princeton university library interlibrary services department … · 2016-02-06 · keywords:...

26
Princeton University Library Interlibrary Services Department Firestone Library A-7-E [email protected] 609-258-3327 Please consider the environment before printing Thank you for using Princeton University Library’s Interlibrary Services. We are happy to meet your research needs. Please note that we can only host your electronic article for 21 days from the date of delivery. After that, it will be permanently deleted. As always, feel free to contact us with any questions or concerns. WARNING CONCERNING COPYRIGHT RESTRICTIONS The copyright law of the United States (Title 17, United States Code) governs the making of photocopies or other reproductions of copyrighted materials. Under certain conditions specified in the law, libraries and archives are authorized to furnish a photocopy or other reproduction. One of these specified conditions is that the photocopy or reproduction is not to be "used for any purpose other than private study, scholarship, or research". If a user makes a request for, or later uses, a photocopy or reproduction for purposes in excess of "fair use", that user may be liable for copyright infringement. This institution reserves the right to refuse to accept a copying order if, in its judgment, fulfillment of the order would involve violation of copyright law.

Upload: others

Post on 26-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Princeton University Library Interlibrary Services Department … · 2016-02-06 · Keywords: dynamic programming, stochastic optimal control, model predictive control, rollout algorithm

Princeton University Library

Interlibrary Services Department Firestone Library A-7-E [email protected] 609-258-3327

� Please consider the environment before printing

Thank you for using Princeton University Library’s Interlibrary Services. We are happy to meet

your research needs.

Please note that we can only host your electronic article for 21 days from the date of delivery.

After that, it will be permanently deleted.

As always, feel free to contact us with any questions or concerns.

WARNING CONCERNING COPYRIGHT RESTRICTIONS

The copyright law of the United States (Title 17, United States Code) governs the making of

photocopies or other reproductions of copyrighted materials.

Under certain conditions specified in the law, libraries and archives are authorized to

furnish a photocopy or other reproduction. One of these specified conditions is that the

photocopy or reproduction is not to be "used for any purpose other than private study,

scholarship, or research". If a user makes a request for, or later uses, a photocopy or

reproduction for purposes in excess of "fair use", that user may be liable for copyright

infringement.

This institution reserves the right to refuse to accept a copying order if, in its judgment,

fulfillment of the order would involve violation of copyright law.

Page 2: Princeton University Library Interlibrary Services Department … · 2016-02-06 · Keywords: dynamic programming, stochastic optimal control, model predictive control, rollout algorithm

European Journal of Control (2005)11:310–334# 2005 EUCA

Dynamic Programming and Suboptimal Control: A Survey

from ADP to MPC�

Dimitri P. Bertsekas��Department of Electrical Engineering and Computer Science, MIT, Cambridge, MA 02139, USA

We survey some recent research directions withinthe field of approximate dynamic programming, witha particular emphasis on rollout algorithms and modelpredictive control (MPC). We argue that while theyare motivated by different concerns, these two meth-odologies are closely connected, and the mathematicalessence of their desirable properties (cost improvementand stability, respectively) is couched on the centraldynamic programming idea of policy iteration. Inparticular, among other things, we show that the mostcommon MPC schemes can be viewed as rolloutalgorithms and are related to policy iteration methods.Furthermore, we embed rollout and MPC within anew unifying suboptimal control framework, based on aconcept of restricted or constrained structure policies,which contains these schemes as special cases.

Keywords: dynamic programming, stochastic optimalcontrol, model predictive control, rollout algorithm

1. Introduction

We consider a basic stochastic optimal control pro-blem, which is amenable to a dynamic programmingsolution, and is considered in many sources (includingthe author’s dynamic programming textbook [14],whose notation we adopt). We have a discrete-timedynamic system

xkþ1 ¼ fkðxk, uk,wkÞ, k ¼ 0, 1, . . . , ð1:1Þ

where xk is the state taking values in some set, uk is thecontrol to be selected from a finite set UkðxkÞ,wk is arandom disturbance and fk is a given function. Weassume that each wk is selected according to a prob-ability distribution that may depend on xk and uk, butnot on previous disturbances. The cost incurred atthe k-th time period is denoted by gkðxk, uk,wkÞ. Formathematical rigor (to avoid measurability ques-tions), we assume that wk takes values in a countableset, but the following discussion applies qualitativelyto more general cases.

We initially assume perfect state information, i.e.that the control uk is chosen with knowledge of thecurrent state xk; we later discuss the case of imperfectstate information, where uk is chosen with knowledgeof a history of measurements related to the state. We,thus, initially consider policies � ¼ f�0,�1, . . .g, whichat time k map a state xk to a control �kðxkÞ 2 UkðxkÞ.We focus primarily on an N-stage horizon problemwhere k takes the values 0, 1, . . . ,N� 1, and there is aterminal cost gNðxNÞ that depends on the terminalstate xN. The cost-to-go of � starting from a state xk attime k is denoted by

J�k ðxkÞ ¼ E gNðxNÞ þ

XN�1

i¼k

gi xi,�iðxiÞ,wið Þ( )

:

ð1:2ÞThe optimal cost-to-go starting from a state xk at timek is

JkðxkÞ ¼ inf�J �k ðxkÞ,

�Many thanks are due to Janey Yu for helpful comments. Researchsupported by NSF Grant ECS-0218328.��E-mail: [email protected]

Received 15 June 2005; Accepted 30 June 2005.Recommended by E.F. Camacho, R. Tempo, S. Yurkovich,P.J. Fleming

Page 3: Princeton University Library Interlibrary Services Department … · 2016-02-06 · Keywords: dynamic programming, stochastic optimal control, model predictive control, rollout algorithm

and it is assumed that J �k ðxkÞ and JkðxkÞ are finite for

all xk,� and k. The cost-to-go functions Jk satisfy thefollowing recursion of dynamic programming (DP)

JkðxkÞ ¼ infuk2UkðxkÞ

Efgkðxk, uk,wkÞ þ Jkþ1

� ð fkðxk, uk,wkÞÞg, k ¼ 0, 1, . . . ,N� 1,

ð1:3Þ

with the initial condition

JNðxNÞ ¼ gNðxNÞ:

Our discussion applies with minor modifications toinfinite horizon problems, with the DP algorithmreplaced by its asymptotic form (Bellman’s equation).For example, for a stationary discounted cost pro-blem, the analog of the DP algorithm (1.3) is

JðxÞ ¼ infu2UðxÞ

Efgðx, u,wÞ þ �Jð fðx, u,wÞÞg, 8x,

ð1:4Þ

where J(x) is the optimal (�-discounted) cost-to-gostarting from x.

An optimal policy may be obtained in principle byminimization in the right-hand side of the DP algo-rithm (1.3), but this requires the calculation of theoptimal cost-to-go functions Jk, which for manyproblems is prohibitively time-consuming. This hasmotivated approximations that require a more tract-able computation, but yield a suboptimal policy.There is a long history of such suboptimal controlmethods, and the purpose of this paper is to surveysome of them, to discuss their connections, and toplace them under the umbrella of a unified metho-dology. Although we initially assume a stochasticmodel with perfect state information, much of thesubsequent material is focused on other types ofmodels, including deterministic models.

A broad class of suboptimal control methods,which we refer to as approximate dynamic program-ming (ADP), is based on replacing the cost-to-gofunction Jkþ1 in the right-hand side of the DPalgorithm (1.3) by an approximation ~JJkþ1, with~JJN ¼ gN. Thus, this method applies at time k andstate xk a control �kðxkÞ that minimizes overuk 2 UkðxkÞ

Efgkðxk, uk,wkÞ þ ~JJkþ1ð fkðxk, uk,wkÞÞg:

The corresponding suboptimal policy � ¼ f�0,�1, . . . ,�N�1g is determined by the approximate cost-to-go functions ~JJ1, ~JJ2, . . . , ~JJN (given either by their

functional form or by an algorithm to calculate theirvalues at states of interest). Note that if the problem isstationary, and the functions ~JJkþ1 are all equal, as theywould normally be in an infinite horizon context,the policy � is stationary.

There are several alternative approaches for select-ing or calculating the functions ~JJkþ1. We distinguishtwo broad categories:

(1) Explicit cost-to-go approximation. Here ~JJkþ1 iscomputed off-line in one of a number of ways.Some important examples are as follows:(a) By solving (optimally) a related simpler

problem, obtained for example by stateaggregation or by some other type of problemsimplification, such as some form of enforceddecomposition. The functions ~JJkþ1 arederived from the optimal cost-to-go functionsof the simpler problem. We will not discussthis approach further in this paper, and werefer to the author’s textbook [13] for moredetails.

(b) By introducing a parametric approximationarchitecture, such as a neural network or aweighted sum of basis functions or features.The idea here is to approximate the optimalcost-to-go Jkþ1ðxÞ with a function of a givenparametric form ~JJkþ1ðxÞ ¼ JJkþ1ðx, rkþ1Þ,where rkþ1 is a parameter vector. This vectoris tuned by some form of ad hoc/heuristicmethod (as for example in computer chess)or some systematic method (for example, ofthe type provided by the neuro-dynamic pro-gramming and reinforcement learning meth-odologies, such as temporal difference andQ-learning methods; see the monographs byBertsekas and Tsitsiklis [8], and Sutton andBarto [SuB98], and the recent edited volumeby Barto et al. [2], which contain extensivebibliographies). There has also been con-siderable recent related work based on linearprogramming, actor-critic, policy gradientand other methods (for a sampling of recentwork, see de Farias and Van Roy [18,19],Konda [30], Konda and Tsitsiklis [29],Marbach and Tsitsiklis [36], and Rantzer [41],which contain many other references). Again,this approach will not be discussed further inthis paper.

(2) Implicit cost-to-go approximation. Here the valuesof ~JJkþ1 at the states fkðxk, uk,wkÞ are computedon-line as needed, via some computation offuture costs, starting from these states (optimal orsuboptimal/heuristic, with or without a rolling

Dynamic Programming and Suboptimal Control 311

Page 4: Princeton University Library Interlibrary Services Department … · 2016-02-06 · Keywords: dynamic programming, stochastic optimal control, model predictive control, rollout algorithm

horizon). We will focus on a few possibilities,which we will discuss in detail in Sections 4and 5:(a) Rollout, where the cost-to-go of a suboptimal/

heuristic policy (called the base policy) is usedas ~JJkþ1. This cost is computed as necessary inwhatever form is most convenient, includingby on-line simulation. It can be seen that thesuboptimal policy � obtained by rollout isidentical to the policy obtained by a singlepolicy improvement step of the classical policyiteration method, starting from the basepolicy. There is also a variant of the rolloutalgorithm where ~JJkþ1 is defined via a basepolicy but may differ from the cost-to-go ofthe corresponding policy. This variant stillhas the cost improvement property of policyiteration under some assumptions that areoften satisfied (see Example 3.1 and Section 4).

(b) Open-loop feedback control (OLFC), wherean optimal open-loop computation is used,starting from the state xk (in the case of per-fect state information) or the conditionalprobability distribution of the state (in thecase of imperfect state information).

(c) Model predictive control (MPC), where anoptimal control computation is used in con-junction with a rolling horizon. This compu-tation is deterministic, possibly based on asimplification of the original problem viacertainty equivalence, but there is also aminimax variant that implicitly involvesreachability of target tube computations (seeSection 5).

A few important variations of the precedingschemes should be mentioned. The first is the use ofmultistep lookahead, which aims to improve the per-formance of one-step lookahead, at the expense ofincreased on-line computation. The second is the useof certainty equivalence, which simplifies the off-lineand on-line computations by replacing the current andfuture unknown disturbances wk, . . . ,wN�1 withnominal deterministic values. A third variation, whichapplies to problems of imperfect state information, is touse one of the preceding schemes with the unknownstate xk replaced by some estimate. We briefly discusssome of these variations in the next section.

Our paper has three objectives:

(1) To derive some general performance bounds, andrelate them to observed practical performance,and to the issue of stability in MPC. This is donein Section 3, and also in Section 5, where MPC isdiscussed in more detail.

(2) To demonstrate the connection between rollout(and hence policy iteration) on the one hand, andOLFC and MPC on the other hand. Indeed, wewill show that both OLFC and MPC are specialcases of rollout, obtained for particular choices ofthe corresponding base policy. This is explained inSections 4 and 5.

(3) To introduce a general unifying framework forsuboptimal control, which includes as specialcases rollout, OLFC, and MPC, and captures themathematical essence of their attractive proper-ties. This is the subject of Section 6.

A complete bibliography is beyond the scope ofthe present paper. In this section, we will select-ively provide a few major relevant sources, withpreference given to survey papers and textbooks.We supplement these references with more specificremarks at the appropriate points in subsequentsections.

The main idea of rollout algorithms, obtaining animproved policy starting from some other suboptimalpolicy using a one-time policy improvement, hasappeared in several DP application contexts, althoughthe connection with policy iteration has not alwaysbeen recognized. In the context of game-playing com-puter programs, it has been proposed by Abramson [1]and by Tesauro [TeG96]. The name ‘rollout’ wascoined by Tesauro in specific reference to rolling thedice in the game of backgammon. In Tesauro’s pro-posal, a given backgammon position is evaluated by‘rolling out’ many games starting from that position,using a simulator, and the results are averaged toprovide a ‘score’ for the position. The internet con-tains a lot of material on computer backgammon andthe use of rollout, in some cases in conjunction withmultistep lookahead and cost-to-go approximation.

Generally, it is possible to use a base policy tocompute cost-to-go approximations ~JJk, but to make astrong connection with policy iteration, it is essentialthat ~JJk be the true cost-to-go function of some policy.For deterministic problems, this is equivalent torequiring that the base policy has a property calledsequential consistency. However, for cost improve-ment it is sufficient that the base policy has a weakerproperty, called sequential improvement, which bears arelation to concepts of Lyapounov stability theoryand can also be extended to general limited lookaheadpolicies (see Proposition 3.1 and Example 3.1). Thesequential consistency and sequential improvementproperties for deterministic problems were first for-malized by Bertsekas et al. [3], and will be described inSection 4 in a more general context, which includesconstraints.

312 D.P. Bertsekas

Page 5: Princeton University Library Interlibrary Services Department … · 2016-02-06 · Keywords: dynamic programming, stochastic optimal control, model predictive control, rollout algorithm

For further work on rollout algorithms, seeChristodouleas [Chr97], Bertsekas [12], Bertsekas andCastanon [4], Secomandi [43–45], Bertsimas andDemir [5], Ferris and Voelker [23,24], McGovern et al.[31], Savagaonkar et al. [SGC02], Bertsimas andPopescu [6], Guerriero andMancini [GuM03], Tu andPattipati [47], Wu et al. [48], Chang et al. [16], Meloniet al. [32] and Yan et al. [52]. Collectively, theseworks discuss a broad variety of applications and casestudies, and generally report positive computationalexperience.

TheMPC approach has become popular in a varietyof control system design contexts, and particularly inchemical process control, where meeting explicit con-trol and state constraints is an important practicalissue. Over time, there has been increasing awarenessof the connection with the problem of reachability oftarget tubes, set-membership descriptions of uncer-tainty, and minimax control (see the discussion ofSection 5). The associated literature is voluminous andwe make no attempt to provide detailed references.The stability analysis given here is based on the workof Keerthi and Gilbert [28]. For extensive surveys ofthe field, which give many references, see Morari andLee [39], Mayne et al. [33], Rawlings [42], Findeisenet al. [22] and Qin and Badgwell [40]. For relatedtextbooks, see Martin-Sanchez and Rodellar [34],Maciejowski [35] and Camacho and Bordons [17].

While our main emphasis in this paper is on high-lighting the conceptual connections between severalapproaches to suboptimal control, we also present afew new and/or unpublished results. In particular,the material of Section 4 on rollout algorithms forconstrained DP is unpublished (it is also reportedseparately in Ref. [14]). The connection of MPC withrollout algorithms reported in Section 5 does not seemto have been noticed earlier. Finally, the material ofSection 6 on the unifying suboptimal control frame-work based on restricted structure policies is new.

2. Limited Lookahead Policies

An effective way to reduce the computation requiredby DP is to truncate the time horizon, and use at eachstage a decision based on lookahead of a smallnumber of stages. The simplest possibility is to use aone-step lookahead policy whereby at stage k and statexk one uses a control �kðxkÞ, which attains the mini-mum in the expression

minuk2UkðxkÞ

Efgkðxk, uk,wkÞ þ ~JJkþ1ð fkðxk, uk,wkÞÞg,

ð2:1Þ

where ~JJkþ1 is some approximation of the true cost-to-go function Jkþ1, with ~JJN ¼ gN. Similarly, a two-steplookahead policy applies at time k and state xk, thecontrol �kðxkÞ attaining the minimum in the precedingequation, where now ~JJkþ1 is obtained itself onthe basis of a one-step lookahead approximation. Inother words, for all possible states xkþ1 that can begenerated via the system equation starting from xk,

xkþ1 ¼ fkðxk, uk,wkÞ,

we have

~JJkþ1ðxkþ1Þ¼ minukþ12Ukþ1ðxkþ1Þ

Efgkþ1ðxkþ1,ukþ1,wkþ1Þ

þ ~JJkþ2ð fkþ1ðxkþ1,ukþ1,wkþ1ÞÞg,

where ~JJkþ2 is some approximation of the cost-to-gofunction Jkþ2. Policies with lookahead of more thantwo stages are similarly defined.

Note that even with readily available cost-to-goapproximations ~JJk, the minimization over uk 2UkðxkÞin the calculation of the one-step lookahead control[cf. Eq. (2.1)] may involve substantial computation.This motivates several variants/simplifications of thebasic method. For example, the minimization overUkðxkÞ in Eq. (2.1) may be replaced by a minimizationover a subset

UkðxkÞ � UkðxkÞ:

Thus, the control �kðxkÞ used in this variant is onethat attains the minimum in the expression

minuk2UkðxkÞ

Efgkðxk, uk,wkÞ þ ~JJkþ1ð fkðxk, uk,wkÞÞg:

ð2:2Þ

A practical example of this approach is when byusing some heuristic or approximate optimization, weidentify a subset UkðxkÞ of promising controls, and tosave computation, we restrict attention to this subsetin the one-step lookahead minimization.

Another major simplification is common in pro-blems of imperfect state information, where thecomputational requirements of exact DP are oftenprohibitive. There, the state xk is not known exactlybut may be estimated based on observations obtainedup to time k. A possible approach is then to replacethe state xk in Eqs (2.1) and (2.2) by an estimate.

A third simplification, known as (assumed)certainty equivalence, aims to avoid the computationof the expected value in Eqs (2.1) and (2.2), and tosimplify the calculation of the required values of ~JJkþ1

by replacing the stochastic quantities wk with some

Dynamic Programming and Suboptimal Control 313

Page 6: Princeton University Library Interlibrary Services Department … · 2016-02-06 · Keywords: dynamic programming, stochastic optimal control, model predictive control, rollout algorithm

nominal deterministic values wk. Then, the one-step-lookahead minimization in Eq. (2.1) is replaced by

minuk2UkðxkÞ

½gkðxk, uk,wkÞ þ ~JJkþ1ð fkðxk, uk,wkÞÞ�:

ð2:3Þ

The cost-to-go approximation ~JJkþ1 is often obtainedby solving an optimal control problem where thefuture uncertain quantitieswkþ1, . . . ,wN�1 are replacedby deterministic nominal values wkþ1, . . . ,wN�1.Then, the minimization in Eq. (2.3) can be done bysolving a deterministic optimal control problem, fromthe present time to the end of the horizon:

(1) Find a control sequence fuk, ukþ1, . . . , uN�1g thatminimizes

gNðxNÞ þXN�1

i¼k

giðxi, ui,wiÞ

subject to the constraints

xiþ1 ¼ fiðxi, ui,wiÞ,ui 2 UiðxiÞ, i ¼ k, kþ 1, . . . ,N� 1:

(2) Use as control the first element in the controlsequence found:

�kðxkÞ ¼ uk:

Note that in variants of Step (1) above, the deter-ministic optimization problem may still be difficult,and may be solved approximately using a suboptimalmethod that offers some computational advantage.For example, when the number of stages N is largeor infinite, the deterministic problem may be definedover a rolling horizon with a fixed number of stages.We finally note that certainty equivalence is often usedat the modeling level, when a deterministic model isadopted for a system which is known to involveuncertainty.

3. Error Bounds

For any suboptimal control scheme it is important tohave theoretical bounds on its performance. Unfor-tunately, despite recent progress, the methodology forperformance analysis of suboptimal control schemesis not very satisfactory at present, and the validationof a suboptimal policy by simulation is often essentialin practice. Still, however, with experience and newtheoretical research, the relative merits of differentapproaches have been clarified to some extent, and it is

now understood that some schemes possess desirabletheoretical performance guarantees, while others donot. In this section, we will discuss a few performancebounds that are available for ADP. For additionaltheoretical analysis on performance bounds, based ondifferent approaches, see Witsenhausen [50,51].

Let us denote by JkðxkÞ the expected cost-to-goincurred by a limited lookahead policy f�0,�1, . . . ,�N�1g starting from state xk at time k [JkðxkÞ shouldbe distinguished from ~JJkðxkÞ, the approximation ofthe cost-to-go that is used to compute the limitedlookahead policy via the minimization in Eq. (2.1) orEq. (2.2)]. It is generally difficult to evaluate analyti-cally the functions Jk, even when the functions ~JJkare readily available. We, thus, aim to obtain someestimates of JkðxkÞ. The following proposition givesa condition under which the one-step lookaheadpolicy achieves a cost JkðxkÞ which is better than theapproximation ~JJkðxkÞ. The proposition also providesa readily computable upper bound to JkðxkÞ.Proposition 3.1. Assume that for all xk and k, we have

minuk2UkðxkÞ

Efgkðxk, uk,wkÞ þ ~JJkþ1ð fkðxk, uk,wkÞÞg

� ~JJkðxkÞ: ð3:1Þ

Then the cost-to-go functions Jk corresponding to aone-step lookahead policy that uses ~JJk and UkðxkÞwith UkðxkÞ � UkðxkÞ [cf. Eq. (2.2)] satisfy for all xkand k

JkðxkÞ � minuk2UkðxkÞ

Efgkðxk, uk,wkÞ

þ ~JJkþ1ð fkðxk, uk,wkÞÞg: ð3:2Þ

Proof. For k ¼ 0, . . . ,N� 1, denote

JJkðxkÞ ¼ minuk2UkðxkÞ

Efgkðxk, uk,wkÞ

þ ~JJkþ1ð fkðxk, uk,wkÞÞg, ð3:3Þ

and let JJN ¼ gN. We must show that for all xk and k,we have JkðxkÞ � JJkðxkÞ:We use backwards inductionon k. In particular, we have JNðxNÞ ¼ JJNðxNÞ ¼~JJNðxNÞ ¼ gNðxNÞ for all xN. Assuming thatJkþ1ðxkþ1Þ � JJkþ1ðxkþ1Þ for all xkþ1, we have

JkðxkÞ¼Efgkðxk,�kðxkÞ,wkÞþJkþ1ð fkðxk,�kðxkÞ,wkÞÞg�Efgðxk,�kðxkÞ,wkÞþJJkþ1ð fkðxk,�kðxkÞ,wkÞÞg�Efgðxk,�kðxkÞ,wkÞþ ~JJkþ1ð fkðxk,�kðxkÞ,wkÞÞg¼ JJkðxkÞ,

314 D.P. Bertsekas

Page 7: Princeton University Library Interlibrary Services Department … · 2016-02-06 · Keywords: dynamic programming, stochastic optimal control, model predictive control, rollout algorithm

for all xk. The first equality above follows from theDP algorithm that defines the costs-to-go Jk of thelimited lookahead policy, while the first inequalityfollows from the induction hypothesis, and the secondinequality follows from the assumption (3.1). Thiscompletes the induction proof. &

Note that by Eq. (3.2), the value JJkðxkÞ of Eq. (3.3),which is the calculated one-step lookahead costfrom state xk at time k, provides a readily obtainableperformance bound for the cost-to-go JkðxkÞ of theone-step lookahead policy. Furthermore, using alsothe assumption (3.1), we obtain for all xk and k,

JkðxkÞ � ~JJkðxkÞ,

that is, the cost-to-go of the one-step lookahead policyis no greater than the lookahead approximation onwhich it is based. The critical assumption (3.1) inProposition 3.1 can be verified in a few interestingspecial cases, as indicated by the following examples.

Example 3.1 (Rollout algorithm). Suppose that ~JJkðxkÞis the cost-to-go J �

k ðxkÞ of some given suboptimalpolicy � ¼ f�0, . . . ,�N�1g and that the set UkðxkÞcontains the control �kðxkÞ for all xk and k. Theresulting one-step lookahead algorithm is called therollout algorithm and will be discussed further inSection 4. From the DP algorithm (restricted to thegiven policy �), we have

~JJkðxkÞ ¼ Efgkðxk,�kðxkÞ,wkÞþ ~JJkþ1ð fkðxk,�kðxkÞ,wkÞÞg,

which in view of the assumption �kðxkÞ 2 UkðxkÞ,yields

~JJkðxkÞ � minuk2UkðxkÞ

Efgkðxk, uk,wkÞ

þ ~JJkþ1ð fkðxk, uk,wkÞÞg: ð3:4Þ

Thus, the assumption of Proposition 3.1 is satisfied,and it follows that the rollout algorithm performs noworse than the policy on which it is based, startingfrom any state and stage.

In a generalization of the rollout algorithm, which isrelevant to the algorithm for constrained DP given inSection 4, and to MPC as described in Section 5,~JJkðxkÞ is a lower bound to the cost-to-go of some givensuboptimal policy � ¼ f�0, . . . ,�N�1g, starting fromxk, and furthermore satisfies the inequality

~JJkðxkÞ � Efgkðxk,�kðxkÞ,wkÞþ ~JJkþ1ð fkðxk,�kðxkÞ,wkÞÞg,

for all xk. Then, Eq. (3.4) holds, and fromProposition 3.1, it follows that JkðxkÞ � ~JJkðxkÞ. Since~JJkðxkÞ � J �

k ðxkÞ by assumption, we see that again therollout algorithm performs no worse than the policyon which it is based, starting from any state and stage.

Example 3.2 (Rollout algorithm with multipleheuristics). Consider a scheme that is similar to the oneof the preceding example, except that ~JJkðxkÞ is theminimum of the cost-to-go functions correspondingto m heuristics, i.e.

~JJkðxkÞ ¼ minfJ�1k ðxkÞ, . . . , J �m

k ðxkÞg,

where for each j, J�jk ðxkÞ is the cost-to-go of a policy

�j ¼ f�j;0, . . . ,�j;N�1g, starting from state xk at stagek. From the DP algorithm, we have, for all j,

J�jk ðxkÞ ¼ Efgkðxk,�j;kðxkÞ,wkÞ

þ J�jkþ1ð fkðxk,�j;kðxkÞ,wkÞÞg,

from which, using the definition of ~JJk, it follows that

J�jk ðxkÞ � Efgkðxk,�j;kðxkÞ,wkÞ

þ ~JJkþ1ð fkðxk,�j;kðxkÞ,wkÞÞg� min

uk2UkðxkÞEfgkðxk, uk,wkÞ

þ ~JJkþ1ð fkðxk, uk,wkÞÞg:

Taking the minimum of the left-hand side over j, weobtain

~JJkðxkÞ � minuk2UkðxkÞ

Efgkðxk, uk,wkÞ

þ ~JJkþ1ð fkðxk, uk,wkÞÞg:

Thus, Proposition 3.1 implies that the one-step look-ahead algorithm based on the heuristic algorithms’costs-to-go J�1

k ðxkÞ, . . . , J �mk ðxkÞ performs better than

all of these heuristics, starting from any state andstage. This also follows from the analysis of the pre-ceding example, since ~JJkðxkÞ is a lower bound to allthe heuristic costs-to-go J

�jk ðxkÞ.

Generally, the approximate cost-to-go functions ~JJkneed not satisfy the assumption (3.1) of Proposition3.1. The following proposition does not require thisassumption, and applies to any policy, not necessarilyone obtained by one-step or multi-step lookahead. Itis useful in some contexts, including the case where theminimization involved in the calculation in the one-step lookahead policy is not exact.

Proposition 3.2. Let ~JJk, k ¼ 0, 1, . . . ,N, be functionsof xk, with ~JJNðxNÞ ¼ gNðxNÞ for all xN. Let also

Dynamic Programming and Suboptimal Control 315

Page 8: Princeton University Library Interlibrary Services Department … · 2016-02-06 · Keywords: dynamic programming, stochastic optimal control, model predictive control, rollout algorithm

� ¼ f�0,�1, . . . ,�N�1g be a policy such that for all xkand k, we have

~JJkðxkÞ þ �k � Efgkðxk,�kðxkÞ,wkÞþ ~JJkþ1ð fkðxk,�kðxkÞ,wkÞÞg

� ~JJkðxkÞ þ �k, ð3:5Þ

where �k, �k, k ¼ 0, . . . ,N� 1, are some scalars. Thenfor all xk and k, we have

~JJkðxkÞ þXN�1

i¼k

�i � J�k ðxkÞ � ~JJkðxkÞ þ

XN�1

i¼k

�i,

ð3:6Þ

where J�k ðxkÞ is the cost-to-go of � starting from state

xk at stage k.

Proof. We use backwards induction on k to provethe right-hand side of Eq. (3.5). The proof of theleft-hand side is similar. In particular, we haveJ �NðxNÞ ¼ ~JJNðxNÞ ¼ gNðxNÞ for all xN. Assuming that

J �kþ1ðxkþ1Þ � ~JJkþ1ðxkþ1Þ þ

XN�1

i¼kþ1

�i

for all xkþ1, we have

J �k ðxkÞ ¼ Efgkðxk,�kðxkÞ,wkÞ

þ J�kþ1ð fkðxk,�kðxkÞ,wkÞÞg

� Efgðxk,�kðxkÞ,wkÞ

þ ~JJkþ1ð fkðxk,�kðxkÞ,wkÞÞg þXN�1

i¼kþ1

�i

� ~JJkðxkÞ þ �k þXN�1

i¼kþ1

�i,

for all xk. The first equality above follows from theDP algorithm that defines the costs-to-go J�

k of �,while the first inequality follows from the inductionhypothesis, and the second inequality follows from theassumption (3.6). This completes the induction proof.

&

Example 3.3 (Certainty equivalent control). Considerthe certainty equivalent control scheme (CEC), whereeach disturbance wk is fixed at a nominal valuewk, k ¼ 0, . . . ,N� 1, which is independent of xk anduk. Let ~JJkðxkÞ be the optimal value of the problemsolved by CEC at state xk and stage k:

~JJkðxkÞ¼ minxiþ1¼fiðxi ;ui ;wiÞ

ui2UiðxiÞ;i¼k;...;N�1

gNðxNÞþXN�1

i¼k

giðxi,ui,wiÞ" #

,

and let ~JJNðxNÞ ¼ gNðxNÞ for all xN. Recall that theCEC applies the control �kðxkÞ ¼ uk after findingan optimal control sequence fuk, . . . , uN�1g for thedeterministic problem in the right-hand side above.Note also that the following DP equation

~JJkðxkÞ ¼ minuk2UkðxkÞ

½gkðxk, uk,wkÞ

þ ~JJkþ1ð fkðxk, uk,wkÞÞ�holds, and that the control uk applied by CECminimizes in the right-hand side.

Let us now apply Proposition 3.2 to derive aperformance bound for the CEC. We have for all xkand k,

~JJkðxkÞ ¼ gkðxk,�kðxkÞ,wkÞþ ~JJkþ1ð fkðxk,�kðxkÞ,wkÞÞ

¼ Efgðxk,�kðxkÞ,wkÞþ ~JJkþ1ð fkðxk,�kðxkÞ,wkÞÞg � �kðxkÞ

where �kðxkÞ is defined by

�kðxkÞ ¼ Efgðxk,�kðxkÞ,wkÞþ ~JJkþ1ð fkðxk,�kðxkÞ,wkÞÞg� gkðxk,�kðxkÞ,wkÞ� ~JJkþ1ð fkðxk,�kðxkÞ,wkÞÞ:

It follows that

Efgðxk, ���kðxkÞ,wkÞ þ ~JJkþ1ð fkðxk, ���kðxkÞ,wkÞÞg� ~JJkðxkÞ þ �k,

where

�k ¼ maxxk

�kðxkÞ,

and by Proposition 3.2, we obtain the followingbound for the cost-to-go function �JJkðxkÞ of the CEC:

�JJkðxkÞ � ~JJkðxkÞ þXN�1

i¼k

�i:

The preceding performance bound is helpful whenit can be shown that �k � 0 for all k, in which case wehave JkðxkÞ � ~JJkðxkÞ for all xk and k. This is true forexample if for all xk and uk, we have

Efgðxk, uk,wkÞg � gkðxk, uk,wkÞ,

and

Ef~JJkþ1ð fkðxk, uk,wkÞÞg � ~JJkþ1ð fkðxk, uk,wkÞÞ:

316 D.P. Bertsekas

Page 9: Princeton University Library Interlibrary Services Department … · 2016-02-06 · Keywords: dynamic programming, stochastic optimal control, model predictive control, rollout algorithm

The most common way to assert that inequalities ofthis type hold is via some kind of concavity assump-tions; for example, the inequalities hold if the state,control, and disturbance spaces are Euclidean spaces,wk is the expected value of wk, and the functionsgðxk, uk, Þ and ~JJkþ1ð fkðxk, uk, ÞÞ, viewed as functionsof wk, are concave (this is Jensen’s inequality, andfollows easily from the definition of concavity). It canbe shown that the concavity conditions just describedare guaranteed if the system is linear with respect to xkand wk, the cost functions gk are concave with respectto xk and wk for each fixed uk, the terminal costfunction gN is concave, and the control constraint setsUk do not depend on xk.

4. Rollout and Open-Loop Feedback

Control

In this section, we discuss rollout and OLFC schemesfor one-step lookahead, and describe their relation.We first consider general stochastic problems, andwe subsequently focus on deterministic problems, forwhich stronger results and algorithmic procedures arepossible.

4.1. Rollout Algorithm

Consider the one-step lookahead minimization

minuk2UkðxkÞ

Efgkðxk, uk,wkÞ þ ~JJkþ1ð fkðxk, uk,wkÞÞg,

ð4:1Þ

for the case where the approximating function ~JJkþ1

is the cost-to-go J�kþ1 of some known heuristic/

suboptimal policy � ¼ f�0, . . . ,�N�1g, called basepolicy (see also Example 3.1). The policy thus obtainedis called the rollout policy based on �. Thus the rolloutpolicy is a one-step lookahead policy, with the optimalcost-to-go approximated by the cost-to-go of the basepolicy.

The salient feature of the rollout algorithm is its costimprovement property: it improves on the performanceof the base policy (see Example 3.1). We note thatthe process of starting from some suboptimal policyand generating another policy using the one-steplookahead process described above is known as policyimprovement, and is the basis of the policy iterationmethod, a primary method for solving DP problems.The rollout algorithm can be viewed as a single policyiteration, and its cost improvement property is amanifestation of the corresponding property of the

policy iteration method. In practice, the rollout algo-rithm often performs surprisingly well (much betterthan its corresponding base policy). This is consistentwith the observed practical behavior of the policyiteration method, which tends to produce largeimprovements in the first few iterations.

The necessary values of the approximate cost-to-go~JJkþ1 in Eq. (4.1) may be computed in a number ofways:

(1) By a closed-form expression. This may be possiblein exceptional cases where the structure of the basepolicy is well-suited for a quasi-analytical cost-to-go evaluation.

(2) By an approximate off-line computation. Forexample, ~JJkþ1 may be computed approximately asthe output of a neural network or some otherapproximation architecture, which is trained bysimulation.

(3) By an on-line computation, such as for examplea form of simulation or optimization. The draw-back of simulation is that its computationalrequirements may be excessive. Note, however,that for a deterministic problem, only one simu-lation trajectory is needed to calculate the cost-to-go of the base policy starting from some state, soin this case the computational requirements aregreatly reduced.

A fuller discussion of the computational issuesregarding the rollout algorithm is beyond the scope ofthis paper, and we refer to the literature on the subject.In particular, the textbook [13] contains an extensiveaccount.

4.2. Rollout Algorithms for Deterministic

Constrained DP

We now specialize the rollout algorithm to deter-ministic optimal control problems, and we alsogeneralize it and extend its range of application toproblems with some additional constraints, takingadvantage of the special deterministic structure.

We consider a problem involving the system

xkþ1 ¼ fkðxk, ukÞ, k ¼ 0, . . . ,N� 1,

where xk and uk are the state and control at time k,respectively, taking values in some sets, which maydepend on k. The initial state is given and is denotedby x0. Each control uk must be chosen from a con-straint set UkðxkÞ that depends on the current state xk.A sequence of the form

T ¼ ðx0, u0, x1, u1, . . . , uN�1, xNÞ,

Dynamic Programming and Suboptimal Control 317

Page 10: Princeton University Library Interlibrary Services Department … · 2016-02-06 · Keywords: dynamic programming, stochastic optimal control, model predictive control, rollout algorithm

where

xkþ1 ¼ fkðxk, ukÞ, uk 2 UkðxkÞ, k ¼ 0, 1, . . . ,N� 1,

is referred to as a trajectory. In our terminology, atrajectory is complete in the sense that it starts at thegiven initial state x0, and ends at some state xN afterNstages. We will also refer to partial trajectories, whichare subsets of complete trajectories, involving fewerthan N stages and consisting of stage-contiguousstates and controls.

The cost of the trajectory T ¼ ðx0, u0, x1, u1, . . . ,uN�1, xNÞ is

VðTÞ ¼ gNðxNÞ þXN�1

k¼0

gkðxk, ukÞ, ð4:2Þ

where gk, k ¼ 0, 1, . . . ,N, are given functions. Theproblem is to find a trajectory T that minimizes VðTÞsubject to the constraint

T 2 C, ð4:3Þ

where C is a given set of trajectories.An optimal solution of this problem can be

found by an extension of the DP algorithm, but theassociated computation can be overwhelming. It ismuch greater that the computation for the corre-sponding unconstrained problem where the constraintT 2 C is absent. This is true even in the special casewhere C is specified in terms of a finite numberof constraint functions that are time-additive, i.e.T 2 C if

gmNðxNÞ þXN�1

k¼0

gmk ðxk, ukÞ � bm, m ¼ 1, . . . ,M,

ð4:4Þ

where gmk , k ¼ 0, 1, . . . ,N, and bm, m ¼ 1, . . . ,M, aregiven functions and scalars, respectively. The litera-ture contains several proposals of suboptimal solutionmethods for the case where the constraints are of theform (4.4). Most of these proposals are cast in thecontext of the constrained and multiobjective shortestpath problems; see, e.g. Jaffe [26], Martins [37],Guerriero andMusmanno [25] and Stewart andWhite[46], who also survey earlier work.

The extension of the rollout algorithm to determi-nistic problems with the constraints (4.3), uses a basepolicy, which has the property that given any state xkat stage k, it produces a partial trajectory

HðxkÞ ¼ ðxk, uk, xkþ1, ukþ1, . . . , uN�1, xNÞ,

that starts at xk and satisfies

xiþ1 ¼ fiðxi, uiÞ, ui 2 UiðxiÞ, i ¼ k, . . . ,N� 1:

The cost corresponding to the partial trajectoryHðxkÞis denoted by ~JJðxkÞ:

~JJðxkÞ ¼ gNðxNÞ þXN�1

i¼k

giðxi, uiÞ:

Thus, given a partial trajectory that starts at the initialstate x0 and ends at a state xk, the base policy can beused to complete this trajectory by concatenating itwith the partial trajectory HðxkÞ. The trajectory thusobtained is not guaranteed to be feasible (i.e. it maynot belong to C), but we assume throughout that thestate/control trajectory generated by the base policystarting from the given initial state x0 is feasible, that is

Hðx0Þ 2 C:

Finding a base policy with this property may not beeasy, but this question is beyond the scope this paper.It appears, however, that for many problems ofinterest, there are natural base policies that satisfy thisfeasibility requirement; this is true in particular whenC is specified by Eq. (4.4) with only one constraintfunction (M¼ 1), in which case a feasible base policycan be obtained (if at all possible) by ordinary(unconstrained) DP, using as cost function either theconstraint function or an appropriate weighted sum ofthe cost and constraint functions.

We now describe the rollout algorithm. It startsat stage 0 and sequentially proceeds to the last stage.At stage k, it maintains a partial trajectory

Tk ¼ ðx0, u0,x1, . . . , uk�1, xkÞ ð4:5Þ

that starts at the given initial state x0, and is such that

xiþ1 ¼ fiðxi, uiÞ, ui 2 UiðxiÞ, i ¼ 0, 1, . . . ,k� 1,

where x0 ¼ x0. The algorithm starts with the partialtrajectory T0 that consists of just the initial state x0.

For each k ¼ 0, 1, . . . ,N� 1, and given the currentpartial trajectory Tk, it forms for each controluk 2 UkðxkÞ, a (complete) trajectory Tc

kðukÞ byconcatenating:

(1) The current partial trajectory Tk.(2) The control uk and the next state xkþ1 ¼ fkðxk, ukÞ.(3) The partial trajectory Hðxkþ1Þ generated by the

base policy starting from xkþ1.

Then, it forms the subset UkðxkÞ of controls ukfor which Tc

kðukÞ is feasible, i.e. TckðukÞ 2 C. The

318 D.P. Bertsekas

Page 11: Princeton University Library Interlibrary Services Department … · 2016-02-06 · Keywords: dynamic programming, stochastic optimal control, model predictive control, rollout algorithm

algorithm then selects from UkðxkÞ a control uk thatminimizes over uk 2 UkðxkÞ

gkðxk, ukÞ þ ~JJð fkðxk, ukÞÞ:

[We assume that the existence of a minimizing ukis guaranteed when UkðxkÞ is nonempty; this is truefor example when UkðxkÞ, and hence also UkðxkÞ, isfinite.] The algorithm then forms the partial trajectoryTkþ1 by adding ðuk, xkþ1Þ to Tk, where

xkþ1 ¼ fkðxk, ukÞ:

For a more compact definition of the rollout algo-rithm, let us use the union symbol [ to denote theconcatenation of two or more partial trajectories.With this notation, we have

TckðukÞ ¼ Tk [ ðukÞ [Hð fkðxk, ukÞÞ, ð4:6Þ

UkðxkÞ ¼ fuk j uk 2 UkðxkÞ,TckðukÞ 2 Cg: ð4:7Þ

The rollout algorithm selects at stage k

uk 2 arg minuk2UkðxkÞ

½gkðxk, ukÞ þ ~JJð fkðxk, ukÞÞ�,

ð4:8Þ

and extends the current partial trajectory by setting

Tkþ1 ¼ Tk [ ðuk, xkþ1Þ, ð4:9Þwhere

xkþ1 ¼ fkðxk, ukÞ: ð4:10Þ

Figure 1 provides an illustration of the algorithm.

Note that it is possible that there is no controluk 2 UkðxkÞ that satisfies the constraint Tc

kðukÞ 2 C[i.e. the set UkðxkÞ is empty], in which case the algo-rithm breaks down. We will show, however, that thiscannot happen under some conditions that we nowintroduce.

Definition 4.1. The base policy is called sequentiallyconsistent if whenever it generates a partial trajectory

ðxk, uk, xkþ1, ukþ1, . . . , uN�1, xNÞ,

starting from state xk, it also generates the partialtrajectory

ðxkþ1, ukþ1, xkþ2, ukþ2, . . . , uN�1, xNÞ,

starting from state xkþ1.

Thus, a base policy is sequentially consistent if,when started at intermediate states of a partialtrajectory that it generates, it produces the samesubsequent controls and states. If we denote by �kðxkÞthe control applied by the base policy when at statexk, then the sequence of control functions � ¼f�0,�1, . . . ,�N�1g can be viewed as a policy. The cost-to-go of this policy starting at state xk, call it J

�k ðxkÞ,

and the cost ~JJðxkÞ produced by the base policy startingfrom xk need not be equal. However, they are equalwhen the base policy is sequentially consistent, that is

J �k ðxkÞ ¼ ~JJðxkÞ, 8xk, k,

because in this case, � generates the same partial tra-jectory as the base policy, when starting from the samestate. It follows that, for a sequentially consistent basepolicy, we have

gkðxk,�kðxkÞÞþ ~JJð fkðxk,�kðxkÞÞ¼ ~JJðxkÞ, 8xk, k:ð4:11Þ

Note that base policies that are of the ‘greedy’ typetend to be sequentially consistent, as explained inRef. [13], Section 6.4. The next definition providesan extension of the notion of sequential consistency.

Definition 4.2. The base policy is called sequentiallyimproving if for every xk, the partial trajectory

HðxkÞ ¼ ðxk, uk, xkþ1, ukþ1, . . . , uN�1, xNÞ,

has the following properties:

(1)

gkðxk, ukÞ þ ~JJðxkþ1Þ � ~JJðxkÞ: ð4:12Þ

...

...

x0 x1 x2 xk

xk+1 xN-1 xN

...

...

...

xk-1

...

uk-1u1u0

ukuk+1 uN-1Tk

Stage k

H(xk+1)

Tk(uk)c

Fig. 1. Illustration of the rollout algorithm for constraineddeterministic DP problems. At stage k, and given the currentpartial trajectory Tk that starts at x0 and ends at xk, it considersall possible next states xkþ1 ¼ fkðxk, ukÞ, uk 2 UkðxkÞ, and runs thebase policy starting at xkþ1. It then finds a control uk 2 UkðxkÞ suchthat the corresponding trajectory Tc

kðukÞ ¼ Tk [ ðukÞ [Hðxkþ1Þ,where xkþ1 ¼ fkðxk, ukÞ, is feasible and has minimum cost, andextends Tk by adding to it ðuk, xkþ1Þ.

Dynamic Programming and Suboptimal Control 319

Page 12: Princeton University Library Interlibrary Services Department … · 2016-02-06 · Keywords: dynamic programming, stochastic optimal control, model predictive control, rollout algorithm

(2) If for some partial trajectory of the form(x0, u0, x1, . . . , uk�1, xk) we have

ðx0, u0, x1, . . . , uk�1, xkÞ [HðxkÞ 2 C,

then

ðx0, u0, x1, . . . , uk�1, xk, uk, xkþ1Þ [Hðxkþ1Þ 2 C:

Properties (1) and (2) of the preceding definitionprovide some basic guarantees for the use of thecontrol uk dictated by the base policy at state xk. Inparticular, property (1) guarantees that when uk isused, the cost ~JJðxkþ1Þ of the base policy starting fromthe new state xkþ1 will not be ‘excessive’, in the sensethat it will be no more than the cost ~JJðxkÞ � gkðxk, ukÞ‘predicted’ at state xk. Property (2) guarantees thatwhen uk is used, the base policy will maintain feasi-bility starting from the new state xkþ1.

Note that if the base policy is sequentially con-sistent, it is also sequentially improving [cf. Eqs (4.11)and (4.12)]. An example of an interesting case wherethe sequential improvement property holds arises inMPC (see the next section). There, the base policydrives the system to a cost-free and absorbing statewithin a given number of stages, while minimizingthe cost over these stages. It can be verified that for asystem with a cost-free and absorbing state, a basepolicy of this type is sequentially improving. Foranother case where the base policy is sequentiallyimproving, let the constraint set C consist of all tra-jectories T ¼ ðx0, u0, x1, u1, . . . , uN�1, xNÞ such that

gmNðxNÞ þXN�1

k¼0

gmk ðxk, ukÞ � bm, m ¼ 1, . . . ,M,

[cf. Eq. (4.4)]. Let ~CCmðxkÞ be the value of the m-th

constraint function corresponding to the partialtrajectory HðxkÞ ¼ ðxk, uk, xkþ1, ukþ1, . . . , uN�1, xNÞ,generated by the base policy starting from xk, that is

~CCmðxkÞ ¼ gmNðxNÞþ

XN�1

i¼k

gmi ðxi,uiÞ, m¼ 1, . . . ,M:

Then, it can be seen that the base policy is sequentiallyimproving if in addition to Eq. (4.12), the partialtrajectory HðxkÞ satisfies

gmk ðxk,ukÞ þ ~CCmðxkþ1Þ � ~CC

mðxkÞ, m¼ 1, . . . ,M:

ð4:13Þ

The reason is that if a trajectory of the form

ðx0, u0, x1, . . . , uk�1, xkÞ [HðxkÞ

belongs to C, then we have

Xk�1

i¼0

gmi ðxi, uiÞ þ ~CCmðxkÞ � bm, m ¼ 1, . . . ,M,

and from Eq. (4.13), it follows that

Xk�1

i¼0

gmi ðxi, uiÞ þ gmk ðxk, ukÞ þ ~CCmðxkþ1Þ � bm,

m ¼ 1, . . . ,M:

This implies that the trajectory

ðx0, u0,x1, . . . , uk�1, xk, uk, xkþ1Þ [Hðxkþ1Þbelongs to C, thereby verifying property (2) of thedefinition of sequential improvement.

The essence of our main result is contained in thefollowing proposition. To state compactly the result,consider the partial trajectory

Tk ¼ ðx0, u0,x1, . . . , uk�1, xkÞmaintained by the rollout algorithm after k stages, theset of trajectories

fTckðukÞ j uk 2 UkðxkÞg

that are feasible [cf. Eqs (4.6) and (4.7)], and the tra-jectory T

c

k within this set that corresponds to thecontrol uk chosen by the rollout algorithm, that is

Tc

k ¼ TckðukÞ ¼ Tk [ ðukÞ [Hðxkþ1Þ, ð4:14Þ

where

xkþ1 ¼ fkðxk, ukÞ:The result asserts that as k increases, the cost of thecorresponding trajectories T

c

k cannot increase.

Proposition 4.1. Assume that the base policy issequentially improving. Then for each k, the setUkðxkÞ is nonempty, and

VðHðx0ÞÞ � VðTc

0 Þ � VðTc

1 Þ� � VðTc

N�1Þ � VðTc

NÞ,where the trajectories T

c

k are given by Eq. (4.14) forall k.

Proof. Let the trajectory generated by the base policystarting from x0 have the form

Hðx0Þ ¼ ðx0, u00, x01, u01, . . . , u0N�1, x0NÞ,

and note that sinceHðx0Þ 2 C by assumption, we haveu00 2 U0ðx0Þ. Thus, the set U0ðx0Þ is nonempty. Also,

320 D.P. Bertsekas

Page 13: Princeton University Library Interlibrary Services Department … · 2016-02-06 · Keywords: dynamic programming, stochastic optimal control, model predictive control, rollout algorithm

we have

VðHðx0ÞÞ ¼ ~JJðx0Þ � g0ðx0, u00Þ þ ~JJð f0ðx0, u00ÞÞ,where the inequality follows by the sequentialimprovement assumption [cf. Eq. (4.12)]. Since

u0 2 arg minu02U0ðx0Þ

½g0ðx0, u0Þ þ ~JJð f0ðx0, u0ÞÞ�,

it follows by combining the preceding two relationsand the definition of T

c

0 [cf. Eq. (4.14)] that

VðHðx0ÞÞ � g0ðx0, u0Þ þ ~JJðx1Þ ¼ VðTc

0 Þ,while T

c

0 is feasible by the definition of U0ðx0Þ.The preceding argument can be repeated for the

next stage, by replacing x0 with x1, andHðx0Þwith Tc

0 .In particular, let T

c

0 have the form

Tc

0 ¼ ðx0, u0, x1, u01, x01, u02, . . . , u0N�1, x0NÞ:

Since Tc

0 is feasible as noted earlier, we have u01 2U1ðx1Þ, so that U1ðx1Þ is nonempty. Also, we have

VðTc

0 Þ ¼ g0ðx0, u0Þ þ ~JJðx1Þ� g0ðx0, u0Þ þ g1ðx1, u01Þ þ ~JJð f1ðx1, u01ÞÞ,

where the inequality follows by the sequentialimprovement assumption. Since

u1 2 arg minu12U1ðx1Þ

½g1ðx1, u1Þ þ ~JJð f1ðx1, u1ÞÞ�,

it follows by combining the preceding two relationsand the definition of T

c

1 that

VðTc

0Þ � g0ðx0, u0Þ þ g1ðx1, u1Þ þ ~JJðx2Þ ¼ VðTc

1 Þ,

while Tc

1 is feasible by the definition of U1ðx1Þ.Similarly, the argument can be successively repe-

ated for every k, to verify thatUkðxkÞ is nonempty andthat VðTc

k�1Þ � VðTc

kÞ for all k. &

Proposition 4.1 implies that the rollout algorithmgenerates at each stage k a feasible trajectory that is noworse than its predecessor in terms of cost. Since thestarting trajectory is Hðx0Þ and the final trajectory is

Tc

N ¼ ðx0, u0, x1, . . . , uN�1, xNÞ,

we have the following.

Proposition 4.2. If the base policy is sequentiallyimproving, then the rollout algorithm produces atrajectory ðx0, u0, x1, . . . , uN�1, xNÞ that is feasible andhas cost that is no larger than the cost of the trajectorygenerated by the base policy starting from x0.

By a simple extension of the argument used to showProposition 4.2, we can also show that when the basepolicy is sequentially improving, then the trajectoryðx0, u0, x1, . . . , uN�1, xNÞ produced by the rolloutalgorithm satisfies for all k¼ 1, . . . ,N� 1,

~JJðxkÞ � gNðxNÞ þXN�1

i¼k

giðxi, uiÞ:

In words, the cost-to-go of the rollout algorithmstarting from state xk is no worse than the cost-to-goof the base policy starting from xk. Thus, ~JJðxkÞ pro-vides a readily computable upper bound to the cost-to-go of the rollout algorithm starting from xk.

We will now discuss some variations and extensionsof the rollout algorithm for constrained deterministicproblems. Let us consider the case where the sequen-tial improvement assumption is not satisfied. Then,even if the base policy generates a feasible trajectorystarting from the initial state x0, it may happen thatgiven the partial trajectory Tk, the set of controlsUkðxkÞ that corresponds to feasible trajectories Tc

kðukÞ[cf. Eq. (4.7)] is empty, in which case the rolloutalgorithm cannot extend the trajectory Tk further. Tobypass this difficulty, we propose a modification ofthe algorithm, called fortified rollout algorithm, whichis an extension of an algorithm given in Ref. [3] for thecase of an unconstrained DP problem (see also [13],Section 6.4).

The fortified rollout algorithm, in addition to thepartial trajectory

Tk ¼ ðx0, u0, x1, . . . , uk�1, xkÞ,maintains a (complete) trajectory T, which is feasibleand agrees with Tk up to state xk, i.e. T has the form

T¼ ðx0,u0,x1, . . . ,uk�1,xk,uk,xkþ1, . . . ,uN�1,xNÞ,for some uk, xkþ1, . . . , uN�1, xN such that xiþ1 ¼fiðui, xiÞ for i¼ k, . . . ,N� 1, with xk ¼ xk. Initially, T0

consists of the initial state x0, and T is the trajectoryHðx0Þ, generated by the base policy starting from x0.At stage k, the algorithm forms the subset ~UUkðxkÞ ofcontrols uk 2 UkðxkÞ such that Tc

kðukÞ 2 C and

Xk�1

i¼0

giðxi, uiÞ þ gkðxk, ukÞ þ ~JJð fkðxk, ukÞÞ � VðTÞ:

There are two cases to consider:

(1) The set ~UUkðxkÞ is nonempty. Then, the algorithmselects from ~UUkðxkÞ a control uk that minimizesover uk 2 ~UUkðxkÞ

gkðxk, ukÞ þ ~JJð fkðxk, ukÞÞ:

Dynamic Programming and Suboptimal Control 321

Page 14: Princeton University Library Interlibrary Services Department … · 2016-02-06 · Keywords: dynamic programming, stochastic optimal control, model predictive control, rollout algorithm

It then forms the partial trajectory

Tkþ1 ¼ Tk [ ðuk, xkþ1Þ,

where xkþ1 ¼ fkðxk, ukÞ, and replaces T with thetrajectory

Tk [ ðukÞ [Hðxkþ1Þ:

(2) The set ~UUkðxkÞ is empty. Then, the algorithmforms the partial trajectory Tkþ1 by concatenatingTk by the control/state pair ðuk, xkþ1Þ subsequentto xk in T, and leaves T unchanged.

It can be seen with a little thought that the fortifiedrollout algorithm will follow the initial trajectory T,the one generated by the base policy starting from x0,up to the point where it will discover a new feasibletrajectory with smaller cost to replace T. Similarly, thenew trajectory T may be subsequently replaced byanother feasible trajectory with smaller cost, etc. Notethat if the base policy is sequentially improving, thefortified rollout algorithm generates the same trajec-tory as the (nonfortified) rollout algorithm givenearlier. However, it can be verified, by modifying theproofs of Propositions 4.1 and 4.2, that even when thebase policy is not sequentially improving, the fortifiedrollout algorithm will generate a trajectory that isfeasible and has cost that is no worse than the cost ofthe trajectory generated by the base policy startingfrom x0.

We finally consider the case where C is specified bythe time-additive constraints

gmNðxNÞ þXN�1

k¼0

gmk ðxk, ukÞ � bm, m ¼ 1, . . . ,M,

ð4:15Þ

[cf. Eq. (4.4)], representing restrictions on M resour-ces. Then, it is natural for the base policy to take intoaccount the amounts of resources already expended.One way to do this is by augmenting the state xk toinclude the quantities

ymk ¼Xk�1

i¼0

gmi ðxi, uiÞ, m ¼ 1, . . . ,M,

k ¼ 1, . . . ,N� 2,

ym0 ¼ 0,

ymN ¼ gmNð fN�1ðxN�1, uN�1ÞÞ þXN�1

i¼0

gmi ðxi, uiÞ:

In other words, we may reformulate the state of thesystem to be ðxk, y1k, . . . , yMk Þ, and to evolve accordingto the new system equation

xkþ1 ¼ fkðxk, ukÞ, k ¼ 0, . . . ,N� 1,

ymkþ1 ¼ ymk þ gmk ðxk, ukÞ, m ¼ 1, . . . ,M,

k ¼ 0, . . . ,N� 2,

ym0 ¼ 0, ymN ¼ ymN�1 þ gmNð fN�1ðxN�1, uN�1ÞÞþ gmN�1ðxN�1, uN�1Þ:

The time-additive constraints (4.15) are then refor-mulated as ymN � bm, m ¼ 1, . . . ,M. The rolloutalgorithm described earlier, when used in conjunc-tion with this reformulated problem, allows for basepolicies whose generated trajectories depend not onlyon xk but also on y1k, . . . , y

Mk , and brings to bear

Propositions 4.1 and 4.2.

4.3. Open-Loop Feedback Control

We will now discuss a classical suboptimal controlscheme, which turns out to be a special case of therollout algorithm. We have so far focused on pro-blems of perfect state information where the state xk isobserved exactly. In an imperfect state informationproblem, control at stage k is applied with knowledgeof the information vector

Ik ¼ ðz1, . . . , zk, u0, . . . , uk�1Þ,

where for all k, zk is an observation obtained at stagek, and related to the current state xk and the precedingcontrol uk�1 through a given conditional probabilitydistribution. Following the observation zk, a controluk is chosen by the controller with knowledge of theinformation Ik and the associated the conditionalprobability distribution PxkjIk . Then, a cost gðxk, ukÞ isincurred, where xk is the current (hidden) state. Whileproblems of imperfect state information are very hardto solve optimally, they can be reduced to problems ofperfect state information whose state is the condi-tional distribution PxkjIk (see e.g., [13]).

Generally, in a problem with imperfect state infor-mation, the performance of the optimal policyimproves when extra information is available. How-ever, the use of this information may make theDP calculation of the optimal policy intractable. Thismotivates a suboptimal policy, based on a moretractable computation that in part ignores the avail-ability of extra information. This policy was firstsuggested by Dreyfus [21], and is known as the open-loop feedback controller (OLFC for short). It uses the

322 D.P. Bertsekas

Page 15: Princeton University Library Interlibrary Services Department … · 2016-02-06 · Keywords: dynamic programming, stochastic optimal control, model predictive control, rollout algorithm

current information vector Ik to determine PxkjIk , butit calculates the control uk as if no further measure-ments will be received, by using an open-loop opti-mization over the future evolution of the system. Inparticular, uk is determined as follows:

(1) Given the information vector Ik, compute theconditional probability distribution PxkjIk (in thecase of perfect state information, where Ik includesxk, this step is unnecessary).

(2) Find a control sequence fuk, ukþ1, . . . , uN�1g thatsolves the open-loop problem of minimizing

EfgNðxNÞ þXN�1

i¼k

giðxi, ui,wiÞ j Ikg

subject to the constraints

xiþ1 ¼ fiðxi,ui,wiÞ, ui 2Ui, i¼ k,kþ 1, . . . ,N� 1:

(3) Apply the control input

�kðIkÞ ¼ uk:

Thus the OLFC uses at time k the new measurementzk to calculate the conditional probability distributionPxkjIk . However, it selects the control input as if futuremeasurements will be disregarded.

In any suboptimal control scheme, one would liketo be assured that measurements are advantageouslyused. By this we mean that the scheme performs atleast as well as any open-loop policy that uses asequence of controls that is independent of the valuesof the measurements received. An optimal open-loop policy can be obtained by finding a sequencefu�0, u�1, . . . , u�N�1g that minimizes

Jðu0,u1,...,uN�1Þ¼EfgNðxNÞþXN�1

k¼0

gkðxk,uk,wkÞg

subject to the constraints

xkþ1 ¼ fkðxk,uk,wkÞ, uk 2Uk, k¼ 0,1, . . . ,N�1:

A nice property of the OLFC is that it performs atleast as well as an optimal open-loop policy, as shownby the following proposition. In contrast, the CECdoes not have this property (for example, for a one-stage problem, the optimal open-loop controller andthe OLFC are both optimal, but the CEC may bestrictly suboptimal).

Proposition 4.3. The cost J� corresponding to anOLFC satisfies

J� � J�0,

where J�0 is the cost corresponding to an optimal open-loop policy.

The preceding proposition shows that the OLFCuses the measurements advantageously even thoughit selects at each period the present control input as ifno further measurements will be taken in the future. Aproof from first principles was given in Bertsekas [11](this proof is reproduced in Ref. [13]; it was extendedby White and Harrington [49]). However, it is muchsimpler to argue that the proposition is a special caseof the cost improvement property of the rolloutalgorithm (cf. Example 3.1), based on the followingobservation.

Let us view the given problem of imperfect stateinformation as a problem of perfect state informationwhose ‘state’ at the k-th stage is the distribution PxkjIk .Let us also view the optimal open-loop policy startingat the next ‘state’ Pxkþ1jIkþ1

as a base policy. Then it canbe seen that the OLFC is just the rollout algorithmcorresponding to this base policy. According to thegeneric cost improvement property of rollout algo-rithms, the performance of the OLFC is no worse thatthe performance of the optimal open-loop policy (thebase policy) starting from the initial state Px0jI0 . This isprecisely the statement of Proposition 4.3.

We finally note that a form of suboptimal controlthat is intermediate between the optimal feedbackcontroller and the OLFC is provided by a generaliza-tion of the OLFC called the partial OLFC (POLFC);see [13]. Similar to OLFC, this controller uses pastmeasurements to compute PxjIk , but calculates thecontrol input on the basis that some (but not neces-sarily all) of the measurements will in fact be taken inthe future, and the remaining measurements will notbe taken. This method often allows one to deal withthose measurements that are troublesome and com-plicate the solution, while taking into account thefuture availability of other measurements that can bereasonably dealt with. The POLFC can also be viewedas a form of rollout policy whose base policy takesinto account some (but not all) of the future mea-surements rather than being open-loop.

5. Model Predictive Control

MPC was motivated by the desire to introduce non-linearities and constraints into the linear-quadraticcontrol framework, while obtaining a suboptimal butstable closed-loop system. We will describe MPC forthe more general case of a time-invariant nonlineardeterministic system and nonquadratic cost. The stateand control belong to the Euclidean spaces<n and<m,respectively, and may be constrained. In particular,the system is

xkþ1 ¼ fðxk, ukÞ, k ¼ 0, 1, . . . ,

Dynamic Programming and Suboptimal Control 323

Page 16: Princeton University Library Interlibrary Services Department … · 2016-02-06 · Keywords: dynamic programming, stochastic optimal control, model predictive control, rollout algorithm

and the cost per stage is

gðxk, ukÞ, k ¼ 0, 1, . . . ,

and is assumed nonnegative for all xk and uk. Weimpose state and control constraints

xk 2 X, uk 2 UðxkÞ, k ¼ 0, 1, . . . ,

and we assume that the set X contains the origin of<n.Furthermore, if the system is at the origin, it can bekept there at no cost with control equal to 0, that is

0 2 Uð0Þ, fð0, 0Þ ¼ 0, gð0, 0Þ ¼ 0:

We want to derive a stationary feedback controllerthat applies control �ðxÞ at state x, and is stable in thesense that xk ! 0 and �ðxkÞ ! 0 for all initial statesx0 2 X. Furthermore, we require that for all initialstates x0 2 X, the state of the closed-loop system

xkþ1 ¼ fðxk,�ðxkÞÞ,

satisfies the state and control constraints. Finally, tosatisfy the stability requirement through a cost mini-mization, we require that the total cost of the feedbackcontroller � over an infinite number of stages is finite:

X1k¼0

gðxk,�ðxkÞÞ < 1, ð5:1Þ

and that the nonnegative function g is such that thepreceding relation implies that xk ! 0 and �ðxkÞ ! 0.A primary example is a quadratic cost per stage,

gðx, uÞ ¼ x0Qxþ u0Ru,

where the matrices Q and R are positive definite andsymmetric.

In order for such a controller to exist, it is evidentlysufficient [in view of the assumptions f(0, 0)¼ 0 andg(0, 0)¼ 0] that there exists a positive integer m suchthat for every initial state x0 2 X, one can find asequence of controls uk, k ¼ 0, 1, . . . ,m� 1, which driveto 0 the state xm of the system at time m, while keepingall the preceding states x1, x2, . . . , xm�1 within X andsatisfying the control constraints u0 2 Uðx0Þ, . . . ,um�1 2 Uðxm�1Þ. We refer to this as the constrainedcontrollability assumption. In practical applications,the constrained controllability assumption can oftenbe checked easily. Alternatively, the state and controlconstraints can be constructed in a way that theassumption is satisfied using the methodology ofreachability of target tubes; see the end of this section.

Let us now describe a form of MPC under theconstrained controllability assumption. At each stagek and state xk 2 X, it solves an m-stage deterministicoptimal control problem involving the same cost perstage and constraints, and the requirement that thestate after m stages be exactly equal to 0. This is theproblem of minimizing

Xkþm�1

i¼k

gðxi, uiÞ, ð5:2Þ

subject to the system equation constraints

xiþ1 ¼ fðxi, uiÞ, i ¼ k, kþ 1, . . . , kþm� 1,

ð5:3Þ

the state and control constraints

xi 2 X, ui 2 UðxiÞ, i ¼ k,kþ 1, . . . ,kþm� 1,

ð5:4Þand the terminal state constraint

xkþm ¼ 0: ð5:5ÞBy the constrained controllability assumption, thisproblem has a feasible solution. Let

fuk, ukþ1, . . . , ukþm�1gbe a corresponding optimal control sequence. TheMPC applies at stage k the first component of thissequence,

�ðxkÞ ¼ uk,

and discards the remaining components.We now show that the MPC satisfies the stability

condition (5.1). Let x0, u0, x1, u1, . . . be the state andcontrol sequence generated by MPC:

uk ¼ uðxkÞ, xkþ1 ¼ fðxk, uðxkÞÞ, k ¼ 0, 1, . . . :

Denote by JJðxÞ the optimal cost of the m-stage pro-blem solved by MPC when at a state x 2 X; cf. Eqs(5.2)–(5.5). Let also ~JJðxÞ ¼ 1 be the optimal coststarting at x of a corresponding (m� 1)-stage pro-blem, i.e. the optimal value of the cost

Xm�2

k¼0

gðxk, ukÞ,

where x0 ¼ x, subject to the constraints

xk 2 X, uk 2 UðxkÞ, k ¼ 0, 1, . . . ,m� 2,

324 D.P. Bertsekas

Page 17: Princeton University Library Interlibrary Services Department … · 2016-02-06 · Keywords: dynamic programming, stochastic optimal control, model predictive control, rollout algorithm

and

xm�1 ¼ 0:

[For states x 2 X for which this problem does nothave a feasible solution, we write ~JJðxÞ ¼ 1.] Havingone less stage in our disposal to drive the state to0 cannot decrease the optimal cost, so we have forall x 2 X

JJðxÞ � ~JJðxÞ: ð5:6Þ

From the definitions of JJ and ~JJ, we have for all k,

minu2UðxÞ

½gðxk, uÞ þ ~JJð fðxk, uÞÞ�

¼ gðxk, ukÞ þ ~JJðxkþ1Þ ¼ JJðxkÞ, ð5:7Þ

so using Eq. (5.6) with x ¼ xkþ1, we see that

gðxk, ukÞ þ JJðxkþ1Þ � JJðxkÞ, k ¼ 0, 1, . . . :

ð5:8Þ

Adding this equation for all k in the range ½0,K�,where K ¼ 0, 1, . . ., we obtain

JJðxKþ1Þ þXKk¼0

gðxk, ukÞ � JJðx0Þ:

Since JJðxKþ1Þ � 0 (in view of the nonnegativity of g),it follows that

XKk¼0

gðxk, ukÞ � JJðx0Þ, K ¼ 0, 1, . . . , ð5:9Þ

and taking the limit as K ! 1,

X1k¼0

gðxk, ukÞ � JJðx0Þ � 1:

This shows the stability condition (5.1).We note that the one-step lookahead function ~JJ

implicitly used by MPC [cf. Eq. (5.7)] is the costcorresponding to a certain base policy. This policy isdefined by the open-loop control sequence that drivesto 0 the state after m� 1 stages and keeps the state at0 thereafter, while observing the state and controlconstraints xk 2 X and uk 2 UðxkÞ, and minimizingthe cost. Thus, we can also view MPC as a rolloutalgorithm with the base policy defined by the sequencejust described. This base policy may not be sequen-tially consistent, but it is sequentially improving(cf. Definitions 4.1 and 4.2). The reason is that if

ðxkþ1, ukþ1, xkþ2, ukþ2, . . . , xkþm�1, ukþm�1, 0Þ is thesequence generated by the ðm� 1Þ-stage base policystarting from state xkþ1, we have

gðxkþ1, ukþ1Þ þ ~JJðxkþ2Þ � ~JJðxkþ1Þ;

this follows by an argument similar to the argumentwe used to show Eq. (5.8) based on Eqs (5.6) and (5.7).Thus, property (1) of Definition 4.2 is satisfied, whilethe feasibility property (2) of that definition is alsosatisfied because of the structure of the constraintsof the problem solved at each stage by MPC. In factthe stability property of MPC is a special case ofthe cost improvement property of rollout algorithmsthat employ a sequentially improving base policy(cf. Proposition 4.2), which under our assumptions forthe cost per stage g, implies that if the base policyresults in a stable closed-loop system, the same is truefor the corresponding rollout algorithm. Note that theconstrained controllability assumption is used tosatisfy the feasibility assumption on the base policywithin the rollout context of Section 4.

The connection between MPC and rollout is con-ceptually useful, and may lead to new algorithms forspecialized or different types of control problems. Forexample, it can be exploited to derive MPC schemesinvolving more complex constraints in analogy withthe rollout algorithm given in Section 4.

Looking back into the argument that we usedto prove stability [cf. Eqs (5.6)–(5.9)], we see that toobtain the stability property (5.1), we do not need tosolve them-stage optimal control problem (5.2)–(5.5).It is sufficient instead to apply at any stage k and statexk 2 X, a control uk ¼ uk, where uk is the first elementof a sequence fuk, ukþ1, . . . , ukþm�1g generated byan m-stage base policy starting at xk (m may bedependent on xk, but for simplicity we do not showthis dependence). This control sequence, togetherwith the corresponding generated state sequencefxk, xkþ1, . . . , xkþmg, must have the following twoproperties:

(1) xkþ1 2 X.(2) gðxk, ukÞ þ JJðxkþ1Þ � JJðxkÞ, where JJðxÞ is some

function of x that is nonnegative over the set X[for example, JJðxÞ can be the cost incurred bythe base policy over m stages starting from x, cf.Eq. (5.8)].

Property (2) above is a Lyapounov stability-type ofcondition, where JJðxÞ is the Lyapounov function.Note that property (1) is weaker than the propertyxi 2 X for all i ¼ kþ 1, . . . , kþm� 1 satisfied by thesequence fuk, ukþ1, . . . , ukþm�1g that is generated byMPC at stage k. However, the later property is also

Dynamic Programming and Suboptimal Control 325

Page 18: Princeton University Library Interlibrary Services Department … · 2016-02-06 · Keywords: dynamic programming, stochastic optimal control, model predictive control, rollout algorithm

needed to show property (2) [cf. Eq. (5.8)] via theargument of Eqs (5.6) and (5.7).

The MPC scheme that we have described is just thestarting point for a broad methodology with manyvariations. For example, the m-stage problem solvedby MPC at each stage may be modified so that insteadof requiring that the state be 0 after m stages, one mayuse a large penalty for the state being nonzero after mstages. Then, the preceding analysis goes through, aslong as the terminal penalty is chosen so that Eq. (5.6)is satisfied. In another variant, instead of aiming todrive the state to 0 after m stages, one aims to reach asufficiently small neighborhood of the origin, withinwhich a stabilizing controller, designed by othermethods, may be used. This variant is also well-suitedfor taking into account disturbances described by setmembership, as we now proceed to explain.

5.1. MPC with Set-Membership Disturbances

To extend the MPC methodology to the case wherethere are disturbances wk in the system equation

xkþ1 ¼ fðxk, uk,wkÞ,

we must first modify the stability objective. Thereason is that in the presence of disturbances, thestability condition (5.1) is impossible to meet. A rea-sonable alternative is to introduce a set-membershipconstraint wk 2 Wðxk, ukÞ for the disturbance and atarget set T for the state, and to require that thecontroller specified by MPC drives the state to T withfinite cost.

To formulate the MPC, we assume that T � X, andthat once the system state enters T, we will use somecontrol law ~�� that keeps the state within T for allpossible values of the disturbances, that is

fðx, ~��ðxÞ,wÞ 2 T, for all x 2 T, w 2 Wðx, ~��ðxÞÞ:ð5:10Þ

Finding such a target set T and control law ~�� can beaddressed via the methodology of infinite timereachability, and will be discussed at the end of thissection.

Once we specify the target set T, we can view itessentially as a cost-free and absorbing state, similarto our view of the origin in the earlier deterministiccontext. Consistent with this interpretation, weintroduce the stage cost function

gðx, uÞ ¼ gðx, uÞ if x =2T,

0 if x 2 T,

where g is a nonnegative function with the propertythat for any sequence fx0, u0, x1, u1, . . .g generated bythe system, the condition

X1k¼0

gðxk, ukÞ < 1,

implies that xk ! 0 and uk ! 0 [for example, g can bequadratic of the form gðx, uÞ ¼ x0Qxþ u0Ru, where Qand R are positive definite, symmetric matrices].

The MPC is now defined as follows: at each stagek and state xk 2 X with xk =2T, it solves the m-stageminimax control problem of finding a policy ��k,��kþ1, . . . , ��kþm�1 that minimizes

maxwi2Wðxi ;��ðxiÞÞ;

i¼k;kþ1;...;kþm�1

Xkþm�1

i¼k

gðxi, ��ðxiÞÞ,

subject to the system equation constraints

xiþ1 ¼ fðxi, ��ðxiÞ,wiÞ, i¼ k,kþ 1, . . . ,kþm� 1,

the control and state constraints

xi 2X, ��ðxiÞ, 2UðxiÞ, i¼ k,kþ1, . . . ,kþm�1,

and the terminal state constraint

xi 2 T, for some i 2 ½kþ 1, kþm�:

These constraints must be satisfied for all disturbancesequences satisfying

wi 2 Wðxi, ��iðxiÞÞ, i ¼ k, kþ 1, . . . , kþm� 1:

The MPC applies at stage k the first component of thepolicy ��k, ��kþ1, . . . , ��kþm�1 thus obtained,

�ðxkÞ ¼ ��kðxkÞ,

and discards the remaining components. For states xwithin the target set T, the MPC applies the control~��ðxÞ that keeps the state within T, as per Eq. (5.10) , atno further cost ½�ðxÞ ¼ ~��ðxÞ for x 2 T�.

We make a constrained controllability assumption,namely that the m-stage minimax problem solved ateach stage by MPC has a feasible solution for allxk 2 X with xk =2T (this assumption can be checkedusing the target tube reachability methods, describedat the end of this section). Note that this problem is apotentially difficult minimax control problem, whichgenerally must be solved by DP (see e.g., Section 1.6 ofRef. [13]).

326 D.P. Bertsekas

Page 19: Princeton University Library Interlibrary Services Department … · 2016-02-06 · Keywords: dynamic programming, stochastic optimal control, model predictive control, rollout algorithm

The stability analysis of MPC (in the modified senseof reaching the target set T with finite cost, for allpossible disturbance values) is similar to the one givenearlier in the absence of disturbances. Furthermore,we can view MPC in the presence of disturbances as aspecial case of a rollout algorithm, suitably modifiedto take account of the set-membership description ofthe disturbances. Let us provide the details of thisanalysis.

We first show that � attains reachability of thetarget tube fX,X, . . .g in the sense that

fðx,�ðxÞ,wÞ 2 X, for all x 2 X and w 2Wðx,�ðxÞÞ:ð5:11Þ

Indeed, for x 2 T, MPC applies the control ~��ðxÞ,which by assumption keeps the next state within T,and hence also within X (since T � X). Hence,Eq. (5.11) is satisfied for x 2 T. For x 2 X with x =2T,the constraints of the m-stage minimax problemsolved by MPC starting from x include Eq. (5.11).Hence the reachability condition (5.11) is satisfied.(Note that the constrained controllability assumptionthat the m-stage minimax problem solved at eachstage by MPC has a feasible solution is critical for thisanalysis.)

Consider now any sequence fx0, u0, x1, u1, . . .ggenerated by MPC [i.e. x0 2 X, x0 =2T, uk ¼ �ðxkÞ,xkþ1 ¼ fðxk, uk,wkÞ and wk 2 Wðxk, ukÞ]. We willshow that

XKT�1

k¼0

gðxk, ukÞ � JJðx0Þ < 1, ð5:12Þ

where JJðxÞ is the optimal cost starting at state x 2 X ofthem-stage minimax control problem solved byMPC,and KT is the smallest integer k such that xk 2 T (withKT ¼ 1 if xk =2T for all k). Note that as a con-sequence of Eq. (5.12) there are two possibilities:

(1) The state reaches the target set T at some finitetime KT, in which case it is kept within T for allsubsequent times, by the definition of MPC.

(2) The state never reaches the target set T (KT ¼ 1),in which case we have gðxk, ukÞ ¼ gðxk, ukÞ for allk, and the infinite horizon cost

P1k¼0 gðxk, ukÞ is

finite. By our assumption regarding g, this impliesthat xk ! 0 and uk ! 0.

In either case (1) or (2), Eq. (5.12) can be viewed as aproperty of stability in the presence of disturbances(assuming that T is a bounded set).

To show Eq. (5.12), we argue similar to the casewhere there are no disturbances. Let us consider anoptimal control problem that is similar to the one

solved at each stage by MPC, but has one stage less.In particular, given x 2 X with x =2T, consider theminimax control problem of finding a policy��0, ��1, . . . , ��m�2 that minimizes

maxwi2Wðxi ;��ðxiÞÞ;i¼0;1;...;m�2

Xm�2

i¼0

gðxi, ��iðxiÞÞ, ð5:13Þ

subject to the system equation constraints

xiþ1 ¼ fðxi, ��iðxiÞ,wiÞ, i ¼ 0, 1, . . . ,m� 2,

ð5:14Þthe control and state constraints

xi 2 X, ��iðxiÞ 2 UðxiÞ, i ¼ 0, 1, . . . ,m� 2,

ð5:15Þand the terminal state constraint

xi 2 T, for some i 2 ½1,m� 1�: ð5:16Þ

These constraints must be satisfied for all disturbancesequences with

wi 2 Wðxi, ��iðxiÞÞ, i ¼ 0, 1, . . . ,m� 2: ð5:17Þ

Let ~JJðx0Þ be the corresponding optimal value, anddefine ~JJðx0Þ ¼ 0 for x0 2 T, and ~JJðx0Þ ¼ 1 for allx0 =2T for which the problem has no feasible solution.It can be seen that the control �ðxÞ applied byMPC ata state x 2 X with x =2T, minimizes over u 2 UðxÞ

maxw2Wðx;uÞ

½gðx, uÞ þ ~JJð fðx, u,wÞÞ�:

By using the fact JJðxÞ � ~JJðxÞ, it follows that for allx 2 X with x =2T, we have

maxw2Wðx;uÞ

½gðx,�ðxÞÞ þ JJð fðx,�ðxÞ,wÞÞ� � JJðxÞ:

Hence, for all k such that xk 2 X with xk =2T, we have

gðxk, ukÞ þ JJðxkþ1Þ � JJðxkÞ,where JJðxkþ1Þ ¼ 0 if xkþ1 2 T, and by adding overk ¼ 0, 1, . . . ,KT � 1, we obtain the desired result(5.12).

Note that similar to the case where there are nodisturbances, we can interpret MPC as a rolloutalgorithm with a base policy defined as follows: forx 2 T, the base policy applies ~��ðxÞ, and for x 2 X withx =2T, it applies the first element of a sequence thatsolves the ðm� 1Þ-stage problem described above,i.e. minimizes the cost (5.13) subject to the constraints(5.14)–(5.17).

Dynamic Programming and Suboptimal Control 327

Page 20: Princeton University Library Interlibrary Services Department … · 2016-02-06 · Keywords: dynamic programming, stochastic optimal control, model predictive control, rollout algorithm

5.2. MPC and Infinite-Time Reachability

We will now describe the connection of MPC, and themethodology of reachability of target sets and tubes,first introduced and developed for discrete-time sys-tems by the author in his Ph.D. thesis [9], and sub-sequent papers [7,10]. Reachability, as well as thebroader subject of estimation and control for systemswith a set-membership description of the uncertainty,stayed outside mainstream control theory and practicefor a long time, but have received renewed attentionsince the late 1980s, in the context of robust controland MPC. We refer to the surveys by Deller [20],Kosut et al. [27], Blanchini [15] and Mayne [38], whichgive many additional references. Section 4.6.2 of thetextbook [13] provides an introduction to the subject,which is relevant to the material that follows.

Consider the system

xkþ1 ¼ fkðxk, uk,wkÞ,

where wk is known to belong to a given setWkðxk, ukÞ,which may depend on the current state xk and controluk. A basic reachability problem is to find a policy� ¼ f�0, . . . ,�N�1g with �kðxkÞ 2 UkðxkÞ for all xkand k, such that for each k ¼ 1, 2, . . . ,N, the state xkof the closed-loop system

xkþ1 ¼ fkðxk,�kðxkÞ,wkÞ

belongs to a given setXk, called the target set at time k.We may view the set sequence fX1,X2, . . . ,XNg as a

‘tube’ within which the state must stay, even underthe worst possible choice of the disturbances wk fromwithin the corresponding sets Wkðxk,�kðxkÞÞ. If thereexists a policy � ¼ f�0, . . . ,�N�1g that keeps the statexk within Xk for all k ¼ 1, . . . ,N� 1, starting froma given initial state x0, we say that the tubefX1,X2, . . . ,XNg is reachable from x0.

Turning back to theMPC formulation, we note thatthe constrained controllability assumption is in effectan assumption about reachability of a certain targettube. In particular, for a deterministic system, thisassumption requires that the state constraint set X issuch that for all initial states within X, there exists acontrol sequence that keeps the state of the systemwithin x for the next m� 1 stages, and drives the stateto 0 at the m-th stage. This is equivalent to assumingthat the tube fX1,X2, . . . ,Xmg is reachable from allx0 2 X, where

X1 ¼ X2 ¼ ¼ Xm�1 ¼ X, Xm ¼ f0g:

For the case of a system driven by disturbancesdescribed by set membership, the constrained

controllability assumption requires that we know twosets with favorable reachability properties: the targetset T that the system aims at and the set X withinwhich the state must stay as it approaches T. Anequivalent statement of the assumption is that thetube fX1,X2, . . . ,Xmg is reachable from all x0 2 X,where

X1 ¼ X2 ¼ ¼ Xm�1 ¼ X, Xm ¼ T,

and that the set T is infinitely reachable, in the sensethat there exists a control law ~�� that keeps the statewithin T for all possible values of the disturbances,that is

fðx, ~��ðxÞ,wÞ 2 T, for all x 2 T,w 2 Wðx, ~��ðxÞÞ:ð5:18Þ

We may view the problem of infinite time reachabilityof a set T as a limit/infinite horizon version of theproblem of reachability of a target tube: instead of atube consisting of a finite number of sets, we considera tube with an infinite number of sets of the formfT,T, . . .g.

The computation as well as the definition of MPCcritically depends on being able to solve the tworeachability problems formulated above: reachabilityof a (finite horizon) target tube, and infinite timereachability of a set (or reachability of an infinitehorizon target tube). One may formulate the problemof reachability of a target tube fX1,X2, . . . ,XNgas a minimax control problem, where the cost atstage k is

gkðxkÞ ¼0 if xk 2 Xk,

1 if xk =2 Xk:

With this choice, the optimal cost-to-go from a giveninitial state x0 is the minimum number of violations ofthe target tube constraints xk 2 Xk that can occurwhen the wk are optimally chosen, subject to theconstraint wk 2 Wkðxk, ukÞ, by an adversary wishingto maximize the number of violations. In particular, ifJkðxkÞ ¼ 0 for some xk 2 Xk, there exists a policy suchthat starting from xk, the subsequent system statesxi, i ¼ kþ 1, . . . ,N, are guaranteed to be within thecorresponding sets Xi.

It can be seen that the set

Xk ¼ fxkjJkðxkÞ ¼ 0g

is the set that we must reach at time k in order to beable to maintain the state within the subsequent targetsets. Accordingly, we refer to Xk as the effective targetset at time k. We can generate the sets Xk with a

328 D.P. Bertsekas

Page 21: Princeton University Library Interlibrary Services Department … · 2016-02-06 · Keywords: dynamic programming, stochastic optimal control, model predictive control, rollout algorithm

backwards recursion, first obtained in Ref. [9], whichmay be derived from the DP algorithm for minimaxproblems, but can also be easily justified from firstprinciples. In particular, we start with

XN ¼ XN, ð5:19Þand for k ¼ 0, 1, . . . ,N� 1, we have

Xk ¼ fxk 2Xk j there exists uk 2UkðxkÞ such that

fkðxk,uk,wkÞ 2Xkþ1,

for all wk 2Wkðxk,ukÞg: ð5:20Þ

In general, it is not easy to characterize the effectivetarget sets Xk. However, a few special cases involvingthe linear system

xkþ1 ¼ Akxk þ Bkuk þ wk, k ¼ 0, 1, . . . ,N� 1,

where Ak and Bk are given matrices, are amenable toexact or approximate computational solution. Onesuch case is when the sets Xk are polyhedral, and thesets UkðxkÞ and Wkðxk, ukÞ are also polyhedral, andindependent of xk and uk. Then the effective target setsare polyhedral and can be computed by linear pro-gramming methods.

Another case of interest is when the sets Xk areellipsoids, and the sets UkðxkÞ and Wkðxk, ukÞ are alsoellipsoids that do not depend on xk and ðxk, ukÞ,respectively. In this case, the effective target sets Xk

are not ellipsoids, but can be approximated by innerellipsoids ~XXk with

~XXk � Xk

(this requires that the ellipsoids Uk have sufficientlylarge size, for otherwise the target tube may not bereachable and the problem may not have a solution).Furthermore, the state trajectory fx1, x2, . . . , xNgcan be maintained within the ellipsoidal tubef ~XX1, ~XX2, . . . , ~XXNg by using a linear control law (see [13],p. 214). We refer to the author’s Ph.D. thesis work [9]and the subsequent paper [7], for a more detailedanalysis.

When the problem is stationary and fk,Xk,Uk andWk do not depend on k, as in theMPC formulation, aninfinitely reachable set can often be obtained, at leastin principle, by a form of value iteration, i.e. start withsome (sufficiently large) target set Y, and sequentiallyconstruct an infinite sequence of correspondingeffective target sets fYkg satisfying Y0 ¼ Y and

Ykþ1 ¼ fx 2 Yk j there exists u 2 UðxÞ withfðx, u,wÞ 2 Yk, for all w 2 Wðx, uÞg:

Clearly, Yk is the set of all initial states x 2 Y forwhich there exists a k-stage policy that keeps the stateof the system within Y for the next k stages. We have

Ykþ1 � Yk, 8k,and we may view the intersection

Y1 ¼\1k¼0

Yk

as the set within which the state can be kept for anarbitrarily large (but finite) number of time periods.Paradoxically, under some unusual circumstances, theset Y1 may not be infinitely reachable, i.e. there mayexist states in Y1 starting from which it may beimpossible to remain within Y1 for an indefinitelylong horizon, i.e. an infinite number of time periods.Conditions that guarantee infinite reachability of Y1have been investigated by the author in Ref. [10], andinclude compactness of the sets Y,UðxÞ and Wðx, uÞ,and continuity of the function f. Under these condi-tions, the set Y1 is the largest infinitely reachable setcontained within Y. Refs [9,10] also focus on the casewhere the system is linear, and the sets Y,U and Ware ellipsoids. For this case, these references providea methodology, based on a Riccati-like equationfor constructing infinitely reachable ellipsoidal innerapproximations to Y1 and an associated linear con-trol law that attains infinite time reachability.

6. Restricted Structure Policies

We will now introduce a general unifying suboptimalcontrol scheme that contains as special cases severalof the control schemes we have discussed: rollout,OLFC and MPC. The idea is to simplify the problemby selectively restricting the information and/or thecontrols available to the controller, thereby obtaininga restricted but more tractable problem structure,which can be used conveniently in a one-step looka-head context.

An example of such a structure is one where fewerobservations are obtained, or one where the controlconstraint set is restricted to a single or a small numberof given controls at each state. Generally, a restrictedstructure is associated with a problem where theoptimal cost achievable is less favorable than in thegiven problem; this will be made specific in what fol-lows. At each stage, we compute a policy that solvesan optimal control problem involving the remainingstages and the restricted problem structure. The con-trol applied at the given stage is the first component ofthe restricted policy thus obtained.

Dynamic Programming and Suboptimal Control 329

Page 22: Princeton University Library Interlibrary Services Department … · 2016-02-06 · Keywords: dynamic programming, stochastic optimal control, model predictive control, rollout algorithm

An example of a suboptimal control approach thatuses a restricted structure is the OLFC, where one usesthe information available at a given stage as thestarting point for an open-loop computation (wherefuture observations are ignored). Another example isthe rollout algorithm, where at a given stage onerestricts the controls available at future stages to bethose applied by some suboptimal policy. Still anotherexample is MPC, which under some conditions maybe viewed as a form of rollout algorithm, as discussedin the preceding section.

For a problem withN stages, implementation of thesuboptimal scheme to be discussed requires the solu-tion of a problem involving the restricted structure ateach stage. The horizon of this problem starts at thecurrent stage, call it k, and extends up to the final stageN. This solution yields a control uk for stage kand a policy for the remaining stages kþ 1, . . . ,N� 1(which must obey the constraints of the restrictedstructure). The control uk is used at the currentstage, while the policy for the remaining stageskþ 1, . . . ,N� 1 is discarded. The process is repeatedat the next stage kþ 1, using the additional informa-tion obtained between stages k and kþ 1.

Similarly, for an infinite horizon model, imple-mentation of the suboptimal scheme requires, at eachstage k, the solution of a problem involving therestricted structure and a (rolling) horizon of fixedlength. The solution yields a control uk for stage k anda policy for each of the remaining stages. The controluk is then used at stage k, and the policy for theremaining stages is discarded. For simplicity in whatfollows, we will focus attention to the finite horizoncase, but the analysis applies, with minor modifica-tions, to infinite horizon cases as well.

Our main result is that the performance of thesuboptimal control scheme is no worse than the one ofthe restricted problem, i.e. the problem correspondingto the restricted structure. This result unifies andgeneralizes our analysis for the OLFC (which isknown to improve the cost of the optimal open-looppolicy, cf. Section 4) for the rollout algorithm (whichis known to improve the cost of the correspondingsuboptimal policy, cf. Section 4) and for MPC (whereunder some reasonable assumptions, stability of thesuboptimal closed-loop control scheme is guaranteed,cf. Section 5).

For simplicity, we focus on the imperfect stateinformation framework for stationary finite-stateMarkov chains with N stages; the ideas apply to muchmore general problems with perfect and imperfectstate information, as well problems with an infinitehorizon. We assume that the system state is one of afinite number of states denoted 1, 2, . . . , n. When a

control u is applied, the system moves from state i tostate j with probability pijðuÞ. The control u is chosenfrom a finite set U. Following a state transition, anobservation is made by the controller. There is a finitenumber of possible observation outcomes, and theprobability of each depends on the current state andthe preceding control. The information available tothe controller at stage k is the information vector

Ik ¼ ðz1, . . . , zk, u0, . . . , uk�1Þ,

where for all i, zi and ui are the observation andcontrol at stage i, respectively. Following the obser-vation zk, a control uk is chosen by the controller and acost gðxk, ukÞ is incurred, where xk is the current(hidden) state. The terminal cost for being at state xat the end of the N stages is denoted GðxÞ. We wishto minimize the expected value of the sum of costsincurred over the N stages.

We can reformulate the problem into a problem ofperfect state information where the objective is tocontrol the column vector of conditional probabilities

pk ¼ ð p1k, . . . , pnkÞ0,

with

p jk ¼ Pðxk ¼ j j IkÞ, j ¼ 1, . . . , n:

We refer to pk as the belief state, and we note that itevolves according to an equation of the form

pkþ1 ¼ �ð pk, uk, zkþ1Þ:

The function � represents an estimator, as discussed.The initial belief state p0 is given.

The corresponding DP algorithm has the form

Jkð pkÞ ¼ minuk2U

½p0kgðukÞþ Ezkþ1

fJkþ1ð�ð pk, uk, zkþ1ÞÞ j pk, ukg�,

where gðukÞ is the column vector with componentsgð1, ukÞ, . . . , gðn, ukÞ, and p0kgðukÞ, the expected stagecost, is the inner product of the vectors pk and gðukÞ.The algorithm starts at stage N, with

JNð pNÞ ¼ p0NG,

where G is the column vector with componentsGð1Þ, . . . ,GðnÞ, and proceeds backwards.

We will also consider another control structure,where the information vector is

Ik ¼ ðz1, . . . , zk, u0, . . . , uk�1Þ, k ¼ 0, . . . ,N� 1,

330 D.P. Bertsekas

Page 23: Princeton University Library Interlibrary Services Department … · 2016-02-06 · Keywords: dynamic programming, stochastic optimal control, model predictive control, rollout algorithm

with zi being some observation for each i (possiblydifferent from zi), and the control constraint set ateach pk is a given set Uð pkÞ. The probability dis-tribution of zk given xk and uk�1 is known, and maybe different than the one of zk. Also Uð pkÞ may bedifferent from U [in what follows, we will assume thatUð pkÞ is a subset of U].

We introduce a suboptimal policy, which at stage k,and starting with the current belief state pk, appliesa control �kð pkÞ 2 U, based on the assumption thatthe future observations and control constraints willbe according to the restricted structure. More speci-fically, this policy chooses the control at the typicalstage k and state xk as follows:

Restricted structure policy: At stage k and state xk,apply the control

�kð pkÞ ¼ uk,

where

ðuk,��kþ1ðzkþ1,ukÞ, . . . ,��N�1ðzkþ1, . . . ,zN�1,uk, . . . ,uN�2ÞÞ

is a policy that attains the optimal cost achievablefrom stage k onward with knowledge of pk andwith access to the future observations zkþ1, . . . , zN�1

(in addition to the future controls), and subject to theconstraints

uk 2 U, �kþ1ðpkþ1Þ 2 Uðpkþ1Þ, . . . ,�N�1ðpN�1Þ 2 UðpN�1Þ:

Let Jkð pkÞ be the cost-to-go, starting at belief statepk at stage k, of the restricted structure policyf�0, . . . ,�N�1g just described. This is given by the DPalgorithm

JkðpkÞ ¼ p0kgð�kðpkÞÞþEzkþ1

fJkþ1ð�ðpk,�kðpkÞ, zkþ1ÞÞ jpk,�kðpkÞgð6:1Þ

for all pk and k, with the terminal condition JNð pNÞ ¼p0NG for all pN.

Let us also denote by Jrkð pkÞ the optimal cost-to-go

of the restricted problem, i.e. the one where theobservations and control constraints of the restrictedstructure are used exclusively. This is the optimal costachievable, starting at belief state pk at stage k, usingthe observations zi, i ¼ kþ 1, . . . ,N� 1, and subjectto the constraints

uk2UðpkÞ, �kþ1ðpkþ1Þ2Uðpkþ1Þ,...,�N�1ðpN�1Þ2UðpN�1Þ:

We will show, under certain assumptions to beintroduced shortly, that

Jkð pkÞ � J rkð pkÞ, 8pk, k ¼ 0, . . . ,N� 1,

and we will also obtain a readily computable upperbound to Jkð pkÞ. To this end, for a given belief vectorpk and control uk 2 U, we consider three optimalcosts-to-go corresponding to three different patternsof availability of information and control restrictionover the remaining stages kþ 1, . . . ,N� 1.We denote:

Qkð pk, ukÞ: The cost achievable from stage k onwardstarting with pk, applying uk at stage k, and optimallychoosing each future control ui, i ¼ kþ 1, . . . ,N� 1,with knowledge of pk, the observations zkþ1, . . . , ziand the controls uk, . . . , ui�1, and subject to theconstraint ui 2 U.

Qckð pk, ukÞ: The cost achievable from stage k

onward starting with pk, applying uk at stage k,and optimally choosing each future controlui, i ¼ kþ 1, . . . ,N� 1, with knowledge of pk, theobservations zkþ1, . . . , zi, and the controls uk, . . . , ui�1,and subject to the constraint ui 2 Uð piÞ. Note thatthis definition is equivalent to

Qckð pk,�kð pkÞÞ ¼ min

uk2UQc

kð pk, ukÞ, ð6:2Þ

where �kð pkÞ is the control applied by the restrictedstructure policy just described.

QQc

kð pk, ukÞ: The cost achievable from stage konward starting with pk, applying uk at stage k, opti-mally choosing the control ukþ1 with knowledge of pk,the observation zkþ1, and the control uk, subject to theconstraint ukþ1 2 U, and optimally choosing each ofthe remaining controls ui, i ¼ kþ 2, . . . ,N� 1, withknowledge of pk, the observations zkþ1, zkþ2, . . . , zi,and the controls uk, . . . , ui�1, and subject to theconstraints ui 2 Uð piÞ.

Thus, the difference between Qckð pk, ukÞ and

Qkð pk, ukÞ is due to the difference in the controlconstraint and the information available to thecontroller at all future stages kþ 1, . . . ,N� 1[Uð pkþ1Þ, . . . ,Uð pN�1Þ versus U, and zkþ1, . . . , zN�1

versus zkþ1, . . . , zN�1, respectively]. The differencebetween Qc

kð pk, ukÞ and QQckð pk, ukÞ is due to the dif-

ference in the control constraint and the informationavailable to the controller at the single stage kþ 1[Uð pkþ1Þ versus U, and zkþ1 versus zkþ1, respectively].Our key assumptions are that

Uð pkÞ � U, 8pk, k ¼ 0, . . . ,N� 1, ð6:3ÞQkð pk, ukÞ � QQc

kð pk, ukÞ � Qckð pk, ukÞ,

8pk, uk 2 U, k ¼ 0, . . . ,N� 1: ð6:4Þ

Dynamic Programming and Suboptimal Control 331

Page 24: Princeton University Library Interlibrary Services Department … · 2016-02-06 · Keywords: dynamic programming, stochastic optimal control, model predictive control, rollout algorithm

Roughly, this means that the control constraintUð pkÞis more stringent than U, and the observationszkþ1, . . . , zN�1 are ‘weaker’ (no more valuable in termsof improving the cost) than the observationszkþ1, . . . , zN�1. Consequently, if Eqs (6.3) and (6.4)hold, we may interpret a controller that uses in partthe observations zk and the control constraints Uð pkÞ,in place of zk and U, respectively, as ‘handicapped’ or‘restricted’.

Let us denote:

Jkð pkÞ: The optimal cost-to-go of the originalproblem, starting at belief state pk at stage k. This isgiven by

Jkð pkÞ ¼ minuk2U

Qkð pk, ukÞ: ð6:5Þ

Jckð pkÞ: The optimal cost achievable, starting atbelief state pk at stage k, using the observationszi, i ¼ kþ 1, . . . ,N� 1, and subject to the constraints

uk 2 U, �kþ1ðpkþ1Þ 2 Uðpkþ1Þ, . . . ,�N�1ðpN�1Þ 2 UðpN�1Þ:

This is given by

Jckð pkÞ ¼ minuk2U

Qckð pk, ukÞ, ð6:6Þ

and it is the cost that is computed when solving theoptimization problem of stage k in the restrictedstructure policy scheme. Note that we have for all pk,

J rkð pkÞ ¼ min

uk2Uð pkÞQc

kð pk, ukÞ

� minuk2U

Qckð pk, ukÞ ¼ Jckð pkÞ, ð6:7Þ

where the inequality holds in view of the assumptionUð pkÞ � U.

Our main result is as follows.

Proposition 6.1. Under the assumptions (6.3) and(6.4), there holds

Jkð pkÞ � Jkð pkÞ � Jckð pkÞ � Jrkð pkÞ,8pk, k ¼ 0, . . . ,N� 1:

Proof. The inequality Jkð pkÞ � Jkð pkÞ is evident,since Jkð pkÞ is the optimal cost-to-go over a class ofpolicies that includes the restricted structure policyf�0, . . . ,�N�1g. Also the inequality Jckð pkÞ � Jrkð pkÞfollows from the definitions [see Eq. (6.7)]. We provethe remaining inequality Jkð pkÞ � Jckð pkÞ by induc-tion on k.

We have JNð pNÞ ¼ JcNð pNÞ ¼ 0 for all pN. Assumethat for all pkþ1, we have

Jkþ1ð pkþ1Þ � Jckþ1ð pkþ1Þ:

Then, for all pk,

Jkð pkÞ ¼ p0kgð�kð pkÞÞþEzkþ1

fJkþ1ð�ð pk,�kð pkÞ, zkþ1ÞÞ j pk,�kð pkÞg� p0kgð�kð pkÞÞþEzkþ1

fJckþ1ð�ð pk,�kð pkÞ, zkþ1ÞÞ j pk,�kð pkÞg¼ p0kgð�kð pkÞÞþ Ezkþ1

f minukþ12U

Qckþ1ð�ð pk,�kð pkÞ, zkþ1Þ, ukþ1Þ

j pk,�kð pkÞg¼ QQc

kð pk,�kð pkÞÞ� Qc

kð pk,�kð pkÞÞ¼ Jckð pkÞ,

where the first equality holds by Eq. (6.1), the firstinequality holds by the induction hypothesis, thesecond equality holds by Eq. (6.6), the third equalityholds by the definition of QQc

k, the second inequal-ity holds by the assumption (6.4) and the last equalityholds from the definition (6.2) of the restricted struc-ture policy. The induction is complete. &

The main conclusion from the proposition is thatthe performance of the restricted structure policyf�0, . . . ,�N�1g is no worse than the performanceassociated with the restricted control structure. Fur-thermore, at each stage k, the value Jckð pkÞ, which isobtained as a byproduct of the on-line computation ofthe control �kð pkÞ, is an upper bound to the cost-to-go Jkð pkÞ of the suboptimal policy. This is consistentwith our earlier results that show the cost improve-ment property of the rollout algorithm and the OLFC,and the stability property of MPC.

7. Conclusions

This paper has aimed to outline the connectionsbetween some suboptimal control schemes that havebeen the focus of much recent research. The under-lying idea of these schemes is to start with a sub-optimal/heuristic policy, and to improve it by using itscost-to-go (or a lower bound thereof) as an approx-imation to the optimal cost-to-go, the principal themeof the policy iteration method. While this viewpoint isnatural in DP-based optimization, it is somewhatindirect within the control-oriented context of

332 D.P. Bertsekas

Page 25: Princeton University Library Interlibrary Services Department … · 2016-02-06 · Keywords: dynamic programming, stochastic optimal control, model predictive control, rollout algorithm

MPC, where a principal issue is the stability of theclosed-loop system. We have tried to emphasize therelation between the stability property of MPC andthe cost improvement property of the underlyingpolicy iteration methodology.

The connections between the various schemesdescribed here may be helpful in better understandingtheir underlying mechanisms, and in extending thescope of their practical applications. In particular, thesuccess of MPC in control engineering applicationsshould motivate the broader use of rollout algorithmsin practice, while the rollout and restricted policy costimprovement analyses should motivate the use ofMPC methods in other problem domains, involvingfor example more complex state and control con-straints. Finally, it should be mentioned that while inrollout algorithms for constrained DP, the issue ofconstructing a feasible base policy is left unresolved,in MPC it is addressed via the well-understoodmethodology of reachability of target tubes.

References

1. Abramson B. Expected-outcome: a general model ofstatic evaluation. IEEE Trans Pattern Anal MachineIntell 1990; 12: 182–193

2. Barto A, Powell W, Si J (eds). Learning andapproximate dynamic programming. IEEE Press, NY,2004

3. Bertsekas DP, Tsitsiklis JN, Wu C. Rollout algorithmsfor combinatorial optimization. Heuristics 1997; 3:245–262

4. Bertsekas DP, Castanon DA. Rollout algorithmsfor stochastic scheduling problems. Heuristics 1999; 5:89–108

5. Bertsimas D, Demir R. An approximate dynamicprogramming approach to multi-dimensional knapsackproblems. Manage Sci 2002; 4: 550–565

6. Bertsimas D, Popescu I. Revenue management in adynamic network environment. Transportation Sci2003; 37: 257–277

7. Bertsekas DP, Rhodes IB. On the minimax reachab-ility of target sets and target tubes. Automatica 1971; 7:233–247

8. Bertsekas DP, Tsitsiklis JN. Neuro-dynamic program-ming. Athena Scientific, Belmont, MA, 1996

9. Bertsekas DP. Control of uncertain systems with aset-membership description of the uncertainty, Ph.D.Dissertation, Massachusetts Institute of Technology,Cambridge, MA, 1971 (Available in scanned form fromthe author’s www site)

10. Bertsekas DP. Infinite time reachability of state spaceregions by using feedback control. IEEE Trans AutomControl 1972; AC-17: 604–613

11. Bertsekas DP. On the solution of some minimax controlproblems. In: Proceedings of the 1972 IEEE decisionand control conference, New Orleans, LA, 1972

12. Bertsekas DP. Differential training of rollout policies.In: Proceedings of the 35th Allerton conference on

communication, control, and computing, Allerton Park,II, 1997, pp. 913–922

13. Bertsekas DP. Dynamic programming and optimalcontrol. 3rd edn. Athena Scientific, Belmont, MA, 2005

14. Bertsekas DP. Rollout algorithms for constraineddynamic programming. Laboratory for Informationand Decision Systems Report 2646, MIT, 2005

15. Blanchini F. Set invariance in control—a survey.Automatica 1999; 35: 1747–1768

16. Christodouleas,J.D. Solution Methods for Multi-processor Network Scheduling Problems withApplication to Railroad Operations, Ph.D. Thesis,Operations Research Center, Massachusetts Instituteof Technology, 1997.

17. Chang HS, Givan RL, Chong EKP. Parallel rollout foronline solution of partially observable Markov decisionprocesses. Discrete Event Dynam Syst 2004; 14: 309–341

18. Camacho EF, Bordons C. Model predictive control.2nd edn. Springer-Verlag, New York, NY, 2004

19. de Farias DP. Van Roy B. The linear programmingapproach to approximate dynamic programming.Operations Res 2003; 51: 850–865

20. de Farias DP, Van Roy B. On constraint sampling inthe linear programming approach to approximatedynamic programming. Math Operations Res 2004;29: 462–478

21. Deller JR. Set membership identification in digitalsignal processing. IEEE ASSP Mag 1989; October:4–20

22. Dreyfus SD. Dynamic programming and the calculus ofvariations. Academic Press, NY, 1965

23. Findeisen R, Imsland L, Allgower F, Foss BA.State and output feedback nonlinear modelpredictive control: an overview. Eur J Control 2003; 9:190–205

24. Ferris MC, Voelker MM. Neuro-dynamic program-ming for radiation treatment planning. NumericalAnalysis Group Research Report NA-02/06, OxfordUniversity Computing Laboratory, Oxford University,2002

25. Ferris MC, Voelker MM. Fractionation in radiationtreatment planning. Math Program B 2004; 102:387–413

26. Guerriero F, Musmanno R. Label correcting methodsto solve multicriteria shortest path problems. J OptimTheory Appl 2001; 111: 589–613

27. Guerriero F, Mancini M. A Co-operative ParallelRollout Algorithm for the Sequential OrderingProblem, Parallel Computing, 2003; 29: 663–677

28. Jaffe JM. Algorithms for finding paths with multipleconstraints. Networks 1984; 14: 95–116

29. Kosut RL, Lau MK, Boyd SP. Set-membership identi-fication of systems with parametric and nonparametricuncertainty. IEEE Trans Autom Control 1992; AC-37:929–941

30. Keerthi SS, Gilbert EG. Optimal, infinite horizonfeedback laws for a general class of constrained discetetime systems: stability and moving-horizon approxima-tions. J Optim Theory Appl 1988; 57: 265–293

31. Konda VR, Tsitsiklis JN. Actor-critic algorithms.SIAM J Control Optim 2003; 42: 1143–1166

32. Konda VR. Actor-critic algorithms. Ph.D. Thesis,Department of Electrical Engineering and ComputerScience, MIT, Cambridge, MA, 2002

Dynamic Programming and Suboptimal Control 333

Page 26: Princeton University Library Interlibrary Services Department … · 2016-02-06 · Keywords: dynamic programming, stochastic optimal control, model predictive control, rollout algorithm

33. McGovern A, Moss E, Barto A. Building a basicbuilding block scheduler using reinforcement learningand rollouts. Mach Learning 2002; 49: 141–160

34. Meloni C, Pacciarelli D, Pranzo M. A rolloutmetaheuristic for job shop scheduling problems. AnnOperations Res 2004; 131: 215–235

35. Mayne DQ, Rawlings JB, Rao CV, Scokaert POM.Constrained model predictive control: stability andoptimality. Automatica 2000; 36: 789–814

36. Martin-Sanchez JM, Rodellar J. Adaptive predictivecontrol. From the concepts to plant optimization.Prentice-Hall International (UK), Hemel Hempstead,Hertfordshire, UK, 1996

37. Maciejowski JM. Predictive control with constraints.Addison-Wesley, Reading, MA, 2002

38. Marbach P, Tsitsiklis JN. Simulation-based optimiza-tion of Markov reward processes. IEEE Trans AutomControl 2001; AC-46: 191–209

39. Martins EQV. On a multicriteria shortest path problem.Eur J Operational Res 1984; 16: 236–245

40. Mayne DQ. Control of constrained dynamic systems.Eur J Control 2001; 7: 87–99

41. Morari M, Lee JH. Model predictive control:Past, present, and future. Comput Chem Eng 1999; 23:667–682

42. Qin SJ, Badgwell TA. A survey of industrial modelpredictive control Technology. Control Eng Practice2003; 11: 733–764

43. Rantzer A. On relaxed dynamic programming inswitching systems. IEE Proc Special Issue Hybrid Syst2005 (to appear)

44. Rawlings JB. Tutorial overview of model predictivecontrol. Control Syst Mag 2000; 20: 38–52

45. Secomandi N. Comparing neuro-dynamic program-ming algorithms for the vehicle routing problem with

stochastic demands. Comput Operations Res 2000; 27:1201–1225

46. Sutton R and Barto AG. Reinforcement Learning, MITPress, Cambridge, MA, 1998

47. Secomandi N. A rollout policy for the vehicle routingproblem with stochastic demands. Operations Res 2001;49: 796–802

48. Secomandi N. Analysis of a rollout approach tosequencing problems with stochastic routing applica-tions. J Heuristics 2003; 9: 321–352

49. Stewart BS, White CC.Multiobjective A�. J ACM 1991;38: 775–814

50. Tesauro G and Galperin GR. 1996 ‘‘On-Line PolicyImprovement Using Monte Carlo Search,’’ presented atthe 1996 Neural Information Processing systems Con-ference, Denver, CO; also in M. Mozer et al. (eds.),Advances in Neural Information Processing Systems 9,MIT Press (1997).

51. Tu F, Pattipati KR. Rollout strategies for sequentialfault diagnosis. IEEE Trans Syst Man Cybernet A 2003;86–99

52. Wu G, Chong EKP, Givan RL. Congestion controlusing policy rollout. In: Proceedings of the 2nd IEEECDC, Maui, Hawaii. 2003; pp 4825–4830

53. White CC, Harrington DP. Application of Jensen’sinequality to adaptive suboptimal design. J OptimTheory Appl 1980; 32: 89–99

54. Witsenhausen HS. Inequalities for the performance ofsuboptimal uncertain systems. Automatica 1969; 5:507–512

55. Witsenhausen HS. On performance bounds for uncer-tain systems. SIAM J Control 1970; 8: 55–89

56. Yan X, Diaconis P, Rusmevichientong P, Van Roy B.Solitaire: man versus machine. Adv Neural InformProcess Syst 2005; 17: (to appear)

334 D.P. Bertsekas