dynamic programming and optimal control script

8/3/2019 Dynamic Programming and Optimal Control Script

1/58

Dynamic Programming andOptimal Control

Script

Prof. Raffaello DAndrea

Lecture notes

Dieter Baldinger Thomas Mantel Daniel Rohrer

HS 2010


2/58


3/58

Contents

Contents 1

1 Introduction 3

1.1 Class Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Key Ingredients . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Open Loop versus Closed Loop Control . . . . . . . . . . . . . 41.4 Discrete State and Finite State Problem . . . . . . . . . . . . . 51.5 The Basic Problem . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Dynamic Programming Algorithm 92.1 Principle of Optimality . . . . . . . . . . . . . . . . . . . . . . . 92.2 The DPA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3 Chess Match Strategy Revisited . . . . . . . . . . . . . . . . . . 112.4 Converting non-standard problems . . . . . . . . . . . . . . . 13

2.4.1 Time Lags . . . . . . . . . . . . . . . . . . . . . . . . . . 132.4.2 Correlated Disturbances . . . . . . . . . . . . . . . . . . 132.4.3 Forecasts . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.5 Deterministic, Finite State Systems . . . . . . . . . . . . . . . . 152.5.1 Convert DP to Shortest Path Problem . . . . . . . . . . 152.5.2 DP algorithm . . . . . . . . . . . . . . . . . . . . . . . . 162.5.3 Forward DP algorithm . . . . . . . . . . . . . . . . . . . 16

2.6 Converting Shortest Path to DP . . . . . . . . . . . . . . . . . . 162.7 Viterbi Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 172.8 Shortest Path Algorithms . . . . . . . . . . . . . . . . . . . . . 18

2.8.1 Label Correcting Methods . . . . . . . . . . . . . . . . . 19

2.8.2 A Algorithm . . . . . . . . . . . . . . . . . . . . . . . 212.9 Multi-Objective Problems . . . . . . . . . . . . . . . . . . . . . 21

2.9.1 Extended Principle of Optimality . . . . . . . . . . . . 222.10 Infinite Horizon Problems . . . . . . . . . . . . . . . . . . . . . 232.11 Stochastic, Shortest Path Problems . . . . . . . . . . . . . . . . 23

2.11.1 Main Result . . . . . . . . . . . . . . . . . . . . . . . . . 242.11.2 Sketch of Proof . . . . . . . . . . . . . . . . . . . . . . . 252.11.3 Prove B . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.12 Summary of previous lecture . . . . . . . . . . . . . . . . . . . 272.13 How do we solve Bellmans Equation? . . . . . . . . . . . . . . 28

1


4/58

Contents

2.13.1 Method 1: Value iteration (VI) . . . . . . . . . . . . . . 28

2.13.2 Method 2: Policy Iteration (PI) . . . . . . . . . . . . . . 282.13.3 Third Method: Linear Programming . . . . . . . . . . . 312.13.4 Analogies and Connections . . . . . . . . . . . . . . . . 32

2.14 Discounted Problems . . . . . . . . . . . . . . . . . . . . . . . . 34

3 Continuous Time Optimal Control 36

3.1 The Hamilton Jacobi Bellman (HJB) Equation . . . . . . . . . . 373.2 Aside on Notation . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.2.1 The Minimum Principle . . . . . . . . . . . . . . . . . . 423.3 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.3.1 Fixed Terminal State . . . . . . . . . . . . . . . . . . . . 45

3.3.2 Free initial state, with cost . . . . . . . . . . . . . . . . . 463.4 Linear Systems and Quadratic Costs . . . . . . . . . . . . . . . 483.4.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.5 General Problem Formulation . . . . . . . . . . . . . . . . . . . 52

Bibliography 56

2


5/58

Chapter 1

Introduction

1.1 Class Objective

The class objective is to make multiple decisions in stages to minimize a costthat captures undesirable outcomes.

1.2 Key Ingredients

1. Underlying discrete time system:

xk+1 = fk (xk, uk, wk) k = 0 , 1 , . . . , N

1

k: discrete time index

xk: state

uk: control input, decision variable

wk: disturbance or noise, random parameters

N: time horizon

fk: function, captures system evolution

2. Additive cost function:

gN(xN) terminal cost

+N1k=0

gk (xk, uk, wk) state cost

accumulated cost

gk is a given nonlinear function.

The cost is a function of the control applied Because the wk are random, typically consider the expected cost:

Ewk

gN(xN) +

N1k=0

gk (xk, uk, wk)

3


6/58

1. Introduction

Example 1: Inventory Control:

Keeping an item stocked in a warehouse. Too little, you run out (bad).Too much, cost of storage and misuse of capital (bad).

xk: stock in the warehouse at the beginning ofkth time period

uk: stock ordered and immideately delivered at the beginning of kth

time period

wk: demand during kth period, with some given probability distribu-

tion

Dynamics:

xk+1 = xk + uk wk

Excess demand is backlogged and corresponds to negative values.

Cost:

E

R (xN) +

N1k=0

(r (xk) + c uk)

r (xk): penaltise too much stock or negative stock

c uk: cost of items

R (xn): terminal cost from items at the end that cant be sold or de-mand that cant be met

Objective: The objective is to minimize the cost subject to uk 0

1.3 Open Loop versus Closed Loop Control

Open Loop: Come up with control inputs u0, . . . , uN1 before k = 0. InOpen Loop the objective is to calculate {u0, . . . , uN1}.

Closed Loop: Wait until time k to make decision. Asumes xk is measurable.Closed Loop will always give better performance, but is computationallymuch more expensive. In Closed Loop the objective is to calculate the optimalrule: uk = k (xk). = {0, . . . , N1} ist a policy or control law.

4


7/58

Wednesday 22nd September, 2010 Discrete State and Finite State Problem

Example 2:

k (xk) =

sk xk if xk < sk

0 otherwise

sk is some threshold.

1.4 Discrete State and Finite State Problem

When state xk takes on discrete values or is finite in size, it is often conve-

nient to express the dynamics in terms of transition probabilities:Pij (u, k) := Prob (xk+1 = j | xk = i, uk = u)

i: start state

j: possible future state

u: control input

k: time

This is equivalent to: xk+1 = wk, where wk has the following

Prob (wk = j|

xk = i, uk = u) := Pij (u, k)

Example 3: Optimizing Chess Playing Strategies:

Two game chess match with an opponent, the objective is to comeup with a strategy that maximizes the chance to win.

Each game can have 2 outcomes:a) Win by one player: 1 point for the winner, 0 points for the

loser.

b) Draw: 0.5 points for each player.

If games are tied 1-1 at the end of 2 games, go into sudden death

mode until someone wins.

Decision variable for player, two player styles:1) Timid play: Draw with probability pd, lose with probability

1pd2) Bold play: Win with probability pw, lose with probability

1pw Asume pd > pw as a necessary condition for problem to make

sense.

5


8/58

1. Introduction

Problem: What playing style should be chosen? Since it doesnt make

sense to play Timid if we are tied 1-1 at the end of 2 games, it is a2-stage-finite problem.

Transition Probability Graph The graphis below show all possible out-comes.

0, 0

1, 0

0, 1

pw

1 pw

(a) Timid Play

0, 0

12 ,

12

0, 1

pd

1 pd

(b) Bold Play

Figure 1.1: First Game

1, 0

1

2, 1

2

0, 1

2, 0

32 ,

12

1, 1

12 ,

32

0, 2

pd

1 pdpd

1 pdpd

1 pd

(a) Timid Play

1, 0

1

2, 1

2

0, 1

2, 0

32 ,

12

1, 1

12 ,

32

0, 2

pw

1 pwpw

1 pwpw

1 pw

(b) Bold Play

Figure 1.2: Second Game

Closed Loop Strategy Play timid iff player is ahead.

The probability of winning is:

pd pw + pw ((1 pd) pw + pw (1 pw)) = p2w (2 pw) + pw (1 pw) pdFor {pw = 0.45, pd = 0.9} and {pw = 0.5, pd = 1} the probabilities towin are 0.54 and 0.625.

Open Loop Strategy Possibilities:

6


9/58

Wednesday 29th September, 2010 The Basic Problem

0, 0

1, 0

0, 1

32 ,

12 WIN

1, 1

0, 2 LOSE

2, 1 WIN

1, 2 LOSE

pw

1 pw

pd

1pd

pw

1 pw

pw

1 pw

Figure 1.3: Closed Loop Strategy

1) Timid for first 2 games: p2d pw

2) Bold in both: p2w (3 2 pw)3) Bold in first, timid in second game: pw pd + pw (1 pd) pw4) Timid in first, bold in second game: pw pd + pw (1pd) pw

Clearly 1) is not the optimal OL strategy, because p2d pw pd pw + . . .Best strategy yields:

p2w + pw (1pw) max (2 pw, pd)

if pd > 2 pw. The optimal OL strategy is 3) or 4). It can be shown that ifpw 0.5, then the probability of winning is 0.5.

1.5 The Basic Problem

Summarize basic problem setup:

xk+1 = fk(xk, uk, wk), k = 0 , 1 , . . . , N 1

xk Sk state spaceuk Ck control spacewk Dk disturbance space

uk U(xk) Ck. Constrained not only as a function of time, but alsoof current state.

wk PR (|xk, uk). Noise distribution can depend on current state andapplied control.

7


10/58

1. Introduction

Consider policies, or control laws,

= {0, 1, . . . , N1}where k maps state xk into controls uk = k(xk), such that k(xk) U(xk) xk Sk.The set of all called Admissible Policies, denoted .

Given a policy , the expected cost of starting at state x0:

J(x0) := Ewk

gN(xN) +

N1k=0

gk(xk, k(xk), wk)

Optimal Policy: J(x0) J(x0) Optimal Cost: J(x0) := J(x0)

8


11/58

Chapter 2

Dynamic ProgrammingAlgorithm

At the heart of the DP algorithm is the following very simple and intuitiveidea.

2.1 Principle of Optimality

Let = {0 , 1, . . . , n1}be an optimal policy. Assume that in the processof using , a state xi occurs at time i. Consider the subproblem whereby attime i we are at state xi and we want to minimize

Ewk

gn(xn) +

N1k=i

gk(xk, k(xk), wk)

.

Then the truncated policy {i , i+1, . . . , N1} is optimal for this problem.The proof is simple: prove by contradiction. If the above were not optimal,you could find a different policy that would give a lower cost. Applyingthe same policy to the original problem from i would therefore give a lowercost, which contradicts that was an optimal policy.

Example 4: Deterministic Scheduling Problem

Have 4 machines A, B, C, D, that are used to make something.

A must occur before B. C before D.

The solution is obtained by calculating the optimal cost for each node,beginning at the bottom of the tree. See figure 2.1.

9


12/58

2. Dynamic Programming Algorithm

I.C.

A

C

AB

AC

CA

CD

ABC ABCD

ACB

ACD

ACBD

ACDB

CAB

CAD

CABD

CADB

CDA CDAB

5

3

2

3

4

6

3

4

6

2

4

3

6

1

3

1

3

2

6

1

3

1

3

2

9

5

3

5

8

7

10

Figure 2.1: Problem of example 4 with optimal cost for each node written above it (in circles).

2.2 The DPA

For every initial state x0, the optimal cost J(x0) is equal to J0(x0), given by

the last step of the following recursive algorithm, which proceeds backwardsin time from N

1 to 0:

Initialization: JN(xN) = gN(xN) xn SNRecursion: For the recursion we use

Jk(xk) = minukUk(xk)

Ewk

gk(xk, uk, wk) + Jk+1

fk(xk, uk, wk)

.

where expectation is taken with respect to PR(|xk, uk).

Furthemore, if uk = k (xk) minimizes the recursion equation for each xk

and k, the policy ={0 , . . . ,

N

1

}is optimal.

Comments

For each recursion step, we have to perform the optimization over allpossible values xk Sk, since we dont know a priori which states wewill actually visit.

This pointwise optimization is what gives us k .

Proof 1: (Read section 1.5 in [1] if you are mathematically inclined).

10


13/58

Wednesday 6th October, 2010 Chess Match Strategy Revisited

Denote k := {k, k+1, . . . , N1}. Denote

Jk (xk) = mink

Ewk,...,wN1

gN(xN) +

N1i=k

gi

xi, i(xi), wi

.

optimal cost when starting at time k, we find ourselves at state xk.

JN(xN) = gN(xN), finally.

We will show that Jk = Jk generated by the DPA which give usthe desired result when k = 0.

Induction: JN(xN) = JN(xN), true for k = NAssume true for k + 1: Jk+1(xk+1) = Jk+1(xk+1) xk+1 Sk+1.Then, since k = {k, k+1}, have

Jk (xk) = min(k,k+1)

Ewk ,...,wN1

gk(xk , k(xk), wk) + gN(xN) +

N1

i=k+1

gi (xi, i(xi), wi)

by principle of optimality:

= mink

Ewk

gk(xk, k(xk), wk) + min

k+1E

wk+1,...,wN1

gN(xN) +

N1

i=k+1

gi (xi, i(xi), wi)

by definition of Jk+1 and update equation:= min

kEwk

gk(xk, k(xk), wk) + J

k+1(fk(xk, k (xk), wk))

by induction hypothesis:

= mink

Ewk

(gk(xk , k(xk), wk) + Jk+1(fk(xk , k(xk), wk)))

= minukUk(xk)

Ewk

(gk(xk, k(xk), wk) + Jk+1(fk(xk, uk, wk)))

= Jk(xk)

In other words: Search over a function is simply like solving what the function doespartwise.

Jk(xk) is called the cost-to-go at state xk.Jk() is called the cost-to-go function.

2.3 Chess Match Strategy Revisited

Recall

Timid Play, prob. tie = pd, prob. loss = 1 pd

11


14/58


Bold Play, prob. win = pw, prob. loss = 1 pw 2 game match, + tie breaker if necessary

Objective Find policy which maximizes probability of winning. We willsolve with DP, replace min by max.Asume pd > pw.

Define xk = difference between our score and opponent score at the end ofgame k. Recall 1 point for win, 0 for loss and 0.5 for tie.

Define Jk(xk) = probability of winning match at time k if state = xk.

Start Recursion

J2(x2) =

1 if x2 > 0

pw if x2 = 0

0 if x2 < 0

Recursive Equation

Jk(xk) = max pd Jk+1 + (1pd) Jk+1(xk 1) timid, pw Jk+1 + (1

pw) Jk+1(xk

1)

bold

Convince yourself that this is equivalent to the formal definitions:

Jk(xk) = maxuk

Ewk

gk(xk, uk, wk) + Jk+1(fk

xk, uk, wk)

Note: There is only a terminal cost in this problem.

J1(x1) = max [pd J2(x1) + (1 pd) J2(x1 1), pw J2(x1 + 1) + (1 pw) J2(x1 1)]

If x

1= 1: max pd + (1pd) pw

timid

, pw

+ (1

pw

) pw

bold

Which is bigger? Timid - Bold = (pdpw)(1pw) > 0 Timid is optimal,and J1(1) = pd + (1 pd) pw.

If x1 = 0: max

pd pw + (1 pd) 0 timid

, pw + (1pw) 0 bold

Optimal is pw, and J1(0) = pw, Bold is optimal strategy.

If x1 = 1: max

0, p2w

J1(1) = p2w, optimal strategy is Bold.

12


15/58

Wednesday 6th October, 2010 Converting non-standard problems

J0(0) = max [pd J1(0) + (1 pd) J1(1), pw J1(1) + (1pw) J1(1)]= max

pd pw + (1 pd) p2w, pw (pd + (1 pd) pw) + (1 pw) p2w

= max

pd pw + (1 pd) p2w, pd pw + (1 pd) p2w + (1pw) p2w

J0(0) = pd pw + (1 pd) p2w + (1 pw) p2w, the optimal strategy is Bold.

Optimal Strategy If ahead, play Timid.

2.4 Converting non-standard problems to Basic Prob-lem

2.4.1 Time Lags

Assume update equation is of the following form:

xx+1 = fk(xk, xk1, uk, uk1, wk)

Define yk = xk1, sk = uk1

xk+1

yk+1

sk+1

=

fk(xk,yk, sk, wk)

xk

uk

fk

Let xk = (xk,yk, sk), xk+1 = fk(xk, uk, wk).

The control is uk = k(xk, uk1, xk1).This can be generalized to more than one time lag.

2.4.2 Correlated Disturbances

If disturbances are not independent, but can be modeled as the output of asystem driven by independent disturbances Colored Noise.

Example 5:

wk = Ck yk+1

yk+1 = Ak yk + k

13


16/58


Ak, Ck are given, {k} is independent.As usual, xk+1 = fk(xk, uk, wk).

xk+1

yk+1

=

fkxk, uk, Ck Ak yk + k

Ak yk + k

and uk = k(xk,yk)

which is now in the standard form. In general, yk cannot be measuredand must be estimated.

2.4.3 Forecasts

When state information includes knowledge of probability distributions. Atthe beginning of each period k, we receive information about wk+1 probabil-ity distribution. In particular, assume wk+1 could have the following prob-ability distributions {Q1, Q2, . . . , Qm}, with a priori probabilities p1, . . . , pm.At time k, we receive forecast i that Qi is used to generate wk+1. Model asfollows: yk+1 = k, k is a random variable, taking value i with probabilitypi. In particular, wk has probability distribution Qyk .

Then have

xk+1

yk+1

=

fk(xk, uk, wk)

k

. New state xk = (xk,yk).

Since yk is known at time k, we have a Basic Problem formulation. Newdisturbance wk = (wk, k), depends on current state, which is allowed. DPAtakes on the following form:

JN(xN,yN) = gN(xN)

Jk(xk,yk) = minuk

Ewk

Ek

gk(xk, uk, wk) + Jk+1

fk(xk, uk, wk), k

yk= min

ukEwk

gk(xk, uk, wk) +

m

i=1

pi Jk+1

fk(xk, uk, wk), i yk

where conditional expectation simply means that wk has probability distri-bution Qyk . yk {1 , . . . , m}, expectation over wk is taken with respect to thedistribution Qyk .

14


17/58

Wednesday 13th October, 2010 Deterministic, Finite State Systems

2.5 Deterministic, Finite State Systems

Recall Basic Problem

xk+1 = fk(xk, uk, wk) k = 0 , . . . , N 1gk(xk, uk, wk) cost at stage k.

Consider Problems where

1. xk Sk, Sk is a finite set2. No disturbances wk

We assume, without loss of generality, that there is only one way to go fromstate i Sk to j Sk+1 (If there is more than one way, pick one with lowestcost at stage k).

2.5.1 Convert DP to Shortest Path Problem

S...

Stage 1

...

Stage 2

.

.

.

..

.

...

Stage N

. . .

. . .

. . .

T

Figure 2.2: General Shortest Path Problem

akij = cost to go from state i Sk to state j Sk+1, time k. This is equal ifthere is no way to go from i Sk to j Sk+1.

aNiT = terminal cost of state i SNIn other words,

akij = gk(i, uijk ), where j = fk(i, u

ijk )

aNiT = gN(i)

15


18/58


2.5.2 DP algorithm

JN(i) = aNiT i SN

Jk(i) = minjSk+1

akij + Jk+1(j)

i Sk, k = 0 , . . . , N 1

This solves shortest path problem.

2.5.3 Forward DP algorithm

By inspection, the problem is symmetric. Shortest path from S to T is thesame as from T to S, motivating following algorithm, Jk(j) optimal cost toarrive to state j:

JN(j) = a0sj j S1

Jk(j) = miniSNk

aNkij + Jk+1(i)

j SNk+1, k = 1 , . . . , N

J0(T) = miniSN

aNiT + J1(i)

J0(T) = J0(s)

2.6 Converting Shortest Path to DP

start end

.

.

.

.

.

.

Figure 2.3: Another Path Problem, in which circles are allowed

As an example for a mental picture, one could imagine cities on a map.

16


19/58

Wednesday 13th October, 2010 Viterbi Algorithm

Let {1 , 2 , . . . , N, T} be the nodes of graph, aij the cost to move fromi to j, aij = if there is no edge. i and j denote nodes, as opposed toprevious section where they were states.

Assume that all cycles have non-negative cost. This isnt an issue if alledges have cost 0.

Note that with above assumption, an optimal path length N (visitsall nodes).

Setup problem where we require exactly N moves, degenerate moves areallowed (aii = 0).

Jk(i) optimal cost of getting from i to T in N k moves

JN(i) = aiT (can be infinite, of course)

Jk(i) = minj

aij + Jk+1(j)

(optimal N k move is aij + optimal N k 1

move from j).

Notice that degenerate moves are allowed. (remove in the end)

Terminate procedure if Jk(i) = Jk+1(i) i.

2.7 Viterbi Algorithm

This is a powerful combination of D.P. and Bayes Rule for optimal estima-tion.

Given Markov Chain, with state transition probabilities pij .

pij = P (xk+1 = j|xk = i) 1 i, j M

p(x0) = initial probability for starting state.

Can only indirectly observe state via measurement.

r(z; i, j) = P (meas = z|xk = i, xk+1 = j) k

where P is the likelihood function.

17


20/58


Objective: Given ZN = {z1, . . . , zN}measurements, construct XN = {x0, . . . , xN}that maximizes over all XN = {x0, . . . , xN} PR(XN|ZN). Most likely state.

Recall PR(XN, ZN) = PR(XN|ZN) PR(ZN).

For a given ZN, maximizing PR(XN, ZN) over XN gives same result as max-imizing PR(XN|ZN) over XN.

PR(XN, ZN) = PR(x0, . . . , xN, z1, . . . , zN)

= PR(x1, . . . , xN, z1, . . . , zN|x0) PR(x0)= PR(x2, . . . , xN, z2, . . . , zN|x0, x1, z1) PR(x1, z1|x0) PR(x0)= PR(x2, . . . , xN, z2, . . . , zN|x0, x1, z1) PR(z1|x0, x1) PR(x1|x0) PR(x0)= PR(x2, . . . , xN, z2, . . . , zN|x0, x1, z1) r(z1; x0, x1)px0,x1 PR(x0)

One more step:

PR(x2, . . . , xN, z2, . . . , zN|x0, x1, z1)= PR(x3, . . . , xN, z3, . . . , zN|x0, x1, z1, z2, x2) PR(x2, z2|x0, x1, z1)= PR(x3, . . . , xN, z3, . . . , zN|x0, x1, z1, z2, x2) PR(z2|x0, x1, z1, x2) PR(x2|x0, x1, z1)= PR(x3, . . . , xN, z3, . . . , zN|x0, x1, z1, z2, x2) r(z2; x1, x2)px1,x2

Keep going, and one gets:

PR(XN, ZN) = PR(x0)N

k=1

pxk1,xk r(zk; xk1, xk)

Assume, that all quantities > 0. If= 0, can modify algorithm.

Since the above is a strictly positive property (by above assumptions), andlog function is monotonically increasing as a function of its argument, wecan maximize

log (PR(XN, ZN)) = minXN

log(PR(x0)) +

N

k=1

log pxk1,xk r(zk; xk1, xk)

Forward DP At time k, we can calculate cost to arrive to any state. Wedont have to wait until the end to solve the problem.

2.8 Shortest Path Algorithms

Look at alternatives to DP for problems that are finite and deterministic.

Path Length Cost

18


21/58

Wednesday 20th October, 2010 Shortest Path Algorithms

2.8.1 Label Correcting Methods

Assume aij 0. Arclength = cost to go from node i to node j 0.

OPEN BIN

i j

REMOVE

Is di + aij < dj?

Is di + aij < dT?

YES

YES

set dj = di + aij

Figure 2.4: Diagram of the label correcting algorithm.

Let di be shortest path to i so far.

Step 0 Place Node S in OPEN BIN, set dS = 0, dj = j.

Step 1 Remove a node i from OPEN, and execute Step 2 for all childrenj of i.

Step 2 Ifdi + aij < min(dj, dT), set dj = di + aij , set i to be the parent of j. Ifj = T, place j in OPEN if it is not already there.

Step 3 If OPEN is empty, done. If not, go back to Step 1.

Example 6: Deterministic Scheduling Problem (revisited)

19


22/58


I.C.

A

C

AB

AC

CA

CD

ABC

ACB

ACD

CAB

CAD

CDA

T

5

3

2

3

4

6

3

4

6

2

4

3

6

1

3

1

3

2A before B

C before D

Figure 2.5: Deterministic Scheduling Problem.

Iteration # Remove OPEN dt OPTIMAL

0 S(0)

1 S A(5), C(3)

2 C A(5), CA(7), CD(9)

3 CD A(5), CA(7), CD A(12)

4 CDA A(5), CA(7) 14 CDAB

5 CA A(5), CAB(9), CAD(11) 14 CDAB

6 CAD A(5), CAB(9) 14 CDAB

7 CAB A(5) 10 CABD

8 A AB(7), AC(8) 10 CABD

9 AC AB(7) 10 CABD

10 AB 10 CABD

Done, optimal cost = 10, optimal path = CABD.

Different ways to remove items from OPEN give different, well known,algorithms.

Depth-First Search Last in, first out. What we did in example. Findsfeasible path quickly. Also good if you have limited memory.

20


23/58

Wednesday 20th October, 2010 Multi-Objective Problems

Best-First Search Remove best label. Dijkstras method. Remove step

is more expensive, but can give good performance.

Brendth-First Search First in, first out. Bellman-Ford.

2.8.2 A AlgorithmWorkhouse for many AI applications, path planning.

Basic idea: Replace test di + aij < dT by di + aij + hj < dT, where hj is a lowerbound to the shortest distance from j to T. Indeed, ifdj + aij + hj dT, clearthat path going through j will not be optimal.

2.9 Multi-Objective Problems

Example 7: Motivation: care about time and fuel.

time

fuel

inferior

non-inferior

Figure 2.6: Possibilities in the time-fuel graph.

A vector x = (x1, x2, . . . , xM) S is non-inferior if there are no other y Sso that yl xl, l = 1 , . . . , M, with strict inequality for one of these ls.

Given a problem with M cost functions f1(x), . . . , fM(x) x X is a non-inferior solution if the vector (f1(x), . . . , fM(x)) is a non-inferior vector ofset {(f1(y), . . . , fM(y))

y X}.Reasonable goal: find all non-inferior solutions, then use another criterionto pick which one you actually want to use.

21


24/58


How this applies to deterministic, finite state DP (which are equivalent to

shortest path problems):

xk+1 = fk(xk, uk) Dynamics

glN +N1k=0

glk(xk, uk) l = 1 , . . . , M

2.9.1 Extended Principle of Optimality

If{uk, . . . , uN1} is a non-inferior control sequence for the tail subproblemthat starts at xk, then {uk+1, . . . , uN1} is also non-inferior for the tail sub-problem that starts at fk(xk, uk). Simple proof: by contradiction.

Algorithm First define what we will do recursion over:

Fk(xk) : the set ofM-tuples (vectors of size M) of cost to go at xk which arenon-inferior.

FN(xN) = {(g1N(xN), . . . , gMN(xN))}. Only one element in set for each xN.

Given Fk+1(xk+1) xk+1, generate for each state xk the set of vectors (glk(xk, uk) +c1, . . . , gMk (xk, uk) + c

M), such that (c1, . . . , cM) Fk+1(fk(xk, uk)).

These are all possible costs that are consistent with Fk+1(xk+1). Then toobtain Fk(xk) simply extract all non-inferior elements.

Xk

Fk(xk)

Fk+1()

Figure 2.7: Possible sets for Fk+1.

When we calculate F0(x0), will have all non-inferior solutions.

22


25/58

Wednesday 27th October, 2010 Infinite Horizon Problems

2.10 Infinite Horizon Problems

Consider the time (or iteration) invariant case:

xk+1 = f(xk, uk, wk) xk Suk Uwk P(|xk, uk)

J(x0) = EN1k=0

g(xk, k(xk), wk)

, no terminal cost

Write down DP algorithm:

JN(xN) = 0

Jk(xk) = minukU

Ewk

(g(xk, uk, wk) + Jk+1 (f(xk, uk, wk))) k

Question: What happens as N ? Does the problem become easier?

Yes. Reason: lose notion of time. For very large class of problems, haveBellman Equation:

J(x) = minu

Ew

(g(x, u, w) + J (f(x, u, w))) x S

Bellman Equation involves solving for optimal cost to go function J(x)x S.

u = (x) gives optimal policy (() is obtained from solution to BellmanEquation: for every x there is a u).

Efficient methods for solving Bellman Equation

Technical conditions on when this can be done.

2.11 Stochastic, Shortest Path Problems

xk+1 = wk xk S, finite setPR(wk = j|xk = i, uk = u) = pij (u) uk U(xk), finite set

23


26/58


We have a finite number of states. The transition from one state to the next

is dictated by pij (u): probability that next state is j given current state is i. uis the constrol input, we can control what these transition probabilites are,finite set of options u U(i). Problem data is time (or iteration) indepen-dent.

Cost:Given initial state i and a policy = {0, 1, . . .}.

J(i) = limN

E

N1k=0

g(xk, k(xk))|xo = i

.

Optimal cost from state i: J

(i)

Stationary policy = {, , . . .}. Denote J(i) as the resulting cost.{, , . . .} is simply referred to as . is optimal if

J(i) = J(i) = min

J(i)

Assumptions:

Existence of a cost-free termination state t

ptt(u) = 1

u

g(t, u) = 0 u.Sufficient condition to make cost meaningfull. Think of this as a desti-nation state.

integer m such that for all admissible policies

= maxi=1,...,n

PR (xm = t|x0 = i, ) < 1

This is a strong assumption, which is only required for proofs.

2.11.1 Main Result

A) Given any initial conditions J0(1), . . . , J0(n), the sequence

Jk+1(i) = minuU(i)

g(i, u) +

n

j=1

pij (u)Jk(j)

i

converges to optimal cost J(i) for each i.

24


27/58

Wednesday 27th October, 2010 Stochastic, Shortest Path Problems

B) Optimal cost satisfies Bellmans Equation:

J(i) = minuU(i)

g(i, u) +

n

j=1

pij (u)J(j)

i

which has a unique solution.

2.11.2 Sketch of Proof

A0) First prove that cost is bounded.

Recall m such that policies , := max

iPR(xm = t|x0 = i, ) < 1

Since all problem data is finte, := max

< 1

PR(x2m = t|x0 = i, ) = PR(x2m = t|xm = t, x0 = i, ) PR(xm = t|x0 = i, ) 2

Generally, PR(xk m = t|x0 = i, ) k.Furthermore, the cost incurred between the m periods km and (k +

1)m 1 ismk max

i,u|g(i, u)| =: kM M := m max

i,u|g(i, u)|

J(i)

k=0

Mk =M

1 finite

A1)

J(x0) = limN

E

N1k=0

g (xk, k(xk))

= E

mK1k=0

g (xk, k(xk))

+ lim

NE

N1

k=mK

g (xk, k(xk))

by previous, we know that limN E

N1

k=mK

g (xk, k(xk))

MK

1

As expected, we can make tail as small as we want.

25


28/58


A2) Recall that we can view J0 as terminal cost funtion, with J0(i) given.

Bound its expected vallue:

|E (J0(xmK))| =

n

i=1

PR(xmK = i|x0, )J0(i)

n

i=1

PR(xmK = i|x0, )

maxi|J0(i)|

K maxi|J0(i)|

A3) Sandwich

E (J0(xmK)) + E

mK1k=0

g(xk, k(xk))

=

E (J0(xmK)) + J(x0) limN

E

N1

k=mK

g(xk, k(xk))

Recall if a = b + c, then

a b + |c| b + c, c |c|a

b

|c

| b

c

Follows that

K maxi|J0(i)| M

K

1 + J(x0) E

J0(xmK) +mK1k=0

g(xk, k(xk))

K maxi|J0(i)|+ M

K

1 + J(x0)

A4) We take minimum over all policies, the middle term is exactly our DPrecursion of part A. Take limits, get

limK

JmK(x0) = J(x0)

Now we are almost done. But since

|JmK+1(x0) JmK(x0)| KM

have

limk

Jk(x0) = J(x0)

26


29/58

Wednesday 3rd November, 2010 Summary of previous lecture

Summary

A1: bound the tail over all policies

A2: bound contribution from initial condition J0(i), over all policies

A3: sandwich type of bounds, middle term is DP recursion

A4: optimized over all policies, took limit.

2.11.3 Prove B

Prove that optimal cost satisfies Bellmans equation

Jk+1(i) = minuU(i)

g(i, u) +

n

j=1

pij (u)Jk(j)

.

In Part A, we showed that Jk() J(), just take limits on both sides. Toprove uniqueness, just use solution of Bellman equation as initial conditionof DP iteration.

2.12 Summary of previous lecture

Dynamics

xk+1 = wk

PR{wk = j|xk = j, uk = u} = pij , u U(i) finite setxk S finite S = {1 , 2 , . . . , n, t}ptt = 1 u U(t)

Cost Given = {0, 1, . . .}

J(i) = limN

E

N1k=0

g(xk, k(xk))|x0 = i

i S

g(t, u) = 0 u U(t)J(i) = min

J(i) optimal cost

Note: J(t) = 0 J(t) = 0

27


30/58


Result

A) Given any initial conditions J0(1), . . . , J0(n), the sequence

Jk+1(i) = minuU(i)

g(i, u) +

n

j=1

pij (u)Jk(j)

i S \ t = {1 , . . . , n}

converges to J(i).Note: there is a lot of a short-cut here, can include terminal state t,provided we pick J0(t) = 0. Does not change equations.

B)

J(i) = minuU(i)

g(i, u) +

n

j=1

pij (u) J(j)

i S \ t

Bellmans Equation: Also gives optimal policy, which is in fact stationary.

2.13 How do we solve Bellmans Equation?

2.13.1 Method 1: Value iteration (VI)

Use the DP recursion of result A:

Jk+1(i) = minuU(i)

g(i, u) +

n

j=1

pij (u) Jk(j)

i S \ t

until it converges. J0(i) can be set to a guess, if the guess is good, it willspeed up convergence. How do we know that we are close to converging?Exploit problem structure to get bounds, look at [1].

2.13.2 Method 2: Policy Iteration (PI)

Iterate over policy instead of values. Need following result:

C) For any stationary policy , the costs J(i) are the unique solutions of

J(i) = g(i, (i)) +n

j=1

pij ((i)) J(j) i S \ t

28


31/58

Wednesday 3rd November, 2010 How do we solve Bellmans Equation?

Furthermore, given any initial conditions J0(i), the sequence

Jk+1(i) = g(i, (i)) +n

j=1

pij ((i)) Jk(j)

converges to J(i) for each i.Proof: trivial. Consider problem where only allowable control at state iis (i), and apply parts A and B. Special case of general theorem.

Algorithm for PI From now on, i S \ t = {1 , 2 , . . . , n}.

Stage 1 Given k (stationary policy at iteration k, not policy at time k), solve

for the Jk (i) by solving

J(i) = g(i, k(i)) +n

j=1

pij (k(i)) J(j) i

n equations, n unknowns (Result C).

Stage 2 Improve policy.

k+1(i) = arg minu

U(i)

g(i, u) +

n

j=1

pij (u) Jk (j)

i

Iterate, quit when Jk+1 (i) = Jk (i) i

Theorem Above terminates after a finite # of steps, and converges to opti-mal policy.

Proof two steps

1) We will first show that

Jk (i) Jk+1 (i) i, k2) We will show that what we converge to satisfies Bellmans Equation.

1) For fixed k, consider the following recursion in N:

JN+1(i) = g(i, k+1(i)) +

n

j=0

pij (k+1(i)) JN(j) i

J0(i) = Jk (i)

29


32/58


By result C), JN Jk+1 as N .

J0(i) = g(i, k(i)) +

j

pij (k(i)) J0(j)

g(i, k+1(i)) +j

pij (k+1(i)) J0(j) = J1(i)

J1(i) g(i, k+1(i)) +j

pij (k+1(i)) J1(j)

since J1(i) J0(i). But g(i, k+1(i)) = J2(i), keep going, get

J0(i)

J1(i)

. . .

JN(i)

. . .

take limit

Jk (i) Jk+1 (i) i

Since the # of stationary policies is finite, we will eventually have Jk (i) =Jk+1 i for some finite k.

2) It follows from Stage 2 that

Jk+1 (i) = Jk (i) = minu

U(i)

g(i, u) +

j

pij (u) Jk (j)

when converged, but this is Bellmans Equation! have converged tooptimal policy.

Discussion

Complexity

Stage 1 linear sytem of equations, size n, comlexity

O(n3)

Stage 2 n minimizations, p choices (p different values of u that I can use),complexity O(p n2)

Put together: O(n2(n + p)).

Worst case # of iterations: search over all policies, pn. But in practice, con-verges very quickly.

Why does Policy Iteration converge so quickly relative to Value Iteration?

30


33/58

Wednesday 10th November, 2010 How do we solve Bellmans Equation?

Rewrite Value Iteration

Stage 2 k = arg minuU(i)

g(i, u) +

jpij (u) Jk(j)

iterate

Stage 1 Jk+1(i) = g(i, k(i)) +

jpij (

k(i)) Jk(j)

2.13.3 Third Method: Linear Programming

Recall: Bellmans Equation

J(i) = minuU(i)

g(i, u) +

n

j=1

pij (u) J(j)

i = 1 , . . . , n

and Value Iteration:

Jk+1(i) = minuU(i)

g(i, u) +

n

j=1

pij Jk(j)

i = 1 , . . . , n

We showed that Value Iteration (V.I.) converges to optimal cost to go J forall initial guesses J0.

Assume we start V.I. with any J0 that satisfies

J0(i) minuU(i)

g(i, u) +

n

j=1

pij J0(j)

i = 1 , . . . , n

If follows that J1(i) J0(i) for all i.

J1(i) minuU(i)

g(i, u) + nj=1

pij J1(j) i = 1 , . . . , n J2(i) J1(i) i

In general:

Jk+1(i) Jk(i) i, kJk J

J0(i) J(i) i

31


34/58


Now let J solve the following problem:

maxi

J(i) subject to

J(i) g(i, u) +j

pij (u)J(j) i, u U(i)

Clear that J(i) J(i) i, by previous analysis. Since J satisfies the con-straints, it follows that J achieves the maximum.This is a Linear Program!

2.13.4 Analogies and Connections

Say I want to solve

J = G + PJ, J Rn, G Rn, P Rnn.Direct way to solve:

(I P)J = GJ = (I P)1G.

This is exactly what we do in Stage 1 of policy iteration: solve for the costassociated wit ha specific policy.

Why is (I P)1 guaranteed to exist?For a given policy, let P R(n+1)(n+1) be the probability matrix that cap-tures our Markov Chain:

P =

P P()t

0 1

pij : probability next state = j given current state = i.

Facts

P is a right stochastic matrix: all rows sum up to 1, all elements 0. Perron-Frobenius Therorem: eigenvalues ofP have absolute value 1,

at least one = 1

Assumption on terminal state: PN 0 as N . We well eventuallyreach termination state!

Therefore

32


35/58

Wednesday 10th November, 2010 How do we solve Bellmans Equation?

- the eigenvalues ofP have absolute value < 1.

- (I P)1 exists.

Furthermore:

(I P)1 = I+ P + P2 + . . .Proof:

(I P)(I+ P + P2 + . . .) = I+ (P + P2 + . . .) (P + P2 + . . .)= I

Therefore one way to solve for J is as follows:

J1 = G + PJ0

J2 = G + PJ1 = G + PG + P2J0

...

JN = (I+ P + . . . + PN1)G + PNJ0

JN (1 P)1G as N !

Analogy

Value Iteration: step of the updatePolicy Iteration: infinite numbers of update, or solving system of equationsexactly.

What is truly amazing is that various combinations of policy iteration,value iteration, all converge to Bellmans Equation.

Recall value iteration

Jk+1(i) = minuU(i)

g(i, u) +

j

pij (u)Jk(j)

i = 1 , . . . , n

In practice, you would implement as follows:

J(i) minuU(i)

g(i, u) +

j

pij (u)J(j)

i = 1 , . . . , n

J(i) J(i) i = 1 , . . . , n

33


36/58


Dont have to do this! Can also do:

J(i) minuU(i)

g(i, u) +

j

pij (u)J(j)

i = 1 , . . . , n

Gauss-Seidel update: generic technique for solving iterative equations.

Gets even better: Asynchronous Policy Iterating.

Any number of value update in between policy updates

Any number of states updated at each value update

Any number of states updated at each policy update.Under some mild assumptions, all converge to J

2.14 Discounted Problems

J(i) = limN

E

N1k=0

kg (xk, k(xk))x0 = i

< 1, i {1 , . . . , n}

No explicit termination state required. No assumption on transition proba-bilities required.

Bellmans Equation for this problem:

J(i) = minuU(i)

g(i, u) +

n

j=1

pij (u) J(j)

i

How do we show this?

Define associated problem with states {1 , 2 , . . . , n, t}. From state i = t,when u is applied in new cost g(i, u) next state = j with probabilitypij (u), and t with probability 1 (since

jpij (u) = 1)

Clear that since a < 1, we have a non-zero probability of making it to

state t, therefore our assumption on reaching the termination state issatisfied.Suppose we use the same policy in discounted problem as auxiliaryproblem.

Note that

PR

xk+1 = jxk = i, xk+1 = t, u = pij (u)pi,1 + pi,2 + . . . + pi,n

=pij (u)

= pij (u)

34


37/58

Wednesday 17th November, 2010 Discounted Problems

So as long as we reach the termination state, the state evolution is

governed by the same probabilities. The expected cost of the kth stageof associated problem g(xk, k(xk)) times the probability, that t has not

been reached k, therefore have kg(xk, k(xk)).

Connections: for a given policy, have

P =

P

1...

1

(1 )

0 1

clear that (P)N = NPN 0

35


38/58

Chapter 3

Continuous Time OptimalControl

Consider the following system

x(t) = f(x(t), u(t)) 0 t Tx(0) = x0 no noise!

State x(t) R

Time t R, T is the terminal time

Control u(t) U Rm, U control constraint set

Assume

f is continuously differentiable with respect to x. (Less stringent re-quirement: Lipschitz)

f is continuous with respect to u

u(t) is piecewise continuous

See Appendix A in[1] for Details.

Assume: Existence and uniqueness of solutions.

Example 8:

x(t) = x(t)1/3

x(0) = 0

Solutions: x(t) = 0 t x(t) = ( 23 t)3/2Not uinque!

36


39/58

Wednesday 17th November, 2010 The Hamilton Jacobi Bellman (HJB) Equation

Example 9:

x(t) = x(t)2

x(0) = 1

Solutions: x(t) = 11t , finite escape time, x(1) = The solution does not exist on an interval that includes 1, e.g. [0, 2].

Objective Minimize h(x(T)) +T

0g(x(t), u(t)) dt where g and h are con-

tinuously differentiable with respect to x, g is continuous with respect to u.

Very similar to discrete time problem (

, xk+1 x) except for technical

assumptions.

3.1 The Hamilton Jacobi Bellman (HJB) Equation

Continuous time analog of DP algorithm. Derive it informally by discretiz-ing problem and taking limits. Not a rigorous derivation, but it does capturethe main ideas.

Divide time horizon into N pieces, define = T/N

xk := x(k), uk := u(k) k = 0 , 1 , . . . , N Approximate differential equation by

x(k) = f(x(k, u(k)))

xk+1 xk

= f(xk, uk), xk+1 = xk + f(xk, uk)

Approximate cost function:

h(xN) +N1k=0

g(xk, uk)

Define J(t, x) = optimal cost to go at time t and state x for the con-tinuous problem.

Define J(t, x) = discrete approximation of optimal cost to go.Apply DP algorithm:

terminal condition: J(N, x) = h(x) recursion:

J(k, x) = minuU

g(x, u) + J((k + 1) , x + f(x, u) )

k = 0 , . . . , N1

37


40/58

3. Continuous Time Optimal Control

Do Taylor Expansion of J, because 0

J((k + 1) , x + f(x, u) ) = J(k, x) + J(k, x)

t+

J(k, x)

x

Tf(x, u) + o()

lim0

o()

= 0

little oh notation: o() are quadratic terms or higher in .

Substitute back into DP recursion by :

0 = minuU g(x, u) +

J(k, x)

t

+ J(k, x)

x T

f(x, u) +o()

Now let t = k, and let 0. Assuming J J, have

0 = minuU

g(x, u) +

J(t, x)t

+

J(t, x)

x

Tf(x, u)

x, t

(3.1)

J(T, x) = h(x)

HJB equation (3.1):

Partial Differential Equation, very difficult to solve

u = (t, x) that minimizes R.H.S. of HJB is optimal policy

Example 10: Consider the system

x(t) = u(t) |u(t)| 1Cost is 12 x

2(T), only terminal cost.

Intuitive solution: u(t) = (t, x) = sgn(x) =

1 if x > 0

0 if x = 0

1 if x < 0

What is cost to go associated with this policy?

V(t, x) =1

2(max{0, |x| (T t)})2

Verify that this is indeed the cost to go associated with the policy out-lined above.

For fixed t:

38


41/58

Wednesday 17th November, 2010 The Hamilton Jacobi Bellman (HJB) Equation

V(t, x)

x

T t(T t)

Figure 3.1: Example 10 for fixed t.

V(t, x)x

x

T t(T t)

Figure 3.2: First derivative.

V(t, x)

t1|x| T

2

T |x| T

V(t, x)

t

t

1

2

T |x| T1: |x| T2: |x| > T

Figure 3.3: Example 10 for fixed x.

V

x(t, x) = sgn(x) max{0, |x| (T t)}

39


42/58


For fixed x:

Does V(t, x) satisfy HJB?

First check: does it satisfy boundary condition?V(T, x) = 12 x

2 = h(x)

Second check:

min|u|1

V(t, x)

t+

V(t, x)

xf(x, u)

= min

|u|1(1 + sgn(x), u) max{0, |x| (T t)}

= 0 by choosing u = sgn(x)

V(t, x) satisfies HJB equation, V(t, x) = J(t, x).Furthermore u = sgn(x) is an optimal solution. Not unique!Note: Verifying that V(t, x) satisfies HJB is not trivial, even forthis simple example. Imagine solving for it!

Another issue: 12 x2(T) will give same optimal policy as cost |x(T)|, so

different costs give same optimal policy, but some cost are nicer towork with than other.

3.2 Aside on Notation

Let F(t, x) be a continuosly differentiable function. Then

1.F(t, x)

t: partial derivative of F with respect to the first argument.

2.F(t, x(t))

t=

F(t, x(t))

t

x=x(t)

: shorthand notation

3.dF(t, x(t))

dt =F(t, x(t))

t +F(t, x(t))

x x(t): total derivative

Example 11: For F(t, x) = tx

F(t, x)

t= x

F(t, x(t))

t= x(t)

dF(t, x(t))

dt= x(t) + t x(t)

Lemma 3.2.1: Let F(t, x, u) be a continous differentiable function andlet U be a convex set. Assume that (t, x) := arg min

uUF(t, x, u) is conti-

nous differentiable.

40


43/58

Wednesday 24th November, 2010 Aside on Notation

Then:

1) minuU F(t, x, u)

t=

F(t, x, (t, x))t

t, x

2) minuU F(t, x, u)

x=

F(t, x, (t, x))x

t, x

Example 12: Let F(t, x, u) = (1 + t)u2 + ux + 1, t 0, U is the realline (no constraint on u), then

minu

F(t, x, u) : 2(1 + t)u + x = 0, u = x2(1 + t)

(t, x) = x2(1 + t)

,

minu

F(t, x, u) =(1 + t)x2

4(1 + t)2 x

2

2(1 + t)+ 1

= x2

4(1 + t)+ 1

1) minuU F(t, x, u)

t=

x2

4(1 + t)2

F(t, x, (t, x))t

= u2|u=(t,x) = x24(1 + t)2

2) minuU F(t, x, u)

x=

x

2(1 + t)

F(t, x, (t, x))x

= u|u=(t,x) =x

2(1 + t)

Proof 2: Proof of Lemma 3.2.1 when u is unconstrained (U Rm).Let

G(t, x) = minuU

F(t, x, u) = F(t, x,

(t, x)).

Then

G(t, x)

t=

F(t, x, (t, x))t

+F(t, x, (t, x))

u =0 because (t,x) minimizes

(t, x)t

.

This can be done similar forG(t, x)

x.

41


44/58


3.2.1 The Minimum Principle

HJB gives us a lot of information: optimal cost to go for all time and for allpossible states, also gives optimal feedback law u = (t, x). What if weonly cared about optimal control trajectory for a specific initial conditionx(0) = x0?

Can we exploit the fact that we are asking for much less to simplify themathematical conditions?

Starting Point: HJB

0 = minuU

g(x, u) + J(t, x)t +J(t, x)x T

f(x, u) t, xJ(T, x) = h(x) x

Let (t, x) be te corresponding optimal strategy (feedback law)

Let

F(t, x, u) = g(x, u) +J(t, x)

t+

J(t, x)

x

Tf(x, u)

So HJB equation gives us

G(t, x) = minuU

(F(t, x, u)) = 0

Apply Lemma

1)G(t, x)

t= 0 =

2J(t, x)t2

+

2J(t, x)

xt

Tf(x, (t, x)) t, x

2)G(t, x)

x= 0 =

g(x, (t, x))x

+2J(t, x)

xt+

2J(t, x)x2

f(x, (t, x))

+f(x, (t, x))

x

J(t, x)

x t, x

Consider a specific optimal trajectory:

u(t) = (t, x(t)) , x(t) = f (x(t), u(t)) , x(0) = x0

1) 0 =d

dt

J(t, x(t))

t

2) 0 =g (x(t), u(t))

x+

d

dt

J(t, x(t))

x

+

f(x(t), u(t))x

J(t, x(t))x

42


45/58

Wednesday 24th November, 2010 Aside on Notation

p(t) = J(t, x(t))

xp0(t) =

J(t, x(t))t

1) p0(t) = 0 p0(t) = constant for 0 t T2) p(t) = f(x

(t), u(t))x

p(t) g(x(t), u(t))x

, 0 t TJ(T, x)

x=

h(x)

x p(T) = h(x

(T))x

Put all of this together:

Define Hamiltonian H(x, u, p) = g(x, u) + pT

f(x, u). Let u(t) be an optimalcontrol trajectory, x(t) resulting state trajectory. Then

x(t) =H

p(x(t), u(t), p(t)) x(0) = x0

p(t) = Hx

(x(t), u(t), p(t)) p(T) =h(x(T))

xu(t) = arg min

uUH(x(t), u, p(t))

H(x(t), u(t), p(t)) = constant t [0, T](H() = constant comes from p0(t) = constant).Some remarks:

Set of 2n ODE, with split boundary conditions. Not trivial to solve.

Necessary condition, but not sufficient. Can have multiple solutions,not all of them may be optimal.

If f(x, u) is linear, U is convex, h and g are convex, then the conditionis necessary and sufficient.

Example 13 (Resource Allocation): Some robots are sent to Mars tobuild habitats for a later exploration by humans.

x(t) Number of reconfigurable robots, which can build habitats or them-selves.

x(0) is given: Number of robots that arrive to Mars.

x(t) = u(t)x(t) x(0) = x0

y(t) = (1 u(t))x(t) y(0) = 00 u(t) 1

43


46/58


Objective: given terminal time T, find control input u(t) that maxi-

mizes y(T), the number of habitats build.Note: y(t) =

T0

(1 u(t))x(t) dt

Solution

g(x, u) = (1 u)xf(x, u) = ux

H(x, u, p) = (1 u)x + puxp(t) =

H(x(t), u(t), p(t))

x

=

1 + u(t)

p(t)u(t)

p(T) = 0 (h(x) 0)u(t) = arg max

0u1(x(t) + (p(t)x(t) x(t)) u)

get

u = 0 if p(t) < 1

u = 1 if p(t) > 1

Since p(T) = 0, for t close to T, will have u(t) = 0 and thereforep(t) = 1.

Therefore at time t = T 1, p(t) = 1 and that is where switch occurs:

p(t) = p(t) 0 t T 1 p(T 1) = 1 p(t) = exp(t) exp(T 1) 0 t T 1

Conclusion

u(t) =

1 0 t T 1

0 T 1 t T

How to use in practice:

1. If you can solve HJB, you get a feedback law u = (x). Very con-venient, just a controller: meausure the state and apply the controlinput.

2. Solve for optimal trajectory and use a feedback law (probably linear)to keep you on that trajectory.

3. Solve for optimal trajectory online after measuring state. Do this often.

44


47/58

Wednesday 1st December, 2010 Extensions

HJBMinimumPrinciple

OptimalSolutions

Non OptimalSolutions

not too difficult

Easy to show(in book)

Viscositysolutions

Hard to showrigorously(calculus of vari-ations)

Local Minima

Figure 3.4: Different approaches to find a solution.

3.3 Extensions

(Drop x(t), J notation for simplycity)

3.3.1 Fixed Terminal State

Case where x(T) is given. Clear that there is no need for a terminal cost.

Recall co-state p(t) =J(t, x(t))

x.

p(T) = limtT

J(t, x(t))

x, but we cant use h(t), the terminal cost, to constrain

p(T). Dont need constraints on p!

x(t) = f(x(t), u(t)) x(0) = x0, x(T) = xT 2n ODEs

p(t) = H(x(t), u(t), p(t))

x 2n boundary conditions

Example 14:

x(t) = u(t) x(0) = 0, x(1) = 1

g(x, u) =1

2(x2 + u2)

cost =1

2

10

(x2(t) + u2(t)) dt

45


48/58


Hamiltonian H(x, u, p) = 12 (x2 + u2) + p u

We get

x(t) = u(t)

p(t) = x(t)u(t) = arg min

u

1

2(x2(t) + u2(t)) + p(t) u(t)

therefore u(t) = p(t) x(t) = p(t), p(t) = x(t), x(t) = x(t)

x(t) = A cosh(t) + B sinh(t)

x(0) = 0 A = 0, x(1) = 1 B = 1sinh(1)

x(t) =sinh(t)

sinh(1)=

et ete1 e1

Exercise: show that Hamiltonian is constant along this trajectory.

3.3.2 Free initial state, with cost

x(0) is not fixed, but have initial cost l(x(0)). Can show that the resuling

condition is p(0) = l(x(0))x

.

Example 15:

x(t) = u(t) x(1) = 1, x(0) is free

g(x, u) =1

2(x2 + u2)

l(x) = 0 no cost (given)

Apply Minimum Principle, as before

x(t) = u(t) x = x(t)

p(t) = x(t)u(t) = p(t) x(t) = A cosh(t) + B sinh(t)

x(0) = u(0) = p(0) = 0 B = 0

x(t) =cosh(t)

cosh(1)=

et + et

e1 + e1

x(0) 0.65

46


49/58

Wednesday 1st December, 2010 Extensions

Free Terminal Time Result: Hamiltonian = 0 on optimal trajectory.

Gain extra degree of freedom in choosing T, we lose a degree of freedombecause H 0.

Time Varying System and Cost What happens if f = f(x, u, t), g = g(x, u, t)?

Result: Everything stays the same, except that Hamiltonian is no longerconstant along trajectory. Hint: t = 1 and x = f(x, u, t)

Singular Problems Motivate via an example:

Tracking problem: z(t) = 1 t2 0 t 1Minimize 12 10 (x(t) z(t))

2 dt subject to

|x(t)

| 1.

Apply Minimum Principle:

x(t) = u(t) |u(t)| 1 x(0), x(1) are freeg(x, u, t) =

1

2(x z(t))2

H(x, u, t) =1

2(x z(t))2 + p u

Co-state equation:

p(t) = (x(t) z(t)) p(0) = 0, p(1) = 0Optimal u:

u(t) = arg min|u|1

H(x(t), u, p(t), t)

u(t) =

= 1 if p(t) > 0= 1 if p(t) < 0

= ? if p(t) = 0

Problem is singular if Hamiltonian is not a function of u for a non-trivial

time interval.Try the following:

p(0) = 0 p(t) = 0 for 0 t T T to be determinedThen

p(t) = 0 0 t T x(t) = z(t) 0 t T x(t) = z(t) = 2t = u(t)

47


50/58


One guess: pick T = 12 .

This cant be the solution: for t > 12 :

x(t) z(t) > 0 p < 0 p(1) < 0cant satisfy constraint.

Explore this instead:

Switch before T = 12 .

x(t) = z(t) 0 t T< 12

x(t) = 1 T< t 1 x(t) = z(T) (t T) = 1 T2 t + T

p(t) = (x(t) z(t)) (1 T2 t + T 1 + t2) T< t 1= T2 T t2 + t

p(1) =1

T(T2 T t2 + t) dt

= T2 T 13

+1

2 T3 + T2 + T

3

3 T

2

2= 0

Simplifies to (multiply by 6):

0 =

4T3 + 9T2

6T+ 1

= (T 1)(T 1)(1 4T) T = 1, T =

1

4

T = 14 satisfies all the constraints and we are done!

Can easily verify that p(t) > 0 for 14 < t < 1, giving u(t) = 1 as required.

3.4 Linear Systems and Quadratic Costs

Look at infite horizon, LTI (linear time invariant) system:

xk+1 = Axk + Buk k = 0 , 1 , . . .

cost =

k=0

xTk Qxk + uTk Ruk, R > 0, Q 0, R = RT, Q = QT

Informally, the cost to go is time invariant: only depends on the state andnot when we get there.

J(x) = minu

xTQx + uTRu + J(Ax + Bu)

48


51/58

Wednesday 8th December, 2010 Linear Systems and Quadratic Costs

Conjecture that optimal cost to go is quadratic in x: J(x) = xTKx , where

K = KT, K 0. ThenxTKx = xTQx + xTATKAx + min

u

uTRu + uTBTKBu + xTATKBu + uTBTKAx

Since R > 0, BTKB 0, R + BTKB > 0

2

R + BTKB

u + 2BTKAx = 0

u =

R + BTKB1

BTKA

Substitute back in:All terms are of the form xT x.Therefore we must have:K = Q + ATKA + ATKB

R + BTKB

1 R + BTKB

R + BTKB

1BTKA

2ATKB

R + BTKB1

BTKA

K = AT

K KB

R + BTKB1

BTK

A + Q, K 0

Summary

Optimal Cost to go J(x) = xTKx

Optimal feedback strategy u = Fx, F = R + BTKB1 BTKAQuestions

1. Can we always solve for K?

2. Is closed loop system xk+1 = (A + BF)xk stable?

Example 16:

xk+1 = 2xk + 0 ukcost =

k=0

x2k + u2kA = 2, B = 0, Q = 1, R = 1

Solve for K:

K = 4 (K 0) + 1 3K = 1 K = 1/3

49


52/58


K does not satisfy K 0 constraint. No solution to this problem: costis infite.Problem with this example is that (A, B) is not stabilizable.

Stabilizable : One can find a matrix F such that A + BF is stable.

(A + BF) < 1, eigenvalues of A + BF have magnitude < 1.

Example 17:

xk+1 = 0.5xk + 0 ukcost =

k=0

x2k + u2k

A = 0.5, B = 0, Q = 1, R = 1

Solve for K:

K = 0.25K + 1

K = 4/3Cost to go = 43 x

2k . F = 0.

Example 18:

xk+1 = 2xk + uk

cost =

k=0

x2k + u

2k

A = 2, B = 1, Q = 1, R = 1

(A, B) is stabilizable. Solve for K:

K = 4

K K2(1 + K)

+ 1

0 = K2

4K

1

K = 4

20

2= 2

5

Pick K = 2 +

5 4.236 and solve for F:

F = (1 + K)12K = 2K1 + K

1.618

A + BF = 2 1.618 0.3820 is stable.Our optimizing strategy stabilizes system as expected.

50


53/58

Wednesday 8th December, 2010 Linear Systems and Quadratic Costs

Example 19:

xk+1 = 2xk + uk

cost =

k=0

u2k

A = 2, B = 1, Q = 0, R = 1

Solve for K:

K = 4

K K2(1 + K)1

K =

{0, 3}

K = 0 is clearly optimal thing to do, but it leads to an unstable system.K = 3, however, while not being optimal, leads to F = (1 + 3)1 3 2 =1.5, A + BF = 0.5, stable.Modify cost to

k=0

u2k + x

2k

, > 0, 1.

A = 2, B = 1, Q = , R = 1

Solve for K:

K() = {3 + 43

, 3} (to first order in )

The K = 3 solution is the limiting case as we put arbitrarily small costto the state which would otherwise .Example 20:

xk+1 = 0.5xk + uk

cost =

k=0

u2k

A = 0.5, B = 1, Q = 0, R = 1

Solve for K:

K = {0,0.75}Here K = 0 makes perfect sense: gives optimal strategy u = 0 and stableclosed loop system.

IfQ =

K() {43

,0.75 3}

K = 0 is a well behaved solution for = 0.

51


54/58


Need Concept of Detectability Let Q be decomposed as Q = CTC (can

always do this).Need detectability assumption. (A, C) is detectable if L so that A + LC isstable. C xk 0 xk 0 and C xk 0 xTk Qxk 0.

3.4.1 Summary

Given

xk+1 = Axk + Buk k = 0 , 1 , . . .

cost = k=0

xTQx + uTk Ruk

Q 0, R 0

(A, B) is stabilizable, (A, C) is detectable where C is any matrix that satisfiesCTC = Q. Then:

1. unique solution to D.A.R.E2. Optimal cost to go is J(x) = xTKx

3. Optimal feedback strategy u = Fx

4. Closed loop system is stable.

3.5 General Problem Formulation

Finite horizon, time varying, disturbances.

xk+1 = Ak xk + Bk uk + wk k = 0 , . . . , N

1

E(wk) = 0 E(wk wTk ) finite

Cost:

E

xTN QN xN +

N1k=0

xTk Qk xk + uTk Rk uk

Qk = QTk 0 (EV 0)

Rk = RTk > 0

52


55/58

Wednesday 15th December, 2010 General Problem Formulation

Apply DP to solve problem:

JN(xN) = xTN QN xN

Jk(xk) = minuk

E

xTk Qk xk + uTk Rk uk + Jk+1(Ak xk + Bk uk + wk)

Lets do the first step of this recursion. Or equivalently, N = 1:

J0(x0) = minu0

E

xT0 Q0 x0 + uT0 R0 u0 + (A0 x0 + B0 u0 + w0)

T Q1 (. . .)

consider last term:

E(A0 x0 + B0 u0)T Q1 (A0 x0 + B0 u0) + 2(A0 x0 + B0 u0)T Q1 w0 + wT0 Q1 w0= (A0 x0 + B0 u0)

T Q1(. . .) + E(wT0 Q1 w0)

J0(x0) = minu0

xT0 Q0 x0 + u

T0 R0 u0 + (A0 x0 + B0 u0)

T Q1 (. . .) + E(wT0 Q1 w0)

Strategy is the same as if there was no noise (although it does give a differentcost). Certainty equivalence (works only for some problems, not all).

Solve for minimizing u0: differentiate , set to 0:

2R0 u0 + 2 BT0 Q1 B0 u0 + 2 B

T0 Q1 A0 x0 = 0

u0 = (R0 + BT0 Q1 B0)

1

BT0 Q1 A0 x0

=: F0 x0

Optimal feedback strategy is a linear function of the state.

Substitute and solve for J0(x0):

xT0 Q0 x0 + uT0 (R0 + B

T0 Q1 B0) u0 + x

T0 A

T0 Q1 A0 x0 + 2 x

T0 A

T0 Q1 B0 u0 + E(w

T0 Q1 w0)

= xT0 K0 x0 + E(wT0 Q1 w0)

K0 = Q0 + AT0 (K1 K1 B0(R0 + BT0 K1 B0)1 BT0 K1) A0

K1 = Q1

Cost at k = 1: quadratic in x1, at k = 0: quadratic in x0 + constant.

Can extend this to any horizon, Discrete Riccati Equation (DRE):

Kk = Qk + ATk (Kk+1 Kk+1 Bk (Rk + BTk Kk+1 Bk)1 Bk Kk+1) Ak

KN = QN

Feedback law:

uk = Fk xk Fk = (Rk + BTk Kk+1 Bk)1 BTk Kk+1 Ak

53


56/58


Cost:

Jk(xk) = xTk Kk xk +N

1

j=k

E(wTj Kj+1 wj)

No noise, time invariant, infinite horizon (N ): recover previous resultsand DARE. In fact, above iterative method is one way to solve DARE. Iterate

backwards until it converges. ([1] has proof of convergence not trivial)

Time invariant, infinite horizon with noise: Cost goes to infinity. Approach:divide cost by N and let N , cost = E(wTK w).

Example 21: System given:

z(t) = u(t)

Objective Apply a force to move a mass from any starting point to z =0, z = 0. Implement on a computer that can only update informationonce per second.

1. Discretize problem:

z(t) = z(0) + u(0)t 0 t < 1z(t) = z(0) + z(0)t + 1/2u(0)t2 0 t < 1

Let x1(k) = z(k), x2(k) = z(k).

x(k + 1) = Ax(k) + Bu(k) k = 0 , 1 , . . .

A =

1 1

0 1

B =

0.5

1

2. Cost = k=0

x21(k) + u2(k)

. Therefore

Q =

1 0

0 0

R = 1

3. Is system stabilizable? Can we make A + BF stable for some F?

Yes: F =1 1.5

makes eigenvalues of A + BF = 0

4. Q can be decomposed as follows

Q =

1 0

0 0

=

1

0

1 0

54


57/58

Wednesday 15th December, 2010 General Problem Formulation

Therefore

C =

1 0

, Q = CTC

Is (A, C) detectable?Yes.

L =2 1

makes eigenvalues of A + LC = 0.

5. Solve DARE. Use MATLAB command dare.

K = 2 1

1 1.5Optimal Feedback matrix:

F =0.5 1.0

6. Physical interpretation: spring and a damper. spring has coeffi-

cient 0.5, damper has coefficient 1.0.

55


58/58

Bibliography

[1] Dimitri P. Bertsekas. Dynamic Programming and Optimal Control, volume I.Athena Scientific, 3rd edition, 2005.

dynamic programming and optimal control script

Documents