dynamic programming and optimal control script

Upload: kschwabs

Post on 06-Apr-2018

237 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    1/58

    Dynamic Programming andOptimal Control

    Script

    Prof. Raffaello DAndrea

    Lecture notes

    Dieter Baldinger Thomas Mantel Daniel Rohrer

    HS 2010

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    2/58

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    3/58

    Contents

    Contents 1

    1 Introduction 3

    1.1 Class Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Key Ingredients . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Open Loop versus Closed Loop Control . . . . . . . . . . . . . 41.4 Discrete State and Finite State Problem . . . . . . . . . . . . . 51.5 The Basic Problem . . . . . . . . . . . . . . . . . . . . . . . . . 7

    2 Dynamic Programming Algorithm 92.1 Principle of Optimality . . . . . . . . . . . . . . . . . . . . . . . 92.2 The DPA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3 Chess Match Strategy Revisited . . . . . . . . . . . . . . . . . . 112.4 Converting non-standard problems . . . . . . . . . . . . . . . 13

    2.4.1 Time Lags . . . . . . . . . . . . . . . . . . . . . . . . . . 132.4.2 Correlated Disturbances . . . . . . . . . . . . . . . . . . 132.4.3 Forecasts . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    2.5 Deterministic, Finite State Systems . . . . . . . . . . . . . . . . 152.5.1 Convert DP to Shortest Path Problem . . . . . . . . . . 152.5.2 DP algorithm . . . . . . . . . . . . . . . . . . . . . . . . 162.5.3 Forward DP algorithm . . . . . . . . . . . . . . . . . . . 16

    2.6 Converting Shortest Path to DP . . . . . . . . . . . . . . . . . . 162.7 Viterbi Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 172.8 Shortest Path Algorithms . . . . . . . . . . . . . . . . . . . . . 18

    2.8.1 Label Correcting Methods . . . . . . . . . . . . . . . . . 19

    2.8.2 A Algorithm . . . . . . . . . . . . . . . . . . . . . . . 212.9 Multi-Objective Problems . . . . . . . . . . . . . . . . . . . . . 21

    2.9.1 Extended Principle of Optimality . . . . . . . . . . . . 222.10 Infinite Horizon Problems . . . . . . . . . . . . . . . . . . . . . 232.11 Stochastic, Shortest Path Problems . . . . . . . . . . . . . . . . 23

    2.11.1 Main Result . . . . . . . . . . . . . . . . . . . . . . . . . 242.11.2 Sketch of Proof . . . . . . . . . . . . . . . . . . . . . . . 252.11.3 Prove B . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    2.12 Summary of previous lecture . . . . . . . . . . . . . . . . . . . 272.13 How do we solve Bellmans Equation? . . . . . . . . . . . . . . 28

    1

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    4/58

    Contents

    2.13.1 Method 1: Value iteration (VI) . . . . . . . . . . . . . . 28

    2.13.2 Method 2: Policy Iteration (PI) . . . . . . . . . . . . . . 282.13.3 Third Method: Linear Programming . . . . . . . . . . . 312.13.4 Analogies and Connections . . . . . . . . . . . . . . . . 32

    2.14 Discounted Problems . . . . . . . . . . . . . . . . . . . . . . . . 34

    3 Continuous Time Optimal Control 36

    3.1 The Hamilton Jacobi Bellman (HJB) Equation . . . . . . . . . . 373.2 Aside on Notation . . . . . . . . . . . . . . . . . . . . . . . . . 40

    3.2.1 The Minimum Principle . . . . . . . . . . . . . . . . . . 423.3 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

    3.3.1 Fixed Terminal State . . . . . . . . . . . . . . . . . . . . 45

    3.3.2 Free initial state, with cost . . . . . . . . . . . . . . . . . 463.4 Linear Systems and Quadratic Costs . . . . . . . . . . . . . . . 483.4.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 52

    3.5 General Problem Formulation . . . . . . . . . . . . . . . . . . . 52

    Bibliography 56

    2

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    5/58

    Chapter 1

    Introduction

    1.1 Class Objective

    The class objective is to make multiple decisions in stages to minimize a costthat captures undesirable outcomes.

    1.2 Key Ingredients

    1. Underlying discrete time system:

    xk+1 = fk (xk, uk, wk) k = 0 , 1 , . . . , N

    1

    k: discrete time index

    xk: state

    uk: control input, decision variable

    wk: disturbance or noise, random parameters

    N: time horizon

    fk: function, captures system evolution

    2. Additive cost function:

    gN(xN) terminal cost

    +N1k=0

    gk (xk, uk, wk) state cost

    accumulated cost

    gk is a given nonlinear function.

    The cost is a function of the control applied Because the wk are random, typically consider the expected cost:

    Ewk

    gN(xN) +

    N1k=0

    gk (xk, uk, wk)

    3

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    6/58

    1. Introduction

    Example 1: Inventory Control:

    Keeping an item stocked in a warehouse. Too little, you run out (bad).Too much, cost of storage and misuse of capital (bad).

    xk: stock in the warehouse at the beginning ofkth time period

    uk: stock ordered and immideately delivered at the beginning of kth

    time period

    wk: demand during kth period, with some given probability distribu-

    tion

    Dynamics:

    xk+1 = xk + uk wk

    Excess demand is backlogged and corresponds to negative values.

    Cost:

    E

    R (xN) +

    N1k=0

    (r (xk) + c uk)

    r (xk): penaltise too much stock or negative stock

    c uk: cost of items

    R (xn): terminal cost from items at the end that cant be sold or de-mand that cant be met

    Objective: The objective is to minimize the cost subject to uk 0

    1.3 Open Loop versus Closed Loop Control

    Open Loop: Come up with control inputs u0, . . . , uN1 before k = 0. InOpen Loop the objective is to calculate {u0, . . . , uN1}.

    Closed Loop: Wait until time k to make decision. Asumes xk is measurable.Closed Loop will always give better performance, but is computationallymuch more expensive. In Closed Loop the objective is to calculate the optimalrule: uk = k (xk). = {0, . . . , N1} ist a policy or control law.

    4

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    7/58

    Wednesday 22nd September, 2010 Discrete State and Finite State Problem

    Example 2:

    k (xk) =

    sk xk if xk < sk

    0 otherwise

    sk is some threshold.

    1.4 Discrete State and Finite State Problem

    When state xk takes on discrete values or is finite in size, it is often conve-

    nient to express the dynamics in terms of transition probabilities:Pij (u, k) := Prob (xk+1 = j | xk = i, uk = u)

    i: start state

    j: possible future state

    u: control input

    k: time

    This is equivalent to: xk+1 = wk, where wk has the following

    Prob (wk = j|

    xk = i, uk = u) := Pij (u, k)

    Example 3: Optimizing Chess Playing Strategies:

    Two game chess match with an opponent, the objective is to comeup with a strategy that maximizes the chance to win.

    Each game can have 2 outcomes:a) Win by one player: 1 point for the winner, 0 points for the

    loser.

    b) Draw: 0.5 points for each player.

    If games are tied 1-1 at the end of 2 games, go into sudden death

    mode until someone wins.

    Decision variable for player, two player styles:1) Timid play: Draw with probability pd, lose with probability

    1pd2) Bold play: Win with probability pw, lose with probability

    1pw Asume pd > pw as a necessary condition for problem to make

    sense.

    5

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    8/58

    1. Introduction

    Problem: What playing style should be chosen? Since it doesnt make

    sense to play Timid if we are tied 1-1 at the end of 2 games, it is a2-stage-finite problem.

    Transition Probability Graph The graphis below show all possible out-comes.

    0, 0

    1, 0

    0, 1

    pw

    1 pw

    (a) Timid Play

    0, 0

    12 ,

    12

    0, 1

    pd

    1 pd

    (b) Bold Play

    Figure 1.1: First Game

    1, 0

    1

    2, 1

    2

    0, 1

    2, 0

    32 ,

    12

    1, 1

    12 ,

    32

    0, 2

    pd

    1 pdpd

    1 pdpd

    1 pd

    (a) Timid Play

    1, 0

    1

    2, 1

    2

    0, 1

    2, 0

    32 ,

    12

    1, 1

    12 ,

    32

    0, 2

    pw

    1 pwpw

    1 pwpw

    1 pw

    (b) Bold Play

    Figure 1.2: Second Game

    Closed Loop Strategy Play timid iff player is ahead.

    The probability of winning is:

    pd pw + pw ((1 pd) pw + pw (1 pw)) = p2w (2 pw) + pw (1 pw) pdFor {pw = 0.45, pd = 0.9} and {pw = 0.5, pd = 1} the probabilities towin are 0.54 and 0.625.

    Open Loop Strategy Possibilities:

    6

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    9/58

    Wednesday 29th September, 2010 The Basic Problem

    0, 0

    1, 0

    0, 1

    32 ,

    12 WIN

    1, 1

    0, 2 LOSE

    2, 1 WIN

    1, 2 LOSE

    pw

    1 pw

    pd

    1pd

    pw

    1 pw

    pw

    1 pw

    Figure 1.3: Closed Loop Strategy

    1) Timid for first 2 games: p2d pw

    2) Bold in both: p2w (3 2 pw)3) Bold in first, timid in second game: pw pd + pw (1 pd) pw4) Timid in first, bold in second game: pw pd + pw (1pd) pw

    Clearly 1) is not the optimal OL strategy, because p2d pw pd pw + . . .Best strategy yields:

    p2w + pw (1pw) max (2 pw, pd)

    if pd > 2 pw. The optimal OL strategy is 3) or 4). It can be shown that ifpw 0.5, then the probability of winning is 0.5.

    1.5 The Basic Problem

    Summarize basic problem setup:

    xk+1 = fk(xk, uk, wk), k = 0 , 1 , . . . , N 1

    xk Sk state spaceuk Ck control spacewk Dk disturbance space

    uk U(xk) Ck. Constrained not only as a function of time, but alsoof current state.

    wk PR (|xk, uk). Noise distribution can depend on current state andapplied control.

    7

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    10/58

    1. Introduction

    Consider policies, or control laws,

    = {0, 1, . . . , N1}where k maps state xk into controls uk = k(xk), such that k(xk) U(xk) xk Sk.The set of all called Admissible Policies, denoted .

    Given a policy , the expected cost of starting at state x0:

    J(x0) := Ewk

    gN(xN) +

    N1k=0

    gk(xk, k(xk), wk)

    Optimal Policy: J(x0) J(x0) Optimal Cost: J(x0) := J(x0)

    8

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    11/58

    Chapter 2

    Dynamic ProgrammingAlgorithm

    At the heart of the DP algorithm is the following very simple and intuitiveidea.

    2.1 Principle of Optimality

    Let = {0 , 1, . . . , n1}be an optimal policy. Assume that in the processof using , a state xi occurs at time i. Consider the subproblem whereby attime i we are at state xi and we want to minimize

    Ewk

    gn(xn) +

    N1k=i

    gk(xk, k(xk), wk)

    .

    Then the truncated policy {i , i+1, . . . , N1} is optimal for this problem.The proof is simple: prove by contradiction. If the above were not optimal,you could find a different policy that would give a lower cost. Applyingthe same policy to the original problem from i would therefore give a lowercost, which contradicts that was an optimal policy.

    Example 4: Deterministic Scheduling Problem

    Have 4 machines A, B, C, D, that are used to make something.

    A must occur before B. C before D.

    The solution is obtained by calculating the optimal cost for each node,beginning at the bottom of the tree. See figure 2.1.

    9

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    12/58

    2. Dynamic Programming Algorithm

    I.C.

    A

    C

    AB

    AC

    CA

    CD

    ABC ABCD

    ACB

    ACD

    ACBD

    ACDB

    CAB

    CAD

    CABD

    CADB

    CDA CDAB

    5

    3

    2

    3

    4

    6

    3

    4

    6

    2

    4

    3

    6

    1

    3

    1

    3

    2

    6

    1

    3

    1

    3

    2

    9

    5

    3

    5

    8

    7

    10

    Figure 2.1: Problem of example 4 with optimal cost for each node written above it (in circles).

    2.2 The DPA

    For every initial state x0, the optimal cost J(x0) is equal to J0(x0), given by

    the last step of the following recursive algorithm, which proceeds backwardsin time from N

    1 to 0:

    Initialization: JN(xN) = gN(xN) xn SNRecursion: For the recursion we use

    Jk(xk) = minukUk(xk)

    Ewk

    gk(xk, uk, wk) + Jk+1

    fk(xk, uk, wk)

    .

    where expectation is taken with respect to PR(|xk, uk).

    Furthemore, if uk = k (xk) minimizes the recursion equation for each xk

    and k, the policy ={0 , . . . ,

    N

    1

    }is optimal.

    Comments

    For each recursion step, we have to perform the optimization over allpossible values xk Sk, since we dont know a priori which states wewill actually visit.

    This pointwise optimization is what gives us k .

    Proof 1: (Read section 1.5 in [1] if you are mathematically inclined).

    10

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    13/58

    Wednesday 6th October, 2010 Chess Match Strategy Revisited

    Denote k := {k, k+1, . . . , N1}. Denote

    Jk (xk) = mink

    Ewk,...,wN1

    gN(xN) +

    N1i=k

    gi

    xi, i(xi), wi

    .

    optimal cost when starting at time k, we find ourselves at state xk.

    JN(xN) = gN(xN), finally.

    We will show that Jk = Jk generated by the DPA which give usthe desired result when k = 0.

    Induction: JN(xN) = JN(xN), true for k = NAssume true for k + 1: Jk+1(xk+1) = Jk+1(xk+1) xk+1 Sk+1.Then, since k = {k, k+1}, have

    Jk (xk) = min(k,k+1)

    Ewk ,...,wN1

    gk(xk , k(xk), wk) + gN(xN) +

    N1

    i=k+1

    gi (xi, i(xi), wi)

    by principle of optimality:

    = mink

    Ewk

    gk(xk, k(xk), wk) + min

    k+1E

    wk+1,...,wN1

    gN(xN) +

    N1

    i=k+1

    gi (xi, i(xi), wi)

    by definition of Jk+1 and update equation:= min

    kEwk

    gk(xk, k(xk), wk) + J

    k+1(fk(xk, k (xk), wk))

    by induction hypothesis:

    = mink

    Ewk

    (gk(xk , k(xk), wk) + Jk+1(fk(xk , k(xk), wk)))

    = minukUk(xk)

    Ewk

    (gk(xk, k(xk), wk) + Jk+1(fk(xk, uk, wk)))

    = Jk(xk)

    In other words: Search over a function is simply like solving what the function doespartwise.

    Jk(xk) is called the cost-to-go at state xk.Jk() is called the cost-to-go function.

    2.3 Chess Match Strategy Revisited

    Recall

    Timid Play, prob. tie = pd, prob. loss = 1 pd

    11

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    14/58

    2. Dynamic Programming Algorithm

    Bold Play, prob. win = pw, prob. loss = 1 pw 2 game match, + tie breaker if necessary

    Objective Find policy which maximizes probability of winning. We willsolve with DP, replace min by max.Asume pd > pw.

    Define xk = difference between our score and opponent score at the end ofgame k. Recall 1 point for win, 0 for loss and 0.5 for tie.

    Define Jk(xk) = probability of winning match at time k if state = xk.

    Start Recursion

    J2(x2) =

    1 if x2 > 0

    pw if x2 = 0

    0 if x2 < 0

    Recursive Equation

    Jk(xk) = max pd Jk+1 + (1pd) Jk+1(xk 1) timid, pw Jk+1 + (1

    pw) Jk+1(xk

    1)

    bold

    Convince yourself that this is equivalent to the formal definitions:

    Jk(xk) = maxuk

    Ewk

    gk(xk, uk, wk) + Jk+1(fk

    xk, uk, wk)

    Note: There is only a terminal cost in this problem.

    J1(x1) = max [pd J2(x1) + (1 pd) J2(x1 1), pw J2(x1 + 1) + (1 pw) J2(x1 1)]

    If x

    1= 1: max pd + (1pd) pw

    timid

    , pw

    + (1

    pw

    ) pw

    bold

    Which is bigger? Timid - Bold = (pdpw)(1pw) > 0 Timid is optimal,and J1(1) = pd + (1 pd) pw.

    If x1 = 0: max

    pd pw + (1 pd) 0 timid

    , pw + (1pw) 0 bold

    Optimal is pw, and J1(0) = pw, Bold is optimal strategy.

    If x1 = 1: max

    0, p2w

    J1(1) = p2w, optimal strategy is Bold.

    12

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    15/58

    Wednesday 6th October, 2010 Converting non-standard problems

    J0(0) = max [pd J1(0) + (1 pd) J1(1), pw J1(1) + (1pw) J1(1)]= max

    pd pw + (1 pd) p2w, pw (pd + (1 pd) pw) + (1 pw) p2w

    = max

    pd pw + (1 pd) p2w, pd pw + (1 pd) p2w + (1pw) p2w

    J0(0) = pd pw + (1 pd) p2w + (1 pw) p2w, the optimal strategy is Bold.

    Optimal Strategy If ahead, play Timid.

    2.4 Converting non-standard problems to Basic Prob-lem

    2.4.1 Time Lags

    Assume update equation is of the following form:

    xx+1 = fk(xk, xk1, uk, uk1, wk)

    Define yk = xk1, sk = uk1

    xk+1

    yk+1

    sk+1

    =

    fk(xk,yk, sk, wk)

    xk

    uk

    fk

    Let xk = (xk,yk, sk), xk+1 = fk(xk, uk, wk).

    The control is uk = k(xk, uk1, xk1).This can be generalized to more than one time lag.

    2.4.2 Correlated Disturbances

    If disturbances are not independent, but can be modeled as the output of asystem driven by independent disturbances Colored Noise.

    Example 5:

    wk = Ck yk+1

    yk+1 = Ak yk + k

    13

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    16/58

    2. Dynamic Programming Algorithm

    Ak, Ck are given, {k} is independent.As usual, xk+1 = fk(xk, uk, wk).

    xk+1

    yk+1

    =

    fkxk, uk, Ck Ak yk + k

    Ak yk + k

    and uk = k(xk,yk)

    which is now in the standard form. In general, yk cannot be measuredand must be estimated.

    2.4.3 Forecasts

    When state information includes knowledge of probability distributions. Atthe beginning of each period k, we receive information about wk+1 probabil-ity distribution. In particular, assume wk+1 could have the following prob-ability distributions {Q1, Q2, . . . , Qm}, with a priori probabilities p1, . . . , pm.At time k, we receive forecast i that Qi is used to generate wk+1. Model asfollows: yk+1 = k, k is a random variable, taking value i with probabilitypi. In particular, wk has probability distribution Qyk .

    Then have

    xk+1

    yk+1

    =

    fk(xk, uk, wk)

    k

    . New state xk = (xk,yk).

    Since yk is known at time k, we have a Basic Problem formulation. Newdisturbance wk = (wk, k), depends on current state, which is allowed. DPAtakes on the following form:

    JN(xN,yN) = gN(xN)

    Jk(xk,yk) = minuk

    Ewk

    Ek

    gk(xk, uk, wk) + Jk+1

    fk(xk, uk, wk), k

    yk= min

    ukEwk

    gk(xk, uk, wk) +

    m

    i=1

    pi Jk+1

    fk(xk, uk, wk), i yk

    where conditional expectation simply means that wk has probability distri-bution Qyk . yk {1 , . . . , m}, expectation over wk is taken with respect to thedistribution Qyk .

    14

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    17/58

    Wednesday 13th October, 2010 Deterministic, Finite State Systems

    2.5 Deterministic, Finite State Systems

    Recall Basic Problem

    xk+1 = fk(xk, uk, wk) k = 0 , . . . , N 1gk(xk, uk, wk) cost at stage k.

    Consider Problems where

    1. xk Sk, Sk is a finite set2. No disturbances wk

    We assume, without loss of generality, that there is only one way to go fromstate i Sk to j Sk+1 (If there is more than one way, pick one with lowestcost at stage k).

    2.5.1 Convert DP to Shortest Path Problem

    S...

    Stage 1

    ...

    Stage 2

    .

    .

    .

    ..

    .

    ...

    Stage N

    . . .

    . . .

    . . .

    T

    Figure 2.2: General Shortest Path Problem

    akij = cost to go from state i Sk to state j Sk+1, time k. This is equal ifthere is no way to go from i Sk to j Sk+1.

    aNiT = terminal cost of state i SNIn other words,

    akij = gk(i, uijk ), where j = fk(i, u

    ijk )

    aNiT = gN(i)

    15

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    18/58

    2. Dynamic Programming Algorithm

    2.5.2 DP algorithm

    JN(i) = aNiT i SN

    Jk(i) = minjSk+1

    akij + Jk+1(j)

    i Sk, k = 0 , . . . , N 1

    This solves shortest path problem.

    2.5.3 Forward DP algorithm

    By inspection, the problem is symmetric. Shortest path from S to T is thesame as from T to S, motivating following algorithm, Jk(j) optimal cost toarrive to state j:

    JN(j) = a0sj j S1

    Jk(j) = miniSNk

    aNkij + Jk+1(i)

    j SNk+1, k = 1 , . . . , N

    J0(T) = miniSN

    aNiT + J1(i)

    J0(T) = J0(s)

    2.6 Converting Shortest Path to DP

    start end

    .

    .

    .

    .

    .

    .

    Figure 2.3: Another Path Problem, in which circles are allowed

    As an example for a mental picture, one could imagine cities on a map.

    16

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    19/58

    Wednesday 13th October, 2010 Viterbi Algorithm

    Let {1 , 2 , . . . , N, T} be the nodes of graph, aij the cost to move fromi to j, aij = if there is no edge. i and j denote nodes, as opposed toprevious section where they were states.

    Assume that all cycles have non-negative cost. This isnt an issue if alledges have cost 0.

    Note that with above assumption, an optimal path length N (visitsall nodes).

    Setup problem where we require exactly N moves, degenerate moves areallowed (aii = 0).

    Jk(i) optimal cost of getting from i to T in N k moves

    JN(i) = aiT (can be infinite, of course)

    Jk(i) = minj

    aij + Jk+1(j)

    (optimal N k move is aij + optimal N k 1

    move from j).

    Notice that degenerate moves are allowed. (remove in the end)

    Terminate procedure if Jk(i) = Jk+1(i) i.

    2.7 Viterbi Algorithm

    This is a powerful combination of D.P. and Bayes Rule for optimal estima-tion.

    Given Markov Chain, with state transition probabilities pij .

    pij = P (xk+1 = j|xk = i) 1 i, j M

    p(x0) = initial probability for starting state.

    Can only indirectly observe state via measurement.

    r(z; i, j) = P (meas = z|xk = i, xk+1 = j) k

    where P is the likelihood function.

    17

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    20/58

    2. Dynamic Programming Algorithm

    Objective: Given ZN = {z1, . . . , zN}measurements, construct XN = {x0, . . . , xN}that maximizes over all XN = {x0, . . . , xN} PR(XN|ZN). Most likely state.

    Recall PR(XN, ZN) = PR(XN|ZN) PR(ZN).

    For a given ZN, maximizing PR(XN, ZN) over XN gives same result as max-imizing PR(XN|ZN) over XN.

    PR(XN, ZN) = PR(x0, . . . , xN, z1, . . . , zN)

    = PR(x1, . . . , xN, z1, . . . , zN|x0) PR(x0)= PR(x2, . . . , xN, z2, . . . , zN|x0, x1, z1) PR(x1, z1|x0) PR(x0)= PR(x2, . . . , xN, z2, . . . , zN|x0, x1, z1) PR(z1|x0, x1) PR(x1|x0) PR(x0)= PR(x2, . . . , xN, z2, . . . , zN|x0, x1, z1) r(z1; x0, x1)px0,x1 PR(x0)

    One more step:

    PR(x2, . . . , xN, z2, . . . , zN|x0, x1, z1)= PR(x3, . . . , xN, z3, . . . , zN|x0, x1, z1, z2, x2) PR(x2, z2|x0, x1, z1)= PR(x3, . . . , xN, z3, . . . , zN|x0, x1, z1, z2, x2) PR(z2|x0, x1, z1, x2) PR(x2|x0, x1, z1)= PR(x3, . . . , xN, z3, . . . , zN|x0, x1, z1, z2, x2) r(z2; x1, x2)px1,x2

    Keep going, and one gets:

    PR(XN, ZN) = PR(x0)N

    k=1

    pxk1,xk r(zk; xk1, xk)

    Assume, that all quantities > 0. If= 0, can modify algorithm.

    Since the above is a strictly positive property (by above assumptions), andlog function is monotonically increasing as a function of its argument, wecan maximize

    log (PR(XN, ZN)) = minXN

    log(PR(x0)) +

    N

    k=1

    log pxk1,xk r(zk; xk1, xk)

    Forward DP At time k, we can calculate cost to arrive to any state. Wedont have to wait until the end to solve the problem.

    2.8 Shortest Path Algorithms

    Look at alternatives to DP for problems that are finite and deterministic.

    Path Length Cost

    18

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    21/58

    Wednesday 20th October, 2010 Shortest Path Algorithms

    2.8.1 Label Correcting Methods

    Assume aij 0. Arclength = cost to go from node i to node j 0.

    OPEN BIN

    i j

    REMOVE

    Is di + aij < dj?

    Is di + aij < dT?

    YES

    YES

    set dj = di + aij

    Figure 2.4: Diagram of the label correcting algorithm.

    Let di be shortest path to i so far.

    Step 0 Place Node S in OPEN BIN, set dS = 0, dj = j.

    Step 1 Remove a node i from OPEN, and execute Step 2 for all childrenj of i.

    Step 2 Ifdi + aij < min(dj, dT), set dj = di + aij , set i to be the parent of j. Ifj = T, place j in OPEN if it is not already there.

    Step 3 If OPEN is empty, done. If not, go back to Step 1.

    Example 6: Deterministic Scheduling Problem (revisited)

    19

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    22/58

    2. Dynamic Programming Algorithm

    I.C.

    A

    C

    AB

    AC

    CA

    CD

    ABC

    ACB

    ACD

    CAB

    CAD

    CDA

    T

    5

    3

    2

    3

    4

    6

    3

    4

    6

    2

    4

    3

    6

    1

    3

    1

    3

    2A before B

    C before D

    Figure 2.5: Deterministic Scheduling Problem.

    Iteration # Remove OPEN dt OPTIMAL

    0 S(0)

    1 S A(5), C(3)

    2 C A(5), CA(7), CD(9)

    3 CD A(5), CA(7), CD A(12)

    4 CDA A(5), CA(7) 14 CDAB

    5 CA A(5), CAB(9), CAD(11) 14 CDAB

    6 CAD A(5), CAB(9) 14 CDAB

    7 CAB A(5) 10 CABD

    8 A AB(7), AC(8) 10 CABD

    9 AC AB(7) 10 CABD

    10 AB 10 CABD

    Done, optimal cost = 10, optimal path = CABD.

    Different ways to remove items from OPEN give different, well known,algorithms.

    Depth-First Search Last in, first out. What we did in example. Findsfeasible path quickly. Also good if you have limited memory.

    20

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    23/58

    Wednesday 20th October, 2010 Multi-Objective Problems

    Best-First Search Remove best label. Dijkstras method. Remove step

    is more expensive, but can give good performance.

    Brendth-First Search First in, first out. Bellman-Ford.

    2.8.2 A AlgorithmWorkhouse for many AI applications, path planning.

    Basic idea: Replace test di + aij < dT by di + aij + hj < dT, where hj is a lowerbound to the shortest distance from j to T. Indeed, ifdj + aij + hj dT, clearthat path going through j will not be optimal.

    2.9 Multi-Objective Problems

    Example 7: Motivation: care about time and fuel.

    time

    fuel

    inferior

    non-inferior

    Figure 2.6: Possibilities in the time-fuel graph.

    A vector x = (x1, x2, . . . , xM) S is non-inferior if there are no other y Sso that yl xl, l = 1 , . . . , M, with strict inequality for one of these ls.

    Given a problem with M cost functions f1(x), . . . , fM(x) x X is a non-inferior solution if the vector (f1(x), . . . , fM(x)) is a non-inferior vector ofset {(f1(y), . . . , fM(y))

    y X}.Reasonable goal: find all non-inferior solutions, then use another criterionto pick which one you actually want to use.

    21

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    24/58

    2. Dynamic Programming Algorithm

    How this applies to deterministic, finite state DP (which are equivalent to

    shortest path problems):

    xk+1 = fk(xk, uk) Dynamics

    glN +N1k=0

    glk(xk, uk) l = 1 , . . . , M

    2.9.1 Extended Principle of Optimality

    If{uk, . . . , uN1} is a non-inferior control sequence for the tail subproblemthat starts at xk, then {uk+1, . . . , uN1} is also non-inferior for the tail sub-problem that starts at fk(xk, uk). Simple proof: by contradiction.

    Algorithm First define what we will do recursion over:

    Fk(xk) : the set ofM-tuples (vectors of size M) of cost to go at xk which arenon-inferior.

    FN(xN) = {(g1N(xN), . . . , gMN(xN))}. Only one element in set for each xN.

    Given Fk+1(xk+1) xk+1, generate for each state xk the set of vectors (glk(xk, uk) +c1, . . . , gMk (xk, uk) + c

    M), such that (c1, . . . , cM) Fk+1(fk(xk, uk)).

    These are all possible costs that are consistent with Fk+1(xk+1). Then toobtain Fk(xk) simply extract all non-inferior elements.

    Xk

    Fk(xk)

    Fk+1()

    Figure 2.7: Possible sets for Fk+1.

    When we calculate F0(x0), will have all non-inferior solutions.

    22

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    25/58

    Wednesday 27th October, 2010 Infinite Horizon Problems

    2.10 Infinite Horizon Problems

    Consider the time (or iteration) invariant case:

    xk+1 = f(xk, uk, wk) xk Suk Uwk P(|xk, uk)

    J(x0) = EN1k=0

    g(xk, k(xk), wk)

    , no terminal cost

    Write down DP algorithm:

    JN(xN) = 0

    Jk(xk) = minukU

    Ewk

    (g(xk, uk, wk) + Jk+1 (f(xk, uk, wk))) k

    Question: What happens as N ? Does the problem become easier?

    Yes. Reason: lose notion of time. For very large class of problems, haveBellman Equation:

    J(x) = minu

    Ew

    (g(x, u, w) + J (f(x, u, w))) x S

    Bellman Equation involves solving for optimal cost to go function J(x)x S.

    u = (x) gives optimal policy (() is obtained from solution to BellmanEquation: for every x there is a u).

    Efficient methods for solving Bellman Equation

    Technical conditions on when this can be done.

    2.11 Stochastic, Shortest Path Problems

    xk+1 = wk xk S, finite setPR(wk = j|xk = i, uk = u) = pij (u) uk U(xk), finite set

    23

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    26/58

    2. Dynamic Programming Algorithm

    We have a finite number of states. The transition from one state to the next

    is dictated by pij (u): probability that next state is j given current state is i. uis the constrol input, we can control what these transition probabilites are,finite set of options u U(i). Problem data is time (or iteration) indepen-dent.

    Cost:Given initial state i and a policy = {0, 1, . . .}.

    J(i) = limN

    E

    N1k=0

    g(xk, k(xk))|xo = i

    .

    Optimal cost from state i: J

    (i)

    Stationary policy = {, , . . .}. Denote J(i) as the resulting cost.{, , . . .} is simply referred to as . is optimal if

    J(i) = J(i) = min

    J(i)

    Assumptions:

    Existence of a cost-free termination state t

    ptt(u) = 1

    u

    g(t, u) = 0 u.Sufficient condition to make cost meaningfull. Think of this as a desti-nation state.

    integer m such that for all admissible policies

    = maxi=1,...,n

    PR (xm = t|x0 = i, ) < 1

    This is a strong assumption, which is only required for proofs.

    2.11.1 Main Result

    A) Given any initial conditions J0(1), . . . , J0(n), the sequence

    Jk+1(i) = minuU(i)

    g(i, u) +

    n

    j=1

    pij (u)Jk(j)

    i

    converges to optimal cost J(i) for each i.

    24

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    27/58

    Wednesday 27th October, 2010 Stochastic, Shortest Path Problems

    B) Optimal cost satisfies Bellmans Equation:

    J(i) = minuU(i)

    g(i, u) +

    n

    j=1

    pij (u)J(j)

    i

    which has a unique solution.

    2.11.2 Sketch of Proof

    A0) First prove that cost is bounded.

    Recall m such that policies , := max

    iPR(xm = t|x0 = i, ) < 1

    Since all problem data is finte, := max

    < 1

    PR(x2m = t|x0 = i, ) = PR(x2m = t|xm = t, x0 = i, ) PR(xm = t|x0 = i, ) 2

    Generally, PR(xk m = t|x0 = i, ) k.Furthermore, the cost incurred between the m periods km and (k +

    1)m 1 ismk max

    i,u|g(i, u)| =: kM M := m max

    i,u|g(i, u)|

    J(i)

    k=0

    Mk =M

    1 finite

    A1)

    J(x0) = limN

    E

    N1k=0

    g (xk, k(xk))

    = E

    mK1k=0

    g (xk, k(xk))

    + lim

    NE

    N1

    k=mK

    g (xk, k(xk))

    by previous, we know that limN E

    N1

    k=mK

    g (xk, k(xk))

    MK

    1

    As expected, we can make tail as small as we want.

    25

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    28/58

    2. Dynamic Programming Algorithm

    A2) Recall that we can view J0 as terminal cost funtion, with J0(i) given.

    Bound its expected vallue:

    |E (J0(xmK))| =

    n

    i=1

    PR(xmK = i|x0, )J0(i)

    n

    i=1

    PR(xmK = i|x0, )

    maxi|J0(i)|

    K maxi|J0(i)|

    A3) Sandwich

    E (J0(xmK)) + E

    mK1k=0

    g(xk, k(xk))

    =

    E (J0(xmK)) + J(x0) limN

    E

    N1

    k=mK

    g(xk, k(xk))

    Recall if a = b + c, then

    a b + |c| b + c, c |c|a

    b

    |c

    | b

    c

    Follows that

    K maxi|J0(i)| M

    K

    1 + J(x0) E

    J0(xmK) +mK1k=0

    g(xk, k(xk))

    K maxi|J0(i)|+ M

    K

    1 + J(x0)

    A4) We take minimum over all policies, the middle term is exactly our DPrecursion of part A. Take limits, get

    limK

    JmK(x0) = J(x0)

    Now we are almost done. But since

    |JmK+1(x0) JmK(x0)| KM

    have

    limk

    Jk(x0) = J(x0)

    26

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    29/58

    Wednesday 3rd November, 2010 Summary of previous lecture

    Summary

    A1: bound the tail over all policies

    A2: bound contribution from initial condition J0(i), over all policies

    A3: sandwich type of bounds, middle term is DP recursion

    A4: optimized over all policies, took limit.

    2.11.3 Prove B

    Prove that optimal cost satisfies Bellmans equation

    Jk+1(i) = minuU(i)

    g(i, u) +

    n

    j=1

    pij (u)Jk(j)

    .

    In Part A, we showed that Jk() J(), just take limits on both sides. Toprove uniqueness, just use solution of Bellman equation as initial conditionof DP iteration.

    2.12 Summary of previous lecture

    Dynamics

    xk+1 = wk

    PR{wk = j|xk = j, uk = u} = pij , u U(i) finite setxk S finite S = {1 , 2 , . . . , n, t}ptt = 1 u U(t)

    Cost Given = {0, 1, . . .}

    J(i) = limN

    E

    N1k=0

    g(xk, k(xk))|x0 = i

    i S

    g(t, u) = 0 u U(t)J(i) = min

    J(i) optimal cost

    Note: J(t) = 0 J(t) = 0

    27

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    30/58

    2. Dynamic Programming Algorithm

    Result

    A) Given any initial conditions J0(1), . . . , J0(n), the sequence

    Jk+1(i) = minuU(i)

    g(i, u) +

    n

    j=1

    pij (u)Jk(j)

    i S \ t = {1 , . . . , n}

    converges to J(i).Note: there is a lot of a short-cut here, can include terminal state t,provided we pick J0(t) = 0. Does not change equations.

    B)

    J(i) = minuU(i)

    g(i, u) +

    n

    j=1

    pij (u) J(j)

    i S \ t

    Bellmans Equation: Also gives optimal policy, which is in fact stationary.

    2.13 How do we solve Bellmans Equation?

    2.13.1 Method 1: Value iteration (VI)

    Use the DP recursion of result A:

    Jk+1(i) = minuU(i)

    g(i, u) +

    n

    j=1

    pij (u) Jk(j)

    i S \ t

    until it converges. J0(i) can be set to a guess, if the guess is good, it willspeed up convergence. How do we know that we are close to converging?Exploit problem structure to get bounds, look at [1].

    2.13.2 Method 2: Policy Iteration (PI)

    Iterate over policy instead of values. Need following result:

    C) For any stationary policy , the costs J(i) are the unique solutions of

    J(i) = g(i, (i)) +n

    j=1

    pij ((i)) J(j) i S \ t

    28

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    31/58

    Wednesday 3rd November, 2010 How do we solve Bellmans Equation?

    Furthermore, given any initial conditions J0(i), the sequence

    Jk+1(i) = g(i, (i)) +n

    j=1

    pij ((i)) Jk(j)

    converges to J(i) for each i.Proof: trivial. Consider problem where only allowable control at state iis (i), and apply parts A and B. Special case of general theorem.

    Algorithm for PI From now on, i S \ t = {1 , 2 , . . . , n}.

    Stage 1 Given k (stationary policy at iteration k, not policy at time k), solve

    for the Jk (i) by solving

    J(i) = g(i, k(i)) +n

    j=1

    pij (k(i)) J(j) i

    n equations, n unknowns (Result C).

    Stage 2 Improve policy.

    k+1(i) = arg minu

    U(i)

    g(i, u) +

    n

    j=1

    pij (u) Jk (j)

    i

    Iterate, quit when Jk+1 (i) = Jk (i) i

    Theorem Above terminates after a finite # of steps, and converges to opti-mal policy.

    Proof two steps

    1) We will first show that

    Jk (i) Jk+1 (i) i, k2) We will show that what we converge to satisfies Bellmans Equation.

    1) For fixed k, consider the following recursion in N:

    JN+1(i) = g(i, k+1(i)) +

    n

    j=0

    pij (k+1(i)) JN(j) i

    J0(i) = Jk (i)

    29

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    32/58

    2. Dynamic Programming Algorithm

    By result C), JN Jk+1 as N .

    J0(i) = g(i, k(i)) +

    j

    pij (k(i)) J0(j)

    g(i, k+1(i)) +j

    pij (k+1(i)) J0(j) = J1(i)

    J1(i) g(i, k+1(i)) +j

    pij (k+1(i)) J1(j)

    since J1(i) J0(i). But g(i, k+1(i)) = J2(i), keep going, get

    J0(i)

    J1(i)

    . . .

    JN(i)

    . . .

    take limit

    Jk (i) Jk+1 (i) i

    Since the # of stationary policies is finite, we will eventually have Jk (i) =Jk+1 i for some finite k.

    2) It follows from Stage 2 that

    Jk+1 (i) = Jk (i) = minu

    U(i)

    g(i, u) +

    j

    pij (u) Jk (j)

    when converged, but this is Bellmans Equation! have converged tooptimal policy.

    Discussion

    Complexity

    Stage 1 linear sytem of equations, size n, comlexity

    O(n3)

    Stage 2 n minimizations, p choices (p different values of u that I can use),complexity O(p n2)

    Put together: O(n2(n + p)).

    Worst case # of iterations: search over all policies, pn. But in practice, con-verges very quickly.

    Why does Policy Iteration converge so quickly relative to Value Iteration?

    30

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    33/58

    Wednesday 10th November, 2010 How do we solve Bellmans Equation?

    Rewrite Value Iteration

    Stage 2 k = arg minuU(i)

    g(i, u) +

    jpij (u) Jk(j)

    iterate

    Stage 1 Jk+1(i) = g(i, k(i)) +

    jpij (

    k(i)) Jk(j)

    2.13.3 Third Method: Linear Programming

    Recall: Bellmans Equation

    J(i) = minuU(i)

    g(i, u) +

    n

    j=1

    pij (u) J(j)

    i = 1 , . . . , n

    and Value Iteration:

    Jk+1(i) = minuU(i)

    g(i, u) +

    n

    j=1

    pij Jk(j)

    i = 1 , . . . , n

    We showed that Value Iteration (V.I.) converges to optimal cost to go J forall initial guesses J0.

    Assume we start V.I. with any J0 that satisfies

    J0(i) minuU(i)

    g(i, u) +

    n

    j=1

    pij J0(j)

    i = 1 , . . . , n

    If follows that J1(i) J0(i) for all i.

    J1(i) minuU(i)

    g(i, u) + nj=1

    pij J1(j) i = 1 , . . . , n J2(i) J1(i) i

    In general:

    Jk+1(i) Jk(i) i, kJk J

    J0(i) J(i) i

    31

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    34/58

    2. Dynamic Programming Algorithm

    Now let J solve the following problem:

    maxi

    J(i) subject to

    J(i) g(i, u) +j

    pij (u)J(j) i, u U(i)

    Clear that J(i) J(i) i, by previous analysis. Since J satisfies the con-straints, it follows that J achieves the maximum.This is a Linear Program!

    2.13.4 Analogies and Connections

    Say I want to solve

    J = G + PJ, J Rn, G Rn, P Rnn.Direct way to solve:

    (I P)J = GJ = (I P)1G.

    This is exactly what we do in Stage 1 of policy iteration: solve for the costassociated wit ha specific policy.

    Why is (I P)1 guaranteed to exist?For a given policy, let P R(n+1)(n+1) be the probability matrix that cap-tures our Markov Chain:

    P =

    P P()t

    0 1

    pij : probability next state = j given current state = i.

    Facts

    P is a right stochastic matrix: all rows sum up to 1, all elements 0. Perron-Frobenius Therorem: eigenvalues ofP have absolute value 1,

    at least one = 1

    Assumption on terminal state: PN 0 as N . We well eventuallyreach termination state!

    Therefore

    32

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    35/58

    Wednesday 10th November, 2010 How do we solve Bellmans Equation?

    - the eigenvalues ofP have absolute value < 1.

    - (I P)1 exists.

    Furthermore:

    (I P)1 = I+ P + P2 + . . .Proof:

    (I P)(I+ P + P2 + . . .) = I+ (P + P2 + . . .) (P + P2 + . . .)= I

    Therefore one way to solve for J is as follows:

    J1 = G + PJ0

    J2 = G + PJ1 = G + PG + P2J0

    ...

    JN = (I+ P + . . . + PN1)G + PNJ0

    JN (1 P)1G as N !

    Analogy

    Value Iteration: step of the updatePolicy Iteration: infinite numbers of update, or solving system of equationsexactly.

    What is truly amazing is that various combinations of policy iteration,value iteration, all converge to Bellmans Equation.

    Recall value iteration

    Jk+1(i) = minuU(i)

    g(i, u) +

    j

    pij (u)Jk(j)

    i = 1 , . . . , n

    In practice, you would implement as follows:

    J(i) minuU(i)

    g(i, u) +

    j

    pij (u)J(j)

    i = 1 , . . . , n

    J(i) J(i) i = 1 , . . . , n

    33

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    36/58

    2. Dynamic Programming Algorithm

    Dont have to do this! Can also do:

    J(i) minuU(i)

    g(i, u) +

    j

    pij (u)J(j)

    i = 1 , . . . , n

    Gauss-Seidel update: generic technique for solving iterative equations.

    Gets even better: Asynchronous Policy Iterating.

    Any number of value update in between policy updates

    Any number of states updated at each value update

    Any number of states updated at each policy update.Under some mild assumptions, all converge to J

    2.14 Discounted Problems

    J(i) = limN

    E

    N1k=0

    kg (xk, k(xk))x0 = i

    < 1, i {1 , . . . , n}

    No explicit termination state required. No assumption on transition proba-bilities required.

    Bellmans Equation for this problem:

    J(i) = minuU(i)

    g(i, u) +

    n

    j=1

    pij (u) J(j)

    i

    How do we show this?

    Define associated problem with states {1 , 2 , . . . , n, t}. From state i = t,when u is applied in new cost g(i, u) next state = j with probabilitypij (u), and t with probability 1 (since

    jpij (u) = 1)

    Clear that since a < 1, we have a non-zero probability of making it to

    state t, therefore our assumption on reaching the termination state issatisfied.Suppose we use the same policy in discounted problem as auxiliaryproblem.

    Note that

    PR

    xk+1 = jxk = i, xk+1 = t, u = pij (u)pi,1 + pi,2 + . . . + pi,n

    =pij (u)

    = pij (u)

    34

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    37/58

    Wednesday 17th November, 2010 Discounted Problems

    So as long as we reach the termination state, the state evolution is

    governed by the same probabilities. The expected cost of the kth stageof associated problem g(xk, k(xk)) times the probability, that t has not

    been reached k, therefore have kg(xk, k(xk)).

    Connections: for a given policy, have

    P =

    P

    1...

    1

    (1 )

    0 1

    clear that (P)N = NPN 0

    35

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    38/58

    Chapter 3

    Continuous Time OptimalControl

    Consider the following system

    x(t) = f(x(t), u(t)) 0 t Tx(0) = x0 no noise!

    State x(t) R

    Time t R, T is the terminal time

    Control u(t) U Rm, U control constraint set

    Assume

    f is continuously differentiable with respect to x. (Less stringent re-quirement: Lipschitz)

    f is continuous with respect to u

    u(t) is piecewise continuous

    See Appendix A in[1] for Details.

    Assume: Existence and uniqueness of solutions.

    Example 8:

    x(t) = x(t)1/3

    x(0) = 0

    Solutions: x(t) = 0 t x(t) = ( 23 t)3/2Not uinque!

    36

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    39/58

    Wednesday 17th November, 2010 The Hamilton Jacobi Bellman (HJB) Equation

    Example 9:

    x(t) = x(t)2

    x(0) = 1

    Solutions: x(t) = 11t , finite escape time, x(1) = The solution does not exist on an interval that includes 1, e.g. [0, 2].

    Objective Minimize h(x(T)) +T

    0g(x(t), u(t)) dt where g and h are con-

    tinuously differentiable with respect to x, g is continuous with respect to u.

    Very similar to discrete time problem (

    , xk+1 x) except for technical

    assumptions.

    3.1 The Hamilton Jacobi Bellman (HJB) Equation

    Continuous time analog of DP algorithm. Derive it informally by discretiz-ing problem and taking limits. Not a rigorous derivation, but it does capturethe main ideas.

    Divide time horizon into N pieces, define = T/N

    xk := x(k), uk := u(k) k = 0 , 1 , . . . , N Approximate differential equation by

    x(k) = f(x(k, u(k)))

    xk+1 xk

    = f(xk, uk), xk+1 = xk + f(xk, uk)

    Approximate cost function:

    h(xN) +N1k=0

    g(xk, uk)

    Define J(t, x) = optimal cost to go at time t and state x for the con-tinuous problem.

    Define J(t, x) = discrete approximation of optimal cost to go.Apply DP algorithm:

    terminal condition: J(N, x) = h(x) recursion:

    J(k, x) = minuU

    g(x, u) + J((k + 1) , x + f(x, u) )

    k = 0 , . . . , N1

    37

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    40/58

    3. Continuous Time Optimal Control

    Do Taylor Expansion of J, because 0

    J((k + 1) , x + f(x, u) ) = J(k, x) + J(k, x)

    t+

    J(k, x)

    x

    Tf(x, u) + o()

    lim0

    o()

    = 0

    little oh notation: o() are quadratic terms or higher in .

    Substitute back into DP recursion by :

    0 = minuU g(x, u) +

    J(k, x)

    t

    + J(k, x)

    x T

    f(x, u) +o()

    Now let t = k, and let 0. Assuming J J, have

    0 = minuU

    g(x, u) +

    J(t, x)t

    +

    J(t, x)

    x

    Tf(x, u)

    x, t

    (3.1)

    J(T, x) = h(x)

    HJB equation (3.1):

    Partial Differential Equation, very difficult to solve

    u = (t, x) that minimizes R.H.S. of HJB is optimal policy

    Example 10: Consider the system

    x(t) = u(t) |u(t)| 1Cost is 12 x

    2(T), only terminal cost.

    Intuitive solution: u(t) = (t, x) = sgn(x) =

    1 if x > 0

    0 if x = 0

    1 if x < 0

    What is cost to go associated with this policy?

    V(t, x) =1

    2(max{0, |x| (T t)})2

    Verify that this is indeed the cost to go associated with the policy out-lined above.

    For fixed t:

    38

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    41/58

    Wednesday 17th November, 2010 The Hamilton Jacobi Bellman (HJB) Equation

    V(t, x)

    x

    T t(T t)

    Figure 3.1: Example 10 for fixed t.

    V(t, x)x

    x

    T t(T t)

    Figure 3.2: First derivative.

    V(t, x)

    t1|x| T

    2

    T |x| T

    V(t, x)

    t

    t

    1

    2

    T |x| T1: |x| T2: |x| > T

    Figure 3.3: Example 10 for fixed x.

    V

    x(t, x) = sgn(x) max{0, |x| (T t)}

    39

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    42/58

    3. Continuous Time Optimal Control

    For fixed x:

    Does V(t, x) satisfy HJB?

    First check: does it satisfy boundary condition?V(T, x) = 12 x

    2 = h(x)

    Second check:

    min|u|1

    V(t, x)

    t+

    V(t, x)

    xf(x, u)

    = min

    |u|1(1 + sgn(x), u) max{0, |x| (T t)}

    = 0 by choosing u = sgn(x)

    V(t, x) satisfies HJB equation, V(t, x) = J(t, x).Furthermore u = sgn(x) is an optimal solution. Not unique!Note: Verifying that V(t, x) satisfies HJB is not trivial, even forthis simple example. Imagine solving for it!

    Another issue: 12 x2(T) will give same optimal policy as cost |x(T)|, so

    different costs give same optimal policy, but some cost are nicer towork with than other.

    3.2 Aside on Notation

    Let F(t, x) be a continuosly differentiable function. Then

    1.F(t, x)

    t: partial derivative of F with respect to the first argument.

    2.F(t, x(t))

    t=

    F(t, x(t))

    t

    x=x(t)

    : shorthand notation

    3.dF(t, x(t))

    dt =F(t, x(t))

    t +F(t, x(t))

    x x(t): total derivative

    Example 11: For F(t, x) = tx

    F(t, x)

    t= x

    F(t, x(t))

    t= x(t)

    dF(t, x(t))

    dt= x(t) + t x(t)

    Lemma 3.2.1: Let F(t, x, u) be a continous differentiable function andlet U be a convex set. Assume that (t, x) := arg min

    uUF(t, x, u) is conti-

    nous differentiable.

    40

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    43/58

    Wednesday 24th November, 2010 Aside on Notation

    Then:

    1) minuU F(t, x, u)

    t=

    F(t, x, (t, x))t

    t, x

    2) minuU F(t, x, u)

    x=

    F(t, x, (t, x))x

    t, x

    Example 12: Let F(t, x, u) = (1 + t)u2 + ux + 1, t 0, U is the realline (no constraint on u), then

    minu

    F(t, x, u) : 2(1 + t)u + x = 0, u = x2(1 + t)

    (t, x) = x2(1 + t)

    ,

    minu

    F(t, x, u) =(1 + t)x2

    4(1 + t)2 x

    2

    2(1 + t)+ 1

    = x2

    4(1 + t)+ 1

    1) minuU F(t, x, u)

    t=

    x2

    4(1 + t)2

    F(t, x, (t, x))t

    = u2|u=(t,x) = x24(1 + t)2

    2) minuU F(t, x, u)

    x=

    x

    2(1 + t)

    F(t, x, (t, x))x

    = u|u=(t,x) =x

    2(1 + t)

    Proof 2: Proof of Lemma 3.2.1 when u is unconstrained (U Rm).Let

    G(t, x) = minuU

    F(t, x, u) = F(t, x,

    (t, x)).

    Then

    G(t, x)

    t=

    F(t, x, (t, x))t

    +F(t, x, (t, x))

    u =0 because (t,x) minimizes

    (t, x)t

    .

    This can be done similar forG(t, x)

    x.

    41

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    44/58

    3. Continuous Time Optimal Control

    3.2.1 The Minimum Principle

    HJB gives us a lot of information: optimal cost to go for all time and for allpossible states, also gives optimal feedback law u = (t, x). What if weonly cared about optimal control trajectory for a specific initial conditionx(0) = x0?

    Can we exploit the fact that we are asking for much less to simplify themathematical conditions?

    Starting Point: HJB

    0 = minuU

    g(x, u) + J(t, x)t +J(t, x)x T

    f(x, u) t, xJ(T, x) = h(x) x

    Let (t, x) be te corresponding optimal strategy (feedback law)

    Let

    F(t, x, u) = g(x, u) +J(t, x)

    t+

    J(t, x)

    x

    Tf(x, u)

    So HJB equation gives us

    G(t, x) = minuU

    (F(t, x, u)) = 0

    Apply Lemma

    1)G(t, x)

    t= 0 =

    2J(t, x)t2

    +

    2J(t, x)

    xt

    Tf(x, (t, x)) t, x

    2)G(t, x)

    x= 0 =

    g(x, (t, x))x

    +2J(t, x)

    xt+

    2J(t, x)x2

    f(x, (t, x))

    +f(x, (t, x))

    x

    J(t, x)

    x t, x

    Consider a specific optimal trajectory:

    u(t) = (t, x(t)) , x(t) = f (x(t), u(t)) , x(0) = x0

    1) 0 =d

    dt

    J(t, x(t))

    t

    2) 0 =g (x(t), u(t))

    x+

    d

    dt

    J(t, x(t))

    x

    +

    f(x(t), u(t))x

    J(t, x(t))x

    42

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    45/58

    Wednesday 24th November, 2010 Aside on Notation

    p(t) = J(t, x(t))

    xp0(t) =

    J(t, x(t))t

    1) p0(t) = 0 p0(t) = constant for 0 t T2) p(t) = f(x

    (t), u(t))x

    p(t) g(x(t), u(t))x

    , 0 t TJ(T, x)

    x=

    h(x)

    x p(T) = h(x

    (T))x

    Put all of this together:

    Define Hamiltonian H(x, u, p) = g(x, u) + pT

    f(x, u). Let u(t) be an optimalcontrol trajectory, x(t) resulting state trajectory. Then

    x(t) =H

    p(x(t), u(t), p(t)) x(0) = x0

    p(t) = Hx

    (x(t), u(t), p(t)) p(T) =h(x(T))

    xu(t) = arg min

    uUH(x(t), u, p(t))

    H(x(t), u(t), p(t)) = constant t [0, T](H() = constant comes from p0(t) = constant).Some remarks:

    Set of 2n ODE, with split boundary conditions. Not trivial to solve.

    Necessary condition, but not sufficient. Can have multiple solutions,not all of them may be optimal.

    If f(x, u) is linear, U is convex, h and g are convex, then the conditionis necessary and sufficient.

    Example 13 (Resource Allocation): Some robots are sent to Mars tobuild habitats for a later exploration by humans.

    x(t) Number of reconfigurable robots, which can build habitats or them-selves.

    x(0) is given: Number of robots that arrive to Mars.

    x(t) = u(t)x(t) x(0) = x0

    y(t) = (1 u(t))x(t) y(0) = 00 u(t) 1

    43

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    46/58

    3. Continuous Time Optimal Control

    Objective: given terminal time T, find control input u(t) that maxi-

    mizes y(T), the number of habitats build.Note: y(t) =

    T0

    (1 u(t))x(t) dt

    Solution

    g(x, u) = (1 u)xf(x, u) = ux

    H(x, u, p) = (1 u)x + puxp(t) =

    H(x(t), u(t), p(t))

    x

    =

    1 + u(t)

    p(t)u(t)

    p(T) = 0 (h(x) 0)u(t) = arg max

    0u1(x(t) + (p(t)x(t) x(t)) u)

    get

    u = 0 if p(t) < 1

    u = 1 if p(t) > 1

    Since p(T) = 0, for t close to T, will have u(t) = 0 and thereforep(t) = 1.

    Therefore at time t = T 1, p(t) = 1 and that is where switch occurs:

    p(t) = p(t) 0 t T 1 p(T 1) = 1 p(t) = exp(t) exp(T 1) 0 t T 1

    Conclusion

    u(t) =

    1 0 t T 1

    0 T 1 t T

    How to use in practice:

    1. If you can solve HJB, you get a feedback law u = (x). Very con-venient, just a controller: meausure the state and apply the controlinput.

    2. Solve for optimal trajectory and use a feedback law (probably linear)to keep you on that trajectory.

    3. Solve for optimal trajectory online after measuring state. Do this often.

    44

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    47/58

    Wednesday 1st December, 2010 Extensions

    HJBMinimumPrinciple

    OptimalSolutions

    Non OptimalSolutions

    not too difficult

    Easy to show(in book)

    Viscositysolutions

    Hard to showrigorously(calculus of vari-ations)

    Local Minima

    Figure 3.4: Different approaches to find a solution.

    3.3 Extensions

    (Drop x(t), J notation for simplycity)

    3.3.1 Fixed Terminal State

    Case where x(T) is given. Clear that there is no need for a terminal cost.

    Recall co-state p(t) =J(t, x(t))

    x.

    p(T) = limtT

    J(t, x(t))

    x, but we cant use h(t), the terminal cost, to constrain

    p(T). Dont need constraints on p!

    x(t) = f(x(t), u(t)) x(0) = x0, x(T) = xT 2n ODEs

    p(t) = H(x(t), u(t), p(t))

    x 2n boundary conditions

    Example 14:

    x(t) = u(t) x(0) = 0, x(1) = 1

    g(x, u) =1

    2(x2 + u2)

    cost =1

    2

    10

    (x2(t) + u2(t)) dt

    45

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    48/58

    3. Continuous Time Optimal Control

    Hamiltonian H(x, u, p) = 12 (x2 + u2) + p u

    We get

    x(t) = u(t)

    p(t) = x(t)u(t) = arg min

    u

    1

    2(x2(t) + u2(t)) + p(t) u(t)

    therefore u(t) = p(t) x(t) = p(t), p(t) = x(t), x(t) = x(t)

    x(t) = A cosh(t) + B sinh(t)

    x(0) = 0 A = 0, x(1) = 1 B = 1sinh(1)

    x(t) =sinh(t)

    sinh(1)=

    et ete1 e1

    Exercise: show that Hamiltonian is constant along this trajectory.

    3.3.2 Free initial state, with cost

    x(0) is not fixed, but have initial cost l(x(0)). Can show that the resuling

    condition is p(0) = l(x(0))x

    .

    Example 15:

    x(t) = u(t) x(1) = 1, x(0) is free

    g(x, u) =1

    2(x2 + u2)

    l(x) = 0 no cost (given)

    Apply Minimum Principle, as before

    x(t) = u(t) x = x(t)

    p(t) = x(t)u(t) = p(t) x(t) = A cosh(t) + B sinh(t)

    x(0) = u(0) = p(0) = 0 B = 0

    x(t) =cosh(t)

    cosh(1)=

    et + et

    e1 + e1

    x(0) 0.65

    46

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    49/58

    Wednesday 1st December, 2010 Extensions

    Free Terminal Time Result: Hamiltonian = 0 on optimal trajectory.

    Gain extra degree of freedom in choosing T, we lose a degree of freedombecause H 0.

    Time Varying System and Cost What happens if f = f(x, u, t), g = g(x, u, t)?

    Result: Everything stays the same, except that Hamiltonian is no longerconstant along trajectory. Hint: t = 1 and x = f(x, u, t)

    Singular Problems Motivate via an example:

    Tracking problem: z(t) = 1 t2 0 t 1Minimize 12 10 (x(t) z(t))

    2 dt subject to

    |x(t)

    | 1.

    Apply Minimum Principle:

    x(t) = u(t) |u(t)| 1 x(0), x(1) are freeg(x, u, t) =

    1

    2(x z(t))2

    H(x, u, t) =1

    2(x z(t))2 + p u

    Co-state equation:

    p(t) = (x(t) z(t)) p(0) = 0, p(1) = 0Optimal u:

    u(t) = arg min|u|1

    H(x(t), u, p(t), t)

    u(t) =

    = 1 if p(t) > 0= 1 if p(t) < 0

    = ? if p(t) = 0

    Problem is singular if Hamiltonian is not a function of u for a non-trivial

    time interval.Try the following:

    p(0) = 0 p(t) = 0 for 0 t T T to be determinedThen

    p(t) = 0 0 t T x(t) = z(t) 0 t T x(t) = z(t) = 2t = u(t)

    47

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    50/58

    3. Continuous Time Optimal Control

    One guess: pick T = 12 .

    This cant be the solution: for t > 12 :

    x(t) z(t) > 0 p < 0 p(1) < 0cant satisfy constraint.

    Explore this instead:

    Switch before T = 12 .

    x(t) = z(t) 0 t T< 12

    x(t) = 1 T< t 1 x(t) = z(T) (t T) = 1 T2 t + T

    p(t) = (x(t) z(t)) (1 T2 t + T 1 + t2) T< t 1= T2 T t2 + t

    p(1) =1

    T(T2 T t2 + t) dt

    = T2 T 13

    +1

    2 T3 + T2 + T

    3

    3 T

    2

    2= 0

    Simplifies to (multiply by 6):

    0 =

    4T3 + 9T2

    6T+ 1

    = (T 1)(T 1)(1 4T) T = 1, T =

    1

    4

    T = 14 satisfies all the constraints and we are done!

    Can easily verify that p(t) > 0 for 14 < t < 1, giving u(t) = 1 as required.

    3.4 Linear Systems and Quadratic Costs

    Look at infite horizon, LTI (linear time invariant) system:

    xk+1 = Axk + Buk k = 0 , 1 , . . .

    cost =

    k=0

    xTk Qxk + uTk Ruk, R > 0, Q 0, R = RT, Q = QT

    Informally, the cost to go is time invariant: only depends on the state andnot when we get there.

    J(x) = minu

    xTQx + uTRu + J(Ax + Bu)

    48

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    51/58

    Wednesday 8th December, 2010 Linear Systems and Quadratic Costs

    Conjecture that optimal cost to go is quadratic in x: J(x) = xTKx , where

    K = KT, K 0. ThenxTKx = xTQx + xTATKAx + min

    u

    uTRu + uTBTKBu + xTATKBu + uTBTKAx

    Since R > 0, BTKB 0, R + BTKB > 0

    2

    R + BTKB

    u + 2BTKAx = 0

    u =

    R + BTKB1

    BTKA

    Substitute back in:All terms are of the form xT x.Therefore we must have:K = Q + ATKA + ATKB

    R + BTKB

    1 R + BTKB

    R + BTKB

    1BTKA

    2ATKB

    R + BTKB1

    BTKA

    K = AT

    K KB

    R + BTKB1

    BTK

    A + Q, K 0

    Summary

    Optimal Cost to go J(x) = xTKx

    Optimal feedback strategy u = Fx, F = R + BTKB1 BTKAQuestions

    1. Can we always solve for K?

    2. Is closed loop system xk+1 = (A + BF)xk stable?

    Example 16:

    xk+1 = 2xk + 0 ukcost =

    k=0

    x2k + u2kA = 2, B = 0, Q = 1, R = 1

    Solve for K:

    K = 4 (K 0) + 1 3K = 1 K = 1/3

    49

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    52/58

    3. Continuous Time Optimal Control

    K does not satisfy K 0 constraint. No solution to this problem: costis infite.Problem with this example is that (A, B) is not stabilizable.

    Stabilizable : One can find a matrix F such that A + BF is stable.

    (A + BF) < 1, eigenvalues of A + BF have magnitude < 1.

    Example 17:

    xk+1 = 0.5xk + 0 ukcost =

    k=0

    x2k + u2k

    A = 0.5, B = 0, Q = 1, R = 1

    Solve for K:

    K = 0.25K + 1

    K = 4/3Cost to go = 43 x

    2k . F = 0.

    Example 18:

    xk+1 = 2xk + uk

    cost =

    k=0

    x2k + u

    2k

    A = 2, B = 1, Q = 1, R = 1

    (A, B) is stabilizable. Solve for K:

    K = 4

    K K2(1 + K)

    + 1

    0 = K2

    4K

    1

    K = 4

    20

    2= 2

    5

    Pick K = 2 +

    5 4.236 and solve for F:

    F = (1 + K)12K = 2K1 + K

    1.618

    A + BF = 2 1.618 0.3820 is stable.Our optimizing strategy stabilizes system as expected.

    50

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    53/58

    Wednesday 8th December, 2010 Linear Systems and Quadratic Costs

    Example 19:

    xk+1 = 2xk + uk

    cost =

    k=0

    u2k

    A = 2, B = 1, Q = 0, R = 1

    Solve for K:

    K = 4

    K K2(1 + K)1

    K =

    {0, 3}

    K = 0 is clearly optimal thing to do, but it leads to an unstable system.K = 3, however, while not being optimal, leads to F = (1 + 3)1 3 2 =1.5, A + BF = 0.5, stable.Modify cost to

    k=0

    u2k + x

    2k

    , > 0, 1.

    A = 2, B = 1, Q = , R = 1

    Solve for K:

    K() = {3 + 43

    , 3} (to first order in )

    The K = 3 solution is the limiting case as we put arbitrarily small costto the state which would otherwise .Example 20:

    xk+1 = 0.5xk + uk

    cost =

    k=0

    u2k

    A = 0.5, B = 1, Q = 0, R = 1

    Solve for K:

    K = {0,0.75}Here K = 0 makes perfect sense: gives optimal strategy u = 0 and stableclosed loop system.

    IfQ =

    K() {43

    ,0.75 3}

    K = 0 is a well behaved solution for = 0.

    51

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    54/58

    3. Continuous Time Optimal Control

    Need Concept of Detectability Let Q be decomposed as Q = CTC (can

    always do this).Need detectability assumption. (A, C) is detectable if L so that A + LC isstable. C xk 0 xk 0 and C xk 0 xTk Qxk 0.

    3.4.1 Summary

    Given

    xk+1 = Axk + Buk k = 0 , 1 , . . .

    cost = k=0

    xTQx + uTk Ruk

    Q 0, R 0

    (A, B) is stabilizable, (A, C) is detectable where C is any matrix that satisfiesCTC = Q. Then:

    1. unique solution to D.A.R.E2. Optimal cost to go is J(x) = xTKx

    3. Optimal feedback strategy u = Fx

    4. Closed loop system is stable.

    3.5 General Problem Formulation

    Finite horizon, time varying, disturbances.

    xk+1 = Ak xk + Bk uk + wk k = 0 , . . . , N

    1

    E(wk) = 0 E(wk wTk ) finite

    Cost:

    E

    xTN QN xN +

    N1k=0

    xTk Qk xk + uTk Rk uk

    Qk = QTk 0 (EV 0)

    Rk = RTk > 0

    52

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    55/58

    Wednesday 15th December, 2010 General Problem Formulation

    Apply DP to solve problem:

    JN(xN) = xTN QN xN

    Jk(xk) = minuk

    E

    xTk Qk xk + uTk Rk uk + Jk+1(Ak xk + Bk uk + wk)

    Lets do the first step of this recursion. Or equivalently, N = 1:

    J0(x0) = minu0

    E

    xT0 Q0 x0 + uT0 R0 u0 + (A0 x0 + B0 u0 + w0)

    T Q1 (. . .)

    consider last term:

    E(A0 x0 + B0 u0)T Q1 (A0 x0 + B0 u0) + 2(A0 x0 + B0 u0)T Q1 w0 + wT0 Q1 w0= (A0 x0 + B0 u0)

    T Q1(. . .) + E(wT0 Q1 w0)

    J0(x0) = minu0

    xT0 Q0 x0 + u

    T0 R0 u0 + (A0 x0 + B0 u0)

    T Q1 (. . .) + E(wT0 Q1 w0)

    Strategy is the same as if there was no noise (although it does give a differentcost). Certainty equivalence (works only for some problems, not all).

    Solve for minimizing u0: differentiate , set to 0:

    2R0 u0 + 2 BT0 Q1 B0 u0 + 2 B

    T0 Q1 A0 x0 = 0

    u0 = (R0 + BT0 Q1 B0)

    1

    BT0 Q1 A0 x0

    =: F0 x0

    Optimal feedback strategy is a linear function of the state.

    Substitute and solve for J0(x0):

    xT0 Q0 x0 + uT0 (R0 + B

    T0 Q1 B0) u0 + x

    T0 A

    T0 Q1 A0 x0 + 2 x

    T0 A

    T0 Q1 B0 u0 + E(w

    T0 Q1 w0)

    = xT0 K0 x0 + E(wT0 Q1 w0)

    K0 = Q0 + AT0 (K1 K1 B0(R0 + BT0 K1 B0)1 BT0 K1) A0

    K1 = Q1

    Cost at k = 1: quadratic in x1, at k = 0: quadratic in x0 + constant.

    Can extend this to any horizon, Discrete Riccati Equation (DRE):

    Kk = Qk + ATk (Kk+1 Kk+1 Bk (Rk + BTk Kk+1 Bk)1 Bk Kk+1) Ak

    KN = QN

    Feedback law:

    uk = Fk xk Fk = (Rk + BTk Kk+1 Bk)1 BTk Kk+1 Ak

    53

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    56/58

    3. Continuous Time Optimal Control

    Cost:

    Jk(xk) = xTk Kk xk +N

    1

    j=k

    E(wTj Kj+1 wj)

    No noise, time invariant, infinite horizon (N ): recover previous resultsand DARE. In fact, above iterative method is one way to solve DARE. Iterate

    backwards until it converges. ([1] has proof of convergence not trivial)

    Time invariant, infinite horizon with noise: Cost goes to infinity. Approach:divide cost by N and let N , cost = E(wTK w).

    Example 21: System given:

    z(t) = u(t)

    Objective Apply a force to move a mass from any starting point to z =0, z = 0. Implement on a computer that can only update informationonce per second.

    1. Discretize problem:

    z(t) = z(0) + u(0)t 0 t < 1z(t) = z(0) + z(0)t + 1/2u(0)t2 0 t < 1

    Let x1(k) = z(k), x2(k) = z(k).

    x(k + 1) = Ax(k) + Bu(k) k = 0 , 1 , . . .

    A =

    1 1

    0 1

    B =

    0.5

    1

    2. Cost = k=0

    x21(k) + u2(k)

    . Therefore

    Q =

    1 0

    0 0

    R = 1

    3. Is system stabilizable? Can we make A + BF stable for some F?

    Yes: F =1 1.5

    makes eigenvalues of A + BF = 0

    4. Q can be decomposed as follows

    Q =

    1 0

    0 0

    =

    1

    0

    1 0

    54

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    57/58

    Wednesday 15th December, 2010 General Problem Formulation

    Therefore

    C =

    1 0

    , Q = CTC

    Is (A, C) detectable?Yes.

    L =2 1

    makes eigenvalues of A + LC = 0.

    5. Solve DARE. Use MATLAB command dare.

    K = 2 1

    1 1.5Optimal Feedback matrix:

    F =0.5 1.0

    6. Physical interpretation: spring and a damper. spring has coeffi-

    cient 0.5, damper has coefficient 1.0.

    55

  • 8/3/2019 Dynamic Programming and Optimal Control Script

    58/58

    Bibliography

    [1] Dimitri P. Bertsekas. Dynamic Programming and Optimal Control, volume I.Athena Scientific, 3rd edition, 2005.