optimal control and dynamic programming · 6 approach • dynamic programming (dp) shall allow us...

4SC000 Q2 2017-2018

Optimal Control and Dynamic Programming

Duarte Antunes

Part IIIContinuous-time optimal

control problems

Recap

1

Discrete optimization problems Stage decision problems

Formulation Transition diagram Dynamic system & additive cost function

DP algorithm Graphical DP algorithm & DP equation

DP equation

Partial information

Bayesian inference & decisions based on prob. distribution

Kalman filter and separation principle

Alternative algorithms Dijkstra's algorithm Static optimization

2

Introduce optimal control concepts for continuous-time optimal control problems

Goals of part III

Discrete optimization

problems

Stage decision problems

Continuous-time control problems

Formulation Transition diagram Discrete-time system & additive cost function

Differential equations & additive cost function

DP algorithm

Graphical DP algorithm & DP equation

DP equationHamilton Jacobi Bellman equation

Partial information

Bayesian inference & decisions based on prob. distribution

Kalman filter and separation principle

Continuous-time Kalman filter and

separation principle

Alternative algorithms Dijkstra's algorithm Static optimization

Pontryagin’s maximum principle

And analyze frequency-domain properties of continuous-time LQR/LQG

Outline

• Problem formulation and approach• Hamilton Jacobi Bellman equation• Linear quadratic regulator

3

Continuous-time optimal control problems

Dynamic model

Cost function

The goal is to find an optimal path and an optimal policy

• The differential equation has a unique solution in

• We assume that do not explicitly depend on time for simplicity - we could consider

• and

Assumptionst 2 [0, T ]

ẋ(t) = f(x(t), u(t)), x(0) = x0, t 2 [0, T ]

Z T

0g(x(t), u(t))dt+ gT (x(T ))

f, gf(t, x(t), u(t)), g(t, x(t), u(t))

x(t) 2 Rn u(t) 2 U ✓ Rm

4

Optimal path

• A path consists of a control input and a corresponding solution to the differential equation

• A path is said to be optimal is there is no other path with a smaller cost

(u(t), x(t)) u(t)x(t)

, t 2 [0, T ]

• Choosing the control input can be seen as making decisions in infinitesimal time intervals which shape the derivative of the state (and thus determine its evolution)

ẋ(t) = f(x(t), u(t)), x(0) = x0, t 2 [0, T ]

Z T

0g(x(t), u(t))dt+ gT (x(T ))

t = T

x(T )

5

Optimal policy

• A policy is a function which maps states into actions at every time step

• A policy is said to be optimal if for every state at every time ,

coincides with the cost of the optimal path to the problem

• We denote the cost of the latter problem by optimal cost-to-go

u(t) = µ(t, x(t)), t 2 [0, T ]

x(t) = x̄ t

µ

µ

ẋ(s) = f(x(s), u(s)), x(t) = x̄, s 2 [t, T ]

J(t, x̄)

Z T

tg(x(s), µ(s, x(s)))ds+ gT (x(T ))

Z T

tg(x(s), u(s))ds+ gT (x(T ))

6

Approach• Dynamic programming (DP) shall allow us to compute optimal policies and optimal paths

and the Pontryagin’s maximum principle (PMP) shall allow us to compute optimal paths.

• However, obtaining these results in continuous-time (CT) is mathematically involved.

• To gain intuition in both cases we will first discretize the problem as a function of the discretization step (previously sampling period), apply DP and take the limit as the discretization step converges to zero.

CT DP

DT DP

Discretization, step ⌧ ⌧ ! 0

Taking the limit

Optimal path and

policy

Stage decision problem

CT control problem

Optimal path and

policy

7

Example

+�

R

C+

�u

How to charge the capacitor in a RC circuit with minimum energy loss in the resistor?

i

x

ẋ(t) = 1RC (u(t)� x(t))

Let us consider R = C = T = xdesired = 1

x(T ) = xdesired

x(0) = 0minu(t)

Z T

0

(x(t)� u(t))2

R

dt

8

Discretization

Dynamic model

Cost function

Discretization times

discretization step⌧

x(t) = e�(t�tk) x(tk

)| {z }xk

+(1� e�(t�tk))u(tk

)| {z }uk

Z 1

0(x(t)� u(t))2dt =

h�1X

k=0

Z tk+1

tk

(e�(t�tk)xk + (1� e�(t�tk))uk � uk)2dt

=h�1X

k=0

Z tk+1

tk

e

�2(t�tk)dt(xk � uk)2

=h�1X

k=0

1� e�2⌧

2(xk � uk)2

xk+1 = e�⌧

xk + (1� e�⌧ )uk

t 2 [tk, tk+1)

kh = Ttk = k⌧

9

From terminal constraint to terminal cost

time1 1 +�

1x(t)

The framework of stage decision problems does not take into account terminal constraints.

Thus we apply a trick considering that a final control input is applied at the terminal time setting the state to the desired terminal value after seconds, .

x(1 +�) = e��x(1) + (1� e��)u(1)Since this terminal control input is given by

x(1 +�) = 1�

u(1) =1� e��x(1)(1� e��)

10

The following cost approximates the original one that we are interested in

From terminal constraint to terminal cost

1� e��x(1)(1� e��)

terminal cost

� ! 0�(�) ! 1 asNote that but if�(�)(xh � 1)2 ! 0 xh ! 1

�(�) =1� e�2�

2(1� e��)2

Z 1+�

0(x(t)� u(t))2dt =

Z 1

0(x(t)� u(t))2dt+

Z 1+�

1(x(t)� u(t))2dt

=(h�1X

k=0

1� e�2⌧

2(xk � uk)2) +

1� e�2�

2(xh � uh)2

=(h�1X

k=0

1� e�2⌧

2(xk � uk)2) + �(�)(xh � 1)2

11

Dynamic programming

Jk(xk) = minuk

(xk � uk)2 + Jk+1(e�⌧xk + (1� e�⌧ )uk)

Applying DP

Jh(xh) = �(�)(xh � 1)2

Results in Obtained from Riccati equations

Example

⌧ = 0.2

� = 0.01

time t0 0.2 0.4 0.6 0.8 1

x(t)

0

0.2

0.4

0.6

0.8

1

time t0 0.2 0.4 0.6 0.8 1

u(t)

0

0.5

1

1.5

2

uk = Kkxk + ↵k

Jk(xk) = ✓kx2k + �kxk + �k

12

Taking the limit ⌧ ! 0

Seems to be converging to u(t) = 1 + t x(t) = t . Later we will prove this.

� = 0.01

� = 0.001

⌧ = 0.01

⌧ = 0.05

⌧ = 0.01

� = 0.01

time t0 0.2 0.4 0.6 0.8 1

x(t)

0

0.2

0.4

0.6

0.8

1

time t0 0.2 0.4 0.6 0.8 1

u(t)

0

0.5

1

1.5

2

time t0 0.2 0.4 0.6 0.8 1

x(t)

0

0.2

0.4

0.6

0.8

1

time t0 0.2 0.4 0.6 0.8 1

u(t)

0

0.5

1

1.5

2

time t0 0.2 0.4 0.6 0.8 1

x(t)

0

0.2

0.4

0.6

0.8

1

time t0 0.2 0.4 0.6 0.8 1

u(t)

0

0.5

1

1.5

2

13

Static optimization

minu0,...,uh�1

h�1X

k=0

(1� e�2⌧ )2

(xk � uk)2

xk+1 = e�⌧

xk + (1� e�⌧ )uks.t.x0 = 0 xh = 1

Static optimization problem which can handle constraints

Lagrangian

L(x1, u0,�1, . . . , xh�1, uh�1,�h) =h�1X

k=0

(1� e�2⌧ )2

(xk�uk)2+h�1X

k=0

�k+1(e�⌧

xk+(1�e�⌧ )uk�xk+1)

Necessary optimality conditions amount to solving a linear system (when )

@L

@xk= 0

@L

@uk= 0

�k = (1� e�2⌧ )(xk � uk) + �k+1e�⌧

0 = (1� e�2⌧ )(xk � uk) + �k+1(1� e�⌧ )

xk+1 = e�⌧

xk + (1� e�⌧ )uk

x0 = 0 xh = 1

k 2 {0, . . . , h� 1}

k 2 {0, . . . , h� 1}

k 2 {1, . . . , h� 1}

k 2 {0, . . . , h� 1}@L

@�k+1= 0

14

Taking the limit ⌧ ! 0

Again, seems to be converging to u(t) = 1 + t x(t) = t

⌧ = 0.05

⌧ = 0.2

⌧ = 0.01

time t0 0.2 0.4 0.6 0.8 1

x(t)

0

0.2

0.4

0.6

0.8

1

time t0 0.2 0.4 0.6 0.8 1

u(t)

0

0.5

1

1.5

2

time t0 0.2 0.4 0.6 0.8 1

x(t)

0

0.2

0.4

0.6

0.8

1

time t0 0.2 0.4 0.6 0.8 1

u(t)

0

0.5

1

1.5

2

time t0 0.2 0.4 0.6 0.8 1

x(t)

0

0.2

0.4

0.6

0.8

1

time t0 0.2 0.4 0.6 0.8 1

u(t)

0

0.5

1

1.5

2

15

Discussion• In this lecture we follow this discretization approach (the more formal continuous-time

approach can be found in Bertsekas’ book) to derive the counterpart of DP for continuous-time control problems, which is the Hamilton Jacobi Bellman equation

• Later we will use both the discretization approach and the continuous-time approach to derive the Pontryagin’s maximum principle.

• With such tools we will be able to establish the optimal solution for charging the capacitor, and solve many other problems.

CT PMP

CT DP

DT PMP

DT DP

Discretization, step ⌧ ⌧ ! 0

Taking the limit

Optimal path and

policy

Stage decision problem

CT control problem

Optimal path and

policy

Outline


16

Discretization approach

Dynamic model

Cost function

• Note that these are approximate discretizations. We could have considered exact discretization, as in the linear case, but this approximation will suffice.

Discretization times

discretization step⌧kh = Ttk = k⌧

xk+1 = xk + ⌧f(xk, uk) xk = x(k⌧) uk = u(k⌧)

ẋ(t) = f(x(t), u(t)), x(0) = x0, t 2 [0, T ]

Z T

0g(x(t), u(t))dt+ gT (x(T ))

h�1X

k=0

g(xk, uk)⌧ + gh(xh) gh(x) = gT (x), 8x

17

Dynamic programming

DP equations for the resulting stage decision problem

Jh(xh) = gh(xh)

Jk(xk) = minuk2U

g(xh, uk)⌧ + Jk+1(xk + ⌧f(xk, uk))

For convenience let us define

J̄(k⌧, x) = minu2U

g(x, u)⌧ + J̄((k + 1)⌧, x+ ⌧f(x, u))

J̄(h⌧, x) = Jh(x)J̄(t, x) = Jk(x), k 2 [k⌧, (k + 1)⌧)

Then the dynamic programming algorithm can be written as

k 2 {h� 1, . . . , 0}

k 2 {h� 1, . . . , 0}

8x

J̄(h⌧, x) = gh(x) 8x

8x

18

Taking the limit

Using first order Taylor series expansion

⌧ ! 0

J̄((k+1)⌧, x+ ⌧f(x, u)) = J̄(k⌧, x)+ ⌧(@

@t

J̄(k⌧, x)+@

@x

J̄(k⌧, x)f(x, u))+o(⌧2)

and replacing in the DP algorithm, we obtain

Assuming that (wishful thinking....) as , converges to a continuously differentiable function, then

J̄(k⌧, x) = minu2U

g(x, u)⌧ + J̄(k⌧, x)+ ⌧(@

@t

J̄(k⌧, x)+@

@x

J̄(k⌧, x)f(x, u))+o(⌧2)

J̄(t, x)

0 = minu2U

g(x, u) +@

@t

J̄(t, x) +@

@x

J̄(t, x)f(x, u)

⌧ ! 0

19

Theorem (HJB)

Suppose that is continuously differentiable in and , and is such that it satisfies the Hamilton-Jacobi-Bellman equation:

V (t, u) t x

0 = minu2U

g(x, u) +@

@t

V (t, x) +@

@x

V (t, x)f(x, u) 8t, x

V (T, x) = gT (x)

Suppose also that attains the minimum in the HJB equation for all . u = µ(t, x)t, x

Then coincides with the optimal cost-to-go and coincides with the optimal policy.

V (t, x)J(t, x) µ(t, x)

20

Discussion

• The HJB equation is a partial differential equation.• The intuitive arguments provided before show that this partial

differential equation is just an extension of the DP algorithm.

• The bottleneck of such intuitive arguments is how to establish that the cost-to-go is differentiable.

• The formal proof uses different argument, following a continuous-time approach. It can be found in Bertsekas’ book, pag 111.

• Partial differential equations are in general very hard to solve analytically.

• We are going to apply the HJB equation first to a simple example, then for linear systems and solve the previous problem of charging a capacitor.

21

Example

For the simple problem*

ẋ(t) = u(t) u(t) 2 U := [�1, 1]

12 (x(T ))

2

dynamics

cost

t 2 [0, T ]

The HJB equation is

with the terminal condition

Approach: find a candidate for optimality and check that it satisfies HJB.

V (T, x) =1

2x

2

* example taken from Bertsekas’ book, p. 112

0 = minu2[�1,1]

@

@t

V (t, x) +@

@x

V (t, x)u

22

Example

There is an obvious candidate for optimality: move the state towards zero as quickly as possible

and for an initial time and initial state , the cost is given by

µ

⇤(t, x) = �sign(x) =

8><

>:

1 if x < 0,

0 if x = 0,

� 1 if x > 0t x

J

⇤(t, x) =

1

2

(max{0, |x|� (T � t)})2

xT � t�(T � t)

23

Example

This function satisfies the terminal condition of the HJB theorem

J

⇤(T, x) =1

2x

2

satisfies the HJB equation

0 = min

u2[�1,1][1 + sgn(x)u]max{0, |x|� (T � t)}

µ ⇤ (t, x) = u = �sign(x)

where the minimum in the HJB equation is achieved by

(not unique when )|x(t)| T � t

Then this is an optimal policy.

@

@x

J

⇤(t, x) = sign(x)max{0, |x|� (T � t)}

@

@t

J

⇤(t, x) = max{0, |x|� (T � t)}

Outline


24

Linear systems, quadratic cost

HJB

Dynamic model

Cost function

Inspired by the fact that a discretization based approach would result in quadratic costs-to-go, let us try . If such function satisfies the HJB equations, it is the cost-to-go!V (t, x) = x|P (t)x

ẋ(t) = Ax(t) +Bu(t) x(0) = x0

0 = minu2Rm

[x|Qx+ 2x|Su+ u|Ru+@V (t, x)

@t

+@V (t, x)

@x

(Ax+Bu)]

V (T, x) = x|QTx

x(T )|QTx(T ) +Z T

0(x(t)|Qx(t) + 2x(t)|Su(t) + u(t)|Ru(t))dt

Q SS| R

�> 0

25

The HJB equation takes then the form

To obtain the minimum, differentiate and equate to zero

Linear systems, quadratic cost

which leads to

which is only satisfied if

We have concluded that if satisfies this Riccati equation, then is the cost-to-go and is the optimal policy.

P (T ) = QT

P (T ) = QT

P (T ) = QT

P (t) J(t, x) = x|P (t)xµ(t, x) = K(t)x

K(t)|{z}

0 = minu2Rm

[x|Qx+ 2x|Su+ u|Ru+ x|Ṗ (t)x+ 2x|P (t)Ax+ 2x|P (t)Bu)]

2(B|P (t) + S|)x+ 2Ru = 0u = �R�1(B|P (t) + S|)x

0 = x|(Ṗ (t) + P (t)A+A|P (t)� (P (t)B + S)R�1(B|P (t) + S|) +Q)x

Ṗ (t) = �(P (t)A+A|P (t)� (P (t)B + S)R�1(B|P (t) + S|) +Q)

26

Finite horizon quadratic control

Finite horizonThe optimal control policy for the following problem

is where is the unique solution of

P (T ) = QT

P (t)

ẋ(t) = Ax(t) +Bu(t)

u(t) = K(t)x(t)

, x(0) = x0

the Riccati equation

K(t) = �R�1(B|P (t) + S|),

Moreover, the optimal cost-to-go is given by x|0P (0)x0

minu

Z T

0(x(t)|Qx(t) + 2x(t)|Su(t) + u(t)|Ru(t))dt+ x(T )|QTx(T )


27

Linear Quadratic Regulator

Infinite horizon

The reasoning follows from similar arguments used in the context of stage decision problems.

The optimal policy for the following problem

is , where is the unique positive definite solution to the algebraic Riccati equation

ẋ(t) = Ax(t) +Bu(t) x(0) = x0

u(t) = Kx(t)

(A+BK)

P

Moreover the closed-loop matrix has all its eigenvalues on the left-half complex plane and the optimal cost-to-go is given by .

0 = PA+A|P � (PB + S)R�1(B|P + S|) +Q

K = �R�1(B|P + S|)

x

|0Px0

Q SS| R

�> 0

(A,B) controllable

minu

Z 1

0(x(t)|Qx(t) + 2x(t)|Su(t) + u(t)|Ru(t))dt

28

Charging a capacitor

Applying a trick allows to cast our problem in the standard LQR formulation

ẋ(t) = �x(t) + u(t)

R 10 (x(t)� u(t))

2dt+ �(x(1)� 1)2

|{z} |{z}

|{z}|{z}

|{z}

R 10

⇥x(t) y(t)

⇤ 1 00 0

� x(t)y(t)

�+2

⇥x(t) y(t)

⇤ �10

�u(t)dt+1u(t)2+�

⇥x(1) y(1)

⇤ � ��

� x(1)y(1)

�

A B

SR

QTQ|{z}

Dynamic model

Cost function

ẋ(t)ẏ(t)

�=

�1 00 0

� x(t)y(t)

�+

10

�u(t)

x(0)y(0)

�=

x0

1

�

29

Riccati equations

P (T ) = QT


Riccati equations

P (t) =

p1(t) p2(t)p2(t) p3(t)

�

boil down to and

ṗ1(t) ṗ2(t)ṗ2(t) ṗ3(t)

�= �

p1(t) p2(t)p2(t) p3(t)

� �1 00 0

��1 00 0

� p1(t) p2(t)p2(t) p3(t)

�

+

p1(t)� 1p2(t)

� ⇥p1(t)� 1 p2(t)

⇤�1 00 1

�

or equivalently to the non-linear differential equations

ṗ1(t) = 2p1(t) + (p1(t)� 1)2 � 1 = p1(t)2

ṗ2(t) = p2(t) + p2(t)(p1(t)� 1) = p1(t)p2(t)ṗ3(t) = p2(t)

2

p1(1) = �p2(1) = p3(1) = �

whose solution is (solution method not addressed here) p1(t) = �p2(t) = p3(t) =1

1 + 1� � t

30

Optimal policy and optimal pathOptimal policy

u(t) = �R�1(B|P + S)x(t)y(t)

�

=⇥�(p1(t)� 1) �p2(t)

⇤ x(t)1

�= �(p1(t)� 1)x(t) + p1(t) = �p1(t)(x(t)� 1) + x(t)

Optimal path for x(0) = 0

p1(t) =1

1 + 1� � t

ẋ(t) = �x(t) + u(t) = �p1(t)(x(t)� 1)

Letting the parameter of the artificial terminal cost converge to zero we obtain

x(t) =t� (1 + 1� )

1 + 1�+ 1

u(t) = 1 + t

x(t) = t

� ! 0 (� ! 1)

31

Discussion

• The HJB equation is a partial differential equation and an analytical solution is very hard to find.

• For problems with linear models and quadratic costs, computing the optimal policy and optimal paths involves solving non-linear differential equations (Riccati equations).

• We were able to solve these Riccati equations since the dimension of the state-space in our example was small.

• The approach based on Pontryagin’s maximum principle will lead to different conditions which can be applied to more cases.

• We will later consider stochastic disturbances, but the advantages of having a policy are exactly the same as for stage decision problems.

32

Concluding remarks

• The counter part of DP for stage-decision problems is the HJB equation.

• This is a partial differential equation very hard to solve in general.

• However, for linear systems we can solve it and this leads to the Riccati equations.

• As for discrete-time optimal control problems this leads to an algebraic Riccati equation (LQR in continuous-time) when the horizon is infinite.

Summary:

After this lecture you should be able to:

• Compute optimal policy and optimal path for problems with linear model and finite-horizon quadratic cost (Riccati equations).

• Compute the optimal policy for problems with linear models and infinite-horizon quadratic cost.

• Solve the algebraic Riccati equation analytically when the dimension of the state-space is small.

optimal control and dynamic programming · 6 approach • dynamic programming (dp) shall allow us...

Documents