optimal control and dynamic programming · 6 approach • dynamic programming (dp) shall allow us...
TRANSCRIPT
-
4SC000 Q2 2017-2018
Optimal Control and Dynamic Programming
Duarte Antunes
-
Part IIIContinuous-time optimal
control problems
-
Recap
1
Discrete optimization problems Stage decision problems
Formulation Transition diagram Dynamic system & additive cost function
DP algorithm Graphical DP algorithm & DP equation
DP equation
Partial information
Bayesian inference & decisions based on prob. distribution
Kalman filter and separation principle
Alternative algorithms Dijkstra's algorithm Static optimization
-
2
Introduce optimal control concepts for continuous-time optimal control problems
Goals of part III
Discrete optimization
problems
Stage decision problems
Continuous-time control problems
Formulation Transition diagram Discrete-time system & additive cost function
Differential equations & additive cost function
DP algorithm
Graphical DP algorithm & DP equation
DP equationHamilton Jacobi Bellman equation
Partial information
Bayesian inference & decisions based on prob. distribution
Kalman filter and separation principle
Continuous-time Kalman filter and
separation principle
Alternative algorithms Dijkstra's algorithm Static optimization
Pontryagin’s maximum principle
And analyze frequency-domain properties of continuous-time LQR/LQG
-
Outline
• Problem formulation and approach• Hamilton Jacobi Bellman equation• Linear quadratic regulator
-
3
Continuous-time optimal control problems
Dynamic model
Cost function
The goal is to find an optimal path and an optimal policy
• The differential equation has a unique solution in
• We assume that do not explicitly depend on time for simplicity - we could consider
• and
Assumptionst 2 [0, T ]
ẋ(t) = f(x(t), u(t)), x(0) = x0, t 2 [0, T ]
Z T
0g(x(t), u(t))dt+ gT (x(T ))
f, gf(t, x(t), u(t)), g(t, x(t), u(t))
x(t) 2 Rn u(t) 2 U ✓ Rm
-
4
Optimal path
• A path consists of a control input and a corresponding solution to the differential equation
• A path is said to be optimal is there is no other path with a smaller cost
(u(t), x(t)) u(t)x(t)
, t 2 [0, T ]
• Choosing the control input can be seen as making decisions in infinitesimal time intervals which shape the derivative of the state (and thus determine its evolution)
ẋ(t) = f(x(t), u(t)), x(0) = x0, t 2 [0, T ]
Z T
0g(x(t), u(t))dt+ gT (x(T ))
t = T
x(T )
-
5
Optimal policy
• A policy is a function which maps states into actions at every time step
• A policy is said to be optimal if for every state at every time ,
coincides with the cost of the optimal path to the problem
• We denote the cost of the latter problem by optimal cost-to-go
u(t) = µ(t, x(t)), t 2 [0, T ]
x(t) = x̄ t
µ
µ
ẋ(s) = f(x(s), u(s)), x(t) = x̄, s 2 [t, T ]
J(t, x̄)
Z T
tg(x(s), µ(s, x(s)))ds+ gT (x(T ))
Z T
tg(x(s), u(s))ds+ gT (x(T ))
-
6
Approach• Dynamic programming (DP) shall allow us to compute optimal policies and optimal paths
and the Pontryagin’s maximum principle (PMP) shall allow us to compute optimal paths.
• However, obtaining these results in continuous-time (CT) is mathematically involved.
• To gain intuition in both cases we will first discretize the problem as a function of the discretization step (previously sampling period), apply DP and take the limit as the discretization step converges to zero.
CT DP
DT DP
Discretization, step ⌧ ⌧ ! 0
Taking the limit
Optimal path and
policy
Stage decision problem
CT control problem
Optimal path and
policy
-
7
Example
+�
R
C+
�u
How to charge the capacitor in a RC circuit with minimum energy loss in the resistor?
i
x
ẋ(t) = 1RC (u(t)� x(t))
Let us consider R = C = T = xdesired = 1
x(T ) = xdesired
x(0) = 0minu(t)
Z T
0
(x(t)� u(t))2
R
dt
-
8
Discretization
Dynamic model
Cost function
Discretization times
discretization step⌧
x(t) = e�(t�tk) x(tk
)| {z }xk
+(1� e�(t�tk))u(tk
)| {z }uk
Z 1
0(x(t)� u(t))2dt =
h�1X
k=0
Z tk+1
tk
(e�(t�tk)xk + (1� e�(t�tk))uk � uk)2dt
=h�1X
k=0
Z tk+1
tk
e
�2(t�tk)dt(xk � uk)2
=h�1X
k=0
1� e�2⌧
2(xk � uk)2
xk+1 = e�⌧
xk + (1� e�⌧ )uk
t 2 [tk, tk+1)
kh = Ttk = k⌧
-
9
From terminal constraint to terminal cost
time1 1 +�
1x(t)
The framework of stage decision problems does not take into account terminal constraints.
Thus we apply a trick considering that a final control input is applied at the terminal time setting the state to the desired terminal value after seconds, .
x(1 +�) = e��x(1) + (1� e��)u(1)Since this terminal control input is given by
x(1 +�) = 1�
u(1) =1� e��x(1)(1� e��)
-
10
The following cost approximates the original one that we are interested in
From terminal constraint to terminal cost
1� e��x(1)(1� e��)
terminal cost
� ! 0�(�) ! 1 asNote that but if�(�)(xh � 1)2 ! 0 xh ! 1
�(�) =1� e�2�
2(1� e��)2
Z 1+�
0(x(t)� u(t))2dt =
Z 1
0(x(t)� u(t))2dt+
Z 1+�
1(x(t)� u(t))2dt
=(h�1X
k=0
1� e�2⌧
2(xk � uk)2) +
1� e�2�
2(xh � uh)2
=(h�1X
k=0
1� e�2⌧
2(xk � uk)2) + �(�)(xh � 1)2
-
11
Dynamic programming
Jk(xk) = minuk
(xk � uk)2 + Jk+1(e�⌧xk + (1� e�⌧ )uk)
Applying DP
Jh(xh) = �(�)(xh � 1)2
Results in Obtained from Riccati equations
Example
⌧ = 0.2
� = 0.01
time t0 0.2 0.4 0.6 0.8 1
x(t)
0
0.2
0.4
0.6
0.8
1
time t0 0.2 0.4 0.6 0.8 1
u(t)
0
0.5
1
1.5
2
uk = Kkxk + ↵k
Jk(xk) = ✓kx2k + �kxk + �k
-
12
Taking the limit ⌧ ! 0
Seems to be converging to u(t) = 1 + t x(t) = t . Later we will prove this.
� = 0.01
� = 0.001
⌧ = 0.01
⌧ = 0.05
⌧ = 0.01
� = 0.01
time t0 0.2 0.4 0.6 0.8 1
x(t)
0
0.2
0.4
0.6
0.8
1
time t0 0.2 0.4 0.6 0.8 1
u(t)
0
0.5
1
1.5
2
time t0 0.2 0.4 0.6 0.8 1
x(t)
0
0.2
0.4
0.6
0.8
1
time t0 0.2 0.4 0.6 0.8 1
u(t)
0
0.5
1
1.5
2
time t0 0.2 0.4 0.6 0.8 1
x(t)
0
0.2
0.4
0.6
0.8
1
time t0 0.2 0.4 0.6 0.8 1
u(t)
0
0.5
1
1.5
2
-
13
Static optimization
minu0,...,uh�1
h�1X
k=0
(1� e�2⌧ )2
(xk � uk)2
xk+1 = e�⌧
xk + (1� e�⌧ )uks.t.x0 = 0 xh = 1
Static optimization problem which can handle constraints
Lagrangian
L(x1, u0,�1, . . . , xh�1, uh�1,�h) =h�1X
k=0
(1� e�2⌧ )2
(xk�uk)2+h�1X
k=0
�k+1(e�⌧
xk+(1�e�⌧ )uk�xk+1)
Necessary optimality conditions amount to solving a linear system (when )
@L
@xk= 0
@L
@uk= 0
�k = (1� e�2⌧ )(xk � uk) + �k+1e�⌧
0 = (1� e�2⌧ )(xk � uk) + �k+1(1� e�⌧ )
xk+1 = e�⌧
xk + (1� e�⌧ )uk
x0 = 0 xh = 1
k 2 {0, . . . , h� 1}
k 2 {0, . . . , h� 1}
k 2 {1, . . . , h� 1}
k 2 {0, . . . , h� 1}@L
@�k+1= 0
-
14
Taking the limit ⌧ ! 0
Again, seems to be converging to u(t) = 1 + t x(t) = t
⌧ = 0.05
⌧ = 0.2
⌧ = 0.01
time t0 0.2 0.4 0.6 0.8 1
x(t)
0
0.2
0.4
0.6
0.8
1
time t0 0.2 0.4 0.6 0.8 1
u(t)
0
0.5
1
1.5
2
time t0 0.2 0.4 0.6 0.8 1
x(t)
0
0.2
0.4
0.6
0.8
1
time t0 0.2 0.4 0.6 0.8 1
u(t)
0
0.5
1
1.5
2
time t0 0.2 0.4 0.6 0.8 1
x(t)
0
0.2
0.4
0.6
0.8
1
time t0 0.2 0.4 0.6 0.8 1
u(t)
0
0.5
1
1.5
2
-
15
Discussion• In this lecture we follow this discretization approach (the more formal continuous-time
approach can be found in Bertsekas’ book) to derive the counterpart of DP for continuous-time control problems, which is the Hamilton Jacobi Bellman equation
• Later we will use both the discretization approach and the continuous-time approach to derive the Pontryagin’s maximum principle.
• With such tools we will be able to establish the optimal solution for charging the capacitor, and solve many other problems.
CT PMP
CT DP
DT PMP
DT DP
Discretization, step ⌧ ⌧ ! 0
Taking the limit
Optimal path and
policy
Stage decision problem
CT control problem
Optimal path and
policy
-
Outline
• Problem formulation and approach• Hamilton Jacobi Bellman equation• Linear quadratic regulator
-
16
Discretization approach
Dynamic model
Cost function
• Note that these are approximate discretizations. We could have considered exact discretization, as in the linear case, but this approximation will suffice.
Discretization times
discretization step⌧kh = Ttk = k⌧
xk+1 = xk + ⌧f(xk, uk) xk = x(k⌧) uk = u(k⌧)
ẋ(t) = f(x(t), u(t)), x(0) = x0, t 2 [0, T ]
Z T
0g(x(t), u(t))dt+ gT (x(T ))
h�1X
k=0
g(xk, uk)⌧ + gh(xh) gh(x) = gT (x), 8x
-
17
Dynamic programming
DP equations for the resulting stage decision problem
Jh(xh) = gh(xh)
Jk(xk) = minuk2U
g(xh, uk)⌧ + Jk+1(xk + ⌧f(xk, uk))
For convenience let us define
J̄(k⌧, x) = minu2U
g(x, u)⌧ + J̄((k + 1)⌧, x+ ⌧f(x, u))
J̄(h⌧, x) = Jh(x)J̄(t, x) = Jk(x), k 2 [k⌧, (k + 1)⌧)
Then the dynamic programming algorithm can be written as
k 2 {h� 1, . . . , 0}
k 2 {h� 1, . . . , 0}
8x
J̄(h⌧, x) = gh(x) 8x
8x
-
18
Taking the limit
Using first order Taylor series expansion
⌧ ! 0
J̄((k+1)⌧, x+ ⌧f(x, u)) = J̄(k⌧, x)+ ⌧(@
@t
J̄(k⌧, x)+@
@x
J̄(k⌧, x)f(x, u))+o(⌧2)
and replacing in the DP algorithm, we obtain
Assuming that (wishful thinking....) as , converges to a continuously differentiable function, then
J̄(k⌧, x) = minu2U
g(x, u)⌧ + J̄(k⌧, x)+ ⌧(@
@t
J̄(k⌧, x)+@
@x
J̄(k⌧, x)f(x, u))+o(⌧2)
J̄(t, x)
0 = minu2U
g(x, u) +@
@t
J̄(t, x) +@
@x
J̄(t, x)f(x, u)
⌧ ! 0
-
19
Theorem (HJB)
Suppose that is continuously differentiable in and , and is such that it satisfies the Hamilton-Jacobi-Bellman equation:
V (t, u) t x
0 = minu2U
g(x, u) +@
@t
V (t, x) +@
@x
V (t, x)f(x, u) 8t, x
V (T, x) = gT (x)
Suppose also that attains the minimum in the HJB equation for all . u = µ(t, x)t, x
Then coincides with the optimal cost-to-go and coincides with the optimal policy.
V (t, x)J(t, x) µ(t, x)
-
20
Discussion
• The HJB equation is a partial differential equation.• The intuitive arguments provided before show that this partial
differential equation is just an extension of the DP algorithm.
• The bottleneck of such intuitive arguments is how to establish that the cost-to-go is differentiable.
• The formal proof uses different argument, following a continuous-time approach. It can be found in Bertsekas’ book, pag 111.
• Partial differential equations are in general very hard to solve analytically.
• We are going to apply the HJB equation first to a simple example, then for linear systems and solve the previous problem of charging a capacitor.
-
21
Example
For the simple problem*
ẋ(t) = u(t) u(t) 2 U := [�1, 1]
12 (x(T ))
2
dynamics
cost
t 2 [0, T ]
The HJB equation is
with the terminal condition
Approach: find a candidate for optimality and check that it satisfies HJB.
V (T, x) =1
2x
2
* example taken from Bertsekas’ book, p. 112
0 = minu2[�1,1]
@
@t
V (t, x) +@
@x
V (t, x)u
-
22
Example
There is an obvious candidate for optimality: move the state towards zero as quickly as possible
and for an initial time and initial state , the cost is given by
µ
⇤(t, x) = �sign(x) =
8><
>:
1 if x < 0,
0 if x = 0,
� 1 if x > 0t x
J
⇤(t, x) =
1
2
(max{0, |x|� (T � t)})2
xT � t�(T � t)
-
23
Example
This function satisfies the terminal condition of the HJB theorem
J
⇤(T, x) =1
2x
2
satisfies the HJB equation
0 = min
u2[�1,1][1 + sgn(x)u]max{0, |x|� (T � t)}
µ ⇤ (t, x) = u = �sign(x)
where the minimum in the HJB equation is achieved by
(not unique when )|x(t)| T � t
Then this is an optimal policy.
@
@x
J
⇤(t, x) = sign(x)max{0, |x|� (T � t)}
@
@t
J
⇤(t, x) = max{0, |x|� (T � t)}
-
Outline
• Problem formulation and approach• Hamilton Jacobi Bellman equation• Linear quadratic regulator
-
24
Linear systems, quadratic cost
HJB
Dynamic model
Cost function
Inspired by the fact that a discretization based approach would result in quadratic costs-to-go, let us try . If such function satisfies the HJB equations, it is the cost-to-go!V (t, x) = x|P (t)x
ẋ(t) = Ax(t) +Bu(t) x(0) = x0
0 = minu2Rm
[x|Qx+ 2x|Su+ u|Ru+@V (t, x)
@t
+@V (t, x)
@x
(Ax+Bu)]
V (T, x) = x|QTx
x(T )|QTx(T ) +Z T
0(x(t)|Qx(t) + 2x(t)|Su(t) + u(t)|Ru(t))dt
Q SS| R
�> 0
-
25
The HJB equation takes then the form
To obtain the minimum, differentiate and equate to zero
Linear systems, quadratic cost
which leads to
which is only satisfied if
We have concluded that if satisfies this Riccati equation, then is the cost-to-go and is the optimal policy.
P (T ) = QT
P (T ) = QT
P (T ) = QT
P (t) J(t, x) = x|P (t)xµ(t, x) = K(t)x
K(t)|{z}
0 = minu2Rm
[x|Qx+ 2x|Su+ u|Ru+ x|Ṗ (t)x+ 2x|P (t)Ax+ 2x|P (t)Bu)]
2(B|P (t) + S|)x+ 2Ru = 0u = �R�1(B|P (t) + S|)x
0 = x|(Ṗ (t) + P (t)A+A|P (t)� (P (t)B + S)R�1(B|P (t) + S|) +Q)x
Ṗ (t) = �(P (t)A+A|P (t)� (P (t)B + S)R�1(B|P (t) + S|) +Q)
-
26
Finite horizon quadratic control
Finite horizonThe optimal control policy for the following problem
is where is the unique solution of
P (T ) = QT
P (t)
ẋ(t) = Ax(t) +Bu(t)
u(t) = K(t)x(t)
, x(0) = x0
the Riccati equation
K(t) = �R�1(B|P (t) + S|),
Moreover, the optimal cost-to-go is given by x|0P (0)x0
minu
Z T
0(x(t)|Qx(t) + 2x(t)|Su(t) + u(t)|Ru(t))dt+ x(T )|QTx(T )
Ṗ (t) = �(P (t)A+A|P (t)� (P (t)B + S)R�1(B|P (t) + S|) +Q)
-
27
Linear Quadratic Regulator
Infinite horizon
The reasoning follows from similar arguments used in the context of stage decision problems.
The optimal policy for the following problem
is , where is the unique positive definite solution to the algebraic Riccati equation
ẋ(t) = Ax(t) +Bu(t) x(0) = x0
u(t) = Kx(t)
(A+BK)
P
Moreover the closed-loop matrix has all its eigenvalues on the left-half complex plane and the optimal cost-to-go is given by .
0 = PA+A|P � (PB + S)R�1(B|P + S|) +Q
K = �R�1(B|P + S|)
x
|0Px0
Q SS| R
�> 0
(A,B) controllable
minu
Z 1
0(x(t)|Qx(t) + 2x(t)|Su(t) + u(t)|Ru(t))dt
-
28
Charging a capacitor
Applying a trick allows to cast our problem in the standard LQR formulation
ẋ(t) = �x(t) + u(t)
R 10 (x(t)� u(t))
2dt+ �(x(1)� 1)2
|{z} |{z}
|{z}|{z}
|{z}
R 10
⇥x(t) y(t)
⇤ 1 00 0
� x(t)y(t)
�+2
⇥x(t) y(t)
⇤ �10
�u(t)dt+1u(t)2+�
⇥x(1) y(1)
⇤ � ���� �
� x(1)y(1)
�
A B
SR
QTQ|{z}
Dynamic model
Cost function
ẋ(t)ẏ(t)
�=
�1 00 0
� x(t)y(t)
�+
10
�u(t)
x(0)y(0)
�=
x0
1
�
-
29
Riccati equations
P (T ) = QT
Ṗ (t) = �(P (t)A+A|P (t)� (P (t)B + S)R�1(B|P (t) + S|) +Q)
Riccati equations
P (t) =
p1(t) p2(t)p2(t) p3(t)
�
boil down to and
ṗ1(t) ṗ2(t)ṗ2(t) ṗ3(t)
�= �
p1(t) p2(t)p2(t) p3(t)
� �1 00 0
���1 00 0
� p1(t) p2(t)p2(t) p3(t)
�
+
p1(t)� 1p2(t)
� ⇥p1(t)� 1 p2(t)
⇤�1 00 1
�
or equivalently to the non-linear differential equations
ṗ1(t) = 2p1(t) + (p1(t)� 1)2 � 1 = p1(t)2
ṗ2(t) = p2(t) + p2(t)(p1(t)� 1) = p1(t)p2(t)ṗ3(t) = p2(t)
2
p1(1) = �p2(1) = p3(1) = �
whose solution is (solution method not addressed here) p1(t) = �p2(t) = p3(t) =1
1 + 1� � t
-
30
Optimal policy and optimal pathOptimal policy
u(t) = �R�1(B|P + S)x(t)y(t)
�
=⇥�(p1(t)� 1) �p2(t)
⇤ x(t)1
�= �(p1(t)� 1)x(t) + p1(t) = �p1(t)(x(t)� 1) + x(t)
Optimal path for x(0) = 0
p1(t) =1
1 + 1� � t
ẋ(t) = �x(t) + u(t) = �p1(t)(x(t)� 1)
Letting the parameter of the artificial terminal cost converge to zero we obtain
x(t) =t� (1 + 1� )
1 + 1�+ 1
u(t) = 1 + t
x(t) = t
� ! 0 (� ! 1)
-
31
Discussion
• The HJB equation is a partial differential equation and an analytical solution is very hard to find.
• For problems with linear models and quadratic costs, computing the optimal policy and optimal paths involves solving non-linear differential equations (Riccati equations).
• We were able to solve these Riccati equations since the dimension of the state-space in our example was small.
• The approach based on Pontryagin’s maximum principle will lead to different conditions which can be applied to more cases.
• We will later consider stochastic disturbances, but the advantages of having a policy are exactly the same as for stage decision problems.
-
32
Concluding remarks
• The counter part of DP for stage-decision problems is the HJB equation.
• This is a partial differential equation very hard to solve in general.
• However, for linear systems we can solve it and this leads to the Riccati equations.
• As for discrete-time optimal control problems this leads to an algebraic Riccati equation (LQR in continuous-time) when the horizon is infinite.
Summary:
After this lecture you should be able to:
• Compute optimal policy and optimal path for problems with linear model and finite-horizon quadratic cost (Riccati equations).
• Compute the optimal policy for problems with linear models and infinite-horizon quadratic cost.
• Solve the algebraic Riccati equation analytically when the dimension of the state-space is small.