f.l. lewis - uta 06 rl short course seu/rl papers/ppt/… · derivation of linear quadratic...
TRANSCRIPT
![Page 1: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/1.jpg)
Moncrief-O’Donnell Chair, UTA Research Institute (UTARI)The University of Texas at Arlington, USA
and
F.L. Lewis National Academy of Inventors
Talk available online at http://www.UTA.edu/UTARI/acs
New Developments in Integral Reinforcement Learning:Continuous-time Optimal Control
Qian Ren Consulting Professor, State Key Laboratory of Synthetical Automation for Process Industries
Northeastern University, Shenyang, China
Supported by :ONRUS NSF
Supported by :China NNSFChina Project 111
![Page 2: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/2.jpg)
7
Invited byZhongping JiangWen ChangyunYang Guanghong
![Page 3: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/3.jpg)
New Research ResultsIntegral Reinforcement Learning for Online Optimal ControlIRL for Online Solution of Multi-player GamesMulti‐Player Games on Communication Graphs Off‐Policy LearningExperience ReplayBio-inspired Multi-Actor CriticsOutput Synchronization of Heterogeneous MAS
Applications to:MicrogridRoboticsIndustry Process Control
![Page 4: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/4.jpg)
Optimal Control is Effective for:Aircraft AutopilotsVehicle engine controlAerospace VehiclesShip ControlIndustrial Process Control
Multi-player Games Occur in:Networked Systems Bandwidth AssignmentEconomicsControl Theory disturbance rejectionTeam gamesInternational politicsSports strategy
But, optimal control and game solutions are found byOffline solution of Matrix Design equationsA full dynamical model of the system is needed
Optimality and Games
![Page 5: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/5.jpg)
xux Ax Bu
SystemControlK
PBPBRQPAPA TT 10
1 TK R B P
On-line real-timeControl Loop
Off-line Design LoopUsing AREMATLAB LQR(A,B,Q,R)
Optimal Control- The Linear Quadratic Regulator (LQR)
An Offline Design Procedurethat requires Knowledge of system dynamics model (A,B)
System modeling is expensive, time consuming, and inaccurate
( , )Q R
User prescribed optimization criterion ( ( )) ( )T T
t
V x t x Qx u Ru d
![Page 6: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/6.jpg)
Adaptive Control is online and works for unknown systems.Generally not Optimal
Optimal Control is off-line, and needs to know the system dynamics to solve design eqs.
Reinforcement Learning turns out to be the key to this!
We want to find optimal control solutions Online in real-time Using adaptive control techniquesWithout knowing the full dynamics
For nonlinear systems and general performance indices
Bring together Optimal Control and Adaptive Control
![Page 7: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/7.jpg)
SystemAdaptiveLearning system
ControlInputs
outputs
environmentTuneactor
Reinforcementsignal
Actor
Critic
Desiredperformance
Reinforcement learningIvan Pavlov 1890s
Actor-Critic Learning
We want OPTIMAL performance- ADP- Approximate Dynamic Programming
Every living organism improves its control actions based on rewards received from the environment
The resources available to living organisms are usually meager.Nature uses optimal control.
Optimality in Biological Systems
![Page 8: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/8.jpg)
D. Vrabie, K. Vamvoudakis, and F.L. Lewis,Optimal Adaptive Control and DifferentialGames by Reinforcement LearningPrinciples, IET Press,2012.
BooksF.L. Lewis, D. Vrabie, and V. Syrmos,Optimal Control, third edition, John Wiley andSons, New York, 2012.
New Chapters on:Reinforcement LearningDifferential Games
![Page 9: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/9.jpg)
F.L. Lewis and D. Vrabie,“Reinforcement learning and adaptive dynamic programming for feedback control,”IEEE Circuits & Systems Magazine, Invited Feature Article, pp. 32-50, Third Quarter 2009.
IEEE Control Systems Magazine,F. Lewis, D. Vrabie, and K. Vamvoudakis,“Reinforcement learning and feedback Control,” Dec. 2012
![Page 10: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/10.jpg)
Multi‐player Game SolutionsIEEE Control Systems Magazine,Feb. 2017
Bahare Kiumarsi, K. Vamvoudakis,H. Modares, and F.L. Lewis,“Optimal and Autonomous ControlUsing Reinforcement Learning: ASurvey,” IEEE Trans. NeuralNetworks and Learning Systems, toappear 2018.
![Page 11: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/11.jpg)
( , , , )X U P RRL for Markov Decision ProcessesX= states, U= controls
P= Probability of going to state x’ from state xgiven that the control is uR= Expected reward on going to state x’ from state x given that the control is u
R.S. Sutton and A.G. Barto, Reinforcement Learning– An Introduction, MIT Press, Cambridge, Massachusetts, 1998.D.P. Bertsekas and J. N. Tsitsiklis, Neuro‐Dynamic Programming, Athena Scientific, MA, 1996.W.B. Powell, Approximate Dynamic Programming: Solving the Curses of Dimensionality, Wiley, New York, 2009.
,( ) { | } { | }k T
i kk k T k i k
i kV x E J x x E r x x
*( , ) arg min ( ) arg min { | } .k T
i kk i k
i kx u V s E r x x
*( ) min ( ) min { | } .
k Ti k
k k i ki k
V x V x E r x x
( , )x u
determine a policy ( , )x u to minimize the expected future cost Optimal control problem
optimal policy
optimal value
Expected Value of a policy
Policy evaluation by Bellman eq.
Policy Improvement' '
'( ) ( , ) ( ')u u
j j xx xx ju x
V x x u P R V x 1 ' '
'( , ) arg min ( ')u u
j xx xx ju x
x u P R V x .for all x X
.for all x X
Policy Iteration
Policy Evaluation equation is a system of N simultaneous linear equations, one for each state. ' ( ) ( )V x V x Policy Improvement makes
![Page 12: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/12.jpg)
ki
iiki
kh uxrxV ),()(
Discrete-Time Systems Optimal Adaptive Control
cost
( 1)
1( ) ( , ) ( , )i k
h k k k i ii k
V x r x u r x u
1 ( ) ( )k k k kx f x g x u system
Example ( , ) T Tk k k k k kr x u x Qx u Ru
Bellman equation
Difference eq equivalent
1( ) ( )T Th k k k k k h kV x x Qx u Ru V x
Hamiltonian 1( , ( ), ) ( , ) ( ) ( )k k k k k h k h kH x V x u r x u V x V x System dynamics does not appear
t
T
t
dtRuuxQdtuxrtxV ))((),())((
( , , ) ( , ) ( , ) ( ) ( ) ( , ) 0T TV V VH x u V r x u x r x u f x g x u r x u
x x x
Nonlinear System dynamics
Cost/value
Bellman Equation, in terms of the Hamiltonian function
Continuous‐time Systems Nonlinear Optimal Regulator( , ) ( ) ( )x f x u f x g x u
Leibniz gives Differential equivalent
Problem‐ System dynamics shows up in HamiltonianSTABILITY ?!
![Page 13: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/13.jpg)
How can one do Policy Iteration for Unknown Continuous‐Time Systems?What is Value Iteration for Continuous‐Time systems?How can one do ADP for CT Systems?
( , , ) ( , ) ( , ) ( , ) ( , )T TV V VH x u V r x u x r x u f x u r x u
x x x
)()())(,()),(,( 1 khkhkkkk xVxVxhxrhxVxH Directly leads to temporal difference techniques System dynamics does not occur Two occurrences of value allow APPROXIMATE DYNAMIC PROGRAMMING methods
Leads to off‐line solutions if system dynamics is knownHard to do on‐line learning
Discrete‐Time System Hamiltonian Function
Continuous‐Time System Hamiltonian Function
How to define temporal difference? System dynamics DOES occur Only ONE occurrence of value gradient
RL ADP has been developed for Discrete-Time Systems1 ( , )k k kx f x u
( , )x f x u
![Page 14: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/14.jpg)
Four ADP Methods proposed by Paul Werbos
Heuristic dynamic programming
Dual heuristic programming
AD Heuristic dynamic programming
AD Dual heuristic programming
(Watkins Q Learning)
Critic NN to approximate:
Value
Gradient xV
)( kxV Q function ),( kk uxQ
GradientsuQ
xQ
,
Action NN to approximate the Control
Bertsekas- Neurodynamic Programming
Barto & Bradtke- Q-learning proof (Imposed a settling time)
Discrete-Time SystemsAdaptive (Approximate) Dynamic Programming
Value Iteration
![Page 15: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/15.jpg)
1 Tu R B Px Kx
BuAxx
( ( )) ( ) ( ) ( )T T T
t
V x t x Qx u Ru d x t Px t
0 ( , , ) 2 2 ( )T
T T T T T T TV VH x u V x Qx u Ru x x Qx u Ru x P Ax Bu x Qx u Rux x
Derivation of Linear Quadratic Regulator
System
Cost
Differentiate using Leibniz’ formula
Optimal Control is
Scalar equation
( )T TV x Qx u Ru
Differential equivalent is the Bellman equation
Stationarity condition 0 2 2 TH Ru B Pxu
1 10 2 ( )T T T T Tx P Ax BR B Px x Qx x PBR B Px HJB equation
10 T T T T T Tx PAx x A Px x Qx x PBR B Px
u Kx Given any stabilizing FB policy
0 ( , , ) ( ) ( )T T TVH x u x A BK P P A BK Q K RK xx
Lyapunov equation
Full system dynamics must be knownOff‐line solution
![Page 16: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/16.jpg)
PBPBRQPAPA TT 10
1 Tu R B Px Kx
BuAxx
( ( )) ( ) ( ) ( )T T T
t
V x t x Qx u Ru d x t Px t
Full system dynamics must be knownOff‐line solution
0 ( , , ) 2 ( )T T TVH x u x P Ax Bu x Qx u Rux
Standard Solution for Linear Quadratic Regulator
0 ( ) ( )T TA BK P P A BK Q K RK
u Kx
System
Cost
Differential equivalent is the Bellman equation
Given any stabilizing FB policy
The cost value is found by solving Lyapunov equation = Bellman equation
Optimal Control is
Algebraic Riccati equation
Scalar equation
Matrix equation
0 ( , , ) 2 ( ) x
( ) ( )
T T T T
T T T
VH x u x P A BK x Qx x K RKxx
x A BK P P A BK Q K RK x
10 T T T T T Tx PAx x A Px x Qx x PBR B Px HJB equation
Matrix equation
![Page 17: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/17.jpg)
t
T
t
dtRuuxQdtuxrtxV ))((),())((
( , , ) ( , ) ( , ) ( ) ( ) ( , ) 0T TV V VH x u V r x u x r x u f x g x u r x u
x x x
112( ) ( )T Vu h x R g x
x
dxdVggR
dxdVxQf
dxdV T
TT *1
*
41
*
)(0
, (0) 0V
Nonlinear System dynamics
Cost/value
Bellman Equation, in terms of the Hamiltonian function
Stationary Control Policy
HJB equation
CT Systems‐ Derivation of Nonlinear Optimal Regulator
Off‐line solutionHJB hard to solve. May not have smooth solution.Dynamics must be known
Stationarity condition 0Hu
( , ) ( ) ( )x f x u f x g x u
Leibniz gives Differential equivalent
To find online methods for optimal control Focus on these two equations
Problem‐ System dynamics shows up in Hamiltonian
![Page 18: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/18.jpg)
),,(),(),(0 uxVxHuxruxf
xV T
CT Policy Iteration – a Reinforcement Learning Technique
• Convergence proved by Leake and Liu 1967,
Saridis 1979 if Lyapunov eq. solved exactly• Beard & Saridis used Galerkin Integrals to solve Lyapunov eq.• Abu Khalaf & Lewis used NN to approx. V for nonlinear systems and proved convergence
RuuxQuxr T )(),(Utility
The cost is given by solving the CT Bellman equation
Full system dynamics must be knownOff‐line solution
dxdVggR
dxdVxQf
dxdV T
TT *1
*
41
*
)(0
Scalar equation
M. Abu-Khalaf, F.L. Lewis, and J. Huang, “Policyiterations on the Hamilton-Jacobi-Isaacs equation for H-infinity state feedback control with input saturation,”IEEE Trans. Automatic Control, vol. 51, no. 12, pp.1989-1995, Dec. 2006.
( ) ( )u x h xGiven any admissible policy
0 ( )h x
Converges to solution of HJB
0 ( , ( )) ( , ( ))T
jj j
Vf x h x r x h x
x
(0) 0jV
1121 ( ) ( ) jT
j
Vh x R g x
x
Policy Iteration Solution
Pick stabilizing initial control policy
Policy Evaluation ‐ Find cost, Bellman eq.
Policy improvement ‐ Update control
![Page 19: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/19.jpg)
PBPBRQPAPA TT 10
1 Tu R B Px Kx
BuAxx
( ( )) ( ) ( ) ( )T T T
t
V x t x Qx u Ru d x t Px t
Full system dynamics must be knownOff‐line solution
0 ( , , ) 2 2 ( )T
T T T T T T TV VH x u V x Qx u Ru x x Qx u Ru x P Ax Bu x Qx u Rux x
Policy Iterations for the Linear Quadratic Regulator
0 ( ) ( )T TA BK P P A BK Q K RK
u Kx
System
Cost
Differential equivalent is the Bellman equation
Given any stabilizing FB policy
The cost value is found by solving Lyapunov equation = Bellman equation
Optimal Control is
Algebraic Riccati equation
![Page 20: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/20.jpg)
LQR Policy iteration = Kleinman algorithm
1. For a given control policy solve for the cost:
2. Improve policy:
If started with a stabilizing control policy the matrix monotonically converges to the unique positive definite solution of the Riccati equation.
Every iteration step will return a stabilizing controller. The system has to be known.
ju K x
0 T Tj j j j j jA P P A Q K RK
11
Tj jK R B P
j jA A BK
0K jP
Kleinman 1968
Bellman eq. = Lyapunov eq.
OFF‐LINE DESIGNMUST SOLVE LYAPUNOV EQUATION AT EACH STEP.
Matrix equation
![Page 21: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/21.jpg)
Can Avoid knowledge of drift term f(x)
Work of Draguna Vrabie
Policy iteration requires repeated solution of the CT Bellman equation
0 ( , ( )) ( , ( )) ( , ( )) ( ) ( , , ( ))T T
TV V VV r x u x x r x u x f x u x Q x u Ru H x u xx x x
This can be done online without knowing f(x)using measurements of x(t), u(t) along the system trajectories
Integral Reinforcement Learning
( ) ( )x f x g x u
D. Vrabie, O. Pastravanu, M. Abu-Khalaf, and F. L. Lewis, “Adaptive optimal control for continuous-time linear systems based on policy iteration,” Automatica, vol. 45, pp. 477-484, 2009.
![Page 22: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/22.jpg)
Lemma 1 – Draguna Vrabie
Solves Bellman equation without knowing f(x,u)
( ( )) ( , ) ( ( )), (0) 0t T
t
V x t r x u d V x t T V
0 ( , ) ( , ) ( , , ), (0) 0TV Vf x u r x u H x u V
x x
Allows definition of temporal difference error for CT systems
( ) ( ( )) ( , ) ( ( ))t T
t
e t V x t r x u d V x t T
Integral reinf. form (IRL) for the CT Bellman eq.Is equivalent to
( ( )) ( , ) ( , ) ( , )t T
t t t T
V x t r x u d r x u d r x u d
value
Key Idea= US Patent
Work of Draguna Vrabie 2009Integral Reinforcement Learning
Bad Bellman Equation
Good Bellman Equation
![Page 23: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/23.jpg)
( ( )) ( , ) ( ( ))t T
k k kt
V x t r x u dt V x t T
IRL Policy iteration
Initial stabilizing control is needed
Cost update
Control gain update
f(x) and g(x) do not appear
g(x) needed for control update
Policy evaluation‐ IRL Bellman Equation
Policy improvement11
21 1( ) ( )T kk k
Vu h x R g xx
),,(),(),(0 uxVxHuxruxf
xV T
Equivalent to
Solves Bellman eq. (nonlinear Lyapunov eq.) without knowing system dynamics
CT Bellman eq.
Integral Reinforcement Learning (IRL)- Draguna Vrabie
Converges to solution to HJB eq.dx
dVggRdx
dVxQfdx
dV TTT *
1*
41
*
)(0
D. Vrabie, O. Pastravanu, M. Abu-Khalaf, and F. L. Lewis, “Adaptive optimal control for continuous-time linear systems based on policy iteration,” Automatica, vol. 45, pp. 477-484, 2009.
![Page 24: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/24.jpg)
CT Policy Iteration – How to implement online?Linear Systems Quadratic Cost‐ LQR
1 111 12 11 121 2 1 2
2 212 22 12 22
1 2 1 2
1 2 1 211 12 22 11 12 22
2 2 2 2
( ) ( )
( ) ( )( ) ( ) ( ) ( )
( ) ( )
( ) ( )2 2( ) ( )
( ) ( )t t T
Tk
p p p px t x t Tx t x t x t T x t T
p p p px t x t T
x xp p p x x p p p x x
x x
p x t x t T
Quadratic basis set
( ) ( ) ( )( ) ( ) ( ) ( )t T
T T T Tk k k k
tx t P x t x Q K RK x d x t T P x t T
Policy evaluation‐ solve IRL Bellman Equation
( ) ( ) ( ) ( ) ( )( ) ( )t T
T T T Tk k k k
tx t P x t x t T P x t T x Q K RK x d
( ( )) ( ) ( )TV x t x t Px tValue function is quadratic
( ) ( ) ( ) ( ) ( ) ( )t T
T T T Tk k k k
tp t p x t x t T x Q L RL x d
( , )t t T
Same form as standard System ID problems
![Page 25: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/25.jpg)
( ( )) ( ) ( ( ))t T
Tk k k k
t
V x t Q x u Ru dt V x t T
Approximate value by Weierstrass Approximator Network ( )TV W x
( ( )) ( ) ( ( ))t T
T T Tk k k k
t
W x t Q x u Ru dt W x t T
( ( )) ( ( )) ( )t T
T Tk k k
t
W x t x t T Q x u Ru dt
regression vector Reinforcement on time interval [t, t+T]
kWNow use RLS or batch least-squares along the trajectory to get new weights
Then find updated FB1 11 1
2 21 1( ( ))( ) ( ) ( )
( )
TT Tk
k k kV x tu h x R g x R g x Wx x t
Approximate Dynamic Programming Implementation
Direct Optimal Adaptive Control for Partially Unknown CT Systems
Value Function Approximation (VFA) to Solve Bellman Equation– Paul Werbos (ADP), Dimitri Bertsekas (NDP)
Scalar equation with vector unknowns
Same form as standard System ID problems in Adaptive Control
Optimal ControlandAdaptive Controlcome togetherOn this slide.Because of RL
![Page 26: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/26.jpg)
( ( )) ( ( )) ( )t T
T Tk k k
t
W x t x t T Q x u Ru dt
2
( ( )) ( ( 2 )) ( )t T
T Tk k k
t T
W x t T x t T Q x u Ru dt
3
2
( ( 2 )) ( ( 3 )) ( )t T
T Tk k k
t T
W x t T x t T Q x u Ru dt
11 12
12 22
p pp p
11 12 22TW p p p
Solving the IRL Bellman Equation- LQR case
Need data from 3 time intervals to get 3 equations to solve for 3 unknowns
Now solve by Batch least-squares
Solve for value function parameters
( ( )) ( ) ( )TV x t x t Px tLQR case
![Page 27: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/27.jpg)
t t+T
observe x(t)
apply uk=Kkx
observe cost integral
update P
Do RLS until convergence to Pk
update control gain
Integral Reinforcement Learning (IRL)
Data set at time [t,t+T)
( ), ( , ), ( )x t t t T x t T
t+2T
observe x(t+T)
apply uk=Kkx
observe cost integral
update P
t+3T
observe x(t+2T)
apply uk=Kkx
observe cost integral
update P
11
Tk kK R B P
( , )t t T ( , 2 )t T t T ( 2 , 3 )t T t T
Or use batch least-squares
(x( )) ( ( )) ( ) ( ) ( ) ( , )t T
T T Tk k k
tW t x t T x Q K RK x d t t T
Solve Bellman Equation - Solves Lyapunov eq. without knowing dynamics
This is a data-based approach that uses measurements of x(t), u(t)Instead of the plant dynamical model.
A is not needed anywhere
![Page 28: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/28.jpg)
( ) ( )k ku t K x t
Continuous‐time control with discrete gain updates
t
Kk
k0 1 2 3 4 5
Reinforcement Intervals T need not be the sameThey can be selected on‐line in real time
Gain update (Policy)
Control
T
Interval T can vary
![Page 29: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/29.jpg)
Persistence of Excitation
Regression vector must be PE
Relates to choice of reinforcement interval T
( ( )) ( ( )) ( )t T
T Tk k k
t
W x t x t T Q x u Ru dt
![Page 30: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/30.jpg)
Implementation Policy evaluationNeed to solve online
Add a new state= Integral Reinforcement
T Tx Q x u R u
This is the controller dynamics or memory
(x( )) ( ( )) ( ) ( ) ( ) ( , )t T
T T Tk k k
tW t x t T x Q K RK x d t t T
![Page 31: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/31.jpg)
Direct Optimal Adaptive Controller
A hybrid continuous/discrete dynamic controller whose internal state is the observed cost over the interval
Draguna Vrabie
DynamicControlSystemw/ MEMORY
xu
V
ZOH T
x Ax Bu System
T Tx Q x u Ru
Critic
ActorK
T T
Run RLS or use batch L.S.To identify value of current control
Update FB gain afterCritic has converged
Reinforcement interval T can be selected on line on the fly – can change
Solves Riccati Equation Online without knowing A matrix
CT time Actor‐Critic Structure
![Page 32: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/32.jpg)
Actor / Critic structure for CT Systems
Theta waves 4-8 Hz
Reinforcement learning
Motor control 200 Hz
Optimal Adaptive IRL for CT systems
A new structure of adaptive controllers
( ( )) ( , ) ( ( ))t T
k k kt
V x t r x u dt V x t T
1121 1( ) ( )T k
k kVu h x R g xx
D. Vrabie, 2009
![Page 33: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/33.jpg)
1 TK R B P
PBPBRQPAPA TT 10
xux Ax Bu
SystemControlK On-line Control Loop
On-line Performance Loop
Data-driven Online Adaptive Optimal ControlDDO
An Online Supervisory Control Procedurethat requires no Knowledge of system dynamics model A
Automatically tunes the control gains in real time to optimize a user given cost functionUses measured data (u(t),x(t)) along system trajectories
( , )J Q RUser prescribed optimization criterion
( ) ( ) ( )( ) ( ) ( ) ( )t T
T T T Tk k k k
tx t P x t x Q K RK x d x t T P x t T
11
Tk kK R B P
Data set at time [t,t+T)
( ), ( , ), ( )x t t t T x t T
![Page 34: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/34.jpg)
Simulation 1‐ F‐16 aircraft pitch rate controller
1.01887 0.90506 0.00215 00.82225 1.07741 0.17555 0
0 0 1 1x x u
Stevens and Lewis 2003
*1 11 12 13 22 23 33[p 2p 2p p 2p p ]
[1.4245 1.1682 -0.1352 1.4349 -0.1501 0.4329]
T
T
W
PBPBRQPAPA TT 10 ARE
Exact solution
,Q I R I
][ eqx
Select quadratic NN basis set for VFA
![Page 35: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/35.jpg)
0 0.5 1 1.5 2-0.3
-0.2
-0.1
0Control signal
Time (s)
0 0.5 1 1.5 2-0.4
-0.2
0Controller parameters
Time (s)
0 0.5 1 1.5 20
0.5
1
1.5
2
2.5
3
3.5
4System states
Time (s)
0 1 2 3 4 5 60
0.05
0.1
0.15
0.2
Critic parameters
Time (s)
P(1,1)P(1,2)P(2,2)P(1,1) - optimalP(1,2) - optimalP(2,2) - optimal
Simulations on: F‐16 autopilot
A matrix not needed
Converge to SS Riccati equation soln
Solves ARE online without knowing A
PBPBRQPAPA TT 10
![Page 36: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/36.jpg)
1/ / 0 0 00 1/ 1/ 0 0
,1/ 0 1/ 1/ 1/
0 0 0 0
p p p
T T
G G G G
E
T K TT T
A BRT T T T
K
x Ax Bu
( ) [ ( ) ( ) ( ) ( )]Tg gx t f t P t X t E t
FrequencyGenerator outputGovernor positionIntegral control
Simulation 2: Load Frequency Control of Electric Power system
PBPBRQPAPA TT 10
0.4750 0.4766 0.0601 0.4751
0.4766 0.7831 0.1237 0.3829
0.0601 0.1237 0.0513 0.0298
0.4751 0.3829 0.0298 2.3370
.AREP
ARE
ARE solution using full dynamics model (A,B)
A. Al-Tamimi, D. Vrabie, Youyi Wang
![Page 37: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/37.jpg)
0.4750 0.4766 0.0601 0.4751
0.4766 0.7831 0.1237 0.3829
0.0601 0.1237 0.0513 0.0298
0.4751 0.3829 0.0298 2.3370
.AREP
PBPBRQPAPA TT 10
0 10 20 30 40 50 60-0.5
0
0.5
1
1.5
2
2.5P matrix parameters P(1,1),P(1,3),P(2,4),P(4,4)
Time (s)
P(1,1)P(1,3)P(2,4)P(4,4)
( ), ( ), ( : )x t x t T t t T Fifteen data points
Hence, the value estimate was updated every 1.5s.
IRL period of T= 0.1s.
Solves ARE online without knowing A
0.4802 0.4768 0.0603 0.4754
0.4768 0.7887 0.1239 0.3834
0.0603 0.1239 0.0567 0.0300
0.4754 0.3843 0.0300 2.3433
.critic NNP
0 1 2 3 4 5 6-0.1
-0.08
-0.06
-0.04
-0.02
0
0.02
0.04
0.06
0.08
0.1System states
Time (s)
![Page 38: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/38.jpg)
Optimal Control Design Allows a Lot of Design Freedom
![Page 39: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/39.jpg)
Batch LS
LS local smooth solution for Critic NN updateIssues with Nonlinear ADP
Integral over a region of state-spaceApproximate using a set of points
time
x1
x2
time
x1
x2
Take sample points along a single trajectory
Recursive Least-Squares RLS
Set of points over a region vs. points along a trajectory
For Nonlinear systemsPersistence of excitation is needed to solve for the weightsBut EXPLORATION is needed to identify the complete value function
- PE Versus Exploration
For Linear systems- these are the same
Selection of NN Training Set
( ( )) ( , ) ( ( )), (0) 0t T
t
V x t r x u d V x t T V
0 ( , ) ( , ) ( , , ), (0) 0TV Vf x u r x u H x u V
x x
![Page 40: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/40.jpg)
( ( )) ( , ) ( ( ))t T
k k kt
V x t r x u dt V x t T
IRL Policy iteration Initial stabilizing control is needed
Cost update
Control gain update
Policy evaluation‐ IRL Bellman Equation
Policy improvement11
21 1( ) ( )T kk k
Vu h x R g xx
CT PI Bellman eq.= Lyapunov eq.
IRL Value Iteration - Draguna Vrabie
Converges to solution to HJB eq.dx
dVggRdx
dVxQfdx
dV TTT *
1*
41
*
)(0
1( ( )) ( , ) ( ( ))t T
k k kt
V x t r x u dt V x t T
IRL Value iteration Initial stabilizing control is NOT needed
Cost update
Control gain update
Value evaluation‐ IRL Bellman Equation
Policy improvement1 11
21 1( ) ( )T kk k
Vu h x R g xx
CT VI Bellman eq.
Converges if T is small enough
![Page 41: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/41.jpg)
Kung Tz 500 BCConfucius
ArcheryChariot driving
MusicRites and Rituals
PoetryMathematics
孔子Man’s relations to
FamilyFriendsSocietyNationEmperorAncestors
Tian xia da tongHarmony under heaven
![Page 42: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/42.jpg)
![Page 43: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/43.jpg)
Actor / Critic structure for CT Systems
Theta waves 4-8 Hz
Reinforcement learning
Motor control 200 Hz
Optimal Adaptive IRL for CT systems
A new structure of adaptive controllers
( ( )) ( , ) ( ( ))t T
k k kt
V x t r x u dt V x t T
1121 1( ) ( )T k
k kVu h x R g xx
D. Vrabie, 2009
![Page 44: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/44.jpg)
Motor control 200 Hz
Oscillation is a fundamental property of neural tissue
Brain has multiple adaptive clocks with different timescales
theta rhythm, Hippocampus, Thalamus, 4-10 Hzsensory processing, memory and voluntary control of movement.
gamma rhythms 30-100 Hz, hippocampus and neocortex high cognitive activity.
• consolidation of memory• spatial mapping of the environment – place cells
The high frequency processing is due to the large amounts of sensorial data to be processed
Spinal cord
D. Vrabie and F.L. Lewis, “Neural network approach to continuous-time direct adaptive optimal control forpartially-unknown nonlinear systems,” Neural Networks, vol. 22, no. 3, pp. 237-246, Apr. 2009.
D. Vrabie, F.L. Lewis, D. Levine, “Neural Network-Based Adaptive Optimal Controller- A Continuous-Time Formulation -,” Proc. Int. Conf. Intelligent Control, Shanghai, Sept. 2008.
![Page 45: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/45.jpg)
Doya, Kimura, Kawato 2001
Limbic system
Motor control 200 Hz
theta rhythms 4-10 HzDeliberativeevaluation
control
![Page 46: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/46.jpg)
Cerebral cortexMotor areas
ThalamusBasal ganglia
Cerebellum
Brainstem
Spinal cord
Interoceptivereceptors
Exteroceptivereceptors
Muscle contraction and movement
Summary of Motor Control in the Human Nervous System
reflex
Supervisedlearning
ReinforcementLearning- dopamine
(eye movement)inf. olive
Hippocampus
Unsupervisedlearning
Limbic System
Motor control 200 Hz
theta rhythms 4-10 Hz
picture by E. StinguD. Vrabie
Memoryfunctions
Long term
Short term
Hierarchy of multiple parallel loops
gamma rhythms 30-100 Hz
Kenji Doya
![Page 47: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/47.jpg)
![Page 48: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/48.jpg)
80
Synchronous Real‐time Data‐driven Optimal Control
![Page 49: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/49.jpg)
Actor / Critic structure for CT Systems
Theta waves 4-8 Hz
Motor control 200 Hz
Optimal Adaptive Integral Reinforcement Learning for CT systems
A new structure of adaptive controllers
( ( )) ( , ) ( ( ))t T
k k kt
V x t r x u dt V x t T
1121 1( ) ( )T k
k kVu h x R g xx
D. Vrabie, 2009
Policy Iteration gives the structure needed for online optimal solution
![Page 50: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/50.jpg)
1 1ˆ( ) ( ) ( )TV x W x x Take VFA as
Then IRL Bellman eq
becomes
Critic Network
1 1̂, ( ) TV x W
Action Network for Control Approximation
111 22
ˆ( ) ( ) ,T Tu x R g x W
Synchronous Online Solution of Optimal Control for Nonlinear SystemsKyriakos Vamvoudakis
( ( )) ( ) ( ( ))t
Tk k
t T
V x t Q x u Ru dt V x t T
1 1
ˆ ˆ( ( )) ( ) ( ( ))t
T T Tk k
t T
W x t T Q x u Ru dt W x t
( ( )) ( ( )) ( ( ))x t x t x t T
11 2 21ˆ ˆ ˆ( ( )) ( ) 04
tT T
t T
x t W Q x W D W
Define
Bellman eq becomes
K.G. Vamvoudakis and F.L. Lewis, “Online actor-critic algorithm to solve the continuous-time infinite horizonoptimal control problem,” Automatica, vol. 46, no. 5, pp. 878-888, May 2010.
![Page 51: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/51.jpg)
Then there exists an N0 such that, for the number of hidden layer units
Theorem (Vamvoudakis & Vrabie)‐ Online Learning of Nonlinear Optimal Control
Let be PE. Tune critic NN weights as
Tune actor NN weights as
0N N
the closed‐loop system state, the critic NN error
and the actor NN error are UUB bounded. 2 1 2ˆW W W
1 1 1̂W W W
Learning the Value
Learning the control policy
Data‐driven Online Synchronous Policy Iteration using IRLDoes not need to know f(x)
( ( )) ( ( )) ( ( ))x t x t x t T
11 1 1 2 22
( ( )) 1ˆ ˆ ˆ ˆ( ( )) ( )41 ( ( )) ( ( ))
tT T
Tt T
x tW a x t W Q x W D W dx t x t
1 12 2 2 2 1 1 2 2 14 2( ( ))ˆ ˆ ˆ ˆ ˆ( ( )) ( )
1 ( ( )) ( ( ))
TT
T
x tW a F W F x t W a D x W Wx t x t
Data set at time [t,t+T)
( ), ( , ), ( )x t t T t x t T
Vamvoudakis & Vrabie
![Page 52: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/52.jpg)
1 11 1 1 2 2 2
1 1( ) ( ) ( ) ( ).2 2
T TL t V x tr W a W tr W a W
Lyapunov energy‐based Proof:
1140 ( )
T TTdV dV dVf Q x gR g
dx dx dx
V(x)= Unknown solution to HJB eq.
2 1 2ˆW W W
1 1 1̂W W W
1 1 1( , , ) ( ) ( )T THH x W u W f gu Q x u Ru
W1= Unknown LS solution to Bellman equation for given N
Guarantees stability
![Page 53: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/53.jpg)
Adaptive Critic structure Reinforcement learning
Two Learning NetworksTune them Simultaneously
Synchronous Online Solution of Optimal Control for Nonlinear Systems
A new form of Adaptive Control with TWO tunable networks
A new structure of adaptive controllers
K.G. Vamvoudakis and F.L. Lewis, “Online actor-critic algorithm to solve the continuous-time infinitehorizon optimal control problem,” Automatica, vol. 46, no. 5, pp. 878-888, May 2010.
11 1 1 2 22
( ( )) 1ˆ ˆ ˆ ˆ( ( )) ( )41 ( ( )) ( ( ))
tT T
Tt T
x tW a x t W Q x W D W dx t x t
1 12 2 2 2 1 1 2 2 14 2( ( ))ˆ ˆ ˆ ˆ ˆ( ( )) ( )
1 ( ( )) ( ( ))
TT
T
x tW a F W F x t W a D x W Wx t x t
![Page 54: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/54.jpg)
A New Class of Adaptive Control
Plantcontrol output
Identify the Controller-Direct Adaptive
Identify the system model-Indirect Adaptive
Identify the performance value-Optimal Adaptive
)()( xWxV T
![Page 55: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/55.jpg)
Simulation 1‐ F‐16 aircraft pitch rate controller
1.01887 0.90506 0.00215 00.82225 1.07741 0.17555 0
0 0 1 1x x u
Stevens and Lewis 2003
*1 11 12 13 22 23 33[p 2p 2p p 2p p ]
[1.4245 1.1682 -0.1352 1.4349 -0.1501 0.4329]
T
T
W
PBPBRQPAPA TT 10
Solves ARE online
Exact solution
1̂( ) [1.4279 1.1612 -0.1366 1.4462 -0.1480 0.4317] .TfW t
Algorithm converges to
,Q I R I
Must add probing noise to get PE
111 22
ˆ( ) ( ) ( )T Tu x R g x W n t
2ˆ ( ) [1.4279 1.1612 -0.1366 1.4462 -0.1480 0.4317]TfW t
(exponentially decay n(t))
1
2 1
3 11 11 12 2 2
2
3 2
3
2 0 0 1.42790 1.1612
0 0 -0.1366ˆ ( ) 01.44620 2 01-0.148000.43170 0 2
T
T
T
xx xx x
u x R B Px Rxx x
x
][ eqx
Select quadratic NN basis set for VFA
![Page 56: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/56.jpg)
Critic NN parameters‐Converge to ARE solution
System states
![Page 57: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/57.jpg)
Simulation 2. – Nonlinear System
2( ) ( ) ,x f x g x u x R
21 2
1 2
1
1
0.5 0.5 (1 ( )( )
cos(2 ) 2
0 ( )
)
.cos(2 ) 2
x xf x
x
g x
x x
x
,Q I R I
* 2 21 2
1( )2
V x x x Optimal Value
*1 2( ) (cos(2 ) 2) .u x x x Optimal control
2 21 1 1 2 2( ) [ ] ,Tx x x x x Select VFA basis set
1̂( ) [0.5017 -0.0020 1.0008] .TfW t
Algorithm converges to
2ˆ ( ) [0.5017 -0.0020 1.0008] .T
fW t 111
2 2 121
2
2 0 0.50170ˆ ( ) -0.0020
cos(2 ) 20 2 1.0008
TT x
u x R x xx
x
Nevistic V. and Primbs J. A. (1996)Converse optimal
dxdVggR
dxdVxQf
dxdV T
TT *1
*
41
*
)(0
Solves HJB equation online
![Page 58: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/58.jpg)
Critic NN parameters states
Optimal value fn. Value fn. approx. error Control approx error
![Page 59: F.L. Lewis - UTA 06 RL short course SEU/RL papers/PPT/… · Derivation of Linear Quadratic Regulator System Cost Differentiate using Leibniz’ formula Optimal Control is Scalar](https://reader034.vdocuments.us/reader034/viewer/2022042216/5ebfc19362f44a64f602cda7/html5/thumbnails/59.jpg)