csc2621 imitation learning for roboticsflorian/courses/imitation_learning/...imitation learning for...

109
CSC2621 Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Upload: others

Post on 08-Oct-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

CSC2621Imitation Learning for Robotics

Florian Shkurti

Week 2: Introduction to Optimal Control & Model-Based RL

Page 2: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Today’s agenda

• Intro to Control & Reinforcement Learning

• Linear Quadratic Regulator (LQR)

• Iterative LQR

• Model Predictive Control

• Learning the dynamics and model-based RL

Acknowledgments

Today’s slides have been influenced by: Pieter Abbeel (ECE287), Sergey Levine (DeepRL), Ben Recht (ICML’18), Emo Todorov, Zico Kolter

Page 3: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Optimal Control

control / action

external noise /

disturbance /

error

observation

state

system

Page 4: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Optimal Control

control / action

external noise /

disturbance /

error

observation

state

system

Page 5: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Optimal Control

control / action

external noise /

disturbance /

error

observation

state

system

Page 6: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Optimal Control

control / action

external noise /

disturbance /

error

observation

state

system

control law / policy

known dynamics

Page 7: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Optimal Control

control / action

external noise /

disturbance /

error

observation

state

system

control law / policy

known dynamics

actionobservation

system / world

Reinforcement Learning

state

Page 8: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Optimal Control

control / action

external noise /

disturbance /

error

observation

state

system

control law / policy

known dynamics

policy dynamics

cost = - reward

actionobservation

system / world

Reinforcement Learning

state

Page 9: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Optimal Control

control / action

external noise /

disturbance /

error

observation

state

system

Optimal cost-to-go:

“if you land at state x and you follow the optimal actions

what is the expected cost you will pay?

actionobservation

system / world

Reinforcement Learning

state

Page 10: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Optimal Control

control / action

external noise /

disturbance /

error

observation

state

system

Optimal cost-to-go Optimal value function:

“if you land at state x and you follow the optimal policy

what is the expected reward you will accumulate?”

actionobservation

system / world

Reinforcement Learning

state

For finite time horizon

Page 11: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Optimal Control

control / action

external noise /

disturbance /

error

observation

state

system

Optimal cost-to-go Optimal value function

Optimal state-action value function:

“if you land at state x, and you commit

to first execute action a, and then

follow the optimal policy how much

reward will you accumulate?”

actionobservation

system / world

Reinforcement Learning

state

For finite time horizon

Page 12: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Optimal Control

control / action

external noise /

disturbance /

error

observation

state

system

Optimal cost-to-go Optimal value function

actionobservation

system / world

Reinforcement Learning

state

For finite time horizon

Value function of policy pi:

“if you land at state x and you follow policy pi

what is the expected reward you will accumulate?”

Page 13: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Optimal Control

control / action

external noise /

disturbance /

error

observation

state

system

Optimal cost-to-go Optimal value

function

actionobservation

system / world

Reinforcement Learning

state

For finite time horizon

Value function

of policy pi

State-action value function of policy pi:

“if you land at state x, and you commit

to first execute action a, and then follow

policy pi how much reward will you

accumulate?”

Page 14: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Optimal Control

control / action

external noise /

disturbance /

error

observation

state

system

Optimal cost-to-go Optimal value

function

actionobservation

system / world

Reinforcement Learning

state

For finite time horizon

Value function

of policy pi

State-action value

function of policy pi

Optimal state-action

value function

Page 15: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Today’s agenda

• Intro to Control & Reinforcement Learning

• Linear Quadratic Regulator (LQR)

• Iterative LQR

• Model Predictive Control

• Learning the dynamics and model-based RL

Acknowledgments

Today’s slides have been influenced by: Pieter Abbeel (ECE287), Sergey Levine (DeepRL), Ben Recht (ICML’18), Emo Todorov, Zico Kolter

Page 16: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

What you can do with LQR control

Page 17: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

What you can do with (variants of) LQR control

Pieter Abbeel, Helicopter Aerobatics

Page 18: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

LQR: assumptions

• You know the dynamics model of the system

• It is linear:

State at the next time step Control / command / action applied to the system

Page 19: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Which systems are linear?

• Omnidirectional robot

Page 20: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Which systems are linear?

• Omnidirectional robot

• Simple car

Page 21: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Which systems are linear?

• Omnidirectional robot

• Simple car

Page 22: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Which systems are linear?

• Omnidirectional robot

• Simple car

Page 23: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

The goal of LQR

• Stabilize the system around state with control

• Then and the system will remain at zero forever

Page 24: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

The goal of LQR

• Stabilize the system around state with control

• Then and the system will remain at zero forever

If we want to stabilize around x* then

let x – x* be the state

Page 25: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

LQR: assumptions

• You know the dynamics model of the system

• It is linear:

• There is an instantaneous cost associated with being at state

and taking the action :

Quadratic state cost:

Penalizes deviation

from the zero vector

Quadratic control cost:

Penalizes high control

signals

Page 26: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

LQR: assumptions

• You know the dynamics model of the system

• It is linear:

• There is an instantaneous cost associated with being at state

and taking the action :

Square matrices Q and R must be positive definite:

i.e. positive cost for ANY nonzero state and control vector

Page 27: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Finite-Horizon LQR

• Idea: finding controls is an optimization problem

• Compute the control variables that minimize the cumulative cost

Page 28: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Finite-Horizon LQR

• Idea: finding controls is an optimization problem

• Compute the control variables that minimize the cumulative cost

We could solve this as a constrained

nonlinear optimization problem. But,

there is a better way: we can find a

closed-form solution.

Page 29: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Finite-Horizon LQR

• Idea: finding controls is an optimization problem

• Compute the control variables that minimize the cumulative cost

Open-loop plan!

Given first state compute

action sequence

Page 30: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Finding the LQR controller in closed-form by recursion

• Let denote the cumulative cost-to-go starting from state x and moving for n time steps.

• I.e. cumulative future cost from now till n more steps

• is the terminal cost of ending up at state x, with no actions left to perform. Recall that

Page 31: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Finding the LQR controller in closed-form by recursion

• Let denote the cumulative cost-to-go starting from state x and moving for n time steps.

• I.e. cumulative future cost from now till n more steps

• is the terminal cost of ending up at state x, with no actions left to perform. Recall that

Q: What is the optimal cumulative cost-to-go function with 1 time step left?

Page 32: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Finding the LQR controller in closed-form by recursion

Page 33: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Finding the LQR controller in closed-form by recursion

For notational convenience later on

Page 34: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Finding the LQR controller in closed-form by recursion

Bellman Update

Dynamic Programming

Value Iteration

In RL this would be the

state-action value function

Page 35: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Finding the LQR controller in closed-form by recursion

Q: How do we optimize a multivariable function with respect to some

variables (in our case, the controls)?

Page 36: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Finding the LQR controller in closed-form by recursion

Page 37: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Finding the LQR controller in closed-form by recursion

Page 38: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Finding the LQR controller in closed-form by recursion

A: Take the partial derivative w.r.t. controls and set it to zero. That will give you a critical point.

Quadratic

term in u

Quadratic

term in uLinear

term in u

Page 39: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Finding the LQR controller in closed-form by recursion

From calculus/algebra:

If M is symmetric:

Page 40: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Finding the LQR controller in closed-form by recursion

From calculus/algebra:

If M is symmetric:

The minimum is attained at:

Q: Is this matrix invertible? Recall R, Po are positive definite matrices.

Page 41: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Finding the LQR controller in closed-form by recursion

The minimum is attained at:

Q: Is this matrix invertible? Recall R, Po are positive definite matrices.

is positive definite, so it is invertible

Page 42: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Finding the LQR controller in closed-form by recursion

The minimum is attained at:

So, the optimal control for the last time step is:

Linear controller in terms of the state

Page 43: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Finding the LQR controller in closed-form by recursion

The minimum is attained at:

So, the optimal control for the last time step is:

Linear controller in terms of the state

We computed the location of the minimum.

Now, plug it back in and compute the

minimum value

Page 44: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Finding the LQR controller in closed-form by recursion

Q: Why is this a big deal?

A: The cost-to-go function remains quadratic after the first recursive step.

Page 45: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Finding the LQR controller in closed-form by recursion

J remains quadratic in x throughout the recursion

Time 0

Time N (planning horizon)

u remains linear in x throughout

the recursion

Page 46: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Finite-Horizon LQR: algorithm summary

// n is the # of steps left

for n = 1…N

Optimal control for time t = N – n is with cost-to-go

where the states are predicted forward in time according to linear dynamics

Page 47: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Finite-Horizon LQR: algorithm summary

// n is the # of steps left

for n = 1…N

Optimal control for time t = N – n is with cost-to-go

One pass backward in time:

Matrix gains are precomputed based on the

dynamics and the instantaneous cost

where the states are predicted forward in time according to linear dynamics

Page 48: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Finite-Horizon LQR: algorithm summary

// n is the # of steps left

for n = 1…N One pass backward in time:

Matrix gains are precomputed based on the

dynamics and the instantaneous cost

One pass forward in time:

Predict states, compute

controls and cost-to-go

Optimal control for time t = N – n is with cost-to-go

where the states are predicted forward in time according to linear dynamics

Page 49: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Finite-Horizon LQR: algorithm summary

// n is the # of steps left

for n = 1…N

Potential problem for states of dimension >> 100:

Matrix inversion is expensive: O(k^2.3) for the best

known algorithm and O(k^3) for Gaussian Elimination.

Optimal control for time t = N – n is with cost-to-go

where the states are predicted forward in time according to linear dynamics

Page 50: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

LQR: general form of dynamics and cost functions

Even though we assumed

we can also accommodate

but the form of the computed controls becomes

Page 51: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

LQR with stochastic dynamics

Assume and

Then the form of the optimal policy is the same as in LQR

No need to change the algorithm, as long as you observe the state at each

step (closed-loop policy)

zero mean Gaussian

Page 52: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

LQR with stochastic dynamics

Assume and

Then the form of the optimal policy is the same as in LQR

No need to change the algorithm, as long as you observe the state at each

step (closed-loop policy)

zero mean Gaussian

Linear Quadratic GaussianLQG

Page 53: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

LQR summary

• Advantages:

• If system is linear LQR gives the optimal controller that takes the system’s state to 0 (or the desired target state, same thing)

• Drawbacks:

Page 54: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

LQR summary

• Advantages:

• If system is linear LQR gives the optimal controller that takes the system’s state to 0 (or the desired target state, same thing)

• Drawbacks:

• Linear dynamics

• How can you include obstacles or constraints in the specification?

• Not easy to put bounds on control values

Page 55: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Today’s agenda

• Intro to Control & Reinforcement Learning

• Linear Quadratic Regulator (LQR)

• Iterative LQR

• Model Predictive Control

• Learning the dynamics and model-based RL

Page 56: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

What happens in the general nonlinear case?

Arbitrary differentiable functions c, f

Page 57: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

What happens in the general nonlinear case?

Arbitrary differentiable functions c, f

Idea: iteratively approximate solution by solving linearized versions of the problem via LQR

Page 58: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Iterative LQR (iLQR)

Given an initial sequence of states and actions

Linearize dynamics

Page 59: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Iterative LQR (iLQR)

Given an initial sequence of states and actions

Linearize dynamics

Taylor expand cost

Page 60: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Iterative LQR (iLQR)

Given an initial sequence of states and actions

Linearize dynamics

Taylor expand cost

Use LQR backward pass on the approximate dynamics and cost

Page 61: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Iterative LQR (iLQR)

Given an initial sequence of states and actions

Linearize dynamics

Taylor expand cost

Use LQR backward pass on the approximate dynamics and cost

Do a forward pass to get and and update state and action sequence and

Page 62: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Iterative LQR: convergence & tricks

• New state and action sequence in iLQR is not guaranteed to be close to the linearization point (so linear approximation might be bad)

• Trick: try to penalize magnitude of and

Replace old LQR linearized cost with

• Problem: Can get stuck in local optima, need to initialize well

• Problem: Hessian might not be positive definite

Page 63: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Today’s agenda

• Intro to Control & Reinforcement Learning

• Linear Quadratic Regulator (LQR)

• Iterative LQR

• Model Predictive Control

• Learning the dynamics and model-based RL

Page 64: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Open loop vs. closed loop

• The instances of LQR and iLQR that we saw were open-loop

• Commands are executed in sequence, without feedback

Page 65: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Open loop vs. closed loop

• The instances of LQR and iLQR that we saw were open-loop

• Commands are executed in sequence, without feedback

• Idea: what if we throw away all commands except the first

• We can execute the first command, and then replan Takes into account the changing

state

Page 66: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Model Predictive Control

while True:

observe the current state

run LQR/iLQR or LQG/iLQG or other planner to get

Execute

Page 67: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Model Predictive Control

while True:

observe the current state

run LQR/iLQR or LQG/iLQG or other planner to get

Execute

Possible speedups:

1. Don’t plan too far ahead with LQR

2. Execute more than one planned action

3. Warm starts and initialization

4. Use faster / custom optimizer

(e.g. CPLEX, sequential quadratic programming)

Page 68: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Online trajectory optimization / MPC

Page 69: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Online trajectory optimization / MPC

Page 70: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Online trajectory optimization / MPC

Page 71: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Online trajectory optimization / MPC

Page 72: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Today’s agenda

• Intro to Control & Reinforcement Learning

• Linear Quadratic Regulator (LQR)

• Iterative LQR

• Model Predictive Control

• Learning the dynamics and model-based RL

Page 73: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Learning a dynamics model

Test distribution is different

from training distribution

(covariate shift)

Idea #1: Collect dataset

do supervised learning to minimize

and then use the learned model for planning

Page 74: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Learning a dynamics model

Test distribution is different

from training distribution

(covariate shift)

Idea #1: Collect dataset

do supervised learning to minimize

and then use the learned model for planning

Possibly a better idea: instead of minimizing single-step

prediction errors, minimize multi-step errors.

See “Improving Multi-step Prediction of Learned Time

Series Models” by Venkatraman, Hebert, Bagnell

Page 75: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Learning a dynamics model

Test distribution is different

from training distribution

(covariate shift)

Idea #1: Collect dataset

do supervised learning to minimize

and then use the learned model for planning

Possibly a better idea: instead of predicting next state

predict next change in state.

See “PILCO: A Model-Based and Data-Efficient Approach

to Policy Search” by Deisenroth, Rasmussen

Page 76: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Model-based RL

Collect initial dataset

Fit dynamics model

Plan through to get actions

Execute first action, observe new state

Append to

Page 77: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Today’s agenda

• Intro to Control & Reinforcement Learning

• Linear Quadratic Regulator (LQR)

• Iterative LQR

• Model Predictive Control

• Learning the dynamics and model-based RL

• Appendix

Page 78: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Appendix #1 (optional reading) LQR extensions: time-varying systems

• What can we do when and ?

• Turns out, the proof and the algorithm are almost the same

// n is the # of steps left

for n = 1…N

Optimal controller for n-step horizon is with cost-to-go

Page 79: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Appendix #2 (optional reading)Why not use PID control?

• We could, but:

• The gains for PID are good for a small region of state-space. • System reaches a state outside this set → becomes unstable• PID has no formal guarantees on the size of the set

• We would need to tune PID gains for every control variable.• If the state vector has multiple dimensions it becomes harder to tune every control variable

in isolation. Need to consider interactions and correlations.

• We would need to tune PID gains for different regions of the state-space and guarantee smooth gain transitions• This is called gain scheduling, and it takes a lot of effort and time

Page 80: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Appendix #2 (optional reading)Why not use PID?

• We could, but:

• The gains for PID are good for a small region of state-space. • System reaches a state outside this set → becomes unstable• PID has no formal guarantees on the size of the set

• We would need to tune PID gains for every control variable.• If the state vector has multiple dimensions it becomes harder to tune every control variable

in isolation. Need to consider interactions and correlations.

• We would need to tune PID gains for different regions of the state-space and guarantee smooth gain transitions• This is called gain scheduling, and it takes a lot of effort and time

LQR addresses these problems

Page 81: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Appendix #3 (optional reading)Examples of models and solutions with LQR

Page 82: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

LQR example #1: omnidirectional vehicle with friction

• Similar to double integrator dynamical system, but with friction:

Force

applied

to the vehicle

Control

applied

to the vehicle

Friction

opposed to

motion

Page 83: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

LQR example #1: omnidirectional vehicle with friction

• Similar to double integrator dynamical system, but with friction:

• Set and then you get:

Page 84: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

LQR example #1: omnidirectional vehicle with friction

• Similar to double integrator dynamical system, but with friction:

• Set and then you get:

• We discretize by setting

Page 85: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

LQR example #1: omnidirectional vehicle with friction

• Define the state vector

Q: How can we express this as a linear system?

Page 86: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

LQR example #1: omnidirectional vehicle with friction

• Define the state vector

Page 87: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

LQR example #1: omnidirectional vehicle with friction

• Define the state vector

Page 88: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

LQR example #1: omnidirectional vehicle with friction

• Define the state vector

A B

Page 89: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

LQR example #1: omnidirectional vehicle with friction

• Define the state vector

• Define the instantaneous cost function

A B

Page 90: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

LQR example #1: omnidirectional vehicle with friction

With initial state

Instantaneous cost function

Page 91: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

LQR example #1: omnidirectional vehicle with friction

With initial state

Instantaneous cost function

Page 92: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

LQR example #1: omnidirectional vehicle with friction

With initial state

Instantaneous cost function

Notice how the controls tend to zero. It’s because

the state tends to zero as well.

Page 93: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

LQR example #1: omnidirectional vehicle with friction

With initial state

Instantaneous cost function

Notice how the controls tend to zero. It’s because

the state tends to zero as well.

Also note that in the current LQR framework,

we have not included hard constraints on the controls,

i.e. upper or lower bounds. We only penalize large

norm for controls.

Page 94: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

LQR example #1: omnidirectional vehicle with friction

With initial state

Instantaneous cost function

Notice how the state tends to zero.

Page 95: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

LQR example #2: trajectory following for omnidirectional vehicle

Page 96: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

LQR example #2: trajectory following for omnidirectional vehicle

A B

We are given a desired trajectory

Instantaneous cost

Page 97: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

LQR example #2: trajectory following for omnidirectional vehicle

A B

Define We want

Page 98: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

LQR example #2: trajectory following for omnidirectional vehicle

A B

Define

Need to get rid of this additive term

We want

Page 99: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

LQR example #2: trajectory following for omnidirectional vehicle

A B

Define

Need to get rid of this additive term

Redefine state:

c

We want

Page 100: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

LQR example #2: trajectory following for omnidirectional vehicle

A B

Define

Need to get rid of this additive term

Idea: augment the state

Redefine state:

c

Redefine cost function:

We want

Page 101: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

LQR example #2: trajectory following for omnidirectional vehicle

With initial state

Instantaneous cost function

Page 102: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

LQR example #2: trajectory following for omnidirectional vehicle

With initial state

Instantaneous cost function

Page 103: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Appendix #4 (optional reading)LQR extensions: trajectory following

• You are given a reference trajectory (not just path, but states and times,

or states and controls) that needs to be approximated

Linearize the nonlinear dynamics around the reference point

whereTrajectory following can be implemented as

a time-varying LQR approximation. Not

always clear if this is the best way though.

Page 104: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

Appendix #5 (optional reading)LQR with nonlinear dynamics, quadratic cost

Page 105: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

LQR variants: nonlinear dynamics, quadratic cost

What can we do when but the cost is quadratic ?

We want to stabilize the system around state

But with nonlinear dynamics we do not know if will keep the system at the zero state.

Page 106: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

LQR variants: nonlinear dynamics, quadratic cost

What can we do when but the cost is quadratic ?

We want to stabilize the system around state

But with nonlinear dynamics we do not know if will keep the system at the zero state.

→ Need to compute such that

Page 107: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

LQR variants: nonlinear dynamics, quadratic cost

What can we do when but the cost is quadratic ?

We want to stabilize the system around state

But with nonlinear dynamics we do not know if will keep the system at the zero state.

→ Need to compute such that

Taylor expansion: linearize the nonlinear dynamics around the point

A B

Page 108: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

LQR variants: nonlinear dynamics, quadratic cost

What can we do when but the cost is quadratic ?

We want to stabilize the system around state

But with nonlinear dynamics we do not know if will keep the system at the zero state.

→ Need to compute such that

Taylor expansion: linearize the nonlinear dynamics around the point

Solve this via LQR

Page 109: CSC2621 Imitation Learning for Roboticsflorian/courses/imitation_learning/...Imitation Learning for Robotics Florian Shkurti Week 2: Introduction to Optimal Control & Model-Based RL

LQR examples: code to replicate these results

• https://github.com/florianshkurti/comp417.git

• Look under comp417/lqr_examples/python