![Page 1: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/1.jpg)
Apprenticeship LearningPieter Abbeel
Stanford University
In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov, Sebastian Thrun.
![Page 2: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/2.jpg)
Large number of success stories: Handwritten digit recognition Face detection Disease diagnosis …
All learn from examples a direct mapping from inputs to outputs.
Reinforcement learning / Sequential decision making: Humans still greatly outperform
machines.
Machine Learning
![Page 3: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/3.jpg)
Reinforcement learning
Dynamics Model
Psa
Reward Function R
ReinforcementLearning
Controller p
Prescribes actions to take
Probability distribution over next states given current state and
action
Describes desirability (how much it costs) to
be in a state.
![Page 4: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/4.jpg)
Apprenticeship learning
Dynamics Model
Psa
Reward Function R
ReinforcementLearning
Controller p
Teacher Demonstration
(s0, a0, s1, a1, ….)
![Page 5: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/5.jpg)
Example task: driving
![Page 6: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/6.jpg)
Learning from demonstrations
Learn direct mapping from states to actions Assumes controller simplicity. E.g., Pomerleau, 1989; Sammut et al., 1992; Kuniyoshi et
al., 1994; Demiris & Hayes, 1994; Amit & Mataric, 2002;
Inverse reinforcement learning [Ng & Russell, 2000]
Tries to recover the reward function from demonstrations. Inherent ambiguity makes reward function impossible to
recover.
Apprenticeship learning [Abbeel & Ng, 2004]
Exploits reward function structure + provides strong guarantees.
Related work since: Ratliff et al., 2006, 2007; Neu & Szepesvari, 2007; Syed & Schapire, 2008.
![Page 7: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/7.jpg)
Apprenticeship learning
Key desirable properties:
Returns controller with performance guarantee:
Short running time.
Small number of demonstrations required.
![Page 8: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/8.jpg)
Apprenticeship learning algorithm
Assume
Initialize: pick some controller 0.
Iterate for i = 1, 2, … :
Make the current best guess for the reward function. Concretely, find the reward function such that the teacher maximally outperforms all previously found controllers.
Find optimal optimal controller i for the current guess of the
reward function Rw.
If , exit the algorithm.
![Page 9: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/9.jpg)
Theoretical guarantees
![Page 10: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/10.jpg)
Highway drivingInput: Driving demonstration Output: Learned behavior
The only input to the learning algorithm was the driving demonstration (left panel). No reward function was provided.
![Page 11: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/11.jpg)
Parking lot navigation
Reward function trades off: curvature, smoothness,distance to obstacles, alignment with principal directions.
![Page 12: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/12.jpg)
Reward function trades off 25 features.
Learn on training terrain.
Test on previously unseen terrain.
Quadruped
[NIPS 2008]
![Page 13: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/13.jpg)
Quadruped on test-board
![Page 14: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/14.jpg)
LearnR
Apprenticeship learning
Dynamics Model
Psa
Reward Function R
ReinforcementLearning
Controller p
Teacher’s flight
(s0, a0, s1, a1, ….)
![Page 15: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/15.jpg)
LearnR
Apprenticeship learning
Dynamics Model
Psa
Reward Function R
ReinforcementLearning
Controller p
Teacher’s flight
(s0, a0, s1, a1, ….)
![Page 16: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/16.jpg)
Accurate dynamics model Psa
Motivating example
• Textbook model• Specification
Accurate dynamics model Psa
Collect flight data.
• Textbook model• Specification
Learn model from data.
How to fly helicopter for data collection? How to ensure that entire flight envelope is covered by the data collection process?
![Page 17: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/17.jpg)
Learning the dynamics model
State-of-the-art: E3 algorithm, Kearns and Singh (1998,2002). (And its variants/extensions: Kearns and Koller, 1999; Kakade, Kearns and Langford, 2003; Brafman and Tennenholtz, 2002.)
Have goodmodel of dynamics?
NO
“Explore”
YES
“Exploit”
![Page 18: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/18.jpg)
Learning the dynamics model
State-of-the-art: E3 algorithm, Kearns and Singh (2002). (And its variants/extensions: Kearns and Koller, 1999; Kakade, Kearns and Langford, 2003; Brafman and Tennenholtz, 2002.)
Have goodmodel of dynamics?
NO
“Explore”
YES
“Exploit”
Exploration policies are impractical: they do not even try
to perform well.Can we avoid explicit exploration and just
exploit?
![Page 19: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/19.jpg)
Learn P sa
Learn Psa
Apprenticeship learning of the model
Dynamics Model
Psa
Reward Function R
ReinforcementLearning
Controller p
Autonomous flight
(s0, a0, s1, a1, ….)
Teacher’s flight
(s0, a0, s1, a1, ….)
![Page 20: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/20.jpg)
Here, polynomial is with respect to
1/, 1/(failure probability), the horizon T, the maximum reward R,
the size of the state space.
Theoretical guarantees
![Page 21: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/21.jpg)
From initial pilot demonstrations, our model/simulator Psa will be accurate for the part of the state space (s,a) visited by the pilot.
Our model/simulator will correctly predict the helicopter’s behavior under the pilot’s controller *.
Consequently, there is at least one controller (namely *) that looks capable of flying the helicopter well in our simulation.
Thus, each time we solve for the optimal controller using the current model/simulator Psa, we will find a controller that successfully flies the helicopter according to Psa.
If, on the actual helicopter, this controller fails to fly the helicopter---despite the model Psa predicting that it should---then it must be visiting parts of the state space that are inaccurately modeled.
Hence, we get useful training data to improve the model. This can happen only a small number of times.
Model Learning: Proof Idea
![Page 22: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/22.jpg)
Learning the dynamics model Exploiting structure from physics
Explicitly encode gravity, inertia. Estimate remaining dynamics from data.
Lagged learning criterion Maximize prediction accuracy of the simulator
over time scales relevant for control (vs. digital integration time scale).
Similar to machine learning: discriminative vs. generative.
[Abbeel et al. {NIPS 2005, NIPS 2006}]
![Page 23: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/23.jpg)
Autonomous nose-in funnel
![Page 24: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/24.jpg)
Related work Bagnell & Schneider, 2001; LaCivita et al., 2006;
Ng et al., 2004a; Roberts et al., 2003; Saripalli et al., 2003.; Ng et al., 2004b; Gavrilets, Martinos, Mettler and Feron, 2002.
Maneuvers presented here are significantly more difficult than those flown by any other autonomous helicopter.
![Page 25: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/25.jpg)
Learn P sa
Learn Psa
Apprenticeship learning
Dynamics Model
Psa
Reward Function R
ReinforcementLearning
Controller p
Autonomous flight
(s0, a0, s1, a1, ….)
Teacher’s flight
(s0, a0, s1, a1, ….)
Model predictive control
Receding horizon differential dynamic programming
LearnR
![Page 26: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/26.jpg)
Learn P sa
Learn Psa
Learn Psa
Learn P sa
LearnR
LearnR
Dynamics Model
Psa
Reward Function R
ReinforcementLearning
Controller p
Autonomous flight
(s0, a0, s1, a1, ….)
Teacher’s flight
(s0, a0, s1, a1, ….)
Applications:
Apprenticeship learning: summary
![Page 27: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/27.jpg)
Demonstrations
![Page 28: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/28.jpg)
Learned reward (trajectory)
![Page 29: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/29.jpg)
![Page 30: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/30.jpg)
Applications:
Autonomous helicopters to assist in wildland fire fighting.
Fixed-wing formation flight: Estimated fuel savings for three aircraft formation: 20%.
Learning from demonstrations only scratches the surface of how humans learn (and teach).
Safe autonomous learning.
More general advice taking.
Current and future work
![Page 31: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/31.jpg)
![Page 32: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/32.jpg)
Thank you.
![Page 33: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/33.jpg)
Apprenticeship Learning via Inverse Reinforcement Learning, Pieter Abbeel and Andrew Y. Ng. In Proc. ICML, 2004.
Learning First Order Markov Models for Control, Pieter Abbeel and Andrew Y. Ng. In NIPS 17, 2005.
Exploration and Apprenticeship Learning in Reinforcement Learning, Pieter Abbeel and Andrew Y. Ng. In Proc. ICML, 2005.
Modeling Vehicular Dynamics, with Application to Modeling Helicopters, Pieter Abbeel, Varun Ganapathi and Andrew Y. Ng. In NIPS 18, 2006.
Using Inaccurate Models in Reinforcement Learning, Pieter Abbeel, Morgan Quigley and Andrew Y. Ng. In Proc. ICML, 2006.
An Application of Reinforcement Learning to Aerobatic Helicopter Flight, Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng. In NIPS 19, 2007.
Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion, J. Zico Kolter, Pieter Abbeel and Andrew Y. Ng. In NIPS 20, 2008.
![Page 34: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/34.jpg)
Airshow accuracy
![Page 35: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/35.jpg)
Chaos
![Page 36: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/36.jpg)
Tic-toc
![Page 37: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/37.jpg)
Applications:
Autonomous helicopters to assist in wildland fire fighting.
Fixed-wing formation flight: Estimated fuel savings for three aircraft formation: 20%.
Learning from demonstrations only scratches the surface of how humans learn (and teach).
Safe autonomous learning.
More general advice taking.
Current and future work
![Page 38: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/38.jpg)
Full Inverse RL Algorithm
Initialize: pick some arbitrary reward weights w.
For i = 1, 2, …
RL step:
Compute optimal controller i for the current estimate of the
reward function Rw.
Inverse RL step:
Re-estimate the reward function Rw:
If , exit the algorithm.
![Page 39: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/39.jpg)
Helicopter dynamics model in auto
![Page 40: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/40.jpg)
Parking lot navigation---experiments
![Page 41: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/41.jpg)
Helicopter inverse RL: experiments
![Page 42: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/42.jpg)
![Page 43: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/43.jpg)
Auto-rotation descent
![Page 44: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/44.jpg)
Learn P sa
Learn Psa
Apprenticeship learning
Dynamics Model
Psa
Reward Function R
ReinforcementLearning
Controller p
Autonomous flight
(s0, a0, s1, a1, ….)
Teacher’s flight
(s0, a0, s1, a1, ….)
LearnR
![Page 45: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/45.jpg)
Input to algorithm: approximate model. Start by computing the optimal controller
according to the model.
Algorithm Idea
Real-life trajectory
Target trajectory
![Page 46: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/46.jpg)
Algorithm Idea (2) Update the model such that it becomes exact for
the current controller.
![Page 47: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/47.jpg)
Algorithm Idea (2) Update the model such that it becomes exact for
the current controller.
![Page 48: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/48.jpg)
Algorithm Idea (2)
![Page 49: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/49.jpg)
Performance Guarantees
![Page 50: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/50.jpg)
![Page 51: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/51.jpg)
First trial.(Model-based controller.)
After learning. (10 iterations)
![Page 52: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/52.jpg)
Performance guarantee intuition
Intuition by example:
Let
If the returned controller satisfies
Then no matter what the values of and are, the controller performs as well as the teacher’s controller *.
![Page 53: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/53.jpg)
SummaryTeacher: human pilot flight
(a1, s1, a2, s2, a3, s3, ….)
Learn P sa
(a1, s1, a2, s2, a3, s3, ….)
Autonomous flight
Learn Psa
Dynamics Model
Psa
Reward Function R
ReinforcementLearning )(...)(Emax 0 TsRsR
Controller p
LearnR Im
prov
e
When given a demonstration:
Automatically learn reward function, rather than (time-consumingly) hand-engineer it.
Unlike exploration methods, our algorithm concentrates on the task of interest, and always tries to fly as well as possible.
High performance control with crude model + small number of trials.
![Page 54: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/54.jpg)
Perfect demonstrations are extremely hard to obtain.
Multiple trajectory demonstrations: Every demonstration is a noisy instantiation of
the intended trajectory. Noise model captures (among others):
Position drift. Time warping.
If different demonstrations are suboptimal in different ways, they can capture the “intended” trajectory implicitly.
[Related work: Atkeson & Schaal, 1997.]
Reward: Intended trajectory
![Page 55: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/55.jpg)
Preliminaries: reinforcement learning.
Apprenticeship learning algorithms.
Experimental results on various robotic platforms.
Outline
![Page 56: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/56.jpg)
Reinforcement learning (RL)
System
Dynamics
Psa
state s0
s1
System
dynamics
Psa
…
System
Dynamics
PsasT-1
sT
s2
a0 a1 aT-1
reward R(s0) R(s2) R(sT-1)R(s1) R(sT)+ ++…++
Goal: Pick actions over time so as to maximize the expected score: E[R(s0) + R(s1) + … + R(sT)]
Solution: controller which specifies an action for each possible state for all times t= 0, 1, … , T-1.
![Page 57: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/57.jpg)
Model-based reinforcement learning
Run reinforcement
learning algorithm in simulator.
controller
![Page 58: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/58.jpg)
Probabilistic graphical model for multiple demonstrations
![Page 59: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/59.jpg)
Algorithms such as E3 (Kearns and Singh, 2002) learn the dynamics by using exploration policies, which are dangerous/impractical for many systems.
Our algorithm Initializes model from a demonstration.
Repeatedly executes “exploitation policies'' that try to maximize rewards.
Provably achieves near-optimal performance (compared to teacher).
Machine learning theory: Complicated non-IID sample generating process. Standard learning theory bounds not applicable. Proof uses martingale construction over relative losses.
Apprenticeship learning for the dynamics model
[ICML 2005]
![Page 60: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/60.jpg)
Accuracy
![Page 61: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/61.jpg)
Modeling extremely complex: Our dynamics model state:
Position, orientation, velocity, angular rate.
True state: Air (!), head-speed, servos, deformation, etc.
Key observation: In the vicinity of a specific point along a
specific trajectory, these unknown state variables tend to take on similar values.
Non-stationary maneuvers
![Page 62: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/62.jpg)
Example: z-acceleration
![Page 63: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/63.jpg)
1. Time align trajectories.
2. Learn locally weighted models in the vicinity of the trajectory.
W(t’) = exp(- (t – t’)2 /2 )
Local model learning algorithm
![Page 64: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/64.jpg)
Input to algorithm: Teacher demonstration. Approximate model.
Algorithm Idea w/Teacher
Teacher trajectory
Trajectory predicted by simulator/model
for same inputs
[ICML 2006]
![Page 65: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/65.jpg)
Algorithm Idea w/Teacher (2)
Update the model such that it becomes exact for the demonstration.
![Page 66: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/66.jpg)
Algorithm Idea w/Teacher (2)
Update the model such that it becomes exact for the demonstration.
![Page 67: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/67.jpg)
Algorithm Idea w/Teacher (2)
The updated model perfectly predicts the state sequence obtained during the demonstration.
We can use the updated model to find a feedback Controller.
![Page 68: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/68.jpg)
1. Record teacher’s demonstration s0, s1, …
2. Update the (crude) model/simulator to be exact for the teacher’s demonstration by adding appropriate time biases for each time step.
3. Return the policy that is optimal according to the updated model/simulator.
Algorithm w/Teacher
![Page 69: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/69.jpg)
Theorem.
Performance guarantees w/Teacher
![Page 70: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/70.jpg)
Algorithm [iterative]
1. Record teacher’s demonstration s0, s1, …
2. Update the (crude) model/simulator to be exact for the teacher’s demonstration by adding appropriate time biases for each time step.
3. Find the policy that is optimal according to the updated model/simulator.
4. Execute the policy and record the state trajectory.
5. Update the (crude) model/simulator to be exact along the trajectory obtained with the policy .
6. Go to step 3.
Related work: iterative learning control (ILC).
![Page 71: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/71.jpg)
Algorithm
1. Find the (locally) optimal policy for the model.2. Execute the current policy and record the state
trajectory.3. Update the model such that the new model is exact
for the current policy .4. Use the new model to compute the policy gradient
and update the policy: := + . 5. Go back to Step 2.
Notes: The step-size parameter is determined by a line search. Instead of the policy gradient, any algorithm that provides
a local policy improvement direction can be used. In our experiments we used differential dynamic programming.
![Page 72: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/72.jpg)
Algorithm
1. Find the (locally) optimal policy for the model.
2. Execute the current policy and record the state trajectory.
3. Update the model such that the new model is exact for the current policy .
4. Use the new model to compute the policy gradient and update the policy: := + .
5. Go back to Step 2.
Related work: Iterative learning control.
![Page 73: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/73.jpg)
Future work
![Page 74: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/74.jpg)
Acknowledgments
J. Zico Kolter, Andrew Y. Ng
Morgan Quigley, Andrew Y. Ng
Andrew Y. Ng
Adam Coates, Morgan Quigley, Andrew Y. Ng
![Page 75: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/75.jpg)
RC Car: Circle
![Page 76: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/76.jpg)
RC Car: Figure-8 Maneuver
![Page 77: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/77.jpg)
Teacher demonstration for quadruped
Full teacher demonstration = sequence of footsteps.
Much simpler to “teach hierarchically”: Specify a body path. Specify best footstep in a small area.
![Page 78: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/78.jpg)
Hierarchical inverse RL
Quadratic programming problem (QP): quadratic objective, linear constraints.
Constraint generation for path constraints.
![Page 79: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/79.jpg)
Training: Have quadruped walk straight across a fairly
simple board with fixed-spaced foot placements.
Around each foot placement: label the best foot placement. (about 20 labels)
Label the best body-path for the training board.
Use our hierarchical inverse RL algorithm to learn a reward function from the footstep and path labels.
Test on hold-out terrains: Plan a path across the test-board.
Experimental setup
![Page 80: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/80.jpg)
Task: Hover at a specific point. Initial state: tens of meters away from target.
Reward function trades off: Position accuracy, Orientation accuracy, Zero velocity, Zero angular rate, … (11 features total)
Helicopter Flight
![Page 81: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/81.jpg)
Learned from “careful” pilot
![Page 82: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/82.jpg)
Learned from “aggressive” pilot
![Page 83: Apprenticeship Learning Pieter Abbeel Stanford University In collaboration with: Andrew Y. Ng, Adam Coates, J. Zico Kolter, Morgan Quigley, Dmitri Dolgov,](https://reader038.vdocuments.us/reader038/viewer/2022110207/56649d2d5503460f94a048a5/html5/thumbnails/83.jpg)
More driving examples
In each video, the left sub-panel shows a demonstration of a different driving “style”, and the right sub-panel shows the behavior learned from watching the demonstration.
Driving demonstratio
n
Driving demonstrati
on
Learned behavior
Learned behavior