inverse reinforcement learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · update cost using...
TRANSCRIPT
![Page 1: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f0b45737e708231d42fb0fb/html5/thumbnails/1.jpg)
Inverse Reinforcement LearningChelsea Finn
3/5/2017
1
![Page 2: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f0b45737e708231d42fb0fb/html5/thumbnails/2.jpg)
Course Reminders: March 22nd: Project group & title due April 17th: Milestone report due & milestone presentations April 26th: Beginning of project presentations
2
![Page 3: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f0b45737e708231d42fb0fb/html5/thumbnails/3.jpg)
1. Motivation & Definition 2. Early Approaches 3. Maximum Entropy Inverse RL 4. Scaling inverse RL to deep cost functions
3
Inverse RL: Outline
![Page 4: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f0b45737e708231d42fb0fb/html5/thumbnails/4.jpg)
1. Motivation & Definition 2. Early Approaches 3. Maximum Entropy Inverse RL 4. Scaling inverse RL to deep cost functions
4
Inverse RL: Outline
![Page 5: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f0b45737e708231d42fb0fb/html5/thumbnails/5.jpg)
Mnih et al. ’15
video from Montessori New Zealand
what is the reward?reinforcement learning agent
reward
In the real world, humans don’t get a score.
5
![Page 6: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f0b45737e708231d42fb0fb/html5/thumbnails/6.jpg)
reward function is essential for RL
real-world domains: reward/cost often difficult to specify
• robotic manipulation • autonomous driving • dialog systems • virtual assistants • and more…
Kohl & Stone, ’04 Mnih et al. ’15 Silver et al. ‘16Tesauro ’95
6
![Page 7: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f0b45737e708231d42fb0fb/html5/thumbnails/7.jpg)
Behavioral Cloning: Mimic actions of expert
7
Motivation
- but no reasoning about outcomes or dynamics
Can we reason about what the expert is trying to achieve?
- the expert might have different degrees of freedom
Inverse Optimal Control / Inverse Reinforcement Learning: infer cost/reward function from expert demonstrations
(Kalman ’64, Ng & Russell ’00)(IOC/IRL)
![Page 8: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f0b45737e708231d42fb0fb/html5/thumbnails/8.jpg)
8
Inverse Optimal Control / Inverse Reinforcement Learning: infer cost/reward function from demonstrations
Challenges underdefined problem
difficult to evaluate a learned cost demonstrations may not be precisely optimal
given: - state & action space - roll-outs from π* - dynamics model [sometimes]
goal: - recover reward function - then use reward to get policy
Compare to DAgger: no direct access to π*
![Page 9: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f0b45737e708231d42fb0fb/html5/thumbnails/9.jpg)
9
Early IRL Approaches
All: alternate between solving MDP w.r.t. cost and updating cost
Ng & Russell ‘00: expert actions should have higher value than other actions, larger gap is better
Abbeel & Ng ’04: expert policy w.r.t. cost should match feature counts of expert trajectories
Ratliff et al. ’06: max margin formulation between value of expert actions and other actions
How to handle ambiguity? What if expert is not perfect?
![Page 10: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f0b45737e708231d42fb0fb/html5/thumbnails/10.jpg)
1. Motivation & Examples 2. Early Approaches 3. Maximum Entropy Inverse RL 4. Scaling inverse RL to deep cost functions
10
Inverse RL: Outline
![Page 11: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f0b45737e708231d42fb0fb/html5/thumbnails/11.jpg)
11
Whiteboard
Maximum Entropy Inverse RL(Ziebart et al. ’08)
Notation:
: cost with parameters [linear case ]
: dataset of demonstrations
: transition dynamics
![Page 12: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f0b45737e708231d42fb0fb/html5/thumbnails/12.jpg)
12
Maximum Entropy Inverse RL(Ziebart et al. ’08)
![Page 13: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f0b45737e708231d42fb0fb/html5/thumbnails/13.jpg)
1. Motivation & Examples 2. Early Approaches 3. Maximum Entropy IRL 4. Scaling IRL to deep cost functions
13
Inverse RL: Outline
![Page 14: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f0b45737e708231d42fb0fb/html5/thumbnails/14.jpg)
14
Case Study: MaxEnt Deep IRL MaxEnt IRL with known dynamics (tabular setting), neural net cost
NIPS Deep RL workshop 2015
IROS 2016
![Page 15: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f0b45737e708231d42fb0fb/html5/thumbnails/15.jpg)
Case Study: MaxEnt Deep IRL MaxEnt IRL with known dynamics (tabular setting), neural net cost
15Need to iteratively solve MDP for every weight update
![Page 16: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f0b45737e708231d42fb0fb/html5/thumbnails/16.jpg)
Case Study: MaxEnt Deep IRL MaxEnt IRL with known dynamics (tabular setting), neural net cost
16
demonstrations
mean height height variance cell visibility
120km of demonstration data
test-set trajectory prediction:
manually designed cost:
MHD: modified Hausdorff distance
![Page 17: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f0b45737e708231d42fb0fb/html5/thumbnails/17.jpg)
Case Study: MaxEnt Deep IRL MaxEnt IRL with known dynamics (tabular setting), neural net cost
17
Strengths - scales to neural net costs - efficient enough for real robots
Limitations - still need to repeatedly solve the MDP - assumes known dynamics
![Page 18: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f0b45737e708231d42fb0fb/html5/thumbnails/18.jpg)
18
Whiteboard
What about unknown dynamics?
![Page 19: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f0b45737e708231d42fb0fb/html5/thumbnails/19.jpg)
19
Goals: - remove need to solve MDP in the inner loop - be able to handle unknown dynamics - handle continuous state & actions spaces
Case Study: Guided Cost Learning
ICML 2016
![Page 20: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f0b45737e708231d42fb0fb/html5/thumbnails/20.jpg)
Update cost using samples & demos
generate policy samples from π
update π w.r.t. cost
x
x
x
1
2
n
h
h
h
1
2
k
h
h
h
1
2
p
(1)
(1)
(1)
(2)
(2)
(2)
h
h
h
1
2
m
(3)
(3)
(3)
c (x)θ2
policy π cost c
guided cost learning algorithm
20
policy π0
![Page 21: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f0b45737e708231d42fb0fb/html5/thumbnails/21.jpg)
Update cost using samples & demos
generate policy samples from π
update π w.r.t. cost (partially optimize)
generator
x
x
x
1
2
n
h
h
h
1
2
k
h
h
h
1
2
p
(1)
(1)
(1)
(2)
(2)
(2)
h
h
h
1
2
m
(3)
(3)
(3)
c (x)θ2
policy π cost c
discriminator
guided cost learning algorithm
update cost in inner loop of policy optimization
21
policy π0
![Page 22: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f0b45737e708231d42fb0fb/html5/thumbnails/22.jpg)
Update cost using samples & demos
generate policy samples from π
generator
Ho et al., ICML ’16, NIPS ‘16
x
x
x
1
2
n
h
h
h
1
2
k
h
h
h
1
2
p
(1)
(1)
(1)
(2)
(2)
(2)
h
h
h
1
2
m
(3)
(3)
(3)
c (x)θ2
policy π cost c
discriminator
guided cost learning algorithm
update π w.r.t. cost (partially optimize)
22
policy π0
![Page 23: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f0b45737e708231d42fb0fb/html5/thumbnails/23.jpg)
23
What about unknown dynamics?Adaptive importance sampling
![Page 24: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f0b45737e708231d42fb0fb/html5/thumbnails/24.jpg)
GCL Experiments
dish placement pouring almonds
Real-world Tasks
state includes goal plate pose state includes unsupervised visual features [Finn et al. ’16]
24
action: joint torques
![Page 25: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f0b45737e708231d42fb0fb/html5/thumbnails/25.jpg)
Update cost using samples & demos
generate policy samples from q
x
x
x
1
2
n
h
h
h
1
2
k
h
h
h
1
2
p
(1)
(1)
(1)
(2)
(2)
(2)
h
h
h
1
2
m
(3)
(3)
(3)
c (x)θ2
cost c
ComparisonsRelative Entropy IRL (Boularias et al. ‘11)
Path Integral IRL (Kalakrishnan et al. ‘13)
25
![Page 26: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f0b45737e708231d42fb0fb/html5/thumbnails/26.jpg)
Dish placement, demos
26
![Page 27: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f0b45737e708231d42fb0fb/html5/thumbnails/27.jpg)
Dish placement, standard cost
27
![Page 28: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f0b45737e708231d42fb0fb/html5/thumbnails/28.jpg)
Dish placement, RelEnt IRL
• video of dish baseline method
28
![Page 29: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f0b45737e708231d42fb0fb/html5/thumbnails/29.jpg)
Dish placement, GCL policy
• video of dish our method - samples & reoptimizing
29
![Page 30: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f0b45737e708231d42fb0fb/html5/thumbnails/30.jpg)
Pouring, demos
• video of pouring demos
30
![Page 31: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f0b45737e708231d42fb0fb/html5/thumbnails/31.jpg)
Pouring, RelEnt IRL
• video of pouring baseline method
31
![Page 32: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f0b45737e708231d42fb0fb/html5/thumbnails/32.jpg)
Pouring, GCL policy
• video of pouring our method - samples
32
![Page 33: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f0b45737e708231d42fb0fb/html5/thumbnails/33.jpg)
Conclusion: We can recover successful policies for new positions.
Is the cost function also useful for new scenarios?
33
![Page 34: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f0b45737e708231d42fb0fb/html5/thumbnails/34.jpg)
Dish placement - GCL reopt.
• video of dish our method - samples & reoptimizing
34
![Page 35: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f0b45737e708231d42fb0fb/html5/thumbnails/35.jpg)
Pouring - GCL reopt.
• video of pouring our method - reoptimization
35
Note: normally the GAN discriminator is discarded
![Page 36: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f0b45737e708231d42fb0fb/html5/thumbnails/36.jpg)
36
Case Study: Generative Adversarial Imitation Learning
NIPS 2016
![Page 37: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f0b45737e708231d42fb0fb/html5/thumbnails/37.jpg)
37
Case Study: Generative Adversarial Imitation Learning
- demonstrations from TRPO-optimized policy - use TRPO as a policy optimizer - OpenAI gym tasks
![Page 38: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f0b45737e708231d42fb0fb/html5/thumbnails/38.jpg)
Guided Cost Learning & Generative Adversarial Imitation Learning
38
Strengths - can handle unknown dynamics - scales to neural net costs - efficient enough for real robots
Limitations - adversarial optimization is hard - can’t scale to raw pixel observations of demos - demonstrations typically collected with kinesthetic
teaching or teleoperation (first person)
![Page 39: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f0b45737e708231d42fb0fb/html5/thumbnails/39.jpg)
Next Time: Back to forward RL (advanced policy gradients)
39
![Page 40: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f0b45737e708231d42fb0fb/html5/thumbnails/40.jpg)
40
![Page 41: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f0b45737e708231d42fb0fb/html5/thumbnails/41.jpg)
IOC is under-definedneed regularization:• encourage slowly changing cost
• cost of demos decreases strictly monotonically in time
![Page 42: Inverse Reinforcement Learningrll.berkeley.edu/deeprlcourse/docs/inverserl.pdf · Update cost using samples & demos generate policy samples from π update π w.r.t. cost (partially](https://reader034.vdocuments.us/reader034/viewer/2022052518/5f0b45737e708231d42fb0fb/html5/thumbnails/42.jpg)
Regularization ablationReaching
samples
dist
ance
5 25 45 65 850
0.2
0.4
0.6
0.8
Peg Insertion
samples
dist
ance
5 25 45 65 850
0.1
0.2
0.3
0.4
0.5
ours, demo initours, rand. initours, no lcr reg, demo initours, no lcr reg, rand. initours, no mono reg, demo initours, no mono reg, rand. init