class notes - bair
TRANSCRIPT
![Page 1: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/1.jpg)
1. Homework 5 due Tuesday, November 13th 11:59pm
Class notes
![Page 2: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/2.jpg)
Real-World Robot Learning:Safety and Flexibility
CS294-112: Deep Reinforcement Learning
Gregory Kahn
![Page 3: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/3.jpg)
Safety Flexibility
Why should you care?
![Page 4: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/4.jpg)
Topics• Safety
• Flexibility
Outline
Algorithms• Imitation learning
• Model-free
• Model-based
Safety Flexibility Imitation learning Model-free Model-based
2 * 3 = 6 papers we’ll cover
By no means the best / only papers on these topics
![Page 5: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/5.jpg)
Safety Flexibility Imitation learning Model-free Model-based
![Page 6: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/6.jpg)
Learn control policy that maps observations to controls
Observation
Control
Policy
Goal
Safety Flexibility Imitation learning Model-free Model-based
![Page 7: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/7.jpg)
Human expert
● Able to generate good trajectories using an expert policy
- cost function
- optimization
- full state information
only during training
Trajectory optimization
Assumption
Safety Flexibility Imitation learning Model-free Model-based
![Page 8: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/8.jpg)
● Problem: training and test distributions differ
Gather expert trajectories
Supervised learning
Training trajectory
Policy reaches states not in training set! [Ross et al 2010]
Learned policy trajectory
Trajectoryoptimization
Supervised Learning
Safety Flexibility Imitation learning Model-free Model-based
![Page 9: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/9.jpg)
[Ross et al 2011]
● Problem: training and test distributions differ● Solution: execute policy during training
Gather expert trajectories
Supervised learning
Dataset Aggregation (DAgger)
Safety Flexibility Imitation learning Model-free Model-based
![Page 10: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/10.jpg)
● DAgger mixes the actions
Safety during training
Safety Flexibility Imitation learning Model-free Model-based
![Page 11: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/11.jpg)
● DAgger mixes the actions ● PLATO mixes the objectives
cost J → avoids high cost
Policy Learning using Adaptive Trajectory Optimization (PLATO)
Safety Flexibility Imitation learning Model-free Model-based
![Page 12: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/12.jpg)
approach sampling policy
safe similar training and
test distributions
PLATO
supervised learning
DAgger
Algorithm comparisons
Safety Flexibility Imitation learning Model-free Model-based
![Page 13: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/13.jpg)
Canyon Forest
Experiments: final neural network policies
Safety Flexibility Imitation learning Model-free Model-based
![Page 14: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/14.jpg)
Canyon Forest
Experiments: metrics
Safety Flexibility Imitation learning Model-free Model-based
![Page 15: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/15.jpg)
CanyonForest CanyonForest
Experiments: metrics
Safety Flexibility Imitation learning Model-free Model-based
![Page 16: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/16.jpg)
Safety Flexibility Imitation learning Model-free Model-based
![Page 17: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/17.jpg)
Goal
NOT SAFE
Safety Flexibility Imitation learning Model-free Model-based
![Page 18: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/18.jpg)
Shielding
Like learning in a transformed MDP
Pre-emptive shielding
Shield can be used at test time
Post-posed shielding
Safety Flexibility Imitation learning Model-free Model-based
![Page 19: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/19.jpg)
How to shield: linear temporal logic
● Encode safety with temporal logic
● Assumption: Known approximate/conservative transition dynamics
Safety Flexibility Imitation learning Model-free Model-based
![Page 20: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/20.jpg)
Experiments
Safety criteria- Don’t crash
Safety Flexibility Imitation learning Model-free Model-based
![Page 21: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/21.jpg)
Experiments
Safety criteria- Don’t run out of oxygen- If enough oxygen,
don’t surface w/o divers
Safety Flexibility Imitation learning Model-free Model-based
![Page 22: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/22.jpg)
Safety Flexibility Imitation learning Model-free Model-based
![Page 23: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/23.jpg)
How to do reinforcement learningwithout destroying the robot during training using only onboard images
unknown environment
Goal
Safety Flexibility Imitation learning Model-free Model-based
![Page 24: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/24.jpg)
unknown environment
learn a collision prediction model
command velocities
raw image
neural network
Approach
Safety Flexibility Imitation learning Model-free Model-based
![Page 25: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/25.jpg)
Collision prediction model
Safety Flexibility Imitation learning Model-free Model-based
![Page 26: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/26.jpg)
Train uncertainty-aware collision prediction model
Gather trajectories using MPC controller
Data
Deep neural network with uncertainty estimates from bootstrapping and dropout
Encourage safe, low-speed collisions by reasoning about
the model’s uncertainty
Robot increases speed as model becomes more
confident
May experience collisions
Form speed-dependent, uncertainty-awarecollision cost .
Model-based RL using collision prediction model
Safety Flexibility Imitation learning Model-free Model-based
![Page 27: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/27.jpg)
high speed predict collision large uncertainty
large cost
Collision cost
Safety Flexibility Imitation learning Model-free Model-based
![Page 28: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/28.jpg)
Bootstrapping
Data
D1 D2 D3
Resample with replacement
TrainTrainTrain
M1 M3M2
Training time Test time
Input
M1 M2 M3
Estimating neural network output uncertainty
Safety Flexibility Imitation learning Model-free Model-based
![Page 29: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/29.jpg)
Dropout
Data
Model ModelModel Model ModelModel
Input
Training time Test time
Estimating neural network output uncertainty
Safety Flexibility Imitation learning Model-free Model-based
![Page 30: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/30.jpg)
Not accounting for uncertainty(higher-speed collisions)
Preliminary real-world experiments
Safety Flexibility Imitation learning Model-free Model-based
![Page 31: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/31.jpg)
![Page 32: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/32.jpg)
accounting for uncertainty(lower-speed collisions)
Preliminary real-world experiments
Safety Flexibility Imitation learning Model-free Model-based
![Page 33: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/33.jpg)
![Page 34: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/34.jpg)
successful flight past obstacle
Preliminary real-world experiments
Safety Flexibility Imitation learning Model-free Model-based
![Page 35: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/35.jpg)
![Page 36: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/36.jpg)
• Tradeoff between safety and exploration
• Safety guarantees require expert oversight or known environment + dynamics
• Uncertainty can play a key role
Safety takeaways
Safety Flexibility Imitation learning Model-free Model-based
![Page 37: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/37.jpg)
Safety Flexibility Imitation learning Model-free Model-based
![Page 38: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/38.jpg)
Goal
User-specified command
Safety Flexibility Imitation learning Model-free Model-based
![Page 39: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/39.jpg)
Approach
Option A: Input command Option B: Branch using command
+ empirically better- only works for discrete commands
Safety Flexibility Imitation learning Model-free Model-based
![Page 40: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/40.jpg)
Important details
• Data augmentation• Contrast
• Brightness
• Tone
• Gaussian blur
• Salt-and-pepper noise
• Region dropout
• Adding noise to expert
Approach
Safety Flexibility Imitation learning Model-free Model-based
![Page 41: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/41.jpg)
![Page 42: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/42.jpg)
![Page 43: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/43.jpg)
![Page 44: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/44.jpg)
[slides adapted from Tuomas Haarnoja]
Safety Flexibility Imitation learning Model-free Model-based
![Page 45: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/45.jpg)
Avoidance skill
Reaching skill
Task 1: Reach
Task 2: Avoid
Reaching while avoiding skill
Space of trajectories
Goal
Safety Flexibility Imitation learning Model-free Model-based
![Page 46: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/46.jpg)
Task 1+2: Reach and avoid
Task 1: Reach Task 2: Avoid
Reusability!Related to divergence between and
Avoidance skill
Reaching skillReaching while avoiding skill
Space of trajectories
Policy Composition
Safety Flexibility Imitation learning Model-free Model-based
![Page 47: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/47.jpg)
Task 1 Task 2 Task 1 + 2
![Page 48: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/48.jpg)
Avoidance policyStacking policy
![Page 49: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/49.jpg)
Avoidance policyStacking policy Combined policy
![Page 50: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/50.jpg)
Safety Flexibility Imitation learning Model-free Model-based
![Page 51: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/51.jpg)
Standard Reinforcement Learning
Data
Policy
Trai
nTe
st
Data
Policy
Data
Policy
Data inefficientExpert in the loop
Inflexible
![Page 52: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/52.jpg)
CAPs Approach
CAPs
DataTrai
nTe
st
Event CuesDetector
Data efficientDetector in the loop
Flexible
![Page 53: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/53.jpg)
Detect Predict Control
Safety Flexibility Imitation learning Model-free Model-based
![Page 54: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/54.jpg)
Detect Predict Control
Event Cues
Detector
Safety Flexibility Imitation learning Model-free Model-based
![Page 55: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/55.jpg)
Detect Predict Control
Safety Flexibility Imitation learning Model-free Model-based
![Page 56: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/56.jpg)
Detect Predict Control
Safety Flexibility Imitation learning Model-free Model-based
![Page 57: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/57.jpg)
Detect Predict Control
Safety Flexibility Imitation learning Model-free Model-based
![Page 58: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/58.jpg)
8x 8x
8x8x
8x 8x
Safety Flexibility Imitation learning Model-free Model-based
![Page 59: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/59.jpg)
8x
Safety Flexibility Imitation learning Model-free Model-based
![Page 60: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/60.jpg)
Drive in right lane
6x 6x
Drive in either lane
Drive at 7m/sAvoid collisions
Safety Flexibility Imitation learning Model-free Model-based
![Page 61: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/61.jpg)
6x
CAPs
![Page 62: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/62.jpg)
Safety Flexibility Imitation learning Model-free Model-based
![Page 63: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/63.jpg)
Safety Flexibility Imitation learning Model-free Model-based
![Page 64: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/64.jpg)
CAPs DQL
Collision Avoidance
Safety Flexibility Imitation learning Model-free Model-based
![Page 65: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/65.jpg)
Heading
Avoid collisionsFollow goal headingMove towards doors
![Page 66: Class notes - BAIR](https://reader035.vdocuments.us/reader035/viewer/2022062405/62aba59c665c582a1a789a1a/html5/thumbnails/66.jpg)
• Carefully construct how your policy / model deals with goals
• Model-free methods require extra care to reuse
• Model-based methods are flexible by construction
Flexibility takeaways
Safety Flexibility Imitation learning Model-free Model-based