reframing control as an inference...
TRANSCRIPT
![Page 1: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality](https://reader034.vdocuments.us/reader034/viewer/2022043008/5f98cf0d7cd0c15ded71a39f/html5/thumbnails/1.jpg)
Exploration (Part 2)
CS 285
Instructor: Sergey LevineUC Berkeley
![Page 2: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality](https://reader034.vdocuments.us/reader034/viewer/2022043008/5f98cf0d7cd0c15ded71a39f/html5/thumbnails/2.jpg)
Recap: what’s the problem?
this is easy (mostly) this is impossible
Why?
![Page 3: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality](https://reader034.vdocuments.us/reader034/viewer/2022043008/5f98cf0d7cd0c15ded71a39f/html5/thumbnails/3.jpg)
Unsupervised learning of diverse behaviors
What if we want to recover diverse behavior without any reward function at all?
Why?
➢ Learn skills without supervision, then use them to accomplish goals
➢ Learn sub-skills to use with hierarchical reinforcement learning
➢Explore the space of possible behaviors
![Page 4: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality](https://reader034.vdocuments.us/reader034/viewer/2022043008/5f98cf0d7cd0c15ded71a39f/html5/thumbnails/4.jpg)
An Example Scenario
training time: unsupervised
How can you prepare for an unknown future goal?
![Page 5: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality](https://reader034.vdocuments.us/reader034/viewer/2022043008/5f98cf0d7cd0c15ded71a39f/html5/thumbnails/5.jpg)
In this lecture…
➢ Definitions & concepts from information theory
➢ Learning without a reward function by reaching goals
➢ A state distribution-matching formulation of reinforcement learning
➢ Is coverage of valid states a good exploration objective?
➢ Beyond state covering: covering the space of skills
![Page 6: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality](https://reader034.vdocuments.us/reader034/viewer/2022043008/5f98cf0d7cd0c15ded71a39f/html5/thumbnails/6.jpg)
In this lecture…
➢ Definitions & concepts from information theory
➢ Learning without a reward function by reaching goals
➢ A state distribution-matching formulation of reinforcement learning
➢ Is coverage of valid states a good exploration objective?
➢ Beyond state covering: covering the space of skills
![Page 7: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality](https://reader034.vdocuments.us/reader034/viewer/2022043008/5f98cf0d7cd0c15ded71a39f/html5/thumbnails/7.jpg)
Some useful identities
![Page 8: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality](https://reader034.vdocuments.us/reader034/viewer/2022043008/5f98cf0d7cd0c15ded71a39f/html5/thumbnails/8.jpg)
Some useful identities
![Page 9: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality](https://reader034.vdocuments.us/reader034/viewer/2022043008/5f98cf0d7cd0c15ded71a39f/html5/thumbnails/9.jpg)
Information theoretic quantities in RL
quantifies coverage
can be viewed as quantifying “control authority” in an information-theoretic way
![Page 10: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality](https://reader034.vdocuments.us/reader034/viewer/2022043008/5f98cf0d7cd0c15ded71a39f/html5/thumbnails/10.jpg)
In this lecture…
➢ Definitions & concepts from information theory
➢ Learning without a reward function by reaching goals
➢ A state distribution-matching formulation of reinforcement learning
➢ Is coverage of valid states a good exploration objective?
➢ Beyond state covering: covering the space of skills
![Page 11: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality](https://reader034.vdocuments.us/reader034/viewer/2022043008/5f98cf0d7cd0c15ded71a39f/html5/thumbnails/11.jpg)
An Example Scenario
training time: unsupervised
How can you prepare for an unknown future goal?
![Page 12: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality](https://reader034.vdocuments.us/reader034/viewer/2022043008/5f98cf0d7cd0c15ded71a39f/html5/thumbnails/12.jpg)
Learn without any rewards at all
(but there are many other choices)
Nair*, Pong*, Bahl, Dalal, Lin, L. Visual Reinforcement Learning with Imagined Goals. ’18Dalal*, Pong*, Lin*, Nair, Bahl, Levine. Skew-Fit: State-Covering Self-Supervised Reinforcement Learning. ‘19
12
![Page 13: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality](https://reader034.vdocuments.us/reader034/viewer/2022043008/5f98cf0d7cd0c15ded71a39f/html5/thumbnails/13.jpg)
Learn without any rewards at all
Nair*, Pong*, Bahl, Dalal, Lin, L. Visual Reinforcement Learning with Imagined Goals. ’18Dalal*, Pong*, Lin*, Nair, Bahl, Levine. Skew-Fit: State-Covering Self-Supervised Reinforcement Learning. ‘19
13
![Page 14: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality](https://reader034.vdocuments.us/reader034/viewer/2022043008/5f98cf0d7cd0c15ded71a39f/html5/thumbnails/14.jpg)
Learn without any rewards at all
Nair*, Pong*, Bahl, Dalal, Lin, L. Visual Reinforcement Learning with Imagined Goals. ’18Dalal*, Pong*, Lin*, Nair, Bahl, Levine. Skew-Fit: State-Covering Self-Supervised Reinforcement Learning. ‘19
14
![Page 15: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality](https://reader034.vdocuments.us/reader034/viewer/2022043008/5f98cf0d7cd0c15ded71a39f/html5/thumbnails/15.jpg)
How do we get diverse goals?
Nair*, Pong*, Bahl, Dalal, Lin, L. Visual Reinforcement Learning with Imagined Goals. ’18Dalal*, Pong*, Lin*, Nair, Bahl, Levine. Skew-Fit: State-Covering Self-Supervised Reinforcement Learning. ‘19
15
![Page 16: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality](https://reader034.vdocuments.us/reader034/viewer/2022043008/5f98cf0d7cd0c15ded71a39f/html5/thumbnails/16.jpg)
How do we get diverse goals?
Nair*, Pong*, Bahl, Dalal, Lin, L. Visual Reinforcement Learning with Imagined Goals. ’18Dalal*, Pong*, Lin*, Nair, Bahl, Levine. Skew-Fit: State-Covering Self-Supervised Reinforcement Learning. ‘19
16
![Page 17: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality](https://reader034.vdocuments.us/reader034/viewer/2022043008/5f98cf0d7cd0c15ded71a39f/html5/thumbnails/17.jpg)
How do we get diverse goals?
Nair*, Pong*, Bahl, Dalal, Lin, L. Visual Reinforcement Learning with Imagined Goals. ’18Dalal*, Pong*, Lin*, Nair, Bahl, Levine. Skew-Fit: State-Covering Self-Supervised Reinforcement Learning. ‘19
17
goals get higher entropy due to Skew-Fit
goal final state
![Page 18: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality](https://reader034.vdocuments.us/reader034/viewer/2022043008/5f98cf0d7cd0c15ded71a39f/html5/thumbnails/18.jpg)
How do we get diverse goals?
Nair*, Pong*, Bahl, Dalal, Lin, L. Visual Reinforcement Learning with Imagined Goals. ’18Dalal*, Pong*, Lin*, Nair, Bahl, Levine. Skew-Fit: State-Covering Self-Supervised Reinforcement Learning. ‘19
18
![Page 19: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality](https://reader034.vdocuments.us/reader034/viewer/2022043008/5f98cf0d7cd0c15ded71a39f/html5/thumbnails/19.jpg)
Reinforcement learning withimagined goals
Nair*, Pong*, Bahl, Dalal, Lin, L. Visual Reinforcement Learning with Imagined Goals. ’18Dalal*, Pong*, Lin*, Nair, Bahl, Levine. Skew-Fit: State-Covering Self-Supervised Reinforcement Learning. ‘19
imagin
ed g
oal
RL e
pis
ode
19
![Page 20: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality](https://reader034.vdocuments.us/reader034/viewer/2022043008/5f98cf0d7cd0c15ded71a39f/html5/thumbnails/20.jpg)
In this lecture…
➢ Definitions & concepts from information theory
➢ Learning without a reward function by reaching goals
➢ A state distribution-matching formulation of reinforcement learning
➢ Is coverage of valid states a good exploration objective?
➢ Beyond state covering: covering the space of skills
![Page 21: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality](https://reader034.vdocuments.us/reader034/viewer/2022043008/5f98cf0d7cd0c15ded71a39f/html5/thumbnails/21.jpg)
Aside: exploration with intrinsic motivation
![Page 22: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality](https://reader034.vdocuments.us/reader034/viewer/2022043008/5f98cf0d7cd0c15ded71a39f/html5/thumbnails/22.jpg)
Can we use this for state marginal matching?
Lee*, Eysenbach*, Parisotto*, Xing, Levine, Salakhutdinov. Efficient Exploration via State Marginal MatchingSee also: Hazan, Kakade, Singh, Van Soest. Provably Efficient Maximum Entropy Exploration
![Page 23: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality](https://reader034.vdocuments.us/reader034/viewer/2022043008/5f98cf0d7cd0c15ded71a39f/html5/thumbnails/23.jpg)
MaxEnt on actions
variants of SMM
State marginal matching for exploration
much better coverage!
Lee*, Eysenbach*, Parisotto*, Xing, Levine, Salakhutdinov. Efficient Exploration via State Marginal MatchingSee also: Hazan, Kakade, Singh, Van Soest. Provably Efficient Maximum Entropy Exploration
![Page 24: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality](https://reader034.vdocuments.us/reader034/viewer/2022043008/5f98cf0d7cd0c15ded71a39f/html5/thumbnails/24.jpg)
In this lecture…
➢ Definitions & concepts from information theory
➢ Learning without a reward function by reaching goals
➢ A state distribution-matching formulation of reinforcement learning
➢ Is coverage of valid states a good exploration objective?
➢ Beyond state covering: covering the space of skills
![Page 25: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality](https://reader034.vdocuments.us/reader034/viewer/2022043008/5f98cf0d7cd0c15ded71a39f/html5/thumbnails/25.jpg)
Is state entropy really a good objective?
25
more or less the same thing
Gupta, Eysenbach, Finn, Levine. Unsupervised Meta-Learning for Reinforcement Learning
See also: Hazan, Kakade, Singh, Van Soest. Provably Efficient Maximum Entropy Exploration
![Page 26: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality](https://reader034.vdocuments.us/reader034/viewer/2022043008/5f98cf0d7cd0c15ded71a39f/html5/thumbnails/26.jpg)
In this lecture…
➢ Definitions & concepts from information theory
➢ A distribution-matching formulation of reinforcement learning
➢ Learning without a reward function by reaching goals
➢ A state distribution-matching formulation of reinforcement learning
➢ Is coverage of valid states a good exploration objective?
➢ Beyond state covering: covering the space of skills
![Page 27: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality](https://reader034.vdocuments.us/reader034/viewer/2022043008/5f98cf0d7cd0c15ded71a39f/html5/thumbnails/27.jpg)
Learning diverse skills
task index
Intuition: different skills should visit different state-space regions
Eysenbach, Gupta, Ibarz, Levine. Diversity is All You Need.
Reaching diverse goals is not the same as performing diverse tasks
not all behaviors can be captured by goal-reaching
![Page 28: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality](https://reader034.vdocuments.us/reader034/viewer/2022043008/5f98cf0d7cd0c15ded71a39f/html5/thumbnails/28.jpg)
Diversity-promoting reward function
Policy(Agent)
Discriminator(D)
Skill (z)
Environment
Action State
Predict Skill
Eysenbach, Gupta, Ibarz, Levine. Diversity is All You Need.
![Page 29: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality](https://reader034.vdocuments.us/reader034/viewer/2022043008/5f98cf0d7cd0c15ded71a39f/html5/thumbnails/29.jpg)
Eysenbach, Gupta, Ibarz, Levine. Diversity is All You Need.
CheetahAnt
Examples of learned tasks
Mountain car
![Page 30: Reframing Control as an Inference Problemrail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-14.pdfsoft Q-learning algorithm, deep RL with continuous actions and soft optimality](https://reader034.vdocuments.us/reader034/viewer/2022043008/5f98cf0d7cd0c15ded71a39f/html5/thumbnails/30.jpg)
A connection to mutual information
Eysenbach, Gupta, Ibarz, Levine. Diversity is All You Need.
See also: Gregor et al. Variational Intrinsic Control. 2016