learning (part ii) hierarchical reinforcement
TRANSCRIPT
![Page 1: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/1.jpg)
Hierarchical Reinforcement Learning (Part II)
Mayank Mittal
![Page 2: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/2.jpg)
What are humans good at?
![Page 3: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/3.jpg)
Let’s go and have lunch!
![Page 4: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/4.jpg)
Let’s go and have lunch!
1. Exit ETZ building 3. Eat at mensa2. Cross the street
![Page 5: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/5.jpg)
Let’s go and have lunch!
1. Exit ETZ building
➔ Open door➔ Walk to the lift➔ Press button➔ Wait for lift➔ …..
3. Eat at mensa
➔ Open door➔ Wait in a queue➔ Take food➔ …..
2. Cross the street
➔ Find shortest route➔ Walk safely➔ Follow traffic rules➔ …..
![Page 6: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/6.jpg)
What are humans good at?
Temporal abstraction
![Page 7: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/7.jpg)
Let’s go and have lunch!
1. Exit ETZ building
➔ Open door➔ Walk to the lift➔ Press button➔ Wait for lift➔ …..
3. Eat at mensa
➔ Open door➔ Wait in a queue➔ Take food➔ …..
2. Cross the street
➔ Find shortest route➔ Walk safely➔ Follow traffic rules➔ …..
![Page 8: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/8.jpg)
What are humans good at?
Transfer/Reusability of Skills
Temporal abstraction
![Page 9: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/9.jpg)
Let’s go and have lunch!
1. Exit ETZ building
➔ Open door➔ Walk to the lift➔ Press button➔ Wait for lift➔ …..
3. Eat at mensa
➔ Open door➔ Wait in a queue➔ Take food➔ …..
2. Cross the street
➔ Find shortest route➔ Walk safely➔ Follow traffic rules➔ …..
How to represent these different goals?
![Page 10: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/10.jpg)
What are humans good at?
Powerful/meaningful state abstraction
Transfer/Reusability of Skills
Temporal abstraction
![Page 11: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/11.jpg)
What are humans good at?
Can a learning-based agent do the same?
Powerful/meaningful state abstraction
Transfer/Reusability of Skills
Temporal abstraction
![Page 12: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/12.jpg)
Promise of Hierarchical RL
Structured exploration
Transfer learning
Long-term credit assignment (and
memory)
![Page 13: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/13.jpg)
Hierarchical RL
Environment
AgentM
anag
erW
orke
r(s)
![Page 14: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/14.jpg)
Hierarchical RL
FeUdal Networks for Hierarchical Reinforcement Learning (ICML 2017)
Meta-Learning Shared Hierarchies (ICLR 2018)
Data-Efficient Hierarchical Reinforcement Learning (NeurIPS 2018)
![Page 15: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/15.jpg)
Hierarchical RL
FeUdal Networks for Hierarchical Reinforcement Learning (ICML 2017)
Meta-Learning Shared Hierarchies (ICLR 2018)
Data-Efficient Hierarchical Reinforcement Learning (NeurIPS 2018)
![Page 16: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/16.jpg)
FeUdal Networks (FUN)
![Page 17: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/17.jpg)
FeUdal Networks (FUN)Le
vel o
f Abs
trac
tion
Temporal R
esolution
Dayan, Peter and Geoffrey E. Hinton. “Feudal Reinforcement Learning.” NIPS (1992).
![Page 18: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/18.jpg)
FeUdal Networks (FUN)
![Page 19: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/19.jpg)
Detour: Dilated RNN
▪ Able to preserve memories over longer periods
For more details: Chang, S. et al. (2017). Dilated Recurrent Neural Networks, NIPS.
![Page 20: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/20.jpg)
FeUdal Networks (FUN)
Man
ager
Worker
Agent
![Page 21: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/21.jpg)
FeUdal Networks (FUN)
![Page 22: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/22.jpg)
FeUdal Networks (FUN)
Man
ager
![Page 23: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/23.jpg)
FeUdal Networks (FUN)
Man
ager
![Page 24: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/24.jpg)
FeUdal Networks (FUN)
Worker
Man
ager
![Page 25: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/25.jpg)
FeUdal Networks (FUN)
Absolute Goal
(-3, 1)
(3, 9)
c : Manager’s Horizon
![Page 26: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/26.jpg)
FeUdal Networks (FUN)
Directional Goal
![Page 27: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/27.jpg)
FeUdal Networks (FUN)
Directional Goal
Idea: A single sub-goal (direction) can be reused in many different locations in state space
![Page 28: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/28.jpg)
FeUdal Networks (FUN)
▪ Intrinsic reward
![Page 29: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/29.jpg)
FeUdal Networks (FUN)
▪ Intrinsic reward
![Page 30: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/30.jpg)
FeUdal Networks (FUN)
Worker
Man
ager
![Page 31: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/31.jpg)
FeUdal Networks (FUN)
![Page 32: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/32.jpg)
FeUdal Networks (FUN)
▪ Action Stochastic
Policy!
![Page 33: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/33.jpg)
FeUdal Networks (FUN)
Man
ager
Worker
Agent
![Page 34: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/34.jpg)
FeUdal Networks (FUN)
Man
ager
Worker
Agent
Why not do end-to-end learning?
![Page 35: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/35.jpg)
FeUdal Networks (FUN)
Man
ager
Worker
Agent
Manager & Worker: Separate Actor-Critic
No gradient
Transition Policy
Gradient
Policy Gradient
![Page 36: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/36.jpg)
FeUdal Networks (FUN)
Qualitative Analysis
![Page 37: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/37.jpg)
FeUdal Networks (FUN)
Ablative Analysis
![Page 38: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/38.jpg)
FeUdal Networks (FUN)
Ablative Analysis
![Page 39: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/39.jpg)
FeUdal Networks (FUN)
Comparison
![Page 40: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/40.jpg)
FeUdal Networks (FUN)
Action Repeat Transfer
![Page 41: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/41.jpg)
Experiences
FeUdal Networks (FUN)
On-Policy Learning
Learning
Wastage!
![Page 42: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/42.jpg)
Experiences
Can we do better?
Off-Policy Learning
Learning
Replay Buffer
Reusage!
![Page 43: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/43.jpg)
Can we do better?
Off-Policy Learning
Unstable Learning
![Page 44: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/44.jpg)
Can we do better?
Off-Policy Learning
To-Be-DisclosedUnstable Learning
![Page 45: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/45.jpg)
Hierarchical RL
FeUdal Networks for Hierarchical Reinforcement Learning (ICML 2017)
Meta-Learning Shared Hierarchies (ICLR 2018)
Data-Efficient Hierarchical Reinforcement Learning (NeurIPS 2018)
![Page 46: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/46.jpg)
Man
ager
Wor
ker
Data-Efficient HRL (HIRO)
![Page 47: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/47.jpg)
Input Goal Action
Raw Observation Space
Data-Efficient HRL (HIRO)
![Page 48: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/48.jpg)
Man
ager
Wor
ker
Rollout sequence
Data-Efficient HRL (HIRO)
![Page 49: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/49.jpg)
Man
ager
Wor
ker
Rollout sequence
Data-Efficient HRL (HIRO)
![Page 50: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/50.jpg)
Man
ager
Wor
ker
Rollout sequence
Data-Efficient HRL (HIRO)
![Page 51: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/51.jpg)
Data-Efficient HRL (HIRO)
c : Manager’s Horizon
![Page 52: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/52.jpg)
Data-Efficient HRL (HIRO)
![Page 53: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/53.jpg)
▪ Intrinsic reward
Data-Efficient HRL (HIRO)
![Page 54: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/54.jpg)
Data-Efficient HRL (HIRO)
Environment
Manager
Worker(s)
Agent
Replay Buffer
Replay Buffer
![Page 55: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/55.jpg)
Data-Efficient HRL (HIRO)
Environment
Manager
Worker(s)
Agent
Replay Buffer
Replay Buffer
![Page 56: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/56.jpg)
Data-Efficient HRL (HIRO)
Environment
Manager
Worker(s)
Agent
Replay Buffer
Replay Buffer
![Page 57: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/57.jpg)
Data-Efficient HRL (HIRO)
Environment
Manager
Worker(s)
Agent
Replay Buffer
Replay Buffer
![Page 58: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/58.jpg)
Can we do better?
Off-Policy Learning
Unstable Learning To-Be-Disclosed
![Page 59: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/59.jpg)
Can we do better?
Off-Policy Learning
Unstable Learning Manager’s past experience might become useless
![Page 60: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/60.jpg)
Can we do better?
Off-Policy Learning
t = 12 yrs
Goal: “wear a shirt”
![Page 61: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/61.jpg)
Can we do better?
Off-Policy Learning
Same goal induces different behavior
t = 22 yrs
Goal: “wear a shirt”
![Page 62: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/62.jpg)
Can we do better?
Off-Policy Learning
Goal relabelling required!
t = 22 yrs
Goal: “wear a dress”
Goal: “wear a shirt”
![Page 63: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/63.jpg)
Data-Efficient HRL (HIRO) Off-Policy Correction for Manager
where
![Page 64: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/64.jpg)
Data-Efficient HRL (HIRO) Off-Policy Correction for Manager
where
...
![Page 65: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/65.jpg)
Data-Efficient HRL (HIRO)
Environment
Manager
Worker(s)
Agent
Replay Buffer
Replay Buffer
![Page 66: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/66.jpg)
Data-Efficient HRL (HIRO)
Ant Push
![Page 67: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/67.jpg)
Data-Efficient HRL (HIRO)
Qualitative Analysis
![Page 68: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/68.jpg)
Data-Efficient HRL (HIRO)
Ablative Analysis
Experience Samples (in millions)
Perf
orm
ance
Experience Samples (in millions) Experience Samples (in millions)
![Page 69: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/69.jpg)
Data-Efficient HRL (HIRO)
Comparison
![Page 70: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/70.jpg)
Data-Efficient HRL (HIRO)
Comparison
Experience Samples (in millions)
Perf
orm
ance
![Page 71: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/71.jpg)
Can we do better?
What is missing?
Structured exploration
![Page 72: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/72.jpg)
Hierarchical RL
FeUdal Networks for Hierarchical Reinforcement Learning (ICML 2017)
Meta-Learning Shared Hierarchies (ICLR 2018)
Data-Efficient Hierarchical Reinforcement Learning (NeurIPS 2018)
![Page 73: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/73.jpg)
Meta-Learning Shared Hierarchies (MLSH)
![Page 74: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/74.jpg)
Taken after every N steps
Meta-Learning Shared Hierarchies (MLSH)
![Page 75: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/75.jpg)
Computer Vision practice:▪ Train on ImageNet▪ Fine tune on actual task
Slide Credits: Pieter Abbeel, Metal-Learning Symposium (NIPS 2017)
Meta-Learning Shared Hierarchies (MLSH)
![Page 76: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/76.jpg)
Computer Vision practice:▪ Train on ImageNet▪ Fine tune on actual task
How to generalize this to behavior learning?
Slide Credits: Pieter Abbeel, Metal-Learning Symposium (NIPS 2017)
Meta-Learning Shared Hierarchies (MLSH)
![Page 77: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/77.jpg)
Environment A
Environment B
...Meta-RL
Algorithm“Fast” RL
Agent
Image Credits: Pieter Abbeel, Metal-Learning Symposium (NIPS 2017)
Meta-Learning Shared Hierarchies (MLSH)
![Page 78: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/78.jpg)
Environment A
Environment B
...Meta-RL
Algorithm“Fast” RL
Agent
Environment F
ar, o
Testing environments
Image Credits: Pieter Abbeel, Metal-Learning Symposium (NIPS 2017)
Meta-Learning Shared Hierarchies (MLSH)
![Page 79: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/79.jpg)
GOAL: Find sub-policies that enable fast learning of master policy
Meta-Learning Shared Hierarchies (MLSH)
![Page 80: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/80.jpg)
GOAL: Find sub-policies that enable fast learning of master policy
Meta-Learning Shared Hierarchies (MLSH)
![Page 81: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/81.jpg)
Meta-Learning Shared Hierarchies (MLSH)
![Page 82: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/82.jpg)
Meta-Learning Shared Hierarchies (MLSH)
![Page 83: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/83.jpg)
Meta-Learning Shared Hierarchies (MLSH)
![Page 84: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/84.jpg)
Ant Two-walks
Meta-Learning Shared Hierarchies (MLSH)
![Page 85: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/85.jpg)
Ant Obstacle Course
Meta-Learning Shared Hierarchies (MLSH)
![Page 86: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/86.jpg)
Movement Bandits
Meta-Learning Shared Hierarchies (MLSH)
![Page 88: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/88.jpg)
Ablative Analysis
Meta-Learning Shared Hierarchies (MLSH)
![Page 89: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/89.jpg)
Ablative Analysis
Meta-Learning Shared Hierarchies (MLSH)
![Page 90: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/90.jpg)
Four Rooms
Meta-Learning Shared Hierarchies (MLSH)
![Page 91: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/91.jpg)
Comparison
Meta-Learning Shared Hierarchies (MLSH)
![Page 92: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/92.jpg)
SummaryFUN● Directional goals● Dilated RNN● Transition Policy Gradient
MLSH● Generalization in RL algorithm● Inspired from “Options” framework
HIRO● Absolute goals in observation space● Data-efficient ● Off-policy label correction
![Page 93: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/93.jpg)
Discussion
▪ How to decide temporal resolution (i.e. c, N)?
▪ Do we need discrete # of sub-policies?
▪ Future prospects of HRL? More hierarchies?
![Page 94: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/94.jpg)
Thank you for your attention!
![Page 95: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/95.jpg)
Any Questions?
![Page 96: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/96.jpg)
Let’s go and have lunch!
![Page 97: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/97.jpg)
References ▪ Vezhnevets, A.S., Osindero, S., Schaul, T., Heess,
N., Jaderberg, M., Silver, D., & Kavukcuoglu, K. (2017). FeUdal Networks for Hierarchical Reinforcement Learning. ICML.
▪ Nachum, O., Gu, S., Lee, H., & Levine, S. (2018). Data-Efficient Hierarchical Reinforcement Learning. NeurIPS.
▪ Frans, K., Ho, J., Chen, X., Abbeel, P., & Schulman, J. (2018). Meta Learning Shared Hierarchies. CoRR, abs/1710.09767.
![Page 98: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/98.jpg)
Appendix
![Page 99: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/99.jpg)
Hierarchical RL
Environment
Manager
Worker(s)
Agent
![Page 100: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/100.jpg)
Hierarchical RL
Image Credits: Levy A. et. al (2019) Learning Multi-Level Hierarchies With Hindsight, ICLR
![Page 101: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/101.jpg)
Detour: A2C
Image Credits: Sergey Levine (2018), CS 294-112 (Lecture 6)
![Page 102: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/102.jpg)
Advantage Function:
Update Rule:
FeUdal Networks (FUN)
Worker
Policy Gradient
![Page 103: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/103.jpg)
FeUdal Networks (FUN)
Man
ager
Advantage Function:
Update Rule:
Transition Policy Gradient
![Page 104: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/104.jpg)
FeUdal Networks (FUN)
Transition Policy Gradient
Assumption:
● Worker will eventually learn to follow the goal directions● Direction in state-space follows von Mises-Fisher distribution
![Page 105: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/105.jpg)
FeUdal Networks (FUN)
Learnt sub-goals by Manager
![Page 106: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/106.jpg)
FeUdal Networks (FUN)
Memory Task: Non-Match
![Page 107: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/107.jpg)
FeUdal Networks (FUN)
Memory Task: T-Maze
![Page 108: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/108.jpg)
FeUdal Networks (FUN)
Memory Task: Water-Maze
![Page 109: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/109.jpg)
FeUdal Networks (FUN)
Comparison
![Page 110: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/110.jpg)
FeUdal Networks (FUN)
Comparison
![Page 111: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/111.jpg)
Network Structure: TD3
Data-Efficient HRL (HIRO)
Dimension of raw
observation space
Dimension of Action Space
Manager
Actor-Critic with2-layer MLP each
Worker
Actor-Critic with2-layer MLP each
For more details: Fujimoto, S., et. al (2018). Addressing Function Approximation Error in Actor-Critic Methods. ICML.
![Page 112: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/112.jpg)
Data-Efficient HRL (HIRO) Off-Policy Correction for Manager
where
...
![Page 113: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/113.jpg)
Data-Efficient HRL (HIRO) Off-Policy Correction for Manager
Approximately solved by generating candidate goals
![Page 114: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/114.jpg)
Data-Efficient HRL (HIRO) Off-Policy Correction for Manager
Approximately solved by generating candidate goals :
● Original goal given:
● Absolute goal based on transition observed:
● Randomly sampled candidates:
![Page 115: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/115.jpg)
Training
Data-Efficient HRL (HIRO)
![Page 116: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/116.jpg)
Data-Efficient HRL (HIRO)
Environments
Ant Push Ant Fall
Ant Maze Ant Gather
![Page 117: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/117.jpg)
Network Structure: PPO
Meta-Learning Shared Hierarchies (MLSH)
Number of sub-policies
Dimension of Action Space
Manager
2-layer MLP with 64 hidden units
Each sub-policy
2-layer MLP with 64 hidden units
For more details: Schulman, J., et. al (2017).. Proximal Policy Optimization Algorithms. CoRR, abs/1707.06347
![Page 119: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/119.jpg)
Comparison
Meta-Learning Shared Hierarchies (MLSH)
![Page 120: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/120.jpg)
Comparison
Meta-Learning Shared Hierarchies (MLSH)
![Page 121: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/121.jpg)
Comparison
Meta-Learning Shared Hierarchies (MLSH)
![Page 122: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/122.jpg)
▪ Useful when input data is sequential (such as in speech recognition, language modelling)
Recurrent Neural Network
For more details: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
![Page 123: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/123.jpg)
Stochastic NN for HRL (SNN4HRL)
For more details: Florensa, C. et. al (2017). Stochastic Neural Network for Hierarchical Reinforcement Learning. ICLR.
Aims to learn useful skills during pre-training and then leverage them for learning faster in future tasks
![Page 124: Learning (Part II) Hierarchical Reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022050313/626f8a7e7c01e052e67bf2fa/html5/thumbnails/124.jpg)
Variational Information Maximizing Exploration (VIME)
For more details: Houthooft, R. et. al (2016). VIME: Variational Information Maximizing Exploration, NIPS.
Exploration based on maximizing information gain about agent’s belief of the environment