csc2547 presentation: curiosity-driven exploration · csc2547 presentation: curiosity-driven...
TRANSCRIPT
![Page 1: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master](https://reader033.vdocuments.us/reader033/viewer/2022050510/5f9af72d945d7c4b060714f3/html5/thumbnails/1.jpg)
CSC2547 Presentation:Curiosity-driven exploration
Count-based VS Info gain-based
Sheng Jia, Tinglin Duan(First year master students)
![Page 2: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master](https://reader033.vdocuments.us/reader033/viewer/2022050510/5f9af72d945d7c4b060714f3/html5/thumbnails/2.jpg)
1. PLAN (2011)
1. VIME (NeurIPS2016)
1. CTS (NeurIPS2016)
![Page 3: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master](https://reader033.vdocuments.us/reader033/viewer/2022050510/5f9af72d945d7c4b060714f3/html5/thumbnails/3.jpg)
OutlineMotivation, Related Works and Demo
Unifying Count-Based Exploration and Intrinsic Motivation
Comparisons and Discussion
Planning to Be Surprised
Variational Information Maximizing Exploration
![Page 4: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master](https://reader033.vdocuments.us/reader033/viewer/2022050510/5f9af72d945d7c4b060714f3/html5/thumbnails/4.jpg)
Outline
Unifying Count-Based Exploration and Intrinsic Motivation
Comparisons and Discussion
Planning to Be Surprised
Variational Information Maximizing Exploration
Motivation, Related Works and Demo
![Page 5: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master](https://reader033.vdocuments.us/reader033/viewer/2022050510/5f9af72d945d7c4b060714f3/html5/thumbnails/5.jpg)
BackgroundRL+Curiosity
Next state
Extrinsic reward
Intrinsic reward/exploration bonus
action
History:
![Page 6: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master](https://reader033.vdocuments.us/reader033/viewer/2022050510/5f9af72d945d7c4b060714f3/html5/thumbnails/6.jpg)
What is exploration?
- Reducing the agent’s uncertainty over the environment’s dynamics.
[VIME]
[Plan]
Intrinsic motivation:
[CTS] Count-based
- Use (pseudo) visitation counts to guide agents to unvisited states.
![Page 7: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master](https://reader033.vdocuments.us/reader033/viewer/2022050510/5f9af72d945d7c4b060714f3/html5/thumbnails/7.jpg)
Why exploration useful? DEMO Our original plot & demo
X-axis S1, s2, s3, … . sT
Y-axis Intrinsic Reward function
timestamp
/Training Timestamp Z-ax
is In
trins
ic R
ewar
d
Sparse Reward ProblemMontezuma’s revenge
DQN DQN + Exploration bonus
![Page 8: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master](https://reader033.vdocuments.us/reader033/viewer/2022050510/5f9af72d945d7c4b060714f3/html5/thumbnails/8.jpg)
Related work (Timeline)
2019 On Bonus Based Exploration Methods In The Arcade Learning Environment
2016 VIME CTS
Pseudocount in 2016 still achieves SOTA for Montezuma’s revenge”
Distillation error as a quantification of uncertainty
2011 PLAN
2018 Exploration by Random Network Distillation
2017 Count-Based Exploration with Neural Density Models
2015 Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models
2010 Formal Theory of Creativity, Fun,and Intrinsic Motivation (1990-2010)
The notion of Intrinsic Motivation
L2 prediction error using neural networks
Pseudocount + Pixel CNN
Bayesian Optimal Exploration
Approximate “PLAN”
Pseudocount exploration
![Page 9: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master](https://reader033.vdocuments.us/reader033/viewer/2022050510/5f9af72d945d7c4b060714f3/html5/thumbnails/9.jpg)
Outline
Unifying Count-Based Exploration and Intrinsic Motivation
Comparisons and Discussion
Motivation, Related Works and Demo
Variational Information Maximizing Exploration
Planning to Be Surprised
![Page 10: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master](https://reader033.vdocuments.us/reader033/viewer/2022050510/5f9af72d945d7c4b060714f3/html5/thumbnails/10.jpg)
[PLAN] contribution
Dynamics model
Bayes update for posterior distribution of the dynamics model
Optimal Bayesian Exploration based on:
Expected cumulative info gain fo tau steps if performing this action
Expected one-step info gain Expected cumulative info gain for tau-1 steps if performing this next action
![Page 11: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master](https://reader033.vdocuments.us/reader033/viewer/2022050510/5f9af72d945d7c4b060714f3/html5/thumbnails/11.jpg)
[PLAN] Quantify “surprise” with info gain
p
𝜃
![Page 12: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master](https://reader033.vdocuments.us/reader033/viewer/2022050510/5f9af72d945d7c4b060714f3/html5/thumbnails/12.jpg)
[PLAN] 1-step expected information gain
NOTE: VIME uses this as the Intrinsic reward!
“1-step expected info gain” “expected immediate info gain”
“Mutual info between next state distribution & model parameter”
![Page 13: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master](https://reader033.vdocuments.us/reader033/viewer/2022050510/5f9af72d945d7c4b060714f3/html5/thumbnails/13.jpg)
[PLAN] “Planning to be surprised”
Curious Q-value
Perform an actionFollow a policy “Planning tau steps” because not actually observed yet
Cumulative steps info gain
![Page 14: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master](https://reader033.vdocuments.us/reader033/viewer/2022050510/5f9af72d945d7c4b060714f3/html5/thumbnails/14.jpg)
[PLAN] Optimal Bayesian Exploration policy[Method1] Computing optimal curiosity-Q backwards for tau steps
[Method2] Policy Iteration
Repeat applyingPolicy evaluation
Policy improvement
![Page 15: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master](https://reader033.vdocuments.us/reader033/viewer/2022050510/5f9af72d945d7c4b060714f3/html5/thumbnails/15.jpg)
[Plan] Non-triviality of curious Q-valueCumulative information gain fluctuates!
Cumulative != Sum
Info gain additive in expectation!
![Page 16: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master](https://reader033.vdocuments.us/reader033/viewer/2022050510/5f9af72d945d7c4b060714f3/html5/thumbnails/16.jpg)
[Plan] Results
RandomGreedy w.r.t expected one-step info gain
Policy iteration (Dynamic programming approximation to optimal bayesian exploration)
Q-learning using one-step info gain
.
.
.50 states
![Page 17: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master](https://reader033.vdocuments.us/reader033/viewer/2022050510/5f9af72d945d7c4b060714f3/html5/thumbnails/17.jpg)
[Plan] Results
![Page 18: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master](https://reader033.vdocuments.us/reader033/viewer/2022050510/5f9af72d945d7c4b060714f3/html5/thumbnails/18.jpg)
Outline
Unifying Count-Based Exploration and Intrinsic Motivation
Comparisons and Discussion
Motivation, Related Works and Demo
Planning to Be Surprised
Variational Information Maximizing Exploration
![Page 19: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master](https://reader033.vdocuments.us/reader033/viewer/2022050510/5f9af72d945d7c4b060714f3/html5/thumbnails/19.jpg)
[VIME] contribution
Dynamics model
Variational inference for posterior distribution of dynamics model
1-step exploration bonus
![Page 20: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master](https://reader033.vdocuments.us/reader033/viewer/2022050510/5f9af72d945d7c4b060714f3/html5/thumbnails/20.jpg)
[VIME] Quantify the information gainedReminder: PLAN cumulative info gain
![Page 21: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master](https://reader033.vdocuments.us/reader033/viewer/2022050510/5f9af72d945d7c4b060714f3/html5/thumbnails/21.jpg)
[VIME] Variational BayesWhat’s hard?
Minimize negative ELBO
Computing posterior for highly parameterized models (e.g. neural networks)
Approximate posterior by minimizing
![Page 22: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master](https://reader033.vdocuments.us/reader033/viewer/2022050510/5f9af72d945d7c4b060714f3/html5/thumbnails/22.jpg)
[VIME] Optimization for variational bayes
How to minimize negative ELBO?
Take an efficient single second-order (Newton) update step to minimize negative ELBO:
![Page 23: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master](https://reader033.vdocuments.us/reader033/viewer/2022050510/5f9af72d945d7c4b060714f3/html5/thumbnails/23.jpg)
[VIME] Estimate 1-step expected info gainWhat’s hard?
Computing the exact one-step expected info-gain. High-dimensional states
→ Monte-carlo estimation.
![Page 24: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master](https://reader033.vdocuments.us/reader033/viewer/2022050510/5f9af72d945d7c4b060714f3/html5/thumbnails/24.jpg)
[VIME] Results (Walker-2D) Average extrinsic return
Dense reward
RL algorithm: TRPO
![Page 25: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master](https://reader033.vdocuments.us/reader033/viewer/2022050510/5f9af72d945d7c4b060714f3/html5/thumbnails/25.jpg)
[VIME] Results (Swimmer-Gather) Average extrinsic return
Sparse reward
RL algorithm: TRPO
![Page 26: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master](https://reader033.vdocuments.us/reader033/viewer/2022050510/5f9af72d945d7c4b060714f3/html5/thumbnails/26.jpg)
Outline
Variational Information Maximizing Exploration
Comparisons and Discussion
Motivation, Related Works and Demo
Planning to Be Surprised
Unifying Count-Based Exploration and Intrinsic Motivation
![Page 27: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master](https://reader033.vdocuments.us/reader033/viewer/2022050510/5f9af72d945d7c4b060714f3/html5/thumbnails/27.jpg)
[CTS] contribution States Density model
Pseudo-count
1-step exploration bonus
![Page 28: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master](https://reader033.vdocuments.us/reader033/viewer/2022050510/5f9af72d945d7c4b060714f3/html5/thumbnails/28.jpg)
[CTS] Count state visitation
Empirical distribution
These two are different states!
But we want to increment visitation counts for both when visiting either one.
Pixel difference
Empirical count
![Page 29: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master](https://reader033.vdocuments.us/reader033/viewer/2022050510/5f9af72d945d7c4b060714f3/html5/thumbnails/29.jpg)
[CTS] Introduce state density model
x=s1 s2 s2X =s1
p
s
p
s
![Page 30: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master](https://reader033.vdocuments.us/reader033/viewer/2022050510/5f9af72d945d7c4b060714f3/html5/thumbnails/30.jpg)
How to update CTS density model?Check the “context tree switching” paper! https://arxiv.org/abs/1111.3182
This was the difficulty of reading this paper as it only shows a bayes rule update for mixture of density models (e.g. CTS).
Remark: For pixel-cnn density model in “Count-based exploration with neural density model”, just backprop.
![Page 31: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master](https://reader033.vdocuments.us/reader033/viewer/2022050510/5f9af72d945d7c4b060714f3/html5/thumbnails/31.jpg)
[CTS] Derive pseudo-count from density model
Two constraints:Linear system
Pseudo-count derived!
Solve linear system
![Page 32: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master](https://reader033.vdocuments.us/reader033/viewer/2022050510/5f9af72d945d7c4b060714f3/html5/thumbnails/32.jpg)
[CTS] Results (Montezuma’s Revenge)
State: 84x84x4# Actions: 18
RL algorithm: Double DQN
![Page 33: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master](https://reader033.vdocuments.us/reader033/viewer/2022050510/5f9af72d945d7c4b060714f3/html5/thumbnails/33.jpg)
Summary, Comparisons and Discussion
Outline
Variational Information Maximizing Exploration
Unifying Count-Based Exploration and Intrinsic Motivation
Motivation, Related Works and Demo
Planning to Be Surprised
![Page 34: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master](https://reader033.vdocuments.us/reader033/viewer/2022050510/5f9af72d945d7c4b060714f3/html5/thumbnails/34.jpg)
Deriving posterior dynamics model/ density model
PLAN CTSVIME
Bayes rule Variational inference Bayes rule
![Page 35: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master](https://reader033.vdocuments.us/reader033/viewer/2022050510/5f9af72d945d7c4b060714f3/html5/thumbnails/35.jpg)
Derive exploratory policy
[VIME] 1-step Information gain
[CTS] Pseudo-count
Policy trained with the reward augmented by intrinsic reward.
[PLAN] Directly argmax(curiosity Q)
![Page 36: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master](https://reader033.vdocuments.us/reader033/viewer/2022050510/5f9af72d945d7c4b060714f3/html5/thumbnails/36.jpg)
Pseudo-count VS Intrinsic MotivationMixture model
“Unifying count-based exploration and intrinsic motivations”!
![Page 37: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master](https://reader033.vdocuments.us/reader033/viewer/2022050510/5f9af72d945d7c4b060714f3/html5/thumbnails/37.jpg)
Limitations & Future Directions→ Intractable posterior & use dynamics model for expectation
Difficult to be scaled outside Tabular RL.
→ Currently maximize sum of 1-step info gain.
→ which density model leads to better generalization over states?
Learning rates of policy network VS Updating dynamic model/density model.
PLAN
VIME
CTS
![Page 38: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master](https://reader033.vdocuments.us/reader033/viewer/2022050510/5f9af72d945d7c4b060714f3/html5/thumbnails/38.jpg)
Thank you!
(Appendix)
![Page 39: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master](https://reader033.vdocuments.us/reader033/viewer/2022050510/5f9af72d945d7c4b060714f3/html5/thumbnails/39.jpg)
Our derivation for “Additive in expectation”
h’’ contains h’
![Page 40: CSC2547 Presentation: Curiosity-driven exploration · CSC2547 Presentation: Curiosity-driven exploration Count-based VS Info gain-based Sheng Jia, Tinglin Duan (First year master](https://reader033.vdocuments.us/reader033/viewer/2022050510/5f9af72d945d7c4b060714f3/html5/thumbnails/40.jpg)