rémi munos marc bellemare, will dabney, georg ostrovski ... · why does it work? in the end we...
TRANSCRIPT
![Page 1: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/1.jpg)
Distributional Reinforcement Learning
Marc Bellemare, Will Dabney, Georg Ostrovski, Mark Rowland, Rémi Munos
![Page 2: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/2.jpg)
Deep RL is already a successful empirical research domain
![Page 3: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/3.jpg)
Can we make it a fundamental research domain?
Related fundamental works: - RL side: tabular, linear TD, ADP, sample complexity, ...- Deep learning side: VC-dim, convergence, stability, robustness, ...
Nice theoretical results, but how much do they tell us about deepRL?
![Page 4: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/4.jpg)
Can we make it a fundamental research domain?
Related fundamental works: - RL side: tabular, linear TD, ADP, sample complexity, ...- Deep learning side: VC-dim, convergence, stability, robustness, ...
Nice theoretical results, but how much do they tell us about deepRL?
What is specific about RL when combined with deep learning?
![Page 5: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/5.jpg)
Distributional-RL
Shows interesting interactions between RL and deep-learning
Outline:
● The idea of distributional-RL● The theory● How to represents distributions?● Neural net implementation● Results● Why does this work?
![Page 6: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/6.jpg)
Random immediate reward
Expected immediate reward
Random variable reward:
![Page 7: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/7.jpg)
The return = sum of future discounted rewards
● Returns are often complex, multimodal● Modelling the expected return hides this intrinsic randomness● Model all possible returns!
![Page 8: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/8.jpg)
The r.v. Return
= +10
Captures intrinsic randomness from:● Immediate rewards● Stochastic dynamics● Possibly stochastic policy
![Page 9: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/9.jpg)
The r.v. Return
= +10
+8
-2
![Page 10: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/10.jpg)
The expected Return
The value function
Satisfies the Bellman equation
![Page 11: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/11.jpg)
Distributional Bellman equation?
We would like to write a Bellman equation for the distributions:
Does this equation make sense?
![Page 12: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/12.jpg)
ExampleReward = Bernoulli (½), discount factor 𝛾 = ½
Bellman equation: , thus V = 1
Return Distribution?
![Page 13: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/13.jpg)
ExampleReward = Bernoulli (½), discount factor 𝛾 = ½
Bellman equation: , thus V = 1
Return Distribution?
(rewards = binary expansion of a real number)
![Page 14: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/14.jpg)
ExampleReward = Bernoulli (½), discount factor 𝛾 = ½
Bellman equation: , thus V = 1
Return Distribution?
Distributional Bellman equation:
In terms of distribution:
![Page 15: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/15.jpg)
Distributional Bellman operator
Does there exists a fixed point?
![Page 16: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/16.jpg)
PropertiesTheorem [Rowland et al., 2018]
is a contraction in Cramer metric
Theorem [Bellemare et al., 2017]
is a contraction in Wasserstein metric,
(but not in KL neither in total variation)Intuition: the size of the support shrinks.
![Page 17: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/17.jpg)
Distributional dynamic programmingThus has a unique fixed point, and it is
Policy evaluation:For a given policy 𝜋, iterate converges to
![Page 18: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/18.jpg)
Distributional dynamic programmingThus has a unique fixed point, and it is
Policy evaluation:For a given policy 𝜋, iterate converges to
Policy iteration:● For current policy , compute● Improve policy
Does converge to the return distribution for the optimal policy?
![Page 19: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/19.jpg)
Distributional Bellman optimality operator
Is this operator a contraction mapping?
![Page 20: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/20.jpg)
Distributional Bellman optimality operator
Is this operator a contraction mapping?
No!It’s not even continuous
![Page 21: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/21.jpg)
The dist. opt. Bellman operator is not smooth
Consider distributions
If 𝜀 > 0 we back up a bimodal distribution
If 𝜀 < 0 we back up a Dirac in 0
Thus the map is not continuous
![Page 22: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/22.jpg)
Theorem [Bellemare et al., 2017]
if the optimal policy is unique, then the iteratesconverge to
Intuition: The distributional Bellman operator preserves the mean, thus the mean will converge to the optimal policy eventually. If the policy is unique, we revert to iterating , which is a contraction.
Distributional Bellman optimality operator
![Page 23: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/23.jpg)
How to represent distributions?
● Categorical
● Inverse CDF for specific quantile levels
● Parametric inverse CDF
𝛕0
𝛕1
𝛕2
𝛕n
z0 z1 z2 zn
z0 z1 z2 … zn
![Page 24: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/24.jpg)
Categorical distributions
Distributions supported on a finite support
Discrete distribution
z0 z1 z2 … zn
![Page 25: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/25.jpg)
Projected Distributional Bellman Update
Transition
![Page 26: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/26.jpg)
Discount / Shrink
Projected Distributional Bellman Update
![Page 27: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/27.jpg)
Reward / Shift
Projected Distributional Bellman Update
![Page 28: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/28.jpg)
Projected Distributional Bellman Update
Fit / Project
![Page 29: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/29.jpg)
Projected distributional Bellman operator
Let be the projection onto the support (piecewise linear interpolation)
Theorem: is a contraction (in Cramer distance)
Intuition: is a non-expansion (in Cramer distance).
Its fixed point can be computed by value iteration
Theorem: [Rowland et al., 2018]
![Page 30: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/30.jpg)
Projected distributional Bellman operator
Policy iteration: iterate- Policy evaluation:
- Policy improvement:
Theorem: Assume there is a unique optimal policy. converges to , whose greedy policy is optimal.
![Page 31: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/31.jpg)
Categorical distributional Q-learning
Observe transition samples
Update:
Theorem Under the same assumption as for Q-learning, assume there is a unique optimal policy , then and the resulting policy is optimal.
[Rowland et al., 2018]
![Page 32: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/32.jpg)
DeepRL implementation
![Page 33: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/33.jpg)
DQN [Mnih et al., 2013]
![Page 34: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/34.jpg)
C51DQN [Bellemare et al., 2017]
![Page 35: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/35.jpg)
C51 (categorical distributional DQN)
1. Transition x, a → x’
2. Select best action at x’
3. Compute Bellman backup
4. Project onto support
5. Update toward projection (e.g., by minimize a kl-loss)
![Page 37: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/37.jpg)
Randomness from future choices
![Page 39: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/39.jpg)
![Page 40: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/40.jpg)
![Page 41: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/41.jpg)
Mean Median >human
DQN 228% 79% 24
Double DQN 307% 118% 33
Dueling 373% 151% 37
Prio. Duel. 592% 172% 39
C51 701% 178% 40
Results on 57 games Atari 2600
![Page 42: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/42.jpg)
Categorical representation
x
p0 p1 p2 … pn-1
0
1
… z0 + i𝚫z ...
Fixed support, learned probabilities
![Page 43: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/43.jpg)
Quantile Regression Networks
x
z0 z1 z2 … zn-1
0
1
Fixed probabilities, learned support
𝚫p
z0 z1 z2 … zn-1
![Page 44: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/44.jpg)
Inverse CDF learnt by Quantile Regression
𝛕0
𝛕1
𝛕2
𝛕3
z0 z1 z2 zn
![Page 45: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/45.jpg)
l2-regression
mean
![Page 46: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/46.jpg)
l1-regression
median
![Page 47: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/47.jpg)
¼-quantile-regression
¼-quantile
![Page 48: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/48.jpg)
¾-quantile-regression
¾-quantile
![Page 49: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/49.jpg)
many-quantiles-regression
many-quantiles
![Page 50: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/50.jpg)
Inverse CDF learnt by Quantile Regression
𝛕0
𝛕1
𝛕2
𝛕3
z0 z1 z2 zn
![Page 51: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/51.jpg)
Quantile Regression DQN
![Page 52: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/52.jpg)
Quantile Regression = projection in Wasserstein!(on a uniform grid)
𝛕0
𝛕1
𝛕2
𝛕3
z0 z1 z2 zn
![Page 53: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/53.jpg)
QR distributional Bellman operator
Theorem: is a contraction (in Wasserstein) [Dabney et al., 2018]
Intuition: quantile regression = projection in Wasserstein
Reminder:● is a contraction (both in Cramer and Wasserstein)
● is a contraction (in Cramer)
![Page 54: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/54.jpg)
DQN
![Page 55: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/55.jpg)
C51DQN
![Page 56: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/56.jpg)
QR-DQNC51DQN
![Page 57: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/57.jpg)
Quantile-Regression DQN
Mean Median
DQN 228% 79%
Double DQN 307% 118%
Dueling 373% 151%
Prio. Duel. 592% 172%
C51 701% 178%
QR-DQN 864% 193%
![Page 58: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/58.jpg)
Implicit Quantile Networks (IQN)
Learn a parametric inverse CDF
1
0
𝜏
![Page 59: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/59.jpg)
QR-DQNC51DQN
![Page 60: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/60.jpg)
QR-DQNC51 IQNDQN
![Page 61: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/61.jpg)
Implicit Quantile Networks for TD
![Page 62: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/62.jpg)
![Page 63: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/63.jpg)
Implicit Quantile Networks
Mean Median Human starts
DQN 228% 79% 68%
Prio. Duel. 592% 172% 128%
C51 701% 178% 116%
QR-DQN 864% 193% 153%
IQN 1019% 218% 162%
![Page 64: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/64.jpg)
Implicit Quantile Networks
Mean Median Human starts
DQN 228% 79% 68%
Prio. Duel. 592% 172% 128%
C51 701% 178% 116%
QR-DQN 864% 193% 153%
IQN 1019% 218% 162%
Rainbow 1189% 230% 125%
Almost as good as SOTA (Rainbow/Reactor) which combine prio/dueling/categorical/...
![Page 65: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/65.jpg)
Why does it work?
● In the end we only use the mean of these distributions
![Page 66: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/66.jpg)
Why does it work?
● In the end we only use the mean of these distributions
When we use deep networks, maybe:
● Auxiliary task effect:○ Same signal to learn from but more predictions○ More predictions → richer signal → better representations○ Reduce state aliasing (disambiguate different states based on return)
● Density estimation instead of l2-regressions○ RL uses same tools as deep learning○ Lower variance gradient
● Other reasons?
![Page 67: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/67.jpg)
Distributional RL
Agents:DQN, A3C, Impala,DDPG, TRPO, PPO, ...
Policy:- Risk-neutral- Risk seeking/averse- Exploration: (optimism, Thompson sampling)
Distributional loss- Wasserstein- Cramer- other?
Convergence analysis- Contraction property- Control case- SGD friendly
Algorithms:- Value-based- Policy-based
Distribution over- Returns- Policies
Deep Learning impact:- Lower variance gradients- Richer representations
Other:- State aliasing- Reward clipping- Undiscounted RL
EnvironmentsAtari, DMLab30, Control suite, Go,...
Theory
Algorithms Evaluation
Deep Learning
Representation of distributions- Categorical- Quantile regression- Mixture of Gaussians- Generative models
![Page 68: Rémi Munos Marc Bellemare, Will Dabney, Georg Ostrovski ... · Why does it work? In the end we only use the mean of these distributions When we use deep networks, maybe: Auxiliary](https://reader034.vdocuments.us/reader034/viewer/2022052008/601cf7e5094a5e58626a369f/html5/thumbnails/68.jpg)
References
● A distributional perspective on reinforcement learning, Bellemare, Dabney, Munos, ICML2017
● An Analysis of Categorical Distributional Reinforcement Learning, Rowland, Bellemare, Dabney, Munos, Teh, AISTATS2018
● Distributional reinforcement learning with quantile regression, Dabney, Rowland, Bellemare, Munos, AAAI2018
● Implicit Quantile Networks for Distributional Reinforcement Learning, Dabney, Ostrovski, Silver, Munos, ICML2018