safe and efficient off-policy reinforcement learning · [mnih et al., 2016] asynchronous methods...
TRANSCRIPT
![Page 1: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe](https://reader034.vdocuments.us/reader034/viewer/2022042308/5ed400688d46b66d22634050/html5/thumbnails/1.jpg)
Safe and efficient off-policy reinforcement learning
Remi Munos, Tom Stepleton, Anna Harutyunyan, Marc Bellemare
![Page 2: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe](https://reader034.vdocuments.us/reader034/viewer/2022042308/5ed400688d46b66d22634050/html5/thumbnails/2.jpg)
Remi Munos Google DeepMind
Two desired properties of a RL algorithm:
● Off-policy learning○ use memory replay○ do exploration
● Use multi-steps returns ○ propagate rewards rapidly○ avoid accumulation of approximation/estimation errors
Ex: Q-learning (and DQN) is off-policy but does not use multi-steps returnsPolicy gradient (and A3C) use returns but are on-policy.
Both properties are important in deepRL. Can we have both simultaneously?
![Page 3: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe](https://reader034.vdocuments.us/reader034/viewer/2022042308/5ed400688d46b66d22634050/html5/thumbnails/3.jpg)
Remi Munos Google DeepMind
Off-policy reinforcement learning
Behavior policy , target policy
Observe trajectory
where , and
Goal: ● Policy evaluation:
● Optimal control:
![Page 4: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe](https://reader034.vdocuments.us/reader034/viewer/2022042308/5ed400688d46b66d22634050/html5/thumbnails/4.jpg)
Remi Munos Google DeepMind
Off-policy credit assignment problemBehavior policy Target policy
Can we use the TD to estimate for all ?
![Page 5: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe](https://reader034.vdocuments.us/reader034/viewer/2022042308/5ed400688d46b66d22634050/html5/thumbnails/5.jpg)
Remi Munos Google DeepMind
Importance sampling
Reweight the trace by the product of IS ratios
![Page 6: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe](https://reader034.vdocuments.us/reader034/viewer/2022042308/5ed400688d46b66d22634050/html5/thumbnails/6.jpg)
Remi Munos Google DeepMind
Importance sampling
Unbiased estimate of
![Page 7: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe](https://reader034.vdocuments.us/reader034/viewer/2022042308/5ed400688d46b66d22634050/html5/thumbnails/7.jpg)
Remi Munos Google DeepMind
Importance sampling
Unbiased estimate of Large (possibly infinite) variance Not efficient! [Precup, Sutton, Singh, 2000], [Mahmood, Yu, White, Sutton, 2015] ,...
![Page 8: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe](https://reader034.vdocuments.us/reader034/viewer/2022042308/5ed400688d46b66d22634050/html5/thumbnails/8.jpg)
Remi Munos Google DeepMind
algorithm [Harutyunyan, Bellemare, Stepleton, Munos, 2016]
Cut traces by a constant
![Page 9: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe](https://reader034.vdocuments.us/reader034/viewer/2022042308/5ed400688d46b66d22634050/html5/thumbnails/9.jpg)
Remi Munos Google DeepMind
algorithm[Harutyunyan, Bellemare, Stepleton, Munos, 2016]
works if
![Page 10: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe](https://reader034.vdocuments.us/reader034/viewer/2022042308/5ed400688d46b66d22634050/html5/thumbnails/10.jpg)
Remi Munos Google DeepMind
algorithm[Harutyunyan, Bellemare, Stepleton, Munos, 2016]
works if
may not work otherwise Not safe!
![Page 11: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe](https://reader034.vdocuments.us/reader034/viewer/2022042308/5ed400688d46b66d22634050/html5/thumbnails/11.jpg)
Remi Munos Google DeepMind
Tree backup TB(λ) algorithm[Precup, Sutton, Singh, 2000]
Reweight the traces by the product of target probabilities
![Page 12: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe](https://reader034.vdocuments.us/reader034/viewer/2022042308/5ed400688d46b66d22634050/html5/thumbnails/12.jpg)
Remi Munos Google DeepMind
works for arbitrary policies and
Tree backup TB(λ) algorithm[Precup, Sutton, Singh, 2000]
![Page 13: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe](https://reader034.vdocuments.us/reader034/viewer/2022042308/5ed400688d46b66d22634050/html5/thumbnails/13.jpg)
Remi Munos Google DeepMind
works for arbitrary policies and
cut traces unnecessarily when on-policy Not efficient!
Tree backup TB(λ) algorithm[Precup, Sutton, Singh, 2000]
![Page 14: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe](https://reader034.vdocuments.us/reader034/viewer/2022042308/5ed400688d46b66d22634050/html5/thumbnails/14.jpg)
Remi Munos Google DeepMind
General off-policy return-based algorithm:
Algorithm: Trace coefficient: Problem:
IS high variance
not safe (off-policy)
not efficient (on-policy)
![Page 15: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe](https://reader034.vdocuments.us/reader034/viewer/2022042308/5ed400688d46b66d22634050/html5/thumbnails/15.jpg)
Remi Munos Google DeepMind
Off-policy policy evaluation:
Theorem 1: Assume finite state space. Generate trajectories according to behavior policy . Update all states along trajectories according to:
Assume all states visited infinitely often. Under usual SA assumptions, If then a.s.
Sufficient conditions for a safe algorithm (works for any and )
![Page 16: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe](https://reader034.vdocuments.us/reader034/viewer/2022042308/5ed400688d46b66d22634050/html5/thumbnails/16.jpg)
Remi Munos Google DeepMind
Off-policy return-based operator
Lemma: Assume the traces satisfy .
Then the off-policy return-based operator:
is a contraction mapping (whatever and ) and is its fixed point.
![Page 17: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe](https://reader034.vdocuments.us/reader034/viewer/2022042308/5ed400688d46b66d22634050/html5/thumbnails/17.jpg)
Remi Munos Google DeepMind
Proof [part 1]
Thus
which is a linear combination weighted by non-negative coefficients which sum to...
![Page 18: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe](https://reader034.vdocuments.us/reader034/viewer/2022042308/5ed400688d46b66d22634050/html5/thumbnails/18.jpg)
Remi Munos Google DeepMind
Proof [part 2]
Sum of the coeff.
Thus
![Page 19: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe](https://reader034.vdocuments.us/reader034/viewer/2022042308/5ed400688d46b66d22634050/html5/thumbnails/19.jpg)
Remi Munos Google DeepMind
Tradeoff for trace coefficients
● Contraction coefficient of the expected operator
when (one-step Bellman update) when (full Monte-Carlo rollouts)
● Variance of the estimate (can be infinite for ) Large uses multi-steps returns, but large varianceSmall low variance, but do not use multi-steps returns
![Page 20: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe](https://reader034.vdocuments.us/reader034/viewer/2022042308/5ed400688d46b66d22634050/html5/thumbnails/20.jpg)
Remi Munos Google DeepMind
Our recommendation:
Use Retrace(λ) defined by
Properties:● Low variance since
● Safe (off policy): cut the traces when needed
● Efficient (on policy): but only when needed. Note that
![Page 21: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe](https://reader034.vdocuments.us/reader034/viewer/2022042308/5ed400688d46b66d22634050/html5/thumbnails/21.jpg)
Remi Munos Google DeepMind
Summary
Algorithm: Trace coefficient: Problem:
IS high variance
not safe (off-policy)
not efficient (on-policy)
none!
![Page 22: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe](https://reader034.vdocuments.us/reader034/viewer/2022042308/5ed400688d46b66d22634050/html5/thumbnails/22.jpg)
Remi Munos Google DeepMind
Retrace(λ) for optimal control
Let and sequences of behavior and target policies and
Theorem 2Under previous assumptions (+ a technical assumption)Assume are “increasingly greedy” wrt Then, a.s.,
![Page 23: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe](https://reader034.vdocuments.us/reader034/viewer/2022042308/5ed400688d46b66d22634050/html5/thumbnails/23.jpg)
Remi Munos Google DeepMind
Remarks
● If are greedy policies, then → Convergence of Watkin’s Q(λ) to
(open problem since 1989)
● “Increasingly greedy” allows for smoother traces thus faster convergence
● The behavior policies do not need to become greedy wrt → no GLIE assumption (Greedy in the limit with infinite exploration)
(first return-based algo converging to without GLIE)
![Page 24: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe](https://reader034.vdocuments.us/reader034/viewer/2022042308/5ed400688d46b66d22634050/html5/thumbnails/24.jpg)
Remi Munos Google DeepMind
Retrace for deepRL Retrace is particularly suitable for deepRL as it is off-policy (useful for exploration and when using memory-replay) and uses multi-steps returns (fast propagation of rewards)
Automatically adjust the length of the return to the degree of “off-policy-ness”
Typical use:→ Memory replay Ex: DQN [Mnih et al., 2015]
(behavior policy = stored in memory, target policy = current policy)
→ On-line update Ex: A3C [Mnih et al., 2016](behavior policy = exploratory policy, target policy = evaluation policy)
![Page 25: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe](https://reader034.vdocuments.us/reader034/viewer/2022042308/5ed400688d46b66d22634050/html5/thumbnails/25.jpg)
Remi Munos Google DeepMind
Atari 2600 environments: Retrace vs DQN
Games: Asteroids, Defender, Demon Attack, Hero, Krull, River Raid, Space Invaders, Star Gunner, Wizard of Wor, Zaxxon
![Page 26: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe](https://reader034.vdocuments.us/reader034/viewer/2022042308/5ed400688d46b66d22634050/html5/thumbnails/26.jpg)
Remi Munos Google DeepMind
Experiments on 60 Atari games
![Page 27: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe](https://reader034.vdocuments.us/reader034/viewer/2022042308/5ed400688d46b66d22634050/html5/thumbnails/27.jpg)
Remi Munos Google DeepMind
Comparison of the traces: Retrace vs TB
Retrace:
TB:
![Page 28: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe](https://reader034.vdocuments.us/reader034/viewer/2022042308/5ed400688d46b66d22634050/html5/thumbnails/28.jpg)
Remi Munos Google DeepMind
Conclusions ● General update rule for off-policy return-based RL ● Conditions under which an algo is safe and efficient● We recommend to use Retrace:
○ Converges to Q* ○ Safe: cut the traces when needed ○ Efficient: but only when needed○ Works for policy evaluation and for control○ Particularly suited for deepRL
Extensions:● Works in continuous action spaces● Can be used in off-policy policy-gradient [Wang et al., 2016]
If you want to know more, come to our poster at NIPS!
![Page 29: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe](https://reader034.vdocuments.us/reader034/viewer/2022042308/5ed400688d46b66d22634050/html5/thumbnails/29.jpg)
Remi Munos Google DeepMind
References (mentioned in the slides):
● [Harutyunyan, Bellemare, Stepleton, Munos, 2016] Q(λ) with Off-Policy Corrections
● [Mahmood, Yu, White, Sutton, 2015] Emphatic Temporal-Difference Learning
● [Mnih et al., 2015] Human Level Control Through Deep Reinforcement Learning
● [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning
● [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe and Efficient Off-Policy
Reinforcement Learning
● [Precup, Sutton, Singh, 2000] Eligibility traces for off-policy policy evaluation
● [Sutton, 1988] Learning to predict by the methods of temporal differences
![Page 30: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe](https://reader034.vdocuments.us/reader034/viewer/2022042308/5ed400688d46b66d22634050/html5/thumbnails/30.jpg)
Remi Munos Google DeepMind
Temporal credit assignment problem Behavior policy
Can a “surprise” be used to train for all ?
![Page 31: Safe and efficient off-policy reinforcement learning · [Mnih et al., 2016] Asynchronous Methods for Deep Reinforcement Learning [Munos, Stepleton, Harutyunyan, Bellemare, 2016] Safe](https://reader034.vdocuments.us/reader034/viewer/2022042308/5ed400688d46b66d22634050/html5/thumbnails/31.jpg)
Remi Munos Google DeepMind
Temporal difference learning TD(λ) [Sutton, 1988]
Behavior policy