![Page 1: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed4006d8d46b66d22634058/html5/thumbnails/1.jpg)
Maximum-a-Posteriori (MAP)Policy Optimization
Mayank Mittal
![Page 2: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed4006d8d46b66d22634058/html5/thumbnails/2.jpg)
MAP Policy OptimizationAbbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, Martin Riedmiller (2018)
V-MPO: On-Policy MAP Policy Optimization For Discrete and Continuous ControlH. Francis Song* , Abbas Abdolmaleki* , Jost Tobias Springenberg, Aidan Clark, Hubert Soyer, Jack W. Rae, Seb Noury, Arun Ahuja, Siqi Liu, Dhruva Tirumala, Nicolas Heess, Dan Belov, Martin Riedmiller, Matthew M. Botvinick (2019)
![Page 3: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed4006d8d46b66d22634058/html5/thumbnails/3.jpg)
Duality: Control and Estimation
▪ What are the actions which maximize future rewards?
▪ Assuming future success in maximizing rewards, what are the actions most likely to have been taken?
Solved using Expectation Maximization (EM)
![Page 4: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed4006d8d46b66d22634058/html5/thumbnails/4.jpg)
Expectation Maximization
“global” parameters
▪ Generally, point estimation via MLE/MAP is not possible due to intractability
▪ Consider a latent variable model:
Slide from course: Topics in Probabilistic Modeling and Inferences, Piyush Rai (Fall Semester, 2019), IIT Kanpur
![Page 5: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed4006d8d46b66d22634058/html5/thumbnails/5.jpg)
Expectation Maximization
Slide from course: Topics in Probabilistic Modeling and Inferences, Piyush Rai (Fall Semester, 2019), IIT Kanpur
ELBO
![Page 6: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed4006d8d46b66d22634058/html5/thumbnails/6.jpg)
Expectation Maximization
Slide from course: Topics in Probabilistic Modeling and Inferences, Piyush Rai (Fall Semester, 2019), IIT Kanpur
![Page 7: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed4006d8d46b66d22634058/html5/thumbnails/7.jpg)
Expectation Maximization
Slide from course: Topics in Probabilistic Modeling and Inferences, Piyush Rai (Fall Semester, 2019), IIT Kanpur
![Page 8: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed4006d8d46b66d22634058/html5/thumbnails/8.jpg)
Expectation Maximization
Slide from course: Topics in Probabilistic Modeling and Inferences, Piyush Rai (Fall Semester, 2019), IIT Kanpur
![Page 9: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed4006d8d46b66d22634058/html5/thumbnails/9.jpg)
Duality: Control and Estimation
▪ What are the actions which maximize future rewards?
▪ Assuming future success in maximizing rewards, what are the actions most likely to have been taken?
Solved using Expectation Maximization (EM)
![Page 10: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed4006d8d46b66d22634058/html5/thumbnails/10.jpg)
MAP Policy OptimizationAbbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, Martin Riedmiller (2018)
V-MPO: On-Policy MAP Policy Optimization For Discrete and Continuous ControlH. Francis Song* , Abbas Abdolmaleki* , Jost Tobias Springenberg, Aidan Clark, Hubert Soyer, Jack W. Rae, Seb Noury, Arun Ahuja, Siqi Liu, Dhruva Tirumala, Nicolas Heess, Dan Belov, Martin Riedmiller, Matthew M. Botvinick (2019)
![Page 11: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed4006d8d46b66d22634058/html5/thumbnails/11.jpg)
▪ Given a prior distribution over trajectories
▪ Estimate the posterior distribution over trajectories consistent with desired outcome, O (such as achieving a goal)
Inference for Optimal Control
interpreted as event of succeeding at RL task
![Page 12: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed4006d8d46b66d22634058/html5/thumbnails/12.jpg)
Inference for Optimal Control
Likelihood Function:(for undiscounted case)
interpreted as event of succeeding at RL task
temperature
Likelihood Objective:
![Page 13: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed4006d8d46b66d22634058/html5/thumbnails/13.jpg)
Inference for Optimal ControlLikelihood Objective:
Auxiliary Distribution
ELBO
E-step:Improves ELBO w.r.t. q
M-step:Improves ELBO w.r.t. policy
![Page 14: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed4006d8d46b66d22634058/html5/thumbnails/14.jpg)
Inference for Optimal Control
Likelihood Function:(for undiscounted case)
interpreted as event of succeeding at RL task
temperature
Likelihood Objective:
![Page 15: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed4006d8d46b66d22634058/html5/thumbnails/15.jpg)
Inference for Optimal Control
Definition of variational distribution
policy parameters
prior
Likelihood Objective:
For undiscounted case:
For discounted case:
discount factor
![Page 16: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed4006d8d46b66d22634058/html5/thumbnails/16.jpg)
Regularized RLLikelihood Objective:
For discounted case:
Regularized Q-value function:
![Page 17: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed4006d8d46b66d22634058/html5/thumbnails/17.jpg)
Regularized RL
Proposition
![Page 18: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed4006d8d46b66d22634058/html5/thumbnails/18.jpg)
Objective of MPO
▪ Uses EM-style coordinate ascent to maximize estimation objective
▪ Proposes off-policy algorithm that is▪ scalable, robust and insensitive to
hyperparameters▪ offers data-efficiency
on-policy algorithms
off-policy algorithms
![Page 19: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed4006d8d46b66d22634058/html5/thumbnails/19.jpg)
Regularized RLLikelihood Objective:
For discounted case:
Regularized Q-value function:
![Page 20: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed4006d8d46b66d22634058/html5/thumbnails/20.jpg)
E-step: Maximization w.r.t. q
Consider iteration i:1. Set
2. Estimate unregularized action value:Using Retrace Algorithm
Off-policy!
Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing Systems, 2016, pp. 1054–1062. Neural Information Processing Systems,
![Page 21: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed4006d8d46b66d22634058/html5/thumbnails/21.jpg)
E-step: Maximization w.r.t. q
Consider iteration i:3. Maximize one-step objective:
constant w.r.t q
INTERPRETATION:
Policy chooses soft-optimal action for one step and then resorts to executing policy
Stationary since samples from replay buffer
![Page 22: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed4006d8d46b66d22634058/html5/thumbnails/22.jpg)
E-step: Maximization w.r.t. q
Consider iteration i:3. Maximize one-step objective:
(Hard) Constrained E-step: arbitrary scale!
![Page 23: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed4006d8d46b66d22634058/html5/thumbnails/23.jpg)
E-step: Maximization w.r.t. q
Consider iteration i:3. Maximize one-step objective:
(Hard) Constrained E-step:
Method 1
Use parametric variational distribution
Method 2
Use non-parametric variational distribution
Similar to TRPO/PPO sample based distribution over actions for a state
![Page 24: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed4006d8d46b66d22634058/html5/thumbnails/24.jpg)
E-step: Maximization w.r.t. q
Consider iteration i:3. Maximize one-step objective:
(Hard) Constrained E-step:
sample based distribution over actions for a state
Method 2
Use non-parametric variational distribution
Lagrangian Formulation
![Page 25: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed4006d8d46b66d22634058/html5/thumbnails/25.jpg)
E-step: Maximization w.r.t. q
Consider iteration i:1. Set 2. Estimate unregularized action value:
3. Maximize “one-step” KL regularized objective to obtain:
Using Retrace Algorithm
![Page 26: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed4006d8d46b66d22634058/html5/thumbnails/26.jpg)
M-step: Maximization w.r.tLikelihood Objective:
For discounted case:
M-step: Partial Maximization w.r.t policy
Looks similar to supervised learning!
![Page 27: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed4006d8d46b66d22634058/html5/thumbnails/27.jpg)
M-step: Maximization w.r.tLikelihood Objective:
For discounted case:
M-step: Partial Maximization w.r.t policy
samples weighted by variational distribution
from E-stepLooks similar to supervised learning!
![Page 28: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed4006d8d46b66d22634058/html5/thumbnails/28.jpg)
M-step: Maximization w.r.t
M-step: Partial Maximization w.r.t policy
Looks similar to supervised learning!
For generalized case:
![Page 29: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed4006d8d46b66d22634058/html5/thumbnails/29.jpg)
M-step: Maximization w.r.t
M-step: Partial Maximization w.r.t policy
For generalized case:
(Hard) Constrained M-step:
prevents overfitting on the samples since the constraint decreases tendency of the entropy of policy to collapse
![Page 30: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed4006d8d46b66d22634058/html5/thumbnails/30.jpg)
Algorithm
![Page 31: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed4006d8d46b66d22634058/html5/thumbnails/31.jpg)
Algorithm
![Page 32: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed4006d8d46b66d22634058/html5/thumbnails/32.jpg)
Experimental Evaluation
▪ Gaussian parametrization of policy▪ Benchmark on continuous control tasks
![Page 33: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed4006d8d46b66d22634058/html5/thumbnails/33.jpg)
Experimental Evaluation
▪ Stable learning on all tasks▪ Significant sample efficiency
![Page 34: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed4006d8d46b66d22634058/html5/thumbnails/34.jpg)
Experimental Evaluation
▪ Stable learning on all tasks▪ Significant sample efficiency
![Page 35: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed4006d8d46b66d22634058/html5/thumbnails/35.jpg)
MAP Policy OptimizationAbbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, Martin Riedmiller (2018)
V-MPO: On-Policy MAP Policy Optimization For Discrete and Continuous ControlH. Francis Song* , Abbas Abdolmaleki* , Jost Tobias Springenberg, Aidan Clark, Hubert Soyer, Jack W. Rae, Seb Noury, Arun Ahuja, Siqi Liu, Dhruva Tirumala, Nicolas Heess, Dan Belov, Martin Riedmiller, Matthew M. Botvinick (2019)
![Page 36: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed4006d8d46b66d22634058/html5/thumbnails/36.jpg)
Objective of V-MPO
▪ Uses EM-style coordinate ascent to maximize estimation objective
▪ Proposes on-policy algorithm▪ replaces state-action value function in
MPO with state value function▪ scalable to multi-task setting without
population-based tuning of hyperparameters
![Page 37: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed4006d8d46b66d22634058/html5/thumbnails/37.jpg)
Objective of V-MPO
▪ Uses EM-style coordinate ascent to maximize estimation objective
▪ Proposes on-policy algorithm▪ replaces state-action value function in
MPO with state value function▪ scalable to multi-task setting without
population-based tuning of hyperparameters
![Page 38: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed4006d8d46b66d22634058/html5/thumbnails/38.jpg)
▪ In MPO:
▪ In V-MPO:
Inference for Optimal Control
interpreted as event of succeeding at RL task
temperature
interpreted as relative improvement in policy over previous policy
temperature
![Page 39: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed4006d8d46b66d22634058/html5/thumbnails/39.jpg)
Inference for ControlMAP Objective:
Identity:
E-step:Improves ELBO w.r.t.
M-step:Improves ELBO w.r.t. policy
![Page 40: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed4006d8d46b66d22634058/html5/thumbnails/40.jpg)
E-step: Maximization w.r.t. q
Consider iteration i:1. Set
2. Estimate value function :
3. Calculate advantages:
Using n-step targets
On-policy!
![Page 41: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed4006d8d46b66d22634058/html5/thumbnails/41.jpg)
E-step: Maximization w.r.t. q
Consider iteration i:4. Maximize objective:
(Hard) Constrained E-step:
![Page 42: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed4006d8d46b66d22634058/html5/thumbnails/42.jpg)
Consider iteration i:4. Maximize objective:
E-step: Maximization w.r.t. q
(Hard) Constrained E-step:
sample based distribution over actions for a state
Lagrangian Formulation
Method
Use non-parametric variational distribution
![Page 43: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed4006d8d46b66d22634058/html5/thumbnails/43.jpg)
Consider iteration i:4. Maximize objective:
E-step: Maximization w.r.t. q
(Hard) Constrained E-step:
Engineering:learning improves substantially if samples corresponding to the highest 50% of the advantages in each batch are taken
![Page 44: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed4006d8d46b66d22634058/html5/thumbnails/44.jpg)
M-step: Maximization w.r.t
M-step: Partial Maximization w.r.t policy
Here, minimization (due to negative sign):
Weighted maximum likelihood policy loss!
Assumption:During sample-based computation of the loss, any state-action pairs not in the batch of trajectories have zero weight
![Page 45: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed4006d8d46b66d22634058/html5/thumbnails/45.jpg)
M-step: Maximization w.r.t
M-step: Partial Maximization w.r.t policy
(Hard) Constrained M-step:
prevents overfitting on the samples since the constraint decreases tendency of the entropy of policy to collapse
For generalized case (minimization, due to negative sign):
![Page 46: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed4006d8d46b66d22634058/html5/thumbnails/46.jpg)
Experimental Evaluation
![Page 47: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed4006d8d46b66d22634058/html5/thumbnails/47.jpg)
Multi-task Control: DMLab-30
Experimental Evaluation
![Page 48: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed4006d8d46b66d22634058/html5/thumbnails/48.jpg)
Discrete Control: Atari
Experimental Evaluation
![Page 49: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed4006d8d46b66d22634058/html5/thumbnails/49.jpg)
Experimental Evaluation
Continuous Control
![Page 50: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed4006d8d46b66d22634058/html5/thumbnails/50.jpg)
Summary
▪ Formulation of RL optimization problem into an inference problem
▪ Two particular formulations:▪ MPO: off-policy algorithm▪ V-MPO: on-policy algorithm
MPO V-MPO
![Page 51: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed4006d8d46b66d22634058/html5/thumbnails/51.jpg)
Thank you!
![Page 52: Maximum-a-Posteriori (MAP) Policy Optimization · Off-policy! Munos, Remi, et al. ‘Safe and Efficient Off-Policy Reinforcement Learning’. Advances in Neural Information Processing](https://reader030.vdocuments.us/reader030/viewer/2022041022/5ed4006d8d46b66d22634058/html5/thumbnails/52.jpg)
References
▪ Abdolmaleki, A., et al. (2018). Maximum a Posteriori Policy Optimisation. ArXiv, abs/1806.06920.
▪ Song, F., Abdolmaleki, A., et al. (2019), V-MPO: On-Policy MAP Policy Optimization for Discrete and Continuous Control. ArXiv, abs/1909.12238.