yinlam chow google deepmind · 2019. 1. 11. · drdata (deepmind) acknowledgements general:...

56
Safety in Sequential Decision Making Yinlam Chow Google DeepMind

Upload: others

Post on 02-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)

Safety in Sequential Decision Making

Yinlam Chow

Google DeepMind

Page 2: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)

Reinforcement Learning (RL)

● Combines machine learning with decision making

● Interaction modeled by a Markov decision process (MDP)

Page 3: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)

Successful Use Cases of RL [V. Minh et al., NIPS 13; D. Silver et al., Nature 17]

Page 4: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)

Some Challenges to Apply RL in Real-world [A. Irpan 18]

● Noisy data

● Training with insufficient data

● Unknown reward functions

● Robust models w.r.t.

Uncertainty?

● Safety guarantees in RL?

● Safe exploration

Page 5: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)

Some Challenges to Apply RL in Real-world [A. Irpan 18]

● Noisy data

● Training with insufficient data

● Unknown reward functions

● Robust models w.r.t. uncertainty?

● Safety guarantees in RL?● Safe exploration

Page 6: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)

Safety Problems in RL

Studied the following safety problems:

1. Safety w.r.t. Baseline○ Model-free

○ Model-based

○ Apprenticeship learning

2. Safety w.r.t. Environment Constraints

3. Robust and Risk-sensitive RL (skip here)

Page 7: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)

Safety Problems in RL

Studied the following safety problems:

1. Safety w.r.t. Baseline○ Model-free

○ Model-based

○ Apprenticeship learning

2. Safety w.r.t. Environment Constraints

3. Robust and Risk-sensitive RL (skip here)

Page 8: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)

Safety w.r.t. Baseline

Question: How to train a RL policy offline that performs no worse than baseline?

Page 9: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)

Safety w.r.t. Baseline: Model-free ApproachProblem: Safe off-policy optimization [P. Thomas et al., AAAI 15 & ICML 15]

Recent work: More Robust Doubly Robust Off-policy Evaluation [M. Farajtabar, Y. Chow, M. Ghavamzadeh, ICML 18]

Page 10: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)

Safety w.r.t. Baseline: Model-based ApproachProblem: Sim-to-real with transferable guarantees in performance

Recent work: Safe Policy Improvement by Minimizing Robust Regret [M. Ghavamzadeh, M. Petrik, Y. Chow, NIPS 16]

Page 11: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)

Safety and Apprenticeship Learning [Abbeel ICML04]

Problem: Imitate expert but become more risk-averse? (Without reward.)

Recent work: Risk-Sensitive Generative Adversarial Imitation Learning [J. Lacotte, M. Ghavamzadeh, Y. Chow, M. Pavone, UAI workshop 18, AISTATS 19 (submitted)]

Page 12: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)

Safety Problems in RL

Studied the following safety problems:

1. Safety w.r.t. Baseline

○ Model-free

○ Model-based

○ Apprenticeship learning

2. Safety w.r.t. Environment Constraints

3. Robust and Risk-sensitive RL (skip here)

Page 13: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)

The Safe Decision Making Problem

Safety definitions directly come from environment constraints, e.g., ● Collision avoidance, speed & fuel limits, traffic rules

● System overheating, quality-of-service guarantees

● User satisfaction in online recommendation

Page 14: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)

A Lyapunov Approach to Safe RL

Collaboration with Ofir Nachum (Brain), Mohammad Ghavamzadeh, Edgar Duenez-Guzman (DMG) -- NIPS 2018, ICLR 2019 (submitted)

Safety w.r.t. Environment Constraints

Page 15: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)

Overview on Notions of Safety ● Reachability (safety probability):

Safe if agent doesn’t enter an undesirable region (w.h.p.)

Application: Robot motion planning

● Limit Visits to Undesirable States (time spent in dangerous regions):

Safe if agent doesn’t stay in an undesirable region for long

Applications: System (data center) maintenance

NOTE: Constraints are trajectory-based

Page 16: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)
Page 17: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)

General Problem Formulation

Modeled by Constrained Markov Decision Process (CMDP)

● Reward: primary return performance

● Constraint cost: model safety constraints

Goals:

1. Find an optimal (feasible) RL agent

2. More restrictive: Guarantee safety during training

Page 18: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)
Page 19: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)

Some Prior Art and Limitations● Prior Art:

● Contributions of the Lyapunov-based method:○ Safety during training

○ Scalable model-free RL (on-policy/off-policy); Less conservative policies

Page 20: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)

Lyapunov Function and Safety

Page 21: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)

Safe Policy Iteration

Page 22: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)

Environment with Discrete Actions

lava

agent

goal

agent actions(with ε-error)

● Stochastic 2D GridWorld

● Stage-wise cost is 1 for fuel usage

● Goal reward is 1000

● Incurs a cost of 1 in lava; Tolerate at most 5 touches

● Experiments: (i) Planning, (ii) RL (explicit position or image obs.)

Page 23: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)

Safe Planning with Discrete Actions

Lyapunov-based (Our Method) Dual-LP (Exact But Expensive)

Page 24: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)

Safe Planning with Discrete Actions

Lagrangian-based (Baseline) Unconstrained (Baseline)

Page 25: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)
Page 26: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)

From SPI/SVI to Safe Value-based RL

Page 27: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)

Safe DQN/DPI with Discrete Actions

Baseline: Unsafe during training!

Unconstrained: Always unsafe!

Our Methods: Balance performance and safety

Page 28: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)

Safe Policy Gradient (Recent Extension)To handle safety in RL (policy gradient) with continuous actions:

1. Constrained optimization w.r.t. policy parameter (CPO)

2. Embed Lyapunov constraint into policy network via network augmentation (see [Dalal et al. 2018] for simpler setting)

Page 29: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)

Safe Policy Gradient

Page 30: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)

Safe RL for Continuous ControlObjective: Train a Mujoco HalfCheetah to run stably.

Standard method: DDPG/PPO

Issue: Unstable if runs too fast!

Remedy: Soft constraint torque @ joints

Safe if total torque violation is bounded (d0=50)

Page 31: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)

Lagrangian-PPO versus Lyapunov-PPO

Lagrangian PPO (Baseline) Lyapunov PPO (Our Method)

Page 32: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)

Lagrangian-PPO versus Lyapunov-PPO

Lagrangian PPO (Baseline) Lyapunov PPO (Our Method)

Page 33: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)

Conclusion● Contributions:

a. Formulated safety problems as CMDPs

b. Proposed a Lyapunov-based safe RL methodology

■ Work well for on-policy/off-policy settings

■ Applicable to value-based and policy-based algorithms

c. Guarantee* safety during training

● Limitations:a. Provably-optimal only under restricted conditions!

b. Theoretically justified in MDPs, but not with function approximations (RL)!■ Future work in model-based setting

Page 34: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)

Current WorkIn talks with the following projects at Google & DeepMind:

● PRM-RL for indoor robot navigation (Brain Robotics)

● FineTuner/TF.agents (Brain RL/rSWE)

● DrData (DeepMind)

Page 35: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)

Acknowledgements● General:

○ Mohammad Ghavamzadeh (FAIR)

● Safety w.r.t. Baseline:

○ Mehrdad Farajtabar (DMG)○ Marek Petrik (UNH)○ Jonathan Lacotte, Marco Pavone (Stanford)

● Safety w.r.t. Environment Constraint:

○ Ofir Nachum (Brain)○ Edgar Guzman Duenez (DMG)○ Aleksandra Faust, Jasmine Hsu, Vikas Sindhwani (Brain Robotics)

Page 36: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)

Appendix of Lyapunov Work

Page 37: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)
Page 38: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)
Page 39: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)
Page 40: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)

Highlights of Other Recent Work

Page 41: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)

1. Robust and Controllable Representation Learning

Page 42: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)
Page 43: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)
Page 44: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)
Page 45: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)

2. Risk-Sensitive Imitation Learning

Page 46: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)
Page 47: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)
Page 48: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)
Page 49: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)

3. Minimizing Robust Regret in RL (Sim-to-Real)

Page 50: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)
Page 51: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)
Page 52: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)
Page 53: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)

4. More Robust Doubly Robust

Off-policy Evaluation

Page 54: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)
Page 55: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)
Page 56: Yinlam Chow Google DeepMind · 2019. 1. 11. · DrData (DeepMind) Acknowledgements General: Mohammad Ghavamzadeh (FAIR)