and/or by the people reinforcement learning for the people · speed learning in a new task...

160
Reinforcement Learning for the People and/or by the People Emma Brunskill Stanford University NIPS 2017 Tutorial 1

Upload: others

Post on 24-Mar-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Reinforcement Learning for the People and/or by the People

Emma Brunskill

Stanford University

NIPS 2017 Tutorial

1

Page 2: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Amazing Reinforcement Learning Progress

Page 3: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Page 4: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Overview

● RL introduction● RL for people● RL by the people

Page 5: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Audience

• If you are:– Interested in quick overview of RL (section 1)– Want to learn about the RL technical challenges

involved in people-facing applications (section 2)– Want to learn about how people can help RL

systems (section 3)

5

Page 6: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

*Caveats

• Not trying to cover all the domain-specific methods for tackling these questions

• Focus will be on RL setting, and new technical challenges for RL with people and using people

• Always delighted to learn about new areas/key references might have missed-- email me at [email protected]

6

Page 7: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Math exercise

Student’s answer

Reinforcement Learning for People

Student

Page 8: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Math exercise

Student’s answer

Pass exam

Reinforcement Learning for People

Student

Page 9: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Suggest ProductBuys

product

Revenue

Reinforcement Learning for People

Policy: Prior Recommendations & Purchases → Product AdGoal: Choose actions to maximize expected revenue

Customer

Page 10: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Action Observation

Reward

Reinforcement Learning

Policy: Map Observations → ActionsGoal: Choose actions to maximize expected rewards

Page 11: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Overview

● RL introduction● RL for people● RL by the people

Page 12: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

RL Basics

● MDP● POMDP● Planning● 3 views

○ Model○ Model-free○ Policy search

● Exploration vs exploitation-- how get data?

Page 13: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Markov Decision Process (MDP)

• Set of states S• Set of actions A• Stochastic transition/dynamics model T(s,a,s’)

– Probability of reaching s’ after taking action a in state s

• Reward model R(s,a) (or R(s) or R(s,a,s’))• Maybe a discount factor γ or horizon H

Page 14: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Markov Decision Process (MDP)

• Set of states S• Set of actions A• Stochastic transition/dynamics model T(s,a,s’)

– Probability of reaching s’ after taking action a in state s

• Reward model R(s,a) (or R(s) or R(s,a,s’))• Maybe a discount factor γ or horizon H• Policy π: s → a• Optimal policy is one with highest expected

discounted sum of future rewards

Page 15: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Partially Observable MDP

• Set of states S• Set of actions A• Set of observations Z• Stochastic transition/dynamics model T(s,a,s’)

– Probability of reaching s’ after taking action a in state s

• Reward model R(s,a) (or R(s) or R(s,a,s’))• Observation model P(z|s’,a)• Policy π: history (z,a,z’,a’)... → a• Optimal policy is one with highest expected

discounted sum of future rewards

Page 16: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Model-Based Policy Evaluation

• Given a MDP and a policy π: s → a• How good is this policy?• Want to compute expected sum of discounted

rewards if follow it (starting from some initial states)• Could do Monte Carlo rollouts and average• But if domain is Markov, can do better...

Page 17: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Model-Based Policy Evaluation with Dynamic Programming

• Given a MDP and a policy π: s → a• Set V(s) = 0 for all s;

• Iteratively update value

• V(s) ← R(s) + γ Σs’ T(s, (s),s′) V(s′)

immediate reward

discounted sum of future rewards

Page 18: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Planning & Bellman Equation

• Given a MDP (includes dynamics and reward model)• Compute optimal policy• Bellman optimality equation:

– V(s) = R(s) + γ Σs’ T(s, *,s′) V(s′)

– Q(s,a) = R(s) + γ Σs’ T(s,a,s′) V(s′)

– *(s) = argmaxa R(s) + γ Σ

s’ T(s,a,s′) V(s′) = argmax

a Q(s,a)

Page 19: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Reinforcement Learning

• Typically still assume in a MDP• Know set of states S• Know set of actions A• Maybe a discount factor γ or horizon H• Don’t know dynamics and/or reward model!• Still want to take actions to yield high reward

Page 20: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

RL: 3 Common Approaches

Image from David Silver

Page 21: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Model-Based RL

• Typically still assume in a MDP• As interact in world, observe states, actions and

rewards• Use machine learning to estimate a model of

dynamics and/or rewards using this experience• Now have (one or more) MDP estimated models• Now can do planning to compute an optimal policy

Page 22: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Model Free: Q-Learning

• Initialize Q(s,a) for all (s,a) pairs

Page 23: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Model Free: Q-Learning

• Initialize Q(s,a) for all (s,a) pairs• On observing transition <s

t,a

t,r

t,s

t+1>

– Calculate temporal difference errorδ(s

t,a

t) = r

t + γ max

a′ Q(st+1

,a′) - Q(st,a)

– Difference between what observed and current estimate of long term expected reward (Q)

– Uses Markov property and bootstraps (didn’t observe reward till end of episode, so use Q(s

t+1,a′) as a proxy)

Page 24: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Model Free: Q-Learning

• Initialize Q(s,a) for all (s,a) pairs• On observing transition <s

t,a

t,r

t,s

t+1>

– Calculate TD-errorδ(s

t,a

t) = r

t + γ max

a′ Q(st+1

,a′) - Q(st,a)

– Use to update estimate of Q(st,a

t)

Q(st,a

t) = (1-α) Q(s

t,a

t) + αδ(s

t,a

t)

• Slowly moves estimate toward observed experience

Page 25: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Model Free: Q-Learning

• Initialize Q(s,a) for all (s,a) pairs• On observing transition <s

t,a

t,r

t,s

t+1>

– Calculate TD-errorδ(s

t,a

t) = r

t + γ max

a′ Q(st+1

,a′) - Q(st,a)

– Use to update estimate of Q(st,a

t)

Q(st,a

t) = (1-α) Q(s

t,a

t) + αδ(s

t,a

t)

• Computationally cheap• But only propagates experience one step

– Replay can help fix

Page 26: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Policy Search

• Directly search π space for argmaxπVπ

• Parameterize policy and do stochastic gradient descent

Page 27: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

RL: 3 Common Approaches

Image from David Silver

Page 28: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Exploration vs Exploitation

● In online reinforcement learning, learn about world through acting

● Trade off between○ Learning more about how the world works○ Using that knowledge to maximize reward

Page 29: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Overview

● RL introduction● RL for people● RL by the people

Page 30: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Reinforcement Learning Progress

Page 31: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Page 32: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Cheap to try things, orSimulate

High stakesHard to model

Page 33: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Technical Challenges in RL for People

1. Sample efficient Learning2. What do we want to optimize?3. Batch (purely offline) reinforcement learning

Page 34: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Sample Efficiency Through Transfer

all different all the same

Page 35: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Assume Finite Set of Models

Page 36: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

MDP Y

TY, R

Y

MDP R

TR, R

R

MDP G

TG, R

G

Sample a task from finite set of MDPs

(shared S & A space)

Assume Finite Set of GroupsMulti-Task Reinforcement Learning

Page 37: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Series of tasksAct in each task for H steps

MDP Y

TY, R

Y

MDP R

TR, R

R

MDP G

TG, R

G

Multi-Task Reinforcement Learning

Page 38: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

MDP Y

TY, R

Y

MDP R

TR, R

R

MDP G

TG, R

G

Multi-Task Reinforcement Learning

Page 39: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

MDP Y

T=? R=?

MDP R

T=? R=?

MDP G

T=? R=?

Multi-Task Reinforcement Learning

• Captures a number of settings of interest• Our primary contributions have been showing can provably speed learning

(Brunskill and Li UAI 2013; Brunskill and Li ICML 2014; Guo and Brunskill AAAI 2015)

• Limitations: focused on discrete state and action, impractical bounds, optimizing for average performance

Page 40: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Assume Related Meta-Groups

Wilson, A., Fern, A., Ray, S., & Tadepalli, P. (2007, June). Multi-task reinforcement learning: a hierarchical Bayesian approach. In Proceedings of the 24th international conference on Machine learning (pp. 1015-1022). ACM.

Page 41: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Wilson, A., Fern, A., Ray, S., & Tadepalli, P. (2007, June). Multi-task reinforcement learning: a hierarchical Bayesian approach. In Proceedings of the 24th international conference on Machine learning (pp. 1015-1022). ACM.

Page 42: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Wilson, A., Fern, A., Ray, S., & Tadepalli, P. (2007, June). Multi-task reinforcement learning: a hierarchical Bayesian approach. In Proceedings of the 24th international conference on Machine learning (pp. 1015-1022). ACM.

Encouraging empiricallyNo guarantees

Scalability?

Page 43: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Hidden Parameter MDP

● Allow for smooth linear parameterization of dynamics model

Doshi-Velez, F., & Konidaris, G. (2016, July). Hidden parameter Markov decision processes: A semiparametric regression approach for discovering latent task parametrizations. In IJCAI: proceedings of the conference(Vol. 2016, p. 1432).

Page 44: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Hidden Parameter MDP++

● Use Bayesian Neural Nets for dynamics● Benefits for HIV Treatment simulation● Each episode new patient

TW Killian, G Konidaris, F Doshi-Velez. Robust and Efficient Transfer Learning with Hidden Parameter Markov Decision Processes. NIPS 2017.

Page 45: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Sample Efficient Online RL: Policy Search

• Policy search hugely popular technique in RL– Deep Reinforcement Learning through Policy

Optimization. – Schulman and Abbeel NIPS 2016 tutorial:

https://people.eecs.berkeley.edu/~pabbeel/nips-tutorial-policy-optimization-Schulman-Abbeel.pdf

• Not always associated with data efficiency• But can be used to tune small set of

parameters efficiently

Thomas and B, ICML 2016

Page 46: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Policy Search as Function Optimization

Figure from Wilson et al. JMLR 2014

π = f(θ)

DefinesPolicy class

Perf

orm

ance

of

Polic

y

Page 47: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Policy Search as Function OptimizationGradient Approaches

• Use structure of sequential decision making• Only find local optima

π = f(θ)

DefinesPolicy class

Perf

orm

ance

of

Polic

y

Figure from Wilson et al. JMLR 2014

Page 48: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Policy Search as Bayesian Optimization:Treats Each Evaluation as Costly

π = f(θ)

DefinesPolicy class

Perf

orm

ance

of

Polic

y

• Explicit rep uncertainty, finds global optima• Generally ignores sequential structure

Figure from Wilson et al. JMLR 2014

Page 49: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Policy Search as Bayesian Optimization

● Data efficient policy search

● Can leverage Markovian structure (e.g. Wilson et al. 2014)

● Doesn’t have to make assumptions about world model

● Can combine with off policy evaluation to further speed up learning (in terms of amount of data required)

Goel, Dann and Brunskill IJCAI 2017

Page 50: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Policy Search for Optimal Stopping Problems

● Can leverage full trajectories to retrospectively evaluate alternate optimal stopping policies

● Substantial increase in data efficiency over generic bounds (Ng and Jordan UAI 2000)

● But Gaussian Processes struggle in high dimensions

Goel, Dann and Brunskill IJCAI 2017

Page 51: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Assuming Finite Set of Policies to Speed Learning in a New Task

● Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement learning agent. In Proceedings of the fifth international joint conference on Autonomous agents and multiagent systems (pp. 720-727). ACM.

● Talvitie, E., & Singh, S. P. (2007, January). An Experts Algorithm for Transfer Learning. In IJCAI (pp. 1065-1070).

● Azar, M. G., Lazaric, A., & Brunskill, E. (2013, September). Regret bounds for reinforcement learning with policy advice. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (pp. 97-112).

● Focus on the discrete state and action space setting

Page 52: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Multi-Task Policy Learning

Ammar, H. B., Eaton, E., Luna, J. M., & Ruvolo, P. (2015). Autonomous cross-domain knowledge transfer in lifelong policy gradient reinforcement learning. In IIJCAI.

Page 53: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Multi-Task Policy Learning

Goel, Dann and Brunskill IJCAI 2017

● Set of policies with shared basis set of parameters● Can be used to do cross domain transfer (different state & actions)

Page 54: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Multi-Task Policy Learning with Shared Policy and Feature Parameters

● Descriptor features and policy features● Primarily tackling continuous control tasks

● Scaling to enormous state spaces and performance for a single trajectory less clear

Isele, D., Rostami, M., & Eaton, E. (2016, July). Using Task Features for Zero-Shot Knowledge Transfer in Lifelong Learning. In IJCAI (pp. 1620-1626).

Page 55: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Deep RL Transfer

● Deep reinforcement learning to find good shared representation (Finn, Abbeel, Levine ICML 2017)

● Fast transfer by encouraging shared representation learning across tasks

Page 56: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Direct Policy Search

• Most of these are trying to speed online learning in a new task

• Not making guarantees on performance for a single task

Page 57: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

(Some) Recent Sample Efficient RL References

Brunskill, E., & Li, L. Sample Complexity of Multi-task Reinforcement Learning. UAI 2013.

Wilson, A., Fern, A., Ray, S., & Tadepalli, P. (2007, June). Multi-task reinforcement learning: a hierarchical Bayesian approach. In Proceedings of the 24th international conference on Machine learning (pp. 1015-1022). ACM.

Doshi-Velez, F., & Konidaris, G. (2016, July). Hidden parameter Markov decision processes: A semiparametric regression approach for discovering latent task parametrizations. In IJCAI: proceedings of the conference(Vol. 2016, p. 1432). NIH Public Access.

Robust and Efficient Transfer Learning with Hidden Parameter Markov Decision Processes.TW Killian, G Konidaris, F Doshi-Velez. NIPS 2017

Goel, Dann, & Brunskill. Sample Efficient Policy Search for Optimal Stopping Domains. IJCAI 2017.

Guo, Zhaohan, and Emma Brunskill. "Concurrent PAC RL." AAAI 2015.

Jiang, N., Krishnamurthy, A., Agarwal, A., Langford, J., & Schapire, R. E. (2017). Contextual Decision Processes with Low Bellman Rank are PAC-Learnable. ICML.

Silver, D., Newnham, L., Barker, D., Weller, S., & McFall, J. (2013, February). Concurrent reinforcement learning from customer interactions. In International Conference on Machine Learning (pp. 924-932)

Carlos Diuk, Lihong Li, and Bethany R. Leffler. The adaptive k-meteorologists problem and its application to structure learning and feature selection in reinforcement learning. In ICML, 2009.

Kirill Dyagilev, Shie Mannor, and Nahum Shimkin. Efficient reinforcement learning in parameterized models: Discrete parameter case. In European Workshop on Reinforcement Learning, 2008

Liu, Y., Guo, Z., & Brunskill, E. PAC Continuous State Online Multitask Reinforcement Learning with Identification. AAMAS 2016.

Osband, I., Russo, D., & Van Roy, B. (2013). (More) efficient reinforcement learning via posterior sampling. NIPS

Osband, I., & Van Roy, B. (2014). Near-optimal Regret Bounds for Reinforcement Learning in Factored MDPs. NIPS.

Zhou, L., & Brunskill, E. Latent Contextual Bandits and Their Application to Personalized Recommendations for New Users. IJCAI 2016.

Ammar, H. B., Eaton, E., Luna, J. M., & Ruvolo, P. (2015, July). Autonomous cross-domain knowledge transfer in lifelong policy gradient reinforcement learning. In International Joint Conference on Artificial Intelligence (pp. 3345-3351)

Isele, D., Rostami, M., & Eaton, E. (2016, July). Using Task Features for Zero-Shot Knowledge Transfer in Lifelong Learning. In IJCAI (pp. 1620-1626).

Finn, C., Abbeel, P., & Levine, S. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. ICML 2017.

Wilson, A., Fern, A., Ray, S., & Tadepalli, P. (2007, June). Multi-task reinforcement learning: a hierarchical Bayesian approach. In Proceedings of the 24th international conference on Machine learning (pp. 1015-1022). ACM.

Page 58: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Beyond Expectation

• If interacting with people, sometimes may care not just about expected performance, averaged across many rounds

Page 59: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Beyond Expectation

• If interacting with people, sometimes may care not just about expected performance, averaged across many rounds

• Work on risk-sensitive policies includes– Garcıa, J., & Fernández, F. (2015). A comprehensive survey on safe reinforcement learning. Journal of Machine Learning

Research, 16(1), 1437-1480.– Castro, D. D., Tamar, A., & Mannor, S. (2012). Policy gradients with variance related risk criteria. In Proceedings of the 29th

International Conference on Machine Learning (ICML-12) (pp. 935-942).– Doshi-Velez, F., Pineau, J., & Roy, N. (2012). Reinforcement learning with limited reinforcement: Using Bayes risk for active

learning in POMDPs. Artificial Intelligence, 187, 115-132.– Prashanth, L. A., & Ghavamzadeh, M. (2013). Actor-critic algorithms for risk-sensitive MDPs. In Advances in neural information

processing systems (pp. 252-260).– Chow, Y., & Ghavamzadeh, M. (2014). Algorithms for CVaR optimization in MDPs. In Advances in neural information processing

systems (pp. 3509-3517).– Delage, E., & Mannor, S. (2007, June). Percentile optimization in uncertain Markov decision processes with application to efficient

exploration. In Proceedings of the 24th international conference on Machine learning (pp. 225-232). ACM.•

Page 60: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Beyond Expectation

• If interacting with people, sometimes may care not just about expected performance, averaged across many rounds

• Work on risk-sensitive policies includes– Garcıa, J., & Fernández, F. (2015). A comprehensive survey on safe reinforcement learning. Journal of Machine Learning

Research, 16(1), 1437-1480.– Castro, D. D., Tamar, A., & Mannor, S. (2012). Policy gradients with variance related risk criteria. In Proceedings of the 29th

International Conference on Machine Learning (ICML-12) (pp. 935-942).– Doshi-Velez, F., Pineau, J., & Roy, N. (2012). Reinforcement learning with limited reinforcement: Using Bayes risk for active

learning in POMDPs. Artificial Intelligence, 187, 115-132.– Prashanth, L. A., & Ghavamzadeh, M. (2013). Actor-critic algorithms for risk-sensitive MDPs. In Advances in neural information

processing systems (pp. 252-260).– Chow, Y., & Ghavamzadeh, M. (2014). Algorithms for CVaR optimization in MDPs. In Advances in neural information processing

systems (pp. 3509-3517).– Delage, E., & Mannor, S. (2007, June). Percentile optimization in uncertain Markov decision processes with application to efficient

exploration. In Proceedings of the 24th international conference on Machine learning (pp. 225-232). ACM.

• Work on safe exploration includes• Geramifard, A. (2012). Practical reinforcement learning using representation learning and safe exploration for large scale Markov

decision processes (Doctoral dissertation, Massachusetts Institute of Technology).• Moldovan, T. M., & Abbeel, P. (2012). Safe exploration in Markov decision processes. arXiv preprint arXiv:1205.481• Turchetta, M., Berkenkamp, F., & Krause, A. (2016). Safe exploration in finite Markov decision processes with Gaussian processes. In

Advances in Neural Information Processing Systems (pp. 4312-4320).• Gillula, J. H., & Tomlin, C. J. (2012, May). Guaranteed safe online learning via reachability: tracking a ground target using a quadrotor. In

Robotics and Automation (ICRA), 2012 IEEE International Conference on (pp. 2723-2730). IEEE.• Many make assumptions on structural regularities of the world model

Page 61: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Limited prior data Often lots of prior data!

Page 62: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Limited prior data Often lots of prior data!

Batch (Entirely Offline) RL

Page 63: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

A Classrooms Avg Score: 95

Page 64: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

A Classrooms Avg Score: 95

B Classrooms Avg Score: 92

Page 65: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

A Classrooms Avg Score: 95

B Classrooms Avg Score: 92

What should we do for a new student?

Page 66: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Comes Up in Many Domains: e.g. Equipment Maintenance Scheduling

Page 67: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Comes Up in Many Domains: e.g. Patient Treatment Ordering

Page 68: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

B Classrooms Avg Score: ????

B Classrooms Avg Score: 92

A Classrooms Avg Score: 95

B Classrooms Avg Score: 92

Challenge: Counterfactual Reasoning

Page 69: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

B Classrooms Avg Score: ????

A Classrooms Avg Score: 95

B Classrooms Avg Score: 92

Challenge: Generalization to Untried Policies

Page 70: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Counterfactual Estimation

● Active area of research in many fields● Today focus on this in the context of reinforcement learning (sequential

decision making)● A few pointers to some other work and settings:

○ Saria and Schulam. ML Foundations and Methods for Precision Medicine and Healthcare. NIPS 2016 Tutorial.

○ T. Joachims, A. Swaminathan, Tutorial on Counterfactual Evaluation and Learning for Search, Recommendation and Ad Placement, ACM Conference on Research and Development in Information Retrieval (SIGIR), 2016.

○ Wager, S., & Athey, S. (2017). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, (to appear)

○ Austin, P. C. (2011). An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate behavioral research, 46(3), 399-424.

○ T. Schnabel, A. Swaminathan, A. Singh, N. Chandak, T. Joachims, Recommendations as Treatments: Debiasing Learning and Evaluation, International Conference on Machine Learning (ICML), 2016.

○ Rubin, D. B. (2005). Causal inference using potential outcomes: Design, modeling, decisions. Journal of the American Statistical Association, 100(469), 322-331.

Page 71: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Batch Data Policy

Evaluation

Data on decisions & outcomes

Page 72: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Batch Data Policy

Evaluation

Data on decisions & outcomes

Policy 1

Policy 3

Policy N

Policy 2

Page 73: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Batch Data Policy

Evaluation

Data on decisions & outcomes

... ...

Policy 1 Estimated Performance 1

Policy 3 Estimated Performance 3

Policy N Estimated Performance N

Policy 2 Estimated Performance 2

Policy Evaluation

Page 74: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Batch Data Policy

Selection

Data on decisions & outcomes

... ...

Policy i

Policy Selection

Policy 1 Estimated Performance 1

Policy 3 Estimated Performance 3

Policy N Estimated Performance N

Policy 2 Estimated Performance 2

Policy Evaluation

Page 75: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Batch RL for People

• May want to do only policy evaluation• May want asymptotic guarantees on policy evaluation estimators

– Valid confidence intervals – Consistency

Page 76: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Batch RL for People

• May want to do only policy evaluation• May want asymptotic guarantees on policy evaluation estimators

– Valid confidence intervals – Consistency

• May want measure of confidence in values: confidence intervals• Measure of confidence in policy • Empirically good performance• Robustness to assumptions in the estimator

Page 77: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Batch RL for People

• May want to do only policy evaluation• May want asymptotic guarantees on policy evaluation estimators

– Valid confidence intervals – Consistency

• May want measure of confidence in values: confidence intervals• Measure of confidence in policy • Empirically good performance• Robustness to assumptions in the estimator• Lots of great work on doing off policy policy learning

– Here assume only get access to fixed batch of data (no more learning) and care about accuracy of result

• ...

Page 78: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Policy: Player state → levelGoal: Maximize engagement Old data: ~11,000 students

78

Page 79: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Data on decisions & outcomes

Predictive statistical model of player

behavior

Build Predictive Model

Page 80: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Reward

Action Observation

Compute Policy that Optimizes Expected

Rewards for this Model

Predictive statistical model of player

behavior

Use Models as a Simulator

Page 81: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Reward

Action Observation

Problem: Model May Not be Accurate… Yields Poor Estimate of Policy Performance

Predictive statistical model of player

behavior

Compute Policy that Optimizes Expected

Rewards for this Model

Page 82: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Worse: More Accurate Models Can Yield Even Poorer Performing Policies?

Mandel, Liu, B and Popovic AAMAS 2014

Page 83: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Data on decisions & outcomes

Predictive statistical model of player

● Much prior work: if model good, policy is good

Compute Best Policy for Model

Page 84: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

● Much prior work: if model good, policy is good● Challenge: model class may be wrong

○ May model as Markov but not Markov○ If compute policy assuming model is Markov,

resulting value estimate is not valid

Using Estimators that Rely on Model Class Being Correct Can Fail

Data on decisions & outcomes

Predictive statistical model of player

Compute Best Policy for Model

Page 85: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

● Much prior work: if model good, policy is good● Challenge: model class may be wrong● How can we identify if model type is wrong?

○ Sometimes feasible, sometimes hard● Relates to

○ Sim2Real problem○ Adversarial examples

Using Estimators that Rely on Model Class Being Correct Can Fail

Data on decisions & outcomes

Predictive statistical model of player

Compute Best Policy for Model

Page 86: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Prior Work: Estimate Model from Data

• Strengths– Low variance estimator of policy performance

• Weaknesses– Not unbiased estimator (model may be poor!)

– Not consistent estimator

Build & Estimate a Model

• Define states and actions

• Dynamics p(s’|s,a)

• Observation model p(z|s,a)

• Rewards r(s,a,z)

Historical DataHistory

1, R

1=Σ

ir

i1History

2, R

2=Σ

ir

i2…

Page 87: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Key Challenge: Distribution Mismatch

Histories

Behavior Policy

E[Σir]

• Rewards r(s,a,z)

• Policy maps history (a,z,r,a’,z’,…) → a

New Policy

Histories

E[Σir

i]?

Page 88: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Importance Sampling (IS)(e.g. Precup et al. 2002, Mandel et al. 2014)

Historical DataHistory

1, R

1=Σ

ir

i1History

2, R

2=Σ

ir

i2…

Estimate of Behavior Policy π

b Performance

Estimate of Evaluation Policy π

e Performance

Page 89: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Importance Sampling (IS)(e.g. Precup et al. 2002, Mandel et al. 2014)

• Strengths– Unbiased estimator of policy performance

– Strongly consistent estimator of policy perform.

• Weaknesses– High variance estimator*

*Many extensions, including Retrace (NIPS 2016) but most focused on online off policy setting

Historical DataHistory

1, R

1=Σ

ir

i1History

2, R

2=Σ

ir

i2…

Estimated Evaluation Policy π

e performance

Page 90: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

We used off-policy evaluation to find a policy with 30% higher

engagement (Mandel et al. AAMAS 2014)

Page 91: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Thomas, Theocharous, & Ghavamzadeh AAAI 2015 (slide from Thomas)

High Confidence Off-Policy Policy Evaluation

Page 92: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

High Confidence Off-Policy Policy Evaluation

Thomas, Theocharous, & Ghavamzadeh AAAI 2015 (slide modified from Thomas)

Page 93: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

High Confidence Off-Policy Policy Evaluation

Thomas, Theocharous, & Ghavamzadeh AAAI 2015

Page 94: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

High Confidence Off-Policy Policy Evaluation

~2 million trajectories

Thomas, Theocharous, & Ghavamzadeh AAAI 2015

Page 95: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

High Confidence Off-Policy Policy Improvement

• Use approximate confidence intervals

• Policy evaluation interleaved with running policy for high confidence iterative improvement

• Domain horizon (10) still quite small

Thomas, Theocharous, & Ghavamzadeh ICML 2015

Page 96: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Two Extremes of Offline Reinforcement Learning

ML model + plan

+ Data efficient- Biased

Importance sampling

- Data intensive+ Unbiased

96

Image: https://upload.wikimedia.org/wikipedia/commons/f/f0/DARPA_Big_Data.jpg

Page 97: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Doubly Robust Estimation

ML model + plan

+ Data efficient- Biased

Importance sampling

- Data intensive+ Unbiased

97

Image: https://upload.wikimedia.org/wikipedia/commons/f/f0/DARPA_Big_Data.jpg

+

Page 98: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Doubly Robust (DR) Estimation

• Model + IS-based estimator• Bandits (Dudik et al. 2011)

reward receivedmodel of reward

Page 99: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Doubly Robust (DR) Estimation

• Model + IS-based estimator• Bandits (Dudik et al. 2011)

reward receivedmodel of reward

Page 100: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Doubly Robust (DR) Estimation

• Model + IS-based estimator• Bandits (Dudik et al. 2011)

reward receivedmodel of reward

Page 101: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Doubly Robust Estimation for RL

• Jiang and Li (ICML 2016) extended DR to RL

• Limitation: Estimator derived is unbiased

model-based estimate of Q

actual rewards in the dataset

importance weights

model-based estimate of V

Page 102: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Tight Estimate Often Better than Unbiased: Measure with Mean Squared Error

Thomas and Brunskill, ICML 2016

• Trade bias and varianceBias

Bias

Variance

+Model-based estimator

Importance sampling estimator

Page 103: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Blend IS-Based & Model Based Estimators to Directly Min Mean Squared Error

Bias

Variance

1-step estimate 2-step N-step

x1 x

2 … x

N

Thomas and Brunskill, ICML 2016

Page 104: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Model and Guided Importance Sampling combining (MAGIC) Estimator

Estimated policy value using particular weighting of model estimate and

importance sampling estimate

Thomas and Brunskill, ICML 2016

• Solve quadratic program

• Strongly consistent*

Page 105: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

MAGIC Can Yield Orders of Magnitude Better Estimates of Policy Performance

Log scale!

IS-based

Model

DR

MAGIC

MAGIC-B

Number of Histories

Thomas and Brunskill, ICML 2016

Page 106: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

MAGICal Policy Search

Thomas and Brunskill, EWRL 2016

Page 107: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

107

Variance of Importance Sampling Can Grow With Exponentially with Time Horizon

Page 108: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Trading Bias & Variance for Long Horizon Off Policy Policy Evaluation

● Importance sampling based estimator● Dropping weights (covariance) reduces

variance at cost of bias● Strongly consistent

Guo, Thomas and Brunskill, NIPS 2017

weights for first part of trajectory weights for 2nd

part of trajectory

Page 109: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Trading Bias & Variance for Long Horizon Off Policy Policy EvaluationPromising for some forms of evaluation in Atari

But subtle: depends on particular form of rewards and policies

*Note: updated since original tutorial given a bug found in original plots

Guo, Thomas and Brunskill, NIPS workshop 2017

Page 110: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

High Confidence Bounds for Weighted Doubly Robust Estimation● High confidence bounds for weighted doubly robust off-policy policy estimation

(Hanna, Stone, Niekum AAMAS 2017)

Page 111: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Fairness of Importance Sampling for Policy Selection

• Though IS policy estimates are unbiased, policy selection using them can be unfair

• Define unfair as choose wrong policy as having higher performance > 50% of time

• With finite data, can bias towards certain policies

– more myopic / shorter trajectories

Value

Policy 1 Policy 2

Doroudi, Thomas and Brunskill, UAI 2017, Best Paper

Page 112: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Re-thinking Models: Robust Matrix Evaluation

• When data is very limited, and horizon is long

• IS estimators still too high variance

• Train multiple models & simulate policies on each

• Can use for minimax policy selection

Doroudi, Aleven and Brunskill, Learning at Scale, 2017

Model \ Policy Policy 1 Policy 2

Model 1 8 1

Model 2 10 27

Page 113: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Re-thinking Models

• Which models to include?

• What are good models with limited data?

• Bayesian Neural Networks seem promising

– Applications to stochastic dynamical systems (e.g. Depeweg, Hernández-Lobato, Doshi-Velez, & Udluft, ICLR 2017)

Page 114: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Overview

● RL introduction● RL for people● RL by the people

Page 115: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

RL By the People

● Reward specification● Demonstrations● Rewarding● Advice/labeling / critiquing● Teaching● Shaping the space

*Also see Section 4 of great tutorial by Taylor, Kamar and Hayes: http://interactiveml.net/

Page 116: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Human Specifies Reward

● Is the true reward the best reward for the agent to learn from? E.g.○ Singh, S., Lewis, R. L., & Barto, A. G. Where do

rewards come from. CogSci 2009.● Given (computationally) bounded agent may not be● Human may not write down reward really want● May want constraints on behavior (Thomas, Castro da

Silva, Barto, Brunskill arvix 2017)

Page 117: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Inverse Reward DesignHadfield-Menell et al., NIPS 2017

● Reward is an observation of true desired reward function. ● Compute Bayesian posterior over true reward● Compute risk-averse policy. ● Avoids new bad things● Also avoids new good things

Image credit: http://www.infocuriosity.com/king-midas-and-his-golden-touch-an-ancient-greek-mythology-for-children/

Page 118: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Human Specifies Reward

● May be hard for people to specify reward● Can use as partial specification / shaping● Do RL on top of specified reward

Page 119: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Human Provides Demonstrations

● Imitation / IRL / learning from demonstration / apprenticeship learning

● Enormously influential, especially in robotics● Recent tutorial at ICRA: http://lasa.epfl.ch/tutorialICRA16/ ● Key idea: human provide demonstrations and agent uses

these to learn task● Here assume that get access to an initial set of trajectories &

then have no more interactions with the person● Goal is to learn a good policy

Page 120: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Behavioral Cloning

● Key challenge: data is not iid

Figure from Ross, S., & Bagnell, J. A. (2012). Agnostic system identification for model-based reinforcement learning. ICML.

Page 121: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Human Provides Demonstrations: Inverse Reinforcement Learning

● Learn reward function from human demonstrations ● Develop a policy given that reward function

○ Often involves assuming have access to dynamics model or ability to try new policies online.

○ IRL is ill specified: a 0 reward is always sufficient○ Try to match state features○ Max entropy approach has been very influential○ eta-Learning○ Learning from different demonstrations○ Doing RL on top can be useful○ A number of recent efforts to combine with deep learning

Page 122: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Human Provides Demonstrations: Guided Cost Learning

● Learn reward function from human demonstrations ● Develop a policy given that reward function

○ Often involves assuming have access to dynamics model or ability to try new policies online.

○ IRL is ill specified: a 0 reward is always sufficient○ Try to match state features○ Max entropy approach has been very influential○ eta-Learning○ Learning from different demonstrations○ Doing RL on top can be useful

Finn, C., Levine, S., & Abbeel, P. (2016, June). Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning (pp. 49-58).

Page 123: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Human Provides Demonstrations: Generative Adversarial Imitation Learning

Ho, J., & Ermon, S. (2016). Generative adversarial imitation learning. In Advances in Neural Information Processing Systems (pp. 4565-4573).

# traj on x-axis, y performance

Page 124: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Some Benefits & Limitations to Learning from Demonstrations

● Need access to experts doing task. Just using these can be limited so often combine with online RL which not always feasible

● Some cases expensive/slow to gather data (teach student across a year? provide customer recommendations across months)

● May not help if the solution we need is radically far away-- still have to handle exploration/exploitation trade off online afterwards

● Still a question of how capture expertise (e.g. what state features to use)

● But generally very promising

○ Demonstrations can be prior data collected for other purposes

○ Great for automating, scaling up, & fine tuning good solutions

Page 125: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Humans Providing Online Rewards

● Sophie’s Kitchen● Human trainer can award a scalar reward signal r = [−1, 1]

Teachable robots: Understanding human teaching behavior to build more effective robot learners" Artificial Intelligence Journal, 172:716-737, 2008..

Page 126: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Humans Providing Online Rewards

W. Bradley Knox and Peter Stone. Combining Manual Feedback with Subsequent MDP Reward Signals for Reinforcement Learning. In Proceedings of the Ninth International Conference on Autonomous Agents and Multiagent Systems. May 2010. Best student paper

● TAMER framework● Uses reward

feedback to help shape the agent’s reward

● Can be used in addition to other reward signals from domain

Page 127: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Human Provides Advice / Labels / Feedback

● Human will continue to provide advice about policies over time

Ross, S., Gordon, G. J., & Bagnell, D. (2011). A reduction of imitation learning and structured prediction to no-regret online learning. In International Conference on Artificial Intelligence and Statistics (pp. 627-635).

Page 128: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Human Provides Advice

Griffith, S., Subramanian, K., Scholz, J., Isbell, C., & Thomaz, A. L. (2013). Policy shaping: Integrating human feedback with reinforcement learning. In Advances in Neural Information Processing Systems (pp. 2625-2633).

● Advise: a Bayesian approach for policy shaping● Agent may get “right”/“wrong” label after performing an action

● C: feedback consistent with optimal policy with probability 0 <C <1

● Update posterior over policy from human

● Combine policy learned from

own experience with that estimated

from human

Page 129: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Discussion: Human Provides Advice / Labels / Feedback

● Performance gains are often significant● Challenge: expensive to have someone in the loop. not often

feasible. (number of examples of advice can be in the hundreds)● Maybe in some cases (household robot, asking a supervisor for

help with a tricky case) could be realistic and easier than an expert providing demonstrators

● MIght be some interesting merged caes:○ Learning from demonstrators where can finish the

demonstration of someone once stuck?● Could also potentially be good for unknown unknowns

○ Human may help agent to realize there’s a problem/limitation even if the agent would not have

Page 130: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Humans Teaching Agents

● How do people try to teach?● Evidence that (at least for animal training) people may expect their reward feedback to be

interpreted as an action label ○ This is different if learner is trying to maximize its long term reward

○ Ho, M. K., Littman, M. L., Cushman, F., & Austerweil, J. L. (2015). Teaching with rewards and punishments: Reinforcement or communication?. In CogSci.

Page 131: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

How do People Teach

● People can perform tasks differently if trying to show vs doing task at very high level of performance

○ Ho, M. K., Littman, M., MacGlashan, J., Cushman, F., & Austerweil, J. L. (2016). Showing versus doing: Teaching by demonstration. In Advances in Neural Information Processing Systems (pp. 3027-3035)

Page 132: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

How do People Teach

● People pay attention to learners’ current policy human trainers give a positive or negative feedback for a decision is influenced by the learner’s current policy which influences how the teacher provides feedback, and this influences the best algorithm to do○ James MacGlashan, Mark K. Ho, Robert Loftin, Bei Peng, Guan Wang, David

L. Roberts, Matthew E. Taylor, Michael L. Littman . Proceedings of the 34th International Conference on Machine Learning,

Page 133: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Cooperative Inverse RLHadfield-Menell, D., Russell, S. J., Abbeel, P., & Dragan, A. (NIPS 2016).

● Game where human and the agent get rewards determined by the same reward function.

● If the learner has different capabilities from teacher, teacher should teach in a way to help learning agent learn its optimal policy (which may be different than if the teacher performed the task herself)

Page 134: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

How Should We Teach Agents

● Machine teaching: optimal way to teach agent given knowledge of agent’s algorithm○ Zhu, X. (2015, January). Machine Teaching: An Inverse Problem to Machine Learning

and an Approach Toward Optimal Education. In AAAI (pp. 4083-4087).● How to teach a sequential decision making algorithm e.g.

○ Cakmak, M., & Lopes, M. (2012, July). Algorithmic and Human Teaching of Sequential Decision Tasks. In AAAI.

○ Walsh, Thomas J., and Sergiu Goschin. (2012). "Dynamic teaching in sequential decision making environments. In UAI.

○ Peng, B., MacGlashan, J., Loftin, R., Littman, M. L., Roberts, D. L., & Taylor, M. E. (2017, May). Curriculum Design for Machine Learners in Sequential Decision Tasks. AAMAS

Page 135: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Overview

● Reward specification● Demonstrations● Rewarding● Advice/labeling / critiquing● Teaching● Shaping the space

*Also see Section 4 of great tutorial by Taylor, Kamar and Hayes: http://interactiveml.net/

Page 136: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Histogram Tutor

136

Page 137: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

At end, post test

Action Correct/Wrong

Continually Improving Tutoring System

Page 138: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Action

Improving Across Many Students

Action Action

Page 139: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Over Time Tutoring System Stopped Giving Some Problems to Students

139

Page 140: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

System Self-Diagnosed that Problems Weren’t Helping Student Learning

Page 141: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Rules of the World Are Not Fixed

Page 142: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Humans are Invention Machines

New actions New sensors

Page 143: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Reward

Action Observation

Goal: Choose actions to maximize expected rewards

Human in the Loop Reinforcement Learning

Page 144: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Reward

Action Observation

Goal: Choose actions to maximize expected rewards

Human in the Loop Reinforcement Learning

Add Action

Page 145: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Reward

Action Observation

Goal: Choose actions to maximize expected rewards

Add Action

Direct Human Effort for Adding New ActionsMandel, Liu, Brunskil & Popovic, AAAI 2017

Page 146: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Expected Local Improvement

Prob. human gives you action

ah

for state s

Improvement in value (outcomes) at state s if add in action a

h

Mandel, Liu, Brunskil & Popovic, AAAI 2017

Page 147: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Mostly Bad Human Input

Mandel, Liu, Brunskil & Popovic, AAAI 2017

Page 148: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement
Page 149: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Ask people to add in new hints where might help

Page 150: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Reward

Action

Observation

Add Feature

Is there a Latent Feature that if System Knew It, Could Make Better Decisions?

Ongoing work with Ramtin Keramati

Page 151: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Ongoing work with Ramtin Keramati

Ask for Feature if Exists Latent Variable Model of a State That Could Change Policy

Page 152: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Ongoing work with Ramtin Keramati

Ask for Feature if Exists Latent Variable Model of a State That Could Change Policy

Page 153: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Ongoing work with Ramtin Keramati

Learn Model That Supports Making Good Decisions, Not Perfect Model

Page 154: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Reward

Action

Observation

Requests

Directing Human Experts to Change Actions & Observation Space

Ongoing work with Ramtin Keramati

Page 155: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Reward

Action

Observation

Requests

Another Way to Have Humans Teach Agents

Ongoing work with Ramtin Keramati

Page 156: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Ask people to add in new hints where might help

But system isn’t improving much… what’s going on?

Page 157: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Humans Teaching Agents: Questions

● How to teach human to teach agents?○ People may not have a good model of (a) agent / learner, (b) access

to content (actions) that can change that agent/ learner state○ May need guidance about how best to help agent / learner○ Just like people have to learn to be effective human teachers…○ In our group new project on teaching the teachers...

● Lots of open directions here● Involving teachers still expensive relative to leveraging prior trajectories

of demonstrations or instructions● Can we make better algorithms to leverage past teaching demonstrations

(youtube has many lecture videos) rather than expert demonstrations? COACH (on prior slide, MacGlashan, et al) is one of the steps in this direction

Page 158: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

RL with help from People

● Reward specification● Demonstrations● Rewarding● Advice/labeling / critiquing● Teaching● Shaping the space

*Also see Section 4 of great tutorial by Taylor, Kamar and Hayes: http://interactiveml.net/

Page 159: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Overview

● RL introduction● RL for people● RL by the people

● Lots of applications that could benefit from RL● Lots of ways people can help make RL systems

better● Interested in discussing a postdoc opportunity?

Email me at [email protected]

Page 160: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement

Thanks to

and Karan Goel, Travis Mandel, Yun-En Liu, Ramtin Keramti, NSF, ONR, Microsoft, Google,

Yahoo & IES