![Page 1: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/1.jpg)
Reinforcement Learning for the People and/or by the People
Emma Brunskill
Stanford University
NIPS 2017 Tutorial
1
![Page 2: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/2.jpg)
Amazing Reinforcement Learning Progress
![Page 3: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/3.jpg)
≠
![Page 4: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/4.jpg)
Overview
● RL introduction● RL for people● RL by the people
![Page 5: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/5.jpg)
Audience
• If you are:– Interested in quick overview of RL (section 1)– Want to learn about the RL technical challenges
involved in people-facing applications (section 2)– Want to learn about how people can help RL
systems (section 3)
5
![Page 6: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/6.jpg)
*Caveats
• Not trying to cover all the domain-specific methods for tackling these questions
• Focus will be on RL setting, and new technical challenges for RL with people and using people
• Always delighted to learn about new areas/key references might have missed-- email me at [email protected]
6
![Page 7: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/7.jpg)
Math exercise
Student’s answer
Reinforcement Learning for People
Student
![Page 8: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/8.jpg)
Math exercise
Student’s answer
Pass exam
Reinforcement Learning for People
Student
![Page 9: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/9.jpg)
Suggest ProductBuys
product
Revenue
Reinforcement Learning for People
Policy: Prior Recommendations & Purchases → Product AdGoal: Choose actions to maximize expected revenue
Customer
![Page 10: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/10.jpg)
Action Observation
Reward
Reinforcement Learning
Policy: Map Observations → ActionsGoal: Choose actions to maximize expected rewards
![Page 11: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/11.jpg)
Overview
● RL introduction● RL for people● RL by the people
![Page 12: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/12.jpg)
RL Basics
● MDP● POMDP● Planning● 3 views
○ Model○ Model-free○ Policy search
● Exploration vs exploitation-- how get data?
![Page 13: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/13.jpg)
Markov Decision Process (MDP)
• Set of states S• Set of actions A• Stochastic transition/dynamics model T(s,a,s’)
– Probability of reaching s’ after taking action a in state s
• Reward model R(s,a) (or R(s) or R(s,a,s’))• Maybe a discount factor γ or horizon H
![Page 14: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/14.jpg)
Markov Decision Process (MDP)
• Set of states S• Set of actions A• Stochastic transition/dynamics model T(s,a,s’)
– Probability of reaching s’ after taking action a in state s
• Reward model R(s,a) (or R(s) or R(s,a,s’))• Maybe a discount factor γ or horizon H• Policy π: s → a• Optimal policy is one with highest expected
discounted sum of future rewards
![Page 15: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/15.jpg)
Partially Observable MDP
• Set of states S• Set of actions A• Set of observations Z• Stochastic transition/dynamics model T(s,a,s’)
– Probability of reaching s’ after taking action a in state s
• Reward model R(s,a) (or R(s) or R(s,a,s’))• Observation model P(z|s’,a)• Policy π: history (z,a,z’,a’)... → a• Optimal policy is one with highest expected
discounted sum of future rewards
![Page 16: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/16.jpg)
Model-Based Policy Evaluation
• Given a MDP and a policy π: s → a• How good is this policy?• Want to compute expected sum of discounted
rewards if follow it (starting from some initial states)• Could do Monte Carlo rollouts and average• But if domain is Markov, can do better...
![Page 17: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/17.jpg)
Model-Based Policy Evaluation with Dynamic Programming
• Given a MDP and a policy π: s → a• Set V(s) = 0 for all s;
• Iteratively update value
• V(s) ← R(s) + γ Σs’ T(s, (s),s′) V(s′)
immediate reward
discounted sum of future rewards
![Page 18: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/18.jpg)
Planning & Bellman Equation
• Given a MDP (includes dynamics and reward model)• Compute optimal policy• Bellman optimality equation:
– V(s) = R(s) + γ Σs’ T(s, *,s′) V(s′)
– Q(s,a) = R(s) + γ Σs’ T(s,a,s′) V(s′)
– *(s) = argmaxa R(s) + γ Σ
s’ T(s,a,s′) V(s′) = argmax
a Q(s,a)
![Page 19: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/19.jpg)
Reinforcement Learning
• Typically still assume in a MDP• Know set of states S• Know set of actions A• Maybe a discount factor γ or horizon H• Don’t know dynamics and/or reward model!• Still want to take actions to yield high reward
![Page 20: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/20.jpg)
RL: 3 Common Approaches
Image from David Silver
![Page 21: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/21.jpg)
Model-Based RL
• Typically still assume in a MDP• As interact in world, observe states, actions and
rewards• Use machine learning to estimate a model of
dynamics and/or rewards using this experience• Now have (one or more) MDP estimated models• Now can do planning to compute an optimal policy
![Page 22: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/22.jpg)
Model Free: Q-Learning
• Initialize Q(s,a) for all (s,a) pairs
![Page 23: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/23.jpg)
Model Free: Q-Learning
• Initialize Q(s,a) for all (s,a) pairs• On observing transition <s
t,a
t,r
t,s
t+1>
– Calculate temporal difference errorδ(s
t,a
t) = r
t + γ max
a′ Q(st+1
,a′) - Q(st,a)
– Difference between what observed and current estimate of long term expected reward (Q)
– Uses Markov property and bootstraps (didn’t observe reward till end of episode, so use Q(s
t+1,a′) as a proxy)
![Page 24: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/24.jpg)
Model Free: Q-Learning
• Initialize Q(s,a) for all (s,a) pairs• On observing transition <s
t,a
t,r
t,s
t+1>
– Calculate TD-errorδ(s
t,a
t) = r
t + γ max
a′ Q(st+1
,a′) - Q(st,a)
– Use to update estimate of Q(st,a
t)
Q(st,a
t) = (1-α) Q(s
t,a
t) + αδ(s
t,a
t)
• Slowly moves estimate toward observed experience
![Page 25: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/25.jpg)
Model Free: Q-Learning
• Initialize Q(s,a) for all (s,a) pairs• On observing transition <s
t,a
t,r
t,s
t+1>
– Calculate TD-errorδ(s
t,a
t) = r
t + γ max
a′ Q(st+1
,a′) - Q(st,a)
– Use to update estimate of Q(st,a
t)
Q(st,a
t) = (1-α) Q(s
t,a
t) + αδ(s
t,a
t)
• Computationally cheap• But only propagates experience one step
– Replay can help fix
![Page 26: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/26.jpg)
Policy Search
• Directly search π space for argmaxπVπ
• Parameterize policy and do stochastic gradient descent
![Page 27: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/27.jpg)
RL: 3 Common Approaches
Image from David Silver
![Page 28: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/28.jpg)
Exploration vs Exploitation
● In online reinforcement learning, learn about world through acting
● Trade off between○ Learning more about how the world works○ Using that knowledge to maximize reward
![Page 29: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/29.jpg)
Overview
● RL introduction● RL for people● RL by the people
![Page 30: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/30.jpg)
Reinforcement Learning Progress
![Page 31: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/31.jpg)
≠
![Page 32: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/32.jpg)
≠
Cheap to try things, orSimulate
High stakesHard to model
![Page 33: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/33.jpg)
Technical Challenges in RL for People
1. Sample efficient Learning2. What do we want to optimize?3. Batch (purely offline) reinforcement learning
![Page 34: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/34.jpg)
Sample Efficiency Through Transfer
all different all the same
![Page 35: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/35.jpg)
Assume Finite Set of Models
![Page 36: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/36.jpg)
MDP Y
TY, R
Y
MDP R
TR, R
R
MDP G
TG, R
G
Sample a task from finite set of MDPs
(shared S & A space)
Assume Finite Set of GroupsMulti-Task Reinforcement Learning
![Page 37: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/37.jpg)
…
Series of tasksAct in each task for H steps
MDP Y
TY, R
Y
MDP R
TR, R
R
MDP G
TG, R
G
Multi-Task Reinforcement Learning
![Page 38: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/38.jpg)
…
MDP Y
TY, R
Y
MDP R
TR, R
R
MDP G
TG, R
G
Multi-Task Reinforcement Learning
![Page 39: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/39.jpg)
…
MDP Y
T=? R=?
MDP R
T=? R=?
MDP G
T=? R=?
Multi-Task Reinforcement Learning
• Captures a number of settings of interest• Our primary contributions have been showing can provably speed learning
(Brunskill and Li UAI 2013; Brunskill and Li ICML 2014; Guo and Brunskill AAAI 2015)
• Limitations: focused on discrete state and action, impractical bounds, optimizing for average performance
![Page 40: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/40.jpg)
Assume Related Meta-Groups
Wilson, A., Fern, A., Ray, S., & Tadepalli, P. (2007, June). Multi-task reinforcement learning: a hierarchical Bayesian approach. In Proceedings of the 24th international conference on Machine learning (pp. 1015-1022). ACM.
![Page 41: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/41.jpg)
Wilson, A., Fern, A., Ray, S., & Tadepalli, P. (2007, June). Multi-task reinforcement learning: a hierarchical Bayesian approach. In Proceedings of the 24th international conference on Machine learning (pp. 1015-1022). ACM.
![Page 42: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/42.jpg)
Wilson, A., Fern, A., Ray, S., & Tadepalli, P. (2007, June). Multi-task reinforcement learning: a hierarchical Bayesian approach. In Proceedings of the 24th international conference on Machine learning (pp. 1015-1022). ACM.
Encouraging empiricallyNo guarantees
Scalability?
![Page 43: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/43.jpg)
Hidden Parameter MDP
● Allow for smooth linear parameterization of dynamics model
Doshi-Velez, F., & Konidaris, G. (2016, July). Hidden parameter Markov decision processes: A semiparametric regression approach for discovering latent task parametrizations. In IJCAI: proceedings of the conference(Vol. 2016, p. 1432).
![Page 44: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/44.jpg)
Hidden Parameter MDP++
● Use Bayesian Neural Nets for dynamics● Benefits for HIV Treatment simulation● Each episode new patient
TW Killian, G Konidaris, F Doshi-Velez. Robust and Efficient Transfer Learning with Hidden Parameter Markov Decision Processes. NIPS 2017.
![Page 45: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/45.jpg)
Sample Efficient Online RL: Policy Search
• Policy search hugely popular technique in RL– Deep Reinforcement Learning through Policy
Optimization. – Schulman and Abbeel NIPS 2016 tutorial:
https://people.eecs.berkeley.edu/~pabbeel/nips-tutorial-policy-optimization-Schulman-Abbeel.pdf
• Not always associated with data efficiency• But can be used to tune small set of
parameters efficiently
Thomas and B, ICML 2016
![Page 46: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/46.jpg)
Policy Search as Function Optimization
Figure from Wilson et al. JMLR 2014
π = f(θ)
DefinesPolicy class
Perf
orm
ance
of
Polic
y
![Page 47: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/47.jpg)
Policy Search as Function OptimizationGradient Approaches
• Use structure of sequential decision making• Only find local optima
π = f(θ)
DefinesPolicy class
Perf
orm
ance
of
Polic
y
Figure from Wilson et al. JMLR 2014
![Page 48: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/48.jpg)
Policy Search as Bayesian Optimization:Treats Each Evaluation as Costly
π = f(θ)
DefinesPolicy class
Perf
orm
ance
of
Polic
y
• Explicit rep uncertainty, finds global optima• Generally ignores sequential structure
Figure from Wilson et al. JMLR 2014
![Page 49: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/49.jpg)
Policy Search as Bayesian Optimization
● Data efficient policy search
● Can leverage Markovian structure (e.g. Wilson et al. 2014)
● Doesn’t have to make assumptions about world model
● Can combine with off policy evaluation to further speed up learning (in terms of amount of data required)
Goel, Dann and Brunskill IJCAI 2017
![Page 50: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/50.jpg)
Policy Search for Optimal Stopping Problems
● Can leverage full trajectories to retrospectively evaluate alternate optimal stopping policies
● Substantial increase in data efficiency over generic bounds (Ng and Jordan UAI 2000)
● But Gaussian Processes struggle in high dimensions
Goel, Dann and Brunskill IJCAI 2017
![Page 51: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/51.jpg)
Assuming Finite Set of Policies to Speed Learning in a New Task
● Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement learning agent. In Proceedings of the fifth international joint conference on Autonomous agents and multiagent systems (pp. 720-727). ACM.
● Talvitie, E., & Singh, S. P. (2007, January). An Experts Algorithm for Transfer Learning. In IJCAI (pp. 1065-1070).
● Azar, M. G., Lazaric, A., & Brunskill, E. (2013, September). Regret bounds for reinforcement learning with policy advice. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (pp. 97-112).
● Focus on the discrete state and action space setting
![Page 52: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/52.jpg)
Multi-Task Policy Learning
Ammar, H. B., Eaton, E., Luna, J. M., & Ruvolo, P. (2015). Autonomous cross-domain knowledge transfer in lifelong policy gradient reinforcement learning. In IIJCAI.
![Page 53: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/53.jpg)
Multi-Task Policy Learning
Goel, Dann and Brunskill IJCAI 2017
● Set of policies with shared basis set of parameters● Can be used to do cross domain transfer (different state & actions)
![Page 54: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/54.jpg)
Multi-Task Policy Learning with Shared Policy and Feature Parameters
● Descriptor features and policy features● Primarily tackling continuous control tasks
● Scaling to enormous state spaces and performance for a single trajectory less clear
Isele, D., Rostami, M., & Eaton, E. (2016, July). Using Task Features for Zero-Shot Knowledge Transfer in Lifelong Learning. In IJCAI (pp. 1620-1626).
![Page 55: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/55.jpg)
Deep RL Transfer
● Deep reinforcement learning to find good shared representation (Finn, Abbeel, Levine ICML 2017)
● Fast transfer by encouraging shared representation learning across tasks
![Page 56: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/56.jpg)
Direct Policy Search
• Most of these are trying to speed online learning in a new task
• Not making guarantees on performance for a single task
![Page 57: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/57.jpg)
(Some) Recent Sample Efficient RL References
Brunskill, E., & Li, L. Sample Complexity of Multi-task Reinforcement Learning. UAI 2013.
Wilson, A., Fern, A., Ray, S., & Tadepalli, P. (2007, June). Multi-task reinforcement learning: a hierarchical Bayesian approach. In Proceedings of the 24th international conference on Machine learning (pp. 1015-1022). ACM.
Doshi-Velez, F., & Konidaris, G. (2016, July). Hidden parameter Markov decision processes: A semiparametric regression approach for discovering latent task parametrizations. In IJCAI: proceedings of the conference(Vol. 2016, p. 1432). NIH Public Access.
Robust and Efficient Transfer Learning with Hidden Parameter Markov Decision Processes.TW Killian, G Konidaris, F Doshi-Velez. NIPS 2017
Goel, Dann, & Brunskill. Sample Efficient Policy Search for Optimal Stopping Domains. IJCAI 2017.
Guo, Zhaohan, and Emma Brunskill. "Concurrent PAC RL." AAAI 2015.
Jiang, N., Krishnamurthy, A., Agarwal, A., Langford, J., & Schapire, R. E. (2017). Contextual Decision Processes with Low Bellman Rank are PAC-Learnable. ICML.
Silver, D., Newnham, L., Barker, D., Weller, S., & McFall, J. (2013, February). Concurrent reinforcement learning from customer interactions. In International Conference on Machine Learning (pp. 924-932)
Carlos Diuk, Lihong Li, and Bethany R. Leffler. The adaptive k-meteorologists problem and its application to structure learning and feature selection in reinforcement learning. In ICML, 2009.
Kirill Dyagilev, Shie Mannor, and Nahum Shimkin. Efficient reinforcement learning in parameterized models: Discrete parameter case. In European Workshop on Reinforcement Learning, 2008
Liu, Y., Guo, Z., & Brunskill, E. PAC Continuous State Online Multitask Reinforcement Learning with Identification. AAMAS 2016.
Osband, I., Russo, D., & Van Roy, B. (2013). (More) efficient reinforcement learning via posterior sampling. NIPS
Osband, I., & Van Roy, B. (2014). Near-optimal Regret Bounds for Reinforcement Learning in Factored MDPs. NIPS.
Zhou, L., & Brunskill, E. Latent Contextual Bandits and Their Application to Personalized Recommendations for New Users. IJCAI 2016.
Ammar, H. B., Eaton, E., Luna, J. M., & Ruvolo, P. (2015, July). Autonomous cross-domain knowledge transfer in lifelong policy gradient reinforcement learning. In International Joint Conference on Artificial Intelligence (pp. 3345-3351)
Isele, D., Rostami, M., & Eaton, E. (2016, July). Using Task Features for Zero-Shot Knowledge Transfer in Lifelong Learning. In IJCAI (pp. 1620-1626).
Finn, C., Abbeel, P., & Levine, S. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. ICML 2017.
Wilson, A., Fern, A., Ray, S., & Tadepalli, P. (2007, June). Multi-task reinforcement learning: a hierarchical Bayesian approach. In Proceedings of the 24th international conference on Machine learning (pp. 1015-1022). ACM.
![Page 58: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/58.jpg)
Beyond Expectation
• If interacting with people, sometimes may care not just about expected performance, averaged across many rounds
![Page 59: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/59.jpg)
Beyond Expectation
• If interacting with people, sometimes may care not just about expected performance, averaged across many rounds
• Work on risk-sensitive policies includes– Garcıa, J., & Fernández, F. (2015). A comprehensive survey on safe reinforcement learning. Journal of Machine Learning
Research, 16(1), 1437-1480.– Castro, D. D., Tamar, A., & Mannor, S. (2012). Policy gradients with variance related risk criteria. In Proceedings of the 29th
International Conference on Machine Learning (ICML-12) (pp. 935-942).– Doshi-Velez, F., Pineau, J., & Roy, N. (2012). Reinforcement learning with limited reinforcement: Using Bayes risk for active
learning in POMDPs. Artificial Intelligence, 187, 115-132.– Prashanth, L. A., & Ghavamzadeh, M. (2013). Actor-critic algorithms for risk-sensitive MDPs. In Advances in neural information
processing systems (pp. 252-260).– Chow, Y., & Ghavamzadeh, M. (2014). Algorithms for CVaR optimization in MDPs. In Advances in neural information processing
systems (pp. 3509-3517).– Delage, E., & Mannor, S. (2007, June). Percentile optimization in uncertain Markov decision processes with application to efficient
exploration. In Proceedings of the 24th international conference on Machine learning (pp. 225-232). ACM.•
![Page 60: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/60.jpg)
Beyond Expectation
• If interacting with people, sometimes may care not just about expected performance, averaged across many rounds
• Work on risk-sensitive policies includes– Garcıa, J., & Fernández, F. (2015). A comprehensive survey on safe reinforcement learning. Journal of Machine Learning
Research, 16(1), 1437-1480.– Castro, D. D., Tamar, A., & Mannor, S. (2012). Policy gradients with variance related risk criteria. In Proceedings of the 29th
International Conference on Machine Learning (ICML-12) (pp. 935-942).– Doshi-Velez, F., Pineau, J., & Roy, N. (2012). Reinforcement learning with limited reinforcement: Using Bayes risk for active
learning in POMDPs. Artificial Intelligence, 187, 115-132.– Prashanth, L. A., & Ghavamzadeh, M. (2013). Actor-critic algorithms for risk-sensitive MDPs. In Advances in neural information
processing systems (pp. 252-260).– Chow, Y., & Ghavamzadeh, M. (2014). Algorithms for CVaR optimization in MDPs. In Advances in neural information processing
systems (pp. 3509-3517).– Delage, E., & Mannor, S. (2007, June). Percentile optimization in uncertain Markov decision processes with application to efficient
exploration. In Proceedings of the 24th international conference on Machine learning (pp. 225-232). ACM.
• Work on safe exploration includes• Geramifard, A. (2012). Practical reinforcement learning using representation learning and safe exploration for large scale Markov
decision processes (Doctoral dissertation, Massachusetts Institute of Technology).• Moldovan, T. M., & Abbeel, P. (2012). Safe exploration in Markov decision processes. arXiv preprint arXiv:1205.481• Turchetta, M., Berkenkamp, F., & Krause, A. (2016). Safe exploration in finite Markov decision processes with Gaussian processes. In
Advances in Neural Information Processing Systems (pp. 4312-4320).• Gillula, J. H., & Tomlin, C. J. (2012, May). Guaranteed safe online learning via reachability: tracking a ground target using a quadrotor. In
Robotics and Automation (ICRA), 2012 IEEE International Conference on (pp. 2723-2730). IEEE.• Many make assumptions on structural regularities of the world model
![Page 61: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/61.jpg)
≠
Limited prior data Often lots of prior data!
![Page 62: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/62.jpg)
≠
Limited prior data Often lots of prior data!
Batch (Entirely Offline) RL
![Page 63: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/63.jpg)
A Classrooms Avg Score: 95
![Page 64: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/64.jpg)
A Classrooms Avg Score: 95
B Classrooms Avg Score: 92
![Page 65: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/65.jpg)
A Classrooms Avg Score: 95
B Classrooms Avg Score: 92
What should we do for a new student?
![Page 66: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/66.jpg)
Comes Up in Many Domains: e.g. Equipment Maintenance Scheduling
![Page 67: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/67.jpg)
Comes Up in Many Domains: e.g. Patient Treatment Ordering
![Page 68: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/68.jpg)
B Classrooms Avg Score: ????
B Classrooms Avg Score: 92
A Classrooms Avg Score: 95
B Classrooms Avg Score: 92
Challenge: Counterfactual Reasoning
![Page 69: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/69.jpg)
B Classrooms Avg Score: ????
A Classrooms Avg Score: 95
B Classrooms Avg Score: 92
Challenge: Generalization to Untried Policies
![Page 70: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/70.jpg)
Counterfactual Estimation
● Active area of research in many fields● Today focus on this in the context of reinforcement learning (sequential
decision making)● A few pointers to some other work and settings:
○ Saria and Schulam. ML Foundations and Methods for Precision Medicine and Healthcare. NIPS 2016 Tutorial.
○ T. Joachims, A. Swaminathan, Tutorial on Counterfactual Evaluation and Learning for Search, Recommendation and Ad Placement, ACM Conference on Research and Development in Information Retrieval (SIGIR), 2016.
○ Wager, S., & Athey, S. (2017). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, (to appear)
○ Austin, P. C. (2011). An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivariate behavioral research, 46(3), 399-424.
○ T. Schnabel, A. Swaminathan, A. Singh, N. Chandak, T. Joachims, Recommendations as Treatments: Debiasing Learning and Evaluation, International Conference on Machine Learning (ICML), 2016.
○ Rubin, D. B. (2005). Causal inference using potential outcomes: Design, modeling, decisions. Journal of the American Statistical Association, 100(469), 322-331.
![Page 71: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/71.jpg)
Batch Data Policy
Evaluation
Data on decisions & outcomes
![Page 72: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/72.jpg)
Batch Data Policy
Evaluation
Data on decisions & outcomes
Policy 1
Policy 3
Policy N
Policy 2
![Page 73: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/73.jpg)
Batch Data Policy
Evaluation
Data on decisions & outcomes
... ...
Policy 1 Estimated Performance 1
Policy 3 Estimated Performance 3
Policy N Estimated Performance N
Policy 2 Estimated Performance 2
Policy Evaluation
![Page 74: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/74.jpg)
Batch Data Policy
Selection
Data on decisions & outcomes
... ...
Policy i
Policy Selection
Policy 1 Estimated Performance 1
Policy 3 Estimated Performance 3
Policy N Estimated Performance N
Policy 2 Estimated Performance 2
Policy Evaluation
![Page 75: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/75.jpg)
Batch RL for People
• May want to do only policy evaluation• May want asymptotic guarantees on policy evaluation estimators
– Valid confidence intervals – Consistency
![Page 76: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/76.jpg)
Batch RL for People
• May want to do only policy evaluation• May want asymptotic guarantees on policy evaluation estimators
– Valid confidence intervals – Consistency
• May want measure of confidence in values: confidence intervals• Measure of confidence in policy • Empirically good performance• Robustness to assumptions in the estimator
![Page 77: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/77.jpg)
Batch RL for People
• May want to do only policy evaluation• May want asymptotic guarantees on policy evaluation estimators
– Valid confidence intervals – Consistency
• May want measure of confidence in values: confidence intervals• Measure of confidence in policy • Empirically good performance• Robustness to assumptions in the estimator• Lots of great work on doing off policy policy learning
– Here assume only get access to fixed batch of data (no more learning) and care about accuracy of result
• ...
![Page 78: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/78.jpg)
Policy: Player state → levelGoal: Maximize engagement Old data: ~11,000 students
78
![Page 79: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/79.jpg)
Data on decisions & outcomes
Predictive statistical model of player
behavior
Build Predictive Model
![Page 80: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/80.jpg)
Reward
Action Observation
Compute Policy that Optimizes Expected
Rewards for this Model
Predictive statistical model of player
behavior
Use Models as a Simulator
![Page 81: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/81.jpg)
Reward
Action Observation
Problem: Model May Not be Accurate… Yields Poor Estimate of Policy Performance
Predictive statistical model of player
behavior
Compute Policy that Optimizes Expected
Rewards for this Model
![Page 82: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/82.jpg)
Worse: More Accurate Models Can Yield Even Poorer Performing Policies?
Mandel, Liu, B and Popovic AAMAS 2014
![Page 83: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/83.jpg)
Data on decisions & outcomes
Predictive statistical model of player
● Much prior work: if model good, policy is good
Compute Best Policy for Model
![Page 84: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/84.jpg)
● Much prior work: if model good, policy is good● Challenge: model class may be wrong
○ May model as Markov but not Markov○ If compute policy assuming model is Markov,
resulting value estimate is not valid
Using Estimators that Rely on Model Class Being Correct Can Fail
Data on decisions & outcomes
Predictive statistical model of player
Compute Best Policy for Model
![Page 85: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/85.jpg)
● Much prior work: if model good, policy is good● Challenge: model class may be wrong● How can we identify if model type is wrong?
○ Sometimes feasible, sometimes hard● Relates to
○ Sim2Real problem○ Adversarial examples
Using Estimators that Rely on Model Class Being Correct Can Fail
Data on decisions & outcomes
Predictive statistical model of player
Compute Best Policy for Model
![Page 86: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/86.jpg)
Prior Work: Estimate Model from Data
• Strengths– Low variance estimator of policy performance
• Weaknesses– Not unbiased estimator (model may be poor!)
– Not consistent estimator
Build & Estimate a Model
• Define states and actions
• Dynamics p(s’|s,a)
• Observation model p(z|s,a)
• Rewards r(s,a,z)
Historical DataHistory
1, R
1=Σ
ir
i1History
2, R
2=Σ
ir
i2…
![Page 87: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/87.jpg)
Key Challenge: Distribution Mismatch
Histories
Behavior Policy
E[Σir]
• Rewards r(s,a,z)
• Policy maps history (a,z,r,a’,z’,…) → a
New Policy
Histories
E[Σir
i]?
![Page 88: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/88.jpg)
Importance Sampling (IS)(e.g. Precup et al. 2002, Mandel et al. 2014)
Historical DataHistory
1, R
1=Σ
ir
i1History
2, R
2=Σ
ir
i2…
Estimate of Behavior Policy π
b Performance
Estimate of Evaluation Policy π
e Performance
![Page 89: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/89.jpg)
Importance Sampling (IS)(e.g. Precup et al. 2002, Mandel et al. 2014)
• Strengths– Unbiased estimator of policy performance
– Strongly consistent estimator of policy perform.
• Weaknesses– High variance estimator*
*Many extensions, including Retrace (NIPS 2016) but most focused on online off policy setting
Historical DataHistory
1, R
1=Σ
ir
i1History
2, R
2=Σ
ir
i2…
Estimated Evaluation Policy π
e performance
![Page 90: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/90.jpg)
We used off-policy evaluation to find a policy with 30% higher
engagement (Mandel et al. AAMAS 2014)
![Page 91: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/91.jpg)
Thomas, Theocharous, & Ghavamzadeh AAAI 2015 (slide from Thomas)
High Confidence Off-Policy Policy Evaluation
![Page 92: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/92.jpg)
High Confidence Off-Policy Policy Evaluation
Thomas, Theocharous, & Ghavamzadeh AAAI 2015 (slide modified from Thomas)
![Page 93: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/93.jpg)
High Confidence Off-Policy Policy Evaluation
Thomas, Theocharous, & Ghavamzadeh AAAI 2015
![Page 94: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/94.jpg)
High Confidence Off-Policy Policy Evaluation
~2 million trajectories
Thomas, Theocharous, & Ghavamzadeh AAAI 2015
![Page 95: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/95.jpg)
High Confidence Off-Policy Policy Improvement
• Use approximate confidence intervals
• Policy evaluation interleaved with running policy for high confidence iterative improvement
• Domain horizon (10) still quite small
Thomas, Theocharous, & Ghavamzadeh ICML 2015
![Page 96: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/96.jpg)
Two Extremes of Offline Reinforcement Learning
ML model + plan
+ Data efficient- Biased
Importance sampling
- Data intensive+ Unbiased
96
Image: https://upload.wikimedia.org/wikipedia/commons/f/f0/DARPA_Big_Data.jpg
![Page 97: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/97.jpg)
Doubly Robust Estimation
ML model + plan
+ Data efficient- Biased
Importance sampling
- Data intensive+ Unbiased
97
Image: https://upload.wikimedia.org/wikipedia/commons/f/f0/DARPA_Big_Data.jpg
+
![Page 98: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/98.jpg)
Doubly Robust (DR) Estimation
• Model + IS-based estimator• Bandits (Dudik et al. 2011)
reward receivedmodel of reward
![Page 99: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/99.jpg)
Doubly Robust (DR) Estimation
• Model + IS-based estimator• Bandits (Dudik et al. 2011)
reward receivedmodel of reward
![Page 100: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/100.jpg)
Doubly Robust (DR) Estimation
• Model + IS-based estimator• Bandits (Dudik et al. 2011)
reward receivedmodel of reward
![Page 101: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/101.jpg)
Doubly Robust Estimation for RL
• Jiang and Li (ICML 2016) extended DR to RL
• Limitation: Estimator derived is unbiased
model-based estimate of Q
actual rewards in the dataset
importance weights
model-based estimate of V
![Page 102: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/102.jpg)
Tight Estimate Often Better than Unbiased: Measure with Mean Squared Error
Thomas and Brunskill, ICML 2016
• Trade bias and varianceBias
Bias
Variance
+Model-based estimator
Importance sampling estimator
![Page 103: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/103.jpg)
Blend IS-Based & Model Based Estimators to Directly Min Mean Squared Error
Bias
Variance
1-step estimate 2-step N-step
x1 x
2 … x
N
Thomas and Brunskill, ICML 2016
![Page 104: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/104.jpg)
Model and Guided Importance Sampling combining (MAGIC) Estimator
Estimated policy value using particular weighting of model estimate and
importance sampling estimate
Thomas and Brunskill, ICML 2016
• Solve quadratic program
• Strongly consistent*
![Page 105: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/105.jpg)
MAGIC Can Yield Orders of Magnitude Better Estimates of Policy Performance
Log scale!
IS-based
Model
DR
MAGIC
MAGIC-B
Number of Histories
Thomas and Brunskill, ICML 2016
![Page 106: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/106.jpg)
MAGICal Policy Search
Thomas and Brunskill, EWRL 2016
![Page 107: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/107.jpg)
107
Variance of Importance Sampling Can Grow With Exponentially with Time Horizon
![Page 108: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/108.jpg)
Trading Bias & Variance for Long Horizon Off Policy Policy Evaluation
● Importance sampling based estimator● Dropping weights (covariance) reduces
variance at cost of bias● Strongly consistent
Guo, Thomas and Brunskill, NIPS 2017
weights for first part of trajectory weights for 2nd
part of trajectory
![Page 109: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/109.jpg)
Trading Bias & Variance for Long Horizon Off Policy Policy EvaluationPromising for some forms of evaluation in Atari
But subtle: depends on particular form of rewards and policies
*Note: updated since original tutorial given a bug found in original plots
Guo, Thomas and Brunskill, NIPS workshop 2017
![Page 110: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/110.jpg)
High Confidence Bounds for Weighted Doubly Robust Estimation● High confidence bounds for weighted doubly robust off-policy policy estimation
(Hanna, Stone, Niekum AAMAS 2017)
![Page 111: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/111.jpg)
Fairness of Importance Sampling for Policy Selection
• Though IS policy estimates are unbiased, policy selection using them can be unfair
• Define unfair as choose wrong policy as having higher performance > 50% of time
• With finite data, can bias towards certain policies
– more myopic / shorter trajectories
Value
Policy 1 Policy 2
Doroudi, Thomas and Brunskill, UAI 2017, Best Paper
![Page 112: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/112.jpg)
Re-thinking Models: Robust Matrix Evaluation
• When data is very limited, and horizon is long
• IS estimators still too high variance
• Train multiple models & simulate policies on each
• Can use for minimax policy selection
Doroudi, Aleven and Brunskill, Learning at Scale, 2017
Model \ Policy Policy 1 Policy 2
Model 1 8 1
Model 2 10 27
![Page 113: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/113.jpg)
Re-thinking Models
• Which models to include?
• What are good models with limited data?
• Bayesian Neural Networks seem promising
– Applications to stochastic dynamical systems (e.g. Depeweg, Hernández-Lobato, Doshi-Velez, & Udluft, ICLR 2017)
![Page 114: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/114.jpg)
Overview
● RL introduction● RL for people● RL by the people
![Page 115: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/115.jpg)
RL By the People
● Reward specification● Demonstrations● Rewarding● Advice/labeling / critiquing● Teaching● Shaping the space
*Also see Section 4 of great tutorial by Taylor, Kamar and Hayes: http://interactiveml.net/
![Page 116: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/116.jpg)
Human Specifies Reward
● Is the true reward the best reward for the agent to learn from? E.g.○ Singh, S., Lewis, R. L., & Barto, A. G. Where do
rewards come from. CogSci 2009.● Given (computationally) bounded agent may not be● Human may not write down reward really want● May want constraints on behavior (Thomas, Castro da
Silva, Barto, Brunskill arvix 2017)
![Page 117: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/117.jpg)
Inverse Reward DesignHadfield-Menell et al., NIPS 2017
● Reward is an observation of true desired reward function. ● Compute Bayesian posterior over true reward● Compute risk-averse policy. ● Avoids new bad things● Also avoids new good things
Image credit: http://www.infocuriosity.com/king-midas-and-his-golden-touch-an-ancient-greek-mythology-for-children/
![Page 118: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/118.jpg)
Human Specifies Reward
● May be hard for people to specify reward● Can use as partial specification / shaping● Do RL on top of specified reward
![Page 119: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/119.jpg)
Human Provides Demonstrations
● Imitation / IRL / learning from demonstration / apprenticeship learning
● Enormously influential, especially in robotics● Recent tutorial at ICRA: http://lasa.epfl.ch/tutorialICRA16/ ● Key idea: human provide demonstrations and agent uses
these to learn task● Here assume that get access to an initial set of trajectories &
then have no more interactions with the person● Goal is to learn a good policy
![Page 120: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/120.jpg)
Behavioral Cloning
● Key challenge: data is not iid
Figure from Ross, S., & Bagnell, J. A. (2012). Agnostic system identification for model-based reinforcement learning. ICML.
![Page 121: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/121.jpg)
Human Provides Demonstrations: Inverse Reinforcement Learning
● Learn reward function from human demonstrations ● Develop a policy given that reward function
○ Often involves assuming have access to dynamics model or ability to try new policies online.
○ IRL is ill specified: a 0 reward is always sufficient○ Try to match state features○ Max entropy approach has been very influential○ eta-Learning○ Learning from different demonstrations○ Doing RL on top can be useful○ A number of recent efforts to combine with deep learning
![Page 122: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/122.jpg)
Human Provides Demonstrations: Guided Cost Learning
● Learn reward function from human demonstrations ● Develop a policy given that reward function
○ Often involves assuming have access to dynamics model or ability to try new policies online.
○ IRL is ill specified: a 0 reward is always sufficient○ Try to match state features○ Max entropy approach has been very influential○ eta-Learning○ Learning from different demonstrations○ Doing RL on top can be useful
Finn, C., Levine, S., & Abbeel, P. (2016, June). Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning (pp. 49-58).
![Page 123: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/123.jpg)
Human Provides Demonstrations: Generative Adversarial Imitation Learning
Ho, J., & Ermon, S. (2016). Generative adversarial imitation learning. In Advances in Neural Information Processing Systems (pp. 4565-4573).
# traj on x-axis, y performance
![Page 124: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/124.jpg)
Some Benefits & Limitations to Learning from Demonstrations
● Need access to experts doing task. Just using these can be limited so often combine with online RL which not always feasible
● Some cases expensive/slow to gather data (teach student across a year? provide customer recommendations across months)
● May not help if the solution we need is radically far away-- still have to handle exploration/exploitation trade off online afterwards
● Still a question of how capture expertise (e.g. what state features to use)
● But generally very promising
○ Demonstrations can be prior data collected for other purposes
○ Great for automating, scaling up, & fine tuning good solutions
![Page 125: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/125.jpg)
Humans Providing Online Rewards
● Sophie’s Kitchen● Human trainer can award a scalar reward signal r = [−1, 1]
Teachable robots: Understanding human teaching behavior to build more effective robot learners" Artificial Intelligence Journal, 172:716-737, 2008..
![Page 126: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/126.jpg)
Humans Providing Online Rewards
W. Bradley Knox and Peter Stone. Combining Manual Feedback with Subsequent MDP Reward Signals for Reinforcement Learning. In Proceedings of the Ninth International Conference on Autonomous Agents and Multiagent Systems. May 2010. Best student paper
● TAMER framework● Uses reward
feedback to help shape the agent’s reward
● Can be used in addition to other reward signals from domain
![Page 127: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/127.jpg)
Human Provides Advice / Labels / Feedback
● Human will continue to provide advice about policies over time
Ross, S., Gordon, G. J., & Bagnell, D. (2011). A reduction of imitation learning and structured prediction to no-regret online learning. In International Conference on Artificial Intelligence and Statistics (pp. 627-635).
![Page 128: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/128.jpg)
Human Provides Advice
Griffith, S., Subramanian, K., Scholz, J., Isbell, C., & Thomaz, A. L. (2013). Policy shaping: Integrating human feedback with reinforcement learning. In Advances in Neural Information Processing Systems (pp. 2625-2633).
● Advise: a Bayesian approach for policy shaping● Agent may get “right”/“wrong” label after performing an action
● C: feedback consistent with optimal policy with probability 0 <C <1
● Update posterior over policy from human
● Combine policy learned from
own experience with that estimated
from human
![Page 129: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/129.jpg)
Discussion: Human Provides Advice / Labels / Feedback
● Performance gains are often significant● Challenge: expensive to have someone in the loop. not often
feasible. (number of examples of advice can be in the hundreds)● Maybe in some cases (household robot, asking a supervisor for
help with a tricky case) could be realistic and easier than an expert providing demonstrators
● MIght be some interesting merged caes:○ Learning from demonstrators where can finish the
demonstration of someone once stuck?● Could also potentially be good for unknown unknowns
○ Human may help agent to realize there’s a problem/limitation even if the agent would not have
![Page 130: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/130.jpg)
Humans Teaching Agents
● How do people try to teach?● Evidence that (at least for animal training) people may expect their reward feedback to be
interpreted as an action label ○ This is different if learner is trying to maximize its long term reward
○ Ho, M. K., Littman, M. L., Cushman, F., & Austerweil, J. L. (2015). Teaching with rewards and punishments: Reinforcement or communication?. In CogSci.
![Page 131: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/131.jpg)
How do People Teach
● People can perform tasks differently if trying to show vs doing task at very high level of performance
○ Ho, M. K., Littman, M., MacGlashan, J., Cushman, F., & Austerweil, J. L. (2016). Showing versus doing: Teaching by demonstration. In Advances in Neural Information Processing Systems (pp. 3027-3035)
![Page 132: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/132.jpg)
How do People Teach
● People pay attention to learners’ current policy human trainers give a positive or negative feedback for a decision is influenced by the learner’s current policy which influences how the teacher provides feedback, and this influences the best algorithm to do○ James MacGlashan, Mark K. Ho, Robert Loftin, Bei Peng, Guan Wang, David
L. Roberts, Matthew E. Taylor, Michael L. Littman . Proceedings of the 34th International Conference on Machine Learning,
![Page 133: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/133.jpg)
Cooperative Inverse RLHadfield-Menell, D., Russell, S. J., Abbeel, P., & Dragan, A. (NIPS 2016).
● Game where human and the agent get rewards determined by the same reward function.
● If the learner has different capabilities from teacher, teacher should teach in a way to help learning agent learn its optimal policy (which may be different than if the teacher performed the task herself)
![Page 134: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/134.jpg)
How Should We Teach Agents
● Machine teaching: optimal way to teach agent given knowledge of agent’s algorithm○ Zhu, X. (2015, January). Machine Teaching: An Inverse Problem to Machine Learning
and an Approach Toward Optimal Education. In AAAI (pp. 4083-4087).● How to teach a sequential decision making algorithm e.g.
○ Cakmak, M., & Lopes, M. (2012, July). Algorithmic and Human Teaching of Sequential Decision Tasks. In AAAI.
○ Walsh, Thomas J., and Sergiu Goschin. (2012). "Dynamic teaching in sequential decision making environments. In UAI.
○ Peng, B., MacGlashan, J., Loftin, R., Littman, M. L., Roberts, D. L., & Taylor, M. E. (2017, May). Curriculum Design for Machine Learners in Sequential Decision Tasks. AAMAS
![Page 135: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/135.jpg)
Overview
● Reward specification● Demonstrations● Rewarding● Advice/labeling / critiquing● Teaching● Shaping the space
*Also see Section 4 of great tutorial by Taylor, Kamar and Hayes: http://interactiveml.net/
![Page 136: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/136.jpg)
Histogram Tutor
136
![Page 137: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/137.jpg)
At end, post test
Action Correct/Wrong
Continually Improving Tutoring System
![Page 138: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/138.jpg)
Action
Improving Across Many Students
Action Action
![Page 139: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/139.jpg)
Over Time Tutoring System Stopped Giving Some Problems to Students
139
![Page 140: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/140.jpg)
System Self-Diagnosed that Problems Weren’t Helping Student Learning
![Page 141: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/141.jpg)
Rules of the World Are Not Fixed
≠
![Page 142: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/142.jpg)
Humans are Invention Machines
New actions New sensors
![Page 143: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/143.jpg)
Reward
Action Observation
Goal: Choose actions to maximize expected rewards
Human in the Loop Reinforcement Learning
![Page 144: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/144.jpg)
Reward
Action Observation
Goal: Choose actions to maximize expected rewards
Human in the Loop Reinforcement Learning
Add Action
![Page 145: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/145.jpg)
Reward
Action Observation
Goal: Choose actions to maximize expected rewards
Add Action
Direct Human Effort for Adding New ActionsMandel, Liu, Brunskil & Popovic, AAAI 2017
![Page 146: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/146.jpg)
Expected Local Improvement
Prob. human gives you action
ah
for state s
Improvement in value (outcomes) at state s if add in action a
h
Mandel, Liu, Brunskil & Popovic, AAAI 2017
![Page 147: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/147.jpg)
Mostly Bad Human Input
Mandel, Liu, Brunskil & Popovic, AAAI 2017
![Page 148: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/148.jpg)
![Page 149: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/149.jpg)
Ask people to add in new hints where might help
![Page 150: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/150.jpg)
Reward
Action
Observation
Add Feature
Is there a Latent Feature that if System Knew It, Could Make Better Decisions?
Ongoing work with Ramtin Keramati
![Page 151: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/151.jpg)
Ongoing work with Ramtin Keramati
Ask for Feature if Exists Latent Variable Model of a State That Could Change Policy
![Page 152: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/152.jpg)
Ongoing work with Ramtin Keramati
Ask for Feature if Exists Latent Variable Model of a State That Could Change Policy
![Page 153: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/153.jpg)
Ongoing work with Ramtin Keramati
Learn Model That Supports Making Good Decisions, Not Perfect Model
![Page 154: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/154.jpg)
Reward
Action
Observation
Requests
Directing Human Experts to Change Actions & Observation Space
Ongoing work with Ramtin Keramati
![Page 155: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/155.jpg)
Reward
Action
Observation
Requests
Another Way to Have Humans Teach Agents
Ongoing work with Ramtin Keramati
![Page 156: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/156.jpg)
Ask people to add in new hints where might help
But system isn’t improving much… what’s going on?
![Page 157: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/157.jpg)
Humans Teaching Agents: Questions
● How to teach human to teach agents?○ People may not have a good model of (a) agent / learner, (b) access
to content (actions) that can change that agent/ learner state○ May need guidance about how best to help agent / learner○ Just like people have to learn to be effective human teachers…○ In our group new project on teaching the teachers...
● Lots of open directions here● Involving teachers still expensive relative to leveraging prior trajectories
of demonstrations or instructions● Can we make better algorithms to leverage past teaching demonstrations
(youtube has many lecture videos) rather than expert demonstrations? COACH (on prior slide, MacGlashan, et al) is one of the steps in this direction
![Page 158: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/158.jpg)
RL with help from People
● Reward specification● Demonstrations● Rewarding● Advice/labeling / critiquing● Teaching● Shaping the space
*Also see Section 4 of great tutorial by Taylor, Kamar and Hayes: http://interactiveml.net/
![Page 159: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/159.jpg)
Overview
● RL introduction● RL for people● RL by the people
● Lots of applications that could benefit from RL● Lots of ways people can help make RL systems
better● Interested in discussing a postdoc opportunity?
Email me at [email protected]
![Page 160: and/or by the People Reinforcement Learning for the People · Speed Learning in a New Task Fernández, F., & Veloso, M. (2006, May). Probabilistic policy reuse in a reinforcement](https://reader033.vdocuments.us/reader033/viewer/2022042106/5e845f765297c726322ab301/html5/thumbnails/160.jpg)
Thanks to
and Karan Goel, Travis Mandel, Yun-En Liu, Ramtin Keramti, NSF, ONR, Microsoft, Google,
Yahoo & IES