![Page 1: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/1.jpg)
Counterfactuals and RLEmma Brunskill
RLDM 2019 TutorialAssistant Professor, Computer Science, Stanford
Thanks to Christoph Dann, Andrea Zanette, Phil Thomas, and Xinkun Nie for some figures
![Page 2: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/2.jpg)
A Brief Tale of 2 Hamburgers
1/4 1/3
![Page 3: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/3.jpg)
![Page 4: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/4.jpg)
Took > 30s
Took <= 30s
![Page 5: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/5.jpg)
Given ~11k Learners’ TrajectoriesWith Random Action (Levels)
Goal: Learn a New Policy to Maximize Student Persistence
![Page 6: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/6.jpg)
Parallel Legacy of “RL” to Benefit People
https://web.stanford.edu/group/cslipublications/cslipublications/SuppesCorpus/Professional%20Photos/album/1960s/slides/5.html
![Page 7: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/7.jpg)
● Simulator of domain● Enormous data to train● Can always try out a new
strategy in domain
vs
![Page 8: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/8.jpg)
● Simulator of domain● Enormous data to train● Can always try out a new
strategy in domain
● No good simulator of human physiology, behavior & learning
● Gathering real data involves impacting real people
vs
![Page 9: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/9.jpg)
Techniques to Minimize & Understand Data Needed to Learn to Make Good Decisions
And if can learn to make good decisions faster, benefit more people
![Page 10: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/10.jpg)
Background: Markov Decision Process
![Page 11: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/11.jpg)
Background: Markov Decision Process
![Page 12: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/12.jpg)
Background: Markov Decision Process
![Page 13: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/13.jpg)
Background: Markov Decision Process Value Function
![Page 14: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/14.jpg)
Background: Markov Decision Process Value Function
![Page 15: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/15.jpg)
Background: Reinforcement Learning
Only observed through samples (experience)
![Page 16: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/16.jpg)
Today: Counterfactual / Batch RL
![Page 17: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/17.jpg)
“What If?” Reasoning Given Past Data
Outcome: 91
Outcome: 92
Outcome: 85
?
![Page 18: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/18.jpg)
Data Is Censored
Outcome: 91
Outcome: 92
Outcome: 85
![Page 19: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/19.jpg)
Need for Generalization
Outcome: 91
Outcome: 92
Outcome: 85
?
![Page 20: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/20.jpg)
Growing Interest in Causal Inference & ML
![Page 21: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/21.jpg)
Batch Policy Optimization: Find a Good Policy That Will Perform Well in the Future
![Page 22: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/22.jpg)
Batch Policy Optimization: Find a Good Policy That Will Perform Well in the Future
![Page 23: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/23.jpg)
Batch Policy Optimization: Find a Good Policy That Will Perform Well in the Future
![Page 24: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/24.jpg)
Batch Policy Optimization: Find a Good Policy That Will Perform Well in the Future
● Today will not be a comprehensive overview, but instead highlight some of the challenges involved & some approaches with desirable statistical properties convergence, sample efficiency & bounds
![Page 25: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/25.jpg)
![Page 26: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/26.jpg)
: or
Substantial Literature Focuses on 1 Binary Decision: Treatment Effect Estimation from Old Data
![Page 27: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/27.jpg)
Challenge: Covariate ShiftDifferent Policies → Different Actions → Different State Distributions
IS allows us to reweigh data to make it look as if it came from original distrib
Gottesman et a. Guidelines for reinforcement learning in healthcare. Nature Medicine 2019. Figure by Debbie Maizels/Springer Nature
![Page 28: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/28.jpg)
Policy Evaluation
1. Model based2. Model free3. Importance sampling4. Doubly robust
![Page 29: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/29.jpg)
Learn Dynamics and Reward Models from Data, Evaluate Policy
● (Mannor, Simster, Sun, Tsitsiklis 2007)
![Page 30: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/30.jpg)
Model Free Value Function Approximation
● Fitted Q iteration, DQN, LSTD, ...
![Page 31: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/31.jpg)
Counterfactual Reasoning for Policy Evaluation*
Parametric Modelsof dynamics, rewards or values fit to data
+ Low variance- Bias (unless realizable)
![Page 32: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/32.jpg)
Importance Sampling Refresher
![Page 33: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/33.jpg)
Importance Sampling for RL Policy Evaluation
![Page 34: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/34.jpg)
Importance Sampling for RL Policy Evaluation
![Page 35: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/35.jpg)
Importance Sampling for RL Policy Evaluation
● First used for RL by Precup, Sutton & Singh 2000. Recent work includes: Thomas, Theocharous, Ghavamzadeh 2015; Thomas and Brunskill 2017; Guo, Thomas, Brunskill 2017; Hanna, Niekum, Stone 2019
![Page 36: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/36.jpg)
Stationary Importance Sampling (SIS) for RL Policy Evaluation
● Can be approximated and used as part of Q-learning style update● Hallak & Mannor 2017; Liu, Li, Tang, & Zhou 2018; Gelada & Bellemare 2019
![Page 37: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/37.jpg)
Counterfactual Reasoning for Policy Evaluation
Parametric Modelsof dynamics, rewards or values fit to data
Importance Samplingcorrect mismatch of
state-action distribution
+ Low variance- Bias (unless realizable)
+ Unbiased under certain assumptions- High variance
![Page 38: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/38.jpg)
Doubly Robust (DR) Estimation
• Model + IS-based estimator• Bandits (Dudik et al. 2011)
reward receivedmodel of reward
![Page 39: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/39.jpg)
Doubly Robust (DR) Estimation
• Model + IS-based estimator• Bandits (Dudik et al. 2011)
reward receivedmodel of reward
![Page 40: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/40.jpg)
Doubly Robust (DR) Estimation
• Model + IS-based estimator• Bandits (Dudik et al. 2011)
reward receivedmodel of reward
![Page 41: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/41.jpg)
Doubly Robust Estimation for RL
• Jiang and Li (ICML 2016) extended DR to RL
model-based estimate of Q
actual rewards in the dataset
importance weights
model-based estimate of V
![Page 42: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/42.jpg)
Doubly Robust Estimation for RL
• Jiang and Li (ICML 2016) extended DR to RL
• Limitation: Estimator derived is unbiased
model-based estimate of Q
actual rewards in the dataset
importance weights
model-based estimate of V
![Page 43: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/43.jpg)
Instead Prioritize Accuracy & Measure with Mean Squared Error
Thomas and Brunskill, ICML 2016
• Trade bias and variance
Bias
Bias
Variance
+Model-based estimator Importance sampling estimator
![Page 44: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/44.jpg)
Two New Off Policy Evaluation Estimators1. Weighted doubly robust for RL problems
a. Weighted importance Sampling often much lower variance
![Page 45: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/45.jpg)
Two New Off Policy Evaluation Estimators1. Weighted doubly robust for RL problems
a. Weighted importance Sampling often much lower variance
b. WDR: doubly robust, just use normalized weights!c. Empirically can give much better estimatesd. Still has good properties (strongly consistent)
![Page 46: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/46.jpg)
Price of Robustness?
Le, Voloshin, Yue (2019)
![Page 47: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/47.jpg)
Two New Off Policy Evaluation Estimators1. Weighted doubly robust for RL problems
2. Model And Guided Importance sampling Combining estimatora. Directly try to minimize mean squared error by balancing
between value and importance sampling estimateb. Mean squared error is a function of bias and variance
![Page 48: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/48.jpg)
Blend IS-Based & Model Based Estimators to Directly Min Mean Squared Error
Bias
Variance
1-step estimate 2-step N-step
x1 x
2 … x
N
Thomas and Brunskill, ICML 2016
![Page 49: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/49.jpg)
Model and Guided Importance Sampling combining (MAGIC) Estimator
Estimated policy value using particular weighting of model estimate and
importance sampling estimate
Thomas and Brunskill, ICML 2016
• Solve quadratic program• Strongly consistent (under similar set of assumptions as
WDR)
![Page 50: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/50.jpg)
Estimating Bias & Covariance
• Estimated covariance: sample covariance matrix
• Estimated bias:– May be as hard as estimating true policy value
Importance sampling estimate
Model based estimate Estimate of bias
Thomas and B, ICML 2016
![Page 51: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/51.jpg)
Gridworld Simulation: Needed Only 10% of the Data to Learn a Good Estimate New Policy’s Value
IS-based
ModelDR
MAGIC
MAGIC-B
Number of Histories Thomas and B 2016
![Page 52: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/52.jpg)
Gridworld Simulation: Needed Only 10% of the Data to Learn a Good Estimate of New Policy’s Value
IS-based
ModelDR
MAGIC
MAGIC-B
Number of Histories Thomas and B 2016
![Page 53: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/53.jpg)
Sepsis treatment example
(Gottesman et al. arxiv 2018)
● Actions: IV fluids & vasopressors ● Reward: +100 survival, -100 death● State space: 750 (discretized)● 19,275 ICU patients
WDR in Health Example
![Page 54: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/54.jpg)
Sepsis treatment example
(Gottesman et al. arxiv 2018)
● Actions: IV fluids & vasopressors ● Reward: +100 survival, -100 death● State space: 750 (discretized)● 19,275 ICU patients
Our weighted DR (WDR) was only consistent off policy estimator tried (PDDR, PDIS, WPDIS, WDR) that could find an optimal policy which estimated would improve over prior
WDR in Health Example
![Page 55: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/55.jpg)
Sepsis treatment example
(Gottesman et al. arxiv 2018)
● Actions: IV fluids & vasopressors ● Reward: +100 survival, -100 death● State space: 750 (discretized)● 19,275 ICU patients
Our weighted DR (WDR) was only consistent off policy estimator tried (PDDR, PDIS, WPDIS, WDR) that could find an optimal policy which estimated would improve over prior
Under (common) assumption of no confounding, that is not likely to hold in practice
WDR in Health Example
![Page 56: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/56.jpg)
Policy Evaluation
1. Model based2. Model free3. Importance sampling4. Doubly robust
![Page 57: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/57.jpg)
Policy Optimization: Find Good Policy to Deploy
![Page 58: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/58.jpg)
Learn Dynamics and Reward Models from Data, Plan
![Page 59: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/59.jpg)
Mandel, Liu, Brunskill, Popovic 2014
Better Dynamics/Reward Models for Existing Data, May Not Lead to Better Policies for Future Use
![Page 60: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/60.jpg)
Importance Sampling Estimators Unbiased for Policy Evaluation
![Page 61: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/61.jpg)
Importance Sampling Estimators Unbiased for Policy Evaluation
• But using them for policy evaluation can lead to poor results
![Page 62: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/62.jpg)
Fairness of Importance Sampling-Based Estimators for Policy Selection
62
• Unfortunately even if IS estimates are unbiased, policy selection using them can be unfair
• Here define unfair as: – Given two policies π1 and π2
– Where true performance V(π1) > V(π2)– Choose V(π2) more than 50% of time
Doroudi, Thomas and Brunskill, Best Paper, UAI 2017
![Page 63: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/63.jpg)
Value
Policy 1 Policy 2
Max over Estimates with Differing Variances
Doroudi, Thomas and Brunskill, Best Paper, UAI 2017
![Page 64: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/64.jpg)
Importance Sampling Favors Myopic Policies
Doroudi, Thomas and Brunskill, Best Paper, UAI 2017
![Page 65: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/65.jpg)
Quest for Batch Policy Optimization with Generalization Guarantees→ SRM for Reinforcement Learning
![Page 66: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/66.jpg)
Quest for Batch Policy Optimization with Generalization Guarantees→ SRM for Reinforcement Learning
![Page 67: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/67.jpg)
Challenge: Good Error Bound Analysis
![Page 68: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/68.jpg)
Challenge: Good Error Bound Analysis
● Importance sampling bounds (e.g. Thomas et al, 2015) ignore hypothesis class structure & are typically require very large n
![Page 69: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/69.jpg)
Challenge: Good Error Bound Analysis
● Importance sampling bounds (e.g. Thomas et al, 2015) ignore hypothesis class structure & are typically require very large n
● Kernel function & averager approaches (e.g. Ormoneit & Sten 2002) can need # samples exponential in input state dimension
![Page 70: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/70.jpg)
Challenge: Good Error Bound Analysis
● Importance sampling bounds (e.g. Thomas et al, 2015) ignore hypothesis class structure & are typically require very large n
● Kernel function & averager approaches (e.g. Ormoneit & Sten 2002) can need # samples exponential in input state dimension
● FQI bounds (e.g. Munos 2003; Munos & Szepesvári 2008; Antos et al., 2008; Lazaric et al., 2012; Farahmand et al., 2009; Maillard et al., 2010; Le, Voloshin, Yue 2019; Chen & Jiang 2019)
- Require stronger assumptions (realizability and bounds on the inherent Bellman error)
- If not realizable, FQI bounds depend on unknown quantities
![Page 71: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/71.jpg)
Challenge: Good Error Bound Analysis
● Importance sampling bounds (e.g. Thomas et al, 2015)● Kernel function (e.g. Ormoneit & Sten 2002)● FQI bounds (e.g. Munos 2003; Munos & Szepesvári 2008; Antos
et al., 2008; Lazaric et al., 2012; Farahmand et al., 2009; Maillard et al., 2010; Le, Voloshin, Yue 2019; Chen & Jiang 2019)
- Require stronger assumptions (realizability and bounds on the inherent Bellman error)
- If not realizable, FQI bounds depend on unknown quantities● Primal dual approaches (e.g. Dai, Shaw, Li, Xiao, He, Liu, Chen,
Song 2018) are promising and have similar dependencies
![Page 72: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/72.jpg)
Aim: Strong Generalization Guarantees on Policy Performance, Alternative: Find Good in Class Policy Given Past Data
![Page 73: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/73.jpg)
Direct Batch Policy Search & Optimization● Despite popularity, relative little success in direct policy optimization using
offline / batch data● Correcting for the mismatch in state distributions can yield high variance
(“alternative life” from Sutton / White terminology)● Algorithmically often just correct for 1 step (e.g. Degris, White, & Sutton 2012)
![Page 74: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/74.jpg)
Off-Policy Policy Gradient with State Distribution Correction● Leverage Markov structure idea of stationary importance sampling for RL
Monday RLDM Poster 114 & Liu, Swaminathan, Agarwal, B UAI 2019
![Page 75: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/75.jpg)
First Result that Can Provably Converge to Local Solution with Off Policy Batch Policy Gradient
Monday RLDM Poster 114 & Liu, Swaminathan, Agarwal, B UAI 2019
![Page 76: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/76.jpg)
Aim: Strong Generalization Guarantees on Policy Performance, Alternative: Guarantee Find Best in Class Policy
![Page 77: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/77.jpg)
1st Guarantees on Performance of Policy Choose Vs Best in Class for When to Treat Policies (w/Xinkun Nie & Stefan Wager, arxiv)
![Page 78: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/78.jpg)
Starting HIV treatment as soon as CD4 count dips below 200
Example: Linear Thresholding Policies
Source: https://alv.mizoapp.com/cd4count/
HIV Infection
CD
4 C
ount
HIV Treatment
2-10 years
200
policy parameter
![Page 79: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/79.jpg)
Stopping treatment as soon as health metric above line
Starting HIV treatment as soon as CD4 count dips below 200
Example: Linear Thresholding Policies
Source: https://alv.mizoapp.com/cd4count/
HIV Infection
CD
4 C
ount
HIV Treatment
2-10 years
200Treatment
O
TreatmentOff
policy parameter
policy
![Page 80: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/80.jpg)
Selecting a When to Treat Policy
![Page 81: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/81.jpg)
never acting
Use an Advantage Decomposition
![Page 82: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/82.jpg)
● Estimate treatment effect with a doubly robust estimator given available dataset D● Can learn “nuisance” parameters (propensity weights and value function estimates) at a
slower rate and still get sqrt(n) regret bounds, under various assumptions● Insights from orthogonal / double machine learning ideas from econometrics
never acting
Use a Doubly Robust Advantage Decomposition
![Page 83: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/83.jpg)
Keeping a health metric above 0
evolves with brownian motion
treatment nudges it up, but at a cost
Always start with treatment ON
• Optimal stopping time of treatment?
• Unknown propensity
• Linear Decision Rules
#covariates = 2
• Observe states + noise
Horizon
![Page 84: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/84.jpg)
Fitted Q Iteration Policy Less Interpretable
![Page 85: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/85.jpg)
Quest for Batch Policy Optimization with Generalization Guarantees→ SRM for Reinforcement Learning
& many colleagues’ work (Murphy, Jiang, Yue, Munos, Lazaric, Szepesvari…)→ Much to be done, including to relax common assumptions
AAMAS 2014, AAAI 2015, AAAI 2016, ICML 2016, IJCAI 2016, AAAI 2017, NeurIPS 2017, NeurIPS
2018, ICML 2019
AAAI 2015, AAAI 2016, L@S 2017, UAI 2017, UAI 2019, (Nie,
B, Wagner arxiv)
Nie, B, Wagner arxiv; Thomas, da Silva, Barto,
B, arxiv
![Page 86: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/86.jpg)
Power of Models for Off Policy Evaluation?
● Model based approaches can be provably more efficient than model free value function for online evaluation or control
Sun, Jiang, Krishnamurthy, Agarwal, Langford COLT 2019
Tu & Recht COLT 2019
![Page 87: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/87.jpg)
Models Fit for Off Policy Evaluation May Benefit from Different Loss Function
Liu, Gottesman, Raghu, Komorowski, Faisal, Doshi-Velez, Brunskill NeurIPS 2018
![Page 88: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/88.jpg)
Given ~11k Learners’ TrajectoriesWith Random Action (Levels)
Learn a Policy that Increases Student Persistence
(Mandel, Liu, B, Popovic 2014)
![Page 89: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,](https://reader036.vdocuments.us/reader036/viewer/2022071606/6143ebf56cc38f259c25d831/html5/thumbnails/89.jpg)
Given ~11k Learners’ TrajectoriesWith Random Action (Levels)
Learned a Policy that Increased Student Persistence by +30%
(Mandel, Liu, B, Popovic 2014)