high confidence off-policy evaluation (hcope)ebrun/15889e/lectures/thomas_lecture.pdf · digital...
TRANSCRIPT
![Page 1: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/1.jpg)
High Confidence Off-Policy Evaluation (HCOPE)
Philip Thomas
CMU 15-899E, Real Life Reinforcement Learning, Fall 2015
![Page 2: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/2.jpg)
Sequential Decision Problems
Agent
Environment
Action, 𝑎State, 𝑠
(Discrete / Continuous) (Discrete / Continuous)
(Fully observable / Partially observable)
![Page 3: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/3.jpg)
Example: Digital Marketing
3
![Page 4: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/4.jpg)
Example: Educational Games
![Page 5: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/5.jpg)
Example: Decision Support Systems
![Page 6: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/6.jpg)
Example: Gridworld
![Page 7: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/7.jpg)
Example: Mountain Car
![Page 8: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/8.jpg)
Reinforcement Learning Algorithms
• Sarsa
• Q-learning
• LSPI
• Fitted Q Iteration
• REINFORCE
• Residual Gradient
• Continuous-Time Actor-Critic
• Value Gradient
• POWER
• PILCO
• LSPI
• PIPI
• Policy Gradient
• DQN
• Double Q-Learning
• Deterministic Policy Gradient
• NAC-LSTD
• INAC
• Average-Reward INAC
• Unbiased NAC
• Projected NAC
• Risk-sensitive policy gradient
• Natural Sarsa
• PGPE / PGPE-SyS
• True Online
• GTD/TDC
• ARP
• GPTD
• Auto-Actor Auto-Critic
• Approximate Value Iteration
![Page 9: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/9.jpg)
If you apply an existing method, do you have confidence that it will work?
![Page 10: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/10.jpg)
Notation
• 𝑠: State
• 𝑎: Action
• 𝑆𝑡 , 𝐴𝑡: State, and action at time 𝑡
• 𝜋 𝑎 𝑠 = Pr 𝐴𝑡 = 𝑎 𝑆𝑡 = 𝑠
• 𝜏 = (𝑆0, 𝐴0, 𝑆1, … , 𝑆𝐿 , 𝐴𝐿)
• 𝐺 𝜏 ∈ [0,1]
• 𝜌 𝜋 = 𝐄 𝐺 𝜏 𝜏 ∼ 𝜋]
![Page 11: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/11.jpg)
Two Goals:
• High confidence off-policy evaluation (HCOPE)
• Safe Policy Improvement (SPI)
Historical Data, 𝒟
Proposed Policy, 𝜋𝑒
Confidence Level, 𝛿
1 − 𝛿 confidence lower bound on 𝜌 𝜋𝑒
Historical Data, 𝒟
Performance baseline, 𝜌−
Confidence Level, 𝛿
An improved* policy, 𝜋
*The probability that 𝜋’s performance is below 𝜌− is at most 𝛿
![Page 12: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/12.jpg)
High Confidence Off-Policy Evaluation
• Historical data: 𝒟 = 𝜏𝑖 , 𝜋𝑖 : 𝜏𝑖 ∼ 𝜋𝑖 𝑖=1𝑛
• Evaluation policy, 𝜋𝑒
• Confidence level, 𝛿
• Compute HCOPE(𝜋𝑒|𝒟, 𝛿) such that
Pr 𝜌 𝜋𝑒 ≥ HCOPE(𝜋𝑒|𝒟, 𝛿) ≥ 1 − 𝛿
Historical Data, 𝒟
Proposed Policy, 𝜋𝑒
Confidence Level, 𝛿
1 − 𝛿 confidence lower bound on 𝜌 𝜋𝑒
![Page 13: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/13.jpg)
Importance Sampling
• We would like to estimate𝜃 ≔ 𝐄 𝑓 𝑥 𝑥~𝑝
• Monte Carlo estimator:• Sample 𝑋1, …𝑋𝑛 from 𝑝 and set:
𝜃𝑛 ≔1
𝑛
𝑖=1
𝑛
𝑋𝑖
• Nice properties• The Monte Carlo estimator is strongly consistent:
𝜃𝑛
𝑎.𝑠.𝜃
• The Monte Carlo estimator is unbiased for all 𝑛 ≥ 1:
𝐄 𝜃𝑛 = 𝜃
![Page 14: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/14.jpg)
Importance Sampling
• We would like to estimate𝜃 ≔ 𝐄 𝑓 𝑥 𝑥~𝑝
• … but we can only sample from a distribution, 𝑞, not 𝑝.
• Assume: if 𝑞 𝑥 = 0 then 𝑓 𝑥 𝑝 𝑥 = 0. Then:
𝐄 𝑓 𝑥 𝑥~𝑝 = 𝑥∈supp(𝑝) 𝑝 𝑥 𝑓(𝑥)
= 𝑥∈supp(𝑞) 𝑝 𝑥 𝑓(𝑥)
= 𝑥∈supp(𝑞)𝑞(𝑥)
𝑞(𝑥)𝑝 𝑥 𝑓(𝑥)
= 𝑥∈supp(𝑞) 𝑞 𝑥𝑝(𝑥)
𝑞(𝑥)𝑓(𝑥)
= 𝐄𝑝(𝑥)
𝑞(𝑥)𝑓 𝑥 𝑥~𝑞
![Page 15: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/15.jpg)
Importance Sampling
• We would like to estimate𝜃 ≔ 𝐄 𝑓 𝑥 𝑥~𝑝
• Importance sampling estimator:• Sample 𝑋1, …𝑋𝑛 from 𝑞 and set:
𝜃𝑛 ≔1
𝑛
𝑖=1
𝑛𝑝(𝑋𝑖)
𝑞(𝑋𝑖)𝑋𝑖
• Nice properties (under mild assumptions)• The importance sampling estimator is strongly consistent:
𝜃𝑛
𝑎.𝑠.𝜃
• The importance sampling estimator is unbiased for all 𝑛 ≥ 1:
𝐄 𝜃𝑛 = 𝜃
![Page 16: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/16.jpg)
Importance Sampling
𝜌 𝜋𝑒 = 𝐄𝜏~𝜋𝑒𝐺 𝜏 = 𝐄𝜏~𝜋𝑏
Pr(𝜏|𝜋𝑒)
Pr(𝜏|𝜋𝑏)𝐺 𝜏
Probability of trajectory
Evaluation Policy, 𝜋𝑒
Behavior Policy, 𝜋𝑏
![Page 17: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/17.jpg)
Importance Sampling for Reinforcement Learning
• 𝜌 𝜋𝑒 = 𝐄𝜏~𝜋𝑒𝐺 𝜏 = 𝐄𝜏~𝜋𝑏
Pr(𝜏|𝜋𝑒)
Pr(𝜏|𝜋𝑏)𝐺 𝜏
•Pr(𝜏|𝜋𝑒)
Pr(𝜏|𝜋𝑏)𝐺 𝜏 =
𝑡=0𝐿 Pr 𝑆𝑡|past Pr 𝐴𝑡|past,𝜋𝑒
𝑡=0𝐿 Pr 𝑆𝑡|past Pr 𝐴𝑡|past,𝜋𝑏
𝐺 𝜏
= 𝑡=0
𝐿 Pr 𝐴𝑡|past,𝜋𝑒
𝑡=0𝐿 Pr 𝐴𝑡|past,𝜋𝑏
𝐺 𝜏
= 𝑡=0
𝐿 𝜋𝑒 𝐴𝑡 𝑆𝑡)
𝑡=0𝐿 𝜋𝑏 𝐴𝑡 𝑆𝑡)
𝐺 𝜏
• 𝜌 𝜋𝑒 , 𝜏, 𝜋𝑏 = 𝑡=0𝐿 𝜋𝑒 𝐴𝑡 𝑆𝑡)
𝜋𝑏 𝐴𝑡 𝑆𝑡)𝐺 𝜏
(D. Precup, R. S. Sutton, and S. Dasgupta, 2001)
![Page 18: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/18.jpg)
Per-Decision Importance Sampling
• Use importance sampling to estimate each 𝑅𝑡.• Still and unbiased and strongly consistent estimator of 𝜌(𝜋𝑒).
• Often has lower variance than ordinary importance sampling.
![Page 19: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/19.jpg)
Historical Data, 𝒟
Proposed Policy, 𝜋𝑒
Confidence Level, 𝛿
1 − 𝛿 confidence lower bound on 𝜌 𝜋𝑒
Historical Data, 𝒟
Proposed Policy, 𝜋𝑒
𝜌 𝜋𝑒, 𝜏𝑖 , 𝜋𝑖 𝑖=1𝑛
𝜌 𝜋𝑒, 𝜏, 𝜋𝑏 =
𝑡=0
𝐿𝜋𝑒 𝐴𝑡 𝑆𝑡)
𝜋𝑏 𝐴𝑡 𝑆𝑡)𝐺 𝜏
1 − 𝛿 confidence lower bound on 𝜌 𝜋𝑒Confidence Level, 𝛿
?
![Page 20: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/20.jpg)
Chernoff-Hoeffding Inequality
• Let 𝑋1, … , 𝑋𝑛 be 𝑛 independent identically distributed random variables such that:• Xi ∈ [0, 𝑏]
• Then with probability at least 1 − 𝛿:
𝐸 𝑋𝑖 ≥1
𝑛
𝑖=1
𝑛
𝑋𝑖 −𝑏ln 1 𝛿
2𝑛
![Page 21: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/21.jpg)
𝜌 𝜋𝑒 = 𝐸 𝜌 𝜋𝑒, 𝜏𝑖 , 𝜋𝑖 ≥1
𝑛
𝑖=1
𝑛
𝜌 𝜋𝑒, 𝜏𝑖 , 𝜋𝑖 − 𝑏ln 1 𝛿
2𝑛
With probability at least 1 − 𝛿:
𝐸 𝑋𝑖 ≥1
𝑛
𝑖=1
𝑛
𝑋𝑖 −𝑏ln 1 𝛿
2𝑛
![Page 22: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/22.jpg)
Historical Data, 𝒟
Proposed Policy, 𝜋𝑒
Confidence Level, 𝛿
1 − 𝛿 confidence lower bound on 𝜌 𝜋𝑒
Historical Data, 𝒟
Proposed Policy, 𝜋𝑒
𝜌 𝜋𝑒, 𝜏𝑖 , 𝜋𝑖 𝑖=1𝑛
𝜌 𝜋𝑒, 𝜏, 𝜋𝑏 =
𝑡=0
𝐿𝜋𝑒 𝐴𝑡 𝑆𝑡)
𝜋𝑏 𝐴𝑡 𝑆𝑡)𝐺 𝜏
1 − 𝛿 confidence lower bound on 𝜌 𝜋𝑒Confidence Level, 𝛿
?
![Page 23: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/23.jpg)
Historical Data, 𝒟
Proposed Policy, 𝜋𝑒
Confidence Level, 𝛿
1 − 𝛿 confidence lower bound on 𝜌 𝜋𝑒
Historical Data, 𝒟
Proposed Policy, 𝜋𝑒
𝜌 𝜋𝑒, 𝜏𝑖 , 𝜋𝑖 𝑖=1𝑛
𝜌 𝜋𝑒, 𝜏, 𝜋𝑏 =
𝑡=0
𝐿𝜋𝑒 𝐴𝑡 𝑆𝑡)
𝜋𝑏 𝐴𝑡 𝑆𝑡)𝐺 𝜏
1 − 𝛿 confidence lower bound on 𝜌 𝜋𝑒Confidence Level, 𝛿
ConcentrationInequality
1
𝑛
𝑖=1
𝑛
𝜌 𝜋𝑒 , 𝜏𝑖 , 𝜋𝑖 − 𝑏ln 1 𝛿
2𝑛
![Page 24: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/24.jpg)
Example: Mountain Car
![Page 25: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/25.jpg)
Example: Mountain Car
• Using 100,000 trajectories
• Evaluation policy’s true performance is 0.19 ∈ [0,1].
• We get a 95% confidence lower bound of:
−5,831,000
![Page 26: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/26.jpg)
What went wrong?
𝜌 𝜋𝑒 , 𝜏, 𝜋𝑏 =
𝑡=0
𝐿𝜋𝑒 𝐴𝑡 𝑆𝑡
𝜋𝑏 𝐴𝑡|𝑆𝑡𝐺 𝜏
![Page 27: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/27.jpg)
What went wrong?
𝐸 𝑋𝑖 ≥1
𝑛
𝑖=1
𝑛
𝑋𝑖 −𝑏ln 1 𝛿
2𝑛
𝑏 ≈ 109.4
Largest observed importance weighted return: 316.
![Page 28: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/28.jpg)
Another problem:
𝜌 𝜋𝑒 , 𝜏, 𝜋𝑏 =
𝑡=0
𝐿𝜋𝑒 𝐴𝑡 𝑆𝑡
𝜋𝑏 𝐴𝑡| 𝑆𝑡𝐺 𝜏
• One behavior policy• Independent and identically distributed
• More than one behavior policy• Independent
![Page 29: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/29.jpg)
Conservative Policy Iteration
• ≈ 1,000,000 trajectories for a single policy improvement.
(S. Kakade and J. Langford, 2002)
![Page 30: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/30.jpg)
PAC-RL
• ≈ 1017 time steps to guarantee convergence to a near-optimal policy.
(T. Lattimore and M. Hutter, 2012)
![Page 31: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/31.jpg)
Thesis
High Confidence Off-Policy Evaluation (HCOPE)
and
Safe Policy Improvement (SPI)
are tractable using a practical amount of data.
![Page 32: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/32.jpg)
![Page 33: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/33.jpg)
Historical Data, 𝒟
Proposed Policy, 𝜋𝑒
Confidence Level, 𝛿
1 − 𝛿 confidence lower bound on 𝜌 𝜋𝑒
Historical Data, 𝒟
Proposed Policy, 𝜋𝑒
𝜌 𝜋𝑒, 𝜏𝑖 , 𝜋𝑖 𝑖=1𝑛
𝜌 𝜋𝑒, 𝜏, 𝜋𝑏 =
𝑡=0
𝐿𝜋𝑒 𝐴𝑡 𝑆𝑡)
𝜋𝑏 𝐴𝑡 𝑆𝑡)𝐺 𝜏
1 − 𝛿 confidence lower bound on 𝜌 𝜋𝑒Confidence Level, 𝛿
ConcentrationInequality
1
𝑛
𝑖=1
𝑛
𝜌 𝜋𝑒 , 𝜏𝑖 , 𝜋𝑖 − 𝑏ln 1 𝛿
2𝑛
![Page 34: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/34.jpg)
![Page 35: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/35.jpg)
![Page 36: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/36.jpg)
![Page 37: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/37.jpg)
![Page 38: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/38.jpg)
Extending Maurer’s Inequality
• First Key Idea:• Generalize: random variables with different ranges.
• Specialize: random variables with the same mean.
![Page 39: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/39.jpg)
Extending Maurer’s Inequality
• Second Key Idea:• Removing the upper tail only decreases the expected value.
![Page 40: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/40.jpg)
![Page 41: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/41.jpg)
Tradeoff
![Page 42: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/42.jpg)
Threshold Optimization
• Use 20% of the data to optimize 𝑐.
• Use 80% to compute lower bound with optimized 𝑐.
![Page 43: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/43.jpg)
Given 𝑛 samples, 𝒳 ≔ 𝑋𝑖 𝑖=1𝑛 , predict what the lower bound would
be if computed from 𝑚 samples, rather than 𝑛.
![Page 44: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/44.jpg)
![Page 45: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/45.jpg)
CUT Chernoff-Hoeffding
Maurer Anderson Bubeck et al.
95% Confidence
lower bound on the mean
0.145 −5,831,000 −129,703 0.055 −.046
![Page 46: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/46.jpg)
Digital Marketing Example
• 10 real-valued features
• Two groups of campaigns to choose between
• User interactions limited to 𝐿 = 10
• Data collected from a Fortune 20 company
• Data was not used directly.
47
![Page 47: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/47.jpg)
Example: Digital Marketing
48
![Page 48: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/48.jpg)
Example: Digital Marketing
49
![Page 49: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/49.jpg)
Example: Digital Marketing
50
![Page 50: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/50.jpg)
Example: Digital Marketing
51
![Page 51: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/51.jpg)
Example: Digital Marketing
52
![Page 52: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/52.jpg)
Example: Digital Marketing
![Page 53: High Confidence Off-Policy Evaluation (HCOPE)ebrun/15889e/lectures/thomas_lecture.pdf · Digital Marketing Example •10 real-valued features •Two groups of campaigns to choose](https://reader033.vdocuments.us/reader033/viewer/2022050601/5fa8d25da9825405236e05dc/html5/thumbnails/53.jpg)
We can now evaluate policies proposed by RL algorithms without the need to execute them and in a way that instills users with confidence that the new policy will actually work.