a. lazaric (sequel team @inria-lille - irit · i robotics for entertainment a. lazaric –...
TRANSCRIPT
![Page 1: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/1.jpg)
RL Course
Introduction to Reinforcement Learning
A. LAZARIC (SequeL Team @INRIA-Lille)Machine Learning Summer School – Toulouse, France
dubbing: A. GARIVIER (Institut de Mathématique de Toulouse)
SequeL – INRIA Lille
![Page 2: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/2.jpg)
Motivation
Outline
MotivationInteractive Learning ProblemsA Model for Sequential Decision MakingOutline
Multi-armed Bandit Problems
Extensions
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 2/124
![Page 3: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/3.jpg)
Motivation Interactive Learning Problems
Outline
MotivationInteractive Learning ProblemsA Model for Sequential Decision MakingOutline
Multi-armed Bandit Problems
Extensions
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 3/124
![Page 4: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/4.jpg)
Motivation Interactive Learning Problems
Why
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 4/124
![Page 5: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/5.jpg)
Motivation Interactive Learning Problems
Why: Important Problems
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 5/124
![Page 6: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/6.jpg)
Motivation Interactive Learning Problems
Why: Important Problems
I Autonomous robotics
I Elder careI Exploration of unknown /
dangerous environmentsI Robotics for entertainment
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124
![Page 7: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/7.jpg)
Motivation Interactive Learning Problems
Why: Important Problems
I Autonomous robotics
I Elder care
I Exploration of unknown /dangerous environments
I Robotics for entertainment
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124
![Page 8: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/8.jpg)
Motivation Interactive Learning Problems
Why: Important Problems
I Autonomous robotics
I Elder careI Exploration of unknown /
dangerous environments
I Robotics for entertainment
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124
![Page 9: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/9.jpg)
Motivation Interactive Learning Problems
Why: Important Problems
I Autonomous robotics
I Elder careI Exploration of unknown /
dangerous environmentsI Robotics for entertainment
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124
![Page 10: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/10.jpg)
Motivation Interactive Learning Problems
Why: Important Problems
I Autonomous roboticsI Financial applications
I Trading execution algorithmsI Portfolio managementI Option pricing
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 7/124
![Page 11: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/11.jpg)
Motivation Interactive Learning Problems
Why: Important Problems
I Autonomous roboticsI Financial applications
I Trading execution algorithms
I Portfolio managementI Option pricing
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 7/124
![Page 12: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/12.jpg)
Motivation Interactive Learning Problems
Why: Important Problems
I Autonomous roboticsI Financial applications
I Trading execution algorithmsI Portfolio management
I Option pricing
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 7/124
![Page 13: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/13.jpg)
Motivation Interactive Learning Problems
Why: Important Problems
I Autonomous roboticsI Financial applications
I Trading execution algorithmsI Portfolio managementI Option pricing
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 7/124
![Page 14: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/14.jpg)
Motivation Interactive Learning Problems
Why: Important Problems
I Autonomous roboticsI Financial applicationsI Energy management
I Energy grid integrationI Maintenance schedulingI Energy market regulationI Energy production
management
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 8/124
![Page 15: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/15.jpg)
Motivation Interactive Learning Problems
Why: Important Problems
I Autonomous roboticsI Financial applicationsI Energy management
I Energy grid integration
I Maintenance schedulingI Energy market regulationI Energy production
management
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 8/124
![Page 16: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/16.jpg)
Motivation Interactive Learning Problems
Why: Important Problems
I Autonomous roboticsI Financial applicationsI Energy management
I Energy grid integrationI Maintenance scheduling
I Energy market regulationI Energy production
management
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 8/124
![Page 17: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/17.jpg)
Motivation Interactive Learning Problems
Why: Important Problems
I Autonomous roboticsI Financial applicationsI Energy management
I Energy grid integrationI Maintenance schedulingI Energy market regulation
I Energy productionmanagement
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 8/124
![Page 18: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/18.jpg)
Motivation Interactive Learning Problems
Why: Important Problems
I Autonomous roboticsI Financial applicationsI Energy management
I Energy grid integrationI Maintenance schedulingI Energy market regulationI Energy production
management
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 8/124
![Page 19: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/19.jpg)
Motivation Interactive Learning Problems
Why: Important Problems
I Autonomous roboticsI Financial applicationsI Energy managementI Recommender systems
I Web advertisingI Product recommendationI Date matching
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 9/124
![Page 20: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/20.jpg)
Motivation Interactive Learning Problems
Why: Important Problems
I Autonomous roboticsI Financial applicationsI Energy managementI Recommender systems
I Web advertising
I Product recommendationI Date matching
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 9/124
![Page 21: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/21.jpg)
Motivation Interactive Learning Problems
Why: Important Problems
I Autonomous roboticsI Financial applicationsI Energy managementI Recommender systems
I Web advertisingI Product recommendation
I Date matching
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 9/124
![Page 22: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/22.jpg)
Motivation Interactive Learning Problems
Why: Important Problems
I Autonomous roboticsI Financial applicationsI Energy managementI Recommender systems
I Web advertisingI Product recommendationI Date matching
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 9/124
![Page 23: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/23.jpg)
Motivation Interactive Learning Problems
Why: Important Problems
I Autonomous roboticsI Financial applicationsI Energy managementI Recommender systemsI Social applications
I Bike sharing optimizationI Election campaignI ER service optimizationI Intelligent Tutoring Systems
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 10/124
![Page 24: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/24.jpg)
Motivation Interactive Learning Problems
Why: Important Problems
I Autonomous roboticsI Financial applicationsI Energy managementI Recommender systemsI Social applications I Bike sharing optimization
I Election campaignI ER service optimizationI Intelligent Tutoring Systems
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 10/124
![Page 25: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/25.jpg)
Motivation Interactive Learning Problems
Why: Important Problems
I Autonomous roboticsI Financial applicationsI Energy managementI Recommender systemsI Social applications I Bike sharing optimization
I Election campaign
I ER service optimizationI Intelligent Tutoring Systems
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 10/124
![Page 26: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/26.jpg)
Motivation Interactive Learning Problems
Why: Important Problems
I Autonomous roboticsI Financial applicationsI Energy managementI Recommender systemsI Social applications I Bike sharing optimization
I Election campaignI ER service optimization
I Intelligent Tutoring Systems
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 10/124
![Page 27: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/27.jpg)
Motivation Interactive Learning Problems
Why: Important Problems
I Autonomous roboticsI Financial applicationsI Energy managementI Recommender systemsI Social applications I Bike sharing optimization
I Election campaignI ER service optimizationI Intelligent Tutoring Systems
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 10/124
![Page 28: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/28.jpg)
Motivation Interactive Learning Problems
Why: Important Problems
I Autonomous roboticsI Financial applicationsI Energy managementI Recommender systemsI Social applicationsI And many more...
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 11/124
![Page 29: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/29.jpg)
Motivation A Model for Sequential Decision Making
Outline
MotivationInteractive Learning ProblemsA Model for Sequential Decision MakingOutline
Multi-armed Bandit Problems
Extensions
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 12/124
![Page 30: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/30.jpg)
Motivation A Model for Sequential Decision Making
What
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 13/124
![Page 31: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/31.jpg)
Motivation A Model for Sequential Decision Making
What: Sequential Decision-Making under Uncertainty
Environment
Agent
actuationaction / state /
perception
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 14/124
![Page 32: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/32.jpg)
Motivation A Model for Sequential Decision Making
What: Sequential Decision-Making under Uncertainty
Environment
AgentLearning
Critic
perceptionactuationaction /
rewardstate /
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 14/124
![Page 33: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/33.jpg)
Motivation A Model for Sequential Decision Making
What: A Different Machine Learning Paradigm
I Supervised learning: an expert (supervisor) provides examplesof the right strategy (e.g., classification of clinical images).Supervision is expensive.
I Unsupervised learning: different objects are clustered togetherby similarity (e.g., clustering of images on the basis of theirsimilarity). No actual performance is optimized.
I Reinforcement learning: learning by direct interaction (e.g.,autonomous robotics). Minimum level of supervision (reward)and maximization of long term performance.
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 15/124
![Page 34: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/34.jpg)
Motivation A Model for Sequential Decision Making
What: A Different Machine Learning Paradigm
I Supervised learning: an expert (supervisor) provides examplesof the right strategy (e.g., classification of clinical images).Supervision is expensive.
I Unsupervised learning: different objects are clustered togetherby similarity (e.g., clustering of images on the basis of theirsimilarity). No actual performance is optimized.
I Reinforcement learning: learning by direct interaction (e.g.,autonomous robotics). Minimum level of supervision (reward)and maximization of long term performance.
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 15/124
![Page 35: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/35.jpg)
Motivation A Model for Sequential Decision Making
What: A Different Machine Learning Paradigm
I Supervised learning: an expert (supervisor) provides examplesof the right strategy (e.g., classification of clinical images).Supervision is expensive.
I Unsupervised learning: different objects are clustered togetherby similarity (e.g., clustering of images on the basis of theirsimilarity). No actual performance is optimized.
I Reinforcement learning: learning by direct interaction (e.g.,autonomous robotics). Minimum level of supervision (reward)and maximization of long term performance.
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 15/124
![Page 36: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/36.jpg)
Motivation Outline
Outline
MotivationInteractive Learning ProblemsA Model for Sequential Decision MakingOutline
Multi-armed Bandit Problems
Extensions
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 16/124
![Page 37: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/37.jpg)
Motivation Outline
How: the Course
Agent
Environment
Learning
reward perception
Critic
actuationaction / state /
Formal and rigorous approach to the RL’s way tosequential decision-making under uncertainty
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 17/124
![Page 38: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/38.jpg)
Motivation Outline
How: the Course
Agent
Environment
Learning
reward perception
Critic
actuationaction / state /
Formal and rigorous approach to the RL’s way tosequential decision-making under uncertainty
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 17/124
![Page 39: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/39.jpg)
Motivation Outline
How: the Course
Agent
Environment
Learning
reward perception
Critic
actuationaction / state /
Formal and rigorous approach to the RL’s way tosequential decision-making under uncertainty
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 17/124
![Page 40: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/40.jpg)
Motivation Outline
How: the Course
I How to model an RL problem
I Models without states = MAB
I How to solve exactly an (small) MDP
I Hands-on session! (2h)
I How to solve approximately a (larger) MDP
I How to solve incrementally an MDP
I How to efficiently explore an MDP
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 18/124
![Page 41: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/41.jpg)
Motivation Outline
How to model an RL problem
The Markov Decision Process
The Model
Value Functions
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 19/124
![Page 42: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/42.jpg)
Motivation Outline
The Agent-Environment Interaction Model
Agent
Environment
Learning
reward perception
Critic
actuationaction / state /
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 20/124
![Page 43: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/43.jpg)
Motivation Outline
The Agent-Environment Interaction Model
The environmentI Controllability : fully (e.g., chess) or partially (e.g., portfolio optimization)I Uncertainty : deterministic (e.g., chess) or stochastic (e.g., backgammon)I Reactive: adversarial (e.g., chess) or fixed (e.g., tetris)I Observability : full (e.g., chess) or partial (e.g., robotics)I Availability : known (e.g., chess) or unknown (e.g., robotics)
The criticI Sparse (e.g., win or loose) vs informative (e.g., closer or further)I Preference rewardI Frequent or sporadicI Known or unknown
The agentI Open loop controlI Close loop control (i.e., adaptive)I Non-stationary close loop control (i.e., learning)
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 21/124
![Page 44: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/44.jpg)
Motivation Outline
The Agent-Environment Interaction Model
The environmentI Controllability : fully (e.g., chess) or partially (e.g., portfolio optimization)I Uncertainty : deterministic (e.g., chess) or stochastic (e.g., backgammon)I Reactive: adversarial (e.g., chess) or fixed (e.g., tetris)I Observability : full (e.g., chess) or partial (e.g., robotics)I Availability : known (e.g., chess) or unknown (e.g., robotics)
The criticI Sparse (e.g., win or loose) vs informative (e.g., closer or further)I Preference rewardI Frequent or sporadicI Known or unknown
The agentI Open loop controlI Close loop control (i.e., adaptive)I Non-stationary close loop control (i.e., learning)
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 21/124
![Page 45: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/45.jpg)
Motivation Outline
The Agent-Environment Interaction Model
The environmentI Controllability : fully (e.g., chess) or partially (e.g., portfolio optimization)I Uncertainty : deterministic (e.g., chess) or stochastic (e.g., backgammon)I Reactive: adversarial (e.g., chess) or fixed (e.g., tetris)I Observability : full (e.g., chess) or partial (e.g., robotics)I Availability : known (e.g., chess) or unknown (e.g., robotics)
The criticI Sparse (e.g., win or loose) vs informative (e.g., closer or further)I Preference rewardI Frequent or sporadicI Known or unknown
The agentI Open loop controlI Close loop control (i.e., adaptive)I Non-stationary close loop control (i.e., learning)
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 21/124
![Page 46: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/46.jpg)
Motivation Outline
Markov Decision Process
Definition (Markov decision process [1, 4, 3, 5, 2])A Markov decision process is defined as a tuple M = (X ,A, p, r):
I X is the state space,I A is the action space,I p(y |x , a) is the transition probability with
p(y |x , a) = P(xt+1 = y |xt = x , at = a),
I r(x , a, y) is the reward of transition (x , a, y).
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 22/124
![Page 47: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/47.jpg)
Motivation Outline
Markov Decision Process
Definition (Markov decision process [1, 4, 3, 5, 2])A Markov decision process is defined as a tuple M = (X ,A, p, r):
I X is the state space,
I A is the action space,I p(y |x , a) is the transition probability with
p(y |x , a) = P(xt+1 = y |xt = x , at = a),
I r(x , a, y) is the reward of transition (x , a, y).
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 22/124
![Page 48: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/48.jpg)
Motivation Outline
Markov Decision Process
Definition (Markov decision process [1, 4, 3, 5, 2])A Markov decision process is defined as a tuple M = (X ,A, p, r):
I X is the state space,I A is the action space,
I p(y |x , a) is the transition probability with
p(y |x , a) = P(xt+1 = y |xt = x , at = a),
I r(x , a, y) is the reward of transition (x , a, y).
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 22/124
![Page 49: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/49.jpg)
Motivation Outline
Markov Decision Process
Definition (Markov decision process [1, 4, 3, 5, 2])A Markov decision process is defined as a tuple M = (X ,A, p, r):
I X is the state space,I A is the action space,I p(y |x , a) is the transition probability with
p(y |x , a) = P(xt+1 = y |xt = x , at = a),
I r(x , a, y) is the reward of transition (x , a, y).
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 22/124
![Page 50: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/50.jpg)
Motivation Outline
Markov Decision Process
Definition (Markov decision process [1, 4, 3, 5, 2])A Markov decision process is defined as a tuple M = (X ,A, p, r):
I X is the state space,I A is the action space,I p(y |x , a) is the transition probability with
p(y |x , a) = P(xt+1 = y |xt = x , at = a),
I r(x , a, y) is the reward of transition (x , a, y).
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 22/124
![Page 51: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/51.jpg)
Motivation Outline
Markov Decision Process: the Assumptions
Time assumption: time is discrete
t → t + 1
Possible relaxationsI Identify the proper time granularityI Most of MDP literature extends to continuous time
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 23/124
![Page 52: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/52.jpg)
Motivation Outline
Markov Decision Process: the Assumptions
Markov assumption: the current state x and action a are asufficient statistics for the next state y
p(y |x , a) = P(xt+1 = y |xt = x , at = a)
Possible relaxationsI Define a new state ht = (xt , xt−1, xt−2, . . .)
I Move to partially observable MDP (PO-MDP)I Move to predictive state representation (PSR) model
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 24/124
![Page 53: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/53.jpg)
Motivation Outline
Markov Decision Process: the Assumptions
Reward assumption: the reward is uniquely defined by a transition(or part of it)
r(x , a, y)
Possible relaxationsI Distinguish between global goal and reward functionI Move to inverse reinforcement learning (IRL) to induce the
reward function from desired behaviors
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 25/124
![Page 54: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/54.jpg)
Motivation Outline
Markov Decision Process: the Assumptions
Stationarity assumption: the dynamics and reward do not changeover time
p(y |x , a) = P(xt+1 = y |xt = x , at = a) r(x , a, y)
Possible relaxationsI Identify and remove the non-stationary components (e.g.,
cyclo-stationary dynamics)I Identify the time-scale of the changes
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 26/124
![Page 55: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/55.jpg)
Motivation Outline
Question
Is the MDP formalism powerful enough?
⇒ Let’s try!
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 27/124
![Page 56: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/56.jpg)
Motivation Outline
Example: the Retail Store Management Problem
Description. At each month t, a store contains xt items of a specificgoods and the demand for that goods is Dt . At the end of each monththe manager of the store can order at more items from his supplier.Furthermore we know that
I The cost of maintaining an inventory of x is h(x).I The cost to order a items is C(a).I The income for selling q items is f (q).I If the demand D is bigger than the available inventory x , customers
that cannot be served leave.I The value of the remaining inventory at the end of the year is g(x).I Constraint: the store has a maximum capacity M.
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 28/124
![Page 57: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/57.jpg)
Motivation Outline
Example: the Retail Store Management Problem
I State space: x ∈ X = 0, 1, . . . ,M.
I Action space: it is not possible to order more items that thecapacity of the store, then the action space should depend on thecurrent state. Formally, at state x , a ∈ A(x) = 0, 1, . . . ,M − x.
I Dynamics: xt+1 = [xt + at − Dt ]+.
Problem: the dynamics should be Markov and stationary!I The demand Dt is stochastic and time-independent. Formally,
Dti.i.d.∼ D.
I Reward : rt = −C(at)− h(xt + at) + f ([xt + at − xt+1]+).
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 29/124
![Page 58: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/58.jpg)
Motivation Outline
Example: the Retail Store Management Problem
I State space: x ∈ X = 0, 1, . . . ,M.I Action space: it is not possible to order more items that the
capacity of the store, then the action space should depend on thecurrent state. Formally, at state x , a ∈ A(x) = 0, 1, . . . ,M − x.
I Dynamics: xt+1 = [xt + at − Dt ]+.
Problem: the dynamics should be Markov and stationary!I The demand Dt is stochastic and time-independent. Formally,
Dti.i.d.∼ D.
I Reward : rt = −C(at)− h(xt + at) + f ([xt + at − xt+1]+).
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 29/124
![Page 59: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/59.jpg)
Motivation Outline
Example: the Retail Store Management Problem
I State space: x ∈ X = 0, 1, . . . ,M.I Action space: it is not possible to order more items that the
capacity of the store, then the action space should depend on thecurrent state. Formally, at state x , a ∈ A(x) = 0, 1, . . . ,M − x.
I Dynamics: xt+1 = [xt + at − Dt ]+.
Problem: the dynamics should be Markov and stationary!
I The demand Dt is stochastic and time-independent. Formally,Dt
i.i.d.∼ D.I Reward : rt = −C(at)− h(xt + at) + f ([xt + at − xt+1]
+).
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 29/124
![Page 60: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/60.jpg)
Motivation Outline
Example: the Retail Store Management Problem
I State space: x ∈ X = 0, 1, . . . ,M.I Action space: it is not possible to order more items that the
capacity of the store, then the action space should depend on thecurrent state. Formally, at state x , a ∈ A(x) = 0, 1, . . . ,M − x.
I Dynamics: xt+1 = [xt + at − Dt ]+.
Problem: the dynamics should be Markov and stationary!I The demand Dt is stochastic and time-independent. Formally,
Dti.i.d.∼ D.
I Reward : rt = −C(at)− h(xt + at) + f ([xt + at − xt+1]+).
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 29/124
![Page 61: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/61.jpg)
Motivation Outline
Example: the Retail Store Management Problem
I State space: x ∈ X = 0, 1, . . . ,M.I Action space: it is not possible to order more items that the
capacity of the store, then the action space should depend on thecurrent state. Formally, at state x , a ∈ A(x) = 0, 1, . . . ,M − x.
I Dynamics: xt+1 = [xt + at − Dt ]+.
Problem: the dynamics should be Markov and stationary!I The demand Dt is stochastic and time-independent. Formally,
Dti.i.d.∼ D.
I Reward : rt = −C(at)− h(xt + at) + f ([xt + at − xt+1]+).
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 29/124
![Page 62: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/62.jpg)
Motivation Outline
Policy
Definition (Policy)A decision rule πt can be
I Deterministic: πt : X → A,I Stochastic: πt : X → ∆(A),
A policy (strategy, plan) can beI Non-stationary: π = (π0, π1, π2, . . . ),I Stationary (Markovian): π = (π, π, π, . . . ).
Remark: MDP M + stationary policy π ⇒ Markov chain of stateX and transition probability p(y |x) = p(y |x , π(x)).
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 30/124
![Page 63: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/63.jpg)
Motivation Outline
Policy
Definition (Policy)A decision rule πt can be
I Deterministic: πt : X → A,I Stochastic: πt : X → ∆(A),
A policy (strategy, plan) can beI Non-stationary: π = (π0, π1, π2, . . . ),I Stationary (Markovian): π = (π, π, π, . . . ).
Remark: MDP M + stationary policy π ⇒ Markov chain of stateX and transition probability p(y |x) = p(y |x , π(x)).
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 30/124
![Page 64: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/64.jpg)
Motivation Outline
Policy
Definition (Policy)A decision rule πt can be
I Deterministic: πt : X → A,I Stochastic: πt : X → ∆(A),
A policy (strategy, plan) can beI Non-stationary: π = (π0, π1, π2, . . . ),I Stationary (Markovian): π = (π, π, π, . . . ).
Remark: MDP M + stationary policy π ⇒ Markov chain of stateX and transition probability p(y |x) = p(y |x , π(x)).
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 30/124
![Page 65: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/65.jpg)
Motivation Outline
Example: the Retail Store Management Problem
I Stationary policy 1
π(x) =
M − x if x < M/40 otherwise
I Stationary policy 2
π(x) = max(M − x)/2 − x ; 0
I Non-stationary policy
πt(x) =
M − x if t < 6⌊(M − x)/5⌋ otherwise
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 31/124
![Page 66: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/66.jpg)
Multi-armed Bandit Problems
Outline
Motivation
Multi-armed Bandit ProblemsIntroductionThe Bandit ModelBandit Algorithms: UCBA (distribution-dependent) Lower Bound for the RegretWorst-case Performance
Extensions
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 32/124
![Page 67: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/67.jpg)
Multi-armed Bandit Problems Introduction
Outline
Motivation
Multi-armed Bandit ProblemsIntroductionThe Bandit ModelBandit Algorithms: UCBA (distribution-dependent) Lower Bound for the RegretWorst-case Performance
Extensions
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 33/124
![Page 68: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/68.jpg)
Multi-armed Bandit Problems Introduction
How to efficiently explore an MDP
The Exploration-ExploitationDilemma
Multi-Armed Bandit
Contextual Linear Bandit
Reinforcement Learning
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 34/124
![Page 69: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/69.jpg)
Multi-armed Bandit Problems Introduction
How to efficiently explore an MDP
The Exploration-ExploitationDilemma
Multi-Armed Bandit
Contextual Linear Bandit
Reinforcement Learning
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 34/124
![Page 70: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/70.jpg)
Multi-armed Bandit Problems Introduction
The Navigation Problem
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 35/124
![Page 71: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/71.jpg)
Multi-armed Bandit Problems Introduction
The Navigation Problem
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 35/124
![Page 72: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/72.jpg)
Multi-armed Bandit Problems Introduction
The Navigation Problem
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 35/124
![Page 73: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/73.jpg)
Multi-armed Bandit Problems Introduction
The Navigation Problem
Question: which route should we take?
Problem: each day we obtain a limited feedback: traveling timeof the chosen route
Results: if we do not repeatedly try different options we cannotlearn.
Solution: trade off between optimization and learning .
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 36/124
![Page 74: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/74.jpg)
Multi-armed Bandit Problems Introduction
The Navigation Problem
Question: which route should we take?
Problem: each day we obtain a limited feedback: traveling timeof the chosen route
Results: if we do not repeatedly try different options we cannotlearn.
Solution: trade off between optimization and learning .
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 36/124
![Page 75: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/75.jpg)
Multi-armed Bandit Problems Introduction
The Navigation Problem
Question: which route should we take?
Problem: each day we obtain a limited feedback: traveling timeof the chosen route
Results: if we do not repeatedly try different options we cannotlearn.
Solution: trade off between optimization and learning .
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 36/124
![Page 76: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/76.jpg)
Multi-armed Bandit Problems Introduction
The Navigation Problem
Question: which route should we take?
Problem: each day we obtain a limited feedback: traveling timeof the chosen route
Results: if we do not repeatedly try different options we cannotlearn.
Solution: trade off between optimization and learning .
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 36/124
![Page 77: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/77.jpg)
Multi-armed Bandit Problems Introduction
Learning the Optimal Policy
For i = 1, . . . , n1. Set t = 02. Set initial state x0
3. While (xt not terminal)
3.1 Take action at according to a suitable exploration policy3.2 Observe next state xt+1 and reward rt3.3 Compute the temporal difference δt (e.g., Q-learning)3.4 Update the Q-function
Q(xt , at) = Q(xt , at) + α(xt , at)δt
3.5 Set t = t + 1EndWhile
EndFor
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 37/124
![Page 78: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/78.jpg)
Multi-armed Bandit Problems Introduction
Learning the Optimal Policy
For i = 1, . . . , n1. Set t = 02. Set initial state x0
3. While (xt not terminal)
3.1 Take action at = arg maxa Q(xt , a)3.2 Observe next state xt+1 and reward rt3.3 Compute the temporal difference δt (e.g., Q-learning)3.4 Update the Q-function
Q(xt , at) = Q(xt , at) + α(xt , at)δt
3.5 Set t = t + 1EndWhile
EndFor
⇒ no convergence
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 38/124
![Page 79: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/79.jpg)
Multi-armed Bandit Problems Introduction
Learning the Optimal Policy
For i = 1, . . . , n1. Set t = 02. Set initial state x0
3. While (xt not terminal)
3.1 Take action at = arg maxa Q(xt , a)3.2 Observe next state xt+1 and reward rt3.3 Compute the temporal difference δt (e.g., Q-learning)3.4 Update the Q-function
Q(xt , at) = Q(xt , at) + α(xt , at)δt
3.5 Set t = t + 1EndWhile
EndFor⇒ no convergence
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 38/124
![Page 80: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/80.jpg)
Multi-armed Bandit Problems Introduction
Learning the Optimal Policy
For i = 1, . . . , n1. Set t = 02. Set initial state x0
3. While (xt not terminal)
3.1 Take action at ∼ U(A)3.2 Observe next state xt+1 and reward rt3.3 Compute the temporal difference δt (e.g., Q-learning)3.4 Update the Q-function
Q(xt , at) = Q(xt , at) + α(xt , at)δt
3.5 Set t = t + 1EndWhile
EndFor
⇒ very poor rewards
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 39/124
![Page 81: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/81.jpg)
Multi-armed Bandit Problems Introduction
Learning the Optimal Policy
For i = 1, . . . , n1. Set t = 02. Set initial state x0
3. While (xt not terminal)
3.1 Take action at ∼ U(A)3.2 Observe next state xt+1 and reward rt3.3 Compute the temporal difference δt (e.g., Q-learning)3.4 Update the Q-function
Q(xt , at) = Q(xt , at) + α(xt , at)δt
3.5 Set t = t + 1EndWhile
EndFor⇒ very poor rewards
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 39/124
![Page 82: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/82.jpg)
Multi-armed Bandit Problems The Bandit Model
Outline
Motivation
Multi-armed Bandit ProblemsIntroductionThe Bandit ModelBandit Algorithms: UCBA (distribution-dependent) Lower Bound for the RegretWorst-case Performance
Extensions
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 40/124
![Page 83: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/83.jpg)
Multi-armed Bandit Problems The Bandit Model
How to efficiently explore an MDP
The Exploration-ExploitationDilemma
Multi-Armed Bandit
Contextual Linear Bandit
Reinforcement Learning
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 41/124
![Page 84: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/84.jpg)
Multi-armed Bandit Problems The Bandit Model
Reducing RL down to Multi-Armed Bandit
Definition (Markov decision process [1, 4, 3, 5, 2])A Markov decision process is defined as a tuple M = (X ,A, p, r):
I X is the state space,I A is the action space,I p(y |x , a) is the transition probabilityI r(x , a, y) is the reward of transition (x , a, y)
⇒ r(a) is the reward of action a
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 42/124
![Page 85: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/85.jpg)
Multi-armed Bandit Problems The Bandit Model
Notice
For coherence with the bandit literature we use the notationI i = 1, . . . ,K set of possible actionsI t = 1, . . . , n timeI It action selected at time tI Xi ,t reward for action i at time t
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 43/124
![Page 86: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/86.jpg)
Multi-armed Bandit Problems The Bandit Model
Learning the Optimal Policy
Objective: learn the optimal policy π∗ as efficiently as possible
For t = 1, . . . , n1. Set t = 02. Set initial state x03. While (xt not terminal)
3.1 Take action at3.2 Observe next state xt+1 and reward rt3.3 Set t = t + 1
EndWhileEndFor
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 44/124
![Page 87: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/87.jpg)
Multi-armed Bandit Problems The Bandit Model
Learning the Optimal Policy
Objective: learn the optimal policy π∗ as efficiently as possibleFor t = 1, . . . , n
1. Set t = 02. Set initial state x03. While (xt not terminal)
3.1 Take action at3.2 Observe next state xt+1 and reward rt3.3 Set t = t + 1
EndWhileEndFor
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 44/124
![Page 88: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/88.jpg)
Multi-armed Bandit Problems The Bandit Model
The Multi–armed Bandit Protocol
The learner has i = 1, . . . ,K arms (actions)
At each round t = 1, . . . , n
I At the same timeI The environment chooses a vector of rewards Xi,tK
i=1I The learner chooses an arm It
I The learner receives a reward XIt ,tI The environment does not reveal the rewards of the other
arms
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 45/124
![Page 89: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/89.jpg)
Multi-armed Bandit Problems The Bandit Model
The Multi–armed Bandit Protocol
The learner has i = 1, . . . ,K arms (actions)
At each round t = 1, . . . , nI At the same time
I The environment chooses a vector of rewards Xi,tKi=1
I The learner chooses an arm ItI The learner receives a reward XIt ,tI The environment does not reveal the rewards of the other
arms
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 45/124
![Page 90: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/90.jpg)
Multi-armed Bandit Problems The Bandit Model
The Multi–armed Bandit Protocol
The learner has i = 1, . . . ,K arms (actions)
At each round t = 1, . . . , nI At the same time
I The environment chooses a vector of rewards Xi,tKi=1
I The learner chooses an arm It
I The learner receives a reward XIt ,tI The environment does not reveal the rewards of the other
arms
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 45/124
![Page 91: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/91.jpg)
Multi-armed Bandit Problems The Bandit Model
The Multi–armed Bandit Protocol
The learner has i = 1, . . . ,K arms (actions)
At each round t = 1, . . . , nI At the same time
I The environment chooses a vector of rewards Xi,tKi=1
I The learner chooses an arm ItI The learner receives a reward XIt ,t
I The environment does not reveal the rewards of the otherarms
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 45/124
![Page 92: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/92.jpg)
Multi-armed Bandit Problems The Bandit Model
The Multi–armed Bandit Protocol
The learner has i = 1, . . . ,K arms (actions)
At each round t = 1, . . . , nI At the same time
I The environment chooses a vector of rewards Xi,tKi=1
I The learner chooses an arm ItI The learner receives a reward XIt ,tI The environment does not reveal the rewards of the other
arms
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 45/124
![Page 93: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/93.jpg)
Multi-armed Bandit Problems The Bandit Model
Paradigmatic Example
Imagine you are a doctor:I patients visit you one after another for a given diseaseI you prescribe one of the (say) 5 treatments availableI the treatments are not equally efficientI you do not know which one is the best, you observe the effect
of the prescribed treatment on each patient⇒ What do you do?I You must choose each prescription using only the previous
observationsI Your goal is not to estimate each treatment’s efficiency
precisely, but to heal as many patients as possible
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 46/124
![Page 94: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/94.jpg)
Multi-armed Bandit Problems The Bandit Model
The (stochastic) Multi-Armed Bandit ModelEnvironment K arms with parameters θ = (θ1, . . . , θK ) such that
for any possible choice of arm It ∈ 1, . . . ,K attime t, one receives the reward
Xt = XIt ,t
where, for any 1 ≤ i ≤ K and s ≥ 1, Xi ,s ∼ νi , andthe (Xi ,s)i ,s are independent.
Reward distributions νi ∈ Fi parametric family, or not. Examples:canonical exponential family, general boundedrewards
Example Bernoulli rewards: θ ∈ [0, 1]K , νi = B(θi)
Strategy The agent’s actions follow a dynamical strategyπ = (π1, π2, . . . ) such that
It = πt(X1, . . . ,Xt−1)
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 47/124
![Page 95: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/95.jpg)
Multi-armed Bandit Problems The Bandit Model
The Multi–armed Bandit Game (cont’d)Goal: Choose π so as to maximize
EA [Sn] =n∑
t=1
K∑i=1
E[E [XtIIt = i|X1, . . . ,Xt−1]
]=
K∑i=1
µiE [Ti ,n]
where Ti ,n =∑
t≤n IIt = i is the number of draws of arm i up totime n, and µi = E (νi).
=⇒ Equivalent to minimizing the regret
Rn(A) = maxi=1,...,K
E[ n∑
t=1Xi ,t]− E
[ n∑t=1
XIt ,t
]where µ∗ ∈ maxµi : 1 ≤ i ≤ K.
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 48/124
![Page 96: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/96.jpg)
Multi-armed Bandit Problems The Bandit Model
The Multi–armed Bandit Game (cont’d)Goal: Choose π so as to maximize
EA [Sn] =n∑
t=1
K∑i=1
E[E [XtIIt = i|X1, . . . ,Xt−1]
]=
K∑i=1
µiE [Ti ,n]
where Ti ,n =∑
t≤n IIt = i is the number of draws of arm i up totime n, and µi = E (νi).=⇒ Equivalent to minimizing the regret
Rn(A) = maxi=1,...,K
E[ n∑
t=1Xi ,t]− E
[ n∑t=1
XIt ,t
]where µ∗ ∈ maxµi : 1 ≤ i ≤ K.
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 48/124
![Page 97: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/97.jpg)
Multi-armed Bandit Problems The Bandit Model
The Exploration–Exploitation Lemma
Problem 1: The environment does not reveal the rewards of the armsnot pulled by the learner
⇒ the learner should gain information by repeatedly pulling all the arms⇒ exploration
Problem 2: Whenever the learner pulls a bad arm, it suffers some regret⇒ the learner should reduce the regret by repeatedly pulling the best arm⇒ exploitation
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 49/124
![Page 98: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/98.jpg)
Multi-armed Bandit Problems The Bandit Model
The Exploration–Exploitation Lemma
Problem 1: The environment does not reveal the rewards of the armsnot pulled by the learner⇒ the learner should gain information by repeatedly pulling all the arms
⇒ exploration
Problem 2: Whenever the learner pulls a bad arm, it suffers some regret⇒ the learner should reduce the regret by repeatedly pulling the best arm⇒ exploitation
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 49/124
![Page 99: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/99.jpg)
Multi-armed Bandit Problems The Bandit Model
The Exploration–Exploitation Lemma
Problem 1: The environment does not reveal the rewards of the armsnot pulled by the learner⇒ the learner should gain information by repeatedly pulling all the arms
⇒ exploration
Problem 2: Whenever the learner pulls a bad arm, it suffers some regret
⇒ the learner should reduce the regret by repeatedly pulling the best arm⇒ exploitation
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 49/124
![Page 100: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/100.jpg)
Multi-armed Bandit Problems The Bandit Model
The Exploration–Exploitation Lemma
Problem 1: The environment does not reveal the rewards of the armsnot pulled by the learner⇒ the learner should gain information by repeatedly pulling all the arms
⇒ exploration
Problem 2: Whenever the learner pulls a bad arm, it suffers some regret⇒ the learner should reduce the regret by repeatedly pulling the best arm
⇒ exploitation
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 49/124
![Page 101: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/101.jpg)
Multi-armed Bandit Problems The Bandit Model
The Exploration–Exploitation Lemma
Problem 1: The environment does not reveal the rewards of the armsnot pulled by the learner⇒ the learner should gain information by repeatedly pulling all the arms
⇒ exploration
Problem 2: Whenever the learner pulls a bad arm, it suffers some regret⇒ the learner should reduce the regret by repeatedly pulling the best arm
⇒ exploitation
Challenge: The learner should solve two opposite problems!
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 49/124
![Page 102: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/102.jpg)
Multi-armed Bandit Problems The Bandit Model
The Exploration–Exploitation Lemma
Problem 1: The environment does not reveal the rewards of the armsnot pulled by the learner⇒ the learner should gain information by repeatedly pulling all the arms⇒ exploration
Problem 2: Whenever the learner pulls a bad arm, it suffers some regret⇒ the learner should reduce the regret by repeatedly pulling the best arm
⇒ exploitation
Challenge: The learner should solve two opposite problems!
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 49/124
![Page 103: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/103.jpg)
Multi-armed Bandit Problems The Bandit Model
The Exploration–Exploitation Lemma
Problem 1: The environment does not reveal the rewards of the armsnot pulled by the learner⇒ the learner should gain information by repeatedly pulling all the arms⇒ exploration
Problem 2: Whenever the learner pulls a bad arm, it suffers some regret⇒ the learner should reduce the regret by repeatedly pulling the best arm⇒ exploitationChallenge: The learner should solve two opposite problems!
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 49/124
![Page 104: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/104.jpg)
Multi-armed Bandit Problems The Bandit Model
The Exploration–Exploitation Lemma
Problem 1: The environment does not reveal the rewards of the armsnot pulled by the learner⇒ the learner should gain information by repeatedly pulling all the arms⇒ exploration
Problem 2: Whenever the learner pulls a bad arm, it suffers some regret⇒ the learner should reduce the regret by repeatedly pulling the best arm⇒ exploitationChallenge: The learner should solve the exploration-exploitationdilemma!
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 49/124
![Page 105: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/105.jpg)
Multi-armed Bandit Problems The Bandit Model
The Multi–armed Bandit Game (cont’d)
ExamplesI Packet routingI Clinical trialsI Web advertisingI Computer gamesI Resource miningI ...
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 50/124
![Page 106: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/106.jpg)
Multi-armed Bandit Problems The Bandit Model
The Stochastic Multi–armed Bandit Problem
DefinitionThe environment is stochastic
I Each arm has a distribution νi bounded in [0, 1] andcharacterized by an expected value µi
I The rewards are i.i.d. Xi ,t ∼ νi (as in the MDP model)
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 51/124
![Page 107: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/107.jpg)
Multi-armed Bandit Problems The Bandit Model
The Stochastic Multi–armed Bandit Problem (cont’d)
NotationI Number of times arm i has been pulled after n rounds
Ti ,n =n∑
t=1IIt = i
I Regret
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 52/124
![Page 108: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/108.jpg)
Multi-armed Bandit Problems The Bandit Model
The Stochastic Multi–armed Bandit Problem (cont’d)
NotationI Number of times arm i has been pulled after n rounds
Ti ,n =n∑
t=1IIt = i
I Regret
Rn(A) = maxi=1,...,K
E[ n∑
t=1Xi ,t]− E
[ n∑t=1
XIt ,t
]
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 52/124
![Page 109: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/109.jpg)
Multi-armed Bandit Problems The Bandit Model
The Stochastic Multi–armed Bandit Problem (cont’d)
NotationI Number of times arm i has been pulled after n rounds
Ti ,n =n∑
t=1IIt = i
I Regret
Rn(A) = maxi=1,...,K
(nµi)− E[ n∑
t=1XIt ,t
]
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 52/124
![Page 110: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/110.jpg)
Multi-armed Bandit Problems The Bandit Model
The Stochastic Multi–armed Bandit Problem (cont’d)
NotationI Number of times arm i has been pulled after n rounds
Ti ,n =n∑
t=1IIt = i
I Regret
Rn(A) = maxi=1,...,K
(nµi)−K∑
i=1E[Ti ,n]µi
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 52/124
![Page 111: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/111.jpg)
Multi-armed Bandit Problems The Bandit Model
The Stochastic Multi–armed Bandit Problem (cont’d)
NotationI Number of times arm i has been pulled after n rounds
Ti ,n =n∑
t=1IIt = i
I Regret
Rn(A) = nµi∗ −K∑
i=1E[Ti ,n]µi
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 52/124
![Page 112: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/112.jpg)
Multi-armed Bandit Problems The Bandit Model
The Stochastic Multi–armed Bandit Problem (cont’d)
NotationI Number of times arm i has been pulled after n rounds
Ti ,n =n∑
t=1IIt = i
I RegretRn(A) =
∑i =i∗
E[Ti ,n](µi∗ − µi)
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 52/124
![Page 113: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/113.jpg)
Multi-armed Bandit Problems The Bandit Model
The Stochastic Multi–armed Bandit Problem (cont’d)
NotationI Number of times arm i has been pulled after n rounds
Ti ,n =n∑
t=1IIt = i
I RegretRn(A) =
∑i =i∗
E[Ti ,n]∆i
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 52/124
![Page 114: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/114.jpg)
Multi-armed Bandit Problems The Bandit Model
The Stochastic Multi–armed Bandit Problem (cont’d)
NotationI Number of times arm i has been pulled after n rounds
Ti ,n =n∑
t=1IIt = i
I RegretRn(A) =
∑i =i∗
E[Ti ,n]∆i
I Gap ∆i = µi∗ − µi
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 52/124
![Page 115: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/115.jpg)
Multi-armed Bandit Problems The Bandit Model
The Stochastic Multi–armed Bandit Problem (cont’d)
Rn(A) =∑i =i∗
E[Ti ,n]∆i
⇒ we only need to study the expected number of pulls of thesuboptimal arms
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 53/124
![Page 116: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/116.jpg)
Multi-armed Bandit Problems Bandit Algorithms: UCB
Outline
Motivation
Multi-armed Bandit ProblemsIntroductionThe Bandit ModelBandit Algorithms: UCBA (distribution-dependent) Lower Bound for the RegretWorst-case Performance
Extensions
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 54/124
![Page 117: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/117.jpg)
Multi-armed Bandit Problems Bandit Algorithms: UCB
The Stochastic Multi–armed Bandit Problem (cont’d)
Optimism in Face of Uncertainty Learning (OFUL)
Whenever we are uncertain about the outcome of an arm, weconsider the best possible world and choose the best arm.
Why it works:I If the best possible world is correct ⇒ no regretI If the best possible world is wrong ⇒ the reduction in the
uncertainty is maximized
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 55/124
![Page 118: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/118.jpg)
Multi-armed Bandit Problems Bandit Algorithms: UCB
The Stochastic Multi–armed Bandit Problem (cont’d)
Optimism in Face of Uncertainty Learning (OFUL)
Whenever we are uncertain about the outcome of an arm, weconsider the best possible world and choose the best arm.Why it works:
I If the best possible world is correct ⇒ no regretI If the best possible world is wrong ⇒ the reduction in the
uncertainty is maximized
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 55/124
![Page 119: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/119.jpg)
Multi-armed Bandit Problems Bandit Algorithms: UCB
The Stochastic Multi–armed Bandit Problem (cont’d)
−4 −2 0 2 4 60
5
10
15
20
25
Rewards−4 −2 0 2 4 60
5
10
15
20
25
30
35
40
Rewards
pulls = 100 pulls = 200
−4 −2 0 2 4 60
2
4
6
8
10
12
14
Rewards−4 −2 0 2 4 60
0.5
1
1.5
2
2.5
3
Rewards
pulls = 50 pulls = 20
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 56/124
![Page 120: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/120.jpg)
Multi-armed Bandit Problems Bandit Algorithms: UCB
The Stochastic Multi–armed Bandit Problem (cont’d)Optimism in face of uncertainty
−4 −2 0 2 4 60
5
10
15
20
25
Rewards−4 −2 0 2 4 60
5
10
15
20
25
30
35
40
Rewards
−4 −2 0 2 4 60
2
4
6
8
10
12
14
Rewards−4 −2 0 2 4 60
0.5
1
1.5
2
2.5
3
Rewards
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 57/124
![Page 121: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/121.jpg)
Multi-armed Bandit Problems Bandit Algorithms: UCB
The Upper–Confidence Bound (UCB) AlgorithmThe idea
1 (10) 2 (73) 3 (3) 4 (23)−1.5
−1
−0.5
0
0.5
1
1.5
2
Arms
Rew
ard
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 58/124
![Page 122: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/122.jpg)
Multi-armed Bandit Problems Bandit Algorithms: UCB
The Upper–Confidence Bound (UCB) Algorithm
Show time!
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 59/124
![Page 123: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/123.jpg)
Multi-armed Bandit Problems Bandit Algorithms: UCB
The Upper–Confidence Bound (UCB) Algorithm (cont’d)
At each round t = 1, . . . , nI Compute the score of each arm i
Bi = (optimistic score of arm i)
I Pull armIt = arg max
i=1,...,KBi ,s,t
I Update the number of pulls TIt ,t = TIt ,t−1 + 1 and the otherstatistics
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 60/124
![Page 124: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/124.jpg)
Multi-armed Bandit Problems Bandit Algorithms: UCB
The Upper–Confidence Bound (UCB) Algorithm (cont’d)
The score (with parameters ρ and δ)
Bi = (optimistic score of arm i)
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 61/124
![Page 125: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/125.jpg)
Multi-armed Bandit Problems Bandit Algorithms: UCB
The Upper–Confidence Bound (UCB) Algorithm (cont’d)
The score (with parameters ρ and δ)
Bi ,s,t = (optimistic score of arm i if pulled s times up to round t)
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 61/124
![Page 126: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/126.jpg)
Multi-armed Bandit Problems Bandit Algorithms: UCB
The Upper–Confidence Bound (UCB) Algorithm (cont’d)
The score (with parameters ρ and δ)
Bi ,s,t = (optimistic score of arm i if pulled s times up to round t)
Optimism in face of uncertainty:Current knowledge: average rewards µi ,sCurrent uncertainty : number of pulls s
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 61/124
![Page 127: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/127.jpg)
Multi-armed Bandit Problems Bandit Algorithms: UCB
The Upper–Confidence Bound (UCB) Algorithm (cont’d)
The score (with parameters ρ and δ)
Bi ,s,t = knowledge +︸︷︷︸optimism
uncertainty
Optimism in face of uncertainty:Current knowledge: average rewards µi ,sCurrent uncertainty : number of pulls s
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 61/124
![Page 128: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/128.jpg)
Multi-armed Bandit Problems Bandit Algorithms: UCB
The Upper–Confidence Bound (UCB) Algorithm (cont’d)
The score (with parameters ρ and δ)
Bi ,s,t = µi ,s + ρ
√log 1/δ
2s
Optimism in face of uncertainty:Current knowledge: average rewards µi ,sCurrent uncertainty : number of pulls s
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 61/124
![Page 129: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/129.jpg)
Multi-armed Bandit Problems Bandit Algorithms: UCB
The Upper–Confidence Bound (UCB) Algorithm (cont’d)
At each round t = 1, . . . , nI Compute the score of each arm i
Bi ,t = µi ,Ti,t + ρ
√log(t)2Ti ,t
I Pull armIt = arg max
i=1,...,KBi ,t
I Update the number of pulls TIt ,t = TIt ,t−1 + 1 and µi ,Ti,t
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 62/124
![Page 130: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/130.jpg)
Multi-armed Bandit Problems Bandit Algorithms: UCB
The Upper–Confidence Bound (UCB) Algorithm (cont’d)
TheoremLet X1, . . . ,Xn be i.i.d. samples from a distribution bounded in[a, b], then for any δ ∈ (0, 1)
P
[∣∣∣1nn∑
t=1Xt − E[X1]
∣∣∣ > (b − a)√
log 2/δ2n
]≤ δ
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 63/124
![Page 131: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/131.jpg)
Multi-armed Bandit Problems Bandit Algorithms: UCB
The Upper–Confidence Bound (UCB) Algorithm (cont’d)
After s pulls, arm i
P
[E[Xi ] ≤
1s
s∑t=1
Xi ,t +
√log 1/δ
2s
]≥ 1 − δ
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 64/124
![Page 132: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/132.jpg)
Multi-armed Bandit Problems Bandit Algorithms: UCB
The Upper–Confidence Bound (UCB) Algorithm (cont’d)
After s pulls, arm i
P
[µi ≤ µi ,s +
√log 1/δ
2s
]≥ 1 − δ
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 64/124
![Page 133: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/133.jpg)
Multi-armed Bandit Problems Bandit Algorithms: UCB
The Upper–Confidence Bound (UCB) Algorithm (cont’d)
After s pulls, arm i
P
[µi ≤ µi ,s +
√log 1/δ
2s
]≥ 1 − δ
⇒ UCB uses an upper confidence bound on the expectation
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 64/124
![Page 134: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/134.jpg)
Multi-armed Bandit Problems Bandit Algorithms: UCB
The Upper–Confidence Bound (UCB) Algorithm (cont’d)
TheoremFor any set of K arms with distributions bounded in [0, b], ifδ = 1/t, then UCB(ρ) with ρ > 1, achieves a regret
Rn(A) ≤∑i =i∗
[4b2
∆iρ log(n) + ∆i
(32 +
12(ρ− 1)
)]
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 65/124
![Page 135: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/135.jpg)
Multi-armed Bandit Problems Bandit Algorithms: UCB
The Upper–Confidence Bound (UCB) Algorithm (cont’d)
Let K = 2 with i∗ = 1
Rn(A) ≤ O(
1∆ρ log(n)
)Remark 1: the cumulative regret slowly increases as log(n)
Remark 2: the smaller the gap the bigger the regret... why?
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 66/124
![Page 136: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/136.jpg)
Multi-armed Bandit Problems Bandit Algorithms: UCB
The Upper–Confidence Bound (UCB) Algorithm (cont’d)
Let K = 2 with i∗ = 1
Rn(A) ≤ O(
1∆ρ log(n)
)Remark 1: the cumulative regret slowly increases as log(n)Remark 2: the smaller the gap the bigger the regret... why?
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 66/124
![Page 137: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/137.jpg)
Multi-armed Bandit Problems Bandit Algorithms: UCB
The Upper–Confidence Bound (UCB) Algorithm (cont’d)
Show time (again)!
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 67/124
![Page 138: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/138.jpg)
Multi-armed Bandit Problems A (distribution-dependent) Lower Bound for the Regret
Outline
Motivation
Multi-armed Bandit ProblemsIntroductionThe Bandit ModelBandit Algorithms: UCBA (distribution-dependent) Lower Bound for the RegretWorst-case Performance
Extensions
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 68/124
![Page 139: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/139.jpg)
Multi-armed Bandit Problems A (distribution-dependent) Lower Bound for the Regret
Asymptotically Optimal Strategies
I A strategy π is said to be consistent if, for any (νi)i ∈ FK ,
1nE[Sn] → µ∗
I The strategy is efficient if for all θ ∈ [0, 1]K and all α > 0,
Rn(A) = o(nα)
I There are efficient strategies and we consider the bestachievable asymptotic performance among efficient strategies
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 69/124
![Page 140: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/140.jpg)
Multi-armed Bandit Problems A (distribution-dependent) Lower Bound for the Regret
The Bound of Lai and Robbins
One-parameter reward distribution νi = νθi , θi ∈ Θ ⊂ R .
Theorem [Lai and Robbins, ’85]If π is an efficient strategy, then, for any θ ∈ ΘK ,
lim infn→∞
Rn(A)
log(n) ≥∑
i :µi<µ∗
µ∗ − µiKL(νi , ν∗)
where KL(ν, ν ′) denotes the Kullback-Leibler divergence
For example, in the Bernoulli case:
KL(B(p),B(q)
)= dber(p, q) = p log p
q + (1 − p) log 1 − p1 − q
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 70/124
![Page 141: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/141.jpg)
Multi-armed Bandit Problems A (distribution-dependent) Lower Bound for the Regret
The Bound of Burnetas and KatehakisMore general reward distributions νi ∈ Fi
Theorem [Burnetas and Katehakis, ’96]If π is an efficient strategy, then, for any θ ∈ [0, 1]K ,
lim infn→∞
Rnlog(n) ≥
∑i :µi<µ∗
µ∗ − µiKinf (νi , µ∗)
where
Kinf (νi , µ∗) = inf
K (νi , ν
′) :
ν ′ ∈ Fi ,E (ν ′) ≥ µ∗ν∗
δ1
δ 12
δ0
Kinf (νa, µ⋆)
νa
µ∗
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 71/124
![Page 142: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/142.jpg)
Multi-armed Bandit Problems A (distribution-dependent) Lower Bound for the Regret
Intuition
I First assume that µ∗ is known and that n is fixedI How many draws ni of νi are necessary to know that µi < µ∗
with probability at least 1 − 1/n?I Test: H0 : µi = µ∗ against H1 : ν = νiI Stein’s Lemma: if the first type error αni ≤ 1/n, then
βni % exp(− niKinf (νi , µ
∗))
=⇒ it can be smaller than 1/n if
ni ≥log(n)
Kinf (νi , µ∗)
I How to do as well without knowing µ∗ and n in advance? Notasymptotically?
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 72/124
![Page 143: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/143.jpg)
Multi-armed Bandit Problems Worst-case Performance
Outline
Motivation
Multi-armed Bandit ProblemsIntroductionThe Bandit ModelBandit Algorithms: UCBA (distribution-dependent) Lower Bound for the RegretWorst-case Performance
Extensions
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 73/124
![Page 144: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/144.jpg)
Multi-armed Bandit Problems Worst-case Performance
The Worst–case Performance
Remark: the regret bound is distribution–dependent
Rn(A; ∆) ≤ O(
1∆ρ log(n)
)
Meaning: the algorithm is able to adapt to the specific problem athand!Worst–case performance: what is the distribution which leads tothe worst possible performance of UCB? what is thedistribution–free performance of UCB?
Rn(A) = sup∆
Rn(A;∆)
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 74/124
![Page 145: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/145.jpg)
Multi-armed Bandit Problems Worst-case Performance
The Worst–case Performance
Remark: the regret bound is distribution–dependent
Rn(A; ∆) ≤ O(
1∆ρ log(n)
)Meaning: the algorithm is able to adapt to the specific problem athand!
Worst–case performance: what is the distribution which leads tothe worst possible performance of UCB? what is thedistribution–free performance of UCB?
Rn(A) = sup∆
Rn(A;∆)
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 74/124
![Page 146: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/146.jpg)
Multi-armed Bandit Problems Worst-case Performance
The Worst–case Performance
Remark: the regret bound is distribution–dependent
Rn(A; ∆) ≤ O(
1∆ρ log(n)
)Meaning: the algorithm is able to adapt to the specific problem athand!Worst–case performance: what is the distribution which leads tothe worst possible performance of UCB? what is thedistribution–free performance of UCB?
Rn(A) = sup∆
Rn(A;∆)
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 74/124
![Page 147: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/147.jpg)
Multi-armed Bandit Problems Worst-case Performance
The Worst–case Performance
Problem: it seems like if ∆ → 0 then the regret tends to infinity...
... nosense because the regret is defined as
Rn(A; ∆) = E[T2,n]∆
then if ∆i is small, the regret is also small...In fact
Rn(A;∆) = min
O(
1∆ρ log(n)
),E[T2,n]∆
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 75/124
![Page 148: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/148.jpg)
Multi-armed Bandit Problems Worst-case Performance
The Worst–case Performance
Problem: it seems like if ∆ → 0 then the regret tends to infinity...... nosense because the regret is defined as
Rn(A; ∆) = E[T2,n]∆
then if ∆i is small, the regret is also small...In fact
Rn(A;∆) = min
O(
1∆ρ log(n)
),E[T2,n]∆
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 75/124
![Page 149: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/149.jpg)
Multi-armed Bandit Problems Worst-case Performance
The Worst–case Performance
Problem: it seems like if ∆ → 0 then the regret tends to infinity...... nosense because the regret is defined as
Rn(A; ∆) = E[T2,n]∆
then if ∆i is small, the regret is also small...
In fact
Rn(A;∆) = min
O(
1∆ρ log(n)
),E[T2,n]∆
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 75/124
![Page 150: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/150.jpg)
Multi-armed Bandit Problems Worst-case Performance
The Worst–case Performance
Problem: it seems like if ∆ → 0 then the regret tends to infinity...... nosense because the regret is defined as
Rn(A; ∆) = E[T2,n]∆
then if ∆i is small, the regret is also small...In fact
Rn(A;∆) = min
O(
1∆ρ log(n)
),E[T2,n]∆
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 75/124
![Page 151: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/151.jpg)
Multi-armed Bandit Problems Worst-case Performance
The Worst–case Performance
Then
Rn(A) = sup∆
Rn(A;∆) = sup∆
min
O(
1∆ρ log(n)
), n∆
≈
√n
for ∆ =√
1/n.Remark: Non-stochastic bandits: it is possible to ensure thesame O(
√n) regret even without any stochastic asumption on the
reward process.
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 76/124
![Page 152: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/152.jpg)
Multi-armed Bandit Problems Worst-case Performance
Tuning the confidence δ of UCB
Remark: UCB is an anytime algorithm (δ = 1/t)
Bi ,s,t = µi ,s + ρ
√log t2s
Remark: If the time horizon n is known then the optimal choice isδ = 1/n
Bi ,s,t = µi ,s + ρ
√log n2s
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 77/124
![Page 153: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/153.jpg)
Multi-armed Bandit Problems Worst-case Performance
Tuning the confidence δ of UCB
Remark: UCB is an anytime algorithm (δ = 1/t)
Bi ,s,t = µi ,s + ρ
√log t2s
Remark: If the time horizon n is known then the optimal choice isδ = 1/n
Bi ,s,t = µi ,s + ρ
√log n2s
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 77/124
![Page 154: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/154.jpg)
Multi-armed Bandit Problems Worst-case Performance
Tuning the confidence δ of UCB (cont’d)
Intuition: UCB should pull the suboptimal armsI Enough: so as to understand which arm is the bestI Not too much: so as to keep the regret as small as possible
The confidence 1 − δ has the following impact (similar for ρ)I Big 1 − δ: high level of explorationI Small 1 − δ: high level of exploitation
Solution: depending on the time horizon, we can tune how totrade-off between exploration and exploitation
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 78/124
![Page 155: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/155.jpg)
Multi-armed Bandit Problems Worst-case Performance
Tuning the confidence δ of UCB (cont’d)
Intuition: UCB should pull the suboptimal armsI Enough: so as to understand which arm is the bestI Not too much: so as to keep the regret as small as possible
The confidence 1 − δ has the following impact (similar for ρ)I Big 1 − δ: high level of explorationI Small 1 − δ: high level of exploitation
Solution: depending on the time horizon, we can tune how totrade-off between exploration and exploitation
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 78/124
![Page 156: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/156.jpg)
Multi-armed Bandit Problems Worst-case Performance
Tuning the confidence δ of UCB (cont’d)
Intuition: UCB should pull the suboptimal armsI Enough: so as to understand which arm is the bestI Not too much: so as to keep the regret as small as possible
The confidence 1 − δ has the following impact (similar for ρ)I Big 1 − δ: high level of explorationI Small 1 − δ: high level of exploitation
Solution: depending on the time horizon, we can tune how totrade-off between exploration and exploitation
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 78/124
![Page 157: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/157.jpg)
Multi-armed Bandit Problems Worst-case Performance
UCB Proof
Let’s dig into the (1 page and half!!) proof.
Define the (high-probability) event [statistics]
E =
∀i , s
∣∣∣µi,s − µi
∣∣∣ ≤√ log 1/δ2s
By Chernoff-Hoeffding P[E ] ≥ 1 − nKδ.
At time t we pull arm i [algorithm]
On the event E we have [math]
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 79/124
![Page 158: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/158.jpg)
Multi-armed Bandit Problems Worst-case Performance
UCB Proof
Let’s dig into the (1 page and half!!) proof.
Define the (high-probability) event [statistics]
E =
∀i , s
∣∣∣µi,s − µi
∣∣∣ ≤√ log 1/δ2s
By Chernoff-Hoeffding P[E ] ≥ 1 − nKδ.At time t we pull arm i [algorithm]
Bi,Ti,t−1 ≥ Bi∗,Ti∗,t−1
On the event E we have [math]
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 79/124
![Page 159: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/159.jpg)
Multi-armed Bandit Problems Worst-case Performance
UCB Proof
Let’s dig into the (1 page and half!!) proof.
Define the (high-probability) event [statistics]
E =
∀i , s
∣∣∣µi,s − µi
∣∣∣ ≤√ log 1/δ2s
By Chernoff-Hoeffding P[E ] ≥ 1 − nKδ.At time t we pull arm i [algorithm]
µi,Ti,t−1 +
√log 1/δ2Ti,t−1
≥ µi∗,Ti∗,t−1 +
√log 1/δ2Ti∗,t−1
On the event E we have [math]
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 79/124
![Page 160: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/160.jpg)
Multi-armed Bandit Problems Worst-case Performance
UCB ProofLet’s dig into the (1 page and half!!) proof.
Define the (high-probability) event [statistics]
E =
∀i , s
∣∣∣µi,s − µi
∣∣∣ ≤√ log 1/δ2s
By Chernoff-Hoeffding P[E ] ≥ 1 − nKδ.At time t we pull arm i [algorithm]
µi,Ti,t−1 +
√log 1/δ2Ti,t−1
≥ µi∗,Ti∗,t−1 +
√log 1/δ2Ti∗,t−1
On the event E we have [math]
µi + 2
√log 1/δ2Ti,t−1
≥ µi∗
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 79/124
![Page 161: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/161.jpg)
Multi-armed Bandit Problems Worst-case Performance
UCB Proof (cont’d)Assume t is the last time i is pulled, then Ti,n = Ti,t−1 + 1, thus
µi + 2
√log 1/δ
2(Ti,n − 1) ≥ µi∗
Reordering [math]
Ti,n ≤ log 1/δ2∆2
i+ 1
under event E and thus with probability 1 − nKδ.Moving to the expectation [statistics]Trading-off the two terms δ = 1/n2, we obtain
µi,Ti,t−1 +
√2 log n2Ti,t−1
and
E[Ti,n] ≤log n∆2
i+ 1 + K
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 80/124
![Page 162: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/162.jpg)
Multi-armed Bandit Problems Worst-case Performance
UCB Proof (cont’d)Assume t is the last time i is pulled, then Ti,n = Ti,t−1 + 1, thus
µi + 2
√log 1/δ
2(Ti,n − 1) ≥ µi∗
Reordering [math]
Ti,n ≤ log 1/δ2∆2
i+ 1
under event E and thus with probability 1 − nKδ.
Moving to the expectation [statistics]Trading-off the two terms δ = 1/n2, we obtain
µi,Ti,t−1 +
√2 log n2Ti,t−1
and
E[Ti,n] ≤log n∆2
i+ 1 + K
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 80/124
![Page 163: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/163.jpg)
Multi-armed Bandit Problems Worst-case Performance
UCB Proof (cont’d)Assume t is the last time i is pulled, then Ti,n = Ti,t−1 + 1, thus
µi + 2
√log 1/δ
2(Ti,n − 1) ≥ µi∗
Reordering [math]
Ti,n ≤ log 1/δ2∆2
i+ 1
under event E and thus with probability 1 − nKδ.Moving to the expectation [statistics]
E[Ti,n] = E[Ti,nIE ] + E[Ti,nIEC ]
Trading-off the two terms δ = 1/n2, we obtain
µi,Ti,t−1 +
√2 log n2Ti,t−1
and
E[Ti,n] ≤log n∆2
i+ 1 + K
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 80/124
![Page 164: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/164.jpg)
Multi-armed Bandit Problems Worst-case Performance
UCB Proof (cont’d)Assume t is the last time i is pulled, then Ti,n = Ti,t−1 + 1, thus
µi + 2
√log 1/δ
2(Ti,n − 1) ≥ µi∗
Reordering [math]
Ti,n ≤ log 1/δ2∆2
i+ 1
under event E and thus with probability 1 − nKδ.Moving to the expectation [statistics]
E[Ti,n] ≤log 1/δ
2∆2i
+ 1 + n(nKδ)
Trading-off the two terms δ = 1/n2, we obtain
µi,Ti,t−1 +
√2 log n2Ti,t−1
and
E[Ti,n] ≤log n∆2
i+ 1 + K
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 80/124
![Page 165: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/165.jpg)
Multi-armed Bandit Problems Worst-case Performance
UCB Proof (cont’d)Assume t is the last time i is pulled, then Ti,n = Ti,t−1 + 1, thus
µi + 2
√log 1/δ
2(Ti,n − 1) ≥ µi∗
Reordering [math]
Ti,n ≤ log 1/δ2∆2
i+ 1
under event E and thus with probability 1 − nKδ.Moving to the expectation [statistics]
E[Ti,n] ≤log 1/δ
2∆2i
+ 1 + n(nKδ)
Trading-off the two terms δ = 1/n2, we obtain
µi,Ti,t−1 +
√2 log n2Ti,t−1
and
E[Ti,n] ≤log n∆2
i+ 1 + K
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 80/124
![Page 166: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/166.jpg)
Multi-armed Bandit Problems Worst-case Performance
UCB Proof (cont’d)
Trading-off the two terms δ = 1/n2, we obtain
µi,Ti,t−1 +
√2 log n2Ti,t−1
and
E[Ti,n] ≤log n∆2
i+ 1 + K
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 81/124
![Page 167: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/167.jpg)
Multi-armed Bandit Problems Worst-case Performance
Tuning the confidence δ of UCB (cont’d)
Multi–armed Bandit: the same for δ = 1/t and δ = 1/n...
... almost (i.e., in expectation)
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 82/124
![Page 168: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/168.jpg)
Multi-armed Bandit Problems Worst-case Performance
Tuning the confidence δ of UCB (cont’d)
Multi–armed Bandit: the same for δ = 1/t and δ = 1/n...... almost (i.e., in expectation)
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 82/124
![Page 169: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/169.jpg)
Multi-armed Bandit Problems Worst-case Performance
Tuning the confidence δ of UCB (cont’d)The value–at–risk of the regret for UCB-anytime
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 83/124
![Page 170: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/170.jpg)
Multi-armed Bandit Problems Worst-case Performance
Tuning the ρ of UCB (cont’d)UCB values (for the δ = 1/n algorithm)
Bi,s = µi,s + ρ
√log n2s
TheoryI ρ < 0.5, polynomial regret w.r.t. nI ρ > 0.5, logarithmic regret w.r.t. n
Practice: ρ = 0.2 is often the best choice
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 84/124
![Page 171: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/171.jpg)
Multi-armed Bandit Problems Worst-case Performance
Tuning the ρ of UCB (cont’d)UCB values (for the δ = 1/n algorithm)
Bi,s = µi,s + ρ
√log n2s
TheoryI ρ < 0.5, polynomial regret w.r.t. nI ρ > 0.5, logarithmic regret w.r.t. n
Practice: ρ = 0.2 is often the best choice
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 84/124
![Page 172: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/172.jpg)
Multi-armed Bandit Problems Worst-case Performance
Tuning the ρ of UCB (cont’d)UCB values (for the δ = 1/n algorithm)
Bi,s = µi,s + ρ
√log n2s
TheoryI ρ < 0.5, polynomial regret w.r.t. nI ρ > 0.5, logarithmic regret w.r.t. n
Practice: ρ = 0.2 is often the best choice
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 84/124
![Page 173: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/173.jpg)
Multi-armed Bandit Problems Worst-case Performance
Tuning the ρ of UCB (cont’d)UCB values (for the δ = 1/n algorithm)
Bi,s = µi,s + ρ
√log n2s
TheoryI ρ < 0.5, polynomial regret w.r.t. nI ρ > 0.5, logarithmic regret w.r.t. n
Practice: ρ = 0.2 is often the best choice
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 84/124
![Page 174: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/174.jpg)
Multi-armed Bandit Problems Worst-case Performance
Improvements: UCB-V
Idea: use empirical Bernstein bounds for more accurate c.i.
AlgorithmI Compute the score of each arm iI Pull arm
It = arg maxi=1,...,K
Bi,t
I Update the number of pulls TIt ,t , µi,Ti,t
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 85/124
![Page 175: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/175.jpg)
Multi-armed Bandit Problems Worst-case Performance
Improvements: UCB-V
Idea: use empirical Bernstein bounds for more accurate c.i.
AlgorithmI Compute the score of each arm i
Bi,t = µi,Ti,t + ρ
√log(t)2Ti,t
I Pull armIt = arg max
i=1,...,KBi,t
I Update the number of pulls TIt ,t , µi,Ti,t
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 85/124
![Page 176: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/176.jpg)
Multi-armed Bandit Problems Worst-case Performance
Improvements: UCB-V
Idea: use empirical Bernstein bounds for more accurate c.i.
AlgorithmI Compute the score of each arm i
Bi,t = µi,Ti,t +
√2σ2
i,Ti,tlog t
Ti,t+
8 log t3Ti,t
I Pull armIt = arg max
i=1,...,KBi,t
I Update the number of pulls TIt ,t , µi,Ti,t and σ2i,Ti,t
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 85/124
![Page 177: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/177.jpg)
Multi-armed Bandit Problems Worst-case Performance
Improvements: UCB-VIdea: use empirical Bernstein bounds for more accurate c.i.
AlgorithmI Compute the score of each arm i
Bi,t = µi,Ti,t +
√2σ2
i,Ti,tlog t
Ti,t+
8 log t3Ti,t
I Pull armIt = arg max
i=1,...,KBi,t
I Update the number of pulls TIt ,t , µi,Ti,t and σ2i,Ti,t
RegretRn ≤ O
( 1∆
log n)
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 85/124
![Page 178: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/178.jpg)
Multi-armed Bandit Problems Worst-case Performance
Improvements: UCB-VIdea: use empirical Bernstein bounds for more accurate c.i.
AlgorithmI Compute the score of each arm i
Bi,t = µi,Ti,t +
√2σ2
i,Ti,tlog t
Ti,t+
8 log t3Ti,t
I Pull armIt = arg max
i=1,...,KBi,t
I Update the number of pulls TIt ,t , µi,Ti,t and σ2i,Ti,t
Regret
Rn ≤ O(σ2
∆log n
)
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 85/124
![Page 179: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/179.jpg)
Multi-armed Bandit Problems Worst-case Performance
Improvements: KL-UCB
Idea: use even tighter c.i. based on Kullback–Leibler divergence
d(p, q) = p log pq + (1 − p) log 1 − p
1 − q
Algorithm: Compute the score of each arm i (convex optimization)
Bi,t = max
q ∈ [0, 1] : Ti,td(µi,Ti,t , q
)≤ log(t) + c log(log(t))
Regret: pulls to suboptimal arms
E[Ti,n
]≤ (1 + ϵ)
log(n)d(µi , µ∗)
+ C1 log(log(n)) + C2(ϵ)
nβ(ϵ)
where d(µi , µ∗) > 2∆2
i
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 86/124
![Page 180: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/180.jpg)
Multi-armed Bandit Problems Worst-case Performance
Improvements: KL-UCB
Idea: use even tighter c.i. based on Kullback–Leibler divergence
d(p, q) = p log pq + (1 − p) log 1 − p
1 − q
Algorithm: Compute the score of each arm i (convex optimization)
Bi,t = max
q ∈ [0, 1] : Ti,td(µi,Ti,t , q
)≤ log(t) + c log(log(t))
Regret: pulls to suboptimal arms
E[Ti,n
]≤ (1 + ϵ)
log(n)d(µi , µ∗)
+ C1 log(log(n)) + C2(ϵ)
nβ(ϵ)
where d(µi , µ∗) > 2∆2
i
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 86/124
![Page 181: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/181.jpg)
Multi-armed Bandit Problems Worst-case Performance
Improvements: KL-UCB
Idea: use even tighter c.i. based on Kullback–Leibler divergence
d(p, q) = p log pq + (1 − p) log 1 − p
1 − q
Algorithm: Compute the score of each arm i (convex optimization)
Bi,t = max
q ∈ [0, 1] : Ti,td(µi,Ti,t , q
)≤ log(t) + c log(log(t))
Regret: pulls to suboptimal arms
E[Ti,n
]≤ (1 + ϵ)
log(n)d(µi , µ∗)
+ C1 log(log(n)) + C2(ϵ)
nβ(ϵ)
where d(µi , µ∗) > 2∆2
i
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 86/124
![Page 182: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/182.jpg)
Multi-armed Bandit Problems Worst-case Performance
Improvements: Thompson strategy
Idea: Use a Bayesian approach to estimate the means µii
Algorithm: Assuming Bernoulli arms and a Beta prior on the meanI Compute
Di,t = Beta(Si,t + 1,Fi,t + 1)I Draw a mean sample as
µi,t ∼ Di,t
I Pull armIt = arg max µi,t
I If XIt ,t = 1 update SIt ,t+1 = SIt ,t + 1, else update FIt ,t+1 = FIt ,t + 1
Regret:
limn→∞
Rnlog(n) =
K∑i=1
∆id(µi , µ∗)
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 87/124
![Page 183: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/183.jpg)
Multi-armed Bandit Problems Worst-case Performance
Improvements: Thompson strategy
Idea: Use a Bayesian approach to estimate the means µii
Algorithm: Assuming Bernoulli arms and a Beta prior on the meanI Compute
Di,t = Beta(Si,t + 1,Fi,t + 1)I Draw a mean sample as
µi,t ∼ Di,t
I Pull armIt = arg max µi,t
I If XIt ,t = 1 update SIt ,t+1 = SIt ,t + 1, else update FIt ,t+1 = FIt ,t + 1
Regret:
limn→∞
Rnlog(n) =
K∑i=1
∆id(µi , µ∗)
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 87/124
![Page 184: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/184.jpg)
Multi-armed Bandit Problems Worst-case Performance
How to efficiently explore an MDP
The Exploration-ExploitationDilemma
Multi-Armed Bandit
Contextual Linear Bandit
Reinforcement Learning
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 88/124
![Page 185: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/185.jpg)
Multi-armed Bandit Problems Worst-case Performance
The Contextual Linear Bandit Problem
Motivating ExamplesI Different users may have different preferencesI The set of available news may change over timeI We want to minimise the regret w.r.t. the best news for each
user
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 89/124
![Page 186: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/186.jpg)
Multi-armed Bandit Problems Worst-case Performance
The Contextual Linear Bandit ProblemThe problem: at each time t = 1, . . . , n
I User ut arrives and a set of news At is providedI The user ut together with a news a ∈ At are described by a
feature vector xt,aI The learner chooses a news at and receives a reward rt,at
The optimal news: at each time t = 1, . . . , n, the optimal news is
a∗t = arg maxa∈At
E[rt,a]
The regret:
Rn = E[ n∑
t=1rt,a∗t
]− E
[ n∑t=1
rt,at
]
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 90/124
![Page 187: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/187.jpg)
Multi-armed Bandit Problems Worst-case Performance
The Contextual Linear Bandit Problem
The linear assumption: the reward is a linear combinationbetween the context and an unknown parameter vector
E[rt,a|xt,a] = x⊤t,aθa
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 91/124
![Page 188: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/188.jpg)
Multi-armed Bandit Problems Worst-case Performance
The Contextual Linear Bandit Problem
The linear regression estimate:I Ta = t : at = aI Construct the design matrix of all the contexts observed when
action a has been taken Da ∈ R|Ta|×d
I Construct the reward vector of all the rewards observed whenaction a has been taken ca ∈ R|Ta|
I Estimate θa as
θa = (D⊤a Da + I)−1D⊤
a ca
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 92/124
![Page 189: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/189.jpg)
Multi-armed Bandit Problems Worst-case Performance
The Contextual Linear Bandit Problem
Optimism in face of uncertainty: the LinUCB algorithmI Chernoff-Hoeffding in this case becomes∣∣x⊤
t,aθa − E[rt,a|xt,a]∣∣ ≤ α
√x⊤
t,a(D⊤a Da + I)−1xt,a
I and the UCB strategy is
at = arg maxa∈At
x⊤t,aθa + α
√x⊤
t,a(D⊤a Da + I)−1xt,a
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 93/124
![Page 190: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/190.jpg)
Multi-armed Bandit Problems Worst-case Performance
The Contextual Linear Bandit Problem
The evaluation problemI Online evaluation: too expensiveI Offline evaluation: how to use the logged data?
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 94/124
![Page 191: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/191.jpg)
Multi-armed Bandit Problems Worst-case Performance
The Contextual Linear Bandit Problem
Evaluation from logged dataI Assumption 1: contexts and rewards are i.i.d. from a
stationary distribution
(x1, . . . , xK , r1, . . . , rK ) ∼ D
I Assumption 2: the logging strategy is random
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 95/124
![Page 192: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/192.jpg)
Multi-armed Bandit Problems Worst-case Performance
The Contextual Linear Bandit ProblemEvaluation from logged data: given a bandit strategy π, adesired number of samples T , and a (infinite) stream of data
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 96/124
![Page 193: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/193.jpg)
Extensions
Outline
Motivation
Multi-armed Bandit Problems
ExtensionsSome ExamplesBest Arm IdentificationExploration with Probabilistic Expert Advice
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 97/124
![Page 194: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/194.jpg)
Extensions Some Examples
Outline
Motivation
Multi-armed Bandit Problems
ExtensionsSome ExamplesBest Arm IdentificationExploration with Probabilistic Expert Advice
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 98/124
![Page 195: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/195.jpg)
Extensions Some Examples
Non-stationary Bandits
I Changepoint : rewarddistributions change abruptly
I Goal : follow the best armI Application : scanning
tunnelling microscope
I Variants D-UCB et SW-UCB including a progressive discountof the past
I Bounds O(√
n log n) are proved, which is (almost) optimal
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 99/124
![Page 196: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/196.jpg)
Extensions Some Examples
Generalized Linear Bandits
I Bandit with contextual information:
E[Xt |It ] = µ(m′Itθ∗)
where θ∗ ∈ Rd is an unkown parameter and µ : R → R is alink function
I Example : binary rewards
µ(x) = exp(x)1 + exp(x)
I Application : targeted web adsI GLM-UCB : regret bound depending on dimension d and not
on the number of arms
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 100/124
![Page 197: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/197.jpg)
Extensions Some Examples
Stochastic Optimization
I Goal : Find the maximum of afunction f : C ⊂ Rd → R(possibly) observed in noise
I Application : DAS
0.0 0.2 0.4 0.6 0.8 1.0
0.6
0.8
1.0
1.2
1.4
1.6
res$x
res$
f
I Model : f is the realization of a Gaussian Process (or has asmall norm in some RKHS)
I GP-UCB : evaluate f at the point x ∈ C where the confidenceinterval for f (x) has the highest upper-bound
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 101/124
![Page 198: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/198.jpg)
Extensions Best Arm Identification
Outline
Motivation
Multi-armed Bandit Problems
ExtensionsSome ExamplesBest Arm IdentificationExploration with Probabilistic Expert Advice
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 102/124
![Page 199: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/199.jpg)
Extensions Best Arm Identification
Motivation
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 103/124
![Page 200: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/200.jpg)
Extensions Best Arm Identification
Goal = regret minimization
Improve performance:
Ü fixed number of test users − > smaller probability of errorÜ fixed probability of error − > fewer test users
Tools: sequential allocation and stopping
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 104/124
![Page 201: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/201.jpg)
Extensions Best Arm Identification
The modelA two-armed bandit model is
I a set ν = (ν1, ν2) of two probability distributions (’arms’) withrespective means µ1 and µ2
I a∗ = argmaxa µa is the (unknown) best am
To find the best arm, an agent interacts with the bandit modelwith
I a sampling rule (At)t∈N where At ∈ 1, 2 is the arm chosenat time t (based on past observations) − > a sampleZt ∼ νAt is observed
I a stopping rule τ indicating when he stops sampling the armsI a recommendation rule aτ ∈ 1, 2 indicating which arm he
thinks is best (at the end of the interaction)
nadaIn classical A/B Testing, the sampling rule At is uniform on1, 2 and the stopping rule τ = t is fixed in advance.
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 105/124
![Page 202: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/202.jpg)
Extensions Best Arm Identification
Two possible goalsThe agent’s goal is to design a strategy A = ((At), τ, aτ ) satisfying
Fixed-budget setting Fixed-confidence setting
τ = t Pν(aτ = a∗) ≤ δ
pt(ν) := Pν(at = a∗) as small Eν [τ ] as smallas possible as possible
An algorithm using uniform sampling is
Fixed-budget setting Fixed-confidence setting
a classical test of a sequential test of(µ1 > µ2) against (µ1 < µ2) (µ1 > µ2) against (µ1 < µ2)
based on t samples with probability of erroruniformly bounded by δ
[Siegmund 85]: sequential tests can save samples !A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 106/124
![Page 203: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/203.jpg)
Extensions Exploration with Probabilistic Expert Advice
Outline
Motivation
Multi-armed Bandit Problems
ExtensionsSome ExamplesBest Arm IdentificationExploration with Probabilistic Expert Advice
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 107/124
![Page 204: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/204.jpg)
Extensions Exploration with Probabilistic Expert Advice
The modelOptimal Discovery with Probabilistic Expert Advice: Finite TimeAnalysis and Macroscopic Optimality, JMLR 2013joint work with S. Bubeck and D. Ernst
I Subset A ⊂ X ofimportant items
I |X | ≫ 1, |A| ≪ |X |I Access to X only by
probabilistic experts(Pi)1≤i≤K :sequentialindependent draws
Goal: discover rapidly the elements of AA. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 108/124
![Page 205: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/205.jpg)
Extensions Exploration with Probabilistic Expert Advice
The modelOptimal Discovery with Probabilistic Expert Advice: Finite TimeAnalysis and Macroscopic Optimality, JMLR 2013joint work with S. Bubeck and D. Ernst
I Subset A ⊂ X ofimportant items
I |X | ≫ 1, |A| ≪ |X |I Access to X only by
probabilistic experts(Pi)1≤i≤K :sequentialindependent draws
Goal: discover rapidly the elements of AA. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 108/124
![Page 206: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/206.jpg)
Extensions Exploration with Probabilistic Expert Advice
The modelOptimal Discovery with Probabilistic Expert Advice: Finite TimeAnalysis and Macroscopic Optimality, JMLR 2013joint work with S. Bubeck and D. Ernst
I Subset A ⊂ X ofimportant items
I |X | ≫ 1, |A| ≪ |X |I Access to X only by
probabilistic experts(Pi)1≤i≤K :sequentialindependent draws
Goal: discover rapidly the elements of AA. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 108/124
![Page 207: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/207.jpg)
Extensions Exploration with Probabilistic Expert Advice
The modelOptimal Discovery with Probabilistic Expert Advice: Finite TimeAnalysis and Macroscopic Optimality, JMLR 2013joint work with S. Bubeck and D. Ernst
I Subset A ⊂ X ofimportant items
I |X | ≫ 1, |A| ≪ |X |I Access to X only by
probabilistic experts(Pi)1≤i≤K :sequentialindependent draws
Goal: discover rapidly the elements of AA. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 108/124
![Page 208: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/208.jpg)
Extensions Exploration with Probabilistic Expert Advice
Optimal Exploration with Probabilistic Expert Advice
Search space : A ⊂ Ω discrete setProbabilistic experts : Pi ∈ M1(Ω) for i ∈ 1, . . . ,K
Requests : at time t, calling expert It yields a realization ofXt = XIt ,t independent with law Pa
Goal : find as many distinct elements of A as possible withfew requests :
Fn = Card (A ∩ X1, . . . ,Xn)
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 109/124
![Page 209: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/209.jpg)
Extensions Exploration with Probabilistic Expert Advice
GoalAt each time step t = 1, 2, . . . :
I pick an index It = πt(I1,Y1, . . . , Is−1,Ys−1
)∈ 1, . . . ,K
according to past observationsI observe Yt = XIt ,nIt ,t
∼ PIt , where
ni ,t =∑s≤t
IIs = i
Goal: design the strategy π = (πt)t so as to maximize the numberof important items found after t requests
Fπ(t) =∣∣∣A ∩
Y1, . . . ,Yt
∣∣∣Assumption: non-intersecting supports
A ∩ supp(Pi) ∩ supp(Pj) = ∅ for i = j
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 110/124
![Page 210: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/210.jpg)
Extensions Exploration with Probabilistic Expert Advice
Is it a Bandit Problem ?
It looks like a bandit problem. . .I sequential choices among K optionsI want to maximize cumulative rewardsI exploration vs exploitation dilemma
. . . but it is not a bandit problem !I rewards are not i.i.d.I destructive rewards: no interest to observe twice the same
important itemI all strategies eventually equivalent
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 111/124
![Page 211: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/211.jpg)
Extensions Exploration with Probabilistic Expert Advice
The oracle strategy
Proposition: Under the non-intersecting support hypothesis, thegreedy oracle strategy selecting the expert with highest ‘missingmass’
I∗t ∈ arg max1≤i≤K
Pi (A \ Y1, . . . ,Yt)
is optimal: for every possible strategy π, E[Fπ(t)
]≤ E
[F ∗(t)
].
Remark: the proposition if false if the supports may intersect
=⇒ estimate the “missing mass of important items”!
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 112/124
![Page 212: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/212.jpg)
Extensions Exploration with Probabilistic Expert Advice
Missing mass estimationLet us first focus on one expert i : P = Pi ,Xn = Xi ,n
X1, . . . ,Xn independent draws of P
On(x) =n∑
m=1IXm = x
How to ’estimate’ the total mass of the unseen important items
Rn =∑x∈A
P(x)IOn(x) = 0 ?
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 113/124
![Page 213: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/213.jpg)
Extensions Exploration with Probabilistic Expert Advice
The Good-Turing EstimatorIdea: use the hapaxes = items seen only once (linguistic)
Rn =Unn , where Un =
∑x∈A
IOn(x) = 1
Lemma [Good ’53]: For every distribution P,
0 ≤ E[Rn]− E
[Rn]≤ 1
n
Proposition: With probability at least 1 − δ for every P,
Rn − 1n − (1 +
√2)√
log(4/δ)n ≤ Rn ≤ Rn + (1 +
√2)√
log(4/δ)n
See [McAllester and Schapire ’00, McAllester and Ortiz ’03]:I deviations of Rn: McDiarmid’s inequalityI deviations of Rn: negative association
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 114/124
![Page 214: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/214.jpg)
Extensions Exploration with Probabilistic Expert Advice
The Good-UCB algorithm [Bubeck, Ernst & G.]
Optimistic algorithm based on Good-Turing’s estimator :
It+1 = arg maxi∈1,...,K
Hi(t)Ni(t)
+ c
√log (t)Ni(t)
I Ni(t) = number of draws of Pi up to time tI Hi(t) = number of elements of A seen exactly once thanks to
PiI c = tuning parameter
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 115/124
![Page 215: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/215.jpg)
Extensions Exploration with Probabilistic Expert Advice
Classical analysis
Theorem: For any t ≥ 1, under the non-intersecting supportassumption, Good-UCB (with constant C = (1 +
√2)√
3) satisfies
E[F ∗(t)− F UCB(t)
]≤ 17
√Kt log(t) + 20
√Kt + K + K log(t/K )
Remark: Usual result for bandit problem, but not-so-simple analysis
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 116/124
![Page 216: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/216.jpg)
Extensions Exploration with Probabilistic Expert Advice
A Typical Run of Good-UCB
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 117/124
![Page 217: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/217.jpg)
Extensions Exploration with Probabilistic Expert Advice
The macroscopic limitI Restricted framework: Pi = U1, . . . ,NI N → ∞I |A ∩ supp(Pi)|/N → qi ∈ (0, 1), q =
∑i qi
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 118/124
![Page 218: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/218.jpg)
Extensions Exploration with Probabilistic Expert Advice
The macroscopic limitI Restricted framework: Pi = U1, . . . ,NI N → ∞I |A ∩ supp(Pi)|/N → qi ∈ (0, 1), q =
∑i qi
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 118/124
![Page 219: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/219.jpg)
Extensions Exploration with Probabilistic Expert Advice
The macroscopic limitI Restricted framework: Pi = U1, . . . ,NI N → ∞I |A ∩ supp(Pi)|/N → qi ∈ (0, 1), q =
∑i qi
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 118/124
![Page 220: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/220.jpg)
Extensions Exploration with Probabilistic Expert Advice
The Oracle behaviour
The limiting discovery process of the Oracle strategy isdeterministic
Proposition: For every λ ∈ (0, q1), for every sequence (λN)Nconverging to λ as N goes to infinity, almost surely
limN→∞
T N∗ (λN)
N =∑
i
(log qi
λ
)+
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 119/124
![Page 221: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/221.jpg)
Extensions Exploration with Probabilistic Expert Advice
Oracle vs. uniform samplingOracle: The proportion of important items not found after
Nt draws tends to
q−F ∗(t) = I(t)qI(t) exp (−t/I(t)) ≤ KqK exp(−t/K )
with qK =(∏K
i=1 qi)1/K
the geometric mean of the(qi)i .
Uniform: The proportion of important items not found afterNt draws tends to KqK exp(−t/K )
=⇒ Asymptotic ratio of efficiency
ρ(q) = qKqK
=1K∑k
i=1 qi(∏ki=1 qi
)1/K ≥ 1
larger if the (qi)i are unbalancedA. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 120/124
![Page 222: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/222.jpg)
Extensions Exploration with Probabilistic Expert Advice
Macroscopic optimality
Theorem: Take C = (1 +√
2)√
c + 2 with c > 3/2 in theGood-UCB algorithm.
I For every sequence (λN)N converging to λ as N goes toinfinity, almost surely
lim supN→+∞
T NUCB(λ
N)
N ≤∑
i
(log qi
λ
)+
I The proportion of items found after Nt steps F GUCB
converges uniformly to F ∗ as N goes to infinity
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 121/124
![Page 223: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/223.jpg)
Extensions Exploration with Probabilistic Expert Advice
Simulation
nadaNumber of items found by Good-UCB (line), the oracle (bold dashed), and byuniform sampling (light dotted) as a function of time, for sample sizes N =
128,N = 500, N = 1000 and N = 10000, in an environment with 7 experts.A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 122/124
![Page 224: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/224.jpg)
Extensions Exploration with Probabilistic Expert Advice
Bibliography I
R. E. Bellman.Dynamic Programming.Princeton University Press, Princeton, N.J., 1957.
D.P. Bertsekas and J. Tsitsiklis.Neuro-Dynamic Programming.Athena Scientific, Belmont, MA, 1996.
W. Fleming and R. Rishel.Deterministic and stochastic optimal control.Applications of Mathematics, 1, Springer-Verlag, Berlin New York, 1975.
R. A. Howard.Dynamic Programming and Markov Processes.MIT Press, Cambridge, MA, 1960.
M.L. Puterman.Markov Decision Processes: Discrete Stochastic Dynamic Programming.John Wiley & Sons, Inc., New York, Etats-Unis, 1994.
A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 123/124
![Page 225: A. LAZARIC (SequeL Team @INRIA-Lille - IRIT · I Robotics for entertainment A. LAZARIC – Introduction to Reinforcement Learning Sept 14th, 2015 - 6/124. Motivation Interactive Learning](https://reader035.vdocuments.us/reader035/viewer/2022062917/5ed906846714ca7f476901e1/html5/thumbnails/225.jpg)
Extensions Exploration with Probabilistic Expert Advice
Reinforcement Learning
Alessandro [email protected]
sequel.lille.inria.fr
Aurélien [email protected]
dubbing: