![Page 1: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/1.jpg)
In Search of PiA General Introduction to Reinforcement LearningShane M. Conway@statalgo, www.statalgo.com, [email protected]
![Page 2: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/2.jpg)
”It would be in vain for one intelligent Being, to set aRule to the Actions of another, if he had not in hisPower, to reward the compliance with, and punishdeviation from his Rule, by some Good and Evil, that isnot the natural product and consequence of the actionitself.” (Locke, ”Essay”, 2.28.6)
”The use of punishments and rewards can at best be apart of the teaching process. Roughly speaking, if theteacher has no other means of communicating to thepupil, the amount of information which can reach himdoes not exceed the total number of rewards andpunishments applied.” (Turing (1950) ”ComputingMachinery and Intelligence”)
![Page 3: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/3.jpg)
Table of contents
The recipe (some context)
Looking at other pi’s (motivating examples)
Prep the ingredients (the simplest example)
Mixing the ingredients (models)
Baking (methods)
Eat your own pi (code)
I ate the whole pi, but I’m still hungry! (references)
![Page 4: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/4.jpg)
Outline
The recipe (some context)
Looking at other pi’s (motivating examples)
Prep the ingredients (the simplest example)
Mixing the ingredients (models)
Baking (methods)
Eat your own pi (code)
I ate the whole pi, but I’m still hungry! (references)
![Page 5: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/5.jpg)
What is Reinforcement Learning?
Some context
![Page 6: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/6.jpg)
Why is reinforcement learning so rare here?
Figure: The machine learning sub-reddit on July 23, 2014.
Reinforcement learning is useful for optimizing the long-runbehavior of an agent:
I Handles more complex environments than supervised learning
I Provides a powerful framework for modeling streaming data
![Page 7: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/7.jpg)
Why is reinforcement learning so rare here?
Figure: The machine learning sub-reddit on July 23, 2014.
Reinforcement learning is useful for optimizing the long-runbehavior of an agent:
I Handles more complex environments than supervised learning
I Provides a powerful framework for modeling streaming data
![Page 8: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/8.jpg)
Machine Learning
Machine Learning is often introduced as distinct three approaches:
I Supervised Learning
I Unsupervised Learning
I Reinforcement Learning
![Page 9: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/9.jpg)
Machine Learning (Relationships)
Reinforcement
Supervised
Unsupervised
RLFunctionApprox . Semi − Supervised
Active
![Page 10: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/10.jpg)
Machine Learning (Complexity and Reductions)
Binary Classification
Supervised Learning
Cost−sensitive Learning
Contextual Bandit
Structured Prediction
Imitation Learning
Reinforcement Learning
Interactive/Sequential Complexity
Rew
ard
Str
uctu
re C
ompl
exity
(Langford/Zadrozny 2005)
![Page 11: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/11.jpg)
Reinforcement Learning
...the idea of a learning system that wants something.This was the idea of a ”hedonistic” learning system, or,as we would say now, the idea of reinforcement learning.
- Barto/Sutton (1998), p.viii
Definition
I Agents take actions in an environment and receive rewards
I Goal is to find the policy π that maximizes rewards
I Inspired by research into psychology and animal learning
![Page 12: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/12.jpg)
RL Model
In a single agent version, we consider two major components: theagent and the environment.
Agent
Environment
ActionReward, State
The agent takes actions, and receives updates in the form ofstate/reward pairs.
![Page 13: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/13.jpg)
Reinforcement Learning (Fields)
Reinforcement learning gets covered in a number of different fields:
I Artificial intelligence/machine learning
I Control theory/optimal control
I Neuroscience
I Psychology
One primary research area is in robotics, although the samemethods are applied under optimal control theory (often under thesubject of Approximate Dynamic Programming, or SequentialDecision Making Under Uncertainty.)
![Page 14: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/14.jpg)
Reinforcement Learning (Fields)
From ”Deconstructing Reinforcement Learning” ICML 2009
![Page 15: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/15.jpg)
Artificial Intelligence
Major goal of Artificial Intelligence: build intelligent agents.
”An agent is anything that can be viewed as perceiving itsenvironment through sensors and acting upon thatenvironment through actuators”.
Russell and Norvig (2003)
1. Belief Networks (Chp. 14)
2. Dynamic Belief Networks (Chp. 15)
3. Single Decisions (Chp. 16)
4. Sequential Decisions (Chp. 17) (includes MDP, POMDP, andGame Theory)
5. Reinforcement Learning (Chp. 21)
![Page 16: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/16.jpg)
Major Considerations
I Generalization (Learning)
I Sequential Decisions (Planning)
I Exploration vs. Exploitation
(Multi-Armed Bandits)
I Convergence (PAC learnability)
![Page 17: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/17.jpg)
Variations
I Type of uncertainty.
I Full vs. partial state observability.
I Single vs. multiple decision-makers.
I Model-based vs. model-free methods.
I Finite vs. infinite state space.
I Discrete vs. continuous time.
I Finite vs. infinite horizon.
![Page 18: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/18.jpg)
Key Ideas
1. Time/life/interaction
2. Reward/value/verification
3. Sampling
4. Bootstrapping
Richard Sutton’s list of key ideas for reinforcement learning(”Deconstructing Reinforcement Learning” ICML 2009)
![Page 19: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/19.jpg)
Outline
The recipe (some context)
Looking at other pi’s (motivating examples)
Prep the ingredients (the simplest example)
Mixing the ingredients (models)
Baking (methods)
Eat your own pi (code)
I ate the whole pi, but I’m still hungry! (references)
![Page 20: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/20.jpg)
How is Reinforcement Learningbeing used?
![Page 21: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/21.jpg)
Behaviorism
![Page 22: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/22.jpg)
Human Trials
Figure: ”How Pavlok Works: Earn Rewards when you Succeed. FacePenalties if you Fail. Choose your level of commitment. Pavlok canreward you when you achieve your goals. Earn prizes and even moneywhen you complete your daily task. But be warned: if you fail, you’ll facepenalties. Pay a fine, lose access to your phone, or even suffer an electricshock...at the hands of your friends.”
![Page 23: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/23.jpg)
Shortest Path, Travelling Salesman Problem
Given a list of cities and the distances between each pairof cities, what is the shortest possible route that visitseach city exactly once and returns to the origin city?
I Bellman, R. (1962), ”Dynamic Programming Treatment of theTravelling Salesman Problem”
I Example in python from Mariano Chouza
![Page 24: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/24.jpg)
TD-Gammon
Tesauro (1995) ”Temporal Difference Learning and TD-Gammon” may be
the most famous success story for RL, using a combination of the TD(λ)
algorithm and nonlinear function approximation using a multilayer neural
network trained by backpropagating TD errors.
![Page 25: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/25.jpg)
Go
From Sutton (2009) ”Deconstructing Reinforcement Learning” ICML
![Page 26: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/26.jpg)
Go
From Sutton (2009) ”Fast Gradient-Descent Methods for
Temporal-Difference Learning with Linear Function Approximation” ICML
![Page 27: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/27.jpg)
Andrew Ng’s Helicopters
https://www.youtube.com/watch?v=Idn10JBsA3Q
![Page 28: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/28.jpg)
Outline
The recipe (some context)
Looking at other pi’s (motivating examples)
Prep the ingredients (the simplest example)
Mixing the ingredients (models)
Baking (methods)
Eat your own pi (code)
I ate the whole pi, but I’m still hungry! (references)
![Page 29: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/29.jpg)
Multi-Armed Bandits
Single-state reinforcement learning problems.
![Page 30: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/30.jpg)
Multi-Armed Bandits
A simple introduction to the reinforcement learning problem is thecase when there is only one state, also called a multi-armed bandit.This was named after the slot machines (one-armed bandits).
Definition
I Set of actions A = 1, ..., n
I Each action gives you a random reward with distributionP(rt |at = i)
I The value (or utility) is V =∑
t rt
![Page 31: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/31.jpg)
Exploration vs. Exploitation
![Page 32: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/32.jpg)
Exploration vs. Exploitation
![Page 33: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/33.jpg)
ε-GreedyThe ε-Greedy algorithm is one of the simplest and yet most popularapproaches to solving the exploration/exploitation dilemma.
(picture courtesy of ”Python Multi-armed Bandits” by Eric Chiang,yhat)
![Page 34: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/34.jpg)
Outline
The recipe (some context)
Looking at other pi’s (motivating examples)
Prep the ingredients (the simplest example)
Mixing the ingredients (models)
Baking (methods)
Eat your own pi (code)
I ate the whole pi, but I’m still hungry! (references)
![Page 35: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/35.jpg)
Reinforcement Learning Models
Especially Markov Decision Processes.
![Page 36: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/36.jpg)
Dynamic Decision Networks
Bayesian networks are a popular method for characterizingprobabilistic models. These can be extended as a DynamicDecision Network (DDN) with the addition of decision (action)and utility (value) nodes.
sstate
a decision
rutility
![Page 37: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/37.jpg)
Markov Models
We can extend the markov process to study other models with thesame the property.
Markov Models Are States Observable? Control Over Transitions?Markov Chains Yes NoMDP Yes YesHMM No NoPOMDP No Yes
![Page 38: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/38.jpg)
Markov Processes
Markov Processes are very elementary in time series analysis.
s1 s2 s3 s4
Definition
P(st+1|st , ..., s1) = P(st+1|st) (1)
I st is the state of the markov process at time t.
![Page 39: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/39.jpg)
Markov Decision Process (MDP)
A Markov Decision Process (MDP) adds some further structure tothe problem.
s1 s2 s3 s4
a1 a2 a3
r1 r2 r3
![Page 40: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/40.jpg)
Hidden Markov Model (HMM)
Hidden Markov Models (HMM) provide a mechanism for modelinga hidden (i.e. unobserved) stochastic process by observing arelated observed process. HMM have grown increasingly popularfollowing their success in NLP.
s1 s2 s3 s4
o1 o2 o3 o4
![Page 41: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/41.jpg)
Partially Observable Markov Decision Processes (POMDP)
A Partially Observable Markov Decision Processes (POMDP)extends the MDP by assuming partial observability of the states,where the current state is a probability model (a belief state).
s1 s2 s3 s4
o1 o2 o3 o4
a1 a2 a3
r1 r2 r3
![Page 42: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/42.jpg)
RL Model
An MDP tranisitons from state s to state s ′ following an action a,and receiving a reward r as a result of each transition:
s0a0−−−−−→
r0s1
a1−−−−−→r1
s2 . . . (2)
MDP Components
I S is a set of states
I A is set of actions
I R(s) is a reward function
In addition we define:
I T (s ′|s, a) is a probability transition function
I γ as a discount factor (from 0 to 1)
![Page 43: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/43.jpg)
Policy
The objective is to find a policy π that maps actions to states, andwill maximize the rewards over time:
π(s)→ a
![Page 44: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/44.jpg)
RL Model
We define a value function to maximize the expected return:
V π(s) = E [R(s0) + γR(s1) + γ2R(s2) + · · · |s0 = s, π]
We can rewrite this as a recurrence relation, which is known as theBellman Equation:
V π(s) = R(s) + γ∑s′∈S
T (s ′)V π(s ′)
Qπ(s, a) = R(s) + γ∑s′∈S
T (s ′)maxaQπ(s ′, a′)
![Page 45: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/45.jpg)
Grid Worldhttp://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teachingf iles/introRL.pdfGrid world is a canonical example used in reinforcement learning.
![Page 46: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/46.jpg)
Grid WorldGrid world is a canonical example used in reinforcement learning.
![Page 47: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/47.jpg)
Grid World
Grid world is a canonical example used in reinforcement learning.
![Page 48: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/48.jpg)
Model-based vs. Model-free
I Model-free: Learn a controller without learninga model.
I Model-based: Learn a model, and use it toderive a controller.
![Page 49: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/49.jpg)
Notation Comment
I am using a fairly standard notation throughout this talk, whichfocuses on maximization of a utility; an alternative version usesminimization of cost, where the cost is the negative value of thereward:
Here Alternativeaction a control ureward R cost gvalue V cost-to-go Jpolicy π policy µdiscounting factor γ discounting factor αtransition probability Pa(s, s ′) transition probability pss′(a)
![Page 50: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/50.jpg)
Outline
The recipe (some context)
Looking at other pi’s (motivating examples)
Prep the ingredients (the simplest example)
Mixing the ingredients (models)
Baking (methods)
Eat your own pi (code)
I ate the whole pi, but I’m still hungry! (references)
![Page 51: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/51.jpg)
How to Solve an MDP
The basics, from dynamic programming to TD(λ).
![Page 52: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/52.jpg)
Families of Approaches
The approaches to RL can be summarized based on what they learn
(from Littman (2009) talk at NIP)
![Page 53: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/53.jpg)
Backup Diagrams
Backup diagrams provide a mechanism for summarizing howdifferent methods operate by showing how information is backedup to a state.
s0
a2
s6s5
a1
s4s3
a0
s2s1
r
![Page 54: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/54.jpg)
Dynamic Programming
Dynamic programming is one of the widely known methods formulti-period optimization.
Dynamic Programming Methods
Dynamic programming methods require full knowledge of the en-vironment: T (probability transition function) and R (the rewardfunction).
I Value iteration: Bellman (1957) introduced this method,which finds the value of each state, which can then be used tocompute a policy.
I Policy Iteration: Howard (1960) updates the value once, thenfinds the optimal policy, repeatedly until the policy does notchange.
![Page 55: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/55.jpg)
Generalized Policy Iteration
Almost all reinforcement learning methods can be described by thegeneral idea of generalized policy iteration (GPI), which breaks theoptimization into two processes: policy evaluation and policyimprovement.
π∗ V ∗
Evaluation
Improvement
![Page 56: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/56.jpg)
Monte Carlo
Monte Carlo Methods
Monte carlo methods learn from on-line, simulated experience, and require noprior knowledge of the environment’s dy-namics.
![Page 57: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/57.jpg)
Temporal Difference
Temporal Difference (TD) learning was formally introduced inSutton (1984, 1988). Also used in Samuel (1946).
TD(0) Updates
TD learning computes the temporal differenceerror, and adds this to the current estimatebased on the learning rate α.
V (St)← V (St) +α[Rt+1 +γV (St+1)−V (St)]
![Page 58: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/58.jpg)
Q-Learning
Q-learning (Watkins 1989) is a model-free method, and is one ofthe most important methods in reinforcement learning as it wasone of the first to show convergence. Rather than learning theoptimal value function, Q-learning learns the Q function.
Q(St ,At)← Q(St ,At) + α[Rt+1 + γmaxaQ(St+1, a)− Q(St ,At)]
![Page 59: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/59.jpg)
SARSA
SARSA (Rummery and Niranjan 1994, who called it modifiedQ-learning) is an on-policy temporal difference learning method.
Q(St ,At)← Q(St ,At) + α[Rt+1 + γQ(St+1, a)− Q(St ,At)]
![Page 60: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/60.jpg)
Eligibility Traces
Eligibility Traces provide a mechanism for assigning credit morequickly.
![Page 61: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/61.jpg)
Eligibility Traces
Eligibility Traces provide a mechanism for assigning credit morequickly, and can improve learning.
![Page 62: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/62.jpg)
Methods, the Unified View
The space of RL methods (from Maei (2011) ”Gradient
Temporal-Difference Learning Algorithms”)
![Page 63: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/63.jpg)
Outline
The recipe (some context)
Looking at other pi’s (motivating examples)
Prep the ingredients (the simplest example)
Mixing the ingredients (models)
Baking (methods)
Eat your own pi (code)
I ate the whole pi, but I’m still hungry! (references)
![Page 64: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/64.jpg)
From Theory to Practice
A tour of reinforcement learning software.
![Page 65: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/65.jpg)
The State of Open Source RL
There are a number of projects that provide RL algorithms:
I RL-Glue/RL-Library (Multi-Language)
I RLPark (Java)
I PyBrain (Python)
I RLPy (Python)
I RLToolkit (Python, paper)
I RL Toolbox (C++) (Master’s Thesis)
![Page 66: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/66.jpg)
RL-Glue
RL-Glue is a fundamental library for the RL community.
Figure: RL-Glue standard
![Page 67: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/67.jpg)
RL-Glue: Codecs
RL-Glue currently offers codecs for multiple languages (seeRL-Glue Extensions):
I C/C++
I Lisp
I Java
I Matlab
I Python
We recently created the rlglue R package:
library(devtools); install_github("smc77/rlglue")
![Page 68: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/68.jpg)
RL R Package
The RL package in R is intended for three things:
I Clear RL algorithms for education
I Generic, reusable models that can be applied to any dataset
I Sophisticated, cutting edge methods
It also includes features such as ensemble methods.
![Page 69: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/69.jpg)
RL R Package (Roadmap)
I On-policy prediction: TD(λ)
I Off-policy prediction: GTD(λ), GQ(λ)
I On-policy control: Q(λ)
I Off-policy control: SARSA(λ)
I Acting: softmax, greedy, ε-greedy
![Page 70: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/70.jpg)
Approach
The RL package in R follows a basic routine:
1. Define an agentI Specify a model (e.g. MDP, POMDP)I Choose a learning method (e.g. Value iteration, Q-Learning)I Choose a planning method (e.g. ε-greedy, UCB, bayesian)
2. Define an environment (a dataset or simulator, terminal state)
3. Run an experiment (number of episodes, specify ε)
The result of running a simulation is an RLModel object, whichcan hold several different utilities, including the optimal policy.The package also includes a number of examples (grid world, polebalancing).
![Page 71: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/71.jpg)
Simulation
Similar to RLinterface in RLToolkit.Methods:step, steps, episode, episodes
![Page 72: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/72.jpg)
Outline
The recipe (some context)
Looking at other pi’s (motivating examples)
Prep the ingredients (the simplest example)
Mixing the ingredients (models)
Baking (methods)
Eat your own pi (code)
I ate the whole pi, but I’m still hungry! (references)
![Page 73: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/73.jpg)
Try this at home!
All the source code from this talk is available at:https://github.com/smc77/rl
Other open source software:
I RL-Glue/RL-Library (Multi-Language)
I RLPark (Java)
I PyBrain (Python)
I RLPy (Python)
I RLToolkit (Python, paper)
I RL Toolbox (C++) (Master’s Thesis)
![Page 74: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/74.jpg)
Community
I https://groups.google.com/forum/!forum/rl-list
I glue.rl-community.org/
I http://www.rl-competition.org/
![Page 75: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/75.jpg)
Papers
Surveys:
I Kaelbling, Littman, and Moore (1996) ”ReinforcementLearning: A Survey”
I Littman (1996) ”Algorithms for Sequential Decision Making”
I Kober, Bagnell, and Peters (2013) ”Reinforcement Learning inRobotics: A Survey”
![Page 76: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/76.jpg)
Books (AI/ML/Robotics/Planning)
These are general textbooks that provide a good overview ofreinforcement learning.
I Russell and Norvig (2010) ”Artifical Intelligence: A ModernApproach”
I Ghallab, Nau, and Traverso ”Automated Planning: TheoryPractice”
I Thurn ”Probabilistic Robotics”
I Poole and Mackworth (2010) ”Artificial Intelligence:Foundations of Computational Agents”
I Mitchell (1997) ”Machine Learning”
I Marsland (2009) ”Machine Learning: An AlgorithmicPerspective”
![Page 77: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/77.jpg)
Books (RL)
Sutton and Barto (1998)
”Reinforcement Learning: An
Introduction”
Bertsekas and Tsitsiklis (1996)
”Neuro-Dynamic Programming”
I ”Reinforcement Learning: State-of-the-Art”
I Csaba Szepesvari (2009) ”Algorithms for ReinforcementLearning”
![Page 78: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/78.jpg)
People
I Richard Sutton (Alberta) - http://webdocs.cs.ualberta.ca/ sutton/
I Andrew Barto (UMass) - http://www-anw.cs.umass.edu/ barto/
I Michael Littman (Brown) - http://cs.brown.edu/ mlittman/
I Benjamin Van Roy (Stanford) - http://web.stanford.edu/ bvr/
I Leslie Kaelbling (MIT) - http://people.csail.mit.edu/lpk/
I Emma Brunskill (CMU) - http://www.cs.cmu.edu/ ebrun/
I Dimitri Bertsekas (MIT) - http://www.mit.edu/ dimitrib/home.html
I Csaba SzepesvA ↪ari - http://www.ualberta.ca/ szepesva/
I Chris Watkins - http://www.cs.rhul.ac.uk/home/chrisw/
I Lihong Li (Microsoft) - http://www.research.rutgers.edu/ lihong/
I John Langford (Microsoft) - http://hunch.net/ jl/
I Hado von Hasselt - http://webdocs.cs.ualberta.ca/ vanhasse/
![Page 79: In Search of Pi - Meetupfiles.meetup.com/1406240/ReinforcementLearningTalk.pdfReinforcement Learning...the idea of a learning system that wants something. This was the idea of a "hedonistic"learning](https://reader035.vdocuments.us/reader035/viewer/2022071116/5ffd68db00cc002e69079cc9/html5/thumbnails/79.jpg)
Questions?