lecture 1: introduction to rl - stanford...
TRANSCRIPT
![Page 1: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/1.jpg)
Lecture 1: Introduction to RL
Emma Brunskill
CS234 RL
Winter 2020
Today the 3rd part of the lecture includes slides from David Silver’sintroduction to RL slides or modifications of
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 1 / 67
![Page 2: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/2.jpg)
Today’s Plan
Overview of reinforcement learning
Course logistics
Introduction to sequential decision making under uncertainty
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 2 / 67
![Page 3: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/3.jpg)
Make good sequences of decisions
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 3 / 67
![Page 4: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/4.jpg)
Learn to make good sequences of decisions
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 4 / 67
![Page 5: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/5.jpg)
Reinforcement Learning
Fundamental challenge in artificial intelligence and machine learning islearning to make good decisions under uncertainty
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 5 / 67
![Page 6: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/6.jpg)
2010s: New Era of RL. Atari
Figure: DeepMind Nature, 2015
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 6 / 67
![Page 7: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/7.jpg)
2010s: New Era of RL. Robotics
Figure: Chelsea Finn, Sergey Levine, Pieter Abbeel
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 7 / 67
![Page 8: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/8.jpg)
Expanding Reach. Educational Games
Figure: RL used to optimize Refraction 1, Madel, Liu, Brunskill, Popvic AAMAS2014.
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 8 / 67
![Page 9: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/9.jpg)
Expanding Reach. Health
Figure: Personalized HeartSteps: A Reinforcement Learning Algorithm forOptimizing Physical Activity. Liao, Greenewald, Klasnja, Murphy 2019 arxiv
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 9 / 67
![Page 10: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/10.jpg)
With great power there must also come – great responsibility–Spiderman comics (though related comments appear in the French
National Convention 1793, by Lamb 1817 & Churchill 1906)
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 10 / 67
![Page 11: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/11.jpg)
Reinforcement Learning Involves
Optimization
Delayed consequences
Exploration
Generalization
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 11 / 67
![Page 12: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/12.jpg)
Optimization
Goal is to find an optimal way to make decisions
Yielding best outcomes or at least very good outcomes
Explicit notion of utility of decisions
Example: finding minimum distance route between two cities givennetwork of roads
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 12 / 67
![Page 13: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/13.jpg)
Delayed Consequences
Decisions now can impact things much later...
Saving for retirementFinding a key in video game Montezuma’s revenge
Introduces two challenges
When planning: decisions involve reasoning about not just immediatebenefit of a decision but also its longer term ramificationsWhen learning: temporal credit assignment is hard (what caused laterhigh or low rewards?)
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 13 / 67
![Page 14: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/14.jpg)
Exploration
Learning about the world by making decisions
Agent as scientistLearn to ride a bike by trying (and failing)Finding a key in Montezuma’s revenge
Censored data
Only get a reward (label) for decision madeDon’t know what would have happened if we had taken red pill insteadof blue pill (Matrix movie reference)
Decisions impact what we learn about
If we choose to go to Stanford instead of MIT, we will have differentlater experiences...
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 14 / 67
![Page 15: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/15.jpg)
Policy is mapping from past experience to action
Why not just pre-program a policy?
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 15 / 67
![Page 16: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/16.jpg)
Generalization
Policy is mapping from past experience to action
Why not just pre-program a policy?
Figure: DeepMind Nature, 2015
How many possible images are there?(256100×200
)3Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 16 / 67
![Page 17: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/17.jpg)
Reinforcement Learning Involves
Optimization
Exploration
Generalization
Delayed consequences
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 17 / 67
![Page 18: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/18.jpg)
RL vs Other AI and Machine Learning
AI Planning SL UL RL IL
Optimization
Learns from experience
Generalization
Delayed Consequences
Exploration
SL = Supervised learning; UL = Unsupervised learning; RL =Reinforcement Learning; IL = Imitation Learning
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 18 / 67
![Page 19: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/19.jpg)
RL vs Other AI and Machine Learning
AI Planning SL UL RL IL
Optimization X
Learns from experience
Generalization X
Delayed Consequences X
Exploration
SL = Supervised learning; UL = Unsupervised learning; RL =Reinforcement Learning; IL = Imitation Learning
AI planning assumes have a model of how decisions impactenvironment
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 19 / 67
![Page 20: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/20.jpg)
RL vs Other AI and Machine Learning
AI Planning SL UL RL IL
Optimization X
Learns from experience X
Generalization X X
Delayed Consequences X
Exploration
SL = Supervised learning; UL = Unsupervised learning; RL =Reinforcement Learning; IL = Imitation Learning
Supervised learning is provided correct labels
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 20 / 67
![Page 21: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/21.jpg)
RL vs Other AI and Machine Learning
AI Planning SL UL RL IL
Optimization X
Learns from experience X X
Generalization X X X
Delayed Consequences X
Exploration
SL = Supervised learning; UL = Unsupervised learning; RL =Reinforcement Learning; IL = Imitation Learning
Unsupervised learning is provided no labels
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 21 / 67
![Page 22: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/22.jpg)
RL vs Other AI and Machine Learning
AI Planning SL UL RL IL
Optimization X X
Learns from experience X X X
Generalization X X X X
Delayed Consequences X X
Exploration X
SL = Supervised learning; UL = Unsupervised learning; RL =Reinforcement Learning; IL = Imitation Learning
Reinforcement learning is provided with censored labels
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 22 / 67
![Page 23: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/23.jpg)
Sidenote: Imitation Learning
AI Planning SL UL RL IL
Optimization X X X
Learns from experience X X X X
Generalization X X X X X
Delayed Consequences X X X
Exploration X
SL = Supervised learning; UL = Unsupervised learning; RL =Reinforcement Learning; IL = Imitation Learning
Imitation learning assumes input demonstrations of good policies
IL reduces RL to SL. IL + RL is promising area
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 23 / 67
![Page 24: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/24.jpg)
How Do We Proceed?
Explore the world
Use experience to guide future decisions
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 24 / 67
![Page 25: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/25.jpg)
Other Issues
Where do rewards come from?
And what happens if we get it wrong?
Robustness / Risk sensitivity
We are not alone...
Multi-agent RL
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 25 / 67
![Page 26: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/26.jpg)
Today’s Plan
Overview of reinforcement learning
Course structure overview
Introduction to sequential decision making under uncertainty
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 26 / 67
![Page 27: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/27.jpg)
High Level Learning Goals*
Define the key features of RL
Given an application probem how (and whether) to use RL for it
Compare and contrast RL algorithms on multiple criteria
*For more detailed descriptions, see website
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 27 / 67
![Page 28: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/28.jpg)
Quick Activity
Think of something you are really good at. Write it down (you don’thave to share it with anyone).
Now in 1 or 2 words, explain how you got to be very good at it.
On the count of 3 shout out how you got to be that good at this
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 28 / 67
![Page 29: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/29.jpg)
Practice!
Think of something you are really good at. Write it down (you don’thave to share it with anyone).
Now in 1 or 2 words, explain how you got to be very good at it.
On the count of 3 shout out how you got to be that good at this
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 29 / 67
![Page 30: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/30.jpg)
Course Staff
Instructor: Emma Brunskill
CA’s: Will Deaderick (Head CA), Rohan Badlani, Yao Liu, Tong Mu,Benjamin Petit, Garrett Thomas, Christina Yuan and Andrea Zanette
Additional information
Course webpage: http://cs234.stanford.eduSchedule, Piazza (fastest way to get help), lecture slidesPrerequisites, grading details, late policy, see webpage
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 30 / 67
![Page 31: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/31.jpg)
Standing on the shoulders of giants...
A key part of human progress is our ability to learn beyond our ownexperience
Enormous variability in the effectiveness of education
Practice, coupled with prompt feedback, is key
Use some of our class time to provide opportunities for practice andfeedback
Huge body of evidence which supports that retrieval practice helpsincrease retention more than many other methods, and can supportdeep learning: New ”refresh your understanding” exercises in manylectures
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 31 / 67
![Page 32: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/32.jpg)
Effective Practice Strategies for Learning Class Content
Keep up with Refresh/Check your understanding exercises
Do homework
Attend office hours for help
Do past midterm for practice without looking at solutions
Complete project
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 32 / 67
![Page 33: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/33.jpg)
Criteria for Doing Well in Class
All of you can succeed if you put in the effort
We, the class staff, and your fellow classmates, are here to help
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 33 / 67
![Page 34: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/34.jpg)
Today’s Plan
Overview of reinforcement learning
Course logistics
Introduction to sequential decision making under uncertainty
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 34 / 67
![Page 35: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/35.jpg)
Refresher Exercise: AI Tutor as a Decision Process
Student initially does not know addition (easier) nor subtraction(harder)
AI tutor agent can provide practice problems about addition orsubtraction
AI agent gets rewarded +1 if student gets problem right, -1 if getproblem wrong
Model this as a Decision Process. Define state space, action space,and reward model. What does the dynamics model represent? Whatwould a policy to optimize the expected discounted sum of rewardsyield?
Write down your own answers (5 min) and then discuss in groups of3-4.
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 35 / 67
![Page 36: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/36.jpg)
Refresher Exercise: AI Tutor as a Decision Process
State:
Actions:
Reward model:
Meaning of dynamics model:
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 36 / 67
![Page 37: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/37.jpg)
Refresher Exercise: AI Tutor as a Decision Process
Student initially does not know addition (easier) nor subtraction(harder)
Teaching agent can provide activities about addition or subtraction
Agent gets rewarded for student performance: +1 if student getsproblem right, -1 if get problem wrong
Which items will agent learn to give to max expected reward? Is thisthe best way to optimize for learning? If not, what other rewardmight one give to encourage learning?
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 37 / 67
![Page 38: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/38.jpg)
Sequential Decision Making
Goal: Select actions to maximize total expected future reward
May require balancing immediate & long term rewards
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 38 / 67
![Page 39: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/39.jpg)
Example: Web Advertising
Goal: Select actions to maximize total expected future reward
May require balancing immediate & long term rewards
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 39 / 67
![Page 40: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/40.jpg)
Example: Robot Unloading Dishwasher
Goal: Select actions to maximize total expected future reward
May require balancing immediate & long term rewards
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 40 / 67
![Page 41: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/41.jpg)
Example: Blood Pressure Control
Goal: Select actions to maximize total expected future reward
May require balancing immediate & long term rewards
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 41 / 67
![Page 42: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/42.jpg)
Sequential Decision Process: Agent & the World (DiscreteTime)
Each time step t:
Agent takes an action atWorld updates given action at , emits observation ot and reward rtAgent receives observation ot and reward rt
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 42 / 67
![Page 43: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/43.jpg)
History: Sequence of Past Observations, Actions &Rewards
History ht = (a1, o1, r1, . . . , at , ot , rt)
Agent chooses action based on history
State is information assumed to determine what happens next
Function of history: st = (ht)
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 43 / 67
![Page 44: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/44.jpg)
World State
This is true state of the world used to determine how world generatesnext observation and reward
Often hidden or unknown to agent
Even if known may contain information not needed by agent
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 44 / 67
![Page 45: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/45.jpg)
Agent State: Agent’s Internal Representation
What the agent / algorithm uses to make decisions about how to act
Generally a function of the history: st = f (ht)
Could include meta information like state of algorithm (how manycomputations executed, etc) or decision process (how many decisionsleft until an episode ends)
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 45 / 67
![Page 46: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/46.jpg)
Markov Assumption
Information state: sufficient statistic of history
State st is Markov if and only if:
p(st+1|st , at) = p(st+1|ht , at)
Future is independent of past given present
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 46 / 67
![Page 47: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/47.jpg)
Markov Assumption for Prior Examples
Information state: sufficient statistic of history
State st is Markov if and only if:
p(st+1|st , at) = p(st+1|ht , at)
Future is independent of past given present
Hypertension control: let state be current blood pressure, and actionbe whether to take medication or not. Is this system Markov?
Website shopping: state is current product viewed by customer, andaction is what other product to recommend. Is this system Markov?
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 47 / 67
![Page 48: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/48.jpg)
Why is Markov Assumption Popular?
Can always be satisfied
Setting state as history always Markov: st = ht
In practice often assume most recent observation is sufficient statisticof history: st = otState representation has big implications for:
Computational complexityData requiredResulting performance
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 48 / 67
![Page 49: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/49.jpg)
Full Observability / Markov Decision Process (MDP)
Environment and world state st = ot
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 49 / 67
![Page 50: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/50.jpg)
Types of Sequential Decision Processes
Is state Markov? Is world partially observable? (POMDP)
Are dynamics deterministic or stochastic?
Do actions influence only immediate reward or reward and next state?
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 50 / 67
![Page 51: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/51.jpg)
Example: Mars Rover as a Markov Decision Process
!" !# !$ !% !& !' !(
Figure: Mars rover image: NASA/JPL-Caltech
States: Location of rover (s1, . . . , s7)
Actions: TryLeft or TryRight
Rewards:
+1 in state s1+10 in state s70 in all other states
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 51 / 67
![Page 52: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/52.jpg)
RL Algorithm Components
Often includes one or more of: Model, Policy, Value Function
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 52 / 67
![Page 53: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/53.jpg)
MDP Model
Agent’s representation of how world changes given agent’s action
Transition / dynamics model predicts next agent state
p(st+1 = s ′|st = s, at = a)
Reward model predicts immediate reward
r(st = s, at = a) = E[rt |st = s, at = a]
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 53 / 67
![Page 54: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/54.jpg)
Example: Mars Rover Stochastic Markov Model
!" !# !$ !% !& !' !(
*̂ = 0 *̂ = 0 *̂ = 0 *̂ = 0 *̂ = 0 *̂ = 0 *̂ = 0
Numbers above show RL agent’s reward model
Part of agent’s transition model:
0.5 = P(s1|s1,TryRight) = P(s2|s1,TryRight)0.5 = P(s2|s2,TryRight) = P(s3|s2,TryRight) · · ·
Model may be wrong
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 54 / 67
![Page 55: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/55.jpg)
Policy
Policy π determines how the agent chooses actions
π : S → A, mapping from states to actions
Deterministic policy:π(s) = a
Stochastic policy:
π(a|s) = Pr(at = a|st = s)
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 55 / 67
![Page 56: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/56.jpg)
Example: Mars Rover Policy
!" !# !$ !% !& !' !(
π(s1) = π(s2) = · · · = π(s7) = TryRight
Quick check: is this a deterministic policy or a stochastic policy?
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 56 / 67
![Page 57: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/57.jpg)
Value Function
Value function V π: expected discounted sum of future rewards undera particular policy π
V π(st = s) = Eπ[rt + γrt+1 + γ2rt+2 + γ3rt+3 + · · · |st = s]
Discount factor γ weighs immediate vs future rewards
Can be used to quantify goodness/badness of states and actions
And decide how to act by comparing policies
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 57 / 67
![Page 58: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/58.jpg)
Example: Mars Rover Value Function
!" !# !$ !% !& !' !(
)* !" = +1 )* !# = 0 )* !$ = 0 )* !% = 0 )* !& = 0 )* !' = 0 )* !( = +10
Discount factor, γ = 0
π(s1) = π(s2) = · · · = π(s7) = TryRight
Numbers show value V π(s) for this policy and this discount factor
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 58 / 67
![Page 59: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/59.jpg)
Types of RL Agents
Model-based
Explicit: ModelMay or may not have policy and/or value function
Model-free
Explicit: Value function and/or policy functionNo model
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 59 / 67
![Page 60: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/60.jpg)
RL Agents
Figure: Figure from David Silver’s RL course
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 60 / 67
![Page 61: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/61.jpg)
Evaluation and Control
Evaluation
Estimate/predict the expected rewards from following a given policy
Control
Optimization: find the best policy
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 61 / 67
![Page 62: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/62.jpg)
Example: Mars Rover Policy Evaluation
!" !# !$ !% !& !' !(
π(s1) = π(s2) = · · · = π(s7) = TryRight
Discount factor, γ = 0
What is the value of this policy?
V π(st = s) = Eπ[rt + γrt+1 + γ2rt+2 + · · · |st = s]
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 62 / 67
![Page 63: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/63.jpg)
Example: Mars Rover Policy Control
!" !# !$ !% !& !' !(
Discount factor, γ = 0
What is the policy that optimizes the expected discounted sum ofrewards?
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 63 / 67
![Page 64: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/64.jpg)
Course Outline
Markov decision processes & planning
Model-free policy evaluation
Model-free control
Reinforcement learning with function approximation & Deep RL
Policy Search
Exploration
Advanced Topics
See website for more details
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 64 / 67
![Page 65: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/65.jpg)
Imitation Learning
Figure: Abbeel, Coates and Ng helicopter team, Stanford
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 65 / 67
![Page 66: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/66.jpg)
Imitation Learning
Reduces RL to supervised learning
BenefitsGreat tools for supervised learningAvoids exploration problemWith big data lots of data about outcomes of decisions
LimitationsCan be expensive to captureLimited by data collected
Imitation learning + RL promising!
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 66 / 67
![Page 67: Lecture 1: Introduction to RL - Stanford Universityweb.stanford.edu/class/cs234/slides/lecture1.pdf · 2020-01-06 · Fundamental challenge in arti cial intelligence and machine learning](https://reader034.vdocuments.us/reader034/viewer/2022042222/5ec9009c6b6392450e6125ed/html5/thumbnails/67.jpg)
Expanding Reach. NLP, Vision, ...
Figure: Yeung, Russakovsky, Mori, Li 2016.
Emma Brunskill (CS234 RL) Lecture 1: Introduction to RL Winter 2020 67 / 67