cse 573: artificial intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf ·...
TRANSCRIPT
![Page 1: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging](https://reader035.vdocuments.us/reader035/viewer/2022081613/5fba55e9ab106f107f4a8293/html5/thumbnails/1.jpg)
CSE573:ArtificialIntelligenceReinforcementLearning
DanWeld/UniversityofWashington[ManyslidestakenfromDanKleinandPieterAbbeel /CS188IntrotoAIatUCBerkeley– materialsavailableathttp://ai.berkeley.edu.]
![Page 2: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging](https://reader035.vdocuments.us/reader035/viewer/2022081613/5fba55e9ab106f107f4a8293/html5/thumbnails/2.jpg)
Logistics
§ PS3duetoday§ PS4dueinoneweek(Thurs2/16)§ ResearchpapercommentsdueonTues
§ PaperitselfwillbeonWebcalendarafterclass
2
![Page 3: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging](https://reader035.vdocuments.us/reader035/viewer/2022081613/5fba55e9ab106f107f4a8293/html5/thumbnails/3.jpg)
ReinforcementLearning
![Page 4: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging](https://reader035.vdocuments.us/reader035/viewer/2022081613/5fba55e9ab106f107f4a8293/html5/thumbnails/4.jpg)
ReinforcementLearning
§ Basicidea:§ Receivefeedbackintheformofrewards§ Agent’sutilityisdefinedbytherewardfunction§ Must(learnto)actsoastomaximizeexpectedrewards§ Alllearningisbasedonobservedsamplesofoutcomes!
Environment
Agent
Actions:aState:sReward:r
![Page 5: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging](https://reader035.vdocuments.us/reader035/viewer/2022081613/5fba55e9ab106f107f4a8293/html5/thumbnails/5.jpg)
Example: Animal Learning
§ RL studied experimentally for more than 60 years in psychology
§ Example: foraging
§ Rewards: food, pain, hunger, drugs, etc.§ Mechanisms and sophistication debated
§ Bees learn near-optimal foraging plan in field of artificial flowers with controlled nectar supplies
§ Bees have a direct neural connection from nectar intake measurement to motor planning area
![Page 6: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging](https://reader035.vdocuments.us/reader035/viewer/2022081613/5fba55e9ab106f107f4a8293/html5/thumbnails/6.jpg)
Example: Backgammon
§ Reward only for win / loss in terminal states, zero otherwise
§ TD-Gammon learns a function approximation to V(s) using a neural network
§ Combined with depth 3 search, one of the top 3 players in the world
§ You could imagine training Pacman this way…
§ … but it’s tricky! (It’s also PS 4)
![Page 7: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging](https://reader035.vdocuments.us/reader035/viewer/2022081613/5fba55e9ab106f107f4a8293/html5/thumbnails/7.jpg)
Example:LearningtoWalk
Initial[Video:AIBOWALK– initial][KohlandStone,ICRA2004]
![Page 8: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging](https://reader035.vdocuments.us/reader035/viewer/2022081613/5fba55e9ab106f107f4a8293/html5/thumbnails/8.jpg)
Example:LearningtoWalk
Finished[Video:AIBOWALK– finished][KohlandStone,ICRA2004]
![Page 9: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging](https://reader035.vdocuments.us/reader035/viewer/2022081613/5fba55e9ab106f107f4a8293/html5/thumbnails/9.jpg)
Example:Sidewinding
[AndrewNg] [Video:SNAKE– climbStep+sidewinding]
![Page 10: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging](https://reader035.vdocuments.us/reader035/viewer/2022081613/5fba55e9ab106f107f4a8293/html5/thumbnails/10.jpg)
12
“Fewdrivingtasksareasintimidatingasparallelparking….
https://www.youtube.com/watch?v=pB_iFY2jIdI
![Page 11: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging](https://reader035.vdocuments.us/reader035/viewer/2022081613/5fba55e9ab106f107f4a8293/html5/thumbnails/11.jpg)
ParallelParking“Fewdrivingtasksareasintimidatingasparallelparking….
13
https://www.youtube.com/watch?v=pB_iFY2jIdI
![Page 12: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging](https://reader035.vdocuments.us/reader035/viewer/2022081613/5fba55e9ab106f107f4a8293/html5/thumbnails/12.jpg)
Other Applications
§ Go playing§ Robotic control
§ helicopter maneuvering, autonomous vehicles§ Mars rover - path planning, oversubscription planning§ elevator planning
§ Game playing - backgammon, tetris, checkers§ Neuroscience§ Computational Finance, Sequential Auctions§ Assisting elderly in simple tasks§ Spoken dialog management§ Communication Networks – switching, routing, flow control§ War planning, evacuation planning
![Page 13: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging](https://reader035.vdocuments.us/reader035/viewer/2022081613/5fba55e9ab106f107f4a8293/html5/thumbnails/13.jpg)
ReinforcementLearning
§ StillassumeaMarkovdecisionprocess(MDP):§ AsetofstatessÎ S§ Asetofactions(perstate)A§ AmodelT(s,a,s’)§ ArewardfunctionR(s,a,s’)& discountγ
§ Stilllookingforapolicyp(s)
§ Newtwist:don’tknowTorR§ I.e.wedon’tknowwhichstatesaregoodorwhattheactionsdo§ Mustactuallytryactionsandstatesouttolearn
?
![Page 14: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging](https://reader035.vdocuments.us/reader035/viewer/2022081613/5fba55e9ab106f107f4a8293/html5/thumbnails/14.jpg)
Offline(MDPs)vs.Online(RL)
OfflineSolution(Planning)
OnlineLearning(RL)
MonteCarloPlanning
Simulator
Diff:1)dyingok;2)(re)setbutton
![Page 15: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging](https://reader035.vdocuments.us/reader035/viewer/2022081613/5fba55e9ab106f107f4a8293/html5/thumbnails/15.jpg)
FourKeyIdeasforRL
§ Credit-AssignmentProblem§ Whatwastherealcauseofreward?
§ Exploration-exploitationtradeoff
§ Model-basedvsmodel-freelearning§ Whatfunctionisbeinglearned?
§ ApproximatingtheValueFunction§ Smallerà easiertolearn&bettergeneralization
![Page 16: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging](https://reader035.vdocuments.us/reader035/viewer/2022081613/5fba55e9ab106f107f4a8293/html5/thumbnails/16.jpg)
CreditAssignmentProblem
18
![Page 17: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging](https://reader035.vdocuments.us/reader035/viewer/2022081613/5fba55e9ab106f107f4a8293/html5/thumbnails/17.jpg)
19
Exploration-Exploitation tradeoff
§ You have visited part of the state space and found a reward of 100§ is this the best you can hope for???
§ Exploitation: should I stick with what I know and find a good policy w.r.t. this knowledge?§ at risk of missing out on a better reward somewhere
§ Exploration: should I look for states w/ more reward?§ at risk of wasting time & getting some negative reward
![Page 18: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging](https://reader035.vdocuments.us/reader035/viewer/2022081613/5fba55e9ab106f107f4a8293/html5/thumbnails/18.jpg)
Model-BasedLearning
![Page 19: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging](https://reader035.vdocuments.us/reader035/viewer/2022081613/5fba55e9ab106f107f4a8293/html5/thumbnails/19.jpg)
Model-BasedLearning
§ Model-BasedIdea:§ Learnanapproximatemodelbasedonexperiences§ Solveforvaluesasifthelearnedmodelwerecorrect
§ Step1:LearnempiricalMDPmodel§ Explore(e.g.,moverandomly)§ Countoutcomess’foreachs,a§ Normalizetogiveanestimateof§ Discovereach whenweexperience(s,a,s’)
§ Step2:SolvethelearnedMDP§ Forexample,usevalueiteration,asbefore
![Page 20: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging](https://reader035.vdocuments.us/reader035/viewer/2022081613/5fba55e9ab106f107f4a8293/html5/thumbnails/20.jpg)
Example:Model-BasedLearning
Randomp
Assume:g =1
ObservedEpisodes(Training) LearnedModel
A
B C D
E
B,east,C,-1C,east,D,-1D,exit,x,+10
B,east,C,-1C,east,D,-1D,exit,x,+10
E,north,C,-1C,east,A,-1A,exit,x,-10
Episode1 Episode2
Episode3 Episode4E,north,C,-1C,east,D,-1D,exit,x,+10
T(s,a,s’).T(B,east,C)=1.00T(C,east,D)=0.75T(C,east,A)=0.25
…
R(s,a,s’).R(B,east,C)=-1R(C,east,D)=-1R(D,exit,x)=+10
…
![Page 21: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging](https://reader035.vdocuments.us/reader035/viewer/2022081613/5fba55e9ab106f107f4a8293/html5/thumbnails/21.jpg)
Convergence
§ Ifpolicyexplores“enough”– doesn’tstarveanystate§ ThenT&Rconverge
§ So,VI,PI,Lao*etc. willfindoptimalpolicy§ UsingBellmanEquations
§ Whencanagentstartexploiting??§ (We’llanswerthisquestionlater)
23
![Page 22: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging](https://reader035.vdocuments.us/reader035/viewer/2022081613/5fba55e9ab106f107f4a8293/html5/thumbnails/22.jpg)
24
Two main reinforcement learning approaches
§ Model-based approaches:§ explore environment & learn model, T=P(s’|s,a) and R(s,a), (almost) everywhere§ use model to plan policy, MDP-style§ approach leads to strongest theoretical results § often works well when state-space is manageable
§ Model-free approach:§ don’t learn a model of T&R; instead, learn Q-function (or policy) directly§ weaker theoretical results§ often works better when state space is large
![Page 23: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging](https://reader035.vdocuments.us/reader035/viewer/2022081613/5fba55e9ab106f107f4a8293/html5/thumbnails/23.jpg)
25
Two main reinforcement learning approaches
§ Model-basedapproaches:Learn T+R
|S|2|A|+|S||A|parameters(40,400)
§ Model-freeapproach:Learn Q
|S||A|parameters (400)
![Page 24: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging](https://reader035.vdocuments.us/reader035/viewer/2022081613/5fba55e9ab106f107f4a8293/html5/thumbnails/24.jpg)
Model-FreeLearning
![Page 25: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging](https://reader035.vdocuments.us/reader035/viewer/2022081613/5fba55e9ab106f107f4a8293/html5/thumbnails/25.jpg)
NothingisFreeinLife!
§ WhatexactlyisFree???§ NomodelofT§ NomodelofR
§ (Instead,justmodelQ)
27
![Page 26: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging](https://reader035.vdocuments.us/reader035/viewer/2022081613/5fba55e9ab106f107f4a8293/html5/thumbnails/26.jpg)
Reminder:Q-ValueIteration
a
Qk+1(s,a)
s,a
s,a,s’
’)as’,(kQa’Max)=s’(kV
§ Forall s,a§ InitializeQ0(s,a)=0 notimestepsleftmeansanexpectedrewardofzero
§ K=0§ Repeat doBellmanbackups
For every (s,a) pair:
K += 1§ Untilconvergence I.e.,Qvaluesdon’tchangemuch
This is easy….We can sample this
![Page 27: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging](https://reader035.vdocuments.us/reader035/viewer/2022081613/5fba55e9ab106f107f4a8293/html5/thumbnails/27.jpg)
Puzzle:Q-Learning
a
Qk+1(s,a)
s,a
s,a,s’
’)as’,(kQa’Max)=s’(kV
§ Forall s,a§ InitializeQ0(s,a)=0 notimestepsleftmeansanexpectedrewardofzero
§ K=0§ Repeat doBellmanbackups
For every (s,a) pair:
K += 1§ Untilconvergence I.e.,Qvaluesdon’tchangemuch
Q: How can we compute without R, T ?!?A: Compute averages using sampled outcomes
![Page 28: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging](https://reader035.vdocuments.us/reader035/viewer/2022081613/5fba55e9ab106f107f4a8293/html5/thumbnails/28.jpg)
SimpleExample:ExpectedAgeGoal:ComputeexpectedageofCSEstudents
UnknownP(A):“ModelBased” UnknownP(A):“ModelFree”
WithoutP(A),insteadcollectsamples[a1,a2,…aN]
KnownP(A)
Whydoesthiswork?Becausesamplesappearwiththerightfrequencies.
Whydoesthiswork?Becauseeventuallyyoulearntheright
model.
Note:neverknowP(age=22)
![Page 29: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging](https://reader035.vdocuments.us/reader035/viewer/2022081613/5fba55e9ab106f107f4a8293/html5/thumbnails/29.jpg)
Anytime Model-FreeExpectedAgeGoal:ComputeexpectedageofCSEstudents
UnknownP(A):“ModelFree”
WithoutP(A),insteadcollectsamples[a1,a2,…aN]
Let A=0Loop for i = 1 to ∞
ai ß ask “what is your age?”A ß (i-1)/i * A + (1/i) * ai
Let A=0Loop for i = 1 to ∞
ai ß ask “what is your age?”A ß (1-α)*A + α*ai
![Page 30: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging](https://reader035.vdocuments.us/reader035/viewer/2022081613/5fba55e9ab106f107f4a8293/html5/thumbnails/30.jpg)
SamplingQ-Values§ Bigidea:learnfromeveryexperience!
§ Followexplorationpolicyaß π(s)§ UpdateQ(s,a)eachtimeweexperienceatransition(s,a,s’,r)§ Likelyoutcomess’willcontributeupdatesmoreoften
§ Updatetowardsrunningaverage:
s
p(s),r
s’GetasampleofQ(s,a): sample =R(s,a,s’)+γ Maxa’ Q(s’,a’)
UpdatetoQ(s,a):
Sameupdate:
Q(s,a)ß (1-𝛼)Q(s,a)+(𝛼)sample
Q(s,a)ß Q(s,a)+𝛼(sample– Q(s,a))
Rearranging: Q(s,a)ß Q(s,a)+𝛼(difference)Wheredifference=(R(s,a,s’)+γ Maxa’ Q(s’,a’))- Q(s,a)
![Page 31: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging](https://reader035.vdocuments.us/reader035/viewer/2022081613/5fba55e9ab106f107f4a8293/html5/thumbnails/31.jpg)
QLearning
§ Forall s,a§ InitializeQ(s,a)=0
§ RepeatForeverWhere are you? s.Choose some action aExecute it in real world: (s, a, r, s’)Do update:
differenceß [R(s,a,s’)+γ Maxa’ Q(s’,a’)]- Q(s,a)Q(s,a) ß Q(s,a)+𝛼(difference)
![Page 32: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging](https://reader035.vdocuments.us/reader035/viewer/2022081613/5fba55e9ab106f107f4a8293/html5/thumbnails/32.jpg)
Example Assume:g =1,α =1/2
ObservedTransition: B,east,C,-2
0
00
C0
0
08
D0
0
00
B0
0
00
A0
0
00
E0
In state B. What should you do?Suppose (for now) we follow a random exploration policy
à “Go east”
![Page 33: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging](https://reader035.vdocuments.us/reader035/viewer/2022081613/5fba55e9ab106f107f4a8293/html5/thumbnails/33.jpg)
Example Assume:g =1,α =1/2
ObservedTransition: B,east,C,-2
0
00
C0
0
08
D0
0
00
B0
0
00
A0
0
00
E0
0
00
C0
0
08
D0
0
0?
B0
0
00
A0
0
00
E0
½ 0 ½ -2 0-1
![Page 34: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging](https://reader035.vdocuments.us/reader035/viewer/2022081613/5fba55e9ab106f107f4a8293/html5/thumbnails/34.jpg)
Example Assume:g =1,α =1/2
ObservedTransition: B,east,C,-2
0
00
C0
0
08
D0
0
00
B0
0
00
A0
0
00
E0
0
00
C0
0
08
D0
0
0-1
B0
0
00
A0
0
00
E0
½ 0 ½ -2 83
0
0?
C0
0
08
D0
0
00
B0
0
00
A0
0
00
E0
C,east,D,-2
![Page 35: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging](https://reader035.vdocuments.us/reader035/viewer/2022081613/5fba55e9ab106f107f4a8293/html5/thumbnails/35.jpg)
Example Assume:g =1,α =1/2
ObservedTransition: B,east,C,-2
0
00
C0
0
08
D0
0
00
B0
0
00
A0
0
00
E0
0
00
C0
0
08
D0
0
0-1
B0
0
00
A0
0
00
E0
0
03
C0
0
08
D0
0
0-1
B0
0
00
A0
0
00
E0
C,east,D,-2
![Page 36: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging](https://reader035.vdocuments.us/reader035/viewer/2022081613/5fba55e9ab106f107f4a8293/html5/thumbnails/36.jpg)
Q-LearningProperties
§ Q-learningconvergestooptimalQfunction(andhencelearns optimalpolicy)§ evenifyou’reactingsuboptimally!§ Thisiscalledoff-policylearning
§ Caveats:§ Youhavetoexploreenough§ Youhavetoeventuallyshrinkthelearningrate,α§ …butnotdecreaseittooquickly
§ And… ifyouwanttoact optimally§ Youhavetoswitchfromexploretoexploit
[Demo:Q-learning– auto– cliffgrid(L11D1)]
![Page 37: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging](https://reader035.vdocuments.us/reader035/viewer/2022081613/5fba55e9ab106f107f4a8293/html5/thumbnails/37.jpg)
VideoofDemoQ-LearningAutoCliff Grid
![Page 38: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging](https://reader035.vdocuments.us/reader035/viewer/2022081613/5fba55e9ab106f107f4a8293/html5/thumbnails/38.jpg)
QLearning
§ Forall s,a§ InitializeQ(s,a)=0
§ RepeatForeverWhere are you? s.Choose some action aExecute it in real world: (s, a, r, s’)Do update:
![Page 39: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging](https://reader035.vdocuments.us/reader035/viewer/2022081613/5fba55e9ab106f107f4a8293/html5/thumbnails/39.jpg)
Explorationvs.Exploitation
![Page 40: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging](https://reader035.vdocuments.us/reader035/viewer/2022081613/5fba55e9ab106f107f4a8293/html5/thumbnails/40.jpg)
Questions
§ Howtoexplore?aExplorationUniformexplorationEpsilonGreedy
With(small)probabilitye,actrandomlyWith(large)probability1-e,actoncurrentpolicy
ExplorationFunctions(suchasUCB)ThompsonSampling
§ Whentoexploit?
§ Howtoeventhinkaboutthistradeoff?
![Page 41: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging](https://reader035.vdocuments.us/reader035/viewer/2022081613/5fba55e9ab106f107f4a8293/html5/thumbnails/41.jpg)
Questions
§ Howtoexplore?§ RandomExploration§ Uniformexploration§ EpsilonGreedy
§ With(small)probabilitye,actrandomly§ With(large)probability1-e,actoncurrentpolicy
§ ExplorationFunctions(suchasUCB)§ ThompsonSampling
§ Whentoexploit?
§ Howtoeventhinkaboutthistradeoff?
![Page 42: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging](https://reader035.vdocuments.us/reader035/viewer/2022081613/5fba55e9ab106f107f4a8293/html5/thumbnails/42.jpg)
ExplorationFunctions§ Whentoexplore?
§ Randomactions:exploreafixedamount§ Betteridea:exploreareaswhosebadnessisnot(yet)established,eventuallystopexploring
§ Explorationfunction§ Takesavalueestimateuandavisitcountn,andreturnsanoptimisticutility,e.g.
§ Note:thispropagatesthe“bonus”backtostatesthatleadtounknownstatesaswell!
ModifiedQ-Update:
RegularQ-Update:
![Page 43: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging](https://reader035.vdocuments.us/reader035/viewer/2022081613/5fba55e9ab106f107f4a8293/html5/thumbnails/43.jpg)
VideoofDemoCrawlerBot
http://inst.eecs.berkeley.edu/~ee128/fa11/videos.htmlMore demos at:
![Page 44: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging](https://reader035.vdocuments.us/reader035/viewer/2022081613/5fba55e9ab106f107f4a8293/html5/thumbnails/44.jpg)
ApproximateQ-Learning
![Page 45: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging](https://reader035.vdocuments.us/reader035/viewer/2022081613/5fba55e9ab106f107f4a8293/html5/thumbnails/45.jpg)
GeneralizingAcrossStates
§ BasicQ-Learningkeepsatableofallq-values
§ Inrealisticsituations,wecannotpossiblylearnabouteverysinglestate!§ Toomanystatestovisitthemallintraining§ Toomanystatestoholdtheq-tablesinmemory
§ Instead,wewanttogeneralize:§ Learnaboutsomesmallnumberoftrainingstatesfrom
experience§ Generalizethatexperiencetonew,similarsituations§ Thisisafundamentalideainmachinelearning,andwe’ll
seeitoverandoveragain
[demo– RLpacman]
![Page 46: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging](https://reader035.vdocuments.us/reader035/viewer/2022081613/5fba55e9ab106f107f4a8293/html5/thumbnails/46.jpg)
Example:Pacman
Let’ssaywediscoverthroughexperiencethatthisstateisbad:
Innaïveq-learning,weknownothingaboutthisstate:
![Page 47: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging](https://reader035.vdocuments.us/reader035/viewer/2022081613/5fba55e9ab106f107f4a8293/html5/thumbnails/47.jpg)
Example:Pacman
Let’ssaywediscoverthroughexperiencethatthisstateisbad:
Oreventhisone!
![Page 48: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging](https://reader035.vdocuments.us/reader035/viewer/2022081613/5fba55e9ab106f107f4a8293/html5/thumbnails/48.jpg)
Feature-BasedRepresentationsSolution:describeastateusingavectoroffeatures (aka“properties”)
§ Features= functionsfromstatestoR(often0/1)capturingimportantpropertiesofthestate
§ Examplefeatures:§ Distancetoclosestghostordot§ Numberofghosts§ 1/(disttodot)2§ IsPacman inatunnel?(0/1)§ ……etc.§ Isittheexactstateonthisslide?
§ Canalsodescribeaq-state(s,a)withfeatures(e.g.actionmovesclosertofood)
![Page 49: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging](https://reader035.vdocuments.us/reader035/viewer/2022081613/5fba55e9ab106f107f4a8293/html5/thumbnails/49.jpg)
LinearCombinationofFeatures
§ Usingafeaturerepresentation,wecanwriteaqfunction(orvaluefunction)foranystateusingafewweights:
§ Advantage:ourexperienceissummedupinafewpowerfulnumbers
§ Disadvantage:statessharingfeaturesmayactuallyhaveverydifferentvalues!
![Page 50: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging](https://reader035.vdocuments.us/reader035/viewer/2022081613/5fba55e9ab106f107f4a8293/html5/thumbnails/50.jpg)
ApproximateQ-Learning
§ Q-learningwithlinearQ-functions:
§ Intuitiveinterpretation:§ Adjustweightsofactive features§ E.g.,ifsomethingunexpectedlybadhappens,blamethefeaturesthatwereon:
disprefer allstateswiththatstate’sfeatures
§ Formaljustification:inafewslides!
Exact Q’s
Approximate Q’sForall i do:
![Page 51: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging](https://reader035.vdocuments.us/reader035/viewer/2022081613/5fba55e9ab106f107f4a8293/html5/thumbnails/51.jpg)
QLearning
§ Forall s,a§ InitializeQ(s,a)=0
§ RepeatForeverWhere are you? s.Choose some action aExecute it in real world: (s, a, r, s’)Do update:
differenceß [R(s,a,s’)+γ Maxa’ Q(s’,a’)]- Q(s,a)Q(s,a) ß Q(s,a)+𝛼(difference)
![Page 52: CSE 573: Artificial Intelligencecourses.cs.washington.edu/courses/cse573/17wi/slides/11-rl.pdf · §RL studied experimentally for more than 60 years in psychology §Example: foraging](https://reader035.vdocuments.us/reader035/viewer/2022081613/5fba55e9ab106f107f4a8293/html5/thumbnails/52.jpg)
§ Forall i§ Initializewi =0
§ RepeatForeverWhere are you? s.Choose some action aExecute it in real world: (s, a, r, s’)Do update:
differenceß [R(s,a,s’)+γ Maxa’ Q(s’,a’)]- Q(s,a)Q(s,a) ß Q(s,a)+𝛼(difference)