markov decision processes (continued)bboots3/cs4641-fall2018/lecture19/19_mdps2.pdf§alternative...

MarkovDecisionProcesses(continued)

Robot Image Credit: Viktoriya Sukhanova © 123RF.com

Based on slides by Dan Klein

Example:GridWorld

§ Amaze-likeproblem§ Theagentlivesinagrid§ Wallsblocktheagent’spath

§ Noisymovement:actionsdonotalwaysgoasplanned§ 80%ofthetime,theactionNorthtakestheagentNorth§ 10%ofthetime,NorthtakestheagentWest;10%East§ Ifthereisawallinthedirectiontheagentwouldhave

beentaken,theagentstaysput

§ Theagentreceivesrewardseachtimestep§ Small“living”rewardeachstep(canbenegative)§ Bigrewardscomeattheend(goodorbad)

§ Goal:maximizesumof(discounted)rewards

Recap:MDPs

§ Markovdecisionprocesses:§ StatesS§ ActionsA§ TransitionsP(s’|s,a)(orT(s,a,s’))§ RewardsR(s,a,s’)(anddiscountg)§ Startstates0

§ Quantities:§ Policy=mapofstatestoactions§ Utility=sumofdiscountedrewards§ Values=expectedfutureutilityfromastate(maxnode)§ Q-Values=expectedfutureutilityfromaq-state(chancenode)

s,a,s’s’

OptimalQuantities

§ Thevalue ofastates:V*(s)=expectedutilitystartinginsandactingoptimally

§ Thevalueofaq-state(s,a):Q*(s,a)=expectedutilitystartingouthavingtakenactionafromstatesand(thereafter)actingoptimally

§ Theoptimalpolicy:p*(s)=optimalactionfromstates

(s,a,s’) is a transition

s,a,s’

s is a state

(s, a) is a q-state

[Demo:gridworld values(L9D1)]

Gridworld ValuesV*

Gridworld:Q*

TheBellmanEquations

Howtobeoptimal:

Step1:Takecorrectfirstaction

Step2:Keepbeingoptimal

TheBellmanEquations

§ Definitionof“optimalutility”viaexpectimaxrecurrencegivesasimpleone-steplookaheadrelationshipamongstoptimalutilityvalues

§ ThesearetheBellmanequations,andtheycharacterizeoptimalvaluesinawaywe’lluseoverandover

s,a,s’s’

ValueIteration

§ Bellmanequationscharacterize theoptimalvalues:

§ Valueiterationcomputes them:

§ Valueiterationisjustafixedpointsolutionmethod

s,a,s’

V(s’)

Convergence*

§ HowdoweknowtheVk vectorsaregoingtoconverge?

§ Case1:IfthetreehasmaximumdepthM,thenVM holdstheactualuntruncated values

§ Case2:Ifthediscountislessthan1§ Sketch:ForanystateVk andVk+1 canbeviewedasdepth

k+1resultsinnearlyidenticalsearchtrees§ Thedifferenceisthatonthebottomlayer,Vk+1 hasactual

rewardswhileVk haszeros§ ThatlastlayerisatbestallRMAX

§ ItisatworstRMIN

§ Buteverythingisdiscountedbyγk thatfarout§ SoVk andVk+1 areatmostγk max|R|different§ Soaskincreases,thevaluesconverge

PolicyEvaluation

FixedPolicies

§ Valueiterationcomputessearchtreesthatmaxoverallactionstocomputeoptimalvalues

§ Ifwefixedsomepolicyp(s),thenthetreewouldbesimpler– onlyoneactionperstate§ …althoughthetree’svaluewoulddependonwhichpolicywefixed

s,a,s’s’

s,p(s)

s, p(s),s’s’

Dotheoptimalaction Dowhatp saystodo

UtilitiesforaFixedPolicy

§ Anotherbasicoperation:computetheutilityofastatesunderafixed(generallynon-optimal)policy

§ Definetheutilityofastates,underafixedpolicyp:Vp(s)=expectedtotaldiscountedrewardsstartinginsandfollowingp

§ Recursiverelation(one-steplook-ahead/Bellmanequation):

s,p(s)

s, p(s),s’s’

Example:PolicyEvaluationAlwaysGoRight AlwaysGoForward

PolicyEvaluation

§ HowdowecalculatetheV’sforafixedpolicyp?

§ Idea1:TurnrecursiveBellmanequationsintoupdates(likevalueiteration)

§ Efficiency:O(S2)periteration

§ Idea2:Withoutthemaxes,theBellmanequationsarejustalinearsystem§ SolvewithMatlab (oryourfavoritelinearsystemsolver)

s,p(s)

s, p(s),s’s’

PolicyExtraction

ComputingActionsfromValues

§ Let’simaginewehavetheoptimalvaluesV*(s)

§ Howshouldweact?§ It’snotobvious!

§ Weneedtosolveonestepof

§ Thisiscalledpolicyextraction,sinceitgetsthepolicyimpliedbythevalues

ComputingActionsfromQ-Values

§ Let’simaginewehavetheoptimalq-values:

§ Howshouldweact?§ Completelytrivialtodecide!

§ Importantlesson:actionsareeasiertoselectfromq-valuesthanvalues!

PolicyIteration

ProblemswithValueIteration

§ ValueiterationrepeatstheBellmanupdates:

§ Problem1:It’sslow– O(S2A)periteration

§ Problem2:The“max”ateachstaterarelychanges

§ Problem3:Thepolicyoftenconvergeslongbeforethevalues

s,a,s’s’

Noise=0.2Discount=0.9Livingreward=0

PolicyIteration

§ Alternativeapproachforoptimalvalues:§ Step1:Policyevaluation:calculateutilitiesforsomefixedpolicy(notoptimalutilities!)untilconvergence

§ Step2:Policyimprovement:updatepolicyusingone-steplook-aheadwithresultingconverged(butnotoptimal!)utilitiesasfuturevalues

§ Repeatstepsuntilpolicyconverges

§ Thisispolicyiteration§ It’sstilloptimal!§ Canconverge(much)fasterundersomeconditions

PolicyIteration

§ Evaluation:Forfixedcurrentpolicyp,findvalueswithpolicyevaluation:§ Iterateuntilvaluesconverge:

§ Improvement:Forfixedvalues,getabetterpolicyusingpolicyextraction§ One-steplook-ahead:

Comparison

§ Bothvalueiterationandpolicyiterationcomputethesamething(alloptimalvalues)

§ Invalueiteration:§ Everyiterationupdatesboththevaluesand(implicitly)thepolicy§ Wedon’ttrackthepolicy,buttakingthemaxoveractionsimplicitlyrecomputes it

§ Inpolicyiteration:§ Wedoseveralpassesthatupdateutilitieswithfixedpolicy(eachpassisfastbecausewe

consideronlyoneaction,notallofthem)§ Afterthepolicyisevaluated,anewpolicyisselected(slowlikeavalueiterationpass)§ Thenewpolicywillbebetter(orwe’redone)

§ BotharedynamicprogramsforsolvingMDPs

Summary:MDPAlgorithms

§ Soyouwantto….§ Computeoptimalvalues:usevalueiterationorpolicyiteration§ Computevaluesforaparticularpolicy:usepolicyevaluation§ Turnyourvaluesintoapolicy:usepolicyextraction(one-steplookahead)

§ Thesealllookthesame!§ Theybasicallyare– theyareallvariationsofBellmanupdates§ Theyalluseone-steplookahead§ Theydifferonlyinwhetherwepluginafixedpolicyormaxoveractions

DoubleBandits

Double-BanditMDP

§ Actions:Blue,Red§ States:Win,Lose

0.25 $0

0.75$2

0.75 $2

0.25$0

Nodiscount100timestepsBothstateshavethesamevalue

OfflinePlanning

§ SolvingMDPsisofflineplanning§ Youdetermineallquantitiesthroughcomputation§ YouneedtoknowthedetailsoftheMDP§ Youdonotactuallyplaythegame!

PlayRed

PlayBlue

Nodiscount100timestepsBothstateshavethesamevalue

0.25 $0

0.75$2

0.75 $2

0.25$0

OnlinePlanning

§ Ruleschanged!Red’swinchanceisdifferent.

Let’sPlay!

$0 $0 $0 $2 $0$2 $0 $0 $0 $0

WhatJustHappened?

§ Thatwasn’tplanning,itwaslearning!§ Specifically,reinforcementlearning§ TherewasanMDP,butyoucouldn’tsolveitwithjustcomputation§ Youneededtoactuallyacttofigureitout

§ Importantideasinreinforcementlearningthatcameup§ Exploration:youhavetotryunknownactionstogetinformation§ Exploitation:eventually,youhavetousewhatyouknow§ Regret:evenifyoulearnintelligently,youmakemistakes§ Sampling:becauseofchance,youhavetotrythingsrepeatedly§ Difficulty:learningcanbemuchharderthansolvingaknownMDP

NextTime:ReinforcementLearning!

markov decision processes (continued)bboots3/cs4641-fall2018/lecture19/19_mdps2.pdf§alternative...

Documents

linear algebra -...

universityofmississippi …universityofmississippi...

introduction to cs4641 · course logistics i syllabus i...

probability basics - georgia institute of...

organic scintillation detectors - ne204-fall2018

cs240:notestakenfromlectures · 2019-05-07 ·...

sparse gaussian processes on matrix lie groups: a...

2:1 converter charge vector...

the colloquium for information system security...

hw10 - electrochemical potential, free energy, and...

kw3 breathing association airwave fall2018

linear classification: the...

scripting versus programming scripting 1...

4d crop monitoring: spatio-temporal reconstruction for...

ee457 quiz fall2018 - university of southern california

fall2018-ev3robotcodingleescornerpta.net/.../2018/08/fall2018-ev3robotcoding.pdfbattle...

naïvebayes - georgia institute of...

information theoretic mpc for model-based...

csaapt.orgcsaapt.org/uploads/fall2018/fa18_davidwright.pdf ·...

linearregression - georgia institute of...