markov decision processes (continued)bboots3/cs4641-fall2018/lecture19/19_mdps2.pdf§alternative...

46
Markov Decision Processes (continued) Robot Image Credit: Viktoriya Sukhanova © 123RF.com Based on slides by Dan Klein

Upload: others

Post on 03-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Markov Decision Processes (continued)bboots3/CS4641-Fall2018/Lecture19/19_MDPs2.pdf§Alternative approach for optimal values: §Step 1: Policy evaluation: calculate utilities for some

MarkovDecisionProcesses(continued)

Robot Image Credit: Viktoriya Sukhanova © 123RF.com

Based on slides by Dan Klein

Page 2: Markov Decision Processes (continued)bboots3/CS4641-Fall2018/Lecture19/19_MDPs2.pdf§Alternative approach for optimal values: §Step 1: Policy evaluation: calculate utilities for some

Example:GridWorld

§ Amaze-likeproblem§ Theagentlivesinagrid§ Wallsblocktheagent’spath

§ Noisymovement:actionsdonotalwaysgoasplanned§ 80%ofthetime,theactionNorthtakestheagentNorth§ 10%ofthetime,NorthtakestheagentWest;10%East§ Ifthereisawallinthedirectiontheagentwouldhave

beentaken,theagentstaysput

§ Theagentreceivesrewardseachtimestep§ Small“living”rewardeachstep(canbenegative)§ Bigrewardscomeattheend(goodorbad)

§ Goal:maximizesumof(discounted)rewards

Page 3: Markov Decision Processes (continued)bboots3/CS4641-Fall2018/Lecture19/19_MDPs2.pdf§Alternative approach for optimal values: §Step 1: Policy evaluation: calculate utilities for some

Recap:MDPs

§ Markovdecisionprocesses:§ StatesS§ ActionsA§ TransitionsP(s’|s,a)(orT(s,a,s’))§ RewardsR(s,a,s’)(anddiscountg)§ Startstates0

§ Quantities:§ Policy=mapofstatestoactions§ Utility=sumofdiscountedrewards§ Values=expectedfutureutilityfromastate(maxnode)§ Q-Values=expectedfutureutilityfromaq-state(chancenode)

a

s

s,a

s,a,s’s’

Page 4: Markov Decision Processes (continued)bboots3/CS4641-Fall2018/Lecture19/19_MDPs2.pdf§Alternative approach for optimal values: §Step 1: Policy evaluation: calculate utilities for some

OptimalQuantities

§ Thevalue ofastates:V*(s)=expectedutilitystartinginsandactingoptimally

§ Thevalueofaq-state(s,a):Q*(s,a)=expectedutilitystartingouthavingtakenactionafromstatesand(thereafter)actingoptimally

§ Theoptimalpolicy:p*(s)=optimalactionfromstates

a

s

s’

s, a

(s,a,s’) is a transition

s,a,s’

s is a state

(s, a) is a q-state

[Demo:gridworld values(L9D1)]

Page 5: Markov Decision Processes (continued)bboots3/CS4641-Fall2018/Lecture19/19_MDPs2.pdf§Alternative approach for optimal values: §Step 1: Policy evaluation: calculate utilities for some

Gridworld ValuesV*

Page 6: Markov Decision Processes (continued)bboots3/CS4641-Fall2018/Lecture19/19_MDPs2.pdf§Alternative approach for optimal values: §Step 1: Policy evaluation: calculate utilities for some

Gridworld:Q*

Page 7: Markov Decision Processes (continued)bboots3/CS4641-Fall2018/Lecture19/19_MDPs2.pdf§Alternative approach for optimal values: §Step 1: Policy evaluation: calculate utilities for some

TheBellmanEquations

Howtobeoptimal:

Step1:Takecorrectfirstaction

Step2:Keepbeingoptimal

Page 8: Markov Decision Processes (continued)bboots3/CS4641-Fall2018/Lecture19/19_MDPs2.pdf§Alternative approach for optimal values: §Step 1: Policy evaluation: calculate utilities for some

TheBellmanEquations

§ Definitionof“optimalutility”viaexpectimaxrecurrencegivesasimpleone-steplookaheadrelationshipamongstoptimalutilityvalues

§ ThesearetheBellmanequations,andtheycharacterizeoptimalvaluesinawaywe’lluseoverandover

a

s

s,a

s,a,s’s’

Page 9: Markov Decision Processes (continued)bboots3/CS4641-Fall2018/Lecture19/19_MDPs2.pdf§Alternative approach for optimal values: §Step 1: Policy evaluation: calculate utilities for some

ValueIteration

§ Bellmanequationscharacterize theoptimalvalues:

§ Valueiterationcomputes them:

§ Valueiterationisjustafixedpointsolutionmethod

a

V(s)

s,a

s,a,s’

V(s’)

Page 10: Markov Decision Processes (continued)bboots3/CS4641-Fall2018/Lecture19/19_MDPs2.pdf§Alternative approach for optimal values: §Step 1: Policy evaluation: calculate utilities for some

Convergence*

§ HowdoweknowtheVk vectorsaregoingtoconverge?

§ Case1:IfthetreehasmaximumdepthM,thenVM holdstheactualuntruncated values

§ Case2:Ifthediscountislessthan1§ Sketch:ForanystateVk andVk+1 canbeviewedasdepth

k+1resultsinnearlyidenticalsearchtrees§ Thedifferenceisthatonthebottomlayer,Vk+1 hasactual

rewardswhileVk haszeros§ ThatlastlayerisatbestallRMAX

§ ItisatworstRMIN

§ Buteverythingisdiscountedbyγk thatfarout§ SoVk andVk+1 areatmostγk max|R|different§ Soaskincreases,thevaluesconverge

Page 11: Markov Decision Processes (continued)bboots3/CS4641-Fall2018/Lecture19/19_MDPs2.pdf§Alternative approach for optimal values: §Step 1: Policy evaluation: calculate utilities for some

PolicyEvaluation

Page 12: Markov Decision Processes (continued)bboots3/CS4641-Fall2018/Lecture19/19_MDPs2.pdf§Alternative approach for optimal values: §Step 1: Policy evaluation: calculate utilities for some

FixedPolicies

§ Valueiterationcomputessearchtreesthatmaxoverallactionstocomputeoptimalvalues

§ Ifwefixedsomepolicyp(s),thenthetreewouldbesimpler– onlyoneactionperstate§ …althoughthetree’svaluewoulddependonwhichpolicywefixed

a

s

s,a

s,a,s’s’

p(s)

s

s,p(s)

s, p(s),s’s’

Dotheoptimalaction Dowhatp saystodo

Page 13: Markov Decision Processes (continued)bboots3/CS4641-Fall2018/Lecture19/19_MDPs2.pdf§Alternative approach for optimal values: §Step 1: Policy evaluation: calculate utilities for some

UtilitiesforaFixedPolicy

§ Anotherbasicoperation:computetheutilityofastatesunderafixed(generallynon-optimal)policy

§ Definetheutilityofastates,underafixedpolicyp:Vp(s)=expectedtotaldiscountedrewardsstartinginsandfollowingp

§ Recursiverelation(one-steplook-ahead/Bellmanequation):

p(s)

s

s,p(s)

s, p(s),s’s’

Page 14: Markov Decision Processes (continued)bboots3/CS4641-Fall2018/Lecture19/19_MDPs2.pdf§Alternative approach for optimal values: §Step 1: Policy evaluation: calculate utilities for some

Example:PolicyEvaluationAlwaysGoRight AlwaysGoForward

Page 15: Markov Decision Processes (continued)bboots3/CS4641-Fall2018/Lecture19/19_MDPs2.pdf§Alternative approach for optimal values: §Step 1: Policy evaluation: calculate utilities for some

Example:PolicyEvaluationAlwaysGoRight AlwaysGoForward

Page 16: Markov Decision Processes (continued)bboots3/CS4641-Fall2018/Lecture19/19_MDPs2.pdf§Alternative approach for optimal values: §Step 1: Policy evaluation: calculate utilities for some

PolicyEvaluation

§ HowdowecalculatetheV’sforafixedpolicyp?

§ Idea1:TurnrecursiveBellmanequationsintoupdates(likevalueiteration)

§ Efficiency:O(S2)periteration

§ Idea2:Withoutthemaxes,theBellmanequationsarejustalinearsystem§ SolvewithMatlab (oryourfavoritelinearsystemsolver)

p(s)

s

s,p(s)

s, p(s),s’s’

Page 17: Markov Decision Processes (continued)bboots3/CS4641-Fall2018/Lecture19/19_MDPs2.pdf§Alternative approach for optimal values: §Step 1: Policy evaluation: calculate utilities for some

PolicyExtraction

Page 18: Markov Decision Processes (continued)bboots3/CS4641-Fall2018/Lecture19/19_MDPs2.pdf§Alternative approach for optimal values: §Step 1: Policy evaluation: calculate utilities for some

ComputingActionsfromValues

§ Let’simaginewehavetheoptimalvaluesV*(s)

§ Howshouldweact?§ It’snotobvious!

§ Weneedtosolveonestepof

§ Thisiscalledpolicyextraction,sinceitgetsthepolicyimpliedbythevalues

Page 19: Markov Decision Processes (continued)bboots3/CS4641-Fall2018/Lecture19/19_MDPs2.pdf§Alternative approach for optimal values: §Step 1: Policy evaluation: calculate utilities for some

ComputingActionsfromQ-Values

§ Let’simaginewehavetheoptimalq-values:

§ Howshouldweact?§ Completelytrivialtodecide!

§ Importantlesson:actionsareeasiertoselectfromq-valuesthanvalues!

Page 20: Markov Decision Processes (continued)bboots3/CS4641-Fall2018/Lecture19/19_MDPs2.pdf§Alternative approach for optimal values: §Step 1: Policy evaluation: calculate utilities for some

PolicyIteration

Page 21: Markov Decision Processes (continued)bboots3/CS4641-Fall2018/Lecture19/19_MDPs2.pdf§Alternative approach for optimal values: §Step 1: Policy evaluation: calculate utilities for some

ProblemswithValueIteration

§ ValueiterationrepeatstheBellmanupdates:

§ Problem1:It’sslow– O(S2A)periteration

§ Problem2:The“max”ateachstaterarelychanges

§ Problem3:Thepolicyoftenconvergeslongbeforethevalues

a

s

s,a

s,a,s’s’

Page 22: Markov Decision Processes (continued)bboots3/CS4641-Fall2018/Lecture19/19_MDPs2.pdf§Alternative approach for optimal values: §Step 1: Policy evaluation: calculate utilities for some

k=0

Noise=0.2Discount=0.9Livingreward=0

Page 23: Markov Decision Processes (continued)bboots3/CS4641-Fall2018/Lecture19/19_MDPs2.pdf§Alternative approach for optimal values: §Step 1: Policy evaluation: calculate utilities for some

k=1

Noise=0.2Discount=0.9Livingreward=0

Page 24: Markov Decision Processes (continued)bboots3/CS4641-Fall2018/Lecture19/19_MDPs2.pdf§Alternative approach for optimal values: §Step 1: Policy evaluation: calculate utilities for some

k=2

Noise=0.2Discount=0.9Livingreward=0

Page 25: Markov Decision Processes (continued)bboots3/CS4641-Fall2018/Lecture19/19_MDPs2.pdf§Alternative approach for optimal values: §Step 1: Policy evaluation: calculate utilities for some

k=3

Noise=0.2Discount=0.9Livingreward=0

Page 26: Markov Decision Processes (continued)bboots3/CS4641-Fall2018/Lecture19/19_MDPs2.pdf§Alternative approach for optimal values: §Step 1: Policy evaluation: calculate utilities for some

k=4

Noise=0.2Discount=0.9Livingreward=0

Page 27: Markov Decision Processes (continued)bboots3/CS4641-Fall2018/Lecture19/19_MDPs2.pdf§Alternative approach for optimal values: §Step 1: Policy evaluation: calculate utilities for some

k=5

Noise=0.2Discount=0.9Livingreward=0

Page 28: Markov Decision Processes (continued)bboots3/CS4641-Fall2018/Lecture19/19_MDPs2.pdf§Alternative approach for optimal values: §Step 1: Policy evaluation: calculate utilities for some

k=6

Noise=0.2Discount=0.9Livingreward=0

Page 29: Markov Decision Processes (continued)bboots3/CS4641-Fall2018/Lecture19/19_MDPs2.pdf§Alternative approach for optimal values: §Step 1: Policy evaluation: calculate utilities for some

k=7

Noise=0.2Discount=0.9Livingreward=0

Page 30: Markov Decision Processes (continued)bboots3/CS4641-Fall2018/Lecture19/19_MDPs2.pdf§Alternative approach for optimal values: §Step 1: Policy evaluation: calculate utilities for some

k=8

Noise=0.2Discount=0.9Livingreward=0

Page 31: Markov Decision Processes (continued)bboots3/CS4641-Fall2018/Lecture19/19_MDPs2.pdf§Alternative approach for optimal values: §Step 1: Policy evaluation: calculate utilities for some

k=9

Noise=0.2Discount=0.9Livingreward=0

Page 32: Markov Decision Processes (continued)bboots3/CS4641-Fall2018/Lecture19/19_MDPs2.pdf§Alternative approach for optimal values: §Step 1: Policy evaluation: calculate utilities for some

k=10

Noise=0.2Discount=0.9Livingreward=0

Page 33: Markov Decision Processes (continued)bboots3/CS4641-Fall2018/Lecture19/19_MDPs2.pdf§Alternative approach for optimal values: §Step 1: Policy evaluation: calculate utilities for some

k=11

Noise=0.2Discount=0.9Livingreward=0

Page 34: Markov Decision Processes (continued)bboots3/CS4641-Fall2018/Lecture19/19_MDPs2.pdf§Alternative approach for optimal values: §Step 1: Policy evaluation: calculate utilities for some

k=12

Noise=0.2Discount=0.9Livingreward=0

Page 35: Markov Decision Processes (continued)bboots3/CS4641-Fall2018/Lecture19/19_MDPs2.pdf§Alternative approach for optimal values: §Step 1: Policy evaluation: calculate utilities for some

k=100

Noise=0.2Discount=0.9Livingreward=0

Page 36: Markov Decision Processes (continued)bboots3/CS4641-Fall2018/Lecture19/19_MDPs2.pdf§Alternative approach for optimal values: §Step 1: Policy evaluation: calculate utilities for some

PolicyIteration

§ Alternativeapproachforoptimalvalues:§ Step1:Policyevaluation:calculateutilitiesforsomefixedpolicy(notoptimalutilities!)untilconvergence

§ Step2:Policyimprovement:updatepolicyusingone-steplook-aheadwithresultingconverged(butnotoptimal!)utilitiesasfuturevalues

§ Repeatstepsuntilpolicyconverges

§ Thisispolicyiteration§ It’sstilloptimal!§ Canconverge(much)fasterundersomeconditions

Page 37: Markov Decision Processes (continued)bboots3/CS4641-Fall2018/Lecture19/19_MDPs2.pdf§Alternative approach for optimal values: §Step 1: Policy evaluation: calculate utilities for some

PolicyIteration

§ Evaluation:Forfixedcurrentpolicyp,findvalueswithpolicyevaluation:§ Iterateuntilvaluesconverge:

§ Improvement:Forfixedvalues,getabetterpolicyusingpolicyextraction§ One-steplook-ahead:

Page 38: Markov Decision Processes (continued)bboots3/CS4641-Fall2018/Lecture19/19_MDPs2.pdf§Alternative approach for optimal values: §Step 1: Policy evaluation: calculate utilities for some

Comparison

§ Bothvalueiterationandpolicyiterationcomputethesamething(alloptimalvalues)

§ Invalueiteration:§ Everyiterationupdatesboththevaluesand(implicitly)thepolicy§ Wedon’ttrackthepolicy,buttakingthemaxoveractionsimplicitlyrecomputes it

§ Inpolicyiteration:§ Wedoseveralpassesthatupdateutilitieswithfixedpolicy(eachpassisfastbecausewe

consideronlyoneaction,notallofthem)§ Afterthepolicyisevaluated,anewpolicyisselected(slowlikeavalueiterationpass)§ Thenewpolicywillbebetter(orwe’redone)

§ BotharedynamicprogramsforsolvingMDPs

Page 39: Markov Decision Processes (continued)bboots3/CS4641-Fall2018/Lecture19/19_MDPs2.pdf§Alternative approach for optimal values: §Step 1: Policy evaluation: calculate utilities for some

Summary:MDPAlgorithms

§ Soyouwantto….§ Computeoptimalvalues:usevalueiterationorpolicyiteration§ Computevaluesforaparticularpolicy:usepolicyevaluation§ Turnyourvaluesintoapolicy:usepolicyextraction(one-steplookahead)

§ Thesealllookthesame!§ Theybasicallyare– theyareallvariationsofBellmanupdates§ Theyalluseone-steplookahead§ Theydifferonlyinwhetherwepluginafixedpolicyormaxoveractions

Page 40: Markov Decision Processes (continued)bboots3/CS4641-Fall2018/Lecture19/19_MDPs2.pdf§Alternative approach for optimal values: §Step 1: Policy evaluation: calculate utilities for some

DoubleBandits

Page 41: Markov Decision Processes (continued)bboots3/CS4641-Fall2018/Lecture19/19_MDPs2.pdf§Alternative approach for optimal values: §Step 1: Policy evaluation: calculate utilities for some

Double-BanditMDP

§ Actions:Blue,Red§ States:Win,Lose

W L$1

1.0

$1

1.0

0.25 $0

0.75$2

0.75 $2

0.25$0

Nodiscount100timestepsBothstateshavethesamevalue

Page 42: Markov Decision Processes (continued)bboots3/CS4641-Fall2018/Lecture19/19_MDPs2.pdf§Alternative approach for optimal values: §Step 1: Policy evaluation: calculate utilities for some

OfflinePlanning

§ SolvingMDPsisofflineplanning§ Youdetermineallquantitiesthroughcomputation§ YouneedtoknowthedetailsoftheMDP§ Youdonotactuallyplaythegame!

PlayRed

PlayBlue

Value

Nodiscount100timestepsBothstateshavethesamevalue

150

100

W L$1

1.0

$1

1.0

0.25 $0

0.75$2

0.75 $2

0.25$0

Page 43: Markov Decision Processes (continued)bboots3/CS4641-Fall2018/Lecture19/19_MDPs2.pdf§Alternative approach for optimal values: §Step 1: Policy evaluation: calculate utilities for some

OnlinePlanning

§ Ruleschanged!Red’swinchanceisdifferent.

W L$1

1.0

$1

1.0

?? $0

??$2

?? $2

??$0

Page 44: Markov Decision Processes (continued)bboots3/CS4641-Fall2018/Lecture19/19_MDPs2.pdf§Alternative approach for optimal values: §Step 1: Policy evaluation: calculate utilities for some

Let’sPlay!

$0 $0 $0 $2 $0$2 $0 $0 $0 $0

Page 45: Markov Decision Processes (continued)bboots3/CS4641-Fall2018/Lecture19/19_MDPs2.pdf§Alternative approach for optimal values: §Step 1: Policy evaluation: calculate utilities for some

WhatJustHappened?

§ Thatwasn’tplanning,itwaslearning!§ Specifically,reinforcementlearning§ TherewasanMDP,butyoucouldn’tsolveitwithjustcomputation§ Youneededtoactuallyacttofigureitout

§ Importantideasinreinforcementlearningthatcameup§ Exploration:youhavetotryunknownactionstogetinformation§ Exploitation:eventually,youhavetousewhatyouknow§ Regret:evenifyoulearnintelligently,youmakemistakes§ Sampling:becauseofchance,youhavetotrythingsrepeatedly§ Difficulty:learningcanbemuchharderthansolvingaknownMDP

Page 46: Markov Decision Processes (continued)bboots3/CS4641-Fall2018/Lecture19/19_MDPs2.pdf§Alternative approach for optimal values: §Step 1: Policy evaluation: calculate utilities for some

NextTime:ReinforcementLearning!