planning & reinforcement learning - teaching labscsc384h/summer/lectures/planning_and_r… ·...

52
Planning & Reinforcement Learning Slides borrowed from Sheila McIlraith, Kate Larson, and David Silver 1 CSC384 University of Toronto

Upload: vodan

Post on 18-Jul-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

Planning&ReinforcementLearning

SlidesborrowedfromSheilaMcIlraith,KateLarson,andDavidSilver

1CSC384UniversityofToronto

WhyPlanninguE.g.ifwehavearobotwewanttherobottodecidewhattodo;howtoacttoachieveourgoals

CSC384UniversityofToronto 2

3

Planningvs.Search

• Howtochange theworldtosuitourneeds.• Criticalissue:weneedtoreasonaboutwhattheworldwillbelike afterdoingafewactions.

• ThisaspectofplanningisjustlikeSearch.

GOAL:StevenhascoffeeCURRENTLY:robotinmailroom,hasnocoffee,coffeenotmade,Steveninoffice,etc.TODO:gotolounge,makecoffee,…

4

uAutonomousplanning,scheduling,controlu NASA:JPLandAmes

uRemoteAgentExperiment(RAX)u DeepSpace1

uMarsExplorationRover(MER)

AutonomousAgentsforSpaceExploration

5

SchedulingwithActionChoices&ResourceRequirementsu Problemsinsupplychainmanagementu HSTS(HubbleSpaceTelescopescheduler)uWorkflowmanagement

AirTrafficControlu Routeaircraftbetweenrunwaysandterminals.Craftsmustbekeptsafelyseparated.Safedistancedependsoncraftandmodeoftransport.Minimizetaxiandwaittime.

CharacterAnimationu Generatestep-by-stepcharacterbehaviourfromhigh-levelspec

Plan-basedInterfacesu E.g.NLPtodatabaseinterfacesu Planrecognition,ActivityRecognition

OtherApplications(cont.)

6

Applications

Theseapplicationsrequiremorethansearch.Notsufficienttosimplyfindasequenceofactionfortransformingtheworldsoastoachieveagoalstate.

uTheseapplicationsinvolvedealingwithuncertainty.uSensingtheworldandplanningtosensetheworldsoastoreduceuncertainty.uGeneratingaplanthathashighpayofforhighexpectedpayoffratherthansimplyachievingafixedgoal.

uRunningintoproblemswhenexecutingaplanandhavingtorecover.uEtc.

Planning

u Agent: singleagent,ormulti-agentu State: completeorincomplete(logical/probabilistic),stateoftheworldand/oragent’sstateofknowledge

u Actions: world-alteringand/orknowledge-altering(e.g.sensing);deterministicornon-deterministic(logical/stochastic)

u GoalCondition:satisfyingoroptimizing;final-stateortemporallyextended;optimizingforpreference/cost/utility

u Reasoning:offlineoronline(fullyobservable,partiallyobservable)u Plans:partialorder,sequential,conditionalCSC384UniversityofToronto 11

SimplifyingthePlanningProblem

uWesimplifytheplanningproblemasfollows:uAssumecompleteinformationabouttheinitialstate throughclosedworldassumption(CWA)

uAssumefinitedomain ofobjectsuAssumeactioneffectsarerestrictedtomakingconjunctionsofatomicformulaetobetrueorfalse.Noconditionaleffects,etc.

uAssumeactionpreconditions arerestrictedtoconjunctionsofgroundatoms

uPerformClassicalPlanning.NoincompleteoruncertainknowledgeCSC384UniversityofToronto 12

ClassicalPlanningAssumptions

u FiniteSystem:finitelymanystates,actions,eventsu FullyObservable:controlleralwaysknowscurrentstateu Deterministic:eachactionhasonlyoneoutcomeu Static:changesonlyoccurasresultofcontrolleractionsu Attainmentgoals:asetofgoalstatesSgu Sequentialplans:planislinearlyorderedsequenceofactions(a1,…,an)u Implicittime:actionsareinstantaneous(havenoduration)u Off-lineplanning:plannerdoesn’tknowexecutionstatusCSC384UniversityofToronto 13

STRIPSRepresentation

uSTRIPS(StanfordResearchInstituteProblemSolver)uAwayofrepresentingactionswithrespecttoCW-KB–closedworldknowledgebaserepresentingthestateoftheworld

CSC384UniversityofToronto 14

SequenceofWorlds

CSC384UniversityofToronto 15

STRIPSActions

uStripsrepresentactionsusing3listsuAlistofpreconditions.uAlistofactionaddeffects.uAlistofactiondelete effects

uTheselistscontainvariables sothatwecanrepresentawholeclassofactionswithonespecification

uEachgroundinstantation ofthevariablesyieldsaspecificaction

CSC384UniversityofToronto 16

STRIPSActions:Example

pickup(X):

Pre:{handempty,clear(X),ontable(X)}Adds:{holding(X)}Dels:{handempty,clear(X),ontable(X)}

CSC384UniversityofToronto 17

C

A B

robot hand

STRIPSActions:Example

pickup(X): iscalledaSTRIPSoperator

pickup(a):(aparticularinstance),iscalledanaction

CSC384UniversityofToronto 18

C

A B

robot hand

STRIPSActions:Example

putdown(X)

Pre:{holding(X)}Adds:{clear(X),ontable(X),handempty}Dels:{holding(X)}

CSC384UniversityofToronto 19

C

A

B

robot hand

STRIPSActions:Example

stack(X,Y)

Pre:{holding(X),clear(Y)}Adds:{on(X,Y),handempty,clear(X)}Dels:{holding(X),clear(Y)}

CSC384UniversityofToronto 20

C

A B

robot hand

STRIPShasnoConditionalEffects

u BlocksWorldassumption:Tablehasinfinitespace,soitisalwaysclearu Ifwestacksomethingontable(Y=table),wecannotdeleteclear(table)u ButifYisanordinaryblock,wemustdeleteclear(Y)

CSC384UniversityofToronto 21

stack(X,Y)

Pre:{holding(X),clear(Y)}Adds:{on(X,Y),handempty,clear(X)}Dels:{holding(X),clear(Y)}

STRIPShasnoConditionalEffectsuSinceSTRIPShasnoconditionaleffects,wemustsometimesutilizeextraactions:oneforeachtypeofcondition.uWeEmbedtheconditioninthepreconditionandthenaltertheeffectsaccordingly

CSC384UniversityofToronto 22

stack(X,Y)

Pre:{holding(X),clear(Y)}Adds:{on(X,Y),handempty,clear(X)}Dels:{holding(X),clear(Y)}

putdown(X)

Pre:{holding(X)}Adds:{ontable(X),handempty,clear(X)}Dels:{holding(X)}

STRIPSActions:Example

uunstack(X,Y)

Pre: { }Adds: { }Dels: { }

CSC384UniversityofToronto 23

C

A B

robot hand

STRIPSActions:Example

uunstack(X,Y)

Pre:{clear(X),on(X,Y),handempty}Adds:{holding(X),clear(Y)}Dels:{clear(X),on(X,Y),handempty}

CSC384UniversityofToronto 24

C

A B

robot hand

PlanningasaSearchProblem

u Givenu ACW-KBrepresentingtheinitialstateu AsetofSTRIPSoperatorsthatmapastatetoanewstateu Agoalconditions (conjunctionoffacts,orasaformula)

Theplanningproblemistodeterminesequenceofactionthat,whenappliedtotheinitialCW-KByieldanupdatedCW-KBwhichsatisfiesthegoal.

ThisistheclassicalplanningtaskCSC384UniversityofToronto 25

PlanningAsSearch

uThisisasearchproblem,inwhichourstatespacerepresentationisaCW-KB

uInitialCW-KBistheinitialstateuactionsareoperatorsmappingastatetoanewstateuGoalissatisfiedbyanystatethatsatisfiesthegoal.Typicallythegoalisaconjunctionofprimitivefacts,sowejustneedtocheckifallthefactsinthegoalarecontainedintheCW-KB

CSC384UniversityofToronto 26

Example

CSC384UniversityofToronto 27

CA B

… BA C

Example

CSC384UniversityofToronto 28

CA B

move(b,c)CA

B

move(c,b)C

A B

move(c,table)CA B

move(a,b)BA C

Problems

u SearchtreeisgenerallyquitelargeuRandomlyreconfiguring9blockstakesthousandsofCPUseconds

uBut:representationsuggestssomestructureuEachactiononlyaffectsasmallsetoffacts,uActionsdependoneachotherviatheirpreconditions

u Planningalgorithmsaredesignedtotakeadvantageoffactthattherepresentationmakesthe“locality”ofactionchangesexplicit

CSC384UniversityofToronto 29

PlanningSummary

uModeloftheenvironmentisknownuAgentperformscomputationswithitsmodel(withoutexternalinteraction)

uAgentimprovesitspolicyuDeliberation,reasoning,introspection,pondering,though,search

CSC384UniversityofToronto 30

But… whathappensiftheenvironmentisunknown?

CSC384UniversityofToronto 31

Howcanweinformouragentofwhatactionstotake?uAssume:environmentisinitiallyunknownuConsiderusingarewardfunction,toguideagentuIfagentdoesn’tknowwhatactionstotake

uTryanactionoutuSeewhattherewardis,oftakingthataction

uThisisReinforcementLearningCSC384UniversityofToronto 32

ReinforcementLearning

uLearningwhattodo,soastomaximizesomerewardsignal

CSC384UniversityofToronto 33

Example:TicTacToe

uState:BoardconfigurationuActions:NextmoveuReward:1forwin,-1forloss,0fordraw

uProblem: Find𝜋: 𝑆 → 𝐴 thatmaximizesreward

CSC384UniversityofToronto 34

Example:MobileRobot

uState:locationofrobot,peopleuActions:MotionuReward:numberofhappyfacesuProblem: Find𝜋: 𝑆 → 𝐴 thatmaximizesreward

CSC384UniversityofToronto 35

Example:Atari

uState:pixellocationofgameagents

uActions:agentmovementuReward:scoreuProblem: Find𝜋: 𝑆 → 𝐴 thatmaximizesreward

CSC384UniversityofToronto 36

AutonomousHelicopterFlight

CSC384UniversityofToronto 37

http://heli.stanford.edu/

QuadrupedRobot

CSC384UniversityofToronto 38

http://www.andrewng.org/portfolio/quadruped-robot-locomotion/

ReinforcementLearninguGoal:learntochooseactionsthatmaximize

REWARD=r0 +𝛾 r1 +𝛾2 r2 … ,where 0 < 𝛾 < 1

CSC384UniversityofToronto 39

Reward

uArewardRt isascalarfeedbacksignalu IndicateshowwellagentisdoingatsteptuTheagent’sjobistomaximizecumulativereward

Rewardhypothesis:Allgoalscanbedescribedbythemaximizationofexpectedcumulativereward

CSC384UniversityofToronto 40

SequentialDecisionMaking

uGoal:selectactionstomaximizetotalfuturerewarduActionsmayhavelong-termconsequencesuRewardmaybedelayeduMaybebettertosacrificeimmediaterewardtogainmorelong-termreward(exploitationvs.exploration)

CSC384UniversityofToronto 41

ExplorationandExploitation

uReinforcementlearningisliketrial-and-errorlearninguAgentsshoulddiscoveragoodpolicy

uFromitsexperiencesoftheenvironment (explore)uWithoutlosingtoomuchoftherewardalongtheway(exploit)

CSC384UniversityofToronto 42

Agent’sLearningTask

CSC384UniversityofToronto 43

uExecuteactionsintheworlduObservetheresultsuLearnpolicy𝜋: 𝑆 → 𝐴 thatmaximizesrewardfromsomeinitialstate

FullyObservableEnvironment

CSC384UniversityofToronto 44

uFullobservability: agentdirectlyobservesenvironmentstate

uAgentstate=environmentstate=informationstate

uFormally,thisisaMarkovDecisionProcess(MDP)

PartiallyObservableEnvironment

CSC384UniversityofToronto 45

u Partialobservability: agentindirectlyobservesenvironmentu E.g.robotwithcameravisionisn’ttolditsabsolutelocationu Tradingagentonlyobservescurrentpricesu Pokeplayingagentonlyobservespubliccards

u Agentstate≠environmentstateu FormallythisisapartiallyobservableMarkovDecision

Process(POMDP)u AgentmustconstructitsownstaterepresentationSta

u Completehistory: Sta =Ht

u Beliefsofenvironmentstate Sta =(P[Ste =s1],…,P[Ste =sn]u RecurrentNeuralNetwork Sta =𝜎(𝑆./01 𝑊3 + 𝑂.𝑊6)

RLAgent

uAnRLagentmayincludeoneormore ofthesecomponents

uPolicy:agent’sbehaviour functionuValuefunction: howgoodiseachstateand/oractionuModel:agent’srepresentationoftheenvironmentCSC384UniversityofToronto 46

MazeExample

uStates:Agent’slocationuActions:N,E,S,WuRewards:-1pertime-step

CSC384UniversityofToronto 47

MazeExamplePolicy:Agent’sbehaviouruMapfromstatetoactionuDeterministicpolicy

a=𝜋(𝑠)uStochasticpolicy𝜋(𝑎|𝑠) =P(At =a|St =s)

CSC384UniversityofToronto 48

Eacharrowrepresentspolicy𝜋(𝑠) foreachstates

MazeExampleValuefunction:uPredictionoffuturerewarduUsedtoevaluategoodness/badnessofstates

v𝜋(s)=E𝜋 [Rt+1 +𝛾 Rt+2 +… |St=s]

CSC384UniversityofToronto 49

Numbersrepresentvaluefunctionv𝜋(s)ofeachstates

MazeExampleModel:u Predictswhattheenvironmentwilldonext

u Agentmayhaveinternalmodeloftheenvironment,whichdeterminesu howactionschangethestate,andu howmuchrewardshouldbegivenforeachstate.

u Modelmaybeimperfect

CSC384UniversityofToronto 50

RLAgent

CSC384UniversityofToronto 51

RLAgent

uModel-Based:uPolicyand/orValueFunctionuModel

uModel-Free:uPolicyand/orValueFunctionuNoModel

CSC384UniversityofToronto 52