planning & reinforcement learning - teaching labscsc384h/summer/lectures/planning_and_r… ·...

Planning&ReinforcementLearning

SlidesborrowedfromSheilaMcIlraith,KateLarson,andDavidSilver

1CSC384UniversityofToronto

WhyPlanninguE.g.ifwehavearobotwewanttherobottodecidewhattodo;howtoacttoachieveourgoals

CSC384UniversityofToronto 2

3

Planningvs.Search

• Howtochange theworldtosuitourneeds.• Criticalissue:weneedtoreasonaboutwhattheworldwillbelike afterdoingafewactions.

• ThisaspectofplanningisjustlikeSearch.

GOAL:StevenhascoffeeCURRENTLY:robotinmailroom,hasnocoffee,coffeenotmade,Steveninoffice,etc.TODO:gotolounge,makecoffee,…

4

uAutonomousplanning,scheduling,controlu NASA:JPLandAmes

uRemoteAgentExperiment(RAX)u DeepSpace1

uMarsExplorationRover(MER)

AutonomousAgentsforSpaceExploration

5

SchedulingwithActionChoices&ResourceRequirementsu Problemsinsupplychainmanagementu HSTS(HubbleSpaceTelescopescheduler)uWorkflowmanagement

AirTrafficControlu Routeaircraftbetweenrunwaysandterminals.Craftsmustbekeptsafelyseparated.Safedistancedependsoncraftandmodeoftransport.Minimizetaxiandwaittime.

CharacterAnimationu Generatestep-by-stepcharacterbehaviourfromhigh-levelspec

Plan-basedInterfacesu E.g.NLPtodatabaseinterfacesu Planrecognition,ActivityRecognition

OtherApplications(cont.)

6

Applications

Theseapplicationsrequiremorethansearch.Notsufficienttosimplyfindasequenceofactionfortransformingtheworldsoastoachieveagoalstate.

uTheseapplicationsinvolvedealingwithuncertainty.uSensingtheworldandplanningtosensetheworldsoastoreduceuncertainty.uGeneratingaplanthathashighpayofforhighexpectedpayoffratherthansimplyachievingafixedgoal.

uRunningintoproblemswhenexecutingaplanandhavingtorecover.uEtc.

Planning

u Agent: singleagent,ormulti-agentu State: completeorincomplete(logical/probabilistic),stateoftheworldand/oragent’sstateofknowledge

u Actions: world-alteringand/orknowledge-altering(e.g.sensing);deterministicornon-deterministic(logical/stochastic)

u GoalCondition:satisfyingoroptimizing;final-stateortemporallyextended;optimizingforpreference/cost/utility

u Reasoning:offlineoronline(fullyobservable,partiallyobservable)u Plans:partialorder,sequential,conditionalCSC384UniversityofToronto 11

SimplifyingthePlanningProblem

uWesimplifytheplanningproblemasfollows:uAssumecompleteinformationabouttheinitialstate throughclosedworldassumption(CWA)

uAssumefinitedomain ofobjectsuAssumeactioneffectsarerestrictedtomakingconjunctionsofatomicformulaetobetrueorfalse.Noconditionaleffects,etc.

uAssumeactionpreconditions arerestrictedtoconjunctionsofgroundatoms

uPerformClassicalPlanning.NoincompleteoruncertainknowledgeCSC384UniversityofToronto 12

ClassicalPlanningAssumptions

u FiniteSystem:finitelymanystates,actions,eventsu FullyObservable:controlleralwaysknowscurrentstateu Deterministic:eachactionhasonlyoneoutcomeu Static:changesonlyoccurasresultofcontrolleractionsu Attainmentgoals:asetofgoalstatesSgu Sequentialplans:planislinearlyorderedsequenceofactions(a1,…,an)u Implicittime:actionsareinstantaneous(havenoduration)u Off-lineplanning:plannerdoesn’tknowexecutionstatusCSC384UniversityofToronto 13

STRIPSRepresentation

uSTRIPS(StanfordResearchInstituteProblemSolver)uAwayofrepresentingactionswithrespecttoCW-KB–closedworldknowledgebaserepresentingthestateoftheworld


SequenceofWorlds


STRIPSActions

uStripsrepresentactionsusing3listsuAlistofpreconditions.uAlistofactionaddeffects.uAlistofactiondelete effects

uTheselistscontainvariables sothatwecanrepresentawholeclassofactionswithonespecification

uEachgroundinstantation ofthevariablesyieldsaspecificaction


STRIPSActions:Example

pickup(X):

Pre:{handempty,clear(X),ontable(X)}Adds:{holding(X)}Dels:{handempty,clear(X),ontable(X)}


C

A B

robot hand


pickup(X): iscalledaSTRIPSoperator

pickup(a):(aparticularinstance),iscalledanaction


C

A B

robot hand


putdown(X)

Pre:{holding(X)}Adds:{clear(X),ontable(X),handempty}Dels:{holding(X)}


C

A

B

robot hand


stack(X,Y)

Pre:{holding(X),clear(Y)}Adds:{on(X,Y),handempty,clear(X)}Dels:{holding(X),clear(Y)}


C

A B

robot hand

STRIPShasnoConditionalEffects

u BlocksWorldassumption:Tablehasinfinitespace,soitisalwaysclearu Ifwestacksomethingontable(Y=table),wecannotdeleteclear(table)u ButifYisanordinaryblock,wemustdeleteclear(Y)


stack(X,Y)


STRIPShasnoConditionalEffectsuSinceSTRIPShasnoconditionaleffects,wemustsometimesutilizeextraactions:oneforeachtypeofcondition.uWeEmbedtheconditioninthepreconditionandthenaltertheeffectsaccordingly


stack(X,Y)


putdown(X)

Pre:{holding(X)}Adds:{ontable(X),handempty,clear(X)}Dels:{holding(X)}


uunstack(X,Y)

Pre: { }Adds: { }Dels: { }


C

A B

robot hand


uunstack(X,Y)

Pre:{clear(X),on(X,Y),handempty}Adds:{holding(X),clear(Y)}Dels:{clear(X),on(X,Y),handempty}


C

A B

robot hand

PlanningasaSearchProblem

u Givenu ACW-KBrepresentingtheinitialstateu AsetofSTRIPSoperatorsthatmapastatetoanewstateu Agoalconditions (conjunctionoffacts,orasaformula)

Theplanningproblemistodeterminesequenceofactionthat,whenappliedtotheinitialCW-KByieldanupdatedCW-KBwhichsatisfiesthegoal.

ThisistheclassicalplanningtaskCSC384UniversityofToronto 25

PlanningAsSearch

uThisisasearchproblem,inwhichourstatespacerepresentationisaCW-KB

uInitialCW-KBistheinitialstateuactionsareoperatorsmappingastatetoanewstateuGoalissatisfiedbyanystatethatsatisfiesthegoal.Typicallythegoalisaconjunctionofprimitivefacts,sowejustneedtocheckifallthefactsinthegoalarecontainedintheCW-KB


Example


CA B

… BA C

Example


CA B

move(b,c)CA

B

move(c,b)C

A B

move(c,table)CA B

move(a,b)BA C

Problems

u SearchtreeisgenerallyquitelargeuRandomlyreconfiguring9blockstakesthousandsofCPUseconds

uBut:representationsuggestssomestructureuEachactiononlyaffectsasmallsetoffacts,uActionsdependoneachotherviatheirpreconditions

u Planningalgorithmsaredesignedtotakeadvantageoffactthattherepresentationmakesthe“locality”ofactionchangesexplicit


PlanningSummary

uModeloftheenvironmentisknownuAgentperformscomputationswithitsmodel(withoutexternalinteraction)

uAgentimprovesitspolicyuDeliberation,reasoning,introspection,pondering,though,search


But… whathappensiftheenvironmentisunknown?


Howcanweinformouragentofwhatactionstotake?uAssume:environmentisinitiallyunknownuConsiderusingarewardfunction,toguideagentuIfagentdoesn’tknowwhatactionstotake

uTryanactionoutuSeewhattherewardis,oftakingthataction

uThisisReinforcementLearningCSC384UniversityofToronto 32

ReinforcementLearning

uLearningwhattodo,soastomaximizesomerewardsignal


Example:TicTacToe

uState:BoardconfigurationuActions:NextmoveuReward:1forwin,-1forloss,0fordraw

uProblem: Find𝜋: 𝑆 → 𝐴 thatmaximizesreward


Example:MobileRobot

uState:locationofrobot,peopleuActions:MotionuReward:numberofhappyfacesuProblem: Find𝜋: 𝑆 → 𝐴 thatmaximizesreward


Example:Atari

uState:pixellocationofgameagents

uActions:agentmovementuReward:scoreuProblem: Find𝜋: 𝑆 → 𝐴 thatmaximizesreward


AutonomousHelicopterFlight


http://heli.stanford.edu/

QuadrupedRobot


http://www.andrewng.org/portfolio/quadruped-robot-locomotion/

ReinforcementLearninguGoal:learntochooseactionsthatmaximize

REWARD=r0 +𝛾 r1 +𝛾2 r2 … ,where 0 < 𝛾 < 1


Reward

uArewardRt isascalarfeedbacksignalu IndicateshowwellagentisdoingatsteptuTheagent’sjobistomaximizecumulativereward

Rewardhypothesis:Allgoalscanbedescribedbythemaximizationofexpectedcumulativereward


SequentialDecisionMaking

uGoal:selectactionstomaximizetotalfuturerewarduActionsmayhavelong-termconsequencesuRewardmaybedelayeduMaybebettertosacrificeimmediaterewardtogainmorelong-termreward(exploitationvs.exploration)


ExplorationandExploitation

uReinforcementlearningisliketrial-and-errorlearninguAgentsshoulddiscoveragoodpolicy

uFromitsexperiencesoftheenvironment (explore)uWithoutlosingtoomuchoftherewardalongtheway(exploit)


Agent’sLearningTask


uExecuteactionsintheworlduObservetheresultsuLearnpolicy𝜋: 𝑆 → 𝐴 thatmaximizesrewardfromsomeinitialstate

FullyObservableEnvironment


uFullobservability: agentdirectlyobservesenvironmentstate

uAgentstate=environmentstate=informationstate

uFormally,thisisaMarkovDecisionProcess(MDP)

PartiallyObservableEnvironment


u Partialobservability: agentindirectlyobservesenvironmentu E.g.robotwithcameravisionisn’ttolditsabsolutelocationu Tradingagentonlyobservescurrentpricesu Pokeplayingagentonlyobservespubliccards

u Agentstate≠environmentstateu FormallythisisapartiallyobservableMarkovDecision

Process(POMDP)u AgentmustconstructitsownstaterepresentationSta

u Completehistory: Sta =Ht

u Beliefsofenvironmentstate Sta =(P[Ste =s1],…,P[Ste =sn]u RecurrentNeuralNetwork Sta =𝜎(𝑆./01 𝑊3 + 𝑂.𝑊6)

RLAgent

uAnRLagentmayincludeoneormore ofthesecomponents

uPolicy:agent’sbehaviour functionuValuefunction: howgoodiseachstateand/oractionuModel:agent’srepresentationoftheenvironmentCSC384UniversityofToronto 46

MazeExample

uStates:Agent’slocationuActions:N,E,S,WuRewards:-1pertime-step


MazeExamplePolicy:Agent’sbehaviouruMapfromstatetoactionuDeterministicpolicy

a=𝜋(𝑠)uStochasticpolicy𝜋(𝑎|𝑠) =P(At =a|St =s)


Eacharrowrepresentspolicy𝜋(𝑠) foreachstates

MazeExampleValuefunction:uPredictionoffuturerewarduUsedtoevaluategoodness/badnessofstates

v𝜋(s)=E𝜋 [Rt+1 +𝛾 Rt+2 +… |St=s]


Numbersrepresentvaluefunctionv𝜋(s)ofeachstates

MazeExampleModel:u Predictswhattheenvironmentwilldonext

u Agentmayhaveinternalmodeloftheenvironment,whichdeterminesu howactionschangethestate,andu howmuchrewardshouldbegivenforeachstate.

u Modelmaybeimperfect


RLAgent


RLAgent

uModel-Based:uPolicyand/orValueFunctionuModel

uModel-Free:uPolicyand/orValueFunctionuNoModel


planning & reinforcement learning - teaching labscsc384h/summer/lectures/planning_and_r… ·...

Documents