trust region policy optimization (trpo) - sjsu · 2018-04-17 · trust region policy optimization...

18
Trust region policy optimization (TRPO)

Upload: others

Post on 21-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Trust region policy optimization (TRPO) - SJSU · 2018-04-17 · Trust Region Policy Optimization • For parameterized policies with parameter vector, we are guaranteed to improve

Trustregionpolicyoptimization(TRPO)

Page 2: Trust region policy optimization (TRPO) - SJSU · 2018-04-17 · Trust Region Policy Optimization • For parameterized policies with parameter vector, we are guaranteed to improve

ValueIteration

Page 3: Trust region policy optimization (TRPO) - SJSU · 2018-04-17 · Trust Region Policy Optimization • For parameterized policies with parameter vector, we are guaranteed to improve

ValueIteration

• ThisiswhatwesimilartowhatQ-Learningdoes,themaindifferencebeingthatwewemightnotknowtheactualexpectedrewardandinsteadexploretheworldandusediscountedrewardstomodelourvaluefunction.

Page 4: Trust region policy optimization (TRPO) - SJSU · 2018-04-17 · Trust Region Policy Optimization • For parameterized policies with parameter vector, we are guaranteed to improve

ValueIteration

• ThisiswhatwesimilartowhatQ-Learningdoes,themaindifferencebeingthatwewemightnotknowtheactualexpectedrewardandinsteadexploretheworldandusediscountedrewardstomodelourvaluefunction.

Model-based

Model-free

Page 5: Trust region policy optimization (TRPO) - SJSU · 2018-04-17 · Trust Region Policy Optimization • For parameterized policies with parameter vector, we are guaranteed to improve

ValueIteration

• ThisiswhatwesimilartowhatQ-Learningdoes,themaindifferencebeingthatwewemightnotknowtheactualexpectedrewardandinsteadexploretheworldandusediscountedrewardstomodelourvaluefunction.

• OncewehaveQ(s,a),wecanfindoptimalpolicyπ*using:

Page 6: Trust region policy optimization (TRPO) - SJSU · 2018-04-17 · Trust Region Policy Optimization • For parameterized policies with parameter vector, we are guaranteed to improve

PolicyIteration• Wecandirectlyoptimizeinthepolicyspace.

Page 7: Trust region policy optimization (TRPO) - SJSU · 2018-04-17 · Trust Region Policy Optimization • For parameterized policies with parameter vector, we are guaranteed to improve

PolicyIteration• Wecandirectlyoptimizeinthepolicyspace.

Smaller thanQ-functionspace

Page 8: Trust region policy optimization (TRPO) - SJSU · 2018-04-17 · Trust Region Policy Optimization • For parameterized policies with parameter vector, we are guaranteed to improve

PreliminariesFollowingidentityexpressestheexpectedreturnofanotherpolicy intermsoftheadvantageoverπ,accumulatedovertimesteps:

WhereAπ istheadvantagefunction:

Andisthevisitation frequencyofstatesinpolicy

Page 9: Trust region policy optimization (TRPO) - SJSU · 2018-04-17 · Trust Region Policy Optimization • For parameterized policies with parameter vector, we are guaranteed to improve

PreliminariesToremovethecomplexity dueto,following localapproximation isintroduced:

Ifwehaveaparameterized policy ,where isadifferentiable functionoftheparametervector ,then matchestofirstorder. i.e.,

Thisimplies thatasufficiently small stepthatimproves willalsoimprove ,butdoesnotgiveusanyguidance onhowbigofasteptotake.

Page 10: Trust region policy optimization (TRPO) - SJSU · 2018-04-17 · Trust Region Policy Optimization • For parameterized policies with parameter vector, we are guaranteed to improve

• Toaddressthisissue,Kakade &Langford(2002)proposedconservativepolicyiteration:

where,

• Theyderivedthefollowinglowerbound:

Preliminaries

Page 11: Trust region policy optimization (TRPO) - SJSU · 2018-04-17 · Trust Region Policy Optimization • For parameterized policies with parameter vector, we are guaranteed to improve

Preliminaries

• Computationally,thisα-couplingmeansthatifwerandomlychooseaseedforourrandomnumbergenerator,andthenwesamplefromeachofπ andπnew aftersettingthatseed,theresultswillagreeforatleastfraction1-α ofseeds.• Thusα canbeconsideredasameasureofdisagreementbetweenπandπnew

Page 12: Trust region policy optimization (TRPO) - SJSU · 2018-04-17 · Trust Region Policy Optimization • For parameterized policies with parameter vector, we are guaranteed to improve

Theorem1• Previousresultwasapplicabletomixturepoliciesonly.Schulmanshowedthatitcanbeextended togeneralstochasticpoliciesbyusingadistancemeasurecalled“TotalVariation”divergencebetweenπandas:

• Let

• Theyprovedthatfor ,followingresultholds:

fordiscreteprobability distributionsp;q

Page 13: Trust region policy optimization (TRPO) - SJSU · 2018-04-17 · Trust Region Policy Optimization • For parameterized policies with parameter vector, we are guaranteed to improve

• NotethefollowingrelationbetweenTotalVariation&Kullback–Leibler:

• Thusboundingconditionbecomes:

Theorem1

Page 14: Trust region policy optimization (TRPO) - SJSU · 2018-04-17 · Trust Region Policy Optimization • For parameterized policies with parameter vector, we are guaranteed to improve

Algorithm1

Page 15: Trust region policy optimization (TRPO) - SJSU · 2018-04-17 · Trust Region Policy Optimization • For parameterized policies with parameter vector, we are guaranteed to improve

TrustRegionPolicyOptimization

• Forparameterizedpolicieswithparametervector,weareguaranteedtoimprovethetrueobjectivebyperformingfollowingmaximization:

• However,usingthepenaltycoefficientlikeaboveresultsinverysmallstepsizes.OnewaytotakelargerstepsinarobustwayistouseaconstraintontheKLdivergencebetweenthenewpolicyandtheoldpolicy,i.e.,atrustregionconstraint:

Page 16: Trust region policy optimization (TRPO) - SJSU · 2018-04-17 · Trust Region Policy Optimization • For parameterized policies with parameter vector, we are guaranteed to improve

TrustRegionPolicyOptimization

• Theconstraintisboundedateverypointinstatespace,whichisnotpractical.Wecanusethefollowingheuristicapproximation:

• Thus,theoptimizationproblembecomes:

Page 17: Trust region policy optimization (TRPO) - SJSU · 2018-04-17 · Trust Region Policy Optimization • For parameterized policies with parameter vector, we are guaranteed to improve

TrustRegionPolicyOptimization

• Intermsofexpectation,previousequationcanbewrittenas:

where,qdenotesthesamplingdistribution• Thissamplingdistributioncanbecalculatedintwoways:

Ø a)SinglePathMethodØ b)VineMethod

Page 18: Trust region policy optimization (TRPO) - SJSU · 2018-04-17 · Trust Region Policy Optimization • For parameterized policies with parameter vector, we are guaranteed to improve

FinalAlgorithm

• Step1: Usethesinglepathorvineprocedurestocollectasetofstate-actionpairsalongwithMonteCarloestimatesoftheirQ-values• Step2:Byaveragingoversamples,constructtheestimatedobjectiveandconstraintinEquation(14)• Step3: Approximatelysolvethisconstrainedoptimizationproblemtoupdatethepolicy’sparametervector