variance reduction for reinforcement learning in input...
TRANSCRIPT
Samplez~š(š§) andtrajectories,thenuseMAMLtoadaptmetabaselineš(
)(*) forpolicygradient:
Input-DependentBaselines
State-dependentbaseline:b(st)=V(st),āzt:āInput-dependentbaseline:b(st ,zt:ā)=V(st|zt:ā)
Dependontheentirefutureinputsequence{zt,zt+1,ā¦,zā}duringtraining
Input-dependentbaselinesarebias-free forpolicygradients:
Implementationsofinput-dependentbaselines:
Experiments
Motivation Example
Input-DrivenProcesses
VarianceReductionforReinforcementLearninginInput-DrivenEnvironments
HongziMaoShaileshh Bojja Venkatakrishnan Malte SchwarzkopfMohammadAlizadehMITComputerScienceandArtificialIntelligenceLaboratory
Time
Job
size
Load balancer
Server 1 Server 2
Input 1
Input 2
Input-dependent State-dependent
TRPO A2C
RobustAdversarialRLMeta-PolicyOptimization
Input-dependentbaselinesareapplicabletomanypolicygradientmethods,suchasA2C,TRPO,PPO,andtheyarecomplementaryandorthogonaltorobustadversarialRLmethodssuchasRARL (Pintoetal.,2017)andmeta-policyoptimizationsuchasMPO (Clavera etal.,2018).
(a)
(b)
st-1 st st+1
zt-1 zt zt+1
at-1 at at+1
st-1 st st+1
zt-1 zt zt+1
at-1 at at+1
st-1 st st+1
at-1 at at+1
(c)
(a) StandardMDP
(b) Input-Driven MDP
(c) Input-DrivenPOMDP
Policyvariance Reward
Policyvisualization
Loadbalancingexample
Sampleš§+~ š§,, š§., ā¦ , š§0 , usethecorrespondingvaluenetforpolicygradient:
Samplez~š(š§), useLSTMtocomputeš((1, š§) forpolicygradient:
Inputz~$(&) LSTM
Baseline()(*, &)
Input!"
Input!#
Input!$
Valuenet%&'( (*)
Valuenet%&', (*)
Valuenet%&'-(*)
Inputz~$(&)
Metabaseline()*(+)
MAMLadapt
Baseline()+(,)
(Notefficient)
workload networkcondition wind buoyancy movingtarget
Environmentswithexogenous,stochasticinputprocessesthataffectthedynamics
Sincetherewardispartiallydictatedbytheinputprocess,thestatealoneonlyprovideslimitedinformationtoestimatetheaveragereturn.Thus,policygradientmethodswithstandardstate-dependentbaselinessufferfromhighvariance.
b(s, z) = V(s|z) (MAML)
b(s, z) = V(s|z) (10 values)b(s) = V(s),ļæ½z
heuristic
TRPO, b(s, z) = V(s|z)
RARL, b(s, z) = V(s|z)
TRPO, b(s) = V(s),ļæ½z
RARL, b(s) = V(s),ļæ½z b(s, z) = V(s|z); TRPOb(s, z) = V(s|z); MPO
b(s) = V(s),ļæ½z; TRPOb(s) = V(s),ļæ½z; MPO