variance reduction for reinforcement learning in input...

1
Sample z~() and trajectories, then use MAML to adapt meta baseline ( )(*) for policy gradient: Input-Dependent Baselines State-dependent baseline: b(s t ) = V(s t ), āˆ€z t:āˆž Input-dependent baseline: b(s t ,z t:āˆž ) = V(s t |z t:āˆž ) Depend on the entire future input sequence { z t, z t+1, ā€¦, z āˆž } during training Input-dependent baselines are bias-free for policy gradients: Implementations of input-dependent baselines: Experiments Motivation Example Input-Driven Processes Variance Reduction for Reinforcement Learning in Input-Driven Environments Hongzi Mao Shaileshh Bojja Venkatakrishnan Malte Schwarzkopf Mohammad Alizadeh MIT Computer Science and Artificial Intelligence Laboratory Time Job size Load balancer Server 1 Server 2 Input 1 Input 2 Input-dependent State-dependent TRPO A2C Robust Adversarial RL Meta-Policy Optimization Input-dependent baselines are applicable to many policy gradient methods, such as A2C, TRPO, PPO, and they are complementary and orthogonal to robust adversarial RL methods such as RARL (Pinto et al., 2017) and meta-policy optimization such as MPO (Clavera et al., 2018). (a) (b) s t-1 s t s t+1 z t-1 z t z t+1 a t-1 a t a t+1 s t-1 s t s t+1 z t-1 z t z t+1 a t-1 a t a t+1 s t-1 s t s t+1 a t-1 a t a t+1 (c) (a) Standard MDP (b) Input -Driven MDP (c) Input -Driven POMDP Policy variance Reward Policy visualization Load balancing example Sample + ~ , , . ,ā€¦, 0 , use the corresponding value net for policy gradient: Sample z~(), use LSTM to compute ( (1, ) for policy gradient: Input z ~ $(&) LSTM Baseline ( ) (*, &) Input ! " Input ! # Input ! $ Value net % & ' ( (*) Value net % & ' , (*) Value net % & ' - (*) Input z ~ $(&) Meta baseline ( ) *(+) MAML adapt Baseline ( ) + (,) (Not efficient) workload network condition wind buoyancy moving target Environments with exogenous, stochastic input processes that affect the dynamics Since the reward is partially dictated by the input process, the state alone only provides limited information to estimate the average return. Thus, policy gradient methods with standard state-dependent baselines suffer from high variance. b(s, z) = V(s|z) (MAML) b(s, z) = V(s|z) (10 values) b(s) = V(s),z heuristic TRPO, b(s, z) = V(s|z) RARL, b(s, z) = V(s|z) TRPO, b(s) = V(s),z RARL, b(s) = V(s),z b(s, z) = V(s|z); TRPO b(s, z) = V(s|z); MPO b(s) = V(s),z; TRPO b(s) = V(s),z; MPO

Upload: others

Post on 15-Feb-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Variance Reduction for Reinforcement Learning in Input ...people.csail.mit.edu/hongzi/var-website/content/poster.pdforthogonal to robust adversarial RL methods such as RARL(Pinto et

Samplez~š‘ƒ(š‘§) andtrajectories,thenuseMAMLtoadaptmetabaselineš‘‰(

)(*) forpolicygradient:

Input-DependentBaselines

State-dependentbaseline:b(st)=V(st),āˆ€zt:āˆžInput-dependentbaseline:b(st ,zt:āˆž)=V(st|zt:āˆž)

Dependontheentirefutureinputsequence{zt,zt+1,ā€¦,zāˆž}duringtraining

Input-dependentbaselinesarebias-free forpolicygradients:

Implementationsofinput-dependentbaselines:

Experiments

Motivation Example

Input-DrivenProcesses

VarianceReductionforReinforcementLearninginInput-DrivenEnvironments

HongziMaoShaileshh Bojja Venkatakrishnan Malte SchwarzkopfMohammadAlizadehMITComputerScienceandArtificialIntelligenceLaboratory

Time

Job

size

Load balancer

Server 1 Server 2

Input 1

Input 2

Input-dependent State-dependent

TRPO A2C

RobustAdversarialRLMeta-PolicyOptimization

Input-dependentbaselinesareapplicabletomanypolicygradientmethods,suchasA2C,TRPO,PPO,andtheyarecomplementaryandorthogonaltorobustadversarialRLmethodssuchasRARL (Pintoetal.,2017)andmeta-policyoptimizationsuchasMPO (Clavera etal.,2018).

(a)

(b)

st-1 st st+1

zt-1 zt zt+1

at-1 at at+1

st-1 st st+1

zt-1 zt zt+1

at-1 at at+1

st-1 st st+1

at-1 at at+1

(c)

(a) StandardMDP

(b) Input-Driven MDP

(c) Input-DrivenPOMDP

Policyvariance Reward

Policyvisualization

Loadbalancingexample

Sampleš‘§+~ š‘§,, š‘§., ā€¦ , š‘§0 , usethecorrespondingvaluenetforpolicygradient:

Samplez~š‘ƒ(š‘§), useLSTMtocomputeš‘‰((1, š‘§) forpolicygradient:

Inputz~$(&) LSTM

Baseline()(*, &)

Input!"

Input!#

Input!$

Valuenet%&'( (*)

Valuenet%&', (*)

Valuenet%&'-(*)

Inputz~$(&)

Metabaseline()*(+)

MAMLadapt

Baseline()+(,)

(Notefficient)

workload networkcondition wind buoyancy movingtarget

Environmentswithexogenous,stochasticinputprocessesthataffectthedynamics

Sincetherewardispartiallydictatedbytheinputprocess,thestatealoneonlyprovideslimitedinformationtoestimatetheaveragereturn.Thus,policygradientmethodswithstandardstate-dependentbaselinessufferfromhighvariance.

b(s, z) = V(s|z) (MAML)

b(s, z) = V(s|z) (10 values)b(s) = V(s),ļæ½z

heuristic

TRPO, b(s, z) = V(s|z)

RARL, b(s, z) = V(s|z)

TRPO, b(s) = V(s),ļæ½z

RARL, b(s) = V(s),ļæ½z b(s, z) = V(s|z); TRPOb(s, z) = V(s|z); MPO

b(s) = V(s),ļæ½z; TRPOb(s) = V(s),ļæ½z; MPO