trust region policy optimizationppoupart/teaching/cs885-spring...trust region policy optimization...

25
Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML 2015 Presenter: Shivam Kalra [email protected] CS 885 (Reinforcement Learning) Prof. Pascal Poupart June 20 th 2018

Upload: others

Post on 21-Jul-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML

Trust Region Policy OptimizationJohn Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel

@ICML 2015

Presenter: Shivam Kalra

[email protected]

CS 885 (Reinforcement Learning)

Prof. Pascal Poupart

June 20th 2018

Page 2: Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML

Reinforcement Learning

Action Value Function

Q-Learning

Policy Gradients

TRPOActor Critic PPO

A3C ACKTR

Ref: https://www.youtube.com/watch?v=CKaN5PgkSBc

Page 3: Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML

Policy Gradient

For i=1,2,…

Collect N trajectories for policy πœ‹πœƒ

Estimate advantage function 𝐴

Compute policy gradient 𝑔

Update policy parameter πœƒ = πœƒπ‘œπ‘™π‘‘ + 𝛼𝑔

Page 4: Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML

Problems of Policy Gradient

For i=1,2,…

Collect N trajectories for policy πœ‹πœƒ

Estimate advantage function 𝐴

Compute policy gradient 𝑔

Update policy parameter πœƒ = πœƒπ‘œπ‘™π‘‘ + 𝛼𝑔

Non stationary input data due to changing policy and reward distributions change

Page 5: Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML

Problems of Policy Gradient

For i=1,2,…

Collect N trajectories for policy πœ‹πœƒ

Estimate advantage function 𝐴

Compute policy gradient 𝑔

Update policy parameter πœƒ = πœƒπ‘œπ‘™π‘‘ + 𝛼𝑔

cvAdvantage is very random initially

Advantage Policy

You’re bad

Page 6: Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML

Problems of Policy Gradient

For i=1,2,…

Collect N trajectories for policy πœ‹πœƒ

Estimate advantage function 𝐴

Compute policy gradient 𝑔

Update policy parameter πœƒ = πœƒπ‘œπ‘™π‘‘ + 𝛼𝑔We need more carefully crafted policy update

We want improvement and not degradation

Idea: We can update old policy πœ‹π‘œπ‘™π‘‘ to a new policy πœ‹such that they are β€œtrusted” distance apart. Suchconservative policy update allows improvement insteadof degradation.

Page 7: Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML

RL to Optimization

β€’ Most of ML is optimizationβ€’ Supervised learning is reducing training loss

β€’ RL: what is policy gradient optimizing?β€’ Favoring (𝑠, π‘Ž) that gave more advantage 𝐴.

β€’ Can we write down optimization problem that allows to do small update on a policy πœ‹ based on data sampled from πœ‹ (on-policy data)

Ref: https://www.youtube.com/watch?v=xvRrgxcpaHY (6:40)

Page 8: Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML

What loss to optimize?

β€’ Optimize πœ‚(πœ‹) i.e., expected return of a policy πœ‹

β€’ We collect data with πœ‹π‘œπ‘™π‘‘ and optimize the objective to get a new policy πœ‹.

πœ‚ πœ‹ = 𝔼𝑠0~𝜌0,π‘Žπ‘‘~πœ‹ . 𝑠𝑑

𝑑=0

∞

π›Ύπ‘‘π‘Ÿπ‘‘

Page 9: Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML

What loss to optimize?

β€’ We can express πœ‚ πœ‹ in terms of the advantage over the original policy1.

πœ‚ πœ‹ = πœ‚ πœ‹π‘œπ‘™π‘‘ + π”Όπœ~πœ‹

𝑑=0

∞

π›Ύπ‘‘π΄πœ‹π‘œπ‘™π‘‘(𝑠𝑑 , π‘Žπ‘‘)

[1] Kakade, Sham, and John Langford. "Approximately optimal approximate reinforcement learning." ICML. Vol. 2. 2002.

Expected return of new policy

Expected return of old policy

Sample from newpolicy

Page 10: Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML

What loss to optimize?

β€’ Previous equation can be rewritten as1:

πœ‚ πœ‹ = πœ‚ πœ‹π‘œπ‘™π‘‘ +

𝑠

πœŒπœ‹(𝑠)

π‘Ž

πœ‹ π‘Ž 𝑠 π΄πœ‹(𝑠, π‘Ž)

[1] Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015.

Expected return of new policy

Expected return of old policy

Discounted visitation frequency

πœŒπœ‹ 𝑠 = 𝑃 𝑠0 = 𝑠 + 𝛾𝑃 𝑠1 = 𝑠 + 𝛾2𝑃 +β‹―

Page 11: Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML

πœ‚ πœ‹ = πœ‚ πœ‹π‘œπ‘™π‘‘ +

𝑠

πœŒπœ‹(𝑠)

π‘Ž

πœ‹ π‘Ž 𝑠 π΄πœ‹(𝑠, π‘Ž)

What loss to optimize?

New Expected Return

Old Expected Return

β‰₯ 𝟎

Page 12: Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML

πœ‚ πœ‹ = πœ‚ πœ‹π‘œπ‘™π‘‘ +

𝑠

πœŒπœ‹(𝑠)

π‘Ž

πœ‹ π‘Ž 𝑠 π΄πœ‹(𝑠, π‘Ž)

New Expected Return Old Expected Return>

What loss to optimize?

β‰₯ 𝟎

Guaranteed Improvement from πœ‹π‘œπ‘™π‘‘ β†’ πœ‹

Page 13: Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML

πœ‚ πœ‹ = πœ‚ πœ‹π‘œπ‘™π‘‘ +

𝑠

πœŒπœ‹(𝑠)

π‘Ž

πœ‹ π‘Ž 𝑠 π΄πœ‹(𝑠, π‘Ž)

New State Visitation is Difficult

State visitation based on new policy

New policy

β€œComplex dependency of πœŒπœ‹(𝑠) onπœ‹ makes the equation difficult tooptimize directly.” [1]

[1] Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015.

Page 14: Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML

πœ‚ πœ‹ = πœ‚ πœ‹π‘œπ‘™π‘‘ +

𝑠

πœŒπœ‹(𝑠)

π‘Ž

πœ‹ π‘Ž 𝑠 π΄πœ‹(𝑠, π‘Ž)

New State Visitation is Difficult

[1] Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015.

𝐿 πœ‹ = πœ‚ πœ‹π‘œπ‘™π‘‘ +

𝑠

πœŒπœ‹(𝑠)

π‘Ž

πœ‹ π‘Ž 𝑠 π΄πœ‹(𝑠, π‘Ž)

Local approximation of 𝜼(𝝅)

Page 15: Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML

Local approximation of πœ‚( πœ‹)

[1] Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015.

𝐿 πœ‹ = πœ‚ πœ‹π‘œπ‘™π‘‘ +

𝑠

πœŒπœ‹(𝑠)

π‘Ž

πœ‹ π‘Ž 𝑠 π΄πœ‹π‘œπ‘™π‘‘(𝑠, π‘Ž)

The approximation is accurate within step size 𝛿 (trust region)

Monotonic improvement guaranteed

πœƒ

πœƒβ€² πœ‹πœƒβ€² 𝑠 π‘Ž does not change dramatically.

Trust region

Page 16: Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML

β€’ Monotonically improving policies can be generated by:

β€’ The following bound holds:

Local approximation of πœ‚( πœ‹)

πœ‚ πœ‹ β‰₯ 𝐿 πœ‹ βˆ’ πΆπ·πΎπΏπ‘šπ‘Žπ‘₯(πœ‹, πœ‹)

Where, 𝐢 =4πœ–π›Ύ

1βˆ’π›Ύ 2

πœ‹ = arg maxπœ‹

[𝐿 πœ‹ βˆ’ πΆπ·πΎπΏπ‘šπ‘Žπ‘₯ πœ‹, πœ‹ ]

Where, 𝐢 =4πœ–π›Ύ

1βˆ’π›Ύ 2

Page 17: Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML

Minorization Maximization (MM) algorithm

Actual function πœ‚(πœ‹)

Surrogate function 𝐿 πœ‹ βˆ’ πΆπ·πΎπΏπ‘šπ‘Žπ‘₯(πœ‹, πœ‹)

Shivam
Page 18: Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML

Optimization of Parameterized Policies

β€’ Now policies are parameterized πœ‹πœƒ π‘Ž 𝑠 with parameters πœƒ

β€’ Accordingly surrogate function changes to

arg maxπœƒ

[𝐿 πœƒ βˆ’ πΆπ·πΎπΏπ‘šπ‘Žπ‘₯ πœƒπ‘œπ‘™π‘‘ , πœƒ ]

Page 19: Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML

Optimization of Parameterized Policies

arg maxπœƒ

[𝐿 πœƒ βˆ’ πΆπ·πΎπΏπ‘šπ‘Žπ‘₯ πœƒπ‘œπ‘™π‘‘ , πœƒ ]

In practice 𝐢 results in very small step sizes

One way to take larger step size is to constraint KLdivergence between the new policy and the oldpolicy, i.e., a trust region constraint:

π’Žπ’‚π’™π’Šπ’Žπ’Šπ’›π’†πœ½

π‘³πœ½(𝜽)

subject to, π‘«π‘²π‘³π’Žπ’‚π’™ πœ½π’π’π’…, 𝜽 ≀ 𝜹

Page 20: Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML

Solving KL-Penalized Problem

β€’ maxπ‘–π‘šπ‘–π‘§π‘’πœƒ 𝐿 πœƒ βˆ’ 𝐢.π·πΎπΏπ‘šπ‘Žπ‘₯(πœƒπ‘œπ‘™π‘‘ , πœƒ)

β€’ Use mean KL divergence instead of max.β€’ i.e., maxπ‘–π‘šπ‘–π‘§π‘’

πœƒπΏ πœƒ βˆ’ 𝐢.𝐷𝐾𝐿(πœƒπ‘œπ‘™π‘‘ , πœƒ)

β€’ Make linear approximation to 𝐿 and quadratic to KL term:

maxπ‘–π‘šπ‘–π‘§π‘’πœƒ

𝑔 . πœƒ βˆ’ πœƒπ‘œπ‘™π‘‘ βˆ’π‘

2πœƒ βˆ’ πœƒπ‘œπ‘™π‘‘

𝑇𝐹(πœƒ βˆ’ πœƒπ‘œπ‘™π‘‘)

where, 𝑔 =πœ•

πœ•πœƒπΏ πœƒ Θπœƒ=πœƒπ‘œπ‘™π‘‘ , 𝐹 =

πœ•2

πœ•2πœƒπ·πΎπΏ πœƒπ‘œπ‘™π‘‘ , πœƒ Θπœƒ=πœƒπ‘œπ‘™π‘‘

Page 21: Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML

Solving KL-Penalized Problem

β€’ Make linear approximation to 𝐿 and quadratic to KL term:

β€’ Solution: πœƒ βˆ’ πœƒπ‘œπ‘™π‘‘ =1

π‘πΉβˆ’1𝑔. Don’t want to form full Hessian matrix

𝐹 =πœ•2

πœ•2πœƒπ·πΎπΏ πœƒπ‘œπ‘™π‘‘ , πœƒ Θπœƒ=πœƒπ‘œπ‘™π‘‘ .

β€’ Can compute πΉβˆ’1𝑔 approximately using conjugate gradient algorithm without forming 𝐹 explicitly.

maxπ‘–π‘šπ‘–π‘§π‘’πœƒ

𝑔 . πœƒ βˆ’ πœƒπ‘œπ‘™π‘‘ βˆ’π‘

2πœƒ βˆ’ πœƒπ‘œπ‘™π‘‘

𝑇𝐹(πœƒ βˆ’ πœƒπ‘œπ‘™π‘‘)

where, 𝑔 =πœ•

πœ•πœƒπΏ πœƒ Θπœƒ=πœƒπ‘œπ‘™π‘‘ , 𝐹 =

πœ•2

πœ•2πœƒπ·πΎπΏ πœƒπ‘œπ‘™π‘‘, πœƒ Θπœƒ=πœƒπ‘œπ‘™π‘‘

Page 22: Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML

Conjugate Gradient (CG)

β€’ Conjugate gradient algorithm approximately solves for π‘₯ = π΄βˆ’1𝑏without explicitly forming matrix 𝐴

β€’ After π‘˜ iterations, CG has minimized 1

2π‘₯𝑇𝐴π‘₯ βˆ’ 𝑏π‘₯

Page 23: Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML

TRPO: KL-Constrained

β€’ Unconstrained problem: maxπ‘–π‘šπ‘–π‘§π‘’πœƒ

𝐿 πœƒ βˆ’ 𝐢. 𝐷𝐾𝐿(πœƒπ‘œπ‘™π‘‘ , πœƒ)

β€’ Constrained problem: maxπ‘–π‘šπ‘–π‘§π‘’πœƒ

𝐿 πœƒ subject to 𝐢. 𝐷𝐾𝐿 πœƒπ‘œπ‘™π‘‘ , πœƒ ≀ 𝛿

β€’ 𝛿 is a hyper-parameter, remains fixed over whole learning processβ€’ Solve constrained quadratic problem: compute πΉβˆ’1𝑔 and then rescale step

to get correct KLβ€’ maxπ‘–π‘šπ‘–π‘§π‘’

πœƒπ‘” . πœƒ βˆ’ πœƒπ‘œπ‘™π‘‘ subject to

1

2πœƒ βˆ’ πœƒπ‘œπ‘™π‘‘

𝑇𝐹 πœƒ βˆ’ πœƒπ‘œπ‘™π‘‘ ≀ 𝛿

β€’ Lagrangian: β„’ πœƒ, πœ† = 𝑔 . πœƒ βˆ’ πœƒπ‘œπ‘™π‘‘ βˆ’πœ†

2[ πœƒ βˆ’ πœƒπ‘œπ‘™π‘‘

𝑇𝐹 πœƒ βˆ’ πœƒπ‘œπ‘™π‘‘ βˆ’ 𝛿]

β€’ Differentiate wrt πœƒ and get πœƒ βˆ’ πœƒπ‘œπ‘™π‘‘ =1

πœ†πΉβˆ’1𝑔

β€’ We want 1

2𝑠𝑇𝐹𝑠 = 𝛿

β€’ Given candidate step π‘ π‘’π‘›π‘ π‘π‘Žπ‘™π‘’π‘‘ rescale to 𝑠 =2𝛿

π‘ π‘’π‘›π‘ π‘π‘Žπ‘™π‘’π‘‘.(πΉπ‘ π‘’π‘›π‘ π‘π‘Žπ‘™π‘’π‘‘)π‘ π‘’π‘›π‘ π‘π‘Žπ‘™π‘’π‘‘

Page 24: Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML

TRPO Algorithm

For i=1,2,…

Collect N trajectories for policy πœ‹πœƒEstimate advantage function 𝐴Compute policy gradient 𝑔Use CG to compute π»βˆ’1𝑔Compute rescaled step 𝑠 = π›Όπ»βˆ’1𝑔 with rescaling and line search

Apply update: πœƒ = πœƒπ‘œπ‘™π‘‘ + π›Όπ»βˆ’1𝑔

maxπ‘–π‘šπ‘–π‘§π‘’πœƒ

𝐿 πœƒ subject to 𝐢. 𝐷𝐾𝐿 πœƒπ‘œπ‘™π‘‘, πœƒ ≀ 𝛿

Page 25: Trust Region Policy Optimizationppoupart/teaching/cs885-spring...Trust Region Policy Optimization John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel @ICML

Questions?