reinforcement learning lecture function approximation · 2013. 11. 12. · actions: f˝; g. ˝2f...

ReinforcementLearning

Function Approximation

Continuous state/action space, mean-squareerror, gradient temporal difference learning,

least-square temporal difference, least squarespolicy iteration

Vien NgoMarc Toussaint

University of Stuttgart

Outline

• Function Approximation

– Gradient Descent Methods.

– Least-Square Temporal Difference.

2/??

Value Iteration in Continuous MDP

V (s) = supa

[r(s, a) + γ

∫P (s′|s, a)V (s′)dx′

]

3/??

Continuous state/actions in model-free RL

• All of this is fine in small finite state & action spaces.Q(s, a) is a |S| × |A|-matrix of numbers.π(a|s) is a |S| × |A|-matrix of numbers.

• In the following: two examples for handling continuous states/actions– use function approximation to estimate Q(s, a): Gradient descent (TDwith FA), LSPI.– optimize a parameterized π(a|s) (policy search - next lecture).

4/??

Value Function Approximation

(from Satinder Singh, RL: A tutorial at videolectures.net)

• Estimate of the value function

Vt(s) = V (s, θt)

5/??

Performance Measure

• Minimizing the mean-squared error (MSE) over some distribution, P , ofthe states

MSE(βt) =∑s∈S

P (s)[V π(s)− Vt(s)

]2where V π(s) is the true value function of the policy π.

• Set P to the stationary distribution of policy π in on-policy learningmethods (e.g. SARSA).

6/??

Value Function Approximation• The estimate value function:

V (s, βt) = β>t φ(s)

where β ∈ <d is a vector of parameters, φ : S 7→ <d is a mapping fromstates to d-dimensional spaces.

– Examples: polynomial, RBF, fourier, wavelet basis, tile-coding. (suffer from the curse

of dimensionality)

• Nonparametric methods: k-nearest neighbor, nonparametric kernelsmoothing, spline smoothers, Gaussian process regression,... 7/??

Value Function Approximation

(1035 states, 105 binary features and parameters.)(Sutton, presentation at ICML 2009) )

8/??

TD(λ) with Function Approximation• The gradient at any point βt

∇MSE(βt) = −2∑s∈S


]∇V (s, βt)

= −2∑s∈S


]φ(s)

• Applying stochastic approximation and bootstrapping, we caniteratively update the parameters (TD(0) with function approximation)

βt+1 = βt − αt[rt + γV (s′, βt)− Vt(s, βt)

]φ(s)

• TD(λ) (with eligibility trace)

et+1 = γλet + φ(s)

βt+1 = βt − αtet+1

[rt + γV (s′, βt)− Vt(s, βt)

]9/??

TD(λ) with Function Approximation(Gradient-descent SARSA(λ))

Repeat (for each episode)

• e = 0

• initial state s = s0

• Repeatat = π(s)

Take a, observe rt, s′

et+1 = γλet + φ(s, a)

βt+1 = βt − αtet+1

[rt + γQ(s′, π(s′), βt)−Q(s, at, βt)

]s← s′

• until s is terminal.

10/??

TD(λ) with Function Approximation

• Convergence proof: If the stochastic process St is ergodic Markovprocess, the whose stationary distribution is the same as the stationarydistribution of the underlying MDP (e.g. on-policy distribution).

• The convergence property

MSE(β∞) ≤ 1− γλ1− λ

MSE(β∗)

(Tsitsiklis & Van Roy, An analysis of temporal-difference learning with function

approximation. IEEE Transactions on Automatic Control, 1997)

• Convergence guarantee for off-policy methods (e.g Q-learning withlinear function approximation)?

11/??

Gradient temporal difference learning

• GTD (gradient temporal difference learning)

• GTD2 (gradient temporal difference learning, version 2)

• TDC (temporal difference learning with corrections.)

1. Sutton, Szepesveri and Maei. A convergent O(n) temporal difference algorithm for

off-policy learning with linear function approximation, NIPS 2008.

2. Sutton, Maei, Precup, Bhatnagar, Silver, Szepesvri, Wiewiora: Fast gradient-descent

methods for temporal-difference learning with linear function approximation. ICML 2009.

12/??

Value function geometry• Bellman operator

TV = R+ γPV

(The space spanned by the feature vectors)

RMSBE : Residual mean-squared Bellman errorRMSPBE: Residual mean-squared projected Bellman error 13/??

TD performance measure

• Error from the true value:||Vβ − V ∗||

• Error in the Bellman update (used in previous section: gradient descentmethods)

||Vβ − TVβ ||

• Error in Bellman update after projection

||Vβ −∏

TVβ ||

14/??

TD performance measure

• GTD(0): the norm of the expected TD update

NEU(β) = E(δφ)>E(δφ)

• GTD(2) and TDC: the norm of the expected TD update weighted by thecovariance matrix of the features

MSPBE(β) = E(δφ)>E(φφ)−1E(δφ)

(δ is the TD error.)(GTD(2) and TDC slightly differ at their derivation of the approximate gradient direction.)

15/??

• TD algorithms with linear function approximation problems areguaranteed convergent under both general on- and off-policy training.

• the compuational complexity is only O(n) (n is the number of features).

• the curse of dimensionality is removed

16/??

LSPI: Least Squares Policy Iteration

• Gradient-descent methods are sensitive to the choice of learning ratesand initial parameter values.

• Least-square temporal difference (LSTD) method: LSPI.

– Bellman residual minimization

– Least Squares Fixed-Point Approximation

17/??

Bellman residual minimization• The Q-functions for a given policy π fulfills for any s, a:

Qπ(s, a) = R(s, a) + γ∑s′

P (s′ | a, s) Qπ(s′, π(s′))

• If we have n data points D = {(si, ai, ri, s′i)}ni=1, we require that thisequation holds (approximately) for these n data points:

∀i : Qπ(si, ai) = ri + γQπ(s′i, π(s′i))

• Written in vector notation: Q = R + gQ̄ with N -dim data vectorsQ,R, Q̄

• Written as optmization: minimize the Bellman residual error

L(Qπ) = ||R+ γPΠQπ −Qπ||(true residual)

=

n∑i=1

[Qπ(si, ai)− ri − γQπ(s′i, π(s′i))]2 = ||R−Q + γQ̄||2

18/??

Bellman residual minimization

• The true fixed point of Bellman Residual Minimization (this is anoverconstrained system)

βπ =(

(Φ− γPΠΦ)>(Φ− γPΠΦ))−1

(Φ− γPΠΦ)r

• the solution βπ of the system is unique since the columns of Φ (thebasis functions) are linearly independent by definition.(See Lagoudakis & Parr (JMLR 2003) for details.)

19/??

LSPI: Least Squares Fixed-Point Approximation

• Projection TπQ back onto span(Φ)

T̂π(Q) = Φ(Φ>Φ)−1Φ>(TπQ)

• The approximate fixed-point

βπ =(

Φ>(Φ− γPΠΦ))−1

Φ>r

20/??

LSPI: Comparisons of two views

• the Bellman residual minimizing method: focus on the magnitude of thechange.

• the least-squares fixed-point approximation: focus on the direction ofthe change.

• the least-squares fixed point approximation is less stable and lesspredictable

• the least-squares fixed-point method might be preferable. Because– Learning the Bellman residual minimizing approximation requires doubled samples.

– Experimentally, it often delivers policies that are superior.

(See Lagoudakis & Parr (JMLR 2003) for details.)

21/??

LSPI: LSTDQ algorithm

A =(

Φ>(Φ− γPΠΦ))−1

b = Φ>r

• For each (s, a, r, s′) ∈ D

A← A+ φ(s, a)(φ(s, a)− γφ(s′, π(s′))

)>b← b+ φ(s, a)r

• β ← A−1b

22/??

LSPI algorithmgiven D

• repeat

π ← π′

π′ ← LSTDQ(π) (π′ is a policy of βπ)

• return π

23/??

LSPI: Riding a bike

(from Alma A. M. Rahat’s simulation.)• States = {θ, θ̇, ω, ω̇, ω̈, ψ}. where θ is the angle of the handlebar, ω is the vertical angle

of the bicycle, and ψ is the angle of the bicycle to the goal.• Actions: {τ, ν}. τ ∈ {−2, 0, 2} is the torque applied to the handlebar,ν ∈ {−0.02, 0, 0.02} is the displacement of the rider.

• For each a, the value function Q(s, a) uses 20 features

(1, ω, ω̇, ω2, ωω̇, θ, θ̇, θ2, θ̇2, θθ̇, ωθ, ωθ2, ω2θ, ψ, ψ2, ψθ, ψ̄, ψ̄2, ψ̄θ)

where ψ = sign(ψ)× π − ψ. 24/??

LSPI: Riding a bike

from Lagoudakis & Parr (JMLR 2003)

25/??

LSPI: Riding a bike

• Training samples were collected in advance by initializing the bicycle to a small randomperturbation from the initial position (0, 0, 0, 0, 0, π/2) and running each episode up to 20steps using a purely random policy.

• Each successful ride must complete a distance of 2 kilometers.

• This experiment was repeated 100 times

from Lagoudakis & Parr (JMLR 2003)26/??

Feature Selection/Building Problems

• Feature selection.

• Online/increment feature learning.

Wu and Givan (2005); Keller et al. (2006); Parr et al. (2007); Mahadevan and Liu (2010);Parr et al., (2007); Mahadevan and Liu, (2010); Mahadevan et al., (2006), Kolter and Ng,(2009), Boots and Gordon, (2010), Sun et al., (2011), etc.

27/??

reinforcement learning lecture function approximation · 2013. 11. 12. · actions: f˝; g. ˝2f...

Documents