reinforcement learning lecture function approximation · 2013. 11. 12. · actions: f˝; g. ˝2f...
TRANSCRIPT
![Page 1: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f84d76eed9c31ddc5dfc99/html5/thumbnails/1.jpg)
ReinforcementLearning
Function Approximation
Continuous state/action space, mean-squareerror, gradient temporal difference learning,
least-square temporal difference, least squarespolicy iteration
Vien NgoMarc Toussaint
University of Stuttgart
![Page 2: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f84d76eed9c31ddc5dfc99/html5/thumbnails/2.jpg)
Outline
• Function Approximation
– Gradient Descent Methods.
– Least-Square Temporal Difference.
2/??
![Page 3: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f84d76eed9c31ddc5dfc99/html5/thumbnails/3.jpg)
Value Iteration in Continuous MDP
V (s) = supa
[r(s, a) + γ
∫P (s′|s, a)V (s′)dx′
]
3/??
![Page 4: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f84d76eed9c31ddc5dfc99/html5/thumbnails/4.jpg)
Continuous state/actions in model-free RL
• All of this is fine in small finite state & action spaces.Q(s, a) is a |S| × |A|-matrix of numbers.π(a|s) is a |S| × |A|-matrix of numbers.
• In the following: two examples for handling continuous states/actions– use function approximation to estimate Q(s, a): Gradient descent (TDwith FA), LSPI.– optimize a parameterized π(a|s) (policy search - next lecture).
4/??
![Page 5: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f84d76eed9c31ddc5dfc99/html5/thumbnails/5.jpg)
Value Function Approximation
(from Satinder Singh, RL: A tutorial at videolectures.net)
• Estimate of the value function
Vt(s) = V (s, θt)
5/??
![Page 6: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f84d76eed9c31ddc5dfc99/html5/thumbnails/6.jpg)
Performance Measure
• Minimizing the mean-squared error (MSE) over some distribution, P , ofthe states
MSE(βt) =∑s∈S
P (s)[V π(s)− Vt(s)
]2where V π(s) is the true value function of the policy π.
• Set P to the stationary distribution of policy π in on-policy learningmethods (e.g. SARSA).
6/??
![Page 7: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f84d76eed9c31ddc5dfc99/html5/thumbnails/7.jpg)
Value Function Approximation• The estimate value function:
V (s, βt) = β>t φ(s)
where β ∈ <d is a vector of parameters, φ : S 7→ <d is a mapping fromstates to d-dimensional spaces.
– Examples: polynomial, RBF, fourier, wavelet basis, tile-coding. (suffer from the curse
of dimensionality)
• Nonparametric methods: k-nearest neighbor, nonparametric kernelsmoothing, spline smoothers, Gaussian process regression,... 7/??
![Page 8: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f84d76eed9c31ddc5dfc99/html5/thumbnails/8.jpg)
Value Function Approximation
(1035 states, 105 binary features and parameters.)(Sutton, presentation at ICML 2009) )
8/??
![Page 9: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f84d76eed9c31ddc5dfc99/html5/thumbnails/9.jpg)
TD(λ) with Function Approximation• The gradient at any point βt
∇MSE(βt) = −2∑s∈S
P (s)[V π(s)− Vt(s)
]∇V (s, βt)
= −2∑s∈S
P (s)[V π(s)− Vt(s)
]φ(s)
• Applying stochastic approximation and bootstrapping, we caniteratively update the parameters (TD(0) with function approximation)
βt+1 = βt − αt[rt + γV (s′, βt)− Vt(s, βt)
]φ(s)
• TD(λ) (with eligibility trace)
et+1 = γλet + φ(s)
βt+1 = βt − αtet+1
[rt + γV (s′, βt)− Vt(s, βt)
]9/??
![Page 10: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f84d76eed9c31ddc5dfc99/html5/thumbnails/10.jpg)
TD(λ) with Function Approximation(Gradient-descent SARSA(λ))
Repeat (for each episode)
• e = 0
• initial state s = s0
• Repeatat = π(s)
Take a, observe rt, s′
et+1 = γλet + φ(s, a)
βt+1 = βt − αtet+1
[rt + γQ(s′, π(s′), βt)−Q(s, at, βt)
]s← s′
• until s is terminal.
10/??
![Page 11: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f84d76eed9c31ddc5dfc99/html5/thumbnails/11.jpg)
TD(λ) with Function Approximation
• Convergence proof: If the stochastic process St is ergodic Markovprocess, the whose stationary distribution is the same as the stationarydistribution of the underlying MDP (e.g. on-policy distribution).
• The convergence property
MSE(β∞) ≤ 1− γλ1− λ
MSE(β∗)
(Tsitsiklis & Van Roy, An analysis of temporal-difference learning with function
approximation. IEEE Transactions on Automatic Control, 1997)
• Convergence guarantee for off-policy methods (e.g Q-learning withlinear function approximation)?
11/??
![Page 12: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f84d76eed9c31ddc5dfc99/html5/thumbnails/12.jpg)
Gradient temporal difference learning
• GTD (gradient temporal difference learning)
• GTD2 (gradient temporal difference learning, version 2)
• TDC (temporal difference learning with corrections.)
1. Sutton, Szepesveri and Maei. A convergent O(n) temporal difference algorithm for
off-policy learning with linear function approximation, NIPS 2008.
2. Sutton, Maei, Precup, Bhatnagar, Silver, Szepesvri, Wiewiora: Fast gradient-descent
methods for temporal-difference learning with linear function approximation. ICML 2009.
12/??
![Page 13: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f84d76eed9c31ddc5dfc99/html5/thumbnails/13.jpg)
Value function geometry• Bellman operator
TV = R+ γPV
(The space spanned by the feature vectors)
RMSBE : Residual mean-squared Bellman errorRMSPBE: Residual mean-squared projected Bellman error 13/??
![Page 14: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f84d76eed9c31ddc5dfc99/html5/thumbnails/14.jpg)
TD performance measure
• Error from the true value:||Vβ − V ∗||
• Error in the Bellman update (used in previous section: gradient descentmethods)
||Vβ − TVβ ||
• Error in Bellman update after projection
||Vβ −∏
TVβ ||
14/??
![Page 15: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f84d76eed9c31ddc5dfc99/html5/thumbnails/15.jpg)
TD performance measure
• GTD(0): the norm of the expected TD update
NEU(β) = E(δφ)>E(δφ)
• GTD(2) and TDC: the norm of the expected TD update weighted by thecovariance matrix of the features
MSPBE(β) = E(δφ)>E(φφ)−1E(δφ)
(δ is the TD error.)(GTD(2) and TDC slightly differ at their derivation of the approximate gradient direction.)
15/??
![Page 16: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f84d76eed9c31ddc5dfc99/html5/thumbnails/16.jpg)
• TD algorithms with linear function approximation problems areguaranteed convergent under both general on- and off-policy training.
• the compuational complexity is only O(n) (n is the number of features).
• the curse of dimensionality is removed
16/??
![Page 17: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f84d76eed9c31ddc5dfc99/html5/thumbnails/17.jpg)
LSPI: Least Squares Policy Iteration
• Gradient-descent methods are sensitive to the choice of learning ratesand initial parameter values.
• Least-square temporal difference (LSTD) method: LSPI.
– Bellman residual minimization
– Least Squares Fixed-Point Approximation
17/??
![Page 18: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f84d76eed9c31ddc5dfc99/html5/thumbnails/18.jpg)
Bellman residual minimization• The Q-functions for a given policy π fulfills for any s, a:
Qπ(s, a) = R(s, a) + γ∑s′
P (s′ | a, s) Qπ(s′, π(s′))
• If we have n data points D = {(si, ai, ri, s′i)}ni=1, we require that thisequation holds (approximately) for these n data points:
∀i : Qπ(si, ai) = ri + γQπ(s′i, π(s′i))
• Written in vector notation: Q = R + gQ̄ with N -dim data vectorsQ,R, Q̄
• Written as optmization: minimize the Bellman residual error
L(Qπ) = ||R+ γPΠQπ −Qπ||(true residual)
=
n∑i=1
[Qπ(si, ai)− ri − γQπ(s′i, π(s′i))]2 = ||R−Q + γQ̄||2
18/??
![Page 19: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f84d76eed9c31ddc5dfc99/html5/thumbnails/19.jpg)
Bellman residual minimization
• The true fixed point of Bellman Residual Minimization (this is anoverconstrained system)
βπ =(
(Φ− γPΠΦ)>(Φ− γPΠΦ))−1
(Φ− γPΠΦ)r
• the solution βπ of the system is unique since the columns of Φ (thebasis functions) are linearly independent by definition.(See Lagoudakis & Parr (JMLR 2003) for details.)
19/??
![Page 20: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f84d76eed9c31ddc5dfc99/html5/thumbnails/20.jpg)
LSPI: Least Squares Fixed-Point Approximation
• Projection TπQ back onto span(Φ)
T̂π(Q) = Φ(Φ>Φ)−1Φ>(TπQ)
• The approximate fixed-point
βπ =(
Φ>(Φ− γPΠΦ))−1
Φ>r
20/??
![Page 21: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f84d76eed9c31ddc5dfc99/html5/thumbnails/21.jpg)
LSPI: Comparisons of two views
• the Bellman residual minimizing method: focus on the magnitude of thechange.
• the least-squares fixed-point approximation: focus on the direction ofthe change.
• the least-squares fixed point approximation is less stable and lesspredictable
• the least-squares fixed-point method might be preferable. Because– Learning the Bellman residual minimizing approximation requires doubled samples.
– Experimentally, it often delivers policies that are superior.
(See Lagoudakis & Parr (JMLR 2003) for details.)
21/??
![Page 22: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f84d76eed9c31ddc5dfc99/html5/thumbnails/22.jpg)
LSPI: LSTDQ algorithm
A =(
Φ>(Φ− γPΠΦ))−1
b = Φ>r
• For each (s, a, r, s′) ∈ D
A← A+ φ(s, a)(φ(s, a)− γφ(s′, π(s′))
)>b← b+ φ(s, a)r
• β ← A−1b
22/??
![Page 23: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f84d76eed9c31ddc5dfc99/html5/thumbnails/23.jpg)
LSPI algorithmgiven D
• repeat
π ← π′
π′ ← LSTDQ(π) (π′ is a policy of βπ)
• return π
23/??
![Page 24: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f84d76eed9c31ddc5dfc99/html5/thumbnails/24.jpg)
LSPI: Riding a bike
(from Alma A. M. Rahat’s simulation.)• States = {θ, θ̇, ω, ω̇, ω̈, ψ}. where θ is the angle of the handlebar, ω is the vertical angle
of the bicycle, and ψ is the angle of the bicycle to the goal.• Actions: {τ, ν}. τ ∈ {−2, 0, 2} is the torque applied to the handlebar,ν ∈ {−0.02, 0, 0.02} is the displacement of the rider.
• For each a, the value function Q(s, a) uses 20 features
(1, ω, ω̇, ω2, ωω̇, θ, θ̇, θ2, θ̇2, θθ̇, ωθ, ωθ2, ω2θ, ψ, ψ2, ψθ, ψ̄, ψ̄2, ψ̄θ)
where ψ = sign(ψ)× π − ψ. 24/??
![Page 25: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f84d76eed9c31ddc5dfc99/html5/thumbnails/25.jpg)
LSPI: Riding a bike
from Lagoudakis & Parr (JMLR 2003)
25/??
![Page 26: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f84d76eed9c31ddc5dfc99/html5/thumbnails/26.jpg)
LSPI: Riding a bike
• Training samples were collected in advance by initializing the bicycle to a small randomperturbation from the initial position (0, 0, 0, 0, 0, π/2) and running each episode up to 20steps using a purely random policy.
• Each successful ride must complete a distance of 2 kilometers.
• This experiment was repeated 100 times
from Lagoudakis & Parr (JMLR 2003)26/??
![Page 27: Reinforcement Learning Lecture Function Approximation · 2013. 11. 12. · Actions: f˝; g. ˝2f 2;0;2gis the torque applied to the handlebar, 2f 0:02;0;0:02gis the displacement of](https://reader035.vdocuments.us/reader035/viewer/2022071404/60f84d76eed9c31ddc5dfc99/html5/thumbnails/27.jpg)
Feature Selection/Building Problems
• Feature selection.
• Online/increment feature learning.
Wu and Givan (2005); Keller et al. (2006); Parr et al. (2007); Mahadevan and Liu (2010);Parr et al., (2007); Mahadevan and Liu, (2010); Mahadevan et al., (2006), Kolter and Ng,(2009), Boots and Gordon, (2010), Sun et al., (2011), etc.
27/??