value - github pages · reinforcement learning (rl) learning to act in an ... corrupted chain...
TRANSCRIPT
Bayesian Inference for Least-Squares Temporal Difference RegularizationNikolaos Tziortziotis† and Christos Dimitrakakis?
† DaSciM, LIX, École Polytechnique, France? University of Lille, France - SEAS, Harvard University, USA
B email: {ntziorzi, christos.dimitrakakis}@gmail.com
AbstractBayesian Least-Squares Temporal Difference:X Fully Bayesian approach for LSTDX Probabilistic inference of value functionsX Quantifies our uncertainty about value functionVariational Bayesian LSTD:X Sparse model - Good generalisation capabilitiesX Automatically determine the model’s complexityX No need to select a regularization parameter
Reinforcement Learning (RL)Learning to act in an unknown environment, byinteraction and reinforcement.
Reward State
Action
RL tasks folmulated as MDPs,{S,A, P, r, γ}Policy π: S → A (map states to actions)Value function: V π(s) , Eπ
[∑∞t=0 γ
tr(st)|s0 = s]
Bellman operator:
(TπV )(s) = r(s) + γ
∫SV (s′)dP (s′|s, π(s))
V is the unique fixed-point of Bellman operator:
V π = TπV π ⇒ V π = (I − γPπ)−1r
§ The value function cannot be represented inan explicit way (continuous state space)
§ The model of the MDP is unknown
© Value function approximation:
V πθ (s) = φ(s)>θ =
k∑i=1
φi(s)θi,
where F = {fθ|fθ(·) = φ(·)>θ}.© Access to a set of transitions: D = {(si, ri, s′i)}ni=1,
where we define: Φ = [φ(s1)>; . . . ;φ(sn)
>],R = [ri, . . . , rn]
> and Φ′ = [φ(s′1)>; . . . ;φ(s′n)
>].
Least-Squares Temporal DifferenceMinimize the mean-square projected Bellman error
(MSPBE): θ = argminθ∈Rk ‖V πθ −ΠTπV πθ ‖2D .
FV πθ
Tπ
TπV πθ
ΠTπV πθ
Π
Projection operator over F : Π = ΦC−1Φ>D
Nested Optimization Problem
u∗ = argminu∈Rk
‖Φu− TπΦθ‖2D (Projection step)
θ = argminθ∈Rk
‖Φθ − Φu∗‖2D (Fixed-point step)
Solutionu∗ = C−1Φ(R+ γΦ′θ),
θ = (Φ>(Φ− γΦ′))−1Φ>R = A−1b,
where C , Φ>Φ, A , Φ>(Φ− γΦ′), and b , Φ>R.
As the number of samples n increases, the LSTD so-lution Φθ converges to the fixed-point of ΠTπ .
Regularized LSTD Schemes
• `2-LSTD• Lasso-TD• LC-TD
• LARS-TD• `1-PBR• `2,2-LSTD
• `2,1-LSTD• Dantzig-LSTD• ODDS-TD
Bayesian LSTD
Empirical Bellman operator:
TπV πθ = r + γPπV πθ +N, N ∼ N (0, β−1I)
Given observations D (LSTD solution):
V πθ = ΠTπV πθ ⇔ Φ>R = Φ>(Φ− γΦ′)θ + Φ>N
Linear regression model: b = Aθ + Φ>N
Likelihood function: p(b|θ, β) = N (b|Aθ, β−1C)
X Maximum likelihood inference correspondsto standard LSTD solution
Prior distribution over θ : p(θ|α) = N (θ|0, α−1I).Logarithm of posterior distribution:
ln p(θ|D) ∝ −β2ED(θ)−
α
2θ>θ,
where ED(θ) = (b−Aθ)>C−1(b−Aθ) (MSPBE).
X Maximum a posteriori inference correspondsto `2 regularization (λ = α/β)
Posterior distribution: p(θ|D) = N (θ|m, S)
S = (αI+β A>C−1A︸ ︷︷ ︸Σ
)−1 andm = βSA>C−1b
Predictive distribution:
p(V πθ (s∗)|s∗,D) =∫θ
p(V πθ (s∗)|θ, s∗)dp(θ|b, α, β)
= N (V πθ (s∗)|φ(s∗)>m,φ(s∗)>Sφ(s∗))
Variational Bayesian LSTD
Prior distribution: p(θ|α) =∏ki=1 N (θi|0, α−1
i )
Hyperprior over α: p(α) =∏ki=1 Gamma(αi|ha, hb)
Hyperprior over β: p(β) = Gamma(β|hc, hd)
Posterior distribution over Z = {θ,α, β}:
p(θ,α, β|b) = p(b|θ, β)p(β)p(θ|α)p(α)p(b)
§ Marginal likelihood is analytically intractable
X We resolt to variational inference
Variational distribution: Q(Z) = Qθ(θ)Qα(α)Qβ(β)Optimal distribution for each factor:
Qθ(θ) = N (θ|m, S)
Qβ(β) = Gamma(β|c, d)
Qα(α) =k∏i=1
Gamma(αi|ai, bi)
where,
S = (diag E[α] + E[β]Σ)−1 , m = E[β]SA>C−1b,
ai = ha +1
2, bi = hb +
1
2E[θ2i ],
c = hc +k
2, d = hd +
1
2‖b−Am‖2
C+
1
2tr(ΣS).
Lower bound
L(Q) =1
2ln |S| −
1
2|C|+
k∑i=1
{lnΓ (ai)− ai ln bi}
+ lnΓ (c)− c ln d+k
2(1− ln 2π)− k lnΓ (ha)
+ kha lnhb − lnΓ (hc) + hc lnhd
Experimental Results
Corrupted Chain
• A 20-state, 2-actions (left or right) MDP• The probability of action’s success is equal to 0.9• A reward of 1 is given only at the ends of chain• The horizon of each episode is set equal to 20• Optimal policy: i) s <= 10: a = L, ii) s > 10: a = R• Value fuction representation:
φ(s) = (1,RBF1(s), . . . ,RBF5(s), x1, . . . , x600)
where xi ∼ N (0, 1), ∀i ∈ {1, . . . , 600}
10 20 30 40 50 60 70 80 90 100
10−0.6
10−0.4
10−0.2
100
100.2
100.4
100.6
100.8
Number of episodes
‖V∗−
Vθ‖ 2
Corrupted chain varying the number of samples
0 100 200 300 400 500 600 700 800 900 1,000
10−0.6
10−0.4
10−0.2
100
100.2
100.4
100.6
Number of noisy features
Corrupted chain varying the number of noise features
`2-LSTDVBLSTDOMPTDLarsTD
100 101 102−0.5
0
0.5
1
1.5
Mea
nw
eigh
tval
ue
LSTD
100 101 102−5
0
5
10
15VBLSTD
100 101 102−2
0
2
4
6
8
Mea
nw
eigh
tval
ue
OMPTD
100 101 102−10
−5
0
5
10LarsTD
100 101 102−20
−10
0
10
20
30
Mea
nw
eigh
tval
ue
LSTD
100 101 102−20
−10
0
10
20
30
Mea
nw
eigh
tval
ue
VBLSTD
100 101 102−10
−5
0
5
10
15
Mea
nw
eigh
tval
ue
OMPTD
100 101 102−5
0
5
10
15
Mea
nw
eigh
tval
ue
LarsTD
10 Training episodes 100 Training episodesThe 606 mean weight values. The first weight is the bias term, the next 5 correspond to the relevant features (RBFs), andthe rest 600 correspond to the noise (irrelevant) features.