value - github pages · reinforcement learning (rl) learning to act in an ... corrupted chain...

Bayesian Inference for Least-Squares Temporal Difference RegularizationNikolaos Tziortziotis† and Christos Dimitrakakis?

† DaSciM, LIX, École Polytechnique, France? University of Lille, France - SEAS, Harvard University, USA

B email: {ntziorzi, christos.dimitrakakis}@gmail.com

AbstractBayesian Least-Squares Temporal Difference:X Fully Bayesian approach for LSTDX Probabilistic inference of value functionsX Quantifies our uncertainty about value functionVariational Bayesian LSTD:X Sparse model - Good generalisation capabilitiesX Automatically determine the model’s complexityX No need to select a regularization parameter

Reinforcement Learning (RL)Learning to act in an unknown environment, byinteraction and reinforcement.

Reward State

Action

RL tasks folmulated as MDPs,{S,A, P, r, γ}Policy π: S → A (map states to actions)Value function: V π(s) , Eπ

[∑∞t=0 γ

tr(st)|s0 = s]

Bellman operator:

(TπV )(s) = r(s) + γ

∫SV (s′)dP (s′|s, π(s))

V is the unique fixed-point of Bellman operator:

V π = TπV π ⇒ V π = (I − γPπ)−1r

§ The value function cannot be represented inan explicit way (continuous state space)

§ The model of the MDP is unknown

© Value function approximation:

V πθ (s) = φ(s)>θ =

k∑i=1

φi(s)θi,

where F = {fθ|fθ(·) = φ(·)>θ}.© Access to a set of transitions: D = {(si, ri, s′i)}ni=1,

where we define: Φ = [φ(s1)>; . . . ;φ(sn)

>],R = [ri, . . . , rn]

> and Φ′ = [φ(s′1)>; . . . ;φ(s′n)

>].

Least-Squares Temporal DifferenceMinimize the mean-square projected Bellman error

(MSPBE): θ = argminθ∈Rk ‖V πθ −ΠTπV πθ ‖2D .

FV πθ

Tπ

TπV πθ

ΠTπV πθ

Π

Projection operator over F : Π = ΦC−1Φ>D

Nested Optimization Problem

u∗ = argminu∈Rk

‖Φu− TπΦθ‖2D (Projection step)

θ = argminθ∈Rk

‖Φθ − Φu∗‖2D (Fixed-point step)

Solutionu∗ = C−1Φ(R+ γΦ′θ),

θ = (Φ>(Φ− γΦ′))−1Φ>R = A−1b,

where C , Φ>Φ, A , Φ>(Φ− γΦ′), and b , Φ>R.

As the number of samples n increases, the LSTD so-lution Φθ converges to the fixed-point of ΠTπ .

Regularized LSTD Schemes

• `2-LSTD• Lasso-TD• LC-TD

• LARS-TD• `1-PBR• `2,2-LSTD

• `2,1-LSTD• Dantzig-LSTD• ODDS-TD

Bayesian LSTD

Empirical Bellman operator:

TπV πθ = r + γPπV πθ +N, N ∼ N (0, β−1I)

Given observations D (LSTD solution):

V πθ = ΠTπV πθ ⇔ Φ>R = Φ>(Φ− γΦ′)θ + Φ>N

Linear regression model: b = Aθ + Φ>N

Likelihood function: p(b|θ, β) = N (b|Aθ, β−1C)

X Maximum likelihood inference correspondsto standard LSTD solution

Prior distribution over θ : p(θ|α) = N (θ|0, α−1I).Logarithm of posterior distribution:

ln p(θ|D) ∝ −β2ED(θ)−

α

2θ>θ,

where ED(θ) = (b−Aθ)>C−1(b−Aθ) (MSPBE).

X Maximum a posteriori inference correspondsto `2 regularization (λ = α/β)

Posterior distribution: p(θ|D) = N (θ|m, S)

S = (αI+β A>C−1A︸︷︷︸Σ

)−1 andm = βSA>C−1b

Predictive distribution:

p(V πθ (s∗)|s∗,D) =∫θ

p(V πθ (s∗)|θ, s∗)dp(θ|b, α, β)

= N (V πθ (s∗)|φ(s∗)>m,φ(s∗)>Sφ(s∗))

Variational Bayesian LSTD

Prior distribution: p(θ|α) =∏ki=1 N (θi|0, α−1

i )

Hyperprior over α: p(α) =∏ki=1 Gamma(αi|ha, hb)

Hyperprior over β: p(β) = Gamma(β|hc, hd)

Posterior distribution over Z = {θ,α, β}:

p(θ,α, β|b) = p(b|θ, β)p(β)p(θ|α)p(α)p(b)

§ Marginal likelihood is analytically intractable

X We resolt to variational inference

Variational distribution: Q(Z) = Qθ(θ)Qα(α)Qβ(β)Optimal distribution for each factor:

Qθ(θ) = N (θ|m, S)

Qβ(β) = Gamma(β|c, d)

Qα(α) =k∏i=1

Gamma(αi|ai, bi)

where,

S = (diag E[α] + E[β]Σ)−1 , m = E[β]SA>C−1b,

ai = ha +1

2, bi = hb +

1

2E[θ2i ],

c = hc +k

2, d = hd +

1

2‖b−Am‖2

C+

1

2tr(ΣS).

Lower bound

L(Q) =1

2ln |S| −

1

2|C|+

k∑i=1

{lnΓ (ai)− ai ln bi}

+ lnΓ (c)− c ln d+k

2(1− ln 2π)− k lnΓ (ha)

+ kha lnhb − lnΓ (hc) + hc lnhd

Experimental Results

Corrupted Chain

• A 20-state, 2-actions (left or right) MDP• The probability of action’s success is equal to 0.9• A reward of 1 is given only at the ends of chain• The horizon of each episode is set equal to 20• Optimal policy: i) s <= 10: a = L, ii) s > 10: a = R• Value fuction representation:

φ(s) = (1,RBF1(s), . . . ,RBF5(s), x1, . . . , x600)

where xi ∼ N (0, 1), ∀i ∈ {1, . . . , 600}

10 20 30 40 50 60 70 80 90 100

10−0.6

10−0.4

10−0.2

100

100.2

100.4

100.6

100.8

Number of episodes

‖V∗−

Vθ‖ 2

Corrupted chain varying the number of samples

0 100 200 300 400 500 600 700 800 900 1,000

10−0.6

10−0.4

10−0.2

100

100.2

100.4

100.6

Number of noisy features

Corrupted chain varying the number of noise features

`2-LSTDVBLSTDOMPTDLarsTD

100 101 102−0.5

0

0.5

1

1.5

Mea

nw

eigh

tval

ue

LSTD

100 101 102−5

0

5

10

15VBLSTD

100 101 102−2

0

2

4

6

8

Mea

nw

eigh

tval

ue

OMPTD

100 101 102−10

−5

0

5

10LarsTD

100 101 102−20

−10

0

10

20

30

Mea

nw

eigh

tval

ue

LSTD

100 101 102−20

−10

0

10

20

30

Mea

nw

eigh

tval

ue

VBLSTD

100 101 102−10

−5

0

5

10

15

Mea

nw

eigh

tval

ue

OMPTD

100 101 102−5

0

5

10

15

Mea

nw

eigh

tval

ue

LarsTD

10 Training episodes 100 Training episodesThe 606 mean weight values. The first weight is the bias term, the next 5 correspond to the relevant features (RBFs), andthe rest 600 correspond to the noise (irrelevant) features.

value - github pages · reinforcement learning (rl) learning to act in an ... corrupted chain...

Documents