value - github pages · reinforcement learning (rl) learning to act in an ... corrupted chain...

1
Bayesian Inference for Least-Squares Temporal Difference Regularization Nikolaos Tziortziotis and Christos Dimitrakakis ? DaSciM, LIX, École Polytechnique, France ? University of Lille, France - SEAS, Harvard University, USA B email: {ntziorzi, christos.dimitrakakis}@gmail.com Abstract Bayesian Least-Squares Temporal Difference: X Fully Bayesian approach for LSTD X Probabilistic inference of value functions X Quantifies our uncertainty about value function Variational Bayesian LSTD: X Sparse model - Good generalisation capabilities X Automatically determine the model’s complexity X No need to select a regularization parameter Reinforcement Learning (RL) Learning to act in an unknown environment, by interaction and reinforcement. Reward State Action RL tasks folmulated as MDPs,{S , A, P, r, γ } Policy π : S→A (map states to actions) Value function: V π (s) , E π t=0 γ t r (s t )|s 0 = s Bellman operator: (T π V )(s)= r (s)+ γ Z S V (s 0 )dP (s 0 |s, π (s)) V is the unique fixed-point of Bellman operator: V π = T π V π V π =(I - γP π ) -1 r § The value function cannot be represented in an explicit way (continuous state space) § The model of the MDP is unknown ' Value function approximation: V π θ (s)= φ(s) > θ = k X i=1 φ i (s)θ i , where F = {f θ |f θ (·)= φ(·) > θ }. ' Access to a set of transitions: D = {(s i ,r i ,s 0 i )} n i=1 , where we define: ˜ Φ =[φ(s 1 ) > ; ... ; φ(s n ) > ], ˜ R =[r i ,...,r n ] > and ˜ Φ 0 =[φ(s 0 1 ) > ; ... ; φ(s 0 n ) > ]. Least-Squares Temporal Difference Minimize the mean-square projected Bellman error (MSPBE): θ = arg min θ R k kV π θ - ΠT π V π θ k 2 D . F V π θ T π T π V π θ ΠT π V π θ Π Projection operator over F : Π = ΦC -1 Φ > D Nested Optimization Problem u * = arg min uR k kΦu - T π Φθ k 2 D (Projection step) θ = arg min θ R k kΦθ - Φu * k 2 D (Fixed-point step) Solution u * = ˜ C -1 ˜ Φ( ˜ R + γ ˜ Φ 0 θ ), θ =( ˜ Φ > ( ˜ Φ - γ ˜ Φ 0 )) -1 ˜ Φ > ˜ R = A -1 b, where ˜ C , ˜ Φ > ˜ Φ, A , ˜ Φ > ( ˜ Φ - γ ˜ Φ 0 ), and b , ˜ Φ > ˜ R. As the number of samples n increases, the LSTD so- lution ˜ Φθ converges to the fixed-point of ˆ ΠT π . Regularized LSTD Schemes 2 -LSTD Lasso-TD LC-TD LARS-TD 1 -PBR 2,2 -LSTD 2,1 -LSTD Dantzig-LSTD ODDS-TD Bayesian LSTD Empirical Bellman operator: ˆ T π V π θ = r + γP π V π θ + N, N N (0-1 I ) Given observations D (LSTD solution): V π θ = ˆ Π ˆ T π V π θ ˜ Φ > ˜ R = ˜ Φ > ( ˜ Φ - γ ˜ Φ 0 )θ + ˜ Φ > N Linear regression model: b = Aθ + ˜ Φ > N Likelihood function: p(b|θ )= N (b|Aθ -1 ˜ C ) X Maximum likelihood inference corresponds to standard LSTD solution Prior distribution over θ : p(θ |α)= N (θ |0-1 I ). Logarithm of posterior distribution: ln p(θ |D ) ∝- β 2 E D (θ ) - α 2 θ > θ , where E D (θ )=(b - Aθ ) > ˜ C -1 (b - Aθ ) (MSPBE). X Maximum a posteriori inference corresponds to 2 regularization (λ = α/β ) Posterior distribution: p(θ |D )= N (θ |m,S ) S =(αI +βA > ˜ C -1 A | {z } Σ ) -1 and m = βSA > ˜ C -1 b Predictive distribution: p(V π θ (s * )|s * , D )= Z θ p(V π θ (s * )|θ ,s * )dp(θ |b, α, β ) = N (V π θ (s * )|φ(s * ) > m, φ(s * ) > S φ(s * )) Variational Bayesian LSTD Prior distribution: p(θ |α)= Q k i=1 N (θ i |0-1 i ) Hyperprior over α: p(α)= Q k i=1 Gamma (α i |h a ,h b ) Hyperprior over β : p(β )= Gamma (β |h c ,h d ) Posterior distribution over Z = {θ , α}: p(θ , α|b)= p(b|θ )p(β )p(θ |α)p(α) p(b) § Marginal likelihood is analytically intractable X We resolt to variational inference Variational distribution: Q(Z )= Q θ (θ )Q α (α)Q β (β ) Optimal distribution for each factor: Q θ (θ )= N (θ |m,S ) Q β (β )= Gamma (β | ˜ c, ˜ d) Q α (α)= k Y i=1 Gamma (α i | ˜ a i , ˜ b i ) where, S =(diag E[α]+ E[β ]Σ ) -1 , m = E[β ]SA > ˜ C -1 b, ˜ a i = h a + 1 2 , ˜ b i = h b + 1 2 E[θ 2 i ], ˜ c = h c + k 2 , ˜ d = h d + 1 2 kb - Amk 2 ˜ C + 1 2 tr(ΣS ). Lower bound L(Q)= 1 2 ln |S |- 1 2 | ˜ C | + k X i=1 {ln Γ a i ) - ˜ a i ln ˜ b i } + ln Γ c) - ˜ c ln ˜ d + k 2 (1 - ln 2π ) - k ln Γ (h a ) + kh a ln h b - ln Γ (h c )+ h c ln h d Experimental Results Corrupted Chain A 20-state, 2-actions (left or right) MDP The probability of action’s success is equal to 0.9 A reward of 1 is given only at the ends of chain The horizon of each episode is set equal to 20 Optimal policy: i) s<= 10: a = L, ii) s> 10:a=R Value fuction representation: φ(s) = (1, RBF 1 (s),..., RBF 5 (s),x 1 ,...,x 600 ) where x i N (0, 1), i ∈{1,..., 600} 10 20 30 40 50 60 70 80 90 100 10 -0.6 10 -0.4 10 -0.2 10 0 10 0.2 10 0.4 10 0.6 10 0.8 Number of episodes kV * - V θ k 2 Corrupted chain varying the number of samples 0 100 200 300 400 500 600 700 800 900 1,000 10 -0.6 10 -0.4 10 -0.2 10 0 10 0.2 10 0.4 10 0.6 Number of noisy features Corrupted chain varying the number of noise features 2 -LSTD VBLSTD OMPTD LarsTD 10 0 10 1 10 2 -0.5 0 0.5 1 1.5 Mean weight value LSTD 10 0 10 1 10 2 -5 0 5 10 15 VBLSTD 10 0 10 1 10 2 -2 0 2 4 6 8 Mean weight value OMPTD 10 0 10 1 10 2 -10 -5 0 5 10 LarsTD 10 0 10 1 10 2 -20 -10 0 10 20 30 Mean weight value LSTD 10 0 10 1 10 2 -20 -10 0 10 20 30 Mean weight value VBLSTD 10 0 10 1 10 2 -10 -5 0 5 10 15 Mean weight value OMPTD 10 0 10 1 10 2 -5 0 5 10 15 Mean weight value LarsTD 10 Training episodes 100 Training episodes The 606 mean weight values. The first weight is the bias term, the next 5 correspond to the relevant features (RBFs), and the rest 600 correspond to the noise (irrelevant) features.

Upload: others

Post on 21-Nov-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: value - GitHub Pages · Reinforcement Learning (RL) Learning to act in an ... Corrupted chain varying the number of noise features ‘ 2-LSTD VBLSTD OMPTD LarsTD 0: 10 5 0 101 102

Bayesian Inference for Least-Squares Temporal Difference RegularizationNikolaos Tziortziotis† and Christos Dimitrakakis?

† DaSciM, LIX, École Polytechnique, France? University of Lille, France - SEAS, Harvard University, USA

B email: {ntziorzi, christos.dimitrakakis}@gmail.com

AbstractBayesian Least-Squares Temporal Difference:X Fully Bayesian approach for LSTDX Probabilistic inference of value functionsX Quantifies our uncertainty about value functionVariational Bayesian LSTD:X Sparse model - Good generalisation capabilitiesX Automatically determine the model’s complexityX No need to select a regularization parameter

Reinforcement Learning (RL)Learning to act in an unknown environment, byinteraction and reinforcement.

Reward State

Action

RL tasks folmulated as MDPs,{S,A, P, r, γ}Policy π: S → A (map states to actions)Value function: V π(s) , Eπ

[∑∞t=0 γ

tr(st)|s0 = s]

Bellman operator:

(TπV )(s) = r(s) + γ

∫SV (s′)dP (s′|s, π(s))

V is the unique fixed-point of Bellman operator:

V π = TπV π ⇒ V π = (I − γPπ)−1r

§ The value function cannot be represented inan explicit way (continuous state space)

§ The model of the MDP is unknown

© Value function approximation:

V πθ (s) = φ(s)>θ =

k∑i=1

φi(s)θi,

where F = {fθ|fθ(·) = φ(·)>θ}.© Access to a set of transitions: D = {(si, ri, s′i)}ni=1,

where we define: Φ = [φ(s1)>; . . . ;φ(sn)

>],R = [ri, . . . , rn]

> and Φ′ = [φ(s′1)>; . . . ;φ(s′n)

>].

Least-Squares Temporal DifferenceMinimize the mean-square projected Bellman error

(MSPBE): θ = argminθ∈Rk ‖V πθ −ΠTπV πθ ‖2D .

FV πθ

TπV πθ

ΠTπV πθ

Π

Projection operator over F : Π = ΦC−1Φ>D

Nested Optimization Problem

u∗ = argminu∈Rk

‖Φu− TπΦθ‖2D (Projection step)

θ = argminθ∈Rk

‖Φθ − Φu∗‖2D (Fixed-point step)

Solutionu∗ = C−1Φ(R+ γΦ′θ),

θ = (Φ>(Φ− γΦ′))−1Φ>R = A−1b,

where C , Φ>Φ, A , Φ>(Φ− γΦ′), and b , Φ>R.

As the number of samples n increases, the LSTD so-lution Φθ converges to the fixed-point of ΠTπ .

Regularized LSTD Schemes

• `2-LSTD• Lasso-TD• LC-TD

• LARS-TD• `1-PBR• `2,2-LSTD

• `2,1-LSTD• Dantzig-LSTD• ODDS-TD

Bayesian LSTD

Empirical Bellman operator:

TπV πθ = r + γPπV πθ +N, N ∼ N (0, β−1I)

Given observations D (LSTD solution):

V πθ = ΠTπV πθ ⇔ Φ>R = Φ>(Φ− γΦ′)θ + Φ>N

Linear regression model: b = Aθ + Φ>N

Likelihood function: p(b|θ, β) = N (b|Aθ, β−1C)

X Maximum likelihood inference correspondsto standard LSTD solution

Prior distribution over θ : p(θ|α) = N (θ|0, α−1I).Logarithm of posterior distribution:

ln p(θ|D) ∝ −β2ED(θ)−

α

2θ>θ,

where ED(θ) = (b−Aθ)>C−1(b−Aθ) (MSPBE).

X Maximum a posteriori inference correspondsto `2 regularization (λ = α/β)

Posterior distribution: p(θ|D) = N (θ|m, S)

S = (αI+β A>C−1A︸ ︷︷ ︸Σ

)−1 andm = βSA>C−1b

Predictive distribution:

p(V πθ (s∗)|s∗,D) =∫θ

p(V πθ (s∗)|θ, s∗)dp(θ|b, α, β)

= N (V πθ (s∗)|φ(s∗)>m,φ(s∗)>Sφ(s∗))

Variational Bayesian LSTD

Prior distribution: p(θ|α) =∏ki=1 N (θi|0, α−1

i )

Hyperprior over α: p(α) =∏ki=1 Gamma(αi|ha, hb)

Hyperprior over β: p(β) = Gamma(β|hc, hd)

Posterior distribution over Z = {θ,α, β}:

p(θ,α, β|b) = p(b|θ, β)p(β)p(θ|α)p(α)p(b)

§ Marginal likelihood is analytically intractable

X We resolt to variational inference

Variational distribution: Q(Z) = Qθ(θ)Qα(α)Qβ(β)Optimal distribution for each factor:

Qθ(θ) = N (θ|m, S)

Qβ(β) = Gamma(β|c, d)

Qα(α) =k∏i=1

Gamma(αi|ai, bi)

where,

S = (diag E[α] + E[β]Σ)−1 , m = E[β]SA>C−1b,

ai = ha +1

2, bi = hb +

1

2E[θ2i ],

c = hc +k

2, d = hd +

1

2‖b−Am‖2

C+

1

2tr(ΣS).

Lower bound

L(Q) =1

2ln |S| −

1

2|C|+

k∑i=1

{lnΓ (ai)− ai ln bi}

+ lnΓ (c)− c ln d+k

2(1− ln 2π)− k lnΓ (ha)

+ kha lnhb − lnΓ (hc) + hc lnhd

Experimental Results

Corrupted Chain

• A 20-state, 2-actions (left or right) MDP• The probability of action’s success is equal to 0.9• A reward of 1 is given only at the ends of chain• The horizon of each episode is set equal to 20• Optimal policy: i) s <= 10: a = L, ii) s > 10: a = R• Value fuction representation:

φ(s) = (1,RBF1(s), . . . ,RBF5(s), x1, . . . , x600)

where xi ∼ N (0, 1), ∀i ∈ {1, . . . , 600}

10 20 30 40 50 60 70 80 90 100

10−0.6

10−0.4

10−0.2

100

100.2

100.4

100.6

100.8

Number of episodes

‖V∗−

Vθ‖ 2

Corrupted chain varying the number of samples

0 100 200 300 400 500 600 700 800 900 1,000

10−0.6

10−0.4

10−0.2

100

100.2

100.4

100.6

Number of noisy features

Corrupted chain varying the number of noise features

`2-LSTDVBLSTDOMPTDLarsTD

100 101 102−0.5

0

0.5

1

1.5

Mea

nw

eigh

tval

ue

LSTD

100 101 102−5

0

5

10

15VBLSTD

100 101 102−2

0

2

4

6

8

Mea

nw

eigh

tval

ue

OMPTD

100 101 102−10

−5

0

5

10LarsTD

100 101 102−20

−10

0

10

20

30

Mea

nw

eigh

tval

ue

LSTD

100 101 102−20

−10

0

10

20

30

Mea

nw

eigh

tval

ue

VBLSTD

100 101 102−10

−5

0

5

10

15

Mea

nw

eigh

tval

ue

OMPTD

100 101 102−5

0

5

10

15

Mea

nw

eigh

tval

ue

LarsTD

10 Training episodes 100 Training episodesThe 606 mean weight values. The first weight is the bias term, the next 5 correspond to the relevant features (RBFs), andthe rest 600 correspond to the noise (irrelevant) features.