Download - Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

FormalismDynamic Programming

Approximate Dynamic ProgrammingOnline learning

Policy search and actor-critic methods

Reinforcement Learning(Machine Learning, SIR)

Matthieu Geist (CentraleSupelec)[email protected]

1 / 66




Figure : The perception-action cycle in reinforcement learning.

2 / 66




Applications

playing games (Backgammon, Go, Tetris, Atari...)roboticsautonomous acrobatic helicopter control1

operation research (pricing, vehicle routing...)human computer interactions (dialogue management,e-learning...)virtually any control problem2

1http://heli.stanford.edu/2An old list: http://umichrl.pbworks.com/w/page/7597597/

Successes_of_Reinforcement_Learning3 / 66

http://heli.stanford.edu/

http://umichrl.pbworks.com/w/page/7597597/Successes_of_Reinforcement_Learning

http://umichrl.pbworks.com/w/page/7597597/Successes_of_Reinforcement_Learning




Markov Decision ProcessesPolicy and value functionBellman operators

1 FormalismMarkov Decision ProcessesPolicy and value functionBellman operators

2 Dynamic ProgrammingLinear programmingValue iterationPolicy iteration

3 Approximate Dynamic ProgrammingState-action value functionApproximate value iterationApproximate policy iteration

4 Online learningSARSA and Q-learningThe exploration-exploitation dilemma

5 Policy search and actor-critic methodsThe policy gradient theoremActor-critic methods

4 / 66










5 / 66





A Markov Decision Process (MDP) is a tuple {S,A,P, r , γ}where:

S is the (finite) state space;

A is the (finite) action space;

P ∈ ∆S×AS is the Markovian transition kernel. The termP(s ′|s, a) denotes the probability of transiting in state s ′ giventhat action a was chosen in state s;

r ∈ RS×A is the reward function, it associates the rewardr(s, a) for taking action a in state s. The reward function isassumed to be uniformly bounded;

γ ∈ (0, 1) is a discount factor that favors shorter term rewards(usually set to a value close to 1).

6 / 66










7 / 66





Policy:

π ∈ AS ;in state s, an agent applying policy π chooses the action π(s)

Value function (quantify the quality of a policy):

vπ(s) = E[∞∑t=0

γtr(St , π(St))|S0 = s,St+1 ∼ P(.|St , π(St))].

Comparing policies (partial ordering):

π1 ≥ π2 ⇔ ∀s ∈ S, vπ1(s) ≥ vπ2(s).

Optimal policy:π∗ ∈ argmax

π∈ASvπ.

8 / 66










9 / 66





Rewriting the Bellman equation (cont.)

Define the stochastic matrix Pπ ∈ RS×S and the vectorrπ ∈ RS as

Pπ =(P(s ′|s, π(s))

)s,s′∈S and rπ = (r(s, π(s)))s∈S .

Using these notations, we have:

vπ = rπ + γPπvπ ⇔ vπ = (I − γPπ)−1rπ.

11 / 66





Bellman evaluation operator

Define the Bellman evaluation operator Tπ : RS → RS as

∀v ∈ RS , Tπv = rπ + γPπv ,

or equivalently componentwise

∀s ∈ S, [Tπv ](s) = r(s, π(s)) + γ∑s′∈S

P(s ′|s, π(s))v(s ′).

Tπ is a contraction (supremum norm) and vπ is its unique fixedpoint:

vπ = Tπvπ.

12 / 66





Optimal value function and policies

Assume that v∗ = vπ∗ is known, an optimal policy should begreedy resp. to v∗:

π∗(s) ∈ argmaxa∈A

(r(s, a) + γ

∑s′∈S

P(s ′|s, a)v∗(s ′)

).

Characterizing v∗:

∀s ∈ S, v∗(s) = maxa∈A

(r(s, a) + γ

∑s′∈S

P(s ′|s, a)v∗(s ′)

).

13 / 66





Bellman optimality operator

Define the Bellman optimality operator T∗ : RS → RS as

∀v ∈ RS , T∗v = maxπ∈AS

(rπ + γPπv) ,

or equivalently componentwise

∀s ∈ S, [T∗v ](s) = maxa∈A

(r(s, a) + γ

∑s′∈S

P(s ′|s, a)v(s ′)

).

T∗ is a contraction (supremum norm) and v∗ is its unique fixedpoint:

v∗ = T∗v∗.

14 / 66




Linear programmingValue iterationPolicy iteration






15 / 66





DP: solve an MDP when the model is known.

In practice, the model is unknown and one has to rely on data.

Even so, related learning methods are often based on DP.

16 / 66










17 / 66





v∗ is the solution of the following linear program:

minv∈RS

1>v

subject to v ≥ T∗v .

Proof.

v ≥ T∗v ⇒ v ≥ v∗ ⇒ 1>v ≥ 1>v∗

andv∗ = T∗v∗.

18 / 66





Algorithm 1 Linear programming

1: Solve

minv∈RS

∑s∈S

v(s)

subject to v(s) ≥ r(s, a) + γ∑s′∈S

P(s ′|s, a)v(s ′), ∀s ∈ S,∀a ∈ A

and get v∗.2: return the policy π∗ defined as

π∗(s) ∈ argmaxa∈A

(r(s, a) + γ

∑s′∈S

P(s ′|s, a)v∗(s ′)

).

19 / 66










20 / 66





T∗ is a contraction: ∀u, v ∈ RS , ‖T∗u − T∗v‖∞ ≤ ‖u − v‖∞;

v∗ is its unique fixed point: T∗v∗ = v∗;

Banach fixed-point theorem: for any v0, the sequence

vk+1 = T∗vk

converges to v∗;

natural stopping criterion: ‖vk+1 − vk‖∞ ≤ ε;output a greedy policy (resp. to vk), πk ∈ G(vk)

π ∈ G(v)⇔ Tπv = T∗v

⇔ ∀s ∈ S, π(s) ∈ argmaxa∈A

(r(s, a) + γ

∑s′∈S

P(s ′|s, a)v(s ′)

).

21 / 66





Algorithm 2 Value iteration

Require: An initial v0 ∈ RS , a stopping criterion ε1: k = 02: repeat3: for all s ∈ S do4:

vk+1(s) = maxa∈A

(r(s, a) + γ

∑s′∈S

P(s ′|s, a)vk(s ′)

)5: end for6: k ← k + 17: until ‖vk+1 − vk‖∞ ≤ ε8: return a policy πk ∈ G(vk):

πk(s) ∈ argmaxa∈A

(r(s, a) + γ

∑s′∈S

P(s ′|s, a)vk(s ′)

).

22 / 66





Quality of the obtained solution?

Stop iterations if‖vk+1 − vk‖∞ ≤ ε.

Guaranty on the function vk :

‖v∗ − vk‖∞ ≤1

1− γε.

Guaranty on the policy πk :

‖v∗ − vπk‖∞ ≤2γ

(1− γ)2ε.

23 / 66










24 / 66





Let π be any policy and vπ its value function.

Let π′ be greedy resp. to vπ, π′ ∈ G(vπ):

∀s ∈ S, π′(s) ∈ argmaxa∈A

(r(s, a) + γ

∑s′∈S

P(s ′|s, a)vπ(s ′)

).

π′ is a better policy than π:

vπ′ ≥ vπ

This suggests the following algorithmic scheme, iterate:1 policy evaluation: solve Tπk

vπk= vπk

;2 policy improvement: compute πk+1 ∈ G(vπk

).

25 / 66





Algorithm 3 Policy iteration

Require: An initial π0 ∈ AS1: k = 02: repeat3: solve (policy evaluation)

vk(s) = r(s, πk(s)) + γ∑s′∈S

P(s ′|s, πk(s))vk(s ′), ∀s ∈ S.

4: Compute (policy improvement)

πk+1(s) ∈ argmaxa∈A

(r(s, a) + γ

∑s′∈S

P(s ′|s, a)vk(s ′)

).

5: k ← k + 16: until vk+1 = vk7: return the policy πk+1 = π∗

26 / 66




State-action value functionApproximate value iterationApproximate policy iteration






27 / 66





DP requires:

the state and action spaces to be small enough;

the model to be known.

Unfortunately:

the state space can be too large (even continuous) for the valuefunction to be represented exactly,

vθ(s) = θ>φ(s) =d∑

i=1

θiφi (s)

the model might be unknown and one has to rely on a dataset

D = {(si , ai , ri , s ′i )1≤i≤n}.

the dataset can be obtained in multiple ways;the evaluation operator can be sampled (assume hereai = π(si )),

[Tπv ](si ) = ri + γv(s ′i )

is unbiased:E[[Tπv ](si )|si ] = ES′∼P(.|si ,ai )[ri + γv(S ′)] = [Tπv ](si ).

28 / 66










29 / 66





Problems with value functions

Computing a greedy policy requires knowing the model:

π ∈ G(v)⇔

∀s ∈ S, π(s) ∈ argmaxa∈A

(r(s, a) + γ

∑s′∈S

P(s ′|s, a)v(s ′)

).

Sampling the optimality operator?Optimality operator:

[T∗v ](s) = maxa∈A

ES′∼P(.|s,a)[r(s, a) + γv(S ′)];

with s ′i,a ∼ P(.|si , a), a possible sampled operator:

[T∗v ](si ) = maxa∈A

(r(si , a) + γv(s ′i,a)

);

it is biased: E[[T∗v ](si )|si ] 6= T∗(si ).

30 / 66





state-action value function (also called Q-function and qualityfunction)

Qπ(s, a) = E[∞∑t=0

γtr(St ,At)|S0 = s,A0 = a, St+1 ∼ P(.|St ,At),At+1 = π(St+1)].

Bellman evaluation operator Tπ : RS×A → RS×Adefinition:[TπQ](s, a) = r(s, a) + γ

∑s′∈S P(s ′|s, a)Q(s ′, π(s ′));

Qπ is its unique fixed point:TπQπ = Qπ;link to vπ:vπ(s) = Qπ(s, π(s)).

Bellman optimality operator T∗ : RS×A → RS×Adefinition:[T∗Q](s, a) = r(s, a) + γ

∑s′∈S P(s ′|s, a) maxa′∈AQ(s ′, a′);

Q∗ is its unique fixed point:Q∗ = T∗Q∗;link to v∗:v∗(s) = maxa∈AQ∗(s, a).

31 / 66





Allows acting greedily:resp to vπ = Qπ(s, π(s):

π′ ∈ G(vπ)⇔ ∀s ∈ S, π(s) ∈ argmaxa∈A

(r(s, a) + γ

∑s′∈S

P(s ′|s, a)vπ(s ′)

)

⇔ ∀s ∈ S, π′(s) ∈ argmaxa∈A

Qπ(s, a).

resp. to v∗:π∗(s) ∈ argmax

a∈AQ∗(s, a).

resp. to any Q ∈ RS×A:

∀Q ∈ RS×A, π ∈ G(Q)⇔ ∀s ∈ S, π(s) ∈ argmaxa∈A

Q(s, a).

32 / 66





Allows sampling easily the related operatorsrecall the dataset

D = {(si , ai , ri , s′i )1≤i≤n}.

sampled Bellman evaluation operator

[TπQ](si , ai ) = ri + γQ(s ′i , π(s ′i ));

sampled Bellman optimality operator

[T∗Q](si , ai ) = ri + γ maxa′∈A

Q(s ′i , a′).

Features for the Q-function:linear param. of the Q-function: Qθ(s, a) = θ>φ(s, a) with

φ(s, a) =[δa=a1φ(s)> . . . δa=a|A|φ(s)>

]>.

other representations are possible.

33 / 66










34 / 66





Value iteration: Qk+1 = T∗Qk

T∗ cannot be applied, the model being unknown;with large space, Qk ∈ H, no reason for T∗Qk ∈ H to hold.

Approximate value iteration (an introductory example):linear param. for the Q-functions

H = {Qθ(s, a) = θ>φ(s, a), θ ∈ Rd};

writing Qk = Qθk , sampled operator:

[T∗Qk ](si , ai ) = ri + γ maxa′∈A

Qk(s ′i , a′)

search for Q ∈ H being the closest to T∗Qk :

Qk+1 ∈ argminQθ∈H

1

n

n∑i=1

(Qθ(si , ai )− [T∗Qk ](si , ai )

)2.

summary: Qk+1 = ΠT∗Qk .

35 / 66





Abstraction

Approximate value iteration:

Qk+1 = AT∗Qk .

A is an abstract approximation operator

AT∗ should be a contraction!

otherwise divergence can (will) occur;

not the case for ΠT∗...

do not implement the introductory example!

true if averagers are used for function approximation, such as

ensemble of treesnotably extremely randomized forests: fitted-Qkernel averagers (Nadaraya-Watson)

36 / 66





Algorithm 4 Approximate value iteration

Require: A dataset D = {(si , ai , ri , s′i )1≤i≤n}, the number K of iterations,

a function approximator, an initial state-action value function Q0.1: for k = 0 to K do2: Apply the sampled Bellman operator to function Qk :

[T∗Qk ](si , ai ) = ri + γ maxa′∈A

Qk(s ′i , a′).

3: Solve the regression problem with inputs (si , ai ) and outputs[T∗Qk ](si , ai ) to get the Q-function Qk+1

4: end for5: return The greedy policy πK+1 ∈ G(QK+1):

∀s ∈ S, πK+1 ∈ argmaxa∈A

QK+1(s, a).

37 / 66










38 / 66





Policy iteration:1 policy evaluation: solve the fixed-point equation

Qπk= Tπk

Qπk;

2 policy improvement: compute the greedy policyπk+1 = G(Qπk

).

Approximate policy iteration:1 approximate policy evaluation: find a function Qk ∈ H such

that Qk ≈ TπkQk ;

2 policy improvement: compute the greedy policyπk+1 = G(Qπk

).

39 / 66





Algorithm 5 Approximate policy iteration

Require: An initial π0 ∈ AS (possibly an initial Q0 and π0 ∈ G(Q0)),number of iterations K

1: for k = 0 to K do2: approximate policy evaluation: find Qk ∈ H such that Qk ≈

TπkQk .

3: policy improvement: πk+1 ∈ G(Qk).4: end for5: return the policy πK+1

Problem: how to find an approximate fixed point of Tπ, thatis a function Qθ ∈ H such that Qθ ≈ TπQθ?

40 / 66





Monte Carlo rollouts

Approx. fixed point of Tπ ⇔ approx Qπ.

Assume that Qπ is known, simply a regression problem. For example,linear least-squares:

minθ∈Rd

1

n

n∑i=1

(Qπ(si , ai )− Qθ(si , ai ))2 .

Qπ is (obviously) unknown... Monte carlo rollout:

sample a full trajectory starting in si where action ai is chosen first,all subsequent states being sampled according to the systemdynamics and all subsequent actions being chosen according to π;write qi the associated discounted cumulative reward;

unbiased estimate: E[qi |si , ai ] = Qπ(si , ai )

replace Qπ(si , ai ) by the unbiased estimate qi :

minθ∈Rd

1

n

n∑i=1

(qi − Qθ(si , ai ))2 .

Drawbacks: requires a simulator, rollouts can be quite noisy.

41 / 66





Residual approach

Idea: minimize the residual ‖Qθ − TπQθ‖ for some norm.

With an `2-loss, parametric representation and sampledoperator:

minθ∈Rd

1

n

n∑i=1

([TπQθ](si , ai )− Qθ(si , ai )

)2= min

θ∈Rd

1

n

n∑i=1

(ri + γQθ(s ′i , π(s ′i ))− Qθ(si , ai )

)2.

However, there is a bias problem:

E[([TπQθ](si , ai )− Qθ(si , ai ))2|si , ai ]=([TπQθ](si , ai )− Qθ(si , ai ))2 + var([TπQθ](si , ai )|si , ai )6=([TπQθ](si , ai )− Qθ(si , ai ))2.

42 / 66





Least-Squares temporal Differences

Idea: solveQθ = ΠTπQθ.

As a nested optimization problem:{wθ = argminw∈Rd

1n

∑ni=1(ri + γQθ(s ′i , π(s ′i ))− Qw (si , ai ))2

θn = argminθ∈Rd1n

∑ni=1(Qθ(si , ai )− Qwθ(si , ai ))2

.

43 / 66





Least-Squares temporal Differences (cont.)

Opt. problem (linear param. assumed):{wθ = argminw∈Rd

1n

∑ni=1(ri + γQθ(s ′i , π(s ′i ))− Qw (si , ai ))2

θn = argminθ∈Rd1n

∑ni=1(Qθ(si , ai )− Qwθ (si , ai ))2

.

First Eq., linear least-squares in w:

wθ =

(n∑

i=1

φ(si , ai )φ(si , ai )>

)−1 n∑i=1

φ(si , ai )(ri + γθ>φ(s ′i , π(s ′i ))).

Second Eq. minimized for θ = wθ:

θn = wθn ⇔ θn =

(n∑

i=1

φ(si , ai )φ(si , ai )>

)−1 n∑i=1

φ(si , ai )(ri + γθ>n φ(s ′i , π(s ′i )))

⇔ θn =

(n∑

i=1

φ(si , ai )(φ(si , ai )− γφ(s ′i , π(s ′i ))

)>)−1 n∑i=1

φ(si , ai )ri .

API with LSTD named LSPI (Least-Squares Policy Iteration).44 / 66





Least-Squares temporal Differences (cont.)

Algorithm 6 Least-squares policy iteration

Require: An initial π0 ∈ AS (possibly an initial Q0 and π0 ∈ G(Q0)),number of iterations K

1: for k = 0 to K do2: approximate policy evaluation:

θk =

(n∑

i=1

φ(si , ai ) (φ(si , ai )− γφ(s ′i , πk(s ′i )))>)−1 n∑

i=1

φ(si , ai )ri .

3: policy improvement:πk+1 ∈ G(Qθk ).

4: end for5: return the policy πK+1

45 / 66





Approximating the policy

Idea: instead of generalizing the Q-function, generalize thepolicy

Motivation: a policy might be easier to learn than a Qfunction

At iteration k, let F ⊂ AS be an hypothesis space of policies,assume that Qπk (si , a) are known, and solve the cost-sensitivemulti-class classification problem

πk+1 ∈ argminπ∈F

1

n

n∑i=1

(maxa∈A

Qπk (si , a)− Qπk (si , π(si ))

).

Practically, replace Qπk (si , a) by a Monte Carlo rollout.

Often called DPI for Direct Policy Iteration

46 / 66




SARSA and Q-learningThe exploration-exploitation dilemma






47 / 66





For ADP, we assumed that the dataset was provided.

What about online learning?

requires an online learner;dilemma between exploration and exploitation.

48 / 66










49 / 66





SARSA

Goal: online estimation of Qπ, for a given π.Assume a linear param., and that Qπ is known, the risk of interest is

minθ∈Rd

1

n

n∑i=1

(Qπ(si , ai )− Qθ(si , ai ))2 .

Minimize it with a stochastic gradient descent:

θi+1 = θi −αi

2∇ (Qπ(si , ai )− Qθ(si , ai ))2

= θi + αiφ(si , ai )(Qπ(si , ai )− θ>i φ(si , ai )

),

Qπ(si , ai ) is unknown, bootstrap it (si+1 ∼ P(.|si , ai ) and ai+1 = π(si+1))

Qπ(si , ai )→ [TπQθi ](si , ai ) = ri + γQθi (si+1, ai+1).

Replace Qπ(si , ai ) by its estimate in the update rule

θi+1 = θi + αiφ(si , ai ) (ri + γQθi (si+1, ai+1)− Qθi (si , ai ))

= θi + αiφ(si , ai )(ri + γθ>i φ(si+1, ai+1)− θ>i φ(si , ai )

).

This is called a temporal difference algorithm.50 / 66





SARSA (cont.)

Algorithm 7 SARSARequire: An initial parameter vector θ0, the initial state s0, an initial action a0,

the learning rates (αi )i≥0

1: i = 02: while true do3: Apply action ai in state si4: Get the reward ri and observe the new state si+1

5: Choose the action ai+1 to be applied in state si+1

6: Update the parameter vector of the Q-function according to the transition(si , ai , ri , si+1, ai+1);

θi+1 = θi + αiφ(si , ai )(ri + γθ>i φ(si+1, ai+1)− θ>i φ(si , ai )

)7: i ← i + 18: end while

Remark: on-policy algorithm.51 / 66





Q-learning

Goal: direct online estimation of Q∗.Assume a linear param., and that Q∗ is known, the risk of interest is

minθ∈Rd

1

n

n∑i=1

(Q∗(si , ai )− Qθ(si , ai ))2 .

Minimize it with a stochastic gradient descent:

θi+1 = θi −αi

2∇ (Q∗(si , ai )− Qθ(si , ai ))2

= θi + αiφ(si , ai )(Q∗(si , ai )− θ>i φ(si , ai )

),

Q∗(si , ai ) is unknown, bootstrap it (si+1 ∼ P(.|si , ai ))

Q∗(si , ai )→ [T∗Qθi ](si , ai ) = ri + γmaxa∈A

Qθi (si+1, a).

Replace Q∗(si , ai ) by its estimate in the update rule

θi+1 = θi + αiφ(si , ai )

(ri + γmax

a∈AQθi (si+1, a)− Qθi (si , ai )

)= θi + αiφ(si , ai )

(ri + γmax

a∈A(θ>i φ(si+1, a))− θ>i φ(si , ai )

).

52 / 66





Q-learning (cont.)

Algorithm 8 Q-learningRequire: An initial parameter vector θ0, the initial state s0, the learning rates

(αi )i≥0

1: i = 02: while true do3: Choose the action ai to be applied in state si4: Apply action ai in state si5: Get the reward ri and observe the new state si+1

6: Update the parameter vector of the Q-function according to the transition(si , ai , ri , si+1);

θi+1 = θi + αiφ(si , ai )

(ri + γmax

a∈A(θ>i φ(si+1, a))− θ>i φ(si , ai )

)7: i ← i + 18: end while

Remark: off-policy algorithm. 53 / 66










54 / 66





With SARSA or Q-learning, what action should be applied?

acting always greedily is not wise;dilemma between exploration and exploitation.

ε-greedy policy:

πε(s) =

{argmaxa∈AQθ(s, a) with probability 1− εa random action with probability ε

.

Softmax (stochastic) policy (τ is the temperature param.):

πτ (a|s) =e

1τQθ(s,a)∑

a′∈A e1τQθ(s,a′)

.

Other schemes exist.

55 / 66




The policy gradient theoremActor-critic methods






56 / 66





What about continuous actions (max, argmax)?

Policy search: parameterize the policy, search directly in thepolicy space.

We’ll use stochastic policies:

stochastic policy: π ∈ ∆SA associates to each state s aconditional probability over actions π(.|s)All things already defined extend naturaly to stochasticpolicies:

vπ(s) =∑a∈A

π(a|s)

(r(s, a) + γ

∑s′∈S

P(s ′|s, a)vπ(s)

),

vπ(s) =∑a∈A

π(a|s)Qπ(s, a),

Qπ(s, a) = r(s, a) + γ∑s′∈S

P(s ′|s, a)vπ(s).

57 / 66





Exemples of parameterized policies:discrete actions

πθ(a|s) =eθ>φ(s,a)∑

a′∈A eθ>φ(s,a′),

continuous (1d here) actions

πθ(a|s) ∝ e− 1

2

(a−θ>φ(s)

σ

)2

.

The policy search problem:let ν ∈ ∆S be a user-defined distribution over states;solve

maxθ∈Rd J(θ) with J(θ) =∑s∈S

ν(s)vπθ (s) = ES∼ν [vπθ (S)].

Difference with (A)DP:DP: find a policy that maximizes the value for every state;Policy search: find a policy that maximizes the value inaverage.

58 / 66










59 / 66





natural approach, gradient ascent:

θ ← θ + α∇θJ(θ).

Gradient ∇θJ(θ)?

Define dν,π ∈ ∆S , the γ-weighted occupancy measure:

dν,π = (1− γ)ν>(I − γPπ)−1.

Theorem (Policy gradient)

Let πθ be such that πθ(a|s) > 0, for all s, a. We have

∇θJ(θ) =1

1− γ∑s∈S

dν,π(s)∑a∈A

πθ(a|s)Qπθ(s, a)∇θ lnπθ(a|s)

=1

1− γES∼dν,π ,A∼πθ(.|S)[Qπθ(S ,A)∇θ lnπθ(A|S)].

60 / 66










61 / 66





Policy gradient:

∇θJ(θ) =1

1− γES∼dν,π ,A∼πθ(.|S)[Qπθ(S ,A)∇θ lnπθ(A|S)].

Qπ can be estimated pointwise using Monte Carlo rollouts.

Policy search is called an actor method (DPI too).

Can we replace Qπ by an approximation Qw ∈ H,without changing the gradient?

If so, the resulting approach is called an actor critic method.

62 / 66





Policy gradient with a critic

Theorem (Policy gradient with a critic)

If the parametrization of the state-action value function iscompatible, in the sense that

∀(s, a) ∈ S ×A, ∇θ lnπθ(a|s) = ∇wQw (s, a),

and if Qw is a local optimum of the risk based on the `2-loss, withstate-action distribution given by dν,π, and with the target functionbeing Qπ, that is

∇wEdν,π [(Qπθ(S ,A)− Qw (S ,A))2] = 0,

then the gradient satisfies

∇θJ(θ) = Edν,π [Qw (S ,A)∇θ lnπθ(A|S)].

63 / 66





Policy gradient with a critic (cont.)

Exemple of compatible approximation.

Softmax policy: πθ(a|s) = eθ>φ(s,a)∑

a′∈A eθ>φ(s,a′).

Gradient: ∇θ lnπθ(a|s) = φ(s, a)−∑

a′∈A πθ(a′|s)φ(s, a′).

Compatible approximation:Qw (s, a) = w>(φ(s, a)−

∑a′∈A πθ(a′|s)φ(s, a′)).

Not a Q-function as∑

a∈A πθ(a|s)Qw (s, a) = 0

More an advantage function: Aπ(s, a) = Qπ(s, a)− vπ(s).

Yet, for any v ∈ RS , Edν,π [v(s)∇θ lnπθ(a|s)] = 0

As the term w>∑

a′∈A πθ(a′|s)φ(s, a′) does not depend on a, acompatible approximation is

Qw (s, a) = θ>φ(s, a).

64 / 66





Natural policy gradient

Natural gradient:

gradient premultiplied by the inverse of the Fisher informationmatrixinstead of following the steepest direction in the parameterspace, it follows the steepest direction with respect to Fishermetrictends to be much more efficient empirically

in our case, the natural gradient ∇ is

∇θJ(θ) = F (θ)−1∇θJ(θ)

with F (θ) = Edν,π [∇θ lnπθ(A|S)(∇θ lnπθ(A|S))>]

65 / 66





Natural policy gradient (cont.)

Theorem (Natural policy gradient with a critic)

If the parametrization of the state-action value function iscompatible, in the sense that

∀(s, a) ∈ S ×A, ∇θ lnπθ(a|s) = ∇wQw (s, a),

and if Qw is a local optimum of the risk based on the `2-loss, withstate-action distribution given by dν,π, and with the target functionbeing Qπ, that is

∇wEdν,π [(Qπθ(S ,A)− Qw (S ,A))2] = 0,

then the natural gradient satisfies

∇θJ(θ) = w .

66 / 66

Download - Reinforcement Learning (Machine Learning, SIR)sirien.metz.supelec.fr/depot/SIR/CoursML/slides_renforcement.pdf · Reinforcement Learning (Machine Learning, SIR) Matthieu Geist (CentraleSup

Top Related