FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
Reinforcement Learning(Machine Learning, SIR)
Matthieu Geist (CentraleSupelec)[email protected]
1 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
Figure : The perception-action cycle in reinforcement learning.
2 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
Applications
playing games (Backgammon, Go, Tetris, Atari...)roboticsautonomous acrobatic helicopter control1
operation research (pricing, vehicle routing...)human computer interactions (dialogue management,e-learning...)virtually any control problem2
1http://heli.stanford.edu/2An old list: http://umichrl.pbworks.com/w/page/7597597/
Successes_of_Reinforcement_Learning3 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
Markov Decision ProcessesPolicy and value functionBellman operators
1 FormalismMarkov Decision ProcessesPolicy and value functionBellman operators
2 Dynamic ProgrammingLinear programmingValue iterationPolicy iteration
3 Approximate Dynamic ProgrammingState-action value functionApproximate value iterationApproximate policy iteration
4 Online learningSARSA and Q-learningThe exploration-exploitation dilemma
5 Policy search and actor-critic methodsThe policy gradient theoremActor-critic methods
4 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
Markov Decision ProcessesPolicy and value functionBellman operators
1 FormalismMarkov Decision ProcessesPolicy and value functionBellman operators
2 Dynamic ProgrammingLinear programmingValue iterationPolicy iteration
3 Approximate Dynamic ProgrammingState-action value functionApproximate value iterationApproximate policy iteration
4 Online learningSARSA and Q-learningThe exploration-exploitation dilemma
5 Policy search and actor-critic methodsThe policy gradient theoremActor-critic methods
5 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
Markov Decision ProcessesPolicy and value functionBellman operators
A Markov Decision Process (MDP) is a tuple {S,A,P, r , γ}where:
S is the (finite) state space;
A is the (finite) action space;
P ∈ ∆S×AS is the Markovian transition kernel. The termP(s ′|s, a) denotes the probability of transiting in state s ′ giventhat action a was chosen in state s;
r ∈ RS×A is the reward function, it associates the rewardr(s, a) for taking action a in state s. The reward function isassumed to be uniformly bounded;
γ ∈ (0, 1) is a discount factor that favors shorter term rewards(usually set to a value close to 1).
6 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
Markov Decision ProcessesPolicy and value functionBellman operators
1 FormalismMarkov Decision ProcessesPolicy and value functionBellman operators
2 Dynamic ProgrammingLinear programmingValue iterationPolicy iteration
3 Approximate Dynamic ProgrammingState-action value functionApproximate value iterationApproximate policy iteration
4 Online learningSARSA and Q-learningThe exploration-exploitation dilemma
5 Policy search and actor-critic methodsThe policy gradient theoremActor-critic methods
7 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
Markov Decision ProcessesPolicy and value functionBellman operators
Policy:
π ∈ AS ;in state s, an agent applying policy π chooses the action π(s)
Value function (quantify the quality of a policy):
vπ(s) = E[∞∑t=0
γtr(St , π(St))|S0 = s,St+1 ∼ P(.|St , π(St))].
Comparing policies (partial ordering):
π1 ≥ π2 ⇔ ∀s ∈ S, vπ1(s) ≥ vπ2(s).
Optimal policy:π∗ ∈ argmax
π∈ASvπ.
8 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
Markov Decision ProcessesPolicy and value functionBellman operators
1 FormalismMarkov Decision ProcessesPolicy and value functionBellman operators
2 Dynamic ProgrammingLinear programmingValue iterationPolicy iteration
3 Approximate Dynamic ProgrammingState-action value functionApproximate value iterationApproximate policy iteration
4 Online learningSARSA and Q-learningThe exploration-exploitation dilemma
5 Policy search and actor-critic methodsThe policy gradient theoremActor-critic methods
9 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
Markov Decision ProcessesPolicy and value functionBellman operators
Rewriting the Bellman equation
vπ(s) = E[∞∑t=0
γtr(St , π(St))|S0 = s,St+1 ∼ P(.|St , π(St))]
= r(s, π(s)) + E[∞∑t=1
γtr(St , π(St))|S0 = s,St+1 ∼ P(.|St , π(St))]
= r(s, π(s)) + γE[∞∑t=0
γtr(St+1, π(St+1))|S0 = s,St+1 ∼ P(.|St , π(St))]
⇔ vπ(s) = r(s, π(s)) + γ∑s′∈S
P(s ′|s, π(s))vπ(s ′).
10 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
Markov Decision ProcessesPolicy and value functionBellman operators
Rewriting the Bellman equation (cont.)
Define the stochastic matrix Pπ ∈ RS×S and the vectorrπ ∈ RS as
Pπ =(P(s ′|s, π(s))
)s,s′∈S and rπ = (r(s, π(s)))s∈S .
Using these notations, we have:
vπ = rπ + γPπvπ ⇔ vπ = (I − γPπ)−1rπ.
11 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
Markov Decision ProcessesPolicy and value functionBellman operators
Bellman evaluation operator
Define the Bellman evaluation operator Tπ : RS → RS as
∀v ∈ RS , Tπv = rπ + γPπv ,
or equivalently componentwise
∀s ∈ S, [Tπv ](s) = r(s, π(s)) + γ∑s′∈S
P(s ′|s, π(s))v(s ′).
Tπ is a contraction (supremum norm) and vπ is its unique fixedpoint:
vπ = Tπvπ.
12 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
Markov Decision ProcessesPolicy and value functionBellman operators
Optimal value function and policies
Assume that v∗ = vπ∗ is known, an optimal policy should begreedy resp. to v∗:
π∗(s) ∈ argmaxa∈A
(r(s, a) + γ
∑s′∈S
P(s ′|s, a)v∗(s ′)
).
Characterizing v∗:
∀s ∈ S, v∗(s) = maxa∈A
(r(s, a) + γ
∑s′∈S
P(s ′|s, a)v∗(s ′)
).
13 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
Markov Decision ProcessesPolicy and value functionBellman operators
Bellman optimality operator
Define the Bellman optimality operator T∗ : RS → RS as
∀v ∈ RS , T∗v = maxπ∈AS
(rπ + γPπv) ,
or equivalently componentwise
∀s ∈ S, [T∗v ](s) = maxa∈A
(r(s, a) + γ
∑s′∈S
P(s ′|s, a)v(s ′)
).
T∗ is a contraction (supremum norm) and v∗ is its unique fixedpoint:
v∗ = T∗v∗.
14 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
Linear programmingValue iterationPolicy iteration
1 FormalismMarkov Decision ProcessesPolicy and value functionBellman operators
2 Dynamic ProgrammingLinear programmingValue iterationPolicy iteration
3 Approximate Dynamic ProgrammingState-action value functionApproximate value iterationApproximate policy iteration
4 Online learningSARSA and Q-learningThe exploration-exploitation dilemma
5 Policy search and actor-critic methodsThe policy gradient theoremActor-critic methods
15 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
Linear programmingValue iterationPolicy iteration
DP: solve an MDP when the model is known.
In practice, the model is unknown and one has to rely on data.
Even so, related learning methods are often based on DP.
16 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
Linear programmingValue iterationPolicy iteration
1 FormalismMarkov Decision ProcessesPolicy and value functionBellman operators
2 Dynamic ProgrammingLinear programmingValue iterationPolicy iteration
3 Approximate Dynamic ProgrammingState-action value functionApproximate value iterationApproximate policy iteration
4 Online learningSARSA and Q-learningThe exploration-exploitation dilemma
5 Policy search and actor-critic methodsThe policy gradient theoremActor-critic methods
17 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
Linear programmingValue iterationPolicy iteration
v∗ is the solution of the following linear program:
minv∈RS
1>v
subject to v ≥ T∗v .
Proof.
v ≥ T∗v ⇒ v ≥ v∗ ⇒ 1>v ≥ 1>v∗
andv∗ = T∗v∗.
18 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
Linear programmingValue iterationPolicy iteration
Algorithm 1 Linear programming
1: Solve
minv∈RS
∑s∈S
v(s)
subject to v(s) ≥ r(s, a) + γ∑s′∈S
P(s ′|s, a)v(s ′), ∀s ∈ S,∀a ∈ A
and get v∗.2: return the policy π∗ defined as
π∗(s) ∈ argmaxa∈A
(r(s, a) + γ
∑s′∈S
P(s ′|s, a)v∗(s ′)
).
19 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
Linear programmingValue iterationPolicy iteration
1 FormalismMarkov Decision ProcessesPolicy and value functionBellman operators
2 Dynamic ProgrammingLinear programmingValue iterationPolicy iteration
3 Approximate Dynamic ProgrammingState-action value functionApproximate value iterationApproximate policy iteration
4 Online learningSARSA and Q-learningThe exploration-exploitation dilemma
5 Policy search and actor-critic methodsThe policy gradient theoremActor-critic methods
20 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
Linear programmingValue iterationPolicy iteration
T∗ is a contraction: ∀u, v ∈ RS , ‖T∗u − T∗v‖∞ ≤ ‖u − v‖∞;
v∗ is its unique fixed point: T∗v∗ = v∗;
Banach fixed-point theorem: for any v0, the sequence
vk+1 = T∗vk
converges to v∗;
natural stopping criterion: ‖vk+1 − vk‖∞ ≤ ε;output a greedy policy (resp. to vk), πk ∈ G(vk)
π ∈ G(v)⇔ Tπv = T∗v
⇔ ∀s ∈ S, π(s) ∈ argmaxa∈A
(r(s, a) + γ
∑s′∈S
P(s ′|s, a)v(s ′)
).
21 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
Linear programmingValue iterationPolicy iteration
Algorithm 2 Value iteration
Require: An initial v0 ∈ RS , a stopping criterion ε1: k = 02: repeat3: for all s ∈ S do4:
vk+1(s) = maxa∈A
(r(s, a) + γ
∑s′∈S
P(s ′|s, a)vk(s ′)
)5: end for6: k ← k + 17: until ‖vk+1 − vk‖∞ ≤ ε8: return a policy πk ∈ G(vk):
πk(s) ∈ argmaxa∈A
(r(s, a) + γ
∑s′∈S
P(s ′|s, a)vk(s ′)
).
22 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
Linear programmingValue iterationPolicy iteration
Quality of the obtained solution?
Stop iterations if‖vk+1 − vk‖∞ ≤ ε.
Guaranty on the function vk :
‖v∗ − vk‖∞ ≤1
1− γε.
Guaranty on the policy πk :
‖v∗ − vπk‖∞ ≤2γ
(1− γ)2ε.
23 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
Linear programmingValue iterationPolicy iteration
1 FormalismMarkov Decision ProcessesPolicy and value functionBellman operators
2 Dynamic ProgrammingLinear programmingValue iterationPolicy iteration
3 Approximate Dynamic ProgrammingState-action value functionApproximate value iterationApproximate policy iteration
4 Online learningSARSA and Q-learningThe exploration-exploitation dilemma
5 Policy search and actor-critic methodsThe policy gradient theoremActor-critic methods
24 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
Linear programmingValue iterationPolicy iteration
Let π be any policy and vπ its value function.
Let π′ be greedy resp. to vπ, π′ ∈ G(vπ):
∀s ∈ S, π′(s) ∈ argmaxa∈A
(r(s, a) + γ
∑s′∈S
P(s ′|s, a)vπ(s ′)
).
π′ is a better policy than π:
vπ′ ≥ vπ
This suggests the following algorithmic scheme, iterate:1 policy evaluation: solve Tπk
vπk= vπk
;2 policy improvement: compute πk+1 ∈ G(vπk
).
25 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
Linear programmingValue iterationPolicy iteration
Algorithm 3 Policy iteration
Require: An initial π0 ∈ AS1: k = 02: repeat3: solve (policy evaluation)
vk(s) = r(s, πk(s)) + γ∑s′∈S
P(s ′|s, πk(s))vk(s ′), ∀s ∈ S.
4: Compute (policy improvement)
πk+1(s) ∈ argmaxa∈A
(r(s, a) + γ
∑s′∈S
P(s ′|s, a)vk(s ′)
).
5: k ← k + 16: until vk+1 = vk7: return the policy πk+1 = π∗
26 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
State-action value functionApproximate value iterationApproximate policy iteration
1 FormalismMarkov Decision ProcessesPolicy and value functionBellman operators
2 Dynamic ProgrammingLinear programmingValue iterationPolicy iteration
3 Approximate Dynamic ProgrammingState-action value functionApproximate value iterationApproximate policy iteration
4 Online learningSARSA and Q-learningThe exploration-exploitation dilemma
5 Policy search and actor-critic methodsThe policy gradient theoremActor-critic methods
27 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
State-action value functionApproximate value iterationApproximate policy iteration
DP requires:
the state and action spaces to be small enough;
the model to be known.
Unfortunately:
the state space can be too large (even continuous) for the valuefunction to be represented exactly,
vθ(s) = θ>φ(s) =d∑
i=1
θiφi (s)
the model might be unknown and one has to rely on a dataset
D = {(si , ai , ri , s ′i )1≤i≤n}.
the dataset can be obtained in multiple ways;the evaluation operator can be sampled (assume hereai = π(si )),
[Tπv ](si ) = ri + γv(s ′i )
is unbiased:E[[Tπv ](si )|si ] = ES′∼P(.|si ,ai )[ri + γv(S ′)] = [Tπv ](si ).
28 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
State-action value functionApproximate value iterationApproximate policy iteration
1 FormalismMarkov Decision ProcessesPolicy and value functionBellman operators
2 Dynamic ProgrammingLinear programmingValue iterationPolicy iteration
3 Approximate Dynamic ProgrammingState-action value functionApproximate value iterationApproximate policy iteration
4 Online learningSARSA and Q-learningThe exploration-exploitation dilemma
5 Policy search and actor-critic methodsThe policy gradient theoremActor-critic methods
29 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
State-action value functionApproximate value iterationApproximate policy iteration
Problems with value functions
Computing a greedy policy requires knowing the model:
π ∈ G(v)⇔
∀s ∈ S, π(s) ∈ argmaxa∈A
(r(s, a) + γ
∑s′∈S
P(s ′|s, a)v(s ′)
).
Sampling the optimality operator?Optimality operator:
[T∗v ](s) = maxa∈A
ES′∼P(.|s,a)[r(s, a) + γv(S ′)];
with s ′i,a ∼ P(.|si , a), a possible sampled operator:
[T∗v ](si ) = maxa∈A
(r(si , a) + γv(s ′i,a)
);
it is biased: E[[T∗v ](si )|si ] 6= T∗(si ).
30 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
State-action value functionApproximate value iterationApproximate policy iteration
state-action value function (also called Q-function and qualityfunction)
Qπ(s, a) = E[∞∑t=0
γtr(St ,At)|S0 = s,A0 = a, St+1 ∼ P(.|St ,At),At+1 = π(St+1)].
Bellman evaluation operator Tπ : RS×A → RS×Adefinition:[TπQ](s, a) = r(s, a) + γ
∑s′∈S P(s ′|s, a)Q(s ′, π(s ′));
Qπ is its unique fixed point:TπQπ = Qπ;link to vπ:vπ(s) = Qπ(s, π(s)).
Bellman optimality operator T∗ : RS×A → RS×Adefinition:[T∗Q](s, a) = r(s, a) + γ
∑s′∈S P(s ′|s, a) maxa′∈AQ(s ′, a′);
Q∗ is its unique fixed point:Q∗ = T∗Q∗;link to v∗:v∗(s) = maxa∈AQ∗(s, a).
31 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
State-action value functionApproximate value iterationApproximate policy iteration
Allows acting greedily:resp to vπ = Qπ(s, π(s):
π′ ∈ G(vπ)⇔ ∀s ∈ S, π(s) ∈ argmaxa∈A
(r(s, a) + γ
∑s′∈S
P(s ′|s, a)vπ(s ′)
)
⇔ ∀s ∈ S, π′(s) ∈ argmaxa∈A
Qπ(s, a).
resp. to v∗:π∗(s) ∈ argmax
a∈AQ∗(s, a).
resp. to any Q ∈ RS×A:
∀Q ∈ RS×A, π ∈ G(Q)⇔ ∀s ∈ S, π(s) ∈ argmaxa∈A
Q(s, a).
32 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
State-action value functionApproximate value iterationApproximate policy iteration
Allows sampling easily the related operatorsrecall the dataset
D = {(si , ai , ri , s′i )1≤i≤n}.
sampled Bellman evaluation operator
[TπQ](si , ai ) = ri + γQ(s ′i , π(s ′i ));
sampled Bellman optimality operator
[T∗Q](si , ai ) = ri + γ maxa′∈A
Q(s ′i , a′).
Features for the Q-function:linear param. of the Q-function: Qθ(s, a) = θ>φ(s, a) with
φ(s, a) =[δa=a1φ(s)> . . . δa=a|A|φ(s)>
]>.
other representations are possible.
33 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
State-action value functionApproximate value iterationApproximate policy iteration
1 FormalismMarkov Decision ProcessesPolicy and value functionBellman operators
2 Dynamic ProgrammingLinear programmingValue iterationPolicy iteration
3 Approximate Dynamic ProgrammingState-action value functionApproximate value iterationApproximate policy iteration
4 Online learningSARSA and Q-learningThe exploration-exploitation dilemma
5 Policy search and actor-critic methodsThe policy gradient theoremActor-critic methods
34 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
State-action value functionApproximate value iterationApproximate policy iteration
Value iteration: Qk+1 = T∗Qk
T∗ cannot be applied, the model being unknown;with large space, Qk ∈ H, no reason for T∗Qk ∈ H to hold.
Approximate value iteration (an introductory example):linear param. for the Q-functions
H = {Qθ(s, a) = θ>φ(s, a), θ ∈ Rd};
writing Qk = Qθk , sampled operator:
[T∗Qk ](si , ai ) = ri + γ maxa′∈A
Qk(s ′i , a′)
search for Q ∈ H being the closest to T∗Qk :
Qk+1 ∈ argminQθ∈H
1
n
n∑i=1
(Qθ(si , ai )− [T∗Qk ](si , ai )
)2.
summary: Qk+1 = ΠT∗Qk .
35 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
State-action value functionApproximate value iterationApproximate policy iteration
Abstraction
Approximate value iteration:
Qk+1 = AT∗Qk .
A is an abstract approximation operator
AT∗ should be a contraction!
otherwise divergence can (will) occur;
not the case for ΠT∗...
do not implement the introductory example!
true if averagers are used for function approximation, such as
ensemble of treesnotably extremely randomized forests: fitted-Qkernel averagers (Nadaraya-Watson)
36 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
State-action value functionApproximate value iterationApproximate policy iteration
Algorithm 4 Approximate value iteration
Require: A dataset D = {(si , ai , ri , s′i )1≤i≤n}, the number K of iterations,
a function approximator, an initial state-action value function Q0.1: for k = 0 to K do2: Apply the sampled Bellman operator to function Qk :
[T∗Qk ](si , ai ) = ri + γ maxa′∈A
Qk(s ′i , a′).
3: Solve the regression problem with inputs (si , ai ) and outputs[T∗Qk ](si , ai ) to get the Q-function Qk+1
4: end for5: return The greedy policy πK+1 ∈ G(QK+1):
∀s ∈ S, πK+1 ∈ argmaxa∈A
QK+1(s, a).
37 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
State-action value functionApproximate value iterationApproximate policy iteration
1 FormalismMarkov Decision ProcessesPolicy and value functionBellman operators
2 Dynamic ProgrammingLinear programmingValue iterationPolicy iteration
3 Approximate Dynamic ProgrammingState-action value functionApproximate value iterationApproximate policy iteration
4 Online learningSARSA and Q-learningThe exploration-exploitation dilemma
5 Policy search and actor-critic methodsThe policy gradient theoremActor-critic methods
38 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
State-action value functionApproximate value iterationApproximate policy iteration
Policy iteration:1 policy evaluation: solve the fixed-point equation
Qπk= Tπk
Qπk;
2 policy improvement: compute the greedy policyπk+1 = G(Qπk
).
Approximate policy iteration:1 approximate policy evaluation: find a function Qk ∈ H such
that Qk ≈ TπkQk ;
2 policy improvement: compute the greedy policyπk+1 = G(Qπk
).
39 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
State-action value functionApproximate value iterationApproximate policy iteration
Algorithm 5 Approximate policy iteration
Require: An initial π0 ∈ AS (possibly an initial Q0 and π0 ∈ G(Q0)),number of iterations K
1: for k = 0 to K do2: approximate policy evaluation: find Qk ∈ H such that Qk ≈
TπkQk .
3: policy improvement: πk+1 ∈ G(Qk).4: end for5: return the policy πK+1
Problem: how to find an approximate fixed point of Tπ, thatis a function Qθ ∈ H such that Qθ ≈ TπQθ?
40 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
State-action value functionApproximate value iterationApproximate policy iteration
Monte Carlo rollouts
Approx. fixed point of Tπ ⇔ approx Qπ.
Assume that Qπ is known, simply a regression problem. For example,linear least-squares:
minθ∈Rd
1
n
n∑i=1
(Qπ(si , ai )− Qθ(si , ai ))2 .
Qπ is (obviously) unknown... Monte carlo rollout:
sample a full trajectory starting in si where action ai is chosen first,all subsequent states being sampled according to the systemdynamics and all subsequent actions being chosen according to π;write qi the associated discounted cumulative reward;
unbiased estimate: E[qi |si , ai ] = Qπ(si , ai )
replace Qπ(si , ai ) by the unbiased estimate qi :
minθ∈Rd
1
n
n∑i=1
(qi − Qθ(si , ai ))2 .
Drawbacks: requires a simulator, rollouts can be quite noisy.
41 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
State-action value functionApproximate value iterationApproximate policy iteration
Residual approach
Idea: minimize the residual ‖Qθ − TπQθ‖ for some norm.
With an `2-loss, parametric representation and sampledoperator:
minθ∈Rd
1
n
n∑i=1
([TπQθ](si , ai )− Qθ(si , ai )
)2= min
θ∈Rd
1
n
n∑i=1
(ri + γQθ(s ′i , π(s ′i ))− Qθ(si , ai )
)2.
However, there is a bias problem:
E[([TπQθ](si , ai )− Qθ(si , ai ))2|si , ai ]=([TπQθ](si , ai )− Qθ(si , ai ))2 + var([TπQθ](si , ai )|si , ai )6=([TπQθ](si , ai )− Qθ(si , ai ))2.
42 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
State-action value functionApproximate value iterationApproximate policy iteration
Least-Squares temporal Differences
Idea: solveQθ = ΠTπQθ.
As a nested optimization problem:{wθ = argminw∈Rd
1n
∑ni=1(ri + γQθ(s ′i , π(s ′i ))− Qw (si , ai ))2
θn = argminθ∈Rd1n
∑ni=1(Qθ(si , ai )− Qwθ(si , ai ))2
.
43 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
State-action value functionApproximate value iterationApproximate policy iteration
Least-Squares temporal Differences (cont.)
Opt. problem (linear param. assumed):{wθ = argminw∈Rd
1n
∑ni=1(ri + γQθ(s ′i , π(s ′i ))− Qw (si , ai ))2
θn = argminθ∈Rd1n
∑ni=1(Qθ(si , ai )− Qwθ (si , ai ))2
.
First Eq., linear least-squares in w:
wθ =
(n∑
i=1
φ(si , ai )φ(si , ai )>
)−1 n∑i=1
φ(si , ai )(ri + γθ>φ(s ′i , π(s ′i ))).
Second Eq. minimized for θ = wθ:
θn = wθn ⇔ θn =
(n∑
i=1
φ(si , ai )φ(si , ai )>
)−1 n∑i=1
φ(si , ai )(ri + γθ>n φ(s ′i , π(s ′i )))
⇔ θn =
(n∑
i=1
φ(si , ai )(φ(si , ai )− γφ(s ′i , π(s ′i ))
)>)−1 n∑i=1
φ(si , ai )ri .
API with LSTD named LSPI (Least-Squares Policy Iteration).44 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
State-action value functionApproximate value iterationApproximate policy iteration
Least-Squares temporal Differences (cont.)
Algorithm 6 Least-squares policy iteration
Require: An initial π0 ∈ AS (possibly an initial Q0 and π0 ∈ G(Q0)),number of iterations K
1: for k = 0 to K do2: approximate policy evaluation:
θk =
(n∑
i=1
φ(si , ai ) (φ(si , ai )− γφ(s ′i , πk(s ′i )))>)−1 n∑
i=1
φ(si , ai )ri .
3: policy improvement:πk+1 ∈ G(Qθk ).
4: end for5: return the policy πK+1
45 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
State-action value functionApproximate value iterationApproximate policy iteration
Approximating the policy
Idea: instead of generalizing the Q-function, generalize thepolicy
Motivation: a policy might be easier to learn than a Qfunction
At iteration k, let F ⊂ AS be an hypothesis space of policies,assume that Qπk (si , a) are known, and solve the cost-sensitivemulti-class classification problem
πk+1 ∈ argminπ∈F
1
n
n∑i=1
(maxa∈A
Qπk (si , a)− Qπk (si , π(si ))
).
Practically, replace Qπk (si , a) by a Monte Carlo rollout.
Often called DPI for Direct Policy Iteration
46 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
SARSA and Q-learningThe exploration-exploitation dilemma
1 FormalismMarkov Decision ProcessesPolicy and value functionBellman operators
2 Dynamic ProgrammingLinear programmingValue iterationPolicy iteration
3 Approximate Dynamic ProgrammingState-action value functionApproximate value iterationApproximate policy iteration
4 Online learningSARSA and Q-learningThe exploration-exploitation dilemma
5 Policy search and actor-critic methodsThe policy gradient theoremActor-critic methods
47 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
SARSA and Q-learningThe exploration-exploitation dilemma
For ADP, we assumed that the dataset was provided.
What about online learning?
requires an online learner;dilemma between exploration and exploitation.
48 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
SARSA and Q-learningThe exploration-exploitation dilemma
1 FormalismMarkov Decision ProcessesPolicy and value functionBellman operators
2 Dynamic ProgrammingLinear programmingValue iterationPolicy iteration
3 Approximate Dynamic ProgrammingState-action value functionApproximate value iterationApproximate policy iteration
4 Online learningSARSA and Q-learningThe exploration-exploitation dilemma
5 Policy search and actor-critic methodsThe policy gradient theoremActor-critic methods
49 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
SARSA and Q-learningThe exploration-exploitation dilemma
SARSA
Goal: online estimation of Qπ, for a given π.Assume a linear param., and that Qπ is known, the risk of interest is
minθ∈Rd
1
n
n∑i=1
(Qπ(si , ai )− Qθ(si , ai ))2 .
Minimize it with a stochastic gradient descent:
θi+1 = θi −αi
2∇ (Qπ(si , ai )− Qθ(si , ai ))2
= θi + αiφ(si , ai )(Qπ(si , ai )− θ>i φ(si , ai )
),
Qπ(si , ai ) is unknown, bootstrap it (si+1 ∼ P(.|si , ai ) and ai+1 = π(si+1))
Qπ(si , ai )→ [TπQθi ](si , ai ) = ri + γQθi (si+1, ai+1).
Replace Qπ(si , ai ) by its estimate in the update rule
θi+1 = θi + αiφ(si , ai ) (ri + γQθi (si+1, ai+1)− Qθi (si , ai ))
= θi + αiφ(si , ai )(ri + γθ>i φ(si+1, ai+1)− θ>i φ(si , ai )
).
This is called a temporal difference algorithm.50 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
SARSA and Q-learningThe exploration-exploitation dilemma
SARSA (cont.)
Algorithm 7 SARSARequire: An initial parameter vector θ0, the initial state s0, an initial action a0,
the learning rates (αi )i≥0
1: i = 02: while true do3: Apply action ai in state si4: Get the reward ri and observe the new state si+1
5: Choose the action ai+1 to be applied in state si+1
6: Update the parameter vector of the Q-function according to the transition(si , ai , ri , si+1, ai+1);
θi+1 = θi + αiφ(si , ai )(ri + γθ>i φ(si+1, ai+1)− θ>i φ(si , ai )
)7: i ← i + 18: end while
Remark: on-policy algorithm.51 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
SARSA and Q-learningThe exploration-exploitation dilemma
Q-learning
Goal: direct online estimation of Q∗.Assume a linear param., and that Q∗ is known, the risk of interest is
minθ∈Rd
1
n
n∑i=1
(Q∗(si , ai )− Qθ(si , ai ))2 .
Minimize it with a stochastic gradient descent:
θi+1 = θi −αi
2∇ (Q∗(si , ai )− Qθ(si , ai ))2
= θi + αiφ(si , ai )(Q∗(si , ai )− θ>i φ(si , ai )
),
Q∗(si , ai ) is unknown, bootstrap it (si+1 ∼ P(.|si , ai ))
Q∗(si , ai )→ [T∗Qθi ](si , ai ) = ri + γmaxa∈A
Qθi (si+1, a).
Replace Q∗(si , ai ) by its estimate in the update rule
θi+1 = θi + αiφ(si , ai )
(ri + γmax
a∈AQθi (si+1, a)− Qθi (si , ai )
)= θi + αiφ(si , ai )
(ri + γmax
a∈A(θ>i φ(si+1, a))− θ>i φ(si , ai )
).
52 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
SARSA and Q-learningThe exploration-exploitation dilemma
Q-learning (cont.)
Algorithm 8 Q-learningRequire: An initial parameter vector θ0, the initial state s0, the learning rates
(αi )i≥0
1: i = 02: while true do3: Choose the action ai to be applied in state si4: Apply action ai in state si5: Get the reward ri and observe the new state si+1
6: Update the parameter vector of the Q-function according to the transition(si , ai , ri , si+1);
θi+1 = θi + αiφ(si , ai )
(ri + γmax
a∈A(θ>i φ(si+1, a))− θ>i φ(si , ai )
)7: i ← i + 18: end while
Remark: off-policy algorithm. 53 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
SARSA and Q-learningThe exploration-exploitation dilemma
1 FormalismMarkov Decision ProcessesPolicy and value functionBellman operators
2 Dynamic ProgrammingLinear programmingValue iterationPolicy iteration
3 Approximate Dynamic ProgrammingState-action value functionApproximate value iterationApproximate policy iteration
4 Online learningSARSA and Q-learningThe exploration-exploitation dilemma
5 Policy search and actor-critic methodsThe policy gradient theoremActor-critic methods
54 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
SARSA and Q-learningThe exploration-exploitation dilemma
With SARSA or Q-learning, what action should be applied?
acting always greedily is not wise;dilemma between exploration and exploitation.
ε-greedy policy:
πε(s) =
{argmaxa∈AQθ(s, a) with probability 1− εa random action with probability ε
.
Softmax (stochastic) policy (τ is the temperature param.):
πτ (a|s) =e
1τQθ(s,a)∑
a′∈A e1τQθ(s,a′)
.
Other schemes exist.
55 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
The policy gradient theoremActor-critic methods
1 FormalismMarkov Decision ProcessesPolicy and value functionBellman operators
2 Dynamic ProgrammingLinear programmingValue iterationPolicy iteration
3 Approximate Dynamic ProgrammingState-action value functionApproximate value iterationApproximate policy iteration
4 Online learningSARSA and Q-learningThe exploration-exploitation dilemma
5 Policy search and actor-critic methodsThe policy gradient theoremActor-critic methods
56 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
The policy gradient theoremActor-critic methods
What about continuous actions (max, argmax)?
Policy search: parameterize the policy, search directly in thepolicy space.
We’ll use stochastic policies:
stochastic policy: π ∈ ∆SA associates to each state s aconditional probability over actions π(.|s)All things already defined extend naturaly to stochasticpolicies:
vπ(s) =∑a∈A
π(a|s)
(r(s, a) + γ
∑s′∈S
P(s ′|s, a)vπ(s)
),
vπ(s) =∑a∈A
π(a|s)Qπ(s, a),
Qπ(s, a) = r(s, a) + γ∑s′∈S
P(s ′|s, a)vπ(s).
57 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
The policy gradient theoremActor-critic methods
Exemples of parameterized policies:discrete actions
πθ(a|s) =eθ>φ(s,a)∑
a′∈A eθ>φ(s,a′),
continuous (1d here) actions
πθ(a|s) ∝ e− 1
2
(a−θ>φ(s)
σ
)2
.
The policy search problem:let ν ∈ ∆S be a user-defined distribution over states;solve
maxθ∈Rd J(θ) with J(θ) =∑s∈S
ν(s)vπθ (s) = ES∼ν [vπθ (S)].
Difference with (A)DP:DP: find a policy that maximizes the value for every state;Policy search: find a policy that maximizes the value inaverage.
58 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
The policy gradient theoremActor-critic methods
1 FormalismMarkov Decision ProcessesPolicy and value functionBellman operators
2 Dynamic ProgrammingLinear programmingValue iterationPolicy iteration
3 Approximate Dynamic ProgrammingState-action value functionApproximate value iterationApproximate policy iteration
4 Online learningSARSA and Q-learningThe exploration-exploitation dilemma
5 Policy search and actor-critic methodsThe policy gradient theoremActor-critic methods
59 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
The policy gradient theoremActor-critic methods
natural approach, gradient ascent:
θ ← θ + α∇θJ(θ).
Gradient ∇θJ(θ)?
Define dν,π ∈ ∆S , the γ-weighted occupancy measure:
dν,π = (1− γ)ν>(I − γPπ)−1.
Theorem (Policy gradient)
Let πθ be such that πθ(a|s) > 0, for all s, a. We have
∇θJ(θ) =1
1− γ∑s∈S
dν,π(s)∑a∈A
πθ(a|s)Qπθ(s, a)∇θ lnπθ(a|s)
=1
1− γES∼dν,π ,A∼πθ(.|S)[Qπθ(S ,A)∇θ lnπθ(A|S)].
60 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
The policy gradient theoremActor-critic methods
1 FormalismMarkov Decision ProcessesPolicy and value functionBellman operators
2 Dynamic ProgrammingLinear programmingValue iterationPolicy iteration
3 Approximate Dynamic ProgrammingState-action value functionApproximate value iterationApproximate policy iteration
4 Online learningSARSA and Q-learningThe exploration-exploitation dilemma
5 Policy search and actor-critic methodsThe policy gradient theoremActor-critic methods
61 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
The policy gradient theoremActor-critic methods
Policy gradient:
∇θJ(θ) =1
1− γES∼dν,π ,A∼πθ(.|S)[Qπθ(S ,A)∇θ lnπθ(A|S)].
Qπ can be estimated pointwise using Monte Carlo rollouts.
Policy search is called an actor method (DPI too).
Can we replace Qπ by an approximation Qw ∈ H,without changing the gradient?
If so, the resulting approach is called an actor critic method.
62 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
The policy gradient theoremActor-critic methods
Policy gradient with a critic
Theorem (Policy gradient with a critic)
If the parametrization of the state-action value function iscompatible, in the sense that
∀(s, a) ∈ S ×A, ∇θ lnπθ(a|s) = ∇wQw (s, a),
and if Qw is a local optimum of the risk based on the `2-loss, withstate-action distribution given by dν,π, and with the target functionbeing Qπ, that is
∇wEdν,π [(Qπθ(S ,A)− Qw (S ,A))2] = 0,
then the gradient satisfies
∇θJ(θ) = Edν,π [Qw (S ,A)∇θ lnπθ(A|S)].
63 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
The policy gradient theoremActor-critic methods
Policy gradient with a critic (cont.)
Exemple of compatible approximation.
Softmax policy: πθ(a|s) = eθ>φ(s,a)∑
a′∈A eθ>φ(s,a′).
Gradient: ∇θ lnπθ(a|s) = φ(s, a)−∑
a′∈A πθ(a′|s)φ(s, a′).
Compatible approximation:Qw (s, a) = w>(φ(s, a)−
∑a′∈A πθ(a′|s)φ(s, a′)).
Not a Q-function as∑
a∈A πθ(a|s)Qw (s, a) = 0
More an advantage function: Aπ(s, a) = Qπ(s, a)− vπ(s).
Yet, for any v ∈ RS , Edν,π [v(s)∇θ lnπθ(a|s)] = 0
As the term w>∑
a′∈A πθ(a′|s)φ(s, a′) does not depend on a, acompatible approximation is
Qw (s, a) = θ>φ(s, a).
64 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
The policy gradient theoremActor-critic methods
Natural policy gradient
Natural gradient:
gradient premultiplied by the inverse of the Fisher informationmatrixinstead of following the steepest direction in the parameterspace, it follows the steepest direction with respect to Fishermetrictends to be much more efficient empirically
in our case, the natural gradient ∇ is
∇θJ(θ) = F (θ)−1∇θJ(θ)
with F (θ) = Edν,π [∇θ lnπθ(A|S)(∇θ lnπθ(A|S))>]
65 / 66
FormalismDynamic Programming
Approximate Dynamic ProgrammingOnline learning
Policy search and actor-critic methods
The policy gradient theoremActor-critic methods
Natural policy gradient (cont.)
Theorem (Natural policy gradient with a critic)
If the parametrization of the state-action value function iscompatible, in the sense that
∀(s, a) ∈ S ×A, ∇θ lnπθ(a|s) = ∇wQw (s, a),
and if Qw is a local optimum of the risk based on the `2-loss, withstate-action distribution given by dν,π, and with the target functionbeing Qπ, that is
∇wEdν,π [(Qπθ(S ,A)− Qw (S ,A))2] = 0,
then the natural gradient satisfies
∇θJ(θ) = w .
66 / 66