reinforcement learning: an overview - github pages · general approach for π∗ • if the model...
TRANSCRIPT
Reinforcement Learning: An Overview
Yi Cui, Fei Feng, Yibo Zeng
Dept. of Mathematics, UCLA
November 13, 2017
1 / 32
Model : Sequential Markov Decision Process
• discrete time steps, t= 1,2,3,...• St ∈ S, state space• At ∈ A(St) ⊂ A, action space• Rt(St, At) : S ×A → R, immediate reward of taking At at St• p(St+1, Rt|St, At), transition probability
2 / 32
An Optimization Problem
Policy:π : S → P
(A)
Agent’s Goal: choose a policy to maximize a function of reward sequence.
• expected total discounted-reward
E[G1] = E[R1 + γR2 + γ2R3 · · · ]
= E[∞∑k=1
γkRk]
where 0 < γ < 1 is a discount rate.• averaged-reward
3 / 32
An Optimization Problem
State-value function of π
V π(s) = E [Gt | St = s;π]
Fix π, V π satisfy Bellman equation (self-consistency condition):
V π(s) =∑a
π(a|s)∑s′,r
p(s′, r|s, a)[r + γV π(s′)], ∀s ∈ S
Action-value function of π
Qπ(s, a) = E [Gt | St = s,At = a;π]
= E [Rt + γGt+1 | St = s,At = a;π]
= E [Rt + γV π(St+1) | St = s,At = a;π]
4 / 32
Optimal Policy and Optimal Value
Build a partial ordering over policies
π ≥ π′ ⇐⇒ V π(s) ≥ V π′(s), ∀s ∈ S
Find an optimal policy π∗ such that
V ∗(s) := V π∗(s) = max
πV π(s), ∀s ∈ S.
The Bellman Optimality Equations for V ∗, Q∗ are
V ∗(s) = maxa
E[Rt + γV ∗(St+1)|St = s,At = a] (1)
Q∗(s, a) = E[Rt + γmaxa′
Q∗(St+1, a′)|St = s,At = a]. (2)
Note that
π∗(s) = arg maxa
Q∗(s, a), V ∗(s) = maxa
Q∗(s, a), ∀s ∈ S.
5 / 32
General Approach for π∗
• If the model p(s′, r|s, a) is available → dynamic programming.• If no model → reinforcement learning(RL) algorithms
1 model-based method: learn model then derive optimal policy.2 model-free method: learn optimal policy without learning model.
For model-free method,
• Value-based: evaluate Q∗, derive a greedy policy1 Tabular Implementation: Sarsa[1], Q-learning[2]2 Function Approximation: Deep Q-learning
• Policy-based: Search over policy space for better value: Actor-critic(DDPG)
6 / 32
On-policy v.s. Off policy
• Learning Policy:a policy that maps experience into a current choice(experience contains: states visited, actions chosen, rewards received, etc.)
• Update Rule:How the algorithm uses experience to change its estimate of the optimalvalue function.
7 / 32
Sarsa: On-policy TD Control
8 / 32
Q-learning: Off-policy TD Control
9 / 32
The Drawbacks of Tabular Methods
• Huge memory demand, non-scalable;• Impossible to list all states in more complicated scenarios.
Solution: Use functions to approximate Q.
10 / 32
Deep Q-learning: Neural Network + Q-learning
Idea: represent Q-function with a neural network, that takes the state and actionas input and outputs the corresponding Q-value.Update Q table each episode → Update parameters of deep Q-network(DQN).
11 / 32
Deep Q-learning
• Four continuous 84× 84 images.• Two convolution layers.• Two denses.• Output:vectors including every action’s Q value.
12 / 32
Deep Q-learning Algorithm
13 / 32
However,
DQN can only handle discrete and low-dimension action spaces.
Recall:
• Learning Policy:
arg maxa
Q(φ(st), a; θ)
• Update Rule:
yj := rj + γmaxa′
Q(φj+1, a′; θ−)
14 / 32
Discretize the action space?
• not scalable
actions increases exponentially37 = 2187
• not fine-tuning
naive discretization removes crucial information:the structure of A, the optimal action
15 / 32
Policy-Based Methods
Deterministic Policy:µθ : S → A
where θ ∈ Rn is a vector of parameters.
Agent’s goal: obtain an optimal policy which maximizes the expected discountedreturn from the start state, denoted by the performance objective
J(µθ) := E[G1|µθ] (3)
16 / 32
J(µθ) := E[G1|µθ]
In a more specific way,
J(µθ) =∫Sρµ(s)r(s, µθ)ds
=Es∼ρµ [r(s, µθ(s))],(4)
where ρµ(s’) is the (improper) discounted state distribution:
ρπ(s′) :=∫S
∞∑t=1
γt−1p1(s)p(s→ s′, t, π)ds (5)
17 / 32
Deterministic Policy Ascent
Theorem 1Deterministic Policy Gradient Theorem (Silver et al., 2014)
∇θJ(µθ) =∫Sρµ(s)∇θQµ(s, µθ(s))ds
=Es∼ρµ [∇θQµ(s, µθ(s))]
=Es∼ρµ [∇θµθ(s)∇aQµ(s, a)|a=µθ(s)]
(6)
Question:
How to estimate the action-value function Qπ(s, a)?
18 / 32
Actor-Critic
• Actor:adjust the stochastic policy µθ(s) by stochastic gradient ascent.
• Critic: inspired by the success of DQNestimate the action-value function
Qπ(s, a) ≈ Qµθ (s, a) (7)
DNN? Experience Replay? Target Network?
19 / 32
Off-Policy Deterministic Policy Gradient(Silver et.al, 2014)
∇θµµ ≈ Eµ′ [∇θµQ(s, a|θQ)|s=st,a=µ(st|θµ)] (8)
= Eµ′ [∇aQ(s, a|θQ)|s=st,a=µ(st)∇θµµ(s|θµ)|s=st ] (9)
20 / 32
21 / 32
22 / 32
Asynchronous One-Step Q-learning
23 / 32
Asynchronous Advantage Actor-Critic(A3C)
24 / 32
MengDi Wang: Primal-Dual π
Infinite-horizon Average-reward MDP
• Maximize value function
•
• Bellman equation
25 / 32
Linear Programming
• Primal
• Dual
26 / 32
• Minimax formulation
27 / 32
Dual-ascent-primal-descent
28 / 32
Simplify for implementation
29 / 32
Algorithm
30 / 32
Convergence
31 / 32
References:
Richard S Sutton and Andrew G Barto. Reinforcement learning: Anintroduction. Vol. 1. 1. MIT press Cambridge, 1998.
Christopher John Cornish Hellaby Watkins. “Learning from delayed rewards.”PhD thesis. King’s College, Cambridge, 1989.
32 / 32