reinforcement learning: an overview - github pages · general approach for π∗ • if the model...

Reinforcement Learning: An Overview

Yi Cui, Fei Feng, Yibo Zeng

Dept. of Mathematics, UCLA

November 13, 2017

1 / 32

Model : Sequential Markov Decision Process

• discrete time steps, t= 1,2,3,...• St ∈ S, state space• At ∈ A(St) ⊂ A, action space• Rt(St, At) : S ×A → R, immediate reward of taking At at St• p(St+1, Rt|St, At), transition probability

2 / 32

An Optimization Problem

Policy:π : S → P

(A)

Agent’s Goal: choose a policy to maximize a function of reward sequence.

• expected total discounted-reward

E[G1] = E[R1 + γR2 + γ2R3 · · · ]

= E[∞∑k=1

γkRk]

where 0 < γ < 1 is a discount rate.• averaged-reward

3 / 32

An Optimization Problem

State-value function of π

V π(s) = E [Gt | St = s;π]

Fix π, V π satisfy Bellman equation (self-consistency condition):

V π(s) =∑a

π(a|s)∑s′,r

p(s′, r|s, a)[r + γV π(s′)], ∀s ∈ S

Action-value function of π

Qπ(s, a) = E [Gt | St = s,At = a;π]

= E [Rt + γGt+1 | St = s,At = a;π]

= E [Rt + γV π(St+1) | St = s,At = a;π]

4 / 32

Optimal Policy and Optimal Value

Build a partial ordering over policies

π ≥ π′ ⇐⇒ V π(s) ≥ V π′(s), ∀s ∈ S

Find an optimal policy π∗ such that

V ∗(s) := V π∗(s) = max

πV π(s), ∀s ∈ S.

The Bellman Optimality Equations for V ∗, Q∗ are

V ∗(s) = maxa

E[Rt + γV ∗(St+1)|St = s,At = a] (1)

Q∗(s, a) = E[Rt + γmaxa′

Q∗(St+1, a′)|St = s,At = a]. (2)

Note that

π∗(s) = arg maxa

Q∗(s, a), V ∗(s) = maxa

Q∗(s, a), ∀s ∈ S.

5 / 32

General Approach for π∗

• If the model p(s′, r|s, a) is available → dynamic programming.• If no model → reinforcement learning(RL) algorithms

1 model-based method: learn model then derive optimal policy.2 model-free method: learn optimal policy without learning model.

For model-free method,

• Value-based: evaluate Q∗, derive a greedy policy1 Tabular Implementation: Sarsa[1], Q-learning[2]2 Function Approximation: Deep Q-learning

• Policy-based: Search over policy space for better value: Actor-critic(DDPG)

6 / 32

On-policy v.s. Off policy

• Learning Policy:a policy that maps experience into a current choice(experience contains: states visited, actions chosen, rewards received, etc.)

• Update Rule:How the algorithm uses experience to change its estimate of the optimalvalue function.

7 / 32

Sarsa: On-policy TD Control

8 / 32

Q-learning: Off-policy TD Control

9 / 32

The Drawbacks of Tabular Methods

• Huge memory demand, non-scalable;• Impossible to list all states in more complicated scenarios.

Solution: Use functions to approximate Q.

10 / 32

Deep Q-learning: Neural Network + Q-learning

Idea: represent Q-function with a neural network, that takes the state and actionas input and outputs the corresponding Q-value.Update Q table each episode → Update parameters of deep Q-network(DQN).

11 / 32

Deep Q-learning

• Four continuous 84× 84 images.• Two convolution layers.• Two denses.• Output:vectors including every action’s Q value.

12 / 32

Deep Q-learning Algorithm

13 / 32

However,

DQN can only handle discrete and low-dimension action spaces.

Recall:

• Learning Policy:

arg maxa

Q(φ(st), a; θ)

• Update Rule:

yj := rj + γmaxa′

Q(φj+1, a′; θ−)

14 / 32

Discretize the action space?

• not scalable

actions increases exponentially37 = 2187

• not fine-tuning

naive discretization removes crucial information:the structure of A, the optimal action

15 / 32

Policy-Based Methods

Deterministic Policy:µθ : S → A

where θ ∈ Rn is a vector of parameters.

Agent’s goal: obtain an optimal policy which maximizes the expected discountedreturn from the start state, denoted by the performance objective

J(µθ) := E[G1|µθ] (3)

16 / 32

J(µθ) := E[G1|µθ]

In a more specific way,

J(µθ) =∫Sρµ(s)r(s, µθ)ds

=Es∼ρµ [r(s, µθ(s))],(4)

where ρµ(s’) is the (improper) discounted state distribution:

ρπ(s′) :=∫S

∞∑t=1

γt−1p1(s)p(s→ s′, t, π)ds (5)

17 / 32

Deterministic Policy Ascent

Theorem 1Deterministic Policy Gradient Theorem (Silver et al., 2014)

∇θJ(µθ) =∫Sρµ(s)∇θQµ(s, µθ(s))ds

=Es∼ρµ [∇θQµ(s, µθ(s))]

=Es∼ρµ [∇θµθ(s)∇aQµ(s, a)|a=µθ(s)]

(6)

Question:

How to estimate the action-value function Qπ(s, a)?

18 / 32

Actor-Critic

• Actor:adjust the stochastic policy µθ(s) by stochastic gradient ascent.

• Critic: inspired by the success of DQNestimate the action-value function

Qπ(s, a) ≈ Qµθ (s, a) (7)

DNN? Experience Replay? Target Network?

19 / 32

Off-Policy Deterministic Policy Gradient(Silver et.al, 2014)

∇θµµ ≈ Eµ′ [∇θµQ(s, a|θQ)|s=st,a=µ(st|θµ)] (8)

= Eµ′ [∇aQ(s, a|θQ)|s=st,a=µ(st)∇θµµ(s|θµ)|s=st ] (9)

20 / 32

21 / 32

22 / 32

Asynchronous One-Step Q-learning

23 / 32

Asynchronous Advantage Actor-Critic(A3C)

24 / 32

MengDi Wang: Primal-Dual π

Infinite-horizon Average-reward MDP

• Maximize value function

•

• Bellman equation

25 / 32

Linear Programming

• Primal

• Dual

26 / 32

• Minimax formulation

27 / 32

Dual-ascent-primal-descent

28 / 32

Simplify for implementation

29 / 32

Algorithm

30 / 32

Convergence

31 / 32

References:

Richard S Sutton and Andrew G Barto. Reinforcement learning: Anintroduction. Vol. 1. 1. MIT press Cambridge, 1998.

Christopher John Cornish Hellaby Watkins. “Learning from delayed rewards.”PhD thesis. King’s College, Cambridge, 1989.

32 / 32

reinforcement learning: an overview - github pages · general approach for π∗ • if the model...

Documents