machine learning - lecture 16: value-based deep...

108
Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning Nevin L. Zhang [email protected] Department of Computer Science and Engineering The Hong Kong University of Science and Technology This set of notes is based on the references listed at the end and internet resources. Nevin L. Zhang (HKUST) Machine Learning 1 / 37

Upload: others

Post on 27-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Machine LearningLecture 16: Value-Based Deep Reinforcement Learning

Nevin L. [email protected]

Department of Computer Science and EngineeringThe Hong Kong University of Science and Technology

This set of notes is based on the references listed at the end and internet resources.

Nevin L. Zhang (HKUST) Machine Learning 1 / 37

Page 2: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Introduction

Outline

1 Introduction

2 Training Target

3 Experience Replay

4 DQN in Atari

5 Improvements to DQNDouble DQNPrioritized Experience Replay

Nevin L. Zhang (HKUST) Machine Learning 2 / 37

Page 3: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Introduction

Motivation

Q-learning:

Difficult to represent Q(s, a) as a lookup table when the number of states islarge.

Atari games: A state is an image with 210× 160 pixels and each pixelhas 128 possible color values.The total number of states is: (210× 160)128.

Cannot determine actions for states never visited (such states are many).

Nevin L. Zhang (HKUST) Machine Learning 3 / 37

Page 4: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Introduction

Motivation

Q-learning:

Difficult to represent Q(s, a) as a lookup table when the number of states islarge.

Atari games: A state is an image with 210× 160 pixels and each pixelhas 128 possible color values.The total number of states is: (210× 160)128.

Cannot determine actions for states never visited (such states are many).

Nevin L. Zhang (HKUST) Machine Learning 3 / 37

Page 5: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Introduction

Deep Q-Networks

In DQN, we aim to approximate the optimal action-value function Q∗(s, a)using a neural network with parameters θ:

Q(s, a; θ) ≈ Q∗(s, a)

A function is defined over all possible states (images) without enumeratingthem.

There is information for picking actions at any states.

Nevin L. Zhang (HKUST) Machine Learning 4 / 37

Page 6: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Introduction

Deep Q-Networks

In DQN, we aim to approximate the optimal action-value function Q∗(s, a)using a neural network with parameters θ:

Q(s, a; θ) ≈ Q∗(s, a)

A function is defined over all possible states (images) without enumeratingthem.

There is information for picking actions at any states.

Nevin L. Zhang (HKUST) Machine Learning 4 / 37

Page 7: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Introduction

Deep Q-Networks

In DQN, we aim to approximate the optimal action-value function Q∗(s, a)using a neural network with parameters θ:

Q(s, a; θ) ≈ Q∗(s, a)

A function is defined over all possible states (images) without enumeratingthem.

There is information for picking actions at any states.

Nevin L. Zhang (HKUST) Machine Learning 4 / 37

Page 8: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Introduction

Deep Q-Networks

In DQN, we aim to approximate the optimal action-value function Q∗(s, a)using a neural network with parameters θ:

Q(s, a; θ) = Q∗(s, a)

Key question: How to learn θ from experiences with the environment?

s0, a0, r0, s1, a1, r1, s2, a2, r2, . . .

Nevin L. Zhang (HKUST) Machine Learning 5 / 37

Page 9: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Training Target

Outline

1 Introduction

2 Training Target

3 Experience Replay

4 DQN in Atari

5 Improvements to DQNDouble DQNPrioritized Experience Replay

Nevin L. Zhang (HKUST) Machine Learning 6 / 37

Page 10: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Training Target

Learning DQN

Q-learning:

We will learn θ iteratively, just as Q(s, a) is learned iteratively in Q-learning.

As the first step toward a DQN learning algorithm, we can change the theQ-value update step into a θ update step:

θ ← update(θ; s, a, s ′, r)

To do so, we need come up with a objective function and update θ using itsgradient.

Nevin L. Zhang (HKUST) Machine Learning 7 / 37

Page 11: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Training Target

Learning DQN

Q-learning:

We will learn θ iteratively, just as Q(s, a) is learned iteratively in Q-learning.

As the first step toward a DQN learning algorithm, we can change the theQ-value update step into a θ update step:

θ ← update(θ; s, a, s ′, r)

To do so, we need come up with a objective function and update θ using itsgradient.

Nevin L. Zhang (HKUST) Machine Learning 7 / 37

Page 12: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Training Target

Learning DQN

Q-learning:

We will learn θ iteratively, just as Q(s, a) is learned iteratively in Q-learning.

As the first step toward a DQN learning algorithm, we can change the theQ-value update step into a θ update step:

θ ← update(θ; s, a, s ′, r)

To do so, we need come up with a objective function and update θ using itsgradient.

Nevin L. Zhang (HKUST) Machine Learning 7 / 37

Page 13: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Training Target

Learning from experience tuple e = (s, a, s ′, r)

We want to change θ so that Q(s, a; θ) gets closer to Q∗(s, a).

The problem is that we do not know the true target Q∗(s, a).

So, we use the TD target:

y = r(s, a) + γmaxa′

Q(s ′, a′; θ)

Let Q ′ be the results if we run one value iteration on Q(s, a; θ). It iscloser to Q∗ than Q.As explained in the previous lecture, y is an unbiased estimation of Q ′

based on the experience tuple.

Nevin L. Zhang (HKUST) Machine Learning 8 / 37

Page 14: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Training Target

Learning from experience tuple e = (s, a, s ′, r)

We want to change θ so that Q(s, a; θ) gets closer to Q∗(s, a).

The problem is that

we do not know the true target Q∗(s, a).

So, we use the TD target:

y = r(s, a) + γmaxa′

Q(s ′, a′; θ)

Let Q ′ be the results if we run one value iteration on Q(s, a; θ). It iscloser to Q∗ than Q.As explained in the previous lecture, y is an unbiased estimation of Q ′

based on the experience tuple.

Nevin L. Zhang (HKUST) Machine Learning 8 / 37

Page 15: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Training Target

Learning from experience tuple e = (s, a, s ′, r)

We want to change θ so that Q(s, a; θ) gets closer to Q∗(s, a).

The problem is that we do not know the true target Q∗(s, a).

So, we use the TD target:

y = r(s, a) + γmaxa′

Q(s ′, a′; θ)

Let Q ′ be the results if we run one value iteration on Q(s, a; θ). It iscloser to Q∗ than Q.As explained in the previous lecture, y is an unbiased estimation of Q ′

based on the experience tuple.

Nevin L. Zhang (HKUST) Machine Learning 8 / 37

Page 16: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Training Target

Learning from experience tuple e = (s, a, s ′, r)

We want to change θ so that Q(s, a; θ) gets closer to Q∗(s, a).

The problem is that we do not know the true target Q∗(s, a).

So, we use the TD target:

y = r(s, a) + γmaxa′

Q(s ′, a′; θ)

Let Q ′ be the results if we run one value iteration on Q(s, a; θ). It iscloser to Q∗ than Q.As explained in the previous lecture, y is an unbiased estimation of Q ′

based on the experience tuple.

Nevin L. Zhang (HKUST) Machine Learning 8 / 37

Page 17: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Training Target

Learning from experience tuple e = (s, a, s ′, r)

We want to change θ so that Q(s, a; θ) gets closer to Q∗(s, a).

The problem is that we do not know the true target Q∗(s, a).

So, we use the TD target:

y = r(s, a) + γmaxa′

Q(s ′, a′; θ)

Let Q ′ be the results if we run one value iteration on Q(s, a; θ).

It iscloser to Q∗ than Q.As explained in the previous lecture, y is an unbiased estimation of Q ′

based on the experience tuple.

Nevin L. Zhang (HKUST) Machine Learning 8 / 37

Page 18: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Training Target

Learning from experience tuple e = (s, a, s ′, r)

We want to change θ so that Q(s, a; θ) gets closer to Q∗(s, a).

The problem is that we do not know the true target Q∗(s, a).

So, we use the TD target:

y = r(s, a) + γmaxa′

Q(s ′, a′; θ)

Let Q ′ be the results if we run one value iteration on Q(s, a; θ). It iscloser to Q∗ than Q.

As explained in the previous lecture, y is an unbiased estimation of Q ′

based on the experience tuple.

Nevin L. Zhang (HKUST) Machine Learning 8 / 37

Page 19: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Training Target

Learning from experience tuple e = (s, a, s ′, r)

We want to change θ so that Q(s, a; θ) gets closer to Q∗(s, a).

The problem is that we do not know the true target Q∗(s, a).

So, we use the TD target:

y = r(s, a) + γmaxa′

Q(s ′, a′; θ)

Let Q ′ be the results if we run one value iteration on Q(s, a; θ). It iscloser to Q∗ than Q.As explained in the previous lecture, y is an unbiased estimation of Q ′

based on the experience tuple.

Nevin L. Zhang (HKUST) Machine Learning 8 / 37

Page 20: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Training Target

Learning from experience tuple e = (s, a, s ′, r)

Our target is now to push Q(s, a; θ) toward the TD target.

So, we use thefollowing loss function:

L(θ) = (y − Q(s, a; θ))2 = ([r(s, a) + γmaxa′

Q(s ′, a′; θ)]− Q(s, a; θ))2 (1)

The update rule for θ is:

θ ← θ − α∇θ([r(s, a) + γmaxa′

Q(s ′, a′; θ)]− Q(s, a; θ))2 (2)

where α is the learning rate.

Nevin L. Zhang (HKUST) Machine Learning 9 / 37

Page 21: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Training Target

Learning from experience tuple e = (s, a, s ′, r)

Our target is now to push Q(s, a; θ) toward the TD target. So, we use thefollowing loss function:

L(θ) = (y − Q(s, a; θ))2 = ([r(s, a) + γmaxa′

Q(s ′, a′; θ)]− Q(s, a; θ))2 (1)

The update rule for θ is:

θ ← θ − α∇θ([r(s, a) + γmaxa′

Q(s ′, a′; θ)]− Q(s, a; θ))2 (2)

where α is the learning rate.

Nevin L. Zhang (HKUST) Machine Learning 9 / 37

Page 22: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Training Target

Learning from experience tuple e = (s, a, s ′, r)

Our target is now to push Q(s, a; θ) toward the TD target. So, we use thefollowing loss function:

L(θ) = (y − Q(s, a; θ))2 = ([r(s, a) + γmaxa′

Q(s ′, a′; θ)]− Q(s, a; θ))2 (1)

The update rule for θ is:

θ ← θ − α∇θ([r(s, a) + γmaxa′

Q(s ′, a′; θ)]− Q(s, a; θ))2 (2)

where α is the learning rate.

Nevin L. Zhang (HKUST) Machine Learning 9 / 37

Page 23: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Training Target

Learning from experience tuple e = (s, a, s ′, r)

Our target is now to push Q(s, a; θ) toward the TD target. So, we use thefollowing loss function:

L(θ) = (y − Q(s, a; θ))2 = ([r(s, a) + γmaxa′

Q(s ′, a′; θ)]− Q(s, a; θ))2 (1)

The update rule for θ is:

θ ← θ − α∇θ([r(s, a) + γmaxa′

Q(s ′, a′; θ)]− Q(s, a; θ))2 (2)

where α is the learning rate.

Nevin L. Zhang (HKUST) Machine Learning 9 / 37

Page 24: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Training Target

Moving Target

The TD target y = r(s, a) + γmaxa′ Q(s ′, a′; θ) depends on the parametersθ.

Every time the parameters are updated, the target is also changed.

So, learning θ with update rule (2) is like chasing a moving target.

This makes learning unstable.

Nevin L. Zhang (HKUST) Machine Learning 10 / 37

Page 25: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Training Target

Moving Target

The TD target y = r(s, a) + γmaxa′ Q(s ′, a′; θ) depends on the parametersθ.

Every time the parameters are updated, the target is also changed.

So, learning θ with update rule (2) is like chasing a moving target.

This makes learning unstable.

Nevin L. Zhang (HKUST) Machine Learning 10 / 37

Page 26: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Training Target

Moving Target

The TD target y = r(s, a) + γmaxa′ Q(s ′, a′; θ) depends on the parametersθ.

Every time the parameters are updated, the target is also changed.

So, learning θ with update rule (2) is like chasing a moving target.

This makes learning unstable.

Nevin L. Zhang (HKUST) Machine Learning 10 / 37

Page 27: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Training Target

Moving Target

The TD target y = r(s, a) + γmaxa′ Q(s ′, a′; θ) depends on the parametersθ.

Every time the parameters are updated, the target is also changed.

So, learning θ with update rule (2) is like chasing a moving target.

This makes learning unstable.

Nevin L. Zhang (HKUST) Machine Learning 10 / 37

Page 28: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Training Target

Target Network

To deal with the moving target issue, create a target network that has thesame structure as the DQN.

Let θ− be its parameters.

Use the target network to compute the TD target y

Set θ− = θ once in a while (every C steps).

So, we have this new update rule:

θ ← θ − α∇θ([r(s, a) + γmaxa′

Q(s ′, a′; θ−)]− Q(s, a; θ))2 (3)

This way, we are solving a sequence of well-defined optimization problems.

Nevin L. Zhang (HKUST) Machine Learning 11 / 37

Page 29: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Training Target

Target Network

To deal with the moving target issue, create a target network that has thesame structure as the DQN. Let θ− be its parameters.

Use the target network to compute the TD target y

Set θ− = θ once in a while (every C steps).

So, we have this new update rule:

θ ← θ − α∇θ([r(s, a) + γmaxa′

Q(s ′, a′; θ−)]− Q(s, a; θ))2 (3)

This way, we are solving a sequence of well-defined optimization problems.

Nevin L. Zhang (HKUST) Machine Learning 11 / 37

Page 30: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Training Target

Target Network

To deal with the moving target issue, create a target network that has thesame structure as the DQN. Let θ− be its parameters.

Use the target network to compute the TD target y

Set θ− = θ once in a while (every C steps).

So, we have this new update rule:

θ ← θ − α∇θ([r(s, a) + γmaxa′

Q(s ′, a′; θ−)]− Q(s, a; θ))2 (3)

This way, we are solving a sequence of well-defined optimization problems.

Nevin L. Zhang (HKUST) Machine Learning 11 / 37

Page 31: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Training Target

Target Network

To deal with the moving target issue, create a target network that has thesame structure as the DQN. Let θ− be its parameters.

Use the target network to compute the TD target y

Set θ− = θ once in a while (every C steps).

So, we have this new update rule:

θ ← θ − α∇θ([r(s, a) + γmaxa′

Q(s ′, a′; θ−)]− Q(s, a; θ))2 (3)

This way, we are solving a sequence of well-defined optimization problems.

Nevin L. Zhang (HKUST) Machine Learning 11 / 37

Page 32: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Training Target

Target Network

To deal with the moving target issue, create a target network that has thesame structure as the DQN. Let θ− be its parameters.

Use the target network to compute the TD target y

Set θ− = θ once in a while (every C steps).

So, we have this new update rule:

θ ← θ − α∇θ([r(s, a) + γmaxa′

Q(s ′, a′; θ−)]− Q(s, a; θ))2 (3)

This way, we are solving a sequence of well-defined optimization problems.

Nevin L. Zhang (HKUST) Machine Learning 11 / 37

Page 33: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Training Target

Target Network

To deal with the moving target issue, create a target network that has thesame structure as the DQN. Let θ− be its parameters.

Use the target network to compute the TD target y

Set θ− = θ once in a while (every C steps).

So, we have this new update rule:

θ ← θ − α∇θ([r(s, a) + γmaxa′

Q(s ′, a′; θ−)]− Q(s, a; θ))2 (3)

This way, we are solving a sequence of well-defined optimization problems.

Nevin L. Zhang (HKUST) Machine Learning 11 / 37

Page 34: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Training Target

Deep Q-Learning with Target Network

Repeat:

Take action a in current state s, observe r and s ′

Update the parameters:

θ ← θ − α∇θ([r(s, a) + γmaxa′

Q(s ′, a′; θ−)]− Q(s, a; θ))2

s ← s ′

θ− ← θ in every C steps.

Nevin L. Zhang (HKUST) Machine Learning 12 / 37

Page 35: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Training Target

Deep Q-Learning with Target Network

Repeat:

Take action a in current state s, observe r and s ′

Update the parameters:

θ ← θ − α∇θ([r(s, a) + γmaxa′

Q(s ′, a′; θ−)]− Q(s, a; θ))2

s ← s ′

θ− ← θ in every C steps.

Nevin L. Zhang (HKUST) Machine Learning 12 / 37

Page 36: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Training Target

Deep Q-Learning with Target Network

Repeat:

Take action a in current state s, observe r and s ′

Update the parameters:

θ ← θ − α∇θ([r(s, a) + γmaxa′

Q(s ′, a′; θ−)]− Q(s, a; θ))2

s ← s ′

θ− ← θ in every C steps.

Nevin L. Zhang (HKUST) Machine Learning 12 / 37

Page 37: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Experience Replay

Outline

1 Introduction

2 Training Target

3 Experience Replay

4 DQN in Atari

5 Improvements to DQNDouble DQNPrioritized Experience Replay

Nevin L. Zhang (HKUST) Machine Learning 13 / 37

Page 38: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Experience Replay

Experience Replay

In Deep Q-Learning with target network

Only the latest experience tuple is used for parameter update.

The idea of experience replay is:

Store experience tuples in a bufferUse a random minibatch experience tuples for parameter update.

Nevin L. Zhang (HKUST) Machine Learning 14 / 37

Page 39: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Experience Replay

Experience Replay

In Deep Q-Learning with target network

Only the latest experience tuple is used for parameter update.

The idea of experience replay is:

Store experience tuples in a buffer

Use a random minibatch experience tuples for parameter update.

Nevin L. Zhang (HKUST) Machine Learning 14 / 37

Page 40: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Experience Replay

Experience Replay

In Deep Q-Learning with target network

Only the latest experience tuple is used for parameter update.

The idea of experience replay is:

Store experience tuples in a bufferUse a random minibatch experience tuples for parameter update.

Nevin L. Zhang (HKUST) Machine Learning 14 / 37

Page 41: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Experience Replay

Deep Q-Learning with Target Network and ExperienceReplay

Repeat:

Take action a in current state s, observe r and s ′; add experience tuple(s, a, s ′, r) to a buffer D; s ← s ′

Sample a minibatch B = {sj , aj , s ′j , rj} from D.

Update the parameters

θ ← θ − α∇θ∑j

([r(sj , aj) + γmaxa′j

Q(s ′j , a′j ; θ−)]− Q(sj , aj ; θ))2

θ− ← θ in every C steps.

Nevin L. Zhang (HKUST) Machine Learning 15 / 37

Page 42: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Experience Replay

Deep Q-Learning with Target Network and ExperienceReplay

Repeat:

Take action a in current state s, observe r and s ′; add experience tuple(s, a, s ′, r) to a buffer D; s ← s ′

Sample a minibatch B = {sj , aj , s ′j , rj} from D.

Update the parameters

θ ← θ − α∇θ∑j

([r(sj , aj) + γmaxa′j

Q(s ′j , a′j ; θ−)]− Q(sj , aj ; θ))2

θ− ← θ in every C steps.

Nevin L. Zhang (HKUST) Machine Learning 15 / 37

Page 43: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Experience Replay

Deep Q-Learning with Target Network and ExperienceReplay

Repeat:

Take action a in current state s, observe r and s ′; add experience tuple(s, a, s ′, r) to a buffer D; s ← s ′

Sample a minibatch B = {sj , aj , s ′j , rj} from D.

Update the parameters

θ ← θ − α∇θ∑j

([r(sj , aj) + γmaxa′j

Q(s ′j , a′j ; θ−)]− Q(sj , aj ; θ))2

θ− ← θ in every C steps.

Nevin L. Zhang (HKUST) Machine Learning 15 / 37

Page 44: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Experience Replay

Deep Q-Learning with Target Network and ExperienceReplay

Repeat:

Take action a in current state s, observe r and s ′; add experience tuple(s, a, s ′, r) to a buffer D; s ← s ′

Sample a minibatch B = {sj , aj , s ′j , rj} from D.

Update the parameters

θ ← θ − α∇θ∑j

([r(sj , aj) + γmaxa′j

Q(s ′j , a′j ; θ−)]− Q(sj , aj ; θ))2

θ− ← θ in every C steps.

Nevin L. Zhang (HKUST) Machine Learning 15 / 37

Page 45: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Experience Replay

Advantages of Experience Replay

Without experience reply, experiences tuples used in consecutive parameterupdates are strongly correlated.

Experience replay breaks the correlationsand hence reduces variance of the updates.

Experience replay avoids oscillations or divergence in the parameters.

With experience replay, each experience tuple is potentially used multipletimes. Hence better data efficiency.

With experience replay, each update improves Q(s, a; θ) using signals frommultiple points (experience tuples). Hence faster convergence.

Nevin L. Zhang (HKUST) Machine Learning 16 / 37

Page 46: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Experience Replay

Advantages of Experience Replay

Without experience reply, experiences tuples used in consecutive parameterupdates are strongly correlated. Experience replay breaks the correlationsand hence reduces variance of the updates.

Experience replay avoids oscillations or divergence in the parameters.

With experience replay, each experience tuple is potentially used multipletimes. Hence better data efficiency.

With experience replay, each update improves Q(s, a; θ) using signals frommultiple points (experience tuples). Hence faster convergence.

Nevin L. Zhang (HKUST) Machine Learning 16 / 37

Page 47: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Experience Replay

Advantages of Experience Replay

Without experience reply, experiences tuples used in consecutive parameterupdates are strongly correlated. Experience replay breaks the correlationsand hence reduces variance of the updates.

Experience replay avoids oscillations or divergence in the parameters.

With experience replay, each experience tuple is potentially used multipletimes. Hence better data efficiency.

With experience replay, each update improves Q(s, a; θ) using signals frommultiple points (experience tuples). Hence faster convergence.

Nevin L. Zhang (HKUST) Machine Learning 16 / 37

Page 48: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Experience Replay

Advantages of Experience Replay

Without experience reply, experiences tuples used in consecutive parameterupdates are strongly correlated. Experience replay breaks the correlationsand hence reduces variance of the updates.

Experience replay avoids oscillations or divergence in the parameters.

With experience replay, each experience tuple is potentially used multipletimes.

Hence better data efficiency.

With experience replay, each update improves Q(s, a; θ) using signals frommultiple points (experience tuples). Hence faster convergence.

Nevin L. Zhang (HKUST) Machine Learning 16 / 37

Page 49: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Experience Replay

Advantages of Experience Replay

Without experience reply, experiences tuples used in consecutive parameterupdates are strongly correlated. Experience replay breaks the correlationsand hence reduces variance of the updates.

Experience replay avoids oscillations or divergence in the parameters.

With experience replay, each experience tuple is potentially used multipletimes. Hence better data efficiency.

With experience replay, each update improves Q(s, a; θ) using signals frommultiple points (experience tuples). Hence faster convergence.

Nevin L. Zhang (HKUST) Machine Learning 16 / 37

Page 50: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Experience Replay

Advantages of Experience Replay

Without experience reply, experiences tuples used in consecutive parameterupdates are strongly correlated. Experience replay breaks the correlationsand hence reduces variance of the updates.

Experience replay avoids oscillations or divergence in the parameters.

With experience replay, each experience tuple is potentially used multipletimes. Hence better data efficiency.

With experience replay, each update improves Q(s, a; θ) using signals frommultiple points (experience tuples).

Hence faster convergence.

Nevin L. Zhang (HKUST) Machine Learning 16 / 37

Page 51: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Experience Replay

Advantages of Experience Replay

Without experience reply, experiences tuples used in consecutive parameterupdates are strongly correlated. Experience replay breaks the correlationsand hence reduces variance of the updates.

Experience replay avoids oscillations or divergence in the parameters.

With experience replay, each experience tuple is potentially used multipletimes. Hence better data efficiency.

With experience replay, each update improves Q(s, a; θ) using signals frommultiple points (experience tuples). Hence faster convergence.

Nevin L. Zhang (HKUST) Machine Learning 16 / 37

Page 52: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Experience Replay

A More General View of DQN

Processes 1 and 3 run at the same pace, while Process 2 is slower.

Old experiences are discarded when buffer is full.

Nevin L. Zhang (HKUST) Machine Learning 17 / 37

Page 53: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Experience Replay

A More General View of DQN

Processes 1 and 3 run at the same pace, while Process 2 is slower.

Old experiences are discarded when buffer is full.

Nevin L. Zhang (HKUST) Machine Learning 17 / 37

Page 54: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

DQN in Atari

Outline

1 Introduction

2 Training Target

3 Experience Replay

4 DQN in Atari

5 Improvements to DQNDouble DQNPrioritized Experience Replay

Nevin L. Zhang (HKUST) Machine Learning 18 / 37

Page 55: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

DQN in Atari

Atari Games

Space Invaders

Nevin L. Zhang (HKUST) Machine Learning 19 / 37

Page 56: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

DQN in Atari

DAN in Atari

End-to-end learning of values Q(s; a) from pixels s

Input state s is stack of raw pixels from last 4 frames

Output is Q(s; a) for 18 joystick/button positions

Reward is change in score for that step

Network architecture and hyperparameters fixed across all games

Nevin L. Zhang (HKUST) Machine Learning 20 / 37

Page 57: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

DQN in Atari

DAN in Atari

End-to-end learning of values Q(s; a) from pixels s

Input state s is stack of raw pixels from last 4 frames

Output is Q(s; a) for 18 joystick/button positions

Reward is change in score for that step

Network architecture and hyperparameters fixed across all games

Nevin L. Zhang (HKUST) Machine Learning 20 / 37

Page 58: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

DQN in Atari

DAN in Atari

End-to-end learning of values Q(s; a) from pixels s

Input state s is stack of raw pixels from last 4 frames

Output is Q(s; a) for 18 joystick/button positions

Reward is change in score for that step

Network architecture and hyperparameters fixed across all games

Nevin L. Zhang (HKUST) Machine Learning 20 / 37

Page 59: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

DQN in Atari

DAN in Atari

End-to-end learning of values Q(s; a) from pixels s

Input state s is stack of raw pixels from last 4 frames

Output is Q(s; a) for 18 joystick/button positions

Reward is change in score for that step

Network architecture and hyperparameters fixed across all games

Nevin L. Zhang (HKUST) Machine Learning 20 / 37

Page 60: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

DQN in Atari

DAN in Atari

End-to-end learning of values Q(s; a) from pixels s

Input state s is stack of raw pixels from last 4 frames

Output is Q(s; a) for 18 joystick/button positions

Reward is change in score for that step

Network architecture and hyperparameters fixed across all games

Nevin L. Zhang (HKUST) Machine Learning 20 / 37

Page 61: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

DQN in Atari

Comparable with or superior to human at majority of games

Nevin L. Zhang (HKUST) Machine Learning 21 / 37

Page 62: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

DQN in Atari

Results on Breakout

https://www.youtube.com/watch?v=V1eYniJ0Rnk

Nevin L. Zhang (HKUST) Machine Learning 22 / 37

Page 63: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Improvements to DQN

Outline

1 Introduction

2 Training Target

3 Experience Replay

4 DQN in Atari

5 Improvements to DQNDouble DQNPrioritized Experience Replay

Nevin L. Zhang (HKUST) Machine Learning 23 / 37

Page 64: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Improvements to DQN Double DQN

Outline

1 Introduction

2 Training Target

3 Experience Replay

4 DQN in Atari

5 Improvements to DQNDouble DQNPrioritized Experience Replay

Nevin L. Zhang (HKUST) Machine Learning 24 / 37

Page 65: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Improvements to DQN Double DQN

Overestimate of Value function

In (3), we estimate TD Target as follows:

yQ = r(s, a) + γmaxa′

Q(s ′, a′; θ−)

The subscript Q indicates that yQ is the estimate used in Q-learning (DQN).

The Q-values used are not accurate, and are influenced by random factorsdenoted as ω−. So, we can write:

yQ = r(s, a) + γmaxa′

Q(s ′, a′; θ−, ω−)

If we run the process many times and take average of the y values, we get

Eω− [yQ ] = r(s, a) + γEω− [maxa′

Q(s ′, a′; θ−, ω−)]

Nevin L. Zhang (HKUST) Machine Learning 25 / 37

Page 66: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Improvements to DQN Double DQN

Overestimate of Value function

In (3), we estimate TD Target as follows:

yQ = r(s, a) + γmaxa′

Q(s ′, a′; θ−)

The subscript Q indicates that yQ is the estimate used in Q-learning (DQN).

The Q-values used are not accurate, and are influenced by random factorsdenoted as ω−. So, we can write:

yQ = r(s, a) + γmaxa′

Q(s ′, a′; θ−, ω−)

If we run the process many times and take average of the y values, we get

Eω− [yQ ] = r(s, a) + γEω− [maxa′

Q(s ′, a′; θ−, ω−)]

Nevin L. Zhang (HKUST) Machine Learning 25 / 37

Page 67: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Improvements to DQN Double DQN

Overestimate of Value function

In (3), we estimate TD Target as follows:

yQ = r(s, a) + γmaxa′

Q(s ′, a′; θ−)

The subscript Q indicates that yQ is the estimate used in Q-learning (DQN).

The Q-values used are not accurate,

and are influenced by random factorsdenoted as ω−. So, we can write:

yQ = r(s, a) + γmaxa′

Q(s ′, a′; θ−, ω−)

If we run the process many times and take average of the y values, we get

Eω− [yQ ] = r(s, a) + γEω− [maxa′

Q(s ′, a′; θ−, ω−)]

Nevin L. Zhang (HKUST) Machine Learning 25 / 37

Page 68: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Improvements to DQN Double DQN

Overestimate of Value function

In (3), we estimate TD Target as follows:

yQ = r(s, a) + γmaxa′

Q(s ′, a′; θ−)

The subscript Q indicates that yQ is the estimate used in Q-learning (DQN).

The Q-values used are not accurate, and are influenced by random factorsdenoted as ω−.

So, we can write:

yQ = r(s, a) + γmaxa′

Q(s ′, a′; θ−, ω−)

If we run the process many times and take average of the y values, we get

Eω− [yQ ] = r(s, a) + γEω− [maxa′

Q(s ′, a′; θ−, ω−)]

Nevin L. Zhang (HKUST) Machine Learning 25 / 37

Page 69: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Improvements to DQN Double DQN

Overestimate of Value function

In (3), we estimate TD Target as follows:

yQ = r(s, a) + γmaxa′

Q(s ′, a′; θ−)

The subscript Q indicates that yQ is the estimate used in Q-learning (DQN).

The Q-values used are not accurate, and are influenced by random factorsdenoted as ω−. So, we can write:

yQ = r(s, a) + γmaxa′

Q(s ′, a′; θ−, ω−)

If we run the process many times and take average of the y values, we get

Eω− [yQ ] = r(s, a) + γEω− [maxa′

Q(s ′, a′; θ−, ω−)]

Nevin L. Zhang (HKUST) Machine Learning 25 / 37

Page 70: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Improvements to DQN Double DQN

Overestimate of Value function

In (3), we estimate TD Target as follows:

yQ = r(s, a) + γmaxa′

Q(s ′, a′; θ−)

The subscript Q indicates that yQ is the estimate used in Q-learning (DQN).

The Q-values used are not accurate, and are influenced by random factorsdenoted as ω−. So, we can write:

yQ = r(s, a) + γmaxa′

Q(s ′, a′; θ−, ω−)

If we run the process many times and take average of the y values, we get

Eω− [yQ ] = r(s, a) + γEω− [maxa′

Q(s ′, a′; θ−, ω−)]

Nevin L. Zhang (HKUST) Machine Learning 25 / 37

Page 71: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Improvements to DQN Double DQN

Overestimate of Value function

Ideally, we would like to first get a robust estimate of Q by take averageover many run, i.e.,

Eω− [Q(s ′, a′; θ−, ω−)]

and then use it to estimate the TD target:

y = r(s, a) + γmaxa′

Eω− [Q(s ′, a′; θ−, ω−)]

On average, the yQ overestimates the ideal value y because

Eω− [yQ ] = r(s, a) + γEω− [maxa′

Q(s ′, a′; θ−, ω−)]

≥ r(s, a) + γmaxa′

Eω− [Q(s ′, a′; θ−, ω−)]

= y

Note that. for any two random variables X1 and X2,E [max(X1,X2)] ≥ max(E [X1],E [X2])

Nevin L. Zhang (HKUST) Machine Learning 26 / 37

Page 72: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Improvements to DQN Double DQN

Overestimate of Value function

Ideally, we would like to first get a robust estimate of Q by take averageover many run, i.e.,

Eω− [Q(s ′, a′; θ−, ω−)]

and then use it to estimate the TD target:

y = r(s, a) + γmaxa′

Eω− [Q(s ′, a′; θ−, ω−)]

On average, the yQ overestimates the ideal value y because

Eω− [yQ ] = r(s, a) + γEω− [maxa′

Q(s ′, a′; θ−, ω−)]

≥ r(s, a) + γmaxa′

Eω− [Q(s ′, a′; θ−, ω−)]

= y

Note that. for any two random variables X1 and X2,E [max(X1,X2)] ≥ max(E [X1],E [X2])

Nevin L. Zhang (HKUST) Machine Learning 26 / 37

Page 73: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Improvements to DQN Double DQN

Overestimate of Value function

Ideally, we would like to first get a robust estimate of Q by take averageover many run, i.e.,

Eω− [Q(s ′, a′; θ−, ω−)]

and then use it to estimate the TD target:

y = r(s, a) + γmaxa′

Eω− [Q(s ′, a′; θ−, ω−)]

On average, the yQ overestimates the ideal value y because

Eω− [yQ ] = r(s, a) + γEω− [maxa′

Q(s ′, a′; θ−, ω−)]

≥ r(s, a) + γmaxa′

Eω− [Q(s ′, a′; θ−, ω−)]

= y

Note that. for any two random variables X1 and X2,E [max(X1,X2)] ≥ max(E [X1],E [X2])

Nevin L. Zhang (HKUST) Machine Learning 26 / 37

Page 74: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Improvements to DQN Double DQN

Overestimate of Value function

Ideally, we would like to first get a robust estimate of Q by take averageover many run, i.e.,

Eω− [Q(s ′, a′; θ−, ω−)]

and then use it to estimate the TD target:

y = r(s, a) + γmaxa′

Eω− [Q(s ′, a′; θ−, ω−)]

On average, the yQ overestimates the ideal value y because

Eω− [yQ ] = r(s, a) + γEω− [maxa′

Q(s ′, a′; θ−, ω−)]

≥ r(s, a) + γmaxa′

Eω− [Q(s ′, a′; θ−, ω−)]

= y

Note that. for any two random variables X1 and X2,E [max(X1,X2)] ≥ max(E [X1],E [X2])

Nevin L. Zhang (HKUST) Machine Learning 26 / 37

Page 75: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Improvements to DQN Double DQN

Overestimate of Value function

Ideally, we would like to first get a robust estimate of Q by take averageover many run, i.e.,

Eω− [Q(s ′, a′; θ−, ω−)]

and then use it to estimate the TD target:

y = r(s, a) + γmaxa′

Eω− [Q(s ′, a′; θ−, ω−)]

On average, the yQ overestimates the ideal value y because

Eω− [yQ ] = r(s, a) + γEω− [maxa′

Q(s ′, a′; θ−, ω−)]

≥ r(s, a) + γmaxa′

Eω− [Q(s ′, a′; θ−, ω−)]

= y

Note that. for any two random variables X1 and X2,E [max(X1,X2)] ≥ max(E [X1],E [X2])

Nevin L. Zhang (HKUST) Machine Learning 26 / 37

Page 76: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Improvements to DQN Double DQN

Double DQN

a∗ = arg maxa′ Q(s ′, a′; θ, ω)

yDoubleQ = r(s, a) + γQ(s ′, a∗; θ−, ω−)

Taking average, we get

Eω− [yDoubleQ ] = r(s, a) + γEω− [Q(s ′, a∗; θ−, ω−)]

This is closer to

y = r(s, a) + γmaxa′

Eω− [Q(s ′, a′; θ−, ω−)]

than Eω− [yQ ] = r(s, a) + γEω− [maxa′

Q(s ′, a′; θ−, ω−)]

Nevin L. Zhang (HKUST) Machine Learning 27 / 37

Page 77: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Improvements to DQN Double DQN

Double DQN

a∗ = arg maxa′ Q(s ′, a′; θ, ω)

yDoubleQ = r(s, a) + γQ(s ′, a∗; θ−, ω−)

Taking average, we get

Eω− [yDoubleQ ] = r(s, a) + γEω− [Q(s ′, a∗; θ−, ω−)]

This is closer to

y = r(s, a) + γmaxa′

Eω− [Q(s ′, a′; θ−, ω−)]

than Eω− [yQ ] = r(s, a) + γEω− [maxa′

Q(s ′, a′; θ−, ω−)]

Nevin L. Zhang (HKUST) Machine Learning 27 / 37

Page 78: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Improvements to DQN Double DQN

Double DQN

a∗ = arg maxa′ Q(s ′, a′; θ, ω)

yDoubleQ = r(s, a) + γQ(s ′, a∗; θ−, ω−)

Taking average, we get

Eω− [yDoubleQ ] = r(s, a) + γEω− [Q(s ′, a∗; θ−, ω−)]

This is closer to

y = r(s, a) + γmaxa′

Eω− [Q(s ′, a′; θ−, ω−)]

than Eω− [yQ ] = r(s, a) + γEω− [maxa′

Q(s ′, a′; θ−, ω−)]

Nevin L. Zhang (HKUST) Machine Learning 27 / 37

Page 79: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Improvements to DQN Double DQN

Double DQN

a∗ = arg maxa′ Q(s ′, a′; θ, ω)

yDoubleQ = r(s, a) + γQ(s ′, a∗; θ−, ω−)

Taking average, we get

Eω− [yDoubleQ ] = r(s, a) + γEω− [Q(s ′, a∗; θ−, ω−)]

This is closer to

y = r(s, a) + γmaxa′

Eω− [Q(s ′, a′; θ−, ω−)]

than Eω− [yQ ] = r(s, a) + γEω− [maxa′

Q(s ′, a′; θ−, ω−)]

Nevin L. Zhang (HKUST) Machine Learning 27 / 37

Page 80: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Improvements to DQN Double DQN

Double DQN

a∗ = arg maxa′ Q(s ′, a′; θ, ω)

yDoubleQ = r(s, a) + γQ(s ′, a∗; θ−, ω−)

Taking average, we get

Eω− [yDoubleQ ] = r(s, a) + γEω− [Q(s ′, a∗; θ−, ω−)]

This is closer to

y = r(s, a) + γmaxa′

Eω− [Q(s ′, a′; θ−, ω−)]

than Eω− [yQ ] = r(s, a) + γEω− [maxa′

Q(s ′, a′; θ−, ω−)]

Nevin L. Zhang (HKUST) Machine Learning 27 / 37

Page 81: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Improvements to DQN Double DQN

An Analogy

TD Target in DQN:

yQ = r(s, a) + γmaxa′

Q(s ′, a′; θ−)

The second term estimates future optimal value of s ′ in two steps:

Picks next action a′, andEvaluate Q(s’, a’)

In both steps, we use θ−

Nevin L. Zhang (HKUST) Machine Learning 28 / 37

Page 82: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Improvements to DQN Double DQN

An Analogy

TD Target in DQN:

yQ = r(s, a) + γmaxa′

Q(s ′, a′; θ−)

The second term estimates future optimal value of s ′ in two steps:

Picks next action a′, andEvaluate Q(s’, a’)

In both steps, we use θ−

Nevin L. Zhang (HKUST) Machine Learning 28 / 37

Page 83: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Improvements to DQN Double DQN

An Analogy

Task: Estimate highest exam score of a class given current situation s ′.

Method 1: Ask one teacher θ− to

Estimate the score of each student a′: Q(s ′, a′; θ−).Pick best student: a∗ = arg maxa′ Q(s ′, a′; θ−)Report Q(s ′, a∗; θ−).

Problem: Best student picked by teach θ− might not the best after all.Score for a∗ is overestimated, and hence max score of class is overestimated.

Method 2:

Ask one teacher θ to pick best student: a∗ = arg maxa′ Q(s ′, a′; θ)Ask one teacher θ− to estimate her score: Q(s ′, a∗; θ−)

Nevin L. Zhang (HKUST) Machine Learning 29 / 37

Page 84: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Improvements to DQN Double DQN

An Analogy

Task: Estimate highest exam score of a class given current situation s ′.

Method 1: Ask one teacher θ− to

Estimate the score of each student a′: Q(s ′, a′; θ−).Pick best student: a∗ = arg maxa′ Q(s ′, a′; θ−)Report Q(s ′, a∗; θ−).

Problem: Best student picked by teach θ− might not the best after all.Score for a∗ is overestimated, and hence max score of class is overestimated.

Method 2:

Ask one teacher θ to pick best student: a∗ = arg maxa′ Q(s ′, a′; θ)Ask one teacher θ− to estimate her score: Q(s ′, a∗; θ−)

Nevin L. Zhang (HKUST) Machine Learning 29 / 37

Page 85: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Improvements to DQN Double DQN

An Analogy

Task: Estimate highest exam score of a class given current situation s ′.

Method 1: Ask one teacher θ− to

Estimate the score of each student a′: Q(s ′, a′; θ−).

Pick best student: a∗ = arg maxa′ Q(s ′, a′; θ−)Report Q(s ′, a∗; θ−).

Problem: Best student picked by teach θ− might not the best after all.Score for a∗ is overestimated, and hence max score of class is overestimated.

Method 2:

Ask one teacher θ to pick best student: a∗ = arg maxa′ Q(s ′, a′; θ)Ask one teacher θ− to estimate her score: Q(s ′, a∗; θ−)

Nevin L. Zhang (HKUST) Machine Learning 29 / 37

Page 86: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Improvements to DQN Double DQN

An Analogy

Task: Estimate highest exam score of a class given current situation s ′.

Method 1: Ask one teacher θ− to

Estimate the score of each student a′: Q(s ′, a′; θ−).Pick best student: a∗ = arg maxa′ Q(s ′, a′; θ−)

Report Q(s ′, a∗; θ−).

Problem: Best student picked by teach θ− might not the best after all.Score for a∗ is overestimated, and hence max score of class is overestimated.

Method 2:

Ask one teacher θ to pick best student: a∗ = arg maxa′ Q(s ′, a′; θ)Ask one teacher θ− to estimate her score: Q(s ′, a∗; θ−)

Nevin L. Zhang (HKUST) Machine Learning 29 / 37

Page 87: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Improvements to DQN Double DQN

An Analogy

Task: Estimate highest exam score of a class given current situation s ′.

Method 1: Ask one teacher θ− to

Estimate the score of each student a′: Q(s ′, a′; θ−).Pick best student: a∗ = arg maxa′ Q(s ′, a′; θ−)Report Q(s ′, a∗; θ−).

Problem: Best student picked by teach θ− might not the best after all.Score for a∗ is overestimated, and hence max score of class is overestimated.

Method 2:

Ask one teacher θ to pick best student: a∗ = arg maxa′ Q(s ′, a′; θ)Ask one teacher θ− to estimate her score: Q(s ′, a∗; θ−)

Nevin L. Zhang (HKUST) Machine Learning 29 / 37

Page 88: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Improvements to DQN Double DQN

An Analogy

Task: Estimate highest exam score of a class given current situation s ′.

Method 1: Ask one teacher θ− to

Estimate the score of each student a′: Q(s ′, a′; θ−).Pick best student: a∗ = arg maxa′ Q(s ′, a′; θ−)Report Q(s ′, a∗; θ−).

Problem: Best student picked by teach θ− might not the best after all.

Score for a∗ is overestimated, and hence max score of class is overestimated.

Method 2:

Ask one teacher θ to pick best student: a∗ = arg maxa′ Q(s ′, a′; θ)Ask one teacher θ− to estimate her score: Q(s ′, a∗; θ−)

Nevin L. Zhang (HKUST) Machine Learning 29 / 37

Page 89: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Improvements to DQN Double DQN

An Analogy

Task: Estimate highest exam score of a class given current situation s ′.

Method 1: Ask one teacher θ− to

Estimate the score of each student a′: Q(s ′, a′; θ−).Pick best student: a∗ = arg maxa′ Q(s ′, a′; θ−)Report Q(s ′, a∗; θ−).

Problem: Best student picked by teach θ− might not the best after all.Score for a∗ is overestimated,

and hence max score of class is overestimated.

Method 2:

Ask one teacher θ to pick best student: a∗ = arg maxa′ Q(s ′, a′; θ)Ask one teacher θ− to estimate her score: Q(s ′, a∗; θ−)

Nevin L. Zhang (HKUST) Machine Learning 29 / 37

Page 90: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Improvements to DQN Double DQN

An Analogy

Task: Estimate highest exam score of a class given current situation s ′.

Method 1: Ask one teacher θ− to

Estimate the score of each student a′: Q(s ′, a′; θ−).Pick best student: a∗ = arg maxa′ Q(s ′, a′; θ−)Report Q(s ′, a∗; θ−).

Problem: Best student picked by teach θ− might not the best after all.Score for a∗ is overestimated, and hence max score of class is overestimated.

Method 2:

Ask one teacher θ to pick best student: a∗ = arg maxa′ Q(s ′, a′; θ)Ask one teacher θ− to estimate her score: Q(s ′, a∗; θ−)

Nevin L. Zhang (HKUST) Machine Learning 29 / 37

Page 91: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Improvements to DQN Double DQN

An Analogy

Task: Estimate highest exam score of a class given current situation s ′.

Method 1: Ask one teacher θ− to

Estimate the score of each student a′: Q(s ′, a′; θ−).Pick best student: a∗ = arg maxa′ Q(s ′, a′; θ−)Report Q(s ′, a∗; θ−).

Problem: Best student picked by teach θ− might not the best after all.Score for a∗ is overestimated, and hence max score of class is overestimated.

Method 2:

Ask one teacher θ to pick best student: a∗ = arg maxa′ Q(s ′, a′; θ)

Ask one teacher θ− to estimate her score: Q(s ′, a∗; θ−)

Nevin L. Zhang (HKUST) Machine Learning 29 / 37

Page 92: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Improvements to DQN Double DQN

An Analogy

Task: Estimate highest exam score of a class given current situation s ′.

Method 1: Ask one teacher θ− to

Estimate the score of each student a′: Q(s ′, a′; θ−).Pick best student: a∗ = arg maxa′ Q(s ′, a′; θ−)Report Q(s ′, a∗; θ−).

Problem: Best student picked by teach θ− might not the best after all.Score for a∗ is overestimated, and hence max score of class is overestimated.

Method 2:

Ask one teacher θ to pick best student: a∗ = arg maxa′ Q(s ′, a′; θ)Ask one teacher θ− to estimate her score: Q(s ′, a∗; θ−)

Nevin L. Zhang (HKUST) Machine Learning 29 / 37

Page 93: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Improvements to DQN Double DQN

Double DQN

Update rule for DQN:

θ ← θ − α∇θ([r(s, a) + γmaxa′

Q(s ′, a′; θ−)]− Q(s, a; θ))2

Or, equivalently:

θ ← θ − α∇θ([r(s, a) + γQ(s ′, a∗; θ−)]− Q(s, a; θ))2

where a∗ = arg maxa′ Q(s ′, a′; θ−).

Update rule for Double DQN:

θ ← θ − α∇θ([r(s, a) + γQ(s ′, a∗; θ−)]− Q(s, a; θ))2 (4)

where a∗ = arg maxa′ Q(s ′, a′; θ).

Nevin L. Zhang (HKUST) Machine Learning 30 / 37

Page 94: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Improvements to DQN Double DQN

Double DQN

Update rule for DQN:

θ ← θ − α∇θ([r(s, a) + γmaxa′

Q(s ′, a′; θ−)]− Q(s, a; θ))2

Or, equivalently:

θ ← θ − α∇θ([r(s, a) + γQ(s ′, a∗; θ−)]− Q(s, a; θ))2

where a∗ = arg maxa′ Q(s ′, a′; θ−).

Update rule for Double DQN:

θ ← θ − α∇θ([r(s, a) + γQ(s ′, a∗; θ−)]− Q(s, a; θ))2 (4)

where a∗ = arg maxa′ Q(s ′, a′; θ).

Nevin L. Zhang (HKUST) Machine Learning 30 / 37

Page 95: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Improvements to DQN Double DQN

Double DQN

Update rule for DQN:

θ ← θ − α∇θ([r(s, a) + γmaxa′

Q(s ′, a′; θ−)]− Q(s, a; θ))2

Or, equivalently:

θ ← θ − α∇θ([r(s, a) + γQ(s ′, a∗; θ−)]− Q(s, a; θ))2

where a∗ = arg maxa′ Q(s ′, a′; θ−).

Update rule for Double DQN:

θ ← θ − α∇θ([r(s, a) + γQ(s ′, a∗; θ−)]− Q(s, a; θ))2 (4)

where a∗ = arg maxa′ Q(s ′, a′; θ).

Nevin L. Zhang (HKUST) Machine Learning 30 / 37

Page 96: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Improvements to DQN Double DQN

DQN versus Double DQN

Nevin L. Zhang (HKUST) Machine Learning 31 / 37

Page 97: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Improvements to DQN Prioritized Experience Replay

Outline

1 Introduction

2 Training Target

3 Experience Replay

4 DQN in Atari

5 Improvements to DQNDouble DQNPrioritized Experience Replay

Nevin L. Zhang (HKUST) Machine Learning 32 / 37

Page 98: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Improvements to DQN Prioritized Experience Replay

Why Prioritized Experience Replay?

In DQN and Double DQN, experience transitions are uniformly sampledfrom a replay memory for the purpose of Q-function regression. (Process 3)

This approach does not take the significance of the experience transitionsinto consideration.

Nevin L. Zhang (HKUST) Machine Learning 33 / 37

Page 99: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Improvements to DQN Prioritized Experience Replay

Why Prioritized Experience Replay?

In DQN and Double DQN, experience transitions are uniformly sampledfrom a replay memory for the purpose of Q-function regression. (Process 3)

This approach does not take the significance of the experience transitionsinto consideration.

Nevin L. Zhang (HKUST) Machine Learning 33 / 37

Page 100: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Improvements to DQN Prioritized Experience Replay

Prioritized Experience Replay

The significance of a experience tuple (s, a, s ′, r) is measured using the TDerror:

δ = [r(s, a) + γmaxa′

Q(s ′, a′; θ)]− Q(s, a; θ)

Let {si , ai , s ′i , ri )} be the collection of experience tuples in the replay memory.

For each update of θ, experience tuples are sampled using the followingdistribution:

P(i) =pαi∑k p

αk

pi is a function of δi : pi = |δi |+ ε or pi = 1rank(i)

α determines how much prioritization is used.α = 0 amounts to uniformsampling.

Nevin L. Zhang (HKUST) Machine Learning 34 / 37

Page 101: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Improvements to DQN Prioritized Experience Replay

Prioritized Experience Replay

The significance of a experience tuple (s, a, s ′, r) is measured using the TDerror:

δ = [r(s, a) + γmaxa′

Q(s ′, a′; θ)]− Q(s, a; θ)

Let {si , ai , s ′i , ri )} be the collection of experience tuples in the replay memory.

For each update of θ, experience tuples are sampled using the followingdistribution:

P(i) =pαi∑k p

αk

pi is a function of δi : pi = |δi |+ ε or pi = 1rank(i)

α determines how much prioritization is used.α = 0 amounts to uniformsampling.

Nevin L. Zhang (HKUST) Machine Learning 34 / 37

Page 102: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Improvements to DQN Prioritized Experience Replay

Prioritized Experience Replay

The significance of a experience tuple (s, a, s ′, r) is measured using the TDerror:

δ = [r(s, a) + γmaxa′

Q(s ′, a′; θ)]− Q(s, a; θ)

Let {si , ai , s ′i , ri )} be the collection of experience tuples in the replay memory.

For each update of θ, experience tuples are sampled using the followingdistribution:

P(i) =pαi∑k p

αk

pi is a function of δi : pi = |δi |+ ε or pi = 1rank(i)

α determines how much prioritization is used.α = 0 amounts to uniformsampling.

Nevin L. Zhang (HKUST) Machine Learning 34 / 37

Page 103: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Improvements to DQN Prioritized Experience Replay

Prioritized Experience Replay

The significance of a experience tuple (s, a, s ′, r) is measured using the TDerror:

δ = [r(s, a) + γmaxa′

Q(s ′, a′; θ)]− Q(s, a; θ)

Let {si , ai , s ′i , ri )} be the collection of experience tuples in the replay memory.

For each update of θ, experience tuples are sampled using the followingdistribution:

P(i) =pαi∑k p

αk

pi is a function of δi : pi = |δi |+ ε or pi = 1rank(i)

α determines how much prioritization is used.α = 0 amounts to uniformsampling.

Nevin L. Zhang (HKUST) Machine Learning 34 / 37

Page 104: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Improvements to DQN Prioritized Experience Replay

Prioritized Experience Replay

The significance of a experience tuple (s, a, s ′, r) is measured using the TDerror:

δ = [r(s, a) + γmaxa′

Q(s ′, a′; θ)]− Q(s, a; θ)

Let {si , ai , s ′i , ri )} be the collection of experience tuples in the replay memory.

For each update of θ, experience tuples are sampled using the followingdistribution:

P(i) =pαi∑k p

αk

pi is a function of δi : pi = |δi |+ ε or pi = 1rank(i)

α determines how much prioritization is used.

α = 0 amounts to uniformsampling.

Nevin L. Zhang (HKUST) Machine Learning 34 / 37

Page 105: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Improvements to DQN Prioritized Experience Replay

Prioritized Experience Replay

The significance of a experience tuple (s, a, s ′, r) is measured using the TDerror:

δ = [r(s, a) + γmaxa′

Q(s ′, a′; θ)]− Q(s, a; θ)

Let {si , ai , s ′i , ri )} be the collection of experience tuples in the replay memory.

For each update of θ, experience tuples are sampled using the followingdistribution:

P(i) =pαi∑k p

αk

pi is a function of δi : pi = |δi |+ ε or pi = 1rank(i)

α determines how much prioritization is used.α = 0 amounts to uniformsampling.

Nevin L. Zhang (HKUST) Machine Learning 34 / 37

Page 106: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Improvements to DQN Prioritized Experience Replay

Prioritized Experience Replay Improves Double DQN inMost Cases

Nevin L. Zhang (HKUST) Machine Learning 35 / 37

Page 107: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Improvements to DQN Prioritized Experience Replay

Code Example

github.com/vmayoral/basic reinforcement learning/blob/master/tutorial6/README.md

Nevin L. Zhang (HKUST) Machine Learning 36 / 37

Page 108: Machine Learning - Lecture 16: Value-Based Deep ...home.cse.ust.hk/~lzhang/teach/comp5212/slides/l16.t.pdf · Machine Learning Lecture 16: Value-Based Deep Reinforcement Learning

Improvements to DQN Prioritized Experience Replay

References

Sergey Levine. Deep Reinforcement Learning,http://rail.eecs.berkeley.edu/deeprlcourse/, 2018.

David Silver. Tutorial: Deep Reinforcement Learning,http://hunch.net/∼beygel/deep rl tutorial.pdf, 2016.

Tambet Matiisen. Demystifying Deep Reinforcement Learning.https://ai.intel.com/demystifying-deep-reinforcement-learning/, 2015.

Watkins. (1989). Learning from delayed rewards: introduces Q-learning

Riedmiller. (2005). Neural fitted Q-iteration: batch-mode Q-learning with neural networks

Mnih et al. (2013). Human-level control through deep reinforcement learning: Q-learningwith convolutional networks for playing Atari.

Van Hasselt, Guez, Silver. (2015). Deep reinforcement learning with double Q-learning: avery effective trick to improve performance of deep Q-learning.

Schaul, T., Quan, J., Antonoglou, I., and Silver, D. (2016). Prioritized experience replay.In ICLR, 2016.

Wang, Schaul, Hessel, van Hasselt, Lanctot, de Freitas (2016). Dueling networkarchitectures for deep reinforcement learning: separates value and advantage estimation inQ-function.

Nevin L. Zhang (HKUST) Machine Learning 37 / 37