machine learning - lecture 16: value-based deep...

Machine LearningLecture 16: Value-Based Deep Reinforcement Learning

Nevin L. [email protected]

Department of Computer Science and EngineeringThe Hong Kong University of Science and Technology

This set of notes is based on the references listed at the end and internet resources.

Nevin L. Zhang (HKUST) Machine Learning 1 / 37

Introduction

Outline

1 Introduction

2 Training Target

3 Experience Replay

4 DQN in Atari

5 Improvements to DQNDouble DQNPrioritized Experience Replay

Introduction

Motivation

Q-learning:

Difficult to represent Q(s, a) as a lookup table when the number of states islarge.

Atari games: A state is an image with 210× 160 pixels and each pixelhas 128 possible color values.The total number of states is: (210× 160)128.

Cannot determine actions for states never visited (such states are many).

Introduction

Motivation

Q-learning:

Difficult to represent Q(s, a) as a lookup table when the number of states islarge.

Atari games: A state is an image with 210× 160 pixels and each pixelhas 128 possible color values.The total number of states is: (210× 160)128.

Cannot determine actions for states never visited (such states are many).

Introduction

Deep Q-Networks

In DQN, we aim to approximate the optimal action-value function Q∗(s, a)using a neural network with parameters θ:

Q(s, a; θ) ≈ Q∗(s, a)

A function is defined over all possible states (images) without enumeratingthem.

There is information for picking actions at any states.

Introduction

Deep Q-Networks

Q(s, a; θ) ≈ Q∗(s, a)

Introduction

Deep Q-Networks

Q(s, a; θ) ≈ Q∗(s, a)

Introduction

Deep Q-Networks

Q(s, a; θ) = Q∗(s, a)

Key question: How to learn θ from experiences with the environment?

s0, a0, r0, s1, a1, r1, s2, a2, r2, . . .

Training Target

Outline

1 Introduction

2 Training Target

3 Experience Replay

4 DQN in Atari

Training Target

Learning DQN

Q-learning:

We will learn θ iteratively, just as Q(s, a) is learned iteratively in Q-learning.

As the first step toward a DQN learning algorithm, we can change the theQ-value update step into a θ update step:

θ ← update(θ; s, a, s ′, r)

To do so, we need come up with a objective function and update θ using itsgradient.

Training Target

Learning DQN

Q-learning:

Training Target

Learning DQN

Q-learning:

Training Target

Learning from experience tuple e = (s, a, s ′, r)

We want to change θ so that Q(s, a; θ) gets closer to Q∗(s, a).

The problem is that we do not know the true target Q∗(s, a).

So, we use the TD target:

y = r(s, a) + γmaxa′

Q(s ′, a′; θ)

Let Q ′ be the results if we run one value iteration on Q(s, a; θ). It iscloser to Q∗ than Q.As explained in the previous lecture, y is an unbiased estimation of Q ′

based on the experience tuple.

Training Target

The problem is that

we do not know the true target Q∗(s, a).

Q(s ′, a′; θ)

Training Target

Q(s ′, a′; θ)

Training Target

Q(s ′, a′; θ)

Training Target

Q(s ′, a′; θ)

Let Q ′ be the results if we run one value iteration on Q(s, a; θ).

It iscloser to Q∗ than Q.As explained in the previous lecture, y is an unbiased estimation of Q ′

Training Target

Q(s ′, a′; θ)

Let Q ′ be the results if we run one value iteration on Q(s, a; θ). It iscloser to Q∗ than Q.

As explained in the previous lecture, y is an unbiased estimation of Q ′

Training Target

Q(s ′, a′; θ)

Training Target

Our target is now to push Q(s, a; θ) toward the TD target.

So, we use thefollowing loss function:

L(θ) = (y − Q(s, a; θ))2 = ([r(s, a) + γmaxa′

Q(s ′, a′; θ)]− Q(s, a; θ))2 (1)

The update rule for θ is:

θ ← θ − α∇θ([r(s, a) + γmaxa′

Q(s ′, a′; θ)]− Q(s, a; θ))2 (2)

where α is the learning rate.

Training Target

Our target is now to push Q(s, a; θ) toward the TD target. So, we use thefollowing loss function:

Q(s ′, a′; θ)]− Q(s, a; θ))2 (1)

Q(s ′, a′; θ)]− Q(s, a; θ))2 (2)

Training Target

Q(s ′, a′; θ)]− Q(s, a; θ))2 (1)

Q(s ′, a′; θ)]− Q(s, a; θ))2 (2)

Training Target

Q(s ′, a′; θ)]− Q(s, a; θ))2 (1)

Q(s ′, a′; θ)]− Q(s, a; θ))2 (2)

Training Target

Moving Target

The TD target y = r(s, a) + γmaxa′ Q(s ′, a′; θ) depends on the parametersθ.

Every time the parameters are updated, the target is also changed.

So, learning θ with update rule (2) is like chasing a moving target.

This makes learning unstable.

Training Target

Moving Target

Training Target

Moving Target

Training Target

Moving Target

Training Target

Target Network

To deal with the moving target issue, create a target network that has thesame structure as the DQN.

Let θ− be its parameters.

Use the target network to compute the TD target y

Set θ− = θ once in a while (every C steps).

So, we have this new update rule: