deep learning and reinforcement learning

Post on 19-Mar-2017

284 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Deep Learning & Reinforcement Learning

Renārs LiepiņšLead Researcher, LUMII & LETArenars.liepins@lumii.lv

At “Riga AI, Machine Learning and Bots”, February 16, 2017

Outline

• Current State

• Deep Learning

• Reinforcement Learning

Outline

• Current State

• Deep Learning

• Reinforcement Learning

Machine learning is a core transformative way by which we are rethinking everything we are doing– Sundar Pichai (CEO Google) 2015

Source

Why such optimism?

Artificial Intelligence

computer systems ableto perform tasks normally

requiring human intelligence

0

5

10

15

20

25

30

2010 2011 2012 2013 2014 2015 2016

3.083.57

6.7

11.7

16.4

27.82829

Classic Deep Learning

HumanLevel

HumanLevel

Nice, but so what?

First Universal Learning Algorithm

Andrew Ng

Features for machine learning

Image! Vision features! Detection!

Images!

Audio! Audio features! Speaker ID!

Audio!

Text!

Text! Text features!

Web search!…!

Before Deep Learning

Andrew Ng

Features for machine learning

Image! Vision features! Detection!

Images!

Audio! Audio features! Speaker ID!

Audio!

Text!

Text! Text features!

Web search!…!

Andrew Ng

Features for machine learning

Image! Vision features! Detection!

Images!

Audio! Audio features! Speaker ID!

Audio!

Text!

Text! Text features!

Web search!…!

Source

Andrew Ng

Features for machine learning

Image! Vision features! Detection!

Images!

Audio! Audio features! Speaker ID!

Audio!

Text!

Text! Text features!

Web search!…!

With Deep Learning

Andrew Ng Andrew Ng

Neurons in the brain

Output

Deep Learning: Neural network �

Andrew Ng Andrew Ng

Neurons in the brain

Output

Deep Learning: Neural network �

Andrew Ng Andrew Ng

Neurons in the brain

Output

Deep Learning: Neural network �

Universal Learning Algorithm

Andrew Ng Andrew Ng

Neurons in the brain

Output

Deep Learning: Neural network �

… …

A yellow busdriving down….

Universal Learning Algorithm – Speech Recognition

Andrew Ng Andrew Ng

T h e _ q u i c k …

Bi-directional Recurrent Neural Network (BDRNN)

Baidu"Deep"Speech"

Andrew Ng Andrew Ng

Neurons in the brain

Output

Deep Learning: Neural network �

Source

Universal Learning Algorithm – Translation

Dzeltens autobuss brauc pa ceļu….

A yellow busdriving down….

Andrew Ng Andrew Ng

Neurons in the brain

Output

Deep Learning: Neural network �

Source

Universal Learning Algorithm – Self driving cars

Andrew Ng Andrew Ng

Neurons in the brain

Output

Deep Learning: Neural network �

Source

Universal Learning Algorithm

A yellow busdriving down….

Andrew Ng Andrew Ng

Neurons in the brain

Output

Deep Learning: Neural network �

Andrew Ng Andrew Ng

Caption Data (image)

A yellow bus driving down….

The limitations of supervised learning

Universal Learning Algorithm – Image captions

A yellow busdriving down….

Andrew Ng Andrew Ng

Neurons in the brain

Output

Deep Learning: Neural network �

Source

Andrew Ng Andrew Ng

Chinese captions

�� ��

(A baseball player

getting ready to bat.)

(A person surfing on the ocean.)

� � �

(A double-decker bus driving on a street.)

Andrew Ng Andrew Ng

Chinese captions

�� ��

(A baseball player

getting ready to bat.)

(A person surfing on the ocean.)

� � �

(A double-decker bus driving on a street.)

Andrew Ng Andrew Ng

Chinese captions

�� ��

(A baseball player

getting ready to bat.)

(A person surfing on the ocean.)

� � �

(A double-decker bus driving on a street.)

Andrew Ng Andrew Ng

Chinese captions

�� ��

(A baseball player

getting ready to bat.)

(A person surfing on the ocean.)

� � �

(A double-decker bus driving on a street.)

Andrew Ng Andrew Ng

Chinese captions

�� ��

(A baseball player

getting ready to bat.)

(A person surfing on the ocean.)

� � �

(A double-decker bus driving on a street.)

Andrew Ng Andrew Ng

Chinese captions

�� ��

(A baseball player

getting ready to bat.)

(A person surfing on the ocean.)

� � �

(A double-decker bus driving on a street.)

Universal Learning Algorithm – X-ray reports

Andrew Ng Andrew Ng

Neurons in the brain

Output

Deep Learning: Neural network �

Source

Universal Learning Algorithm – Photo localisation

Andrew Ng Andrew Ng

Neurons in the brain

Output

Deep Learning: Neural network �

Deep Learning in Computer VisionImage Localization

PlaNet is able to determine the location of almost any image with superhuman ability.

Source

Deep Learning in Computer VisionImage Localization

PlaNet is able to determine the location of almost any image with superhuman ability.

Source

Source

Universal Learning Algorithm – Style Transfer

Andrew Ng Andrew Ng

Neurons in the brain

Output

Deep Learning: Neural network �

Source

Universal Learning Algorithm – Semantic Face Transforms

Andrew Ng Andrew Ng

Neurons in the brain

Output

Deep Learning: Neural network �

Figure 1. (Zoom in for details.) Aging a 400x400 face with Deep Feature Interpolation, before and after the artifact removal step,showcasing the quality of our method. In this figure (and no other) a mask was applied to preserve the background. Although the inputimage was 400x400, all source and target images used in the transformation were only 100x100.

olderinput mouth open eyes open smiling facial hair spectaclesFigure 2. (Zoom in for details.) An example Deep Feature Interpolation transformation of a test image (Silvio Berlusconi, left) towards sixcategories. Each transformation was performed via linear interpolation in deep feature space composed of pre-trained VGG features.

images. It also requires that sample images with and withoutthe desired attribute are otherwise similar to the target image(e.g. in the case of Figure 1 they consist of images of othercaucasian males).

However, these assumptions on the data are surprisinglymild, and in the presence of such data DFI works surprisinglywell. We demonstrate its efficacy on several transformationtasks that generative approaches are most commonly evalu-ated on. Compared to prior work, it is often much simpler,faster and more versatile: It does not require re-training of aconvnet, is not specialized on any particular task, and it isable to deal with much higher resolution images. Despiteits simplicity we show that on many of these image editingtasks it even outperforms state-of-the-art methods that aresubstantially more involved and specialized.

2. Related Work

Probably the generative methods most similar to oursare [23] and [28] as these also generate data-driven attributetransformations and rely on deep feature spaces. We usethese methods as our primary point of comparison, althoughthey rely on specially trained generative auto-encoders andare fundamentally different in their approaches to learn im-

age transformations. Works by Reed et al. [29, 30] proposecontent change models for challenging tasks (identity andviewpoint changes) but do not demonstrate photo-realisticresults. A contemporaneous work [4] edits image content bymanipulating latent space variables but their approach failswhen applied directly to existing photos. An advantage ofour approach is that it works with pre-trained networks andhas the ability to run on much higher resolution images. Ingeneral, many other uses of generative networks are distinctfrom our problem setting [13, 6, 47, 33, 26, 7], as they dealprimarily with generating novel images rather than changingexisting ones.

Gardner et al. [9] edits images by minimizing the witnessfunction of the Maximum Mean Discrepancy statistic. Thememory needed to find w by their method grows linearlywhereas DFI removes this bottleneck.

Mahendran and Vedaldi [25] recovered visual imagery byinverting deep convolutional feature representations. Gatyset al. [11] demonstrated how to transfer the artistic style offamous artists to natural images by optimizing for featuretargets during reconstruction. Rather than reconstructingimagery or transferring style, we construct new images withdifferent content class memberships.

mouth open

Figure 1. (Zoom in for details.) Aging a 400x400 face with Deep Feature Interpolation, before and after the artifact removal step,showcasing the quality of our method. In this figure (and no other) a mask was applied to preserve the background. Although the inputimage was 400x400, all source and target images used in the transformation were only 100x100.

olderinput mouth open eyes open smiling facial hair spectaclesFigure 2. (Zoom in for details.) An example Deep Feature Interpolation transformation of a test image (Silvio Berlusconi, left) towards sixcategories. Each transformation was performed via linear interpolation in deep feature space composed of pre-trained VGG features.

images. It also requires that sample images with and withoutthe desired attribute are otherwise similar to the target image(e.g. in the case of Figure 1 they consist of images of othercaucasian males).

However, these assumptions on the data are surprisinglymild, and in the presence of such data DFI works surprisinglywell. We demonstrate its efficacy on several transformationtasks that generative approaches are most commonly evalu-ated on. Compared to prior work, it is often much simpler,faster and more versatile: It does not require re-training of aconvnet, is not specialized on any particular task, and it isable to deal with much higher resolution images. Despiteits simplicity we show that on many of these image editingtasks it even outperforms state-of-the-art methods that aresubstantially more involved and specialized.

2. Related Work

Probably the generative methods most similar to oursare [23] and [28] as these also generate data-driven attributetransformations and rely on deep feature spaces. We usethese methods as our primary point of comparison, althoughthey rely on specially trained generative auto-encoders andare fundamentally different in their approaches to learn im-

age transformations. Works by Reed et al. [29, 30] proposecontent change models for challenging tasks (identity andviewpoint changes) but do not demonstrate photo-realisticresults. A contemporaneous work [4] edits image content bymanipulating latent space variables but their approach failswhen applied directly to existing photos. An advantage ofour approach is that it works with pre-trained networks andhas the ability to run on much higher resolution images. Ingeneral, many other uses of generative networks are distinctfrom our problem setting [13, 6, 47, 33, 26, 7], as they dealprimarily with generating novel images rather than changingexisting ones.

Gardner et al. [9] edits images by minimizing the witnessfunction of the Maximum Mean Discrepancy statistic. Thememory needed to find w by their method grows linearlywhereas DFI removes this bottleneck.

Mahendran and Vedaldi [25] recovered visual imagery byinverting deep convolutional feature representations. Gatyset al. [11] demonstrated how to transfer the artistic style offamous artists to natural images by optimizing for featuretargets during reconstruction. Rather than reconstructingimagery or transferring style, we construct new images withdifferent content class memberships.

Source

Figure 1. (Zoom in for details.) Aging a 400x400 face with Deep Feature Interpolation, before and after the artifact removal step,showcasing the quality of our method. In this figure (and no other) a mask was applied to preserve the background. Although the inputimage was 400x400, all source and target images used in the transformation were only 100x100.

olderinput mouth open eyes open smiling facial hair spectaclesFigure 2. (Zoom in for details.) An example Deep Feature Interpolation transformation of a test image (Silvio Berlusconi, left) towards sixcategories. Each transformation was performed via linear interpolation in deep feature space composed of pre-trained VGG features.

images. It also requires that sample images with and withoutthe desired attribute are otherwise similar to the target image(e.g. in the case of Figure 1 they consist of images of othercaucasian males).

However, these assumptions on the data are surprisinglymild, and in the presence of such data DFI works surprisinglywell. We demonstrate its efficacy on several transformationtasks that generative approaches are most commonly evalu-ated on. Compared to prior work, it is often much simpler,faster and more versatile: It does not require re-training of aconvnet, is not specialized on any particular task, and it isable to deal with much higher resolution images. Despiteits simplicity we show that on many of these image editingtasks it even outperforms state-of-the-art methods that aresubstantially more involved and specialized.

2. Related Work

Probably the generative methods most similar to oursare [23] and [28] as these also generate data-driven attributetransformations and rely on deep feature spaces. We usethese methods as our primary point of comparison, althoughthey rely on specially trained generative auto-encoders andare fundamentally different in their approaches to learn im-

age transformations. Works by Reed et al. [29, 30] proposecontent change models for challenging tasks (identity andviewpoint changes) but do not demonstrate photo-realisticresults. A contemporaneous work [4] edits image content bymanipulating latent space variables but their approach failswhen applied directly to existing photos. An advantage ofour approach is that it works with pre-trained networks andhas the ability to run on much higher resolution images. Ingeneral, many other uses of generative networks are distinctfrom our problem setting [13, 6, 47, 33, 26, 7], as they dealprimarily with generating novel images rather than changingexisting ones.

Gardner et al. [9] edits images by minimizing the witnessfunction of the Maximum Mean Discrepancy statistic. Thememory needed to find w by their method grows linearlywhereas DFI removes this bottleneck.

Mahendran and Vedaldi [25] recovered visual imagery byinverting deep convolutional feature representations. Gatyset al. [11] demonstrated how to transfer the artistic style offamous artists to natural images by optimizing for featuretargets during reconstruction. Rather than reconstructingimagery or transferring style, we construct new images withdifferent content class memberships.

Universal Learning Algorithm – Lipreading

Andrew Ng Andrew Ng

Neurons in the brain

Output

Deep Learning: Neural network �

A yellow busdriving down….

Deep Learning in Computer VisionLipNet - Sentence-level Lipreading

Source

LipNet achieves 93.4% accuracy, outperforming experienced human lipreadersand the previous 79.6% state-of-the-art accuracy.

Source

Universal Learning Algorithm – Sketch Vectorisation

Andrew Ng Andrew Ng

Neurons in the brain

Output

Deep Learning: Neural network �

Source

Universal Learning Algorithm – Handwriting Generation

Andrew Ng Andrew Ng

Neurons in the brain

Output

Deep Learning: Neural network �

A yellow busdriving down….

Source

Deep Learning in Computer VisionImage Generation - Handwriting

This LSTM recurrent neural network is able to generate highly realistic

cursive handwriting in a wide variety of styles, simply by predicting one data

point at a time.

Source

Universal Learning Algorithm – Image upscaling

Andrew Ng Andrew Ng

Neurons in the brain

Output

Deep Learning: Neural network �

Source

Google – Saving you bandwidth through machine learning

Source

First Universal Learning Algorithm

Not Magic

• Simply downloading and “applying” open-source software won’t work.

• Needs to be customised to your business context and data.

• Needs lots of examples and computing power for training

Source

Outline

• Current State

• Deep Learning

• Reinforcement Learning

Outline

• Current State

• Deep Learning

• Reinforcement Learning

cell bodyoutput axon

synapse

Neuron

Source

cell bodyoutput axon

synapse

Neuron

Artifical Neuron

Source

Andrew Ng Andrew Ng

Neurons in the brain

Output

Deep Learning: Neural network �

Andrew Ng Andrew Ng

Yes/No (Mug or not?)

What is a neural network?

Data (image)

!

x1 ∈!5 , !x2∈!5

x2 = (W1 × x1)+x3 = (W2 × x2)+

x1 x2 x3

x4

x5

W4 W3 W2 W1

Training

Andrew Ng Andrew Ng

Yes/No (Mug or not?)

What is a neural network?

Data (image)

!

x1 ∈!5 , !x2∈!5

x2 = (W1 × x1)+x3 = (W2 × x2)+

x1 x2 x3

x4

x5

W4 W3 W2 W1 .

0.9

0.3

0.2

output

1.0

0.0

1.0

true out error

0.1

0.3

0.8

training data

Andrew Ng Andrew Ng

Yes/No (Mug or not?)

What is a neural network?

Data (image)

!

x1 ∈!5 , !x2∈!5

x2 = (W1 × x1)+x3 = (W2 × x2)+

x1 x2 x3

x4

x5

W4 W3 W2 W1

error backpropagation

Andrew Ng

Features for machine learning

Image! Vision features! Detection!

Images!

Audio! Audio features! Speaker ID!

Audio!

Text!

Text! Text features!

Web search!…!

19

WHAT MAKES DEEP LEARNING DEEP?

Input Result

Today’s Largest Networks ~10 layers 1B parameters 10M images ~30 Exaflops ~30 GPU days Human brain has trillions of parameters – only 1,000 more.

“cat”

● Loosely based on (what little) we know about the brain

What is Deep Learning?

“cat”

● Loosely based on (what little) we know about the brain

What is Deep Learning?

“cat”

● Loosely based on (what little) we know about the brain

What is Deep Learning?

“cat”

● Loosely based on (what little) we know about the brain

What is Deep Learning?

“cat”

● Loosely based on (what little) we know about the brain

What is Deep Learning?

“cat”

● Loosely based on (what little) we know about the brain

What is Deep Learning?

19

WHAT MAKES DEEP LEARNING DEEP?

Input Result

Today’s Largest Networks ~10 layers 1B parameters 10M images ~30 Exaflops ~30 GPU days Human brain has trillions of parameters – only 1,000 more.

Demo

http://playground.tensorflow.org/

https://transcranial.github.io/keras-js/

Why Now?

20171943 1956

A brief HistoryA long time ago…

1974 Backpropagation

awkward silence (AI Winter)

1995

SVM reigns

Convolution Neural Networks for

Handwritten Recognition

1998

2006

Restricted

Boltzmann

Machine

1958 Perceptron

1969

Perceptron criticized

Google Brain Project on

16k Cores

2012

2012

AlexNet wins

ImageNet

196919

58

1974 AI Winter 1998

DeepLearning

2012

Why Now?

Computational Power

Big Data

Andrew Ng Andrew Ng

Neurons in the brain

Output

Deep Learning: Neural network �

Algorithms

Current Situation

Outline

• Current State

• Deep Learning

• Reinforcement Learning

Andrew Ng Andrew Ng

Neurons in the brain

Output

Deep Learning: Neural network �

Outline

• Current State

• Deep Learning

• Reinforcement Learning

Learning from Experience

What is Reinforcement Learning?

Action (A1)

State (S1)

Reward (R1)

Gorila (General Reinforcement Learning Architecture)

I 10x faster than Nature DQN on 38 out of 49 Atari games

I Applied to recommender systems within Google

Agent Environment

What is Reinforcement Learning?

Action (A2)

State (S2)

Reward (R2)

Gorila (General Reinforcement Learning Architecture)

I 10x faster than Nature DQN on 38 out of 49 Atari games

I Applied to recommender systems within Google

Agent Environment

What is Reinforcement Learning?

Gorila (General Reinforcement Learning Architecture)

I 10x faster than Nature DQN on 38 out of 49 Atari games

I Applied to recommender systems within Google

Agent Environment

Action (Ai)

State (Si)

Reward (Ri)

What is Reinforcement Learning?

Gorila (General Reinforcement Learning Architecture)

I 10x faster than Nature DQN on 38 out of 49 Atari games

I Applied to recommender systems within Google

Agent Environment

Goal: Maximize Accumulated Rewards R1 + R2 + R3 + …

Action (Ai)

State (Si)

Reward (Ri)

iRi∑=

Pong Example

States (S)

Actions (A) Rewards (R)

+1

-1

0

Environment Agent

Gorila (General Reinforcement Learning Architecture)

I 10x faster than Nature DQN on 38 out of 49 Atari games

I Applied to recommender systems within Google

Goal: Maximize Accumulated Rewards

Reinforcement Agent

Agent

Gorila (General Reinforcement Learning Architecture)

I 10x faster than Nature DQN on 38 out of 49 Atari games

I Applied to recommender systems within Google

Reinforcement Agent = Policy Function

Agent

Gorila (General Reinforcement Learning Architecture)

I 10x faster than Nature DQN on 38 out of 49 Atari games

I Applied to recommender systems within Google

π(S) -> A

Policy Function

=

AiSi

Pong Example

π( ) ->

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

AiSi

Pong Example

π( ) ->

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

AiSi

Pong Example

π( ) ->

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Pong Example

States (S)

Actions (A) Rewards (R)

+1

-1

0

Environment Agent

Gorila (General Reinforcement Learning Architecture)

I 10x faster than Nature DQN on 38 out of 49 Atari games

I Applied to recommender systems within Google

Goal: Maximize Accumulated Rewards

π(S) -> A

Reinforcement Learning Problem

that maximizesiR

i∑Accumulated Rewards:

Policy Function: π(S) -> AFind

How to Find

π(S) -> A

?

Reinforcement Learning Algorithms

• Q-Learning

• Actor-Critic methods

• Policy Gradient

Reinforcement Learning Algorithms

• Q-Learning

• Actor-Critic methods

• Policy Gradient

Episode

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

😁R3=+1R1=0 R2=0

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Game Over

👍 👍 👍

iRi∑ = +1

😭

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Episode

R3=-1R1=0 R2=0Game Over

👎 👎 👎

iRi∑ = −1

How to Find

π(S) -> A

?

How to Find π(S) -> A ?

1. Change π to Stochastic: π(S) -> P(A)

Pong Example

π( ) ->

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

π( ) -> Action Probability

0 0.25 0.5 0.75 1

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

π( ) -> Action Probability

0 0.25 0.5 0.75 1

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

π( ) -> Action Probability

0 0.25 0.5 0.75 1

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

2. Approximate π with NeuralNet: π(S, θ) -> P(A)

How to Find π(S) -> A ?

1. Change π to Stochastic: π(S) -> P(A)

Siπ( ) ->

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Action Probability0 0.25 0.5 0.75 1

Siπ( ) ->

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Action Probability0 0.25 0.5 0.75 1

, θ

Siπ( ) ->

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Action Probability0 0.25 0.5 0.75 1

,

Si

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Action Probability0 0.25 0.5 0.75 1

π(Si, θ)

θ

Si

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Action Probability0 0.25 0.5 0.75 1

Si

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Action Probability0 0.25 0.5 0.75 1

How to Find ?π(Si, θ) -> P(A)

How to Find θ

Loss Function…

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

0.0

0.2

0.4

0.6

0.8

1.0

R1=0

0.0

0.2

0.4

0.6

0.8

1.0

R2=0

0.0

0.2

0.4

0.6

0.8

1.0

Game Over

R3=+1Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Rii

n

∑ = +1

😁

👍 👍 👍

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

π(Si , θ | Ai)

θk

0.0

0.2

0.4

0.6

0.8

1.0

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

π(Si , θ | Ai)}👍

👎

Δ(π(Si , θ | Ai) )Δ θk

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

0.0

0.2

0.4

0.6

0.8

1.0

R1=0

0.0

0.2

0.4

0.6

0.8

1.0

R2=0

0.0

0.2

0.4

0.6

0.8

1.0

Game Over

R3=+1Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Rii

n

∑ = +1

😁

👍 👍 👍

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Reinforcement Learning

Outline

• Current State

• Deep Learning

• Reinforcement Learning

• Conclusions

Conclusions

1.

Andrew Ng Andrew Ng

Neurons in the brain

Output

Deep Learning: Neural network �

… …2.

3.

top related