deep learning and reinforcement learning

Deep Learning & Reinforcement Learning

Renārs LiepiņšLead Researcher, LUMII & LETArenars.liepins@lumii.lv

At “Riga AI, Machine Learning and Bots”, February 16, 2017

Outline

• Current State

• Deep Learning

• Reinforcement Learning

Outline

• Current State

• Deep Learning

Source

Machine learning is a core transformative way by which we are rethinking everything we are doing– Sundar Pichai (CEO Google) 2015

Source

Why such optimism?

Artificial Intelligence

computer systems ableto perform tasks normally

requiring human intelligence

2010 2011 2012 2013 2014 2015 2016

3.083.57

27.82829

Classic Deep Learning

HumanLevel

Nice, but so what?

First Universal Learning Algorithm

Andrew Ng

Features for machine learning

Image! Vision features! Detection!

Images!

Audio! Audio features! Speaker ID!

Audio!

Text! Text features!

Web search!…!

Before Deep Learning

Andrew Ng

Images!

Audio!

Web search!…!

Andrew Ng

Images!

Audio!

Web search!…!

Source

Andrew Ng

Images!

Audio!

Web search!…!

With Deep Learning

Andrew Ng Andrew Ng

Neurons in the brain

Output

Deep Learning: Neural network �

Andrew Ng Andrew Ng

Output

Andrew Ng Andrew Ng

Output

Universal Learning Algorithm

Andrew Ng Andrew Ng

Output

… …

A yellow busdriving down….

Universal Learning Algorithm – Speech Recognition

Andrew Ng Andrew Ng

T h e _ q u i c k …

Bi-directional Recurrent Neural Network (BDRNN)

Baidu"Deep"Speech"

Andrew Ng Andrew Ng

Output

Source

Universal Learning Algorithm – Translation

Dzeltens autobuss brauc pa ceļu….

Andrew Ng Andrew Ng

Output

Source

Universal Learning Algorithm – Self driving cars

Andrew Ng Andrew Ng

Output

Source

Universal Learning Algorithm

Andrew Ng Andrew Ng

Output

Andrew Ng Andrew Ng

Caption Data (image)

A yellow bus driving down….

The limitations of supervised learning

Universal Learning Algorithm – Image captions

Andrew Ng Andrew Ng

Output

Source

Andrew Ng Andrew Ng

Chinese captions

��

(A baseball player

getting ready to bat.)

(A person surfing on the ocean.)

� � �

(A double-decker bus driving on a street.)

Andrew Ng Andrew Ng

Chinese captions

��

(A baseball player

� � �

Andrew Ng Andrew Ng

Chinese captions

��

(A baseball player

� � �

Andrew Ng Andrew Ng

Chinese captions

��

(A baseball player

� � �

Andrew Ng Andrew Ng

Chinese captions

��

(A baseball player

� � �

Andrew Ng Andrew Ng

Chinese captions

��

(A baseball player

� � �

Universal Learning Algorithm – X-ray reports

Andrew Ng Andrew Ng

Output

Source

Universal Learning Algorithm – Photo localisation

Andrew Ng Andrew Ng

Output

Deep Learning in Computer VisionImage Localization

PlaNet is able to determine the location of almost any image with superhuman ability.

Source

Deep Learning in Computer VisionImage Localization

PlaNet is able to determine the location of almost any image with superhuman ability.

Source

Universal Learning Algorithm – Style Transfer

Andrew Ng Andrew Ng

Output

Source

Universal Learning Algorithm – Semantic Face Transforms

Andrew Ng Andrew Ng

Output

Figure 1. (Zoom in for details.) Aging a 400x400 face with Deep Feature Interpolation, before and after the artifact removal step,showcasing the quality of our method. In this figure (and no other) a mask was applied to preserve the background. Although the inputimage was 400x400, all source and target images used in the transformation were only 100x100.

olderinput mouth open eyes open smiling facial hair spectaclesFigure 2. (Zoom in for details.) An example Deep Feature Interpolation transformation of a test image (Silvio Berlusconi, left) towards sixcategories. Each transformation was performed via linear interpolation in deep feature space composed of pre-trained VGG features.

images. It also requires that sample images with and withoutthe desired attribute are otherwise similar to the target image(e.g. in the case of Figure 1 they consist of images of othercaucasian males).

However, these assumptions on the data are surprisinglymild, and in the presence of such data DFI works surprisinglywell. We demonstrate its efficacy on several transformationtasks that generative approaches are most commonly evalu-ated on. Compared to prior work, it is often much simpler,faster and more versatile: It does not require re-training of aconvnet, is not specialized on any particular task, and it isable to deal with much higher resolution images. Despiteits simplicity we show that on many of these image editingtasks it even outperforms state-of-the-art methods that aresubstantially more involved and specialized.

2. Related Work

Probably the generative methods most similar to oursare [23] and [28] as these also generate data-driven attributetransformations and rely on deep feature spaces. We usethese methods as our primary point of comparison, althoughthey rely on specially trained generative auto-encoders andare fundamentally different in their approaches to learn im-

age transformations. Works by Reed et al. [29, 30] proposecontent change models for challenging tasks (identity andviewpoint changes) but do not demonstrate photo-realisticresults. A contemporaneous work [4] edits image content bymanipulating latent space variables but their approach failswhen applied directly to existing photos. An advantage ofour approach is that it works with pre-trained networks andhas the ability to run on much higher resolution images. Ingeneral, many other uses of generative networks are distinctfrom our problem setting [13, 6, 47, 33, 26, 7], as they dealprimarily with generating novel images rather than changingexisting ones.

Gardner et al. [9] edits images by minimizing the witnessfunction of the Maximum Mean Discrepancy statistic. Thememory needed to find w by their method grows linearlywhereas DFI removes this bottleneck.

Mahendran and Vedaldi [25] recovered visual imagery byinverting deep convolutional feature representations. Gatyset al. [11] demonstrated how to transfer the artistic style offamous artists to natural images by optimizing for featuretargets during reconstruction. Rather than reconstructingimagery or transferring style, we construct new images withdifferent content class memberships.

mouth open

2. Related Work

Source

2. Related Work

Universal Learning Algorithm – Lipreading

Andrew Ng Andrew Ng

Output

Deep Learning in Computer VisionLipNet - Sentence-level Lipreading

Source

LipNet achieves 93.4% accuracy, outperforming experienced human lipreadersand the previous 79.6% state-of-the-art accuracy.

Source

Universal Learning Algorithm – Sketch Vectorisation

Andrew Ng Andrew Ng

Output

Source

Universal Learning Algorithm – Handwriting Generation

Andrew Ng Andrew Ng

Output

Source

Deep Learning in Computer VisionImage Generation - Handwriting

This LSTM recurrent neural network is able to generate highly realistic

cursive handwriting in a wide variety of styles, simply by predicting one data

point at a time.

Source

Universal Learning Algorithm – Image upscaling

Andrew Ng Andrew Ng

Output

Source

Google – Saving you bandwidth through machine learning

Source

First Universal Learning Algorithm

Not Magic

• Simply downloading and “applying” open-source software won’t work.

• Needs to be customised to your business context and data.

• Needs lots of examples and computing power for training

Source

Outline

• Current State

• Deep Learning

Outline

• Current State

• Deep Learning

cell bodyoutput axon

synapse

Neuron

Source

cell bodyoutput axon

synapse

Neuron

Artifical Neuron

Source

Andrew Ng Andrew Ng

Output

Andrew Ng Andrew Ng

Yes/No (Mug or not?)

What is a neural network?

Data (image)

x1 ∈!5 , !x2∈!5

x2 = (W1 × x1)+x3 = (W2 × x2)+

x1 x2 x3

W4 W3 W2 W1

Training

Andrew Ng Andrew Ng

Data (image)

x1 ∈!5 , !x2∈!5

x2 = (W1 × x1)+x3 = (W2 × x2)+

x1 x2 x3

W4 W3 W2 W1 .

output

true out error

training data

Andrew Ng Andrew Ng

Data (image)

x1 ∈!5 , !x2∈!5

x2 = (W1 × x1)+x3 = (W2 × x2)+

x1 x2 x3

W4 W3 W2 W1

error backpropagation

Andrew Ng

Images!

Audio!

Web search!…!

WHAT MAKES DEEP LEARNING DEEP?

Input Result

Today’s Largest Networks ~10 layers 1B parameters 10M images ~30 Exaflops ~30 GPU days Human brain has trillions of parameters – only 1,000 more.

“cat”

● Loosely based on (what little) we know about the brain

What is Deep Learning?

“cat”

WHAT MAKES DEEP LEARNING DEEP?

Input Result

Today’s Largest Networks ~10 layers 1B parameters 10M images ~30 Exaflops ~30 GPU days Human brain has trillions of parameters – only 1,000 more.

http://playground.tensorflow.org/

https://transcranial.github.io/keras-js/

Why Now?

20171943 1956

A brief HistoryA long time ago…

1974 Backpropagation

awkward silence (AI Winter)

SVM reigns

Convolution Neural Networks for

Handwritten Recognition

Restricted

Boltzmann

Machine

1958 Perceptron

Perceptron criticized

Google Brain Project on

16k Cores

AlexNet wins

ImageNet

196919

1974 AI Winter 1998

DeepLearning

Why Now?

Computational Power

Big Data

Andrew Ng Andrew Ng

Output

Algorithms

Current Situation

Outline

• Current State

• Deep Learning

Andrew Ng Andrew Ng

Output

Outline

• Current State

• Deep Learning

Learning from Experience

Source

40% reduction in cooling

Source

What is Reinforcement Learning?

Action (A1)

State (S1)

Reward (R1)

Gorila (General Reinforcement Learning Architecture)

I 10x faster than Nature DQN on 38 out of 49 Atari games

I Applied to recommender systems within Google

Agent Environment

Action (A2)

State (S2)

Reward (R2)

Agent Environment

Action (Ai)

State (Si)

Reward (Ri)

Agent Environment

Goal: Maximize Accumulated Rewards R1 + R2 + R3 + …

Action (Ai)

State (Si)

Reward (Ri)

iRi∑=

Pong Example

States (S)

Actions (A) Rewards (R)

Environment Agent

Goal: Maximize Accumulated Rewards

Reinforcement Agent

Reinforcement Agent = Policy Function

π(S) -> A

Policy Function

Pong Example

π( ) ->

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Pong Example

π( ) ->

LETTER RESEARCH

Pong Example

π( ) ->

LETTER RESEARCH

Pong Example

States (S)

Actions (A) Rewards (R)

Environment Agent

Goal: Maximize Accumulated Rewards

π(S) -> A

Reinforcement Learning Problem

that maximizesiR

i∑Accumulated Rewards:

Policy Function: π(S) -> AFind

How to Find

π(S) -> A

Reinforcement Learning Algorithms

• Q-Learning

• Actor-Critic methods

• Policy Gradient

Reinforcement Learning Algorithms

• Q-Learning

• Actor-Critic methods

• Policy Gradient

Episode

LETTER RESEARCH

😁R3=+1R1=0 R2=0

LETTER RESEARCH

Game Over

👍 👍 👍

iRi∑ = +1

LETTER RESEARCH

Episode

R3=-1R1=0 R2=0Game Over

👎 👎 👎

iRi∑ = −1

How to Find

π(S) -> A

How to Find π(S) -> A ?

1. Change π to Stochastic: π(S) -> P(A)

Pong Example

π( ) ->

LETTER RESEARCH

π( ) -> Action Probability

0 0.25 0.5 0.75 1

LETTER RESEARCH

0 0.25 0.5 0.75 1

LETTER RESEARCH

0 0.25 0.5 0.75 1

LETTER RESEARCH

2. Approximate π with NeuralNet: π(S, θ) -> P(A)

How to Find π(S) -> A ?

1. Change π to Stochastic: π(S) -> P(A)

Siπ( ) ->

LETTER RESEARCH

Action Probability0 0.25 0.5 0.75 1

Siπ( ) ->

LETTER RESEARCH

Siπ( ) ->

LETTER RESEARCH

π(Si, θ)

LETTER RESEARCH

How to Find ?π(Si, θ) -> P(A)

How to Find θ

Loss Function…

LETTER RESEARCH

Game Over

R3=+1Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

LETTER RESEARCH

∑ = +1

👍 👍 👍

LETTER RESEARCH

π(Si , θ | Ai)

LETTER RESEARCH

π(Si , θ | Ai)}👍

Δ(π(Si , θ | Ai) )Δ θk

LETTER RESEARCH

Game Over

R3=+1Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

LETTER RESEARCH

∑ = +1

👍 👍 👍

LETTER RESEARCH

Reinforcement Learning

Outline

• Current State

• Deep Learning

• Conclusions

Conclusions

Andrew Ng Andrew Ng

Output

… …2.

deep learning and reinforcement learning

Technology

thinking while moving: deep reinforcement learning in ......

tutorial: deep reinforcement learning - machine learning...

reinforcement and deep reinforcement learning for wireless

deep reinforcement learning - environments tour

hierarchical deep reinforcement learning: integrating...

(deep) reinforcement learning - computer vision lab....

10703 deep reinforcement learning

towards deep symbolic reinforcement learning

10703 deep reinforcement learning and...

deep reinforcement learning - environments tour€¦ ·...

deep reinforcement learning from human...

towards deeper deep reinforcement learning

agente sonic. deep reinforcement learning

deep learning for reinforcement learning in · pdf filedeep...

deep reinforcement learning: q-learning - svetlana...

deep reinforcement learning for robotics

deep reinforcement learning at scale - github pages · deep...

deep learning for reinforcement learning in pacman · deep...

deep reinforcement learning - cross entropy

comparing deep reinforcement learning methods for...