recurrent neural networks - github pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfย ยท training...

Post on 27-Mar-2020

15 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Recurrent Neural NetworksDeep Learning โ€“ Lecture 5

Efstratios Gavves

Sequential Data

So far, all tasks assumed stationary data

Neither all data, nor all tasks are stationary though

Sequential Data: Text

What

Sequential Data: Text

What about

Sequential Data: Text

What about inputs that appear in

sequences, such as text? Could neural

network handle such modalities?

a

Memory needed

What about inputs that appear in

sequences, such as text? Could neural

network handle such modalities?

a

Pr ๐‘ฅ =เท‘

๐‘–

Pr ๐‘ฅ๐‘– ๐‘ฅ1, โ€ฆ , ๐‘ฅ๐‘–โˆ’1)

Sequential data: Video

Quiz: Other sequential data?

Quiz: Other sequential data?

Time series data

Stock exchange

Biological measurements

Climate measurements

Market analysis

Speech/Music

User behavior in websites

...

Machine Translation

The phrase in the source language is one sequence

โ€“ โ€œToday the weather is goodโ€

The phrase in the target language is also a sequence

โ€“ โ€œะŸะพะณะพะดะฐ ัะตะณะพะดะฝั ั…ะพั€ะพัˆะฐัโ€

LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM

Today the weather is good <EOS> ะŸะพะณะพะดะฐ ัะตะณะพะดะฝั ั…ะพั€ะพัˆะฐั

ะŸะพะณะพะดะฐ ัะตะณะพะดะฝั <EOS>

Encoder Decoder

ั…ะพั€ะพัˆะฐั

Image captioning

An image is a thousand words, literally!

Pretty much the same as machine transation

Replace the encoder part with the output of a Convnet

โ€“ E.g. use Alexnet or a VGG16 network

Keep the decoder part to operate as a translator

LSTM LSTM LSTM LSTM LSTM LSTM

Today the weather is good

Today the weather is good <EOS>

Convnet

Demo

Click to go to the video in Youtube

Question answering

Bleeding-edge research, no real consensusโ€“ Very interesting open, research problems

Again, pretty much like machine translation

Again, Encoder-Decoder paradigmโ€“ Insert the question to the encoder part

โ€“ Model the answer at the decoder part

Question answering with images alsoโ€“ Again, bleeding-edge research

โ€“ How/where to add the image?

โ€“ What has been working so far is to add the image

only in the beginningQ: what are the people playing?

A: They play beach football

Q: John entered the living room, where he met Mary. She was drinking some wine and watching

a movie. What room did John enter?A: John entered the living room.

Demo

Click to go to the website

Handwriting

Click to go to the website

Recurrent Networks

Recurrentconnections

NN CellState

Input

Output

Sequences

Next data depend on previous data

Roughly equivalent to predicting what comes next

Pr ๐‘ฅ =เท‘

๐‘–

Pr ๐‘ฅ๐‘– ๐‘ฅ1, โ€ฆ , ๐‘ฅ๐‘–โˆ’1)

What about inputs that appear in

sequences,suc

h as text?Could neural

network handle such modalities?

a

Why sequences?

Parameters are reused Fewer parameters Easier modelling

Generalizes well to arbitrary lengths Not constrained to specific length

However, often we pick a โ€œframeโ€ ๐‘‡ instead of an arbitrary length

Pr ๐‘ฅ =เท‘

๐‘–

Pr ๐‘ฅ๐‘– ๐‘ฅ๐‘–โˆ’๐‘‡ , โ€ฆ , ๐‘ฅ๐‘–โˆ’1)

I think, therefore, I am!

Everything is repeated, in a circle. History is a master because it

teaches us that it doesn't exist. It's the permutations that matter.

๐‘…๐‘’๐‘๐‘ข๐‘Ÿ๐‘Ÿ๐‘’๐‘›๐‘ก๐‘€๐‘œ๐‘‘๐‘’๐‘™(

๐‘…๐‘’๐‘๐‘ข๐‘Ÿ๐‘Ÿ๐‘’๐‘›๐‘ก๐‘€๐‘œ๐‘‘๐‘’๐‘™(

)

)

โ‰ก

Quiz: What is really a sequence?

Data inside a sequence are โ€ฆ ?

I am Bond , James

Bond

McGuire

tired

am

!

Bond

Quiz: What is really a sequence?

Data inside a sequence are non i.i.d.

โ€“ Identically, independently distributed

The next โ€œwordโ€ depends on the previous โ€œwordsโ€

โ€“ Ideally on all of them

We need context, and we need memory!

How to model context and memory ?

I am Bond , James

Bond

McGuire

tired

am

!

Bond

Bond

๐‘ฅ๐‘– โ‰ก One-hot vectors

A vector with all zeros except for the active dimension

12 words in a sequence 12 One-hot vectors

After the one-hot vectors apply an embedding

Word2Vec, GloVE

I

am

Bond

,

James

McGuire

tired

!

I

am

Bond

,

James

McGuire

tired

!

I

am

Bond

,

James

McGuire

tired

!

I

am

Bond

,

James

McGuire

tired

!

0

0

0

0

0

0

1

0

0

0

0

0

1

0

0

0

0

0

1

0

0

0

0

0

1

0

0

0I

am

Bond

,

James

McGuire

tired

!

tired

Vocabulary

0 0 0 0

One-hot vectors

Quiz: Why not just indices?

I am James McGuire

0

0

0

0

0

0

1

0

0

0

0

0

0

1

0

0

0

0

0

1

0

0

0

0

0

0

0

1

0

0

0

0

I am James McGuire

One-hot representation Index representation

๐‘ฅ๐‘ก=1,2,3,4 =

OR?

๐‘ฅ"๐ผ" = 1

๐‘ฅ"๐‘Ž๐‘š" = 2

๐‘ฅ"๐ฝ๐‘Ž๐‘š๐‘’๐‘ " = 4

๐‘ฅ"๐‘€๐‘๐บ๐‘ข๐‘–๐‘Ÿ๐‘’" = 7

Quiz: Why not just indices?

No, because then some words get closer together for no

good reason (artificial bias between words)

I am James McGuire

0

0

0

0

0

0

1

0

0

0

0

0

0

1

0

0

0

0

0

1

0

0

0

0

0

0

0

1

0

0

0

0

I am James McGuire

One-hot representation Index representation

๐‘ฅ"๐ผ"

OR?

๐‘ž"๐ผ" = 1

๐‘ž"๐‘Ž๐‘š" = 2

๐‘ž"๐ฝ๐‘Ž๐‘š๐‘’๐‘ " = 4

๐‘ž"๐‘€๐‘๐บ๐‘ข๐‘–๐‘Ÿ๐‘’" = 7

๐‘‘๐‘–๐‘ ๐‘ก๐‘Ž๐‘›๐‘๐‘’(๐‘ž"๐‘Ž๐‘š", ๐‘ž"๐‘€๐‘๐บ๐‘ข๐‘–๐‘Ÿ๐‘’") = 5

๐‘ฅ"๐‘Ž๐‘š"

๐‘ฅ"๐ฝ๐‘Ž๐‘š๐‘’๐‘ "

๐‘ฅ"๐‘€๐‘๐บ๐‘ข๐‘–๐‘Ÿ๐‘’"

๐‘‘๐‘–๐‘ ๐‘ก๐‘Ž๐‘›๐‘๐‘’(๐‘ž"๐ผ", ๐‘ž"๐‘Ž๐‘š") = 1

๐‘‘๐‘–๐‘ ๐‘ก๐‘Ž๐‘›๐‘๐‘’(๐‘ฅ"๐‘Ž๐‘š", ๐‘ฅ"๐‘€๐‘๐บ๐‘ข๐‘–๐‘Ÿ๐‘’") = 1

๐‘‘๐‘–๐‘ ๐‘ก๐‘Ž๐‘›๐‘๐‘’(๐‘ฅ"๐ผ", ๐‘ฅ"๐‘Ž๐‘š") = 1

Memory

A representation of the past

A memory must project information at timestep ๐‘ก on a

latent space ๐‘๐‘ก using parameters ๐œƒ

Then, re-use the projected information from ๐‘ก at ๐‘ก + 1๐‘๐‘ก+1 = โ„Ž(๐‘ฅ๐‘ก+1, ๐‘๐‘ก; ๐œƒ)

Memory parameters ๐œƒ are shared for all timesteps ๐‘ก = 0,โ€ฆ๐‘๐‘ก+1 = โ„Ž(๐‘ฅ๐‘ก+1, โ„Ž(๐‘ฅ๐‘ก , โ„Ž(๐‘ฅ๐‘กโˆ’1, โ€ฆ โ„Ž ๐‘ฅ1, ๐‘0; ๐œƒ ; ๐œƒ); ๐œƒ); ๐œƒ)

Memory as a Graph

Simplest model

โ€“ Input with parameters ๐‘ˆ

โ€“ Memory embedding with parameters ๐‘Š

โ€“ Output with parameters ๐‘‰

๐‘ฅ๐‘ก

๐‘ฆ๐‘ก

๐‘ˆ

๐‘‰

๐‘Š

Output parameters

Input parameters

Memory

parameters

Input

Output

๐‘๐‘กMemory

Memory as a Graph

Simplest model

โ€“ Input with parameters ๐‘ˆ

โ€“ Memory embedding with parameters ๐‘Š

โ€“ Output with parameters ๐‘‰

๐‘ฅ๐‘ก

๐‘ฆ๐‘ก

๐‘ฅ๐‘ก+1

๐‘ฆ๐‘ก+1

๐‘ˆ

๐‘‰

๐‘Š

Output parameters

Input parameters Input

Output

๐‘ˆ

๐‘‰

๐‘ŠMemory

parameters๐‘๐‘ก ๐‘๐‘ก+1

Memory

Memory as a Graph

Simplest model

โ€“ Input with parameters ๐‘ˆ

โ€“ Memory embedding with parameters ๐‘Š

โ€“ Output with parameters ๐‘‰

๐‘ฅ๐‘ก

๐‘ฆ๐‘ก

๐‘ฅ๐‘ก+1 ๐‘ฅ๐‘ก+2 ๐‘ฅ๐‘ก+3

๐‘ฆ๐‘ก+1 ๐‘ฆ๐‘ก+2 ๐‘ฆ๐‘ก+3

๐‘ˆ

๐‘‰

๐‘Š๐‘๐‘ก ๐‘๐‘ก+1 ๐‘๐‘ก+2 ๐‘๐‘ก+3

Output parameters

Input parameters ๐‘ˆ

๐‘‰

๐‘Š

๐‘ˆ

๐‘‰

๐‘Š

๐‘ˆ

๐‘‰

๐‘Š

Memory

Input

Output

๐‘ˆ

๐‘‰

๐‘ŠMemory

parameters

Folding the memory

๐‘ฅ๐‘ก

๐‘ฆ๐‘ก

๐‘ฅ๐‘ก+1 ๐‘ฅ๐‘ก+2

๐‘ฆ๐‘ก+1 ๐‘ฆ๐‘ก+2

๐‘ˆ

๐‘‰

๐‘Š

๐‘ˆ

๐‘๐‘ก ๐‘๐‘ก+1 ๐‘๐‘ก+2

๐‘ˆ

๐‘‰

๐‘Š

๐‘ฅ๐‘ก

๐‘ฆ๐‘ก

๐‘Š๐‘‰

๐‘ˆ

๐‘๐‘ก(๐‘๐‘กโˆ’1)

Unrolled/Unfolded Network Folded Network

๐‘Š

Recurrent Neural Network (RNN)

Only two equations

๐‘ฅ๐‘ก

๐‘ฆ๐‘ก

๐‘Š๐‘‰

๐‘ˆ

๐‘๐‘ก

(๐‘๐‘กโˆ’1)

๐‘๐‘ก = tanh(๐‘ˆ ๐‘ฅ๐‘ก +๐‘Š๐‘๐‘กโˆ’1)

๐‘ฆ๐‘ก = softmax(๐‘‰ ๐‘๐‘ก)

RNN Example

Vocabulary of 5 words

A memory of 3 units

โ€“ Hyperparameter that we choose like layer size

โ€“ ๐‘๐‘ก: 3 ร— 1 ,W: [3 ร— 3]

An input projection of 3 dimensions

โ€“ U: [3 ร— 5]

An output projections of 10 dimensions

โ€“ V: [10 ร— 3]

๐‘๐‘ก = tanh(๐‘ˆ ๐‘ฅ๐‘ก +๐‘Š๐‘๐‘กโˆ’1)

๐‘ฆ๐‘ก = softmax(๐‘‰ ๐‘๐‘ก)

๐‘ˆ โ‹… ๐‘ฅ๐‘ก=4 =

Rolled Network vs. MLP?

What is really different?

โ€“ Steps instead of layers

โ€“ Step parameters shared in Recurrent Network

โ€“ In a Multi-Layer Network parameters are different

๐‘ฆ1 ๐‘ฆ2 ๐‘ฆ3

๐ฟ๐‘Ž๐‘ฆ๐‘’๐‘Ÿ

1

๐ฟ๐‘Ž๐‘ฆ๐‘’๐‘Ÿ

2

๐ฟ๐‘Ž๐‘ฆ๐‘’๐‘Ÿ

3

3-gram Unrolled Recurrent Network 3-layer Neural Network

Final output

๐‘ˆ

๐‘‰

๐‘Š

๐‘ˆ

๐‘‰

๐‘Š

๐‘ˆ

๐‘‰

๐‘Š1 ๐‘Š2 ๐‘Š3๐‘ฆ

โ€œLayer/Stepโ€ 1 โ€œLayer/Stepโ€ 2 โ€œLayer/Stepโ€ 3

๐‘Š๐‘ฅ ๐‘ฅ

Rolled Network vs. MLP?

What is really different?

โ€“ Steps instead of layers

โ€“ Step parameters shared in Recurrent Network

โ€“ In a Multi-Layer Network parameters are different

๐‘ฆ1 ๐‘ฆ2 ๐‘ฆ3

๐ฟ๐‘Ž๐‘ฆ๐‘’๐‘Ÿ

1

๐ฟ๐‘Ž๐‘ฆ๐‘’๐‘Ÿ

2

๐ฟ๐‘Ž๐‘ฆ๐‘’๐‘Ÿ

3

3-gram Unrolled Recurrent Network 3-layer Neural Network

โ€œLayer/Stepโ€ 1 โ€œLayer/Stepโ€ 2 โ€œLayer/Stepโ€ 3

Final output

๐‘ˆ

๐‘‰

๐‘Š

๐‘ˆ

๐‘‰

๐‘Š

๐‘ˆ

๐‘‰

๐‘Š1 ๐‘Š2๐‘Š ๐‘Š3๐‘ฅ ๐‘ฅ

๐‘ฆ

Rolled Network vs. MLP?

What is really different?

โ€“ Steps instead of layers

โ€“ Step parameters shared in Recurrent Network

โ€“ In a Multi-Layer Network parameters are different

โ€ข Sometimes intermediate outputs are not even neededโ€ข Removing them, we almost end up to a standard

Neural Network

๐‘ฆ1 ๐‘ฆ2 ๐‘ฆ3

๐ฟ๐‘Ž๐‘ฆ๐‘’๐‘Ÿ

1

๐ฟ๐‘Ž๐‘ฆ๐‘’๐‘Ÿ

2

๐ฟ๐‘Ž๐‘ฆ๐‘’๐‘Ÿ

3

3-gram Unrolled Recurrent Network 3-layer Neural Network

โ€œLayer/Stepโ€ 1 โ€œLayer/Stepโ€ 2 โ€œLayer/Stepโ€ 3

Final output

๐‘ˆ

๐‘‰

๐‘Š

๐‘ˆ

๐‘‰

๐‘Š

๐‘ˆ

๐‘‰

๐‘Š1 ๐‘Š2๐‘Š ๐‘Š3๐‘ฅ ๐‘ฅ

๐‘ฆ

Rolled Network vs. MLP?

What is really different?

โ€“ Steps instead of layers

โ€“ Step parameters shared in Recurrent Network

โ€“ In a Multi-Layer Network parameters are different

๐‘ฆ

๐ฟ๐‘Ž๐‘ฆ๐‘’๐‘Ÿ

1

๐ฟ๐‘Ž๐‘ฆ๐‘’๐‘Ÿ

2

๐ฟ๐‘Ž๐‘ฆ๐‘’๐‘Ÿ

3

3-gram Unrolled Recurrent Network 3-layer Neural Network

โ€œLayer/Stepโ€ 1 โ€œLayer/Stepโ€ 2 โ€œLayer/Stepโ€ 3 Final output

๐‘ˆ

๐‘Š

๐‘ˆ

๐‘Š

๐‘ˆ

๐‘‰

๐‘Š2๐‘Š ๐‘Š3๐‘ฅ ๐‘ฅ

๐‘ฆ๐‘Š1

Training Recurrent Networks

Cross-entropy loss

๐‘ƒ =เท‘

๐‘ก,๐‘˜

๐‘ฆ๐‘ก๐‘˜๐‘™๐‘ก๐‘˜ โ‡’ โ„’ = โˆ’ log ๐‘ƒ =

๐‘ก

โ„’๐‘ก =โˆ’1

๐‘‡

๐‘ก

๐‘™๐‘ก log ๐‘ฆ๐‘ก

Backpropagation Through Time (BPTT)

โ€“ Again, chain rule

โ€“ Only difference: Gradients survive over time steps

Training RNNs is hard

Vanishing gradients

โ€“ After a few time steps the gradients become almost 0

Exploding gradients

โ€“ After a few time steps the gradients become huge

Canโ€™t capture long-term dependencies

Alternative formulation for RNNs

An alternative formulation to derive conclusions and

intuitions

๐‘๐‘ก = ๐‘Š โ‹… tanh(๐‘๐‘กโˆ’1) + ๐‘ˆ โ‹… ๐‘ฅ๐‘ก + ๐‘

โ„’ =

๐‘ก

โ„’๐‘ก(๐‘๐‘ก)

Another look at the gradients

โ„’ = ๐ฟ(๐‘๐‘‡(๐‘๐‘‡โˆ’1 โ€ฆ ๐‘1 ๐‘ฅ1, ๐‘0;๐‘Š ;๐‘Š ;๐‘Š ;๐‘Š)

๐œ•โ„’๐‘ก๐œ•๐‘Š

=

๐œ=1

t๐œ•โ„’๐‘ก๐œ•๐‘๐‘ก

๐œ•๐‘๐‘ก๐œ•๐‘๐œ

๐œ•๐‘๐œ๐œ•๐‘Š

๐œ•โ„’

๐œ•๐‘๐‘ก

๐œ•๐‘๐‘ก๐œ•๐‘๐œ

=๐œ•โ„’

๐œ•ctโ‹…๐œ•๐‘๐‘ก๐œ•ctโˆ’1

โ‹…๐œ•๐‘๐‘กโˆ’1๐œ•ctโˆ’2

โ‹… โ€ฆ โ‹…๐œ•๐‘๐œ+1๐œ•cฯ„

โ‰ค ๐œ‚๐‘กโˆ’๐œ๐œ•โ„’๐‘ก๐œ•๐‘๐‘ก

The RNN gradient is a recursive product of ๐œ•๐‘๐‘ก

๐œ•ctโˆ’1

๐‘…๐‘’๐‘ ๐‘ก โ†’ short-term factors ๐‘ก โ‰ซ ๐œ โ†’ long-term factors

Exploding/Vanishing gradients

๐œ•โ„’

๐œ•๐‘๐‘ก=

๐œ•โ„’

๐œ•cTโ‹…๐œ•๐‘๐‘‡๐œ•cTโˆ’1

โ‹…๐œ•๐‘๐‘‡โˆ’1๐œ•cTโˆ’2

โ‹… โ€ฆ โ‹…๐œ•๐‘๐‘ก+1๐œ•c๐‘๐‘ก

๐œ•โ„’

๐œ•๐‘๐‘ก=

๐œ•โ„’

๐œ•cTโ‹…๐œ•๐‘๐‘‡๐œ•cTโˆ’1

โ‹…๐œ•๐‘๐‘‡โˆ’1๐œ•cTโˆ’2

โ‹… โ€ฆ โ‹…๐œ•๐‘1๐œ•c๐‘๐‘ก

< 1 < 1 < 1

๐œ•โ„’

๐œ•๐‘Šโ‰ช 1โŸน Vanishing

gradient

> 1 > 1 > 1

๐œ•โ„’

๐œ•๐‘Šโ‰ซ 1โŸน Exploding

gradient

Vanishing gradients

The gradient of the error w.r.t. to intermediate cell

๐œ•โ„’๐‘ก๐œ•๐‘Š

=

๐œ=1

t๐œ•โ„’๐‘Ÿ๐œ•๐‘ฆ๐‘ก

๐œ•๐‘ฆ๐‘ก๐œ•๐‘๐‘ก

๐œ•๐‘๐‘ก๐œ•๐‘๐œ

๐œ•๐‘๐œ๐œ•๐‘Š

๐œ•๐‘๐‘ก๐œ•๐‘๐œ

= เท‘

๐‘กโ‰ฅ๐‘˜โ‰ฅ๐œ

๐œ•๐‘๐‘˜๐œ•๐‘๐‘˜โˆ’1

= เท‘

๐‘กโ‰ฅ๐‘˜โ‰ฅ๐œ

๐‘Š โ‹… ๐œ• tanh(๐‘๐‘˜โˆ’1)

Vanishing gradients

The gradient of the error w.r.t. to intermediate cell

๐œ•โ„’๐‘ก๐œ•๐‘Š

=

๐œ=1

t๐œ•โ„’๐‘Ÿ๐œ•๐‘ฆ๐‘ก

๐œ•๐‘ฆ๐‘ก๐œ•๐‘๐‘ก

๐œ•๐‘๐‘ก๐œ•๐‘๐œ

๐œ•๐‘๐œ๐œ•๐‘Š

๐œ•๐‘๐‘ก๐œ•๐‘๐œ

= เท‘

๐‘กโ‰ฅ๐‘˜โ‰ฅ๐œ

๐œ•๐‘๐‘˜๐œ•๐‘๐‘˜โˆ’1

= เท‘

๐‘กโ‰ฅ๐‘˜โ‰ฅ๐œ

๐‘Š โ‹… ๐œ• tanh(๐‘๐‘˜โˆ’1)

Long-term dependencies get exponentially smaller weights

Gradient clipping for exploding gradients

Scale the gradients to a threshold

1. g โ†๐œ•โ„’

๐œ•๐‘Š

2. if ๐‘” > ๐œƒ0:

g โ†๐œƒ0

๐‘”๐‘”

else:

print(โ€˜Do nothingโ€™)

Simple, but works!

๐œƒ0

g

๐œƒ0๐‘”

๐‘”

Pseudocode

Rescaling vanishing gradients?

Not good solution

Weights are shared between timesteps Loss summed

over timesteps

โ„’ =

๐‘ก

โ„’๐‘ก โŸน๐œ•โ„’

๐œ•๐‘Š=

๐‘ก

๐œ•โ„’๐‘ก๐œ•๐‘Š

๐œ•โ„’๐‘ก๐œ•๐‘Š

=

๐œ=1

t๐œ•โ„’๐‘ก๐œ•๐‘๐œ

๐œ•๐‘๐œ๐œ•๐‘Š

=

๐œ=1

๐‘ก๐œ•โ„’๐‘ก๐œ•๐‘๐‘ก

๐œ•๐‘๐‘ก๐œ•๐‘๐œ

๐œ•๐‘๐œ๐œ•๐‘Š

Rescaling for one timestep (๐œ•โ„’๐‘ก

๐œ•๐‘Š) affects all timesteps

โ€“ The rescaling factor for one timestep does not work for another

More intuitively

๐œ•โ„’1๐œ•๐‘Š

๐œ•โ„’2๐œ•๐‘Š

๐œ•โ„’3๐œ•๐‘Š

๐œ•โ„’4๐œ•๐‘Š

๐œ•โ„’5๐œ•๐‘Š

๐œ•โ„’

๐œ•๐‘Š=

๐œ•โ„’1

๐œ•๐‘Š+

๐œ•โ„’2

๐œ•๐‘Š+๐œ•โ„’3

๐œ•๐‘Š+

๐œ•โ„’4

๐œ•๐‘Š+

๐œ•โ„’5

๐œ•๐‘Š

๐œ•โ„’1๐œ•๐‘Š

๐œ•โ„’4๐œ•๐‘Š

๐œ•โ„’2๐œ•๐‘Š ๐œ•โ„’3

๐œ•๐‘Š๐œ•โ„’5๐œ•๐‘Š

More intuitively

Letโ€™s say ๐œ•โ„’1

๐œ•๐‘Šโˆ 1,

๐œ•โ„’2

๐œ•๐‘Šโˆ 1/10,

๐œ•โ„’3

๐œ•๐‘Šโˆ 1/100,

๐œ•โ„’4

๐œ•๐‘Šโˆ 1/1000,

๐œ•โ„’5

๐œ•๐‘Šโˆ 1/10000

๐œ•โ„’

๐œ•๐‘Š=

๐‘Ÿ

๐œ•โ„’๐‘Ÿ๐œ•๐‘Š

= 1.1111

If ๐œ•โ„’

๐œ•๐‘Šrescaled to 1

๐œ•โ„’5

๐œ•๐‘Šโˆ 10โˆ’5

Longer-term dependencies negligible

โ€“ Weak recurrent modelling

โ€“ Learning focuses on the short-term only

๐œ•โ„’

๐œ•๐‘Š=

๐œ•โ„’1

๐œ•๐‘Š+

๐œ•โ„’2

๐œ•๐‘Š+๐œ•โ„’3

๐œ•๐‘Š+

๐œ•โ„’4

๐œ•๐‘Š+

๐œ•โ„’5

๐œ•๐‘Š

Recurrent networks โˆ Chaotic systems

Fixing vanishing gradients

Regularization on the recurrent weights

โ€“ Force the error signal not to vanish

ฮฉ =

๐‘ก

ฮฉt =

๐‘ก

๐œ•โ„’๐œ•๐‘๐‘ก+1

๐œ•๐‘๐‘ก+1๐œ•๐‘๐‘ก

๐œ•โ„’๐œ•๐‘๐‘ก+1

โˆ’ 1

2

Advanced recurrent modules

Long-Short Term Memory module

Gated Recurrent Unit module

Pascanu et al., On the diffculty of training Recurrent Neural Networks, 2013

Advanced Recurrent Nets

+

๐œŽ๐œŽ๐œŽ

tanh

tanh

Input

Output

๐‘“๐‘ก

๐‘๐‘กโˆ’1 ๐‘๐‘ก

๐‘œ๐‘ก๐‘–๐‘ก

๐‘๐‘ก

๐‘š๐‘ก๐‘š๐‘กโˆ’1

๐‘ฅ๐‘ก

How to fix the vanishing gradients?

Error signal over time must have not too large, not too small

norm

Letโ€™s have a look at the loss function

๐œ•โ„’๐‘ก๐œ•๐‘Š

=

๐œ=1

t๐œ•โ„’๐‘Ÿ๐œ•๐‘ฆ๐‘ก

๐œ•๐‘ฆ๐‘ก๐œ•๐‘๐‘ก

๐œ•๐‘๐‘ก๐œ•๐‘๐œ

๐œ•๐‘๐œ๐œ•๐‘Š

๐œ•๐‘๐‘ก๐œ•๐‘๐œ

= เท‘

๐‘กโ‰ฅ๐‘˜โ‰ฅ๐œ

๐œ•๐‘๐‘˜๐œ•๐‘๐‘˜โˆ’1

How to fix the vanishing gradients?

Error signal over time must have not too large, not too small

norm

Letโ€™s have a look at the loss function

๐œ•โ„’๐‘ก๐œ•๐‘Š

=

๐œ=1

t๐œ•โ„’๐‘Ÿ๐œ•๐‘ฆ๐‘ก

๐œ•๐‘ฆ๐‘ก๐œ•๐‘๐‘ก

๐œ•๐‘๐‘ก๐œ•๐‘๐œ

๐œ•๐‘๐œ๐œ•๐‘Š

๐œ•๐‘๐‘ก๐œ•๐‘๐œ

= เท‘

๐‘กโ‰ฅ๐‘˜โ‰ฅ๐œ

๐œ•๐‘๐‘˜๐œ•๐‘๐‘˜โˆ’1

How to fix the vanishing gradients?

Error signal over time must have not too large, not too small

norm

Solution: have an activation function with gradient equal to 1

๐œ•โ„’๐‘ก๐œ•๐‘Š

=

๐œ=1

t๐œ•โ„’๐‘Ÿ๐œ•๐‘ฆ๐‘ก

๐œ•๐‘ฆ๐‘ก๐œ•๐‘๐‘ก

๐œ•๐‘๐‘ก๐œ•๐‘๐œ

๐œ•๐‘๐œ๐œ•๐‘Š

๐œ•๐‘๐‘ก๐œ•๐‘๐œ

= เท‘

๐‘กโ‰ฅ๐‘˜โ‰ฅ๐œ

๐œ•๐‘๐‘˜๐œ•๐‘๐‘˜โˆ’1

โ€“ Identify function has a gradient equal to 1

By doing so, gradients do not become too small not too large

Long Short-Term MemoryLSTM: Beefed up RNN

๐‘– = ๐œŽ ๐‘ฅ๐‘ก๐‘ˆ(๐‘–) +๐‘š๐‘กโˆ’1๐‘Š

(๐‘–)

๐‘“ = ๐œŽ ๐‘ฅ๐‘ก๐‘ˆ(๐‘“) +๐‘š๐‘กโˆ’1๐‘Š

(๐‘“)

๐‘œ = ๐œŽ ๐‘ฅ๐‘ก๐‘ˆ(๐‘œ) +๐‘š๐‘กโˆ’1๐‘Š

(๐‘œ)

๐‘๐‘ก = tanh(๐‘ฅ๐‘ก๐‘ˆ๐‘” +๐‘š๐‘กโˆ’1๐‘Š

(๐‘”))

๐‘๐‘ก = ๐‘๐‘กโˆ’1 โŠ™๐‘“ + ๐‘๐‘ก โŠ™ ๐‘–

๐‘š๐‘ก = tanh ๐‘๐‘ก โŠ™๐‘œ

+

๐œŽ๐œŽ๐œŽ

tanh

tanh

InputOutput

๐‘“๐‘ก

๐‘๐‘กโˆ’1 ๐‘๐‘ก

๐‘œ๐‘ก๐‘–๐‘ก

๐‘๐‘ก

๐‘š๐‘ก๐‘š๐‘กโˆ’1

๐‘ฅ๐‘ก

More info at: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

๐‘๐‘ก = ๐‘Š โ‹… tanh(๐‘๐‘กโˆ’1) + ๐‘ˆ โ‹… ๐‘ฅ๐‘ก + ๐‘

LSTM

Simple

RNN

- The previous state ๐‘๐‘กโˆ’1 is connected to new

๐‘๐‘ก with no nonlinearity (identity function).

- The only other factor is the forget gate ๐‘“which rescales the previous LSTM state.

Cell state

The cell state carries the essential information over time

๐‘– = ๐œŽ ๐‘ฅ๐‘ก๐‘ˆ(๐‘–) +๐‘š๐‘กโˆ’1๐‘Š

(๐‘–)

๐‘“ = ๐œŽ ๐‘ฅ๐‘ก๐‘ˆ(๐‘“) +๐‘š๐‘กโˆ’1๐‘Š

(๐‘“)

๐‘œ = ๐œŽ ๐‘ฅ๐‘ก๐‘ˆ(๐‘œ) +๐‘š๐‘กโˆ’1๐‘Š

(๐‘œ)

๐‘๐‘ก = tanh(๐‘ฅ๐‘ก๐‘ˆ๐‘” +๐‘š๐‘กโˆ’1๐‘Š

(๐‘”))

๐‘๐‘ก = ๐‘๐‘กโˆ’1 โŠ™๐‘“ + ๐‘๐‘ก โŠ™ ๐‘–

๐‘š๐‘ก = tanh ๐‘๐‘ก โŠ™๐‘œ

Cell state line

+

๐œŽ๐œŽ๐œŽ

tanh

tanh

๐‘“๐‘ก

๐‘๐‘กโˆ’1 ๐‘๐‘ก

๐‘œ๐‘ก๐‘–๐‘ก

๐‘๐‘ก

๐‘š๐‘ก๐‘š๐‘กโˆ’1

๐‘ฅ๐‘ก

LSTM nonlinearities

๐œŽ โˆˆ (0, 1): control gate โ€“ something like a switch

tanh โˆˆ โˆ’1, 1 : recurrent nonlinearity

๐‘– = ๐œŽ ๐‘ฅ๐‘ก๐‘ˆ(๐‘–) +๐‘š๐‘กโˆ’1๐‘Š

(๐‘–)

๐‘“ = ๐œŽ ๐‘ฅ๐‘ก๐‘ˆ(๐‘“) +๐‘š๐‘กโˆ’1๐‘Š

(๐‘“)

๐‘œ = ๐œŽ ๐‘ฅ๐‘ก๐‘ˆ(๐‘œ) +๐‘š๐‘กโˆ’1๐‘Š

(๐‘œ)

๐‘๐‘ก = tanh(๐‘ฅ๐‘ก๐‘ˆ๐‘” +๐‘š๐‘กโˆ’1๐‘Š

(๐‘”))

๐‘๐‘ก = ๐‘๐‘กโˆ’1 โŠ™๐‘“ + ๐‘๐‘ก โŠ™ ๐‘–

๐‘š๐‘ก = tanh ๐‘๐‘ก โŠ™๐‘œ

+

๐œŽ๐œŽ๐œŽ

tanh

tanh

Input

Output

๐‘“๐‘ก

๐‘๐‘กโˆ’1 ๐‘๐‘ก

๐‘œ๐‘ก๐‘–๐‘ก

๐‘๐‘ก

๐‘š๐‘ก๐‘š๐‘กโˆ’1

๐‘ฅ๐‘ก

LSTM Step-by-Step: Step (1)

E.g. LSTM on โ€œYesterday she slapped me. Today she loves me.โ€

Decide what to forget and what to remember for the new memory

โ€“ Sigmoid 1 Remember everything

โ€“ Sigmoid 0 Forget everything

๐‘–๐‘ก = ๐œŽ ๐‘ฅ๐‘ก๐‘ˆ(๐‘–) +๐‘š๐‘กโˆ’1๐‘Š

(๐‘–)

๐‘“๐‘ก = ๐œŽ ๐‘ฅ๐‘ก๐‘ˆ(๐‘“) +๐‘š๐‘กโˆ’1๐‘Š

(๐‘“)

๐‘œ๐‘ก = ๐œŽ ๐‘ฅ๐‘ก๐‘ˆ(๐‘œ) +๐‘š๐‘กโˆ’1๐‘Š

(๐‘œ)

๐‘๐‘ก = tanh(๐‘ฅ๐‘ก๐‘ˆ๐‘” +๐‘š๐‘กโˆ’1๐‘Š

(๐‘”))

๐‘๐‘ก = ๐‘๐‘กโˆ’1 โŠ™๐‘“ + ๐‘๐‘ก โŠ™ ๐‘–

๐‘š๐‘ก = tanh ๐‘๐‘ก โŠ™๐‘œ

+

๐œŽ๐œŽ๐œŽ

tanh

tanh

๐‘“๐‘ก

๐‘๐‘กโˆ’1 ๐‘๐‘ก

๐‘œ๐‘ก๐‘–๐‘ก

๐‘๐‘ก

๐‘š๐‘ก๐‘š๐‘กโˆ’1

๐‘ฅ๐‘ก

LSTM Step-by-Step: Step (2)

Decide what new information is relevant from the new input

and should be add to the new memory

โ€“ Modulate the input ๐‘–๐‘กโ€“ Generate candidate memories ๐‘๐‘ก

๐‘–๐‘ก = ๐œŽ ๐‘ฅ๐‘ก๐‘ˆ(๐‘–) +๐‘š๐‘กโˆ’1๐‘Š

(๐‘–)

๐‘“๐‘ก = ๐œŽ ๐‘ฅ๐‘ก๐‘ˆ(๐‘“) +๐‘š๐‘กโˆ’1๐‘Š

(๐‘“)

๐‘œ๐‘ก = ๐œŽ ๐‘ฅ๐‘ก๐‘ˆ(๐‘œ) +๐‘š๐‘กโˆ’1๐‘Š

(๐‘œ)

๐‘๐‘ก = tanh(๐‘ฅ๐‘ก๐‘ˆ๐‘” +๐‘š๐‘กโˆ’1๐‘Š

(๐‘”))

๐‘๐‘ก = ๐‘๐‘กโˆ’1 โŠ™๐‘“ + ๐‘๐‘ก โŠ™ ๐‘–

๐‘š๐‘ก = tanh ๐‘๐‘ก โŠ™๐‘œ

+

๐œŽ๐œŽ๐œŽ

tanh

tanh

๐‘“๐‘ก

๐‘๐‘กโˆ’1 ๐‘๐‘ก

๐‘œ๐‘ก๐‘–๐‘ก

๐‘๐‘ก

๐‘š๐‘ก๐‘š๐‘กโˆ’1

๐‘ฅ๐‘ก

LSTM Step-by-Step: Step (3)Compute and update the current cell state ๐‘๐‘ก

โ€“ Depends on the previous cell state

โ€“ What we decide to forget

โ€“ What inputs we allow

โ€“ The candidate memories

+

๐œŽ๐œŽ๐œŽ

tanh

tanh

๐‘–๐‘ก = ๐œŽ ๐‘ฅ๐‘ก๐‘ˆ(๐‘–) +๐‘š๐‘กโˆ’1๐‘Š

(๐‘–)

๐‘“๐‘ก = ๐œŽ ๐‘ฅ๐‘ก๐‘ˆ(๐‘“) +๐‘š๐‘กโˆ’1๐‘Š

(๐‘“)

๐‘œ๐‘ก = ๐œŽ ๐‘ฅ๐‘ก๐‘ˆ(๐‘œ) +๐‘š๐‘กโˆ’1๐‘Š

(๐‘œ)

๐‘๐‘ก = tanh(๐‘ฅ๐‘ก๐‘ˆ๐‘” +๐‘š๐‘กโˆ’1๐‘Š

(๐‘”))

๐‘๐‘ก = ๐‘๐‘กโˆ’1 โŠ™๐‘“ + ๐‘๐‘ก โŠ™ ๐‘–

๐‘š๐‘ก = tanh ๐‘๐‘ก โŠ™๐‘œ

๐‘“๐‘ก

๐‘๐‘กโˆ’1 ๐‘๐‘ก

๐‘œ๐‘ก๐‘–๐‘ก

๐‘๐‘ก

๐‘š๐‘ก๐‘š๐‘กโˆ’1

๐‘ฅ๐‘ก

LSTM Step-by-Step: Step (4)

Modulate the output

โ€“ Does the new cell state relevant? Sigmoid 1

โ€“ If not Sigmoid 0

Generate the new memory

+

๐œŽ๐œŽ๐œŽ

tanh

tanh

๐‘–๐‘ก = ๐œŽ ๐‘ฅ๐‘ก๐‘ˆ(๐‘–) +๐‘š๐‘กโˆ’1๐‘Š

(๐‘–)

๐‘“๐‘ก = ๐œŽ ๐‘ฅ๐‘ก๐‘ˆ(๐‘“) +๐‘š๐‘กโˆ’1๐‘Š

(๐‘“)

๐‘œ๐‘ก = ๐œŽ ๐‘ฅ๐‘ก๐‘ˆ(๐‘œ) +๐‘š๐‘กโˆ’1๐‘Š

(๐‘œ)

๐‘๐‘ก = tanh(๐‘ฅ๐‘ก๐‘ˆ๐‘” +๐‘š๐‘กโˆ’1๐‘Š

(๐‘”))

๐‘๐‘ก = ๐‘๐‘กโˆ’1 โŠ™๐‘“ + ๐‘๐‘ก โŠ™ ๐‘–

๐‘š๐‘ก = tanh ๐‘๐‘ก โŠ™๐‘œ

๐‘“๐‘ก

๐‘๐‘กโˆ’1 ๐‘๐‘ก

๐‘œ๐‘ก๐‘–๐‘ก

๐‘๐‘ก

๐‘š๐‘ก๐‘š๐‘กโˆ’1

๐‘ฅ๐‘ก

LSTM Unrolled Network

Macroscopically very similar to standard RNNs

The engine is a bit different (more complicated)

โ€“ Because of their gates LSTMs capture long and short term

dependencies

ร— +

๐œŽ๐œŽ๐œŽ

ร—

tanh

ร—

tanh

ร— +

๐œŽ๐œŽ๐œŽ

ร—

tanh

ร—

tanh

ร— +

๐œŽ๐œŽ๐œŽ

ร—

tanh

ร—

tanh

Beyond RNN & LSTM

LSTM with peephole connections

โ€“ Gates have access also to the previous cell states ๐‘๐‘กโˆ’1 (not only memories)

โ€“ Coupled forget and input gates, ๐‘๐‘ก = ๐‘“๐‘ก โŠ™ ๐‘๐‘กโˆ’1 + 1 โˆ’ ๐‘“๐‘ก โŠ™ ๐‘๐‘กโ€“ Bi-directional recurrent networks

Gated Recurrent Units (GRU)

Deep recurrent architectures

Recursive neural networks

โ€“ Tree structured

Multiplicative interactions

Generative recurrent architectures

LSTM (2)

LSTM (1)

LSTM (2)

LSTM (1)

LSTM (2)

LSTM (1)

Take-away message

Recurrent Neural Networks (RNN) for sequences

Backpropagation Through Time

Vanishing and Exploding Gradients and Remedies

RNNs using Long Short-Term Memory (LSTM)

Applications of Recurrent Neural Networks

Thank you!

top related