recurrent neural networks - github pagesuvadlc.github.io/lectures/nov2017/lecture6.pdfย ยท training...
Post on 27-Mar-2020
15 Views
Preview:
TRANSCRIPT
Recurrent Neural NetworksDeep Learning โ Lecture 5
Efstratios Gavves
Sequential Data
So far, all tasks assumed stationary data
Neither all data, nor all tasks are stationary though
Sequential Data: Text
What
Sequential Data: Text
What about
Sequential Data: Text
What about inputs that appear in
sequences, such as text? Could neural
network handle such modalities?
a
Memory needed
What about inputs that appear in
sequences, such as text? Could neural
network handle such modalities?
a
Pr ๐ฅ =เท
๐
Pr ๐ฅ๐ ๐ฅ1, โฆ , ๐ฅ๐โ1)
Sequential data: Video
Quiz: Other sequential data?
Quiz: Other sequential data?
Time series data
Stock exchange
Biological measurements
Climate measurements
Market analysis
Speech/Music
User behavior in websites
...
ApplicationsClick to go to the video in Youtube Click to go to the website
Machine Translation
The phrase in the source language is one sequence
โ โToday the weather is goodโ
The phrase in the target language is also a sequence
โ โะะพะณะพะดะฐ ัะตะณะพะดะฝั ั ะพัะพัะฐัโ
LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM
Today the weather is good <EOS> ะะพะณะพะดะฐ ัะตะณะพะดะฝั ั ะพัะพัะฐั
ะะพะณะพะดะฐ ัะตะณะพะดะฝั <EOS>
Encoder Decoder
ั ะพัะพัะฐั
Image captioning
An image is a thousand words, literally!
Pretty much the same as machine transation
Replace the encoder part with the output of a Convnet
โ E.g. use Alexnet or a VGG16 network
Keep the decoder part to operate as a translator
LSTM LSTM LSTM LSTM LSTM LSTM
Today the weather is good
Today the weather is good <EOS>
Convnet
Question answering
Bleeding-edge research, no real consensusโ Very interesting open, research problems
Again, pretty much like machine translation
Again, Encoder-Decoder paradigmโ Insert the question to the encoder part
โ Model the answer at the decoder part
Question answering with images alsoโ Again, bleeding-edge research
โ How/where to add the image?
โ What has been working so far is to add the image
only in the beginningQ: what are the people playing?
A: They play beach football
Q: John entered the living room, where he met Mary. She was drinking some wine and watching
a movie. What room did John enter?A: John entered the living room.
Recurrent Networks
Recurrentconnections
NN CellState
Input
Output
Sequences
Next data depend on previous data
Roughly equivalent to predicting what comes next
Pr ๐ฅ =เท
๐
Pr ๐ฅ๐ ๐ฅ1, โฆ , ๐ฅ๐โ1)
What about inputs that appear in
sequences,suc
h as text?Could neural
network handle such modalities?
a
Why sequences?
Parameters are reused Fewer parameters Easier modelling
Generalizes well to arbitrary lengths Not constrained to specific length
However, often we pick a โframeโ ๐ instead of an arbitrary length
Pr ๐ฅ =เท
๐
Pr ๐ฅ๐ ๐ฅ๐โ๐ , โฆ , ๐ฅ๐โ1)
I think, therefore, I am!
Everything is repeated, in a circle. History is a master because it
teaches us that it doesn't exist. It's the permutations that matter.
๐ ๐๐๐ข๐๐๐๐๐ก๐๐๐๐๐(
๐ ๐๐๐ข๐๐๐๐๐ก๐๐๐๐๐(
)
)
โก
Quiz: What is really a sequence?
Data inside a sequence are โฆ ?
I am Bond , James
Bond
McGuire
tired
am
!
Bond
Quiz: What is really a sequence?
Data inside a sequence are non i.i.d.
โ Identically, independently distributed
The next โwordโ depends on the previous โwordsโ
โ Ideally on all of them
We need context, and we need memory!
How to model context and memory ?
I am Bond , James
Bond
McGuire
tired
am
!
Bond
Bond
๐ฅ๐ โก One-hot vectors
A vector with all zeros except for the active dimension
12 words in a sequence 12 One-hot vectors
After the one-hot vectors apply an embedding
Word2Vec, GloVE
I
am
Bond
,
James
McGuire
tired
!
I
am
Bond
,
James
McGuire
tired
!
I
am
Bond
,
James
McGuire
tired
!
I
am
Bond
,
James
McGuire
tired
!
0
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0I
am
Bond
,
James
McGuire
tired
!
tired
Vocabulary
0 0 0 0
One-hot vectors
Quiz: Why not just indices?
I am James McGuire
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
I am James McGuire
One-hot representation Index representation
๐ฅ๐ก=1,2,3,4 =
OR?
๐ฅ"๐ผ" = 1
๐ฅ"๐๐" = 2
๐ฅ"๐ฝ๐๐๐๐ " = 4
๐ฅ"๐๐๐บ๐ข๐๐๐" = 7
Quiz: Why not just indices?
No, because then some words get closer together for no
good reason (artificial bias between words)
I am James McGuire
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
I am James McGuire
One-hot representation Index representation
๐ฅ"๐ผ"
OR?
๐"๐ผ" = 1
๐"๐๐" = 2
๐"๐ฝ๐๐๐๐ " = 4
๐"๐๐๐บ๐ข๐๐๐" = 7
๐๐๐ ๐ก๐๐๐๐(๐"๐๐", ๐"๐๐๐บ๐ข๐๐๐") = 5
๐ฅ"๐๐"
๐ฅ"๐ฝ๐๐๐๐ "
๐ฅ"๐๐๐บ๐ข๐๐๐"
๐๐๐ ๐ก๐๐๐๐(๐"๐ผ", ๐"๐๐") = 1
๐๐๐ ๐ก๐๐๐๐(๐ฅ"๐๐", ๐ฅ"๐๐๐บ๐ข๐๐๐") = 1
๐๐๐ ๐ก๐๐๐๐(๐ฅ"๐ผ", ๐ฅ"๐๐") = 1
Memory
A representation of the past
A memory must project information at timestep ๐ก on a
latent space ๐๐ก using parameters ๐
Then, re-use the projected information from ๐ก at ๐ก + 1๐๐ก+1 = โ(๐ฅ๐ก+1, ๐๐ก; ๐)
Memory parameters ๐ are shared for all timesteps ๐ก = 0,โฆ๐๐ก+1 = โ(๐ฅ๐ก+1, โ(๐ฅ๐ก , โ(๐ฅ๐กโ1, โฆ โ ๐ฅ1, ๐0; ๐ ; ๐); ๐); ๐)
Memory as a Graph
Simplest model
โ Input with parameters ๐
โ Memory embedding with parameters ๐
โ Output with parameters ๐
๐ฅ๐ก
๐ฆ๐ก
๐
๐
๐
Output parameters
Input parameters
Memory
parameters
Input
Output
๐๐กMemory
Memory as a Graph
Simplest model
โ Input with parameters ๐
โ Memory embedding with parameters ๐
โ Output with parameters ๐
๐ฅ๐ก
๐ฆ๐ก
๐ฅ๐ก+1
๐ฆ๐ก+1
๐
๐
๐
Output parameters
Input parameters Input
Output
๐
๐
๐Memory
parameters๐๐ก ๐๐ก+1
Memory
Memory as a Graph
Simplest model
โ Input with parameters ๐
โ Memory embedding with parameters ๐
โ Output with parameters ๐
๐ฅ๐ก
๐ฆ๐ก
๐ฅ๐ก+1 ๐ฅ๐ก+2 ๐ฅ๐ก+3
๐ฆ๐ก+1 ๐ฆ๐ก+2 ๐ฆ๐ก+3
๐
๐
๐๐๐ก ๐๐ก+1 ๐๐ก+2 ๐๐ก+3
Output parameters
Input parameters ๐
๐
๐
๐
๐
๐
๐
๐
๐
Memory
Input
Output
๐
๐
๐Memory
parameters
Folding the memory
๐ฅ๐ก
๐ฆ๐ก
๐ฅ๐ก+1 ๐ฅ๐ก+2
๐ฆ๐ก+1 ๐ฆ๐ก+2
๐
๐
๐
๐
๐๐ก ๐๐ก+1 ๐๐ก+2
๐
๐
๐
๐ฅ๐ก
๐ฆ๐ก
๐๐
๐
๐๐ก(๐๐กโ1)
Unrolled/Unfolded Network Folded Network
๐
Recurrent Neural Network (RNN)
Only two equations
๐ฅ๐ก
๐ฆ๐ก
๐๐
๐
๐๐ก
(๐๐กโ1)
๐๐ก = tanh(๐ ๐ฅ๐ก +๐๐๐กโ1)
๐ฆ๐ก = softmax(๐ ๐๐ก)
RNN Example
Vocabulary of 5 words
A memory of 3 units
โ Hyperparameter that we choose like layer size
โ ๐๐ก: 3 ร 1 ,W: [3 ร 3]
An input projection of 3 dimensions
โ U: [3 ร 5]
An output projections of 10 dimensions
โ V: [10 ร 3]
๐๐ก = tanh(๐ ๐ฅ๐ก +๐๐๐กโ1)
๐ฆ๐ก = softmax(๐ ๐๐ก)
๐ โ ๐ฅ๐ก=4 =
Rolled Network vs. MLP?
What is really different?
โ Steps instead of layers
โ Step parameters shared in Recurrent Network
โ In a Multi-Layer Network parameters are different
๐ฆ1 ๐ฆ2 ๐ฆ3
๐ฟ๐๐ฆ๐๐
1
๐ฟ๐๐ฆ๐๐
2
๐ฟ๐๐ฆ๐๐
3
3-gram Unrolled Recurrent Network 3-layer Neural Network
Final output
๐
๐
๐
๐
๐
๐
๐
๐
๐1 ๐2 ๐3๐ฆ
โLayer/Stepโ 1 โLayer/Stepโ 2 โLayer/Stepโ 3
๐๐ฅ ๐ฅ
Rolled Network vs. MLP?
What is really different?
โ Steps instead of layers
โ Step parameters shared in Recurrent Network
โ In a Multi-Layer Network parameters are different
๐ฆ1 ๐ฆ2 ๐ฆ3
๐ฟ๐๐ฆ๐๐
1
๐ฟ๐๐ฆ๐๐
2
๐ฟ๐๐ฆ๐๐
3
3-gram Unrolled Recurrent Network 3-layer Neural Network
โLayer/Stepโ 1 โLayer/Stepโ 2 โLayer/Stepโ 3
Final output
๐
๐
๐
๐
๐
๐
๐
๐
๐1 ๐2๐ ๐3๐ฅ ๐ฅ
๐ฆ
Rolled Network vs. MLP?
What is really different?
โ Steps instead of layers
โ Step parameters shared in Recurrent Network
โ In a Multi-Layer Network parameters are different
โข Sometimes intermediate outputs are not even neededโข Removing them, we almost end up to a standard
Neural Network
๐ฆ1 ๐ฆ2 ๐ฆ3
๐ฟ๐๐ฆ๐๐
1
๐ฟ๐๐ฆ๐๐
2
๐ฟ๐๐ฆ๐๐
3
3-gram Unrolled Recurrent Network 3-layer Neural Network
โLayer/Stepโ 1 โLayer/Stepโ 2 โLayer/Stepโ 3
Final output
๐
๐
๐
๐
๐
๐
๐
๐
๐1 ๐2๐ ๐3๐ฅ ๐ฅ
๐ฆ
Rolled Network vs. MLP?
What is really different?
โ Steps instead of layers
โ Step parameters shared in Recurrent Network
โ In a Multi-Layer Network parameters are different
๐ฆ
๐ฟ๐๐ฆ๐๐
1
๐ฟ๐๐ฆ๐๐
2
๐ฟ๐๐ฆ๐๐
3
3-gram Unrolled Recurrent Network 3-layer Neural Network
โLayer/Stepโ 1 โLayer/Stepโ 2 โLayer/Stepโ 3 Final output
๐
๐
๐
๐
๐
๐
๐2๐ ๐3๐ฅ ๐ฅ
๐ฆ๐1
Training Recurrent Networks
Cross-entropy loss
๐ =เท
๐ก,๐
๐ฆ๐ก๐๐๐ก๐ โ โ = โ log ๐ =
๐ก
โ๐ก =โ1
๐
๐ก
๐๐ก log ๐ฆ๐ก
Backpropagation Through Time (BPTT)
โ Again, chain rule
โ Only difference: Gradients survive over time steps
Training RNNs is hard
Vanishing gradients
โ After a few time steps the gradients become almost 0
Exploding gradients
โ After a few time steps the gradients become huge
Canโt capture long-term dependencies
Alternative formulation for RNNs
An alternative formulation to derive conclusions and
intuitions
๐๐ก = ๐ โ tanh(๐๐กโ1) + ๐ โ ๐ฅ๐ก + ๐
โ =
๐ก
โ๐ก(๐๐ก)
Another look at the gradients
โ = ๐ฟ(๐๐(๐๐โ1 โฆ ๐1 ๐ฅ1, ๐0;๐ ;๐ ;๐ ;๐)
๐โ๐ก๐๐
=
๐=1
t๐โ๐ก๐๐๐ก
๐๐๐ก๐๐๐
๐๐๐๐๐
๐โ
๐๐๐ก
๐๐๐ก๐๐๐
=๐โ
๐ctโ ๐๐๐ก๐ctโ1
โ ๐๐๐กโ1๐ctโ2
โ โฆ โ ๐๐๐+1๐cฯ
โค ๐๐กโ๐๐โ๐ก๐๐๐ก
The RNN gradient is a recursive product of ๐๐๐ก
๐ctโ1
๐ ๐๐ ๐ก โ short-term factors ๐ก โซ ๐ โ long-term factors
Exploding/Vanishing gradients
๐โ
๐๐๐ก=
๐โ
๐cTโ ๐๐๐๐cTโ1
โ ๐๐๐โ1๐cTโ2
โ โฆ โ ๐๐๐ก+1๐c๐๐ก
๐โ
๐๐๐ก=
๐โ
๐cTโ ๐๐๐๐cTโ1
โ ๐๐๐โ1๐cTโ2
โ โฆ โ ๐๐1๐c๐๐ก
< 1 < 1 < 1
๐โ
๐๐โช 1โน Vanishing
gradient
> 1 > 1 > 1
๐โ
๐๐โซ 1โน Exploding
gradient
Vanishing gradients
The gradient of the error w.r.t. to intermediate cell
๐โ๐ก๐๐
=
๐=1
t๐โ๐๐๐ฆ๐ก
๐๐ฆ๐ก๐๐๐ก
๐๐๐ก๐๐๐
๐๐๐๐๐
๐๐๐ก๐๐๐
= เท
๐กโฅ๐โฅ๐
๐๐๐๐๐๐โ1
= เท
๐กโฅ๐โฅ๐
๐ โ ๐ tanh(๐๐โ1)
Vanishing gradients
The gradient of the error w.r.t. to intermediate cell
๐โ๐ก๐๐
=
๐=1
t๐โ๐๐๐ฆ๐ก
๐๐ฆ๐ก๐๐๐ก
๐๐๐ก๐๐๐
๐๐๐๐๐
๐๐๐ก๐๐๐
= เท
๐กโฅ๐โฅ๐
๐๐๐๐๐๐โ1
= เท
๐กโฅ๐โฅ๐
๐ โ ๐ tanh(๐๐โ1)
Long-term dependencies get exponentially smaller weights
Gradient clipping for exploding gradients
Scale the gradients to a threshold
1. g โ๐โ
๐๐
2. if ๐ > ๐0:
g โ๐0
๐๐
else:
print(โDo nothingโ)
Simple, but works!
๐0
g
๐0๐
๐
Pseudocode
Rescaling vanishing gradients?
Not good solution
Weights are shared between timesteps Loss summed
over timesteps
โ =
๐ก
โ๐ก โน๐โ
๐๐=
๐ก
๐โ๐ก๐๐
๐โ๐ก๐๐
=
๐=1
t๐โ๐ก๐๐๐
๐๐๐๐๐
=
๐=1
๐ก๐โ๐ก๐๐๐ก
๐๐๐ก๐๐๐
๐๐๐๐๐
Rescaling for one timestep (๐โ๐ก
๐๐) affects all timesteps
โ The rescaling factor for one timestep does not work for another
More intuitively
๐โ1๐๐
๐โ2๐๐
๐โ3๐๐
๐โ4๐๐
๐โ5๐๐
๐โ
๐๐=
๐โ1
๐๐+
๐โ2
๐๐+๐โ3
๐๐+
๐โ4
๐๐+
๐โ5
๐๐
๐โ1๐๐
๐โ4๐๐
๐โ2๐๐ ๐โ3
๐๐๐โ5๐๐
More intuitively
Letโs say ๐โ1
๐๐โ 1,
๐โ2
๐๐โ 1/10,
๐โ3
๐๐โ 1/100,
๐โ4
๐๐โ 1/1000,
๐โ5
๐๐โ 1/10000
๐โ
๐๐=
๐
๐โ๐๐๐
= 1.1111
If ๐โ
๐๐rescaled to 1
๐โ5
๐๐โ 10โ5
Longer-term dependencies negligible
โ Weak recurrent modelling
โ Learning focuses on the short-term only
๐โ
๐๐=
๐โ1
๐๐+
๐โ2
๐๐+๐โ3
๐๐+
๐โ4
๐๐+
๐โ5
๐๐
Recurrent networks โ Chaotic systems
Fixing vanishing gradients
Regularization on the recurrent weights
โ Force the error signal not to vanish
ฮฉ =
๐ก
ฮฉt =
๐ก
๐โ๐๐๐ก+1
๐๐๐ก+1๐๐๐ก
๐โ๐๐๐ก+1
โ 1
2
Advanced recurrent modules
Long-Short Term Memory module
Gated Recurrent Unit module
Pascanu et al., On the diffculty of training Recurrent Neural Networks, 2013
Advanced Recurrent Nets
+
๐๐๐
tanh
tanh
Input
Output
๐๐ก
๐๐กโ1 ๐๐ก
๐๐ก๐๐ก
๐๐ก
๐๐ก๐๐กโ1
๐ฅ๐ก
How to fix the vanishing gradients?
Error signal over time must have not too large, not too small
norm
Letโs have a look at the loss function
๐โ๐ก๐๐
=
๐=1
t๐โ๐๐๐ฆ๐ก
๐๐ฆ๐ก๐๐๐ก
๐๐๐ก๐๐๐
๐๐๐๐๐
๐๐๐ก๐๐๐
= เท
๐กโฅ๐โฅ๐
๐๐๐๐๐๐โ1
How to fix the vanishing gradients?
Error signal over time must have not too large, not too small
norm
Letโs have a look at the loss function
๐โ๐ก๐๐
=
๐=1
t๐โ๐๐๐ฆ๐ก
๐๐ฆ๐ก๐๐๐ก
๐๐๐ก๐๐๐
๐๐๐๐๐
๐๐๐ก๐๐๐
= เท
๐กโฅ๐โฅ๐
๐๐๐๐๐๐โ1
How to fix the vanishing gradients?
Error signal over time must have not too large, not too small
norm
Solution: have an activation function with gradient equal to 1
๐โ๐ก๐๐
=
๐=1
t๐โ๐๐๐ฆ๐ก
๐๐ฆ๐ก๐๐๐ก
๐๐๐ก๐๐๐
๐๐๐๐๐
๐๐๐ก๐๐๐
= เท
๐กโฅ๐โฅ๐
๐๐๐๐๐๐โ1
โ Identify function has a gradient equal to 1
By doing so, gradients do not become too small not too large
Long Short-Term MemoryLSTM: Beefed up RNN
๐ = ๐ ๐ฅ๐ก๐(๐) +๐๐กโ1๐
(๐)
๐ = ๐ ๐ฅ๐ก๐(๐) +๐๐กโ1๐
(๐)
๐ = ๐ ๐ฅ๐ก๐(๐) +๐๐กโ1๐
(๐)
๐๐ก = tanh(๐ฅ๐ก๐๐ +๐๐กโ1๐
(๐))
๐๐ก = ๐๐กโ1 โ๐ + ๐๐ก โ ๐
๐๐ก = tanh ๐๐ก โ๐
+
๐๐๐
tanh
tanh
InputOutput
๐๐ก
๐๐กโ1 ๐๐ก
๐๐ก๐๐ก
๐๐ก
๐๐ก๐๐กโ1
๐ฅ๐ก
More info at: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
๐๐ก = ๐ โ tanh(๐๐กโ1) + ๐ โ ๐ฅ๐ก + ๐
LSTM
Simple
RNN
- The previous state ๐๐กโ1 is connected to new
๐๐ก with no nonlinearity (identity function).
- The only other factor is the forget gate ๐which rescales the previous LSTM state.
Cell state
The cell state carries the essential information over time
๐ = ๐ ๐ฅ๐ก๐(๐) +๐๐กโ1๐
(๐)
๐ = ๐ ๐ฅ๐ก๐(๐) +๐๐กโ1๐
(๐)
๐ = ๐ ๐ฅ๐ก๐(๐) +๐๐กโ1๐
(๐)
๐๐ก = tanh(๐ฅ๐ก๐๐ +๐๐กโ1๐
(๐))
๐๐ก = ๐๐กโ1 โ๐ + ๐๐ก โ ๐
๐๐ก = tanh ๐๐ก โ๐
Cell state line
+
๐๐๐
tanh
tanh
๐๐ก
๐๐กโ1 ๐๐ก
๐๐ก๐๐ก
๐๐ก
๐๐ก๐๐กโ1
๐ฅ๐ก
LSTM nonlinearities
๐ โ (0, 1): control gate โ something like a switch
tanh โ โ1, 1 : recurrent nonlinearity
๐ = ๐ ๐ฅ๐ก๐(๐) +๐๐กโ1๐
(๐)
๐ = ๐ ๐ฅ๐ก๐(๐) +๐๐กโ1๐
(๐)
๐ = ๐ ๐ฅ๐ก๐(๐) +๐๐กโ1๐
(๐)
๐๐ก = tanh(๐ฅ๐ก๐๐ +๐๐กโ1๐
(๐))
๐๐ก = ๐๐กโ1 โ๐ + ๐๐ก โ ๐
๐๐ก = tanh ๐๐ก โ๐
+
๐๐๐
tanh
tanh
Input
Output
๐๐ก
๐๐กโ1 ๐๐ก
๐๐ก๐๐ก
๐๐ก
๐๐ก๐๐กโ1
๐ฅ๐ก
LSTM Step-by-Step: Step (1)
E.g. LSTM on โYesterday she slapped me. Today she loves me.โ
Decide what to forget and what to remember for the new memory
โ Sigmoid 1 Remember everything
โ Sigmoid 0 Forget everything
๐๐ก = ๐ ๐ฅ๐ก๐(๐) +๐๐กโ1๐
(๐)
๐๐ก = ๐ ๐ฅ๐ก๐(๐) +๐๐กโ1๐
(๐)
๐๐ก = ๐ ๐ฅ๐ก๐(๐) +๐๐กโ1๐
(๐)
๐๐ก = tanh(๐ฅ๐ก๐๐ +๐๐กโ1๐
(๐))
๐๐ก = ๐๐กโ1 โ๐ + ๐๐ก โ ๐
๐๐ก = tanh ๐๐ก โ๐
+
๐๐๐
tanh
tanh
๐๐ก
๐๐กโ1 ๐๐ก
๐๐ก๐๐ก
๐๐ก
๐๐ก๐๐กโ1
๐ฅ๐ก
LSTM Step-by-Step: Step (2)
Decide what new information is relevant from the new input
and should be add to the new memory
โ Modulate the input ๐๐กโ Generate candidate memories ๐๐ก
๐๐ก = ๐ ๐ฅ๐ก๐(๐) +๐๐กโ1๐
(๐)
๐๐ก = ๐ ๐ฅ๐ก๐(๐) +๐๐กโ1๐
(๐)
๐๐ก = ๐ ๐ฅ๐ก๐(๐) +๐๐กโ1๐
(๐)
๐๐ก = tanh(๐ฅ๐ก๐๐ +๐๐กโ1๐
(๐))
๐๐ก = ๐๐กโ1 โ๐ + ๐๐ก โ ๐
๐๐ก = tanh ๐๐ก โ๐
+
๐๐๐
tanh
tanh
๐๐ก
๐๐กโ1 ๐๐ก
๐๐ก๐๐ก
๐๐ก
๐๐ก๐๐กโ1
๐ฅ๐ก
LSTM Step-by-Step: Step (3)Compute and update the current cell state ๐๐ก
โ Depends on the previous cell state
โ What we decide to forget
โ What inputs we allow
โ The candidate memories
+
๐๐๐
tanh
tanh
๐๐ก = ๐ ๐ฅ๐ก๐(๐) +๐๐กโ1๐
(๐)
๐๐ก = ๐ ๐ฅ๐ก๐(๐) +๐๐กโ1๐
(๐)
๐๐ก = ๐ ๐ฅ๐ก๐(๐) +๐๐กโ1๐
(๐)
๐๐ก = tanh(๐ฅ๐ก๐๐ +๐๐กโ1๐
(๐))
๐๐ก = ๐๐กโ1 โ๐ + ๐๐ก โ ๐
๐๐ก = tanh ๐๐ก โ๐
๐๐ก
๐๐กโ1 ๐๐ก
๐๐ก๐๐ก
๐๐ก
๐๐ก๐๐กโ1
๐ฅ๐ก
LSTM Step-by-Step: Step (4)
Modulate the output
โ Does the new cell state relevant? Sigmoid 1
โ If not Sigmoid 0
Generate the new memory
+
๐๐๐
tanh
tanh
๐๐ก = ๐ ๐ฅ๐ก๐(๐) +๐๐กโ1๐
(๐)
๐๐ก = ๐ ๐ฅ๐ก๐(๐) +๐๐กโ1๐
(๐)
๐๐ก = ๐ ๐ฅ๐ก๐(๐) +๐๐กโ1๐
(๐)
๐๐ก = tanh(๐ฅ๐ก๐๐ +๐๐กโ1๐
(๐))
๐๐ก = ๐๐กโ1 โ๐ + ๐๐ก โ ๐
๐๐ก = tanh ๐๐ก โ๐
๐๐ก
๐๐กโ1 ๐๐ก
๐๐ก๐๐ก
๐๐ก
๐๐ก๐๐กโ1
๐ฅ๐ก
LSTM Unrolled Network
Macroscopically very similar to standard RNNs
The engine is a bit different (more complicated)
โ Because of their gates LSTMs capture long and short term
dependencies
ร +
๐๐๐
ร
tanh
ร
tanh
ร +
๐๐๐
ร
tanh
ร
tanh
ร +
๐๐๐
ร
tanh
ร
tanh
Beyond RNN & LSTM
LSTM with peephole connections
โ Gates have access also to the previous cell states ๐๐กโ1 (not only memories)
โ Coupled forget and input gates, ๐๐ก = ๐๐ก โ ๐๐กโ1 + 1 โ ๐๐ก โ ๐๐กโ Bi-directional recurrent networks
Gated Recurrent Units (GRU)
Deep recurrent architectures
Recursive neural networks
โ Tree structured
Multiplicative interactions
Generative recurrent architectures
LSTM (2)
LSTM (1)
LSTM (2)
LSTM (1)
LSTM (2)
LSTM (1)
Take-away message
Recurrent Neural Networks (RNN) for sequences
Backpropagation Through Time
Vanishing and Exploding Gradients and Remedies
RNNs using Long Short-Term Memory (LSTM)
Applications of Recurrent Neural Networks
Thank you!
top related