![Page 1: Why LSTM outperforms RNN - GitHub Pages...Solution - LSTM With the cell state, it runs straight down the entire chain, with only some minor linear interactions (NOT matrix multiplication](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec68cfb579a6a09681250b7/html5/thumbnails/1.jpg)
Why LSTM outperforms RNN?
Xintao Huan
![Page 2: Why LSTM outperforms RNN - GitHub Pages...Solution - LSTM With the cell state, it runs straight down the entire chain, with only some minor linear interactions (NOT matrix multiplication](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec68cfb579a6a09681250b7/html5/thumbnails/2.jpg)
Outline
• Quick Review of RNN
• Investigate RNN
• Key Issue
• Solution - LSTM
![Page 3: Why LSTM outperforms RNN - GitHub Pages...Solution - LSTM With the cell state, it runs straight down the entire chain, with only some minor linear interactions (NOT matrix multiplication](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec68cfb579a6a09681250b7/html5/thumbnails/3.jpg)
Quick Review of RNN
Recurrent Neural Network (RNN) and the unfolding in time of thecomputation involved in its forward computation.
![Page 4: Why LSTM outperforms RNN - GitHub Pages...Solution - LSTM With the cell state, it runs straight down the entire chain, with only some minor linear interactions (NOT matrix multiplication](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec68cfb579a6a09681250b7/html5/thumbnails/4.jpg)
Investigate RNN
Whh� weight between hidden layersWxh: weight between input layer and hidden layerWhy: weight between hidden layer and output layer
![Page 5: Why LSTM outperforms RNN - GitHub Pages...Solution - LSTM With the cell state, it runs straight down the entire chain, with only some minor linear interactions (NOT matrix multiplication](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec68cfb579a6a09681250b7/html5/thumbnails/5.jpg)
Investigate RNN
In the computation of the hidden layer, Weight Matrix W is shared in all time-steps!
![Page 6: Why LSTM outperforms RNN - GitHub Pages...Solution - LSTM With the cell state, it runs straight down the entire chain, with only some minor linear interactions (NOT matrix multiplication](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec68cfb579a6a09681250b7/html5/thumbnails/6.jpg)
Investigate RNN
In TRAINING,we compare the output of the time-step ytwith the reference result, then the loss Lt is obtained,
and sequence loss L will be back propagated from the end time-step to the first time-step.Next,
we employ Stochastic Gradient Descent (SGD) to minimize the lossand update the parameters in W .
![Page 7: Why LSTM outperforms RNN - GitHub Pages...Solution - LSTM With the cell state, it runs straight down the entire chain, with only some minor linear interactions (NOT matrix multiplication](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec68cfb579a6a09681250b7/html5/thumbnails/7.jpg)
Investigate RNN
Forward through entire sequence to compute the Loss.Backward through entire sequence to minimize the loss and update the parameters in W .
Back Propagation Through Time (BPTT)
![Page 8: Why LSTM outperforms RNN - GitHub Pages...Solution - LSTM With the cell state, it runs straight down the entire chain, with only some minor linear interactions (NOT matrix multiplication](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec68cfb579a6a09681250b7/html5/thumbnails/8.jpg)
Key IssueIn every backpropagation from ht to ht-1:
W is multiplied!
e.g. computing gradient of h0 involves many factors of W !
If sequence is long enough andW > 1, exploding gradients!W < 1, vanishing gradients!
![Page 9: Why LSTM outperforms RNN - GitHub Pages...Solution - LSTM With the cell state, it runs straight down the entire chain, with only some minor linear interactions (NOT matrix multiplication](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec68cfb579a6a09681250b7/html5/thumbnails/9.jpg)
Key Issue
Exploding gradients:
Employ Gradient Clipping to scale the gradient (e.g. cut the value).
Vanishing gradients:
Change RNN architecture !
![Page 10: Why LSTM outperforms RNN - GitHub Pages...Solution - LSTM With the cell state, it runs straight down the entire chain, with only some minor linear interactions (NOT matrix multiplication](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec68cfb579a6a09681250b7/html5/thumbnails/10.jpg)
Solution - LSTM
With the cell state, it runs straight down the entire chain, with only someminor linear interactions (NOT matrix multiplication like in RNN)
It’s very easy for information to just flow along it unchanged.
input gateforget gateoutput gateupdate gate
![Page 11: Why LSTM outperforms RNN - GitHub Pages...Solution - LSTM With the cell state, it runs straight down the entire chain, with only some minor linear interactions (NOT matrix multiplication](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec68cfb579a6a09681250b7/html5/thumbnails/11.jpg)
Solution - LSTMRNN
LSTM
Vanishing gradients SOLVED!
![Page 12: Why LSTM outperforms RNN - GitHub Pages...Solution - LSTM With the cell state, it runs straight down the entire chain, with only some minor linear interactions (NOT matrix multiplication](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec68cfb579a6a09681250b7/html5/thumbnails/12.jpg)
References
1. Understanding LSTM http://colah.github.io/posts/2015-08-Understanding-LSTMs/
2. The Unreasonable Effectiveness of Recurrent Neural Networkshttp://karpathy.github.io/2015/05/21/rnn-effectiveness/
3. An Empirical Exploration of Recurrent Network Architectures http://proceedings.mlr.press/v37/jozefowicz15.pdf
4. A Critical Review of Recurrent Neural Networks for Sequence Learning https://arxiv.org/pdf/1506.00019.pdf
5. https://www.zhihu.com/question/44895610
6. https://my.oschina.net/u/2719468/blog/662099
7. https://yq.aliyun.com/articles/574218
8. https://juejin.im/post/59ae29d36fb9a024966cac99
![Page 13: Why LSTM outperforms RNN - GitHub Pages...Solution - LSTM With the cell state, it runs straight down the entire chain, with only some minor linear interactions (NOT matrix multiplication](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec68cfb579a6a09681250b7/html5/thumbnails/13.jpg)
Thanks for your attention!