topics in ai (cpsc 532s)lsigal/532s_2018w2/lecture9.pdf · 2019-02-11 · vanilla rnn gradient flow...
TRANSCRIPT
![Page 1: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/1.jpg)
Lecture 9: RNNs (part 2) + Applications
Topics in AI (CPSC 532S): Multimodal Learning with Vision, Language and Sound
![Page 2: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/2.jpg)
BackProp Through Time
Loss
Forward through entire sequence to compute loss, then backward through entire sequence to compute gradient
* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford
![Page 3: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/3.jpg)
Truncated BackProp Through Time
Loss
Run backwards and forwards through (fixed length) chunks of the sequence, instead of the whole sequence
* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford
![Page 4: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/4.jpg)
Truncated BackProp Through TimeRun backwards and forwards through (fixed length) chunks of the sequence, instead of the whole sequence
Loss Carry hidden states forward, but only BackProp through some smaller number of steps
* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford
![Page 5: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/5.jpg)
Truncated BackProp Through TimeRun backwards and forwards through (fixed length) chunks of the sequence, instead of the whole sequence
Loss
* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford
![Page 6: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/6.jpg)
Implementation: Relatively Easy
… you will have a chance to experience this in the Assignment 3
![Page 7: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/7.jpg)
Learning to Write Like Shakespeare
x
RNN
y
* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford
![Page 8: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/8.jpg)
train more
train more
train more
at first:
Learning to Write Like Shakespeare … after training a bit
* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford
![Page 9: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/9.jpg)
Learning to Write Like Shakespeare … after training
* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford
![Page 10: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/10.jpg)
Learning CodeTrained on entire source code of Linux kernel
* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford
![Page 11: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/11.jpg)
DopeLearning: Computational Approach to Rap Lyrics
[ Malmi et al., KDD 2016 ]
![Page 12: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/12.jpg)
Sunspring: First movie generated by AI
![Page 13: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/13.jpg)
Multilayer RNNs
time
depth
* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford
![Page 14: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/14.jpg)
Vanilla RNN Gradient Flow
ht-1
xt
W
stack
tanh
ht
[ Bengio et al., 1994 ]
[ Pascanu et al., ICML 2013 ]
* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford
![Page 15: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/15.jpg)
Vanilla RNN Gradient Flow
ht-1
xt
W
stack
tanh
ht
[ Bengio et al., 1994 ]
[ Pascanu et al., ICML 2013 ]
Backpropagation from ht to ht-1 multiplies by W (actually WhhT)
* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford
![Page 16: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/16.jpg)
Vanilla RNN Gradient Flow [ Bengio et al., 1994 ]
[ Pascanu et al., ICML 2013 ]
h0
x1
Wstack
tanh
h1
x2
Wstack
tanh
h2
x3
Wstack
tanh
h3
x4
Wstack
tanh
h4
Computing gradient of h0 involves many factors of W (and repeated tanh)
* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford
![Page 17: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/17.jpg)
Vanilla RNN Gradient Flow [ Bengio et al., 1994 ]
[ Pascanu et al., ICML 2013 ]
h0
x1
Wstack
tanh
h1
x2
Wstack
tanh
h2
x3
Wstack
tanh
h3
x4
Wstack
tanh
h4
Largest singular value > 1: Exploding gradients
Largest singular value < 1: Vanishing gradients
Computing gradient of h0 involves many factors of W (and repeated tanh)
* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford
![Page 18: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/18.jpg)
Vanilla RNN Gradient Flow [ Bengio et al., 1994 ]
[ Pascanu et al., ICML 2013 ]
h0
x1
Wstack
tanh
h1
x2
Wstack
tanh
h2
x3
Wstack
tanh
h3
x4
Wstack
tanh
h4
Largest singular value > 1: Exploding gradients
Largest singular value < 1: Vanishing gradients
Computing gradient of h0 involves many factors of W (and repeated tanh)
Gradient clipping: Scale gradient if its norm is too big
* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford
![Page 19: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/19.jpg)
Vanilla RNN Gradient Flow [ Bengio et al., 1994 ]
[ Pascanu et al., ICML 2013 ]
h0
x1
Wstack
tanh
h1
x2
Wstack
tanh
h2
x3
Wstack
tanh
h3
x4
Wstack
tanh
h4
Largest singular value > 1: Exploding gradients
Largest singular value < 1: Vanishing gradients
Computing gradient of h0 involves many factors of W (and repeated tanh) Change RNN architecture
* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford
![Page 20: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/20.jpg)
Long-Short Term Memory (LSTM)
Vanilla RNN LSTM
[ Hochreiter and Schmidhuber, NC 1977 ]
* slide from Fei-Dei Li, Justin Johnson, Serena Yeung, cs231n Stanford
![Page 21: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/21.jpg)
Long-Short Term Memory (LSTM)
Image Credit: Christopher Olah (http://colah.github.io/posts/2015-08-Understanding-LSTMs/) * slide from Dhruv Batra
![Page 22: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/22.jpg)
Long-Short Term Memory (LSTM)
Cell state / memory
Image Credit: Christopher Olah (http://colah.github.io/posts/2015-08-Understanding-LSTMs/) * slide from Dhruv Batra
![Page 23: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/23.jpg)
LSTM Intuition: Forget Gate
Should we continue to remember this “bit” of information or not?
Intuition: memory and forget gate output multiply, output of forget gate can be though of as binary (0 or 1)
anything x 0 = 0 (forget)anything x 1 = anything (remember)
Image Credit: Christopher Olah (http://colah.github.io/posts/2015-08-Understanding-LSTMs/) * slide from Dhruv Batra
![Page 24: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/24.jpg)
LSTM Intuition: Input Gate
Should we update this “bit” of information or not? If yes, then what should we remember?
Image Credit: Christopher Olah (http://colah.github.io/posts/2015-08-Understanding-LSTMs/) * slide from Dhruv Batra
![Page 25: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/25.jpg)
LSTM Intuition: Memory Update
Forget what needs to be forgotten + memorize what needs to be remembered
Image Credit: Christopher Olah (http://colah.github.io/posts/2015-08-Understanding-LSTMs/) * slide from Dhruv Batra
![Page 26: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/26.jpg)
LSTM Intuition: Output Gate
Should we output this bit of information (e.g., to “deeper” LSTM layers)?
Image Credit: Christopher Olah (http://colah.github.io/posts/2015-08-Understanding-LSTMs/) * slide from Dhruv Batra
![Page 27: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/27.jpg)
LSTM Intuition: Additive UpdatesBackpropagation from ct to ct-1 only elementwise multiplication by
f, no matrix multiply by W
Image Credit: Christopher Olah (http://colah.github.io/posts/2015-08-Understanding-LSTMs/) * slide from Dhruv Batra
![Page 28: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/28.jpg)
Uninterrupted gradient flow!
LSTM Intuition: Additive Updates
Image Credit: Christopher Olah (http://colah.github.io/posts/2015-08-Understanding-LSTMs/) * slide from Dhruv Batra
![Page 29: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/29.jpg)
Uninterrupted gradient flow!
LSTM Intuition: Additive Updates
Input
Softm
ax
3x3
co
nv, 6
4
7x7
co
nv, 6
4 / 2
FC 1000
Pool
3x3
co
nv, 6
4
3x3
co
nv, 6
4
3x3
co
nv, 6
4
3x3
co
nv, 6
4
3x3
co
nv, 6
4
3x3
co
nv, 1
28
3x3
co
nv, 1
28
/ 2
3x3
co
nv, 1
28
3x3
co
nv, 1
28
3x3
co
nv, 1
28
3x3
co
nv, 1
28
...
3x3
co
nv, 6
4
3x3
co
nv, 6
4
3x3
co
nv, 6
4
3x3
co
nv, 6
4
3x3
co
nv, 6
4
3x3
co
nv, 6
4
Pool
Similar to ResNet
Image Credit: Christopher Olah (http://colah.github.io/posts/2015-08-Understanding-LSTMs/) * slide from Dhruv Batra
![Page 30: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/30.jpg)
LSTM Variants: with Peephole Connections
Lets gates see the cell state / memory
Image Credit: Christopher Olah (http://colah.github.io/posts/2015-08-Understanding-LSTMs/) * slide from Dhruv Batra
![Page 31: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/31.jpg)
LSTM Variants: with Coupled Gates
Only memorize new information when you’re forgetting old
Image Credit: Christopher Olah (http://colah.github.io/posts/2015-08-Understanding-LSTMs/) * slide from Dhruv Batra
![Page 32: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/32.jpg)
Gated Recurrent Unit (GRU)
No explicit memory; memory = hidden output
z = memorize new and forget old
Image Credit: Christopher Olah (http://colah.github.io/posts/2015-08-Understanding-LSTMs/) * slide from Dhruv Batra
![Page 33: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/33.jpg)
RNNs: ReviewKey Enablers: — Parameter sharing in computational graphs
— “Unrolling” in computational graphs
— Allows modeling arbitrary length sequences!
![Page 34: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/34.jpg)
RNNs: Review
x
RNN
y
Vanilla RNN
Key Enablers: — Parameter sharing in computational graphs
— “Unrolling” in computational graphs
— Allows modeling arbitrary length sequences!
![Page 35: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/35.jpg)
RNNs: Review
x
RNN
yVanishing
or Exploding Gradients
Vanilla RNN
Key Enablers: — Parameter sharing in computational graphs
— “Unrolling” in computational graphs
— Allows modeling arbitrary length sequences!
![Page 36: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/36.jpg)
RNNs: Review
x
RNN
yVanishing
or Exploding Gradients
Uninterrupted gradient flow!
Vanilla RNN Long-Short Term Memory (LSTM)
Key Enablers: — Parameter sharing in computational graphs
— “Unrolling” in computational graphs
— Allows modeling arbitrary length sequences!
![Page 37: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/37.jpg)
RNNs: ReviewKey Enablers: — Parameter sharing in computational graphs
— “Unrolling” in computational graphs
— Allows modeling arbitrary length sequences!
x
RNN
yVanishing
or Exploding Gradients
Uninterrupted gradient flow!
Vanilla RNN Long-Short Term Memory (LSTM)
Loss functions: often cross-entropy (for classification); could be max-margin (like in SVM) or Squared Loss (regression)
![Page 38: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/38.jpg)
LSTM/RNN Challenges
— LSTM can remember some history, but not too long — LSTM assumes data is regularly sampled
![Page 39: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/39.jpg)
Phased LSTM
Gates are controlled by phased (periodic) oscillations
[ Neil et al., 2016 ]
![Page 40: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/40.jpg)
Bi-directional RNNs/LSTMs
h
t
= f
W
(ht�1, xt
)
y
t
= W
hy
h
t
+ b
y
h
t
= tanh(Whh
h
t�1 +W
xh
x
t
+ b
h
)
h
t
= f
W
(ht�1, xt
)
y
t
= W
hy
h
t
+ b
y
h
t
= tanh(Whh
h
t�1 +W
xh
x
t
+ b
h
)
h
t
= f
W
(ht�1, xt
)
y
t
= W
hy
h
t
+ b
y
h
t
= tanh(Whh
h
t�1 +W
xh
x
t
+ b
h
)
![Page 41: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/41.jpg)
�!h
t
= f
�!W
(�!h
t�1, xt
)
�h
t
= f
�W
( �h
t+1, xt
)
y
t
= W
hy
[�!h
t
,
�h
t
]T + b
y
�!h
t
= tanh(�!W
hh
�!h
t�1 +�!W
xh
x
t
+�!b
h
)
�h
t
= tanh( �W
hh
�h
t+1 + �W
xh
x
t
+ �b
h
)
�!h
t
= f
�!W
(�!h
t�1, xt
)
�h
t
= f
�W
( �h
t+1, xt
)
y
t
= W
hy
[�!h
t
,
�h
t
]T + b
y
�!h
t
= tanh(�!W
hh
�!h
t�1 +�!W
xh
x
t
+�!b
h
)
�h
t
= tanh( �W
hh
�h
t+1 + �W
xh
x
t
+ �b
h
)
�!h
t
= f
�!W
(�!h
t�1, xt
)
�h
t
= f
�W
( �h
t+1, xt
)
y
t
= W
hy
[�!h
t
,
�h
t
]T + b
y
�!h
t
= tanh(�!W
hh
�!h
t�1 +�!W
xh
x
t
+�!b
h
)
�h
t
= tanh( �W
hh
�h
t+1 + �W
xh
x
t
+ �b
h
)
Bi-directional RNNs/LSTMs
�!h
t
= f
�!W
(�!h
t�1, xt
)
�h
t
= f
�W
( �h
t+1, xt
)
y
t
= W
hy
[�!h
t
,
�h
t
]T + b
y
�!h
t
= tanh(�!W
hh
�!h
t�1 +�!W
xh
x
t
+�!b
h
)
�h
t
= tanh( �W
hh
�h
t+1 + �W
xh
x
t
+ �b
h
)
�!h
t
= f
�!W
(�!h
t�1, xt
)
�h
t
= f
�W
( �h
t+1, xt
)
y
t
= W
hy
[�!h
t
,
�h
t
]T + b
y
�!h
t
= tanh(�!W
hh
�!h
t�1 +�!W
xh
x
t
+�!b
h
)
�h
t
= tanh( �W
hh
�h
t+1 + �W
xh
x
t
+ �b
h
)
![Page 42: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/42.jpg)
Attention Mechanisms and RNNs Consider a translation task: This is one of the first neural translation models
English
Enc
oder
Fren
ch D
ecod
er
Summary Vector
https://devblogs.nvidia.com/introduction-neural-machine-translation-gpus-part-3/
![Page 43: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/43.jpg)
Attention Mechanisms and RNNs Consider a translation task with a bi-directional encoder of the source language
English
Enc
oder
https://devblogs.nvidia.com/introduction-neural-machine-translation-gpus-part-3/
![Page 44: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/44.jpg)
Attention Mechanisms and RNNs Consider a translation task with a bi-directional encoder of the source language
English
Enc
oder
Fren
ch D
ecod
er
https://devblogs.nvidia.com/introduction-neural-machine-translation-gpus-part-3/
![Page 45: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/45.jpg)
Attention Mechanisms and RNNs Consider a translation task with a bi-directional encoder of the source language
Fren
ch D
ecod
er
Build a small neural network (one layer) with softmax output that takes (1) everything decoded so far and (encoded by previous decoder state Zi) (2) encoding of the current word (encoded by the hidden state of encoder hj)
and predicts relevance of every source word towards next translation
[ Cho et al., 2015 ]
https://devblogs.nvidia.com/introduction-neural-machine-translation-gpus-part-3/
![Page 46: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/46.jpg)
Attention Mechanisms and RNNs Consider a translation task with a bi-directional encoder of the source language
Fren
ch D
ecod
er
Build a small neural network (one layer) with softmax output that takes (1) everything decoded so far and (encoded by previous decoder state Zi) (2) encoding of the current word (encoded by the hidden state of encoder hj)
and predicts relevance of every source word towards next translation
ci =TX
j=1
↵jhj
[ Cho et al., 2015 ]
https://devblogs.nvidia.com/introduction-neural-machine-translation-gpus-part-3/
![Page 47: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/47.jpg)
Attention Mechanisms and RNNs [ Cho et al., 2015 ]
https://devblogs.nvidia.com/introduction-neural-machine-translation-gpus-part-3/
![Page 48: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/48.jpg)
Regularization in RNNs
Standard dropout in recurrent layers does not work because it causes loss of long term memory!
* slide from Marco Pedersoli and Thomas Lucas
![Page 49: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/49.jpg)
Regularization in RNNs
Standard dropout in recurrent layers does not work because it causes loss of long term memory!
—Dropout in input-to-hidden or hidden-to-output layers
— Apply dropout at sequence level (same zeroed units for the entire sequence)
— Dropout only at the cell update (for LSTM and GRU units)
— Enforcing norm of the hidden state to be similar along time
— Zoneout some hidden units (copy their state to the next tilmestep)
[ Zaremba et al., 2014 ]
[ Gal, 2016 ]
[ Semeniuta et al., 2016 ]
[ Krueger & Memisevic, 2016 ]
[ Krueger et al., 2016 ]
* slide from Marco Pedersoli and Thomas Lucas
![Page 50: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/50.jpg)
Teacher Forcing
.03
.13
.00
.84
.25
.20
.05
.50
.11
.17
.68
.03
.11
.02.08
.79Softmax
“e” “l” “l” “o”SampleTraining Objective: Predict the next word (cross entropy loss)
Testing: Sample the full sequence
![Page 51: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/51.jpg)
Teacher Forcing
.03
.13
.00
.84
.25
.20
.05
.50
.11
.17
.68
.03
.11
.02.08
.79Softmax
“e” “l” “l” “o”Sample
Testing: Sample the full sequence
Training and testing objectives are not consistent!
Training Objective: Predict the next word (cross entropy loss)
![Page 52: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/52.jpg)
Teacher Forcing
Slowly move from Teacher Forcing to Sampling
Probability of sampling from the ground truth
[ Bengio et al., 2015 ]
* slide from Marco Pedersoli and Thomas Lucas
![Page 53: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/53.jpg)
Baseline: Google NIC captioning model
Baseline with Dropout: Regularized RNN version
Always sampling: Use sampling from the beginning of training
Scheduled sampling: Sampling with inverse Sigmoid decay
Uniformed scheduled sampling: Scheduled sampling but uniformly
Teacher Forcing
* slide from Marco Pedersoli and Thomas Lucas
![Page 54: Topics in AI (CPSC 532S)lsigal/532S_2018W2/Lecture9.pdf · 2019-02-11 · Vanilla RNN Gradient Flow [ Bengio et al., 1994 ] [ Pascanu et al., ICML 2013 ] h 0 x 1 W stack tanh h 1](https://reader033.vdocuments.us/reader033/viewer/2022041717/5e4c12545623cd195b4c5518/html5/thumbnails/54.jpg)
Sequence Level Training
During training objective is different than at test time — Training: generate next word given the previous
— Test: generate the entire sequence given an initial state
Optimize directly evaluation metric (e.g. BLUE score for sentence generation)
Set the problem as a Reinforcement Learning: — RNN is an Agent
— Policy defined by the learned parameters
— Action is the selection of the next word based on the policy - Reward is the evaluation metric
* slide from Marco Pedersoli and Thomas Lucas
[ Ranzato et al., 2016 ]