![Page 1: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/1.jpg)
Deep Learning Tutorial李宏毅
Hung-yi Lee
![Page 2: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/2.jpg)
Deep learning attracts lots of attention.• I believe you have seen lots of exciting results
before.
This talk focuses on the basic techniques.
Deep learning trends at Google. Source: SIGMOD/Jeff Dean
![Page 3: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/3.jpg)
Outline
Lecture IV: Next Wave
Lecture III: Variants of Neural Network
Lecture II: Tips for Training Deep Neural Network
Lecture I: Introduction of Deep Learning
![Page 4: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/4.jpg)
Lecture I:Introduction of Deep Learning
![Page 5: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/5.jpg)
Step 1: Network Structure
Step 2: Learning
Target
Step 3: Learn!
Three Steps for Deep Learning
天生的腦http://onepiece1234567890.blogspot.tw/2013/12/blog-post_8.html
![Page 6: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/6.jpg)
A set of function
Define the goodness of a function
Pick the best
function f*
Step 1: Network Structure
Step 2: Learning
Target
Step 3: Learn!
Three Steps for Deep Learning
based on training data
![Page 7: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/7.jpg)
A set of function
Define the goodness of a function
Pick the best
function f*
Step 1: Network Structure
Step 2: Learning
Target
Step 3: Learn!
Three Steps for Deep Learning
• Speech Recognition
• Handwritten Recognition
• Playing Go
• Dialogue System
( ) = *f
( ) = *f
( ) = *f
( ) = *f
“2”
“你好”
“5-5”
“Hello”“Hi”(what the user said) (system response)
(step)
![Page 8: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/8.jpg)
Step 1: Network Structure
Step 2: Learning
Target
Step 3: Learn!
Three Steps for Deep Learning
Deep Learning is so simple ……
![Page 9: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/9.jpg)
A set of function
Define the goodness of a function
Pick the best
function f*
Step 1: Network Structure
Step 2: Learning
Target
Step 3: Learn!
Three Steps for Deep Learning
![Page 10: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/10.jpg)
Human Brains
![Page 11: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/11.jpg)
bwawawaz KKkk +++++= 11
Neural Network
z
1w
kw
Kw
…
1a
ka
Ka
+
b
( )zσ
bias
a
weights
Neuron
…
……
A simple function
Activation function
![Page 12: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/12.jpg)
Neural Network
+ ( )zσ
bias
Activation function
weights
Neuron
1
-2
-1
1
2
-1
1
4
( )zσ
z( ) zez −+=
11σ
Sigmoid Function
0.98
![Page 13: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/13.jpg)
Neural Network
( )zσ+
( )zσ+
( )zσ+
( )zσ+
Different connections leads to different network structured
Weights and biases are network parameters 𝜃𝜃
Each neurons can have different values of weights and biases.
![Page 14: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/14.jpg)
Fully Connect Feedforward Network
( )zσ
z( ) zez −+=
11σ
Sigmoid Function
1
-1
1
-2
1
-1
1
0
4
-2
0.98
0.12
![Page 15: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/15.jpg)
Fully Connect Feedforward Network
1
-2
1
-1
1
0
4
-2
0.98
0.12
2
-1
-1
-2
3
-1
4
-1
0.86
0.11
0.62
0.83
0
0
-2
2
1
-1
![Page 16: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/16.jpg)
Fully Connect Feedforward Network
1
-2
1
-1
1
0
0.73
0.5
2
-1
-1
-2
3
-1
4
-1
0.72
0.12
0.51
0.85
0
0
-2
2
𝑓𝑓 00 = 0.51
0.85
Given parameters 𝜃𝜃, define a function
𝑓𝑓 1−1 = 0.62
0.83
0
0
Network is a function.Input vector, output vector
Given network structure, define a function set
![Page 17: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/17.jpg)
Output LayerHidden Layers
Input Layer
Fully Connect Feedforward Network
Input Output
1x
2x
Layer 1
……
Nx
……
Layer 2
……
Layer L…
………
……
……
……
y1
y2
yM
Deep means many hidden layers
neuron
![Page 18: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/18.jpg)
Ultra Deep Network
8 layers
19 layers
22 layers
AlexNet (2012) VGG (2014) GoogleNet (2014)
16.4%
7.3%6.7%
http://cs231n.stanford.edu/slides/winter1516_lecture8.pdf
![Page 19: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/19.jpg)
Ultra Deep Network
AlexNet(2012)
VGG (2014)
GoogleNet(2014)
152 layers
3.57%
Residual Net (2015)
Taipei101
101 layers
16.4%7.3% 6.7%
This ultra deep network have special structure.
(Lecture IV)
![Page 20: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/20.jpg)
1x
2x
……
Nx
……
……
……
……
……
……
……
y1
y2
yM
Fully Connect Feedforward Network - Matrix Operation
W1 W2 WL
b2 bL
x a1 a2 y
b1W1 x +𝜎𝜎b2W2 a1 +𝜎𝜎
bLWL +𝜎𝜎 aL-1
b1
![Page 21: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/21.jpg)
= 𝜎𝜎 𝜎𝜎
1x
2x
……
Nx
……
……
……
……
……
……
……
y1
y2
yM
Fully Connect Feedforward Network - Matrix Operation
W1 W2 WL
b2 bL
x a1 a2 y
y = 𝑓𝑓 x
b1W1 x +𝜎𝜎 b2W2 + bLWL +…
b1
…
Using parallel computing techniques (e.g. GPU) to speed up matrix operation
![Page 22: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/22.jpg)
Output Layer (Option)
• Softmax layer as the output layer
Ordinary Layer
( )11 zy σ=
( )22 zy σ=
( )33 zy σ=
1z
2z
3z
σ
σ
σ
In general, the output of network can be any value.
May not be easy to interpret
![Page 23: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/23.jpg)
Output Layer (Option)
• Softmax layer as the output layer
1z
2z
3z
Softmax Layer
e
e
e
1ze
2ze
3ze
+
∑=
=3
11
1
j
zz jeey
∑=
3
1j
z je
÷
÷
÷
3
-3
1 2.7
20
0.05
0.88
0.12
≈0
Probability: 1 > 𝑦𝑦𝑖𝑖 > 0 ∑𝑖𝑖 𝑦𝑦𝑖𝑖 = 1
∑=
=3
12
2
j
zz jeey
∑=
=3
13
3
j
zz jeey
![Page 24: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/24.jpg)
FAQ
• Q: How many layers? How many neurons for each layer?
• Q: Can the structure be automatically determined?
![Page 25: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/25.jpg)
A set of function
Define the goodness of a function
Pick the best
function
Step 1: Network Structure
Step 2: Learning
Target
Step 3: Learn!
Three Steps for Deep Learning
![Page 26: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/26.jpg)
Example Application
Input Output
16 x 16 = 256
1x
2x
256x…
…
Ink → 1No ink → 0
……
y1
y2
y10
Each dimension represents the confidence of a digit.
is 1
is 2
is 0
……
0.1
0.7
0.2
The image is “2”
![Page 27: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/27.jpg)
Example Application
• Handwriting Digit Recognition
Machine “2”
1x
2x
256x
…… ……
y1
y2
y10
is 1
is 2
is 0
……
What is needed is a function ……
Input: 256-dim vector
output: 10-dim vector
![Page 28: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/28.jpg)
Output LayerHidden Layers
Input Layer
Fully Connect Feedforward Network
Input Output
1x
2x
Layer 1
……
256x
……
Layer 2
……
Layer L
……
……
……
……
neuron
“2”……
y1
y2
y10
is 1
is 2
is 0
……
Softmax
A function set containing the candidates for
Handwriting Digit Recognition
![Page 29: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/29.jpg)
Fully Connect Feedforward Network
Input Output
1x
2x
Layer 1
……
256x
……
Layer 2
……
Layer L
……
……
……
……
neuron
“2”……
y1
y2
y10
is 1
is 2
is 0
……
Softmax
Define the goodness of function based on training data
Pick the best function
Step 2
Step 3
A function set containing the candidates for
Handwriting Digit Recognition
![Page 30: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/30.jpg)
Training Data
• Preparing training data: images and their labels
The learning target is defined on the training data.
“5” “0” “4” “1”
“3”“1”“2”“9”
![Page 31: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/31.jpg)
Learning Target
16 x 16 = 256
1x
2x
……
256x
……
……
……
……
Ink → 1No ink → 0
……
y1
y2
y10
y1 has the maximum value
The learning target is ……
Input:
y2 has the maximum valueInput:
is 1
is 2
is 0
Softmax
![Page 32: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/32.jpg)
Loss
1x
2x
……
256x
……
……
……
……
……
y1
y2
y10
Loss 𝑙𝑙
“1”
……
1
0
0
……
Loss can be square error or cross entropy between the network output and target
target
Softmax
As close as possible
A good function should make the loss of all examples as small as possible.
![Page 33: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/33.jpg)
Total Loss
x1
x2
xR
NN
NN
NN
……
……
y1
y2
yR
�𝑦𝑦1
�𝑦𝑦2
�𝑦𝑦𝑅𝑅
𝑙𝑙1
……
……
x3 NN y3 �𝑦𝑦3
For all training data …𝐿𝐿 = �
𝑟𝑟=1
𝑅𝑅
𝑙𝑙𝑟𝑟
Find the network parameters 𝜽𝜽∗ that minimize total loss L
Total Loss:
𝑙𝑙2
𝑙𝑙3
𝑙𝑙𝑅𝑅
As small as possible
Find a function in function set that minimize total loss L
![Page 34: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/34.jpg)
A set of function
Define the goodness of a function
Pick the best
function
Step 1: Network Structure
Step 2: Learning
Target
Step 3: Learn!
Three Steps for Deep Learning
![Page 35: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/35.jpg)
How to pick the best function
Find network parameters 𝜽𝜽∗ that minimize total loss L
Network parameters 𝜃𝜃 =𝑤𝑤1,𝑤𝑤2,𝑤𝑤3,⋯ , 𝑏𝑏1, 𝑏𝑏2, 𝑏𝑏3,⋯
Enumerate all possible valuesLayer l
……
Layer l+1
……
E.g. speech recognition: 8 layers and 1000 neurons each layer
1000neurons
1000neurons
106
weightsMillions of parameters
![Page 36: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/36.jpg)
Gradient Descent
Total Loss 𝐿𝐿
Random, pre-train, RBM Usually good enough
Network parameters 𝜃𝜃 =𝑤𝑤1,𝑤𝑤2,⋯ , 𝑏𝑏1, 𝑏𝑏2,⋯
w
Pick an initial value for w
Find network parameters 𝜽𝜽∗ that minimize total loss L
![Page 37: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/37.jpg)
Gradient Descent
Total Loss 𝐿𝐿
Network parameters 𝜃𝜃 =𝑤𝑤1,𝑤𝑤2,⋯ , 𝑏𝑏1, 𝑏𝑏2,⋯
w
Pick an initial value for w Compute ⁄𝜕𝜕𝐿𝐿 𝜕𝜕𝑤𝑤
Positive
Negative
Decrease w
Increase w切線斜率
http://chico386.pixnet.net/album/photo/171572850
Find network parameters 𝜽𝜽∗ that minimize total loss L
![Page 38: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/38.jpg)
Gradient Descent
Total Loss 𝐿𝐿
Network parameters 𝜃𝜃 =𝑤𝑤1,𝑤𝑤2,⋯ , 𝑏𝑏1, 𝑏𝑏2,⋯
w
Pick an initial value for w Compute ⁄𝜕𝜕𝐿𝐿 𝜕𝜕𝑤𝑤
⁄−𝜂𝜂𝜕𝜕𝐿𝐿 𝜕𝜕𝑤𝑤
η is called “learning rate”
𝑤𝑤 ⁄← 𝑤𝑤 − 𝜂𝜂𝜕𝜕𝐿𝐿 𝜕𝜕𝑤𝑤
Repeat
Find network parameters 𝜽𝜽∗ that minimize total loss L
![Page 39: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/39.jpg)
Gradient Descent
Total Loss 𝐿𝐿
Network parameters 𝜃𝜃 =𝑤𝑤1,𝑤𝑤2,⋯ , 𝑏𝑏1, 𝑏𝑏2,⋯
w
Pick an initial value for w Compute ⁄𝜕𝜕𝐿𝐿 𝜕𝜕𝑤𝑤
𝑤𝑤 ⁄← 𝑤𝑤 − 𝜂𝜂𝜕𝜕𝐿𝐿 𝜕𝜕𝑤𝑤
Repeat Until ⁄𝜕𝜕𝐿𝐿 𝜕𝜕𝑤𝑤 is approximately small(when update is little)
Find network parameters 𝜽𝜽∗ that minimize total loss L
![Page 40: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/40.jpg)
Gradient Descent
𝑤𝑤1Compute ⁄𝜕𝜕𝐿𝐿 𝜕𝜕𝑤𝑤1
−𝜇𝜇 ⁄𝜕𝜕𝐿𝐿 𝜕𝜕𝑤𝑤10.15
𝑤𝑤2Compute ⁄𝜕𝜕𝐿𝐿 𝜕𝜕𝑤𝑤2
−𝜇𝜇 ⁄𝜕𝜕𝐿𝐿 𝜕𝜕𝑤𝑤20.05
𝑏𝑏1Compute ⁄𝜕𝜕𝐿𝐿 𝜕𝜕𝑏𝑏1
−𝜇𝜇 ⁄𝜕𝜕𝐿𝐿 𝜕𝜕𝑏𝑏10.2
……
……
0.2
-0.1
0.3
𝜃𝜃𝜕𝜕𝐿𝐿𝜕𝜕𝑤𝑤1𝜕𝜕𝐿𝐿𝜕𝜕𝑤𝑤2⋮𝜕𝜕𝐿𝐿𝜕𝜕𝑏𝑏1⋮
𝛻𝛻𝐿𝐿 =
gradient
![Page 41: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/41.jpg)
Gradient Descent
𝑤𝑤1Compute ⁄𝜕𝜕𝐿𝐿 𝜕𝜕𝑤𝑤1
−𝜇𝜇 ⁄𝜕𝜕𝐿𝐿 𝜕𝜕𝑤𝑤10.15
−𝜇𝜇 ⁄𝜕𝜕𝐿𝐿 𝜕𝜕𝑤𝑤1
Compute ⁄𝜕𝜕𝐿𝐿 𝜕𝜕𝑤𝑤10.09
𝑤𝑤2Compute ⁄𝜕𝜕𝐿𝐿 𝜕𝜕𝑤𝑤2
−𝜇𝜇 ⁄𝜕𝜕𝐿𝐿 𝜕𝜕𝑤𝑤20.05
−𝜇𝜇 ⁄𝜕𝜕𝐿𝐿 𝜕𝜕𝑤𝑤2
Compute ⁄𝜕𝜕𝐿𝐿 𝜕𝜕𝑤𝑤20.15
𝑏𝑏1Compute ⁄𝜕𝜕𝐿𝐿 𝜕𝜕𝑏𝑏1
−𝜇𝜇 ⁄𝜕𝜕𝐿𝐿 𝜕𝜕𝑏𝑏10.2
−𝜇𝜇 ⁄𝜕𝜕𝐿𝐿 𝜕𝜕𝑏𝑏1
Compute ⁄𝜕𝜕𝐿𝐿 𝜕𝜕𝑏𝑏10.10
……
……
0.2
-0.1
0.3
……
……
……
𝜃𝜃
![Page 42: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/42.jpg)
𝑤𝑤1
𝑤𝑤2
Gradient Descent
Color: Value of Total Loss L
Randomly pick a starting point
![Page 43: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/43.jpg)
𝑤𝑤1
𝑤𝑤2
Gradient Descent Hopfully, we would reach a minima …..
Compute ⁄𝜕𝜕𝐿𝐿 𝜕𝜕𝑤𝑤1, ⁄𝜕𝜕𝐿𝐿 𝜕𝜕𝑤𝑤2
(−𝜂𝜂 ⁄𝜕𝜕𝐿𝐿 𝜕𝜕𝑤𝑤1, −𝜂𝜂 ⁄𝜕𝜕𝐿𝐿 𝜕𝜕𝑤𝑤2)
Color: Value of Total Loss L
![Page 44: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/44.jpg)
Gradient Descent
• When considering multiple parameters together, do you see any problem?
![Page 45: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/45.jpg)
Local Minima
• Gradient descent never guarantee global minima
𝐿𝐿
𝑤𝑤1 𝑤𝑤2
Different initial point
Reach different minima, so different results
Who is Afraid of Non-Convex Loss Functions?http://videolectures.net/eml07_lecun_wia/
![Page 46: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/46.jpg)
Gradient Descent
𝑤𝑤1𝑤𝑤2
想像你在玩世紀帝國……
Compute ⁄𝜕𝜕𝐿𝐿 𝜕𝜕𝑤𝑤1, ⁄𝜕𝜕𝐿𝐿 𝜕𝜕𝑤𝑤2
(−𝜂𝜂 ⁄𝜕𝜕𝐿𝐿 𝜕𝜕𝑤𝑤1, −𝜂𝜂 ⁄𝜕𝜕𝐿𝐿 𝜕𝜕𝑤𝑤2)
沒有探索過的地方被戰霧覆蓋
![Page 47: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/47.jpg)
Gradient DescentThis is the “learning” of machines in deep learning ……
Even alpha go using this approach.
I hope you are not too disappointed :p
大家以為 Learning 是 …… 其實 Learning 只是 ……
![Page 48: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/48.jpg)
Backpropagation
• Backpropagation: an efficient way to compute ⁄𝜕𝜕𝐿𝐿 𝜕𝜕𝑤𝑤• Ref:
http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/DNN%20backprop.ecm.mp4/index.html
Don’t worry about ⁄𝜕𝜕𝐿𝐿 𝜕𝜕𝑤𝑤, the toolkits will handle it.
libdnn台大周伯威同學開發
![Page 49: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/49.jpg)
A set of function
Define the goodness of a function
Pick the best
function
Step 1: Network Structure
Step 2: Learning
Target
Step 3: Learn!
You can do lots of different things
![Page 50: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/50.jpg)
For example, you can do …….
• Image Recognition
Network
“monkey”
“cat”
“dog”
“monkey”
“cat”
“dog”
![Page 51: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/51.jpg)
For example, you can do …….
Spam filtering
(http://spam-filter-review.toptenreviews.com/)
Network(Yes/No)
1/0
1 (Yes)
0 (No)
“free” in e-mail
“Talk” in e-mail
![Page 52: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/52.jpg)
Document Classification
http://top-breaking-news.com/
Network
政治
體育
經濟
“president” in document
“stock” in document
體育 政治 財經
![Page 53: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/53.jpg)
Playing Go
Network (19 x 19 positions)
Next move
19 x 19 vector
Black: 1white: -1none: 0
19 x 19 vector
Fully-connected feedwordnetwork can be used
But CNN performs much better.
19 x 19 image (matrix)
(Lecture III)
![Page 54: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/54.jpg)
Playing Go
Network
Network
蒐集一堆棋譜
Target:天元 = 1其他都是 0
Target:五之 5 = 1其他都是 0
Training:
![Page 55: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/55.jpg)
Concluding Remarks
• Deep Learning is simple & powerful!
但 Deep Learning 就像雷神之槌
http://ent.ltn.com.tw/news/breakingnews/1144545
無法輕易被舉起來 ……
![Page 56: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/56.jpg)
Lecture II:Tips for Training DNN
![Page 57: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/57.jpg)
Outline of Lecture II
“Hello World” for Deep Learning
Recipe of Deep Learning
![Page 58: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/58.jpg)
Keras
keras
http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/Theano%20DNN.ecm.mp4/index.html
http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/RNN%20training%20(v6).ecm.mp4/index.html
Very flexible
Need some effort to learn
Easy to learn and use(still have some flexibility)
You can modify it if you can write TensorFlow or Theano
Interface of TensorFlow or Theano
or
If you want to learn theano:
![Page 59: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/59.jpg)
Keras
• François Chollet is the author of Keras. • He currently works for Google as a deep learning
engineer and researcher.
• Keras means horn in Greek• Documentation: http://keras.io/• Example:
https://github.com/fchollet/keras/tree/master/examples
![Page 60: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/60.jpg)
使用 Keras 心得
感謝沈昇勳同學提供圖檔
![Page 61: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/61.jpg)
Example Application
• Handwriting Digit Recognition
Machine “1”
“Hello world” for deep learningMNIST Data: http://yann.lecun.com/exdb/mnist/
Keras provides data sets loading function: http://keras.io/datasets/
28 x 28
![Page 62: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/62.jpg)
Keras
y1 y2 y10
……
……
……
……
Softmax
500
500
28x28
![Page 63: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/63.jpg)
Keras
![Page 64: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/64.jpg)
Keras
Step 3.1: Configuration
Step 3.2: Find the optimal network parameters
𝑤𝑤 ⁄← 𝑤𝑤 − 𝜂𝜂𝜕𝜕𝜕𝜕 𝜕𝜕𝑤𝑤0.1
Training data(Images)
Labels(digits)
等一下再講
![Page 65: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/65.jpg)
Keras
Step 3.2: Find the optimal network parameters
https://www.tensorflow.org/versions/r0.8/tutorials/mnist/beginners/index.html
Number of training examples
numpy array
28 x 28=784
numpy array
10
Number of training examples
…… ……
![Page 66: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/66.jpg)
Keras
http://keras.io/getting-started/faq/#how-can-i-save-a-keras-modelSave and load models
How to use the neural network (testing):
case 1:
case 2:
![Page 67: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/67.jpg)
Keras
• Using GPU to speed training• Way 1
• THEANO_FLAGS=device=gpu0 python YourCode.py
• Way 2 (in your code)• import os• os.environ["THEANO_FLAGS"] =
"device=cpu"
![Page 68: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/68.jpg)
Outline of Lecture II
“Hello World” for Deep Learning
Recipe of Deep Learning
![Page 69: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/69.jpg)
Neural Network
Good Results on Testing Data?
Good Results on Training Data?Step 3: Learn!
Step 2: Learning Target
Step 1: Network Structure
YES
YES
NO
NO
Overfitting!
Recipe of Deep Learning
Other methods do not emphasize this.
![Page 70: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/70.jpg)
Do not always blame Overfitting
Deep Residual Learning for Image Recognitionhttp://arxiv.org/abs/1512.03385
Testing Data
Overfitting?
Training Data
Not well trained
![Page 71: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/71.jpg)
Neural Network
Good Results on Testing Data?
Good Results on Training Data?
YES
YES
Recipe of Deep Learning
Other methods do not emphasize this.
Different approaches for different problems.
e.g. dropout for good results on testing data
![Page 72: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/72.jpg)
Good Results on Testing Data?
Good Results on Training Data?
YES
YES
Recipe of Deep Learning
Choosing proper loss
Mini-batch
New activation function
Adaptive Learning Rate
Momentum
![Page 73: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/73.jpg)
Choosing Proper Loss
1x
2x
……
256x
……
……
……
……
……
y1
y2
y10
loss
“1”
……
1
0
0
……
target
Softmax
�𝑖𝑖=1
10
𝑦𝑦𝑖𝑖 − �𝑦𝑦𝑖𝑖 2Square Error
Cross Entropy −�
𝑖𝑖=1
10
�𝑦𝑦𝑖𝑖𝑙𝑙𝑙𝑙𝑦𝑦𝑖𝑖
Which one is better?
�𝑦𝑦1
�𝑦𝑦2
�𝑦𝑦10
……
1
0
0
=0 =0
![Page 74: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/74.jpg)
Choosing Proper Loss
Total Loss
w1w2
Cross Entropy
SquareError
When using softmax output layer, choose cross entropy
http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf
![Page 75: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/75.jpg)
Let’s try it
Square Error
Cross Entropy
![Page 76: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/76.jpg)
Let’s try itAccuracy
Square Error 0.11Cross Entropy 0.84
Training
Testing:
Cross Entropy
SquareError
![Page 77: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/77.jpg)
Good Results on Testing Data?
Good Results on Training Data?
YES
YES
Recipe of Deep Learning
Choosing proper loss
Mini-batch
New activation function
Adaptive Learning Rate
Momentum
![Page 78: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/78.jpg)
Mini-batch
x1 NN
……
y1 �𝑦𝑦1𝑙𝑙1
x31 NN y31 �𝑦𝑦31𝑙𝑙31
x2 NN
……
y2 �𝑦𝑦2𝑙𝑙2
x16 NN y16 �𝑦𝑦16𝑙𝑙16
Pick the 1st batch
Randomly initialize network parameters
Pick the 2nd batchMin
i-bat
chM
ini-b
atch
𝜕𝜕𝐿 = 𝑙𝑙1 + 𝑙𝑙31 + ⋯
𝜕𝜕𝐿𝐿 = 𝑙𝑙2 + 𝑙𝑙16 + ⋯
Update parameters once
Update parameters once
Until all mini-batches have been picked
…
one epoch
Repeat the above process
We do not really minimize total loss!
![Page 79: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/79.jpg)
Mini-batch
x1 NN
……
y1 �𝑦𝑦1𝑙𝑙1
x31 NN y31 �𝑦𝑦31𝑙𝑙31M
ini-b
atch
Pick the 1st batch
Pick the 2nd batch
𝜕𝜕𝐿 = 𝑙𝑙1 + 𝑙𝑙31 + ⋯
𝜕𝜕𝐿𝐿 = 𝑙𝑙2 + 𝑙𝑙16 + ⋯
Update parameters once
Update parameters once
Until all mini-batches have been picked
…
one epoch
100 examples in a mini-batch
Repeat 20 times
![Page 80: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/80.jpg)
Mini-batch
x1 NN
……
y1 �𝑦𝑦1𝑙𝑙1
x31 NN y31 �𝑦𝑦31𝑙𝑙31
x2 NN
……
y2 �𝑦𝑦2𝑙𝑙2
x16 NN y16 �𝑦𝑦16𝑙𝑙16
Pick the 1st batch
Randomly initialize network parameters
Pick the 2nd batchMin
i-bat
chM
ini-b
atch
𝜕𝜕𝐿 = 𝑙𝑙1 + 𝑙𝑙31 + ⋯
𝜕𝜕𝐿𝐿 = 𝑙𝑙2 + 𝑙𝑙16 + ⋯
Update parameters once
Update parameters once
…
L is different each time when we update parameters!
目標換來換去?!
We do not really minimize total loss!
![Page 81: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/81.jpg)
Mini-batch
Original Gradient Descent With Mini-batch
Unstable!!!
The colors represent the total loss.
![Page 82: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/82.jpg)
Mini-batch is Faster
1 epoch
See all examples
See only one batch
Update after seeing all examples
If there are 20 batches, update 20 times in one epoch.
Original Gradient Descent With Mini-batch
Not always true with parallel computing.
Can have the same speed (not super large data set)
Mini-batch has better performance!
![Page 83: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/83.jpg)
Mini-batch is Better! Accuracy
Mini-batch 0.84No batch 0.12
Testing:
Epoch
Accu
racy
Mini-batch
No batch
Training
![Page 84: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/84.jpg)
x1 NN
……
y1 �𝑦𝑦1𝑙𝑙1
x31 NN y31 �𝑦𝑦31𝑙𝑙31
x2 NN
……
y2 �𝑦𝑦2𝑙𝑙2
x16 NN y16 �𝑦𝑦16𝑙𝑙16
Min
i-bat
chM
ini-b
atch
Shuffle the training examples for each epochEpoch 1
x1 NN
……
y1 �𝑦𝑦1𝑙𝑙1
x31 NN y31 �𝑦𝑦31𝑙𝑙17
x2 NN
……
y2 �𝑦𝑦2𝑙𝑙2
x16 NN y16 �𝑦𝑦16𝑙𝑙26
Min
i-bat
chM
ini-b
atch
Epoch 2
Don’t worry. This is the default of Keras.
![Page 85: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/85.jpg)
Good Results on Testing Data?
Good Results on Training Data?
YES
YES
Recipe of Deep Learning
Choosing proper loss
Mini-batch
New activation function
Adaptive Learning Rate
Momentum
![Page 86: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/86.jpg)
Hard to get the power of Deep …
Deeper usually does not imply better.
![Page 87: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/87.jpg)
Let’s try itAccuracy
3 layers 0.849 layers 0.11
Testing:
9 layers
3 layersTraining
![Page 88: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/88.jpg)
Vanishing Gradient Problem
Larger gradients
Almost random Already convergebased on random!?
Learn very slow Learn very fast
1x
2x
……
Nx
……
……
……
……
……
……
……
y1
y2
yM
Smaller gradients
![Page 89: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/89.jpg)
Vanishing Gradient Problem
1x
2x
……
Nx
……
……
……
……
……
……
……
𝑦𝑦1
𝑦𝑦2
𝑦𝑦𝑀𝑀
……
�𝑦𝑦1
�𝑦𝑦2
�𝑦𝑦𝑀𝑀
𝑙𝑙
Intuitive way to compute the derivatives …𝜕𝜕𝑙𝑙𝜕𝜕𝑤𝑤
=?
+∆𝑤𝑤
+∆𝑙𝑙
∆𝑙𝑙∆𝑤𝑤
Smaller gradients
Large input
Small output
![Page 90: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/90.jpg)
Hard to get the power of Deep …
In 2006, people used RBM pre-training.In 2015, people use ReLU.
![Page 91: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/91.jpg)
ReLU
• Rectified Linear Unit (ReLU)
Reason:
1. Fast to compute
2. Biological reason
3. Infinite sigmoid with different biases
4. Vanishing gradient problem
𝑧𝑧
𝑎𝑎𝑎𝑎 = 𝑧𝑧
𝑎𝑎 = 0
𝜎𝜎 𝑧𝑧
[Xavier Glorot, AISTATS’11][Andrew L. Maas, ICML’13][Kaiming He, arXiv’15]
![Page 92: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/92.jpg)
ReLU
1x
2x
1y
2y0
0
0
0
𝑧𝑧
𝑎𝑎𝑎𝑎 = 𝑧𝑧
𝑎𝑎 = 0
![Page 93: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/93.jpg)
ReLU
1x
2x
1y
2y
A Thinner linear network
Do not have smaller gradients
𝑧𝑧
𝑎𝑎𝑎𝑎 = 𝑧𝑧
𝑎𝑎 = 0
![Page 94: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/94.jpg)
Let’s try it
![Page 95: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/95.jpg)
Let’s try it
• 9 layers
9 layers Accuracy
Sigmoid 0.11ReLU 0.96
Training
Testing:
ReLUSigmoid
![Page 96: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/96.jpg)
ReLU - variant
𝑧𝑧
𝑎𝑎𝑎𝑎 = 𝑧𝑧
𝑎𝑎 = 0.01𝑧𝑧
𝜕𝜕𝑒𝑒𝑎𝑎𝑒𝑒𝑦𝑦 𝑅𝑅𝑒𝑒𝜕𝜕𝑅𝑅
𝑧𝑧
𝑎𝑎𝑎𝑎 = 𝑧𝑧
𝑎𝑎 = 𝛼𝛼𝑧𝑧
𝑃𝑃𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑒𝑒𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑅𝑅𝑒𝑒𝜕𝜕𝑅𝑅
α also learned by gradient descent
![Page 97: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/97.jpg)
Maxout
• Learnable activation function [Ian J. Goodfellow, ICML’13]
Max
1x
2x
Input
Max
+ 5
+ 7
+ −1
+ 1
7
1
Max
Max
+ 1
+ 2
+ 4
+ 3
2
4
ReLU is a special cases of Maxout
You can have more than 2 elements in a group.
neuron
![Page 98: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/98.jpg)
Maxout
• Learnable activation function [Ian J. Goodfellow, ICML’13]
• Activation function in maxout network can be any piecewise linear convex function
• How many pieces depending on how many elements in a group
ReLU is a special cases of Maxout
2 elements in a group 3 elements in a group
![Page 99: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/99.jpg)
Good Results on Testing Data?
Good Results on Training Data?
YES
YES
Recipe of Deep Learning
Choosing proper loss
Mini-batch
New activation function
Adaptive Learning Rate
Momentum
![Page 100: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/100.jpg)
𝑤𝑤1
𝑤𝑤2
Learning Rates
If learning rate is too large
Total loss may not decrease after each update
Set the learning rate η carefully
![Page 101: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/101.jpg)
𝑤𝑤1
𝑤𝑤2
Learning Rates
If learning rate is too large
Set the learning rate η carefully
If learning rate is too small
Training would be too slow
Total loss may not decrease after each update
![Page 102: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/102.jpg)
Learning Rates
• Popular & Simple Idea: Reduce the learning rate by some factor every few epochs.
• At the beginning, we are far from the destination, so we use larger learning rate
• After several epochs, we are close to the destination, so we reduce the learning rate
• E.g. 1/t decay: 𝜂𝜂𝑡𝑡 = ⁄𝜂𝜂 𝑎𝑎 + 1• Learning rate cannot be one-size-fits-all
• Giving different parameters different learning rates
![Page 103: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/103.jpg)
Adagrad
Parameter dependent learning rate
w ← 𝑤𝑤 − 𝜂𝜂𝑤𝑤𝜕𝜕𝜕𝜕 ∕ 𝜕𝜕𝑤𝑤
constant
𝑔𝑔𝑖𝑖 is 𝜕𝜕𝜕𝜕 ∕ 𝜕𝜕𝑤𝑤 obtained at the i-th update
𝜂𝜂𝑤𝑤 =𝜂𝜂
∑𝑖𝑖=0𝑡𝑡 𝑔𝑔𝑖𝑖 2
Summation of the square of the previous derivatives
𝑤𝑤 ← 𝑤𝑤 − 𝜂𝜂𝜕𝜕𝜕𝜕 ∕ 𝜕𝜕𝑤𝑤Original:
Adagrad:
![Page 104: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/104.jpg)
Adagrad
g0 g1 ……0.1 0.2 ……
g0 g1 ……20.0 10.0 ……
Observation: 1. Learning rate is smaller and smaller for all parameters2. Smaller derivatives, larger learning rate, and vice versa
𝜂𝜂0.12
𝜂𝜂0.12 + 0.22
𝜂𝜂202𝜂𝜂
202 + 102
=𝜂𝜂
0.1
=𝜂𝜂
0.22
=𝜂𝜂
20
=𝜂𝜂
22
Why?
𝜂𝜂𝑤𝑤 =𝜂𝜂
∑𝑖𝑖=0𝑡𝑡 𝑔𝑔𝑖𝑖 2
Learning rate: Learning rate:
𝑤𝑤1 𝑤𝑤2
![Page 105: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/105.jpg)
Smaller Derivatives
Larger Learning Rate
2. Smaller derivatives, larger learning rate, and vice versa Why?
Smaller Learning Rate
Larger derivatives
![Page 106: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/106.jpg)
Not the whole story ……
• Adagrad [John Duchi, JMLR’11]
• RMSprop• https://www.youtube.com/watch?v=O3sxAc4hxZU
• Adadelta [Matthew D. Zeiler, arXiv’12]
• “No more pesky learning rates” [Tom Schaul, arXiv’12]
• AdaSecant [Caglar Gulcehre, arXiv’14]
• Adam [Diederik P. Kingma, ICLR’15]
• Nadam• http://cs229.stanford.edu/proj2015/054_report.pdf
![Page 107: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/107.jpg)
Good Results on Testing Data?
Good Results on Training Data?
YES
YES
Recipe of Deep Learning
Choosing proper loss
Mini-batch
New activation function
Adaptive Learning Rate
Momentum
![Page 108: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/108.jpg)
Hard to find optimal network parameters
TotalLoss
The value of a network parameter w
Very slow at the plateau
Stuck at local minima
𝜕𝜕𝜕𝜕 ∕ 𝜕𝜕𝑤𝑤= 0
Stuck at saddle point
𝜕𝜕𝜕𝜕 ∕ 𝜕𝜕𝑤𝑤= 0
𝜕𝜕𝜕𝜕 ∕ 𝜕𝜕𝑤𝑤≈ 0
![Page 109: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/109.jpg)
In physical world ……
• Momentum
How about put this phenomenon in gradient descent?
![Page 110: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/110.jpg)
Movement = Negative of 𝜕𝜕𝜕𝜕∕𝜕𝜕𝑤𝑤 + Momentum
Momentumcost
𝜕𝜕𝜕𝜕∕𝜕𝜕𝑤𝑤 = 0
Still not guarantee reaching global minima, but give some hope ……
Negative of 𝜕𝜕𝜕𝜕 ∕ 𝜕𝜕𝑤𝑤MomentumReal Movement
![Page 111: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/111.jpg)
Adam RMSProp (Advanced Adagrad) + Momentum
![Page 112: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/112.jpg)
Let’s try it
• ReLU, 3 layer
AccuracyOriginal 0.96Adam 0.97
Training
Testing:
AdamOriginal
![Page 113: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/113.jpg)
Good Results on Testing Data?
Good Results on Training Data?
YES
YES
Recipe of Deep Learning
Early Stopping
Regularization
Dropout
Network Structure
![Page 114: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/114.jpg)
Why Overfitting?
• Training data and testing data can be different.
Training Data: Testing Data:
The parameters achieving the learning target do not necessary have good results on the testing data.
Learning target is defined by the training data.
![Page 115: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/115.jpg)
Panacea for Overfitting
• Have more training data• Create more training data (?)
Original Training Data:
Created Training Data:
Shift 15。
Handwriting recognition:
![Page 116: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/116.jpg)
Why Overfitting?
• For experiments, we added some noises to the testing data
![Page 117: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/117.jpg)
Why Overfitting?
• For experiments, we added some noises to the testing data
Training is not influenced.
Accuracy
Clean 0.97Noisy 0.50
Testing:
![Page 118: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/118.jpg)
Good Results on Testing Data?
Good Results on Training Data?
YES
YES
Recipe of Deep Learning
Early Stopping
Weight Decay
Dropout
Network Structure
![Page 119: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/119.jpg)
Early Stopping
Epochs
TotalLoss
Training set
Testing setStop at
hereValidation set
http://keras.io/getting-started/faq/#how-can-i-interrupt-training-when-the-validation-loss-isnt-decreasing-anymoreKeras:
![Page 120: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/120.jpg)
Good Results on Testing Data?
Good Results on Training Data?
YES
YES
Recipe of Deep Learning
Early Stopping
Weight Decay
Dropout
Network Structure
![Page 121: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/121.jpg)
Weight Decay
• Our brain prunes out the useless link between neurons.
Doing the same thing to machine’s brain improves the performance.
![Page 122: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/122.jpg)
Weight Decay
Useless
Close to zero (萎縮了)
Weight decay is one kind of regularization
![Page 123: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/123.jpg)
Weight Decay
• Implementation
Smaller and smaller
Keras: http://keras.io/regularizers/
wLww∂∂
−← η
( )wLww∂∂
−−← ηλ1
Original:
Weight Decay:
0.01=λ
0.99
![Page 124: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/124.jpg)
Good Results on Testing Data?
Good Results on Training Data?
YES
YES
Recipe of Deep Learning
Early Stopping
Weight Decay
Dropout
Network Structure
![Page 125: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/125.jpg)
DropoutTraining:
Each time before updating the parameters Each neuron has p% to dropout
![Page 126: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/126.jpg)
DropoutTraining:
Each time before updating the parameters Each neuron has p% to dropout
Using the new network for trainingThe structure of the network is changed.
Thinner!
For each mini-batch, we resample the dropout neurons
![Page 127: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/127.jpg)
DropoutTesting:
No dropout
If the dropout rate at training is p%, all the weights times (1-p)%
Assume that the dropout rate is 50%. If a weight w = 1 by training, set 𝑤𝑤 = 0.5 for testing.
![Page 128: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/128.jpg)
Dropout - Intuitive Reason
When teams up, if everyone expect the partner will do the work, nothing will be done finally.
However, if you know your partner will dropout, you will do better.
我的 partner 會擺爛,所以我要好好做
When testing, no one dropout actually, so obtaining good results eventually.
![Page 129: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/129.jpg)
Dropout - Intuitive Reason
• Why the weights should multiply (1-p)% (dropout rate) when testing?
Training of Dropout Testing of Dropout
𝑤𝑤1𝑤𝑤2𝑤𝑤3𝑤𝑤4
𝑧𝑧
𝑤𝑤1𝑤𝑤2𝑤𝑤3𝑤𝑤4
𝑧𝑧′
Assume dropout rate is 50%
0.5 ×0.5 ×0.5 ×0.5 ×
No dropoutWeights from training
𝑧𝑧′ ≈ 2𝑧𝑧
𝑧𝑧′ ≈ 𝑧𝑧Weights multiply (1-p)%
![Page 130: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/130.jpg)
Dropout is a kind of ensemble.
Ensemble
Network1
Network2
Network3
Network4
Train a bunch of networks with different structures
Training Set
Set 1 Set 2 Set 3 Set 4
![Page 131: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/131.jpg)
Dropout is a kind of ensemble.
Ensemble
y1
Network1
Network2
Network3
Network4
Testing data x
y2 y3 y4
average
![Page 132: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/132.jpg)
Dropout is a kind of ensemble.
Training of Dropout
minibatch1
……
Using one mini-batch to train one networkSome parameters in the network are shared
minibatch2
minibatch3
minibatch4
M neurons
2M possible networks
![Page 133: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/133.jpg)
Dropout is a kind of ensemble.
testing data xTesting of Dropout
……
average
y1 y2 y3
All the weights multiply (1-p)%
≈ y神秘
![Page 134: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/134.jpg)
More about dropout
• More reference for dropout [Nitish Srivastava, JMLR’14] [Pierre Baldi, NIPS’13][Geoffrey E. Hinton, arXiv’12]
• Dropout works better with Maxout [Ian J. Goodfellow, ICML’13]
• Dropconnect [Li Wan, ICML’13]
• Dropout delete neurons• Dropconnect deletes the connection between neurons
• Annealed dropout [S.J. Rennie, SLT’14]
• Dropout rate decreases by epochs
• Standout [J. Ba, NISP’13]
• Each neural has different dropout rate
![Page 135: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/135.jpg)
Let’s try it
y1 y2 y10
……
……
……
……
Softmax
500
500
model.add( dropout(0.8) )
model.add( dropout(0.8) )
![Page 136: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/136.jpg)
Let’s try it
Training
Dropout
No Dropout
Epoch
Accu
racy
AccuracyNoisy 0.50
+ dropout 0.63
Testing:
![Page 137: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/137.jpg)
Good Results on Testing Data?
Good Results on Training Data?
YES
YES
Recipe of Deep Learning
Early Stopping
Regularization
Dropout
Network Structure
CNN is a very good example!(next lecture)
![Page 138: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/138.jpg)
Concluding Remarks of Lecture II
![Page 139: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/139.jpg)
Recipe of Deep Learning
Neural Network
Good Results on Testing Data?
Good Results on Training Data?Step 3: Learn!
Step 2: Learning Target
Step 1: Network Structure
YES
YES
NO
NO
![Page 140: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/140.jpg)
Lecture III:Variants of Neural
Networks
![Page 141: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/141.jpg)
Variants of Neural Networks
Convolutional Neural Network (CNN)
Recurrent Neural Network (RNN)
Consdiering the property of images
![Page 142: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/142.jpg)
Why CNN for Image?
• When processing image, the first layer of fully connected network would be very large
100
……
……
……
……
……
Softmax100
100 x 100 x 3 1000
3 x 107
Can the fully connected network be simplified by considering the properties of image recognition?
![Page 143: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/143.jpg)
Why CNN for Image
• Some patterns are much smaller than the whole image
A neuron does not have to see the whole image to discover the pattern.
負責偵測有沒有喙
這是喙
Connecting to small region with less parameters
![Page 144: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/144.jpg)
Why CNN for Image
• The same patterns appear in different regions.
我負責看左上有沒有喙
我負責看中間有沒有喙
太冗 ……幾乎做一樣的事
![Page 145: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/145.jpg)
Why CNN for Image
• The same patterns appear in different regions.
我負責看圖中有沒有喙
一個 neuron 就夠了 ……
![Page 146: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/146.jpg)
Why CNN for Image
• Subsampling the pixels will not change the object
subsampling
birdbird
除非太過頭
We can subsample the pixels to make image smaller
Less parameters for the network to process the image
![Page 147: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/147.jpg)
Step 1: Neural
Network
Step 2: Learning
Target
Step 3: Learn!
Convolutional Neural Network
天生的腦
![Page 148: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/148.jpg)
The whole CNN
Fully Connected Feedforward network
cat dog ……Convolution
Max Pooling
Convolution
Max Pooling
Flatten
Can repeat many times
![Page 149: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/149.jpg)
The whole CNN
Convolution
Max Pooling
Convolution
Max Pooling
Flatten
Can repeat many times
Some patterns are much smaller than the whole image
The same patterns appear in different regions.
Subsampling the pixels will not change the object
Property 1
Property 2
Property 3
![Page 150: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/150.jpg)
The whole CNN
Fully Connected Feedforward network
cat dog ……Convolution
Max Pooling
Convolution
Max Pooling
Flatten
Can repeat many times
![Page 151: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/151.jpg)
CNN – Convolution
1 0 0 0 0 10 1 0 0 1 00 0 1 1 0 01 0 0 0 1 00 1 0 0 1 00 0 1 0 1 0
6 x 6 image
1 -1 -1-1 1 -1-1 -1 1
Filter 1
-1 1 -1-1 1 -1-1 1 -1
Filter 2
……
Those are the network parameters to be learned.
Matrix
Matrix
Each filter detects a small pattern (3 x 3). Property 1
![Page 152: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/152.jpg)
CNN – Convolution
1 0 0 0 0 10 1 0 0 1 00 0 1 1 0 01 0 0 0 1 00 1 0 0 1 00 0 1 0 1 0
6 x 6 image
1 -1 -1-1 1 -1-1 -1 1
Filter 1
3 -1
stride=1
![Page 153: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/153.jpg)
CNN – Convolution
1 0 0 0 0 10 1 0 0 1 00 0 1 1 0 01 0 0 0 1 00 1 0 0 1 00 0 1 0 1 0
6 x 6 image
1 -1 -1-1 1 -1-1 -1 1
Filter 1
3 -3
If stride=2
We set stride=1 below
![Page 154: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/154.jpg)
CNN – Convolution
1 0 0 0 0 10 1 0 0 1 00 0 1 1 0 01 0 0 0 1 00 1 0 0 1 00 0 1 0 1 0
6 x 6 image
1 -1 -1-1 1 -1-1 -1 1
Filter 1
3 -1 -3 -1
-3 1 0 -3
-3 -3 0 1
3 -2 -2 -1
stride=1
Property 2
![Page 155: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/155.jpg)
CNN – Convolution
1 0 0 0 0 10 1 0 0 1 00 0 1 1 0 01 0 0 0 1 00 1 0 0 1 00 0 1 0 1 0
6 x 6 image
3 -1 -3 -1
-3 1 0 -3
-3 -3 0 1
3 -2 -2 -1
-1 1 -1-1 1 -1-1 1 -1
Filter 2
-1 -1 -1 -1
-1 -1 -2 1
-1 -1 -2 1
-1 0 -4 3
Do the same process for every filter
stride=1
4 x 4 image
FeatureMap
![Page 156: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/156.jpg)
CNN – Zero Padding
1 0 0 0 0 10 1 0 0 1 00 0 1 1 0 01 0 0 0 1 00 1 0 0 1 00 0 1 0 1 0
6 x 6 image
1 -1 -1-1 1 -1-1 -1 1
Filter 1
You will get another 6 x 6 images in this way
0
Zero padding
0000
00000
![Page 157: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/157.jpg)
CNN – Colorful image
1 0 0 0 0 10 1 0 0 1 00 0 1 1 0 01 0 0 0 1 00 1 0 0 1 00 0 1 0 1 0
1 0 0 0 0 10 1 0 0 1 00 0 1 1 0 01 0 0 0 1 00 1 0 0 1 00 0 1 0 1 0
1 0 0 0 0 10 1 0 0 1 00 0 1 1 0 01 0 0 0 1 00 1 0 0 1 00 0 1 0 1 0
1 -1 -1-1 1 -1-1 -1 1 Filter 1
-1 1 -1-1 1 -1-1 1 -1 Filter 2
1 -1 -1-1 1 -1-1 -1 1
1 -1 -1-1 1 -1-1 -1 1
-1 1 -1-1 1 -1-1 1 -1
-1 1 -1-1 1 -1-1 1 -1
Colorful image
![Page 158: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/158.jpg)
The whole CNN
Fully Connected Feedforward network
cat dog ……Convolution
Max Pooling
Convolution
Max Pooling
Flatten
Can repeat many times
![Page 159: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/159.jpg)
CNN – Max Pooling
3 -1 -3 -1
-3 1 0 -3
-3 -3 0 1
3 -2 -2 -1
-1 1 -1-1 1 -1-1 1 -1
Filter 2
-1 -1 -1 -1
-1 -1 -2 1
-1 -1 -2 1
-1 0 -4 3
1 -1 -1-1 1 -1-1 -1 1
Filter 1
![Page 160: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/160.jpg)
CNN – Max Pooling
1 0 0 0 0 10 1 0 0 1 00 0 1 1 0 01 0 0 0 1 00 1 0 0 1 00 0 1 0 1 0
6 x 6 image
3 0
13
-1 1
30
2 x 2 image
Each filter is a channel
New image but smaller
Conv
MaxPooling
![Page 161: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/161.jpg)
The whole CNN
Convolution
Max Pooling
Convolution
Max Pooling
Can repeat many timesA new image
The number of the channel is the number of filters
Smaller than the original image
3 0
13
-1 1
30
![Page 162: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/162.jpg)
The whole CNN
Fully Connected Feedforward network
cat dog ……Convolution
Max Pooling
Convolution
Max Pooling
Flatten
A new image
A new image
![Page 163: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/163.jpg)
Flatten
3 0
13
-1 1
30 Flatten
3
0
1
3
-1
1
0
3
Fully Connected Feedforward network
![Page 164: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/164.jpg)
The whole CNN
Convolution
Max Pooling
Convolution
Max Pooling
Can repeat many times
![Page 165: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/165.jpg)
Max
1x
2x
Input
Max
+ 5
+ 7
+ −1
+ 1
7
1
1 0 0 0 0 10 1 0 0 1 00 0 1 1 0 01 0 0 0 1 00 1 0 0 1 00 0 1 0 1 0
imageconvolution Max
pooling
-1 1 -1-1 1 -1-1 1 -1
1 -1 -1-1 1 -1-1 -1 1
(Ignoring the non-linear activation function after the convolution.)
![Page 166: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/166.jpg)
1 0 0 0 0 10 1 0 0 1 00 0 1 1 0 01 0 0 0 1 00 1 0 0 1 00 0 1 0 1 0
6 x 6 image
1 -1 -1-1 1 -1-1 -1 1
Filter 1 1:2:3:
…
7:8:9:
…
13:14:15:
…
Only connect to 9 input, not fully connected
4:
10:
16:
1000
0100
0011
3
Less parameters!
![Page 167: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/167.jpg)
1 0 0 0 0 10 1 0 0 1 00 0 1 1 0 01 0 0 0 1 00 1 0 0 1 00 0 1 0 1 0
1 -1 -1-1 1 -1-1 -1 1
Filter 1
1:2:3:
…
7:8:9:
…
13:14:15:
…
4:
10:
16:
1000
0100
0011
3
-1
Shared weights
6 x 6 image
Less parameters!
Even less parameters!
![Page 168: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/168.jpg)
Max
1x
2x
Input
Max
+ 5
+ 7
+ −1
+ 1
7
1
1 0 0 0 0 10 1 0 0 1 00 0 1 1 0 01 0 0 0 1 00 1 0 0 1 00 0 1 0 1 0
imageconvolution Max
pooling
-1 1 -1-1 1 -1-1 1 -1
1 -1 -1-1 1 -1-1 -1 1
(Ignoring the non-linear activation function after the convolution.)
![Page 169: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/169.jpg)
3 -1 -3 -1
-3 1 0 -3
-3 -3 0 1
3 -2 -2 -1
3 0
13
Max
1x
1x
Input
Max
+ 5
+ 7
+ −1
+ 1
7
1
![Page 170: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/170.jpg)
Max
1x
2x
Input
Max
+ 5
+ 7
+ −1
+ 1
7
1
1 0 0 0 0 10 1 0 0 1 00 0 1 1 0 01 0 0 0 1 00 1 0 0 1 00 0 1 0 1 0
image
convolutionMax pooling
-1 1 -1-1 1 -1-1 1 -1
1 -1 -1-1 1 -1-1 -1 1
Only 9 x 2 = 18 parameters
Dim = 6 x 6 = 36
Dim = 4 x 4 x 2 = 32
parameters = 36 x 32 = 1152
![Page 171: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/171.jpg)
Step 1: Neural
Network
Step 2: Learning
Target
Step 3: Learn!
Convolutional Neural Network
Learning: Nothing special, just gradient descent ……
CNN
“monkey”“cat”
“dog”Convolution, Max
Pooling, fully connected
10
0
……
target
![Page 172: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/172.jpg)
Only modified the network structureCNN in Keras
Code: https://github.com/fchollet/keras/blob/master/examples/cifar10_cnn.py
![Page 173: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/173.jpg)
Why CNN for playing Go?
• Some patterns are much smaller than the whole image
• The same patterns appear in different regions.
Alpha Go uses 5 x 5 for first layer
![Page 174: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/174.jpg)
Why CNN for playing Go?
• Subsampling the pixels will not change the object
Alpha Go does not use Max Pooling ……
Max Pooling How to explain this???
![Page 175: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/175.jpg)
Variants of Neural Networks
Convolutional Neural Network (CNN)
Recurrent Neural Network (RNN) Neural Network with Memory
![Page 176: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/176.jpg)
Example Application
• Slot Filling
I would like to arrive Taipei on November 2nd.
飛機訂票系統
Destination:
time of arrival:
Taipei
November 2ndSlot
![Page 177: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/177.jpg)
Example Application
1x 2x
2y1y
Taipei
Input: a word(Each word is represented as a vector)
Solving slot filling by Feedforward network?
![Page 178: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/178.jpg)
1-of-N encoding
Each dimension corresponds to a word in the lexicon
The dimension for the wordis 1, and others are 0
lexicon = {apple, bag, cat, dog, elephant}apple = [ 1 0 0 0 0]bag = [ 0 1 0 0 0]cat = [ 0 0 1 0 0]dog = [ 0 0 0 1 0]
elephant = [ 0 0 0 0 1]
The vector is lexicon size.
1-of-N Encoding
How to represent each word as a vector?
![Page 179: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/179.jpg)
Beyond 1-of-N encoding
w = “apple”
a-a-aa-a-b
p-p-l
26 X 26 X 26…
…a-p-p…
p-l-e…
……
……
1
1
1
00
Word hashingDimension for “Other”
w = “Sauron” …
applebagcat
dog
elephant
“other”
00000
1
w = “Gandalf” 40
![Page 180: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/180.jpg)
Example Application
1x 2x
2y1y
Taipei
desttime of departure
Input: a word(Each word is represented as a vector)
Output:Probability distribution that the input word belonging to the slots
Solving slot filling by Feedforward network?
![Page 181: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/181.jpg)
Example Application
1x 2x
2y1y
Taipei
arrive Taipei on November 2nd
other otherdest time time
leave Taipei on November 2nd
place of departure
Neural network needs memory!
desttime of departure
Problem?
![Page 182: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/182.jpg)
Step 1: Neural
Network
Step 2: Learning
Target
Step 3: Learn!
Recurrent Neural Network
天生的腦
Recurrent Neural Network (RNN)
http://onepiece1234567890.blogspot.tw/2013/12/blog-post_8.html
![Page 183: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/183.jpg)
Recurrent Neural Network (RNN)
1x 2x
2y1y
1a 2a
Memory can be considered as another input.
The output of hidden layer are stored in the memory.
store
![Page 184: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/184.jpg)
RNN
store store
x1 x2 x3
y1 y2 y3
a1
a1 a2a2 a3
The same network is used again and again.
arrive Taipei on November 2nd
Probability of “arrive” in each slot
Probability of “Taipei” in each slot
Probability of “on” in each slot
![Page 185: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/185.jpg)
RNN
store
x1 x2
y1 y2
a1
a1a2
……
……
……
store
x1 x2
y1 y2
a1
a1a2
……
……
……leave Taipei
Prob of “leave” in each slot
Prob of “Taipei” in each slot
Prob of “arrive” in each slot
Prob of “Taipei” in each slot
arrive Taipei
Different
The values stored in the memory is different.
![Page 186: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/186.jpg)
Of course it can be deep …
…… ……
xt xt+1 xt+2
……
……
yt
…… ……
yt+1…
…yt+2
……
……
![Page 187: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/187.jpg)
Bidirectional RNN
yt+1
…… ……
…………yt+2yt
xt xt+1 xt+2
xt xt+1 xt+2
![Page 188: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/188.jpg)
MemoryCell
Long Short-term Memory (LSTM)
Input Gate
Output Gate
Signal control the input gate
Signal control the output gate
Forget Gate
Signal control the forget gate
Other part of the network
Other part of the network
(Other part of the network)
(Other part of the network)
(Other part of the network)
LSTM
Special Neuron:4 inputs, 1 output
![Page 189: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/189.jpg)
𝑧𝑧
𝑧𝑧𝑖𝑖
𝑧𝑧𝑓𝑓
𝑧𝑧𝑜𝑜
𝑔𝑔 𝑧𝑧
𝑓𝑓 𝑧𝑧𝑖𝑖multiply
multiplyActivation function f is usually a sigmoid function
Between 0 and 1
Mimic open and close gate
c
𝑐𝑐′ = 𝑔𝑔 𝑧𝑧 𝑓𝑓 𝑧𝑧𝑖𝑖 + 𝑐𝑐𝑓𝑓 𝑧𝑧𝑓𝑓
ℎ 𝑐𝑐′𝑓𝑓 𝑧𝑧𝑜𝑜
𝑎𝑎 = ℎ 𝑐𝑐′ 𝑓𝑓 𝑧𝑧𝑜𝑜
𝑔𝑔 𝑧𝑧 𝑓𝑓 𝑧𝑧𝑖𝑖
𝑐𝑐′𝑓𝑓 𝑧𝑧𝑓𝑓
𝑐𝑐𝑓𝑓 𝑧𝑧𝑓𝑓
𝑐𝑐
![Page 190: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/190.jpg)
7
3
10
-10
10
3
≈1 3≈1
10
10
≈00
![Page 191: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/191.jpg)
7
-3
10
10
-10
≈1
≈010
≈1-3
-3
-3
-3
-3
![Page 192: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/192.jpg)
LSTM
ct-1……
vector
xt
zzizf zo 4 vectors
![Page 193: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/193.jpg)
LSTM
xt
zzi
×
zf zo
× + ×
yt
ct-1
z
zi
zf
zo
![Page 194: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/194.jpg)
LSTM
xt
zzi
×
zf zo
× + ×
yt
xt+1
zzi
×
zf zo
× + ×
yt+1
ht
Extension: “peephole”
ht-1 ctct-1
ct-1 ct ct+1
![Page 195: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/195.jpg)
Multiple-layerLSTM
This is quite standard now.
https://img.komicolle.org/2015-09-20/src/14426967627131.gif
Don’t worry if you cannot understand this.Keras can handle it.
Keras supports “LSTM”, “GRU”, “SimpleRNN” layers
![Page 196: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/196.jpg)
Step 1: Neural
Network
Step 2: Learning
Target
Step 3: Learn!
Recurrent Neural Network
天生的腦http://onepiece1234567890.blogspot.tw/2013/12/blog-post_8.html
![Page 197: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/197.jpg)
copy copy
x1 x2 x3
y1 y2 y3
Wi
a1
a1 a2a2 a3
arrive Taipei on November 2ndTrainingSentences:
Learning Target
other otherdest
10 0 10 010 0
other dest other
… … … … … …
time time
![Page 198: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/198.jpg)
Step 1: Neural
Network
Step 2: Learning
Target
Step 3: Learn!
Recurrent Neural Network
天生的腦http://onepiece1234567890.blogspot.tw/2013/12/blog-post_8.html
![Page 199: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/199.jpg)
Learning
RNN Learning is very difficult in practice.
Backpropagation through time (BPTT)
𝑤𝑤 ← 𝑤𝑤 − 𝜂𝜂𝜂𝜂𝐿𝐿 ∕ 𝜂𝜂𝑤𝑤 1x 2x
2y1y
1a 2a
copy
𝑤𝑤
![Page 200: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/200.jpg)
Unfortunately ……
• RNN-based network is not always easy to learn
感謝曾柏翔同學提供實驗結果
Real experiments on Language modeling
Lucky
sometimes
Tota
l Los
s
Epoch
![Page 201: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/201.jpg)
The error surface is rough.
w1
w2
Cost
The error surface is either very flat or very steep.
Clipping
[Razvan Pascanu, ICML’13]
Total Loss
![Page 202: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/202.jpg)
Why?
11
y1
01
w
y2
01
w
y3
01
w
y1000
……
𝑤𝑤 = 1𝑤𝑤 = 1.01
𝑦𝑦1000 = 1𝑦𝑦1000 ≈ 20000
𝑤𝑤 = 0.99𝑤𝑤 = 0.01
𝑦𝑦1000 ≈ 0
𝑦𝑦1000 ≈ 0
1 1 1 1
Large⁄𝜂𝜂𝐿𝐿 𝜂𝜂𝑤𝑤
Small Learning rate?
small ⁄𝜂𝜂𝐿𝐿 𝜂𝜂𝑤𝑤
Large Learning rate?
Toy Example
=w999
![Page 203: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/203.jpg)
• Advance momentum method• Nesterov’s Accelerated Gradient (NAG)
Helpful Techniques
Source: http://www.cs.toronto.edu/~fritz/absps/momentum.pdf
Gradient descentMomentum
Nesterov’sAccelerated Gradient (NAG)
Methods:
Valley
![Page 204: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/204.jpg)
add
• Long Short-term Memory (LSTM)• Can deal with gradient vanishing (not gradient
explode)
Helpful Techniques
Memory and input are added
The influence never disappears unless forget gate is closed
No Gradient vanishing(If forget gate is opened.)
![Page 205: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/205.jpg)
Helpful Techniques
• Gated Recurrent Unit (GRU)
[Cho, EMNLP’14]
Simplified LSTM
GRU has less parameters than LSTM
a1-a
舊的不去、新的不來
![Page 206: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/206.jpg)
Helpful Techniques
Vanilla RNN Initialized with Identity matrix + ReLU activation function [Quoc V. Le, arXiv’15]
Outperform or be comparable with LSTM in 4 different tasks
[Jan Koutnik, JMLR’14]
Clockwise RNN
[Tomas Mikolov, ICLR’15]
Structurally Constrained Recurrent Network (SCRN)
![Page 207: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/207.jpg)
More Applications ……
store store
x1 x2 x3
y1 y2 y3
a1
a1 a2a2 a3
arrive Taipei on November 2nd
Probability of “arrive” in each slot
Probability of “Taipei” in each slot
Probability of “on” in each slot
Input and output are both sequences with the same length
RNN can do more than that!
![Page 208: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/208.jpg)
Many to one
• Input is a vector sequence, but output is only one vector
Sentiment Analysis
……
我 覺 太得 糟 了
超好雷
好雷
普雷
負雷
超負雷
看了這部電影覺得很高興 …….
這部電影太糟了…….
這部電影很棒 …….
Positive (正雷) Negative (負雷) Positive (正雷)
……
Keras Example: https://github.com/fchollet/keras/blob/master/examples/imdb_lstm.py
![Page 209: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/209.jpg)
Many to Many (Output is shorter)
• Both input and output are both sequences, but the output is shorter.
• E.g. Speech Recognition
好 好 好
Trimming
棒 棒 棒 棒 棒
“好棒”
Why can’t it be “好棒棒”
Input:
Output: (character sequence)
(vector sequence)
Problem?
![Page 210: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/210.jpg)
Many to Many (Output is shorter)
• Both input and output are both sequences, but the output is shorter.
• Connectionist Temporal Classification (CTC) [Alex Graves, ICML’06][Alex Graves, ICML’14][Haşim Sak, Interspeech’15][Jie Li, Interspeech’15][Andrew Senior, ASRU’15]
好 φ φ 棒 φ φ φ φ 好 φ φ 棒 φ 棒 φ φ
“好棒” “好棒棒”Add an extra symbol “φ” representing “null”
![Page 211: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/211.jpg)
Many to Many (No Limitation)
• Both input and output are both sequences with different lengths. → Sequence to sequence learning
• E.g. Machine Translation (machine learning→機器學習)
Containing all information about
input sequence
learning
machine
![Page 212: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/212.jpg)
learning
Many to Many (No Limitation)
• Both input and output are both sequences with different lengths. → Sequence to sequence learning
• E.g. Machine Translation (machine learning→機器學習)
machine
機 習器 學
……
……
Don’t know when to stop
慣 性
![Page 213: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/213.jpg)
Many to Many (No Limitation)
推 tlkagk: =========斷==========
Ref:http://zh.pttpedia.wikia.com/wiki/%E6%8E%A5%E9%BE%8D%E6%8E%A8%E6%96%87 (鄉民百科)
![Page 214: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/214.jpg)
learning
Many to Many (No Limitation)
• Both input and output are both sequences with different lengths. → Sequence to sequence learning
• E.g. Machine Translation (machine learning→機器學習)
machine
機 習器 學
Add a symbol “===“ (斷)
[Ilya Sutskever, NIPS’14][Dzmitry Bahdanau, arXiv’15]
===
![Page 215: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/215.jpg)
One to Many
• Input an image, but output a sequence of words
Input image
a woman is
……
===
CNN
A vector for whole image
[Kelvin Xu, arXiv’15][Li Yao, ICCV’15]
Caption Generation
![Page 216: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/216.jpg)
Video Caption Generation
A …girl is
===
Video frames
![Page 217: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/217.jpg)
Concluding Remarks
Convolutional Neural Network (CNN)
Recurrent Neural Network (RNN)
![Page 218: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/218.jpg)
Lecture IV:Next Wave
![Page 219: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/219.jpg)
Outline
Ultra Deep Network
Attention Model
Reinforcement Learning
Realizing what the World Looks Like
Understanding the Meaning of Words
Why Deep?
![Page 220: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/220.jpg)
Ultra Deep Network
AlexNet(2012)
VGG (2014)
GoogleNet(2014)
152 layers
3.57%
Residual Net (2015)
16.4%7.3% 6.7%
This ultra deep network have special structure.
Worry about overfitting?
Worry about achieving target first!
![Page 221: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/221.jpg)
Ultra Deep Network
• Ultra deep network is the ensemble of many networks with different depth.
Residual Networks are Exponential Ensembles of Relatively Shallow Networkshttps://arxiv.org/abs/1605.06431
6 layers4 layers2 layers
Ensemble
![Page 222: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/222.jpg)
Ultra Deep Network
• FractalNetFractalNet: Ultra-Deep Neural Networks without Residualshttps://arxiv.org/abs/1605.07648
Resnet in Resnet: Generalizing Residual Architectureshttps://arxiv.org/abs/1603.08029
Resnet in Resnet
All you need is a good inithttp://arxiv.org/pdf/1511.06422v7.pdf
Good Initialization?
![Page 223: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/223.jpg)
Ultra Deep Network
• Residual Network • Highway Network
Deep Residual Learning for Image Recognitionhttp://arxiv.org/abs/1512.03385
Training Very Deep Networkshttps://arxiv.org/pdf/1507.06228v2.pdf
+
copy copyGate
controller
![Page 224: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/224.jpg)
Input layer
output layer
Input layer
output layer
Input layer
output layer
Highway Network automatically determines the layers needed!
![Page 225: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/225.jpg)
Outline
Ultra Deep Network
Attention Model
Reinforcement Learning
Realizing what the World Looks Like
Understanding the Meaning of Words
Why Deep?
![Page 226: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/226.jpg)
Attention-based Model
http://henrylo1605.blogspot.tw/2015/05/blog-post_56.html
今天午餐吃什麼今天學的
國中二年級的夏天
What is deep learning?
組織回答
![Page 227: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/227.jpg)
Attention-based Model
Reading Head Controller
Input
Reading Head
output
…… ……
Machine’s Memory
DNN/RNN
Ref: http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/Attain%20(v3).ecm.mp4/index.html
![Page 228: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/228.jpg)
Reading Comprehension
Query
Each sentence becomes a vector.
……
DNN/RNN
Reading Head Controller
……
answer
Semantic Analysis
![Page 229: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/229.jpg)
Reading Comprehension
• End-To-End Memory Networks. S. Sukhbaatar, A. Szlam, J. Weston, R. Fergus. NIPS, 2015.
The position of reading head:
Keras has example: https://github.com/fchollet/keras/blob/master/examples/babi_memnn.py
![Page 230: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/230.jpg)
Visual Question Answering
source: http://visualqa.org/
![Page 231: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/231.jpg)
Visual Question Answering
Query DNN/RNN
Reading Head Controller
answer
CNN A vector for each region
![Page 232: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/232.jpg)
Visual Question Answering
• Huijuan Xu, Kate Saenko. Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering. arXiv Pre-Print, 2015
![Page 233: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/233.jpg)
Video Caption Generation
RNN
Reading Head Controller
ADNN/RNN<START>
Memory: video framesOutput: video description
Video:
![Page 234: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/234.jpg)
Video Caption Generation
RNN
GirlDNN/RNNA
Memory: video framesOutput: video description
Video:
Reading Head Controller
![Page 235: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/235.jpg)
Video Caption Generation
• Demo: 曾柏翔、盧宏宗、吳柏瑜
![Page 236: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/236.jpg)
Outline
Ultra Deep Network
Attention Model
Reinforcement Learning
Realizing what the World Looks Like
Understanding the Meaning of Words
Why Deep?
![Page 237: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/237.jpg)
Only Supervised Learning until now ……• Network is a function. In supervised learning, the
input-output pair is given in the training data
NeuralNetwork
“monkey”“cat”
“dog”
10
0
…… target
NeuralNetwork
“monkey”“cat”
“dog”
00
1
…… target
![Page 238: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/238.jpg)
Supervised v.s. Reinforcement
• Example: Dialogue Agent for 訂票系統
I would like to leave Taipei on November 2nd.
飛機訂票系統
Place of leave:
time of arrival:
Taipei
November 2ndSlot
What is your destination?
![Page 239: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/239.jpg)
Supervised v.s. Reinforcement
• Example: Dialogue Agent for 訂票系統
Greetings
Confirm
Inquire
Max………database
…….
Take this action
DNN
![Page 240: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/240.jpg)
Supervised v.s. Reinforcement
• Supervised
• Reinforcement
Hello
Agent
……
Agent
……. ……. ……
Bad
“Hello” You have to “greeting”
%$#$%# You have to “confirm”
![Page 241: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/241.jpg)
Supervised v.s. Reinforcement
• Playing GO• Supervised: 看著棋譜學
• Reinforcement Learning
下一步:“5-5”
下一步:“3-3”
初手天元 ……下了好幾百手 …… Win!
Alpha Go is supervised learning + reinforcement learning.
![Page 242: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/242.jpg)
To learn deep reinforcement learning ……• Lectures of David Silver
• http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html
• 10 堂課 (1:30 each)• Deep Reinforcement Learning
• http://videolectures.net/rldm2015_silver_reinforcement_learning/
![Page 243: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/243.jpg)
Outline
Ultra Deep Network
Attention Model
Reinforcement Learning
Realizing what the World Looks Like
Understanding the Meaning of Words
Why Deep?
![Page 244: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/244.jpg)
Does machine know what the world look like?
Draw something!
Ref: https://openai.com/blog/generative-models/
![Page 245: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/245.jpg)
Deep Dream
• Given a photo, machine adds what it sees ……
http://deepdreamgenerator.com/
![Page 246: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/246.jpg)
Deep Dream
• Given a photo, machine adds what it sees ……
http://deepdreamgenerator.com/
![Page 247: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/247.jpg)
Deep Style
• Given a photo, make its style like famous paintings
https://dreamscopeapp.com/
![Page 248: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/248.jpg)
Deep Style
• Given a photo, make its style like famous paintings
https://dreamscopeapp.com/
![Page 249: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/249.jpg)
Deep Style
CNN CNN
content style
CNN
?
![Page 250: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/250.jpg)
Generating Images (無中生有)
• Training a decoder to generate images is unsupervised
Neural Network
? Training data is a lot of imagescode
![Page 251: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/251.jpg)
Auto-encoder
NNEncoder
NNDecoder
code
code
Learn together
Input Layer
bottle
Output Layer
Layer
Layer … …
Code
As close as possible
Layer
Layer
Encoder Decoder
Not state-of-the-art approach
![Page 252: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/252.jpg)
Generating Images
• Training a decoder to generate images is unsupervised
• Variation Auto-encoder (VAE)• Ref: Auto-Encoding Variational Bayes,
https://arxiv.org/abs/1312.6114• Generative Adversarial Network (GAN)
• Ref: Generative Adversarial Networks, http://arxiv.org/abs/1406.2661
NNDecoder
code
![Page 253: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/253.jpg)
Which one is machine-generated?
Ref: https://openai.com/blog/generative-models/
![Page 254: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/254.jpg)
Generating Images by RNN
color of 1st pixel
color of 2nd pixel
color of 2nd pixel
color of 3rd pixel
color of 3rd pixel
color of 4th pixel
![Page 255: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/255.jpg)
Generating Images by RNN
• Pixel Recurrent Neural Networks• https://arxiv.org/abs/1601.06759
RealWorld
![Page 256: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/256.jpg)
Outline
Ultra Deep Network
Attention Model
Reinforcement Learning
Realizing what the World Looks Like
Understanding the Meaning of Words
Why Deep?
![Page 257: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/257.jpg)
http://top-breaking-news.com/
Machine Reading
• Machine learn the meaning of words from reading a lot of documents without supervision
![Page 258: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/258.jpg)
Machine Reading
• Machine learn the meaning of words from reading a lot of documents without supervision
dog
cat
rabbit
jumprun
flower
tree
Word Vector / Embedding
![Page 259: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/259.jpg)
Machine Reading
• Generating Word Vector/Embedding is unsupervised
Neural Network
Apple
https://garavato.files.wordpress.com/2011/11/stacksdocuments.jpg?w=490
Training data is a lot of text
?
![Page 260: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/260.jpg)
Machine Reading
• Machine learn the meaning of words from reading a lot of documents without supervision
• A word can be understood by its context
蔡英文 520宣誓就職
馬英九 520宣誓就職
蔡英文、馬英九 are something very similar
You shall know a word by the company it keeps
![Page 261: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/261.jpg)
Word Vector
Source: http://www.slideshare.net/hustwj/cikm-keynotenov201444
![Page 262: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/262.jpg)
Word Vector
• Characteristics
• Solving analogies
𝑉𝑉 ℎ𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜 − 𝑉𝑉 ℎ𝑜𝑜𝑜𝑜 ≈ 𝑉𝑉 𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑏𝑜𝑜𝑜𝑜 − 𝑉𝑉 𝑏𝑏𝑏𝑏𝑏𝑏𝑉𝑉 𝑅𝑅𝑜𝑜𝑅𝑅𝑜𝑜 − 𝑉𝑉 𝐼𝐼𝑜𝑜𝐼𝐼𝐼𝐼𝐼𝐼 ≈ 𝑉𝑉 𝐵𝐵𝑜𝑜𝑜𝑜𝐼𝐼𝑏𝑏𝐵𝐵 − 𝑉𝑉 𝐺𝐺𝑜𝑜𝑜𝑜𝑅𝑅𝐼𝐼𝐵𝐵𝐼𝐼𝑉𝑉 𝑘𝑘𝑏𝑏𝐵𝐵𝑏𝑏 − 𝑉𝑉 𝑞𝑞𝑞𝑞𝑜𝑜𝑜𝑜𝐵𝐵 ≈ 𝑉𝑉 𝑞𝑞𝐵𝐵𝑢𝑢𝐼𝐼𝑜𝑜 − 𝑉𝑉 𝐼𝐼𝑞𝑞𝐵𝐵𝑜𝑜
Rome : Italy = Berlin : ?
𝑉𝑉 𝐺𝐺𝑜𝑜𝑜𝑜𝑅𝑅𝐼𝐼𝐵𝐵𝐼𝐼≈ 𝑉𝑉 𝐵𝐵𝑜𝑜𝑜𝑜𝐼𝐼𝑏𝑏𝐵𝐵 − 𝑉𝑉 𝑅𝑅𝑜𝑜𝑅𝑅𝑜𝑜 + 𝑉𝑉 𝐼𝐼𝑜𝑜𝐼𝐼𝐼𝐼𝐼𝐼
Compute 𝑉𝑉 𝐵𝐵𝑜𝑜𝑜𝑜𝐼𝐼𝑏𝑏𝐵𝐵 − 𝑉𝑉 𝑅𝑅𝑜𝑜𝑅𝑅𝑜𝑜 + 𝑉𝑉 𝐼𝐼𝑜𝑜𝐼𝐼𝐼𝐼𝐼𝐼Find the word w with the closest V(w)
45
![Page 263: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/263.jpg)
Machine Reading
• Machine learn the meaning of words from reading a lot of documents without supervision
Machine learns to understand 鄉民用語 via reading the posts on PTT
![Page 264: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/264.jpg)
Demo
• Model used in demo is provided by 陳仰德
• Part of the project done by 陳仰德、林資偉
• TA: 劉元銘
• Training data is from PTT (collected by 葉青峰)
47
![Page 265: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/265.jpg)
Outline
Ultra Deep Network
Attention Model
Reinforcement Learning
Realizing what the World Looks Like
Understanding the Meaning of Words
Why Deep?
![Page 266: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/266.jpg)
Layer X Size Word Error Rate (%) Layer X Size Word Error
Rate (%)1 X 2k 24.22 X 2k 20.43 X 2k 18.44 X 2k 17.85 X 2k 17.2 1 X 3772 22.57 X 2k 17.1 1 X 4634 22.6
1 X 16k 22.1
Deeper is Better?
Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks." Interspeech. 2011.
Not surprised, more parameters, better performance
![Page 267: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/267.jpg)
Universality Theorem
Reference for the reason: http://neuralnetworksanddeeplearning.com/chap4.html
Any continuous function f
M: RRf N →
Can be realized by a network with one hidden layer
(given enough hidden neurons)
Why “Deep” neural network not “Fat” neural network?
![Page 268: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/268.jpg)
Fat + Short v.s. Thin + Tall
1x 2x ……Nx
Deep
1x 2x ……Nx
……
Shallow
Which one is better?
The same number of parameters
![Page 269: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/269.jpg)
Fat + Short v.s. Thin + Tall
Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks." Interspeech. 2011.
Layer X Size Word Error Rate (%) Layer X Size Word Error
Rate (%)1 X 2k 24.22 X 2k 20.43 X 2k 18.44 X 2k 17.85 X 2k 17.2 1 X 3772 22.57 X 2k 17.1 1 X 4634 22.6
1 X 16k 22.1
Why?
![Page 270: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/270.jpg)
Analogy
• Logic circuits consists of gates
• A two layers of logic gates can represent any Boolean function.
• Using multiple layers of logic gates to build some functions are much simpler
• Neural network consists of neurons
• A hidden layer network can represent any continuous function.
• Using multiple layers of neurons to represent some functions are much simpler
This page is for EE background.
less gates needed
Logic circuits Neural network
less parameters
less data?
![Page 271: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/271.jpg)
長髮男
Modularization
• Deep → ModularizationGirls with long hair
Boys with short hair
Boys with long hair
Image
Classifier 1
Classifier 2
Classifier 3
長髮女長髮女長髮女長髮女
Girls with short hair
短髮女
短髮男短髮男短髮男短髮男
短髮女短髮女短髮女
Classifier 4
Little examplesweak
![Page 272: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/272.jpg)
Modularization
• Deep → Modularization
Image
Long or short?
Boy or Girl?
Classifiers for the attributes
長髮男
長髮女長髮女長髮女長髮女
短髮女 短髮
男短髮男短髮男短髮男
短髮女短髮女短髮女
v.s.
長髮男
長髮女長髮女長髮女長髮女
短髮女
短髮男短髮男短髮男短髮男
短髮女短髮女短髮女
v.s.
Each basic classifier can have sufficient training examples.
Basic Classifier
![Page 273: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/273.jpg)
Modularization
• Deep → Modularization
Image
Long or short?
Boy or Girl?
Sharing by the following classifiers
as module
can be trained by little data
Girls with long hair
Boys with short hair
Boys with long hair
Classifier 1
Classifier 2
Classifier 3
Girls with short hair
Classifier 4
Little datafineBasic Classifier
![Page 274: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/274.jpg)
Modularization
• Deep → Modularization
1x
2x
……
Nx
……
……
……
……
……
……
The most basic classifiers
Use 1st layer as module to build classifiers
Use 2nd layer as module ……
The modularization is automatically learned from data.
→ Less training data?
![Page 275: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/275.jpg)
Modularization
• Deep → Modularization
1x
2x
……
Nx
……
……
……
……
……
……
The most basic classifiers
Use 1st layer as module to build classifiers
Use 2nd layer as module ……
Reference: Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. In Computer Vision–ECCV 2014 (pp. 818-833)
![Page 276: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/276.jpg)
Hand-crafted kernel function
SVM
Source of image: http://www.gipsa-lab.grenoble-inp.fr/transfert/seminaire/455_Kadri2013Gipsa-lab.pdf
Apply simple classifier
Deep Learning
1x
2x
…Nx
… … …
y1
y2
yM
…
……
……
……
𝑥𝑥
𝜙𝜙 𝑥𝑥simple classifierLearnable kernel
![Page 277: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/277.jpg)
Boosting
Deep Learning
1x
2x
…Nx
… ……
Input𝑥𝑥
Weak classifier
Weak classifier
CombineWeak classifier
Weak classifier
……
Boosted weak classifier
Boosted Boostedweak classifier
……
……
……
![Page 278: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/278.jpg)
More Reasons
• Do Deep Nets Really Need To Be Deep? (by Rich Caruana)• http://research.microsoft.com/apps/video/default.aspx?id=
232373&r=1
keynote of Rich Caruana at ASRU 2015
![Page 279: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/279.jpg)
Concluding Remarks
![Page 280: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/280.jpg)
Today’s Lecture
Lecture IV: Next Wave
Lecture III: Variants of Neural Network
Lecture II: Tips for Training Deep Neural Network
Lecture I: Introduction of Deep Learning
![Page 281: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/281.jpg)
Some Opinions
• Also learn other machine learning methods
不想知道 Deep Learning 以外的方法
![Page 282: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/282.jpg)
Some Opinions
• In some situations, the simpler machine learning methods can be very powerful.
![Page 283: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/283.jpg)
Some Opinions
• 寒武纪大爆炸
已經有一些生物滅絕了
http://www.baike.com/gwiki/%E5%AF%92%E6%AD%A6%E7%BA%AA%E5%A4%A7%E7%88%86%E5%8F%91
![Page 284: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/284.jpg)
Some Opinions
• Deep Learning is still at the phase of “神農嘗百草”
• Lots of questions still do not have answers
• However, probably also easy to enter
http://orchid.shu.edu.tw/upload/article/20110927181605_1_pic.png
![Page 285: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/285.jpg)
如果你想 “深度學習深度學習”
• “Neural Networks and Deep Learning”• written by Michael Nielsen• http://neuralnetworksanddeeplearning.com/
• “Deep Learning” • Written by Yoshua Bengio, Ian J. Goodfellow and Aaron
Courville • http://www.iro.umontreal.ca/~bengioy/dlbook/
• Course: Machine learning and having it deep and structured• http://speech.ee.ntu.edu.tw/~tlkagk/courses_MLSD15_2.
html
![Page 286: Deep Learning Tutorial - GitHub Pages learning attracts lots of attention. • I believe you have seen lots of exciting results before. This talk focuses on the basic techniques](https://reader031.vdocuments.us/reader031/viewer/2022030419/5aa5f2e47f8b9afa758de64d/html5/thumbnails/286.jpg)
給資料科學愛好者
• 台大電機系於台大電信所成立「資料科學與智慧網路組」,開始招收碩、博士生
• 今年秋天開始報名