lecture 5 - 東京工業大学 · speech and language processing lecture 5 neural network based...
TRANSCRIPT
![Page 1: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro](https://reader033.vdocuments.us/reader033/viewer/2022042311/5ed900216714ca7f4768f5c0/html5/thumbnails/1.jpg)
Speech and Language ProcessingLecture 5
Neural network based acoustic and language models
Information and Communications Engineering Course
Takahiro Shinozaki
2019/11/5 1
![Page 2: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro](https://reader033.vdocuments.us/reader033/viewer/2022042311/5ed900216714ca7f4768f5c0/html5/thumbnails/2.jpg)
Lecture Plan (Shinozaki’s part)
1. 10/18 (remote)Speech recognition based on GMM, HMM, and N-gram
2. 10/25 (remote)Maximum likelihood estimation and EM algorithm
3. 11/4 (@TAIST)Bayesian network and Bayesian inference
4. 11/4 (@TAIST)Variational inference and sampling
5. 11/5 (@TAIST)Neural network based acoustic and language models
6. 11/5 (@TAIST)Weighted finite state transducer (WFST) and speech decoding 2
I gives the first 6 lectures about speech recognition.Through these lectures, the backbone of the latest speech recognition techniques is explained.
![Page 3: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro](https://reader033.vdocuments.us/reader033/viewer/2022042311/5ed900216714ca7f4768f5c0/html5/thumbnails/3.jpg)
Today’s Topic
• Answers for the previous exercises• Neural network based acoustic and language
models
3
![Page 4: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro](https://reader033.vdocuments.us/reader033/viewer/2022042311/5ed900216714ca7f4768f5c0/html5/thumbnails/4.jpg)
Answers for the Previous Exercises
4
![Page 5: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro](https://reader033.vdocuments.us/reader033/viewer/2022042311/5ed900216714ca7f4768f5c0/html5/thumbnails/5.jpg)
Exercise 4.1
• When p(x) and y = f(x) are given as follows, obtain distribution q(y)
5
( ) ( )xyxxp −−=∈= 1log,1,01)(
( ) ( )ydydxyx
yxyx
−=−−=
∞=→==→=
exp,exp1
1,00)exp()()( y
dydxxpyq −==
x
# of
occ
urre
nce Histogram of x
y
Histogram of y#
of o
ccur
renc
e
![Page 6: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro](https://reader033.vdocuments.us/reader033/viewer/2022042311/5ed900216714ca7f4768f5c0/html5/thumbnails/6.jpg)
Exercise 4.2• When p(x) and y = f(x) are given as follows, obtain
distribution q(y)
6
( ) ( ) 43,,1,0|2
exp21)(
2
+=∞∞−∈=
−= xyxxNxxp
π
31,
34
=−
=dydxyx
( ) ( )22
2
23,4|
34
21exp
321)()( yNy
dydxxpyq =
−−==
π
x
# of
occ
urre
nce
Histogram of x
y
Histogram of y
# of
occ
urre
nce
![Page 7: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro](https://reader033.vdocuments.us/reader033/viewer/2022042311/5ed900216714ca7f4768f5c0/html5/thumbnails/7.jpg)
Exercise 4.3
• Show that N(xA|xB, 1) = N(xB|xA,1), where N(x|m,v) is the Gaussian distribution with mean m and variance v
7
( ) ( )
( ) ( )1,|21exp
21
21exp
211,|
2
2
ABAB
BABA
xxNxx
xxxxN
=
−−=
−−=
π
π
![Page 8: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro](https://reader033.vdocuments.us/reader033/viewer/2022042311/5ed900216714ca7f4768f5c0/html5/thumbnails/8.jpg)
Neural network
8
![Page 9: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro](https://reader033.vdocuments.us/reader033/viewer/2022042311/5ed900216714ca7f4768f5c0/html5/thumbnails/9.jpg)
Multi Layer Perceptron (MLP)
• Unit of MLP
• MLP consists of multiple layers of the units
9
+= ∑
iii bxwhy
h: activation functionw: weightb:bias
2x
y
ix1x
1x 2x nx
1y myOutput layer
Input layer
Hidden layersMultiple layers
![Page 10: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro](https://reader033.vdocuments.us/reader033/viewer/2022042311/5ed900216714ca7f4768f5c0/html5/thumbnails/10.jpg)
Activation Functions
10
( ) ≤
=otherwise0
0if1 xxh
( ) ( )xxh
−+=
exp11
( ) xxh =Linear function
Unit step function
Sigmoidfunction
( ) { }xxh ,0max=hinge function
![Page 11: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro](https://reader033.vdocuments.us/reader033/viewer/2022042311/5ed900216714ca7f4768f5c0/html5/thumbnails/11.jpg)
Softmax Function
• For N variables zi, softmax function is:
• Properties of softmax• Positive
• Sum is one
• Example
11
( )∑
=
jj
ii z
zzh)exp(
)exp(
( )izh<0
( ) 0.11
=∑=
N
iizh
Expresses a probability distribution
( ) ( ) ( ) ( ) 0.2595 0.7054, 0.0351,,, 321 == zhzhzhZh1,2,1-,, 321 == zzzZ
16,8,12,, 321 == zzzZ ( ) ( ) ( ) ( ) 0.0180 0.0003, 0.9817,,, 321 == zhzhzhZh
![Page 12: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro](https://reader033.vdocuments.us/reader033/viewer/2022042311/5ed900216714ca7f4768f5c0/html5/thumbnails/12.jpg)
Exercise 5.1
• Let h be a softmax function having inputs z1, z2,…,zN.
• Prove that
12
( )∑
=
jj
ii z
zzh)exp(
)exp(
( ) 0.11
=∑=
N
iizh
( ) 1)exp(
)exp(
)exp()exp(
11===
∑∑
∑∑∑==
jj
iiN
ij
j
iN
ii z
z
zzzh
![Page 13: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro](https://reader033.vdocuments.us/reader033/viewer/2022042311/5ed900216714ca7f4768f5c0/html5/thumbnails/13.jpg)
Forward Propagation
• Compute the output of MLP step by step from the input layer to the output layer
13
E.g. softmax layer
E.g. sigmoid layer
E.g. sigmoid layer
Input vector
![Page 14: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro](https://reader033.vdocuments.us/reader033/viewer/2022042311/5ed900216714ca7f4768f5c0/html5/thumbnails/14.jpg)
Parameters of Neural Network
• The weights and a bias of each unit need training before the network is used
14
( )xw ⋅=
+= ∑
h
bxwhyi
ii
2x
y
Nx
1x 11w
2w Nwb h: activation function
w: weight vectorw=(w1,w2,…,wN,b)
X: input vectorx=(x1,x2,…,xN,1)
The bias b can be regarded as one of the weights whose input takes a constant value 1.0
![Page 15: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro](https://reader033.vdocuments.us/reader033/viewer/2022042311/5ed900216714ca7f4768f5c0/html5/thumbnails/15.jpg)
Principle of NN Training
15
Training set
Reference output vector
Input vector
Adjust parameters of MLP so as to minimize the error
Output by MLP
![Page 16: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro](https://reader033.vdocuments.us/reader033/viewer/2022042311/5ed900216714ca7f4768f5c0/html5/thumbnails/16.jpg)
Definitions of Errors
• Sum of square error• Used when output layer uses linear functions
• Cross-entropy• Used when the output layer is a softmax
16
( ) ( ) 2,21∑ −=
nnn tWXyWE
ntXW
n
n
:Set of weights in MLP
:Vector of a training sample (input)
:Vector of a training sample (output)
:Index of training samples
( ) ( ){ }∑∑−=n k
nnknk WXytWE ,ln
ktnk
:Reference output (Takes 1 if unit kcorresponds to correct output, 0 otherwise):Index of output unit
![Page 17: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro](https://reader033.vdocuments.us/reader033/viewer/2022042311/5ed900216714ca7f4768f5c0/html5/thumbnails/17.jpg)
Gradient Descent
• An iterative optimization method
17
f(x)
x
( )t
tt xxfxx
∂∂
−=+ ε1ε :Learning rate
(small positive value)
x0x1x2xNInitial value
( )0xxx
xf
=∂∂
![Page 18: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro](https://reader033.vdocuments.us/reader033/viewer/2022042311/5ed900216714ca7f4768f5c0/html5/thumbnails/18.jpg)
MLP Training by Gradient Descent
• Define an error measure E(W) for training samples
• Initialize parameters W={w1, w2,…, wM}• Repeatedly update the parameter set using
gradient descent
18
( ) ( ) 2,21∑ −=
nnn tWXyWEe.g.
( ) ( ) ( )( )twwi
ii
iiwWEtwtw
=∂
∂−=+ ε1
![Page 19: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro](https://reader033.vdocuments.us/reader033/viewer/2022042311/5ed900216714ca7f4768f5c0/html5/thumbnails/19.jpg)
Chain Rule of Differentiation
xy
yz
xz
∂∂
∂∂
=∂∂
19
)()(
xgyyfz
==
g
f
x
y
z
When x,y,z are scalars:
When x,y,z are vectors:
2121321 ,,,,,, zzzyyyxxxx ===
∂∂
∂∂
∂∂
∂∂
∂∂
∂∂
∂∂
∂∂
∂∂
∂∂
=
∂∂
∂∂
∂∂
∂∂
∂∂
∂∂
3
2
2
2
1
2
3
1
2
1
1
1
2
2
1
2
2
1
1
1
3
2
2
2
1
2
3
1
2
1
1
1
xy
xy
xy
xy
xy
xy
yz
yz
yz
yz
xz
xz
xz
xz
xz
xz
Jacobian matrixxy
yz
xz
∂∂
∂∂
=∂∂ The same rule holds using Jacobian matrix
![Page 20: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro](https://reader033.vdocuments.us/reader033/viewer/2022042311/5ed900216714ca7f4768f5c0/html5/thumbnails/20.jpg)
When There Are Branches
20
)()(
),(
22
11
21
xgyxgy
yyfz
===
g1
f
x
y1
z
g2
y2
xy
yz
xy
yz
xz
∂∂
∂∂
+∂∂
∂∂
=∂∂ 2
2
1
1
g1
f
x
y1
z
y2
f
x
y1
z
g2
y2
Variations: xxg = )(1 Cxg = )(2
(independent of x)xy
yz
yz
xz
∂∂
∂∂
+∂∂
=∂∂ 2
21
xy
yz
xz
∂∂
∂∂
=∂∂ 1
1
![Page 21: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro](https://reader033.vdocuments.us/reader033/viewer/2022042311/5ed900216714ca7f4768f5c0/html5/thumbnails/21.jpg)
Back Propagation(BP)
21
r
( )4344 , wyfy =
( )3233 , wyfy =
( )2122 , wyfy =
( )111 , wxfy =
x
( )ryEErr ,4=
( )344 max ywsofty ⋅=Ex.:
( )233 ywsigmoidy ⋅=Ex.:
f2
f1
f3
f4
E
x
w1
w2
w3
w4
rref
Input1y
2y
3y
4y
Err
4
4
44 wf
fErr
wErr
∂∂
∂∂
=∂∂
3
3
33 wf
fErr
wErr
∂∂
∂∂
=∂∂
2
2
22 wf
fErr
wErr
∂∂
∂∂
=∂∂
1
1
11 wf
fErr
wErr
∂∂
∂∂
=∂∂
3
4
43 ff
fErr
fErr
∂∂
∂∂
=∂∂
4fErr∂∂
2
3
32 ff
fErr
fErr
∂∂
∂∂
=∂∂
1
2
21 ff
fErr
fErr
∂∂
∂∂
=∂∂
①obtain value of each node by forward propagation
②Obtain derivatives by backward propagation
Output
![Page 22: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro](https://reader033.vdocuments.us/reader033/viewer/2022042311/5ed900216714ca7f4768f5c0/html5/thumbnails/22.jpg)
Feed-Forward Neural Network
22
• When the network structure is a DAG, it is called feed-forward network• The nodes are ordered in a line so that all connections have the same direction• The forward/backward propagation can be efficiently applied
1
2
3
4
![Page 23: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro](https://reader033.vdocuments.us/reader033/viewer/2022042311/5ed900216714ca7f4768f5c0/html5/thumbnails/23.jpg)
Exercise 5.2
When h(y) and y(x) are given as follows, obtain
23
( ) ( )yyh
−+=
exp11
baxy +=
xh∂∂
( ) ( ) ( )( )( )
( )( )( )( )( )
( )( ) ( )baxhbaxhabax
baxa
ay
ybaxxyyx
yyh
xh
++−=+−++−
=
−+−
=+∂∂
−+∂
∂=
∂∂
∂∂
=∂∂
1exp1
expexp1
expexp1
1
2
2
![Page 24: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro](https://reader033.vdocuments.us/reader033/viewer/2022042311/5ed900216714ca7f4768f5c0/html5/thumbnails/24.jpg)
Recurrent Neural Network (RNN)
• Neural network having a feedback• Expected to be more powerful modeling performance
than feed-forward MLP, but the training is more difficult
24
Delay
Input
Output
Input layer
Output layer
Hidden layers
![Page 25: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro](https://reader033.vdocuments.us/reader033/viewer/2022042311/5ed900216714ca7f4768f5c0/html5/thumbnails/25.jpg)
Unfolding of RNN to Time Axis
25
D
UnfoldThrough Time
Input feature sequenceTime
Reference vector sequence
![Page 26: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro](https://reader033.vdocuments.us/reader033/viewer/2022042311/5ed900216714ca7f4768f5c0/html5/thumbnails/26.jpg)
Training of RNN by BP Through Time (BPTT)
26
y1 y2 y3 y4
Input(Regard the input sequence as an input)
Output(Regard the output sequence as an output)
Input sequence
Output sequence Back-propagation
h1 h2 h3 h4
x1 x2 x3 x4
y1 y2 y3 y4
h4
h3
h2
h1
x1 x2 x3 x4
Apply BP to the unfolded network
![Page 27: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro](https://reader033.vdocuments.us/reader033/viewer/2022042311/5ed900216714ca7f4768f5c0/html5/thumbnails/27.jpg)
tx
⊗
1−ty1−tc
Long Short-Term Memory (LSTM)
27
LSTM
A type of RNN addressing the gradient vanishing problem
tx
ty
Delay Delay
tc
1−tc
1−ty
tc
ty
⊗
⊕
σ
tanh σ
⊗
tanh σ
Output gate
Input gate
forget gate
⊗
⊕
σ
tanhTanh layer with affine transform
Sigmoid layer with affine transform
Pointwise multiplication
Sum
![Page 28: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro](https://reader033.vdocuments.us/reader033/viewer/2022042311/5ed900216714ca7f4768f5c0/html5/thumbnails/28.jpg)
Convolutional Neural Network (CNN)
28
1 3 3 4 2 13 5 2 1 3 5
Input
Filter (1)
Filter (2)
Filter (3)
Activation map (1)
5 4 5
Pooling
Convolution Layer Pooling Layer
Next convolution layer etc.
Activation map (2)
Activation map (N)A type of feed-forward neural network with parameter sharing and connection constraint
A filter is shifted and applied at different positions
![Page 29: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro](https://reader033.vdocuments.us/reader033/viewer/2022042311/5ed900216714ca7f4768f5c0/html5/thumbnails/29.jpg)
Neural network based acoustic model
29
![Page 30: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro](https://reader033.vdocuments.us/reader033/viewer/2022042311/5ed900216714ca7f4768f5c0/html5/thumbnails/30.jpg)
Frame Level Vowel Recognition Using MLP
30
( )あp ( )いp ( )うp ( )えp ( )おp
Softmax function
Sigmoid function
Sigmoid function
Input: Speech feature vector (e.g. MFCC)
0.1 0.4 0.20.15 0.15
![Page 31: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro](https://reader033.vdocuments.us/reader033/viewer/2022042311/5ed900216714ca7f4768f5c0/html5/thumbnails/31.jpg)
Exercise 5.3Obtain recognition result (yes or no). You may use a calculator.
31
Sigmoid
Softmax
P(yes) P(no)
1.5
-2 1
-1
2
-2
-2.5
3
2.5 -4.0
![Page 32: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro](https://reader033.vdocuments.us/reader033/viewer/2022042311/5ed900216714ca7f4768f5c0/html5/thumbnails/32.jpg)
Combination of HMM and MLP
32
s1 s3s4s0 s2
MLP-HMM
s1 s3s4s0 s2
GMM-HMM
Softmaxlayer
( ) ( )( )
( )( )sp
XsMLPspXspsXp ||| =∝
( ) ( )XGMMsXp s=|
![Page 33: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro](https://reader033.vdocuments.us/reader033/viewer/2022042311/5ed900216714ca7f4768f5c0/html5/thumbnails/33.jpg)
MLP-HMM based Phone Recognizer
33
/a/ /i/ /N/
Softmax
Sigmoid
Sigmoid
Input speech feature
Start End
![Page 34: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro](https://reader033.vdocuments.us/reader033/viewer/2022042311/5ed900216714ca7f4768f5c0/html5/thumbnails/34.jpg)
Neural network based language model
34
![Page 35: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro](https://reader033.vdocuments.us/reader033/viewer/2022042311/5ed900216714ca7f4768f5c0/html5/thumbnails/35.jpg)
Word Vector
• One-of-K representation of a word for a fixed vocabulary
35
word ID 1-of-KApple 1 <1,0,0,0,0,0,0>Banana 2 <0,1,0,0,0,0,0>Cherry 3 <0,0,1,0,0,0,0>Durian 4 <0,0,0,1,0,0,0>Orange 5 <0,0,0,0,1,0,0>Pineapple 6 <0,0,0,0,0,1,0>Strawberry 7 <0,0,0,0,0,0,1>
![Page 36: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro](https://reader033.vdocuments.us/reader033/viewer/2022042311/5ed900216714ca7f4768f5c0/html5/thumbnails/36.jpg)
Word Prediction Using RNN
36
<0.02 ,0.65, 0.14 ,0.11, 0.05, 0.01 ,0.02>
<0, 0, 0, 1, 0, 0, 0>
D
Wordt-1
(Distribution of)Wordt
![Page 37: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro](https://reader033.vdocuments.us/reader033/viewer/2022042311/5ed900216714ca7f4768f5c0/html5/thumbnails/37.jpg)
RNN Language Model (Unfolded)
37
</s>
<s>
BigDelicious Red Apple
P(<s>, Delicious, Big, Red, Apple, </s>)
![Page 38: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro](https://reader033.vdocuments.us/reader033/viewer/2022042311/5ed900216714ca7f4768f5c0/html5/thumbnails/38.jpg)
Dialogue System Using Seq2Seq Network
38
What is your name <s>
name
Encoder network
Decoder network
</s>My is TS-800
Sampling from posterior
Input
Output
![Page 39: Lecture 5 - 東京工業大学 · Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course . Takahiro](https://reader033.vdocuments.us/reader033/viewer/2022042311/5ed900216714ca7f4768f5c0/html5/thumbnails/39.jpg)
Evolution of Compute Hardware
39
2002Earth simulator40.96TFLOPS
2017 GeForce GTX 1080Ti10.609TFLOPS699USD
Picture is from wikipedia Picture is from Nvidia.com