lecture 5 - 東京工業大学 · speech and language processing lecture 5 neural network based...

Speech and Language ProcessingLecture 5

Neural network based acoustic and language models

Information and Communications Engineering Course

Takahiro Shinozaki

2019/11/5 1

Lecture Plan (Shinozaki’s part)

1. 10/18 (remote)Speech recognition based on GMM, HMM, and N-gram

2. 10/25 (remote)Maximum likelihood estimation and EM algorithm

3. 11/4 (@TAIST)Bayesian network and Bayesian inference

4. 11/4 (@TAIST)Variational inference and sampling

5. 11/5 (@TAIST)Neural network based acoustic and language models

6. 11/5 (@TAIST)Weighted finite state transducer (WFST) and speech decoding 2

I gives the first 6 lectures about speech recognition.Through these lectures, the backbone of the latest speech recognition techniques is explained.

Today’s Topic

• Answers for the previous exercises• Neural network based acoustic and language

models

3

Answers for the Previous Exercises

4

Exercise 4.1

• When p(x) and y = f(x) are given as follows, obtain distribution q(y)

5

( ) ( )xyxxp −−=∈= 1log,1,01)(

( ) ( )ydydxyx

yxyx

−=−−=

∞=→==→=

exp,exp1

1,00)exp()()( y

dydxxpyq −==

x

# of

occ

urre

nce Histogram of x

y

Histogram of y#

of o

ccur

renc

e

Exercise 4.2• When p(x) and y = f(x) are given as follows, obtain

distribution q(y)

6

( ) ( ) 43,,1,0|2

exp21)(

2

+=∞∞−∈=

−= xyxxNxxp

π

31,

34

=−

=dydxyx

( ) ( )22

2

23,4|

34

21exp

321)()( yNy

dydxxpyq =

−−==

π

x

# of

occ

urre

nce

Histogram of x

y

Histogram of y

# of

occ

urre

nce

Neural network

8

Multi Layer Perceptron (MLP)

• Unit of MLP

• MLP consists of multiple layers of the units

9

+= ∑

iii bxwhy

h: activation functionw: weightb：bias

2x

y

ix1x

1x 2x nx

1y myOutput layer

Input layer

Hidden layersMultiple layers

Activation Functions

10

( ) ≤

=otherwise0

0if1 xxh

( ) ( )xxh

−+=

exp11

( ) xxh =Linear function

Unit step function

Sigmoidfunction

( ) { }xxh ,0max=hinge function

Softmax Function

• For N variables zi, softmax function is:

• Properties of softmax• Positive

• Sum is one

• Example

11

( )∑

=

jj

ii z

zzh)exp(

)exp(

( )izh<0

( ) 0.11

=∑=

N

iizh

Expresses a probability distribution

( ) ( ) ( ) ( ) 0.2595 0.7054, 0.0351,,, 321 == zhzhzhZh1,2,1-,, 321 == zzzZ

16,8,12,, 321 == zzzZ ( ) ( ) ( ) ( ) 0.0180 0.0003, 0.9817,,, 321 == zhzhzhZh

Exercise 5.1

• Let h be a softmax function having inputs z1, z2,…,zN.

• Prove that

12

( )∑

=

jj

ii z

zzh)exp(

)exp(

( ) 0.11

=∑=

N

iizh

( ) 1)exp(

)exp(

)exp()exp(

11===

∑∑

∑∑∑==

jj

iiN

ij

j

iN

ii z

z

zzzh

Forward Propagation

• Compute the output of MLP step by step from the input layer to the output layer

13

E.g. softmax layer

E.g. sigmoid layer

E.g. sigmoid layer

Input vector

Parameters of Neural Network

• The weights and a bias of each unit need training before the network is used

14

( )xw ⋅=

+= ∑

h

bxwhyi

ii

2x

y

Nx

1x 11w

2w Nwb h: activation function

w: weight vectorw=(w1,w2,…,wN,b)

X: input vectorx=(x1,x2,…,xN,1)

The bias b can be regarded as one of the weights whose input takes a constant value 1.0

Principle of NN Training

15

Training set

Reference output vector

Input vector

Adjust parameters of MLP so as to minimize the error

Output by MLP

Definitions of Errors

• Sum of square error• Used when output layer uses linear functions

• Cross-entropy• Used when the output layer is a softmax

16

( ) ( ) 2,21∑ −=

nnn tWXyWE

ntXW

n

n

：Set of weights in MLP

：Vector of a training sample (input)

：Vector of a training sample (output)

：Index of training samples

( ) ( ){ }∑∑−=n k

nnknk WXytWE ,ln

ktnk

：Reference output (Takes 1 if unit kcorresponds to correct output, 0 otherwise)：Index of output unit

Gradient Descent

• An iterative optimization method

17

f(x)

x

( )t

tt xxfxx

∂∂

−=+ ε1ε ：Learning rate

(small positive value)

x0x1x2xNInitial value

( )0xxx

xf

=∂∂

MLP Training by Gradient Descent

• Define an error measure E(W) for training samples

• Initialize parameters W={w1, w2,…, wM}• Repeatedly update the parameter set using

gradient descent

18

( ) ( ) 2,21∑ −=

nnn tWXyWEe.g.

( ) ( ) ( )( )twwi

ii

iiwWEtwtw

=∂

∂−=+ ε1

Chain Rule of Differentiation

xy

yz

xz

∂∂

∂∂

=∂∂

19

)()(

xgyyfz

==

ｇ

f

x

y

z

When x,y,z are scalars:

When x,y,z are vectors:

2121321 ,,,,,, zzzyyyxxxx ===

∂∂

∂∂

∂∂

∂∂

∂∂

∂∂

∂∂

∂∂

∂∂

∂∂

=

∂∂

∂∂

∂∂

∂∂

∂∂

∂∂

3

2

2

2

1

2

3

1

2

1

1

1

2

2

1

2

2

1

1

1

3

2

2

2

1

2

3

1

2

1

1

1

xy

xy

xy

xy

xy

xy

yz

yz

yz

yz

xz

xz

xz

xz

xz

xz

Jacobian matrixxy

yz

xz

∂∂

∂∂

=∂∂ The same rule holds using Jacobian matrix

When There Are Branches

20

)()(

),(

22

11

21

xgyxgy

yyfz

===

ｇ1

f

x

y1

z

ｇ2

y2

xy

yz

xy

yz

xz

∂∂

∂∂

+∂∂

∂∂

=∂∂ 2

2

1

1

ｇ1

f

x

y1

z

y2

f

x

y1

z

ｇ2

y2

Variations： xxg =　)(1 Cxg =　)(2

(independent of x)xy

yz

yz

xz

∂∂

∂∂

+∂∂

=∂∂ 2

21

xy

yz

xz

∂∂

∂∂

=∂∂ 1

1

Back Propagation(BP)

21

r

( )4344 , wyfy =

( )3233 , wyfy =

( )2122 , wyfy =

( )111 , wxfy =

x

( )ryEErr ,4=

( )344 max ywsofty ⋅=Ex.：

( )233 ywsigmoidy ⋅=Ex.：

f2

f1

f3

f4

E

x

w1

w2

w3

w4

rref

Input1y

2y

3y

4y

Err

4

4

44 wf

fErr

wErr

∂∂

∂∂

=∂∂

3

3

33 wf

fErr

wErr

∂∂

∂∂

=∂∂

2

2

22 wf

fErr

wErr

∂∂

∂∂

=∂∂

1

1

11 wf

fErr

wErr

∂∂

∂∂

=∂∂

3

4

43 ff

fErr

fErr

∂∂

∂∂

=∂∂

4fErr∂∂

2

3

32 ff

fErr

fErr

∂∂

∂∂

=∂∂

1

2

21 ff

fErr

fErr

∂∂

∂∂

=∂∂

①obtain value of each node by forward propagation

②Obtain derivatives by backward propagation

Output

Feed-Forward Neural Network

22

• When the network structure is a DAG, it is called feed-forward network• The nodes are ordered in a line so that all connections have the same direction• The forward/backward propagation can be efficiently applied

1

2

3

4

Exercise 5.2

When h(y) and y(x) are given as follows, obtain

23

( ) ( )yyh

−+=

exp11

baxy +=

xh∂∂

( ) ( ) ( )( )( )

( )( )( )( )( )

( )( ) ( )baxhbaxhabax

baxa

ay

ybaxxyyx

yyh

xh

++−=+−++−

=

−+−

=+∂∂

−+∂

∂=

∂∂

∂∂

=∂∂

1exp1

expexp1

expexp1

1

2

2

Recurrent Neural Network (RNN)

• Neural network having a feedback• Expected to be more powerful modeling performance

than feed-forward MLP, but the training is more difficult

24

Delay

Input

Output

Input layer

Output layer

Hidden layers

Unfolding of RNN to Time Axis

25

D

UnfoldThrough Time

Input feature sequenceTime

Reference vector sequence

Training of RNN by BP Through Time (BPTT)

26

ｙ１ｙ2 ｙ3 ｙ4

Input（Regard the input sequence as an input）

Output（Regard the output sequence as an output）

Input sequence

Output sequence Back-propagation

h１ h2 h3 h4

x１ x2 x3 x4

ｙ１ｙ2 ｙ3 ｙ4

h4

h3

h2

h１

x１ x2 x3 x4

Apply BP to the unfolded network

tx

⊗

1−ty1−tc

Long Short-Term Memory (LSTM)

27

LSTM

A type of RNN addressing the gradient vanishing problem

tx

ty

Delay Delay

tc

1−tc

1−ty

tc

ty

⊗

⊕

σ

tanh σ

⊗

tanh σ

Output gate

Input gate

forget gate

⊗

⊕

σ

tanhTanh layer with affine transform

Sigmoid layer with affine transform

Pointwise multiplication

Sum

Convolutional Neural Network (CNN)

28

1 3 3 4 2 13 5 2 1 3 5

Input

Filter (1)

Filter (2)

Filter (3)

Activation map (1)

5 4 5

Pooling

Convolution Layer Pooling Layer

Next convolution layer etc.

Activation map (2)

Activation map (N)A type of feed-forward neural network with parameter sharing and connection constraint

A filter is shifted and applied at different positions

Neural network based acoustic model

29

Frame Level Vowel Recognition Using MLP

30

( )あp ( )いp ( )うp ( )えp ( )おp

Softmax function

Sigmoid function

Sigmoid function

Input: Speech feature vector (e.g. MFCC)

0.1 0.4 0.20.15 0.15

Exercise 5.3Obtain recognition result (yes or no). You may use a calculator.

31

Sigmoid

Softmax

P(yes) P(no)

1.5

-2 1

-1

2

-2

-2.5

3

2.5 -4.0

Combination of HMM and MLP

32

s1 s3s4s0 s2

MLP-HMM

s1 s3s4s0 s2

GMM-HMM

Softmaxlayer

( ) ( )( )

( )( )sp

XsMLPspXspsXp ||| =∝

( ) ( )XGMMsXp s=|

MLP-HMM based Phone Recognizer

33

/a/ /i/ /N/

Softmax

Sigmoid

Sigmoid

Input speech feature

Start End

Neural network based language model

34

Word Vector

• One-of-K representation of a word for a fixed vocabulary

35

word ID 1-of-KApple 1 <1,0,0,0,0,0,0>Banana 2 <0,1,0,0,0,0,0>Cherry 3 <0,0,1,0,0,0,0>Durian 4 <0,0,0,1,0,0,0>Orange 5 <0,0,0,0,1,0,0>Pineapple 6 <0,0,0,0,0,1,0>Strawberry 7 <0,0,0,0,0,0,1>

Word Prediction Using RNN

36

<0.02 ,0.65, 0.14 ,0.11, 0.05, 0.01 ,0.02>

<0, 0, 0, 1, 0, 0, 0>

D

Wordt-1

(Distribution of)Wordt

RNN Language Model (Unfolded)

37

</s>

<s>

BigDelicious Red Apple

P(<s>, Delicious, Big, Red, Apple, </s>)

Dialogue System Using Seq2Seq Network

38

What is your name <s>

name

Encoder network

Decoder network

</s>My is TS-800

Sampling from posterior

Input

Output

Evolution of Compute Hardware

39

2002Earth simulator40.96TFLOPS

2017 GeForce GTX 1080Ti10.609TFLOPS699USD

Picture is from wikipedia Picture is from Nvidia.com

lecture 5 - 東京工業大学 · speech and language processing lecture 5 neural network based...

Documents