en. 601.467/667 introduc3onto human language technology ... · en. 601.467/667 introduc3onto human...

62
EN. 601.467/667 Introduc3on to Human Language Technology Deep Learning I Shinji Watanabe 1

Upload: others

Post on 31-Oct-2019

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

EN. 601.467/667 Introduc3on to Human Language TechnologyDeep Learning I

Shinji Watanabe

1

Page 2: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

Today’s agenda

• Introduction of deep neural network• Basics of neural network

2

Page 3: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

Short bio

• Research interests• Automatic speech recognition (ASR), speech enhancement, application of

machine learning to speech processing

• Around 20 years of ASR experience since 2001

Page 4: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

Speech recogni3on evalua3on metric

• Word error rate (WER)• Using edit distance word-by-word:

• # inserJon errors = 1, # subsJtuJon errors = 2, # of deleJon errors = 0➡ Edit distance = 3• Word error rate (%): Edit distance (=3) / # reference words (=9) * 100 = 33.3%• How to compute WERs for languages that do not have word boundaries?

• Chunking or using character error rate

Reference) I want to go to the Johns Hopkins campusRecognition result) I want to go to the 10 top kids campus

Page 5: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

2001: when I started speech recognition….

1%

10%

100%

2000 201020051995 2015

(Pallett’03, Saon’15, Xiong’16)

Switchboard task (Telephone conversation speech)

Wor

d er

ror r

ate

(WER

)

2016

Page 6: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

Really bad age….

• No applicaJon• No breakthrough technologies• Everyone outside speech research criJcized it…• General people don’t know “what is speech recogniJon”

Page 7: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1
Page 8: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

Now we are at

1%

10%

100%

2000 201020051995 2015

Deep learning

(Pallett’03, Saon’15, Xiong’16)

Switchboard task (Telephone conversation speech)

Wor

d er

ror r

ate

(WER

)

2016

5.9%

Page 9: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

Everything was changed

• No application• No breakthrough technologies• Everyone outside speech research criticized it…• General people don’t know “what is speech recognition”

Page 10: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

Everything was changed

• No application voice search, smart speakers• No breakthrough technologies• Everyone outside speech research criticized it…• General people don’t know “what is speech recognition”

Page 11: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

Everything was changed

• No application voice search, smart speakers• No breakthrough technologies deep neural network• Everyone outside speech research criticized it…• General people don’t know “what is speech recognition”

Page 12: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

Everything was changed

• No applicaJon voice search, smart speakers• No breakthrough technologies deep neural network• Everyone outside speech research criJcized it… many people outside

speech research know/respect it• General people don’t know “what is speech recogniJon”

Page 13: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

Everything was changed

• No application voice search, smart speakers• No breakthrough technologies deep neural network• Everyone outside speech research criticized it… many people outside

speech research know/respect it• General people don’t know “what is speech recognition” now my

wife knows what I’m doing

Page 14: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1
Page 15: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

Acoustic model

15

argmaxW

p(W |O) = argmaxW

X

L

p(W,L|O)

= argmaxW

X

L

p(O|L,W )p(L,W )

= argmaxW

X

L

p(O|L)p(L|W )p(W )<latexit sha1_base64="d7OxcnUqwg/shIPI7hVie8NUo5U=">AAADAHicjZFdaxNBFIZn16p1/Wiql70ZGpRWSthowXohFLwpGNoKxi10QpidTNKh87HMnG0Nm73yf3jvlXjbf9J/09nNUmwi4oFlX97zPhzmnDSTwkEcXwfhvZX7Dx6uPooeP3n6bK21/vyrM7llvM+MNPYkpY5LoXkfBEh+kllOVSp5kp5/rPrJBbdOGP0FphkfKDrRYiwYBW8NWz8ItRNFv+FhkZQ420pmR9v4Ff6A7/jE5cqrXp3Ywb0qREj07+DRrLeDk22v5v//Aer0rIZqYthqx524Lrwsuo1oo6aOh+vBJzIyLFdcA5PUudNunMGgoBYEk7yMSO54Rtk5nfBTLzVV3A2KepElfumdER4b6z8NuHb/JAqqnJuq1CcVhTO32KvMv/ZULkFYc7kwH8Z7g0LoLAeu2Xz8OJcYDK5uhUfCcgZy6gVlVvgXYHZGLWXgLxpFRPNLZpSielQ0Sy0LUg03WUGswo1HpFACXLkMCL0MeO8W8OvvLi57WfTfdN534s+77f3D5g6raANtoi3URe/QPjpAx6iPWLASvA7eBrvh9/Bn+Cv8PY+GQcO8QHcqvLoBrpntIA==</latexit><latexit sha1_base64="d7OxcnUqwg/shIPI7hVie8NUo5U=">AAADAHicjZFdaxNBFIZn16p1/Wiql70ZGpRWSthowXohFLwpGNoKxi10QpidTNKh87HMnG0Nm73yf3jvlXjbf9J/09nNUmwi4oFlX97zPhzmnDSTwkEcXwfhvZX7Dx6uPooeP3n6bK21/vyrM7llvM+MNPYkpY5LoXkfBEh+kllOVSp5kp5/rPrJBbdOGP0FphkfKDrRYiwYBW8NWz8ItRNFv+FhkZQ420pmR9v4Ff6A7/jE5cqrXp3Ywb0qREj07+DRrLeDk22v5v//Aer0rIZqYthqx524Lrwsuo1oo6aOh+vBJzIyLFdcA5PUudNunMGgoBYEk7yMSO54Rtk5nfBTLzVV3A2KepElfumdER4b6z8NuHb/JAqqnJuq1CcVhTO32KvMv/ZULkFYc7kwH8Z7g0LoLAeu2Xz8OJcYDK5uhUfCcgZy6gVlVvgXYHZGLWXgLxpFRPNLZpSielQ0Sy0LUg03WUGswo1HpFACXLkMCL0MeO8W8OvvLi57WfTfdN534s+77f3D5g6raANtoi3URe/QPjpAx6iPWLASvA7eBrvh9/Bn+Cv8PY+GQcO8QHcqvLoBrpntIA==</latexit><latexit sha1_base64="d7OxcnUqwg/shIPI7hVie8NUo5U=">AAADAHicjZFdaxNBFIZn16p1/Wiql70ZGpRWSthowXohFLwpGNoKxi10QpidTNKh87HMnG0Nm73yf3jvlXjbf9J/09nNUmwi4oFlX97zPhzmnDSTwkEcXwfhvZX7Dx6uPooeP3n6bK21/vyrM7llvM+MNPYkpY5LoXkfBEh+kllOVSp5kp5/rPrJBbdOGP0FphkfKDrRYiwYBW8NWz8ItRNFv+FhkZQ420pmR9v4Ff6A7/jE5cqrXp3Ywb0qREj07+DRrLeDk22v5v//Aer0rIZqYthqx524Lrwsuo1oo6aOh+vBJzIyLFdcA5PUudNunMGgoBYEk7yMSO54Rtk5nfBTLzVV3A2KepElfumdER4b6z8NuHb/JAqqnJuq1CcVhTO32KvMv/ZULkFYc7kwH8Z7g0LoLAeu2Xz8OJcYDK5uhUfCcgZy6gVlVvgXYHZGLWXgLxpFRPNLZpSielQ0Sy0LUg03WUGswo1HpFACXLkMCL0MeO8W8OvvLi57WfTfdN534s+77f3D5g6raANtoi3URe/QPjpAx6iPWLASvA7eBrvh9/Bn+Cv8PY+GQcO8QHcqvLoBrpntIA==</latexit>

• From Bayes decision theory to acoustic model

Acousticmodel

Languagemodel

Page 16: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

GMM/HMM

16

• Given HMM state j, we can represent the likelihood function as

• Deep neural network acoustic model only replaces this GMM representation with a neural network

Page 17: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

Problem

• Input MFCC vector: 𝐨"• Output phoneme (or HMM state): 𝑠" = {/a/, /k/, …, }

• How to find a probabilistic distribution of 𝑝(𝑠"|𝐨")???• We use a large amounts of pair data 𝐨", 𝑠" "*+

, to train the model parameter of the distribution

17

Page 18: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

Very easy case

• We can use a linear classifier• /a/: 𝑎.𝑜. + 𝑎1𝑜1 + 𝑏 ≥ 0• /k/: 𝑎.𝑜. + 𝑎1𝑜1 + 𝑏 < 0

• We can also make a probability with the sigmoid funcJon 𝜎()

• 𝑝(/a/ 𝐨 = 𝜎(𝑎.𝑜. + 𝑎1𝑜1 + 𝑏)• 𝑝(/k/ 𝐨 = 1 - 𝜎(𝑎.𝑜. + 𝑎1𝑜1 + 𝑏)

• Sigmoid funcJon:

𝜎 𝑥 =1

1 + 𝑒:.

18

/a/

/k/

𝑜.

𝑜1

𝑎.𝑜. + 𝑎1𝑜1 + 𝑏 = 0

from http://cs.jhu.edu/~kevinduh/a/deep2014/140114-ResearchSeminar.pdf

Page 19: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

Very easy case

• We can use GMM (although not so suitable) 𝑝(𝐨|/a/) = ∑< 𝜔<𝑁(𝐨|𝜇<, Σ<)𝑝(𝐨|/k/) = ∑< 𝜔′<𝑁(𝐨|𝜇′<, Σ′<)

• 𝑝(/a/ 𝐨 ≈ C(𝐨|/a/)C 𝐨 /a/ DC(𝐨|/k/)

19

/a/

/k/

𝑜.

𝑜1

Page 20: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

Getting more difficult with the GMM classifier or linear classifier

20

/a/

/k/

𝑜.

𝑜1

Page 21: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

Getting more difficult with the GMM classifier or linear classifier

21

/a/

/k/

𝑜.

𝑜1

Page 22: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

Neural network

22

• Combination of linear classifiers to classify complicated patterns

• More layers, more complicated patterns

Page 23: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

Neural network used in speech recogni3on

• Very large combinaJon of linear classifiers

23

Output HMM state or phoneme

Log mel filterbank + 11 context frames

~7hi

dden

laye

rs,

2048

uni

ts

30 ~ 10,000 units

・・・・・・Input speech features

a i u w N・・・

Page 24: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

Why neural network was not focused

1. Very difficult to train• Batch? On-line? Mini-batch?• Stochastic gradient decent

• Learning rate? Scheduling?• What kind of topologies?• Large computational cost

2. The amount of training data is very critical3. CPU -> GPU

24

Output HMM state

Log mel filterbank + 11 context frames

~7hi

dden

laye

rs,

2048

uni

ts

30 ~ 10,000 units

・・・・・・Input speech features

a i u w N・・・

Page 25: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

Before deep learning (2002 – 2009)

• Success of neural networks was very old period• People believed that GMM was beqer• But very small gain from standard GMMs

25

Page 26: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

26from https://en.wikipedia.org/wiki/Geoffrey_Hinton

Page 27: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

When I noticed deep learning (2010)

• A. Mohamed, G. E. Dahl, and G. E. Hinton, “Deep belief networks for phone recognition,” in NIPS Workshop on Deep Learning for Speech Recognition and Related Applications, 2009.

• This still did not fully convince me (I introduced it at NTT’s reading group)

27

• Using deep belief network as pre-training

• Fine-tuning deep neural network→ Provides stable esJmaJon

Page 28: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

Pre-training and fine-tuning

• First train neural network like parameters with deep belief network or autoencoder

• Then, using deep neural network training

28

・・・・・・

1,000 ~ 10,000 unitsa i u w N

・・・

Page 29: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

Interspeech 2011 at Florence

• The following three papers convinced me• Feature extraction: Valente, Fabio / Magimai-Doss, Mathew / Wang, Wen

(2011): "Analysis and comparison of recent MLP features for LVCSR systems", In INTERSPEECH-2011, 1245-1248.

• Acoustic model: Seide, Frank / Li, Gang / Yu, Dong (2011): "Conversational speech transcription using context-dependent deep neural networks", In INTERSPEECH-2011, 437-440.

• Language model: Mikolov, Tomáš / Deoras, Anoop / Kombrink, Stefan / Burget, Lukáš / Černocký, Jan (2011): "Empirical evaluation and combination of advanced language modeling techniques", In INTERSPEECH-2011, 605-608.

• I discussed this potential to my NLP folks in NTT but they did not believe it (SVM, log linear model)

29

Page 30: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

Late 2012

• My first deep learning (Kaldi nnet)• Kaldi started to support DNN since 2012 (mainly developed by Karel Vesely)

• Deep belief network based pre-training• Feed forward neural network• Sequence-discriminative training

30

Hub5 ‘00 (SWB) WSJ

GMM 18.6 5.6

DNN 14.2 3.6

DNN with sequence-discriminative training

12.6 3.2

Page 31: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

31

Page 32: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

Build speech recogni3on with public tools and resources • TED-LIUM (~100 hours)• LIBRISPEECH (~1000 hours)

• We can build use Kald+DNN+TED-LIUM to make English speech recogniJon system by using one machine (GPU + many core machines)

• Before this, it’s only for a big company.

32

Page 33: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

Same things happened in computer vision

33

Page 34: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

34from https://en.wikipedia.org/wiki/Geoffrey_Hinton

Page 35: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

ImageNet challenge (Large scale data)

35L. Fei-Fei and O. Russakovsky, Analysis of Large-Scale Visual Recognition, Bay Area Vision Meeting, October, 2013

Page 36: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

ImageNet challengeAlexNet, GoogLeNet, VGG, ResNet, …

28.225.8

16.4

11.7

6.7

0

5

10

15

20

25

30

2010 (NEC) 2011 (Xerox) 2012 (Tronto) 2013 (Clarifai) 2014 (Google)

36

Erro

r rat

e

Deep learning!

Page 37: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

Same things happened in text processing

• Recurrent neural network language model (RNNLM) [Mikolov+ (2010)]

37

w0

w1

w1

w2

w2

w3

w3

w4

w4

w5

w5

w6

y2y1 y3 y4 y5

”I” “want” “to” “go” “to” “my

“<s>” ”I” “want” “to” “go” “to”

Perplexity

N-gram (conventional)

336

RNNLM 156

Page 38: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

Word embedding examplehttps://www.tensorflow.org/tutorials/word2vec

38

Page 39: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

Neural machine translation (New York Times, December 2016)

39

Page 40: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

40from https://ai.googleblog.com/2016/09/a-neural-network-for-machine.html

Page 41: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

Deep neural network toolkit (2013-)

• Theano• Caffe• Torch• CNTK• Chainer• Keras

Later• Tensorflow• PyTorch (used in this course)• MxNet• Etc.

41

Page 42: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

Summary

• Before 2011• GMM/HMM, limitation of the performance, bit boring• This is because they are linear models…

• After 2011• DNN/HMM• Toolkit• Public large data• GPU• NLP, image/vision were also moved to DNN• Always something exciting

42

Page 43: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

Today’s agenda

• Introduction of deep neural network• Basics of neural network

43

Page 44: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

Feed-forward neural network for acoustic model• Configurations

• Input features• Context expansion

• Output class• Softmax function• Training criterion

• Number of layers• Number of hidden states• Type of non-linear activations

44

Output HMM state

Log mel filterbank + 11 context frames

~7hi

dden

laye

rs,

2048

uni

ts

30 ~ 10,000 units

・・・・・・Input speech features

a i u w N・・・

Page 45: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

Input feature

• GMM/HMM formulaJon• Lot of condiJonal independence assumpJon and Markov assumpJon• Many of our trials are how to break these assumpJons

• In GMM, we always have to care about the correlaJon• Delta, linear discriminant analysis, semi-Jed covariance

• In DNN, we don’t have to care J• We can simply concatenate

the le� and right contexts,and just throw it!

45

Page 46: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

Output

• Phoneme or HMM state ID is used• We need to have a pair data of output and input data at frame 𝑡

• First use the Viterbi alignment to obtain the state sequence

• Then, we get the input and output pair

• Make acoustic model as a multiclass classification problem by predicting the all HMM state ID given the observation

• Not consider any constraint in this stage (e.g., left to right, which is handled by an HMM during recognition)

46

Page 47: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

Feed-forward neural networks

• Affine transformation and non-linear activation function (sigmoid function)

• Apply the above transformation L times

• Softmax operation to get the probability distribution

47

Page 48: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

Linear opera3on

• Transforms 𝐷(G:+)-dimensional input to 𝐷(G) output𝑓(𝐡(G:+)) = 𝐖(G)𝐡(G:+) + 𝐛(G)

• 𝐖(G) ∈ ℝN(O)×N(OQR): Linear transformation matrix• 𝐛(G) ∈ ℝN(O): bias vector

• Derivatives•S ∑T UVTWTDXV

SXVY= 𝛿(𝑖, 𝑖′)

•S(∑T UVTWTDXV)

SUVYTY= 𝛿 𝑖, 𝑖\ ℎ̂ \

48

Page 49: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

Sigmoid function

• Sigmoid function

• Convert the domain from ℝ to [0, 1]• Elementwise sigmoid function:

• No trainable parameter in general• Derivative

• Sa(.)S.

= 𝜎(𝑥)(1 − 𝜎(𝑥))

49

Page 50: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

Softmax function

• Softmax function

• Convert the domain from ℝc to 0, 1 c (make a multinomial dist. → classification)• Satisfy the sum to one condition, i.e., ∑^*+ 𝑝 𝑗 𝐡 = 1• 𝐽 = 2: sigmoid function

• Derivative• For 𝑖 = 𝑗: SC(^|𝐡)

SWV= 𝑝(𝑗|𝐡)(1 − 𝑝 𝑗 𝐡 )

• For 𝑖 ≠ 𝑗: SC(^|𝐡)SWV

= −𝑝(𝑖|𝐡) 𝑝 𝑗 𝐡

• Or we can write as SC(^|𝐡)SWV

= 𝑝(𝑗|𝐡)(𝛿(𝑖, 𝑗) − 𝑝 𝑖 𝐡 ) :𝛿(𝑖, 𝑗): Kronecker’s delta

50

Page 51: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

What functions/operations we cannot use?

• The function/operations that we cannot take a derivative, including some discrete operation

• argmaxU𝑝(𝑊|𝑂): Basic ASR operation, but we cannot take a derivative….• Discretization• Etc.

51

Page 52: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

Objective function design

• We usually use the cross entropy as an objective function

• Since the Viterbi sequence is a hard assignment, the summation over states is simplified

52

Page 53: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

Other objective functions

• Square error𝐡opq − 𝐡 r

• We could also use p norm, e.g., L1 norm

• Binary cross entropy

• Again this is a special case of the cross entropy when the number of classes is two

53

Page 54: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

Building blocks

54

Input: 𝐨" ∈ ℝN

Output: 𝑠" ∈ {1, … , 𝐽}

Linear transformation

Sigmoid acJvaJon

Softmax activation

+, −, exp , log , etc.

Page 55: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

Building blocks

55

Input: 𝐨" ∈ ℝN

Output: 𝑠" ∈ {1, … , 𝐽}

Linear transformaJon

Sigmoid activation

Softmax activation

Page 56: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

Building blocks

56

Input: 𝐨" ∈ ℝN

Output: 𝑠" ∈ {1, … , 𝐽}

Linear transformaJon

Sigmoid activation

Softmax activation

Linear transformation

Sigmoid activation

Page 57: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

Building blocks

57

Input: 𝐨" ∈ ℝN

Output: 𝑠" ∈ {1, … , 𝐽}

Linear transformation

Sigmoid activation

Softmax activation

+, −, exp , log , etc.

Page 58: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

Building blocks

58

Input: 𝐨" ∈ ℝN

Output: 𝑠" ∈ {1, … , 𝐽}

Linear transformation

Sigmoid activation

Softmax activation

Page 59: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

How to optimize?Gradient decent and their variants• Take a derivative and update parameters with this derivative

• Chain rule

59

Page 60: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

Deep neural network: nested function

• Chain rule to get a derivative recursively• Each transformation (Affine, sigmoid, and softmax) has analytical derivatives

and we just combine these derivatives• We can obtain the derivative from the back propagation algorithm

61

Softmax activationLinear transformation Sigmoid activation Linear

transformation

Page 61: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

Minibatch processing

• Batch processing• Slow convergence• Effective computation

• Online processing• Fast convergence• Very inefficient computation

• Minibatch processing• Something between batch and online

processing

62

where

Whole data(batch)

minibatch

minibatch

minibatch

Split the whole data into minibatch

Θ~�� Θ�p�+

Θ~�� Θ�p��Θ�p�+ Θ�p�r

Page 62: EN. 601.467/667 Introduc3onto Human Language Technology ... · EN. 601.467/667 Introduc3onto Human Language Technology Deep Learning I Shinji Watanabe 1

Summary of today’s talk

• Deep learning changes the world• A lot of human language technologies are boosted by deep learning• Deep neural network basics

• Input• Output• FuncJon• Back propagataion

63