recent advances in automatic speech recognition | a brief …llu/pdf/liang_hwu14.pdf · 2016. 8....

Recent advances in automatic speech recognition— A brief overview

Liang LuUniversity of Edinburgh

Liang Lu (liang.lu@ed.ac.uk), Heriot-Watt University, Feb, 2014. RTSC

This talk

I What is happening in ASR?I Background – speech recognition and its applicationI (Recent) advances in system representation

I Weighted finite state transducer

I Recent advances in language modellingI Recurrent neural network language model

I Recent advances in acoustic modellingI Deep neural network acoustic model

I Summary

BackgroundI Speech is one of the most nature ways for information

communicationI ASR is a central component for voice-driven information

processing systems

X. He and L, Deng, ”Speech-Centric Information Processing: An Optimization Oriented Approach”, in proceedings

of IEEE, 2013

Background

I What does ASR do and how does it do it?

ASRSpeech

I It can be expressed mathematically as

W = arg maxW

P(W|X) (1)

= arg maxW

p(X|W)︸︷︷︸likelihood

P(W)︸︷︷︸prior

where X is a sequence of acoustic feature vectors, and W is aword sequence.

It is still hard, let’s decompose it further ...

i'm i'm what i'm thinking while my sweet time i'm very 0.0324

i had what i'm thinking while my sweet time i'm very 0.0127

i i was thinking about my sweet time i'm very 0.0046 ........

...abide ax b ay d 1.0abiding ax b ay d ih ng 1.0abilities ax b ih l ih t iy z 0.666666abilities ey b ih l ih t iy z 0.333333ability ax b ih l ih t iy 1.0able ax b ax l 0.413349able ey b ax l 0.553356......

ax b ay d ---> sil-ax-b ax-b-ay b-ay-d ay-d-si 1.0

j − 1 j + 1j

sil-ax-b

· · · j − 1 j + 1j

ay-d-si

LM -- language model

PM -- pronunciation model

CD -- context dependency

It is still hard, let’s decompose it further ...

j − 1 j + 1j

sil-ax-b

· · · j − 1 j + 1j

ay-d-si

Active research

System training - a generative process

j − 1 j + 1j

sil-ax-b

· · · j − 1 j + 1j

ay-d-si

i was thinking about my sweet time i'm very

· · · · · ·

Decoding - a search problem

j − 1 j + 1j

sil-ax-b

· · · j − 1 j + 1j

ay-d-si

· · · · · ·

(Recent) advances in system representation

j − 1 j + 1j

sil-ax-b

· · · j − 1 j + 1j

ay-d-si

· · · · · ·

?Liang Lu (liang.lu@ed.ac.uk), Heriot-Watt University, Feb, 2014. R

WFST – Weighted finite state transducer

I Input vocabulary i ∈ Φ1

I Output vocabulary o ∈ Φ2

I weight w ∈ RI ⊕ operation

I ⊗ operation

0 1brad:brad/2

2pitt:pitt/5

M. Mohri and F. Pereira, ”Weighted finite state Transducers in speech recognition”, in CSL, 2002

WFST for language model and pronunciation model

M. Mohri and F. Pereira, ”Weighted finite state Transducers in speech recognition”, in CSL, 2002

I WFST can integrate all the components in an ASR systeminto a joint graph with addition optimisation

I If we defineI H - HMMsI C - context dependency transducerI L - pronunciation modelI G - language model

then the task for ASR can be simply represented as

w = best path(H ◦ C ◦ L ◦ G ) (3)

given the acoustic signals.

I WFST provides an elegant interface for down-streamapplications

I An example of spoken language understanding (ASR + NLU):

0 1Show:O

3movies:B-movie_type

4with:O

5brad:B-movie_star

6pitt:I-moive_star

A. Deoras et al, ”Joint Discriminative Decoding of Words and Semantic Tags for Spoken Language Understanding”

in IEEE TASLP 2013

I An example of speech to speech translation (ASR + MT)

X He, L. Deng and A. Acero, ”Why Word Error Rate is not a Good Metric for Speech Recognizer Training

for the Speech Translation Task?” in ICASSP 2011

X He, L. Deng and A. Acero, ”Why Word Error Rate is not a Good Metric for Speech Recognizer Training for the

Speech Translation Task?” in ICASSP 2011

I Common practice – coupling ASR and MT with WFST

B. Zhou et al, ”Folsom: A Fast and Memory-Efficient Phrase-based Approach to statistical machine translation” inSLT 2006

B. Zhou et al, ”On efficient coupling of ASR and SMT for speech translation” in ICASSP 2007

Recent advances in ASR

j − 1 j + 1j

sil-ax-b

· · · j − 1 j + 1j

ay-d-si

Active research

Recent advances in ASR

j − 1 j + 1j

sil-ax-b

· · · j − 1 j + 1j

ay-d-si

Neural networks

Neural networks in language modelling

I N-gram language model has defined the state-of-the-art foralmost 40 years [L. R. Bahl, 1978]

I There has been a long struggle to move beyond n-grams byvarious statistical models

I Random forest language model [P. Xu, 2004]I Class-based language model, e.g. IBM Model M [S.F. Chen,

2009]I Nonparametric language model [Y.W. Teh, 2006]I Discriminative language model [B. Roark, 2006]I ...

I It may just really happen recently by using recurrent neuralnetwork (RNNLM)

T. Mikolov, et al, ”Recurrent neural network based language model”, in Interspeech

I The aim of a language model is very simple

P(wn|wn−1, ...,w1) ≈ P(wn|wn−1, ...,wn−k), (4)

but it is very difficult if k > 3 for large vocabulary task, e.g.what is the value of 60, 0003?

I Neural network language model is not new [Y. Benjio, 2003]

0 0 0 1 0 0 ... 0 1 0 0 0 0 ...· · ·wn−k wn−1

projection layer

input layer

hidden layer

output layer

P (wn = n|wn−1, . . . , wn−k)

Y. Benjio, et al, ”A neural probabilistic language model”, in JMLR 2003

I RNNLM differs in that a recurrent layer of input is used tocapture longer contextual information

I RNNLM can achieve significant reduction both in perplexityand word error rate (results on Wall Street Journal)

Not limited to language model

I RNN for spoken language understanding [K. Yao, et al, 2013]

K. Yao, et al, ”Recurrent neural networks for language understanding”, in Interspeech 2013

Not limited to language model

I RNN for spoken language understanding [K. Yao, et al, 2013]

K. Yao, et al, ”Recurrent neural networks for language understanding”, in Interspeech 2013

Neural networks in language modellingNot limited to language model

I RNN for machine translation [N Kalchbrenner, P. Blunsom,2013 ]

N. Kalchbrenner, P. Blunsom, ”Recurrent continuous translation models”, in EMNLP 2013

Neural networks in acoustic modelling

I GMM-HMM has defined state-of-the-art for over 20 years

j − 1 j + 1j

I Pros:I Efficient and parallel training algorithmsI Clear physical meaning (Gaussian mean, variances, etc)I Efficient adaptation algorithm (MLLR, fMLLR, etc)

I Cons:I Inefficient in learning feature correlationsI Hard to take advantage of longer context windowI Generative rather than Discriminative model

Neural networks in acoustic modelling

I Moving beyond GMM-HMM?I Conditional random field (CRF), e.g. segmental CRF[G. Zweig,

2010], augmented CRF [Y. Hifny 2009], hidden CRF[A.Gunawardana, 2005] )

I Support vector machines (SVM) e.g. [N. Simith, 2002]I Template based acoustic models, e.g. [M. De Wachter, 2007]I ...

I Deep neural networks for acoustic modelling [G. Dahl, 2012]

G. Zweig, P. Nguyen, ”A segmental conditional random fields toolkit for speech recognition” in Interspeech 2010.

Y. Hifny, S. Renals, ”Speech recognition using augmented conditional random fields”, IEEE TASLP, 2009.

A. Gunawardana, et al, ”Hidden conditional random fields for phone classification”, in Interspeech 2005.

N. Smith, M. Gales, ”Speech recognition using SVMs”, in NIPS, 2002.

M. De Wachter, et al, ”Template based continuous speech recognition”, IEEE TASLP, 2007

G. Dahl, et al, ”Context-dependent pre-trained deep neural networks for large vocabulary speech recognition”, IEEETASLP 2012

Deep neural networks for acoustic modelling

I Neural networks for speech recognition has been extensivelystudied in early 1990’s

I New ingredients in Deep Neural Networks (DNN)I Pre-training using restricted Boltzmann machine (RBM)I More hidden layers (≥ 4)I Wider output (103 vs. less than 102 in speech)

hidden

output

shallow neural network

hidden

output

deep neural network

DNN is still combined with HMM, which was the practice in theearly 1990’s

G. Dahl, et al, ”Context-dependent pre-trained deep neural networks for large vocabulary speech recognition”, IEEETASLP 2012

I Why the new ingredients can make a difference?I Deep neural network is difficult to train since it can easily be

trapped in the local optimumI Pre-training helps (in some cases)

I Shallow network can not efficient learn complex functionsI more hidden layer helps

I For ASR, context-dependent model normally has severalthousand output states

I Wide output helps

Additionally, GPU provides us the computation power

I How to train a deep neural network (for acoustic modelling)?I Step 1: Train the restricted Boltzmann machines (RBMs)I Step 2: Stack the RBMsI Step 3: Put a softmax layer on top and refine the weights

using back-propagation

G. Hinton, et al, ”Deep neural networks for acoustic modeling in speech recognition: The shared views of fourresearch groups”, IEEE signal processing magazine, 2012

Deep neural networks for acoustic modellingI restricted Boltzmann machine

I Only has visible-hidden connectionsI learning by maximising the log-likelihood

P(v) =1

Zexp(−F (v)) (5)

F (v) = −log(∑

exp(−E (v,h))

)free energy (6)

E (v,h) = −bTv − cTh− vTWh energy function (7)

Z =∑v,h

exp(−E (v,h)) partition function (8)

visible layer v

hidden layer h

weight matrix W

I Performance for ASR — DNN significantly improvesstate-of-the-art

GMM (25.3)

+SAT (21.2)

+DT (18.6)

DNN+SAT (14.2)

+DT (12.6)

Results on switchboard using 300 hours of training data

K. Vesely, et al, ”Sequence-discriminative training of deep neural networks”, in Interspeech 2013

I Current research activities in DNN for ASRI New types of neural networks, e.g. tensor networks,

convolutional networksI Learning new acoustic feature representations i.e., move

beyond MFCCsI Distributed optimisation to speed up the trainingI Adaption algorithms for speakers or domainsI ...

Summary

I A brief overview of recent advances in ASR

I System representation using WFST

I Language modelling using RNN

I Acoustic modelling using DNNI Practise by yourself using the open-source toolkits

I OpenFst - http://www.openfst.orgI RNNLM - http://www.fit.vutbr.cz/∼imikolov/rnnlmI DNN for ASR - http://kaldi.sourceforge.net

Thanks!

recent advances in automatic speech recognition | a brief …llu/pdf/liang_hwu14.pdf · 2016. 8....

Documents

recent advances in automatic speech summarization · •...

joint learning of syntactic and semantic dependencies ·...

thescholarsrepository@llu: digital archive of research

roller profile rail guides llu - skf.com

report seminar on advances in speech technology … ·...

phase processing for single channel speech enhancement...

belt feeder for biomass mixing - llu

llu - ocde

subspace gaussian mixture models for automatic speech...

fundamentals & recent advances in hmm-based speech synthesis

brad story speech, language, and hearing sciences university...

yuraryararput kangiit-llu: our ways of dance and

the llu experience: mission focused learning

state of artin continuous speech...

lb llb lu llu - norfolkbearings.com · 18 bearing bore o.d....

3-pain lecture for llu pa-medications 2015.ppt

advances in large vocabulary continuous speech...

advances in voice & speech recognition

speech technologies for serbian and kindred south slavic...

ference proceedings research for - llu...research for rural...