recent advances in automatic speech recognition | a brief …llu/pdf/liang_hwu14.pdf · 2016. 8....
Post on 28-Mar-2021
0 Views
Preview:
TRANSCRIPT
Recent advances in automatic speech recognition— A brief overview
Liang LuUniversity of Edinburgh
Liang Lu (liang.lu@ed.ac.uk), Heriot-Watt University, Feb, 2014. RTSC
RTSC
This talk
I What is happening in ASR?I Background – speech recognition and its applicationI (Recent) advances in system representation
I Weighted finite state transducer
I Recent advances in language modellingI Recurrent neural network language model
I Recent advances in acoustic modellingI Deep neural network acoustic model
I Summary
Liang Lu (liang.lu@ed.ac.uk), Heriot-Watt University, Feb, 2014. RTSC
RTSC
BackgroundI Speech is one of the most nature ways for information
communicationI ASR is a central component for voice-driven information
processing systems
X. He and L, Deng, ”Speech-Centric Information Processing: An Optimization Oriented Approach”, in proceedings
of IEEE, 2013
Liang Lu (liang.lu@ed.ac.uk), Heriot-Watt University, Feb, 2014. RTSC
RTSC
Background
I What does ASR do and how does it do it?
ASRSpeech
Text
I It can be expressed mathematically as
W = arg maxW
P(W|X) (1)
= arg maxW
p(X|W)︸ ︷︷ ︸likelihood
P(W)︸ ︷︷ ︸prior
(2)
where X is a sequence of acoustic feature vectors, and W is aword sequence.
Liang Lu (liang.lu@ed.ac.uk), Heriot-Watt University, Feb, 2014. RTSC
RTSC
It is still hard, let’s decompose it further ...
i'm i'm what i'm thinking while my sweet time i'm very 0.0324
i had what i'm thinking while my sweet time i'm very 0.0127
i i was thinking about my sweet time i'm very 0.0046 ........
...
...abide ax b ay d 1.0abiding ax b ay d ih ng 1.0abilities ax b ih l ih t iy z 0.666666abilities ey b ih l ih t iy z 0.333333ability ax b ih l ih t iy 1.0able ax b ax l 0.413349able ey b ax l 0.553356......
ax b ay d ---> sil-ax-b ax-b-ay b-ay-d ay-d-si 1.0
j − 1 j + 1j
sil-ax-b
· · · j − 1 j + 1j
ay-d-si
LM -- language model
PM -- pronunciation model
CD -- context dependency
HMMs
Liang Lu (liang.lu@ed.ac.uk), Heriot-Watt University, Feb, 2014. RTSC
RTSC
It is still hard, let’s decompose it further ...
i'm i'm what i'm thinking while my sweet time i'm very 0.0324
i had what i'm thinking while my sweet time i'm very 0.0127
i i was thinking about my sweet time i'm very 0.0046 ........
...
...abide ax b ay d 1.0abiding ax b ay d ih ng 1.0abilities ax b ih l ih t iy z 0.666666abilities ey b ih l ih t iy z 0.333333ability ax b ih l ih t iy 1.0able ax b ax l 0.413349able ey b ax l 0.553356......
ax b ay d ---> sil-ax-b ax-b-ay b-ay-d ay-d-si 1.0
j − 1 j + 1j
sil-ax-b
· · · j − 1 j + 1j
ay-d-si
LM -- language model
PM -- pronunciation model
CD -- context dependency
HMMs
Active research
Liang Lu (liang.lu@ed.ac.uk), Heriot-Watt University, Feb, 2014. RTSC
RTSC
System training - a generative process
...
...abide ax b ay d 1.0abiding ax b ay d ih ng 1.0abilities ax b ih l ih t iy z 0.666666abilities ey b ih l ih t iy z 0.333333ability ax b ih l ih t iy 1.0able ax b ax l 0.413349able ey b ax l 0.553356......
ax b ay d ---> sil-ax-b ax-b-ay b-ay-d ay-d-si 1.0
j − 1 j + 1j
sil-ax-b
· · · j − 1 j + 1j
ay-d-si
i was thinking about my sweet time i'm very
· · · · · ·
Liang Lu (liang.lu@ed.ac.uk), Heriot-Watt University, Feb, 2014. RTSC
RTSC
Decoding - a search problem
...
...abide ax b ay d 1.0abiding ax b ay d ih ng 1.0abilities ax b ih l ih t iy z 0.666666abilities ey b ih l ih t iy z 0.333333ability ax b ih l ih t iy 1.0able ax b ax l 0.413349able ey b ax l 0.553356......
ax b ay d ---> sil-ax-b ax-b-ay b-ay-d ay-d-si 1.0
j − 1 j + 1j
sil-ax-b
· · · j − 1 j + 1j
ay-d-si
· · · · · ·
i'm i'm what i'm thinking while my sweet time i'm very 0.0324
i had what i'm thinking while my sweet time i'm very 0.0127
i i was thinking about my sweet time i'm very 0.0046 ........
Liang Lu (liang.lu@ed.ac.uk), Heriot-Watt University, Feb, 2014. RTSC
RTSC
(Recent) advances in system representation
...
...abide ax b ay d 1.0abiding ax b ay d ih ng 1.0abilities ax b ih l ih t iy z 0.666666abilities ey b ih l ih t iy z 0.333333ability ax b ih l ih t iy 1.0able ax b ax l 0.413349able ey b ax l 0.553356......
ax b ay d ---> sil-ax-b ax-b-ay b-ay-d ay-d-si 1.0
j − 1 j + 1j
sil-ax-b
· · · j − 1 j + 1j
ay-d-si
· · · · · ·
i'm i'm what i'm thinking while my sweet time i'm very 0.0324
i had what i'm thinking while my sweet time i'm very 0.0127
i i was thinking about my sweet time i'm very 0.0046 ........
?Liang Lu (liang.lu@ed.ac.uk), Heriot-Watt University, Feb, 2014. R
TSC
RTSC
(Recent) advances in system representation
WFST – Weighted finite state transducer
I Input vocabulary i ∈ Φ1
I Output vocabulary o ∈ Φ2
I weight w ∈ RI ⊕ operation
I ⊗ operation
0 1brad:brad/2
2pitt:pitt/5
M. Mohri and F. Pereira, ”Weighted finite state Transducers in speech recognition”, in CSL, 2002
Liang Lu (liang.lu@ed.ac.uk), Heriot-Watt University, Feb, 2014. RTSC
RTSC
(Recent) advances in system representation
WFST for language model and pronunciation model
M. Mohri and F. Pereira, ”Weighted finite state Transducers in speech recognition”, in CSL, 2002
Liang Lu (liang.lu@ed.ac.uk), Heriot-Watt University, Feb, 2014. RTSC
RTSC
(Recent) advances in system representation
I WFST can integrate all the components in an ASR systeminto a joint graph with addition optimisation
I If we defineI H - HMMsI C - context dependency transducerI L - pronunciation modelI G - language model
then the task for ASR can be simply represented as
w = best path(H ◦ C ◦ L ◦ G ) (3)
given the acoustic signals.
Liang Lu (liang.lu@ed.ac.uk), Heriot-Watt University, Feb, 2014. RTSC
RTSC
(Recent) advances in system representation
I WFST provides an elegant interface for down-streamapplications
I An example of spoken language understanding (ASR + NLU):
0 1Show:O
2me:O
3movies:B-movie_type
4with:O
5brad:B-movie_star
6pitt:I-moive_star
A. Deoras et al, ”Joint Discriminative Decoding of Words and Semantic Tags for Spoken Language Understanding”
in IEEE TASLP 2013
Liang Lu (liang.lu@ed.ac.uk), Heriot-Watt University, Feb, 2014. RTSC
RTSC
(Recent) advances in system representation
I An example of speech to speech translation (ASR + MT)
X He, L. Deng and A. Acero, ”Why Word Error Rate is not a Good Metric for Speech Recognizer Training
for the Speech Translation Task?” in ICASSP 2011
Liang Lu (liang.lu@ed.ac.uk), Heriot-Watt University, Feb, 2014. RTSC
RTSC
(Recent) advances in system representation
X He, L. Deng and A. Acero, ”Why Word Error Rate is not a Good Metric for Speech Recognizer Training for the
Speech Translation Task?” in ICASSP 2011
Liang Lu (liang.lu@ed.ac.uk), Heriot-Watt University, Feb, 2014. RTSC
RTSC
(Recent) advances in system representation
I Common practice – coupling ASR and MT with WFST
B. Zhou et al, ”Folsom: A Fast and Memory-Efficient Phrase-based Approach to statistical machine translation” inSLT 2006
B. Zhou et al, ”On efficient coupling of ASR and SMT for speech translation” in ICASSP 2007
Liang Lu (liang.lu@ed.ac.uk), Heriot-Watt University, Feb, 2014. RTSC
RTSC
(Recent) advances in system representation
Liang Lu (liang.lu@ed.ac.uk), Heriot-Watt University, Feb, 2014. RTSC
RTSC
Recent advances in ASR
i'm i'm what i'm thinking while my sweet time i'm very 0.0324
i had what i'm thinking while my sweet time i'm very 0.0127
i i was thinking about my sweet time i'm very 0.0046 ........
...
...abide ax b ay d 1.0abiding ax b ay d ih ng 1.0abilities ax b ih l ih t iy z 0.666666abilities ey b ih l ih t iy z 0.333333ability ax b ih l ih t iy 1.0able ax b ax l 0.413349able ey b ax l 0.553356......
ax b ay d ---> sil-ax-b ax-b-ay b-ay-d ay-d-si 1.0
j − 1 j + 1j
sil-ax-b
· · · j − 1 j + 1j
ay-d-si
LM -- language model
PM -- pronunciation model
CD -- context dependency
HMMs
Active research
Liang Lu (liang.lu@ed.ac.uk), Heriot-Watt University, Feb, 2014. RTSC
RTSC
Recent advances in ASR
i'm i'm what i'm thinking while my sweet time i'm very 0.0324
i had what i'm thinking while my sweet time i'm very 0.0127
i i was thinking about my sweet time i'm very 0.0046 ........
...
...abide ax b ay d 1.0abiding ax b ay d ih ng 1.0abilities ax b ih l ih t iy z 0.666666abilities ey b ih l ih t iy z 0.333333ability ax b ih l ih t iy 1.0able ax b ax l 0.413349able ey b ax l 0.553356......
ax b ay d ---> sil-ax-b ax-b-ay b-ay-d ay-d-si 1.0
j − 1 j + 1j
sil-ax-b
· · · j − 1 j + 1j
ay-d-si
LM -- language model
PM -- pronunciation model
CD -- context dependency
HMMs
Neural networks
Liang Lu (liang.lu@ed.ac.uk), Heriot-Watt University, Feb, 2014. RTSC
RTSC
Neural networks in language modelling
I N-gram language model has defined the state-of-the-art foralmost 40 years [L. R. Bahl, 1978]
I There has been a long struggle to move beyond n-grams byvarious statistical models
I Random forest language model [P. Xu, 2004]I Class-based language model, e.g. IBM Model M [S.F. Chen,
2009]I Nonparametric language model [Y.W. Teh, 2006]I Discriminative language model [B. Roark, 2006]I ...
I It may just really happen recently by using recurrent neuralnetwork (RNNLM)
T. Mikolov, et al, ”Recurrent neural network based language model”, in Interspeech
2010
Liang Lu (liang.lu@ed.ac.uk), Heriot-Watt University, Feb, 2014. RTSC
RTSC
Neural networks in language modelling
I The aim of a language model is very simple
P(wn|wn−1, ...,w1) ≈ P(wn|wn−1, ...,wn−k), (4)
but it is very difficult if k > 3 for large vocabulary task, e.g.what is the value of 60, 0003?
Liang Lu (liang.lu@ed.ac.uk), Heriot-Watt University, Feb, 2014. RTSC
RTSC
Neural networks in language modelling
I Neural network language model is not new [Y. Benjio, 2003]
0 0 0 1 0 0 ... 0 1 0 0 0 0 ...· · ·wn−k wn−1
projection layer
input layer
hidden layer
output layer
P (wn = n|wn−1, . . . , wn−k)
Y. Benjio, et al, ”A neural probabilistic language model”, in JMLR 2003
Liang Lu (liang.lu@ed.ac.uk), Heriot-Watt University, Feb, 2014. RTSC
RTSC
Neural networks in language modelling
I RNNLM differs in that a recurrent layer of input is used tocapture longer contextual information
T. Mikolov, et al, ”Recurrent neural network based language model”, in Interspeech
2010
Liang Lu (liang.lu@ed.ac.uk), Heriot-Watt University, Feb, 2014. RTSC
RTSC
Neural networks in language modelling
I RNNLM can achieve significant reduction both in perplexityand word error rate (results on Wall Street Journal)
T. Mikolov, et al, ”Recurrent neural network based language model”, in Interspeech
2010
Liang Lu (liang.lu@ed.ac.uk), Heriot-Watt University, Feb, 2014. RTSC
RTSC
Neural networks in language modelling
Not limited to language model
I RNN for spoken language understanding [K. Yao, et al, 2013]
K. Yao, et al, ”Recurrent neural networks for language understanding”, in Interspeech 2013
Liang Lu (liang.lu@ed.ac.uk), Heriot-Watt University, Feb, 2014. RTSC
RTSC
Neural networks in language modelling
Not limited to language model
I RNN for spoken language understanding [K. Yao, et al, 2013]
K. Yao, et al, ”Recurrent neural networks for language understanding”, in Interspeech 2013
Liang Lu (liang.lu@ed.ac.uk), Heriot-Watt University, Feb, 2014. RTSC
RTSC
Neural networks in language modellingNot limited to language model
I RNN for machine translation [N Kalchbrenner, P. Blunsom,2013 ]
N. Kalchbrenner, P. Blunsom, ”Recurrent continuous translation models”, in EMNLP 2013
Liang Lu (liang.lu@ed.ac.uk), Heriot-Watt University, Feb, 2014. RTSC
RTSC
Neural networks in acoustic modelling
I GMM-HMM has defined state-of-the-art for over 20 years
j − 1 j + 1j
I Pros:I Efficient and parallel training algorithmsI Clear physical meaning (Gaussian mean, variances, etc)I Efficient adaptation algorithm (MLLR, fMLLR, etc)
I Cons:I Inefficient in learning feature correlationsI Hard to take advantage of longer context windowI Generative rather than Discriminative model
Liang Lu (liang.lu@ed.ac.uk), Heriot-Watt University, Feb, 2014. RTSC
RTSC
Neural networks in acoustic modelling
I Moving beyond GMM-HMM?I Conditional random field (CRF), e.g. segmental CRF[G. Zweig,
2010], augmented CRF [Y. Hifny 2009], hidden CRF[A.Gunawardana, 2005] )
I Support vector machines (SVM) e.g. [N. Simith, 2002]I Template based acoustic models, e.g. [M. De Wachter, 2007]I ...
I Deep neural networks for acoustic modelling [G. Dahl, 2012]
G. Zweig, P. Nguyen, ”A segmental conditional random fields toolkit for speech recognition” in Interspeech 2010.
Y. Hifny, S. Renals, ”Speech recognition using augmented conditional random fields”, IEEE TASLP, 2009.
A. Gunawardana, et al, ”Hidden conditional random fields for phone classification”, in Interspeech 2005.
N. Smith, M. Gales, ”Speech recognition using SVMs”, in NIPS, 2002.
M. De Wachter, et al, ”Template based continuous speech recognition”, IEEE TASLP, 2007
G. Dahl, et al, ”Context-dependent pre-trained deep neural networks for large vocabulary speech recognition”, IEEETASLP 2012
Liang Lu (liang.lu@ed.ac.uk), Heriot-Watt University, Feb, 2014. RTSC
RTSC
Deep neural networks for acoustic modelling
I Neural networks for speech recognition has been extensivelystudied in early 1990’s
I New ingredients in Deep Neural Networks (DNN)I Pre-training using restricted Boltzmann machine (RBM)I More hidden layers (≥ 4)I Wider output (103 vs. less than 102 in speech)
input
hidden
output
shallow neural network
input
hidden
output
deep neural network
Liang Lu (liang.lu@ed.ac.uk), Heriot-Watt University, Feb, 2014. RTSC
RTSC
Deep neural networks for acoustic modelling
DNN is still combined with HMM, which was the practice in theearly 1990’s
G. Dahl, et al, ”Context-dependent pre-trained deep neural networks for large vocabulary speech recognition”, IEEETASLP 2012
Liang Lu (liang.lu@ed.ac.uk), Heriot-Watt University, Feb, 2014. RTSC
RTSC
Deep neural networks for acoustic modelling
I Why the new ingredients can make a difference?I Deep neural network is difficult to train since it can easily be
trapped in the local optimumI Pre-training helps (in some cases)
I Shallow network can not efficient learn complex functionsI more hidden layer helps
I For ASR, context-dependent model normally has severalthousand output states
I Wide output helps
Additionally, GPU provides us the computation power
Liang Lu (liang.lu@ed.ac.uk), Heriot-Watt University, Feb, 2014. RTSC
RTSC
Deep neural networks for acoustic modelling
I How to train a deep neural network (for acoustic modelling)?I Step 1: Train the restricted Boltzmann machines (RBMs)I Step 2: Stack the RBMsI Step 3: Put a softmax layer on top and refine the weights
using back-propagation
G. Hinton, et al, ”Deep neural networks for acoustic modeling in speech recognition: The shared views of fourresearch groups”, IEEE signal processing magazine, 2012
Liang Lu (liang.lu@ed.ac.uk), Heriot-Watt University, Feb, 2014. RTSC
RTSC
Deep neural networks for acoustic modellingI restricted Boltzmann machine
I Only has visible-hidden connectionsI learning by maximising the log-likelihood
P(v) =1
Zexp(−F (v)) (5)
F (v) = −log(∑
h
exp(−E (v,h))
)free energy (6)
E (v,h) = −bTv − cTh− vTWh energy function (7)
Z =∑v,h
exp(−E (v,h)) partition function (8)
visible layer v
hidden layer h
weight matrix W
Liang Lu (liang.lu@ed.ac.uk), Heriot-Watt University, Feb, 2014. RTSC
RTSC
Deep neural networks for acoustic modelling
I Performance for ASR — DNN significantly improvesstate-of-the-art
1 20
5
10
15
20
25
30
GMM (25.3)
+SAT (21.2)
+DT (18.6)
DNN+SAT (14.2)
+DT (12.6)
Results on switchboard using 300 hours of training data
K. Vesely, et al, ”Sequence-discriminative training of deep neural networks”, in Interspeech 2013
Liang Lu (liang.lu@ed.ac.uk), Heriot-Watt University, Feb, 2014. RTSC
RTSC
Deep neural networks for acoustic modelling
I Current research activities in DNN for ASRI New types of neural networks, e.g. tensor networks,
convolutional networksI Learning new acoustic feature representations i.e., move
beyond MFCCsI Distributed optimisation to speed up the trainingI Adaption algorithms for speakers or domainsI ...
Liang Lu (liang.lu@ed.ac.uk), Heriot-Watt University, Feb, 2014. RTSC
RTSC
Summary
I A brief overview of recent advances in ASR
I System representation using WFST
I Language modelling using RNN
I Acoustic modelling using DNNI Practise by yourself using the open-source toolkits
I OpenFst - http://www.openfst.orgI RNNLM - http://www.fit.vutbr.cz/∼imikolov/rnnlmI DNN for ASR - http://kaldi.sourceforge.net
Liang Lu (liang.lu@ed.ac.uk), Heriot-Watt University, Feb, 2014. RTSC
RTSC
Thanks!
Liang Lu (liang.lu@ed.ac.uk), Heriot-Watt University, Feb, 2014. RTSC
RTSC
top related