speech recognition - nyu computer sciencemohri/asr12/lecture_1.pdfmehryar mohri - speech recognition...
TRANSCRIPT
Speech RecognitionLecture 1: Introduction
Mehryar MohriCourant Institute and Google Research
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Logistics
Prerequisites: basics in analysis of algorithms and probability. No specific knowledge about signal processing.
Workload: 2-3 homework assignments, 1 project (your choice).
Textbooks: no single textbook covering the material presented in this course. Lecture slides available electronically.
2
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Objectives
Computer science view of automatic speech recognition (ASR) (no signal processing).
Essential algorithms for large-vocabulary speech recognition.
But, emphasis on general algorithms:
• automata and transducer algorithms.
• statistical learning algorithms.
3
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Topics
introduction, formulation, components, features.
weighted transducer software library.
weighted automata algorithms.
statistical language modeling software library.
ngram models.
maximum entropy models.
pronunciation models, decision trees, context-dependent models.
4
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Topics
search algorithms, transducer optimizations, Viterbi decoder.
search algorithms, N-best algorithms, lattice generation, rescoring.
structured prediction algorithms.
adaptation.
active learning.
semi-supervised learning.
5
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Speech recognition problem
Statistical formulation
Acoustic features
This Lecture
6
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Speech Recognition Problem
Definition: find accurate written transcription of spoken utterances.
• transcriptions may be in words, phonemes, syllables, or other units.
Accuracy: typically measured in terms of the edit-distance between reference transcription and sequence output by the model.
7
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Other Related Problems
Speaker verification.
Speaker identification.
Spoken-dialog systems.
Detection of voice features, e.g., gender, age, dialect, emotion, height, weight!
Speech synthesis.
8
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Speech Recognition Is Difficult
Highly variable: the same words pronounced by the same person in the same conditions typically lead to different waveforms.
• source variation: speaking rate, volume, accent, dialect, pitch, coarticulation.
• channel variation: microphone (type, position), noise (background, distortion).
Key problem: robustness to such variations.
10
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
ASR Characteristics
Vocabulary size: small (digit recognition, 10), medium (Resource Management, 1000), large (Broadcast News, 100,000), very large (+1M).
Speaker-dependent or speaker-independent.
Domain-specific or unconstrained, e.g., travel reservation, modern spoken-dialog systems.
Isolated (pause between units) or continuous.
Read or spontaneous, e.g., dictation, news broadcast, conversational speech.
11
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
History
1922: Radio Rex, toy, single-word recognizer (rex).
1939: voder and vocoder (mechanical synthesizer), Dudley (Bell Labs).
1952: isolated digit recognition, single speaker (Bell Labs).
1950s: 10 syllables of single speaker, Olson and Belar, (RCA Labs).
1950s: speaker-independent 10-vowel recognizer (MIT).
See (Juang and Rabiner, 1995)
13
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
History
1960s: Linear Predictive Coding (LPC), Atal and Itakura.
1969: John Pierce’s negative comments about ASR (Bell Labs).
1970s: Advanced Research Projects Agency (ARPA) funds speech understanding program. CMU’s Harpy system based on automata had reasonable accuracy for 1,000 words.
14
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
History
1980s: n-gram models. ARPA Resource Management, Wall Street Journal, and ATIS tasks. Delta/delta-delta cepstra, mel cepstra.
mid-1980s: Hidden Markov models (HMMs) become the preferred technique for speech recognition.
1990s: Discriminative training, vocal tract normalization, speaker adaptation. Very large-vocabulary speech recognition, e.g., 1M names recognizer (Bell Labs), 500,000 words North American Business News (NAB) recognizer.
15
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
History
mid 1990s: FSM library. Weighted transducers major component of almost all modern speech recognition and understanding systems. SVMs, kernel methods. Dictation systems, Dragon, IBM speaker-dependent system.
2000s: Broadcast News, conversational speech, e.g., Switchboard, Call Home, real-time large-vocabulary systems, unconstrained spoken-dialog systems, e.g., HMIHY.
16
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
History
! ! !
! ! "#$#%$&##'!#()*+)''!,-! &"!
./!012.0.34!45651451768!9:5;.<4!=64:<!.>!;1<<:>!-60?.@!9.<:84A!,88!./!5;14!769:!6=.35!=:7634:!./!
412>1/176>5! 0:4:607;! 7.>501=351.>4! /0.9! 676<:916B! C01@65:! 1><3450D! 6><! 5;:! 2.@:0>9:>5A!,4! 5;:!
5:7;>.8.2D!7.>51>3:4!5.!96530:B! 15! 14!78:60! 5;65!96>D!>:E!6CC817651.>4!E188!:9:02:!6><!=:7.9:!
C605!./!.30!E6D!./!81/:!F!5;:0:=D!56?1>2!/388!6<@6>562:!./!967;1>:4!5;65!60:!C6051688D!6=8:!5.!91917!
;396>!4C::7;!76C6=18151:4A!!
G;:!7;688:>2:!./!<:412>1>2!6!967;1>:!5;65!5038D!/3>751.>4!81?:!6>!1>5:8812:>5!;396>!14!45188!6!
96H.0!.>:!2.1>2!/.0E60<A!I30!677.9C814;9:>54B!5.!<65:B!60:!.>8D!5;:!=:21>>1>2!6><!15!E188!56?:!
96>D!D:604!=:/.0:!6!967;1>:!76>!C644!5;:!G301>2!5:45B!>69:8D!67;1:@1>2!C:0/.096>7:!5;65!01@684!
5;65!./!6!;396>A!
!
!"#$%&'()''''-18:45.>:4!1>!JC::7;!K:7.2>151.>!6><!L><:0456><1>2!G:7;>.8.2D!.@:0!5;:!M645!!!
'#!N:604A'
*&+&%&,-&.'
"A! OA!P3<8:DB!!"#$%&'&(#)B!Q:88!R6=4!K:7.0<B!S.8A!"TB!CCA!"&&U"&+B!"(V(A!
&A! OA!P3<8:DB!KA!KA!K1:4WB!6><!JA!,A!X65?1>4B!*$+,-."#./'$+0#12#)B! YA!Z06>?81>! [>451535:B!S.8A!&&TB!CCA!TV(UT+'B!"(V(A!
VA! YA!\A!X18C.>!6><!PA!QA!K.:B!*!3!$!#4#0"&-#$5#.6&)2$*004/'1./&-7$&8$+0##'"$9#'&:-/./&-B!M0.7;$]IJG&V&!X.0?4;.CB!K.9:B![568DB!^.@A!"((&A!
!"#$%&'($%)"()*+$$,-).(/)!0#&"1'/.#)2$,-('#'34)5$%$.6,-)
!!!!!!!!!!!!!!!!!"#$%! !!"#$&!!!!!!!!!!"#&%! !!!!!!"#&&!!!!!!!!!!!!"#'%!!!!!!!!!!!!!"#'&!!!!!!!!!!!"##%!!!!!!!!!!!"##&!!!!!!!!!!!%((%
!"#$
7%'#.&$/)8'6/%!
)*+,-./0123!121+45*56!7*8-/
29.81+*:1,*926!;4218*<!
=.9>.188*2>!
7%'#.&$/)8'6/%9):'(($,&$/);"3"&%9)
:'(&"(0'0%)
*+$$,-
?1,,-.2!.-<9>2*,*926!@?A!
121+45*56!A+B5,-.*2>!
1+>9.*,C856!@-D-+!0B*+E*2>6
:'(&"(0'0%)*+$$,-9)*+$$,-)
F,9<C15,*<!+12>B1>-!B2E-.5,12E*2>6!)*2*,-/5,1,-!81<C*2-56!!
F,1,*5,*<1+!+-1.2*2>6!
*1.##)<',.=0#.64>)?,'0%&",)@-'($&",%A
=.%$/!
!$/"01)<',.=0#.64>)2$1+#.&$A
=.%$/!
B.63$)<',.=0#.649)*4(&.C>)
*$1.(&",%>)!
:'(($,&$/)8'6/%9)
:'(&"(0'0%)
*+$$,-
B.63$)<',.=0#.64>)*&.&"%&",.#A
=.%$/!
G*EE-2!H1.39D!!!89E-+56!!!!!!!!!F,9<C15,*<!@12>B1>-!89E-+*2>6
*+'D$()/".#'39)!0#&"+#$)
1'/.#"&"$%
<$64)B.63$)<',.=0#.649)*$1.(&",%>)!0#&"1'/.#)
;".#'3>)22*!
A92<1,-21,*D-!542,C-5*56!H1<C*2-!+-1.2*2>6!!H*I-E/*2*,*1,*D-!E*1+9>6
(Juang and Rabiner, 1995)
17
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Unscontrained Spoken-Dialog Systems
18
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Speech recognition problem
Statistical formulation
Acoustic features
This Lecture
19
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Speech recognition problem
Statistical formulation
• Maximum likelihood and maximum a posteriori
• Statistical formulation of speech recognition
• Components of a speech recognizer
Acoustic features
This Lecture
20
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Problem
Data: sample drawn i.i.d. from set according to some distribution ,
Problem: find distribution out of a set that best estimates .
D
x1, . . . , xm ! X.
X
D
p P
21
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Maximum Likelihood
Likelihood: probability of observing sample under distribution , which, given the independence assumption is
Principle: select distribution maximizing sample probability
Pr[x1, . . . , xm] =m!
i=1
p(xi).
p ! P
p! = argmaxp!P
m!
i=1
p(xi),
p! = argmaxp!P
m!
i=1
log p(xi).or
22
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Example: Bernoulli Trials
Problem: find most likely Bernoulli distribution, given sequence of coin flips
Bernoulli distribution:
Likelihood:
Solution: is differentiable and concave;
H, T, T, H, T, H, T, H, H, H, T, T, . . . , H.
p(H) = !, p(T ) = 1 ! !.
l(p) = log !N(H)(1 ! !)N(T )
= N(H) log ! + N(T ) log(1 ! !).
l
dl(p)dθ
=N(H)
θ− N(T )
1− θ= 0⇔ θ =
N(H)N(H) + N(T )
.
23
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Example: Gaussian Distribution
Problem: find most likely Gaussian distribution, given sequence of real-valued observations
Normal distribution:
Likelihood:
Solution: is differentiable and concave;
3.18, 2.35, .95, 1.175, . . .
p(x) =1
!
2!"2exp
!
"
(x " µ)2
2"2
"
.
l
!p(x)
!µ= 0 ! µ =
1
m
m!
i=1
xi
!p(x)
!"2= 0 ! "2 =
1
m
m!
i=1
x2
i " µ2.
l(p) = −12m log(2πσ2)−
m�
i=1
(xi − µ)2
2σ2.
24
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Properties
Problems:
• the underlying distribution may not be among those searched.
• overfitting: number of examples too small wrt number of parameters.
25
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Maximum A Posteriori (MAP)
Principle: select the most likely hypothesis given the sample, with some prior distribution over the hypotheses, ,
Note: for a uniform prior, MAP coincides with maximum likelihood.
h ! H
h! = argmaxh!H
Pr[h | S]
= argmaxh!H
Pr[S|h] Pr[h]Pr[S]
= argmaxh!H
Pr[S | h] Pr[h].
Pr[h]
26
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Speech recognition problem
Statistical formulation
• Maximum likelihood and maximum a posteriori
• Statistical formulation of speech recognition
• Components of a speech recognizer
Acoustic features
This Lecture
27
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
General Ideas
Probabilistic formulation: given a spoken utterance, find the most likely transcription.
Decomposition: mapping from spoken utterances to word sequences decomposed into intermediate units.
phoneme seq.
CD phone seq.
word seq.
observ. seq.
w1 w2 w3 w4
p1 p2 p3 p4 p5 p6 p7 p8 p9 p10
c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14
o1 o2 o3 o4 o5 o6 o7 o8 o9 o10 o11 o12 o13 o14 o15 o16
28
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Statistical Formulation
Observation sequence produced by signal processing system:
Sequence of words over alphabet :
Formulation (maximum a posteriori decoding):
o = o1 . . . om.
! w = w1 . . . wk.
w = argmaxw!!!
Pr[w | o]
= argmaxw!!!
Pr[o | w] Pr[w]
Pr[o]
= argmaxw!!!
Pr[o | w]! "# $
Pr[w]! "# $
.
language modelacoustic & pronunciation model
(Bahl, Jelinek, and Mercer, 1983)
29
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Fred Jelinek
30
18 November 1932 - 14 September 2010
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Components
Acoustic and pronunciation model:
• : observation seq. distribution seq.
• : distribution seq. CD phone seq.
• : CD phone seq. phoneme seq.
• : phoneme seq. word seq.
Language model: , distribution over word seq.
Pr(o | w) =!
d,c,p
Pr(o | d) Pr(d | c) Pr(c | p) Pr(p | w).
Pr(o | d)
Pr(d | c)
!
!
Pr(c | p) !
Pr(p | w) !
Pr(w)
acou
stic
mod
el
31
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Notes
Formulation does not match the way speech recognition errors are typically measured: edit-distance between hypothesis and reference transcription.
32
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Speech recognition problem
Statistical formulation
• Maximum likelihood and maximum a posteriori
• Statistical formulation of speech recognition
• Components of a speech recognizer
Acoustic features
This Lecture
33
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Acoustic Observations
Discretization
• time: local spectral analysis of the speech waveform at regular intervals,
Parameter vectors
• magnitude.
Note: other perceptual information, e.g., visual information is ignored.
t = t1, . . . , tm, ti+1 ! ti = 10ms (typically).
o = o1 . . . om, oi ! RN , N = 39 (typically).
34
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Acoustic Model
Three-state hidden Markov models (HMMs)
Distributions:
• Full covariance multivariate Gaussians:
• Diagonal covariance Gaussian mixture.
• Semi-continuous, tied mixtures.
0
d0:!
1d0:!
d1:!
2d1:!
d2:!
3d2:aeb,d
Pr[!] =1
(2")N/2|#|1/2e!
1
2(!!µ)T "!1(!!µ)
.
(Rabiner and Juang, 1993)
35
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Context-Dependent Model
Idea:
• phoneme pronunciation depends on environment (allophones, co-articulation).
• model phone in context better accuracy.
Context-dependent rules:
• Context-dependent units:
• Allophonic rules:
• Complex contexts: regular expressions.
ae/b d ! aeb,d.
!
t/V ! V ! dx.
(Lee, 1990; Young et al., 1994)
36
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Pronunciation Dictionary
Phonemic transcription
• Example: word data in American English.
Representation
data D ey dx ax 0.32
data D ey t ax 0.08
data D ae dx ax 0.48
data D ae t ax 0.12
0 1d:!/1.0
2ey:!/0.4
ae:!/0.63
dx:!/0.8
t:!/0.24/1
ax:data/1.0
37
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Language Model
Definition: probabilistic model for sequences of words
• By the chain rule,
Modeling simplifications:
• Clustering of histories:
• Example: nth order Markov assumption,
w = w1 . . . wk.
(w1, . . . , wi!1) !" c(w1, . . . , wi!1).
!i,Pr[wi | w1 . . . wi!1] = Pr[wi | hi], |hi| " n # 1.
Pr[w] =k!
i=1
Pr[wi | w1 . . . wi!1].
38
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Recognition Cascade
Combination of components
Viterbi approximation
Pron. ModelHMM Lang. ModelCD Model
word seq.phoneme seq.CD phone seq. word seq.observ. seq.
w = argmaxw
!
d,c,p
Pr[o | d] Pr[d | c] Pr[c | p] Pr[p | w] Pr[w]
! argmaxw
maxd,c,p
Pr[o | d] Pr[d | c] Pr[c | p] Pr[p | w] Pr[w].
39
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Speech Recognition Problems
Learning: how to create accurate models for each component?
Search: how to efficiently combine models and determine best transcription?
Representation: compact data structure for the computational representation of the models.
40
common representation and algorithmic framework based on weighted transducers (next lectures).
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Speech recognition problem
Statistical formulation
Acoustic features
This Lecture
41
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Feature Selection
Short-time Fourier analysis:
Idea: find smooth approximation eliminating large variations over short frequency intervals.
Short-time (25 msec. Hamming window) spectrum of /ae/.
power (db)
freq. (Hz)
log�����
x(t)w(t − τ)e−iωt dt
����
42
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Cepstral Coefficients
Let denote the Fourier transform of the signal.
Definition: the 13 cepstral coefficients are the energy and the 12 first coefficients of the expansion
Other coefficients: 13 first-order (delta-cepstra) and 13 second-order (delta-delta cepstra) differentials.
x(!)
log |x(!)| =!!
n="!
cne"in!
.
43
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Mel Frequency Cepstral Coefficients
Refinement: non-linear scale, approximation of human perception of distance between frequencies, e.g., mel frequency scale:
MFCCs:
• signal first transformed using the Mel frequency band.
• extraction of cepstral coefficients.
fmel = 2595 log10(1 + f/700).
(Stevens and Volkman, 1940)
44
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Other Refinements
Speaker/Channel adaptation:
• mean cepstral subtraction.
• vocal tract normalization.
• linear transformations.
45
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
References• Bahl, L. R., Jelinek, F., and Mercer, R. (1983). A Maximum Likelihood Approach to
Continuous Speech Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 5(2), 179-190.
• Biing-Hwang Juang and Lawrence R. Rabiner. Automatic Speech Recognition - A Brief History of the Technology. Elsevier Encyclopedia of Language and Linguistics, Second Edition, 2005.
• Frederick Jelinek. Statistical Methods for Speech Recognition. MIT Press, Cambridge, MA, 1998.
• Kai-Fu Lee. Context-Dependent Phonetic Hidden Markov Models for Continuous Speech Recognition. IEEE International Conference on Acoustics, Speech, and Signal Processing, 38(4):599-609, 1990.
• Lawrence Rabiner and Biing-Hwang Juang. Fundamentals of Speech Recognition. Prentice Hall, 1993.
46
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
References• S.S. Stevens and J. Volkman. The relation of pitch to frequency. American Journal of
Psychology, 53:329, 1940.
• Steve Young, J. Odell, and Phil Woodland. Tree-Based State-Tying for High Accuracy Acoustic Modelling. In Proceedings of ARPA Human Language Technology Workshop, Morgan Kaufmann, San Francisco, 1994.
47