speech recognition - nyu computer sciencemohri/asr12/lecture_1.pdfmehryar mohri - speech recognition...

Speech RecognitionLecture 1: Introduction

Mehryar MohriCourant Institute and Google Research

[email protected]

mailto:[email protected]

mailto:[email protected]

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Logistics

Prerequisites: basics in analysis of algorithms and probability. No specific knowledge about signal processing.

Workload: 2-3 homework assignments, 1 project (your choice).

Textbooks: no single textbook covering the material presented in this course. Lecture slides available electronically.

2


Objectives

Computer science view of automatic speech recognition (ASR) (no signal processing).

Essential algorithms for large-vocabulary speech recognition.

But, emphasis on general algorithms:

• automata and transducer algorithms.

• statistical learning algorithms.

3


Topics

introduction, formulation, components, features.

weighted transducer software library.

weighted automata algorithms.

statistical language modeling software library.

ngram models.

maximum entropy models.

pronunciation models, decision trees, context-dependent models.

4


Topics

search algorithms, transducer optimizations, Viterbi decoder.

search algorithms, N-best algorithms, lattice generation, rescoring.

structured prediction algorithms.

adaptation.

active learning.

semi-supervised learning.

5


Speech recognition problem

Statistical formulation

Acoustic features

This Lecture

6


Speech Recognition Problem

Definition: find accurate written transcription of spoken utterances.

• transcriptions may be in words, phonemes, syllables, or other units.

Accuracy: typically measured in terms of the edit-distance between reference transcription and sequence output by the model.

7


Other Related Problems

Speaker verification.

Speaker identification.

Spoken-dialog systems.

Detection of voice features, e.g., gender, age, dialect, emotion, height, weight!

Speech synthesis.

8


Speech Spectogram

9


Speech Recognition Is Difficult

Highly variable: the same words pronounced by the same person in the same conditions typically lead to different waveforms.

• source variation: speaking rate, volume, accent, dialect, pitch, coarticulation.

• channel variation: microphone (type, position), noise (background, distortion).

Key problem: robustness to such variations.

10


ASR Characteristics

Vocabulary size: small (digit recognition, 10), medium (Resource Management, 1000), large (Broadcast News, 100,000), very large (+1M).

Speaker-dependent or speaker-independent.

Domain-specific or unconstrained, e.g., travel reservation, modern spoken-dialog systems.

Isolated (pause between units) or continuous.

Read or spontaneous, e.g., dictation, news broadcast, conversational speech.

11


Example - Broadcast News

12


History

1922: Radio Rex, toy, single-word recognizer (rex).

1939: voder and vocoder (mechanical synthesizer), Dudley (Bell Labs).

1952: isolated digit recognition, single speaker (Bell Labs).

1950s: 10 syllables of single speaker, Olson and Belar, (RCA Labs).

1950s: speaker-independent 10-vowel recognizer (MIT).

See (Juang and Rabiner, 1995)

13


History

1960s: Linear Predictive Coding (LPC), Atal and Itakura.

1969: John Pierce’s negative comments about ASR (Bell Labs).

1970s: Advanced Research Projects Agency (ARPA) funds speech understanding program. CMU’s Harpy system based on automata had reasonable accuracy for 1,000 words.

14


History

1980s: n-gram models. ARPA Resource Management, Wall Street Journal, and ATIS tasks. Delta/delta-delta cepstra, mel cepstra.

mid-1980s: Hidden Markov models (HMMs) become the preferred technique for speech recognition.

1990s: Discriminative training, vocal tract normalization, speaker adaptation. Very large-vocabulary speech recognition, e.g., 1M names recognizer (Bell Labs), 500,000 words North American Business News (NAB) recognizer.

15


History

mid 1990s: FSM library. Weighted transducers major component of almost all modern speech recognition and understanding systems. SVMs, kernel methods. Dictation systems, Dragon, IBM speaker-dependent system.

2000s: Broadcast News, conversational speech, e.g., Switchboard, Call Home, real-time large-vocabulary systems, unconstrained spoken-dialog systems, e.g., HMIHY.

16


History

! ! !

! ! "#$#%$&##'!#()*+)''!,-! &"!

./!012.0.34!45651451768!9:5;.<4!=64:<!.>!;1<<:>!-60?.@!9.<:84A!,88!./!5;14!769:!6=.35!=:7634:!./!

412>1/176>5! 0:4:607;! 7.>501=351.>4! /0.9! 676<:916B! C01@65:! 1><3450D! 6><! 5;:! 2.@:0>9:>5A!,4! 5;:!

5:7;>.8.2D!7.>51>3:4!5.!96530:B! 15! 14!78:60! 5;65!96>D!>:E!6CC817651.>4!E188!:9:02:!6><!=:7.9:!

C605!./!.30!E6D!./!81/:!F!5;:0:=D!56?1>2!/388!6<@6>562:!./!967;1>:4!5;65!60:!C6051688D!6=8:!5.!91917!

;396>!4C::7;!76C6=18151:4A!!

G;:!7;688:>2:!./!<:412>1>2!6!967;1>:!5;65!5038D!/3>751.>4!81?:!6>!1>5:8812:>5!;396>!14!45188!6!

96H.0!.>:!2.1>2!/.0E60<A!I30!677.9C814;9:>54B!5.!<65:B!60:!.>8D!5;:!=:21>>1>2!6><!15!E188!56?:!

96>D!D:604!=:/.0:!6!967;1>:!76>!C644!5;:!G301>2!5:45B!>69:8D!67;1:@1>2!C:0/.096>7:!5;65!01@684!

5;65!./!6!;396>A!

!

!"#$%&'()''''-18:45.>:4!1>!JC::7;!K:7.2>151.>!6><!L><:0456><1>2!G:7;>.8.2D!.@:0!5;:!M645!!!

'#!N:604A'

*&+&%&,-&.'

"A! OA!P3<8:DB!!"#$%&'&(#)B!Q:88!R6=4!K:7.0<B!S.8A!"TB!CCA!"&&U"&+B!"(V(A!

&A! OA!P3<8:DB!KA!KA!K1:4WB!6><!JA!,A!X65?1>4B!*$+,-."#./'$+0#12#)B! YA!Z06>?81>! [>451535:B!S.8A!&&TB!CCA!TV(UT+'B!"(V(A!

VA! YA!\A!X18C.>!6><!PA!QA!K.:B!*!3!$!#4#0"&-#$5#.6&)2$*004/'1./&-7$&8$+0##'"$9#'&:-/./&-B!M0.7;$]IJG&V&!X.0?4;.CB!K.9:B![568DB!^.@A!"((&A!

!"#$%&'($%)"()*+$$,-).(/)!0#&"1'/.#)2$,-('#'34)5$%$.6,-)

!!!!!!!!!!!!!!!!!"#$%! !!"#$&!!!!!!!!!!"#&%! !!!!!!"#&&!!!!!!!!!!!!"#'%!!!!!!!!!!!!!"#'&!!!!!!!!!!!"##%!!!!!!!!!!!"##&!!!!!!!!!!!%((%

!"#$

7%'#.&$/)8'6/%!

)*+,-./0123!121+45*56!7*8-/

29.81+*:1,*926!;4218*<!

=.9>.188*2>!

7%'#.&$/)8'6/%9):'(($,&$/);"3"&%9)

:'(&"(0'0%)

*+$$,-

?1,,-.2!.-<9>2*,*926!@?A!

121+45*56!A+B5,-.*2>!

1+>9.*,C856!@-D-+!0B*+E*2>6

:'(&"(0'0%)*+$$,-9)*+$$,-)

F,9<C15,*<!+12>B1>-!B2E-.5,12E*2>6!)*2*,-/5,1,-!81<C*2-56!!

F,1,*5,*<1+!+-1.2*2>6!

*1.##)<',.=0#.64>)?,'0%&",)@-'($&",%A

=.%$/!

!$/"01)<',.=0#.64>)2$1+#.&$A

=.%$/!

B.63$)<',.=0#.649)*4(&.C>)

*$1.(&",%>)!

:'(($,&$/)8'6/%9)

:'(&"(0'0%)

*+$$,-

B.63$)<',.=0#.64>)*&.&"%&",.#A

=.%$/!

G*EE-2!H1.39D!!!89E-+56!!!!!!!!!F,9<C15,*<!@12>B1>-!89E-+*2>6

*+'D$()/".#'39)!0#&"+#$)

1'/.#"&"$%

<$64)B.63$)<',.=0#.649)*$1.(&",%>)!0#&"1'/.#)

;".#'3>)22*!

A92<1,-21,*D-!542,C-5*56!H1<C*2-!+-1.2*2>6!!H*I-E/*2*,*1,*D-!E*1+9>6

(Juang and Rabiner, 1995)

17


Unscontrained Spoken-Dialog Systems

18




Acoustic features

This Lecture

19




• Maximum likelihood and maximum a posteriori

• Statistical formulation of speech recognition

• Components of a speech recognizer

Acoustic features

This Lecture

20


Problem

Data: sample drawn i.i.d. from set according to some distribution ,

Problem: find distribution out of a set that best estimates .

D

x1, . . . , xm ! X.

X

D

p P

21


Maximum Likelihood

Likelihood: probability of observing sample under distribution , which, given the independence assumption is

Principle: select distribution maximizing sample probability

Pr[x1, . . . , xm] =m!

i=1

p(xi).

p ! P

p! = argmaxp!P

m!

i=1

p(xi),

p! = argmaxp!P

m!

i=1

log p(xi).or

22


Example: Bernoulli Trials

Problem: find most likely Bernoulli distribution, given sequence of coin flips

Bernoulli distribution:

Likelihood:

Solution: is differentiable and concave;

H, T, T, H, T, H, T, H, H, H, T, T, . . . , H.

p(H) = !, p(T ) = 1 ! !.

l(p) = log !N(H)(1 ! !)N(T )

= N(H) log ! + N(T ) log(1 ! !).

l

dl(p)dθ

=N(H)

θ− N(T )

1− θ= 0⇔ θ =

N(H)N(H) + N(T )

.

23


Example: Gaussian Distribution

Problem: find most likely Gaussian distribution, given sequence of real-valued observations

Normal distribution:

Likelihood:

Solution: is differentiable and concave;

3.18, 2.35, .95, 1.175, . . .

p(x) =1

!

2!"2exp

!

"

(x " µ)2

2"2

"

.

l

!p(x)

!µ= 0 ! µ =

1

m

m!

i=1

xi

!p(x)

!"2= 0 ! "2 =

1

m

m!

i=1

x2

i " µ2.

l(p) = −12m log(2πσ2)−

m�

i=1

(xi − µ)2

2σ2.

24


Properties

Problems:

• the underlying distribution may not be among those searched.

• overfitting: number of examples too small wrt number of parameters.

25


Maximum A Posteriori (MAP)

Principle: select the most likely hypothesis given the sample, with some prior distribution over the hypotheses, ,

Note: for a uniform prior, MAP coincides with maximum likelihood.

h ! H

h! = argmaxh!H

Pr[h | S]

= argmaxh!H

Pr[S|h] Pr[h]Pr[S]

= argmaxh!H

Pr[S | h] Pr[h].

Pr[h]

26







Acoustic features

This Lecture

27


General Ideas

Probabilistic formulation: given a spoken utterance, find the most likely transcription.

Decomposition: mapping from spoken utterances to word sequences decomposed into intermediate units.

phoneme seq.

CD phone seq.

word seq.

observ. seq.

w1 w2 w3 w4

p1 p2 p3 p4 p5 p6 p7 p8 p9 p10

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14

o1 o2 o3 o4 o5 o6 o7 o8 o9 o10 o11 o12 o13 o14 o15 o16

28


Statistical Formulation

Observation sequence produced by signal processing system:

Sequence of words over alphabet :

Formulation (maximum a posteriori decoding):

o = o1 . . . om.

! w = w1 . . . wk.

w = argmaxw!!!

Pr[w | o]

= argmaxw!!!

Pr[o | w] Pr[w]

Pr[o]

= argmaxw!!!

Pr[o | w]! "# $

Pr[w]! "# $

.

language modelacoustic & pronunciation model

(Bahl, Jelinek, and Mercer, 1983)

29


Fred Jelinek

30

18 November 1932 - 14 September 2010


Notes

Formulation does not match the way speech recognition errors are typically measured: edit-distance between hypothesis and reference transcription.

32







Acoustic features

This Lecture

33


Acoustic Observations

Discretization

• time: local spectral analysis of the speech waveform at regular intervals,

Parameter vectors

• magnitude.

Note: other perceptual information, e.g., visual information is ignored.

t = t1, . . . , tm, ti+1 ! ti = 10ms (typically).

o = o1 . . . om, oi ! RN , N = 39 (typically).

34


Acoustic Model

Three-state hidden Markov models (HMMs)

Distributions:

• Full covariance multivariate Gaussians:

• Diagonal covariance Gaussian mixture.

• Semi-continuous, tied mixtures.

0

d0:!

1d0:!

d1:!

2d1:!

d2:!

3d2:aeb,d

Pr[!] =1

(2")N/2|#|1/2e!

1

2(!!µ)T "!1(!!µ)

.

(Rabiner and Juang, 1993)

35


Context-Dependent Model

Idea:

• phoneme pronunciation depends on environment (allophones, co-articulation).

• model phone in context better accuracy.

Context-dependent rules:

• Context-dependent units:

• Allophonic rules:

• Complex contexts: regular expressions.

ae/b d ! aeb,d.

!

t/V ! V ! dx.

(Lee, 1990; Young et al., 1994)

36


Pronunciation Dictionary

Phonemic transcription

• Example: word data in American English.

Representation

data D ey dx ax 0.32

data D ey t ax 0.08

data D ae dx ax 0.48

data D ae t ax 0.12

0 1d:!/1.0

2ey:!/0.4

ae:!/0.63

dx:!/0.8

t:!/0.24/1

ax:data/1.0

37


Language Model

Definition: probabilistic model for sequences of words

• By the chain rule,

Modeling simplifications:

• Clustering of histories:

• Example: nth order Markov assumption,

w = w1 . . . wk.

(w1, . . . , wi!1) !" c(w1, . . . , wi!1).

!i,Pr[wi | w1 . . . wi!1] = Pr[wi | hi], |hi| " n # 1.

Pr[w] =k!

i=1

Pr[wi | w1 . . . wi!1].

38


Speech Recognition Problems

Learning: how to create accurate models for each component?

Search: how to efficiently combine models and determine best transcription?

Representation: compact data structure for the computational representation of the models.

40

common representation and algorithmic framework based on weighted transducers (next lectures).




Acoustic features

This Lecture

41


Feature Selection

Short-time Fourier analysis:

Idea: find smooth approximation eliminating large variations over short frequency intervals.

Short-time (25 msec. Hamming window) spectrum of /ae/.

power (db)

freq. (Hz)

log��

x(t)w(t − τ)e−iωt dt

��

42


Cepstral Coefficients

Let denote the Fourier transform of the signal.

Definition: the 13 cepstral coefficients are the energy and the 12 first coefficients of the expansion

Other coefficients: 13 first-order (delta-cepstra) and 13 second-order (delta-delta cepstra) differentials.

x(!)

log |x(!)| =!!

n="!

cne"in!

.

43


Mel Frequency Cepstral Coefficients

Refinement: non-linear scale, approximation of human perception of distance between frequencies, e.g., mel frequency scale:

MFCCs:

• signal first transformed using the Mel frequency band.

• extraction of cepstral coefficients.

fmel = 2595 log10(1 + f/700).

(Stevens and Volkman, 1940)

44


Other Refinements

Speaker/Channel adaptation:

• mean cepstral subtraction.

• vocal tract normalization.

• linear transformations.

45


References• Bahl, L. R., Jelinek, F., and Mercer, R. (1983). A Maximum Likelihood Approach to

Continuous Speech Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 5(2), 179-190.

• Biing-Hwang Juang and Lawrence R. Rabiner. Automatic Speech Recognition - A Brief History of the Technology. Elsevier Encyclopedia of Language and Linguistics, Second Edition, 2005.

• Frederick Jelinek. Statistical Methods for Speech Recognition. MIT Press, Cambridge, MA, 1998.

• Kai-Fu Lee. Context-Dependent Phonetic Hidden Markov Models for Continuous Speech Recognition. IEEE International Conference on Acoustics, Speech, and Signal Processing, 38(4):599-609, 1990.

• Lawrence Rabiner and Biing-Hwang Juang. Fundamentals of Speech Recognition. Prentice Hall, 1993.

46


References• S.S. Stevens and J. Volkman. The relation of pitch to frequency. American Journal of

Psychology, 53:329, 1940.

• Steve Young, J. Odell, and Phil Woodland. Tree-Based State-Tying for High Accuracy Acoustic Modelling. In Proceedings of ARPA Human Language Technology Workshop, Morgan Kaufmann, San Francisco, 1994.

47

speech recognition - nyu computer sciencemohri/asr12/lecture_1.pdfmehryar mohri - speech recognition...

Documents