speech recognition - nyu computer sciencemohri/asr12/lecture_1.pdfmehryar mohri - speech recognition...

47
Speech Recognition Lecture 1: Introduction Mehryar Mohri Courant Institute and Google Research [email protected]

Upload: haduong

Post on 11-Jun-2018

225 views

Category:

Documents


0 download

TRANSCRIPT

Speech RecognitionLecture 1: Introduction

Mehryar MohriCourant Institute and Google Research

[email protected]

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Logistics

Prerequisites: basics in analysis of algorithms and probability. No specific knowledge about signal processing.

Workload: 2-3 homework assignments, 1 project (your choice).

Textbooks: no single textbook covering the material presented in this course. Lecture slides available electronically.

2

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Objectives

Computer science view of automatic speech recognition (ASR) (no signal processing).

Essential algorithms for large-vocabulary speech recognition.

But, emphasis on general algorithms:

• automata and transducer algorithms.

• statistical learning algorithms.

3

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Topics

introduction, formulation, components, features.

weighted transducer software library.

weighted automata algorithms.

statistical language modeling software library.

ngram models.

maximum entropy models.

pronunciation models, decision trees, context-dependent models.

4

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Topics

search algorithms, transducer optimizations, Viterbi decoder.

search algorithms, N-best algorithms, lattice generation, rescoring.

structured prediction algorithms.

adaptation.

active learning.

semi-supervised learning.

5

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Speech recognition problem

Statistical formulation

Acoustic features

This Lecture

6

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Speech Recognition Problem

Definition: find accurate written transcription of spoken utterances.

• transcriptions may be in words, phonemes, syllables, or other units.

Accuracy: typically measured in terms of the edit-distance between reference transcription and sequence output by the model.

7

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Other Related Problems

Speaker verification.

Speaker identification.

Spoken-dialog systems.

Detection of voice features, e.g., gender, age, dialect, emotion, height, weight!

Speech synthesis.

8

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Speech Spectogram

9

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Speech Recognition Is Difficult

Highly variable: the same words pronounced by the same person in the same conditions typically lead to different waveforms.

• source variation: speaking rate, volume, accent, dialect, pitch, coarticulation.

• channel variation: microphone (type, position), noise (background, distortion).

Key problem: robustness to such variations.

10

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

ASR Characteristics

Vocabulary size: small (digit recognition, 10), medium (Resource Management, 1000), large (Broadcast News, 100,000), very large (+1M).

Speaker-dependent or speaker-independent.

Domain-specific or unconstrained, e.g., travel reservation, modern spoken-dialog systems.

Isolated (pause between units) or continuous.

Read or spontaneous, e.g., dictation, news broadcast, conversational speech.

11

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Example - Broadcast News

12

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

History

1922: Radio Rex, toy, single-word recognizer (rex).

1939: voder and vocoder (mechanical synthesizer), Dudley (Bell Labs).

1952: isolated digit recognition, single speaker (Bell Labs).

1950s: 10 syllables of single speaker, Olson and Belar, (RCA Labs).

1950s: speaker-independent 10-vowel recognizer (MIT).

See (Juang and Rabiner, 1995)

13

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

History

1960s: Linear Predictive Coding (LPC), Atal and Itakura.

1969: John Pierce’s negative comments about ASR (Bell Labs).

1970s: Advanced Research Projects Agency (ARPA) funds speech understanding program. CMU’s Harpy system based on automata had reasonable accuracy for 1,000 words.

14

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

History

1980s: n-gram models. ARPA Resource Management, Wall Street Journal, and ATIS tasks. Delta/delta-delta cepstra, mel cepstra.

mid-1980s: Hidden Markov models (HMMs) become the preferred technique for speech recognition.

1990s: Discriminative training, vocal tract normalization, speaker adaptation. Very large-vocabulary speech recognition, e.g., 1M names recognizer (Bell Labs), 500,000 words North American Business News (NAB) recognizer.

15

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

History

mid 1990s: FSM library. Weighted transducers major component of almost all modern speech recognition and understanding systems. SVMs, kernel methods. Dictation systems, Dragon, IBM speaker-dependent system.

2000s: Broadcast News, conversational speech, e.g., Switchboard, Call Home, real-time large-vocabulary systems, unconstrained spoken-dialog systems, e.g., HMIHY.

16

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

History

! ! !

! ! "#$#%$&##'!#()*+)''!,-! &"!

./!012.0.34!45651451768!9:5;.<4!=64:<!.>!;1<<:>!-60?.@!9.<:84A!,88!./!5;14!769:!6=.35!=:7634:!./!

412>1/176>5! 0:4:607;! 7.>501=351.>4! /0.9! 676<:916B! C01@65:! 1><3450D! 6><! 5;:! 2.@:0>9:>5A!,4! 5;:!

5:7;>.8.2D!7.>51>3:4!5.!96530:B! 15! 14!78:60! 5;65!96>D!>:E!6CC817651.>4!E188!:9:02:!6><!=:7.9:!

C605!./!.30!E6D!./!81/:!F!5;:0:=D!56?1>2!/388!6<@6>562:!./!967;1>:4!5;65!60:!C6051688D!6=8:!5.!91917!

;396>!4C::7;!76C6=18151:4A!!

G;:!7;688:>2:!./!<:412>1>2!6!967;1>:!5;65!5038D!/3>751.>4!81?:!6>!1>5:8812:>5!;396>!14!45188!6!

96H.0!.>:!2.1>2!/.0E60<A!I30!677.9C814;9:>54B!5.!<65:B!60:!.>8D!5;:!=:21>>1>2!6><!15!E188!56?:!

96>D!D:604!=:/.0:!6!967;1>:!76>!C644!5;:!G301>2!5:45B!>69:8D!67;1:@1>2!C:0/.096>7:!5;65!01@684!

5;65!./!6!;396>A!

!

!"#$%&'()''''-18:45.>:4!1>!JC::7;!K:7.2>151.>!6><!L><:0456><1>2!G:7;>.8.2D!.@:0!5;:!M645!!!

'#!N:604A'

*&+&%&,-&.'

"A! OA!P3<8:DB!!"#$%&'&(#)B!Q:88!R6=4!K:7.0<B!S.8A!"TB!CCA!"&&U"&+B!"(V(A!

&A! OA!P3<8:DB!KA!KA!K1:4WB!6><!JA!,A!X65?1>4B!*$+,-."#./'$+0#12#)B! YA!Z06>?81>! [>451535:B!S.8A!&&TB!CCA!TV(UT+'B!"(V(A!

VA! YA!\A!X18C.>!6><!PA!QA!K.:B!*!3!$!#4#0"&-#$5#.6&)2$*004/'1./&-7$&8$+0##'"$9#'&:-/./&-B!M0.7;$]IJG&V&!X.0?4;.CB!K.9:B![568DB!^.@A!"((&A!

!"#$%&'($%)"()*+$$,-).(/)!0#&"1'/.#)2$,-('#'34)5$%$.6,-)

!!!!!!!!!!!!!!!!!"#$%! !!"#$&!!!!!!!!!!"#&%! !!!!!!"#&&!!!!!!!!!!!!"#'%!!!!!!!!!!!!!"#'&!!!!!!!!!!!"##%!!!!!!!!!!!"##&!!!!!!!!!!!%((%

!"#$

7%'#.&$/)8'6/%!

)*+,-./0123!121+45*56!7*8-/

29.81+*:1,*926!;4218*<!

=.9>.188*2>!

7%'#.&$/)8'6/%9):'(($,&$/);"3"&%9)

:'(&"(0'0%)

*+$$,-

?1,,-.2!.-<9>2*,*926!@?A!

121+45*56!A+B5,-.*2>!

1+>9.*,C856!@-D-+!0B*+E*2>6

:'(&"(0'0%)*+$$,-9)*+$$,-)

F,9<C15,*<!+12>B1>-!B2E-.5,12E*2>6!)*2*,-/5,1,-!81<C*2-56!!

F,1,*5,*<1+!+-1.2*2>6!

*1.##)<',.=0#.64>)?,'0%&",)@-'($&",%A

=.%$/!

!$/"01)<',.=0#.64>)2$1+#.&$A

=.%$/!

B.63$)<',.=0#.649)*4(&.C>)

*$1.(&",%>)!

:'(($,&$/)8'6/%9)

:'(&"(0'0%)

*+$$,-

B.63$)<',.=0#.64>)*&.&"%&",.#A

=.%$/!

G*EE-2!H1.39D!!!89E-+56!!!!!!!!!F,9<C15,*<!@12>B1>-!89E-+*2>6

*+'D$()/".#'39)!0#&"+#$)

1'/.#"&"$%

<$64)B.63$)<',.=0#.649)*$1.(&",%>)!0#&"1'/.#)

;".#'3>)22*!

A92<1,-21,*D-!542,C-5*56!H1<C*2-!+-1.2*2>6!!H*I-E/*2*,*1,*D-!E*1+9>6

(Juang and Rabiner, 1995)

17

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Unscontrained Spoken-Dialog Systems

18

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Speech recognition problem

Statistical formulation

Acoustic features

This Lecture

19

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Speech recognition problem

Statistical formulation

• Maximum likelihood and maximum a posteriori

• Statistical formulation of speech recognition

• Components of a speech recognizer

Acoustic features

This Lecture

20

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Problem

Data: sample drawn i.i.d. from set according to some distribution ,

Problem: find distribution out of a set that best estimates .

D

x1, . . . , xm ! X.

X

D

p P

21

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Maximum Likelihood

Likelihood: probability of observing sample under distribution , which, given the independence assumption is

Principle: select distribution maximizing sample probability

Pr[x1, . . . , xm] =m!

i=1

p(xi).

p ! P

p! = argmaxp!P

m!

i=1

p(xi),

p! = argmaxp!P

m!

i=1

log p(xi).or

22

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Example: Bernoulli Trials

Problem: find most likely Bernoulli distribution, given sequence of coin flips

Bernoulli distribution:

Likelihood:

Solution: is differentiable and concave;

H, T, T, H, T, H, T, H, H, H, T, T, . . . , H.

p(H) = !, p(T ) = 1 ! !.

l(p) = log !N(H)(1 ! !)N(T )

= N(H) log ! + N(T ) log(1 ! !).

l

dl(p)dθ

=N(H)

θ− N(T )

1− θ= 0⇔ θ =

N(H)N(H) + N(T )

.

23

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Example: Gaussian Distribution

Problem: find most likely Gaussian distribution, given sequence of real-valued observations

Normal distribution:

Likelihood:

Solution: is differentiable and concave;

3.18, 2.35, .95, 1.175, . . .

p(x) =1

!

2!"2exp

!

"

(x " µ)2

2"2

"

.

l

!p(x)

!µ= 0 ! µ =

1

m

m!

i=1

xi

!p(x)

!"2= 0 ! "2 =

1

m

m!

i=1

x2

i " µ2.

l(p) = −12m log(2πσ2)−

m�

i=1

(xi − µ)2

2σ2.

24

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Properties

Problems:

• the underlying distribution may not be among those searched.

• overfitting: number of examples too small wrt number of parameters.

25

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Maximum A Posteriori (MAP)

Principle: select the most likely hypothesis given the sample, with some prior distribution over the hypotheses, ,

Note: for a uniform prior, MAP coincides with maximum likelihood.

h ! H

h! = argmaxh!H

Pr[h | S]

= argmaxh!H

Pr[S|h] Pr[h]Pr[S]

= argmaxh!H

Pr[S | h] Pr[h].

Pr[h]

26

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Speech recognition problem

Statistical formulation

• Maximum likelihood and maximum a posteriori

• Statistical formulation of speech recognition

• Components of a speech recognizer

Acoustic features

This Lecture

27

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

General Ideas

Probabilistic formulation: given a spoken utterance, find the most likely transcription.

Decomposition: mapping from spoken utterances to word sequences decomposed into intermediate units.

phoneme seq.

CD phone seq.

word seq.

observ. seq.

w1 w2 w3 w4

p1 p2 p3 p4 p5 p6 p7 p8 p9 p10

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14

o1 o2 o3 o4 o5 o6 o7 o8 o9 o10 o11 o12 o13 o14 o15 o16

28

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Statistical Formulation

Observation sequence produced by signal processing system:

Sequence of words over alphabet :

Formulation (maximum a posteriori decoding):

o = o1 . . . om.

! w = w1 . . . wk.

w = argmaxw!!!

Pr[w | o]

= argmaxw!!!

Pr[o | w] Pr[w]

Pr[o]

= argmaxw!!!

Pr[o | w]! "# $

Pr[w]! "# $

.

language modelacoustic & pronunciation model

(Bahl, Jelinek, and Mercer, 1983)

29

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Fred Jelinek

30

18 November 1932 - 14 September 2010

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Components

Acoustic and pronunciation model:

• : observation seq. distribution seq.

• : distribution seq. CD phone seq.

• : CD phone seq. phoneme seq.

• : phoneme seq. word seq.

Language model: , distribution over word seq.

Pr(o | w) =!

d,c,p

Pr(o | d) Pr(d | c) Pr(c | p) Pr(p | w).

Pr(o | d)

Pr(d | c)

!

!

Pr(c | p) !

Pr(p | w) !

Pr(w)

acou

stic

mod

el

31

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Notes

Formulation does not match the way speech recognition errors are typically measured: edit-distance between hypothesis and reference transcription.

32

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Speech recognition problem

Statistical formulation

• Maximum likelihood and maximum a posteriori

• Statistical formulation of speech recognition

• Components of a speech recognizer

Acoustic features

This Lecture

33

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Acoustic Observations

Discretization

• time: local spectral analysis of the speech waveform at regular intervals,

Parameter vectors

• magnitude.

Note: other perceptual information, e.g., visual information is ignored.

t = t1, . . . , tm, ti+1 ! ti = 10ms (typically).

o = o1 . . . om, oi ! RN , N = 39 (typically).

34

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Acoustic Model

Three-state hidden Markov models (HMMs)

Distributions:

• Full covariance multivariate Gaussians:

• Diagonal covariance Gaussian mixture.

• Semi-continuous, tied mixtures.

0

d0:!

1d0:!

d1:!

2d1:!

d2:!

3d2:aeb,d

Pr[!] =1

(2")N/2|#|1/2e!

1

2(!!µ)T "!1(!!µ)

.

(Rabiner and Juang, 1993)

35

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Context-Dependent Model

Idea:

• phoneme pronunciation depends on environment (allophones, co-articulation).

• model phone in context better accuracy.

Context-dependent rules:

• Context-dependent units:

• Allophonic rules:

• Complex contexts: regular expressions.

ae/b d ! aeb,d.

!

t/V ! V ! dx.

(Lee, 1990; Young et al., 1994)

36

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Pronunciation Dictionary

Phonemic transcription

• Example: word data in American English.

Representation

data D ey dx ax 0.32

data D ey t ax 0.08

data D ae dx ax 0.48

data D ae t ax 0.12

0 1d:!/1.0

2ey:!/0.4

ae:!/0.63

dx:!/0.8

t:!/0.24/1

ax:data/1.0

37

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Language Model

Definition: probabilistic model for sequences of words

• By the chain rule,

Modeling simplifications:

• Clustering of histories:

• Example: nth order Markov assumption,

w = w1 . . . wk.

(w1, . . . , wi!1) !" c(w1, . . . , wi!1).

!i,Pr[wi | w1 . . . wi!1] = Pr[wi | hi], |hi| " n # 1.

Pr[w] =k!

i=1

Pr[wi | w1 . . . wi!1].

38

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Recognition Cascade

Combination of components

Viterbi approximation

Pron. ModelHMM Lang. ModelCD Model

word seq.phoneme seq.CD phone seq. word seq.observ. seq.

w = argmaxw

!

d,c,p

Pr[o | d] Pr[d | c] Pr[c | p] Pr[p | w] Pr[w]

! argmaxw

maxd,c,p

Pr[o | d] Pr[d | c] Pr[c | p] Pr[p | w] Pr[w].

39

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Speech Recognition Problems

Learning: how to create accurate models for each component?

Search: how to efficiently combine models and determine best transcription?

Representation: compact data structure for the computational representation of the models.

40

common representation and algorithmic framework based on weighted transducers (next lectures).

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Speech recognition problem

Statistical formulation

Acoustic features

This Lecture

41

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Feature Selection

Short-time Fourier analysis:

Idea: find smooth approximation eliminating large variations over short frequency intervals.

Short-time (25 msec. Hamming window) spectrum of /ae/.

power (db)

freq. (Hz)

log�����

x(t)w(t − τ)e−iωt dt

����

42

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Cepstral Coefficients

Let denote the Fourier transform of the signal.

Definition: the 13 cepstral coefficients are the energy and the 12 first coefficients of the expansion

Other coefficients: 13 first-order (delta-cepstra) and 13 second-order (delta-delta cepstra) differentials.

x(!)

log |x(!)| =!!

n="!

cne"in!

.

43

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Mel Frequency Cepstral Coefficients

Refinement: non-linear scale, approximation of human perception of distance between frequencies, e.g., mel frequency scale:

MFCCs:

• signal first transformed using the Mel frequency band.

• extraction of cepstral coefficients.

fmel = 2595 log10(1 + f/700).

(Stevens and Volkman, 1940)

44

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Other Refinements

Speaker/Channel adaptation:

• mean cepstral subtraction.

• vocal tract normalization.

• linear transformations.

45

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

References• Bahl, L. R., Jelinek, F., and Mercer, R. (1983). A Maximum Likelihood Approach to

Continuous Speech Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 5(2), 179-190.

• Biing-Hwang Juang and Lawrence R. Rabiner. Automatic Speech Recognition - A Brief History of the Technology. Elsevier Encyclopedia of Language and Linguistics, Second Edition, 2005.

• Frederick Jelinek. Statistical Methods for Speech Recognition. MIT Press, Cambridge, MA, 1998.

• Kai-Fu Lee. Context-Dependent Phonetic Hidden Markov Models for Continuous Speech Recognition. IEEE International Conference on Acoustics, Speech, and Signal Processing, 38(4):599-609, 1990.

• Lawrence Rabiner and Biing-Hwang Juang. Fundamentals of Speech Recognition. Prentice Hall, 1993.

46

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

References• S.S. Stevens and J. Volkman. The relation of pitch to frequency. American Journal of

Psychology, 53:329, 1940.

• Steve Young, J. Odell, and Phil Woodland. Tree-Based State-Tying for High Accuracy Acoustic Modelling. In Proceedings of ARPA Human Language Technology Workshop, Morgan Kaufmann, San Francisco, 1994.

47