speech recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · see (juang and rabiner, 1995)...

46
Speech Recognition Lecture 1: Introduction Mehryar Mohri Courant Institute of Mathematical Sciences [email protected]

Upload: others

Post on 09-Oct-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear

Speech RecognitionLecture 1: Introduction

Mehryar MohriCourant Institute of Mathematical Sciences

[email protected]

Page 2: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Logistics

Prerequisites: basics in analysis of algorithms and probability. No specific knowledge about signal processing.

Workload: 3 homework assignments, 1 project (your choice).

Textbooks: no single textbook covering the material presented in this course. Lecture slides available electronically.

2

Page 3: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Objectives

Computer science view of automatic speech recognition (ASR) (no signal processing).

Essential algorithms for large-vocabulary speech recognition.

But, emphasis on general algorithms:

• automata and transducer algorithms.

• statistical learning algorithms.

3

Page 4: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Topics

introduction, formulation, components, features.

weighted transducer software library.

weighted automata algorithms.

statistical language modeling software library.

ngram models.

maximum entropy models.

pronunciation models, decision trees, context-dependent models.

4

Page 5: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Topics

search algorithms, transducer optimizations, Viterbi decoder.

search algorithms, N-best algorithms, lattice generation, rescoring.

structured prediction algorithms.

adaptation.

active learning.

semi-supervised learning.

5

Page 6: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Speech recognition problem

Statistical formulation

Acoustic features

This Lecture

6

Page 7: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Speech Recognition Problem

Definition: find accurate written transcription of spoken utterances.

• transcriptions may be in words, phonemes, syllables, or other units.

Accuracy: typically measured in terms of the edit-distance between reference transcription and sequence output by the model.

7

Page 8: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Other Related Problems

Speaker verification.

Speaker identification.

Spoken-dialog systems.

Detection of voice features, e.g., gender, age, dialect, emotion, height, weight!

Speech synthesis.

8

Page 9: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Speech Spectogram

9

Page 10: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Speech Recognition Is Difficult

Highly variable: the same words pronounced by the same person in the same conditions typically lead to different waveforms.

• source variation: speaking rate, volume, accent, dialect, pitch, coarticulation.

• channel variation: microphone (type, position), noise (background, distortion).

Key problem: robustness to such variations.

10

Page 11: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

ASR Characteristics

Vocabulary size: small (digit recognition, 10), medium (Resource Management, 1000), large (Broadcast News, 100,000), very large (+1M).

Speaker-dependent or speaker-independent.

Domain-specific or unconstrained, e.g., travel reservation, modern spoken-dialog systems.

Isolated (pause between units) or continuous.

Read or spontaneous, e.g., dictation, news broadcast, conversational speech.

11

Page 12: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Example - Broadcast News

12

Page 13: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

History

1922: Radio Rex, toy, single-word recognizer (rex).

1939: voder and vocoder (mechanical synthesizer), Dudley (Bell Labs).

1952: isolated digit recognition, single speaker (Bell Labs).

1950s: 10 syllables of single speaker, Olson and Belar, (RCA Labs).

1950s: speaker-independent 10-vowel recognizer (MIT).

13

See (Juang and Rabiner, 1995)

Page 14: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

History

1960s: Linear Predictive Coding (LPC), Atal and Itakura.

1969: John Pierce’s negative comments about ASR (Bell Labs).

1970s: Advanced Research Projects Agency (ARPA) funds speech understanding program. CMU’s Harpy system based on automata had reasonable accuracy for 1,000 words.

14

Page 15: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

History

1980s: n-gram models. ARPA Resource Management, Wall Street Journal, and ATIS tasks. Delta/delta-delta cepstra, mel cepstra.

mid-1980s: Hidden Markov models (HMMs) become the preferred technique for speech recognition.

1990s: Discriminative training, vocal tract normalization, speaker adaptation. Very large-vocabulary speech recognition, e.g., 1M names recognizer (Bell Labs), 500,000 words North American Business News (NAB) recognizer.

15

Page 16: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

History

mid 1990s: FSM library. Weighted transducers major component of almost all modern speech recognition and understanding systems. SVMs, kernel methods. Dictation systems, Dragon, IBM speaker-dependent system.

2000s: Broadcast News, conversational speech, e.g., Switchboard, Call Home, real-time large-vocabulary systems, unconstrained spoken-dialog systems, e.g., HMIHY.

16

Page 17: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

History

17

! ! !

! ! "#$#%$&##'!#()*+)''!,-! &"!

./!012.0.34!45651451768!9:5;.<4!=64:<!.>!;1<<:>!-60?.@!9.<:84A!,88!./!5;14!769:!6=.35!=:7634:!./!

412>1/176>5! 0:4:607;! 7.>501=351.>4! /0.9! 676<:916B! C01@65:! 1><3450D! 6><! 5;:! 2.@:0>9:>5A!,4! 5;:!

5:7;>.8.2D!7.>51>3:4!5.!96530:B! 15! 14!78:60! 5;65!96>D!>:E!6CC817651.>4!E188!:9:02:!6><!=:7.9:!

C605!./!.30!E6D!./!81/:!F!5;:0:=D!56?1>2!/388!6<@6>562:!./!967;1>:4!5;65!60:!C6051688D!6=8:!5.!91917!

;396>!4C::7;!76C6=18151:4A!!

G;:!7;688:>2:!./!<:412>1>2!6!967;1>:!5;65!5038D!/3>751.>4!81?:!6>!1>5:8812:>5!;396>!14!45188!6!

96H.0!.>:!2.1>2!/.0E60<A!I30!677.9C814;9:>54B!5.!<65:B!60:!.>8D!5;:!=:21>>1>2!6><!15!E188!56?:!

96>D!D:604!=:/.0:!6!967;1>:!76>!C644!5;:!G301>2!5:45B!>69:8D!67;1:@1>2!C:0/.096>7:!5;65!01@684!

5;65!./!6!;396>A!

!

!"#$%&'()''''-18:45.>:4!1>!JC::7;!K:7.2>151.>!6><!L><:0456><1>2!G:7;>.8.2D!.@:0!5;:!M645!!!

'#!N:604A'

*&+&%&,-&.'

"A! OA!P3<8:DB!!"#$%&'&(#)B!Q:88!R6=4!K:7.0<B!S.8A!"TB!CCA!"&&U"&+B!"(V(A!

&A! OA!P3<8:DB!KA!KA!K1:4WB!6><!JA!,A!X65?1>4B!*$+,-."#./'$+0#12#)B! YA!Z06>?81>! [>451535:B!S.8A!&&TB!CCA!TV(UT+'B!"(V(A!

VA! YA!\A!X18C.>!6><!PA!QA!K.:B!*!3!$!#4#0"&-#$5#.6&)2$*004/'1./&-7$&8$+0##'"$9#'&:-/./&-B!M0.7;$]IJG&V&!X.0?4;.CB!K.9:B![568DB!^.@A!"((&A!

!"#$%&'($%)"()*+$$,-).(/)!0#&"1'/.#)2$,-('#'34)5$%$.6,-)

!!!!!!!!!!!!!!!!!"#$%! !!"#$&!!!!!!!!!!"#&%! !!!!!!"#&&!!!!!!!!!!!!"#'%!!!!!!!!!!!!!"#'&!!!!!!!!!!!"##%!!!!!!!!!!!"##&!!!!!!!!!!!%((%

!"#$

7%'#.&$/)8'6/%!

)*+,-./0123!121+45*56!7*8-/

29.81+*:1,*926!;4218*<!

=.9>.188*2>!

7%'#.&$/)8'6/%9):'(($,&$/);"3"&%9)

:'(&"(0'0%)

*+$$,-

?1,,-.2!.-<9>2*,*926!@?A!

121+45*56!A+B5,-.*2>!

1+>9.*,C856!@-D-+!0B*+E*2>6

:'(&"(0'0%)*+$$,-9)*+$$,-)

F,9<C15,*<!+12>B1>-!B2E-.5,12E*2>6!)*2*,-/5,1,-!81<C*2-56!!

F,1,*5,*<1+!+-1.2*2>6!

*1.##)<',.=0#.64>)?,'0%&",)@-'($&",%A

=.%$/!

!$/"01)<',.=0#.64>)2$1+#.&$A

=.%$/!

B.63$)<',.=0#.649)*4(&.C>)

*$1.(&",%>)!

:'(($,&$/)8'6/%9)

:'(&"(0'0%)

*+$$,-

B.63$)<',.=0#.64>)*&.&"%&",.#A

=.%$/!

G*EE-2!H1.39D!!!89E-+56!!!!!!!!!F,9<C15,*<!@12>B1>-!89E-+*2>6

*+'D$()/".#'39)!0#&"+#$)

1'/.#"&"$%

<$64)B.63$)<',.=0#.649)*$1.(&",%>)!0#&"1'/.#)

;".#'3>)22*!

A92<1,-21,*D-!542,C-5*56!H1<C*2-!+-1.2*2>6!!H*I-E/*2*,*1,*D-!E*1+9>6

(Juang and Rabiner, 1995)

Page 18: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Speech recognition problem

Statistical formulation

Acoustic features

This Lecture

18

Page 19: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Speech recognition problem

Statistical formulation

• Maximum likelihood and maximum a posteriori

• Statistical formulation of speech recognition

• Components of a speech recognizer

Acoustic features

This Lecture

19

Page 20: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Problem

Data: sample drawn i.i.d. from set according to some distribution ,

Problem: find distribution out of a set that best estimates .

D

x1, . . . , xm ! X.

X

D

p P

20

Page 21: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Maximum Likelihood

Likelihood: probability of observing sample under distribution , which, given the independence assumption is

Principle: select distribution maximizing sample probability

Pr[x1, . . . , xm] =m!

i=1

p(xi).

p ! P

p! = argmaxp!P

m!

i=1

p(xi),

p! = argmaxp!P

m!

i=1

log p(xi).or

21

Page 22: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Relative Entropy Formulation

Empirical distribution: distribution that assigns to each point the frequency of its occurrence in the sample.

Lemma: has maximum likelihood iff

Proof:

p

p! = argminp!P

D(p ! p).

p!

D(p ! p) =!

zi observed

p(zi) log p(zi) "!

zi observed

p(zi) log p(zi)

= "H(p) ""

zi

count(zi)N0

log p(zi)

= "H(p) " 1N0

log#

zip(zi)count(zi)

= "H(p) " 1N0

l(p).

l(p!)

22

Page 23: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Example: Bernoulli Trials

Problem: find most likely Bernoulli distribution, given sequence of coin flips

Bernoulli distribution:

Likelihood:

Solution: is differentiable and concave;

H, T, T, H, T, H, T, H, H, H, T, T, . . . , H.

p(H) = !, p(T ) = 1 ! !.

l(p) = log !N(H)(1 ! !)N(T )

= N(H) log ! + N(T ) log(1 ! !).

l

dl(p)

d!=

N(H)

!!

N(T )

!= 0 " ! =

N(H)

N(H) + N(T ).

23

Page 24: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Example: Gaussian Distribution

Problem: find most likely Gaussian distribution, given sequence of real-valued observations

Normal distribution:

Likelihood:

Solution: is differentiable and concave;

3.18, 2.35, .95, 1.175, . . .

p(x) =1

!

2!"2exp

!

"

(x " µ)2

2"2

"

.

l(p) = !

1

2m log(2!"2) !

1

2

m!

i=1

(xi ! µ)2

2"2.

l

!p(x)

!µ= 0 ! µ =

1

m

m!

i=1

xi

!p(x)

!"2= 0 ! "2 =

1

m

m!

i=1

x2

i " µ2.

24

Page 25: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Properties

Problems:

• the underlying distribution may not be among those searched.

• overfitting: number of examples too small wrt number of parameters.

25

Page 26: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Maximum A Posteriori (MAP)

Principle: select the most likely hypothesis given the sample, with some prior distribution over the hypotheses, ,

Note: for a uniform prior, MAP coincides with maximum likelihood.

h ! H

h! = argmaxh!H

Pr[h | S]

= argmaxh!H

Pr[S|h] Pr[h]Pr[S]

= argmaxh!H

Pr[S | h] Pr[h].

Pr[h]

26

Page 27: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Speech recognition problem

Statistical formulation

• Maximum likelihood and maximum a posteriori

• Statistical formulation of speech recognition

• Components of a speech recognizer

Acoustic features

This Lecture

27

Page 28: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

General Ideas

Probabilistic formulation: given a spoken utterance, find the most likely transcription.

Decomposition: mapping from spoken utterances to word sequences decomposed into intermediate units.

28

phoneme seq.

CD phone seq.

word seq.

observ. seq.

w1 w2 w3 w4

p1 p2 p3 p4 p5 p6 p7 p8 p9 p10

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14

o1 o2 o3 o4 o5 o6 o7 o8 o9 o10 o11 o12 o13 o14 o15 o16

Page 29: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Statistical Formulation

Observation sequence produced by signal processing system:

Sequence of words over alphabet :

Formulation (maximum a posteriori decoding):

29

o = o1 . . . om.

! w = w1 . . . wk.

w = argmaxw!!!

Pr[w | o]

= argmaxw!!!

Pr[o | w] Pr[w]

Pr[o]

= argmaxw!!!

Pr[o | w]! "# $

Pr[w]! "# $

.

language modelacoustic & pronounciation model

(Bahl, Jelinek, and Mercer, 1983)

Page 30: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Components

Acoustic and pronunciation model:

• : observation seq. distribution seq.

• : distribution seq. CD phone seq.

• : CD phone seq. phoneme seq.

• : phoneme seq. word seq.

Language model: , distribution over word seq.

30

Pr(o | w) =!

d,c,p

Pr(o | d) Pr(d | c) Pr(c | p) Pr(p | w).

Pr(o | d)

Pr(d | c)

!

!

Pr(c | p) !

Pr(p | w) !

Pr(w)

acou

stic

mod

el

Page 31: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Notes

Formulation does not match the way speech recognition errors are typically measured: edit-distance between hypothesis and reference transcription.

31

Page 32: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Speech recognition problem

Statistical formulation

• Maximum likelihood and maximum a posteriori

• Statistical formulation of speech recognition

• Components of a speech recognizer

Acoustic features

This Lecture

32

Page 33: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Acoustic Observations

Discretization

• time: local spectral analysis of the speech waveform at regular intervals,

Parameter vectors

• magnitude.

• Note: other perceptual information, e.g., visual information is ignored.

33

t = t1, . . . , tm, ti+1 ! ti = 10ms (typically).

o = o1 . . . om, oi ! RN , N = 39 (typically).

Page 34: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Acoustic Model

Three-state hidden Markov models (HMMs)

Distributions:

• Full covariance multivariate Gaussians:

• Diagonal covariance Gaussian mixture.

• Semi-continuous, tied mixtures.34

0

d0:!

1d0:!

d1:!

2d1:!

d2:!

3d2:aeb,d

Pr[!] =1

(2")N/2|#|1/2e!

1

2(!!µ)T "!1(!!µ)

.

(Rabiner and Juang, 1993)

Page 35: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Context-Dependent Model

Idea:

• phoneme pronounciation depends on environment (allophones, co-articulation).

• model phone in context better accuracy.

Context-dependent rules:

• Context-dependent units:

• Allophonic rules:

• Complex contexts: regular expressions.

35

ae/b d ! aeb,d.

!

t/V ! V ! dx.

(Lee, 1990; Young et al., 1994)

Page 36: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Pronunciation Dictionary

Phonemic transcription

• Example: word data in American English.

Representation

36

data D ey dx ax .32

data D ey t ax .08

data D ae dx ax .48

data D ae t ax .12

0 1d:!/1.0

2ey:!/0.4

ae:!/0.63

dx:!/0.8

t:!/0.24/1

ax:data/1.0

Page 37: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Language Model

Definition: probabilistic model for sequences of words

• By the chain rule,

Modeling simplifications:

• Clustering of histories:

• Example: nth order Markov assumption,

37

w = w1 . . . wk.

(w1, . . . , wi!1) !" c(w1, . . . , wi!1).

!i,Pr[wi | w1 . . . wi−1] = Pr[wi | hi], |hi| " n # 1.

Pr[w] =k!

i=1

Pr[wi | w1 . . . wi!1].

Page 38: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Recognition Cascade

Combination of components

Viterbi approximation

38

Pron. ModelHMM Lang. ModelCD Model

word seq.phoneme seq.CD phone seq. word seq.observ. seq.

w = argmaxw

!

d,c,p

Pr[o | d] Pr[d | c] Pr[c | p] Pr[p | w] Pr[w]

! argmaxw

maxd,c,p

Pr[o | d] Pr[d | c] Pr[c | p] Pr[p | w] Pr[w].

Page 39: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Speech Recognition Problems

Learning: how to create accurate models for each component?

Search: how to efficiently combine models and determine best transcription?

Representation: compact data structure for the computational representation of the models.

39

Page 40: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Speech recognition problem

Statistical formulation

Acoustic features

This Lecture

40

Page 41: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Feature Selection

Short-time spectral analysis:

Idea: find a smooth approximation eliminating large variations over short frequency intervals.

41

log

!

!

!

!

"

g(!)x(t + !)e!i2!f" d!

!

!

!

!

Short-time (25 msec. Hamming window) spectrum of /ae/.

power (db)

freq. (Hz)

Page 42: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Cepstral Coefficients

Let denote the Fourier transform of the signal.

Definition: the 13 cepstral coefficients are the energy and the 12 first coefficients of the expansion

Other coefficients: 13 first-order (delta-cepstra) and 13 second-order (delta-delta cepstra) differentials.

42

x(!)

log |x(!)| =!!

n="!

cne"in!

.

Page 43: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Mel Frequency Cepstral Coefficients

Refinement: non-linear scale, approximation of human perception of distance between frequencies. E.g., mel frequency scale:

MFCCs:

• signal first transformed using the Mel frequency band.

• extraction of cepstral coefficients.

43

fmel = 2595 log10(1 + f/700).

(Stevens and Volkman, 1940)

Page 44: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Other Refinements

Speaker/Channel adaptation:

• mean cepstral subtraction.

• vocal tract normalization.

• linear transformations.

44

Page 45: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

References

• Bahl, L. R., Jelinek, F., and Mercer, R. (1983). A Maximum Likelihood Approach to Continuous Speech Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 5(2), 179-190.

• Biing-Hwang Juang and Lawrence R. Rabiner. Automatic Speech Recognition - A Brief History of the Technology. Elsevier Encyclopedia of Language and Linguistics, Second Edition, 2005.

• Frederick Jelinek. Statistical Methods for Speech Recognition. MIT Press, Cambridge, MA, 1998.

• Kai-Fu Lee. Context-Dependent Phonetic Hidden Markov Models for Continuous Speech Recognition. IEEE International Conference on Acoustics, Speech, and Signal Processing, 38(4):599-609, 1990.

• Lawrence Rabiner and Biing-Hwang Juang. Fundamentals of Speech Recognition. Prentice Hall, 1993.

45

Page 46: Speech Recognitionmohri/asr07/lecture_1.pdf · 2007. 11. 18. · See (Juang and Rabiner, 1995) Mehryar Mohri - Speech Recognition page Courant Institute, NYU History 1960s: Linear

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

References

• S.S. Stevens and J. Volkman. The relation of pitch to frequency. American Journal of Psychology, 53:329, 1940.

• Steve Young, J. Odell, and Phil Woodland. Tree-Based State-Tying for High Accuracy Acoustic Modelling. In Proceedings of ARPA Human Language Technology Workshop, Morgan Kaufmann, San Francisco, 1994.

46