cs 552/652 speech recognition with hidden markov models winter 2011

37
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul Hosom Lecture 6 January 24 HMMs for speech; review anatomy/framework of HMM; start Viterbi search

Upload: gareth-sheppard

Post on 31-Dec-2015

51 views

Category:

Documents


1 download

DESCRIPTION

CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul Hosom Lecture 6 January 24 HMMs for speech; review anatomy/framework of HMM; start Viterbi search. HMMs for Speech. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

1

CS 552/652Speech Recognition with Hidden Markov Models

Winter 2011

Oregon Health & Science UniversityCenter for Spoken Language Understanding

John-Paul Hosom

Lecture 6January 24

HMMs for speech;review anatomy/framework of HMM;

start Viterbi search

Page 2: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

2

HMMs for Speech

• Speech is the output of an HMM; problem is to find most likely model for a given speech observation sequence.

• Speech is divided into sequence of 10-msec frames, one frame per state transition (faster processing). Assume speech can be recognized using 10-msec chunks.

• Each vertical line delineates one observation, ot

T=80

Page 3: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

3

HMMs for Speech

Page 4: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

4

HMMs for Speech

• Each state can be associated with sub-phoneme phoneme sub-word

• Usually, sub-phonemes or sub-words are used, to account for spectral dynamics (coarticulation).

• One HMM corresponds to one phoneme or word

• For each HMM, determine the probability of the best state sequence that results in the observed speech.

• Choose HMM with best match (probability) to observed speech.

• Given most likely HMM and state sequence, maybe determine the corresponding phoneme and word sequence.

Page 5: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

5

HMMs for Speech

• Example of states for word model:

k ae 0.1

0.90.5

0.5t

0.3

0.7

k ae 0.1

0.90.5

0.5t

0.3

0.7<null> <null>1.0 1.0

3-state word modelfor “cat”

5-state word model for“cat” with null states

Page 6: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

6

HMMs for Speech

• Example of states for word model:

ae1 ae2 0.3

0.70.7

0.3 tcl

0.2

0.9k t0.5 0.7<null> <null>

0.5 0.1

1.0 1.0

• 7-state word model for “cat” with null states

• Null states do not emit observations, and are entered and exited at the same time t. Theoretically, they are unnecessary. Practically, they can make implementation easier.

• States don’t have to correspond directly to phonemes, but are commonly labeled using phonemes.

Page 7: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

7

HMMs for Speech

y eh s

0.3 0.5 0.8

0.7 0.5 0.20.4sil sil

1.00.6

bsil(o1)·0.6·bsil(o2)·0.6·bsil(o3)·0.6·bsil(o4)·0.4·by(o5)·0.3·by(o6)·0.3·by(o7)·0.7 ...

• Example of using HMM for word “yes” on an utterance:

observation state

o1 o2 o3 o4 o5 o6 o7 o8 o29

Page 8: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

8

HMMs for Speech

n ow sil

0.2 0.9 1.0

0.8 0.10.4sil

0.6

bsil(o1)·0.6·bsil(o2)·0.6·bsil(o3)·0.4·bn(o4)·0.8·bow(o5)·0.9·bow(o6)·0.9 ...

• Example of using HMM for word “no” on same utterance:

o1 o2 o3 o4 o5 o6 o7 o8 o29

Page 9: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

9

HMMs for Speech

• Because of coarticulation, states are sometimes made dependent on preceding and/or following phonemes (context dependent).

ae (monophone model) k-ae+t (triphone model) k-ae (diphone model) ae+t (diphone model)

• Constructing words requires matching the contexts:

• “cat”:

sil-k+ae k-ae+t ae-t+sil

Page 10: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

10

HMMs for Speech

• This permits several different models for each phoneme, depending on surrounding phonemes (context sensitive)

k-ae+t p-ae+t k-ae+p

• Probability of “illegal” state sequence is zero (never used)

sil-k+ae p-ae+t

• Much larger number of states to train on… (50 vs. 125,000 for a full set of phonemes, 39 vs. 59,319 for reduced set).

0.0

Page 11: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

11

HMMs for Speech

y

0.3

0.70.4 eh

0.5

0.5

sil-y+eh sil-y+eh sil-y+eh y-eh+s y-eh+s y-eh+s0.5

0.2 0.3 0.2 0.3 0.4 0.3

0.8 0.7 0.8 0.7 0.3 0.7

• Example of 3-state, triphone HMM (expand from previous):

Page 12: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

12

• 1-state monophone (context independent)

• 3-state monophone (context independent)

• 1-state triphone (context dependent)

• 3-state triphone (context dependent)

HMMs for Speech

y

0.3

0.70.4

sil-y+eh sil-y+eh sil-y+eh0.5

0.2 0.3 0.2

0.8 0.7 0.8

• what about a context-independent triphone??

sil-y+eh

0.3

0.70.4

y1 y2 y30.5

0.2 0.3 0.2

0.8 0.7 0.8

Page 13: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

13

HMMs for Speech

• Typically, one HMM = one word or phoneme

• Join HMMs to form sequence of phonemes = word-level HMM

• Join words to form sentences = sentence-level HMM

• Use <null> states at ends of HMM to simplify implementation

k ae0.1

0.90.5

0.5t

0.3

0.7null null

1.0

s ae0.1

0.90.8

0.2t

0.3

0.7null null

(i.t.) 1.0

(instantaneous transition)

Page 14: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

14

HMMs for Speech

• Reminder of big picture:

feature computationat each frame

(cepstral features)

(from Encyclopedia of Information Systems, 2002)

Page 15: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

15

HMMs for Speech

Notes:

• Assume that speech observation is stationary for 1 frame

• If frame is small enough, and enough states are used, we can approximate dynamics of speech:

• The use of context-dependent states accounts (somewhat) for context-dependent nature of speech.

s1 s2 s3 s4 s5

(frame size=

4 msec)

/ay/

Page 16: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

16

HMMs for Word Recognition

Different Topologies are Possible:

• “standard”

• “short phoneme”

• “left-to-right”

A1 A2 A3

0.3 0.4 0.3

0.8 0.7 0.3 0.7

A1 A2 A3

0.3 0.4 0.3

0.8 0.5 0.3 0.7

0.2

A1 A2 A3

0.3 0.4 0.3

0.8 0.7 0.3 0.7 A4 A5

0.4 0.3

0.3 0.7

Page 17: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

17

Anatomy of an HMM

HMMs for speech:

• first-order HMM

• one HMM per phoneme or word

• 3 states per phoneme-level HMM, more for word-level HMM

• sequential series of states, each with self-loop

• link HMMs together to form words and sentences

• GMM: many Gaussian components per state (16)

• context-dependent HMMs: (phoneme-level) HMMs can be linked together only if their contexts correspond

Page 18: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

18

Anatomy of an HMM

HMMs for speech: (con’t)

• speech signal divided into 10-msec quanta

• 1 HMM state per 10-msec quantum (frame)

• use self-loop for speech units that require more than N states

• trace through an HMM to determine probability of utterance and state sequence.

Page 19: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

19

Anatomy of an HMM

• Diagram of one HMM/y/ in context of preceding silence, followed by /eh/

sil-y+eh sil-y+eh sil-y+eh0.5

0.2 0.3 0.2

0.8 0.7 0.8

11

11

c11

12

12

c12

13

13

c13

21

21

c21

22

22

c22

23

23

c23

31

31

c31

32

32

c32

33

33

c33

vector:matrix:scalar:

Page 20: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

20

Framework for HMMs

• N = number of states3 per phoneme, >3 per word

• S = states {S1, S2, S3, … , SN}even though any state can output (any) observation,associate most likely output with state name. Often usecontext-dependent phonetic states (triphones):{sil-y+eh y-eh+s eh-s+sil …}

• T = final time of outputt = {1, 2, … T}

• O = observations {o1 o2 … oT}actual output generated by HMM;features (cepstral, LPC, MFCC, PLP, etc) of a speech signal

Page 21: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

21

Framework for HMMs• M = number of observation symbols per state

= number of codewords for discrete HMM= “infinite” for continuous HMM

• v = symbols {v1 v2 … vM} “codebook indices” generated by discrete (VQ) HMM;for speech, indices point to locations in feature space.

No direct correspondence for continuous HMM; outputof continuous HMM is sequence of observations

{speech vector 1, speech vector 2, …}output can be any point in continuous n-dimensional space.

• A = matrix of transition probabilities {aij}aij = P(qt=j | qt-1=i)ergodic HMM: all aij > 0

• B = set of parameters for determining probabilities bj(ot)bj(ot) = P(ot = vk | qt = j) (discrete: codebook)

= P(ot | qt = j) (continuous: GMM)

Page 22: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

22

Framework for HMMs

• = initial state distribution {i} i = P(q1 = i)

• = entire model = (A, B, )

Page 23: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

23

Framework for HMMs

• Example: “hi”

sil-h+ay h-ay+sil

0.3 0.4

0.7 0.6

• observed features: o1 = {0.8}o2 = {0.8}o3 = {0.2}

• what is probability of O given the state sequence: {sil-h+ay h-ay+sil h-ay+sil}{1 2 2}

1.0

0.0 1.0 1.00.0

1.0

0.0

1.0

0.0

Page 24: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

24

P = 1 b1(o1) a12 b2(o2) a22 b2(o3)

P = 1.0 · 0.76 · 0.7 · 0.27 · 0.4 · 0.82

P = 0.0471

Framework for HMMs

o1=0.8 o2=0.8 o3=0.2

1.0

0.0

q1q2 q2

• Example: “hi”

Page 25: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

25

Framework for HMMs

• What is probability of an observation sequence and state sequence, given the model?

P(O, q | ) = P(O | q, ) P(q | )

What is the “best” valid observation sequence from time 1 to time T, given the model?

• At every time t, can connect to up to N states

There are up to NT possible state sequences (for one second of speech with 3 states, NT = 1047 sequences)

infeasible!!

Page 26: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

26

Viterbi Search: Formula

• Use inductive procedure (see first part of Lecture 2)

• Best sequence (highest probability) up to time t ending in state i is defined as:

]|...,,...[,...,max)( 21121

121 ttt

tt iqqqqPqqqi ooo

• First iteration (t=1):

]|,[)( 111 oiqPi

],|[]|[)( 1111 iqPiqPi o

)()( 11 oii bi

)|()()( ABPAPBAP

• Question 1: What is best score along a single path, up to time t,ending in state i?

Page 27: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

27

Viterbi Search: Formula

• Second iteration (t=2)

]|...,,...[,...,max)( 21121

121 ttt

tt iqqqqPqqqi ooo

]|,,[max)( 21211

2 ooiqkqPqi

],,|,[]|,[max)( 1122111

2 ooo kqiqPkqPqi

],,,|[],,|[)(max)( 112211211

2 ooo kqiqPkqiqPkqi

Page 28: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

28

Viterbi Search: Formula

• Second iteration (t=2) (continued…)

)()(max)( 212 oiki bakki

],,,|[],,|[)(max)( 112211211

2 ooo kqiqPkqiqPkqi

],|[],|[)(max)( 221211

2 iqPkqiqPkqi o

)()(max)( 212 oiki bakki

P(o2) independent of o1 and q1;

P(q2) independent of o1

Page 29: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

29

Viterbi Search: Formula

• In general, for any value of t:

]|...,,...[,...,

max)( 21121

121

tttt

t iqqqqPqqq

i ooo

],,,,|,[

]|,,,[

,,

max)(

121121

121121

121

tttt

tt

tt qqqiqP

qqqP

qqqi

oooo

ooo

],,,,,,|[

],,,,,|[)(

,,

max)(

1211221

12112211

121

ttttt

ttttt

tt kqqqqiqP

kqqqqiqPk

kqqqi

oooo

ooo

],,,,,|,[

]|,,,,[

,,

max)(

1211221

1211221

121

ttttt

ttt

tt kqqqqiqP

kqqqqP

kqqqi

oooo

ooo

change notation to say that we call state qt-1 by variable name “k”

the first term now equals t-1(k)

Page 30: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

30

Viterbi Search: Formula

• In general, for any value of t: (continued…)

)()(max

)( 1 tikitt bakk

i o

)()(max

)( 1 tikitt bakk

i o

now make 1st order Markov assumption, andassumption that p(ot) depends only on current state i and the model :

],,,,,,|[

],,,,,|[)(

,,

max)(

1211221

12112211

121

ttttt

ttttt

tt kqqqqiqP

kqqqqiqPk

kqqqi

oooo

ooo

],|[],|[)(,,

max)( 11

121

iqPkqiqPkkqqq

i tttttt

t

o

q1 through qt-2 have been removed from the equation (implicit in t-1(k)):

],|[],|[)(max

)( 111

iqPkqiqPkkq

i tttttt

t

o

Page 31: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

31

• We have shown that if we can compute the highest probabilityfor all states at time t-1, then we can compute the highest probability for any state j at time t.

• We have also shown that we can compute the highest probabilityfor any state j (or all states) at time 1.

• Therefore, our inductive proof shows that we can compute thehighest probability of an observation sequence (making theassumptions noted above) for any state j up to time t.

Viterbi Search: Formula

• In general, for any value of t:

)()(max)( 1 tjijtt baiij o

• Best path from {1, 2, … t} is not dependent on future times {t+1, t+2, … T} (from definition of model)

• Best path from {1, 2, … t} is not necessarily the same as the best path from {1, 2, … (t-1)} concatenated with the best path {(t-1) t}

Page 32: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

32

Viterbi Search: Formula

• Keep in memory only t-1(i) for all i.

• For each time t and state j, need (N multiply and compare) + (1 multiply)

• For each time t, needN ((N multiply and compare) + (1 multiply))

• To find best path, needO( N2T )

operations.

• This is much better than NT possible paths, especially for large T!

Page 33: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

33

Viterbi Search: Comparison with DTW

Note the similarities to DTW:

• best path to an end time is computed using only previous data points (i.e. in DTW, points in lower-left quadrant; in Viterbi search, previous time values)

• best path for entire utterance is computed from best path when time t=T.

• DTW cost D for a point (x,y) is computed using cumulative cost for previous points, transition cost (path weights), and local cost for current point (x,y).

Viterbi probability for a time t and state j is computed using cumulative probability for previous time points and states, transition probabilities, and local observation probability for current time point and state.

Page 34: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

34

Viterbi Search: Comparison with DTW

“Hybrid” between DTW and Viterbi: Use multiple templates

1. Collect N templates. Use DTW to find template n which has lowest D with all other templates. Use DTW to align all other templates with template n, creating warped templates.

2. At each frame in template n, compute average feature value and standard deviation of feature values over all warped templates.

3. When performing DTW, don’t use Euclidean distance to get d value between input at frame t (ot) and template at frame u, but d(t,u) = negative log probability of ot (input at t) given mean and standard deviation of template at frame u, assuming Normal distribution. (If template data at frame u are not Normally distributed, can use GMM instead.)

This can be viewed as an HMM with the number of states equal to the number of frames in template n, and (possibly a second-order) Markov process with transition probabilities associated with only local states (frames).

Page 35: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

35

Viterbi Search: Comparison with DTW

Other uses of DTW

1. Aligning Phoneme Sequences:words TIMIT phonemes Worldbet phonemes

“this is easy” /dh ih s I z .pau iy z iy/ /D I s I z .pau i: z i:/

“this was easy” /dh ih s .pau w ah z iy z iy/ /D I s .pau w ^ z i: z i:/

Define phonemes in a multi-dimensional feature space such as {Voicing, Manner, Place, Height}. /iy/=[1.0 1.0 1.0 4.0], /z/=[3.0 6.0 3.0 5.0], /s/=[4.0 6.0 3.0 5.0]

2. Automatic Dialogue Replacement (ADR):

Actor gives a performance for movie. There is background noise, room reverberation, wind, making the audio of low quality. Later, the same actor goes into a studio and records the same lines in an acoustically-controlled environment. But then small timing differences need to be corrected. DTW is used in state-of-the-art ADR.

Page 36: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

36

• Prior segmentation of speech into phonetic regions is not required before performing recognition.

This provides robustness over other methods that first segment and then classify, because any attempt to do prior segmentation will yield errors.

• As we move through an HMM to determine most likely sequence, we get segmentation.

• First-order and independence assumptions correct for some phenomena, but not for speech. But math is easier.

HMMs for Speech

Page 37: CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011

37

Viterbi Search: Example

Speech/Non-Speech Segmentation (frame rate 100 msec):

Speech = state ANon-Speech = state B

t: 1 2 3 4 5p(A): 0.1 0.5 0.9 0.1 0.7p(B): 0.8 0.6 0.2 0.4 0.2

A=0.2B=0.8

A B

.8.2

.3

.7