cs 552/652 speech recognition with hidden markov models winter 2011

1

CS 552/652Speech Recognition with Hidden Markov Models

Winter 2011

Oregon Health & Science UniversityCenter for Spoken Language Understanding

John-Paul Hosom

Lecture 6January 24

HMMs for speech;review anatomy/framework of HMM;

start Viterbi search

2

HMMs for Speech

• Speech is the output of an HMM; problem is to find most likely model for a given speech observation sequence.

• Speech is divided into sequence of 10-msec frames, one frame per state transition (faster processing). Assume speech can be recognized using 10-msec chunks.

• Each vertical line delineates one observation, ot

T=80

3

HMMs for Speech

4

HMMs for Speech

• Each state can be associated with sub-phoneme phoneme sub-word

• Usually, sub-phonemes or sub-words are used, to account for spectral dynamics (coarticulation).

• One HMM corresponds to one phoneme or word

• For each HMM, determine the probability of the best state sequence that results in the observed speech.

• Choose HMM with best match (probability) to observed speech.

• Given most likely HMM and state sequence, maybe determine the corresponding phoneme and word sequence.

5

HMMs for Speech

• Example of states for word model:

k ae 0.1

0.90.5

0.5t

0.3

0.7

k ae 0.1

0.90.5

0.5t

0.3

0.7<null> <null>1.0 1.0

3-state word modelfor “cat”

5-state word model for“cat” with null states

6

HMMs for Speech

• Example of states for word model:

ae1 ae2 0.3

0.70.7

0.3 tcl

0.2

0.9k t0.5 0.7<null> <null>

0.5 0.1

1.0 1.0

• 7-state word model for “cat” with null states

• Null states do not emit observations, and are entered and exited at the same time t. Theoretically, they are unnecessary. Practically, they can make implementation easier.

• States don’t have to correspond directly to phonemes, but are commonly labeled using phonemes.

7

HMMs for Speech

y eh s

0.3 0.5 0.8

0.7 0.5 0.20.4sil sil

1.00.6

bsil(o1)·0.6·bsil(o2)·0.6·bsil(o3)·0.6·bsil(o4)·0.4·by(o5)·0.3·by(o6)·0.3·by(o7)·0.7 ...

• Example of using HMM for word “yes” on an utterance:

observation state

o1 o2 o3 o4 o5 o6 o7 o8 o29

8

HMMs for Speech

n ow sil

0.2 0.9 1.0

0.8 0.10.4sil

0.6

bsil(o1)·0.6·bsil(o2)·0.6·bsil(o3)·0.4·bn(o4)·0.8·bow(o5)·0.9·bow(o6)·0.9 ...

• Example of using HMM for word “no” on same utterance:

o1 o2 o3 o4 o5 o6 o7 o8 o29

9

HMMs for Speech

• Because of coarticulation, states are sometimes made dependent on preceding and/or following phonemes (context dependent).

ae (monophone model) k-ae+t (triphone model) k-ae (diphone model) ae+t (diphone model)

• Constructing words requires matching the contexts:

• “cat”:

sil-k+ae k-ae+t ae-t+sil

10

HMMs for Speech

• This permits several different models for each phoneme, depending on surrounding phonemes (context sensitive)

k-ae+t p-ae+t k-ae+p

• Probability of “illegal” state sequence is zero (never used)

sil-k+ae p-ae+t

• Much larger number of states to train on… (50 vs. 125,000 for a full set of phonemes, 39 vs. 59,319 for reduced set).

0.0

11

HMMs for Speech

y

0.3

0.70.4 eh

0.5

0.5

sil-y+eh sil-y+eh sil-y+eh y-eh+s y-eh+s y-eh+s0.5

0.2 0.3 0.2 0.3 0.4 0.3

0.8 0.7 0.8 0.7 0.3 0.7

• Example of 3-state, triphone HMM (expand from previous):

12

• 1-state monophone (context independent)

• 3-state monophone (context independent)

• 1-state triphone (context dependent)

• 3-state triphone (context dependent)

HMMs for Speech

y

0.3

0.70.4

sil-y+eh sil-y+eh sil-y+eh0.5

0.2 0.3 0.2

0.8 0.7 0.8

• what about a context-independent triphone??

sil-y+eh

0.3

0.70.4

y1 y2 y30.5

0.2 0.3 0.2

0.8 0.7 0.8

13

HMMs for Speech

• Typically, one HMM = one word or phoneme

• Join HMMs to form sequence of phonemes = word-level HMM

• Join words to form sentences = sentence-level HMM

• Use <null> states at ends of HMM to simplify implementation

k ae0.1

0.90.5

0.5t

0.3

0.7null null

1.0

s ae0.1

0.90.8

0.2t

0.3

0.7null null

(i.t.) 1.0

(instantaneous transition)

14

HMMs for Speech

• Reminder of big picture:

feature computationat each frame

(cepstral features)

(from Encyclopedia of Information Systems, 2002)

15

HMMs for Speech

Notes:

• Assume that speech observation is stationary for 1 frame

• If frame is small enough, and enough states are used, we can approximate dynamics of speech:

• The use of context-dependent states accounts (somewhat) for context-dependent nature of speech.

s1 s2 s3 s4 s5

(frame size=

4 msec)

/ay/

16

HMMs for Word Recognition

Different Topologies are Possible:

• “standard”

• “short phoneme”

• “left-to-right”

A1 A2 A3

0.3 0.4 0.3

0.8 0.7 0.3 0.7

A1 A2 A3

0.3 0.4 0.3

0.8 0.5 0.3 0.7

0.2

A1 A2 A3

0.3 0.4 0.3

0.8 0.7 0.3 0.7 A4 A5

0.4 0.3

0.3 0.7

17

Anatomy of an HMM

HMMs for speech:

• first-order HMM

• one HMM per phoneme or word

• 3 states per phoneme-level HMM, more for word-level HMM

• sequential series of states, each with self-loop

• link HMMs together to form words and sentences

• GMM: many Gaussian components per state (16)

• context-dependent HMMs: (phoneme-level) HMMs can be linked together only if their contexts correspond

18

Anatomy of an HMM

HMMs for speech: (con’t)

• speech signal divided into 10-msec quanta

• 1 HMM state per 10-msec quantum (frame)

• use self-loop for speech units that require more than N states

• trace through an HMM to determine probability of utterance and state sequence.

19

Anatomy of an HMM

• Diagram of one HMM/y/ in context of preceding silence, followed by /eh/

sil-y+eh sil-y+eh sil-y+eh0.5

0.2 0.3 0.2

0.8 0.7 0.8

11

11

c11

12

12

c12

13

13

c13

21

21

c21

22

22

c22

23

23

c23

31

31

c31

32

32

c32

33

33

c33

vector:matrix:scalar:

20

Framework for HMMs

• N = number of states3 per phoneme, >3 per word

• S = states {S1, S2, S3, … , SN}even though any state can output (any) observation,associate most likely output with state name. Often usecontext-dependent phonetic states (triphones):{sil-y+eh y-eh+s eh-s+sil …}

• T = final time of outputt = {1, 2, … T}

• O = observations {o1 o2 … oT}actual output generated by HMM;features (cepstral, LPC, MFCC, PLP, etc) of a speech signal

21

Framework for HMMs• M = number of observation symbols per state

= number of codewords for discrete HMM= “infinite” for continuous HMM

• v = symbols {v1 v2 … vM} “codebook indices” generated by discrete (VQ) HMM;for speech, indices point to locations in feature space.

No direct correspondence for continuous HMM; outputof continuous HMM is sequence of observations

{speech vector 1, speech vector 2, …}output can be any point in continuous n-dimensional space.

• A = matrix of transition probabilities {aij}aij = P(qt=j | qt-1=i)ergodic HMM: all aij > 0

• B = set of parameters for determining probabilities bj(ot)bj(ot) = P(ot = vk | qt = j) (discrete: codebook)

= P(ot | qt = j) (continuous: GMM)

22

Framework for HMMs

• = initial state distribution {i} i = P(q1 = i)

• = entire model = (A, B, )

23

Framework for HMMs

• Example: “hi”

sil-h+ay h-ay+sil

0.3 0.4

0.7 0.6

• observed features: o1 = {0.8}o2 = {0.8}o3 = {0.2}

• what is probability of O given the state sequence: {sil-h+ay h-ay+sil h-ay+sil}{1 2 2}

1.0

0.0 1.0 1.00.0

1.0

0.0

1.0

0.0

24

P = 1 b1(o1) a12 b2(o2) a22 b2(o3)

P = 1.0 · 0.76 · 0.7 · 0.27 · 0.4 · 0.82

P = 0.0471

Framework for HMMs

o1=0.8 o2=0.8 o3=0.2

1.0

0.0

q1q2 q2

• Example: “hi”

25

Framework for HMMs

• What is probability of an observation sequence and state sequence, given the model?

P(O, q | ) = P(O | q, ) P(q | )

What is the “best” valid observation sequence from time 1 to time T, given the model?

• At every time t, can connect to up to N states

There are up to NT possible state sequences (for one second of speech with 3 states, NT = 1047 sequences)

infeasible!!

26

Viterbi Search: Formula

• Use inductive procedure (see first part of Lecture 2)

• Best sequence (highest probability) up to time t ending in state i is defined as:

]|...,,...[,...,max)( 21121

121 ttt

tt iqqqqPqqqi ooo

• First iteration (t=1):

]|,[)( 111 oiqPi

],|[]|[)( 1111 iqPiqPi o

)()( 11 oii bi

)|()()( ABPAPBAP

• Question 1: What is best score along a single path, up to time t,ending in state i?

28


• Second iteration (t=2) (continued…)

)()(max)( 212 oiki bakki

],,,|[],,|[)(max)( 112211211

2 ooo kqiqPkqiqPkqi

],|[],|[)(max)( 221211

2 iqPkqiqPkqi o

)()(max)( 212 oiki bakki

P(o2) independent of o1 and q1;

P(q2) independent of o1

29


• In general, for any value of t:

]|...,,...[,...,

max)( 21121

121

tttt

t iqqqqPqqq

i ooo

],,,,|,[

]|,,,[

,,

max)(

121121

121121

121

tttt

tt

tt qqqiqP

qqqP

qqqi

oooo

ooo

],,,,,,|[

],,,,,|[)(

,,

max)(

1211221

12112211

121

ttttt

ttttt

tt kqqqqiqP

kqqqqiqPk

kqqqi

oooo

ooo

],,,,,|,[

]|,,,,[

,,

max)(

1211221

1211221

121

ttttt

ttt

tt kqqqqiqP

kqqqqP

kqqqi

oooo

ooo

change notation to say that we call state qt-1 by variable name “k”

the first term now equals t-1(k)

30


• In general, for any value of t: (continued…)

)()(max

)( 1 tikitt bakk

i o

)()(max

)( 1 tikitt bakk

i o

now make 1st order Markov assumption, andassumption that p(ot) depends only on current state i and the model :

],,,,,,|[

],,,,,|[)(

,,

max)(

1211221

12112211

121

ttttt

ttttt

tt kqqqqiqP

kqqqqiqPk

kqqqi

oooo

ooo

],|[],|[)(,,

max)( 11

121

iqPkqiqPkkqqq

i tttttt

t

o

q1 through qt-2 have been removed from the equation (implicit in t-1(k)):

],|[],|[)(max

)( 111

iqPkqiqPkkq

i tttttt

t

o

31

• We have shown that if we can compute the highest probabilityfor all states at time t-1, then we can compute the highest probability for any state j at time t.

• We have also shown that we can compute the highest probabilityfor any state j (or all states) at time 1.

• Therefore, our inductive proof shows that we can compute thehighest probability of an observation sequence (making theassumptions noted above) for any state j up to time t.


• In general, for any value of t:

)()(max)( 1 tjijtt baiij o

• Best path from {1, 2, … t} is not dependent on future times {t+1, t+2, … T} (from definition of model)

• Best path from {1, 2, … t} is not necessarily the same as the best path from {1, 2, … (t-1)} concatenated with the best path {(t-1) t}

32


• Keep in memory only t-1(i) for all i.

• For each time t and state j, need (N multiply and compare) + (1 multiply)

• For each time t, needN ((N multiply and compare) + (1 multiply))

• To find best path, needO( N2T )

operations.

• This is much better than NT possible paths, especially for large T!

33

Viterbi Search: Comparison with DTW

Note the similarities to DTW:

• best path to an end time is computed using only previous data points (i.e. in DTW, points in lower-left quadrant; in Viterbi search, previous time values)

• best path for entire utterance is computed from best path when time t=T.

• DTW cost D for a point (x,y) is computed using cumulative cost for previous points, transition cost (path weights), and local cost for current point (x,y).

Viterbi probability for a time t and state j is computed using cumulative probability for previous time points and states, transition probabilities, and local observation probability for current time point and state.

34


“Hybrid” between DTW and Viterbi: Use multiple templates

1. Collect N templates. Use DTW to find template n which has lowest D with all other templates. Use DTW to align all other templates with template n, creating warped templates.

2. At each frame in template n, compute average feature value and standard deviation of feature values over all warped templates.

3. When performing DTW, don’t use Euclidean distance to get d value between input at frame t (ot) and template at frame u, but d(t,u) = negative log probability of ot (input at t) given mean and standard deviation of template at frame u, assuming Normal distribution. (If template data at frame u are not Normally distributed, can use GMM instead.)

This can be viewed as an HMM with the number of states equal to the number of frames in template n, and (possibly a second-order) Markov process with transition probabilities associated with only local states (frames).

35


Other uses of DTW

1. Aligning Phoneme Sequences:words TIMIT phonemes Worldbet phonemes

“this is easy” /dh ih s I z .pau iy z iy/ /D I s I z .pau i: z i:/

“this was easy” /dh ih s .pau w ah z iy z iy/ /D I s .pau w ^ z i: z i:/

Define phonemes in a multi-dimensional feature space such as {Voicing, Manner, Place, Height}. /iy/=[1.0 1.0 1.0 4.0], /z/=[3.0 6.0 3.0 5.0], /s/=[4.0 6.0 3.0 5.0]

2. Automatic Dialogue Replacement (ADR):

Actor gives a performance for movie. There is background noise, room reverberation, wind, making the audio of low quality. Later, the same actor goes into a studio and records the same lines in an acoustically-controlled environment. But then small timing differences need to be corrected. DTW is used in state-of-the-art ADR.

36

• Prior segmentation of speech into phonetic regions is not required before performing recognition.

This provides robustness over other methods that first segment and then classify, because any attempt to do prior segmentation will yield errors.

• As we move through an HMM to determine most likely sequence, we get segmentation.

• First-order and independence assumptions correct for some phenomena, but not for speech. But math is easier.

HMMs for Speech

37

Viterbi Search: Example

Speech/Non-Speech Segmentation (frame rate 100 msec):

Speech = state ANon-Speech = state B

t: 1 2 3 4 5p(A): 0.1 0.5 0.9 0.1 0.7p(B): 0.8 0.6 0.2 0.4 0.2

A=0.2B=0.8

A B

.8.2

.3

.7

cs 552/652 speech recognition with hidden markov models winter 2011

Documents

speech speech

speech example of states

word sequence

observed speech

speech recognition

best state sequence

sequence of phonemes

ae pae t