cs 552/652 speech recognition with hidden markov models winter 2011
DESCRIPTION
CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul Hosom Lecture 6 January 24 HMMs for speech; review anatomy/framework of HMM; start Viterbi search. HMMs for Speech. - PowerPoint PPT PresentationTRANSCRIPT
1
CS 552/652Speech Recognition with Hidden Markov Models
Winter 2011
Oregon Health & Science UniversityCenter for Spoken Language Understanding
John-Paul Hosom
Lecture 6January 24
HMMs for speech;review anatomy/framework of HMM;
start Viterbi search
2
HMMs for Speech
• Speech is the output of an HMM; problem is to find most likely model for a given speech observation sequence.
• Speech is divided into sequence of 10-msec frames, one frame per state transition (faster processing). Assume speech can be recognized using 10-msec chunks.
• Each vertical line delineates one observation, ot
T=80
3
HMMs for Speech
4
HMMs for Speech
• Each state can be associated with sub-phoneme phoneme sub-word
• Usually, sub-phonemes or sub-words are used, to account for spectral dynamics (coarticulation).
• One HMM corresponds to one phoneme or word
• For each HMM, determine the probability of the best state sequence that results in the observed speech.
• Choose HMM with best match (probability) to observed speech.
• Given most likely HMM and state sequence, maybe determine the corresponding phoneme and word sequence.
5
HMMs for Speech
• Example of states for word model:
k ae 0.1
0.90.5
0.5t
0.3
0.7
k ae 0.1
0.90.5
0.5t
0.3
0.7<null> <null>1.0 1.0
3-state word modelfor “cat”
5-state word model for“cat” with null states
6
HMMs for Speech
• Example of states for word model:
ae1 ae2 0.3
0.70.7
0.3 tcl
0.2
0.9k t0.5 0.7<null> <null>
0.5 0.1
1.0 1.0
• 7-state word model for “cat” with null states
• Null states do not emit observations, and are entered and exited at the same time t. Theoretically, they are unnecessary. Practically, they can make implementation easier.
• States don’t have to correspond directly to phonemes, but are commonly labeled using phonemes.
7
HMMs for Speech
y eh s
0.3 0.5 0.8
0.7 0.5 0.20.4sil sil
1.00.6
bsil(o1)·0.6·bsil(o2)·0.6·bsil(o3)·0.6·bsil(o4)·0.4·by(o5)·0.3·by(o6)·0.3·by(o7)·0.7 ...
• Example of using HMM for word “yes” on an utterance:
observation state
o1 o2 o3 o4 o5 o6 o7 o8 o29
8
HMMs for Speech
n ow sil
0.2 0.9 1.0
0.8 0.10.4sil
0.6
bsil(o1)·0.6·bsil(o2)·0.6·bsil(o3)·0.4·bn(o4)·0.8·bow(o5)·0.9·bow(o6)·0.9 ...
• Example of using HMM for word “no” on same utterance:
o1 o2 o3 o4 o5 o6 o7 o8 o29
9
HMMs for Speech
• Because of coarticulation, states are sometimes made dependent on preceding and/or following phonemes (context dependent).
ae (monophone model) k-ae+t (triphone model) k-ae (diphone model) ae+t (diphone model)
• Constructing words requires matching the contexts:
• “cat”:
sil-k+ae k-ae+t ae-t+sil
10
HMMs for Speech
• This permits several different models for each phoneme, depending on surrounding phonemes (context sensitive)
k-ae+t p-ae+t k-ae+p
• Probability of “illegal” state sequence is zero (never used)
sil-k+ae p-ae+t
• Much larger number of states to train on… (50 vs. 125,000 for a full set of phonemes, 39 vs. 59,319 for reduced set).
0.0
11
HMMs for Speech
y
0.3
0.70.4 eh
0.5
0.5
sil-y+eh sil-y+eh sil-y+eh y-eh+s y-eh+s y-eh+s0.5
0.2 0.3 0.2 0.3 0.4 0.3
0.8 0.7 0.8 0.7 0.3 0.7
• Example of 3-state, triphone HMM (expand from previous):
12
• 1-state monophone (context independent)
• 3-state monophone (context independent)
• 1-state triphone (context dependent)
• 3-state triphone (context dependent)
HMMs for Speech
y
0.3
0.70.4
sil-y+eh sil-y+eh sil-y+eh0.5
0.2 0.3 0.2
0.8 0.7 0.8
• what about a context-independent triphone??
sil-y+eh
0.3
0.70.4
y1 y2 y30.5
0.2 0.3 0.2
0.8 0.7 0.8
13
HMMs for Speech
• Typically, one HMM = one word or phoneme
• Join HMMs to form sequence of phonemes = word-level HMM
• Join words to form sentences = sentence-level HMM
• Use <null> states at ends of HMM to simplify implementation
k ae0.1
0.90.5
0.5t
0.3
0.7null null
1.0
s ae0.1
0.90.8
0.2t
0.3
0.7null null
(i.t.) 1.0
(instantaneous transition)
14
HMMs for Speech
• Reminder of big picture:
feature computationat each frame
(cepstral features)
(from Encyclopedia of Information Systems, 2002)
15
HMMs for Speech
Notes:
• Assume that speech observation is stationary for 1 frame
• If frame is small enough, and enough states are used, we can approximate dynamics of speech:
• The use of context-dependent states accounts (somewhat) for context-dependent nature of speech.
s1 s2 s3 s4 s5
(frame size=
4 msec)
/ay/
16
HMMs for Word Recognition
Different Topologies are Possible:
• “standard”
• “short phoneme”
• “left-to-right”
A1 A2 A3
0.3 0.4 0.3
0.8 0.7 0.3 0.7
A1 A2 A3
0.3 0.4 0.3
0.8 0.5 0.3 0.7
0.2
A1 A2 A3
0.3 0.4 0.3
0.8 0.7 0.3 0.7 A4 A5
0.4 0.3
0.3 0.7
17
Anatomy of an HMM
HMMs for speech:
• first-order HMM
• one HMM per phoneme or word
• 3 states per phoneme-level HMM, more for word-level HMM
• sequential series of states, each with self-loop
• link HMMs together to form words and sentences
• GMM: many Gaussian components per state (16)
• context-dependent HMMs: (phoneme-level) HMMs can be linked together only if their contexts correspond
18
Anatomy of an HMM
HMMs for speech: (con’t)
• speech signal divided into 10-msec quanta
• 1 HMM state per 10-msec quantum (frame)
• use self-loop for speech units that require more than N states
• trace through an HMM to determine probability of utterance and state sequence.
19
Anatomy of an HMM
• Diagram of one HMM/y/ in context of preceding silence, followed by /eh/
sil-y+eh sil-y+eh sil-y+eh0.5
0.2 0.3 0.2
0.8 0.7 0.8
11
11
c11
12
12
c12
13
13
c13
21
21
c21
22
22
c22
23
23
c23
31
31
c31
32
32
c32
33
33
c33
vector:matrix:scalar:
20
Framework for HMMs
• N = number of states3 per phoneme, >3 per word
• S = states {S1, S2, S3, … , SN}even though any state can output (any) observation,associate most likely output with state name. Often usecontext-dependent phonetic states (triphones):{sil-y+eh y-eh+s eh-s+sil …}
• T = final time of outputt = {1, 2, … T}
• O = observations {o1 o2 … oT}actual output generated by HMM;features (cepstral, LPC, MFCC, PLP, etc) of a speech signal
21
Framework for HMMs• M = number of observation symbols per state
= number of codewords for discrete HMM= “infinite” for continuous HMM
• v = symbols {v1 v2 … vM} “codebook indices” generated by discrete (VQ) HMM;for speech, indices point to locations in feature space.
No direct correspondence for continuous HMM; outputof continuous HMM is sequence of observations
{speech vector 1, speech vector 2, …}output can be any point in continuous n-dimensional space.
• A = matrix of transition probabilities {aij}aij = P(qt=j | qt-1=i)ergodic HMM: all aij > 0
• B = set of parameters for determining probabilities bj(ot)bj(ot) = P(ot = vk | qt = j) (discrete: codebook)
= P(ot | qt = j) (continuous: GMM)
22
Framework for HMMs
• = initial state distribution {i} i = P(q1 = i)
• = entire model = (A, B, )
23
Framework for HMMs
• Example: “hi”
sil-h+ay h-ay+sil
0.3 0.4
0.7 0.6
• observed features: o1 = {0.8}o2 = {0.8}o3 = {0.2}
• what is probability of O given the state sequence: {sil-h+ay h-ay+sil h-ay+sil}{1 2 2}
1.0
0.0 1.0 1.00.0
1.0
0.0
1.0
0.0
24
P = 1 b1(o1) a12 b2(o2) a22 b2(o3)
P = 1.0 · 0.76 · 0.7 · 0.27 · 0.4 · 0.82
P = 0.0471
Framework for HMMs
o1=0.8 o2=0.8 o3=0.2
1.0
0.0
q1q2 q2
• Example: “hi”
25
Framework for HMMs
• What is probability of an observation sequence and state sequence, given the model?
P(O, q | ) = P(O | q, ) P(q | )
What is the “best” valid observation sequence from time 1 to time T, given the model?
• At every time t, can connect to up to N states
There are up to NT possible state sequences (for one second of speech with 3 states, NT = 1047 sequences)
infeasible!!
26
Viterbi Search: Formula
• Use inductive procedure (see first part of Lecture 2)
• Best sequence (highest probability) up to time t ending in state i is defined as:
]|...,,...[,...,max)( 21121
121 ttt
tt iqqqqPqqqi ooo
• First iteration (t=1):
]|,[)( 111 oiqPi
],|[]|[)( 1111 iqPiqPi o
)()( 11 oii bi
)|()()( ABPAPBAP
• Question 1: What is best score along a single path, up to time t,ending in state i?
27
Viterbi Search: Formula
• Second iteration (t=2)
]|...,,...[,...,max)( 21121
121 ttt
tt iqqqqPqqqi ooo
]|,,[max)( 21211
2 ooiqkqPqi
],,|,[]|,[max)( 1122111
2 ooo kqiqPkqPqi
],,,|[],,|[)(max)( 112211211
2 ooo kqiqPkqiqPkqi
28
Viterbi Search: Formula
• Second iteration (t=2) (continued…)
)()(max)( 212 oiki bakki
],,,|[],,|[)(max)( 112211211
2 ooo kqiqPkqiqPkqi
],|[],|[)(max)( 221211
2 iqPkqiqPkqi o
)()(max)( 212 oiki bakki
P(o2) independent of o1 and q1;
P(q2) independent of o1
29
Viterbi Search: Formula
• In general, for any value of t:
]|...,,...[,...,
max)( 21121
121
tttt
t iqqqqPqqq
i ooo
],,,,|,[
]|,,,[
,,
max)(
121121
121121
121
tttt
tt
tt qqqiqP
qqqP
qqqi
oooo
ooo
],,,,,,|[
],,,,,|[)(
,,
max)(
1211221
12112211
121
ttttt
ttttt
tt kqqqqiqP
kqqqqiqPk
kqqqi
oooo
ooo
],,,,,|,[
]|,,,,[
,,
max)(
1211221
1211221
121
ttttt
ttt
tt kqqqqiqP
kqqqqP
kqqqi
oooo
ooo
change notation to say that we call state qt-1 by variable name “k”
the first term now equals t-1(k)
30
Viterbi Search: Formula
• In general, for any value of t: (continued…)
)()(max
)( 1 tikitt bakk
i o
)()(max
)( 1 tikitt bakk
i o
now make 1st order Markov assumption, andassumption that p(ot) depends only on current state i and the model :
],,,,,,|[
],,,,,|[)(
,,
max)(
1211221
12112211
121
ttttt
ttttt
tt kqqqqiqP
kqqqqiqPk
kqqqi
oooo
ooo
],|[],|[)(,,
max)( 11
121
iqPkqiqPkkqqq
i tttttt
t
o
q1 through qt-2 have been removed from the equation (implicit in t-1(k)):
],|[],|[)(max
)( 111
iqPkqiqPkkq
i tttttt
t
o
31
• We have shown that if we can compute the highest probabilityfor all states at time t-1, then we can compute the highest probability for any state j at time t.
• We have also shown that we can compute the highest probabilityfor any state j (or all states) at time 1.
• Therefore, our inductive proof shows that we can compute thehighest probability of an observation sequence (making theassumptions noted above) for any state j up to time t.
Viterbi Search: Formula
• In general, for any value of t:
)()(max)( 1 tjijtt baiij o
• Best path from {1, 2, … t} is not dependent on future times {t+1, t+2, … T} (from definition of model)
• Best path from {1, 2, … t} is not necessarily the same as the best path from {1, 2, … (t-1)} concatenated with the best path {(t-1) t}
32
Viterbi Search: Formula
• Keep in memory only t-1(i) for all i.
• For each time t and state j, need (N multiply and compare) + (1 multiply)
• For each time t, needN ((N multiply and compare) + (1 multiply))
• To find best path, needO( N2T )
operations.
• This is much better than NT possible paths, especially for large T!
33
Viterbi Search: Comparison with DTW
Note the similarities to DTW:
• best path to an end time is computed using only previous data points (i.e. in DTW, points in lower-left quadrant; in Viterbi search, previous time values)
• best path for entire utterance is computed from best path when time t=T.
• DTW cost D for a point (x,y) is computed using cumulative cost for previous points, transition cost (path weights), and local cost for current point (x,y).
Viterbi probability for a time t and state j is computed using cumulative probability for previous time points and states, transition probabilities, and local observation probability for current time point and state.
34
Viterbi Search: Comparison with DTW
“Hybrid” between DTW and Viterbi: Use multiple templates
1. Collect N templates. Use DTW to find template n which has lowest D with all other templates. Use DTW to align all other templates with template n, creating warped templates.
2. At each frame in template n, compute average feature value and standard deviation of feature values over all warped templates.
3. When performing DTW, don’t use Euclidean distance to get d value between input at frame t (ot) and template at frame u, but d(t,u) = negative log probability of ot (input at t) given mean and standard deviation of template at frame u, assuming Normal distribution. (If template data at frame u are not Normally distributed, can use GMM instead.)
This can be viewed as an HMM with the number of states equal to the number of frames in template n, and (possibly a second-order) Markov process with transition probabilities associated with only local states (frames).
35
Viterbi Search: Comparison with DTW
Other uses of DTW
1. Aligning Phoneme Sequences:words TIMIT phonemes Worldbet phonemes
“this is easy” /dh ih s I z .pau iy z iy/ /D I s I z .pau i: z i:/
“this was easy” /dh ih s .pau w ah z iy z iy/ /D I s .pau w ^ z i: z i:/
Define phonemes in a multi-dimensional feature space such as {Voicing, Manner, Place, Height}. /iy/=[1.0 1.0 1.0 4.0], /z/=[3.0 6.0 3.0 5.0], /s/=[4.0 6.0 3.0 5.0]
2. Automatic Dialogue Replacement (ADR):
Actor gives a performance for movie. There is background noise, room reverberation, wind, making the audio of low quality. Later, the same actor goes into a studio and records the same lines in an acoustically-controlled environment. But then small timing differences need to be corrected. DTW is used in state-of-the-art ADR.
36
• Prior segmentation of speech into phonetic regions is not required before performing recognition.
This provides robustness over other methods that first segment and then classify, because any attempt to do prior segmentation will yield errors.
• As we move through an HMM to determine most likely sequence, we get segmentation.
• First-order and independence assumptions correct for some phenomena, but not for speech. But math is easier.
HMMs for Speech
37
Viterbi Search: Example
Speech/Non-Speech Segmentation (frame rate 100 msec):
Speech = state ANon-Speech = state B
t: 1 2 3 4 5p(A): 0.1 0.5 0.9 0.1 0.7p(B): 0.8 0.6 0.2 0.4 0.2
A=0.2B=0.8
A B
.8.2
.3
.7