lecture 5 the big picture/language modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 ·...
TRANSCRIPT
![Page 1: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen](https://reader035.vdocuments.us/reader035/viewer/2022070912/5fb40e05365acf4b4b135ce4/html5/thumbnails/1.jpg)
Lecture 5The Big Picture/Language Modeling
Michael Picheny, Stanley F. Chen, Ellen EideIBM T.J. Watson Research Center
Yorktown Heights, NY, USA{picheny,stanchen,eeide }@us.ibm.com
October 6, 2005
IBM ELEN E6884: Advanced Speech Recognition
![Page 2: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen](https://reader035.vdocuments.us/reader035/viewer/2022070912/5fb40e05365acf4b4b135ce4/html5/thumbnails/2.jpg)
Administrivia
IBM ELEN E6884: Advanced Speech Recognition 1
![Page 3: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen](https://reader035.vdocuments.us/reader035/viewer/2022070912/5fb40e05365acf4b4b135ce4/html5/thumbnails/3.jpg)
Outline
■ The Big Picture (review)● The story so far, plus situating Lab 1 and Lab 2
■ language modeling● why?● how? (grammars, n-grams)● parameter estimation (smoothing)● evaluation
IBM ELEN E6884: Advanced Speech Recognition 2
![Page 4: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen](https://reader035.vdocuments.us/reader035/viewer/2022070912/5fb40e05365acf4b4b135ce4/html5/thumbnails/4.jpg)
Pattern Classification
■ given training samples (x1, ω1), . . . , (xN , ωN)● feature vector x = (x1, . . . , xT )● class labels ω
■ guess class label of unseen feature vectors x (i.e., from the testset)
■ example: guess gender of person given height+weight● feature vector x = (height, weight)● class labels ω ∈ {M, F}
IBM ELEN E6884: Advanced Speech Recognition 3
![Page 5: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen](https://reader035.vdocuments.us/reader035/viewer/2022070912/5fb40e05365acf4b4b135ce4/html5/thumbnails/5.jpg)
Speech Recognition is Pattern Classification
■ e.g., isolated digit recognition● feature vector x = (x1, . . . , xT ) is acoustic waveform
● T=20000 if 1 sec sample, 20kHz sampling rate● class labels ω ∈ {ONE, TWO, . . . , NINE, ZERO, OH}
IBM ELEN E6884: Advanced Speech Recognition 4
![Page 6: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen](https://reader035.vdocuments.us/reader035/viewer/2022070912/5fb40e05365acf4b4b135ce4/html5/thumbnails/6.jpg)
Nearest Neighbor Classification■ to classify a test vector x
● find nearest sample xi in training set● select associated class ωi
140145150155160165170175180185
30 40 50 60 70 80 90 100
heig
ht (
cm)
weight (kg)
malefemale
IBM ELEN E6884: Advanced Speech Recognition 5
![Page 7: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen](https://reader035.vdocuments.us/reader035/viewer/2022070912/5fb40e05365acf4b4b135ce4/html5/thumbnails/7.jpg)
Nearest Neighbor Classification
Let’s write down some equations
■ for each class ω, we have a model Mω
■ to classify a test vector xtest, find nearest model Mω
class(xtest) = arg minω
distance(xtest,Mω)
■ nearest neighbor classification● let Mω = {training samples x labeled with class ω}● distance(xtest,Mω) = minx∈Mω distance(xtest,x)● e.g., distance(xtest,x) = Euclidean distance
IBM ELEN E6884: Advanced Speech Recognition 6
![Page 8: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen](https://reader035.vdocuments.us/reader035/viewer/2022070912/5fb40e05365acf4b4b135ce4/html5/thumbnails/8.jpg)
Speech Recognition as Pattern Classification
Problem 1: acoustic waveform is not good representation fornearest neighbor classification
■ consider two equal-length acoustic waveforms
x = (x1, . . . , x20000)
x′ = (x′1, . . . , x′20000)
■ do we believe that distance(x,x′) . . .● will be smaller if class(x) = class(x′) . . .● than if class(x) 6= class(x′)?
■ Lab 1: windowing only; computing error rate with DTW● windows are constant-length acoustic waveforms● performance is near random
IBM ELEN E6884: Advanced Speech Recognition 7
![Page 9: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen](https://reader035.vdocuments.us/reader035/viewer/2022070912/5fb40e05365acf4b4b135ce4/html5/thumbnails/9.jpg)
Speech Recognition as Pattern Classification
Finding a transformation F for samples: F (xraw) ⇒ xnew
■ want samples xnew in same class to be close to each other■ want samples xnew in different classes to be far apart■ ⇒ signal processing/front end processing
IBM ELEN E6884: Advanced Speech Recognition 8
![Page 10: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen](https://reader035.vdocuments.us/reader035/viewer/2022070912/5fb40e05365acf4b4b135ce4/html5/thumbnails/10.jpg)
Feature Transformation
0 10 20 30 40 50 60 70 80 90100
010
2030
4050
6070
8090
100
height, fractional part x100 (cm)
weight, fractional part x100 (kg)
140145150155160165170175180185
3040
5060
7080
90100
height (cm)
weight (kg) m
alefem
ale
IBM ELEN E6884: Advanced Speech Recognition 9
![Page 11: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen](https://reader035.vdocuments.us/reader035/viewer/2022070912/5fb40e05365acf4b4b135ce4/html5/thumbnails/11.jpg)
What Did We Learn in Lab 1?■ a smoothed frequency-domain representation works well
● FFT computes energy at each frequency (fine-grained)● mel-binning does coarse-grained grouping (93.6% accuracy)● log+DCT, Hamming window help● not completely unlike human auditory system
■ accuracy on “PCM” (pulse code modulation): 89.1%● DTW is powerful; speaker-dependent only?
■ DTW is OK for speaker-dependent small-voc isolated word ASR● >95% accuracy in clean conditions
■ data reduction● instead of (x1, . . . , x20000) for 1 sec sample● (~x1, . . . , ~x100), dim(~xt) = 12
IBM ELEN E6884: Advanced Speech Recognition 10
![Page 12: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen](https://reader035.vdocuments.us/reader035/viewer/2022070912/5fb40e05365acf4b4b135ce4/html5/thumbnails/12.jpg)
Speech Recognition as Pattern Classification
Problem 2: acoustic waveforms have different lengths
■ e.g., how to compute:
distance{(~x1, . . . , ~x100), (~x′1, . . . , ~x′150)}
■ answer: dynamic time warping (asymmetric)● introduce concept of alignment A = a1 · · · aT
● map each test frame t to a template frame at
IBM ELEN E6884: Advanced Speech Recognition 11
![Page 13: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen](https://reader035.vdocuments.us/reader035/viewer/2022070912/5fb40e05365acf4b4b135ce4/html5/thumbnails/13.jpg)
Alignments
■ take minimum distance over all possible alignments A
distance(x,x′) = minA
distanceA(x,x′)
■ distance given an alignment is . . .● sum of distances between each pair of aligned frames
distanceA{(~x1, . . . , ~xT ), (~x′1, . . . , ~x′T ′)} =
T∑t=1
distance(~xt, ~x′at
)
■ dynamic programming computes minA distanceA(·, ·) efficiently● search exponential number of alignments in polynomial time
IBM ELEN E6884: Advanced Speech Recognition 12
![Page 14: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen](https://reader035.vdocuments.us/reader035/viewer/2022070912/5fb40e05365acf4b4b135ce4/html5/thumbnails/14.jpg)
DTW as Nearest Neighbor ClassificationLab 1: collect only one training sample xω per class ω
■ variation: keep many/all training samples
140145150155160165170175180185
30 40 50 60 70 80 90 100
heig
ht (
cm)
weight (kg)
malefemale
IBM ELEN E6884: Advanced Speech Recognition 13
![Page 15: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen](https://reader035.vdocuments.us/reader035/viewer/2022070912/5fb40e05365acf4b4b135ce4/html5/thumbnails/15.jpg)
Long Live Nearest Neighbor?Issue 1: outliers
140145150155160165170175180185
30 40 50 60 70 80 90 100
heig
ht (
cm)
weight (kg)
malefemale
■ k-nearest neighbors?● find k > 1 nearest training samples; pick most frequent class
IBM ELEN E6884: Advanced Speech Recognition 14
![Page 16: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen](https://reader035.vdocuments.us/reader035/viewer/2022070912/5fb40e05365acf4b4b135ce4/html5/thumbnails/16.jpg)
Long Live k-Nearest Neighbors?Issue 2: different sample densities
150
155
160
165
170
175
180
45 50 55 60 65 70 75
heig
ht (
cm)
weight (kg)
■ averaging all samples in a class to find “canonical” sample?
IBM ELEN E6884: Advanced Speech Recognition 15
![Page 17: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen](https://reader035.vdocuments.us/reader035/viewer/2022070912/5fb40e05365acf4b4b135ce4/html5/thumbnails/17.jpg)
Long Live Canonical Samples?Issue 3: bimodal distributions
150
155
160
165
170
175
180
40 45 50 55 60 65 70 75 80
heig
ht (
cm)
weight (kg)
■ what if we try to compute several “canonical” samples?
IBM ELEN E6884: Advanced Speech Recognition 16
![Page 18: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen](https://reader035.vdocuments.us/reader035/viewer/2022070912/5fb40e05365acf4b4b135ce4/html5/thumbnails/18.jpg)
Is There a Better Way?
■ using “distance” is too restrictive?■ more generally, want some sort of “score”: scoreω(x)
● reflects how strongly point x belongs to class ω
■ to classify a sample x, pick class with highest score
IBM ELEN E6884: Advanced Speech Recognition 17
![Page 19: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen](https://reader035.vdocuments.us/reader035/viewer/2022070912/5fb40e05365acf4b4b135ce4/html5/thumbnails/19.jpg)
Is There a Better Way?
■ is there a framework for assigning scores that . . .● can handle arbitrarily weird sample distributions?
● outliers, varying sample densities, bimodal, etc.● provides guidance on how to find models that perform well at
classification?● provides tools for finding these models efficiently?
■ ⇒ probabilistic modeling!
IBM ELEN E6884: Advanced Speech Recognition 18
![Page 20: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen](https://reader035.vdocuments.us/reader035/viewer/2022070912/5fb40e05365acf4b4b135ce4/html5/thumbnails/20.jpg)
Probability 101
■ probabilities P (x) are nonnegative scores that sum to 1∑x
P (x) = 1 P (x) ≥ 0
● semantics: P (x) ∼ frequency of event x
■ joint probabilities: P (x, y)∑x,y
P (x, y) = 1 P (x, y) ≥ 0
IBM ELEN E6884: Advanced Speech Recognition 19
![Page 21: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen](https://reader035.vdocuments.us/reader035/viewer/2022070912/5fb40e05365acf4b4b135ce4/html5/thumbnails/21.jpg)
Probability 101
■ marginal probabilities
P (x) =∑
y
P (x, y) P (y) =∑
x
P (x, y)
● probability that I trip is equal to probability I trip and fall plus the probability
I trip and don’t fall
■ conditional probabilities
P (x, y) = P (x)P (y|x) = P (y)P (x|y)
● probability that I trip and fall is equal to probability that I trip times the
probability that I fall given that I trip
IBM ELEN E6884: Advanced Speech Recognition 20
![Page 22: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen](https://reader035.vdocuments.us/reader035/viewer/2022070912/5fb40e05365acf4b4b135ce4/html5/thumbnails/22.jpg)
Probabilistic Modeling as Nearest Neighbor
■ for each class ω, we have a model Pω(·)● Pω(x) is a probability distribution over samples x● proportional to frequency that samples from class ω are
located at/around x■ to classify a test vector xtest, find nearest model Pω(·)
class(xtest) = arg minω
distance(xtest, Pω(·))
■ define distance as negative log probability● distance(xtest, Pω(·)) ≡ − log Pω(xtest)
IBM ELEN E6884: Advanced Speech Recognition 21
![Page 23: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen](https://reader035.vdocuments.us/reader035/viewer/2022070912/5fb40e05365acf4b4b135ce4/html5/thumbnails/23.jpg)
What Does Probabilistic Modeling Give Us?
■ if each of our class probability distributions Pω(x) are (perfectly)accurate . . .● we can perform classification optimally!
■ e.g., two-class classification ω ∈ {M, F} (equally frequent)● choose M if PM(x) > PF (x)● choose F if PF (x) > PM(x)● this is the best you can do!
IBM ELEN E6884: Advanced Speech Recognition 22
![Page 24: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen](https://reader035.vdocuments.us/reader035/viewer/2022070912/5fb40e05365acf4b4b135ce4/html5/thumbnails/24.jpg)
What Does Probabilistic Modeling Give Us?
■ consider a simple (1-D) probability distribution (Gaussian)● Pω(x) = 1√
2πσexp
[−1
2
(x−µ
σ
)2]
● parameters µ, σ● assume distribution of samples from class ω is truly Gaussian
■ maximum likelihood estimate (µMLE, σMLE)● choose parameter values that maximize likelihood/probability
of training data (x1, ω1), . . . , (xN , ωN)
(µMLE, σMLE) = arg max(µ,σ)
N∏i=1
Pωi(xi)
IBM ELEN E6884: Advanced Speech Recognition 23
![Page 25: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen](https://reader035.vdocuments.us/reader035/viewer/2022070912/5fb40e05365acf4b4b135ce4/html5/thumbnails/25.jpg)
What Does Probabilistic Modeling Give Us?
■ in the presence of infinite training samples● maximum likelihood estimation (MLE) approaches the “true”
parameter estimates● for most models, MLE is asymptotically consistent, unbiased,
and efficient■ maximum likelihood estimation is easy for a wide class of
models● count and normalize
IBM ELEN E6884: Advanced Speech Recognition 24
![Page 26: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen](https://reader035.vdocuments.us/reader035/viewer/2022070912/5fb40e05365acf4b4b135ce4/html5/thumbnails/26.jpg)
Long Live Probabilistic Modeling!
■ if we choose form of Pω(x) correctly and have lots of data . . .● we win!
■ ASR moved to probabilistic modeling in mid-70’s■ in ASR, no serious challenges to this framework since then■ by and large, probabilistic modeling . . .
● does as well or better than all other techniques . . .● on all classification problems● (not everyone agrees)
IBM ELEN E6884: Advanced Speech Recognition 25
![Page 27: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen](https://reader035.vdocuments.us/reader035/viewer/2022070912/5fb40e05365acf4b4b135ce4/html5/thumbnails/27.jpg)
What’s the Catch?
■ guarantees only hold if● know form of probability distribution that data comes from
● MLE tells us how to find parameters only if we know theform of the model
● can actually find MLE● in training HMM’s w/ FB, can only find local optimum● MLE solution for GMM’s is pathological
● infinite training set● “there’s no data like more data”
■ these are the challenges!
IBM ELEN E6884: Advanced Speech Recognition 26
![Page 28: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen](https://reader035.vdocuments.us/reader035/viewer/2022070912/5fb40e05365acf4b4b135ce4/html5/thumbnails/28.jpg)
From DTW to Probabilistic Modeling■ HMM’s are natural way to express DTW as probabilistic model■ replace template for each word with an HMM
● each frame in template corresponds to state in HMM
1
P(x|a1)
2P(x|a1)
P(x|a2)
3P(x|a2)
P(x|a3)
4P(x|a3)
P(x|a4)
5P(x|a4)
P(x|a5)
6P(x|a5)
P(x|a6)
7P(x|a6)
■ alignment A = a1 · · · aT
● DTW: map each test frame t to template frame at
● HMM: map each test frame t to arc at in HMM● each alignment corresponds to path through HMM
■ can design HMM such that (see Holmes, Sec. 9.13, p. 155)
distanceDTW(xtest,xω) ≈ − log P HMMω (xtest)
IBM ELEN E6884: Advanced Speech Recognition 27
![Page 29: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen](https://reader035.vdocuments.us/reader035/viewer/2022070912/5fb40e05365acf4b4b135ce4/html5/thumbnails/29.jpg)
DTW vs. HMM’s
DTW HMMacoustic template HMMframe in template state in HMMDTW alignment path through HMM
move cost transition (log)probdistance between frames output (log)prob
DTW search Forward/Viterbi algorithm
IBM ELEN E6884: Advanced Speech Recognition 28
![Page 30: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen](https://reader035.vdocuments.us/reader035/viewer/2022070912/5fb40e05365acf4b4b135ce4/html5/thumbnails/30.jpg)
Key Points for HMM’s
■ an HMM describes a probability distribution over observationsequences x (i.e., acoustic feature vectors): P (x)● separate HMM for each word ω, describing likely acoustic
feature vector sequences for that word: Pω(x)■ a path through the HMM
● alignment A = a1 · · · aT between arcs at and observations(frames) xt
■ to calculate probability Pω(x), sum (or max) probability for eachalignment A● Forward algorithm: Pω(x) =
∑A Pω(x, A)
● Viterbi algorithm: Pω(x) ≈ maxA Pω(x, A)
IBM ELEN E6884: Advanced Speech Recognition 29
![Page 31: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen](https://reader035.vdocuments.us/reader035/viewer/2022070912/5fb40e05365acf4b4b135ce4/html5/thumbnails/31.jpg)
Key Points for HMM’s■ probability of alignment and observations
Pω(x, A) = Pω(A)Pω(x|A)
● alignment probability Pω(A): product of transition probabilitiesalong path
Pω(A) =T∏
t=1
Ptrans(at)
● observation probability Pω(x|A): product of observationprobabilities along path
Pω(x|A) =T∏
t=1
Pobs(~xt|at)
IBM ELEN E6884: Advanced Speech Recognition 30
![Page 32: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen](https://reader035.vdocuments.us/reader035/viewer/2022070912/5fb40e05365acf4b4b135ce4/html5/thumbnails/32.jpg)
Key Points for HMM’s
■ each observation probability modeled as a Gaussian● diagonal covariance
Pobs(~xt|at) =D∏
dim d
1√2πσat,d
exp
[−1
2
(xt,d − µat,d
σat,d
)2]
=D∏
dim d
N (xt,d;µat,d, σat,d)
■ or mixture of Gaussians
Pobs(~xt|at) =M∑
m=1
λat,m
D∏dim d
N (xt,d;µat,m,d, σat,m,d)
IBM ELEN E6884: Advanced Speech Recognition 31
![Page 33: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen](https://reader035.vdocuments.us/reader035/viewer/2022070912/5fb40e05365acf4b4b135ce4/html5/thumbnails/33.jpg)
Why Gaussians and GMM’s?
■ a Gaussian is good at modeling a single cluster of samples
-10
-5
0
5
10
-10 -5 0 5 10
training samples for some arc in HMM
IBM ELEN E6884: Advanced Speech Recognition 32
![Page 34: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen](https://reader035.vdocuments.us/reader035/viewer/2022070912/5fb40e05365acf4b4b135ce4/html5/thumbnails/34.jpg)
Why Gaussians and GMM’s?
■ if you sum enough Gaussians, you can approximate anydistribution arbitrarily well
-10
-5
0
5
10
-10 -5 0 5 10
training samples for some arc in HMM
IBM ELEN E6884: Advanced Speech Recognition 33
![Page 35: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen](https://reader035.vdocuments.us/reader035/viewer/2022070912/5fb40e05365acf4b4b135ce4/html5/thumbnails/35.jpg)
Optimizing HMM’s for Isolated Word Recognition
■ each frame in DTW template for word ω corresponds to state inHMM?
■ consecutive frames may be very similar to each other● no point having consecutive states in HMM with nearly
identical output distributions
1
P(x|a1)
2P(x|a1)
P(x|a2)
3P(x|a2)
P(x|a3)
4P(x|a3)
P(x|a4)
5P(x|a4)
P(x|a5)
6P(x|a5)
P(x|a6)
7P(x|a6)
■ how many states do we really need?● rule of thumb: three states per phoneme● one for start, middle, and end of phoneme
IBM ELEN E6884: Advanced Speech Recognition 34
![Page 36: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen](https://reader035.vdocuments.us/reader035/viewer/2022070912/5fb40e05365acf4b4b135ce4/html5/thumbnails/36.jpg)
Word HMM’s
■ word TWO is composed of phonemes T UW
■ two phonemes ⇒ six HMM states
1
P(x|a1)
2P(x|a1)
P(x|a2)
3P(x|a2)
P(x|a3)
4P(x|a3)
P(x|a4)
5P(x|a4)
P(x|a5)
6P(x|a5)
P(x|a6)
7P(x|a6)
■ convention: same output distribution for all arcs exiting a state● or place output distributions on states
■ parameters: 12 transition probabilities, 6 output distributions
IBM ELEN E6884: Advanced Speech Recognition 35
![Page 37: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen](https://reader035.vdocuments.us/reader035/viewer/2022070912/5fb40e05365acf4b4b135ce4/html5/thumbnails/37.jpg)
Isolated Word Recognition with HMM’s
■ training● for each class (word) ω, collect all training samples x with that
label● to train HMM for word ω, run FB on these samples
■ decoding a test sample xtest
● calculate (Viterbi or Forward) likelihood Pω(xtest) of samplewith each word model Pω(·), pick best
class(xtest) = arg maxω
Pω(xtest)
IBM ELEN E6884: Advanced Speech Recognition 36
![Page 38: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen](https://reader035.vdocuments.us/reader035/viewer/2022070912/5fb40e05365acf4b4b135ce4/html5/thumbnails/38.jpg)
Control Flow for Training
For each iteration of Forward-Backward training
■ initialize training counts to 0■ for each utterance (xi, ωi) in training data
● calculate acoustic feature vectors from xi using front end● construct HMM representing word sequence ωi
● run FB algorithm to update training counts for all transitionspresent in HMM
■ reestimate HMM parameters using training counts
IBM ELEN E6884: Advanced Speech Recognition 37
![Page 39: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen](https://reader035.vdocuments.us/reader035/viewer/2022070912/5fb40e05365acf4b4b135ce4/html5/thumbnails/39.jpg)
Decoding with HMM’s
■ instead of separate HMM for each word ω● merge each word HMM into single big HMM
1 2
ONE
TWO
...
■ use Viterbi algorithm (not Forward algorithm, why?)● in backtrace, collect all words along the way
■ why one big graph?● when move to word sequences, can’t have HMM per
sequence● pruning during search● graph minimization
IBM ELEN E6884: Advanced Speech Recognition 38
![Page 40: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen](https://reader035.vdocuments.us/reader035/viewer/2022070912/5fb40e05365acf4b4b135ce4/html5/thumbnails/40.jpg)
Continuous Speech Recognition with HMM’s
■ e.g., ASR for digit sequences of arbitrary length● set of classes ω for classification is infinite
■ instead of separate HMM parameters for each digit string ω● have one HMM for each digit as before● glue together to make HMM for a digit string
IBM ELEN E6884: Advanced Speech Recognition 39
![Page 41: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen](https://reader035.vdocuments.us/reader035/viewer/2022070912/5fb40e05365acf4b4b135ce4/html5/thumbnails/41.jpg)
Continuous Speech Recognition with HMM’s
■ training● e.g., reference transcript: ONE TWO
1 2ONE
3TWO
● for each training sample, update counts for each word HMMin transcript
■ decoding● want HMM that represents all digit strings
1
...THREETWOONE
IBM ELEN E6884: Advanced Speech Recognition 40
![Page 42: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen](https://reader035.vdocuments.us/reader035/viewer/2022070912/5fb40e05365acf4b4b135ce4/html5/thumbnails/42.jpg)
Silence is Golden
■ what about all those pauses in speech?● at begin/end of utterance; in-between words
■ add silence word ∼SIL to vocabulary● three states? (skip arcs?)
■ allow optional silence at beginning/end of utterances and in-between words
IBM ELEN E6884: Advanced Speech Recognition 41
![Page 43: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen](https://reader035.vdocuments.us/reader035/viewer/2022070912/5fb40e05365acf4b4b135ce4/html5/thumbnails/43.jpg)
Silence is Golden
■ e.g., HMM for training with reference script: ONE TWO
1 3
TWO
2
~SIL 4
ONE
5
~SILTWO 6~SIL
ONE
■ in practice, often precompute best word sequence using someother model● e.g., bootstrap models without using optional silences
■ Lab 2: all these graphs are automatically constructed for you● silence is also used in isolated word training/decoding
IBM ELEN E6884: Advanced Speech Recognition 42
![Page 44: Lecture 5 The Big Picture/Language Modelingstanchen/e6884/slides/lecture5a.pdf · 2005-10-06 · Lecture 5 The Big Picture/Language Modeling Michael Picheny, Stanley F. Chen, Ellen](https://reader035.vdocuments.us/reader035/viewer/2022070912/5fb40e05365acf4b4b135ce4/html5/thumbnails/44.jpg)
Wreck a Nice Beach?
The story so far:
■ to decode a test sample xtest, find best word sequence ω
class(xtest) = arg maxω
Pω(xtest) ≡ arg maxω
P (xtest|ω)
● find word sequence ω maximizing likelihood Pω(xtest|ω)● maximum likelihood classification
■ well, you know:
THIS IS OUR ROOM FOR A FOUR HOUR PERIOD .THIS IS HOUR ROOM FOUR A FOR OUR . PERIOD
IBM ELEN E6884: Advanced Speech Recognition 43