math 829: introduction to data mining and analysis hidden...
TRANSCRIPT
![Page 1: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/1.jpg)
MATH 829: Introduction to Data Mining andAnalysis
Hidden Markov Models
Dominique Guillot
Departments of Mathematical Sciences
University of Delaware
May 11, 2016
1/12
![Page 2: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/2.jpg)
Hidden Markov Models
Recall: a (discrete time homogeneous) Markov chain (Xn)n≥0 is aprocess that satis�es:
P (Xn+1 = j|X0 = i0, . . . , Xn−1 = in−1, Xn = i) = P (Xn+1 = j|Xn = i)
= P (X1 = j|X0 = i)
=: p(i, j).
A Hidden Markov Model has two components:
1 A Markov chain that describes the state of the system and is
unobserved.
2 An observed process where each output depends on the state
of the chain.
2/12
![Page 3: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/3.jpg)
Hidden Markov Models
Recall: a (discrete time homogeneous) Markov chain (Xn)n≥0 is aprocess that satis�es:
P (Xn+1 = j|X0 = i0, . . . , Xn−1 = in−1, Xn = i) = P (Xn+1 = j|Xn = i)
= P (X1 = j|X0 = i)
=: p(i, j).
A Hidden Markov Model has two components:
1 A Markov chain that describes the state of the system and is
unobserved.
2 An observed process where each output depends on the state
of the chain.
2/12
![Page 4: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/4.jpg)
Hidden Markov Models
Recall: a (discrete time homogeneous) Markov chain (Xn)n≥0 is aprocess that satis�es:
P (Xn+1 = j|X0 = i0, . . . , Xn−1 = in−1, Xn = i) = P (Xn+1 = j|Xn = i)
= P (X1 = j|X0 = i)
=: p(i, j).
A Hidden Markov Model has two components:
1 A Markov chain that describes the state of the system and is
unobserved.
2 An observed process where each output depends on the state
of the chain.
2/12
![Page 5: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/5.jpg)
Hidden Markov Models
Recall: a (discrete time homogeneous) Markov chain (Xn)n≥0 is aprocess that satis�es:
P (Xn+1 = j|X0 = i0, . . . , Xn−1 = in−1, Xn = i) = P (Xn+1 = j|Xn = i)
= P (X1 = j|X0 = i)
=: p(i, j).
A Hidden Markov Model has two components:
1 A Markov chain that describes the state of the system and is
unobserved.
2 An observed process where each output depends on the state
of the chain.
2/12
![Page 6: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/6.jpg)
Hidden Markov Models
Recall: a (discrete time homogeneous) Markov chain (Xn)n≥0 is aprocess that satis�es:
P (Xn+1 = j|X0 = i0, . . . , Xn−1 = in−1, Xn = i) = P (Xn+1 = j|Xn = i)
= P (X1 = j|X0 = i)
=: p(i, j).
A Hidden Markov Model has two components:
1 A Markov chain that describes the state of the system and is
unobserved.
2 An observed process where each output depends on the state
of the chain.
2/12
![Page 7: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/7.jpg)
Hidden Markov Models (cont.)
More precisely, a Hidden Markov Model consists of:
1 A Makov chain (Zt : t = 1, . . . , T ) with states
S := {s1, . . . , s|S|}, say:
P (Zt+1 = sj |Zt = si) = Aij .
2 An observation process (Xt : t = 1, . . . , T ) taking values in
V := {v1, . . . , v|V |} such that
P (Xt = vj |Zt = si) = Bij .
Remarks:1 The observed variable Xt depends only on Zt, the state of the
Markov chain at time t.2 The output is a random function of the current state.
3/12
![Page 8: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/8.jpg)
Hidden Markov Models (cont.)
More precisely, a Hidden Markov Model consists of:1 A Makov chain (Zt : t = 1, . . . , T ) with states
S := {s1, . . . , s|S|}, say:
P (Zt+1 = sj |Zt = si) = Aij .
2 An observation process (Xt : t = 1, . . . , T ) taking values in
V := {v1, . . . , v|V |} such that
P (Xt = vj |Zt = si) = Bij .
Remarks:1 The observed variable Xt depends only on Zt, the state of the
Markov chain at time t.2 The output is a random function of the current state.
3/12
![Page 9: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/9.jpg)
Hidden Markov Models (cont.)
More precisely, a Hidden Markov Model consists of:1 A Makov chain (Zt : t = 1, . . . , T ) with states
S := {s1, . . . , s|S|}, say:
P (Zt+1 = sj |Zt = si) = Aij .
2 An observation process (Xt : t = 1, . . . , T ) taking values in
V := {v1, . . . , v|V |} such that
P (Xt = vj |Zt = si) = Bij .
Remarks:1 The observed variable Xt depends only on Zt, the state of the
Markov chain at time t.2 The output is a random function of the current state.
3/12
![Page 10: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/10.jpg)
Hidden Markov Models (cont.)
More precisely, a Hidden Markov Model consists of:1 A Makov chain (Zt : t = 1, . . . , T ) with states
S := {s1, . . . , s|S|}, say:
P (Zt+1 = sj |Zt = si) = Aij .
2 An observation process (Xt : t = 1, . . . , T ) taking values in
V := {v1, . . . , v|V |} such that
P (Xt = vj |Zt = si) = Bij .
Remarks:1 The observed variable Xt depends only on Zt, the state of the
Markov chain at time t.2 The output is a random function of the current state.
3/12
![Page 11: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/11.jpg)
Examples
A HMM with states S = {x1, x2, x3} and possible observations
V = {y1, y2, y3, y4}.
Source: Wikipedia.
a's are the state transition probabilities.
b's are the output probabilities.
4/12
![Page 12: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/12.jpg)
Examples (cont.)
Examples of applications:
Recognizing human facial expression from sequences of images
(see e.g. Schmidt et al, 2010).
Speech recognition systems (see e.g. Gales and Young, 2007)
Gales and Young, 2007.
Longitudinal comparisons in medical studies (see e.g. Scott et
al. 2005).
Many applications in �nance (e.g. pricing options, valuation of
life insurance policies, credit risk modeling, etc.).
etc..
5/12
![Page 13: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/13.jpg)
Examples (cont.)
Examples of applications:
Recognizing human facial expression from sequences of images
(see e.g. Schmidt et al, 2010).
Speech recognition systems (see e.g. Gales and Young, 2007)
Gales and Young, 2007.
Longitudinal comparisons in medical studies (see e.g. Scott et
al. 2005).
Many applications in �nance (e.g. pricing options, valuation of
life insurance policies, credit risk modeling, etc.).
etc..
5/12
![Page 14: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/14.jpg)
Examples (cont.)
Examples of applications:
Recognizing human facial expression from sequences of images
(see e.g. Schmidt et al, 2010).
Speech recognition systems (see e.g. Gales and Young, 2007)
Gales and Young, 2007.
Longitudinal comparisons in medical studies (see e.g. Scott et
al. 2005).
Many applications in �nance (e.g. pricing options, valuation of
life insurance policies, credit risk modeling, etc.).
etc..
5/12
![Page 15: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/15.jpg)
Examples (cont.)
Examples of applications:
Recognizing human facial expression from sequences of images
(see e.g. Schmidt et al, 2010).
Speech recognition systems (see e.g. Gales and Young, 2007)
Gales and Young, 2007.
Longitudinal comparisons in medical studies (see e.g. Scott et
al. 2005).
Many applications in �nance (e.g. pricing options, valuation of
life insurance policies, credit risk modeling, etc.).
etc..5/12
![Page 16: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/16.jpg)
Three problems
Three (closely related) important problems naturally arise when
working with HMM:
1 What is the probability of a given observed sequence?
2 What is the most likely series of states that generated a given
observed sequence?
3 What are the state transition probabilities and the observation
probabilities of the model (i.e., how can we estimate the
parameters of the model)?
6/12
![Page 17: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/17.jpg)
Three problems
Three (closely related) important problems naturally arise when
working with HMM:
1 What is the probability of a given observed sequence?
2 What is the most likely series of states that generated a given
observed sequence?
3 What are the state transition probabilities and the observation
probabilities of the model (i.e., how can we estimate the
parameters of the model)?
6/12
![Page 18: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/18.jpg)
Three problems
Three (closely related) important problems naturally arise when
working with HMM:
1 What is the probability of a given observed sequence?
2 What is the most likely series of states that generated a given
observed sequence?
3 What are the state transition probabilities and the observation
probabilities of the model (i.e., how can we estimate the
parameters of the model)?
6/12
![Page 19: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/19.jpg)
Three problems
Three (closely related) important problems naturally arise when
working with HMM:
1 What is the probability of a given observed sequence?
2 What is the most likely series of states that generated a given
observed sequence?
3 What are the state transition probabilities and the observation
probabilities of the model (i.e., how can we estimate the
parameters of the model)?
6/12
![Page 20: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/20.jpg)
Probability of an observed sequence
Suppose the parameters of the model are known.
Let x = (x1, . . . , xT ) ∈ V T be a given observed sequence.
What is P (x;A,B)?
Conditioning on the hidden states, we obtain:
P (x;A,B) =∑z∈ST
P (x|z;A,B)P (z;A,B)
=∑z∈ST
T∏i=1
P (xi|zi;B) ·T∏i=1
P (zi|zi−1;A)
=∑z∈ST
T∏i=1
Bzi,xi ·T∏i=1
Azi−1,zi .
Problem: Although the previous expression is simple, it involves
summing over a set of size |S|T , which is generally too
computationally intensive.
7/12
![Page 21: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/21.jpg)
Probability of an observed sequence
Suppose the parameters of the model are known.
Let x = (x1, . . . , xT ) ∈ V T be a given observed sequence.
What is P (x;A,B)?
Conditioning on the hidden states, we obtain:
P (x;A,B) =∑z∈ST
P (x|z;A,B)P (z;A,B)
=∑z∈ST
T∏i=1
P (xi|zi;B) ·T∏i=1
P (zi|zi−1;A)
=∑z∈ST
T∏i=1
Bzi,xi ·T∏i=1
Azi−1,zi .
Problem: Although the previous expression is simple, it involves
summing over a set of size |S|T , which is generally too
computationally intensive.
7/12
![Page 22: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/22.jpg)
Probability of an observed sequence
Suppose the parameters of the model are known.
Let x = (x1, . . . , xT ) ∈ V T be a given observed sequence.
What is P (x;A,B)?
Conditioning on the hidden states, we obtain:
P (x;A,B) =∑z∈ST
P (x|z;A,B)P (z;A,B)
=∑z∈ST
T∏i=1
P (xi|zi;B) ·T∏i=1
P (zi|zi−1;A)
=∑z∈ST
T∏i=1
Bzi,xi ·T∏i=1
Azi−1,zi .
Problem: Although the previous expression is simple, it involves
summing over a set of size |S|T , which is generally too
computationally intensive.
7/12
![Page 23: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/23.jpg)
Probability of an observed sequence
Suppose the parameters of the model are known.
Let x = (x1, . . . , xT ) ∈ V T be a given observed sequence.
What is P (x;A,B)?
Conditioning on the hidden states, we obtain:
P (x;A,B) =∑z∈ST
P (x|z;A,B)P (z;A,B)
=∑z∈ST
T∏i=1
P (xi|zi;B) ·T∏i=1
P (zi|zi−1;A)
=∑z∈ST
T∏i=1
Bzi,xi ·T∏i=1
Azi−1,zi .
Problem: Although the previous expression is simple, it involves
summing over a set of size |S|T , which is generally too
computationally intensive.
7/12
![Page 24: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/24.jpg)
Probability of an observed sequence
Suppose the parameters of the model are known.
Let x = (x1, . . . , xT ) ∈ V T be a given observed sequence.
What is P (x;A,B)?
Conditioning on the hidden states, we obtain:
P (x;A,B) =∑z∈ST
P (x|z;A,B)P (z;A,B)
=∑z∈ST
T∏i=1
P (xi|zi;B) ·T∏i=1
P (zi|zi−1;A)
=∑z∈ST
T∏i=1
Bzi,xi ·T∏i=1
Azi−1,zi .
Problem: Although the previous expression is simple, it involves
summing over a set of size |S|T , which is generally too
computationally intensive.
7/12
![Page 25: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/25.jpg)
Probability of an observed sequence
Suppose the parameters of the model are known.
Let x = (x1, . . . , xT ) ∈ V T be a given observed sequence.
What is P (x;A,B)?
Conditioning on the hidden states, we obtain:
P (x;A,B) =∑z∈ST
P (x|z;A,B)P (z;A,B)
=∑z∈ST
T∏i=1
P (xi|zi;B) ·T∏i=1
P (zi|zi−1;A)
=∑z∈ST
T∏i=1
Bzi,xi ·T∏i=1
Azi−1,zi .
Problem: Although the previous expression is simple, it involves
summing over a set of size |S|T , which is generally too
computationally intensive.
7/12
![Page 26: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/26.jpg)
Probability of an observed sequence
Suppose the parameters of the model are known.
Let x = (x1, . . . , xT ) ∈ V T be a given observed sequence.
What is P (x;A,B)?
Conditioning on the hidden states, we obtain:
P (x;A,B) =∑z∈ST
P (x|z;A,B)P (z;A,B)
=∑z∈ST
T∏i=1
P (xi|zi;B) ·T∏i=1
P (zi|zi−1;A)
=∑z∈ST
T∏i=1
Bzi,xi ·T∏i=1
Azi−1,zi .
Problem: Although the previous expression is simple, it involves
summing over a set of size |S|T , which is generally too
computationally intensive.
7/12
![Page 27: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/27.jpg)
Probability of an observed sequence
Suppose the parameters of the model are known.
Let x = (x1, . . . , xT ) ∈ V T be a given observed sequence.
What is P (x;A,B)?
Conditioning on the hidden states, we obtain:
P (x;A,B) =∑z∈ST
P (x|z;A,B)P (z;A,B)
=∑z∈ST
T∏i=1
P (xi|zi;B) ·T∏i=1
P (zi|zi−1;A)
=∑z∈ST
T∏i=1
Bzi,xi ·T∏i=1
Azi−1,zi .
Problem: Although the previous expression is simple, it involves
summing over a set of size |S|T , which is generally too
computationally intensive.7/12
![Page 28: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/28.jpg)
Probability of an observed sequence (cont.)
We can compute P (x;A,B) e�ciently using dynamic programming.
Idea: avoid computing the same quantities multiple times!
Let αi(t) := P (x1, x2, . . . , xt, zt = si;A,B).
The Forward Procedure for computing αi(t)
1 Initialize αi(0) := A0,i, i = 1, . . . , |S|.2 Recursion: αj(t) :=
∑|S|i=1 αi(t− 1)AijBj,xt , j = 1, . . . , |S|,
t = 1, . . . , T .
Now, P (x;A,B) = P (x1, . . . , xT ;A,B)
=
|S|∑i=1
P (x1, . . . , xT , zT = si;A,B)
=
|S|∑i=1
αi(T ).
Complexity is now O(|S| · T ) instead of O(|S|T )!
8/12
![Page 29: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/29.jpg)
Probability of an observed sequence (cont.)
We can compute P (x;A,B) e�ciently using dynamic programming.
Idea: avoid computing the same quantities multiple times!
Let αi(t) := P (x1, x2, . . . , xt, zt = si;A,B).
The Forward Procedure for computing αi(t)
1 Initialize αi(0) := A0,i, i = 1, . . . , |S|.2 Recursion: αj(t) :=
∑|S|i=1 αi(t− 1)AijBj,xt , j = 1, . . . , |S|,
t = 1, . . . , T .
Now, P (x;A,B) = P (x1, . . . , xT ;A,B)
=
|S|∑i=1
P (x1, . . . , xT , zT = si;A,B)
=
|S|∑i=1
αi(T ).
Complexity is now O(|S| · T ) instead of O(|S|T )!
8/12
![Page 30: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/30.jpg)
Probability of an observed sequence (cont.)
We can compute P (x;A,B) e�ciently using dynamic programming.
Idea: avoid computing the same quantities multiple times!
Let αi(t) := P (x1, x2, . . . , xt, zt = si;A,B).
The Forward Procedure for computing αi(t)
1 Initialize αi(0) := A0,i, i = 1, . . . , |S|.2 Recursion: αj(t) :=
∑|S|i=1 αi(t− 1)AijBj,xt , j = 1, . . . , |S|,
t = 1, . . . , T .
Now, P (x;A,B) = P (x1, . . . , xT ;A,B)
=
|S|∑i=1
P (x1, . . . , xT , zT = si;A,B)
=
|S|∑i=1
αi(T ).
Complexity is now O(|S| · T ) instead of O(|S|T )!
8/12
![Page 31: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/31.jpg)
Probability of an observed sequence (cont.)
We can compute P (x;A,B) e�ciently using dynamic programming.
Idea: avoid computing the same quantities multiple times!
Let αi(t) := P (x1, x2, . . . , xt, zt = si;A,B).
The Forward Procedure for computing αi(t)
1 Initialize αi(0) := A0,i, i = 1, . . . , |S|.2 Recursion: αj(t) :=
∑|S|i=1 αi(t− 1)AijBj,xt , j = 1, . . . , |S|,
t = 1, . . . , T .
Now, P (x;A,B) = P (x1, . . . , xT ;A,B)
=
|S|∑i=1
P (x1, . . . , xT , zT = si;A,B)
=
|S|∑i=1
αi(T ).
Complexity is now O(|S| · T ) instead of O(|S|T )!
8/12
![Page 32: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/32.jpg)
Probability of an observed sequence (cont.)
We can compute P (x;A,B) e�ciently using dynamic programming.
Idea: avoid computing the same quantities multiple times!
Let αi(t) := P (x1, x2, . . . , xt, zt = si;A,B).
The Forward Procedure for computing αi(t)
1 Initialize αi(0) := A0,i, i = 1, . . . , |S|.2 Recursion: αj(t) :=
∑|S|i=1 αi(t− 1)AijBj,xt , j = 1, . . . , |S|,
t = 1, . . . , T .
Now, P (x;A,B) = P (x1, . . . , xT ;A,B)
=
|S|∑i=1
P (x1, . . . , xT , zT = si;A,B)
=
|S|∑i=1
αi(T ).
Complexity is now O(|S| · T ) instead of O(|S|T )!
8/12
![Page 33: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/33.jpg)
Probability of an observed sequence (cont.)
We can compute P (x;A,B) e�ciently using dynamic programming.
Idea: avoid computing the same quantities multiple times!
Let αi(t) := P (x1, x2, . . . , xt, zt = si;A,B).
The Forward Procedure for computing αi(t)
1 Initialize αi(0) := A0,i, i = 1, . . . , |S|.2 Recursion: αj(t) :=
∑|S|i=1 αi(t− 1)AijBj,xt , j = 1, . . . , |S|,
t = 1, . . . , T .
Now, P (x;A,B) = P (x1, . . . , xT ;A,B)
=
|S|∑i=1
P (x1, . . . , xT , zT = si;A,B)
=
|S|∑i=1
αi(T ).
Complexity is now O(|S| · T ) instead of O(|S|T )!8/12
![Page 34: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/34.jpg)
Inferring the hidden states
One of the most natural question one can ask about a HMM is:
what are the mostly likely states that generated the observations?
In other words, we would like to compute:
argmaxz∈ST
P (z|x;A,B).
Using Bayes' theorem:
argmaxz∈ST
P (z|x;A,B) = argmaxz∈ST
P (x|z;A,B)P (z;A)
P (x;A,B)
= argmaxz∈ST
P (x|z;A,B)P (z;A)
since the denominator does not depend on z.Note: There are |S|T possibilities for z so there is no hope of
examining all of them to pick the optimal one in practice.
9/12
![Page 35: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/35.jpg)
Inferring the hidden states
One of the most natural question one can ask about a HMM is:
what are the mostly likely states that generated the observations?
In other words, we would like to compute:
argmaxz∈ST
P (z|x;A,B).
Using Bayes' theorem:
argmaxz∈ST
P (z|x;A,B) = argmaxz∈ST
P (x|z;A,B)P (z;A)
P (x;A,B)
= argmaxz∈ST
P (x|z;A,B)P (z;A)
since the denominator does not depend on z.Note: There are |S|T possibilities for z so there is no hope of
examining all of them to pick the optimal one in practice.
9/12
![Page 36: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/36.jpg)
Inferring the hidden states
One of the most natural question one can ask about a HMM is:
what are the mostly likely states that generated the observations?
In other words, we would like to compute:
argmaxz∈ST
P (z|x;A,B).
Using Bayes' theorem:
argmaxz∈ST
P (z|x;A,B) = argmaxz∈ST
P (x|z;A,B)P (z;A)
P (x;A,B)
= argmaxz∈ST
P (x|z;A,B)P (z;A)
since the denominator does not depend on z.Note: There are |S|T possibilities for z so there is no hope of
examining all of them to pick the optimal one in practice.
9/12
![Page 37: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/37.jpg)
Inferring the hidden states
One of the most natural question one can ask about a HMM is:
what are the mostly likely states that generated the observations?
In other words, we would like to compute:
argmaxz∈ST
P (z|x;A,B).
Using Bayes' theorem:
argmaxz∈ST
P (z|x;A,B) = argmaxz∈ST
P (x|z;A,B)P (z;A)
P (x;A,B)
= argmaxz∈ST
P (x|z;A,B)P (z;A)
since the denominator does not depend on z.Note: There are |S|T possibilities for z so there is no hope of
examining all of them to pick the optimal one in practice.
9/12
![Page 38: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/38.jpg)
Inferring the hidden states
One of the most natural question one can ask about a HMM is:
what are the mostly likely states that generated the observations?
In other words, we would like to compute:
argmaxz∈ST
P (z|x;A,B).
Using Bayes' theorem:
argmaxz∈ST
P (z|x;A,B) = argmaxz∈ST
P (x|z;A,B)P (z;A)
P (x;A,B)
= argmaxz∈ST
P (x|z;A,B)P (z;A)
since the denominator does not depend on z.
Note: There are |S|T possibilities for z so there is no hope of
examining all of them to pick the optimal one in practice.
9/12
![Page 39: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/39.jpg)
Inferring the hidden states
One of the most natural question one can ask about a HMM is:
what are the mostly likely states that generated the observations?
In other words, we would like to compute:
argmaxz∈ST
P (z|x;A,B).
Using Bayes' theorem:
argmaxz∈ST
P (z|x;A,B) = argmaxz∈ST
P (x|z;A,B)P (z;A)
P (x;A,B)
= argmaxz∈ST
P (x|z;A,B)P (z;A)
since the denominator does not depend on z.Note: There are |S|T possibilities for z so there is no hope of
examining all of them to pick the optimal one in practice.
9/12
![Page 40: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/40.jpg)
The Viterbi algorithm
The Viterbi algorithm is a dynamic programming algorithm that
can be used to e�ciently compute the most likely path for the
states, given a sequence of observations x ∈ V T .
Let vi(t) denote the most probable path that ends in state si attime t:
vi(t) := maxzt,...,zt−1
P (z1, . . . , zt−1, zt = si, x1, . . . , xt;A,B).
Key observation: We have
vj(t) = max1≤i≤|S|
vi(t− 1)AijBj,xt .
In other words:
Best Path at t that end at j
= max1≤i≤|S|
(Best Path at t− 1 that end at i)
× (Go from i to j)
× (Observe xt in state sj).
10/12
![Page 41: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/41.jpg)
The Viterbi algorithm
The Viterbi algorithm is a dynamic programming algorithm that
can be used to e�ciently compute the most likely path for the
states, given a sequence of observations x ∈ V T .
Let vi(t) denote the most probable path that ends in state si attime t:
vi(t) := maxzt,...,zt−1
P (z1, . . . , zt−1, zt = si, x1, . . . , xt;A,B).
Key observation: We have
vj(t) = max1≤i≤|S|
vi(t− 1)AijBj,xt .
In other words:
Best Path at t that end at j
= max1≤i≤|S|
(Best Path at t− 1 that end at i)
× (Go from i to j)
× (Observe xt in state sj).
10/12
![Page 42: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/42.jpg)
The Viterbi algorithm
The Viterbi algorithm is a dynamic programming algorithm that
can be used to e�ciently compute the most likely path for the
states, given a sequence of observations x ∈ V T .
Let vi(t) denote the most probable path that ends in state si attime t:
vi(t) := maxzt,...,zt−1
P (z1, . . . , zt−1, zt = si, x1, . . . , xt;A,B).
Key observation: We have
vj(t) = max1≤i≤|S|
vi(t− 1)AijBj,xt .
In other words:
Best Path at t that end at j
= max1≤i≤|S|
(Best Path at t− 1 that end at i)
× (Go from i to j)
× (Observe xt in state sj).
10/12
![Page 43: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/43.jpg)
The Viterbi algorithm
The Viterbi algorithm is a dynamic programming algorithm that
can be used to e�ciently compute the most likely path for the
states, given a sequence of observations x ∈ V T .
Let vi(t) denote the most probable path that ends in state si attime t:
vi(t) := maxzt,...,zt−1
P (z1, . . . , zt−1, zt = si, x1, . . . , xt;A,B).
Key observation: We have
vj(t) = max1≤i≤|S|
vi(t− 1)AijBj,xt .
In other words:
Best Path at t that end at j
= max1≤i≤|S|
(Best Path at t− 1 that end at i)
× (Go from i to j)
× (Observe xt in state sj).
10/12
![Page 44: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/44.jpg)
The Viterbi algorithm
The Viterbi algorithm:
1 Initialize vi(1) := πiBi,x1 , i = 1, . . . , |S|, where πi is the initial
distribution of the Markov chain.
2 Compute vi(t) recursively for i = 1, . . . , S and t = 1, . . . , T .
3 Finally, the most probable path is the path corresponding to
max1≤i≤|S|
vi(T ).
11/12
![Page 45: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/45.jpg)
The Viterbi algorithm
The Viterbi algorithm:
1 Initialize vi(1) := πiBi,x1 , i = 1, . . . , |S|, where πi is the initial
distribution of the Markov chain.
2 Compute vi(t) recursively for i = 1, . . . , S and t = 1, . . . , T .
3 Finally, the most probable path is the path corresponding to
max1≤i≤|S|
vi(T ).
11/12
![Page 46: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/46.jpg)
The Viterbi algorithm
The Viterbi algorithm:
1 Initialize vi(1) := πiBi,x1 , i = 1, . . . , |S|, where πi is the initial
distribution of the Markov chain.
2 Compute vi(t) recursively for i = 1, . . . , S and t = 1, . . . , T .
3 Finally, the most probable path is the path corresponding to
max1≤i≤|S|
vi(T ).
11/12
![Page 47: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/47.jpg)
Estimating A, B, and π
So far, we assumed the parameters A, B, and π of the HMM
were known.
We now turn to the estimation of these parameters.
Let θ := (A,B, π).
We know how to compute:
1 P (x|θ) Forward algorithm.
2 P (z|x; θ) Viterbi algorithm.
We now want
argmaxθ
P (x|θ),
i.e., the set of parameters for which the observed values are most
likely to be obtained.
Note: if we could observe z, then we could easily compute
A,B, π.
We solve the problem using the EM algorithm.
12/12
![Page 48: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/48.jpg)
Estimating A, B, and π
So far, we assumed the parameters A, B, and π of the HMM
were known.
We now turn to the estimation of these parameters.
Let θ := (A,B, π).
We know how to compute:
1 P (x|θ) Forward algorithm.
2 P (z|x; θ) Viterbi algorithm.
We now want
argmaxθ
P (x|θ),
i.e., the set of parameters for which the observed values are most
likely to be obtained.
Note: if we could observe z, then we could easily compute
A,B, π.
We solve the problem using the EM algorithm.
12/12
![Page 49: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/49.jpg)
Estimating A, B, and π
So far, we assumed the parameters A, B, and π of the HMM
were known.
We now turn to the estimation of these parameters.
Let θ := (A,B, π).
We know how to compute:
1 P (x|θ) Forward algorithm.
2 P (z|x; θ) Viterbi algorithm.
We now want
argmaxθ
P (x|θ),
i.e., the set of parameters for which the observed values are most
likely to be obtained.
Note: if we could observe z, then we could easily compute
A,B, π.
We solve the problem using the EM algorithm.
12/12
![Page 50: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/50.jpg)
Estimating A, B, and π
So far, we assumed the parameters A, B, and π of the HMM
were known.
We now turn to the estimation of these parameters.
Let θ := (A,B, π).
We know how to compute:
1 P (x|θ) Forward algorithm.
2 P (z|x; θ) Viterbi algorithm.
We now want
argmaxθ
P (x|θ),
i.e., the set of parameters for which the observed values are most
likely to be obtained.
Note: if we could observe z, then we could easily compute
A,B, π.
We solve the problem using the EM algorithm.
12/12
![Page 51: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/51.jpg)
Estimating A, B, and π
So far, we assumed the parameters A, B, and π of the HMM
were known.
We now turn to the estimation of these parameters.
Let θ := (A,B, π).
We know how to compute:
1 P (x|θ) Forward algorithm.
2 P (z|x; θ) Viterbi algorithm.
We now want
argmaxθ
P (x|θ),
i.e., the set of parameters for which the observed values are most
likely to be obtained.
Note: if we could observe z, then we could easily compute
A,B, π.
We solve the problem using the EM algorithm.
12/12
![Page 52: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/52.jpg)
Estimating A, B, and π
So far, we assumed the parameters A, B, and π of the HMM
were known.
We now turn to the estimation of these parameters.
Let θ := (A,B, π).
We know how to compute:
1 P (x|θ) Forward algorithm.
2 P (z|x; θ) Viterbi algorithm.
We now want
argmaxθ
P (x|θ),
i.e., the set of parameters for which the observed values are most
likely to be obtained.
Note: if we could observe z, then we could easily compute
A,B, π.
We solve the problem using the EM algorithm.
12/12
![Page 53: MATH 829: Introduction to Data Mining and Analysis Hidden ...dguillot/teaching/MATH829/lectures/34-MATH829-HMM.pdfA Hidden Markov Model has two components: 1 A Markov chain that describes](https://reader035.vdocuments.us/reader035/viewer/2022071215/6045b289fd7a164070152149/html5/thumbnails/53.jpg)
Estimating A, B, and π
So far, we assumed the parameters A, B, and π of the HMM
were known.
We now turn to the estimation of these parameters.
Let θ := (A,B, π).
We know how to compute:
1 P (x|θ) Forward algorithm.
2 P (z|x; θ) Viterbi algorithm.
We now want
argmaxθ
P (x|θ),
i.e., the set of parameters for which the observed values are most
likely to be obtained.
Note: if we could observe z, then we could easily compute
A,B, π.
We solve the problem using the EM algorithm.
12/12