hidden markov models - cs department

Hidden Markov ModelsCAP5610: Machine Learning

Instructor: Guo-Jun Qi

Dynamic Model VS. Static Model

• So far, we have learned a bunch of static machine learning models• Given a input feature vector X, we learn a static model to predict its label

from the training set.• In the static model, given a sequence {𝑋𝑋𝑖𝑖|𝑖𝑖 = 1,2, … ,𝑛𝑛} of feature vectors,

the predictions on them are made independently.• A decision on a previous feature vector will not affect how we predict on the

next feature vector.• A static Bayes model – class conditional model

𝑃𝑃 𝑋𝑋𝑖𝑖 𝑌𝑌𝑖𝑖 = �𝑖𝑖=1

𝑛𝑛

𝑃𝑃(𝑋𝑋𝑖𝑖|𝑌𝑌𝑖𝑖)

Dynamic Model VS. Static Model• In real world, many decisions/predictions are made sequentially

which are often dependent on the context (e.g., the previous and/or next decisions/predictions)

• Speech recognition, recognizing a word is not independent of the previous word recognized.

Please tell the people falling asleep what have discussed in class.

• Exploring the dependency in the context should be very informative and important to many machine learning tasks.

Dynamic Model

• A joint decisions will be made given a dynamic sequence of input feature vector


𝑛𝑛

𝑃𝑃(𝑋𝑋𝑖𝑖|𝑌𝑌𝑖𝑖)𝑃𝑃(𝑌𝑌𝑖𝑖|𝑌𝑌context 𝑖𝑖 )

Where context(𝑖𝑖) denotes the context in which decision 𝑌𝑌𝑖𝑖 is made.

A graphical representation of Static and dynamic models• Static model

• All decision is made only dependent on the current feature vector (Observation)

Xi

Yi

Xn

Yn

X1

Y1

X2

Y2


𝑛𝑛

𝑃𝑃(𝑋𝑋𝑖𝑖|𝑌𝑌𝑖𝑖)

A graphical representation of Static and dynamic models• Dynamic model

where context(i) includes {i-2,i-1}.

Xi

Yi

Xn

Yn

X1

Y1

X2

Y2

Xi-1

Yi-1

Xi+1

Yi+1

Xi-2

Yi-2…

… …

…


𝑛𝑛


Markov model

• Joint distribution of n decision variables, by Bayes rule

Here 𝑌𝑌1:𝑖𝑖−1 is a short for {𝑌𝑌𝑗𝑗 , 𝑗𝑗 = 1,2, … , 𝑖𝑖 − 1}.

• Markov assumption which simplifies the dependence:

𝑃𝑃 𝑌𝑌𝑖𝑖 𝑌𝑌1:𝑖𝑖−1 = 𝑃𝑃(𝑌𝑌𝑖𝑖|𝑌𝑌𝑖𝑖−1)

where A decision is only dependent on its immediate predecessor.

𝑃𝑃 𝑌𝑌𝑖𝑖 = 𝑃𝑃 𝑌𝑌1 𝑃𝑃 𝑌𝑌2 𝑌𝑌1 𝑃𝑃 𝑌𝑌3 𝑌𝑌1,𝑌𝑌2 …𝑃𝑃 𝑌𝑌𝑖𝑖 𝑌𝑌1:𝑖𝑖−1 …𝑃𝑃(𝑌𝑌𝑛𝑛|𝑌𝑌1:𝑛𝑛−1)

Markov Assumption

• Dynamic model

where context(i) includes {i-1}.

Xi

Yi

Xn

Yn

X1

Y1

X2

Y2

Xi-1

Yi-1

Xi+1

Yi+1

Xi-2

Yi-2…

… …

…


𝑛𝑛


Hidden Markov Model

• Decision variables are hidden variables to be inferred.• Markov dependence is imposed on the hidden variables.

• The sequence of feature vectors constitute the observed variables.• The dependence between observed variables are made implicitly via the

hidden variables, where they are not independent of each other.

Xi

Yi

Xn

Yn

X1

Y1

X2

Y2

Xi-1

Yi-1

Xi+1

Yi+1

Xi-2

Yi-2…

… …

…

2nd order Markov Assumption

• Each decision depends on two immediate predecessors

• where context(i) includes {i-1,i-2}.

Xi

Yi

Xn

Yn

X1

Y1

X2

Y2

Xi-1

Yi-1

Xi+1

Yi+1

Xi-2

Yi-2…

… …

…


𝑛𝑛

𝑃𝑃 𝑋𝑋𝑖𝑖|𝑌𝑌𝑖𝑖 𝑃𝑃(𝑌𝑌𝑖𝑖|𝑌𝑌𝑖𝑖−1,𝑌𝑌𝑖𝑖−2)

m order Markov Assumption

• Probabilistic model


𝑛𝑛

𝑃𝑃 𝑋𝑋𝑖𝑖|𝑌𝑌𝑖𝑖 𝑃𝑃(𝑌𝑌𝑖𝑖|𝑌𝑌𝑖𝑖−1,𝑌𝑌𝑖𝑖−2, … ,𝑌𝑌𝑖𝑖−𝑚𝑚)

• Each decision depends on the previous m decisions.

Parameterization – Hidden Markov Modal

• HMM:

• State space 𝑆𝑆 = {1,2, … 𝐼𝐼}, from which 𝑌𝑌𝑖𝑖 samples its value.• Representing a particular label that denotes the class of Xi• A decision that is made based on the observations through i

• Initial state distribution 𝑃𝑃 𝑌𝑌1 = 𝑠𝑠 = 𝜋𝜋𝑠𝑠• Transition distribution 𝑃𝑃 𝑌𝑌𝑖𝑖 = 𝑙𝑙|𝑌𝑌𝑖𝑖−1 = 𝑘𝑘 = 𝑝𝑝𝑘𝑘𝑘𝑘

• Stationary assumption: the transition distribution does not change over time, i.e., independent of i, the transition distribution stays the same.

𝑃𝑃 𝑋𝑋𝑖𝑖 𝑌𝑌𝑖𝑖 = 𝑃𝑃 𝑌𝑌1 𝑃𝑃 𝑋𝑋1 𝑌𝑌1 �𝑖𝑖=2

𝑛𝑛

𝑃𝑃 𝑋𝑋𝑖𝑖|𝑌𝑌𝑖𝑖 𝑃𝑃(𝑌𝑌𝑖𝑖|𝑌𝑌𝑖𝑖−1)

Parameterization – Hidden Markov Modal• HMM:

• Emission distribution 𝑃𝑃(𝑋𝑋𝑖𝑖|𝑌𝑌𝑖𝑖) – given the current state 𝑌𝑌𝑖𝑖, the distribution that generates 𝑋𝑋𝑖𝑖

• Stationary assumption: the emission distribution does not change over time, only depending on the current state

• The observation has a discrete value from O = {1,2, … , K}𝑃𝑃 𝑋𝑋𝑖𝑖 = 𝑜𝑜|𝑌𝑌𝑖𝑖 = 𝑠𝑠 = 𝑞𝑞𝑠𝑠𝑜𝑜

• The observation is a feature vector in a high dimensional feature space𝑃𝑃 𝑋𝑋𝑖𝑖 = 𝑜𝑜|𝑌𝑌𝑖𝑖 = 𝑠𝑠 = 𝑁𝑁(𝑚𝑚𝑠𝑠, Σ𝑠𝑠)

𝑃𝑃 𝑋𝑋𝑖𝑖 𝑌𝑌𝑖𝑖 = 𝑃𝑃 𝑌𝑌1 𝑃𝑃 𝑋𝑋1 𝑌𝑌1 �𝑖𝑖=2

𝑛𝑛

𝑃𝑃 𝑋𝑋𝑖𝑖|𝑌𝑌𝑖𝑖 𝑃𝑃(𝑌𝑌𝑖𝑖|𝑌𝑌𝑖𝑖−1)

An example

• The dishonest casino (Poczos & Singh)• A casino has two dices

• A fair dice denoted by F𝑃𝑃 𝑋𝑋 = 𝑥𝑥 𝑌𝑌 = 𝐹𝐹 = 1

6, for x=1,2,3,4,5,6

• A loaded dice denoted by L𝑃𝑃 𝑋𝑋 = 𝑥𝑥 𝑌𝑌 = 𝐿𝐿 = 1

10, for x=1,2,3,4,5

𝑃𝑃 𝑋𝑋 = 6 𝑌𝑌 = 𝐹𝐹 =12

• Casino player switches back and forth between fair and loaded dices with 5% probability

• Transition distribution • 𝑃𝑃 𝑌𝑌𝑖𝑖 = 𝐹𝐹 𝑌𝑌𝑖𝑖−1 = 𝐿𝐿 = 𝑃𝑃 𝑌𝑌𝑖𝑖 = 𝐿𝐿 𝑌𝑌𝑖𝑖−1 = 𝐹𝐹 = 0.05• 𝑃𝑃 𝑌𝑌𝑖𝑖 = 𝐹𝐹 𝑌𝑌𝑖𝑖−1 = 𝐹𝐹 = 𝑃𝑃 𝑌𝑌𝑖𝑖 = 𝐿𝐿 𝑌𝑌𝑖𝑖−1 = 𝐿𝐿 = 0.95

HMM Problems

• A sequence of rolls by a casino player

• We wonder• Which rolls are generated by the fair/loaded die?

• Decoding problem: inferring the states associated with each observation.• How likely is this sequence generated by the model of this casino player?

• Evaluation problem: computing the likelihood given the model parameters (initial, transition, emission distributions)

• How “loaded” is the loaded die? How fair is the fair die? How often does the casino player change between two dies?

• Learning problem: learning the parameters of the casino player’s model.

HMM examples

HMM parameters

• Transition probability between two dies

• Model parameters

Formalizing three HMM problems

• Evaluation: Given HMM parameters 𝜃𝜃 & observation sequences 𝑂𝑂 ={𝑂𝑂𝑡𝑡|𝑡𝑡 = 1,2, …𝑇𝑇}, compute the probability of observing 𝑂𝑂, i.e.,

𝑃𝑃 𝑂𝑂𝑡𝑡 𝑡𝑡=1𝑇𝑇 𝜃𝜃 = �

𝑆𝑆

𝑃𝑃 𝑂𝑂𝑡𝑡 𝑡𝑡=1𝑇𝑇 , 𝑆𝑆𝑡𝑡 𝑡𝑡=1

𝑇𝑇 𝜃𝜃

• Decoding: Given HMM parameters 𝜃𝜃 & observation sequences 𝑂𝑂 ={𝑂𝑂𝑡𝑡|𝑡𝑡 = 1,2, …𝑇𝑇}, find the most likely sequence of states, i.e.,

𝑆𝑆 = argmax𝑆𝑆𝑃𝑃 𝑆𝑆𝑡𝑡 𝑡𝑡=1𝑇𝑇 𝑂𝑂𝑡𝑡 𝑡𝑡=1

𝑇𝑇 ,𝜃𝜃

Formalizing three HMM problems

• Learning: Given a sequence of observations 𝑂𝑂 = {𝑂𝑂𝑡𝑡|𝑡𝑡 = 1,2, …𝑇𝑇}, find the HMM parameters that maximize the likelihood of observing 𝑂𝑂

𝜃𝜃∗ = argmax𝜃𝜃𝑃𝑃 𝑂𝑂𝑡𝑡 𝑡𝑡=1𝑇𝑇 𝜃𝜃

• We will apply EM algorithm to solve learning problem• E-step: estimate the 𝑃𝑃 𝑆𝑆𝑡𝑡 𝑡𝑡=1

𝑇𝑇 𝑂𝑂𝑡𝑡 𝑡𝑡=1𝑇𝑇 ,𝜃𝜃(𝑗𝑗) (similar to decoding)

• M-step: maximize the expected log likelihood of complete data sequence

𝑄𝑄 𝜃𝜃,𝜃𝜃 𝑗𝑗 = �𝑆𝑆

𝑃𝑃 𝑆𝑆𝑡𝑡 𝑡𝑡=1𝑇𝑇 𝑂𝑂𝑡𝑡 𝑡𝑡=1

𝑇𝑇 ,𝜃𝜃(𝑗𝑗) log𝑃𝑃 𝑆𝑆𝑡𝑡 𝑡𝑡=1𝑇𝑇 , 𝑂𝑂𝑡𝑡 𝑡𝑡=1

𝑇𝑇 𝜃𝜃

Evaluation Problem

• Given HMM parameters 𝜃𝜃 & observation sequences 𝑂𝑂 = {𝑂𝑂𝑡𝑡|𝑡𝑡 = 1,2, …𝑇𝑇}, • compute the probability of observing 𝑂𝑂, i.e.,

𝑃𝑃 𝑂𝑂𝑡𝑡 𝑡𝑡=1𝑇𝑇 𝜃𝜃 = �

𝑆𝑆

𝑃𝑃 𝑂𝑂𝑡𝑡 𝑡𝑡=1𝑇𝑇 , 𝑆𝑆𝑡𝑡 𝑡𝑡=1

𝑇𝑇 𝜃𝜃

= ∑𝑆𝑆 𝑃𝑃 𝑂𝑂𝑡𝑡 𝑡𝑡=1𝑇𝑇 𝑆𝑆𝑡𝑡 𝑡𝑡=1

𝑇𝑇 ,𝜃𝜃 𝑃𝑃( 𝑆𝑆𝑡𝑡 𝑡𝑡=1𝑇𝑇 |𝜃𝜃) by Bayes rule

= ∑𝑆𝑆Π𝑡𝑡=1𝑇𝑇 𝑃𝑃 𝑂𝑂𝑡𝑡 𝑆𝑆𝑡𝑡 𝑃𝑃( 𝑆𝑆𝑡𝑡 𝑡𝑡=1𝑇𝑇 |𝜃𝜃) by independence of observations given states

= ∑𝑆𝑆Π𝑡𝑡=1𝑇𝑇 𝑃𝑃 𝑂𝑂𝑡𝑡 𝑆𝑆𝑡𝑡 𝑃𝑃 𝑆𝑆1 𝜃𝜃 Πt=2T 𝑃𝑃(𝑆𝑆𝑡𝑡|𝑆𝑆𝑡𝑡−1) by Markov assumption

It is exponentially in enumerating all possible assignment of S, leading to computational complexity O(KT).

Forward Probability• A dynamic programming algorithm• Let 𝛼𝛼𝑇𝑇𝑘𝑘 = 𝑃𝑃( 𝑂𝑂𝑡𝑡 𝑡𝑡=1

𝑇𝑇 , 𝑆𝑆𝑇𝑇 = 𝑘𝑘), then 𝑃𝑃 𝑂𝑂𝑡𝑡 𝑡𝑡=1𝑇𝑇 = ∑𝑘𝑘=1𝐾𝐾 𝛼𝛼𝑘𝑘𝑇𝑇

• Recursively compute 𝛼𝛼𝑇𝑇𝑘𝑘 in forward fashion𝛼𝛼𝑇𝑇𝑘𝑘 = 𝑃𝑃 𝑂𝑂𝑡𝑡 𝑡𝑡=1

𝑇𝑇 , 𝑆𝑆𝑇𝑇 = 𝑘𝑘= ∑𝑘𝑘′ 𝑃𝑃 𝑂𝑂𝑡𝑡 𝑡𝑡=1

𝑇𝑇−1,𝑂𝑂𝑇𝑇 , 𝑆𝑆𝑇𝑇−1 = 𝑘𝑘′, 𝑆𝑆𝑇𝑇 = 𝑘𝑘 margalizing 𝑆𝑆𝑇𝑇−1= ∑𝑘𝑘′ 𝑃𝑃 𝑂𝑂𝑇𝑇 , 𝑆𝑆𝑇𝑇 = 𝑘𝑘| 𝑂𝑂𝑡𝑡 𝑡𝑡=1

𝑇𝑇−1, 𝑆𝑆𝑇𝑇−1 = 𝑘𝑘′ 𝑃𝑃( 𝑂𝑂𝑡𝑡 𝑡𝑡=1𝑇𝑇−1, 𝑆𝑆𝑇𝑇−1 = 𝑘𝑘′) by Bayes rule

= ∑𝑘𝑘′ 𝑃𝑃 𝑆𝑆𝑇𝑇 = 𝑘𝑘|𝑆𝑆𝑇𝑇−1 = 𝑘𝑘′ 𝑃𝑃 𝑂𝑂𝑇𝑇|𝑆𝑆𝑇𝑇 = 𝑘𝑘 𝑃𝑃( 𝑂𝑂𝑡𝑡 𝑡𝑡=1𝑇𝑇−1, 𝑆𝑆𝑇𝑇−1 = 𝑘𝑘′) by Markov assumption

= ∑𝑘𝑘′ 𝑃𝑃 𝑆𝑆𝑇𝑇 = 𝑘𝑘|𝑆𝑆𝑇𝑇−1 = 𝑘𝑘′ 𝑃𝑃 𝑂𝑂𝑇𝑇|𝑆𝑆𝑇𝑇 = 𝑘𝑘 𝛼𝛼𝑇𝑇−1𝑘𝑘′ by recursive definition.

Each recursive

Computational complexity

• 𝛼𝛼𝑇𝑇𝑘𝑘 = ∑𝑘𝑘′ 𝑃𝑃 𝑆𝑆𝑇𝑇 = 𝑘𝑘|𝑆𝑆𝑇𝑇−1 = 𝑘𝑘′ 𝑃𝑃 𝑂𝑂𝑇𝑇|𝑆𝑆𝑇𝑇 = 𝑘𝑘 𝛼𝛼𝑇𝑇−1𝑘𝑘′

• Each recursive definition incurs O(K) of summations over the state space.

• We have KT of such 𝛼𝛼𝑇𝑇𝑘𝑘, so the computational complexity is about O(K2T) – a polynomial computational cost.

Forward Algorithm

• Intialize 𝛼𝛼1𝑘𝑘 = 𝑃𝑃 𝑂𝑂1, 𝑆𝑆1 = 𝑘𝑘 = 𝑃𝑃 𝑆𝑆1 = 𝑘𝑘 𝑃𝑃(𝑂𝑂1|𝑆𝑆1 = 𝑘𝑘)• Recursive steps: for t =2,…,T

𝛼𝛼𝑡𝑡𝑘𝑘 = 𝑃𝑃 𝑂𝑂𝑇𝑇|𝑆𝑆𝑇𝑇 = 𝑘𝑘 �𝑘𝑘′𝑃𝑃 𝑆𝑆𝑡𝑡 = 𝑘𝑘|𝑆𝑆𝑡𝑡−1 = 𝑘𝑘′ 𝛼𝛼𝑡𝑡−1𝑘𝑘′

• Termination:

P 𝑂𝑂𝑡𝑡 𝑡𝑡=1𝑇𝑇 |𝜃𝜃 = �

𝑘𝑘

𝛼𝛼𝑇𝑇𝑘𝑘

Summary

• Define Hidden Markov Model to capture a sequence of observations• Hidden states: Markov assumption• Observations: each one only depends on its current state

• Three main problems• Evaluation, decoding, learning

• Forward algorithm: evaluation• Next lecture: decoding and learning

hidden markov models - cs department

Documents