hidden markov models - cs department

24
Hidden Markov Models CAP5610: Machine Learning Instructor: Guo-Jun Qi

Upload: others

Post on 02-Jan-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hidden Markov Models - CS Department

Hidden Markov ModelsCAP5610: Machine Learning

Instructor: Guo-Jun Qi

Page 2: Hidden Markov Models - CS Department

Dynamic Model VS. Static Model

β€’ So far, we have learned a bunch of static machine learning modelsβ€’ Given a input feature vector X, we learn a static model to predict its label

from the training set.β€’ In the static model, given a sequence {𝑋𝑋𝑖𝑖|𝑖𝑖 = 1,2, … ,𝑛𝑛} of feature vectors,

the predictions on them are made independently.β€’ A decision on a previous feature vector will not affect how we predict on the

next feature vector.β€’ A static Bayes model – class conditional model

𝑃𝑃 𝑋𝑋𝑖𝑖 π‘Œπ‘Œπ‘–π‘– = �𝑖𝑖=1

𝑛𝑛

𝑃𝑃(𝑋𝑋𝑖𝑖|π‘Œπ‘Œπ‘–π‘–)

Page 3: Hidden Markov Models - CS Department

Dynamic Model VS. Static Modelβ€’ In real world, many decisions/predictions are made sequentially

which are often dependent on the context (e.g., the previous and/or next decisions/predictions)

β€’ Speech recognition, recognizing a word is not independent of the previous word recognized.

Please tell the people falling asleep what have discussed in class.

β€’ Exploring the dependency in the context should be very informative and important to many machine learning tasks.

Page 4: Hidden Markov Models - CS Department

Dynamic Model

β€’ A joint decisions will be made given a dynamic sequence of input feature vector

𝑃𝑃 𝑋𝑋𝑖𝑖 π‘Œπ‘Œπ‘–π‘– = �𝑖𝑖=1

𝑛𝑛

𝑃𝑃(𝑋𝑋𝑖𝑖|π‘Œπ‘Œπ‘–π‘–)𝑃𝑃(π‘Œπ‘Œπ‘–π‘–|π‘Œπ‘Œcontext 𝑖𝑖 )

Where context(𝑖𝑖) denotes the context in which decision π‘Œπ‘Œπ‘–π‘– is made.

Page 5: Hidden Markov Models - CS Department

A graphical representation of Static and dynamic modelsβ€’ Static model

β€’ All decision is made only dependent on the current feature vector (Observation)

Xi

Yi

Xn

Yn

X1

Y1

X2

Y2

𝑃𝑃 𝑋𝑋𝑖𝑖 π‘Œπ‘Œπ‘–π‘– = �𝑖𝑖=1

𝑛𝑛

𝑃𝑃(𝑋𝑋𝑖𝑖|π‘Œπ‘Œπ‘–π‘–)

Page 6: Hidden Markov Models - CS Department

A graphical representation of Static and dynamic modelsβ€’ Dynamic model

where context(i) includes {i-2,i-1}.

Xi

Yi

Xn

Yn

X1

Y1

X2

Y2

Xi-1

Yi-1

Xi+1

Yi+1

Xi-2

Yi-2…

… …

…

𝑃𝑃 𝑋𝑋𝑖𝑖 π‘Œπ‘Œπ‘–π‘– = �𝑖𝑖=1

𝑛𝑛

𝑃𝑃(𝑋𝑋𝑖𝑖|π‘Œπ‘Œπ‘–π‘–)𝑃𝑃(π‘Œπ‘Œπ‘–π‘–|π‘Œπ‘Œcontext 𝑖𝑖 )

Page 7: Hidden Markov Models - CS Department

Markov model

β€’ Joint distribution of n decision variables, by Bayes rule

Here π‘Œπ‘Œ1:π‘–π‘–βˆ’1 is a short for {π‘Œπ‘Œπ‘—π‘— , 𝑗𝑗 = 1,2, … , 𝑖𝑖 βˆ’ 1}.

β€’ Markov assumption which simplifies the dependence:

𝑃𝑃 π‘Œπ‘Œπ‘–π‘– π‘Œπ‘Œ1:π‘–π‘–βˆ’1 = 𝑃𝑃(π‘Œπ‘Œπ‘–π‘–|π‘Œπ‘Œπ‘–π‘–βˆ’1)

where A decision is only dependent on its immediate predecessor.

𝑃𝑃 π‘Œπ‘Œπ‘–π‘– = 𝑃𝑃 π‘Œπ‘Œ1 𝑃𝑃 π‘Œπ‘Œ2 π‘Œπ‘Œ1 𝑃𝑃 π‘Œπ‘Œ3 π‘Œπ‘Œ1,π‘Œπ‘Œ2 …𝑃𝑃 π‘Œπ‘Œπ‘–π‘– π‘Œπ‘Œ1:π‘–π‘–βˆ’1 …𝑃𝑃(π‘Œπ‘Œπ‘›π‘›|π‘Œπ‘Œ1:π‘›π‘›βˆ’1)

Page 8: Hidden Markov Models - CS Department

Markov Assumption

β€’ Dynamic model

where context(i) includes {i-1}.

Xi

Yi

Xn

Yn

X1

Y1

X2

Y2

Xi-1

Yi-1

Xi+1

Yi+1

Xi-2

Yi-2…

… …

…

𝑃𝑃 𝑋𝑋𝑖𝑖 π‘Œπ‘Œπ‘–π‘– = �𝑖𝑖=1

𝑛𝑛

𝑃𝑃(𝑋𝑋𝑖𝑖|π‘Œπ‘Œπ‘–π‘–)𝑃𝑃(π‘Œπ‘Œπ‘–π‘–|π‘Œπ‘Œcontext 𝑖𝑖 )

Page 9: Hidden Markov Models - CS Department

Hidden Markov Model

β€’ Decision variables are hidden variables to be inferred.β€’ Markov dependence is imposed on the hidden variables.

β€’ The sequence of feature vectors constitute the observed variables.β€’ The dependence between observed variables are made implicitly via the

hidden variables, where they are not independent of each other.

Xi

Yi

Xn

Yn

X1

Y1

X2

Y2

Xi-1

Yi-1

Xi+1

Yi+1

Xi-2

Yi-2…

… …

…

Page 10: Hidden Markov Models - CS Department

2nd order Markov Assumption

β€’ Each decision depends on two immediate predecessors

β€’ where context(i) includes {i-1,i-2}.

Xi

Yi

Xn

Yn

X1

Y1

X2

Y2

Xi-1

Yi-1

Xi+1

Yi+1

Xi-2

Yi-2…

… …

…

𝑃𝑃 𝑋𝑋𝑖𝑖 π‘Œπ‘Œπ‘–π‘– = �𝑖𝑖=1

𝑛𝑛

𝑃𝑃 𝑋𝑋𝑖𝑖|π‘Œπ‘Œπ‘–π‘– 𝑃𝑃(π‘Œπ‘Œπ‘–π‘–|π‘Œπ‘Œπ‘–π‘–βˆ’1,π‘Œπ‘Œπ‘–π‘–βˆ’2)

Page 11: Hidden Markov Models - CS Department

m order Markov Assumption

β€’ Probabilistic model

𝑃𝑃 𝑋𝑋𝑖𝑖 π‘Œπ‘Œπ‘–π‘– = �𝑖𝑖=1

𝑛𝑛

𝑃𝑃 𝑋𝑋𝑖𝑖|π‘Œπ‘Œπ‘–π‘– 𝑃𝑃(π‘Œπ‘Œπ‘–π‘–|π‘Œπ‘Œπ‘–π‘–βˆ’1,π‘Œπ‘Œπ‘–π‘–βˆ’2, … ,π‘Œπ‘Œπ‘–π‘–βˆ’π‘šπ‘š)

β€’ Each decision depends on the previous m decisions.

Page 12: Hidden Markov Models - CS Department

Parameterization – Hidden Markov Modal

β€’ HMM:

β€’ State space 𝑆𝑆 = {1,2, … 𝐼𝐼}, from which π‘Œπ‘Œπ‘–π‘– samples its value.β€’ Representing a particular label that denotes the class of Xiβ€’ A decision that is made based on the observations through i

β€’ Initial state distribution 𝑃𝑃 π‘Œπ‘Œ1 = 𝑠𝑠 = πœ‹πœ‹π‘ π‘ β€’ Transition distribution 𝑃𝑃 π‘Œπ‘Œπ‘–π‘– = 𝑙𝑙|π‘Œπ‘Œπ‘–π‘–βˆ’1 = π‘˜π‘˜ = π‘π‘π‘˜π‘˜π‘˜π‘˜

β€’ Stationary assumption: the transition distribution does not change over time, i.e., independent of i, the transition distribution stays the same.

𝑃𝑃 𝑋𝑋𝑖𝑖 π‘Œπ‘Œπ‘–π‘– = 𝑃𝑃 π‘Œπ‘Œ1 𝑃𝑃 𝑋𝑋1 π‘Œπ‘Œ1 �𝑖𝑖=2

𝑛𝑛

𝑃𝑃 𝑋𝑋𝑖𝑖|π‘Œπ‘Œπ‘–π‘– 𝑃𝑃(π‘Œπ‘Œπ‘–π‘–|π‘Œπ‘Œπ‘–π‘–βˆ’1)

Page 13: Hidden Markov Models - CS Department

Parameterization – Hidden Markov Modalβ€’ HMM:

β€’ Emission distribution 𝑃𝑃(𝑋𝑋𝑖𝑖|π‘Œπ‘Œπ‘–π‘–) – given the current state π‘Œπ‘Œπ‘–π‘–, the distribution that generates 𝑋𝑋𝑖𝑖

β€’ Stationary assumption: the emission distribution does not change over time, only depending on the current state

β€’ The observation has a discrete value from O = {1,2, … , K}𝑃𝑃 𝑋𝑋𝑖𝑖 = π‘œπ‘œ|π‘Œπ‘Œπ‘–π‘– = 𝑠𝑠 = π‘žπ‘žπ‘ π‘ π‘œπ‘œ

β€’ The observation is a feature vector in a high dimensional feature space𝑃𝑃 𝑋𝑋𝑖𝑖 = π‘œπ‘œ|π‘Œπ‘Œπ‘–π‘– = 𝑠𝑠 = 𝑁𝑁(π‘šπ‘šπ‘ π‘ , Σ𝑠𝑠)

𝑃𝑃 𝑋𝑋𝑖𝑖 π‘Œπ‘Œπ‘–π‘– = 𝑃𝑃 π‘Œπ‘Œ1 𝑃𝑃 𝑋𝑋1 π‘Œπ‘Œ1 �𝑖𝑖=2

𝑛𝑛

𝑃𝑃 𝑋𝑋𝑖𝑖|π‘Œπ‘Œπ‘–π‘– 𝑃𝑃(π‘Œπ‘Œπ‘–π‘–|π‘Œπ‘Œπ‘–π‘–βˆ’1)

Page 14: Hidden Markov Models - CS Department

An example

β€’ The dishonest casino (Poczos & Singh)β€’ A casino has two dices

β€’ A fair dice denoted by F𝑃𝑃 𝑋𝑋 = π‘₯π‘₯ π‘Œπ‘Œ = 𝐹𝐹 = 1

6, for x=1,2,3,4,5,6

β€’ A loaded dice denoted by L𝑃𝑃 𝑋𝑋 = π‘₯π‘₯ π‘Œπ‘Œ = 𝐿𝐿 = 1

10, for x=1,2,3,4,5

𝑃𝑃 𝑋𝑋 = 6 π‘Œπ‘Œ = 𝐹𝐹 =12

β€’ Casino player switches back and forth between fair and loaded dices with 5% probability

β€’ Transition distribution β€’ 𝑃𝑃 π‘Œπ‘Œπ‘–π‘– = 𝐹𝐹 π‘Œπ‘Œπ‘–π‘–βˆ’1 = 𝐿𝐿 = 𝑃𝑃 π‘Œπ‘Œπ‘–π‘– = 𝐿𝐿 π‘Œπ‘Œπ‘–π‘–βˆ’1 = 𝐹𝐹 = 0.05β€’ 𝑃𝑃 π‘Œπ‘Œπ‘–π‘– = 𝐹𝐹 π‘Œπ‘Œπ‘–π‘–βˆ’1 = 𝐹𝐹 = 𝑃𝑃 π‘Œπ‘Œπ‘–π‘– = 𝐿𝐿 π‘Œπ‘Œπ‘–π‘–βˆ’1 = 𝐿𝐿 = 0.95

Page 15: Hidden Markov Models - CS Department

HMM Problems

β€’ A sequence of rolls by a casino player

β€’ We wonderβ€’ Which rolls are generated by the fair/loaded die?

β€’ Decoding problem: inferring the states associated with each observation.β€’ How likely is this sequence generated by the model of this casino player?

β€’ Evaluation problem: computing the likelihood given the model parameters (initial, transition, emission distributions)

β€’ How β€œloaded” is the loaded die? How fair is the fair die? How often does the casino player change between two dies?

β€’ Learning problem: learning the parameters of the casino player’s model.

Page 16: Hidden Markov Models - CS Department

HMM examples

Page 17: Hidden Markov Models - CS Department

HMM parameters

β€’ Transition probability between two dies

β€’ Model parameters

Page 18: Hidden Markov Models - CS Department

Formalizing three HMM problems

β€’ Evaluation: Given HMM parameters πœƒπœƒ & observation sequences 𝑂𝑂 ={𝑂𝑂𝑑𝑑|𝑑𝑑 = 1,2, …𝑇𝑇}, compute the probability of observing 𝑂𝑂, i.e.,

𝑃𝑃 𝑂𝑂𝑑𝑑 𝑑𝑑=1𝑇𝑇 πœƒπœƒ = οΏ½

𝑆𝑆

𝑃𝑃 𝑂𝑂𝑑𝑑 𝑑𝑑=1𝑇𝑇 , 𝑆𝑆𝑑𝑑 𝑑𝑑=1

𝑇𝑇 πœƒπœƒ

β€’ Decoding: Given HMM parameters πœƒπœƒ & observation sequences 𝑂𝑂 ={𝑂𝑂𝑑𝑑|𝑑𝑑 = 1,2, …𝑇𝑇}, find the most likely sequence of states, i.e.,

𝑆𝑆 = argmax𝑆𝑆𝑃𝑃 𝑆𝑆𝑑𝑑 𝑑𝑑=1𝑇𝑇 𝑂𝑂𝑑𝑑 𝑑𝑑=1

𝑇𝑇 ,πœƒπœƒ

Page 19: Hidden Markov Models - CS Department

Formalizing three HMM problems

β€’ Learning: Given a sequence of observations 𝑂𝑂 = {𝑂𝑂𝑑𝑑|𝑑𝑑 = 1,2, …𝑇𝑇}, find the HMM parameters that maximize the likelihood of observing 𝑂𝑂

πœƒπœƒβˆ— = argmaxπœƒπœƒπ‘ƒπ‘ƒ 𝑂𝑂𝑑𝑑 𝑑𝑑=1𝑇𝑇 πœƒπœƒ

β€’ We will apply EM algorithm to solve learning problemβ€’ E-step: estimate the 𝑃𝑃 𝑆𝑆𝑑𝑑 𝑑𝑑=1

𝑇𝑇 𝑂𝑂𝑑𝑑 𝑑𝑑=1𝑇𝑇 ,πœƒπœƒ(𝑗𝑗) (similar to decoding)

β€’ M-step: maximize the expected log likelihood of complete data sequence

𝑄𝑄 πœƒπœƒ,πœƒπœƒ 𝑗𝑗 = �𝑆𝑆

𝑃𝑃 𝑆𝑆𝑑𝑑 𝑑𝑑=1𝑇𝑇 𝑂𝑂𝑑𝑑 𝑑𝑑=1

𝑇𝑇 ,πœƒπœƒ(𝑗𝑗) log𝑃𝑃 𝑆𝑆𝑑𝑑 𝑑𝑑=1𝑇𝑇 , 𝑂𝑂𝑑𝑑 𝑑𝑑=1

𝑇𝑇 πœƒπœƒ

Page 20: Hidden Markov Models - CS Department

Evaluation Problem

β€’ Given HMM parameters πœƒπœƒ & observation sequences 𝑂𝑂 = {𝑂𝑂𝑑𝑑|𝑑𝑑 = 1,2, …𝑇𝑇}, β€’ compute the probability of observing 𝑂𝑂, i.e.,

𝑃𝑃 𝑂𝑂𝑑𝑑 𝑑𝑑=1𝑇𝑇 πœƒπœƒ = οΏ½

𝑆𝑆

𝑃𝑃 𝑂𝑂𝑑𝑑 𝑑𝑑=1𝑇𝑇 , 𝑆𝑆𝑑𝑑 𝑑𝑑=1

𝑇𝑇 πœƒπœƒ

= βˆ‘π‘†π‘† 𝑃𝑃 𝑂𝑂𝑑𝑑 𝑑𝑑=1𝑇𝑇 𝑆𝑆𝑑𝑑 𝑑𝑑=1

𝑇𝑇 ,πœƒπœƒ 𝑃𝑃( 𝑆𝑆𝑑𝑑 𝑑𝑑=1𝑇𝑇 |πœƒπœƒ) by Bayes rule

= βˆ‘π‘†π‘†Ξ π‘‘π‘‘=1𝑇𝑇 𝑃𝑃 𝑂𝑂𝑑𝑑 𝑆𝑆𝑑𝑑 𝑃𝑃( 𝑆𝑆𝑑𝑑 𝑑𝑑=1𝑇𝑇 |πœƒπœƒ) by independence of observations given states

= βˆ‘π‘†π‘†Ξ π‘‘π‘‘=1𝑇𝑇 𝑃𝑃 𝑂𝑂𝑑𝑑 𝑆𝑆𝑑𝑑 𝑃𝑃 𝑆𝑆1 πœƒπœƒ Ξ t=2T 𝑃𝑃(𝑆𝑆𝑑𝑑|π‘†π‘†π‘‘π‘‘βˆ’1) by Markov assumption

It is exponentially in enumerating all possible assignment of S, leading to computational complexity O(KT).

Page 21: Hidden Markov Models - CS Department

Forward Probabilityβ€’ A dynamic programming algorithmβ€’ Let π›Όπ›Όπ‘‡π‘‡π‘˜π‘˜ = 𝑃𝑃( 𝑂𝑂𝑑𝑑 𝑑𝑑=1

𝑇𝑇 , 𝑆𝑆𝑇𝑇 = π‘˜π‘˜), then 𝑃𝑃 𝑂𝑂𝑑𝑑 𝑑𝑑=1𝑇𝑇 = βˆ‘π‘˜π‘˜=1𝐾𝐾 π›Όπ›Όπ‘˜π‘˜π‘‡π‘‡

β€’ Recursively compute π›Όπ›Όπ‘‡π‘‡π‘˜π‘˜ in forward fashionπ›Όπ›Όπ‘‡π‘‡π‘˜π‘˜ = 𝑃𝑃 𝑂𝑂𝑑𝑑 𝑑𝑑=1

𝑇𝑇 , 𝑆𝑆𝑇𝑇 = π‘˜π‘˜= βˆ‘π‘˜π‘˜β€² 𝑃𝑃 𝑂𝑂𝑑𝑑 𝑑𝑑=1

π‘‡π‘‡βˆ’1,𝑂𝑂𝑇𝑇 , π‘†π‘†π‘‡π‘‡βˆ’1 = π‘˜π‘˜β€², 𝑆𝑆𝑇𝑇 = π‘˜π‘˜ margalizing π‘†π‘†π‘‡π‘‡βˆ’1= βˆ‘π‘˜π‘˜β€² 𝑃𝑃 𝑂𝑂𝑇𝑇 , 𝑆𝑆𝑇𝑇 = π‘˜π‘˜| 𝑂𝑂𝑑𝑑 𝑑𝑑=1

π‘‡π‘‡βˆ’1, π‘†π‘†π‘‡π‘‡βˆ’1 = π‘˜π‘˜β€² 𝑃𝑃( 𝑂𝑂𝑑𝑑 𝑑𝑑=1π‘‡π‘‡βˆ’1, π‘†π‘†π‘‡π‘‡βˆ’1 = π‘˜π‘˜β€²) by Bayes rule

= βˆ‘π‘˜π‘˜β€² 𝑃𝑃 𝑆𝑆𝑇𝑇 = π‘˜π‘˜|π‘†π‘†π‘‡π‘‡βˆ’1 = π‘˜π‘˜β€² 𝑃𝑃 𝑂𝑂𝑇𝑇|𝑆𝑆𝑇𝑇 = π‘˜π‘˜ 𝑃𝑃( 𝑂𝑂𝑑𝑑 𝑑𝑑=1π‘‡π‘‡βˆ’1, π‘†π‘†π‘‡π‘‡βˆ’1 = π‘˜π‘˜β€²) by Markov assumption

= βˆ‘π‘˜π‘˜β€² 𝑃𝑃 𝑆𝑆𝑇𝑇 = π‘˜π‘˜|π‘†π‘†π‘‡π‘‡βˆ’1 = π‘˜π‘˜β€² 𝑃𝑃 𝑂𝑂𝑇𝑇|𝑆𝑆𝑇𝑇 = π‘˜π‘˜ π›Όπ›Όπ‘‡π‘‡βˆ’1π‘˜π‘˜β€² by recursive definition.

Each recursive

Page 22: Hidden Markov Models - CS Department

Computational complexity

β€’ π›Όπ›Όπ‘‡π‘‡π‘˜π‘˜ = βˆ‘π‘˜π‘˜β€² 𝑃𝑃 𝑆𝑆𝑇𝑇 = π‘˜π‘˜|π‘†π‘†π‘‡π‘‡βˆ’1 = π‘˜π‘˜β€² 𝑃𝑃 𝑂𝑂𝑇𝑇|𝑆𝑆𝑇𝑇 = π‘˜π‘˜ π›Όπ›Όπ‘‡π‘‡βˆ’1π‘˜π‘˜β€²

β€’ Each recursive definition incurs O(K) of summations over the state space.

β€’ We have KT of such π›Όπ›Όπ‘‡π‘‡π‘˜π‘˜, so the computational complexity is about O(K2T) – a polynomial computational cost.

Page 23: Hidden Markov Models - CS Department

Forward Algorithm

β€’ Intialize 𝛼𝛼1π‘˜π‘˜ = 𝑃𝑃 𝑂𝑂1, 𝑆𝑆1 = π‘˜π‘˜ = 𝑃𝑃 𝑆𝑆1 = π‘˜π‘˜ 𝑃𝑃(𝑂𝑂1|𝑆𝑆1 = π‘˜π‘˜)β€’ Recursive steps: for t =2,…,T

π›Όπ›Όπ‘‘π‘‘π‘˜π‘˜ = 𝑃𝑃 𝑂𝑂𝑇𝑇|𝑆𝑆𝑇𝑇 = π‘˜π‘˜ οΏ½π‘˜π‘˜β€²π‘ƒπ‘ƒ 𝑆𝑆𝑑𝑑 = π‘˜π‘˜|π‘†π‘†π‘‘π‘‘βˆ’1 = π‘˜π‘˜β€² π›Όπ›Όπ‘‘π‘‘βˆ’1π‘˜π‘˜β€²

β€’ Termination:

P 𝑂𝑂𝑑𝑑 𝑑𝑑=1𝑇𝑇 |πœƒπœƒ = οΏ½

π‘˜π‘˜

π›Όπ›Όπ‘‡π‘‡π‘˜π‘˜

Page 24: Hidden Markov Models - CS Department

Summary

β€’ Define Hidden Markov Model to capture a sequence of observationsβ€’ Hidden states: Markov assumptionβ€’ Observations: each one only depends on its current state

β€’ Three main problemsβ€’ Evaluation, decoding, learning

β€’ Forward algorithm: evaluationβ€’ Next lecture: decoding and learning