probabilistic graphical models: principles and applicationsesucar/clases-mgp/notes/ch5-hmms.pdf ·...

51
Introduction Markov Chains Basic Questions Parameter Estimation Convergence Hidden Markov Models Basic Questions Learning Extensions Applications References Probabilistic Graphical Models: Principles and Applications Chapter 5: HIDDEN MARKOV MODELS L. Enrique Sucar, INAOE (L E Sucar: PGM) 1 / 51

Upload: others

Post on 20-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

  • Introduction

    MarkovChainsBasic Questions

    ParameterEstimation

    Convergence

    HiddenMarkovModelsBasic Questions

    Learning

    Extensions

    Applications

    References

    Probabilistic Graphical Models:Principles and Applications

    Chapter 5: HIDDEN MARKOV MODELS

    L. Enrique Sucar, INAOE

    (L E Sucar: PGM) 1 / 51

  • Introduction

    MarkovChainsBasic Questions

    ParameterEstimation

    Convergence

    HiddenMarkovModelsBasic Questions

    Learning

    Extensions

    Applications

    References

    Outline

    1 Introduction

    2 Markov ChainsBasic QuestionsParameter EstimationConvergence

    3 Hidden Markov ModelsBasic QuestionsLearningExtensions

    4 Applications

    5 References

    (L E Sucar: PGM) 2 / 51

  • Introduction

    MarkovChainsBasic Questions

    ParameterEstimation

    Convergence

    HiddenMarkovModelsBasic Questions

    Learning

    Extensions

    Applications

    References

    Introduction

    Introduction

    • Markov Chains are another class of PGMs thatrepresent dynamic processes

    • For instance, consider that we are modeling how theweather in a particular place changes over time

    • A simple weather model as a Markov chain in whichthere is a state variable per day, with 3 possible values:sunny, cloudy, raining; these variables are linked in achain

    (L E Sucar: PGM) 3 / 51

  • Introduction

    MarkovChainsBasic Questions

    ParameterEstimation

    Convergence

    HiddenMarkovModelsBasic Questions

    Learning

    Extensions

    Applications

    References

    Introduction

    Markov Chain

    This implies what is known as the Markov property, the stateof the weather for the next day, St+1, is independent of allprevious days given the present weather, St , i.e.,P(St+1 | St ,St−1, ...) = P(St+1 | St )

    (L E Sucar: PGM) 4 / 51

  • Introduction

    MarkovChainsBasic Questions

    ParameterEstimation

    Convergence

    HiddenMarkovModelsBasic Questions

    Learning

    Extensions

    Applications

    References

    Introduction

    Hidden Markov Models

    • The previous model assumes that we can measure theweather with precision each day, that is, the state isobservable

    • In many applications we cannot observe the state of theprocess directly, so we have what is called a HiddenMarkov Model, where the state is hidden

    • In addition to the probability of the next state given thecurrent state, there is another parameter which modelsthe uncertainty about the state, represented as theprobability of the observation given the state, P(Ot | St )

    (L E Sucar: PGM) 5 / 51

  • Introduction

    MarkovChainsBasic Questions

    ParameterEstimation

    Convergence

    HiddenMarkovModelsBasic Questions

    Learning

    Extensions

    Applications

    References

    Markov Chains

    Definition

    • A Markov chain (MC) is a state machine that has adiscrete number of states, q1,q2, ...,qn, and thetransitions between states are non-deterministic

    • Formally, a Markov chain is defined by:Set of states: Q = {q1,q2, ...,qn}Vector of prior probabilities: Π = {π1, π2, ..., πn}, where

    πi = P(S0 = qi)Matrix of transition probabilities: A = {aij},

    i = [1..n], j = [1..n], whereaij = P(St = qj | St−1 = qi)

    • In a compact way, a MC is represented as λ = {A,Π}

    (L E Sucar: PGM) 6 / 51

  • Introduction

    MarkovChainsBasic Questions

    ParameterEstimation

    Convergence

    HiddenMarkovModelsBasic Questions

    Learning

    Extensions

    Applications

    References

    Markov Chains

    Properties

    1 Probability axioms:∑

    i πi = 1 and∑

    j aij = 12 Markov property: P(St = qj | St−1 = qi ,St−2 = qk , ...) =

    P(St = qj | St−1 = qi)

    (L E Sucar: PGM) 7 / 51

  • Introduction

    MarkovChainsBasic Questions

    ParameterEstimation

    Convergence

    HiddenMarkovModelsBasic Questions

    Learning

    Extensions

    Applications

    References

    Markov Chains

    Example - simple weather model

    sunny (q1) cloudy (q2) raining (q3)0.2 0.5 0.3

    Table: Prior probabilities.

    sunny cloudy rainingsunny 0.8 0.1 0.1cloudy 0.2 0.6 0.2raining 0.3 0.3 0.4

    Table: Transition probabilities.

    (L E Sucar: PGM) 8 / 51

  • Introduction

    MarkovChainsBasic Questions

    ParameterEstimation

    Convergence

    HiddenMarkovModelsBasic Questions

    Learning

    Extensions

    Applications

    References

    Markov Chains

    State Transition Diagram

    • This diagram is a directed graph, where each node is astate and the arcs represent possible transitionsbetween states

    (L E Sucar: PGM) 9 / 51

  • Introduction

    MarkovChainsBasic Questions

    ParameterEstimation

    Convergence

    HiddenMarkovModelsBasic Questions

    Learning

    Extensions

    Applications

    References

    Markov Chains Basic Questions

    Basic Questions

    Given a Markov chain model, there are three basicquestions that we can ask:• What is the probability of a certain state sequence?• What is the probability that the chain remains in a

    certain state for a period of time?• What is the expected time that the chain will remain in a

    certain state?

    (L E Sucar: PGM) 10 / 51

  • Introduction

    MarkovChainsBasic Questions

    ParameterEstimation

    Convergence

    HiddenMarkovModelsBasic Questions

    Learning

    Extensions

    Applications

    References

    Markov Chains Basic Questions

    Probability of a state sequence

    • The probability of a sequence of states given the modelis basically the product of the transition probabilities ofthe sequence of states:

    P(qi ,qj ,qk , ...) = a0iaijajk .... (1)

    • For example, in the weather model, we might want toknow the probability of the following sequence of states:Q = sunny, sunny, rainy, rainy, sunny, cloudy, sunny.

    (L E Sucar: PGM) 11 / 51

  • Introduction

    MarkovChainsBasic Questions

    ParameterEstimation

    Convergence

    HiddenMarkovModelsBasic Questions

    Learning

    Extensions

    Applications

    References

    Markov Chains Basic Questions

    Probability of remaining in a state

    • The probability of staying d time steps in a certain state,qi , is equivalent to the probability of a sequence in thisstate for d − 1 time steps and then transiting to adifferent state.

    P(di) = ad−1ii (1− aii) (2)

    • Considering the weather model, what is the probabilityof 3 cloudy days?

    (L E Sucar: PGM) 12 / 51

  • Introduction

    MarkovChainsBasic Questions

    ParameterEstimation

    Convergence

    HiddenMarkovModelsBasic Questions

    Learning

    Extensions

    Applications

    References

    Markov Chains Basic Questions

    Average duration

    • The average duration of a state sequence in a certainstate is the expected value of the number of stages inthat state, that is: E(D) =

    ∑i diP(di)

    E(di) =∑

    i

    diad−1ii (1− aii) (3)

    • Which can be written in a compact form as:

    E(di) = 1/(1− aii) (4)

    • What is the expected number of days that the weatherwill remain cloudy?

    (L E Sucar: PGM) 13 / 51

  • Introduction

    MarkovChainsBasic Questions

    ParameterEstimation

    Convergence

    HiddenMarkovModelsBasic Questions

    Learning

    Extensions

    Applications

    References

    Markov Chains Parameter Estimation

    Parameter Estimation

    • The parameters can be estimated simply by countingthe number of times that the sequence is in a certainstate, i ; and the number of times there is a transitionfrom a state i to a state j :Initial probabilities:

    πi = γ0i/N (5)

    Transition probabilities:

    aij = γij/γi (6)

    • γ0i is the number of times that the state i is the initialstate in a sequence, γi is the number of times that weobserve state i , and γij is the number of times that weobserve a transition from state i to state j

    (L E Sucar: PGM) 14 / 51

  • Introduction

    MarkovChainsBasic Questions

    ParameterEstimation

    Convergence

    HiddenMarkovModelsBasic Questions

    Learning

    Extensions

    Applications

    References

    Markov Chains Parameter Estimation

    Weather Example - data

    • Consider that for the weather example we have thefollowing 4 observation sequences:q2,q2,q3,q3,q3,q3,q1q1,q3,q2,q3,q3,q3,q3q3,q3,q2,q2q2,q1,q2,q2,q1,q3,q1

    (L E Sucar: PGM) 15 / 51

  • Introduction

    MarkovChainsBasic Questions

    ParameterEstimation

    Convergence

    HiddenMarkovModelsBasic Questions

    Learning

    Extensions

    Applications

    References

    Markov Chains Parameter Estimation

    Weather Example - parameters

    sunny cloudy raining0.25 0.5 0.25

    Table: Calculated prior probabilities for the weather example.

    sunny cloudy rainingsunny 0 0.33 0.67cloudy 0.285 0.43 0.285raining 0.18 0.18 0.64

    Table: Calculated transition probabilities for the weather example.

    (L E Sucar: PGM) 16 / 51

  • Introduction

    MarkovChainsBasic Questions

    ParameterEstimation

    Convergence

    HiddenMarkovModelsBasic Questions

    Learning

    Extensions

    Applications

    References

    Markov Chains Convergence

    Convergence

    • If a sequence transits from one state to another a largenumber of times, M, what is the probability in the limit(as M →∞) of each state, qi?

    • Given an initial probability vector, Π, and transitionmatrix, A, the probability of each state,P = {p1,p2, ...,pn} after M iterations is:

    P = πAM (7)

    • The solution is given by the Perron-Frobenius theorem

    (L E Sucar: PGM) 17 / 51

  • Introduction

    MarkovChainsBasic Questions

    ParameterEstimation

    Convergence

    HiddenMarkovModelsBasic Questions

    Learning

    Extensions

    Applications

    References

    Markov Chains Convergence

    Perron-Frobenius theorem

    • Conditions:1 Irreducibility: from every state i there is a probability

    aij > 0 of transiting to any state j .2 Aperiodicity: the chain does not form cycles (a subset of

    states in which the chain remains once it transits to oneof these state).

    • Then as M →∞, the chain converges to an invariantdistribution P, such that P × A = P

    • The rate of convergence is determined by the secondeigen-value of matrix A

    (L E Sucar: PGM) 18 / 51

  • Introduction

    MarkovChainsBasic Questions

    ParameterEstimation

    Convergence

    HiddenMarkovModelsBasic Questions

    Learning

    Extensions

    Applications

    References

    Markov Chains Convergence

    Example

    • Consider a MC with three states and the followingtransition probability matrix:

    A =0.9 0.075 0.0250.15 0.8 0.050.25 0.25 0.5

    • It can be shown that in this case the steady stateprobabilities converge to P = {0.625,0.3125,0.0625}

    • An interesting application of this convergence propertyof Markov chains is for ranking web pages

    (L E Sucar: PGM) 19 / 51

  • Introduction

    MarkovChainsBasic Questions

    ParameterEstimation

    Convergence

    HiddenMarkovModelsBasic Questions

    Learning

    Extensions

    Applications

    References

    Hidden Markov Models

    HMM

    • A Hidden Markov model (HMM) is a Markov chainwhere the states are not directly observable.

    • A HMM is that it is a double stochastic process: (i) ahidden stochastic process that we cannot directlyobserve, (ii) and a second stochastic process thatproduces the sequence of observations given the firstprocess.

    • For instance, consider that we have two unfair or“biased” coins, M1 and M2. M1 has a higher probabilityof heads, while M2 has a higher probability of tails.Someone sequentially flips these two coins, howeverwe do not know which one. We can only observe theoutcome, heads or tails

    (L E Sucar: PGM) 20 / 51

  • Introduction

    MarkovChainsBasic Questions

    ParameterEstimation

    Convergence

    HiddenMarkovModelsBasic Questions

    Learning

    Extensions

    Applications

    References

    Hidden Markov Models

    Example - two unfair coins

    Aside from the prior and transition probabilities for the states(as with a MC), in a HMM we need to specify theobservation probabilities

    Π =M1 M20.5 0.5

    A =M1 M2

    M1 0.5 0.5M2 0.5 0.5

    B =M1 M2

    H 0.8 0.2T 0.2 0.8

    Table: The prior probabilities (Π), transition probabilities (A) andobservation probabilities (B) for the unfair coins example.

    (L E Sucar: PGM) 21 / 51

  • Introduction

    MarkovChainsBasic Questions

    ParameterEstimation

    Convergence

    HiddenMarkovModelsBasic Questions

    Learning

    Extensions

    Applications

    References

    Hidden Markov Models

    Coins example - state diagram

    (L E Sucar: PGM) 22 / 51

  • Introduction

    MarkovChainsBasic Questions

    ParameterEstimation

    Convergence

    HiddenMarkovModelsBasic Questions

    Learning

    Extensions

    Applications

    References

    Hidden Markov Models

    Definition

    Set of states: Q = {q1,q2, ...,qn}Set of observations: O = {o1,o2, ...,om}Vector of prior probabilities: Π = {π1, π2, ..., πn}, where

    πi = P(S0 = qi)Matrix of transition probabilities: A = {aij},

    i = [1..n], j = [1..n], whereaij = P(St = qj | St−1 = qi)

    Matrix of observation probabilities: B = {bij},i = [1..n], j = [1..m], wherebik = P(Ot = ok | St = qi)

    Compactly, a HMM is represented as λ = {A,B,Π}

    (L E Sucar: PGM) 23 / 51

  • Introduction

    MarkovChainsBasic Questions

    ParameterEstimation

    Convergence

    HiddenMarkovModelsBasic Questions

    Learning

    Extensions

    Applications

    References

    Hidden Markov Models

    Properties

    Markov property: P(St = qj | St−1 = qi ,St−2 = qk , ...) =P(St = qj | St−1 = qi)

    Stationary process: P(St−1 = qj | St−2 = qi) = P(St = qj |St−1 = qi)andP(Ot−1 = ok | St−1 = qj) = P(Ot = ok | St =qi), ∀(t)

    Independence of observations: P(Ot = ok | St = qi ,St−1 =qj , ...) = P(SO = ok | St = qi)

    (L E Sucar: PGM) 24 / 51

  • Introduction

    MarkovChainsBasic Questions

    ParameterEstimation

    Convergence

    HiddenMarkovModelsBasic Questions

    Learning

    Extensions

    Applications

    References

    Hidden Markov Models

    HMM: Graphical Model

    (L E Sucar: PGM) 25 / 51

  • Introduction

    MarkovChainsBasic Questions

    ParameterEstimation

    Convergence

    HiddenMarkovModelsBasic Questions

    Learning

    Extensions

    Applications

    References

    Hidden Markov Models Basic Questions

    Basic Questions

    1 Evaluation: given a model, estimate the probability of asequence of observations.

    2 Optimal Sequence: given a model and a particularobservation sequence, estimate the most probablestate sequence that produced the observations.

    3 Parameter learning: given a number of sequence ofobservations, adjust the parameters of the model.

    (L E Sucar: PGM) 26 / 51

  • Introduction

    MarkovChainsBasic Questions

    ParameterEstimation

    Convergence

    HiddenMarkovModelsBasic Questions

    Learning

    Extensions

    Applications

    References

    Hidden Markov Models Basic Questions

    Evaluation - Direct Method• Evaluation consists in determining the probability of an

    observation sequence, O = {o1,o2,o3, ...}, given amodel, λ, that is, estimating P(O | λ)

    • A sequence of observations, O = {o1,o2,o3, ...oT}, canbe generated by different state sequences

    • To calculate the probability of an observation sequence,we can estimate it for a certain state sequence, andthen add the probabilities for all the possible statesequences:

    P(O | λ) =∑

    i

    P(O,Qi | λ) (8)

    • Where:

    P(O,Qi | λ) = π1b1(o1)a12b2(o2)...a(T−1)T bT (oT ) (9)

    (L E Sucar: PGM) 27 / 51

  • Introduction

    MarkovChainsBasic Questions

    ParameterEstimation

    Convergence

    HiddenMarkovModelsBasic Questions

    Learning

    Extensions

    Applications

    References

    Hidden Markov Models Basic Questions

    Direct Method

    • Thus, the probability of O is given by a summation overall the possible state sequences, Q:

    P(O | λ) =∑

    Q

    π1b1(o1)a12b2(o2)...a(T−1)T bT (oT ) (10)

    • For a model with N states and an observation length ofT , there are NT possible state sequences. Each term inthe summation requires 2T operations. As a result, theevaluation requires a number of operations in the orderof 2T × NT

    • A more efficient method is required!

    (L E Sucar: PGM) 28 / 51

  • Introduction

    MarkovChainsBasic Questions

    ParameterEstimation

    Convergence

    HiddenMarkovModelsBasic Questions

    Learning

    Extensions

    Applications

    References

    Hidden Markov Models Basic Questions

    Evaluation - iterative method

    • The basic idea of the iterative method, also known asForward, is to estimate the probabilities of thestates/observations per time step

    • Calculate the probability of a partial sequence ofobservations until time t , and based on this partialresult, calculate it for time t + 1, and so on ...

    • Until the last stage is reached and the probability of thecomplete sequence is obtained.

    (L E Sucar: PGM) 29 / 51

  • Introduction

    MarkovChainsBasic Questions

    ParameterEstimation

    Convergence

    HiddenMarkovModelsBasic Questions

    Learning

    Extensions

    Applications

    References

    Hidden Markov Models Basic Questions

    Iterative method

    • Define an auxiliary variable called forward:

    αt (i) = P(o1,o2, ...,ot ,St = qi | λ) (11)

    • The iterative algorithm consists of three main parts:• Initialization – the α variables for all states at the initial

    time are obtained:α1(i) = P(O1,S1 = qi ) = πibi (O1)

    • Induction – calculate αt+1(i) in terms of αt (i):αt (j) = [

    ∑i αt−1(i)aij ]bj (Ot )

    • Termination – P(O | λ) is obtained by adding all the αT :P(O) =

    ∑i αT (i)

    (L E Sucar: PGM) 30 / 51

  • Introduction

    MarkovChainsBasic Questions

    ParameterEstimation

    Convergence

    HiddenMarkovModelsBasic Questions

    Learning

    Extensions

    Applications

    References

    Hidden Markov Models Basic Questions

    Complexity

    • Each iteration requires N multiplications and Nadditions (approx.), so for the T iterations, the numberof operations is in the order of N2 × T

    • The time complexity is reduced from exponential in Tfor the direct method to linear in T and quadratic in Nfor the iterative method

    (L E Sucar: PGM) 31 / 51

  • Introduction

    MarkovChainsBasic Questions

    ParameterEstimation

    Convergence

    HiddenMarkovModelsBasic Questions

    Learning

    Extensions

    Applications

    References

    Hidden Markov Models Basic Questions

    State Estimation

    • Finding the most probable sequence of states for anobservation sequence, O = {o1,o2,o3, ...}, can beinterpreted in two ways: (i) obtaining the most probablestate, St at each time step t , (ii) obtaining the mostprobable sequence of states, s0, s1, ...sT

    • First we solve the problem of finding the most probableor optimum state for a certain time t , and then theproblem of finding the optimum sequence

    (L E Sucar: PGM) 32 / 51

  • Introduction

    MarkovChainsBasic Questions

    ParameterEstimation

    Convergence

    HiddenMarkovModelsBasic Questions

    Learning

    Extensions

    Applications

    References

    Hidden Markov Models Basic Questions

    Auxiliary variables

    • The backward variable is analogous to the forward one,but in this case we start from the end of the sequence,that is:

    βt (i) = P(ot+1,ot+2, ...,oT ,St = qi | λ) (12)

    • In a similar way to α, βt (i) can be calculated iterativelybut now backwards:

    βt (i) =∑

    j

    βt+1(j)aijbj(ot ) (13)

    The β variables for T are defined as βT (j) = 1• So P(O | λ) can be obtained in terms of β or a

    combination of α and β

    (L E Sucar: PGM) 33 / 51

  • Introduction

    MarkovChainsBasic Questions

    ParameterEstimation

    Convergence

    HiddenMarkovModelsBasic Questions

    Learning

    Extensions

    Applications

    References

    Hidden Markov Models Basic Questions

    Most probable state

    • γ, that is the conditional probability of being in a certainstate qi given the observation sequence:

    γt (i) = P(st = qi | O, λ) = P(st = qi ,O | λ)/P(O) (14)

    • Which can be written in terms of α and β as:

    γt (i) = αt (i)βt (i)/∑

    i

    αt (i)βt (i) (15)

    • This variable, γ, provides the answer to the firstsubproblem, the most probable state (MPS) at a time t ;we just need to find for which state it has the maximumvalue:

    MPS(t) = ArgMaxiγt (i) (16)

    (L E Sucar: PGM) 34 / 51

  • Introduction

    MarkovChainsBasic Questions

    ParameterEstimation

    Convergence

    HiddenMarkovModelsBasic Questions

    Learning

    Extensions

    Applications

    References

    Hidden Markov Models Basic Questions

    Most probable sequence

    • The most probable state sequence Q given theobservation sequence O, such that we want tomaximize P(Q | O, λ)

    • By Bayes rule: P(Q | O, λ) = P(Q,O | λ)/P(O). Giventhat P(O) does not depend on Q, this is equivalent tomaximizing P(Q,O | λ)

    • The method for obtaining the optimum state sequenceis known as the Viterbi algorithm

    (L E Sucar: PGM) 35 / 51

  • Introduction

    MarkovChainsBasic Questions

    ParameterEstimation

    Convergence

    HiddenMarkovModelsBasic Questions

    Learning

    Extensions

    Applications

    References

    Hidden Markov Models Basic Questions

    Viterbi Algorithm

    • δ gives the maximum value of the probability of asubsequence of states and observations until time t ,being at state qi at time t ; that is:

    δt (i) = MAX [P(s1, s2, ...st = qi ,o1,o2, ...,ot | λ)] (17)

    • Which can also be obtained in an iterative way:

    δt+1(i) = [MAXδt (i)aij ]bj(ot+1) (18)

    • The Viterbi algorithm requires four phases: initialization,recursion, termination and backtracking. It requires anadditional variable, ψt (i), that stores for each state i ateach time step t the previous state that gave themaximum probability - used to reconstruct thesequence by backtracking

    (L E Sucar: PGM) 36 / 51

  • Introduction

    MarkovChainsBasic Questions

    ParameterEstimation

    Convergence

    HiddenMarkovModelsBasic Questions

    Learning

    Extensions

    Applications

    References

    Hidden Markov Models Basic Questions

    Algorithm

    FOR i = 1 to N (Initialization)• δ1(i) = πibi(O1)• ψ1(i) = 0

    FOR t = 2 to T (recursion) FOR j = 1 to N• δt (j) = MAXi [δt−1(i)aij ]bj(Ot )• ψt (j) = ARGMAXi [δt−1(i)aij ]

    P∗ = MAXi [δT (i)] (Termination)q∗T = ARGMAXi [δT (i)]FOR t = T to 2 (Backtracking)• q∗t−1 = ψt (q

    ∗t )

    (L E Sucar: PGM) 37 / 51

  • Introduction

    MarkovChainsBasic Questions

    ParameterEstimation

    Convergence

    HiddenMarkovModelsBasic Questions

    Learning

    Extensions

    Applications

    References

    Hidden Markov Models Learning

    Parameter Learning

    • This method assumes that the structure of the model isknown: the number of states and observations ispreviously defined; therefore it only estimates theparameters

    • The Baum-Welch algorithm determines the parametersof a HMM, λ = A,B,Π, given a number of observationsequences, O = O1,O2, ...OK

    • It maximizes the probability of the model given theobservations: P(O | λ)

    (L E Sucar: PGM) 38 / 51

  • Introduction

    MarkovChainsBasic Questions

    ParameterEstimation

    Convergence

    HiddenMarkovModelsBasic Questions

    Learning

    Extensions

    Applications

    References

    Hidden Markov Models Learning

    Auxiliary Variables

    • ξ, the probability of a transition from a state i at time t toa state j at time t + 1 given an observation sequence O:

    ξt (i , j) = P(st = qi , st+1 = qj | O, λ) (19)

    ξt (i , j) = P(st = qi , st+1 = qj ,O | λ)/P(O) (20)

    • In terms of α and β:

    ξt (i , j) = αt (i)aijbj(ot+1)βt+1(j)/P(O) (21)

    • γ can also be written in terms of ξ: γt (i) =∑

    j ξt (i , j)• By adding γt (i) for all time steps,

    ∑t γt (i), we obtain an

    estimate of the number of times that the chain is in statei ; and by accumulating ξt (i , j) over t ,

    ∑t ξt (i , j), we

    estimate the number of transitions from state i to state j

    (L E Sucar: PGM) 39 / 51

  • Introduction

    MarkovChainsBasic Questions

    ParameterEstimation

    Convergence

    HiddenMarkovModelsBasic Questions

    Learning

    Extensions

    Applications

    References

    Hidden Markov Models Learning

    Baum-Welch Algorithm

    1 Estimate the prior probabilities – the number of timesbeing in state i at time t .πi = γ1(i)

    2 Estimate the transition probabilities – the number oftransitions from state i to j between the number of timesin state i .aij =

    ∑t ξt (i , j)/

    ∑t γt (i)

    3 Estimate the observation probabilities – the number oftimes being in state j and observing k between thenumber of times in state j .bjk =

    ∑t ,O=k γt (i)/

    ∑t γt (i)

    (L E Sucar: PGM) 40 / 51

  • Introduction

    MarkovChainsBasic Questions

    ParameterEstimation

    Convergence

    HiddenMarkovModelsBasic Questions

    Learning

    Extensions

    Applications

    References

    Hidden Markov Models Learning

    Expectation-Maximization

    • Notice that the calculation of γ and ξ variables is donein terms of α and β, which require the parameters of theHMM, Π,A,B. So we have encountered a “chicken andegg” problem!

    • The solution to this problem is based on the EM (forexpectation-maximization) principle

    • The idea is to start with some initial parameters for themodel (E-step), λ = {A,B,Π}, which can be initializedrandomly or based on some domain knowledge

    • Then, via the Baum-Welch algorithm, these parametersare re-estimated (M-step)

    • This cycle is repeated until convergence

    (L E Sucar: PGM) 41 / 51

  • Introduction

    MarkovChainsBasic Questions

    ParameterEstimation

    Convergence

    HiddenMarkovModelsBasic Questions

    Learning

    Extensions

    Applications

    References

    Hidden Markov Models Extensions

    Extensions to the basic HMM model

    • Several extensions for the basic HMM have beenproposed

    • Some of these are illustrated in the next slide

    (L E Sucar: PGM) 42 / 51

  • Introduction

    MarkovChainsBasic Questions

    ParameterEstimation

    Convergence

    HiddenMarkovModelsBasic Questions

    Learning

    Extensions

    Applications

    References

    Hidden Markov Models Extensions

    Extensions

    (a) Basic model. (b) Parametric HMMs. (c) Coupled HMMs.(d) Input-Output HMMs. (e) Parallel HMMs. (f) HierarchicalHMMs. (g) Mixed-state dynamic Bayesian networks. (h)Hidden semi-Markov models

    (L E Sucar: PGM) 43 / 51

  • Introduction

    MarkovChainsBasic Questions

    ParameterEstimation

    Convergence

    HiddenMarkovModelsBasic Questions

    Learning

    Extensions

    Applications

    References

    Applications

    Applications

    • Markov chains for ordering web pages with thePageRank algorithm

    • Application of HMMs in gesture recognition

    (L E Sucar: PGM) 44 / 51

  • Introduction

    MarkovChainsBasic Questions

    ParameterEstimation

    Convergence

    HiddenMarkovModelsBasic Questions

    Learning

    Extensions

    Applications

    References

    Applications

    WWW as a HMM

    • We can think of the World Wide Web (WWW) as a verylarge Markov chain, such that each web page is a stateand the hyperlinks between web pages correspond tostate transitions

    • Each outgoing link can be selected with equalprobability; the transition probability from wi to any ofthe web pages with which it has hyperlinks, wj , isAij = 1/m

    (L E Sucar: PGM) 45 / 51

  • Introduction

    MarkovChainsBasic Questions

    ParameterEstimation

    Convergence

    HiddenMarkovModelsBasic Questions

    Learning

    Extensions

    Applications

    References

    Applications

    PageRank

    • Given the transition probability matrix of the WWW, wecan obtain the convergence probabilities for each state(web page) according to the Perron-Frobenius theorem

    • The convergence probability of a certain web page canbe thought to be equivalent to the probability of aperson, who is navigating the WWW, visiting this webpage.

    • Based on the previous ideas, L. Page et al. developedthe PageRank algorithm which is the basis of how webpages are ordered when we make a search in Google

    (L E Sucar: PGM) 46 / 51

  • Introduction

    MarkovChainsBasic Questions

    ParameterEstimation

    Convergence

    HiddenMarkovModelsBasic Questions

    Learning

    Extensions

    Applications

    References

    Applications

    Gestures

    Gestures are essential for human-human communication, sothey are also important for human-computer interaction. Forexample, we can use gestures to command a service robot

    (L E Sucar: PGM) 47 / 51

  • Introduction

    MarkovChainsBasic Questions

    ParameterEstimation

    Convergence

    HiddenMarkovModelsBasic Questions

    Learning

    Extensions

    Applications

    References

    Applications

    Gesture Recognition

    • For recognizing gestures, a powerful option is a hiddenMarkov model

    • Before we can apply HMMs to model and recognizegestures, the images in the video sequence need to beprocessed and a set of features extracted from them;these will constitute the observations for the HMM

    • To recognize N different gestures, we need to train NHMMs, one for each gesture using the Baum-Welchalgorithm

    • For recognition, the features are extracted from thevideo sequence. The probability of each model giventhe observation sequence, P(O | λi), are obtained usingthe Forward algorithm. The model with the highestprobability, λ∗k , is selected as the recognized gesture

    (L E Sucar: PGM) 48 / 51

  • Introduction

    MarkovChainsBasic Questions

    ParameterEstimation

    Convergence

    HiddenMarkovModelsBasic Questions

    Learning

    Extensions

    Applications

    References

    Applications

    Gesture Recognition

    (L E Sucar: PGM) 49 / 51

  • Introduction

    MarkovChainsBasic Questions

    ParameterEstimation

    Convergence

    HiddenMarkovModelsBasic Questions

    Learning

    Extensions

    Applications

    References

    References

    Book

    Sucar, L. E, Probabilistic Graphical Models, Springer 2015 –Chapter 5

    (L E Sucar: PGM) 50 / 51

  • Introduction

    MarkovChainsBasic Questions

    ParameterEstimation

    Convergence

    HiddenMarkovModelsBasic Questions

    Learning

    Extensions

    Applications

    References

    References

    Additional Reading

    Aviles, H., Sucar, L.E., Mendoza, C.E., Pineda, L.A.: AComparison of Dynamic Naive Bayesian Classifiers andHMMs for Gesture Recognition. JART 9(1) (2011)

    Kanungo, T.: Hidden Markov Models Software.http://www.kanungo.com/.

    Page, L., Brin, S., Motwani, R., Winograd, T.: ThePageRank Citation Ranking: Bringing Order to the Web,Stanford Digital Libraries Working Paper, 1998.

    Rabiner, L.E.: A Tutorial on Hidden Markov Models andSelected Applications in Speech Recognition. In: WaibelA., Lee, K. (eds.) Readings in speech recognition,Morgan Kaufmann, 267-296 (1990)

    Rabiner, L., Juang, B.H.: Fundamentals on SpeechRecognition. Prentice-Hall, New Jersey (1993)

    (L E Sucar: PGM) 51 / 51

    IntroductionMarkov ChainsBasic QuestionsParameter EstimationConvergence

    Hidden Markov ModelsBasic QuestionsLearningExtensions

    ApplicationsReferences