hidden markov models and dynamic programming · bb bb bb bb @ a 01::: a 0n::::: :: a n1::: a nn 1...

26
University of Oslo : Department of Informatics Hidden Markov Models and Dynamic Programming Stephan Oepen Jonathon Read Date: October 14 2011 Venue: INF4820 Department of Informatics University of Oslo

Upload: others

Post on 19-Oct-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hidden Markov Models and Dynamic Programming · BB BB BB BB @ a 01::: a 0N::::: :: a N1::: a NN 1 CC CC CC CC A a transition probability matrix, where a ij the probability of moving

University of Oslo : Department of Informatics

Hidden Markov Modelsand Dynamic Programming

Stephan Oepen Jonathon Read

Date: October 14 2011 Venue: INF4820Department of InformaticsUniversity of Oslo

Page 2: Hidden Markov Models and Dynamic Programming · BB BB BB BB @ a 01::: a 0N::::: :: a N1::: a NN 1 CC CC CC CC A a transition probability matrix, where a ij the probability of moving

Topics

Last weekI Parts-of-speech (POS)I A symbolic approach to POS taggingI Stochastic POS tagging

TodayI Hidden Markov ModelsI Computing likelihoods using the Forward algorithmI Decoding hidden states using the Viterbi algorithm

Page 3: Hidden Markov Models and Dynamic Programming · BB BB BB BB @ a 01::: a 0N::::: :: a N1::: a NN 1 CC CC CC CC A a transition probability matrix, where a ij the probability of moving

Topics

Last weekI Parts-of-speech (POS)I A symbolic approach to POS taggingI Stochastic POS tagging

TodayI Hidden Markov ModelsI Computing likelihoods using the Forward algorithmI Decoding hidden states using the Viterbi algorithm

Page 4: Hidden Markov Models and Dynamic Programming · BB BB BB BB @ a 01::: a 0N::::: :: a N1::: a NN 1 CC CC CC CC A a transition probability matrix, where a ij the probability of moving

Markov chains

DefinitionQ = q1q2 . . . qN a set of N statesq0, qF special start and final states

A =

a01 . . . a0N...

. . ....

aN1 . . . aNN

a transition probability matrix, whereaij the probability of moving from statei to state j

Page 5: Hidden Markov Models and Dynamic Programming · BB BB BB BB @ a 01::: a 0N::::: :: a N1::: a NN 1 CC CC CC CC A a transition probability matrix, where a ij the probability of moving

Markov chains

DefinitionQ = q1q2 . . . qN a set of N statesq0, qF special start and final states

A =

a01 . . . a0N...

. . ....

aN1 . . . aNN

a transition probability matrix, whereaij the probability of moving from statei to state j

Page 6: Hidden Markov Models and Dynamic Programming · BB BB BB BB @ a 01::: a 0N::::: :: a N1::: a NN 1 CC CC CC CC A a transition probability matrix, where a ij the probability of moving

Hidden Markov models (HMMs)

Markov chains are useful when computing the probability ofobservable sequences. However, we often interested in eventsthat are hidden.

DefinitionQ = q1q2 . . . qN a set of N statesq0, qF special start and final states

A =

a01 . . . a0N...

. . ....

aN1 . . . aNN

a transition probability matrix, whereaij the probability of moving from statei to state j

O = o1o2 . . . oT a sequence of observationsB = bi(ot) a sequence of observation likelihoods

Page 7: Hidden Markov Models and Dynamic Programming · BB BB BB BB @ a 01::: a 0N::::: :: a N1::: a NN 1 CC CC CC CC A a transition probability matrix, where a ij the probability of moving

Ice cream and global warming

Missing records of weather in Baltimore for Summer 2007I likelihood of hot/cold weather given yesterday’s weatherI Jason’s diary, listing how many ice creams he ate each dayI number of ice creams he tends to eat, given the weather

Page 8: Hidden Markov Models and Dynamic Programming · BB BB BB BB @ a 01::: a 0N::::: :: a N1::: a NN 1 CC CC CC CC A a transition probability matrix, where a ij the probability of moving

Ice cream and global warming

Missing records of weather in Baltimore for Summer 2007I likelihood of hot/cold weather given yesterday’s weatherI Jason’s diary, listing how many ice creams he ate each dayI number of ice creams he tends to eat, given the weather

Page 9: Hidden Markov Models and Dynamic Programming · BB BB BB BB @ a 01::: a 0N::::: :: a N1::: a NN 1 CC CC CC CC A a transition probability matrix, where a ij the probability of moving

Computing likelihoods

TaskGiven a HMM λ = (A,B) and an observation sequence O,determine the likelihood P(O|λ).

Compute the sum over all possible state sequences:

P(O) =∑

Q

P(O,Q) =∑

Q

P(O|Q)P(Q)

For example, the ice cream sequence 3 1 3:

P(3 1 3) = P(3 1 3, cold cold cold) +

P(3 1 3, cold cold hot) +

P(3 1 3,hot hot cold) + . . . ⇒ O(NTT)

Page 10: Hidden Markov Models and Dynamic Programming · BB BB BB BB @ a 01::: a 0N::::: :: a N1::: a NN 1 CC CC CC CC A a transition probability matrix, where a ij the probability of moving

Computing likelihoods

TaskGiven a HMM λ = (A,B) and an observation sequence O,determine the likelihood P(O|λ).

Compute the sum over all possible state sequences:

P(O) =∑

Q

P(O,Q) =∑

Q

P(O|Q)P(Q)

For example, the ice cream sequence 3 1 3:

P(3 1 3) = P(3 1 3, cold cold cold) +

P(3 1 3, cold cold hot) +

P(3 1 3,hot hot cold) + . . . ⇒ O(NTT)

Page 11: Hidden Markov Models and Dynamic Programming · BB BB BB BB @ a 01::: a 0N::::: :: a N1::: a NN 1 CC CC CC CC A a transition probability matrix, where a ij the probability of moving

The Forward algorithm

Employs dynamic programming—storing and reusing theresults of partial computations in a trellis α.

Each cell in the trellis stores the probability of being in state qjafter seing the first t observations:

αt(j) = P(o1 . . . ot, qt = j)

=

N∑i=1

αt−1(i)aijbj(ot)

Page 12: Hidden Markov Models and Dynamic Programming · BB BB BB BB @ a 01::: a 0N::::: :: a N1::: a NN 1 CC CC CC CC A a transition probability matrix, where a ij the probability of moving

The Forward algorithm

Employs dynamic programming—storing and reusing theresults of partial computations in a trellis α.

Each cell in the trellis stores the probability of being in state qjafter seing the first t observations:

αt(j) = P(o1 . . . ot, qt = j)

=

N∑i=1

αt−1(i)aijbj(ot)

Page 13: Hidden Markov Models and Dynamic Programming · BB BB BB BB @ a 01::: a 0N::::: :: a N1::: a NN 1 CC CC CC CC A a transition probability matrix, where a ij the probability of moving

Calculating a single element in the trellis

Page 14: Hidden Markov Models and Dynamic Programming · BB BB BB BB @ a 01::: a 0N::::: :: a N1::: a NN 1 CC CC CC CC A a transition probability matrix, where a ij the probability of moving

Psuedocode for the Forward algorithm

Input: observations of length T, state-graph of length NOutput: forward-probabilitycreate a probability matrix forward[N + 2,T]foreach state s from 1 to N do

forward[s, 1]← ao,s × bs(o1)endforeach time step t from 2 to T do

foreach state s from 1 to N doforward[s, t]←

∑Ns′=1 forward[s, t − 1] × as′,s × bs(ot)

endendforward[qF,T]←

∑Ns=1 forward[s,T] × as,qF

return forward[qF,T]

Page 15: Hidden Markov Models and Dynamic Programming · BB BB BB BB @ a 01::: a 0N::::: :: a N1::: a NN 1 CC CC CC CC A a transition probability matrix, where a ij the probability of moving

An example of the Forward algorithmn

Page 16: Hidden Markov Models and Dynamic Programming · BB BB BB BB @ a 01::: a 0N::::: :: a N1::: a NN 1 CC CC CC CC A a transition probability matrix, where a ij the probability of moving

The ice cream HMM

Page 17: Hidden Markov Models and Dynamic Programming · BB BB BB BB @ a 01::: a 0N::::: :: a N1::: a NN 1 CC CC CC CC A a transition probability matrix, where a ij the probability of moving

Decoding hidden states

TaskGiven a HMM λ = (A,B) and a sequence of observationsO = o1, o2, . . . , oT, find the most probable correspondingsequence of hidden states Q = q1, q2, . . . , qT.

vt(j) = max(q1,...,qt−1)

P(o1 . . . ot, q1 . . . qt−1, qt = j)

=N

maxi=1

vt−1(i) aij bj(ot)

and additionally keep a backtrace to which ever state was themost probable path to the current state:

βt(j) = argN

maxi=1

vt−1(i) aij

Page 18: Hidden Markov Models and Dynamic Programming · BB BB BB BB @ a 01::: a 0N::::: :: a N1::: a NN 1 CC CC CC CC A a transition probability matrix, where a ij the probability of moving

Decoding hidden states

TaskGiven a HMM λ = (A,B) and a sequence of observationsO = o1, o2, . . . , oT, find the most probable correspondingsequence of hidden states Q = q1, q2, . . . , qT.

vt(j) = max(q1,...,qt−1)

P(o1 . . . ot, q1 . . . qt−1, qt = j)

=N

maxi=1

vt−1(i) aij bj(ot)

and additionally keep a backtrace to which ever state was themost probable path to the current state:

βt(j) = argN

maxi=1

vt−1(i) aij

Page 19: Hidden Markov Models and Dynamic Programming · BB BB BB BB @ a 01::: a 0N::::: :: a N1::: a NN 1 CC CC CC CC A a transition probability matrix, where a ij the probability of moving

Decoding hidden states

TaskGiven a HMM λ = (A,B) and a sequence of observationsO = o1, o2, . . . , oT, find the most probable correspondingsequence of hidden states Q = q1, q2, . . . , qT.

vt(j) = max(q1,...,qt−1)

P(o1 . . . ot, q1 . . . qt−1, qt = j)

=N

maxi=1

vt−1(i) aij bj(ot)

and additionally keep a backtrace to which ever state was themost probable path to the current state:

βt(j) = argN

maxi=1

vt−1(i) aij

Page 20: Hidden Markov Models and Dynamic Programming · BB BB BB BB @ a 01::: a 0N::::: :: a N1::: a NN 1 CC CC CC CC A a transition probability matrix, where a ij the probability of moving

Psuedocode for the Virterbi algorithm

Input: observations of length T, state-graph of length NOutput: best-pathcreate a path probability matrix viterbi[N + 2,T]create a path backpointer matrix backpointer[N + 2,T]foreach state s from 1 to N do

forward[s, 1]← ao,s × bs(o1)backpointer[s, 1]← 0

endforeach time step t from 2 to T do

foreach state s from 1 to N doviterbi[s, t]← maxN

s′=1 viterbi[s′, t − 1] × as′ ,s × bs(ot)backpointer[s, t]← arg maxN

s′=1 viterbi[s′, t − 1] × as′ ,s

endendviterbi[qF,T]← maxN

s=1 viterbi[s,T] × as,qF

backpointer[qF,T]← arg maxNs=1 viterbi[s,T] × as,qF

return the path by following backpointers from backpointer[qF,T]

Page 21: Hidden Markov Models and Dynamic Programming · BB BB BB BB @ a 01::: a 0N::::: :: a N1::: a NN 1 CC CC CC CC A a transition probability matrix, where a ij the probability of moving

An example of the Viterbi algorithmn

Page 22: Hidden Markov Models and Dynamic Programming · BB BB BB BB @ a 01::: a 0N::::: :: a N1::: a NN 1 CC CC CC CC A a transition probability matrix, where a ij the probability of moving

The ice cream HMM

Page 23: Hidden Markov Models and Dynamic Programming · BB BB BB BB @ a 01::: a 0N::::: :: a N1::: a NN 1 CC CC CC CC A a transition probability matrix, where a ij the probability of moving

(A Practical Tip)

I When multiplying many small probabilities, we riskgetting values that are too close to zero to be represented:Underflow.

I It is often helpful to work in “log-space”:

log(max f ) = max(log f )

I Reduces multiplication to addition.

log∏

i

Pi =∑

i

log Pi

Page 24: Hidden Markov Models and Dynamic Programming · BB BB BB BB @ a 01::: a 0N::::: :: a N1::: a NN 1 CC CC CC CC A a transition probability matrix, where a ij the probability of moving

(A Practical Tip)

I When multiplying many small probabilities, we riskgetting values that are too close to zero to be represented:Underflow.

I It is often helpful to work in “log-space”:

log(max f ) = max(log f )

I Reduces multiplication to addition.

log∏

i

Pi =∑

i

log Pi

Page 25: Hidden Markov Models and Dynamic Programming · BB BB BB BB @ a 01::: a 0N::::: :: a N1::: a NN 1 CC CC CC CC A a transition probability matrix, where a ij the probability of moving

Evaluation

I Using a manually labeled test set as our gold standard, wecan compute the accuracy of our model: The percentage oftags in test set that the tagger gets right.

I Compare the accuracy to some reference models: anupper-bound and a baseline.

I An upper-bound ceiling can be based on e.g. how wellhumans would do on the task or by assuming an “oracle”.

I A lower-bound baseline can be based on the accuracyexpected by e.g. random choice, always picking the tagswith the highest frequency, or applying a unigram model.

I Standard hypothesis tests can be applied to test thestatistical significance of any differences.

Page 26: Hidden Markov Models and Dynamic Programming · BB BB BB BB @ a 01::: a 0N::::: :: a N1::: a NN 1 CC CC CC CC A a transition probability matrix, where a ij the probability of moving

Evaluation

I Using a manually labeled test set as our gold standard, wecan compute the accuracy of our model: The percentage oftags in test set that the tagger gets right.

I Compare the accuracy to some reference models: anupper-bound and a baseline.

I An upper-bound ceiling can be based on e.g. how wellhumans would do on the task or by assuming an “oracle”.

I A lower-bound baseline can be based on the accuracyexpected by e.g. random choice, always picking the tagswith the highest frequency, or applying a unigram model.

I Standard hypothesis tests can be applied to test thestatistical significance of any differences.