hidden markov models cbb 231 / compsci 261. an hmm is a following: an hmm is a stochastic machine...

32
Hidden Markov Models CBB 231 / COMPSCI 261

Upload: dorthy-cross

Post on 18-Dec-2015

222 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Hidden Markov Models CBB 231 / COMPSCI 261. An HMM is a following: An HMM is a stochastic machine M=(Q, , P t, P e ) consisting of the following: a finite

Hidden Markov ModelsHidden Markov Models

CBB 231 / COMPSCI 261CBB 231 / COMPSCI 261

Page 2: Hidden Markov Models CBB 231 / COMPSCI 261. An HMM is a following: An HMM is a stochastic machine M=(Q, , P t, P e ) consisting of the following: a finite

An HMM is aAn HMM is a stochastic machine M=(Q, , Pt, Pe) consisting of the

following:following:

• a finite set of states, Q={q0, q1, ... , qm}• a finite alphabet ={s0, s1, ... , sn}• a transition distribution Pt : Q×Q a¡ i.e., Pt (qj | qi) • an emission distribution Pe : Q× a¡ i.e., Pe (sj | qi)

q 0

100%

80%

15%

30% 70%

5%

R=0%Y = 100%

q1

Y=0%R = 100%

q2

What is an HMM?What is an HMM?

M1=({q0,q1,q2},{Y,R},Pt,Pe)

Pt={(q0,q1,1), (q1,q1,0.8), (q1,q2,0.15), (q1,q0,0.05), (q2,q2,0.7), (q2,q1,0.3)}

Pe={(q1,Y,1), (q1,R,0), (q2,Y,0), (q2,R,1)}

An ExampleAn Example

Page 3: Hidden Markov Models CBB 231 / COMPSCI 261. An HMM is a following: An HMM is a stochastic machine M=(Q, , P t, P e ) consisting of the following: a finite

q 0

100%

80%

15%

30% 70%

5%

R=0%Y = 100%

q1

Y=0%R = 100%

q2

P(YRYRY|M1) =

a0→1b1,Ya1→2b2,Ra2→1b1,Ya1→2b2,Ra2→1b1,Ya1→0

=1 1 0.15 1 0.3 1 0.15 1 0.3 1 0.05

=0.00010125

Probability of a SequenceProbability of a Sequence

Page 4: Hidden Markov Models CBB 231 / COMPSCI 261. An HMM is a following: An HMM is a stochastic machine M=(Q, , P t, P e ) consisting of the following: a finite

Another ExampleAnother Example

A=10%T=30%C=40%G=20%

A=10%T=30%C=40%G=20% A=11%

T=17%C=43%G=29%

A=11%T=17%C=43%G=29%

A=35%T=25%C=15%G=25%

A=35%T=25%C=15%G=25% A=27%

T=14%C=22%G=37%

A=27%T=14%C=22%G=37%50%50%

50%50%

100%100%

65%65%

35%35%

20%20%

80%

80%

q1q1

100%

100%

q2q2

q3q3

q4q4

q0q0

M2 = (Q, , Pt, Pe)

Q = {q0, q1, q2, q3, q4}

={A, C, G, T}

Page 5: Hidden Markov Models CBB 231 / COMPSCI 261. An HMM is a following: An HMM is a stochastic machine M=(Q, , P t, P e ) consisting of the following: a finite

A=10%T=30%C=40%G=20%

A=10%T=30%C=40%G=20% A=11%

T=17%C=43%G=29%

A=11%T=17%C=43%G=29%

A=35%T=25%C=15%G=25%

A=35%T=25%C=15%G=25% A=27%

T=14%C=22%G=37%

A=27%T=14%C=22%G=37%50%50%

50%50%

100%100%65%65%

35%35%

20%20%

80%

80%

q1q1

100%

100%

q2q2

q3q3

q4q4

q0q0

The most probable path is:The most probable path is: States:States: 122222224 Sequence:Sequence: CATTAATAG

resulting in this parse:resulting in this parse: States:States: 122222224 Sequence:Sequence: CATTAATAG

The most probable path is:The most probable path is: States:States: 122222224 Sequence:Sequence: CATTAATAG

resulting in this parse:resulting in this parse: States:States: 122222224 Sequence:Sequence: CATTAATAG

Finding the Most Probable Path

Example: C A T T A A T A GExample: C A T T A A T A G

top:top: 7.0×10-7

bottom:bottom: 2.8×10-9

feature 1: C

feature 2: ATTAATA

feature 3: G

Finding the Most Probable PathFinding the Most Probable Path

Page 6: Hidden Markov Models CBB 231 / COMPSCI 261. An HMM is a following: An HMM is a stochastic machine M=(Q, , P t, P e ) consisting of the following: a finite

)()|(

)(

)(

)()|(max

φφφ

φφ

φφφφφ

PSPargmax

SPargmaxSP

SPargmaxSPargmax

=

∧=

∧==

P(φ) = Pt(yi+1 |yi )i=0

L

P(S|φ) = Pe(xi |yi+1)i=0

L−1

φmax=argmax

φPt(q0 |yL ) Pe(xi |yi+1)Pt(yi+1 |yi )

i=0

L−1

emission prob.emission prob. transition prob.transition prob.

Decoding with an HMMDecoding with an HMM

Page 7: Hidden Markov Models CBB 231 / COMPSCI 261. An HMM is a following: An HMM is a stochastic machine M=(Q, , P t, P e ) consisting of the following: a finite

φi,k =argmaxφ j ,k−1+qi

P(φ j ,k−1,x0...xk−1)Pt(qi |qj )Pe(xk |qi)[ ] if k> 0

q0qi if k=0

⎧ ⎨ ⎪

⎩ ⎪

P(φi,k,x0...xk) =max

j P(φ j ,k−1,x0...xk−1)Pt(qi |qj )Pe(xk |qi )[ ] if k> 0

Pt(qi |q0 )Pe(x0 |qi ) if k=0

⎧ ⎨ ⎪

⎩ ⎪

The Best Partial ParseThe Best Partial Parse

φi,k = the best partial parse ending in stateqi at positionk

Page 8: Hidden Markov Models CBB 231 / COMPSCI 261. An HMM is a following: An HMM is a stochastic machine M=(Q, , P t, P e ) consisting of the following: a finite

⎪⎩

⎪⎨⎧

=

>−=.0 if )|()|(

,0 if ),()|()1,(max),(

00 kqxPqqP

kqxPqqPkjVjkiVieit

ikejit

The Viterbi AlgorithmThe Viterbi Algorithm

φmax=argmaxφi,L−1

V(i, L−1)Pt(q0 |qi )

sequence

stat

es

(i,k)

kk-1. . .

k-2 k+1. . . . . .

Page 9: Hidden Markov Models CBB 231 / COMPSCI 261. An HMM is a following: An HMM is a stochastic machine M=(Q, , P t, P e ) consisting of the following: a finite

V (i,k)=max

j V ( j,k−1)Pt(qi |qj)Pe(xk |qi) ifk> 0,

Pt (qi |q0 )Pe(x0 |qi) ifk=0.

⎧ ⎨ ⎪

⎩ ⎪

T (i,k)=argmax

jV ( j,k−1)Pt (qi |qj)Pe(xk |qi) if k> 0,

0 if k=0.

⎧ ⎨ ⎪

⎩ ⎪

Viterbi: TracebackViterbi: Traceback

T( T( T( ... T( T(i, L-1), L-2) ..., 2), 1), 0) = 0

Page 10: Hidden Markov Models CBB 231 / COMPSCI 261. An HMM is a following: An HMM is a stochastic machine M=(Q, , P t, P e ) consisting of the following: a finite

Viterbi Algorithm in PseudocodeViterbi Algorithm in Pseudocodetrans[qi]={qj | Pt(qi|qj)>0}

emit[s] = {qi | Pe(s|qi)>0}

initialization

fill out main part of DP matrix

choose best state from last column in DP matrix

traceback

Page 11: Hidden Markov Models CBB 231 / COMPSCI 261. An HMM is a following: An HMM is a stochastic machine M=(Q, , P t, P e ) consisting of the following: a finite

F (i,k)=

1 fork=0, i =00 fork> 0, i =00 fork=0, i > 0

F ( j,k−1)Pt(qi |qj)Pe(xk−1 |qi)j=0

|Q|−1

∑ for1≤k≤|S |, 1≤ i <|Q |

⎪ ⎪ ⎪

⎪ ⎪ ⎪

P(S |M )= F (i,|S |)Pt (q0 |qi)i=0

|Q|−1

The Forward Algorithm : Probability of a SequenceThe Forward Algorithm : Probability of a Sequence

FF(ii,kk) represents the probability P(S0..k-1| qi) that the machine emits the subsequence x0...xk-1 by any path ending in state qi—i.e., so that symbol xk-1 is emitted by state qi.

Page 12: Hidden Markov Models CBB 231 / COMPSCI 261. An HMM is a following: An HMM is a stochastic machine M=(Q, , P t, P e ) consisting of the following: a finite

F ( j,k−1)Pt(qi |qj )Pe(xk−1 |qi)j=0

|Q|−1

The Forward Algorithm : Probability of a SequenceThe Forward Algorithm : Probability of a Sequence

sequence

stat

es

(i,k)

kk-1

. . .

k-2 k+1. . . . . .

maxj V( j,k−1)Pt(qi |qj )Pe(xk |qi )Viterbi:Viterbi:Viterbi:Viterbi:

Forward:Forward:Forward:Forward:

the single most probable path

sum over all paths

P(S,φ)all

pathsφ

∑i.e.,

Page 13: Hidden Markov Models CBB 231 / COMPSCI 261. An HMM is a following: An HMM is a stochastic machine M=(Q, , P t, P e ) consisting of the following: a finite

The Forward Algorithm in PseudocodeThe Forward Algorithm in Pseudocode

fill out the DP matrix

sum over the final column to get P(S)

Page 14: Hidden Markov Models CBB 231 / COMPSCI 261. An HMM is a following: An HMM is a stochastic machine M=(Q, , P t, P e ) consisting of the following: a finite

CGATATTCGATTCTACGCGCGTATACTAGCTTATCTGATCCGATATTCGATTCTACGCGCGTATACTAGCTTATCTGATC 001111111111111122222222222222111111111111222222221111111111111122222222111111111100

to state

0 1 2

from state

0 0 (0%) 1 (100%) 0 (0%)

1 1 (4%) 21 (84%) 3 (12%)

2 0 (0%) 3 (20%) 12 (80%)

symbol

A C G T

instate

16

(24%)7

(28%)5

(20%)7

(28%)

23

(20%)3

(20%)2

(13%)7

(47%)

∑ −

=

= 1||

0 ,

,, Q

h hi

jiji

A

Aa

∑ −Σ

=

= 1||

0 ,

,,

h hi

kiki

E

Ee

Training an HMM from Labeled SequencesTraining an HMM from Labeled Sequencestr

ansi

tion

str

ansi

tion

sem

issi

ons

emis

sion

s

Page 15: Hidden Markov Models CBB 231 / COMPSCI 261. An HMM is a following: An HMM is a stochastic machine M=(Q, , P t, P e ) consisting of the following: a finite

Recall: Eukaryotic Gene StructureRecall: Eukaryotic Gene Structure

ATGATG TGATGA

coding segment

complete mRNA

ATG GT AG GT AG. . . . . . . . .

start codonstart codonstart codonstart codon stop codondonor sitedonor site donor siteacceptor acceptor sitesite

acceptor site

exonexon exon exonintronintron

TGA

Page 16: Hidden Markov Models CBB 231 / COMPSCI 261. An HMM is a following: An HMM is a stochastic machine M=(Q, , P t, P e ) consisting of the following: a finite

exon 1exon 1 exon 2exon 2 exon 3exon 3

AGCTAGCAGTATGTCATGGCATGTTCGGAGGTAGTACGTAGAGGTAGCTAGTATAGGTCGATAGTACGCGA

IntergenicIntergenic

StartcodonStartcodon

StopcodonStop

codon

ExonExon

DonorDonor AcceptorAcceptor

IntronIntron

the Markov model:the Markov model:

the gene prediction:the gene prediction:

the input sequence:the input sequence:

q0q0

the most probable path:the most probable path:

Using an HMM for Gene PredictionUsing an HMM for Gene Prediction

Page 17: Hidden Markov Models CBB 231 / COMPSCI 261. An HMM is a following: An HMM is a stochastic machine M=(Q, , P t, P e ) consisting of the following: a finite

Higher Order Markovian Eukaryotic RecognizerHigher Order Markovian Eukaryotic Recognizer (HOMER)(HOMER)

H3 H5

H17

H27 H77

H95

Page 18: Hidden Markov Models CBB 231 / COMPSCI 261. An HMM is a following: An HMM is a stochastic machine M=(Q, , P t, P e ) consisting of the following: a finite

nucleotidessplice sites

start/stop codons exons genes

Sn Sp F Sn Sp Sn Sp Sn Sp F Sn #

baseline 100 28 44 0 0 0 0 0 0 0 0 0

H3 53 88 66 0 0 0 0 0 0 0 0 0

HOMER, version HOMER, version HH33

I=intron stateI=intron state

E=exon stateE=exon state

N=intergenic stateN=intergenic state

tested on 500 tested on 500 ArabidopsisArabidopsis genes: genes:

IntergenicIntergenic

Startcodon

Stopcodon

ExonExon

Donor Acceptor

IntronIntron

q0q0

Page 19: Hidden Markov Models CBB 231 / COMPSCI 261. An HMM is a following: An HMM is a stochastic machine M=(Q, , P t, P e ) consisting of the following: a finite

Recall: Sensitivity and SpecificityRecall: Sensitivity and Specificity

F =2×Sn×Sp

Sn+Sp

Sn=TP

TP+FN

Sp=TP

TP+FP

Page 20: Hidden Markov Models CBB 231 / COMPSCI 261. An HMM is a following: An HMM is a stochastic machine M=(Q, , P t, P e ) consisting of the following: a finite

nucleotidessplice sites

start/stop codons exons genes

Sn Sp F Sn Sp Sn Sp Sn Sp F Sn #

H3 53 88 66 0 0 0 0 0 0 0 0 0

H5 65 91 76 1 3 3 3 0 0 0 0 0

HOMER, version HOMER, version HH55

three exon three exon states, for the states, for the three codon three codon positionspositions

three exon three exon states, for the states, for the three codon three codon positionspositions

Page 21: Hidden Markov Models CBB 231 / COMPSCI 261. An HMM is a following: An HMM is a stochastic machine M=(Q, , P t, P e ) consisting of the following: a finite

nucleotidessplice sites

start/stop codons exons genes

Sn Sp F Sn Sp Sn Sp Sn Sp F Sn #

H5 65 91 76 1 3 3 3 0 0 0 0 0

H17 81 93 87 34 48 43 37 19 24 21 7 35

HOMER HOMER version version HH1717 acceptor acceptor

sitesiteacceptor acceptor sitesite

donor donor sitesite

donor donor sitesite

start codonstart codonstart codonstart codon

stop stop codoncodonstop stop codoncodon

IntergenicIntergenic

StartcodonStart

codonStop

codonStop

codon

ExonExon

DonorDonor AcceptorAcceptor

IntronIntron

q0q0

Page 22: Hidden Markov Models CBB 231 / COMPSCI 261. An HMM is a following: An HMM is a stochastic machine M=(Q, , P t, P e ) consisting of the following: a finite

GTATGCGATAGTCAAGAGTGATCGCTAGACC01201201 201201201201201201 2012012012

| | | | | | || | | | | | |0 5 10 15 20 25 300 5 10 15 20 25 30

+phase:

sequence:

coordinates:

Maintaining Phase Across an IntronMaintaining Phase Across an Intron

Page 23: Hidden Markov Models CBB 231 / COMPSCI 261. An HMM is a following: An HMM is a stochastic machine M=(Q, , P t, P e ) consisting of the following: a finite

nucleotides splice start/stop exons genes

Sn Sp F Sn Sp Sn Sp Sn Sp F Sn #

H17 81 93 87 34 48 43 37 19 24 21 7 35

H27 83 93 88 40 49 41 36 23 27 25 8 38

HOMER HOMER version version HH2727 three three

separate separate intron intron

modelsmodels

three three separate separate

intron intron modelsmodels

Page 24: Hidden Markov Models CBB 231 / COMPSCI 261. An HMM is a following: An HMM is a stochastic machine M=(Q, , P t, P e ) consisting of the following: a finite

A T G

T G A T A A T A G

G T A G

(start codons)(start codons) (start codons)(start codons)

(donor splice sites)(donor splice sites)(donor splice sites)(donor splice sites) (acceptor splice sites)(acceptor splice sites)(acceptor splice sites)(acceptor splice sites)

Recall: Weight MatricesRecall: Weight Matrices

(stop codons)(stop codons) (stop codons)(stop codons)

Page 25: Hidden Markov Models CBB 231 / COMPSCI 261. An HMM is a following: An HMM is a stochastic machine M=(Q, , P t, P e ) consisting of the following: a finite

nucleotides splice start/stop exons genes

Sn Sp F Sn Sp Sn Sp Sn Sp F Sn #

H27 83 93 88 40 49 41 36 23 27 25 8 38

H77 88 96 92 66 67 51 46 47 46 46 13 65

HOMER HOMER version version HH7777

positional biases positional biases near splice sitesnear splice sitespositional biases positional biases near splice sitesnear splice sites

Page 26: Hidden Markov Models CBB 231 / COMPSCI 261. An HMM is a following: An HMM is a stochastic machine M=(Q, , P t, P e ) consisting of the following: a finite

nucleotides splice start/stop exons genes

Sn Sp F Sn Sp Sn Sp Sn Sp F Sn #

H77 88 96 92 66 67 51 46 47 46 46 13 65

H95 92 97 94 79 76 57 53 62 59 60 19 93

HOMER version HOMER version HH9595

Page 27: Hidden Markov Models CBB 231 / COMPSCI 261. An HMM is a following: An HMM is a stochastic machine M=(Q, , P t, P e ) consisting of the following: a finite

nucleotides

splice sites

start/stop codons exons genes

Sn Sp F Sn Sp Sn Sp Sn Sp F Sn # baseline 100 28 44 0 0 0 0 0 0 0 0 0 H3 53 88 66 0 0 0 0 0 0 0 0 0 H5 65 91 76 1 3 3 3 0 0 0 0 0 H17 81 93 87 34 48 43 37 19 24 21 7 35 H27 83 93 88 40 49 41 36 23 27 25 8 38 H77 88 96 92 66 67 51 46 47 46 46 13 65 H95 92 97 94 79 76 57 53 62 59 60 19 93

Summary of HOMER ResultsSummary of HOMER Results

Page 28: Hidden Markov Models CBB 231 / COMPSCI 261. An HMM is a following: An HMM is a stochastic machine M=(Q, , P t, P e ) consisting of the following: a finite

Higher-order Markov ModelsHigher-order Markov Models

A C G C T A A C G C T A

P(G|AC)

A C G C T A A C G C T A

P(G|C)

A C G C T A A C G C T A P(G)

0th order:

1st order:

2nd order:

Pe(gn |g0...gn−1,qj) ≈C (g0...gn,qj)

C (g0...gn−1s,qj)s∈∑

Page 29: Hidden Markov Models CBB 231 / COMPSCI 261. An HMM is a following: An HMM is a stochastic machine M=(Q, , P t, P e ) consisting of the following: a finite

ordernucleotides

splice sites

starts/ stops exons genes

Sn Sp F Sn Sp Sn Sp Sn Sp F Sn #

H95 0 92 97 94 79 76 57 53 62 59 60 19 93

H95 1 95 98 97 87 81 64 61 72 68 70 25 127

H95 2 98 98 98 91 82 65 62 76 69 72 27 136

H95 3 98 98 98 91 82 67 63 76 69 72 28 140

H95 4 98 97 98 90 81 69 64 76 68 72 29 143

H95 5 98 97 98 90 81 66 62 74 67 70 27 137

Higher-order Markov ModelsHigher-order Markov Models

0

1

2

3

4

5

Page 30: Hidden Markov Models CBB 231 / COMPSCI 261. An HMM is a following: An HMM is a stochastic machine M=(Q, , P t, P e ) consisting of the following: a finite

Pebackoff(gn |g0...gn−1,qj)=

C (g0...gn,qj)

C (g0...gn−1s,qj)s∈∑

ifC (g0...gn−1,qj)≥K

orn=0

Pebackoff(gn |g1...gn−1,qj) otherwise

⎨ ⎪ ⎪

⎩ ⎪ ⎪

Pebackoff(gn |g0...gn−1,qj)=

C (g0...gn,qj)

C (g0...gn−1s,qj)s∈∑

ifC (g0...gn−1,qj)≥K

orn=0

Pebackoff(gn |g1...gn−1,qj) otherwise

⎨ ⎪ ⎪

⎩ ⎪ ⎪

PeIMM (s |g0...gk−1)=

kGPe(s |g0...gk−1)+ (1−k

G )PeIMM (s |g1...gk−1) ifk> 0

Pe(s) ifk=0

⎧ ⎨ ⎩

PeIMM (s |g0...gk−1)=

kGPe(s |g0...gk−1)+ (1−k

G )PeIMM (s |g1...gk−1) ifk> 0

Pe(s) ifk=0

⎧ ⎨ ⎩

kG =

1 ifm≥400

0 ifm< 400 andc< 0.5

c400

C (g0...gk−1x)x∈∑ otherwise

⎪ ⎪ ⎪

⎪ ⎪ ⎪

Variable-Order Markov ModelsVariable-Order Markov Models

Page 31: Hidden Markov Models CBB 231 / COMPSCI 261. An HMM is a following: An HMM is a stochastic machine M=(Q, , P t, P e ) consisting of the following: a finite

Interpolation ResultsInterpolation Results

Page 32: Hidden Markov Models CBB 231 / COMPSCI 261. An HMM is a following: An HMM is a stochastic machine M=(Q, , P t, P e ) consisting of the following: a finite

SummarySummary

• An HMM is a stochastic generative model which emits sequences

• Parsing with an HMM can be accomplished using a decoding algorithm (such as Viterbi) to find the most probable (MAP) path generating the input sequence

• Training of unambiguous HMM’s can be accomplished using labeled sequence training

• Training of ambiguous HMM’s can be accomplished using Viterbi training or the Baum-Welch algorithm (next lesson...)

•Posterior decoding can be used to estimate the probability that a given symbol or substring was generate by a particular state (next lesson...)

• An HMM is a stochastic generative model which emits sequences

• Parsing with an HMM can be accomplished using a decoding algorithm (such as Viterbi) to find the most probable (MAP) path generating the input sequence

• Training of unambiguous HMM’s can be accomplished using labeled sequence training

• Training of ambiguous HMM’s can be accomplished using Viterbi training or the Baum-Welch algorithm (next lesson...)

•Posterior decoding can be used to estimate the probability that a given symbol or substring was generate by a particular state (next lesson...)