generalized hidden markov models
DESCRIPTION
Generalized Hidden Markov Models. for Eukaryotic gene prediction. CBB 231 / COMPSCI 261. W.H. Majoros. Recall: Gene Syntax. complete mRNA. coding segment. ATG. TGA. exon. intron. exon. intron. exon. AG. GT. AG. GT. TGA. ATG. start codon. donor site. acceptor site. donor site. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Generalized Hidden Markov Models](https://reader033.vdocuments.us/reader033/viewer/2022050909/56815a68550346895dc7b57c/html5/thumbnails/1.jpg)
Generalized Hidden Markov ModelsGeneralized Hidden Markov ModelsGeneralized Hidden Markov ModelsGeneralized Hidden Markov Models
CBB 231 / COMPSCI 261
for Eukaryotic gene predictionfor Eukaryotic gene predictionfor Eukaryotic gene predictionfor Eukaryotic gene prediction
W.H. MajorosW.H. MajorosW.H. MajorosW.H. Majoros
![Page 2: Generalized Hidden Markov Models](https://reader033.vdocuments.us/reader033/viewer/2022050909/56815a68550346895dc7b57c/html5/thumbnails/2.jpg)
ATGATG TGATGA
coding segment
complete mRNA
ATG GT AG GT AG. . . . . . . . .
start codon stop codondonor site donor siteacceptor site
acceptor site
exon exon exonintronintron
TGA
Recall: Gene SyntaxRecall: Gene Syntax
![Page 3: Generalized Hidden Markov Models](https://reader033.vdocuments.us/reader033/viewer/2022050909/56815a68550346895dc7b57c/html5/thumbnails/3.jpg)
An HMM is aAn HMM is a stochastic machine M=(Q, , Pt, Pe) consisting of the
following:following:
• a finite set of states, Q={q0, q1, ... , qm}• a finite alphabet ={s0, s1, ... , sn}• a transition distribution Pt : Q×Q a¡ i.e., Pt (qj | qi) • an emission distribution Pe : Q× a¡ i.e., Pe (sj | qi)
q 0
100%
80%
15%
30% 70%
5%
R=0%Y = 100%
q1
Y=0%R = 100%
q2
Recall: “Pure” HMMsRecall: “Pure” HMMs
M1=({q0,q1,q2},{Y,R},Pt,Pe)
Pt={(q0,q1,1), (q1,q1,0.8), (q1,q2,0.15), (q1,q0,0.05), (q2,q2,0.7), (q2,q1,0.3)}
Pe={(q1,Y,1), (q1,R,0), (q2,Y,0), (q2,R,1)}
An ExampleAn Example
![Page 4: Generalized Hidden Markov Models](https://reader033.vdocuments.us/reader033/viewer/2022050909/56815a68550346895dc7b57c/html5/thumbnails/4.jpg)
A GHMM is aA GHMM is a stochastic machine M=(Q, , Pt, Pe, Pd) consisting of the following:following:
• a finite set of states, Q={q0, q1, ... , qm}• a finite alphabet ={s0, s1, ... , sn}• a transition distribution Pt : Q×Q a¡ i.e., Pt (qj | qi) • an emission distribution Pe : Q×*×¥ a¡ i.e., Pe (sj | qi,dj)• a duration distribution Pe : Q×¥ a¡ i.e., Pd (dj | qi)
Generalized HMMsGeneralized HMMs
• each state now emits an entire subsequence rather than just one symbol• feature lengths are now explicitly modeled, rather than implicitly geometric• emission probabilities can now be modeled by any arbitrary probabilistic model• there tend to be far fewer states => simplicity & ease of modification
Key DifferencesKey Differences
Ref: Kulp D, Haussler D, Reese M, Eeckman F (1996) A generalized hidden Markov model for the recognition of human genes in DNA. ISMB '96.
![Page 5: Generalized Hidden Markov Models](https://reader033.vdocuments.us/reader033/viewer/2022050909/56815a68550346895dc7b57c/html5/thumbnails/5.jpg)
exon length
)1()|()|...( 11
010 ppxPxxP d
d
iied −⎟⎟
⎠
⎞⎜⎜⎝
⎛= −
−
=− ∏ θθ
geometric distributiongeometric distribution
geometric
HMMs & Geometric Feature LengthsHMMs & Geometric Feature Lengths
![Page 6: Generalized Hidden Markov Models](https://reader033.vdocuments.us/reader033/viewer/2022050909/56815a68550346895dc7b57c/html5/thumbnails/6.jpg)
Advantages: * Submodel abstraction * Architectural simplicity * State duration modeling
Disadvantages: * Decoding complexity
Model Abstraction in GHMMsModel Abstraction in GHMMs
![Page 7: Generalized Hidden Markov Models](https://reader033.vdocuments.us/reader033/viewer/2022050909/56815a68550346895dc7b57c/html5/thumbnails/7.jpg)
Fixed-length states are called signal states (diamonds).
Variable-length states are called content
states (ovals).
Each state has a separate submodel or sensor
Typical GHMM Topology for Gene FindingTypical GHMM Topology for Gene Finding
Sparse: in-degree of all states is
bounded by some small constant
![Page 8: Generalized Hidden Markov Models](https://reader033.vdocuments.us/reader033/viewer/2022050909/56815a68550346895dc7b57c/html5/thumbnails/8.jpg)
1. WMM (Weight Matrix)
2. Nth-order Markov Chain (MC)
3. Three-Periodic Markov Chain (3PMC)
5. Codon Bias
6. MDD
7. Interpolated Markov Model
1. WMM (Weight Matrix)
2. Nth-order Markov Chain (MC)
3. Three-Periodic Markov Chain (3PMC)
5. Codon Bias
6. MDD
7. Interpolated Markov Model
€
Pi (xi)i=0
L−1
∏
€
P(xi |x0...xi−1) P(xi |xi−n...xi−1)i=n
L−1
∏i=0
n−1
∏
€
P( f+ i)(mod3) (xi)i=0
L−1
∏
€
P(x+3ix+3i+1x+3i+2 )i=0
n−1
∏
Some GHMM Submodel TypesSome GHMM Submodel Types
€
PeIMM (s |g0...gk−1)=
λkGPe(s |g0...gk−1)+ (1−λk
G )PeIMM (s |g1...gk−1) ifk> 0
Pe(s) ifk=0
⎧ ⎨ ⎩
Ref: Burge C (1997) Identification of complete gene structures in human genomic DNA. PhD thesis. Stanford University.
Ref: Salzberg SL, Delcher AL, Kasif S, White O (1998) Microbial gene identification using interpolated Markov models. Nucleic Acids Research 26:544-548.
![Page 9: Generalized Hidden Markov Models](https://reader033.vdocuments.us/reader033/viewer/2022050909/56815a68550346895dc7b57c/html5/thumbnails/9.jpg)
)()|(
)(
)(
)()|(max
φφφ
φφ
φφφφφ
PSPargmax
SPargmaxSP
SPargmaxSPargmax
=
∧=
∧==
€
P(φ) = Pt(yi+1 |yi )i=0
L
∏
€
P(S|φ) = Pe(xi |yi+1)i=0
L−1
∏
€
φmax=argmax
φPt(q0 |yL ) Pe(xi |yi+1)Pt(yi+1 |yi )
i=0
L−1
∏
emission prob.emission prob. transition prob.transition prob.
Recall: Decoding with an HMMRecall: Decoding with an HMM
![Page 10: Generalized Hidden Markov Models](https://reader033.vdocuments.us/reader033/viewer/2022050909/56815a68550346895dc7b57c/html5/thumbnails/10.jpg)
)()|(
)(
)(
)()|(max
φφφ
φφ
φφφφφ
PSPargmax
SPargmaxSP
SPargmaxSPargmax
=
∧=
∧==
€
P(φ) = Pt(yi+1 |yi )Pd(di |yi)i=0
|φ|−2
∏
€
P(S|φ) = Pe(Si |yi ,di )i=1
|φ|−2
∏
€
φmax=argmax
φPe(Si |yi ,di )Pt(yi+1 |yi)Pd(di |yi )
i=0
|φ|−2
∏
emission prob.emission prob. transition prob.
transition prob.
duration prob.
duration prob.
Decoding with a GHMMDecoding with a GHMM
![Page 11: Generalized Hidden Markov Models](https://reader033.vdocuments.us/reader033/viewer/2022050909/56815a68550346895dc7b57c/html5/thumbnails/11.jpg)
⎪⎩
⎪⎨⎧
=
>−=.0 if )|()|(
,0 if ),()|()1,(max),(
00 kqxPqqP
kqxPqqPkjVjkiVieit
ikejit
sequence
stat
es
(i,k)
kk-1. . .
k-2 k+1. . . . . .
Recall: Viterbi Decoding for HMMsRecall: Viterbi Decoding for HMMs
run time: O(L×|Q|2)
![Page 12: Generalized Hidden Markov Models](https://reader033.vdocuments.us/reader033/viewer/2022050909/56815a68550346895dc7b57c/html5/thumbnails/12.jpg)
Naive GHMM DecodingNaive GHMM Decoding
run time: O(L3×|Q|2)
![Page 13: Generalized Hidden Markov Models](https://reader033.vdocuments.us/reader033/viewer/2022050909/56815a68550346895dc7b57c/html5/thumbnails/13.jpg)
(1) The emission functions Pe for variable-length features are factorable in the sense that they can be expressed as a product of terms evaluated at each position in a putative feature—i.e.,
factorable(Pe) f Pe(S)=∏i f(S,i)
for 0≤i<|S| over any putative feature S.
(2) The lengths of noncoding features in genomes are geometrically distributed.
(3) The model contains no transitions between variable-length states.
Assumptions in GHMM Decoding for Gene FindingAssumptions in GHMM Decoding for Gene Finding
![Page 14: Generalized Hidden Markov Models](https://reader033.vdocuments.us/reader033/viewer/2022050909/56815a68550346895dc7b57c/html5/thumbnails/14.jpg)
Assumption #3Assumption #3
Variable-length states cannot transition directly to each other.
![Page 15: Generalized Hidden Markov Models](https://reader033.vdocuments.us/reader033/viewer/2022050909/56815a68550346895dc7b57c/html5/thumbnails/15.jpg)
Each signal state has a signal sensor:
.0045 .0032 .0031 .0021 .002 .0072 .0023 .0034 .0082
The “trellis” or “ORF graph”:
Efficient Decoding via Signal SensorsEfficient Decoding via Signal Sensors
![Page 16: Generalized Hidden Markov Models](https://reader033.vdocuments.us/reader033/viewer/2022050909/56815a68550346895dc7b57c/html5/thumbnails/16.jpg)
Efficient Decoding via Signal SensorsEfficient Decoding via Signal Sensors
GCTATCGATTCTCTAATCGTCTATCGATCGTGGTATCGTACGTTCATTACTGACT...
sensor 1
sensor 2
sensor n . . .ATG’s
GT’S
AG’s
. . .signal queues
sequence:
detect putative signals during left-to-right pass over squence
insert into type-specific signal queues
...ATG.........ATG......ATG..................GT
newly detected signal
elements of the
“ATG” queue
trellis links
![Page 17: Generalized Hidden Markov Models](https://reader033.vdocuments.us/reader033/viewer/2022050909/56815a68550346895dc7b57c/html5/thumbnails/17.jpg)
Decoding with an ORF GraphDecoding with an ORF Graph
![Page 18: Generalized Hidden Markov Models](https://reader033.vdocuments.us/reader033/viewer/2022050909/56815a68550346895dc7b57c/html5/thumbnails/18.jpg)
The Notion of “Eclipsing”The Notion of “Eclipsing”
ATGGATGCTACTTGACGTACTTAACTTACCGATCTCT0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0
in-frame stop codon!
![Page 19: Generalized Hidden Markov Models](https://reader033.vdocuments.us/reader033/viewer/2022050909/56815a68550346895dc7b57c/html5/thumbnails/19.jpg)
Bounding the Number of Coding PredecessorsBounding the Number of Coding Predecessors
![Page 20: Generalized Hidden Markov Models](https://reader033.vdocuments.us/reader033/viewer/2022050909/56815a68550346895dc7b57c/html5/thumbnails/20.jpg)
Geometric Noncoding Lengths (Assumption #2)Geometric Noncoding Lengths (Assumption #2)
![Page 21: Generalized Hidden Markov Models](https://reader033.vdocuments.us/reader033/viewer/2022050909/56815a68550346895dc7b57c/html5/thumbnails/21.jpg)
Prefix Sum Arrays for Emission ProbabilitiesPrefix Sum Arrays for Emission Probabilities(Assumption #1: factorable content sensors)
![Page 22: Generalized Hidden Markov Models](https://reader033.vdocuments.us/reader033/viewer/2022050909/56815a68550346895dc7b57c/html5/thumbnails/22.jpg)
Bounding the Number of Noncoding PredecessorsBounding the Number of Noncoding Predecessors
Ref: Burge C (1997) Identification of complete gene structures in human genomic DNA. PhD thesis. Stanford University.
![Page 23: Generalized Hidden Markov Models](https://reader033.vdocuments.us/reader033/viewer/2022050909/56815a68550346895dc7b57c/html5/thumbnails/23.jpg)
PSA DecodingPSA Decoding
run time: O(L×|Q|)(assuming a sparse GHMM topology)
![Page 24: Generalized Hidden Markov Models](https://reader033.vdocuments.us/reader033/viewer/2022050909/56815a68550346895dc7b57c/html5/thumbnails/24.jpg)
Traceback for GHMMsTraceback for GHMMs
![Page 25: Generalized Hidden Markov Models](https://reader033.vdocuments.us/reader033/viewer/2022050909/56815a68550346895dc7b57c/html5/thumbnails/25.jpg)
DSP DecodingDSP Decoding
Ref: Majoros WM, Pertea M, Delcher AL, Salzberg SL (2005) Efficient decoding algorithms for generalized hidden Markov model gene finders. BMC Bioinformatics 6:16.
DSP = Dynamic Score Propagation: we incrementally propagate the scores of all possible partial parses up to the current point in the sequence, during a single left-to-right pass over the sequence.
![Page 26: Generalized Hidden Markov Models](https://reader033.vdocuments.us/reader033/viewer/2022050909/56815a68550346895dc7b57c/html5/thumbnails/26.jpg)
DSP DecodingDSP Decoding
run time: O(L×|Q|)(assuming a sparse GHMM topology)
![Page 27: Generalized Hidden Markov Models](https://reader033.vdocuments.us/reader033/viewer/2022050909/56815a68550346895dc7b57c/html5/thumbnails/27.jpg)
PSA vs. DSP DecodingPSA vs. DSP Decoding
Space complexity:
PSA: O(L×|Q|)DSP: O(L+|Q|)
![Page 28: Generalized Hidden Markov Models](https://reader033.vdocuments.us/reader033/viewer/2022050909/56815a68550346895dc7b57c/html5/thumbnails/28.jpg)
Modeling IsochoresModeling Isochores
“inhomogeneous GHMM”
Ref: Allen JE, Majoros WH, Pertea M, Salzberg SL (2006) JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the features of human genes in the ENCODE regions. Genome Biology 7(Suppl 1):S9.
![Page 29: Generalized Hidden Markov Models](https://reader033.vdocuments.us/reader033/viewer/2022050909/56815a68550346895dc7b57c/html5/thumbnails/29.jpg)
Explicit Modeling of Noncoding LengthsExplicit Modeling of Noncoding Lengths
still O(L×|Q|), but slower by a constant factor
Ref: Stanke M, Waack S (2003) Gene prediction
with a hidden Markov model and
a new intron submodel.
Bioinformatics 19:II215-II225.
![Page 30: Generalized Hidden Markov Models](https://reader033.vdocuments.us/reader033/viewer/2022050909/56815a68550346895dc7b57c/html5/thumbnails/30.jpg)
€
θMLE =argmax
θP(S,φ)
(S,φ)∈T
∏ ⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟
=argmaxθ
Pe(Si |yi ,di )Pt(yi |yi−1)Pd(di |yi )yi∈φ∏
(S,φ)∈T
∏ ⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟
=argmaxθ
Pt(yi |yi−1)Pd(di |yi )yi∈φ∏ Pe(xj |yi)
j=0
|Si|−1
∏(S,φ)∈T
∏ ⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟
MLE Training for GHMMsMLE Training for GHMMs
estimate via labeled
training data, as in HMM
estimate via labeled
training data, as in HMM
construct a histogram of
observed feature lengths
∑ −
=
= 1||
0 ,
,, Q
h hi
jiji
A
Aa
€
ei,k =Ei,k
Ei,hh=0
||−1∑
![Page 31: Generalized Hidden Markov Models](https://reader033.vdocuments.us/reader033/viewer/2022050909/56815a68550346895dc7b57c/html5/thumbnails/31.jpg)
€
θCML =argmax
θP(φ |S)
(S,φ)∈T
∏ ⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟
=argmaxθ
Pe(Si |yi ,di )Pt(yi |yi−1)Pd(di |yi )yi∈φ∏
P(S)(S,φ)∈T
∏
⎛
⎝
⎜ ⎜ ⎜ ⎜ ⎜
⎞
⎠
⎟ ⎟ ⎟ ⎟ ⎟
Maximum Likelihood vs. Conditional Maximum LikelihoodMaximum Likelihood vs. Conditional Maximum Likelihood
€
θMLE =argmax
θP(S,φ)
(S,φ)∈T
∏ ⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟
=argmaxθ
Pe(Si |yi ,di )Pt(yi |yi−1)Pd(di |yi )yi∈φ∏
(S,φ)∈T
∏ ⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟
maximizing Pe, Pt, Pd separately maximizes the entire expression
maximizing Pe, Pt, Pd separately DOES NOT maximize the entire expression
MLE≠CML
![Page 32: Generalized Hidden Markov Models](https://reader033.vdocuments.us/reader033/viewer/2022050909/56815a68550346895dc7b57c/html5/thumbnails/32.jpg)
Discriminative Training of GHMMsDiscriminative Training of GHMMs
€
θdiscrim=argmax
θaccuracy on training set( )
![Page 33: Generalized Hidden Markov Models](https://reader033.vdocuments.us/reader033/viewer/2022050909/56815a68550346895dc7b57c/html5/thumbnails/33.jpg)
Discriminative Training of GHMMsDiscriminative Training of GHMMsR
ef:
Maj
oros
WM
, Sal
zber
g S
L (
2004
) A
n em
piri
cal a
naly
sis
of tr
aini
ng p
roto
cols
for
pr
obab
ilis
tic
gene
fin
ders
. BM
C B
ioin
form
atic
s 5:
206.
![Page 34: Generalized Hidden Markov Models](https://reader033.vdocuments.us/reader033/viewer/2022050909/56815a68550346895dc7b57c/html5/thumbnails/34.jpg)
• GHMMs generalize HMMs by allowing each state to emit a subsequence rather than just a single symbol
• Whereas HMMs model all feature lengths using a geometric distribution, coding features can be modeled using an arbitrary length distribution in a GHMM
• Emission models within a GHMM can be any arbitrary probabilistic model (“submodel abstraction”), such as a neural network or decision tree
• GHMMs tend to have many fewer states => simplicity & modularity
• When used for parsing sequences, HMMs and GHMMs should be trained discriminatively, rather than via MLE, to achieve optimal predictive accuracy
• GHMMs generalize HMMs by allowing each state to emit a subsequence rather than just a single symbol
• Whereas HMMs model all feature lengths using a geometric distribution, coding features can be modeled using an arbitrary length distribution in a GHMM
• Emission models within a GHMM can be any arbitrary probabilistic model (“submodel abstraction”), such as a neural network or decision tree
• GHMMs tend to have many fewer states => simplicity & modularity
• When used for parsing sequences, HMMs and GHMMs should be trained discriminatively, rather than via MLE, to achieve optimal predictive accuracy
SummarySummary