introduction to profile hidden markov models
DESCRIPTION
Introduction to Profile Hidden Markov Models. Mark Stamp. Hidden Markov Models. Here, we assume you know about HMMs If not, see “A revealing introduction to hidden Markov models” Executive summary of HMMs HMM is a machine learning technique Also, a discrete hill climb technique - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Introduction to Profile Hidden Markov Models](https://reader036.vdocuments.us/reader036/viewer/2022081418/5681396d550346895da103c3/html5/thumbnails/1.jpg)
PHMM 1
Introduction to Profile Hidden Markov Models
Mark Stamp
![Page 2: Introduction to Profile Hidden Markov Models](https://reader036.vdocuments.us/reader036/viewer/2022081418/5681396d550346895da103c3/html5/thumbnails/2.jpg)
PHMM 2
Hidden Markov Models
Here, we assume you know about HMMs If not, see “A revealing introduction to hidden
Markov models” Executive summary of HMMs
HMM is a machine learning technique Also, a discrete hill climb technique Train model based on observation sequence Score given sequence to see how closely it
matches the model Efficient algorithms, many useful applications
![Page 3: Introduction to Profile Hidden Markov Models](https://reader036.vdocuments.us/reader036/viewer/2022081418/5681396d550346895da103c3/html5/thumbnails/3.jpg)
PHMM 3
HMM Notation Recall, HMM model denoted λ = (A,B,π) Observation sequence is O Notation:
![Page 4: Introduction to Profile Hidden Markov Models](https://reader036.vdocuments.us/reader036/viewer/2022081418/5681396d550346895da103c3/html5/thumbnails/4.jpg)
PHMM 4
Hidden Markov Models
Among the many uses for HMMs… Speech analysis Music search engine Malware detection Intrusion detection systems (IDS) Many more, and more all the time
![Page 5: Introduction to Profile Hidden Markov Models](https://reader036.vdocuments.us/reader036/viewer/2022081418/5681396d550346895da103c3/html5/thumbnails/5.jpg)
PHMM 5
Limitations of HMMs
Positional information not considered HMM has no “memory” Higher order models have some memory But no explicit use of positional information
Does not handle insertions or deletions These limitations are serious problems in
some applications In bioinformatics string comparison, sequence
alignment is critical Also, insertions and deletions occur
![Page 6: Introduction to Profile Hidden Markov Models](https://reader036.vdocuments.us/reader036/viewer/2022081418/5681396d550346895da103c3/html5/thumbnails/6.jpg)
PHMM 6
Profile HMM
Profile HMM (PHMM) designed to overcome limitations on previous slide In some ways, PHMM easier than HMM In some ways, PHMM more complex
The basic idea of PHMM Define multiple B matrices Almost like having an HMM for each
position in sequence
![Page 7: Introduction to Profile Hidden Markov Models](https://reader036.vdocuments.us/reader036/viewer/2022081418/5681396d550346895da103c3/html5/thumbnails/7.jpg)
PHMM 7
PHMM
In bioinformatics, begin by aligning multiple related sequences Multiple sequence alignment (MSA) This is like training phase for HMM
Generate PHMM based on given MSA Easy, once MSA is known Hard part is generating MSA
Then can score sequences using PHMM Use forward algorithm, like HMM
![Page 8: Introduction to Profile Hidden Markov Models](https://reader036.vdocuments.us/reader036/viewer/2022081418/5681396d550346895da103c3/html5/thumbnails/8.jpg)
PHMM 8
Generic View of PHMM
Circles are Delete states Diamonds are Insert states Rectangles are Match states
Match states correspond to HMM states Arrows are possible transitions
Each transition has associated probability Transition probabilities are A matrix Emission probabilities are B matrices
In PHMM, observations are emissions Match and insert states have emissions
![Page 9: Introduction to Profile Hidden Markov Models](https://reader036.vdocuments.us/reader036/viewer/2022081418/5681396d550346895da103c3/html5/thumbnails/9.jpg)
PHMM 9
Generic View of PHMM
Circles are Delete states, diamonds are Insert states, rectangles are Match states
Also, begin and end states
![Page 10: Introduction to Profile Hidden Markov Models](https://reader036.vdocuments.us/reader036/viewer/2022081418/5681396d550346895da103c3/html5/thumbnails/10.jpg)
PHMM 10
PHMM Notation Notation
![Page 11: Introduction to Profile Hidden Markov Models](https://reader036.vdocuments.us/reader036/viewer/2022081418/5681396d550346895da103c3/html5/thumbnails/11.jpg)
PHMM 11
PHMM
Match state probabilities easily determined from MSA, that is aMi,Mi+1 transitions between match states eMi(k) emission probability at match
state Note: other transition probabilities
For example, aMi,Ii and aMi,Di+1
Emissions at all match & insert states Remember, emission == observation
![Page 12: Introduction to Profile Hidden Markov Models](https://reader036.vdocuments.us/reader036/viewer/2022081418/5681396d550346895da103c3/html5/thumbnails/12.jpg)
PHMM 12
MSA
First we show MSA construction This is the difficult part Lots of ways to do this “Best” way depends on specific problem
Then construct PHMM from MSA The easy part Standard algorithm for this
How to score a sequence? Forward algorithm, similar to HMM
![Page 13: Introduction to Profile Hidden Markov Models](https://reader036.vdocuments.us/reader036/viewer/2022081418/5681396d550346895da103c3/html5/thumbnails/13.jpg)
PHMM 13
MSA
How to construct MSA? Construct pairwise alignments Combine pairwise alignments to obtain
MSA Allow gaps to be inserted
Makes better matches But gaps tend to weaken scoring
So there is a tradeoff
![Page 14: Introduction to Profile Hidden Markov Models](https://reader036.vdocuments.us/reader036/viewer/2022081418/5681396d550346895da103c3/html5/thumbnails/14.jpg)
PHMM 14
Global vs Local Alignment In these pairwise alignment examples
“-” is gap “|” are aligned “*” omitted beginning and ending symbols
![Page 15: Introduction to Profile Hidden Markov Models](https://reader036.vdocuments.us/reader036/viewer/2022081418/5681396d550346895da103c3/html5/thumbnails/15.jpg)
PHMM 15
Global vs Local Alignment
Global alignment is lossless But gaps tend to proliferate And gaps increase when we do MSA More gaps implies more sequences match So, result is less useful for scoring
We usually only consider local alignment That is, omit ends for better alignment
For simplicity, we assume global alignment here
![Page 16: Introduction to Profile Hidden Markov Models](https://reader036.vdocuments.us/reader036/viewer/2022081418/5681396d550346895da103c3/html5/thumbnails/16.jpg)
PHMM 16
Pairwise Alignment
We allow gaps when aligning How to score an alignment?
Based on n x n substitution matrix S Where n is number of symbols
What algorithm(s) to align sequences? Usually, dynamic programming Sometimes, HMM is used Other?
Local alignment --- more issues
![Page 17: Introduction to Profile Hidden Markov Models](https://reader036.vdocuments.us/reader036/viewer/2022081418/5681396d550346895da103c3/html5/thumbnails/17.jpg)
PHMM 17
Pairwise Alignment
Example
Note gaps vs misaligned elements Depends on S and gap penalty
![Page 18: Introduction to Profile Hidden Markov Models](https://reader036.vdocuments.us/reader036/viewer/2022081418/5681396d550346895da103c3/html5/thumbnails/18.jpg)
PHMM 18
Substitution Matrix
Masquerade detection Detect imposter using an account
Consider 4 different operations E == send email G == play games C == C programming J == Java programming
How similar are these to each other?
![Page 19: Introduction to Profile Hidden Markov Models](https://reader036.vdocuments.us/reader036/viewer/2022081418/5681396d550346895da103c3/html5/thumbnails/19.jpg)
PHMM 19
Substitution Matrix
Consider 4 different operations: E, G, C, J
Possible substitution matrix: Diagonal --- matches
High positive scores Which others most similar?
J and C, so substituting C for J is a high score Game playing/programming, very different
So substituting G for C is a negative score
![Page 20: Introduction to Profile Hidden Markov Models](https://reader036.vdocuments.us/reader036/viewer/2022081418/5681396d550346895da103c3/html5/thumbnails/20.jpg)
PHMM 20
Substitution Matrix
Depending on problem, might be easy or very difficult to get useful S matrix
Consider masquerade detection based on UNIX commands Sometimes difficult to say how “close” 2
commands are Suppose aligning DNA sequences
Biological rationale for closeness of symbols
![Page 21: Introduction to Profile Hidden Markov Models](https://reader036.vdocuments.us/reader036/viewer/2022081418/5681396d550346895da103c3/html5/thumbnails/21.jpg)
PHMM 21
Gap Penalty
Generally must allow gaps to be inserted But gaps make alignment more generic
So, less useful for scoring Therefore, we penalize gaps
How to penalize gaps? Linear gap penalty function
f(g) = dg (i.e., constant penalty per gap) Affine gap penalty function
f(g) = a + e(g – 1) Gap opening penalty a, then constant factor of e
![Page 22: Introduction to Profile Hidden Markov Models](https://reader036.vdocuments.us/reader036/viewer/2022081418/5681396d550346895da103c3/html5/thumbnails/22.jpg)
PHMM 22
Pairwise Alignment Algorithm
We use dynamic programming Based on S matrix, gap penalty function
Notation:
![Page 23: Introduction to Profile Hidden Markov Models](https://reader036.vdocuments.us/reader036/viewer/2022081418/5681396d550346895da103c3/html5/thumbnails/23.jpg)
PHMM 23
Pairwise Alignment DP
Initialization:
Recursion:
![Page 24: Introduction to Profile Hidden Markov Models](https://reader036.vdocuments.us/reader036/viewer/2022081418/5681396d550346895da103c3/html5/thumbnails/24.jpg)
PHMM 24
MSA from Pairwise Alignments
Given pairwise alignments… …how to construct MSA? Generic approach is “progressive
alignment” Select one pairwise alignment Select another and combine with first Continue to add more until all are combined
Relatively easy (good) Gaps may proliferate, unstable (bad)
![Page 25: Introduction to Profile Hidden Markov Models](https://reader036.vdocuments.us/reader036/viewer/2022081418/5681396d550346895da103c3/html5/thumbnails/25.jpg)
PHMM 25
MSA from Pairwise Alignments
Lots of ways to improve on generic progressive alignment Here, we mention one such approach Not necessarily “best” or most popular
Feng-Dolittle progressive alignment Compute scores for all pairs of n sequences Select n-1 alignments that a) “connect” all
sequences and b) maximize pairwise scores Then generate a minimum spanning tree For MSA, add sequences in the order that they
appear in the spanning tree
![Page 26: Introduction to Profile Hidden Markov Models](https://reader036.vdocuments.us/reader036/viewer/2022081418/5681396d550346895da103c3/html5/thumbnails/26.jpg)
PHMM 26
MSA Construction
Create pairwise alignments Generate substitution matrix Dynamic program for pairwise alignments
Use pairwise alignments to make MSA Use pairwise alignments to construct
spanning tree (e.g., Prim’s Algorithm) Add sequences to MSA in spanning tree
order (from highest score, insert gaps as needed)
Note: gap penalty is used
![Page 27: Introduction to Profile Hidden Markov Models](https://reader036.vdocuments.us/reader036/viewer/2022081418/5681396d550346895da103c3/html5/thumbnails/27.jpg)
PHMM 27
MSA Example Suppose 10 sequences, with the following
pairwise alignment scores:
![Page 28: Introduction to Profile Hidden Markov Models](https://reader036.vdocuments.us/reader036/viewer/2022081418/5681396d550346895da103c3/html5/thumbnails/28.jpg)
PHMM 28
MSA Example: Spanning Tree
Spanning tree based on scores
So process pairs in following order: (5,4), (5,8), (8,3), (3,2), (2,7), (2,1), (1,6), (6,10), (10,9)
![Page 29: Introduction to Profile Hidden Markov Models](https://reader036.vdocuments.us/reader036/viewer/2022081418/5681396d550346895da103c3/html5/thumbnails/29.jpg)
PHMM 29
MSA Snapshot
Intermediate step and final Use “+” for
neutral symbol
Then “-” for gaps in MSA
Note increase in gaps
![Page 30: Introduction to Profile Hidden Markov Models](https://reader036.vdocuments.us/reader036/viewer/2022081418/5681396d550346895da103c3/html5/thumbnails/30.jpg)
PHMM 30
PHMM from MSA
For PHMM, must determine match and insert states & probabilities from MSA
“Conservative” columns are match states Half or less of symbols are gaps
Other columns are insert states Majority of symbols are gaps
Delete states are a separate issue
![Page 31: Introduction to Profile Hidden Markov Models](https://reader036.vdocuments.us/reader036/viewer/2022081418/5681396d550346895da103c3/html5/thumbnails/31.jpg)
PHMM 31
PHMM States from MSA
Consider a simpler MSA… Columns 1,2,6 are match
states 1,2,3, respectively Since less than half gaps
Columns 3,4,5 are combined to form insert state 2 Since more than half gaps Match states between
insert
![Page 32: Introduction to Profile Hidden Markov Models](https://reader036.vdocuments.us/reader036/viewer/2022081418/5681396d550346895da103c3/html5/thumbnails/32.jpg)
PHMM 32
PHMM Probabilities from MSA
Emission probabilities Based on symbol
distribution in match and insert states
State transition probs Based on transitions in
the MSA
![Page 33: Introduction to Profile Hidden Markov Models](https://reader036.vdocuments.us/reader036/viewer/2022081418/5681396d550346895da103c3/html5/thumbnails/33.jpg)
PHMM 33
PHMM Probabilities from MSA
Emission probabilities:
But 0 probabilities are bad Model “overfits” the data So, use “add one” rule Add one to each numerator,
add total to denominators
![Page 34: Introduction to Profile Hidden Markov Models](https://reader036.vdocuments.us/reader036/viewer/2022081418/5681396d550346895da103c3/html5/thumbnails/34.jpg)
PHMM 34
PHMM Probabilities from MSA
More emission probabilities:
But 0 probabilities are bad Model “overfits” the data Again, use “add one” rule Add one to each numerator,
add total to denominators
![Page 35: Introduction to Profile Hidden Markov Models](https://reader036.vdocuments.us/reader036/viewer/2022081418/5681396d550346895da103c3/html5/thumbnails/35.jpg)
PHMM 35
PHMM Probabilities from MSA
Transition probabilities:
We look at some examples Note that “-” is delete state
First, consider begin state:
Again, use add one rule
![Page 36: Introduction to Profile Hidden Markov Models](https://reader036.vdocuments.us/reader036/viewer/2022081418/5681396d550346895da103c3/html5/thumbnails/36.jpg)
PHMM 36
PHMM Probabilities from MSA
Transition probabilities When no information in
MSA, set probs to uniform For example I1 does not
appear in MSA, so
![Page 37: Introduction to Profile Hidden Markov Models](https://reader036.vdocuments.us/reader036/viewer/2022081418/5681396d550346895da103c3/html5/thumbnails/37.jpg)
PHMM 37
PHMM Probabilities from MSA
Transition probabilities, another example
What about transitions from state D1?
Can only go to M2, so
Again, use add one rule:
![Page 38: Introduction to Profile Hidden Markov Models](https://reader036.vdocuments.us/reader036/viewer/2022081418/5681396d550346895da103c3/html5/thumbnails/38.jpg)
PHMM 38
PHMM Emission Probabilities Emission probabilities for the given MSA
Using add-one rule
![Page 39: Introduction to Profile Hidden Markov Models](https://reader036.vdocuments.us/reader036/viewer/2022081418/5681396d550346895da103c3/html5/thumbnails/39.jpg)
PHMM 39
PHMM Transition Probabilities Transition probabilities for the given MSA
Using add-one rule
![Page 40: Introduction to Profile Hidden Markov Models](https://reader036.vdocuments.us/reader036/viewer/2022081418/5681396d550346895da103c3/html5/thumbnails/40.jpg)
PHMM 40
PHMM Summary
Construct pairwise alignments Usually, use dynamic programming
Use these to construct MSA Lots of ways to do this
Using MSA, determine probabilities Emission probabilities State transition probabilities
In effect, we have trained a PHMM Now what???
![Page 41: Introduction to Profile Hidden Markov Models](https://reader036.vdocuments.us/reader036/viewer/2022081418/5681396d550346895da103c3/html5/thumbnails/41.jpg)
PHMM 41
PHMM Scoring
Want to score sequences to see how closely they match PHMM
How did we score sequences with HMM? Forward algorithm
How to score sequences with PHMM? Forward algorithm
But, algorithm is a little more complex Due to complex state transitions
![Page 42: Introduction to Profile Hidden Markov Models](https://reader036.vdocuments.us/reader036/viewer/2022081418/5681396d550346895da103c3/html5/thumbnails/42.jpg)
PHMM 42
Forward Algorithm
Notation Indices i and j are columns in MSA xi is ith observation symbol qxi is distribution of xi in “random model” Base case is is score of x1,…,xi up to state j (note
that in PHMM, i and j may not agree) Some states undefined Undefined states ignored in calculation
![Page 43: Introduction to Profile Hidden Markov Models](https://reader036.vdocuments.us/reader036/viewer/2022081418/5681396d550346895da103c3/html5/thumbnails/43.jpg)
PHMM 43
Forward Algorithm
Compute P(X|λ) recursively
Note that depends on , and And corresponding state transition probs
![Page 44: Introduction to Profile Hidden Markov Models](https://reader036.vdocuments.us/reader036/viewer/2022081418/5681396d550346895da103c3/html5/thumbnails/44.jpg)
PHMM 44
PHMM
We will see examples of PHMM later In particular,
Malware detection based on opcodes Masquerade detection based on UNIX
commands
![Page 45: Introduction to Profile Hidden Markov Models](https://reader036.vdocuments.us/reader036/viewer/2022081418/5681396d550346895da103c3/html5/thumbnails/45.jpg)
PHMM 45
References
Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, Durbin, et al
Masquerade detection using profile hidden Markov models, L. Huang and M. Stamp, to appear in Computers and Security
Profile hidden Markov models for metamorphic virus detection, S. Attaluri, S. McGhee and M. Stamp, Journal in Computer Virology, Vol. 5, No. 2, May 2009, pp. 151-169