prologue: pitfalls of standard alignments
DESCRIPTION
PROLOGUE: Pitfalls of standard alignments. Scoring a pairwise alignment. A: ALA E VLIRLIT K LYP B: ASA K HLNRLIT E LYP. Blosum62. Alignment of a family (globins). …………………………………………. Different positions are not equivalent. Sequence logos. http://weblogo.berkeley.edu/cache/file5h2DWc.png. - PowerPoint PPT PresentationTRANSCRIPT
PROLOGUE:PROLOGUE:
Pitfalls of standard alignmentsPitfalls of standard alignments
A: ALAEVLIRLITKLYP B: ASAKHLNRLITELYP ),(),( ii BAsBAScore
Blosum62
Scoring a pairwise alignment
Alignment of a family (globins)
……
……
……
……
……
……
……
……
.
Different positions are not equivalent
http://weblogo.berkeley.edu/cache/file5h2DWc.png
The substitution score IN A FAMILY should depend on the position (the same for gaps)
Sequence logos
For modelling families we need more flexible tools
Probabilistic Models for Biological SequencesProbabilistic Models for Biological Sequences
•What are they?
Generative definition:
•Objects producing different outcomes (sequences) with different probabilities
•The probability distribution over the sequences space determines the model specificity
Probabilistic models for sequences
M
Sequence space
Pro
babili
ty
Generates si with probability P(si | M)e.g.: M is the representation of the family of globins
Associative definition:
•Objects that, given an outcome (sequence), compute a probability value
Probabilistic models for sequences
M
Sequence space
Pro
babili
ty
Associates probability P(si | M) to si
e.g.: M is the representation of the family of globins
We don’t need a generator of new biological sequences
the generative definition is useful as operative definition
Probabilistic models for sequences
Sequence space
Pro
babili
ty
Most useful probabilistic models are Trainable systems
The probability density function over the sequence space is estimated from known examples by means of a learning algorithm
Sequence space
Pro
babili
ty
Known examples Pdf estimate (generalization)
e.g.: Writing a generic representation of the sequences of globins starting from a set of known globins
Probabilistic Models for Biological SequencesProbabilistic Models for Biological Sequences
•What are they?•Why to use them?
Modelling a protein family
Probabilistic model
Seq1 Seq2 Seq3 Seq4 Seq5 Seq6
0.980.210.120.890.470.78
Given a protein class (e.g. Globins), a probabilistic model trained on this family can compute a probability value for each new sequence
This value measures the similarity between the new sequence and the family described by the model
Probabilistic Models for Biological SequencesProbabilistic Models for Biological Sequences
•What are they?•Why to use them?•Which probabilities do they compute?
A model M associates to a sequence s the probability P( s | M )
This probability answers the question:
Which is the probability for a model describing the Globins to generate the sequence s ?
The question we want to answer is:
Given a sequence s, is it a Globin?
We need to compute P( M | s ) !!
P( s | M ) or P( M | s ) ?
P(X,Y) = P(X | Y) P(Y) = P(Y | X) P(X) Joint probability
So:
P(Y | X) =
P(X | Y) P(Y)
P(X)
P(M | s) =
P(s | M) P(M)
P(s)A priori probabilities
Bayes Theorem
P(M | s)Evidenc
e sConclusio
n M
P(s | M)Evidenc
e MConclusio
n s
P(M | s) =
P(s | M) P(M)
P(s)A priori probabilities
The A priori probabilities
P(M) is the probability of the model (i.e. of the class described by the model) BEFORE we know the sequence:
can be estimated as the abundance of the class
P(s) is the probability of the sequence in the sequence space.
Cannot be reliably estimated!!
Comparison between models
P(M1 | s)
P(M2 | s) =
P(s | M1) P(M1)
P(s) P(s | M2) P(M2)
P(s)
P(s | M1) P(M1)
P(s | M2) P(M2)
=
=
We can overcome the problem comparing the probability of generating s from different models
Ratio between the abundance of the classes
Null model
Otherwise we can score a sequence for a model M comparing it to a Null Model: a model that generates ALL the possible sequences with probabilities depending ONLY on the statistical amino acid abundance
S(M, s) = log
P(s | M)
P(s | N)
In this case we need a threshold and a statistic for evaluating the significance (E-value, P-value)
S(M, s)
Sequences belonging to model M
Sequences NOT belonging to model M
The simplest probabilistic models:The simplest probabilistic models:Markov ModelsMarkov Models
•Definition
C R
SF
C: CloudsR: RainF: FogS: Sun
Markov Models Example: Weather
Register the weather conditions day by day:
as a first hypothesis the weather condition in a day depends ONLY on the weather conditions in the day before.
Define the conditional probabilities
P(C|C), P(C|R),…. P(R|C)…..
The probability for the 5-days registration CRRCS
P(CRRCS) = P(C)·P(R|C) ·P(R|R) ·P(C|R) ·P(S|C)
Stochastic generator of sequences in which the probability of state in position i depends ONLY on the state in position i-1
Given an alphabet C = {c1; c2; c3; ………cN }
a Markov model is described with N×(N+2) parameters {art , aBEGIN t , ar END; r, t C}
arq = P( s i = q| s i-1 = r ) aBEGIN q = P( s 1 = q ) ar END = P( s T = END | s T-1 = r )
Markov Model
t art + ar END = 1 r t aBEGIN t = 1
c3
c1
c2
c4
cN
ENDBEGIN
Given the sequence: s = s1s2s3s4s6 ……………sT
with si C = {c1; c2; c3; ………cN }
P( s | M ) = P( s1 ) i=2 P( s i | s i-1 ) =
Markov Models
aBEGIN s i=2 as s as END T
ii-11 T=
P(“ALKALI”)= aBEGIN A aA L aL K aK A aA L aL I aI END
Markov Models: Exercise
C
S
R
SF
0.4
0.20.2 ??
0.3
??0.2
0.5
0.3
0.2
0.3
0.3
0.4 0.2
0.10.2
C
S
R
SF
0.2
0.00.1 0.1
0.1
0.7??
0.2
0.3
1.0
0.0
??
0.0 0.8
0.40.0
1) Fill the non defined values for the transition probabilities
Markov Models: Exercise
C
S
R
SF
0.4
0.20.2 0.2
0.3
0.00.2
0.5
0.3
0.2
0.3
0.3
0.4 0.2
0.10.2
C
S
R
SF
0.2
0.00.1 0.1
0.1
0.70.0
0.2
0.3
1.0
0.0
0.1
0.0 0.8
0.40.0
2) Which model better describes the weather in summer? Which one describes the weather in winter?
Markov Models: Exercise
C
S
R
SF
0.4
0.20.2 0.2
0.3
0.00.2
0.5
0.3
0.2
0.3
0.3
0.4 0.2
0.10.2
C
S
R
SF
0.2
0.00.1 0.1
0.1
0.70.0
0.2
0.3
1.0
0.0
0.1
0.0 0.8
0.40.0
3) Given the sequenceCSSSCFS
which model gives the higher probability?[Consider the starting probabilities: P(X|BEGIN)=0.25]
Winter
Summer
Markov Models: Exercise
P (CSSSCFS | Winter) ==0.25x0.1x0.2x0.2x0.3x0.2x0.2==1.2 x 10-5
P (CSSSCFS | Summer) ==0.25x0.4x0.8x0.8x0.1x0.1x1.0==6.4 x 10-4
4) Can we conclude that the observation sequence refers to a summer week?
C
S
R
SF
0.4
0.20.2 0.2
0.3
0.00.2
0.5
0.3
0.2
0.3
0.3
0.4 0.2
0.10.2
C
S
R
SF
0.2
0.00.1 0.1
0.1
0.70.0
0.2
0.3
1.0
0.0
0.1
0.0 0.8
0.40.0
Winter
Summer
Markov Models: Exercise
P (Seq | Winter) =1.2 x 10-5
P (Seq | Summer) =6.4 x 10-4C
S
R
SF
0.4
0.20.2 0.2
0.3
0.00.2
0.5
0.3
0.2
0.3
0.3
0.4 0.2
0.10.2
C
S
R
SF
0.2
0.00.1 0.1
0.1
0.70.0
0.2
0.3
1.0
0.0
0.1
0.0 0.8
0.40.0
Winter
SummerP(Seq |Summer) P(Summer)P(Seq| Winter) P(Winter)
P(Summer | Seq)
P(Winter | Seq) =
=
G C
TA
DNA:C = {Adenine, Citosine, Guanine, Timine }
16 transition probabilities (12 of which independent) +4 Begin probabilities +4 End probabilities.
Simple Markov Model for DNA sequences
The parameters of the model are different in different zones of DNA
They describe the overall composition and the couple recurrences
G C
TA
Example of Markov Models: GpC Island
In the Markov Model of GpC Islands aGC is higher than in Markov Model Non-GpC Islands
G C
TA
GpC Islands Non-GpC Islands
Given a sequence s we can evaluate
P (GpC | s) = P (s | GpC) ·P(GpC) + P (s | nonGpC) ·P(nonGpC)
P ( s | GpC ) ·P(GpC)
GATGCGTCGC
CTACGCAGCG
The simplest probabilistic models:The simplest probabilistic models:Markov ModelsMarkov Models
•Definition•Training
Probabilistic training of a parametric methodProbabilistic training of a parametric method
Generally speaking, a parametric model M aims to reproduce a set Generally speaking, a parametric model M aims to reproduce a set of known dataof known data
Model MModel MParameters TParameters T
Modelled dataModelled dataReal data (D)Real data (D)
How to compare them?How to compare them?
Let M be the set of parameters of model M.
During the training phase, M parameters are estimated from the set of known data D
Maximum Likelihood Extimation (ML)
ML = argmax P( D | M, )
It can be proved that:
Training of Markov Models
Maximum A Posteriori Extimation (MAP)
MAP = argmax P( | M, D ) = argmax [P( D | M, ) P( ) ]
aik = nik
jnij
Frequency of occurrence as counted in the data set D
Example (coin-tossing)Example (coin-tossing)
Given N tossing of a coin (our data D), the outcomes are h heads and t tails (N=t+h)
ASSUME the model
P(D|M)= ph (1- p)t
Computing the maximum likelihood of P(D|M)
d P(D|M)d p = ph -1(1- p)t-1(h(1-p)-tp) = 0
d P(D|M)d p = 0
We obtain that our estimate of p is
p = h / (h+t) = h / N
Example (Error measure)Example (Error measure)
Suppose you think that your data are affected by a Gaussian error
So that they are distributed according to
F(xi)=A*exp-[(xi – )2 /22]
With A=1/sqrt(2 )
If your measures are independent the data likelihood is
P(Data| model) = i F(xi)
Find and that maximize the P(Data| model)
Maximum Likelihood training: Proof
Given a sequence s contained in D:s = s1s2s3s4s6 ……………sT
ENDs
T
isssBEGIN Tii aaaMsP
1
2, 11)|(
We can count the number of transitions between any to states j and k: njk
1
0
1
0
)|(N
j
N
k
njk
jkaMsP Where states 0 and N+1 are BEGIN and END
01)|(
0)|()|(
0''
N
kjk
j
jjk
jk
jk
aMsP
MsPa
n
a
MsP
N
j
N
kjkj aMsPMsP
0 0
1)|()|(
Normalisation contstraints are taken into account using the Lagrange multipliers k
N
kjk
jkjk
n
na
0''
Hidden Markov ModelsHidden Markov Models
•Preliminary examples
Given a sequence:
4156266656321636543662152611536264162364261664616263
We don’t know the sequence of dice that generated it.
RRRRRLRLRRRRRRRLRRRRRRRRRRRRLRLRRRRRRRRLRRRRLRRRRLRR
Loaded dice
We have 99 regular dice (R) and 1 loaded die (L).
P(1) P(2) P(3) P(4) P(5) P(6)R 1/6 1/6 1/6 1/6 1/6 1/6L 1/10 1/10 1/10 1/10 1/10 1/2
Hypothesis:
We chose a different die for each roll
Two stochastic processes give origin to the sequence of observations.
1) Choosing the die ( R o L ). 2) Rolling the die
The sequence of dice is hidden
The first process is assumed to be Markovian (in this case a 0-order MM)
The outcome of the second process depends only on the state reached in the first process (that is the chosen die)
Loaded dice
Model
Each state (R and L) generates a character of the alphabet
C = {1, 2, 3, 4, 5, 6 }
The emission probabilities depend only on the state.
The transition probabilities describe a Markov model that generates a state path: the hidden sequence ()
The observations sequence (s) is generated by two concomitant stochastic processes
R L0.01
0.01
0.99
0.99
Casinò
The observations sequence (s) is generated by two concomitant stochastic processes
R L0.01
0.01
0.99
0.99
Choose the State : R Probability= 0.99
Chose the Symbol: 1 Probability= 1/6 (given R)
4156266656321636543662152611RRRRRLRLRRRRRRRLRRRRRRRRRRRR
4156266656321636543662152611RRRRRLRLRRRRRRRLRRRRRRRRRRRR
The observations sequence (s) is generated by two concomitant stochastic processes
R L0.01
0.01
0.99
0.99
Choose the State : L Probability= 0.99
Chose the Symbol: 5 Probability= 1/10 (given L)
415626665632163654366215261RRRRRLRLRRRRRRRLRRRRRRRRRRR
41562666563216365436621526115RRRRRLRLRRRRRRRLRRRRRRRRRRRRL
Model
Each state (R and L) generates a character of the alphabet C = {1, 2, 3, 4, 5, 6 }
The emission probabilities depend only on the state.
The transition probabilities describe a Markov model that generates a state path: the hidden sequence ()
The observations sequence (s) is generated by two concomitant stochastic processes
R L0.01
0.01
0.990.99
Loaded dice
Some not so serious example1) DEMOGRAPHY
Observable: Number of births and deaths in a year in a village. Hidden variable: Economic conditions (as a first approximation we can consider the success in business as a random variable, and by consequence, the wealth as a Markov variable
---> can we deduce the economic conditions of a village during a century by means of the register of births and deaths?
2) THE METEREOPATHIC TEACHER
Observable: Average of the marks that a metereopathic teacher gives to their students during a day. Hidden variable: Weather conditions
---> can we deduce the weather conditions during a years by means of the class register?
To be more serious
1) SECONDARY STRUCTURE Observable: protein sequence Hidden variable: secondary structure
---> can we deduce (predict) the secondary structure of a protein given its amino acid sequence?
2) ALIGNMENT Observable: protein sequence Hidden variable: position of each residue along the alignment of a protein family
---> can we align a protein to a family, starting from its amino acid sequence?
Hidden Markov ModelsHidden Markov Models
•Preliminary examples•Formal definition
A HMM is a stochastic generator of sequences characterised by: N states A set of transition probabilities between two states {akj}
akj = P( (i) = j | (i-1) = k ) A set of starting probabilities {a0k}
a0k = P( (1) = k ) A set of ending probabilities {ak0}
ak0 = P( (i) = END | (i-1) = k ) An alphabet C with M characters. A set of emission probabilities for each state {ek (c)}
ek (c) = P( s i = c | (i) = k )Constraints:k a0 k = 1ak0 + j ak j = 1 kc C ek (c) = 1 k
Formal definition of Hidden Markov Models
s: sequence: path through the states
Choose the initial state (1) following the probabilities a0k
i = 1
Choose the character s i from the alphabet C following the probabilities ek(c)
Choose the next state following the probabilities ak j and ak0
Is the END state choosed?
YesEnd
Noi i +1
Generating a sequence with a HMM
s :AGCGCGTAATCTGYYYYYYYNNNNNN
P( s, | M ) can be easily computed
GpC Island, simple model
Y NaYN = 0.2
aNN = 0.8
aNY = 0.1aYY = 0.7
BEGIN
a0N = 0.8a0Y= 0.2
ENDaY0 = 0.1 aN0 = 0.1
eY (A) = 0.1 eY (G) = 0.4eY (C) = 0.4 eY (T) = 0.1
eN (A) = 0.25 eN (G) = 0.25eN (C) = 0.25 eN (T) = 0.25
Gpc Island Non- Gpc Island
P( s, | M ) can be easily computed
Y NaYN = 0.2
aNN = 0.8
aNY = 0.1aYY = 0.7
BEGIN
a0N = 0.8a0Y= 0.2
ENDaY0 = 0.1 aN0 = 0.1
eY (A) = 0.1 eY (G) = 0.4eY (C) = 0.4 eY (T) = 0.1
eN (A) = 0.25 eN (G) = 0.25eN (C) = 0.25 eN (T) = 0.25
GpC Island Non- GpC Island
s : A G C G C G T A A T C T GY Y Y Y Y Y Y N N N N N NEmission: 0.1 0.4 0.4 0.4 0.4 0.4
0.10.250.250.250.250.250.25Transition: 0.2 0.7 0.7 0.7 0.7 0.7 0.7 0.20.8 0.8 0.80.8 0.8 0.1
Multiplying all the probabilities gives the probability of having the sequence AND the path through the states
Evaluation of the joint probability of the sequence ad the path
)|(),|()|,( MPMsPMsP
0)(2 )()1()1(0)|( T
T
i ii aaaMP
T
i
ii seMsP
1 )( )(),|(
T
i
iiiiT seaaMsP
1 )()()1(0)( )()|,(
Hidden Markov ModelsHidden Markov Models
•Preliminary examples•Formal definition•Three questions
s :AGCGCGTAATCTG?????????????
P( s, | M ) can be easily computed How to evaluate P ( s | M )?
GpC Island, simple model
Y NaYN = 0.2
aNN = 0.8
aNY = 0.1aYY = 0.7
BEGIN
a0N = 0.8a0Y= 0.2
ENDaY0 = 0.1 aN0 = 0.1
eY (A) = 0.1 eY (G) = 0.4eY (C) = 0.4 eY (T) = 0.1
eN (A) = 0.25 eN (G) = 0.25eN (C) = 0.25 eN (T) = 0.25
GpC Island Non- GpC Island
How to evaluate P ( s | M )?
Y NaYN = 0.2
aNN = 0.8
aNY = 0.1aYY = 0.7
BEGIN
a0N = 0.8a0Y= 0.2
ENDaY0 = 0.1 aN0 = 0.1
eY (A) = 0.1 eY (G) = 0.4eY (C) = 0.4 eY (T) = 0.1
eN (A) = 0.25 eN (G) = 0.25eN (C) = 0.25 eN (T) = 0.25
GpC Island Non- GpC Island
s : A G C G C G T A A T C T GY Y Y Y Y Y Y Y Y Y Y Y YY Y Y Y Y Y Y Y Y Y Y Y NY Y Y Y Y Y Y Y Y Y Y N YY Y Y Y Y Y Y Y Y Y Y N NY Y Y Y Y Y Y Y Y Y N Y Y………………………………………………………………………………………………………
213 different pathsSumming over all the path will give the probability of
having the sequence
P ( s | M ) = P( s, | M )
s :AGCGCGTAATCTG?????????????
P( s, | M ) can be easily computed How to evaluate P ( s | M )?Can we show the hidden path?
Resumé: GpC Island, simple model
Y NaYN = 0.2
aNN = 0.8
aNY = 0.1aYY = 0.7
BEGIN
a0N = 0.8a0Y= 0.2
ENDaY0 = 0.1 aN0 = 0.1
eY (A) = 0.1 eY (G) = 0.4eY (C) = 0.4 eY (T) = 0.1
eN (A) = 0.25 eN (G) = 0.25eN (C) = 0.25 eN (T) = 0.25
GpC Island Non- GpC Island
Can we show the hidden path?
Y NaYN = 0.2
aNN = 0.8
aNY = 0.1aYY = 0.7
BEGIN
a0N = 0.8a0Y= 0.2
ENDaY0 = 0.1 aN0 = 0.1
eY (A) = 0.1 eY (G) = 0.4eY (C) = 0.4 eY (T) = 0.1
eN (A) = 0.25 eN (G) = 0.25eN (C) = 0.25 eN (T) = 0.25
GpC Island Non- GpC Island
s : A G C G C G T A A T C T GY Y Y Y Y Y Y Y Y Y Y Y YY Y Y Y Y Y Y Y Y Y Y Y NY Y Y Y Y Y Y Y Y Y Y N YY Y Y Y Y Y Y Y Y Y Y N NY Y Y Y Y Y Y Y Y Y N Y Y………………………………………………………………………………………………………
213 different pathsViterbi path: path that gives the best joint probability
* = argmax [ P( | s, M ) ] = argmax [ P( , s | M ) ]
A Posteriori decoding
For each position choose the state (t) :(i) = argmax k[ P( i = k| s, M ) ]
The contribution to this probability derives from all the paths that go through the state k at position i.
The A posteriori path can be a non-sense path (it may not be a legitimate path if some transitions are not permitted in the model)
Can we show the hidden path?
s :AGCGCGTAATCTGYYYYYYYNNNNNN
P( s, | M ) can be easily computed How to evaluate P ( s | M )?Can we show the hidden path? Can we evaluate the parameters starting from known examples?
GpC Island, simple model
Y NaYN = ?
aNN = ?
aNY = ?aYY = ?
BEGIN
a0N = ?a0Y= ?
ENDaY0 = ? aN0 = ?
eY (A) = ? eY (G) = ?eY (C) = ? eY (T) = ?
eN (A) = ? eN (G) = ?eN (C) = ? eN (T) = ?
GpC Island Non- GpC Island
Can we evaluate the parameters starting from known examples?
Y NaYN = ?
aNN = ?
aNY = ?aYY = ?
BEGIN
a0N = ?a0Y= ?
ENDaY0 = ? aN0 = ?
eY (A) = ? eY (G) = ?eY (C) = ? eY (T) = ?
eY (A) = ? eY (G) = ?eY (C) = ? eY (T) = ?
GpC Island Non- GpC Island
s : A G C G C G T A A T C T GY Y Y Y Y Y Y N N N N N NEmission: eY (A)eY (G)eY (C)e Y(G)eY (C)eY (G)eY (T)eN (A)eN (A)eN (T)eN (C)eN (T)eN
(G)
Transition: a0Y aYY aYY aYY aYY aYY aYY aYN aNN aNN aNN aNN aNN aN0
How to find the parameters e and a that maximises this probability?How if we don’t know the path?
Hidden Markov Models:Algorithms Hidden Markov Models:Algorithms
•Resumé•Evaluating P(s | M): Forward Algorithm
Computing P( s,| M ) for each path is a redundant operation
s : A G C G C G T A A T C T GY Y Y Y Y Y Y Y Y Y Y Y YEmission: 0.1 0.4 0.4 0.4 0.4 0.4 0.1 0.1 0.1 0.1 0.4 0.1
0.4Transition: 0.2 0.7 0.7 0.7 0.7 0.7 0.7 0.70.7 0.7 0.70.7 0.7 0.1
Y NaYN = 0.2
aNN = 0.8
aNY = 0.1aYY = 0.7
BEGIN
a0N = 0.8a0Y= 0.2
ENDaY0 = 0.1 aN0 = 0.1
eY (A) = 0.1 eY (G) = 0.4eY (C) = 0.4 eY (T) = 0.1
eN (A) = 0.25 eN (G) = 0.25eN (C) = 0.25 eN (T) = 0.25
GpC Island Non- GpC Island
s : A G C G C G T A A T C T GY Y Y Y Y Y Y Y Y Y Y Y NEmission: 0.1 0.4 0.4 0.4 0.4 0.4 0.1 0.1 0.1 0.1 0.4 0.1
0.25Transition: 0.2 0.7 0.7 0.7 0.7 0.7 0.7 0.70.7 0.7 0.70.7 0.2 0.1
Summing over all the possible paths
Y NaYN = 0.2
aNN = 0.8
aNY = 0.1aYY = 0.7
BEGIN
a0N = 0.8a0Y= 0.2
ENDaY0 = 0.1 aN0 = 0.1
eY (A) = 0.1 eY (G) = 0.4eY (C) = 0.4 eY (T) = 0.1
eN (A) = 0.25 eN (G) = 0.25eN (C) = 0.25 eN (T) = 0.25
GpC Island Non- GpC Island
s : A G Y Y
Emission: 0.1 0.4 Transition: 0.2 0.7 s : A G
Y N
Emission: 0.1 0.25 Transition: 0.2 0.2
s : A G N Y
Emission: 0.250.4 Transition: 0.8 0.1 s : A G
N N
Emission: 0.250.25 Transition: 0.8 0.8
0.0056
0.001 0.04
0.008
s : A G X Y
s : A G X N
0.0136
0.041
Sum
s : A G C X Y Y 0.0136
Y NaYN = 0.2
aNN = 0.8
aNY = 0.1aYY = 0.7
BEGIN
a0N = 0.8a0Y= 0.2
ENDaY0 = 0.1 aN0 = 0.1
eY (A) = 0.1 eY (G) = 0.4eY (C) = 0.4 eY (T) = 0.1
eN (A) = 0.25 eN (G) = 0.25eN (C) = 0.25 eN (T) = 0.25
GpC Island Non- GpC Island
Summing over all the possible paths
0.4 0.7
s : A G C X Y N 0.0136 0.25
0.2
s : A G C X N Y
0.041 0.4 0.1
s : A G C X N N
0.041 0.25 0.8
+
+
s : A G C X X Y
s : A G C X X N
0.005448
0.00888
Sum
Y NaYN = 0.2
aNN = 0.8
aNY = 0.1aYY = 0.7
BEGIN
a0N = 0.8a0Y= 0.2
ENDaY0 = 0.1 aN0 = 0.1
eY (A) = 0.1 eY (G) = 0.4eY (C) = 0.4 eY (T) = 0.1
eN (A) = 0.25 eN (G) = 0.25eN (C) = 0.25 eN (T) = 0.25
GpC Island Non- GpC Island
Summing over all the possible paths
A G C G C G T A A T C T GX X X X X X X X X X X X Y
A G C G C G T A A T C T GX X X X X X X X X X X X N
0.1 (aY0)
0.1 (aN0)
+
P(s|M)
Iterating until the last position of the sequence:
On the basis of preceding observations the computation of P(s | M) can be decomposed in simplest problems
For each state k and each position i in the sequence, we compute:
Fk(i) = P( s1s2s3……s i, (i) = k | M)
Initialisation: FBEGIN (0) = 1 Fi (0) = 0 i BEGIN
Recurrence: Fl ( i+1) = P( s1s2…s is i+1, (i + 1) = l ) =
= k P( s1s2 …s i, (i) = k ) a k l e l ( s i+1 ) =
= e l ( s i+1 ) k Fk ( i ) a k l
Termination: P( s ) = P( s1s2s3……s T, (T + 1) = END ) =
=k P( s1s2 …s T , (T) = k ) a k0
= k Fk ( T ) a k 0
Forward Algorithm
Will be understood
Computing P( s,| M ) for each path is a redundant operation
TBegin
L
R
End
0 1 2 3 T-1
Sta
tes
Iteration
0)()()()1(
1
1 )()()1(1 1111111)()()|,( T
TTTT
T
t
tttt aseaseaMsP
0)()()()1(
1
1 )()()1(2 2222222)()()|,( T
TTTT
T
t
tttt aseaseaMsP
If we compute the common part only once we gain 2·(T-1) operations
TBegin
L
R
End
0 1 2 3 T-1
Sta
tes
Iteration
If we know the probabilities of emitting the two first characters of the sequence ending the path in states L and R respectively:
FR(2) P(s1,s2,(2) = R | M) and FL(2) P(s1,s2,(2) = L | M)
then we can compute:
P(s1,s2,s3,(3) = R | M) = FR(2) · aRR ·eR(s3) + FL(2) · aLR ·eR(s3)
Summing over all the possible paths
STATE
Iteration
BEGIN
END
A
B
0 1 2
FB (2)
eB (s2)
T T + 1
Fi (1) ∙ aiB
P(s | M)
Forward Algorithm
Naïf method
P ( s | M ) = P( s, | M )
There are N T possible paths.
Each path requires about 2T operations.
The time for the computation is O( T N T )
Forward algorithm: computational complexity
s : A G C G C G T A A T C T GY Y Y Y Y Y Y Y Y Y Y Y YEmission: 0.1 0.4 0.4 0.4 0.4 0.4 0.1 0.1 0.1 0.1 0.4 0.1
0.4Transition: 0.2 0.7 0.7 0.7 0.7 0.7 0.7 0.70.7 0.7 0.70.7 0.7 0.1 s : A G C G C G T A A T C T G
Y Y Y Y Y Y Y Y Y Y Y Y NEmission: 0.1 0.4 0.4 0.4 0.4 0.4 0.1 0.1 0.1 0.1 0.4 0.1
0.25Transition: 0.2 0.7 0.7 0.7 0.7 0.7 0.7 0.70.7 0.7 0.70.7 0.2 0.1
s : A G C X Y Y 0.0136 0.4
0.7
s : A G C X Y N 0.0136 0.25
0.2
s : A G C X N Y
0.041 0.4 0.1
s : A G C X N N
0.041 0.25 0.8
s : A G C X X Y
s : A G C X X N
0.005448
0.00888
+
+
Sum
Forward algorithm
T positions, N values for each position
Each element requires about 2N product and 1 sum
The time for the computation is O(T N2)
Forward algorithm: computational complexity
0100200300400500600700800900
1000
1 2 3 4 5 6 7
T
No.
of
oper
atio
ns
Forward algorithm: computational complexity
Naïf method
Forward algorithm
Hidden Markov Models:Algorithms Hidden Markov Models:Algorithms
•Resumé•Evaluating P(s | M): Forward Algorithm•Evaluating P(s | M): Backward Algorithm
Backward AlgorithmSimilar to the Forward algorithm: it computes P( s | M ), reconstructing the sequence from the end
For each state k and each position i in the sequence, we compute:
Bk(i) = P( s i+1s i+2s i+3……s T | (i) = k )
Initialisation: Bk (T) = P((T+1) = END | (T) = k ) = ak0
Recurrence: Bl ( i-1) = P(s is i+1…s T | (i - 1) = l ) =
= k P(s i+1s i+2…s T | (i) = k) a l k e k (s i )=
= k Bk ( i ) e k ( s i ) a l k
Termination: P( s ) = P( s1s2s3……s T | (0) = BEGIN ) =
= k P( s2 …s T | (1) = k ) a 0 k e k ( s 1 ) =
= k Bk ( 1 ) a 0k e k ( s 1 )
STATE
Iteration
BEGIN
END
A
B
0 1 2 T T + 1
Backward Algorithm
BB (T-1)
Bk(T)· aB T· e k (s T-1 )
T-1
P(s | M)
Hidden Markov Models:Algorithms Hidden Markov Models:Algorithms
•Resumé•Evaluating P(s | M): Forward Algorithm•Evaluating P(s | M): Backward Algorithm•Showing the path: Viterbi decoding
Finding the best path
Y NaYN = 0.2
aNN = 0.8
aNY = 0.1aYY = 0.7
BEGIN
a0N = 0.8a0Y= 0.2
ENDaY0 = 0.1 aN0 = 0.1
eY (A) = 0.1 eY (G) = 0.4eY (C) = 0.4 eY (T) = 0.1
eN (A) = 0.25 eN (G) = 0.25eN (C) = 0.25 eN (T) = 0.25
GcP Island Non- GcP Island
s : A G Y Y
Emission: 0.1 0.4 Transition: 0.2 0.7 s : A G
Y N
Emission: 0.1 0.25 Transition: 0.2 0.2
s : A G N Y
Emission: 0.250.4 Transition: 0.8 0.1 s : A G
N N
Emission: 0.250.25 Transition: 0.8 0.8
0.0056
0.001 0.04
0.008
s : A G N Y
s : A G N N
0.008
0.04
Max
s : A G C N Y Y 0.008
Y NaYN = 0.2
aNN = 0.8
aNY = 0.1aYY = 0.7
BEGIN
a0N = 0.8a0Y= 0.2
ENDaY0 = 0.1 aN0 = 0.1
eY (A) = 0.1 eY (G) = 0.4eY (C) = 0.4 eY (T) = 0.1
eN (A) = 0.25 eN (G) = 0.25eN (C) = 0.25 eN (T) = 0.25
GcP Island Non- GcP Island
0.4 0.7
s : A G C N Y N 0.008 0.25
0.2
s : A G C N N Y
0.04 0.4 0.1
s : A G C N N N
0.04 0.25 0.8 ;
=0.00224 =0.0016
=0.0004 =0.008
Finding the best path
s : A G C N Y Y
s : A G C N N N
0.00224
0.008
Max
Y NaYN = 0.2
aNN = 0.8
aNY = 0.1aYY = 0.7
BEGIN
a0N = 0.8a0Y= 0.2
ENDaY0 = 0.1 aN0 = 0.1
eY (A) = 0.1 eY (G) = 0.4eY (C) = 0.4 eY (T) = 0.1
eY (A) = 0.25 eY (G) = 0.25eY (C) = 0.25 eY (T) = 0.25
GcP Island Non- GcP Island
A G C G C G T A A T C T GN Y Y Y Y Y Y N N N Y Y Y
A G C G C G T A A T C T GN N N N N N N N N N N N N
0.1 (aY0)
0.1 (aN0)
Choose the Maximum
Iterating until the last position of the sequence:
Finding the best path
Viterbi Algorithm
* = argmax [ P( , s | M ) ]The computation of P(s,*| M) can be decomposed in simplest problems
Let Vk(i) be the probability of the most probable path for generating the subsequence s1s2s3……s i ending in the state k at iteration i
Initialisation: VBEGIN (0) = 1 Vi (0) = 0 i BEGIN
Recurrence: Vl ( i+1) = e l ( s i+1 ) Max k ( Vk ( i ) a k l )
ptr i ( l ) = argmax k ( Vk ( i ) a k l )
Termination: P( s, * ) =Maxk (Vk ( T ) a k 0 )
* ( T ) = argmax k (Vk ( T ) a k 0 )Traceback: * ( i-1 ) = ptr i (* ( i ))
Viterbi Algorithm
STATE
Iteration
BEGIN
END
A
B
0 1 2
VB (2)
MAX
eB (s2)
T T + 1
Vi (1) ∙ aiB
P(s, *| M)
ptr2 (B)
Viterbi Algorithm
STATE
Iteration
BEGIN
END
A
B
0 1 2 T T + 1T – 1
Viterbi path
Different paths can have the same probability
Hidden Markov Models:Algorithms Hidden Markov Models:Algorithms
•Resumé•Evaluating P(s | M): Forward Algorithm•Evaluating P(s | M): Backward Algorithm•Showing the path: Viterbi decoding•Showing the path: A posteriori decoding•Training a model: EM algorithm
If we know the path generating the training sequence
Y NaYN = ?
aNN = ?
aNY = ?aYY = ?
BEGIN
a0N = ?a0Y= ?
ENDaY0 = ? aN0 = ?
eY (A) = ? eY (G) = ?eY (C) = ? eY (T) = ?
eN (A) = ? eN (G) = ?eN (C) = ? eN (T) = ?
GcP Island Non- GcP Island
s : A G C G C G T A A T C T GY Y Y Y Y Y Y N N N N N NEmission: eY (A)eY (G)eY (A)e Y(G)eY (C)eY (G)eY (T)eN (A)eN (A)eN (T)eN (C)eN (T)eN
(G)
Transition: a0Y aYY aYY aYY aYY aYY aYY aYN aNN aNN aNN aNN aNN aN0
Just count!Example: aYY= nYY /(nYY+ nYN)= 6/7
eY(A) = nY(A) /[nY(A)+nY(C) +nY(G) +nY(T)]= 1/7
If we DO NOT know the path generating the training sequence
Y NaYN = ?
aNN = ?
aNY = ?aYY = ?
BEGIN
a0N = ?a0Y= ?
ENDaY0 = ? aN0 = ?
eY (A) = ? eY (G) = ?eY (C) = ? eY (T) = ?
eN (A) = ? eN (G) = ?eN (C) = ? eN (T) = ?
GcP Island Non- GcP Island
s : A G C G C G T A A T C T G? ? ? ? ? ? ? ? ? ? ? ? ?Emission: e? (A)e? (G)e? (A)e ?(G)e? (C)e? (G)e? (T)e? (A)e? (A)e? (T)e? (C)e? (T)e? (G)
Transition: a0? a?? a?? a?? a?? a?? a?? a?? a?? a?? a?? a?? a?? a?0
We need “in some sense” to average over all the possible paths
No exact algorithm is available.Iterative Baum-Welch algorithm based on the Expectation-Maximisation
Ak,l = P(| s,0) · Ak,l() Ek (c) = P(| s,0) · Ek (c,)
We can compute the expected values over all the paths, given inital parameters 0
ak,l = ek(c) = Ak,l
m = 1 Ak,mN
Ek (c)
c Ek (c)
Baum-Welch algorithm (simple discussion)
Given a path we can countthe number of transition between states k and l: Ak,l()the number on emissions of character c from state k: Ek (c,)
s : A G C G C G T A A T C T GY Y Y Y Y Y Y Y Y Y Y Y YY Y Y Y Y Y Y Y Y Y Y Y NY Y Y Y Y Y Y Y Y Y Y N YY Y Y Y Y Y Y Y Y Y Y N N……………………………………………………………...
The updated parameters are:
Then we can iterate…
Expectation-Maximisation algorithm
We need to estimate the Maximum Likelihood parameters when the paths generating the training sequences are unknown
ML = argmaxP ( s | M)]
Given a model with parameters 0 the EM algorithm finds new parameters that increase the likelihood of the model:
P( s | ) > P( s| )
Expectation-Maximisation algorithm
We need to estimate the Maximum Likelihood parameters when the paths generating the training sequences are unknown
ML = argmaxP ( s | M)]
Given a model with parameters 0 the EM algorithm finds new parameters that increase the likelihood of the model:
P( s | ) > P( s| )
or equivalentely
log P( s | ) > log P( s| )
Expectation-Maximisation algorithm
log P( s | ) = log P(s,|) - log P(| s,)
Multiplying for P(| s,0) and summing over all the possible paths
log P( s | ) =
=P(| s,0) ·log P(s,| ) - P(| s,0) · log P(| s,)
Q(|0) : Expectation value of log P(s,|) over all the “current” paths
log P( s | ) - log P(s | ) = = Q(|) - Q(|0) +
Q(|) - Q(|0)
)s, |πP(
)s, |πP(log)s, |πP(-
00
π
0
Expectation-Maximisation algorithm
The EM algorithm is an iterative process
Each iteration performs two steps:
E-step: evaluation of Q(|) = P(| s,0) ·log P(s,| )
M-step: Maximisation of Q(|) over all
It does NOT assure to converge to the GLOBAL Maximum Likelihood
E-step:
Q( |0) = P(| s,0) ·log P(s,|)
P(s,|) = a0,(1) · i = 1 a(i),(i+1) ·e(i)(si) =
= k = 0 l = 1 ak,l · k = 1 c C ek (c)
T
NNN Ak.l () Ek (c,)
Ak,l(): number of transitions between the states k and l in path
Ek (c,): number of emissions of character c in path
Ak,l = P(| s,0) · Ak,l()
Ek (c) = P(| s,0) · Ek (c,)
So:
Q(|0) = k = 0 l = 1 Ak,l · log ak,l + k = 1 c C Ek (c) ·log ek (c) NN N
Baum-Welch implementation of the EM algorithm
Expected values over all the “actual” paths
ak,l =
ek(c) =
Ak,l
m = 1 Ak,mN
Ek (c)
m = 1 Em (c) N
Baum-Welch implementation of the EM algorithm
M-step:
0,
lka
For any state k and l, with l ak,l = 1
0)(
cek
For any state k and character c, with c ek(c) = 1
By means of Lagrange’s multipliers techniques, we can solve the system
Fk(i) = P( s1s2s3……s i, (i) = k )
Bk(i) = P( s i+1s i+2s i+3……s T | (i) = k )
Ak,l= P((i ) = k , ( i +1) = l | s,) =
Ek (c) = P( s i = c , (i ) = k | s,) =
i Fk(i ) a kl e l (s i +1) Bl(i + 1)
P (s )
s = c Fk(i ) Bl(i)
P (s )
i
Baum-Welch implementation of the EM algorithm
How to compute the expected number of transitions and emissions over all the paths
Baum-Welch implementation of the EM algorithm
AlgorithmStart with random parameters
Compute Forward and Backward matrices on the known sequences
Compute Ak,l and Ek (c) expected numbers of transitions and emissions
Update a k,l Ak,l ek (c) Ek (c)
Has P(s|M) incremented ?Yes
NoEnd
Profile HMMsProfile HMMs
•HMMs for alignments
Profile HMMsProfile HMMs
•HMMs for alignments
M0 M1 M2 M3 M4
How to align?
Each state represent a position in the alignment.
A C G G T AM0 M1 M2 M3 M4 M5
A C G A T CM0 M1 M2 M3 M4 M5
A T G T T CM0 M1 M2 M3 M4 M5
M5
Each position has a peculiar composition
M0 M1 M2 M3 M4
A C G G T AA C G A T CA T G T T C
M5
Given a set of sequences..
..we can train a model..
A 1 0 0 0.33 0 0.33C 0 0.66 0 0 0 0.66G 0 0 1 0.33 0 0T 0 0.33 0 0.33 1 0
..estimating the emission probabilities.
M0 M1 M2 M3 M4 M5
Given a trained model..
..we can align a new sequence..
A 1 0 0 0.33 0 0.33C 0 0.66 0 0 0 0.66G 0 0 1 0.33 0 0T 0 0.33 0 0.33 1 0
A C G A T C
..computing the probability of generating it
P(s|M) = 1 × 0.66 × 1 × 0.33 × 1 × 0.66
M0 M1 M2 M3 M4
And for the sequence AGATC ?
A G A T CM0 M2 M3 M4 M5 M5
M5
A 1 0 0 0.33 0 0.33C 0 0.66 0 0 0 0.66G 0 0 1 0.33 0 0T 0 0.33 0 0.33 1 0
We need a way to introduce gaps
Silent states
Red transitions allow gaps(N-1) ! transitions
To reduce the number of parameters we can use states that doesn’t emit any character4N-8 transitions
M0 M1 M2 M4M3
I0 I1 I2 I3
D1 D4D2 D3
I4
M5
Profile HMMs
Delete states
Insert states
Match states
A C G G T AM0 M1 M2 M3 M4 M5
A C G C A G T CM0 I0 I0 M1 M2 M3 M4 M5
A G A T CM0 D1 M2 M3 M4 M5
Example of alignmentSequence 1
A S T R A LViterbi path
M0 M1 M2 M3 M4 M5
A S T R A L
Sequence 2A S T A I L
Viterbi pathM0 M1 M2 D3 M4 I4 M5
A S T A I L
Sequence 3A R T I
Viterbi pathM0 M1 M2 D3 D4 M5
A R T I
M0 M1 M2 M4M3
I0 I1 I2 I3
D1 D4D2 D3
I4
M5
M0 M1 M2 M4M3
I0 I1 I2 I3
D1 D4D2 D3
I4
M5
M0 M1 M2 M4M3
I0 I1 I2 I3
D1 D4D2 D3
I4
M5
Example of alignment
Grouping by vertical layers0 1 2 3 4 5
s1 A S T R A Ls2 A S T AI Ls3 A R T I
AlignmentASTRA-LAST-AILART---I
M0 M1 M2 M3 M4 M5
A S T R A L
M0 M1 M2 D3 M4 I4 M5
A S T A I L
M0 M1 M2 D3 D4 M5
A R T I
Sequence 1
Sequence 2
Sequence 3
-Log P(s | M) Is an alignment score
Searching for a structural/functional pattern in protein sequence
Zn binding loop:
C H C I C R I C C H C L C K I C C H C I C S L C D H C L C T I C C H C I D S I C C H C L C K I C
Cysteines can be replaced by an Aspartic Acid, but only ONCE for each sequence
Searching for a structural/functional pattern in protein sequences
..ALCPCHCLCRICPLIY..
..WERWDHCIDSICLKDE..
M0 M1 M2 M4M3
I0 I1 I2 I3
D1 D4D2 D3
I4
M5
D5
I5
M6
D6
I6
M7
obtains higher probability than
.. because M0 and M4 have low emission probability for Aspartic Acid and we multiply them twice.
Profile HMMsProfile HMMs
•HMMs for alignments•Example on globins
Structural alignment of globins
Structural alignment of globins
Bashdorf D, Chothia C & Lesk AM, (1987) Determinants of a protein fold: unique features of the globin amino sequence. J.Mol.Biol. 196, 199-216
Alignment of globins reconstructed with profile HMMs
Krogh A, Brown M, Mian IS, Sjolander K & Haussler D (1994) Hidden Markov Models in computational biology: applications to protein modelling. J.Mol.Biol. 235, 1501-1531
Discrimination power of profile HMMs
Z-score = Log(P(s | M)) - <Log( P(s | M))>
(Log(P(s | M)) )
Krogh A, Brown M, Mian IS, Sjolander K & Haussler D (1994) Hidden Markov Models in computational biology: applications to protein modelling. J.Mol.Biol. 235, 1501-1531
Profile HMMsProfile HMMs
•HMMs for alignments•Example on globins•Other applications
Begin
Profile HMM specific for the considered domain
I2
End
I2
Finding a domain
BEGIN
HMM 1
HMM 2
HMM 3
HMM n
.END.
Clustering subfamilies
Each sequence s contributes to update HMM i with a weight equal to P ( s | Mi )
Profile HMMsProfile HMMs
•HMMs for alignments•Example on globins•Other applications•Available codes and servers
HMMER at WUSTL: http://hmmer.wustl.edu/Eddy SR (1998) Profile hidden Markov models. Bioinformatics 14:755-763
HMMER
Alignment of a protein family
hmmbuild
Trained profile-HMM
hmmcalibate
HMM calibrated with the accurate E-value
statistics
Takes the aligned sequences, checks for redundancy and sets the emission and the transitions probabilities of a HMM
Takes a trained HMM, generates a great number of random sequences, score them and fits the Extreme Value Distribution to the computed scores
HMMER
Set of sequences
Alignment of all the sequences to the
model
hmmalign
List of sequences that match the HMM
(sorted by E-value)
hmmsearchHMM
Set of HMMs Sequence
hmmpfam
List of HMMs that match the sequence
!!AA_MULTIPLE_ALIGNMENT 1.0PileUp of: *.pep
Symbol comparison table: GenRunData:blosum62.cmp CompCheck: 6430
GapWeight: 12 GapLengthWeight: 4
pileup.msf MSF: 308 Type: P August 16, 1999 09:09 Check: 9858 ..
Name: lgb1_pea Len: 308 Check: 2200 Weight: 1.00 Name: lgb1_vicfa Len: 308 Check: 214 Weight: 1.00 Name: myg_escgi Len: 308 Check: 3961 Weight: 1.00 Name: myg_horse Len: 308 Check: 5619 Weight: 1.00 Name: myg_progu Len: 308 Check: 6401 Weight: 1.00 Name: myg_saisc Len: 308 Check: 6606 Weight: 1.00
//
1 50 lgb1_pea ~~~~~~~~~G FTDKQEALVN SSSE.FKQNL PGYSILFYTI VLEKAPAAKGlgb1_vicfa ~~~~~~~~~G FTEKQEALVN SSSQLFKQNP SNYSVLFYTI ILQKAPTAKA myg_escgi ~~~~~~~~~V LSDAEWQLVL NIWAKVEADV AGHGQDILIR LFKGHPETLE myg_horse ~~~~~~~~~G LSDGEWQQVL NVWGKVEADI AGHGQEVLIR LFTGHPETLE myg_progu ~~~~~~~~~G LSDGEWQLVL NVWGKVEGDL SGHGQEVLIR LFKGHPETLE myg_saisc ~~~~~~~~~G LSDGEWQLVL NIWGKVEADI PSHGQEVLIS LFKGHPETLE
MSF Format: globins50.msf
Alignment of a protein family
hmmbuild
Trained profile-HMM
hmmbuild globin.hmm globins50.msf
All the transition and emission parameters are estimated by means of the Expectation Maximisation algorithm on the aligned sequences.
In principle we could use also NON aligned sequences to train the model. Nevertheless it is more efficient to build the starting alignment using, for example, CLUSTALW
hmmcalibrate [-num N] -histfile globin.histo globin.hmm
Trained profile-HMM
hmmcalibate
HMM calibrated with the accurate E-value
statistics
A number of N (default 5000) random sequences are generated and scored with the model.
Random sequences
Log P(s|M)/P(s|N)
Range for globin sequences
E-value(S): expected number of random sequences with a score > S
Trained model (globin.hmm)
HMMER2.0 [2.3.2]NAME globins50LENG 143ALPH AminoRF noCS noMAP yesCOM /home/gigi/bin/hmmbuild globin.hmm globins50.msfCOM /home/gigi/bin/hmmcalibrate --histfile globin.histo globin.hmmNSEQ 50DATE Sun May 29 19:03:18 2005CKSUM 9858XT -8455 -4 -1000 -1000 -8455 -4 -8455 -4 NULT -4 -8455NULE 595 -1558 85 338 -294 453 -1158 197 249 902 -1085 -142 -21 -313 45 531 201 384 -1998 -644 EVD -38.893742 0.243153HMM A C D E F G H I K L M N P Q R S T V W Y m->m m->i m->d i->m i->i d->m d->d b->m m->e -450 * -1900 1 591 -1587 159 1351 -1874 -201 151 -1600 998 -1591 -693 389 -1272 595 42 -31 27 -693 -1797 -1134 14 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 -450 * 2 -926 -2616 2221 2269 -2845 -1178 -325 -2678 -300 -2596 -1810 220 -1592 939 -974 -671 -939 -2204 -2785 -1925 15 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * * 3 -638 -1715 -680 497 -2043 -1540 23 -1671 2380 -1641 -840 -222 -1595 437 1040 -564 -523 -1363 2124 -1313 16 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * *
Trained model (globin.hmm)null model
HMMER2.0 [2.3.2]NAME globins50LENG 143ALPH AminoRF noCS noMAP yesCOM /home/gigi/bin/hmmbuild globin.hmm globins50.msfCOM /home/gigi/bin/hmmcalibrate --histfile globin.histo globin.hmmNSEQ 50DATE Sun May 29 19:03:18 2005CKSUM 9858XT -8455 -4 -1000 -1000 -8455 -4 -8455 -4 NULT -4 -8455NULE 595 -1558 85 338 -294 453 -1158 197 249 902 -1085 -142 -21 -313 45 531 201 384 -1998 -644 EVD -38.893742 0.243153HMM A C D E F G H I K L M N P Q R S T V W Y m->m m->i m->d i->m i->i d->m d->d b->m m->e -450 * -1900 1 591 -1587 159 1351 -1874 -201 151 -1600 998 -1591 -693 389 -1272 595 42 -31 27 -693 -1797 -1134 14 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 -450 * 2 -926 -2616 2221 2269 -2845 -1178 -325 -2678 -300 -2596 -1810 220 -1592 939 -974 -671 -939 -2204 -2785 -1925 15 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * * 3 -638 -1715 -680 497 -2043 -1540 23 -1671 2380 -1641 -840 -222 -1595 437 1040 -564 -523 -1363 2124 -1313 16 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * *
Score = INT [1000 log2(prob/null_prob)]
null_prob = 1 for transitions
Trained model (globin.hmm)null model
HMMER2.0 [2.3.2]NAME globins50LENG 143ALPH AminoRF noCS noMAP yesCOM /home/gigi/bin/hmmbuild globin.hmm globins50.msfCOM /home/gigi/bin/hmmcalibrate --histfile globin.histo globin.hmmNSEQ 50DATE Sun May 29 19:03:18 2005CKSUM 9858XT -8455 -4 -1000 -1000 -8455 -4 -8455 -4 NULT -4 -8455NULE 595 -1558 85 338 -294 453 -1158 197 249 902 -1085 -142 -21 -313 45 531 201 384 -1998 -644 EVD -38.893742 0.243153HMM A C D E F G H I K L M N P Q R S T V W Y m->m m->i m->d i->m i->i d->m d->d b->m m->e -450 * -1900 1 591 -1587 159 1351 -1874 -201 151 -1600 998 -1591 -693 389 -1272 595 42 -31 27 -693 -1797 -1134 14 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 -450 * 2 -926 -2616 2221 2269 -2845 -1178 -325 -2678 -300 -2596 -1810 220 -1592 939 -974 -671 -939 -2204 -2785 -1925 15 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * * 3 -638 -1715 -680 497 -2043 -1540 23 -1671 2380 -1641 -840 -222 -1595 437 1040 -564 -523 -1363 2124 -1313 16 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * *
Score = INT [1000 log2(prob/null_prob)]
= natural abundance for emissions
HMMER2.0 [2.3.2]NAME globins50LENG 143ALPH AminoRF noCS noMAP yesCOM /home/gigi/bin/hmmbuild globin.hmm globins50.msfCOM /home/gigi/bin/hmmcalibrate --histfile globin.histo globin.hmmNSEQ 50DATE Sun May 29 19:03:18 2005CKSUM 9858XT -8455 -4 -1000 -1000 -8455 -4 -8455 -4 NULT -4 -8455NULE 595 -1558 85 338 -294 453 -1158 197 249 902 -1085 -142 -21 -313 45 531 201 384 -1998 -644 EVD -38.893742 0.243153HMM A C D E F G H I K L M N P Q R S T V W Y m->m m->i m->d i->m i->i d->m d->d b->m m->e -450 * -1900 1 591 -1587 159 1351 -1874 -201 151 -1600 998 -1591 -693 389 -1272 595 42 -31 27 -693 -1797 -1134 14 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 -450 * 2 -926 -2616 2221 2269 -2845 -1178 -325 -2678 -300 -2596 -1810 220 -1592 939 -974 -671 -939 -2204 -2785 -1925 15 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * * 3 -638 -1715 -680 497 -2043 -1540 23 -1671 2380 -1641 -840 -222 -1595 437 1040 -564 -523 -1363 2124 -1313 16 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * *
Trained model (globin.hmm)
Transitions
HMMER2.0 [2.3.2]NAME globins50LENG 143ALPH AminoRF noCS noMAP yesCOM /home/gigi/bin/hmmbuild globin.hmm globins50.msfCOM /home/gigi/bin/hmmcalibrate --histfile globin.histo globin.hmmNSEQ 50DATE Sun May 29 19:03:18 2005CKSUM 9858XT -8455 -4 -1000 -1000 -8455 -4 -8455 -4 NULT -4 -8455NULE 595 -1558 85 338 -294 453 -1158 197 249 902 -1085 -142 -21 -313 45 531 201 384 -1998 -644 EVD -38.893742 0.243153HMM A C D E F G H I K L M N P Q R S T V W Y m->m m->i m->d i->m i->i d->m d->d b->m m->e -450 * -1900 1 591 -1587 159 1351 -1874 -201 151 -1600 998 -1591 -693 389 -1272 595 42 -31 27 -693 -1797 -1134 14 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 -450 * 2 -926 -2616 2221 2269 -2845 -1178 -325 -2678 -300 -2596 -1810 220 -1592 939 -974 -671 -939 -2204 -2785 -1925 15 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * * 3 -638 -1715 -680 497 -2043 -1540 23 -1671 2380 -1641 -840 -222 -1595 437 1040 -564 -523 -1363 2124 -1313 16 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * *
Trained model (globin.hmm)
Emissions
HMMER2.0 [2.3.2]NAME globins50LENG 143ALPH AminoRF noCS noMAP yesCOM /home/gigi/bin/hmmbuild globin.hmm globins50.msfCOM /home/gigi/bin/hmmcalibrate --histfile globin.histo globin.hmmNSEQ 50DATE Sun May 29 19:03:18 2005CKSUM 9858XT -8455 -4 -1000 -1000 -8455 -4 -8455 -4 NULT -4 -8455NULE 595 -1558 85 338 -294 453 -1158 197 249 902 -1085 -142 -21 -313 45 531 201 384 -1998 -644 EVD -38.893742 0.243153HMM A C D E F G H I K L M N P Q R S T V W Y m->m m->i m->d i->m i->i d->m d->d b->m m->e -450 * -1900 1 591 -1587 159 1351 -1874 -201 151 -1600 998 -1591 -693 389 -1272 595 42 -31 27 -693 -1797 -1134 14 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 -450 * 2 -926 -2616 2221 2269 -2845 -1178 -325 -2678 -300 -2596 -1810 220 -1592 939 -974 -671 -939 -2204 -2785 -1925 15 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * * 3 -638 -1715 -680 497 -2043 -1540 23 -1671 2380 -1641 -840 -222 -1595 437 1040 -564 -523 -1363 2124 -1313 16 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * *
Trained model (globin.hmm)
Emissions
HMMER2.0 [2.3.2]NAME globins50LENG 143ALPH AminoRF noCS noMAP yesCOM /home/gigi/bin/hmmbuild globin.hmm globins50.msfCOM /home/gigi/bin/hmmcalibrate --histfile globin.histo globin.hmmNSEQ 50DATE Sun May 29 19:03:18 2005CKSUM 9858XT -8455 -4 -1000 -1000 -8455 -4 -8455 -4 NULT -4 -8455NULE 595 -1558 85 338 -294 453 -1158 197 249 902 -1085 -142 -21 -313 45 531 201 384 -1998 -644 EVD -38.893742 0.243153HMM A C D E F G H I K L M N P Q R S T V W Y m->m m->i m->d i->m i->i d->m d->d b->m m->e -450 * -1900 1 591 -1587 159 1351 -1874 -201 151 -1600 998 -1591 -693 389 -1272 595 42 -31 27 -693 -1797 -1134 14 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 -450 * 2 -926 -2616 2221 2269 -2845 -1178 -325 -2678 -300 -2596 -1810 220 -1592 939 -974 -671 -939 -2204 -2785 -1925 15 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * * 3 -638 -1715 -680 497 -2043 -1540 23 -1671 2380 -1641 -840 -222 -1595 437 1040 -564 -523 -1363 2124 -1313 16 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * *
Trained model (globin.hmm)
Emissions
hmmemit [-n N] globin.hmm
Trained profile-HMM
hmmemit
Sequences generated by the model
The parameters of the model are used to generate new sequences
hmmsearch globin.hmm Artemia.fa > Artemia.globin
Set of sequences
List of sequences that match the HMM
(sorted by E-value)
hmmsearch
Trained profile-HMM
Search results (Artemia.globin)
Sequence Description Score E-value N -------- ----------- ----- ------- ---S13421 S13421 474.3 1.7e-143 9
Parsed for domains:Sequence Domain seq-f seq-t hmm-f hmm-t score E-value-------- ------- ----- ----- ----- ----- ----- -------S13421 7/9 932 1075 .. 1 143 [] 76.9 7.3e-24S13421 2/9 153 293 .. 1 143 [] 63.7 6.8e-20S13421 3/9 307 450 .. 1 143 [] 59.8 9.8e-19S13421 8/9 1089 1234 .. 1 143 [] 57.6 4.5e-18S13421 9/9 1248 1390 .. 1 143 [] 52.3 1.8e-16S13421 1/9 1 143 [. 1 143 [] 51.2 4e-16S13421 4/9 464 607 .. 1 143 [] 46.7 8.6e-15S13421 6/9 775 918 .. 1 143 [] 42.2 2e-13S13421 5/9 623 762 .. 1 143 [] 23.9 6.6e-08
Alignments of top-scoring domains:S13421: domain 7 of 9, from 932 to 1075: score 76.9, E = 7.3e-24 *->eekalvksvwgkveknveevGaeaLerllvvyPetkryFpkFkdLss +e a vk+ w+ v+ ++ vG +++ l++ +P+ +++FpkF d+ S13421 932 REVAVVKQTWNLVKPDLMGVGMRIFKSLFEAFPAYQAVFPKFSDVPL 978
adavkgsakvkahgkkVltalgdavkkldd...lkgalakLselHaqklr d++++++ v +h V t+l++ ++ ld++ +l+ ++L+e H+ lr S13421 979 -DKLEDTPAVGKHSISVTTKLDELIQTLDEpanLALLARQLGEDHIV-LR 1026
vdpenfkllsevllvvlaeklgkeftpevqaalekllaavataLaakYk< v+ fk +++vl+ l++ lg+ f+ ++ +++k+++++++ +++ + S13421 1027 VNKPMFKSFGKVLVRLLENDLGQRFSSFASRSWHKAYDVIVEYIEEGLQ 1075
Number of domains
Domains sorted byE-value
Start End
Consensus sequence
Sequence
hmmalign globin.hmm globins630.fa
Set of sequences
hmmalign
Alignment of all sequences to the
model
Trained profile-HMM
InsertionsBAHG_VITSP QAG-..VAAAHYPIV.GQELLGAIKEV.L.G.D.AATDDILDAWGKAYGVGLB1_ANABR TR-K..ISAAEFGKI.NGPIKKVLAS-.-.-.K.NFGDKYANAWAKLVAVGLB1_ARTSX NRGT..-DRSFVEYL.KESL-----GD.S.V.D.EFT------VQSFGEVGLB1_CALSO TRGI..TNMELFAFA.LADLVAYMGTT.I.S.-.-FTAAQKASWTAVNDVGLB1_CHITH -KSR..ASPAQLDNF.RKSLVVYLKGA.-.-.T.KWDSAVESSWAPVLDFGLB1_GLYDI GNKH..IKAQYFEPL.GASLLSAMEHR.I.G.G.KMNAAAKDAWAAAYADGLB1_LUMTE ER-N..LKPEFFDIF.LKHLLHVLGDR.L.G.T.HFDF---GAWHDCVDQGLB1_MORMR QSFY..VDRQYFKVL.AGII-------.-.-.A.DTTAPGDAGFEKLMSMGLB1_PARCH DLNK..VGPAHYDLF.AKVLMEALQAE.L.G.S.DFNQKTRDSWAKAFSIGLB1_PETMA KSFQ..VDPQYFKVL.AAVI-------.-.-.V.DTVLPGDAGLEKLMSMGLB1_PHESE QHTErgTKPEYFDLFrGTQLFDILGDKnLiGlTmHFD---QAAWRDCYAV
Gaps
HMMER applications:PFAMhttp://www.sanger.ac.uk/Software/Pfam/
PFAM Exercise
Generate with hmmemit a sequence from the globin model and search it in PFAM database
Search in the SwissProt database the sequencesCG301_HUMANQ9H5F4_HUMAN
1) search them in the PFAM data base.2)launch PSI-BLAST searches. Is it possible to annotate the sequences by means of the BLAST results?
PFAM Exercise
SAM at UCSD:http://www.soe.ucsc.edu/research/compbio/sam.html
Krogh A, Brown M, Mian IS, Sjolander K & Haussler D (1994) Hidden Markov Models in computational biology: applications to protein modelling. J.Mol.Biol. 235, 1501-1531
SAM applications:http://www.cse.ucsc.edu/research/compbio/HMM-apps/T02-query.html
HMMPRO: http://www.netid.com/html/hmmpro.htmlPierre Baldi, Net-ID
HMMs for Mapping problemsHMMs for Mapping problems
•Mapping problems in protein prediction
Covalent structureTTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN
Ct
Nt
3D structure
Secondary structureEEEE..HHHHHHHHHHHH....HHHHHHHH.EEEE...........
Secondary structure
position of Trans Membrane Segments along the sequenceTopography
Topology of membrane proteins
Porin (Rhodobacter capsulatus)
Bacteriorhodopsin(Halobacterium salinarum)
Bil
ayer
-barrel -helices
Outer Membrane Inner Membrane
ALALMLCMLTYRHKELKLKLKK ALALMLCMLTYRHKELKLKLKK ALALMLCMLTYRHKELKLKLKK
HMMs for Mapping problemsHMMs for Mapping problems
•Mapping problems in protein prediction•Labelled HMMs
c
HMM for secondary structure prediction
Simplest model
Introducing a grammar
1 1c
2
3
2
HMM for secondary structure prediction
Labels
The states 1, 2 and 3 share the same label, so states 1 and 2 do.Decoding the Viterbi path for emitting a sequence s, makes a mapping between the sequence s and a sequence of labels y
S A L K M N Y T R E I M V A S N Q s: Sequenceccccc c c : Pathccccc c c Y(): Labels
1 1c
2
3
2
Computing P(s, y | M)
yYMsPMysP
)(|)|,()|,(
Only the path whose labelling is y have to be considered in the sumIn Forward and Backward algorithms it means to set
Fk(i) = 0, Bk(i) = 0 if Y(k) yi
S A L K M N Y T R E I M V A S N Q s: Sequenceccccc c c y: Labels
c
c
States Labelling
Baum-Welch training algorithm for labelled HMMs
Given a set of known labelled sequences (e.g. amino acid sequences and their native secondary structure) we want to find the parameters of the model, without knowing the generating paths:
ML = argmaxP ( s, y | M)]
The algorithm is the same as in the non-labelled case if we use the Forward and Backward matrices defined in the last slide.
Supervised learning of the mapping
HMMs for Mapping problemsHMMs for Mapping problems
•Mapping problems in protein prediction•Labelled HMMs•Duration modelling
Self loops and geometric decay
p
Begin End1-p
P(l) = p l-1 ·( 1-p )
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
p=0.9
p=0.5
p=0.1
l
P(l)
The length distribution of the generated segments is always exp-like
0,0000,050
0,1000,150
0,2000,250
0,3000,350
1 2 3 4 5 6 7 8
P(l)
l
P(1)
P(2)P(3)
P(4)
P(5)
P(6)
P(7)
P(8)
How can we model other length distributions?
1Begin End432 N….
p1 p2 p3 p4
pN
Limited case
This topology can model any length distribution between 1 and N
N
N
ii ppNP
pppP
ppP
pP
1
1
321
21
1
)1()(
......................
)1()1()3(
)1()2(
)1(
1
1
12
1
)1(
)(
..............
)1/()2(
)1(
k
ii
k
p
kPp
pPp
Pp
How can we model other length distributions?
Non limited case
This topology can model any length distribution between 1 and N-1 and a geometrical decay from N and
1Begin End432 N….
p1 p2 p3 p4
pN+1
pN
0
0,05
0,1
0,15
0,2
0,25
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34Length (residues)
Fre
qu
ency Helix
StrandCoil
Secondary structure: length statistic
c ccc
Secondary structure: model
Do we use the same emission probabilities for states sharing the same label?
HMMs for Mapping problemsHMMs for Mapping problems
•Mapping problems in protein prediction•Labelled HMMs•Duration modelling•Models for membrane proteins
Porin (Rhodobacter capsulatus)
Bacteriorhodopsin(Halobacterium salinarum)
Bil
ayer
-barrel -helices
Outer Membrane Inner Membrane
position of Trans Membrane Segments along the sequenceTopography
Topology of membrane proteins
Porin (Rhodobacter capsulatus)
Bacteriorhodopsin(Halobacterium salinarum)
Bil
ayer
-barrel -helices
Outer Membrane Inner Membrane
ALALMLCMLTYRHKELKLKLKK ALALMLCMLTYRHKELKLKLKK ALALMLCMLTYRHKELKLKLKK
A generic model for membrane proteins (TMHMM)
Transmembrane Inner Side
Outer Side
End
Begin
Model of -barrel membrane proteins
Transmembrane Inner SideOuter Side
End
Begin
Labels:
Transmembrane states
Loop states
Model of -barrel membrane proteins
Transmembrane Inner SideOuter Side
End
Begin
Length of transmembrane -strands:
Minimum: 6 residues
Maximum: unbounded
Model of -barrel membrane proteins
Transmembrane Inner SideOuter Side
End
Begin
Six different sets of emission parameters:
Outer loop Inner loop Long globular
domains
TM strands edges TM strands core
Model of -barrel membrane proteins
Transmembrane Inner SideOuter Side
End
Begin
Model of -helix membrane proteins (HMM1)
Transmembrane Inner SideOuter Side
....x 10
....x 10
.....x 13
.....x 12
x13.....x12........ ...
Model of -helix membrane proteins (HMM2)
Transmembrane Inner SideOuter Side
....x 10
....x 10
... ...
TMS probability
0.00.10.20.30.40.50.60.70.80.91.0
1 51 101 151 201 251 301 351 401 451Sequence (1A0S)
TM
S p
robab
ility
TMS probability
Dynamic programming filtering procedure
Dynamic programming filtering procedure
0.00.10.20.30.40.50.60.70.80.91.0
1 51 101 151 201 251 301 351 401 451Sequence (1A0S)
TM
S p
robab
ility
TMS probability Predicted TMS
Maximum-scoring subsequences with constrained segment length and number
0.00.10.20.30.40.50.60.70.80.91.0
1 51 101 151 201 251 301 351 401 451Sequence (1A0S)
TM
S p
robab
ility
TMS probability Observed TMS Predicted TMS
Dynamic programming filtering procedure
Maximum-scoring subsequences with constrained segment length and number
www.cbs.dtu.dk/services/TMHMM
Predictors of alpha-transmembrane topology
Hybrid systems: BasicsHybrid systems: Basics
•Sequence profile based HMMs
1 Y K D Y H S - D K K K G E L - - 2 Y R D Y Q T - D Q K K G D L - - 3 Y R D Y Q S - D H K K G E L - - 4 Y R D Y V S - D H K K G E L - - 5 Y R D Y Q F - D Q K K G S L - - 6 Y K D Y N T - H Q K K N E S - - 7 Y R D Y Q T - D H K K A D L - - 8 G Y G F G - - L I K N T E T T K 9 T K G Y G F G L I K N T E T T K 10 T K G Y G F G L I K N T E T T K
A 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 D 0 0 70 0 0 0 0 60 0 0 0 0 20 0 0 0 E 0 0 0 0 0 0 0 0 0 0 0 0 70 0 0 0 F 0 0 0 10 0 33 0 0 0 0 0 0 0 0 0 0 G 10 0 30 0 30 0 100 0 0 0 0 50 0 0 0 0 H 0 0 0 0 10 0 0 10 30 0 0 0 0 0 0 0 K 0 40 0 0 0 0 0 0 10 100 70 0 0 0 0 100 I 0 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 L 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 0 M 0 0 0 0 0 0 0 0 0 0 0 0 0 60 0 0 N 0 0 0 0 10 0 0 0 0 0 30 10 0 0 0 0 P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Q 0 0 0 0 40 0 0 0 30 0 0 0 0 0 0 0 R 0 50 0 0 0 0 0 0 0 0 0 0 0 0 0 0 S 0 0 0 0 0 33 0 0 0 0 0 0 10 10 0 0 T 20 0 0 0 0 33 0 0 0 0 0 30 0 30 100 0 V 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 W 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Y 70 0 0 90 0 0 0 0 0 0 0 0 0 0 0 0
sequence position
MS
AS
equ
ence
pro
file
Sequence profiles
Sequence-profile-based HMM
085 0 0 5 0 0 0 0 2 0 8 0 0 0 0 0 0 0 0
0 0 0 0 4 013 0 4 0 5 0 6 0 023 0 144 0
0 022 023 0 0 5 023 0 3 011 0 0 2 011 0
034 0 0 024 0 0 0 0 0 2 022 018 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0 092 0 0 0 0 0 0 0
90 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 077 023
3 0 2 7 4 0 8 6 1 3 6 5 512 5 617 2 2 6
.. A C L P R P E T ...
t
Sequence of charactersst
Sequence of M-dimensional vectors vt
0 vt (n) S t, n
k=1 vt (k) = S t
M
90 0 0 0 0 0 0 0 010 0 0 0 0 0 0 0 0 0 0
n
For proteins M=20
Constraints
Sequence-profile-based HMM
Sequence of characters
Sequence of M-dimensional vectors
Probability of emission from state k
P(st|k) = ek(st)
P(vt|k) = n=1 vt(n)ek(n)
M1Z
st
vt
Z is independent of the state
Algorithms for training and probability computation can be derived
constraints P(vt|k) dM vt = 1
If n=1 ek(n) = 1M
Z = n=1 ek (n)MSA
A!
Hybrid systems: BasicsHybrid systems: Basics
•Sequence profile based HMMs•Membrane protein topology
1) Accuracy: Q2=P/N where P is the total number of correctly predicted residues and N is the total number of residues
2) Correlation coefficient C: C(s)=(p(s)*n(s) - u(s)*o(s))/[(p(s)+u(s))(p(s)+o(s))(n(s)+u(s))(n(s)+o(s))]1/2
where, for each class s, p(s) and n(s) are respectively the total number of correct predictions and correctly rejected assignments while u(s) and o(s) are the numbers of under and over predictions
3) Accuracy for each discriminated structure s: Q(s)=p(s)/[p(s)+u(s)]where p(s) and u(s) are the same as in equation 2
4) Probability of correct predictions P(s) : P(s)=p(s)/[p(s)+o(s)] where p(s) and o(s) are the same as in equation 2
5) Segment-based measure (Sov):
SovS SS S
LN
S
1 2
1 2
1
Scoring the prediction
Q2 QTMS Qloop PTMS Ploop Corr Sov
83% 83% 82% 79% 85% 0.65 0.83
HMM based on Multiple Sequence
Alignment
78% 74% 82% 81% 76% 0.56 0.79
76% 77% 76% 72% 80% 0.53 0.64Standard HMM based
on Single Sequence
NN based on Multiple Sequence
Alignment
Topology of -barrel membrane proteinsPerformance of sequence-profile-based HMM
Martelli PL, Fariselli P, Krogh A, Casadio R -A sequence-profile-based HMM for predicting and discriminating beta barrel membrane proteins- Bioinformatics 18: S46-S53 (2002)
0
10
20
30
40
50
60
70
80
90
100
2.75 2.8 2.85 2.9 2.95
Perc
enta
ge
Beta barrel membrane proteins (145)Globular proteins(1239)All helical membrane proteins (188)
I(s | M) = -1/L log P(s | M)
Topology of -barrel membrane proteins Discriminative power of the profile-based HMM
Sequence
Sequence Profiles
NN HMM1 HMM2
MaxSubSeq
Von Heijne rule
Prediction
Topography
Topology
The Bologna predictorfor the topology ofall membraneproteins
Q2 % Corr QTM % QLoop
% PTM
% PLoop
% Qtopography Qtoplogy SOV
NN° 85.8 0.714 84.1 87.3 84.7 86.8 49/59 (83%) 38/59 (64%) 0.908
HMM1° 84.4 0.692 88.6 80.9 79.5 89.4 48/59 (81%) 38/59 (64%) 0.896
HMM2° 82.4 0.658 88.0 78.1 77.1 88.1 48/59 (81%) 39/59 (66%) 0.872
Jury° 85.3 0.708 86.9 84.1 82.1 88.4 53/59 (90%) 42/59 (71%) 0.926
TMHMM 2.0 82.3 0.661 70.9 92.9 89.3 79.2 42/59 (71%) 32/59 (54%) 0.840
MEMSAT 83.6 0.672 70.6 93.9 90.5 75.7 35/49 (71%) 24/49 (49%) 0.823
PHD 80.1 0.614 63.6 94.0 89.9 75.5 43/59 (73%) 30/59 (51%) 0.847
HMMTOP 81.2 0.627 68.5 91.8 87.5 77.7 45/59 (76%) 35/59 (59%) 0.862
Nir Ben-Tal 46/59 (78%)
KD 43/59 (73%)
° ° Test
Topology of all- membrane proteinsPerformance
Martelli PL, Fariselli P, Casadio R -An ENSEMBLE machine learning approach for the prediction of all-alpha membrane proteins- Bioinformatics (in press, 2003)
HMM: Application in gene findingHMM: Application in gene finding
•Basics
Eukaryotic gene structure
A. Krogh
Simple model for coding regions
A. Krogh
Simple model for unspliced gene
A. Krogh
Simple model for spliced gene
A. Krogh