prologue: pitfalls of standard alignments
DESCRIPTION
PROLOGUE: Pitfalls of standard alignments. Scoring a pairwise alignment. A: ALA E VLIRLIT K LYP B: ASA K HLNRLIT E LYP. Blosum62. Alignment of a family (globins). …………………………………………. Different positions are not equivalent. Sequence logos. http://weblogo.berkeley.edu/cache/file5h2DWc.png. - PowerPoint PPT PresentationTRANSCRIPT
-
PROLOGUE:
Pitfalls of standard alignments
-
A:ALAEVLIRLITKLYP B:ASAKHLNRLITELYP
Blosum62Scoring a pairwise alignment
-
Alignment of a family (globins).Different positions are not equivalent
-
http://weblogo.berkeley.edu/cache/file5h2DWc.pngThe substitution score IN A FAMILY should depend on the position (the same for gaps)Sequence logosFor modelling families we need more flexible tools
-
Probabilistic Models for Biological SequencesWhat are they?
-
Generative definition:
Objects producing different outcomes (sequences) with different probabilities
The probability distribution over the sequences space determines the model specificity
Probabilistic models for sequencesMSequence spaceProbabilityGenerates si with probability P(si | M)e.g.: M is the representation of the family of globins
-
Associative definition:
Objects that, given an outcome (sequence), compute a probability value
Probabilistic models for sequencesMSequence spaceProbabilityAssociates probability P(si | M) to sie.g.: M is the representation of the family of globins We dont need a generator of new biological sequences
the generative definition is useful as operative definition
-
Probabilistic models for sequencesSequence spaceProbabilityMost useful probabilistic models are Trainable systems
The probability density function over the sequence space is estimated from known examples by means of a learning algorithmSequence spaceProbabilityKnown examplesPdf estimate (generalization)e.g.: Writing a generic representation of the sequences of globins starting from a set of known globins
-
Probabilistic Models for Biological SequencesWhat are they?Why to use them?
-
Modelling a protein familyProbabilistic modelGiven a protein class (e.g. Globins), a probabilistic model trained on this family can compute a probability value for each new sequenceThis value measures the similarity between the new sequence and the family described by the model
-
Probabilistic Models for Biological SequencesWhat are they?Why to use them?Which probabilities do they compute?
-
A model M associates to a sequence s the probability P( s | M )
This probability answers the question:
Which is the probability for a model describing the Globins to generate the sequence s ?
The question we want to answer is:
Given a sequence s, is it a Globin?
We need to compute P( M | s ) !!P( s | M ) or P( M | s ) ?
-
P(X,Y) = P(X | Y) P(Y) = P(Y | X) P(X) Joint probability
So:P(Y | X) = P(X | Y) P(Y)P(X)P(M | s) = P(s | M) P(M)P(s)A priori probabilitiesBayes TheoremP(M | s)
Evidence sConclusion MP(s | M)
Evidence MConclusion s
-
P(M | s) = P(s | M) P(M)P(s)A priori probabilitiesThe A priori probabilitiesP(M) is the probability of the model (i.e. of the class described by the model) BEFORE we know the sequence:
can be estimated as the abundance of the classP(s) is the probability of the sequence in the sequence space.
Cannot be reliably estimated!!
-
Comparison between models===We can overcome the problem comparing the probability of generating s from different models Ratio between the abundance of the classes
-
Null modelOtherwise we can score a sequence for a model M comparing it to a Null Model: a model that generates ALL the possible sequences with probabilities depending ONLY on the statistical amino acid abundance
S(M, s) = log P(s | M) P(s | N)In this case we need a threshold and a statistic for evaluating the significance (E-value, P-value)S(M, s)Sequences belonging to model MSequences NOT belonging to model M
-
The simplest probabilistic models:Markov ModelsDefinition
-
CRSFC: CloudsR: RainF: FogS: SunMarkov Models Example: WeatherRegister the weather conditions day by day:
as a first hypothesis the weather condition in a day depends ONLY on the weather conditions in the day before.
Define the conditional probabilities
P(C|C), P(C|R),. P(R|C)..
The probability for the 5-days registration CRRCS
P(CRRCS) = P(C)P(R|C) P(R|R) P(C|R) P(S|C)
-
Stochastic generator of sequences in which the probability of state in position i depends ONLY on the state in position i-1
Given an alphabet C = {c1; c2; c3; cN }
a Markov model is described with N(N+2) parameters {art , aBEGIN t , ar END; r, t C}
arq = P( s i = q| s i-1 = r ) aBEGIN q = P( s 1 = q ) ar END = P( s T = END | s T-1 = r )Markov Model t art + ar END = 1 r t aBEGIN t = 1c3c1c2c4cNENDBEGIN
-
Given the sequence: s = s1s2s3s4s6 sT with si C = {c1; c2; c3; cN }
P( s | M ) = P( s1 ) i=2 P( s i | s i-1 ) =
Markov ModelsP(ALKALI)= aBEGIN A aA L aL K aK A aA L aL I aI END
-
Markov Models: Exercise1) Fill the non defined values for the transition probabilities
-
Markov Models: Exercise2) Which model better describes the weather in summer? Which one describes the weather in winter?
-
Markov Models: Exercise3) Given the sequenceCSSSCFS
which model gives the higher probability?[Consider the starting probabilities: P(X|BEGIN)=0.25]WinterSummer
-
Markov Models: ExerciseP (CSSSCFS | Winter) ==0.25x0.1x0.2x0.2x0.3x0.2x0.2==1.2 x 10-5
P (CSSSCFS | Summer) ==0.25x0.4x0.8x0.8x0.1x0.1x1.0==6.4 x 10-4
4) Can we conclude that the observation sequence refers to a summer week?WinterSummer
-
Markov Models: ExerciseP (Seq | Winter) =1.2 x 10-5
P (Seq | Summer) =6.4 x 10-4
WinterSummerP(Seq |Summer) P(Summer)P(Seq| Winter) P(Winter)==
-
DNA:C = {Adenine, Citosine, Guanine, Timine }
16 transition probabilities (12 of which independent) +4 Begin probabilities +4 End probabilities.Simple Markov Model for DNA sequencesThe parameters of the model are different in different zones of DNA
They describe the overall composition and the couple recurrences
-
Example of Markov Models: GpC IslandIn the Markov Model of GpC Islands aGC is higher than in Markov Model Non-GpC IslandsGpC IslandsNon-GpC IslandsGiven a sequence s we can evaluate
GATGCGTCGC
CTACGCAGCG
-
The simplest probabilistic models:Markov ModelsDefinitionTraining
-
Probabilistic training of a parametric methodGenerally speaking, a parametric model M aims to reproduce a set of known dataModel MParameters TModelled dataReal data (D)How to compare them?
-
Let M be the set of parameters of model M.
During the training phase, M parameters are estimated from the set of known data D
Maximum Likelihood Extimation (ML)
ML = argmax P( D | M, )
It can be proved that:Training of Markov ModelsMaximum A Posteriori Extimation (MAP)
MAP = argmax P( | M, D ) = argmax [P( D | M, ) P( ) ]
Frequency of occurrence as counted in the data set D
-
Example (coin-tossing)Given N tossing of a coin (our data D), the outcomes are h heads and t tails (N=t+h)ASSUME the modelP(D|M)= ph (1- p)t Computing the maximum likelihood of P(D|M) We obtain that our estimate of p is p = h / (h+t) = h / N
-
Example (Error measure)Suppose you think that your data are affected by a Gaussian errorSo that they are distributed according toF(xi)=A*exp-[(xi m)2 /2s 2] With A=1/sqrt(2 s)If your measures are independent the data likelihood is P(Data| model) = Pi F(xi)Find m and s that maximize the P(Data| model)
-
Maximum Likelihood training: Proof
Given a sequence s contained in D:s = s1s2s3s4s6 sT We can count the number of transitions between any to states j and k: njkWhere states 0 and N+1 are BEGIN and ENDNormalisation contstraints are taken into account using the Lagrange multipliers lk
-
Hidden Markov ModelsPreliminary examples
-
Given a sequence:
4156266656321636543662152611536264162364261664616263
We dont know the sequence of dice that generated it.
RRRRRLRLRRRRRRRLRRRRRRRRRRRRLRLRRRRRRRRLRRRRLRRRRLRRLoaded dice
We have 99 regular dice (R) and 1 loaded die (L).
P(1)P(2)P(3)P(4)P(5)P(6)R1/61/61/61/61/61/6L1/101/101/101/101/101/2
-
Hypothesis:
We chose a different die for each roll
Two stochastic processes give origin to the sequence of observations.
1) Choosing the die ( R o L ). 2) Rolling the die
The sequence of dice is hidden
The first process is assumed to be Markovian (in this case a 0-order MM)
The outcome of the second process depends only on the state reached in the first process (that is the chosen die)Loaded dice
-
Model
Each state (R and L) generates a character of the alphabet C = {1, 2, 3, 4, 5, 6 }
The emission probabilities depend only on the state.
The transition probabilities describe a Markov model that generates a state path: the hidden sequence (p)
The observations sequence (s) is generated by two concomitant stochastic processesRL0.010.01
0.99
0.99
Casin
-
The observations sequence (s) is generated by two concomitant stochastic processes
RL0.010.01
0.99
0.99
Choose the State : RProbability= 0.99
Chose the Symbol: 1 Probability= 1/6 (given R)4156266656321636543662152611RRRRRLRLRRRRRRRLRRRRRRRRRRRR4156266656321636543662152611RRRRRLRLRRRRRRRLRRRRRRRRRRRR
-
The observations sequence (s) is generated by two concomitant stochastic processes
RL0.010.01
0.99
0.99
Choose the State : LProbability= 0.99
Chose the Symbol: 5 Probability= 1/10 (given L)415626665632163654366215261RRRRRLRLRRRRRRRLRRRRRRRRRRR41562666563216365436621526115RRRRRLRLRRRRRRRLRRRRRRRRRRRRL
-
Model
Each state (R and L) generates a character of the alphabet C = {1, 2, 3, 4, 5, 6 }
The emission probabilities depend only on the state.
The transition probabilities describe a Markov model that generates a state path: the hidden sequence (p)
The observations sequence (s) is generated by two concomitant stochastic processesLoaded dice
-
Some not so serious example1) DEMOGRAPHY
Observable: Number of births and deaths in a year in a village. Hidden variable: Economic conditions (as a first approximation we can consider the success in business as a random variable, and by consequence, the wealth as a Markov variable
---> can we deduce the economic conditions of a village during a century by means of the register of births and deaths?
2) THE METEREOPATHIC TEACHER
Observable: Average of the marks that a metereopathic teacher gives to their students during a day. Hidden variable: Weather conditions
---> can we deduce the weather conditions during a years by means of the class register?
-
To be more serious1) SECONDARY STRUCTURE Observable: protein sequence Hidden variable: secondary structure
---> can we deduce (predict) the secondary structure of a protein given its amino acid sequence?
2) ALIGNMENT Observable: protein sequence Hidden variable: position of each residue along the alignment of a protein family
---> can we align a protein to a family, starting from its amino acid sequence?
-
Hidden Markov ModelsPreliminary examplesFormal definition
-
A HMM is a stochastic generator of sequences characterised by: N states A set of transition probabilities between two states {akj}akj = P( (i) = j | (i-1) = k ) A set of starting probabilities {a0k}a0k = P( (1) = k ) A set of ending probabilities {ak0}ak0 = P( p (i) = END | (i-1) = k ) An alphabet C with M characters. A set of emission probabilities for each state {ek (c)}ek (c) = P( s i = c | (i) = k )Constraints:k a0 k = 1ak0 + j ak j = 1 kc C ek (c) = 1 k
Formal definition of Hidden Markov Models s: sequencep: path through the states
-
Generating a sequence with a HMM
-
s :AGCGCGTAATCTGp :YYYYYYYNNNNNN
P( s, p | M ) can be easily computed
GpC Island, simple model
-
P( s, p | M ) can be easily computed s : A G C G C G T A A T C T Gp : Y Y Y Y Y Y Y N N N N N NEmission: 0.1 0.4 0.4 0.4 0.4 0.4 0.10.250.250.250.250.250.25Transition: 0.2 0.7 0.7 0.7 0.7 0.7 0.7 0.20.8 0.8 0.80.8 0.8 0.1 Multiplying all the probabilities gives the probability of having the sequence AND the path through the states
-
Evaluation of the joint probability of the sequence ad the path
-
Hidden Markov ModelsPreliminary examplesFormal definitionThree questions
-
s :AGCGCGTAATCTGp:?????????????
P( s, p | M ) can be easily computed How to evaluate P ( s | M )?
GpC Island, simple model
-
How to evaluate P ( s | M )?s : A G C G C G T A A T C T Gp1: Y Y Y Y Y Y Y Y Y Y Y Y Yp2: Y Y Y Y Y Y Y Y Y Y Y Y Np3: Y Y Y Y Y Y Y Y Y Y Y N Yp4: Y Y Y Y Y Y Y Y Y Y Y N Np5: Y Y Y Y Y Y Y Y Y Y N Y Y213 different pathsSumming over all the path will give the probability of having the sequenceP ( s | M ) = p P( s, p | M )
-
s :AGCGCGTAATCTGp :?????????????
P( s, p | M ) can be easily computed How to evaluate P ( s | M )?Can we show the hidden path?
Resum: GpC Island, simple model
-
Can we show the hidden path? s : A G C G C G T A A T C T Gp1: Y Y Y Y Y Y Y Y Y Y Y Y Yp2: Y Y Y Y Y Y Y Y Y Y Y Y Np3: Y Y Y Y Y Y Y Y Y Y Y N Yp4: Y Y Y Y Y Y Y Y Y Y Y N Np5: Y Y Y Y Y Y Y Y Y Y N Y Y213 different pathsViterbi path: path that gives the best joint probabilityp* = argmax p [ P( p | s, M ) ] = argmax p [ P( p , s | M ) ]
-
A Posteriori decoding
For each position choose the state p (t) : p (i) = argmax k [ P( p (i) = k| s, M ) ]
The contribution to this probability derives from all the paths that go through the state k at position i.
The A posteriori path can be a non-sense path (it may not be a legitimate path if some transitions are not permitted in the model)
Can we show the hidden path?
-
s :AGCGCGTAATCTGp :YYYYYYYNNNNNN
P( s, p | M ) can be easily computed How to evaluate P ( s | M )?Can we show the hidden path? Can we evaluate the parameters starting from known examples?
GpC Island, simple model
-
Can we evaluate the parameters starting from known examples? s : A G C G C G T A A T C T Gp : Y Y Y Y Y Y Y N N N N N NEmission: eY (A)eY (G)eY (C)e Y(G)eY (C)eY (G)eY (T)eN (A)eN (A)eN (T)eN (C)eN (T)eN (G) Transition: a0Y aYY aYY aYY aYY aYY aYY aYN aNN aNN aNN aNN aNN aN0How to find the parameters e and a that maximises this probability?How if we dont know the path?
-
Hidden Markov Models:Algorithms ResumEvaluating P(s | M): Forward Algorithm
-
Computing P( s,p | M ) for each path is a redundant operations : A G C G C G T A A T C T Gp : Y Y Y Y Y Y Y Y Y Y Y Y YEmission: 0.1 0.4 0.4 0.4 0.4 0.4 0.1 0.1 0.1 0.1 0.4 0.1 0.4Transition: 0.2 0.7 0.7 0.7 0.7 0.7 0.7 0.70.7 0.7 0.70.7 0.7 0.1 s : A G C G C G T A A T C T Gp : Y Y Y Y Y Y Y Y Y Y Y Y NEmission: 0.1 0.4 0.4 0.4 0.4 0.4 0.1 0.1 0.1 0.1 0.4 0.1 0.25Transition: 0.2 0.7 0.7 0.7 0.7 0.7 0.7 0.70.7 0.7 0.70.7 0.2 0.1
-
Computing P( s,p | M ) for each path is a redundant operationIf we compute the common part only once we gain 2(T-1) operations
-
Summing over all the possible pathss : A G p : Y Y Emission: 0.1 0.4 Transition: 0.2 0.7 s : A G p : Y N Emission: 0.1 0.25 Transition: 0.2 0.2 s : A G p : N Y Emission: 0.250.4 Transition: 0.8 0.1 s : A G p : N N Emission: 0.250.25 Transition: 0.8 0.8 0.00560.0010.040.008
-
s : A G C p : X Y Y0.0136Summing over all the possible paths 0.4 0.7 s : A G C p : X Y N0.0136 0.25 0.2 s : A G C p : X N Y0.041 0.4 0.1 s : A G C p : X N N0.041 0.25 0.8 ++
-
Summing over all the possible pathsA G C G C G T A A T C T GX X X X X X X X X X X X YA G C G C G T A A T C T GX X X X X X X X X X X X N 0.1 (aY0) 0.1 (aN0) +P(s|M)Iterating until the last position of the sequence:
-
If we know the probabilities of emitting the two first characters of the sequence ending the path in states L and R respectively:
FR(2) P(s1,s2,p(2) = R | M) and FL(2) P(s1,s2,p(2) = L | M)
then we can compute:
P(s1,s2,s3,p(3) = R | M) = FR(2) aRR eR(s3) + FL(2) aLR eR(s3) Summing over all the possible paths
-
On the basis of preceding observations the computation of P(s | M) can be decomposed in simplest problems
For each state k and each position i in the sequence, we compute:Fk(i) = P( s1s2s3s i, p (i) = k | M)
Initialisation: FBEGIN (0) = 1 Fi (0) = 0 i BEGINRecurrence: Fl ( i+1) = P( s1s2s is i+1, p (i + 1) = l ) = = k P( s1s2 s i, p (i) = k ) a k l e l ( s i+1 ) = = e l ( s i+1 ) k Fk ( i ) a k lTermination: P( s ) = P( s1s2s3s T, p (T + 1) = END ) = =k P( s1s2 s T , p (T) = k ) a k0 = k Fk ( T ) a k 0
Forward AlgorithmWill be understood
-
STATEIterationBEGINENDAB01FB (2)SeB (s2)TT + 1Fi (1) aiBP(s | M)Forward Algorithm
-
Naf method P ( s | M ) = p P( s, p | M )There are N T possible paths.Each path requires about 2T operations.The time for the computation is O( T N T )Forward algorithm: computational complexitys : A G C G C G T A A T C T Gp : Y Y Y Y Y Y Y Y Y Y Y Y YEmission: 0.1 0.4 0.4 0.4 0.4 0.4 0.1 0.1 0.1 0.1 0.4 0.1 0.4Transition: 0.2 0.7 0.7 0.7 0.7 0.7 0.7 0.70.7 0.7 0.70.7 0.7 0.1 s : A G C G C G T A A T C T Gp : Y Y Y Y Y Y Y Y Y Y Y Y NEmission: 0.1 0.4 0.4 0.4 0.4 0.4 0.1 0.1 0.1 0.1 0.4 0.1 0.25Transition: 0.2 0.7 0.7 0.7 0.7 0.7 0.7 0.70.7 0.7 0.70.7 0.2 0.1
-
s : A G C p : X Y Y0.0136 0.4 0.7 s : A G C p : X Y N0.0136 0.25 0.2 s : A G C p : X N Y0.041 0.4 0.1 s : A G C p : X N N0.041 0.25 0.8 s : A G C p : X X Y s : A G C p : X X N0.0054480.00888++SumForward algorithmT positions, N values for each positionEach element requires about 2N product and 1 sum The time for the computation is O(T N2)
Forward algorithm: computational complexity
-
Forward algorithm: computational complexityNaf method Forward algorithm
Grafico1
88
2412
6416
16020
38424
89628
204832
T
No. of operations
Foglio1
T
124
288
32412
46416
516020
638424
789628
8204832
9460836
101024040
Foglio2
Foglio3
-
Hidden Markov Models:Algorithms ResumEvaluating P(s | M): Forward AlgorithmEvaluating P(s | M): Backward Algorithm
-
Backward AlgorithmSimilar to the Forward algorithm: it computes P( s | M ), reconstructing the sequence from the end
For each state k and each position i in the sequence, we compute:Bk(i) = P( s i+1s i+2s i+3s T | p (i) = k )
Initialisation: Bk (T) = P(p (T+1) = END | p (T) = k ) = ak0Recurrence: Bl ( i-1) = P(s is i+1s T | p (i - 1) = l ) = = k P(s i+1s i+2s T | p (i) = k) a l k e k (s i )= = k Bk ( i ) e k ( s i ) a l kTermination: P( s ) = P( s1s2s3s T | p (0) = BEGIN ) = = k P( s2 s T | p (1) = k ) a 0 k e k ( s 1 ) = = k Bk ( 1 ) a 0k e k ( s 1 )
-
STATEIterationBEGINENDAB012TT + 1Backward AlgorithmBB (T-1)Bk(T) aB T e k (s T-1 )T-1SP(s | M)
-
Hidden Markov Models:Algorithms ResumEvaluating P(s | M): Forward AlgorithmEvaluating P(s | M): Backward AlgorithmShowing the path: Viterbi decoding
-
Finding the best paths : A G p : Y Y Emission: 0.1 0.4 Transition: 0.2 0.7 s : A G p : Y N Emission: 0.1 0.25 Transition: 0.2 0.2 s : A G p : N Y Emission: 0.250.4 Transition: 0.8 0.1 s : A G p : N N Emission: 0.250.25 Transition: 0.8 0.8 0.00560.0010.040.008
-
s : A G C p : N Y Y0.008 0.4 0.7 s : A G C p : N Y N0.008 0.25 0.2 s : A G C p : N N Y0.04 0.4 0.1 s : A G C p : N N N0.04 0.25 0.8 ;=0.00224=0.0016=0.0004=0.008Finding the best path
-
A G C G C G T A A T C T GN Y Y Y Y Y Y N N N Y Y YA G C G C G T A A T C T GN N N N N N N N N N N N N 0.1 (aY0) 0.1 (aN0) Choose the MaximumIterating until the last position of the sequence:Finding the best path
-
Viterbi Algorithmp* = argmax p [ P( p , s | M ) ]The computation of P(s,p*| M) can be decomposed in simplest problems
Let Vk(i) be the probability of the most probable path for generating the subsequence s1s2s3s i ending in the state k at iteration iInitialisation: VBEGIN (0) = 1 Vi (0) = 0 i BEGINRecurrence: Vl ( i+1) = e l ( s i+1 ) Max k ( Vk ( i ) a k l )ptr i ( l ) = argmax k ( Vk ( i ) a k l )Termination: P( s, p* ) =Maxk (Vk ( T ) a k 0 )p* ( T ) = argmax k (Vk ( T ) a k 0 )Traceback:p* ( i-1 ) = ptr i (p* ( i ))
-
Viterbi AlgorithmSTATEIterationBEGINENDAB01VB (2)MAXeB (s2)TT + 1Vi (1) aiBP(s, *| M)ptr2 (B)
-
Viterbi AlgorithmViterbi pathDifferent paths can have the same probability
-
Hidden Markov Models:Algorithms ResumEvaluating P(s | M): Forward AlgorithmEvaluating P(s | M): Backward AlgorithmShowing the path: Viterbi decodingShowing the path: A posteriori decodingTraining a model: EM algorithm
-
If we know the path generating the training sequences : A G C G C G T A A T C T Gp : Y Y Y Y Y Y Y N N N N N NEmission: eY (A)eY (G)eY (A)e Y(G)eY (C)eY (G)eY (T)eN (A)eN (A)eN (T)eN (C)eN (T)eN (G) Transition: a0Y aYY aYY aYY aYY aYY aYY aYN aNN aNN aNN aNN aNN aN0Just count!Example: aYY= nYY /(nYY+ nYN)= 6/7eY(A) = nY(A) /[nY(A)+nY(C) +nY(G) +nY(T)]= 1/7
-
Expectation-Maximisation algorithmWe need to estimate the Maximum Likelihood parameters when the paths generating the training sequences are unknown
qML = argmaxq [P ( s | q, M)]
Given a model with parameters q0 the EM algorithm finds new parameters q that increase the likelihood of the model:
P( s | q ) > P( s| q0 )
-
Ak,l = Sp P(p | s,q0) Ak,l(p)Ek (c) = Sp P(p | s,q0) Ek (c,p)We can compute the expected values over all the pathsak,l = ek(c) = Expectation-Maximisation algorithmGiven a path p we can countthe number of transition between states k and l: Ak,l(p)the number on emissions of character c from state k: Ek (c,p) s : A G C G C G T A A T C T Gp1: Y Y Y Y Y Y Y Y Y Y Y Y Yp2: Y Y Y Y Y Y Y Y Y Y Y Y Np3: Y Y Y Y Y Y Y Y Y Y Y N Yp4: Y Y Y Y Y Y Y Y Y Y Y N N...The updated parameters are:
-
Expectation-Maximisation algorithmWe need to estimate the Maximum Likelihood parameters when the paths generating the training sequences are unknown
qML = argmaxq [P ( s | q, M)]
Given a model with parameters q0 the EM algorithm finds new parameters q that increase the likelihood of the model:
P( s | q ) > P( s| q0 )
or equivalentely
log P( s | q ) > log P( s| q0 )
-
Expectation-Maximisation algorithmlog P( s | q ) = log P(s,p |q ) - log P(p | s,q )
Multiplying for P(p | s,q0) and summing over all the possible paths:
log P( s | q ) = =Sp P(p | s,q0) log P(s,p | q ) - Sp P(p | s,q0) log P(p | s,q)Q(q |q0) : Expectation value of log P(s,p |q ) over all the current pathslog P( s | q ) - log P(s | q0) = = Q(q |q0) - Q(q0|q0) + Q(q |q0) - Q(q0|q0) 0
-
Expectation-Maximisation algorithmThe EM algorithm is an iterative process
Each iteration performs two steps:E-step: evaluation of Q(q |q0) = Sp P(p | s,q0) log P(s,p | q ) M-step: Maximisation of Q(q |q0) over all q
It does NOT assure to converge to the GLOBAL Maximum Likelihood
-
E-step:Q( q |q0) = Sp P(p | s,q0) log P(s,p |q ) P(s,p |q ) = a0,p(1) Pi = 1 ap(i),p(i+1) ep(i)(si) = = Pk = 0 Pl = 1 ak,l Pk = 1 Pc C ek (c) TNNNAk.l (p)Ek (c,p)Ak,l = Sp P(p | s,q0) Ak,l(p)Ek (c) = Sp P(p | s,q0) Ek (c,p)Baum-Welch implementation of the EM algorithmExpected values over all the actual paths
-
ak,l =
ek(c) = Baum-Welch implementation of the EM algorithmM-step:For any state k and character c, with Sc ek(c) = 1By means of Lagranges multipliers techniques, we can solve the system
-
Fk(i) = P( s1s2s3s i, p (i) = k )
Bk(i) = P( s i+1s i+2s i+3s T | p (i) = k )
Ak,l= P(p (i ) = k , p ( i +1) = l | s,q) =
Ek (c) = P( s i = c , p (i ) = k | s,q) =
Baum-Welch implementation of the EM algorithmHow to compute the expected number of transitions and emissions over all the paths
-
Baum-Welch implementation of the EM algorithmAlgorithmStart with random parametersCompute Forward and Backward matrices on the known sequencesCompute Ak,l and Ek (c) expected numbers of transitions and emissions Update a k,l Ak,l ek (c) Ek (c)Has P(s|M) incremented ?YesNoEnd
-
Profile HMMsHMMs for alignments
-
Profile HMMsHMMs for alignments
-
M0M1M2M3M4How to align?Each state represent a position in the alignment.ACGGTAM0M1M2M3M4M5
ACGATCM0M1M2M3M4M5
ATGTTCM0M1M2M3M4M5
M5Each position has a peculiar composition
-
M0M1M2M3M4ACGGTAACGATCATGTTCM5Given a set of sequences....we can train a model..A1000.3300.33C00.660000.66G0010.3300T00.3300.3310..estimating the emission probabilities.
-
M0M1M2M3M4M5Given a trained model....we can align a new sequence..A1000.3300.33C00.660000.66G0010.3300T00.3300.3310ACGATC..computing the emission probabilityP(s|M) = 1 0.66 1 0.33 1 0.66
-
M0M1M2M3M4And for the sequence AGATC ?
AGATCM0M2M3M4M5M5M5A1000.3300.33C00.660000.66G0010.3300T00.3300.3310We need a way to introduce gaps
-
Silent statesRed transitions allow gaps(N-1) ! transitionsTo reduce the number of parameters we can use states that doesnt emit any character4N-8 transitions
-
Profile HMMsDelete statesInsert statesMatch statesA C G G T AM0 M1 M2 M3 M4 M5
A C G C A G T CM0 I0 I0 M1 M2 M3 M4 M5
A G A T CM0 D1 M2 M3 M4 M5
-
Example of alignmentSequence 1A S T R A LViterbi pathM0 M1 M2 M3 M4 M5A S T R A L
Sequence 2A S T A I LViterbi pathM0 M1 M2 D3 M4 I4 M5A S T A I L
Sequence 3A R T IViterbi pathM0 M1 M2 D3 D4 M5A R T I
-
Example of alignment-Log P(s | M) Is an alignment score
Grouping by vertical layers
0
1
2
3
4
5
s1
A
S
T
R
A
L
s2
A
S
T
AI
L
s3
A
R
T
I
Alignment
ASTRA-L
AST-AIL
ART---I
-
Searching for a structural/functional pattern in protein sequenceZn binding loop: C H C I C R I C C H C L C K I C C H C I C S L C D H C L C T I C C H C I D S I C C H C L C K I C
Cysteines can be replaced by an Aspartic Acid, but only ONCE for each sequence
-
Searching for a structural/functional pattern in protein sequences..ALCPCHCLCRICPLIY.. ..WERWDHCIDSICLKDE..obtains higher probability than.. because M0 and M4 have low emission probability for Aspartic Acid and we multiply them twice.
-
Profile HMMsHMMs for alignmentsExample on globins
-
Structural alignment of globins
-
Structural alignment of globinsBashdorf D, Chothia C & Lesk AM, (1987) Determinants of a protein fold: unique features of the globin amino sequence. J.Mol.Biol. 196, 199-216
-
Alignment of globins reconstructed with profile HMMsKrogh A, Brown M, Mian IS, Sjolander K & Haussler D (1994) Hidden Markov Models in computational biology: applications to protein modelling. J.Mol.Biol. 235, 1501-1531
-
Discrimination power of profile HMMsKrogh A, Brown M, Mian IS, Sjolander K & Haussler D (1994) Hidden Markov Models in computational biology: applications to protein modelling. J.Mol.Biol. 235, 1501-1531
-
Profile HMMsHMMs for alignmentsExample on globinsOther applications
-
BeginProfile HMM specific for the considered domain
EndFinding a domain
-
BEGINHMM 1HMM 2HMM 3HMM n.END.Clustering subfamiliesEach sequence s contributes to update HMM i with a weight equal to P ( s | Mi )
-
Profile HMMsHMMs for alignmentsExample on globinsOther applicationsAvailable codes and servers
-
HMMER at WUSTL: http://hmmer.wustl.edu/Eddy SR (1998) Profile hidden Markov models. Bioinformatics 14:755-763
-
HMMERAlignment of a protein familyhmmbuildTrained profile-HMM hmmcalibateHMM calibrated with the accurate E-value statistics Takes the aligned sequences, checks for redundancy and sets the emission and the transitions probabilities of a HMMTakes a trained HMM, generates a great number of random sequences, score them and fits the Extreme Value Distribution to the computed scores
-
HMMERSet of sequencesAlignment of all the sequences to the modelhmmalignList of sequences that match the HMM (sorted by E-value)hmmsearchHMMSet of HMMsSequencehmmpfamList of HMMs that match the sequence
-
MSF Format: globins50.msf
!!AA_MULTIPLE_ALIGNMENT 1.0
PileUp of: *.pep
Symbol comparison table: GenRunData:blosum62.cmp CompCheck: 6430
GapWeight: 12
GapLengthWeight: 4
pileup.msf MSF: 308 Type: P August 16, 1999 09:09 Check: 9858 ..
Name: lgb1_pea Len: 308 Check: 2200 Weight: 1.00
Name: lgb1_vicfa Len: 308 Check: 214 Weight: 1.00
Name: myg_escgi Len: 308 Check: 3961 Weight: 1.00
Name: myg_horse Len: 308 Check: 5619 Weight: 1.00
Name: myg_progu Len: 308 Check: 6401 Weight: 1.00
Name: myg_saisc Len: 308 Check: 6606 Weight: 1.00
//
1 50
lgb1_pea ~~~~~~~~~G FTDKQEALVN SSSE.FKQNL PGYSILFYTI VLEKAPAAKG
lgb1_vicfa ~~~~~~~~~G FTEKQEALVN SSSQLFKQNP SNYSVLFYTI ILQKAPTAKA
myg_escgi ~~~~~~~~~V LSDAEWQLVL NIWAKVEADV AGHGQDILIR LFKGHPETLE
myg_horse ~~~~~~~~~G LSDGEWQQVL NVWGKVEADI AGHGQEVLIR LFTGHPETLE
myg_progu ~~~~~~~~~G LSDGEWQLVL NVWGKVEGDL SGHGQEVLIR LFKGHPETLE
myg_saisc ~~~~~~~~~G LSDGEWQLVL NIWGKVEADI PSHGQEVLIS LFKGHPETLE
-
Alignment of a protein familyhmmbuildTrained profile-HMM hmmbuild globin.hmm globins50.msfAll the transition and emission parameters are estimated by means of the Expectation Maximisation algorithm on the aligned sequences.In principle we could use also NON aligned sequences to train the model. Nevertheless it is more efficient to build the starting alignment using, for example, CLUSTALW
-
hmmcalibrate [-num N] -histfile globin.histo globin.hmmTrained profile-HMM hmmcalibateHMM calibrated with the accurate E-value statistics A number of N (default 5000) random sequences are generated and scored with the model. Random sequencesLog P(s|M)/P(s|N)Range for globin sequencesE-value(S): expected number of random sequences with a score > S
-
Trained model (globin.hmm)HMMER2.0 [2.3.2]NAME globins50LENG 143ALPH AminoRF noCS noMAP yesCOM /home/gigi/bin/hmmbuild globin.hmm globins50.msfCOM /home/gigi/bin/hmmcalibrate --histfile globin.histo globin.hmmNSEQ 50DATE Sun May 29 19:03:18 2005CKSUM 9858XT -8455 -4 -1000 -1000 -8455 -4 -8455 -4 NULT -4 -8455NULE 595 -1558 85 338 -294 453 -1158 197 249 902 -1085 -142 -21 -313 45 531 201 384 -1998 -644 EVD -38.893742 0.243153HMM A C D E F G H I K L M N P Q R S T V W Y m->m m->i m->d i->m i->i d->m d->d b->m m->e -450 * -1900 1 591 -1587 159 1351 -1874 -201 151 -1600 998 -1591 -693 389 -1272 595 42 -31 27 -693 -1797 -1134 14 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 -450 * 2 -926 -2616 2221 2269 -2845 -1178 -325 -2678 -300 -2596 -1810 220 -1592 939 -974 -671 -939 -2204 -2785 -1925 15 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * * 3 -638 -1715 -680 497 -2043 -1540 23 -1671 2380 -1641 -840 -222 -1595 437 1040 -564 -523 -1363 2124 -1313 16 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * *
-
Trained model (globin.hmm)null modelHMMER2.0 [2.3.2]NAME globins50LENG 143ALPH AminoRF noCS noMAP yesCOM /home/gigi/bin/hmmbuild globin.hmm globins50.msfCOM /home/gigi/bin/hmmcalibrate --histfile globin.histo globin.hmmNSEQ 50DATE Sun May 29 19:03:18 2005CKSUM 9858XT -8455 -4 -1000 -1000 -8455 -4 -8455 -4 NULT -4 -8455NULE 595 -1558 85 338 -294 453 -1158 197 249 902 -1085 -142 -21 -313 45 531 201 384 -1998 -644 EVD -38.893742 0.243153HMM A C D E F G H I K L M N P Q R S T V W Y m->m m->i m->d i->m i->i d->m d->d b->m m->e -450 * -1900 1 591 -1587 159 1351 -1874 -201 151 -1600 998 -1591 -693 389 -1272 595 42 -31 27 -693 -1797 -1134 14 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 -450 * 2 -926 -2616 2221 2269 -2845 -1178 -325 -2678 -300 -2596 -1810 220 -1592 939 -974 -671 -939 -2204 -2785 -1925 15 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * * 3 -638 -1715 -680 497 -2043 -1540 23 -1671 2380 -1641 -840 -222 -1595 437 1040 -564 -523 -1363 2124 -1313 16 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * *
Score = INT [1000 log2(prob/null_prob)]null_prob = 1 for transitions
-
Trained model (globin.hmm)null modelHMMER2.0 [2.3.2]NAME globins50LENG 143ALPH AminoRF noCS noMAP yesCOM /home/gigi/bin/hmmbuild globin.hmm globins50.msfCOM /home/gigi/bin/hmmcalibrate --histfile globin.histo globin.hmmNSEQ 50DATE Sun May 29 19:03:18 2005CKSUM 9858XT -8455 -4 -1000 -1000 -8455 -4 -8455 -4 NULT -4 -8455NULE 595 -1558 85 338 -294 453 -1158 197 249 902 -1085 -142 -21 -313 45 531 201 384 -1998 -644 EVD -38.893742 0.243153HMM A C D E F G H I K L M N P Q R S T V W Y m->m m->i m->d i->m i->i d->m d->d b->m m->e -450 * -1900 1 591 -1587 159 1351 -1874 -201 151 -1600 998 -1591 -693 389 -1272 595 42 -31 27 -693 -1797 -1134 14 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 -450 * 2 -926 -2616 2221 2269 -2845 -1178 -325 -2678 -300 -2596 -1810 220 -1592 939 -974 -671 -939 -2204 -2785 -1925 15 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * * 3 -638 -1715 -680 497 -2043 -1540 23 -1671 2380 -1641 -840 -222 -1595 437 1040 -564 -523 -1363 2124 -1313 16 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * *
Score = INT [1000 log2(prob/null_prob)]
= natural abundance for emissions
-
HMMER2.0 [2.3.2]NAME globins50LENG 143ALPH AminoRF noCS noMAP yesCOM /home/gigi/bin/hmmbuild globin.hmm globins50.msfCOM /home/gigi/bin/hmmcalibrate --histfile globin.histo globin.hmmNSEQ 50DATE Sun May 29 19:03:18 2005CKSUM 9858XT -8455 -4 -1000 -1000 -8455 -4 -8455 -4 NULT -4 -8455NULE 595 -1558 85 338 -294 453 -1158 197 249 902 -1085 -142 -21 -313 45 531 201 384 -1998 -644 EVD -38.893742 0.243153HMM A C D E F G H I K L M N P Q R S T V W Y m->m m->i m->d i->m i->i d->m d->d b->m m->e -450 * -1900 1 591 -1587 159 1351 -1874 -201 151 -1600 998 -1591 -693 389 -1272 595 42 -31 27 -693 -1797 -1134 14 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 -450 * 2 -926 -2616 2221 2269 -2845 -1178 -325 -2678 -300 -2596 -1810 220 -1592 939 -974 -671 -939 -2204 -2785 -1925 15 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * * 3 -638 -1715 -680 497 -2043 -1540 23 -1671 2380 -1641 -840 -222 -1595 437 1040 -564 -523 -1363 2124 -1313 16 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * *
Trained model (globin.hmm)Transitions
-
HMMER2.0 [2.3.2]NAME globins50LENG 143ALPH AminoRF noCS noMAP yesCOM /home/gigi/bin/hmmbuild globin.hmm globins50.msfCOM /home/gigi/bin/hmmcalibrate --histfile globin.histo globin.hmmNSEQ 50DATE Sun May 29 19:03:18 2005CKSUM 9858XT -8455 -4 -1000 -1000 -8455 -4 -8455 -4 NULT -4 -8455NULE 595 -1558 85 338 -294 453 -1158 197 249 902 -1085 -142 -21 -313 45 531 201 384 -1998 -644 EVD -38.893742 0.243153HMM A C D E F G H I K L M N P Q R S T V W Y m->m m->i m->d i->m i->i d->m d->d b->m m->e -450 * -1900 1 591 -1587 159 1351 -1874 -201 151 -1600 998 -1591 -693 389 -1272 595 42 -31 27 -693 -1797 -1134 14 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 -450 * 2 -926 -2616 2221 2269 -2845 -1178 -325 -2678 -300 -2596 -1810 220 -1592 939 -974 -671 -939 -2204 -2785 -1925 15 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * * 3 -638 -1715 -680 497 -2043 -1540 23 -1671 2380 -1641 -840 -222 -1595 437 1040 -564 -523 -1363 2124 -1313 16 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * *
Trained model (globin.hmm)Emissions
-
HMMER2.0 [2.3.2]NAME globins50LENG 143ALPH AminoRF noCS noMAP yesCOM /home/gigi/bin/hmmbuild globin.hmm globins50.msfCOM /home/gigi/bin/hmmcalibrate --histfile globin.histo globin.hmmNSEQ 50DATE Sun May 29 19:03:18 2005CKSUM 9858XT -8455 -4 -1000 -1000 -8455 -4 -8455 -4 NULT -4 -8455NULE 595 -1558 85 338 -294 453 -1158 197 249 902 -1085 -142 -21 -313 45 531 201 384 -1998 -644 EVD -38.893742 0.243153HMM A C D E F G H I K L M N P Q R S T V W Y m->m m->i m->d i->m i->i d->m d->d b->m m->e -450 * -1900 1 591 -1587 159 1351 -1874 -201 151 -1600 998 -1591 -693 389 -1272 595 42 -31 27 -693 -1797 -1134 14 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 -450 * 2 -926 -2616 2221 2269 -2845 -1178 -325 -2678 -300 -2596 -1810 220 -1592 939 -974 -671 -939 -2204 -2785 -1925 15 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * * 3 -638 -1715 -680 497 -2043 -1540 23 -1671 2380 -1641 -840 -222 -1595 437 1040 -564 -523 -1363 2124 -1313 16 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * *
Trained model (globin.hmm)Emissions
-
HMMER2.0 [2.3.2]NAME globins50LENG 143ALPH AminoRF noCS noMAP yesCOM /home/gigi/bin/hmmbuild globin.hmm globins50.msfCOM /home/gigi/bin/hmmcalibrate --histfile globin.histo globin.hmmNSEQ 50DATE Sun May 29 19:03:18 2005CKSUM 9858XT -8455 -4 -1000 -1000 -8455 -4 -8455 -4 NULT -4 -8455NULE 595 -1558 85 338 -294 453 -1158 197 249 902 -1085 -142 -21 -313 45 531 201 384 -1998 -644 EVD -38.893742 0.243153HMM A C D E F G H I K L M N P Q R S T V W Y m->m m->i m->d i->m i->i d->m d->d b->m m->e -450 * -1900 1 591 -1587 159 1351 -1874 -201 151 -1600 998 -1591 -693 389 -1272 595 42 -31 27 -693 -1797 -1134 14 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 -450 * 2 -926 -2616 2221 2269 -2845 -1178 -325 -2678 -300 -2596 -1810 220 -1592 939 -974 -671 -939 -2204 -2785 -1925 15 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * * 3 -638 -1715 -680 497 -2043 -1540 23 -1671 2380 -1641 -840 -222 -1595 437 1040 -564 -523 -1363 2124 -1313 16 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * *
Trained model (globin.hmm)Emissions
-
hmmemit [-n N] globin.hmmTrained profile-HMM hmmemitSequences generated by the model The parameters of the model are used to generate new sequences
-
hmmsearch globin.hmm Artemia.fa > Artemia.globinSet of sequencesList of sequences that match the HMM (sorted by E-value)hmmsearchTrained profile-HMM
-
Search results (Artemia.globin)Sequence Description Score E-value N -------- ----------- ----- ------- ---S13421 S13421 474.3 1.7e-143 9
Parsed for domains:Sequence Domain seq-f seq-t hmm-f hmm-t score E-value-------- ------- ----- ----- ----- ----- ----- -------S13421 7/9 932 1075 .. 1 143 [] 76.9 7.3e-24S13421 2/9 153 293 .. 1 143 [] 63.7 6.8e-20S13421 3/9 307 450 .. 1 143 [] 59.8 9.8e-19S13421 8/9 1089 1234 .. 1 143 [] 57.6 4.5e-18S13421 9/9 1248 1390 .. 1 143 [] 52.3 1.8e-16S13421 1/9 1 143 [. 1 143 [] 51.2 4e-16S13421 4/9 464 607 .. 1 143 [] 46.7 8.6e-15S13421 6/9 775 918 .. 1 143 [] 42.2 2e-13S13421 5/9 623 762 .. 1 143 [] 23.9 6.6e-08
Alignments of top-scoring domains:S13421: domain 7 of 9, from 932 to 1075: score 76.9, E = 7.3e-24 *->eekalvksvwgkveknveevGaeaLerllvvyPetkryFpkFkdLss +e a vk+ w+ v+ ++ vG +++ l++ +P+ +++FpkF d+ S13421 932 REVAVVKQTWNLVKPDLMGVGMRIFKSLFEAFPAYQAVFPKFSDVPL 978
adavkgsakvkahgkkVltalgdavkkldd...lkgalakLselHaqklr d++++++ v +h V t+l++ ++ ld++ +l+ ++L+e H+ lr S13421 979 -DKLEDTPAVGKHSISVTTKLDELIQTLDEpanLALLARQLGEDHIV-LR 1026
vdpenfkllsevllvvlaeklgkeftpevqaalekllaavataLaakYk< v+ fk +++vl+ l++ lg+ f+ ++ +++k+++++++ +++ + S13421 1027 VNKPMFKSFGKVLVRLLENDLGQRFSSFASRSWHKAYDVIVEYIEEGLQ 1075
Number of domainsDomains sorted byE-valueStartEndConsensus sequenceSequence
-
hmmalign globin.hmm globins630.faSet of sequenceshmmalignAlignment of all sequences to the modelTrained profile-HMM InsertionsBAHG_VITSP QAG-..VAAAHYPIV.GQELLGAIKEV.L.G.D.AATDDILDAWGKAYGVGLB1_ANABR TR-K..ISAAEFGKI.NGPIKKVLAS-.-.-.K.NFGDKYANAWAKLVAVGLB1_ARTSX NRGT..-DRSFVEYL.KESL-----GD.S.V.D.EFT------VQSFGEVGLB1_CALSO TRGI..TNMELFAFA.LADLVAYMGTT.I.S.-.-FTAAQKASWTAVNDVGLB1_CHITH -KSR..ASPAQLDNF.RKSLVVYLKGA.-.-.T.KWDSAVESSWAPVLDFGLB1_GLYDI GNKH..IKAQYFEPL.GASLLSAMEHR.I.G.G.KMNAAAKDAWAAAYADGLB1_LUMTE ER-N..LKPEFFDIF.LKHLLHVLGDR.L.G.T.HFDF---GAWHDCVDQGLB1_MORMR QSFY..VDRQYFKVL.AGII-------.-.-.A.DTTAPGDAGFEKLMSMGLB1_PARCH DLNK..VGPAHYDLF.AKVLMEALQAE.L.G.S.DFNQKTRDSWAKAFSIGLB1_PETMA KSFQ..VDPQYFKVL.AAVI-------.-.-.V.DTVLPGDAGLEKLMSMGLB1_PHESE QHTErgTKPEYFDLFrGTQLFDILGDKnLiGlTmHFD---QAAWRDCYAV
Gaps
-
HMMER applications:PFAMhttp://www.sanger.ac.uk/Software/Pfam/
-
PFAM ExerciseGenerate with hmmemit a sequence from the globin model and search it in PFAM database
-
Search in the SwissProt database the sequencesCG301_HUMANQ9H5F4_HUMAN
1) search them in the PFAM data base.2)launch PSI-BLAST searches. Is it possible to annotate the sequences by means of the BLAST results?PFAM Exercise
-
SAM at UCSD:http://www.soe.ucsc.edu/research/compbio/sam.htmlKrogh A, Brown M, Mian IS, Sjolander K & Haussler D (1994) Hidden Markov Models in computational biology: applications to protein modelling. J.Mol.Biol. 235, 1501-1531
-
SAM applications:http://www.cse.ucsc.edu/research/compbio/HMM-apps/T02-query.html
-
HMMPRO: http://www.netid.com/html/hmmpro.htmlPierre Baldi, Net-ID
-
HMMs for Mapping problemsMapping problems in protein prediction
-
Covalent structureTTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYANSecondary structure
-
position of Trans Membrane Segments along the sequenceTopographyTopology of membrane proteins
ALALMLCMLTYRHKELKLKLKK ALALMLCMLTYRHKELKLKLKK ALALMLCMLTYRHKELKLKLKK
-
HMMs for Mapping problemsMapping problems in protein predictionLabelled HMMs
-
HMM for secondary structure prediction
Simplest modelIntroducing a grammara1b1ca2a3b2
-
HMM for secondary structure predictionLabels
The states a1, a2 and a3 share the same label, so states b1 and b2 do.Decoding the Viterbi path for emitting a sequence s, makes a mapping between the sequence s and a sequence of labels y
S A L K M N Y T R E I M V A S N Q s: Sequencec a1 a2 a3 a4 a4 a4 c c c c b1 b2 b2 b2 c cp: Pathc a a a a a a c c c c b b b b c cY(p): Labelsa1b1ca2a3b2
-
Computing P(s, y | M)Only the path whose labelling is y have to be considered in the sumIn Forward and Backward algorithms it means to set
Fk(i) = 0, Bk(i) = 0 if Y(k) yi
-
Baum-Welch training algorithm for labelled HMMsGiven a set of known labelled sequences (e.g. amino acid sequences and their native secondary structure) we want to find the parameters of the model, without knowing the generating paths:
qML = argmaxq [P ( s, y | q, M)]
The algorithm is the same as in the non-labelled case if we use the Forward and Backward matrices defined in the last slide.
Supervised learning of the mapping
-
HMMs for Mapping problemsMapping problems in protein predictionLabelled HMMsDuration modelling
-
Self loops and geometric decayP(l) = p l-1 ( 1-p )The length distribution of the generated segments is always exp-like
-
How can we model other length distributions?Limited caseThis topology can model any length distribution between 1 and N
-
How can we model other length distributions?Non limited caseThis topology can model any length distribution between 1 and N-1 and a geometrical decay from N and
-
Secondary structure: length statistic
Grafico1
000
000.11612
00.1585330.154294
0.1972340.1359660.130081
0.1081250.1598030.131099
0.0513910.1386460.111321
0.0526480.1238360.073875
0.055320.0935120.057951
0.0523340.0671370.046099
0.0520190.0407620.035192
0.0609780.027080.027848
0.0605060.0187590.021377
0.0474620.0124120.017596
0.0474620.0088860.013742
0.0411760.0056420.010325
0.0326890.0029620.00938
0.0227880.0012690.006762
0.0235740.001410.006399
0.0176020.0005640.004508
0.012730.0007050.005017
0.0099010.0005640.002327
0.0102150.0007050.003127
0.0086440.0002820.002327
0.0040860.0001410.001818
0.00628600.001382
0.0034570.0002820.001527
0.00282900.000945
0.00314300.000873
0.003300.000582
0.00267200.000436
0.00204300.000727
0.00157200.000364
0.002200.000727
0.0003140.0001410.000291
0.00062900.000291
0.00031400.000291
&A
Pagina &P
Helix
Strand
Coil
Length (residues)
Frequency
Database
8rxna.odc.txt
6363709013753
HelixStrandCoil
0000
1000.11612
200.1585330.154294
30.1972340.1359660.130081
40.1081250.1598030.131099
50.0513910.1386460.111321
60.0526480.1238360.073875
70.055320.0935120.057951
80.0523340.0671370.046099
90.0520190.0407620.035192
100.0609780.027080.027848
110.0605060.0187590.021377
120.0474620.0124120.017596
130.0474620.0088860.013742
140.0411760.0056420.010325
150.0326890.0029620.00938
160.0227880.0012690.006762
170.0235740.001410.006399
180.0176020.0005640.004508
190.012730.0007050.005017
200.0099010.0005640.002327
210.0102150.0007050.003127
220.0086440.0002820.002327
230.0040860.0001410.001818
240.00628600.001382
250.0034570.0002820.001527
260.00282900.000945
270.00314300.000873
280.003300.000582
290.00267200.000436
300.00204300.000727
310.00157200.000364
320.002200.000727
330.0003140.0001410.000291
340.00062900.000291
350.00031400.000291
360.00015700.000218
370.00062900.000145
380.00015700.000291
390.00015700.000073
400.00015700.000218
41000.000145
42000.000145
430.00015700.000073
44000.000145
45000
46000
470.00031400.000145
48000.000145
490.00015700.000073
50000.000073
51000.000073
52000.000145
53000
54000.000073
55000.000073
56000.000073
57000
58000.000073
59000
60000
61000.000073
62000.000073
63000
640.00015700
65000.000145
660.00015700
670.00015700
68000
69000
70000
71000
72000
73000
74000
75000
76000
77000.000073
78000
79000
80000
81000
82000
83000
84000
85000
86000.000073
87000
88000
89000
90000
91000
92000
93000
94000
95000
96000
97000
98000
99000.000073
100000.000073
101000
102000
103000
104000
105000
106000
107000
108000
109000
110000
111000
112000
113000
114000
115000
116000
117000
118000
119000
120000
121000
122000
123000
124000
125000
126000
127000
128000
129000
130000
131000
132000
133000
134000
135000
136000
137000
138000
139000
140000
141000
142000
143000
144000
145000
146000
147000
148000
149000
150000
151000
152000
153000
154000
155000
156000
157000
158000
159000
160000
161000
162000
163000
164000
165000
166000
167000
168000
169000
170000
171000
172000
173000
174000
175000
176000
177000
178000
179000
180000
181000
182000
183000
184000
185000
186000
187000
188000
189000
190000
191000
192000
193000
194000
195000
196000
197000
198000
199000
200000
201000
202000
203000
204000
205000
206000
207000
208000
209000
210000
211000
212000
213000
214000
215000
216000
217000
218000
219000
220000
221000
222000
223000
224000
225000
226000
227000
228000
229000
230000
231000
232000
233000
234000
235000
236000
237000
238000
239000
240000
241000
242000
243000
244000
245000
246000
247000
248000
249000
250000
251000
252000
253000
254000
255000
256000
257000
258000
259000
260000
261000
262000
263000
264000
265000
266000
267000
268000
269000
270000
271000
272000
273000
274000
275000
276000
277000
278000
279000
280000
281000
282000
283000
284000
285000
286000
287000
288000
289000
290000
291000
292000
293000
294000
295000
296000
297000
298000
299000
300000
301000
302000
303000
304000
305000
306000
307000
308000
309000
310000
311000
312000
313000
314000
315000
316000
317000
318000
319000
320000
321000
322000
323000
324000
325000
326000
327000
328000
329000
330000
331000
332000
333000
334000
335000
336000
337000
338000
339000
340000
341000
342000
343000
344000
345000
346000
347000
348000
349000
350000
351000
352000
353000
354000
355000
356000
357000
358000
359000
360000
361000
362000
363000
364000
365000
366000
367000
368000
369000
370000
371000
372000
373000
374000
375000
376000
377000
378000
379000
380000
381000
382000
383000
384000
385000
386000
387000
388000
389000
390000
391000
392000
393000
394000
395000
396000
397000
398000
399000
400000
401000
402000
403000
404000
405000
406000
407000
408000
409000
410000
411000
412000
413000
414000
415000
416000
417000
418000
419000
420000
421000
422000
423000
424000
425000
426000
427000
428000
429000
430000
431000
432000
433000
434000
435000
436000
437000
438000
439000
440000
441000
442000
443000
444000
445000
446000
447000
448000
449000
450000
451000
452000
453000
454000
455000
456000
457000
458000
459000
460000
461000
462000
463000
464000
465000
466000
467000
468000
469000
470000
471000
472000
473000
474000
475000
476000
477000
478000
479000
480000
481000
482000
483000
484000
485000
486000
487000
488000
489000
490000
491000
492000
493000
494000
495000
496000
497000
498000
499000
&A
Pagina &P
-
a3a4a5a6a7a2a1a8a9a10a13a14a12a11c3c4c2c1b3b4b5b2b1Secondary structure: modelDo we use the same emission probabilities for states sharing the same label?
-
HMMs for Mapping problemsMapping problems in protein predictionLabelled HMMsDuration modellingModels for membrane proteins
-
Porin (Rhodobacter capsulatus)Bacteriorhodopsin(Halobacterium salinarum)Bilayer-barrel-helicesOuter MembraneInner Membrane
-
position of Trans Membrane Segments along the sequenceTopographyTopology of membrane proteins
ALALMLCMLTYRHKELKLKLKK ALALMLCMLTYRHKELKLKLKK ALALMLCMLTYRHKELKLKLKK
-
A generic model for membrane proteins (TMHMM)TransmembraneInner SideOuter Side End Begin
-
Model of b-barrel membrane proteins
-
Labels:
Transmembrane states
Loop statesModel of b-barrel membrane proteinsTransmembraneInner SideOuter Side End Begin
-
Length of transmembrane b-strands:
Minimum: 6 residues
Maximum: unboundedModel of b-barrel membrane proteins
-
Six different sets of emission parameters:
Outer loop Inner loop Long globular domains
TM strands edgesTM strands coreModel of b-barrel membrane proteins
-
Model of a-helix membrane proteins (HMM1)TransmembraneInner SideOuter Side
-
Model of a-helix membrane proteins (HMM2)TransmembraneInner SideOuter Side....x 10....x 10......
-
Dynamic programming filtering procedure
Grafico1
0.0000174787
0.0000235731
0.0000415042
0.0000553066
0.000153429
0.000167835
0.000172728
0.000172356
0.000171323
0.000170204
0.000171257
0.000166853
0.000158308
0.000148597
0.000141162
0.000132089
0.0000570664
0.0000617807
0.0000673372
0.0000777819
0.0000753772
0.000097771
0.00011462
0.000126552
0.000133944
0.000147535
0.000149896
0.000139449
0.000127235
0.00012101
0.00011239
0.0001101
0.000120898
0.000122121
0.000121798
0.000138214
0.000231619
0.0003353
0.000374969
0.000451195
0.00048892
0.00055561
0.000577447
0.000596539
0.000642954
0.000699476
0.00073945
0.000710173
0.000763287
0.000952486
0.00145005
0.00344736
0.00571033
0.00788937
0.0139086
0.0162845
0.0182426
0.019068
0.0226285
0.02366
0.0258444
0.0261305
0.0284355
0.0295408
0.0324913
0.0318368
0.0285251
0.0268962
0.0371214
0.0521017
0.0770203
0.100266
0.166072
0.251925
0.577052
0.720923
0.912751
0.942341
0.945026
0.942038
0.930935
0.912393
0.896836
0.840698
0.825128
0.721808
0.520763
0.401477
0.221122
0.193517
0.166473
0.188099
0.180852
0.18561
0.196164
0.22403
0.251177
0.268332
0.264953
0.264259
0.278232
0.3047
0.361445
0.407271
0.418106
0.417674
0.435532
0.446926
0.453063
0.425028
0.40118
0.318551
0.286291
0.306349
0.317463
0.450198
0.815417
0.895298
0.905273
0.9123
0.926058
0.926184
0.925919
0.924039
0.927282
0.912855
0.836395
0.779566
0.581614
0.406334
0.215819
0.22083
0.401683
0.499017
0.669711
0.704546
0.714602
0.723647
0.744263
0.737763
0.740447
0.735937
0.731559
0.691716
0.627088
0.576156
0.443201
0.404786
0.270514
0.208538
0.245984
0.242904
0.238868
0.255671
0.266423
0.318376
0.31008
0.358739
0.43441
0.558658
0.647963
0.684144
0.719301
0.722901
0.7581
0.743032
0.736886
0.727311
0.685591
0.660593
0.560776
0.539008
0.524622
0.500756
0.396593
0.496948
0.537717
0.54623
0.58491
0.74687
0.782884
0.833712
0.838098
0.799548
0.765294
0.764944
0.713315
0.619098
0.570536
0.499848
0.425812
0.27176
0.283301
0.263614
0.370863
0.418584
0.439832
0.450443
0.460137
0.524176
0.599266
0.655459
0.693898
0.672716
0.679209
0.65671
0.629125
0.622578
0.585368
0.576914
0.615416
0.666336
0.726034
0.6798
0.713449
0.604731
0.593474
0.549547
0.575317
0.622605
0.677113
0.701663
0.677405
0.637634
0.601581
0.58019
0.542949
0.514777
0.472724
0.435447
0.30936
0.25235
0.186252
0.168481
0.189475
0.219068
0.276261
0.352376
0.416928
0.534413
0.626717
0.709252
0.753832
0.787217
0.803077
0.802876
0.790199
0.757288
0.649386
0.552361
0.407434
0.414973
0.38495
0.631403
0.737471
0.885202
0.909909
0.932449
0.933491
0.936373
0.933199
0.908904
0.845921
0.760906
0.686365
0.405323
0.348651
0.242403
0.222158
0.21915
0.248145
0.23981
0.223253
0.218881
0.208017
0.207941
0.205119
0.216356
0.23604
0.240296
0.246139
0.288373
0.317333
0.373995
0.477986
0.568314
0.620188
0.711002
0.76414
0.784637
0.783317
0.757715
0.688726
0.669211
0.676452
0.647544
0.669277
0.757138
0.791851
0.787456
0.75278
0.733017
0.706812
0.591633
0.585345
0.592416
0.633283
0.668552
0.734997
0.690004
0.726673
0.724654
0.728118
0.694509
0.662954
0.629524
0.581039
0.562001
0.521694
0.479281
0.377203
0.327218
0.212958
0.16657
0.119169
0.113159
0.108245
0.114516
0.112822
0.127794
0.145578
0.176129
0.210041
0.33746
0.508619
0.659039
0.836691
0.878103
0.898896
0.90666
0.9137
0.921421
0.891898
0.878043
0.868452
0.74794
0.567508
0.326562
0.244662
0.302795
0.60396
0.719897
0.843056
0.90701
0.935406
0.942954
0.938652
0.934715
0.917907
0.868289
0.770857
0.642134
0.412903
0.284398
0.132874
0.0686025
0.0508897
0.0572534
0.107611
0.194576
0.387605
0.649729
0.827149
0.935307
0.971441
0.984669
0.988886
0.988197
0.988422
0.986796
0.979964
0.972433
0.920744
0.786618
0.523556
0.387433
0.166555
0.254606
0.544102
0.659614
0.919984
0.953069
0.974823
0.976202
0.970414
0.963163
0.95289
0.896097
0.777576
0.662156
0.47229
0.38383
0.205591
0.178812
0.128526
0.103349
0.0888854
0.0940384
0.106731
0.15551
0.205257
0.24824
0.302063
0.351648
0.411679
0.529721
0.652949
0.82179
0.834566
0.851302
0.860777
0.862223
0.862947
0.864675
0.877207
0.868468
0.838861
0.721514
0.655336
0.554857
0.529138
0.328663
0.371278
0.598594
0.638451
0.727219
0.72837
0.755185
0.803927
0.844441
0.871994
0.893023
0.89633
0.886404
0.871464
0.84975
0.794927
0.52352
0.490371
0.294861
0.267727
0.180829
0.165128
0.156195
0.154401
0.150078
0.125137
0.11442
0.105883
0.0969399
0.0768029
0.0684502
0.0650629
0.0684616
0.0723199
0.0740659
0.140617
0.206073
0.347735
0.467618
0.634927
0.866976
0.967974
0.989465
0.992942
0.994601
0.994744
0.994946
0.987479
0.956531
0.871111
0.758467
0.344668
TMS probability
Sequence (1A0S)
TMS probability
1a0spTOT.SumPostHmm
1.75E-050000
2.36E-050000
4.15E-050000
5.53E-050000
0.0001534290000
0.0001678350000
0.0001727280000
0.0001723560000
0.0001713230000
0.0001702040000
0.0001712570000
0.0001668530000
0.0001583080000
0.0001485970000
0.0001411620000
0.0001320890000
5.71E-050000
6.18E-050000
6.73E-050000
7.78E-050000
7.54E-050000
9.78E-050000
0.000114620000
0.0001265520000
0.0001339440000
0.0001475350000
0.0001498960000
0.0001394490000
0.0001272350000
0.000121010000
0.000112390000
0.00011010000
0.0001208980000
0.0001221210000
0.0001217980000
0.0001382140000
0.0002316190000
0.00033530000
0.0003749690000
0.0004511950000
0.000488920000
0.000555610000
0.0005774470000
0.0005965390000
0.0006429540000
0.0006994760000
0.000739450000
0.0007101730000
0.0007632870000
0.0009524860000
0.001450050000
0.003447360000
0.005710330000
0.007889370000
0.01390860000
0.01628450000
0.01824260000
0.0190680000
0.02262850000
0.023660000
0.02584440000
0.02613050000
0.02843550000
0.02954080000
0.03249130000
0.03183680000
0.02852510000
0.02689620000
0.03712140000
0.05210170000
0.07702030000
0.1002660000
0.1660720.50.5010
0.2519250.50.5010
0.5770520.50.40.50.411
0.7209230.50.40.50.411
0.9127510.50.40.50.411
0.9423410.50.40.50.411
0.9450260.50.40.50.411
0.9420380.50.40.50.411
0.9309350.50.40.50.411
0.9123930.50.40.50.411
0.8968360.50.40.50.411
0.8406980.50.40.50.411
0.8251280.50.40.50.411
0.7218080.50.40.50.411
0.5207630000
0.4014770000
0.2211220000
0.1935170000
0.1664730000
0.1880990000
0.1808520000
0.185610000
0.1961640000
0.224030000
0.2511770000
0.2683320000
0.2649530000
0.2642590000
0.2782320000
0.30470000
0.3614450000
0.4072710000
0.4181060000
0.4176740000
0.4355320000
0.4469260000
0.4530630000
0.4250280000
0.401180000
0.3185510000
0.2862910000
0.3063490000
0.3174630000
0.4501980000
0.8154170.50.40.50.411
0.8952980.50.40.50.411
0.9052730.50.40.50.411
0.91230.50.40.50.411
0.9260580.50.40.50.411
0.9261840.50.40.50.411
0.9259190.50.40.50.411
0.9240390.50.40.50.411
0.9272820.50.40.50.411
0.9128550.50.40.50.411
0.8363950.50.40.50.411
0.7795660.50.40.50.411
0.5816140000
0.4063340000
0.2158190000
0.220830000
0.4016830000
0.4990170.50.5010
0.6697110.50.40.50.411
0.7045460.50.40.50.411
0.7146020.50.40.50.411
0.7236470.50.40.50.411
0.7442630.50.40.50.411
0.7377630.50.40.50.411
0.7404470.50.40.50.411
0.7359370.50.40.50.411
0.7315590.50.40.50.411
0.6917160.50.40.50.411
0.6270880.50.40.50.411
0.5761560000
0.4432010000
0.4047860000
0.2705140000
0.2085380000
0.2459840000
0.2429040000
0.2388680000
0.2556710000
0.2664230000
0.3183760000
0.310080000
0.3587390000
0.434410.50.5010
0.5586580.50.5010
0.6479630.50.40.50.411
0.6841440.50.40.50.411
0.7193010.50.40.50.411
0.7229010.50.40.50.411
0.75810.50.40.50.411
0.7430320.50.40.50.411
0.7368860.50.40.50.411
0.7273110.50.40.50.411
0.6855910.400.401
0.6605930.400.401
0.5607760000
0.5390080000
0.5246220000
0.5007560000
0.3965930000
0.4969480000
0.5377170000
0.546230000
0.584910000
0.746870.400.401
0.7828840.50.40.50.411
0.8337120.50.40.50.411
0.8380980.50.40.50.411
0.7995480.50.40.50.411
0.7652940.50.40.50.411
0.7649440.50.40.50.411
0.7133150.50.40.50.411
0.6190980.50.5010
0.5705360000
0.4998480000
0.4258120000
0.271760000
0.2833010000
0.2636140000
0.3708630000
0.4185840000
0.4398320000
0.4504430000
0.4601370000
0.5241760000
0.5992660000
0.6554590.400.401
0.6938980.400.401
0.6727160.400.401
0.6792090.50.40.50.411
0.656710.50.40.50.411
0.6291250.50.40.50.411
0.6225780.50.40.50.411
0.5853680.50.40.50.411
0.5769140.50.40.50.411
0.6154160.50.40.50.411
0.6663360.50.40.50.411
0.7260340.50.40.50.411
0.67980.400.401
0.7134490.400.401
0.6047310000
0.5934740000
0.5495470000
0.5753170000
0.6226050.400.401
0.6771130.400.401
0.7016630.50.40.50.411
0.6774050.50.40.50.411
0.6376340.50.40.50.411
0.6015810.50.40.50.411
0.580190.50.40.50.411
0.5429490.50.5010
0.5147770.50.5010
0.4727240.50.5010
0.4354470.50.5010
0.309360.50.5010
0.252350.50.5010
0.1862520.50.5010
0.1684810000
0.1894750000
0.2190680000
0.2762610000
0.3523760000
0.4169280000
0.5344130000
0.6267170.50.40.50.411
0.7092520.50.40.50.411
0.7538320.50.40.50.411
0.7872170.50.40.50.411
0.8030770.50.40.50.411
0.8028760.50.40.50.411
0.7901990.50.40.50.411
0.7572880.50.40.50.411
0.6493860.50.40.50.411
0.5523610.50.5010
0.4074340.50.5010
0.4149730.5010
0.384950000
0.6314030000
0.7374710.400.401
0.8852020.50.40.50.411
0.9099090.50.40.50.411
0.9324490.50.40.50.411
0.9334910.50.40.50.411
0.9363730.50.40.50.411
0.9331990.50.40.50.411
0.9089040.50.40.50.411
0.8459210.50.40.50.411
0.7609060.50.40.50.411
0.6863650.400.401
0.4053230000
0.3486510000
0.2424030000
0.2221580000
0.219150000
0.2481450000
0.239810000
0.2232530000
0.2188810000
0.2080170000
0.2079410000
0.2051190000
0.2163560000
0.236040000
0.2402960000
0.2461390000
0.2883730000
0.3173330000
0.3739950000
0.4779860000
0.5683140.50.5010
0.6201880.50.5010
0.7110020.50.40.50.411
0.764140.50.40.50.411
0.7846370.50.40.50.411
0.7833170.50.40.50.411
0.7577150.50.40.50.411
0.6887260.50.40.50.411
0.6692110.50.40.50.411
0.6764520.50.40.50.411
0.6475440.50.40.50.411
0.6692770.400.401
0.7571380.400.401
0.7918510.400.401
0.7874560.400.401
0.752780.400.401
0.7330170.400.401
0.7068120.400.401
0.5916330000
0.5853450000
0.5924160.50.5010
0.6332830.50.5010
0.6685520.50.40.50.411
0.7349970.50.40.50.411
0.6900040.50.40.50.411
0.7266730.50.40.50.411
0.7246540.50.40.50.411
0.7281180.50.40.50.411
0.6945090.50.40.50.411
0.6629540.50.40.50.411
0.6295240.50.5010
0.5810390000
0.5620010000
0.5216940000
0.4792810000
0.3772030000
0.3272180000
0.2129580000
0.166570000
0.1191690000
0.1131590000
0.1082450000
0.1145160000
0.1128220000
0.1277940000
0.1455780000
0.1761290000
0.2100410000
0.337460000
0.5086190.50.5010
0.6590390.50.40.50.411
0.8366910.50.40.50.411
0.8781030.50.40.50.411
0.8988960.50.40.50.411
0.906660.50.40.50.411
0.91370.50.40.50.411
0.9214210.50.40.50.411
0.8918980.50.40.50.411
0.8780430.50.40.50.411
0.8684520.400.401
0.747940.400.401
0.5675080000
0.3265620000
0.2446620000
0.3027950000
0.603960.50.5010
0.7198970.50.40.50.411
0.8430560.50.40.50.411
0.907010.50.40.50.411
0.9354060.50.40.50.411
0.9429540.50.40.50.411
0.9386520.50.40.50.411
0.9347150.50.40.50.411
0.9179070.50.40.50.411
0.8682890.50.40.50.411
0.7708570.50.40.50.411
0.6421340.50.5010
0.4129030.50.5010
0.2843980000
0.1328740000
0.06860250000
0.05088970000
0.05725340000
0.1076110000
0.1945760000
0.3876050.50.5010
0.6497290.50.5010
0.8271490.50.40.50.411
0.9353070.50.40.50.411
0.9714410.50.40.50.411
0.9846690.50.40.50.411
0.9888860.50.40.50.411
0.9881970.50.40.50.411
0.9884220.50.40.50.411
0.9867960.50.40.50.411
0.9799640.50.40.50.411
0.9724330.50.40.50.411
0.9207440.50.40.50.411
0.7866180.50.40.50.411
0.5235560000
0.3874330000
0.1665550000
0.2546060000
0.5441020.50.5010
0.6596140.50.40.50.411
0.9199840.50.40.50.411
0.9530690.50.40.50.411
0.9748230.50.40.50.411
0.9762020.50.40.50.411
0.9704140.50.40.50.411
0.9631630.50.40.50.411
0.952890.50.40.50.411
0.8960970.50.40.50.411
0.7775760.50.40.50.411
0.6621560.50.40.50.411
0.472290.50.5010
0.383830.50.5010
0.2055910.50.5010
0.1788120000
0.1285260000
0.1033490000
0.08888540000
0.09403840000
0.1067310000
0.155510000
0.2052570000
0.248240000
0.3020630.50.5010
0.3516480.50.5010
0.4116790.50.5010
0.5297210.50.5010
0.6529490.50.40.50.411
0.821790.50.40.50.411
0.8345660.50.40.50.411
0.8513020.50.40.50.411
0.8607770.50.40.50.411
0.8622230.50.40.50.411
0.8629470.50.40.50.411
0.8646750.50.40.50.411
0.8772070.50.40.50.411
0.8684680.50.40.50.411
0.8388610.400.401
0.7215140.400.401
0.6553360000
0.5548570000
0.5291380000
0.3286630000
0.3712780000
0.5985940000
0.6384510.400.401
0.7272190.400.401
0.728370.400.401
0.7551850.400.401
0.8039270.50.40.50.411
0.8444410.50.40.50.411
0.8719940.50.40.50.411
0.8930230.50.40.50.411
0.896330.50.40.50.411
0.8864040.50.40.50.411
0.8714640.50.40.50.411
0.849750.50.40.50.411
0.7949270.50.40.50.411
0.523520.50.5010
0.4903710.50.5010
0.2948610000
0.2677270000
0.1808290000
0.1651280000
0.1561950000
0.1544010000
0.1500780000
0.1251370000
0.114420000
0.1058830000
0.09693990000
0.07680290000
0.06845020000
0.06506290000
0.06846160000
0.07231990000
0.07406590000
0.1406170000
0.2060730000
0.3477350000
0.4676180000
0.6349270000
0.8669760.50.40.50.411
0.9679740.50.40.50.411
0.9894650.50.40.50.411
0.9929420.50.40.50.411
0.9946010.50.40.50.411
0.9947440.50.40.50.411
0.9949460.50.40.50.411
0.9874790.50.40.50.411
0.9565310.50.40.50.411
0.8711110.50.40.50.411
0.7584670.50.40.50.411
0.3446680000
-
Dynamic programming filtering procedureMaximum-scoring subsequences with constrained segment length and number
Grafico1
0.0000174787
0.0000235731
0.0000415042
0.0000553066
0.000153429
0.000167835
0.000172728
0.000172356
0.000171323
0.000170204
0.000171257
0.000166853
0.000158308
0.000148597
0.000141162
0.000132089
0.0000570664
0.0000617807
0.0000673372
0.0000777819
0.0000753772
0.000097771
0.00011462
0.000126552
0.000133944
0.000147535
0.000149896
0.000139449
0.000127235
0.00012101
0.00011239
0.0001101
0.000120898
0.000122121
0.000121798
0.000138214
0.000231619
0.0003353
0.000374969
0.000451195
0.00048892
0.00055561
0.000577447
0.000596539
0.000642954
0.000699476
0.00073945
0.000710173
0.000763287
0.000952486
0.00145005
0.00344736
0.00571033
0.00788937
0.0139086
0.0162845
0.0182426
0.019068
0.0226285
0.02366
0.0258444
0.0261305
0.0284355
0.0295408
0.0324913
0.0318368
0.0285251
0.0268962
0.0371214
0.0521017
0.0770203
0.100266
0.166072
0.251925
0.5770520.4
0.7209230.4
0.9127510.4
0.9423410.4
0.9450260.4
0.9420380.4
0.9309350.4
0.9123930.4
0.8968360.4
0.8406980.4
0.8251280.4
0.7218080.4
0.520763
0.401477
0.221122
0.193517
0.166473
0.188099
0.180852
0.18561
0.196164
0.22403
0.251177
0.268332
0.264953
0.264259
0.278232
0.3047
0.361445
0.407271
0.418106
0.417674
0.435532
0.446926
0.453063
0.425028
0.40118
0.318551
0.286291
0.306349
0.317463
0.450198
0.8154170.4
0.8952980.4
0.9052730.4
0.91230.4
0.9260580.4
0.9261840.4
0.9259190.4
0.9240390.4
0.9272820.4
0.9128550.4
0.8363950.4
0.7795660.4
0.581614
0.406334
0.215819
0.22083
0.401683
0.499017
0.6697110.4
0.7045460.4
0.7146020.4
0.7236470.4
0.7442630.4
0.7377630.4
0.7404470.4
0.7359370.4
0.7315590.4
0.6917160.4
0.6270880.4
0.576156
0.443201
0.404786
0.270514
0.208538
0.245984
0.242904
0.238868
0.255671
0.266423
0.318376
0.31008
0.358739
0.43441
0.558658
0.6479630.4
0.6841440.4
0.7193010.4
0.7229010.4
0.75810.4
0.7430320.4
0.7368860.4
0.7273110.4
0.6855910.4
0.6605930.4
0.560776
0.539008
0.524622
0.500756
0.396593
0.496948
0.537717
0.54623
0.58491
0.746870.4
0.7828840.4
0.8337120.4
0.8380980.4
0.7995480.4
0.7652940.4
0.7649440.4
0.7133150.4
0.619098
0.570536
0.499848
0.425812
0.27176
0.283301
0.263614
0.370863
0.418584
0.439832
0.450443
0.460137
0.524176
0.599266
0.6554590.4
0.6938980.4
0.6727160.4
0.6792090.4
0.656710.4
0.6291250.4
0.6225780.4
0.5853680.4
0.5769140.4
0.6154160.4
0.6663360.4
0.7260340.4
0.67980.4
0.7134490.4
0.604731
0.593474
0.549547
0.575317
0.6226050.4
0.6771130.4
0.7016630.4
0.6774050.4
0.6376340.4
0.6015810.4
0.580190.4
0.542949
0.514777
0.472724
0.435447
0.30936
0.25235
0.186252
0.168481
0.189475
0.219068
0.276261
0.352376
0.416928
0.534413
0.6267170.4
0.7092520.4
0.7538320.4
0.7872170.4
0.8030770.4
0.8028760.4
0.7901990.4
0.7572880.4
0.6493860.4
0.552361
0.407434
0.414973
0.38495
0.631403
0.7374710.4
0.8852020.4
0.9099090.4
0.9324490.4
0.9334910.4
0.9363730.4
0.9331990.4
0.9089040.4
0.8459210.4
0.7609060.4
0.6863650.4
0.405323
0.348651
0.242403
0.222158
0.21915
0.248145
0.23981
0.223253
0.218881
0.208017
0.207941
0.205119
0.216356
0.23604
0.240296
0.246139
0.288373
0.317333
0.373995
0.477986
0.568314
0.620188
0.7110020.4
0.764140.4
0.7846370.4
0.7833170.4
0.7577150.4
0.6887260.4
0.6692110.4
0.6764520.4
0.6475440.4
0.6692770.4
0.7571380.4
0.7918510.4
0.7874560.4
0.752780.4
0.7330170.4
0.7068120.4
0.591633
0.585345
0.592416
0.633283
0.6685520.4
0.7349970.4
0.6900040.4
0.7266730.4
0.7246540.4
0.7281180.4
0.6945090.4
0.6629540.4
0.629524
0.581039
0.562001
0.521694
0.479281
0.377203
0.327218
0.212958
0.16657
0.119169
0.113159
0.108245
0.114516
0.112822
0.127794
0.145578
0.176129
0.210041
0.33746
0.508619
0.6590390.4
0.8366910.4
0.8781030.4
0.8988960.4
0.906660.4
0.91370.4
0.9214210.4
0.8918980.4
0.8780430.4
0.8684520.4
0.747940.4
0.567508
0.326562
0.244662
0.302795
0.60396
0.7198970.4
0.8430560.4
0.907010.4
0.9354060.4
0.9429540.4
0.9386520.4
0.9347150.4
0.9179070.4
0.8682890.4
0.7708570.4
0.642134
0.412903
0.284398
0.132874
0.0686025
0.0508897
0.0572534
0.107611
0.194576
0.387605
0.649729
0.8271490.4
0.9353070.4
0.9714410.4
0.9846690.4
0.9888860.4
0.9881970.4
0.9884220.4
0.9867960.4
0.9799640.4
0.9724330.4
0.9207440.4
0.7866180.4
0.523556
0.387433
0.166555
0.254606
0.544102
0.6596140.4
0.9199840.4
0.9530690.4
0.9748230.4
0.9762020.4
0.9704140.4
0.9631630.4
0.952890.4
0.8960970.4
0.7775760.4
0.6621560.4
0.47229
0.38383
0.205591
0.178812
0.128526
0.103349
0.0888854
0.0940384
0.106731
0.15551
0.205257
0.24824
0.302063
0.351648
0.411679
0.529721
0.6529490.4
0.821790.4
0.8345660.4
0.8513020.4
0.8607770.4
0.8622230.4
0.8629470.4
0.8646750.4
0.8772070.4
0.8684680.4
0.8388610.4
0.7215140.4
0.655336
0.554857
0.529138
0.328663
0.371278
0.598594
0.6384510.4
0.7272190.4
0.728370.4
0.7551850.4
0.8039270.4
0.8444410.4
0.8719940.4
0.8930230.4
0.896330.4
0.8864040.4
0.8714640.4
0.849750.4
0.7949270.4
0.52352
0.490371
0.294861
0.267727
0.180829
0.165128
0.156195
0.154401
0.150078
0.125137
0.11442
0.105883
0.0969399
0.0768029
0.0684502
0.0650629
0.0684616
0.0723199
0.0740659
0.140617
0.206073
0.347735
0.467618
0.634927
0.8669760.4
0.9679740.4
0.9894650.4
0.9929420.4
0.9946010.4
0.9947440.4
0.9949460.4
0.9874790.4
0.9565310.4
0.8711110.4
0.7584670.4
0.344668
TMS probability
Predicted TMS
Sequence (1A0S)
TMS probability
1a0spTOT.SumPostHmm
1.75E-050000
2.36E-050000
4.15E-050000
5.53E-050000