prologue: pitfalls of standard alignments

Download PROLOGUE: Pitfalls of standard alignments

If you can't read please download the document

Upload: valdemar-astrid

Post on 30-Dec-2015

23 views

Category:

Documents


0 download

DESCRIPTION

PROLOGUE: Pitfalls of standard alignments. Scoring a pairwise alignment. A: ALA E VLIRLIT K LYP B: ASA K HLNRLIT E LYP. Blosum62. Alignment of a family (globins). …………………………………………. Different positions are not equivalent. Sequence logos. http://weblogo.berkeley.edu/cache/file5h2DWc.png. - PowerPoint PPT Presentation

TRANSCRIPT

  • PROLOGUE:

    Pitfalls of standard alignments

  • A:ALAEVLIRLITKLYP B:ASAKHLNRLITELYP

    Blosum62Scoring a pairwise alignment

  • Alignment of a family (globins).Different positions are not equivalent

  • http://weblogo.berkeley.edu/cache/file5h2DWc.pngThe substitution score IN A FAMILY should depend on the position (the same for gaps)Sequence logosFor modelling families we need more flexible tools

  • Probabilistic Models for Biological SequencesWhat are they?

  • Generative definition:

    Objects producing different outcomes (sequences) with different probabilities

    The probability distribution over the sequences space determines the model specificity

    Probabilistic models for sequencesMSequence spaceProbabilityGenerates si with probability P(si | M)e.g.: M is the representation of the family of globins

  • Associative definition:

    Objects that, given an outcome (sequence), compute a probability value

    Probabilistic models for sequencesMSequence spaceProbabilityAssociates probability P(si | M) to sie.g.: M is the representation of the family of globins We dont need a generator of new biological sequences

    the generative definition is useful as operative definition

  • Probabilistic models for sequencesSequence spaceProbabilityMost useful probabilistic models are Trainable systems

    The probability density function over the sequence space is estimated from known examples by means of a learning algorithmSequence spaceProbabilityKnown examplesPdf estimate (generalization)e.g.: Writing a generic representation of the sequences of globins starting from a set of known globins

  • Probabilistic Models for Biological SequencesWhat are they?Why to use them?

  • Modelling a protein familyProbabilistic modelGiven a protein class (e.g. Globins), a probabilistic model trained on this family can compute a probability value for each new sequenceThis value measures the similarity between the new sequence and the family described by the model

  • Probabilistic Models for Biological SequencesWhat are they?Why to use them?Which probabilities do they compute?

  • A model M associates to a sequence s the probability P( s | M )

    This probability answers the question:

    Which is the probability for a model describing the Globins to generate the sequence s ?

    The question we want to answer is:

    Given a sequence s, is it a Globin?

    We need to compute P( M | s ) !!P( s | M ) or P( M | s ) ?

  • P(X,Y) = P(X | Y) P(Y) = P(Y | X) P(X) Joint probability

    So:P(Y | X) = P(X | Y) P(Y)P(X)P(M | s) = P(s | M) P(M)P(s)A priori probabilitiesBayes TheoremP(M | s)

    Evidence sConclusion MP(s | M)

    Evidence MConclusion s

  • P(M | s) = P(s | M) P(M)P(s)A priori probabilitiesThe A priori probabilitiesP(M) is the probability of the model (i.e. of the class described by the model) BEFORE we know the sequence:

    can be estimated as the abundance of the classP(s) is the probability of the sequence in the sequence space.

    Cannot be reliably estimated!!

  • Comparison between models===We can overcome the problem comparing the probability of generating s from different models Ratio between the abundance of the classes

  • Null modelOtherwise we can score a sequence for a model M comparing it to a Null Model: a model that generates ALL the possible sequences with probabilities depending ONLY on the statistical amino acid abundance

    S(M, s) = log P(s | M) P(s | N)In this case we need a threshold and a statistic for evaluating the significance (E-value, P-value)S(M, s)Sequences belonging to model MSequences NOT belonging to model M

  • The simplest probabilistic models:Markov ModelsDefinition

  • CRSFC: CloudsR: RainF: FogS: SunMarkov Models Example: WeatherRegister the weather conditions day by day:

    as a first hypothesis the weather condition in a day depends ONLY on the weather conditions in the day before.

    Define the conditional probabilities

    P(C|C), P(C|R),. P(R|C)..

    The probability for the 5-days registration CRRCS

    P(CRRCS) = P(C)P(R|C) P(R|R) P(C|R) P(S|C)

  • Stochastic generator of sequences in which the probability of state in position i depends ONLY on the state in position i-1

    Given an alphabet C = {c1; c2; c3; cN }

    a Markov model is described with N(N+2) parameters {art , aBEGIN t , ar END; r, t C}

    arq = P( s i = q| s i-1 = r ) aBEGIN q = P( s 1 = q ) ar END = P( s T = END | s T-1 = r )Markov Model t art + ar END = 1 r t aBEGIN t = 1c3c1c2c4cNENDBEGIN

  • Given the sequence: s = s1s2s3s4s6 sT with si C = {c1; c2; c3; cN }

    P( s | M ) = P( s1 ) i=2 P( s i | s i-1 ) =

    Markov ModelsP(ALKALI)= aBEGIN A aA L aL K aK A aA L aL I aI END

  • Markov Models: Exercise1) Fill the non defined values for the transition probabilities

  • Markov Models: Exercise2) Which model better describes the weather in summer? Which one describes the weather in winter?

  • Markov Models: Exercise3) Given the sequenceCSSSCFS

    which model gives the higher probability?[Consider the starting probabilities: P(X|BEGIN)=0.25]WinterSummer

  • Markov Models: ExerciseP (CSSSCFS | Winter) ==0.25x0.1x0.2x0.2x0.3x0.2x0.2==1.2 x 10-5

    P (CSSSCFS | Summer) ==0.25x0.4x0.8x0.8x0.1x0.1x1.0==6.4 x 10-4

    4) Can we conclude that the observation sequence refers to a summer week?WinterSummer

  • Markov Models: ExerciseP (Seq | Winter) =1.2 x 10-5

    P (Seq | Summer) =6.4 x 10-4

    WinterSummerP(Seq |Summer) P(Summer)P(Seq| Winter) P(Winter)==

  • DNA:C = {Adenine, Citosine, Guanine, Timine }

    16 transition probabilities (12 of which independent) +4 Begin probabilities +4 End probabilities.Simple Markov Model for DNA sequencesThe parameters of the model are different in different zones of DNA

    They describe the overall composition and the couple recurrences

  • Example of Markov Models: GpC IslandIn the Markov Model of GpC Islands aGC is higher than in Markov Model Non-GpC IslandsGpC IslandsNon-GpC IslandsGiven a sequence s we can evaluate

    GATGCGTCGC

    CTACGCAGCG

  • The simplest probabilistic models:Markov ModelsDefinitionTraining

  • Probabilistic training of a parametric methodGenerally speaking, a parametric model M aims to reproduce a set of known dataModel MParameters TModelled dataReal data (D)How to compare them?

  • Let M be the set of parameters of model M.

    During the training phase, M parameters are estimated from the set of known data D

    Maximum Likelihood Extimation (ML)

    ML = argmax P( D | M, )

    It can be proved that:Training of Markov ModelsMaximum A Posteriori Extimation (MAP)

    MAP = argmax P( | M, D ) = argmax [P( D | M, ) P( ) ]

    Frequency of occurrence as counted in the data set D

  • Example (coin-tossing)Given N tossing of a coin (our data D), the outcomes are h heads and t tails (N=t+h)ASSUME the modelP(D|M)= ph (1- p)t Computing the maximum likelihood of P(D|M) We obtain that our estimate of p is p = h / (h+t) = h / N

  • Example (Error measure)Suppose you think that your data are affected by a Gaussian errorSo that they are distributed according toF(xi)=A*exp-[(xi m)2 /2s 2] With A=1/sqrt(2 s)If your measures are independent the data likelihood is P(Data| model) = Pi F(xi)Find m and s that maximize the P(Data| model)

  • Maximum Likelihood training: Proof

    Given a sequence s contained in D:s = s1s2s3s4s6 sT We can count the number of transitions between any to states j and k: njkWhere states 0 and N+1 are BEGIN and ENDNormalisation contstraints are taken into account using the Lagrange multipliers lk

  • Hidden Markov ModelsPreliminary examples

  • Given a sequence:

    4156266656321636543662152611536264162364261664616263

    We dont know the sequence of dice that generated it.

    RRRRRLRLRRRRRRRLRRRRRRRRRRRRLRLRRRRRRRRLRRRRLRRRRLRRLoaded dice

    We have 99 regular dice (R) and 1 loaded die (L).

    P(1)P(2)P(3)P(4)P(5)P(6)R1/61/61/61/61/61/6L1/101/101/101/101/101/2

  • Hypothesis:

    We chose a different die for each roll

    Two stochastic processes give origin to the sequence of observations.

    1) Choosing the die ( R o L ). 2) Rolling the die

    The sequence of dice is hidden

    The first process is assumed to be Markovian (in this case a 0-order MM)

    The outcome of the second process depends only on the state reached in the first process (that is the chosen die)Loaded dice

  • Model

    Each state (R and L) generates a character of the alphabet C = {1, 2, 3, 4, 5, 6 }

    The emission probabilities depend only on the state.

    The transition probabilities describe a Markov model that generates a state path: the hidden sequence (p)

    The observations sequence (s) is generated by two concomitant stochastic processesRL0.010.01

    0.99

    0.99

    Casin

  • The observations sequence (s) is generated by two concomitant stochastic processes

    RL0.010.01

    0.99

    0.99

    Choose the State : RProbability= 0.99

    Chose the Symbol: 1 Probability= 1/6 (given R)4156266656321636543662152611RRRRRLRLRRRRRRRLRRRRRRRRRRRR4156266656321636543662152611RRRRRLRLRRRRRRRLRRRRRRRRRRRR

  • The observations sequence (s) is generated by two concomitant stochastic processes

    RL0.010.01

    0.99

    0.99

    Choose the State : LProbability= 0.99

    Chose the Symbol: 5 Probability= 1/10 (given L)415626665632163654366215261RRRRRLRLRRRRRRRLRRRRRRRRRRR41562666563216365436621526115RRRRRLRLRRRRRRRLRRRRRRRRRRRRL

  • Model

    Each state (R and L) generates a character of the alphabet C = {1, 2, 3, 4, 5, 6 }

    The emission probabilities depend only on the state.

    The transition probabilities describe a Markov model that generates a state path: the hidden sequence (p)

    The observations sequence (s) is generated by two concomitant stochastic processesLoaded dice

  • Some not so serious example1) DEMOGRAPHY

    Observable: Number of births and deaths in a year in a village. Hidden variable: Economic conditions (as a first approximation we can consider the success in business as a random variable, and by consequence, the wealth as a Markov variable

    ---> can we deduce the economic conditions of a village during a century by means of the register of births and deaths?

    2) THE METEREOPATHIC TEACHER

    Observable: Average of the marks that a metereopathic teacher gives to their students during a day. Hidden variable: Weather conditions

    ---> can we deduce the weather conditions during a years by means of the class register?

  • To be more serious1) SECONDARY STRUCTURE Observable: protein sequence Hidden variable: secondary structure

    ---> can we deduce (predict) the secondary structure of a protein given its amino acid sequence?

    2) ALIGNMENT Observable: protein sequence Hidden variable: position of each residue along the alignment of a protein family

    ---> can we align a protein to a family, starting from its amino acid sequence?

  • Hidden Markov ModelsPreliminary examplesFormal definition

  • A HMM is a stochastic generator of sequences characterised by: N states A set of transition probabilities between two states {akj}akj = P( (i) = j | (i-1) = k ) A set of starting probabilities {a0k}a0k = P( (1) = k ) A set of ending probabilities {ak0}ak0 = P( p (i) = END | (i-1) = k ) An alphabet C with M characters. A set of emission probabilities for each state {ek (c)}ek (c) = P( s i = c | (i) = k )Constraints:k a0 k = 1ak0 + j ak j = 1 kc C ek (c) = 1 k

    Formal definition of Hidden Markov Models s: sequencep: path through the states

  • Generating a sequence with a HMM

  • s :AGCGCGTAATCTGp :YYYYYYYNNNNNN

    P( s, p | M ) can be easily computed

    GpC Island, simple model

  • P( s, p | M ) can be easily computed s : A G C G C G T A A T C T Gp : Y Y Y Y Y Y Y N N N N N NEmission: 0.1 0.4 0.4 0.4 0.4 0.4 0.10.250.250.250.250.250.25Transition: 0.2 0.7 0.7 0.7 0.7 0.7 0.7 0.20.8 0.8 0.80.8 0.8 0.1 Multiplying all the probabilities gives the probability of having the sequence AND the path through the states

  • Evaluation of the joint probability of the sequence ad the path

  • Hidden Markov ModelsPreliminary examplesFormal definitionThree questions

  • s :AGCGCGTAATCTGp:?????????????

    P( s, p | M ) can be easily computed How to evaluate P ( s | M )?

    GpC Island, simple model

  • How to evaluate P ( s | M )?s : A G C G C G T A A T C T Gp1: Y Y Y Y Y Y Y Y Y Y Y Y Yp2: Y Y Y Y Y Y Y Y Y Y Y Y Np3: Y Y Y Y Y Y Y Y Y Y Y N Yp4: Y Y Y Y Y Y Y Y Y Y Y N Np5: Y Y Y Y Y Y Y Y Y Y N Y Y213 different pathsSumming over all the path will give the probability of having the sequenceP ( s | M ) = p P( s, p | M )

  • s :AGCGCGTAATCTGp :?????????????

    P( s, p | M ) can be easily computed How to evaluate P ( s | M )?Can we show the hidden path?

    Resum: GpC Island, simple model

  • Can we show the hidden path? s : A G C G C G T A A T C T Gp1: Y Y Y Y Y Y Y Y Y Y Y Y Yp2: Y Y Y Y Y Y Y Y Y Y Y Y Np3: Y Y Y Y Y Y Y Y Y Y Y N Yp4: Y Y Y Y Y Y Y Y Y Y Y N Np5: Y Y Y Y Y Y Y Y Y Y N Y Y213 different pathsViterbi path: path that gives the best joint probabilityp* = argmax p [ P( p | s, M ) ] = argmax p [ P( p , s | M ) ]

  • A Posteriori decoding

    For each position choose the state p (t) : p (i) = argmax k [ P( p (i) = k| s, M ) ]

    The contribution to this probability derives from all the paths that go through the state k at position i.

    The A posteriori path can be a non-sense path (it may not be a legitimate path if some transitions are not permitted in the model)

    Can we show the hidden path?

  • s :AGCGCGTAATCTGp :YYYYYYYNNNNNN

    P( s, p | M ) can be easily computed How to evaluate P ( s | M )?Can we show the hidden path? Can we evaluate the parameters starting from known examples?

    GpC Island, simple model

  • Can we evaluate the parameters starting from known examples? s : A G C G C G T A A T C T Gp : Y Y Y Y Y Y Y N N N N N NEmission: eY (A)eY (G)eY (C)e Y(G)eY (C)eY (G)eY (T)eN (A)eN (A)eN (T)eN (C)eN (T)eN (G) Transition: a0Y aYY aYY aYY aYY aYY aYY aYN aNN aNN aNN aNN aNN aN0How to find the parameters e and a that maximises this probability?How if we dont know the path?

  • Hidden Markov Models:Algorithms ResumEvaluating P(s | M): Forward Algorithm

  • Computing P( s,p | M ) for each path is a redundant operations : A G C G C G T A A T C T Gp : Y Y Y Y Y Y Y Y Y Y Y Y YEmission: 0.1 0.4 0.4 0.4 0.4 0.4 0.1 0.1 0.1 0.1 0.4 0.1 0.4Transition: 0.2 0.7 0.7 0.7 0.7 0.7 0.7 0.70.7 0.7 0.70.7 0.7 0.1 s : A G C G C G T A A T C T Gp : Y Y Y Y Y Y Y Y Y Y Y Y NEmission: 0.1 0.4 0.4 0.4 0.4 0.4 0.1 0.1 0.1 0.1 0.4 0.1 0.25Transition: 0.2 0.7 0.7 0.7 0.7 0.7 0.7 0.70.7 0.7 0.70.7 0.2 0.1

  • Computing P( s,p | M ) for each path is a redundant operationIf we compute the common part only once we gain 2(T-1) operations

  • Summing over all the possible pathss : A G p : Y Y Emission: 0.1 0.4 Transition: 0.2 0.7 s : A G p : Y N Emission: 0.1 0.25 Transition: 0.2 0.2 s : A G p : N Y Emission: 0.250.4 Transition: 0.8 0.1 s : A G p : N N Emission: 0.250.25 Transition: 0.8 0.8 0.00560.0010.040.008

  • s : A G C p : X Y Y0.0136Summing over all the possible paths 0.4 0.7 s : A G C p : X Y N0.0136 0.25 0.2 s : A G C p : X N Y0.041 0.4 0.1 s : A G C p : X N N0.041 0.25 0.8 ++

  • Summing over all the possible pathsA G C G C G T A A T C T GX X X X X X X X X X X X YA G C G C G T A A T C T GX X X X X X X X X X X X N 0.1 (aY0) 0.1 (aN0) +P(s|M)Iterating until the last position of the sequence:

  • If we know the probabilities of emitting the two first characters of the sequence ending the path in states L and R respectively:

    FR(2) P(s1,s2,p(2) = R | M) and FL(2) P(s1,s2,p(2) = L | M)

    then we can compute:

    P(s1,s2,s3,p(3) = R | M) = FR(2) aRR eR(s3) + FL(2) aLR eR(s3) Summing over all the possible paths

  • On the basis of preceding observations the computation of P(s | M) can be decomposed in simplest problems

    For each state k and each position i in the sequence, we compute:Fk(i) = P( s1s2s3s i, p (i) = k | M)

    Initialisation: FBEGIN (0) = 1 Fi (0) = 0 i BEGINRecurrence: Fl ( i+1) = P( s1s2s is i+1, p (i + 1) = l ) = = k P( s1s2 s i, p (i) = k ) a k l e l ( s i+1 ) = = e l ( s i+1 ) k Fk ( i ) a k lTermination: P( s ) = P( s1s2s3s T, p (T + 1) = END ) = =k P( s1s2 s T , p (T) = k ) a k0 = k Fk ( T ) a k 0

    Forward AlgorithmWill be understood

  • STATEIterationBEGINENDAB01FB (2)SeB (s2)TT + 1Fi (1) aiBP(s | M)Forward Algorithm

  • Naf method P ( s | M ) = p P( s, p | M )There are N T possible paths.Each path requires about 2T operations.The time for the computation is O( T N T )Forward algorithm: computational complexitys : A G C G C G T A A T C T Gp : Y Y Y Y Y Y Y Y Y Y Y Y YEmission: 0.1 0.4 0.4 0.4 0.4 0.4 0.1 0.1 0.1 0.1 0.4 0.1 0.4Transition: 0.2 0.7 0.7 0.7 0.7 0.7 0.7 0.70.7 0.7 0.70.7 0.7 0.1 s : A G C G C G T A A T C T Gp : Y Y Y Y Y Y Y Y Y Y Y Y NEmission: 0.1 0.4 0.4 0.4 0.4 0.4 0.1 0.1 0.1 0.1 0.4 0.1 0.25Transition: 0.2 0.7 0.7 0.7 0.7 0.7 0.7 0.70.7 0.7 0.70.7 0.2 0.1

  • s : A G C p : X Y Y0.0136 0.4 0.7 s : A G C p : X Y N0.0136 0.25 0.2 s : A G C p : X N Y0.041 0.4 0.1 s : A G C p : X N N0.041 0.25 0.8 s : A G C p : X X Y s : A G C p : X X N0.0054480.00888++SumForward algorithmT positions, N values for each positionEach element requires about 2N product and 1 sum The time for the computation is O(T N2)

    Forward algorithm: computational complexity

  • Forward algorithm: computational complexityNaf method Forward algorithm

    Grafico1

    88

    2412

    6416

    16020

    38424

    89628

    204832

    T

    No. of operations

    Foglio1

    T

    124

    288

    32412

    46416

    516020

    638424

    789628

    8204832

    9460836

    101024040

    Foglio2

    Foglio3

  • Hidden Markov Models:Algorithms ResumEvaluating P(s | M): Forward AlgorithmEvaluating P(s | M): Backward Algorithm

  • Backward AlgorithmSimilar to the Forward algorithm: it computes P( s | M ), reconstructing the sequence from the end

    For each state k and each position i in the sequence, we compute:Bk(i) = P( s i+1s i+2s i+3s T | p (i) = k )

    Initialisation: Bk (T) = P(p (T+1) = END | p (T) = k ) = ak0Recurrence: Bl ( i-1) = P(s is i+1s T | p (i - 1) = l ) = = k P(s i+1s i+2s T | p (i) = k) a l k e k (s i )= = k Bk ( i ) e k ( s i ) a l kTermination: P( s ) = P( s1s2s3s T | p (0) = BEGIN ) = = k P( s2 s T | p (1) = k ) a 0 k e k ( s 1 ) = = k Bk ( 1 ) a 0k e k ( s 1 )

  • STATEIterationBEGINENDAB012TT + 1Backward AlgorithmBB (T-1)Bk(T) aB T e k (s T-1 )T-1SP(s | M)

  • Hidden Markov Models:Algorithms ResumEvaluating P(s | M): Forward AlgorithmEvaluating P(s | M): Backward AlgorithmShowing the path: Viterbi decoding

  • Finding the best paths : A G p : Y Y Emission: 0.1 0.4 Transition: 0.2 0.7 s : A G p : Y N Emission: 0.1 0.25 Transition: 0.2 0.2 s : A G p : N Y Emission: 0.250.4 Transition: 0.8 0.1 s : A G p : N N Emission: 0.250.25 Transition: 0.8 0.8 0.00560.0010.040.008

  • s : A G C p : N Y Y0.008 0.4 0.7 s : A G C p : N Y N0.008 0.25 0.2 s : A G C p : N N Y0.04 0.4 0.1 s : A G C p : N N N0.04 0.25 0.8 ;=0.00224=0.0016=0.0004=0.008Finding the best path

  • A G C G C G T A A T C T GN Y Y Y Y Y Y N N N Y Y YA G C G C G T A A T C T GN N N N N N N N N N N N N 0.1 (aY0) 0.1 (aN0) Choose the MaximumIterating until the last position of the sequence:Finding the best path

  • Viterbi Algorithmp* = argmax p [ P( p , s | M ) ]The computation of P(s,p*| M) can be decomposed in simplest problems

    Let Vk(i) be the probability of the most probable path for generating the subsequence s1s2s3s i ending in the state k at iteration iInitialisation: VBEGIN (0) = 1 Vi (0) = 0 i BEGINRecurrence: Vl ( i+1) = e l ( s i+1 ) Max k ( Vk ( i ) a k l )ptr i ( l ) = argmax k ( Vk ( i ) a k l )Termination: P( s, p* ) =Maxk (Vk ( T ) a k 0 )p* ( T ) = argmax k (Vk ( T ) a k 0 )Traceback:p* ( i-1 ) = ptr i (p* ( i ))

  • Viterbi AlgorithmSTATEIterationBEGINENDAB01VB (2)MAXeB (s2)TT + 1Vi (1) aiBP(s, *| M)ptr2 (B)

  • Viterbi AlgorithmViterbi pathDifferent paths can have the same probability

  • Hidden Markov Models:Algorithms ResumEvaluating P(s | M): Forward AlgorithmEvaluating P(s | M): Backward AlgorithmShowing the path: Viterbi decodingShowing the path: A posteriori decodingTraining a model: EM algorithm

  • If we know the path generating the training sequences : A G C G C G T A A T C T Gp : Y Y Y Y Y Y Y N N N N N NEmission: eY (A)eY (G)eY (A)e Y(G)eY (C)eY (G)eY (T)eN (A)eN (A)eN (T)eN (C)eN (T)eN (G) Transition: a0Y aYY aYY aYY aYY aYY aYY aYN aNN aNN aNN aNN aNN aN0Just count!Example: aYY= nYY /(nYY+ nYN)= 6/7eY(A) = nY(A) /[nY(A)+nY(C) +nY(G) +nY(T)]= 1/7

  • Expectation-Maximisation algorithmWe need to estimate the Maximum Likelihood parameters when the paths generating the training sequences are unknown

    qML = argmaxq [P ( s | q, M)]

    Given a model with parameters q0 the EM algorithm finds new parameters q that increase the likelihood of the model:

    P( s | q ) > P( s| q0 )

  • Ak,l = Sp P(p | s,q0) Ak,l(p)Ek (c) = Sp P(p | s,q0) Ek (c,p)We can compute the expected values over all the pathsak,l = ek(c) = Expectation-Maximisation algorithmGiven a path p we can countthe number of transition between states k and l: Ak,l(p)the number on emissions of character c from state k: Ek (c,p) s : A G C G C G T A A T C T Gp1: Y Y Y Y Y Y Y Y Y Y Y Y Yp2: Y Y Y Y Y Y Y Y Y Y Y Y Np3: Y Y Y Y Y Y Y Y Y Y Y N Yp4: Y Y Y Y Y Y Y Y Y Y Y N N...The updated parameters are:

  • Expectation-Maximisation algorithmWe need to estimate the Maximum Likelihood parameters when the paths generating the training sequences are unknown

    qML = argmaxq [P ( s | q, M)]

    Given a model with parameters q0 the EM algorithm finds new parameters q that increase the likelihood of the model:

    P( s | q ) > P( s| q0 )

    or equivalentely

    log P( s | q ) > log P( s| q0 )

  • Expectation-Maximisation algorithmlog P( s | q ) = log P(s,p |q ) - log P(p | s,q )

    Multiplying for P(p | s,q0) and summing over all the possible paths:

    log P( s | q ) = =Sp P(p | s,q0) log P(s,p | q ) - Sp P(p | s,q0) log P(p | s,q)Q(q |q0) : Expectation value of log P(s,p |q ) over all the current pathslog P( s | q ) - log P(s | q0) = = Q(q |q0) - Q(q0|q0) + Q(q |q0) - Q(q0|q0) 0

  • Expectation-Maximisation algorithmThe EM algorithm is an iterative process

    Each iteration performs two steps:E-step: evaluation of Q(q |q0) = Sp P(p | s,q0) log P(s,p | q ) M-step: Maximisation of Q(q |q0) over all q

    It does NOT assure to converge to the GLOBAL Maximum Likelihood

  • E-step:Q( q |q0) = Sp P(p | s,q0) log P(s,p |q ) P(s,p |q ) = a0,p(1) Pi = 1 ap(i),p(i+1) ep(i)(si) = = Pk = 0 Pl = 1 ak,l Pk = 1 Pc C ek (c) TNNNAk.l (p)Ek (c,p)Ak,l = Sp P(p | s,q0) Ak,l(p)Ek (c) = Sp P(p | s,q0) Ek (c,p)Baum-Welch implementation of the EM algorithmExpected values over all the actual paths

  • ak,l =

    ek(c) = Baum-Welch implementation of the EM algorithmM-step:For any state k and character c, with Sc ek(c) = 1By means of Lagranges multipliers techniques, we can solve the system

  • Fk(i) = P( s1s2s3s i, p (i) = k )

    Bk(i) = P( s i+1s i+2s i+3s T | p (i) = k )

    Ak,l= P(p (i ) = k , p ( i +1) = l | s,q) =

    Ek (c) = P( s i = c , p (i ) = k | s,q) =

    Baum-Welch implementation of the EM algorithmHow to compute the expected number of transitions and emissions over all the paths

  • Baum-Welch implementation of the EM algorithmAlgorithmStart with random parametersCompute Forward and Backward matrices on the known sequencesCompute Ak,l and Ek (c) expected numbers of transitions and emissions Update a k,l Ak,l ek (c) Ek (c)Has P(s|M) incremented ?YesNoEnd

  • Profile HMMsHMMs for alignments

  • Profile HMMsHMMs for alignments

  • M0M1M2M3M4How to align?Each state represent a position in the alignment.ACGGTAM0M1M2M3M4M5

    ACGATCM0M1M2M3M4M5

    ATGTTCM0M1M2M3M4M5

    M5Each position has a peculiar composition

  • M0M1M2M3M4ACGGTAACGATCATGTTCM5Given a set of sequences....we can train a model..A1000.3300.33C00.660000.66G0010.3300T00.3300.3310..estimating the emission probabilities.

  • M0M1M2M3M4M5Given a trained model....we can align a new sequence..A1000.3300.33C00.660000.66G0010.3300T00.3300.3310ACGATC..computing the emission probabilityP(s|M) = 1 0.66 1 0.33 1 0.66

  • M0M1M2M3M4And for the sequence AGATC ?

    AGATCM0M2M3M4M5M5M5A1000.3300.33C00.660000.66G0010.3300T00.3300.3310We need a way to introduce gaps

  • Silent statesRed transitions allow gaps(N-1) ! transitionsTo reduce the number of parameters we can use states that doesnt emit any character4N-8 transitions

  • Profile HMMsDelete statesInsert statesMatch statesA C G G T AM0 M1 M2 M3 M4 M5

    A C G C A G T CM0 I0 I0 M1 M2 M3 M4 M5

    A G A T CM0 D1 M2 M3 M4 M5

  • Example of alignmentSequence 1A S T R A LViterbi pathM0 M1 M2 M3 M4 M5A S T R A L

    Sequence 2A S T A I LViterbi pathM0 M1 M2 D3 M4 I4 M5A S T A I L

    Sequence 3A R T IViterbi pathM0 M1 M2 D3 D4 M5A R T I

  • Example of alignment-Log P(s | M) Is an alignment score

    Grouping by vertical layers

    0

    1

    2

    3

    4

    5

    s1

    A

    S

    T

    R

    A

    L

    s2

    A

    S

    T

    AI

    L

    s3

    A

    R

    T

    I

    Alignment

    ASTRA-L

    AST-AIL

    ART---I

  • Searching for a structural/functional pattern in protein sequenceZn binding loop: C H C I C R I C C H C L C K I C C H C I C S L C D H C L C T I C C H C I D S I C C H C L C K I C

    Cysteines can be replaced by an Aspartic Acid, but only ONCE for each sequence

  • Searching for a structural/functional pattern in protein sequences..ALCPCHCLCRICPLIY.. ..WERWDHCIDSICLKDE..obtains higher probability than.. because M0 and M4 have low emission probability for Aspartic Acid and we multiply them twice.

  • Profile HMMsHMMs for alignmentsExample on globins

  • Structural alignment of globins

  • Structural alignment of globinsBashdorf D, Chothia C & Lesk AM, (1987) Determinants of a protein fold: unique features of the globin amino sequence. J.Mol.Biol. 196, 199-216

  • Alignment of globins reconstructed with profile HMMsKrogh A, Brown M, Mian IS, Sjolander K & Haussler D (1994) Hidden Markov Models in computational biology: applications to protein modelling. J.Mol.Biol. 235, 1501-1531

  • Discrimination power of profile HMMsKrogh A, Brown M, Mian IS, Sjolander K & Haussler D (1994) Hidden Markov Models in computational biology: applications to protein modelling. J.Mol.Biol. 235, 1501-1531

  • Profile HMMsHMMs for alignmentsExample on globinsOther applications

  • BeginProfile HMM specific for the considered domain

    EndFinding a domain

  • BEGINHMM 1HMM 2HMM 3HMM n.END.Clustering subfamiliesEach sequence s contributes to update HMM i with a weight equal to P ( s | Mi )

  • Profile HMMsHMMs for alignmentsExample on globinsOther applicationsAvailable codes and servers

  • HMMER at WUSTL: http://hmmer.wustl.edu/Eddy SR (1998) Profile hidden Markov models. Bioinformatics 14:755-763

  • HMMERAlignment of a protein familyhmmbuildTrained profile-HMM hmmcalibateHMM calibrated with the accurate E-value statistics Takes the aligned sequences, checks for redundancy and sets the emission and the transitions probabilities of a HMMTakes a trained HMM, generates a great number of random sequences, score them and fits the Extreme Value Distribution to the computed scores

  • HMMERSet of sequencesAlignment of all the sequences to the modelhmmalignList of sequences that match the HMM (sorted by E-value)hmmsearchHMMSet of HMMsSequencehmmpfamList of HMMs that match the sequence

  • MSF Format: globins50.msf

    !!AA_MULTIPLE_ALIGNMENT 1.0

    PileUp of: *.pep

    Symbol comparison table: GenRunData:blosum62.cmp CompCheck: 6430

    GapWeight: 12

    GapLengthWeight: 4

    pileup.msf MSF: 308 Type: P August 16, 1999 09:09 Check: 9858 ..

    Name: lgb1_pea Len: 308 Check: 2200 Weight: 1.00

    Name: lgb1_vicfa Len: 308 Check: 214 Weight: 1.00

    Name: myg_escgi Len: 308 Check: 3961 Weight: 1.00

    Name: myg_horse Len: 308 Check: 5619 Weight: 1.00

    Name: myg_progu Len: 308 Check: 6401 Weight: 1.00

    Name: myg_saisc Len: 308 Check: 6606 Weight: 1.00

    //

    1 50

    lgb1_pea ~~~~~~~~~G FTDKQEALVN SSSE.FKQNL PGYSILFYTI VLEKAPAAKG

    lgb1_vicfa ~~~~~~~~~G FTEKQEALVN SSSQLFKQNP SNYSVLFYTI ILQKAPTAKA

    myg_escgi ~~~~~~~~~V LSDAEWQLVL NIWAKVEADV AGHGQDILIR LFKGHPETLE

    myg_horse ~~~~~~~~~G LSDGEWQQVL NVWGKVEADI AGHGQEVLIR LFTGHPETLE

    myg_progu ~~~~~~~~~G LSDGEWQLVL NVWGKVEGDL SGHGQEVLIR LFKGHPETLE

    myg_saisc ~~~~~~~~~G LSDGEWQLVL NIWGKVEADI PSHGQEVLIS LFKGHPETLE

  • Alignment of a protein familyhmmbuildTrained profile-HMM hmmbuild globin.hmm globins50.msfAll the transition and emission parameters are estimated by means of the Expectation Maximisation algorithm on the aligned sequences.In principle we could use also NON aligned sequences to train the model. Nevertheless it is more efficient to build the starting alignment using, for example, CLUSTALW

  • hmmcalibrate [-num N] -histfile globin.histo globin.hmmTrained profile-HMM hmmcalibateHMM calibrated with the accurate E-value statistics A number of N (default 5000) random sequences are generated and scored with the model. Random sequencesLog P(s|M)/P(s|N)Range for globin sequencesE-value(S): expected number of random sequences with a score > S

  • Trained model (globin.hmm)HMMER2.0 [2.3.2]NAME globins50LENG 143ALPH AminoRF noCS noMAP yesCOM /home/gigi/bin/hmmbuild globin.hmm globins50.msfCOM /home/gigi/bin/hmmcalibrate --histfile globin.histo globin.hmmNSEQ 50DATE Sun May 29 19:03:18 2005CKSUM 9858XT -8455 -4 -1000 -1000 -8455 -4 -8455 -4 NULT -4 -8455NULE 595 -1558 85 338 -294 453 -1158 197 249 902 -1085 -142 -21 -313 45 531 201 384 -1998 -644 EVD -38.893742 0.243153HMM A C D E F G H I K L M N P Q R S T V W Y m->m m->i m->d i->m i->i d->m d->d b->m m->e -450 * -1900 1 591 -1587 159 1351 -1874 -201 151 -1600 998 -1591 -693 389 -1272 595 42 -31 27 -693 -1797 -1134 14 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 -450 * 2 -926 -2616 2221 2269 -2845 -1178 -325 -2678 -300 -2596 -1810 220 -1592 939 -974 -671 -939 -2204 -2785 -1925 15 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * * 3 -638 -1715 -680 497 -2043 -1540 23 -1671 2380 -1641 -840 -222 -1595 437 1040 -564 -523 -1363 2124 -1313 16 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * *

  • Trained model (globin.hmm)null modelHMMER2.0 [2.3.2]NAME globins50LENG 143ALPH AminoRF noCS noMAP yesCOM /home/gigi/bin/hmmbuild globin.hmm globins50.msfCOM /home/gigi/bin/hmmcalibrate --histfile globin.histo globin.hmmNSEQ 50DATE Sun May 29 19:03:18 2005CKSUM 9858XT -8455 -4 -1000 -1000 -8455 -4 -8455 -4 NULT -4 -8455NULE 595 -1558 85 338 -294 453 -1158 197 249 902 -1085 -142 -21 -313 45 531 201 384 -1998 -644 EVD -38.893742 0.243153HMM A C D E F G H I K L M N P Q R S T V W Y m->m m->i m->d i->m i->i d->m d->d b->m m->e -450 * -1900 1 591 -1587 159 1351 -1874 -201 151 -1600 998 -1591 -693 389 -1272 595 42 -31 27 -693 -1797 -1134 14 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 -450 * 2 -926 -2616 2221 2269 -2845 -1178 -325 -2678 -300 -2596 -1810 220 -1592 939 -974 -671 -939 -2204 -2785 -1925 15 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * * 3 -638 -1715 -680 497 -2043 -1540 23 -1671 2380 -1641 -840 -222 -1595 437 1040 -564 -523 -1363 2124 -1313 16 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * *

    Score = INT [1000 log2(prob/null_prob)]null_prob = 1 for transitions

  • Trained model (globin.hmm)null modelHMMER2.0 [2.3.2]NAME globins50LENG 143ALPH AminoRF noCS noMAP yesCOM /home/gigi/bin/hmmbuild globin.hmm globins50.msfCOM /home/gigi/bin/hmmcalibrate --histfile globin.histo globin.hmmNSEQ 50DATE Sun May 29 19:03:18 2005CKSUM 9858XT -8455 -4 -1000 -1000 -8455 -4 -8455 -4 NULT -4 -8455NULE 595 -1558 85 338 -294 453 -1158 197 249 902 -1085 -142 -21 -313 45 531 201 384 -1998 -644 EVD -38.893742 0.243153HMM A C D E F G H I K L M N P Q R S T V W Y m->m m->i m->d i->m i->i d->m d->d b->m m->e -450 * -1900 1 591 -1587 159 1351 -1874 -201 151 -1600 998 -1591 -693 389 -1272 595 42 -31 27 -693 -1797 -1134 14 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 -450 * 2 -926 -2616 2221 2269 -2845 -1178 -325 -2678 -300 -2596 -1810 220 -1592 939 -974 -671 -939 -2204 -2785 -1925 15 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * * 3 -638 -1715 -680 497 -2043 -1540 23 -1671 2380 -1641 -840 -222 -1595 437 1040 -564 -523 -1363 2124 -1313 16 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * *

    Score = INT [1000 log2(prob/null_prob)]

    = natural abundance for emissions

  • HMMER2.0 [2.3.2]NAME globins50LENG 143ALPH AminoRF noCS noMAP yesCOM /home/gigi/bin/hmmbuild globin.hmm globins50.msfCOM /home/gigi/bin/hmmcalibrate --histfile globin.histo globin.hmmNSEQ 50DATE Sun May 29 19:03:18 2005CKSUM 9858XT -8455 -4 -1000 -1000 -8455 -4 -8455 -4 NULT -4 -8455NULE 595 -1558 85 338 -294 453 -1158 197 249 902 -1085 -142 -21 -313 45 531 201 384 -1998 -644 EVD -38.893742 0.243153HMM A C D E F G H I K L M N P Q R S T V W Y m->m m->i m->d i->m i->i d->m d->d b->m m->e -450 * -1900 1 591 -1587 159 1351 -1874 -201 151 -1600 998 -1591 -693 389 -1272 595 42 -31 27 -693 -1797 -1134 14 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 -450 * 2 -926 -2616 2221 2269 -2845 -1178 -325 -2678 -300 -2596 -1810 220 -1592 939 -974 -671 -939 -2204 -2785 -1925 15 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * * 3 -638 -1715 -680 497 -2043 -1540 23 -1671 2380 -1641 -840 -222 -1595 437 1040 -564 -523 -1363 2124 -1313 16 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * *

    Trained model (globin.hmm)Transitions

  • HMMER2.0 [2.3.2]NAME globins50LENG 143ALPH AminoRF noCS noMAP yesCOM /home/gigi/bin/hmmbuild globin.hmm globins50.msfCOM /home/gigi/bin/hmmcalibrate --histfile globin.histo globin.hmmNSEQ 50DATE Sun May 29 19:03:18 2005CKSUM 9858XT -8455 -4 -1000 -1000 -8455 -4 -8455 -4 NULT -4 -8455NULE 595 -1558 85 338 -294 453 -1158 197 249 902 -1085 -142 -21 -313 45 531 201 384 -1998 -644 EVD -38.893742 0.243153HMM A C D E F G H I K L M N P Q R S T V W Y m->m m->i m->d i->m i->i d->m d->d b->m m->e -450 * -1900 1 591 -1587 159 1351 -1874 -201 151 -1600 998 -1591 -693 389 -1272 595 42 -31 27 -693 -1797 -1134 14 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 -450 * 2 -926 -2616 2221 2269 -2845 -1178 -325 -2678 -300 -2596 -1810 220 -1592 939 -974 -671 -939 -2204 -2785 -1925 15 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * * 3 -638 -1715 -680 497 -2043 -1540 23 -1671 2380 -1641 -840 -222 -1595 437 1040 -564 -523 -1363 2124 -1313 16 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * *

    Trained model (globin.hmm)Emissions

  • HMMER2.0 [2.3.2]NAME globins50LENG 143ALPH AminoRF noCS noMAP yesCOM /home/gigi/bin/hmmbuild globin.hmm globins50.msfCOM /home/gigi/bin/hmmcalibrate --histfile globin.histo globin.hmmNSEQ 50DATE Sun May 29 19:03:18 2005CKSUM 9858XT -8455 -4 -1000 -1000 -8455 -4 -8455 -4 NULT -4 -8455NULE 595 -1558 85 338 -294 453 -1158 197 249 902 -1085 -142 -21 -313 45 531 201 384 -1998 -644 EVD -38.893742 0.243153HMM A C D E F G H I K L M N P Q R S T V W Y m->m m->i m->d i->m i->i d->m d->d b->m m->e -450 * -1900 1 591 -1587 159 1351 -1874 -201 151 -1600 998 -1591 -693 389 -1272 595 42 -31 27 -693 -1797 -1134 14 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 -450 * 2 -926 -2616 2221 2269 -2845 -1178 -325 -2678 -300 -2596 -1810 220 -1592 939 -974 -671 -939 -2204 -2785 -1925 15 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * * 3 -638 -1715 -680 497 -2043 -1540 23 -1671 2380 -1641 -840 -222 -1595 437 1040 -564 -523 -1363 2124 -1313 16 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * *

    Trained model (globin.hmm)Emissions

  • HMMER2.0 [2.3.2]NAME globins50LENG 143ALPH AminoRF noCS noMAP yesCOM /home/gigi/bin/hmmbuild globin.hmm globins50.msfCOM /home/gigi/bin/hmmcalibrate --histfile globin.histo globin.hmmNSEQ 50DATE Sun May 29 19:03:18 2005CKSUM 9858XT -8455 -4 -1000 -1000 -8455 -4 -8455 -4 NULT -4 -8455NULE 595 -1558 85 338 -294 453 -1158 197 249 902 -1085 -142 -21 -313 45 531 201 384 -1998 -644 EVD -38.893742 0.243153HMM A C D E F G H I K L M N P Q R S T V W Y m->m m->i m->d i->m i->i d->m d->d b->m m->e -450 * -1900 1 591 -1587 159 1351 -1874 -201 151 -1600 998 -1591 -693 389 -1272 595 42 -31 27 -693 -1797 -1134 14 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 -450 * 2 -926 -2616 2221 2269 -2845 -1178 -325 -2678 -300 -2596 -1810 220 -1592 939 -974 -671 -939 -2204 -2785 -1925 15 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * * 3 -638 -1715 -680 497 -2043 -1540 23 -1671 2380 -1641 -840 -222 -1595 437 1040 -564 -523 -1363 2124 -1313 16 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * *

    Trained model (globin.hmm)Emissions

  • hmmemit [-n N] globin.hmmTrained profile-HMM hmmemitSequences generated by the model The parameters of the model are used to generate new sequences

  • hmmsearch globin.hmm Artemia.fa > Artemia.globinSet of sequencesList of sequences that match the HMM (sorted by E-value)hmmsearchTrained profile-HMM

  • Search results (Artemia.globin)Sequence Description Score E-value N -------- ----------- ----- ------- ---S13421 S13421 474.3 1.7e-143 9

    Parsed for domains:Sequence Domain seq-f seq-t hmm-f hmm-t score E-value-------- ------- ----- ----- ----- ----- ----- -------S13421 7/9 932 1075 .. 1 143 [] 76.9 7.3e-24S13421 2/9 153 293 .. 1 143 [] 63.7 6.8e-20S13421 3/9 307 450 .. 1 143 [] 59.8 9.8e-19S13421 8/9 1089 1234 .. 1 143 [] 57.6 4.5e-18S13421 9/9 1248 1390 .. 1 143 [] 52.3 1.8e-16S13421 1/9 1 143 [. 1 143 [] 51.2 4e-16S13421 4/9 464 607 .. 1 143 [] 46.7 8.6e-15S13421 6/9 775 918 .. 1 143 [] 42.2 2e-13S13421 5/9 623 762 .. 1 143 [] 23.9 6.6e-08

    Alignments of top-scoring domains:S13421: domain 7 of 9, from 932 to 1075: score 76.9, E = 7.3e-24 *->eekalvksvwgkveknveevGaeaLerllvvyPetkryFpkFkdLss +e a vk+ w+ v+ ++ vG +++ l++ +P+ +++FpkF d+ S13421 932 REVAVVKQTWNLVKPDLMGVGMRIFKSLFEAFPAYQAVFPKFSDVPL 978

    adavkgsakvkahgkkVltalgdavkkldd...lkgalakLselHaqklr d++++++ v +h V t+l++ ++ ld++ +l+ ++L+e H+ lr S13421 979 -DKLEDTPAVGKHSISVTTKLDELIQTLDEpanLALLARQLGEDHIV-LR 1026

    vdpenfkllsevllvvlaeklgkeftpevqaalekllaavataLaakYk< v+ fk +++vl+ l++ lg+ f+ ++ +++k+++++++ +++ + S13421 1027 VNKPMFKSFGKVLVRLLENDLGQRFSSFASRSWHKAYDVIVEYIEEGLQ 1075

    Number of domainsDomains sorted byE-valueStartEndConsensus sequenceSequence

  • hmmalign globin.hmm globins630.faSet of sequenceshmmalignAlignment of all sequences to the modelTrained profile-HMM InsertionsBAHG_VITSP QAG-..VAAAHYPIV.GQELLGAIKEV.L.G.D.AATDDILDAWGKAYGVGLB1_ANABR TR-K..ISAAEFGKI.NGPIKKVLAS-.-.-.K.NFGDKYANAWAKLVAVGLB1_ARTSX NRGT..-DRSFVEYL.KESL-----GD.S.V.D.EFT------VQSFGEVGLB1_CALSO TRGI..TNMELFAFA.LADLVAYMGTT.I.S.-.-FTAAQKASWTAVNDVGLB1_CHITH -KSR..ASPAQLDNF.RKSLVVYLKGA.-.-.T.KWDSAVESSWAPVLDFGLB1_GLYDI GNKH..IKAQYFEPL.GASLLSAMEHR.I.G.G.KMNAAAKDAWAAAYADGLB1_LUMTE ER-N..LKPEFFDIF.LKHLLHVLGDR.L.G.T.HFDF---GAWHDCVDQGLB1_MORMR QSFY..VDRQYFKVL.AGII-------.-.-.A.DTTAPGDAGFEKLMSMGLB1_PARCH DLNK..VGPAHYDLF.AKVLMEALQAE.L.G.S.DFNQKTRDSWAKAFSIGLB1_PETMA KSFQ..VDPQYFKVL.AAVI-------.-.-.V.DTVLPGDAGLEKLMSMGLB1_PHESE QHTErgTKPEYFDLFrGTQLFDILGDKnLiGlTmHFD---QAAWRDCYAV

    Gaps

  • HMMER applications:PFAMhttp://www.sanger.ac.uk/Software/Pfam/

  • PFAM ExerciseGenerate with hmmemit a sequence from the globin model and search it in PFAM database

  • Search in the SwissProt database the sequencesCG301_HUMANQ9H5F4_HUMAN

    1) search them in the PFAM data base.2)launch PSI-BLAST searches. Is it possible to annotate the sequences by means of the BLAST results?PFAM Exercise

  • SAM at UCSD:http://www.soe.ucsc.edu/research/compbio/sam.htmlKrogh A, Brown M, Mian IS, Sjolander K & Haussler D (1994) Hidden Markov Models in computational biology: applications to protein modelling. J.Mol.Biol. 235, 1501-1531

  • SAM applications:http://www.cse.ucsc.edu/research/compbio/HMM-apps/T02-query.html

  • HMMPRO: http://www.netid.com/html/hmmpro.htmlPierre Baldi, Net-ID

  • HMMs for Mapping problemsMapping problems in protein prediction

  • Covalent structureTTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYANSecondary structure

  • position of Trans Membrane Segments along the sequenceTopographyTopology of membrane proteins

    ALALMLCMLTYRHKELKLKLKK ALALMLCMLTYRHKELKLKLKK ALALMLCMLTYRHKELKLKLKK

  • HMMs for Mapping problemsMapping problems in protein predictionLabelled HMMs

  • HMM for secondary structure prediction

    Simplest modelIntroducing a grammara1b1ca2a3b2

  • HMM for secondary structure predictionLabels

    The states a1, a2 and a3 share the same label, so states b1 and b2 do.Decoding the Viterbi path for emitting a sequence s, makes a mapping between the sequence s and a sequence of labels y

    S A L K M N Y T R E I M V A S N Q s: Sequencec a1 a2 a3 a4 a4 a4 c c c c b1 b2 b2 b2 c cp: Pathc a a a a a a c c c c b b b b c cY(p): Labelsa1b1ca2a3b2

  • Computing P(s, y | M)Only the path whose labelling is y have to be considered in the sumIn Forward and Backward algorithms it means to set

    Fk(i) = 0, Bk(i) = 0 if Y(k) yi

  • Baum-Welch training algorithm for labelled HMMsGiven a set of known labelled sequences (e.g. amino acid sequences and their native secondary structure) we want to find the parameters of the model, without knowing the generating paths:

    qML = argmaxq [P ( s, y | q, M)]

    The algorithm is the same as in the non-labelled case if we use the Forward and Backward matrices defined in the last slide.

    Supervised learning of the mapping

  • HMMs for Mapping problemsMapping problems in protein predictionLabelled HMMsDuration modelling

  • Self loops and geometric decayP(l) = p l-1 ( 1-p )The length distribution of the generated segments is always exp-like

  • How can we model other length distributions?Limited caseThis topology can model any length distribution between 1 and N

  • How can we model other length distributions?Non limited caseThis topology can model any length distribution between 1 and N-1 and a geometrical decay from N and

  • Secondary structure: length statistic

    Grafico1

    000

    000.11612

    00.1585330.154294

    0.1972340.1359660.130081

    0.1081250.1598030.131099

    0.0513910.1386460.111321

    0.0526480.1238360.073875

    0.055320.0935120.057951

    0.0523340.0671370.046099

    0.0520190.0407620.035192

    0.0609780.027080.027848

    0.0605060.0187590.021377

    0.0474620.0124120.017596

    0.0474620.0088860.013742

    0.0411760.0056420.010325

    0.0326890.0029620.00938

    0.0227880.0012690.006762

    0.0235740.001410.006399

    0.0176020.0005640.004508

    0.012730.0007050.005017

    0.0099010.0005640.002327

    0.0102150.0007050.003127

    0.0086440.0002820.002327

    0.0040860.0001410.001818

    0.00628600.001382

    0.0034570.0002820.001527

    0.00282900.000945

    0.00314300.000873

    0.003300.000582

    0.00267200.000436

    0.00204300.000727

    0.00157200.000364

    0.002200.000727

    0.0003140.0001410.000291

    0.00062900.000291

    0.00031400.000291

    &A

    Pagina &P

    Helix

    Strand

    Coil

    Length (residues)

    Frequency

    Database

    8rxna.odc.txt

    6363709013753

    HelixStrandCoil

    0000

    1000.11612

    200.1585330.154294

    30.1972340.1359660.130081

    40.1081250.1598030.131099

    50.0513910.1386460.111321

    60.0526480.1238360.073875

    70.055320.0935120.057951

    80.0523340.0671370.046099

    90.0520190.0407620.035192

    100.0609780.027080.027848

    110.0605060.0187590.021377

    120.0474620.0124120.017596

    130.0474620.0088860.013742

    140.0411760.0056420.010325

    150.0326890.0029620.00938

    160.0227880.0012690.006762

    170.0235740.001410.006399

    180.0176020.0005640.004508

    190.012730.0007050.005017

    200.0099010.0005640.002327

    210.0102150.0007050.003127

    220.0086440.0002820.002327

    230.0040860.0001410.001818

    240.00628600.001382

    250.0034570.0002820.001527

    260.00282900.000945

    270.00314300.000873

    280.003300.000582

    290.00267200.000436

    300.00204300.000727

    310.00157200.000364

    320.002200.000727

    330.0003140.0001410.000291

    340.00062900.000291

    350.00031400.000291

    360.00015700.000218

    370.00062900.000145

    380.00015700.000291

    390.00015700.000073

    400.00015700.000218

    41000.000145

    42000.000145

    430.00015700.000073

    44000.000145

    45000

    46000

    470.00031400.000145

    48000.000145

    490.00015700.000073

    50000.000073

    51000.000073

    52000.000145

    53000

    54000.000073

    55000.000073

    56000.000073

    57000

    58000.000073

    59000

    60000

    61000.000073

    62000.000073

    63000

    640.00015700

    65000.000145

    660.00015700

    670.00015700

    68000

    69000

    70000

    71000

    72000

    73000

    74000

    75000

    76000

    77000.000073

    78000

    79000

    80000

    81000

    82000

    83000

    84000

    85000

    86000.000073

    87000

    88000

    89000

    90000

    91000

    92000

    93000

    94000

    95000

    96000

    97000

    98000

    99000.000073

    100000.000073

    101000

    102000

    103000

    104000

    105000

    106000

    107000

    108000

    109000

    110000

    111000

    112000

    113000

    114000

    115000

    116000

    117000

    118000

    119000

    120000

    121000

    122000

    123000

    124000

    125000

    126000

    127000

    128000

    129000

    130000

    131000

    132000

    133000

    134000

    135000

    136000

    137000

    138000

    139000

    140000

    141000

    142000

    143000

    144000

    145000

    146000

    147000

    148000

    149000

    150000

    151000

    152000

    153000

    154000

    155000

    156000

    157000

    158000

    159000

    160000

    161000

    162000

    163000

    164000

    165000

    166000

    167000

    168000

    169000

    170000

    171000

    172000

    173000

    174000

    175000

    176000

    177000

    178000

    179000

    180000

    181000

    182000

    183000

    184000

    185000

    186000

    187000

    188000

    189000

    190000

    191000

    192000

    193000

    194000

    195000

    196000

    197000

    198000

    199000

    200000

    201000

    202000

    203000

    204000

    205000

    206000

    207000

    208000

    209000

    210000

    211000

    212000

    213000

    214000

    215000

    216000

    217000

    218000

    219000

    220000

    221000

    222000

    223000

    224000

    225000

    226000

    227000

    228000

    229000

    230000

    231000

    232000

    233000

    234000

    235000

    236000

    237000

    238000

    239000

    240000

    241000

    242000

    243000

    244000

    245000

    246000

    247000

    248000

    249000

    250000

    251000

    252000

    253000

    254000

    255000

    256000

    257000

    258000

    259000

    260000

    261000

    262000

    263000

    264000

    265000

    266000

    267000

    268000

    269000

    270000

    271000

    272000

    273000

    274000

    275000

    276000

    277000

    278000

    279000

    280000

    281000

    282000

    283000

    284000

    285000

    286000

    287000

    288000

    289000

    290000

    291000

    292000

    293000

    294000

    295000

    296000

    297000

    298000

    299000

    300000

    301000

    302000

    303000

    304000

    305000

    306000

    307000

    308000

    309000

    310000

    311000

    312000

    313000

    314000

    315000

    316000

    317000

    318000

    319000

    320000

    321000

    322000

    323000

    324000

    325000

    326000

    327000

    328000

    329000

    330000

    331000

    332000

    333000

    334000

    335000

    336000

    337000

    338000

    339000

    340000

    341000

    342000

    343000

    344000

    345000

    346000

    347000

    348000

    349000

    350000

    351000

    352000

    353000

    354000

    355000

    356000

    357000

    358000

    359000

    360000

    361000

    362000

    363000

    364000

    365000

    366000

    367000

    368000

    369000

    370000

    371000

    372000

    373000

    374000

    375000

    376000

    377000

    378000

    379000

    380000

    381000

    382000

    383000

    384000

    385000

    386000

    387000

    388000

    389000

    390000

    391000

    392000

    393000

    394000

    395000

    396000

    397000

    398000

    399000

    400000

    401000

    402000

    403000

    404000

    405000

    406000

    407000

    408000

    409000

    410000

    411000

    412000

    413000

    414000

    415000

    416000

    417000

    418000

    419000

    420000

    421000

    422000

    423000

    424000

    425000

    426000

    427000

    428000

    429000

    430000

    431000

    432000

    433000

    434000

    435000

    436000

    437000

    438000

    439000

    440000

    441000

    442000

    443000

    444000

    445000

    446000

    447000

    448000

    449000

    450000

    451000

    452000

    453000

    454000

    455000

    456000

    457000

    458000

    459000

    460000

    461000

    462000

    463000

    464000

    465000

    466000

    467000

    468000

    469000

    470000

    471000

    472000

    473000

    474000

    475000

    476000

    477000

    478000

    479000

    480000

    481000

    482000

    483000

    484000

    485000

    486000

    487000

    488000

    489000

    490000

    491000

    492000

    493000

    494000

    495000

    496000

    497000

    498000

    499000

    &A

    Pagina &P

  • a3a4a5a6a7a2a1a8a9a10a13a14a12a11c3c4c2c1b3b4b5b2b1Secondary structure: modelDo we use the same emission probabilities for states sharing the same label?

  • HMMs for Mapping problemsMapping problems in protein predictionLabelled HMMsDuration modellingModels for membrane proteins

  • Porin (Rhodobacter capsulatus)Bacteriorhodopsin(Halobacterium salinarum)Bilayer-barrel-helicesOuter MembraneInner Membrane

  • position of Trans Membrane Segments along the sequenceTopographyTopology of membrane proteins

    ALALMLCMLTYRHKELKLKLKK ALALMLCMLTYRHKELKLKLKK ALALMLCMLTYRHKELKLKLKK

  • A generic model for membrane proteins (TMHMM)TransmembraneInner SideOuter Side End Begin

  • Model of b-barrel membrane proteins

  • Labels:

    Transmembrane states

    Loop statesModel of b-barrel membrane proteinsTransmembraneInner SideOuter Side End Begin

  • Length of transmembrane b-strands:

    Minimum: 6 residues

    Maximum: unboundedModel of b-barrel membrane proteins

  • Six different sets of emission parameters:

    Outer loop Inner loop Long globular domains

    TM strands edgesTM strands coreModel of b-barrel membrane proteins

  • Model of a-helix membrane proteins (HMM1)TransmembraneInner SideOuter Side

  • Model of a-helix membrane proteins (HMM2)TransmembraneInner SideOuter Side....x 10....x 10......

  • Dynamic programming filtering procedure

    Grafico1

    0.0000174787

    0.0000235731

    0.0000415042

    0.0000553066

    0.000153429

    0.000167835

    0.000172728

    0.000172356

    0.000171323

    0.000170204

    0.000171257

    0.000166853

    0.000158308

    0.000148597

    0.000141162

    0.000132089

    0.0000570664

    0.0000617807

    0.0000673372

    0.0000777819

    0.0000753772

    0.000097771

    0.00011462

    0.000126552

    0.000133944

    0.000147535

    0.000149896

    0.000139449

    0.000127235

    0.00012101

    0.00011239

    0.0001101

    0.000120898

    0.000122121

    0.000121798

    0.000138214

    0.000231619

    0.0003353

    0.000374969

    0.000451195

    0.00048892

    0.00055561

    0.000577447

    0.000596539

    0.000642954

    0.000699476

    0.00073945

    0.000710173

    0.000763287

    0.000952486

    0.00145005

    0.00344736

    0.00571033

    0.00788937

    0.0139086

    0.0162845

    0.0182426

    0.019068

    0.0226285

    0.02366

    0.0258444

    0.0261305

    0.0284355

    0.0295408

    0.0324913

    0.0318368

    0.0285251

    0.0268962

    0.0371214

    0.0521017

    0.0770203

    0.100266

    0.166072

    0.251925

    0.577052

    0.720923

    0.912751

    0.942341

    0.945026

    0.942038

    0.930935

    0.912393

    0.896836

    0.840698

    0.825128

    0.721808

    0.520763

    0.401477

    0.221122

    0.193517

    0.166473

    0.188099

    0.180852

    0.18561

    0.196164

    0.22403

    0.251177

    0.268332

    0.264953

    0.264259

    0.278232

    0.3047

    0.361445

    0.407271

    0.418106

    0.417674

    0.435532

    0.446926

    0.453063

    0.425028

    0.40118

    0.318551

    0.286291

    0.306349

    0.317463

    0.450198

    0.815417

    0.895298

    0.905273

    0.9123

    0.926058

    0.926184

    0.925919

    0.924039

    0.927282

    0.912855

    0.836395

    0.779566

    0.581614

    0.406334

    0.215819

    0.22083

    0.401683

    0.499017

    0.669711

    0.704546

    0.714602

    0.723647

    0.744263

    0.737763

    0.740447

    0.735937

    0.731559

    0.691716

    0.627088

    0.576156

    0.443201

    0.404786

    0.270514

    0.208538

    0.245984

    0.242904

    0.238868

    0.255671

    0.266423

    0.318376

    0.31008

    0.358739

    0.43441

    0.558658

    0.647963

    0.684144

    0.719301

    0.722901

    0.7581

    0.743032

    0.736886

    0.727311

    0.685591

    0.660593

    0.560776

    0.539008

    0.524622

    0.500756

    0.396593

    0.496948

    0.537717

    0.54623

    0.58491

    0.74687

    0.782884

    0.833712

    0.838098

    0.799548

    0.765294

    0.764944

    0.713315

    0.619098

    0.570536

    0.499848

    0.425812

    0.27176

    0.283301

    0.263614

    0.370863

    0.418584

    0.439832

    0.450443

    0.460137

    0.524176

    0.599266

    0.655459

    0.693898

    0.672716

    0.679209

    0.65671

    0.629125

    0.622578

    0.585368

    0.576914

    0.615416

    0.666336

    0.726034

    0.6798

    0.713449

    0.604731

    0.593474

    0.549547

    0.575317

    0.622605

    0.677113

    0.701663

    0.677405

    0.637634

    0.601581

    0.58019

    0.542949

    0.514777

    0.472724

    0.435447

    0.30936

    0.25235

    0.186252

    0.168481

    0.189475

    0.219068

    0.276261

    0.352376

    0.416928

    0.534413

    0.626717

    0.709252

    0.753832

    0.787217

    0.803077

    0.802876

    0.790199

    0.757288

    0.649386

    0.552361

    0.407434

    0.414973

    0.38495

    0.631403

    0.737471

    0.885202

    0.909909

    0.932449

    0.933491

    0.936373

    0.933199

    0.908904

    0.845921

    0.760906

    0.686365

    0.405323

    0.348651

    0.242403

    0.222158

    0.21915

    0.248145

    0.23981

    0.223253

    0.218881

    0.208017

    0.207941

    0.205119

    0.216356

    0.23604

    0.240296

    0.246139

    0.288373

    0.317333

    0.373995

    0.477986

    0.568314

    0.620188

    0.711002

    0.76414

    0.784637

    0.783317

    0.757715

    0.688726

    0.669211

    0.676452

    0.647544

    0.669277

    0.757138

    0.791851

    0.787456

    0.75278

    0.733017

    0.706812

    0.591633

    0.585345

    0.592416

    0.633283

    0.668552

    0.734997

    0.690004

    0.726673

    0.724654

    0.728118

    0.694509

    0.662954

    0.629524

    0.581039

    0.562001

    0.521694

    0.479281

    0.377203

    0.327218

    0.212958

    0.16657

    0.119169

    0.113159

    0.108245

    0.114516

    0.112822

    0.127794

    0.145578

    0.176129

    0.210041

    0.33746

    0.508619

    0.659039

    0.836691

    0.878103

    0.898896

    0.90666

    0.9137

    0.921421

    0.891898

    0.878043

    0.868452

    0.74794

    0.567508

    0.326562

    0.244662

    0.302795

    0.60396

    0.719897

    0.843056

    0.90701

    0.935406

    0.942954

    0.938652

    0.934715

    0.917907

    0.868289

    0.770857

    0.642134

    0.412903

    0.284398

    0.132874

    0.0686025

    0.0508897

    0.0572534

    0.107611

    0.194576

    0.387605

    0.649729

    0.827149

    0.935307

    0.971441

    0.984669

    0.988886

    0.988197

    0.988422

    0.986796

    0.979964

    0.972433

    0.920744

    0.786618

    0.523556

    0.387433

    0.166555

    0.254606

    0.544102

    0.659614

    0.919984

    0.953069

    0.974823

    0.976202

    0.970414

    0.963163

    0.95289

    0.896097

    0.777576

    0.662156

    0.47229

    0.38383

    0.205591

    0.178812

    0.128526

    0.103349

    0.0888854

    0.0940384

    0.106731

    0.15551

    0.205257

    0.24824

    0.302063

    0.351648

    0.411679

    0.529721

    0.652949

    0.82179

    0.834566

    0.851302

    0.860777

    0.862223

    0.862947

    0.864675

    0.877207

    0.868468

    0.838861

    0.721514

    0.655336

    0.554857

    0.529138

    0.328663

    0.371278

    0.598594

    0.638451

    0.727219

    0.72837

    0.755185

    0.803927

    0.844441

    0.871994

    0.893023

    0.89633

    0.886404

    0.871464

    0.84975

    0.794927

    0.52352

    0.490371

    0.294861

    0.267727

    0.180829

    0.165128

    0.156195

    0.154401

    0.150078

    0.125137

    0.11442

    0.105883

    0.0969399

    0.0768029

    0.0684502

    0.0650629

    0.0684616

    0.0723199

    0.0740659

    0.140617

    0.206073

    0.347735

    0.467618

    0.634927

    0.866976

    0.967974

    0.989465

    0.992942

    0.994601

    0.994744

    0.994946

    0.987479

    0.956531

    0.871111

    0.758467

    0.344668

    TMS probability

    Sequence (1A0S)

    TMS probability

    1a0spTOT.SumPostHmm

    1.75E-050000

    2.36E-050000

    4.15E-050000

    5.53E-050000

    0.0001534290000

    0.0001678350000

    0.0001727280000

    0.0001723560000

    0.0001713230000

    0.0001702040000

    0.0001712570000

    0.0001668530000

    0.0001583080000

    0.0001485970000

    0.0001411620000

    0.0001320890000

    5.71E-050000

    6.18E-050000

    6.73E-050000

    7.78E-050000

    7.54E-050000

    9.78E-050000

    0.000114620000

    0.0001265520000

    0.0001339440000

    0.0001475350000

    0.0001498960000

    0.0001394490000

    0.0001272350000

    0.000121010000

    0.000112390000

    0.00011010000

    0.0001208980000

    0.0001221210000

    0.0001217980000

    0.0001382140000

    0.0002316190000

    0.00033530000

    0.0003749690000

    0.0004511950000

    0.000488920000

    0.000555610000

    0.0005774470000

    0.0005965390000

    0.0006429540000

    0.0006994760000

    0.000739450000

    0.0007101730000

    0.0007632870000

    0.0009524860000

    0.001450050000

    0.003447360000

    0.005710330000

    0.007889370000

    0.01390860000

    0.01628450000

    0.01824260000

    0.0190680000

    0.02262850000

    0.023660000

    0.02584440000

    0.02613050000

    0.02843550000

    0.02954080000

    0.03249130000

    0.03183680000

    0.02852510000

    0.02689620000

    0.03712140000

    0.05210170000

    0.07702030000

    0.1002660000

    0.1660720.50.5010

    0.2519250.50.5010

    0.5770520.50.40.50.411

    0.7209230.50.40.50.411

    0.9127510.50.40.50.411

    0.9423410.50.40.50.411

    0.9450260.50.40.50.411

    0.9420380.50.40.50.411

    0.9309350.50.40.50.411

    0.9123930.50.40.50.411

    0.8968360.50.40.50.411

    0.8406980.50.40.50.411

    0.8251280.50.40.50.411

    0.7218080.50.40.50.411

    0.5207630000

    0.4014770000

    0.2211220000

    0.1935170000

    0.1664730000

    0.1880990000

    0.1808520000

    0.185610000

    0.1961640000

    0.224030000

    0.2511770000

    0.2683320000

    0.2649530000

    0.2642590000

    0.2782320000

    0.30470000

    0.3614450000

    0.4072710000

    0.4181060000

    0.4176740000

    0.4355320000

    0.4469260000

    0.4530630000

    0.4250280000

    0.401180000

    0.3185510000

    0.2862910000

    0.3063490000

    0.3174630000

    0.4501980000

    0.8154170.50.40.50.411

    0.8952980.50.40.50.411

    0.9052730.50.40.50.411

    0.91230.50.40.50.411

    0.9260580.50.40.50.411

    0.9261840.50.40.50.411

    0.9259190.50.40.50.411

    0.9240390.50.40.50.411

    0.9272820.50.40.50.411

    0.9128550.50.40.50.411

    0.8363950.50.40.50.411

    0.7795660.50.40.50.411

    0.5816140000

    0.4063340000

    0.2158190000

    0.220830000

    0.4016830000

    0.4990170.50.5010

    0.6697110.50.40.50.411

    0.7045460.50.40.50.411

    0.7146020.50.40.50.411

    0.7236470.50.40.50.411

    0.7442630.50.40.50.411

    0.7377630.50.40.50.411

    0.7404470.50.40.50.411

    0.7359370.50.40.50.411

    0.7315590.50.40.50.411

    0.6917160.50.40.50.411

    0.6270880.50.40.50.411

    0.5761560000

    0.4432010000

    0.4047860000

    0.2705140000

    0.2085380000

    0.2459840000

    0.2429040000

    0.2388680000

    0.2556710000

    0.2664230000

    0.3183760000

    0.310080000

    0.3587390000

    0.434410.50.5010

    0.5586580.50.5010

    0.6479630.50.40.50.411

    0.6841440.50.40.50.411

    0.7193010.50.40.50.411

    0.7229010.50.40.50.411

    0.75810.50.40.50.411

    0.7430320.50.40.50.411

    0.7368860.50.40.50.411

    0.7273110.50.40.50.411

    0.6855910.400.401

    0.6605930.400.401

    0.5607760000

    0.5390080000

    0.5246220000

    0.5007560000

    0.3965930000

    0.4969480000

    0.5377170000

    0.546230000

    0.584910000

    0.746870.400.401

    0.7828840.50.40.50.411

    0.8337120.50.40.50.411

    0.8380980.50.40.50.411

    0.7995480.50.40.50.411

    0.7652940.50.40.50.411

    0.7649440.50.40.50.411

    0.7133150.50.40.50.411

    0.6190980.50.5010

    0.5705360000

    0.4998480000

    0.4258120000

    0.271760000

    0.2833010000

    0.2636140000

    0.3708630000

    0.4185840000

    0.4398320000

    0.4504430000

    0.4601370000

    0.5241760000

    0.5992660000

    0.6554590.400.401

    0.6938980.400.401

    0.6727160.400.401

    0.6792090.50.40.50.411

    0.656710.50.40.50.411

    0.6291250.50.40.50.411

    0.6225780.50.40.50.411

    0.5853680.50.40.50.411

    0.5769140.50.40.50.411

    0.6154160.50.40.50.411

    0.6663360.50.40.50.411

    0.7260340.50.40.50.411

    0.67980.400.401

    0.7134490.400.401

    0.6047310000

    0.5934740000

    0.5495470000

    0.5753170000

    0.6226050.400.401

    0.6771130.400.401

    0.7016630.50.40.50.411

    0.6774050.50.40.50.411

    0.6376340.50.40.50.411

    0.6015810.50.40.50.411

    0.580190.50.40.50.411

    0.5429490.50.5010

    0.5147770.50.5010

    0.4727240.50.5010

    0.4354470.50.5010

    0.309360.50.5010

    0.252350.50.5010

    0.1862520.50.5010

    0.1684810000

    0.1894750000

    0.2190680000

    0.2762610000

    0.3523760000

    0.4169280000

    0.5344130000

    0.6267170.50.40.50.411

    0.7092520.50.40.50.411

    0.7538320.50.40.50.411

    0.7872170.50.40.50.411

    0.8030770.50.40.50.411

    0.8028760.50.40.50.411

    0.7901990.50.40.50.411

    0.7572880.50.40.50.411

    0.6493860.50.40.50.411

    0.5523610.50.5010

    0.4074340.50.5010

    0.4149730.5010

    0.384950000

    0.6314030000

    0.7374710.400.401

    0.8852020.50.40.50.411

    0.9099090.50.40.50.411

    0.9324490.50.40.50.411

    0.9334910.50.40.50.411

    0.9363730.50.40.50.411

    0.9331990.50.40.50.411

    0.9089040.50.40.50.411

    0.8459210.50.40.50.411

    0.7609060.50.40.50.411

    0.6863650.400.401

    0.4053230000

    0.3486510000

    0.2424030000

    0.2221580000

    0.219150000

    0.2481450000

    0.239810000

    0.2232530000

    0.2188810000

    0.2080170000

    0.2079410000

    0.2051190000

    0.2163560000

    0.236040000

    0.2402960000

    0.2461390000

    0.2883730000

    0.3173330000

    0.3739950000

    0.4779860000

    0.5683140.50.5010

    0.6201880.50.5010

    0.7110020.50.40.50.411

    0.764140.50.40.50.411

    0.7846370.50.40.50.411

    0.7833170.50.40.50.411

    0.7577150.50.40.50.411

    0.6887260.50.40.50.411

    0.6692110.50.40.50.411

    0.6764520.50.40.50.411

    0.6475440.50.40.50.411

    0.6692770.400.401

    0.7571380.400.401

    0.7918510.400.401

    0.7874560.400.401

    0.752780.400.401

    0.7330170.400.401

    0.7068120.400.401

    0.5916330000

    0.5853450000

    0.5924160.50.5010

    0.6332830.50.5010

    0.6685520.50.40.50.411

    0.7349970.50.40.50.411

    0.6900040.50.40.50.411

    0.7266730.50.40.50.411

    0.7246540.50.40.50.411

    0.7281180.50.40.50.411

    0.6945090.50.40.50.411

    0.6629540.50.40.50.411

    0.6295240.50.5010

    0.5810390000

    0.5620010000

    0.5216940000

    0.4792810000

    0.3772030000

    0.3272180000

    0.2129580000

    0.166570000

    0.1191690000

    0.1131590000

    0.1082450000

    0.1145160000

    0.1128220000

    0.1277940000

    0.1455780000

    0.1761290000

    0.2100410000

    0.337460000

    0.5086190.50.5010

    0.6590390.50.40.50.411

    0.8366910.50.40.50.411

    0.8781030.50.40.50.411

    0.8988960.50.40.50.411

    0.906660.50.40.50.411

    0.91370.50.40.50.411

    0.9214210.50.40.50.411

    0.8918980.50.40.50.411

    0.8780430.50.40.50.411

    0.8684520.400.401

    0.747940.400.401

    0.5675080000

    0.3265620000

    0.2446620000

    0.3027950000

    0.603960.50.5010

    0.7198970.50.40.50.411

    0.8430560.50.40.50.411

    0.907010.50.40.50.411

    0.9354060.50.40.50.411

    0.9429540.50.40.50.411

    0.9386520.50.40.50.411

    0.9347150.50.40.50.411

    0.9179070.50.40.50.411

    0.8682890.50.40.50.411

    0.7708570.50.40.50.411

    0.6421340.50.5010

    0.4129030.50.5010

    0.2843980000

    0.1328740000

    0.06860250000

    0.05088970000

    0.05725340000

    0.1076110000

    0.1945760000

    0.3876050.50.5010

    0.6497290.50.5010

    0.8271490.50.40.50.411

    0.9353070.50.40.50.411

    0.9714410.50.40.50.411

    0.9846690.50.40.50.411

    0.9888860.50.40.50.411

    0.9881970.50.40.50.411

    0.9884220.50.40.50.411

    0.9867960.50.40.50.411

    0.9799640.50.40.50.411

    0.9724330.50.40.50.411

    0.9207440.50.40.50.411

    0.7866180.50.40.50.411

    0.5235560000

    0.3874330000

    0.1665550000

    0.2546060000

    0.5441020.50.5010

    0.6596140.50.40.50.411

    0.9199840.50.40.50.411

    0.9530690.50.40.50.411

    0.9748230.50.40.50.411

    0.9762020.50.40.50.411

    0.9704140.50.40.50.411

    0.9631630.50.40.50.411

    0.952890.50.40.50.411

    0.8960970.50.40.50.411

    0.7775760.50.40.50.411

    0.6621560.50.40.50.411

    0.472290.50.5010

    0.383830.50.5010

    0.2055910.50.5010

    0.1788120000

    0.1285260000

    0.1033490000

    0.08888540000

    0.09403840000

    0.1067310000

    0.155510000

    0.2052570000

    0.248240000

    0.3020630.50.5010

    0.3516480.50.5010

    0.4116790.50.5010

    0.5297210.50.5010

    0.6529490.50.40.50.411

    0.821790.50.40.50.411

    0.8345660.50.40.50.411

    0.8513020.50.40.50.411

    0.8607770.50.40.50.411

    0.8622230.50.40.50.411

    0.8629470.50.40.50.411

    0.8646750.50.40.50.411

    0.8772070.50.40.50.411

    0.8684680.50.40.50.411

    0.8388610.400.401

    0.7215140.400.401

    0.6553360000

    0.5548570000

    0.5291380000

    0.3286630000

    0.3712780000

    0.5985940000

    0.6384510.400.401

    0.7272190.400.401

    0.728370.400.401

    0.7551850.400.401

    0.8039270.50.40.50.411

    0.8444410.50.40.50.411

    0.8719940.50.40.50.411

    0.8930230.50.40.50.411

    0.896330.50.40.50.411

    0.8864040.50.40.50.411

    0.8714640.50.40.50.411

    0.849750.50.40.50.411

    0.7949270.50.40.50.411

    0.523520.50.5010

    0.4903710.50.5010

    0.2948610000

    0.2677270000

    0.1808290000

    0.1651280000

    0.1561950000

    0.1544010000

    0.1500780000

    0.1251370000

    0.114420000

    0.1058830000

    0.09693990000

    0.07680290000

    0.06845020000

    0.06506290000

    0.06846160000

    0.07231990000

    0.07406590000

    0.1406170000

    0.2060730000

    0.3477350000

    0.4676180000

    0.6349270000

    0.8669760.50.40.50.411

    0.9679740.50.40.50.411

    0.9894650.50.40.50.411

    0.9929420.50.40.50.411

    0.9946010.50.40.50.411

    0.9947440.50.40.50.411

    0.9949460.50.40.50.411

    0.9874790.50.40.50.411

    0.9565310.50.40.50.411

    0.8711110.50.40.50.411

    0.7584670.50.40.50.411

    0.3446680000

  • Dynamic programming filtering procedureMaximum-scoring subsequences with constrained segment length and number

    Grafico1

    0.0000174787

    0.0000235731

    0.0000415042

    0.0000553066

    0.000153429

    0.000167835

    0.000172728

    0.000172356

    0.000171323

    0.000170204

    0.000171257

    0.000166853

    0.000158308

    0.000148597

    0.000141162

    0.000132089

    0.0000570664

    0.0000617807

    0.0000673372

    0.0000777819

    0.0000753772

    0.000097771

    0.00011462

    0.000126552

    0.000133944

    0.000147535

    0.000149896

    0.000139449

    0.000127235

    0.00012101

    0.00011239

    0.0001101

    0.000120898

    0.000122121

    0.000121798

    0.000138214

    0.000231619

    0.0003353

    0.000374969

    0.000451195

    0.00048892

    0.00055561

    0.000577447

    0.000596539

    0.000642954

    0.000699476

    0.00073945

    0.000710173

    0.000763287

    0.000952486

    0.00145005

    0.00344736

    0.00571033

    0.00788937

    0.0139086

    0.0162845

    0.0182426

    0.019068

    0.0226285

    0.02366

    0.0258444

    0.0261305

    0.0284355

    0.0295408

    0.0324913

    0.0318368

    0.0285251

    0.0268962

    0.0371214

    0.0521017

    0.0770203

    0.100266

    0.166072

    0.251925

    0.5770520.4

    0.7209230.4

    0.9127510.4

    0.9423410.4

    0.9450260.4

    0.9420380.4

    0.9309350.4

    0.9123930.4

    0.8968360.4

    0.8406980.4

    0.8251280.4

    0.7218080.4

    0.520763

    0.401477

    0.221122

    0.193517

    0.166473

    0.188099

    0.180852

    0.18561

    0.196164

    0.22403

    0.251177

    0.268332

    0.264953

    0.264259

    0.278232

    0.3047

    0.361445

    0.407271

    0.418106

    0.417674

    0.435532

    0.446926

    0.453063

    0.425028

    0.40118

    0.318551

    0.286291

    0.306349

    0.317463

    0.450198

    0.8154170.4

    0.8952980.4

    0.9052730.4

    0.91230.4

    0.9260580.4

    0.9261840.4

    0.9259190.4

    0.9240390.4

    0.9272820.4

    0.9128550.4

    0.8363950.4

    0.7795660.4

    0.581614

    0.406334

    0.215819

    0.22083

    0.401683

    0.499017

    0.6697110.4

    0.7045460.4

    0.7146020.4

    0.7236470.4

    0.7442630.4

    0.7377630.4

    0.7404470.4

    0.7359370.4

    0.7315590.4

    0.6917160.4

    0.6270880.4

    0.576156

    0.443201

    0.404786

    0.270514

    0.208538

    0.245984

    0.242904

    0.238868

    0.255671

    0.266423

    0.318376

    0.31008

    0.358739

    0.43441

    0.558658

    0.6479630.4

    0.6841440.4

    0.7193010.4

    0.7229010.4

    0.75810.4

    0.7430320.4

    0.7368860.4

    0.7273110.4

    0.6855910.4

    0.6605930.4

    0.560776

    0.539008

    0.524622

    0.500756

    0.396593

    0.496948

    0.537717

    0.54623

    0.58491

    0.746870.4

    0.7828840.4

    0.8337120.4

    0.8380980.4

    0.7995480.4

    0.7652940.4

    0.7649440.4

    0.7133150.4

    0.619098

    0.570536

    0.499848

    0.425812

    0.27176

    0.283301

    0.263614

    0.370863

    0.418584

    0.439832

    0.450443

    0.460137

    0.524176

    0.599266

    0.6554590.4

    0.6938980.4

    0.6727160.4

    0.6792090.4

    0.656710.4

    0.6291250.4

    0.6225780.4

    0.5853680.4

    0.5769140.4

    0.6154160.4

    0.6663360.4

    0.7260340.4

    0.67980.4

    0.7134490.4

    0.604731

    0.593474

    0.549547

    0.575317

    0.6226050.4

    0.6771130.4

    0.7016630.4

    0.6774050.4

    0.6376340.4

    0.6015810.4

    0.580190.4

    0.542949

    0.514777

    0.472724

    0.435447

    0.30936

    0.25235

    0.186252

    0.168481

    0.189475

    0.219068

    0.276261

    0.352376

    0.416928

    0.534413

    0.6267170.4

    0.7092520.4

    0.7538320.4

    0.7872170.4

    0.8030770.4

    0.8028760.4

    0.7901990.4

    0.7572880.4

    0.6493860.4

    0.552361

    0.407434

    0.414973

    0.38495

    0.631403

    0.7374710.4

    0.8852020.4

    0.9099090.4

    0.9324490.4

    0.9334910.4

    0.9363730.4

    0.9331990.4

    0.9089040.4

    0.8459210.4

    0.7609060.4

    0.6863650.4

    0.405323

    0.348651

    0.242403

    0.222158

    0.21915

    0.248145

    0.23981

    0.223253

    0.218881

    0.208017

    0.207941

    0.205119

    0.216356

    0.23604

    0.240296

    0.246139

    0.288373

    0.317333

    0.373995

    0.477986

    0.568314

    0.620188

    0.7110020.4

    0.764140.4

    0.7846370.4

    0.7833170.4

    0.7577150.4

    0.6887260.4

    0.6692110.4

    0.6764520.4

    0.6475440.4

    0.6692770.4

    0.7571380.4

    0.7918510.4

    0.7874560.4

    0.752780.4

    0.7330170.4

    0.7068120.4

    0.591633

    0.585345

    0.592416

    0.633283

    0.6685520.4

    0.7349970.4

    0.6900040.4

    0.7266730.4

    0.7246540.4

    0.7281180.4

    0.6945090.4

    0.6629540.4

    0.629524

    0.581039

    0.562001

    0.521694

    0.479281

    0.377203

    0.327218

    0.212958

    0.16657

    0.119169

    0.113159

    0.108245

    0.114516

    0.112822

    0.127794

    0.145578

    0.176129

    0.210041

    0.33746

    0.508619

    0.6590390.4

    0.8366910.4

    0.8781030.4

    0.8988960.4

    0.906660.4

    0.91370.4

    0.9214210.4

    0.8918980.4

    0.8780430.4

    0.8684520.4

    0.747940.4

    0.567508

    0.326562

    0.244662

    0.302795

    0.60396

    0.7198970.4

    0.8430560.4

    0.907010.4

    0.9354060.4

    0.9429540.4

    0.9386520.4

    0.9347150.4

    0.9179070.4

    0.8682890.4

    0.7708570.4

    0.642134

    0.412903

    0.284398

    0.132874

    0.0686025

    0.0508897

    0.0572534

    0.107611

    0.194576

    0.387605

    0.649729

    0.8271490.4

    0.9353070.4

    0.9714410.4

    0.9846690.4

    0.9888860.4

    0.9881970.4

    0.9884220.4

    0.9867960.4

    0.9799640.4

    0.9724330.4

    0.9207440.4

    0.7866180.4

    0.523556

    0.387433

    0.166555

    0.254606

    0.544102

    0.6596140.4

    0.9199840.4

    0.9530690.4

    0.9748230.4

    0.9762020.4

    0.9704140.4

    0.9631630.4

    0.952890.4

    0.8960970.4

    0.7775760.4

    0.6621560.4

    0.47229

    0.38383

    0.205591

    0.178812

    0.128526

    0.103349

    0.0888854

    0.0940384

    0.106731

    0.15551

    0.205257

    0.24824

    0.302063

    0.351648

    0.411679

    0.529721

    0.6529490.4

    0.821790.4

    0.8345660.4

    0.8513020.4

    0.8607770.4

    0.8622230.4

    0.8629470.4

    0.8646750.4

    0.8772070.4

    0.8684680.4

    0.8388610.4

    0.7215140.4

    0.655336

    0.554857

    0.529138

    0.328663

    0.371278

    0.598594

    0.6384510.4

    0.7272190.4

    0.728370.4

    0.7551850.4

    0.8039270.4

    0.8444410.4

    0.8719940.4

    0.8930230.4

    0.896330.4

    0.8864040.4

    0.8714640.4

    0.849750.4

    0.7949270.4

    0.52352

    0.490371

    0.294861

    0.267727

    0.180829

    0.165128

    0.156195

    0.154401

    0.150078

    0.125137

    0.11442

    0.105883

    0.0969399

    0.0768029

    0.0684502

    0.0650629

    0.0684616

    0.0723199

    0.0740659

    0.140617

    0.206073

    0.347735

    0.467618

    0.634927

    0.8669760.4

    0.9679740.4

    0.9894650.4

    0.9929420.4

    0.9946010.4

    0.9947440.4

    0.9949460.4

    0.9874790.4

    0.9565310.4

    0.8711110.4

    0.7584670.4

    0.344668

    TMS probability

    Predicted TMS

    Sequence (1A0S)

    TMS probability

    1a0spTOT.SumPostHmm

    1.75E-050000

    2.36E-050000

    4.15E-050000

    5.53E-050000