prologue: pitfalls of standard alignments

PROLOGUE:PROLOGUE:

Pitfalls of standard alignmentsPitfalls of standard alignments

A: ALAEVLIRLITKLYP B: ASAKHLNRLITELYP ),(),( ii BAsBAScore

Blosum62

Scoring a pairwise alignment

Alignment of a family (globins)

……

……

……

……

……

……

……

……

.

Different positions are not equivalent

http://weblogo.berkeley.edu/cache/file5h2DWc.png

The substitution score IN A FAMILY should depend on the position (the same for gaps)

Sequence logos

For modelling families we need more flexible tools

Probabilistic Models for Biological SequencesProbabilistic Models for Biological Sequences

•What are they?

Generative definition:

•Objects producing different outcomes (sequences) with different probabilities

•The probability distribution over the sequences space determines the model specificity

Probabilistic models for sequences

M

Sequence space

Pro

babili

ty

Generates si with probability P(si | M)e.g.: M is the representation of the family of globins

Associative definition:

•Objects that, given an outcome (sequence), compute a probability value


M

Sequence space

Pro

babili

ty

Associates probability P(si | M) to si

e.g.: M is the representation of the family of globins

We don’t need a generator of new biological sequences

the generative definition is useful as operative definition


Sequence space

Pro

babili

ty

Most useful probabilistic models are Trainable systems

The probability density function over the sequence space is estimated from known examples by means of a learning algorithm

Sequence space

Pro

babili

ty

Known examples Pdf estimate (generalization)

e.g.: Writing a generic representation of the sequences of globins starting from a set of known globins


•What are they?•Why to use them?

Modelling a protein family

Probabilistic model

Seq1 Seq2 Seq3 Seq4 Seq5 Seq6

0.980.210.120.890.470.78

Given a protein class (e.g. Globins), a probabilistic model trained on this family can compute a probability value for each new sequence

This value measures the similarity between the new sequence and the family described by the model


•What are they?•Why to use them?•Which probabilities do they compute?

A model M associates to a sequence s the probability P( s | M )

This probability answers the question:

Which is the probability for a model describing the Globins to generate the sequence s ?

The question we want to answer is:

Given a sequence s, is it a Globin?

We need to compute P( M | s ) !!

P( s | M ) or P( M | s ) ?

P(M | s) =

P(s | M) P(M)

P(s)A priori probabilities

The A priori probabilities

P(M) is the probability of the model (i.e. of the class described by the model) BEFORE we know the sequence:

can be estimated as the abundance of the class

P(s) is the probability of the sequence in the sequence space.

Cannot be reliably estimated!!

Null model

Otherwise we can score a sequence for a model M comparing it to a Null Model: a model that generates ALL the possible sequences with probabilities depending ONLY on the statistical amino acid abundance

S(M, s) = log

P(s | M)

P(s | N)

In this case we need a threshold and a statistic for evaluating the significance (E-value, P-value)

S(M, s)

Sequences belonging to model M

Sequences NOT belonging to model M

The simplest probabilistic models:The simplest probabilistic models:Markov ModelsMarkov Models

•Definition

C R

SF

C: CloudsR: RainF: FogS: Sun

Markov Models Example: Weather

Register the weather conditions day by day:

as a first hypothesis the weather condition in a day depends ONLY on the weather conditions in the day before.

Define the conditional probabilities

P(C|C), P(C|R),…. P(R|C)…..

The probability for the 5-days registration CRRCS

P(CRRCS) = P(C)·P(R|C) ·P(R|R) ·P(C|R) ·P(S|C)

Stochastic generator of sequences in which the probability of state in position i depends ONLY on the state in position i-1

Given an alphabet C = {c1; c2; c3; ………cN }

a Markov model is described with N×(N+2) parameters {art , aBEGIN t , ar END; r, t C}

arq = P( s i = q| s i-1 = r ) aBEGIN q = P( s 1 = q ) ar END = P( s T = END | s T-1 = r )

Markov Model

t art + ar END = 1 r t aBEGIN t = 1

c3

c1

c2

c4

cN

ENDBEGIN

Given the sequence: s = s1s2s3s4s6 ……………sT

with si C = {c1; c2; c3; ………cN }

P( s | M ) = P( s1 ) i=2 P( s i | s i-1 ) =

Markov Models

aBEGIN s i=2 as s as END T

ii-11 T=

P(“ALKALI”)= aBEGIN A aA L aL K aK A aA L aL I aI END

Markov Models: Exercise

C

S

R

SF

0.4

0.20.2 ??

0.3

??0.2

0.5

0.3

0.2

0.3

0.3

0.4 0.2

0.10.2

C

S

R

SF

0.2

0.00.1 0.1

0.1

0.7??

0.2

0.3

1.0

0.0

??

0.0 0.8

0.40.0

1) Fill the non defined values for the transition probabilities


C

S

R

SF

0.4

0.20.2 0.2

0.3

0.00.2

0.5

0.3

0.2

0.3

0.3

0.4 0.2

0.10.2

C

S

R

SF

0.2

0.00.1 0.1

0.1

0.70.0

0.2

0.3

1.0

0.0

0.1

0.0 0.8

0.40.0

2) Which model better describes the weather in summer? Which one describes the weather in winter?


C

S

R

SF

0.4

0.20.2 0.2

0.3

0.00.2

0.5

0.3

0.2

0.3

0.3

0.4 0.2

0.10.2

C

S

R

SF

0.2

0.00.1 0.1

0.1

0.70.0

0.2

0.3

1.0

0.0

0.1

0.0 0.8

0.40.0

3) Given the sequenceCSSSCFS

which model gives the higher probability?[Consider the starting probabilities: P(X|BEGIN)=0.25]

Winter

Summer


P (CSSSCFS | Winter) ==0.25x0.1x0.2x0.2x0.3x0.2x0.2==1.2 x 10-5

P (CSSSCFS | Summer) ==0.25x0.4x0.8x0.8x0.1x0.1x1.0==6.4 x 10-4

4) Can we conclude that the observation sequence refers to a summer week?

C

S

R

SF

0.4

0.20.2 0.2

0.3

0.00.2

0.5

0.3

0.2

0.3

0.3

0.4 0.2

0.10.2

C

S

R

SF

0.2

0.00.1 0.1

0.1

0.70.0

0.2

0.3

1.0

0.0

0.1

0.0 0.8

0.40.0

Winter

Summer


P (Seq | Winter) =1.2 x 10-5

P (Seq | Summer) =6.4 x 10-4C

S

R

SF

0.4

0.20.2 0.2

0.3

0.00.2

0.5

0.3

0.2

0.3

0.3

0.4 0.2

0.10.2

C

S

R

SF

0.2

0.00.1 0.1

0.1

0.70.0

0.2

0.3

1.0

0.0

0.1

0.0 0.8

0.40.0

Winter

SummerP(Seq |Summer) P(Summer)P(Seq| Winter) P(Winter)

P(Summer | Seq)

P(Winter | Seq) =

=

G C

TA

DNA:C = {Adenine, Citosine, Guanine, Timine }

16 transition probabilities (12 of which independent) +4 Begin probabilities +4 End probabilities.

Simple Markov Model for DNA sequences

The parameters of the model are different in different zones of DNA

They describe the overall composition and the couple recurrences

G C

TA

Example of Markov Models: GpC Island

In the Markov Model of GpC Islands aGC is higher than in Markov Model Non-GpC Islands

G C

TA

GpC Islands Non-GpC Islands

Given a sequence s we can evaluate

P (GpC | s) = P (s | GpC) ·P(GpC) + P (s | nonGpC) ·P(nonGpC)

P ( s | GpC ) ·P(GpC)

GATGCGTCGC

CTACGCAGCG

The simplest probabilistic models:The simplest probabilistic models:Markov ModelsMarkov Models

•Definition•Training

Probabilistic training of a parametric methodProbabilistic training of a parametric method

Generally speaking, a parametric model M aims to reproduce a set Generally speaking, a parametric model M aims to reproduce a set of known dataof known data

Model MModel MParameters TParameters T

Modelled dataModelled dataReal data (D)Real data (D)

How to compare them?How to compare them?

Let M be the set of parameters of model M.

During the training phase, M parameters are estimated from the set of known data D

Maximum Likelihood Extimation (ML)

ML = argmax P( D | M, )

It can be proved that:

Training of Markov Models

Maximum A Posteriori Extimation (MAP)

MAP = argmax P( | M, D ) = argmax [P( D | M, ) P( ) ]

aik = nik

jnij

Frequency of occurrence as counted in the data set D

Example (coin-tossing)Example (coin-tossing)

Given N tossing of a coin (our data D), the outcomes are h heads and t tails (N=t+h)

ASSUME the model

P(D|M)= ph (1- p)t

Computing the maximum likelihood of P(D|M)

d P(D|M)d p = ph -1(1- p)t-1(h(1-p)-tp) = 0

d P(D|M)d p = 0

We obtain that our estimate of p is

p = h / (h+t) = h / N

Example (Error measure)Example (Error measure)

Suppose you think that your data are affected by a Gaussian error

So that they are distributed according to

F(xi)=A*exp-[(xi – )2 /22]

With A=1/sqrt(2 )

If your measures are independent the data likelihood is

P(Data| model) = i F(xi)

Find and that maximize the P(Data| model)

Maximum Likelihood training: Proof

Given a sequence s contained in D:s = s1s2s3s4s6 ……………sT

ENDs

T

isssBEGIN Tii aaaMsP

1

2, 11)|(

We can count the number of transitions between any to states j and k: njk

1

0

1

0

)|(N

j

N

k

njk

jkaMsP Where states 0 and N+1 are BEGIN and END

01)|(

0)|()|(

0''

N

kjk

j

jjk

jk

jk

aMsP

MsPa

n

a

MsP

N

j

N

kjkj aMsPMsP

0 0

1)|()|(

Normalisation contstraints are taken into account using the Lagrange multipliers k

N

kjk

jkjk

n

na

0''

Hidden Markov ModelsHidden Markov Models

•Preliminary examples

Given a sequence:

4156266656321636543662152611536264162364261664616263

We don’t know the sequence of dice that generated it.

RRRRRLRLRRRRRRRLRRRRRRRRRRRRLRLRRRRRRRRLRRRRLRRRRLRR

Loaded dice

We have 99 regular dice (R) and 1 loaded die (L).

P(1) P(2) P(3) P(4) P(5) P(6)R 1/6 1/6 1/6 1/6 1/6 1/6L 1/10 1/10 1/10 1/10 1/10 1/2

Hypothesis:

We chose a different die for each roll

Two stochastic processes give origin to the sequence of observations.

1) Choosing the die ( R o L ). 2) Rolling the die

The sequence of dice is hidden

The first process is assumed to be Markovian (in this case a 0-order MM)

The outcome of the second process depends only on the state reached in the first process (that is the chosen die)

Loaded dice

Model

Each state (R and L) generates a character of the alphabet

C = {1, 2, 3, 4, 5, 6 }

The emission probabilities depend only on the state.

The transition probabilities describe a Markov model that generates a state path: the hidden sequence ()

The observations sequence (s) is generated by two concomitant stochastic processes

R L0.01

0.01

0.99

0.99

Casinò


R L0.01

0.01

0.99

0.99

Choose the State : R Probability= 0.99

Chose the Symbol: 1 Probability= 1/6 (given R)

4156266656321636543662152611RRRRRLRLRRRRRRRLRRRRRRRRRRRR

4156266656321636543662152611RRRRRLRLRRRRRRRLRRRRRRRRRRRR


R L0.01

0.01

0.99

0.99

Choose the State : L Probability= 0.99

Chose the Symbol: 5 Probability= 1/10 (given L)

415626665632163654366215261RRRRRLRLRRRRRRRLRRRRRRRRRRR

41562666563216365436621526115RRRRRLRLRRRRRRRLRRRRRRRRRRRRL

Model

Each state (R and L) generates a character of the alphabet C = {1, 2, 3, 4, 5, 6 }

The emission probabilities depend only on the state.

The transition probabilities describe a Markov model that generates a state path: the hidden sequence ()


R L0.01

0.01

0.990.99

Loaded dice

Some not so serious example1) DEMOGRAPHY

Observable: Number of births and deaths in a year in a village. Hidden variable: Economic conditions (as a first approximation we can consider the success in business as a random variable, and by consequence, the wealth as a Markov variable

---> can we deduce the economic conditions of a village during a century by means of the register of births and deaths?

2) THE METEREOPATHIC TEACHER

Observable: Average of the marks that a metereopathic teacher gives to their students during a day. Hidden variable: Weather conditions

---> can we deduce the weather conditions during a years by means of the class register?

To be more serious

1) SECONDARY STRUCTURE Observable: protein sequence Hidden variable: secondary structure

---> can we deduce (predict) the secondary structure of a protein given its amino acid sequence?

2) ALIGNMENT Observable: protein sequence Hidden variable: position of each residue along the alignment of a protein family

---> can we align a protein to a family, starting from its amino acid sequence?


•Preliminary examples•Formal definition

A HMM is a stochastic generator of sequences characterised by: N states A set of transition probabilities between two states {akj}

akj = P( (i) = j | (i-1) = k ) A set of starting probabilities {a0k}

a0k = P( (1) = k ) A set of ending probabilities {ak0}

ak0 = P( (i) = END | (i-1) = k ) An alphabet C with M characters. A set of emission probabilities for each state {ek (c)}

ek (c) = P( s i = c | (i) = k )Constraints:k a0 k = 1ak0 + j ak j = 1 kc C ek (c) = 1 k

Formal definition of Hidden Markov Models

s: sequence: path through the states

Choose the initial state (1) following the probabilities a0k

i = 1

Choose the character s i from the alphabet C following the probabilities ek(c)

Choose the next state following the probabilities ak j and ak0

Is the END state choosed?

YesEnd

Noi i +1

Generating a sequence with a HMM

s :AGCGCGTAATCTGYYYYYYYNNNNNN

P( s, | M ) can be easily computed

GpC Island, simple model

Y NaYN = 0.2

aNN = 0.8

aNY = 0.1aYY = 0.7

BEGIN

a0N = 0.8a0Y= 0.2

ENDaY0 = 0.1 aN0 = 0.1

eY (A) = 0.1 eY (G) = 0.4eY (C) = 0.4 eY (T) = 0.1

eN (A) = 0.25 eN (G) = 0.25eN (C) = 0.25 eN (T) = 0.25

Gpc Island Non- Gpc Island

P( s, | M ) can be easily computed

Y NaYN = 0.2

aNN = 0.8

aNY = 0.1aYY = 0.7

BEGIN

a0N = 0.8a0Y= 0.2

ENDaY0 = 0.1 aN0 = 0.1

eY (A) = 0.1 eY (G) = 0.4eY (C) = 0.4 eY (T) = 0.1

eN (A) = 0.25 eN (G) = 0.25eN (C) = 0.25 eN (T) = 0.25

GpC Island Non- GpC Island

s : A G C G C G T A A T C T GY Y Y Y Y Y Y N N N N N NEmission: 0.1 0.4 0.4 0.4 0.4 0.4

0.10.250.250.250.250.250.25Transition: 0.2 0.7 0.7 0.7 0.7 0.7 0.7 0.20.8 0.8 0.80.8 0.8 0.1

Multiplying all the probabilities gives the probability of having the sequence AND the path through the states

Evaluation of the joint probability of the sequence ad the path

)|(),|()|,( MPMsPMsP

0)(2 )()1()1(0)|( T

T

i ii aaaMP

T

i

ii seMsP

1 )( )(),|(

T

i

iiiiT seaaMsP

1 )()()1(0)( )()|,(


•Preliminary examples•Formal definition•Three questions

s :AGCGCGTAATCTG?????????????

P( s, | M ) can be easily computed How to evaluate P ( s | M )?


Y NaYN = 0.2

aNN = 0.8

aNY = 0.1aYY = 0.7

BEGIN

a0N = 0.8a0Y= 0.2

ENDaY0 = 0.1 aN0 = 0.1

eY (A) = 0.1 eY (G) = 0.4eY (C) = 0.4 eY (T) = 0.1

eN (A) = 0.25 eN (G) = 0.25eN (C) = 0.25 eN (T) = 0.25


How to evaluate P ( s | M )?

Y NaYN = 0.2

aNN = 0.8

aNY = 0.1aYY = 0.7

BEGIN

a0N = 0.8a0Y= 0.2

ENDaY0 = 0.1 aN0 = 0.1

eY (A) = 0.1 eY (G) = 0.4eY (C) = 0.4 eY (T) = 0.1

eN (A) = 0.25 eN (G) = 0.25eN (C) = 0.25 eN (T) = 0.25


s : A G C G C G T A A T C T GY Y Y Y Y Y Y Y Y Y Y Y YY Y Y Y Y Y Y Y Y Y Y Y NY Y Y Y Y Y Y Y Y Y Y N YY Y Y Y Y Y Y Y Y Y Y N NY Y Y Y Y Y Y Y Y Y N Y Y………………………………………………………………………………………………………

213 different pathsSumming over all the path will give the probability of

having the sequence

P ( s | M ) = P( s, | M )

s :AGCGCGTAATCTG?????????????

P( s, | M ) can be easily computed How to evaluate P ( s | M )?Can we show the hidden path?

Resumé: GpC Island, simple model

Y NaYN = 0.2

aNN = 0.8

aNY = 0.1aYY = 0.7

BEGIN

a0N = 0.8a0Y= 0.2

ENDaY0 = 0.1 aN0 = 0.1

eY (A) = 0.1 eY (G) = 0.4eY (C) = 0.4 eY (T) = 0.1

eN (A) = 0.25 eN (G) = 0.25eN (C) = 0.25 eN (T) = 0.25


Can we show the hidden path?

Y NaYN = 0.2

aNN = 0.8

aNY = 0.1aYY = 0.7

BEGIN

a0N = 0.8a0Y= 0.2

ENDaY0 = 0.1 aN0 = 0.1

eY (A) = 0.1 eY (G) = 0.4eY (C) = 0.4 eY (T) = 0.1

eN (A) = 0.25 eN (G) = 0.25eN (C) = 0.25 eN (T) = 0.25


s : A G C G C G T A A T C T GY Y Y Y Y Y Y Y Y Y Y Y YY Y Y Y Y Y Y Y Y Y Y Y NY Y Y Y Y Y Y Y Y Y Y N YY Y Y Y Y Y Y Y Y Y Y N NY Y Y Y Y Y Y Y Y Y N Y Y………………………………………………………………………………………………………

213 different pathsViterbi path: path that gives the best joint probability

* = argmax [ P( | s, M ) ] = argmax [ P( , s | M ) ]

A Posteriori decoding

For each position choose the state (t) :(i) = argmax k[ P( i = k| s, M ) ]

The contribution to this probability derives from all the paths that go through the state k at position i.

The A posteriori path can be a non-sense path (it may not be a legitimate path if some transitions are not permitted in the model)

Can we show the hidden path?

s :AGCGCGTAATCTGYYYYYYYNNNNNN

P( s, | M ) can be easily computed How to evaluate P ( s | M )?Can we show the hidden path? Can we evaluate the parameters starting from known examples?


Y NaYN = ?

aNN = ?

aNY = ?aYY = ?

BEGIN

a0N = ?a0Y= ?

ENDaY0 = ? aN0 = ?

eY (A) = ? eY (G) = ?eY (C) = ? eY (T) = ?

eN (A) = ? eN (G) = ?eN (C) = ? eN (T) = ?


Can we evaluate the parameters starting from known examples?

Y NaYN = ?

aNN = ?

aNY = ?aYY = ?

BEGIN

a0N = ?a0Y= ?

ENDaY0 = ? aN0 = ?

eY (A) = ? eY (G) = ?eY (C) = ? eY (T) = ?

eY (A) = ? eY (G) = ?eY (C) = ? eY (T) = ?


s : A G C G C G T A A T C T GY Y Y Y Y Y Y N N N N N NEmission: eY (A)eY (G)eY (C)e Y(G)eY (C)eY (G)eY (T)eN (A)eN (A)eN (T)eN (C)eN (T)eN

(G)

Transition: a0Y aYY aYY aYY aYY aYY aYY aYN aNN aNN aNN aNN aNN aN0

How to find the parameters e and a that maximises this probability?How if we don’t know the path?

Hidden Markov Models:Algorithms Hidden Markov Models:Algorithms

•Resumé•Evaluating P(s | M): Forward Algorithm

Computing P( s,| M ) for each path is a redundant operation

s : A G C G C G T A A T C T GY Y Y Y Y Y Y Y Y Y Y Y YEmission: 0.1 0.4 0.4 0.4 0.4 0.4 0.1 0.1 0.1 0.1 0.4 0.1

0.4Transition: 0.2 0.7 0.7 0.7 0.7 0.7 0.7 0.70.7 0.7 0.70.7 0.7 0.1

Y NaYN = 0.2

aNN = 0.8

aNY = 0.1aYY = 0.7

BEGIN

a0N = 0.8a0Y= 0.2

ENDaY0 = 0.1 aN0 = 0.1

eY (A) = 0.1 eY (G) = 0.4eY (C) = 0.4 eY (T) = 0.1

eN (A) = 0.25 eN (G) = 0.25eN (C) = 0.25 eN (T) = 0.25


s : A G C G C G T A A T C T GY Y Y Y Y Y Y Y Y Y Y Y NEmission: 0.1 0.4 0.4 0.4 0.4 0.4 0.1 0.1 0.1 0.1 0.4 0.1

0.25Transition: 0.2 0.7 0.7 0.7 0.7 0.7 0.7 0.70.7 0.7 0.70.7 0.2 0.1

Summing over all the possible paths

Y NaYN = 0.2

aNN = 0.8

aNY = 0.1aYY = 0.7

BEGIN

a0N = 0.8a0Y= 0.2

ENDaY0 = 0.1 aN0 = 0.1

eY (A) = 0.1 eY (G) = 0.4eY (C) = 0.4 eY (T) = 0.1

eN (A) = 0.25 eN (G) = 0.25eN (C) = 0.25 eN (T) = 0.25


s : A G Y Y

Emission: 0.1 0.4 Transition: 0.2 0.7 s : A G

Y N

Emission: 0.1 0.25 Transition: 0.2 0.2

s : A G N Y

Emission: 0.250.4 Transition: 0.8 0.1 s : A G

N N

Emission: 0.250.25 Transition: 0.8 0.8

0.0056

0.001 0.04

0.008

s : A G X Y

s : A G X N

0.0136

0.041

Sum

s : A G C X Y Y 0.0136

Y NaYN = 0.2

aNN = 0.8

aNY = 0.1aYY = 0.7

BEGIN

a0N = 0.8a0Y= 0.2

ENDaY0 = 0.1 aN0 = 0.1

eY (A) = 0.1 eY (G) = 0.4eY (C) = 0.4 eY (T) = 0.1

eN (A) = 0.25 eN (G) = 0.25eN (C) = 0.25 eN (T) = 0.25



0.4 0.7

s : A G C X Y N 0.0136 0.25

0.2

s : A G C X N Y

0.041 0.4 0.1

s : A G C X N N

0.041 0.25 0.8

+

+

s : A G C X X Y

s : A G C X X N

0.005448

0.00888

Sum

Y NaYN = 0.2

aNN = 0.8

aNY = 0.1aYY = 0.7

BEGIN

a0N = 0.8a0Y= 0.2

ENDaY0 = 0.1 aN0 = 0.1

eY (A) = 0.1 eY (G) = 0.4eY (C) = 0.4 eY (T) = 0.1

eN (A) = 0.25 eN (G) = 0.25eN (C) = 0.25 eN (T) = 0.25



A G C G C G T A A T C T GX X X X X X X X X X X X Y

A G C G C G T A A T C T GX X X X X X X X X X X X N

0.1 (aY0)

0.1 (aN0)

+

P(s|M)

Iterating until the last position of the sequence:

On the basis of preceding observations the computation of P(s | M) can be decomposed in simplest problems

For each state k and each position i in the sequence, we compute:

Fk(i) = P( s1s2s3……s i, (i) = k | M)

Initialisation: FBEGIN (0) = 1 Fi (0) = 0 i BEGIN

Recurrence: Fl ( i+1) = P( s1s2…s is i+1, (i + 1) = l ) =

= k P( s1s2 …s i, (i) = k ) a k l e l ( s i+1 ) =

= e l ( s i+1 ) k Fk ( i ) a k l

Termination: P( s ) = P( s1s2s3……s T, (T + 1) = END ) =

=k P( s1s2 …s T , (T) = k ) a k0

= k Fk ( T ) a k 0

Forward Algorithm

Will be understood

Computing P( s,| M ) for each path is a redundant operation

TBegin

L

R

End

0 1 2 3 T-1

Sta

tes

Iteration

0)()()()1(

1

1 )()()1(1 1111111)()()|,( T

TTTT

T

t

tttt aseaseaMsP

0)()()()1(

1

1 )()()1(2 2222222)()()|,( T

TTTT

T

t

tttt aseaseaMsP

If we compute the common part only once we gain 2·(T-1) operations

TBegin

L

R

End

0 1 2 3 T-1

Sta

tes

Iteration

If we know the probabilities of emitting the two first characters of the sequence ending the path in states L and R respectively:

FR(2) P(s1,s2,(2) = R | M) and FL(2) P(s1,s2,(2) = L | M)

then we can compute:

P(s1,s2,s3,(3) = R | M) = FR(2) · aRR ·eR(s3) + FL(2) · aLR ·eR(s3)


STATE

Iteration

BEGIN

END

A

B

0 1 2

FB (2)

eB (s2)

T T + 1

Fi (1) ∙ aiB

P(s | M)

Forward Algorithm

Naïf method

P ( s | M ) = P( s, | M )

There are N T possible paths.

Each path requires about 2T operations.

The time for the computation is O( T N T )

Forward algorithm: computational complexity

s : A G C G C G T A A T C T GY Y Y Y Y Y Y Y Y Y Y Y YEmission: 0.1 0.4 0.4 0.4 0.4 0.4 0.1 0.1 0.1 0.1 0.4 0.1

0.4Transition: 0.2 0.7 0.7 0.7 0.7 0.7 0.7 0.70.7 0.7 0.70.7 0.7 0.1 s : A G C G C G T A A T C T G

Y Y Y Y Y Y Y Y Y Y Y Y NEmission: 0.1 0.4 0.4 0.4 0.4 0.4 0.1 0.1 0.1 0.1 0.4 0.1

0.25Transition: 0.2 0.7 0.7 0.7 0.7 0.7 0.7 0.70.7 0.7 0.70.7 0.2 0.1

s : A G C X Y Y 0.0136 0.4

0.7

s : A G C X Y N 0.0136 0.25

0.2

s : A G C X N Y

0.041 0.4 0.1

s : A G C X N N

0.041 0.25 0.8

s : A G C X X Y

s : A G C X X N

0.005448

0.00888

+

+

Sum

Forward algorithm

T positions, N values for each position

Each element requires about 2N product and 1 sum

The time for the computation is O(T N2)


0100200300400500600700800900

1000

1 2 3 4 5 6 7

T

No.

of

oper

atio

ns


Naïf method

Forward algorithm


•Resumé•Evaluating P(s | M): Forward Algorithm•Evaluating P(s | M): Backward Algorithm

Backward AlgorithmSimilar to the Forward algorithm: it computes P( s | M ), reconstructing the sequence from the end

For each state k and each position i in the sequence, we compute:

Bk(i) = P( s i+1s i+2s i+3……s T | (i) = k )

Initialisation: Bk (T) = P((T+1) = END | (T) = k ) = ak0

Recurrence: Bl ( i-1) = P(s is i+1…s T | (i - 1) = l ) =

= k P(s i+1s i+2…s T | (i) = k) a l k e k (s i )=

= k Bk ( i ) e k ( s i ) a l k

Termination: P( s ) = P( s1s2s3……s T | (0) = BEGIN ) =

= k P( s2 …s T | (1) = k ) a 0 k e k ( s 1 ) =

= k Bk ( 1 ) a 0k e k ( s 1 )

STATE

Iteration

BEGIN

END

A

B

0 1 2 T T + 1

Backward Algorithm

BB (T-1)

Bk(T)· aB T· e k (s T-1 )

T-1

P(s | M)


•Resumé•Evaluating P(s | M): Forward Algorithm•Evaluating P(s | M): Backward Algorithm•Showing the path: Viterbi decoding

Finding the best path

Y NaYN = 0.2

aNN = 0.8

aNY = 0.1aYY = 0.7

BEGIN

a0N = 0.8a0Y= 0.2

ENDaY0 = 0.1 aN0 = 0.1

eY (A) = 0.1 eY (G) = 0.4eY (C) = 0.4 eY (T) = 0.1

eN (A) = 0.25 eN (G) = 0.25eN (C) = 0.25 eN (T) = 0.25

GcP Island Non- GcP Island

s : A G Y Y

Emission: 0.1 0.4 Transition: 0.2 0.7 s : A G

Y N

Emission: 0.1 0.25 Transition: 0.2 0.2

s : A G N Y

Emission: 0.250.4 Transition: 0.8 0.1 s : A G

N N

Emission: 0.250.25 Transition: 0.8 0.8

0.0056

0.001 0.04

0.008

s : A G N Y

s : A G N N

0.008

0.04

Max

s : A G C N Y Y 0.008

Y NaYN = 0.2

aNN = 0.8

aNY = 0.1aYY = 0.7

BEGIN

a0N = 0.8a0Y= 0.2

ENDaY0 = 0.1 aN0 = 0.1

eY (A) = 0.1 eY (G) = 0.4eY (C) = 0.4 eY (T) = 0.1

eN (A) = 0.25 eN (G) = 0.25eN (C) = 0.25 eN (T) = 0.25


0.4 0.7

s : A G C N Y N 0.008 0.25

0.2

s : A G C N N Y

0.04 0.4 0.1

s : A G C N N N

0.04 0.25 0.8 ;

=0.00224 =0.0016

=0.0004 =0.008


s : A G C N Y Y

s : A G C N N N

0.00224

0.008

Max

Y NaYN = 0.2

aNN = 0.8

aNY = 0.1aYY = 0.7

BEGIN

a0N = 0.8a0Y= 0.2

ENDaY0 = 0.1 aN0 = 0.1

eY (A) = 0.1 eY (G) = 0.4eY (C) = 0.4 eY (T) = 0.1

eY (A) = 0.25 eY (G) = 0.25eY (C) = 0.25 eY (T) = 0.25


A G C G C G T A A T C T GN Y Y Y Y Y Y N N N Y Y Y

A G C G C G T A A T C T GN N N N N N N N N N N N N

0.1 (aY0)

0.1 (aN0)

Choose the Maximum

Iterating until the last position of the sequence:


Viterbi Algorithm

* = argmax [ P( , s | M ) ]The computation of P(s,*| M) can be decomposed in simplest problems

Let Vk(i) be the probability of the most probable path for generating the subsequence s1s2s3……s i ending in the state k at iteration i

Initialisation: VBEGIN (0) = 1 Vi (0) = 0 i BEGIN

Recurrence: Vl ( i+1) = e l ( s i+1 ) Max k ( Vk ( i ) a k l )

ptr i ( l ) = argmax k ( Vk ( i ) a k l )

Termination: P( s, * ) =Maxk (Vk ( T ) a k 0 )

* ( T ) = argmax k (Vk ( T ) a k 0 )Traceback: * ( i-1 ) = ptr i (* ( i ))

Viterbi Algorithm

STATE

Iteration

BEGIN

END

A

B

0 1 2

VB (2)

MAX

eB (s2)

T T + 1

Vi (1) ∙ aiB

P(s, *| M)

ptr2 (B)

Viterbi Algorithm

STATE

Iteration

BEGIN

END

A

B

0 1 2 T T + 1T – 1

Viterbi path

Different paths can have the same probability


•Resumé•Evaluating P(s | M): Forward Algorithm•Evaluating P(s | M): Backward Algorithm•Showing the path: Viterbi decoding•Showing the path: A posteriori decoding•Training a model: EM algorithm

If we know the path generating the training sequence

Y NaYN = ?

aNN = ?

aNY = ?aYY = ?

BEGIN

a0N = ?a0Y= ?

ENDaY0 = ? aN0 = ?

eY (A) = ? eY (G) = ?eY (C) = ? eY (T) = ?

eN (A) = ? eN (G) = ?eN (C) = ? eN (T) = ?


s : A G C G C G T A A T C T GY Y Y Y Y Y Y N N N N N NEmission: eY (A)eY (G)eY (A)e Y(G)eY (C)eY (G)eY (T)eN (A)eN (A)eN (T)eN (C)eN (T)eN

(G)

Transition: a0Y aYY aYY aYY aYY aYY aYY aYN aNN aNN aNN aNN aNN aN0

Just count!Example: aYY= nYY /(nYY+ nYN)= 6/7

eY(A) = nY(A) /[nY(A)+nY(C) +nY(G) +nY(T)]= 1/7

If we DO NOT know the path generating the training sequence

Y NaYN = ?

aNN = ?

aNY = ?aYY = ?

BEGIN

a0N = ?a0Y= ?

ENDaY0 = ? aN0 = ?

eY (A) = ? eY (G) = ?eY (C) = ? eY (T) = ?

eN (A) = ? eN (G) = ?eN (C) = ? eN (T) = ?


s : A G C G C G T A A T C T G? ? ? ? ? ? ? ? ? ? ? ? ?Emission: e? (A)e? (G)e? (A)e ?(G)e? (C)e? (G)e? (T)e? (A)e? (A)e? (T)e? (C)e? (T)e? (G)

Transition: a0? a?? a?? a?? a?? a?? a?? a?? a?? a?? a?? a?? a?? a?0

We need “in some sense” to average over all the possible paths

No exact algorithm is available.Iterative Baum-Welch algorithm based on the Expectation-Maximisation

Ak,l = P(| s,0) · Ak,l() Ek (c) = P(| s,0) · Ek (c,)

We can compute the expected values over all the paths, given inital parameters 0

ak,l = ek(c) = Ak,l

m = 1 Ak,mN

Ek (c)

c Ek (c)

Baum-Welch algorithm (simple discussion)

Given a path we can countthe number of transition between states k and l: Ak,l()the number on emissions of character c from state k: Ek (c,)

s : A G C G C G T A A T C T GY Y Y Y Y Y Y Y Y Y Y Y YY Y Y Y Y Y Y Y Y Y Y Y NY Y Y Y Y Y Y Y Y Y Y N YY Y Y Y Y Y Y Y Y Y Y N N……………………………………………………………...

The updated parameters are:

Then we can iterate…

Expectation-Maximisation algorithm

We need to estimate the Maximum Likelihood parameters when the paths generating the training sequences are unknown

ML = argmaxP ( s | M)]

Given a model with parameters 0 the EM algorithm finds new parameters that increase the likelihood of the model:

P( s | ) > P( s| )


We need to estimate the Maximum Likelihood parameters when the paths generating the training sequences are unknown

ML = argmaxP ( s | M)]

Given a model with parameters 0 the EM algorithm finds new parameters that increase the likelihood of the model:

P( s | ) > P( s| )

or equivalentely

log P( s | ) > log P( s| )


log P( s | ) = log P(s,|) - log P(| s,)

Multiplying for P(| s,0) and summing over all the possible paths

log P( s | ) =

=P(| s,0) ·log P(s,| ) - P(| s,0) · log P(| s,)

Q(|0) : Expectation value of log P(s,|) over all the “current” paths

log P( s | ) - log P(s | ) = = Q(|) - Q(|0) +

Q(|) - Q(|0)

)s, |πP(

)s, |πP(log)s, |πP(-

00

π

0


The EM algorithm is an iterative process

Each iteration performs two steps:

E-step: evaluation of Q(|) = P(| s,0) ·log P(s,| )

M-step: Maximisation of Q(|) over all

It does NOT assure to converge to the GLOBAL Maximum Likelihood

E-step:

Q( |0) = P(| s,0) ·log P(s,|)

P(s,|) = a0,(1) · i = 1 a(i),(i+1) ·e(i)(si) =

= k = 0 l = 1 ak,l · k = 1 c C ek (c)

T

NNN Ak.l () Ek (c,)

Ak,l(): number of transitions between the states k and l in path

Ek (c,): number of emissions of character c in path

Ak,l = P(| s,0) · Ak,l()

Ek (c) = P(| s,0) · Ek (c,)

So:

Q(|0) = k = 0 l = 1 Ak,l · log ak,l + k = 1 c C Ek (c) ·log ek (c) NN N

Baum-Welch implementation of the EM algorithm

Expected values over all the “actual” paths

ak,l =

ek(c) =

Ak,l

m = 1 Ak,mN

Ek (c)

m = 1 Em (c) N


M-step:

0,

lka

For any state k and l, with l ak,l = 1

0)(

cek

For any state k and character c, with c ek(c) = 1

By means of Lagrange’s multipliers techniques, we can solve the system

Fk(i) = P( s1s2s3……s i, (i) = k )

Bk(i) = P( s i+1s i+2s i+3……s T | (i) = k )

Ak,l= P((i ) = k , ( i +1) = l | s,) =

Ek (c) = P( s i = c , (i ) = k | s,) =

i Fk(i ) a kl e l (s i +1) Bl(i + 1)

P (s )

s = c Fk(i ) Bl(i)

P (s )

i


How to compute the expected number of transitions and emissions over all the paths


AlgorithmStart with random parameters

Compute Forward and Backward matrices on the known sequences

Compute Ak,l and Ek (c) expected numbers of transitions and emissions

Update a k,l Ak,l ek (c) Ek (c)

Has P(s|M) incremented ?Yes

NoEnd

Profile HMMsProfile HMMs

•HMMs for alignments

M0 M1 M2 M3 M4

How to align?

Each state represent a position in the alignment.

A C G G T AM0 M1 M2 M3 M4 M5

A C G A T CM0 M1 M2 M3 M4 M5

A T G T T CM0 M1 M2 M3 M4 M5

M5

Each position has a peculiar composition

M0 M1 M2 M3 M4

A C G G T AA C G A T CA T G T T C

M5

Given a set of sequences..

..we can train a model..

A 1 0 0 0.33 0 0.33C 0 0.66 0 0 0 0.66G 0 0 1 0.33 0 0T 0 0.33 0 0.33 1 0

..estimating the emission probabilities.

M0 M1 M2 M3 M4 M5

Given a trained model..

..we can align a new sequence..

A 1 0 0 0.33 0 0.33C 0 0.66 0 0 0 0.66G 0 0 1 0.33 0 0T 0 0.33 0 0.33 1 0

A C G A T C

..computing the probability of generating it

P(s|M) = 1 × 0.66 × 1 × 0.33 × 1 × 0.66

M0 M1 M2 M3 M4

And for the sequence AGATC ?

A G A T CM0 M2 M3 M4 M5 M5

M5

A 1 0 0 0.33 0 0.33C 0 0.66 0 0 0 0.66G 0 0 1 0.33 0 0T 0 0.33 0 0.33 1 0

We need a way to introduce gaps

Silent states

Red transitions allow gaps(N-1) ! transitions

To reduce the number of parameters we can use states that doesn’t emit any character4N-8 transitions

M0 M1 M2 M4M3

I0 I1 I2 I3

D1 D4D2 D3

I4

M5

Profile HMMs

Delete states

Insert states

Match states

A C G G T AM0 M1 M2 M3 M4 M5

A C G C A G T CM0 I0 I0 M1 M2 M3 M4 M5

A G A T CM0 D1 M2 M3 M4 M5

Example of alignmentSequence 1

A S T R A LViterbi path

M0 M1 M2 M3 M4 M5

A S T R A L

Sequence 2A S T A I L

Viterbi pathM0 M1 M2 D3 M4 I4 M5

A S T A I L

Sequence 3A R T I

Viterbi pathM0 M1 M2 D3 D4 M5

A R T I

M0 M1 M2 M4M3

I0 I1 I2 I3

D1 D4D2 D3

I4

M5

M0 M1 M2 M4M3

I0 I1 I2 I3

D1 D4D2 D3

I4

M5

M0 M1 M2 M4M3

I0 I1 I2 I3

D1 D4D2 D3

I4

M5

Example of alignment

Grouping by vertical layers0 1 2 3 4 5

s1 A S T R A Ls2 A S T AI Ls3 A R T I

AlignmentASTRA-LAST-AILART---I

M0 M1 M2 M3 M4 M5

A S T R A L

M0 M1 M2 D3 M4 I4 M5

A S T A I L

M0 M1 M2 D3 D4 M5

A R T I

Sequence 1

Sequence 2

Sequence 3

-Log P(s | M) Is an alignment score

Searching for a structural/functional pattern in protein sequence

Zn binding loop:

C H C I C R I C C H C L C K I C C H C I C S L C D H C L C T I C C H C I D S I C C H C L C K I C

Cysteines can be replaced by an Aspartic Acid, but only ONCE for each sequence

Searching for a structural/functional pattern in protein sequences

..ALCPCHCLCRICPLIY..

..WERWDHCIDSICLKDE..

M0 M1 M2 M4M3

I0 I1 I2 I3

D1 D4D2 D3

I4

M5

D5

I5

M6

D6

I6

M7

obtains higher probability than

.. because M0 and M4 have low emission probability for Aspartic Acid and we multiply them twice.


•HMMs for alignments•Example on globins

Structural alignment of globins

Structural alignment of globins

Bashdorf D, Chothia C & Lesk AM, (1987) Determinants of a protein fold: unique features of the globin amino sequence. J.Mol.Biol. 196, 199-216

Alignment of globins reconstructed with profile HMMs

Krogh A, Brown M, Mian IS, Sjolander K & Haussler D (1994) Hidden Markov Models in computational biology: applications to protein modelling. J.Mol.Biol. 235, 1501-1531

Discrimination power of profile HMMs

Z-score = Log(P(s | M)) - <Log( P(s | M))>

(Log(P(s | M)) )



•HMMs for alignments•Example on globins•Other applications

Begin

Profile HMM specific for the considered domain

I2

End

I2

Finding a domain

BEGIN

HMM 1

HMM 2

HMM 3

HMM n

.END.

Clustering subfamilies

Each sequence s contributes to update HMM i with a weight equal to P ( s | Mi )


•HMMs for alignments•Example on globins•Other applications•Available codes and servers

HMMER at WUSTL: http://hmmer.wustl.edu/Eddy SR (1998) Profile hidden Markov models. Bioinformatics 14:755-763

http://hmmer.wustl.edu/

HMMER

Alignment of a protein family

hmmbuild

Trained profile-HMM

hmmcalibate

HMM calibrated with the accurate E-value

statistics

Takes the aligned sequences, checks for redundancy and sets the emission and the transitions probabilities of a HMM

Takes a trained HMM, generates a great number of random sequences, score them and fits the Extreme Value Distribution to the computed scores

HMMER

Set of sequences

Alignment of all the sequences to the

model

hmmalign

List of sequences that match the HMM

(sorted by E-value)

hmmsearchHMM

Set of HMMs Sequence

hmmpfam

List of HMMs that match the sequence

!!AA_MULTIPLE_ALIGNMENT 1.0PileUp of: *.pep

Symbol comparison table: GenRunData:blosum62.cmp CompCheck: 6430

GapWeight: 12 GapLengthWeight: 4

pileup.msf MSF: 308 Type: P August 16, 1999 09:09 Check: 9858 ..

Name: lgb1_pea Len: 308 Check: 2200 Weight: 1.00 Name: lgb1_vicfa Len: 308 Check: 214 Weight: 1.00 Name: myg_escgi Len: 308 Check: 3961 Weight: 1.00 Name: myg_horse Len: 308 Check: 5619 Weight: 1.00 Name: myg_progu Len: 308 Check: 6401 Weight: 1.00 Name: myg_saisc Len: 308 Check: 6606 Weight: 1.00

//

1 50 lgb1_pea ~~~~~~~~~G FTDKQEALVN SSSE.FKQNL PGYSILFYTI VLEKAPAAKGlgb1_vicfa ~~~~~~~~~G FTEKQEALVN SSSQLFKQNP SNYSVLFYTI ILQKAPTAKA myg_escgi ~~~~~~~~~V LSDAEWQLVL NIWAKVEADV AGHGQDILIR LFKGHPETLE myg_horse ~~~~~~~~~G LSDGEWQQVL NVWGKVEADI AGHGQEVLIR LFTGHPETLE myg_progu ~~~~~~~~~G LSDGEWQLVL NVWGKVEGDL SGHGQEVLIR LFKGHPETLE myg_saisc ~~~~~~~~~G LSDGEWQLVL NIWGKVEADI PSHGQEVLIS LFKGHPETLE

MSF Format: globins50.msf

Alignment of a protein family

hmmbuild

Trained profile-HMM

hmmbuild globin.hmm globins50.msf

All the transition and emission parameters are estimated by means of the Expectation Maximisation algorithm on the aligned sequences.

In principle we could use also NON aligned sequences to train the model. Nevertheless it is more efficient to build the starting alignment using, for example, CLUSTALW

hmmcalibrate [-num N] -histfile globin.histo globin.hmm

Trained profile-HMM

hmmcalibate

HMM calibrated with the accurate E-value

statistics

A number of N (default 5000) random sequences are generated and scored with the model.

Random sequences

Log P(s|M)/P(s|N)

Range for globin sequences

E-value(S): expected number of random sequences with a score > S

Trained model (globin.hmm)

HMMER2.0 [2.3.2]NAME globins50LENG 143ALPH AminoRF noCS noMAP yesCOM /home/gigi/bin/hmmbuild globin.hmm globins50.msfCOM /home/gigi/bin/hmmcalibrate --histfile globin.histo globin.hmmNSEQ 50DATE Sun May 29 19:03:18 2005CKSUM 9858XT -8455 -4 -1000 -1000 -8455 -4 -8455 -4 NULT -4 -8455NULE 595 -1558 85 338 -294 453 -1158 197 249 902 -1085 -142 -21 -313 45 531 201 384 -1998 -644 EVD -38.893742 0.243153HMM A C D E F G H I K L M N P Q R S T V W Y m->m m->i m->d i->m i->i d->m d->d b->m m->e -450 * -1900 1 591 -1587 159 1351 -1874 -201 151 -1600 998 -1591 -693 389 -1272 595 42 -31 27 -693 -1797 -1134 14 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 -450 * 2 -926 -2616 2221 2269 -2845 -1178 -325 -2678 -300 -2596 -1810 220 -1592 939 -974 -671 -939 -2204 -2785 -1925 15 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * * 3 -638 -1715 -680 497 -2043 -1540 23 -1671 2380 -1641 -840 -222 -1595 437 1040 -564 -523 -1363 2124 -1313 16 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * *

Trained model (globin.hmm)null model


Score = INT [1000 log2(prob/null_prob)]

null_prob = 1 for transitions

Trained model (globin.hmm)null model


Score = INT [1000 log2(prob/null_prob)]

= natural abundance for emissions



Transitions



Emissions

hmmemit [-n N] globin.hmm

Trained profile-HMM

hmmemit

Sequences generated by the model

The parameters of the model are used to generate new sequences

hmmsearch globin.hmm Artemia.fa > Artemia.globin

Set of sequences

List of sequences that match the HMM

(sorted by E-value)

hmmsearch

Trained profile-HMM

Search results (Artemia.globin)

Sequence Description Score E-value N -------- ----------- ----- ------- ---S13421 S13421 474.3 1.7e-143 9

Parsed for domains:Sequence Domain seq-f seq-t hmm-f hmm-t score E-value-------- ------- ----- ----- ----- ----- ----- -------S13421 7/9 932 1075 .. 1 143 [] 76.9 7.3e-24S13421 2/9 153 293 .. 1 143 [] 63.7 6.8e-20S13421 3/9 307 450 .. 1 143 [] 59.8 9.8e-19S13421 8/9 1089 1234 .. 1 143 [] 57.6 4.5e-18S13421 9/9 1248 1390 .. 1 143 [] 52.3 1.8e-16S13421 1/9 1 143 [. 1 143 [] 51.2 4e-16S13421 4/9 464 607 .. 1 143 [] 46.7 8.6e-15S13421 6/9 775 918 .. 1 143 [] 42.2 2e-13S13421 5/9 623 762 .. 1 143 [] 23.9 6.6e-08

Alignments of top-scoring domains:S13421: domain 7 of 9, from 932 to 1075: score 76.9, E = 7.3e-24 *->eekalvksvwgkveknveevGaeaLerllvvyPetkryFpkFkdLss +e a vk+ w+ v+ ++ vG +++ l++ +P+ +++FpkF d+ S13421 932 REVAVVKQTWNLVKPDLMGVGMRIFKSLFEAFPAYQAVFPKFSDVPL 978

adavkgsakvkahgkkVltalgdavkkldd...lkgalakLselHaqklr d++++++ v +h V t+l++ ++ ld++ +l+ ++L+e H+ lr S13421 979 -DKLEDTPAVGKHSISVTTKLDELIQTLDEpanLALLARQLGEDHIV-LR 1026

vdpenfkllsevllvvlaeklgkeftpevqaalekllaavataLaakYk< v+ fk +++vl+ l++ lg+ f+ ++ +++k+++++++ +++ + S13421 1027 VNKPMFKSFGKVLVRLLENDLGQRFSSFASRSWHKAYDVIVEYIEEGLQ 1075

Number of domains

Domains sorted byE-value

Start End

Consensus sequence

Sequence

hmmalign globin.hmm globins630.fa

Set of sequences

hmmalign

Alignment of all sequences to the

model

Trained profile-HMM

InsertionsBAHG_VITSP QAG-..VAAAHYPIV.GQELLGAIKEV.L.G.D.AATDDILDAWGKAYGVGLB1_ANABR TR-K..ISAAEFGKI.NGPIKKVLAS-.-.-.K.NFGDKYANAWAKLVAVGLB1_ARTSX NRGT..-DRSFVEYL.KESL-----GD.S.V.D.EFT------VQSFGEVGLB1_CALSO TRGI..TNMELFAFA.LADLVAYMGTT.I.S.-.-FTAAQKASWTAVNDVGLB1_CHITH -KSR..ASPAQLDNF.RKSLVVYLKGA.-.-.T.KWDSAVESSWAPVLDFGLB1_GLYDI GNKH..IKAQYFEPL.GASLLSAMEHR.I.G.G.KMNAAAKDAWAAAYADGLB1_LUMTE ER-N..LKPEFFDIF.LKHLLHVLGDR.L.G.T.HFDF---GAWHDCVDQGLB1_MORMR QSFY..VDRQYFKVL.AGII-------.-.-.A.DTTAPGDAGFEKLMSMGLB1_PARCH DLNK..VGPAHYDLF.AKVLMEALQAE.L.G.S.DFNQKTRDSWAKAFSIGLB1_PETMA KSFQ..VDPQYFKVL.AAVI-------.-.-.V.DTVLPGDAGLEKLMSMGLB1_PHESE QHTErgTKPEYFDLFrGTQLFDILGDKnLiGlTmHFD---QAAWRDCYAV

Gaps

HMMER applications:PFAMhttp://www.sanger.ac.uk/Software/Pfam/

http://www.sanger.ac.uk/Software/Pfam

PFAM Exercise

Generate with hmmemit a sequence from the globin model and search it in PFAM database

http://www.sanger.ac.uk/cgi-bin/Pfam/getacc?PF00042

Search in the SwissProt database the sequencesCG301_HUMANQ9H5F4_HUMAN

1) search them in the PFAM data base.2)launch PSI-BLAST searches. Is it possible to annotate the sequences by means of the BLAST results?

PFAM Exercise

SAM at UCSD:http://www.soe.ucsc.edu/research/compbio/sam.html


http://www.soe.ucsc.edu/research/compbio/sam.html

SAM applications:http://www.cse.ucsc.edu/research/compbio/HMM-apps/T02-query.html

http://www.cse.ucsc.edu/research/compbio/HMM-apps/T02-query.html

HMMPRO: http://www.netid.com/html/hmmpro.htmlPierre Baldi, Net-ID

HMMs for Mapping problemsHMMs for Mapping problems

•Mapping problems in protein prediction

Covalent structureTTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN

Ct

Nt

3D structure

Secondary structureEEEE..HHHHHHHHHHHH....HHHHHHHH.EEEE...........

Secondary structure

position of Trans Membrane Segments along the sequenceTopography

Topology of membrane proteins

Porin (Rhodobacter capsulatus)

Bacteriorhodopsin(Halobacterium salinarum)

Bil

ayer

-barrel -helices

Outer Membrane Inner Membrane

ALALMLCMLTYRHKELKLKLKK ALALMLCMLTYRHKELKLKLKK ALALMLCMLTYRHKELKLKLKK


•Mapping problems in protein prediction•Labelled HMMs

c

HMM for secondary structure prediction

Simplest model

Introducing a grammar

1 1c

2

3

2

HMM for secondary structure prediction

Labels

The states 1, 2 and 3 share the same label, so states 1 and 2 do.Decoding the Viterbi path for emitting a sequence s, makes a mapping between the sequence s and a sequence of labels y

S A L K M N Y T R E I M V A S N Q s: Sequenceccccc c c : Pathccccc c c Y(): Labels

1 1c

2

3

2

Computing P(s, y | M)

yYMsPMysP

)(|)|,()|,(

Only the path whose labelling is y have to be considered in the sumIn Forward and Backward algorithms it means to set

Fk(i) = 0, Bk(i) = 0 if Y(k) yi

S A L K M N Y T R E I M V A S N Q s: Sequenceccccc c c y: Labels

c

c

States Labelling

Baum-Welch training algorithm for labelled HMMs

Given a set of known labelled sequences (e.g. amino acid sequences and their native secondary structure) we want to find the parameters of the model, without knowing the generating paths:

ML = argmaxP ( s, y | M)]

The algorithm is the same as in the non-labelled case if we use the Forward and Backward matrices defined in the last slide.

Supervised learning of the mapping


•Mapping problems in protein prediction•Labelled HMMs•Duration modelling

Self loops and geometric decay

p

Begin End1-p

P(l) = p l-1 ·( 1-p )

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

p=0.9

p=0.5

p=0.1

l

P(l)

The length distribution of the generated segments is always exp-like

0,0000,050

0,1000,150

0,2000,250

0,3000,350

1 2 3 4 5 6 7 8

P(l)

l

P(1)

P(2)P(3)

P(4)

P(5)

P(6)

P(7)

P(8)

How can we model other length distributions?

1Begin End432 N….

p1 p2 p3 p4

pN

Limited case

This topology can model any length distribution between 1 and N

N

N

ii ppNP

pppP

ppP

pP

1

1

321

21

1

)1()(

......................

)1()1()3(

)1()2(

)1(

1

1

12

1

)1(

)(

..............

)1/()2(

)1(

k

ii

k

p

kPp

pPp

Pp

How can we model other length distributions?

Non limited case

This topology can model any length distribution between 1 and N-1 and a geometrical decay from N and

1Begin End432 N….

p1 p2 p3 p4

pN+1

pN

0

0,05

0,1

0,15

0,2

0,25

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34Length (residues)

Fre

qu

ency Helix

StrandCoil

Secondary structure: length statistic

c ccc

Secondary structure: model

Do we use the same emission probabilities for states sharing the same label?


•Mapping problems in protein prediction•Labelled HMMs•Duration modelling•Models for membrane proteins



Bil

ayer

-barrel -helices


position of Trans Membrane Segments along the sequenceTopography

Topology of membrane proteins



Bil

ayer

-barrel -helices


ALALMLCMLTYRHKELKLKLKK ALALMLCMLTYRHKELKLKLKK ALALMLCMLTYRHKELKLKLKK

A generic model for membrane proteins (TMHMM)

Transmembrane Inner Side

Outer Side

End

Begin

Model of -barrel membrane proteins

Transmembrane Inner SideOuter Side

End

Begin

Labels:

Transmembrane states

Loop states



End

Begin

Length of transmembrane -strands:

Minimum: 6 residues

Maximum: unbounded



End

Begin

Six different sets of emission parameters:

Outer loop Inner loop Long globular

domains

TM strands edges TM strands core



End

Begin

Model of -helix membrane proteins (HMM1)


....x 10

....x 10

.....x 13

.....x 12

x13.....x12........ ...

Model of -helix membrane proteins (HMM2)


....x 10

....x 10

... ...

TMS probability

0.00.10.20.30.40.50.60.70.80.91.0

1 51 101 151 201 251 301 351 401 451Sequence (1A0S)

TM

S p

robab

ility

TMS probability

Dynamic programming filtering procedure


0.00.10.20.30.40.50.60.70.80.91.0

1 51 101 151 201 251 301 351 401 451Sequence (1A0S)

TM

S p

robab

ility

TMS probability Predicted TMS

Maximum-scoring subsequences with constrained segment length and number

0.00.10.20.30.40.50.60.70.80.91.0

1 51 101 151 201 251 301 351 401 451Sequence (1A0S)

TM

S p

robab

ility

TMS probability Observed TMS Predicted TMS


Maximum-scoring subsequences with constrained segment length and number

www.cbs.dtu.dk/services/TMHMM

Predictors of alpha-transmembrane topology

http://www.cbs.dtu.dk/services/TMHMM

Hybrid systems: BasicsHybrid systems: Basics

•Sequence profile based HMMs

1 Y K D Y H S - D K K K G E L - - 2 Y R D Y Q T - D Q K K G D L - - 3 Y R D Y Q S - D H K K G E L - - 4 Y R D Y V S - D H K K G E L - - 5 Y R D Y Q F - D Q K K G S L - - 6 Y K D Y N T - H Q K K N E S - - 7 Y R D Y Q T - D H K K A D L - - 8 G Y G F G - - L I K N T E T T K 9 T K G Y G F G L I K N T E T T K 10 T K G Y G F G L I K N T E T T K

A 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 D 0 0 70 0 0 0 0 60 0 0 0 0 20 0 0 0 E 0 0 0 0 0 0 0 0 0 0 0 0 70 0 0 0 F 0 0 0 10 0 33 0 0 0 0 0 0 0 0 0 0 G 10 0 30 0 30 0 100 0 0 0 0 50 0 0 0 0 H 0 0 0 0 10 0 0 10 30 0 0 0 0 0 0 0 K 0 40 0 0 0 0 0 0 10 100 70 0 0 0 0 100 I 0 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 L 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 0 M 0 0 0 0 0 0 0 0 0 0 0 0 0 60 0 0 N 0 0 0 0 10 0 0 0 0 0 30 10 0 0 0 0 P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Q 0 0 0 0 40 0 0 0 30 0 0 0 0 0 0 0 R 0 50 0 0 0 0 0 0 0 0 0 0 0 0 0 0 S 0 0 0 0 0 33 0 0 0 0 0 0 10 10 0 0 T 20 0 0 0 0 33 0 0 0 0 0 30 0 30 100 0 V 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 W 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Y 70 0 0 90 0 0 0 0 0 0 0 0 0 0 0 0

sequence position

MS

AS

equ

ence

pro

file

Sequence profiles

Sequence-profile-based HMM

085 0 0 5 0 0 0 0 2 0 8 0 0 0 0 0 0 0 0

0 0 0 0 4 013 0 4 0 5 0 6 0 023 0 144 0

0 022 023 0 0 5 023 0 3 011 0 0 2 011 0

034 0 0 024 0 0 0 0 0 2 022 018 0 0 0 0

8 0 0 0 0 0 0 0 0 0 0 092 0 0 0 0 0 0 0

90 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 077 023

3 0 2 7 4 0 8 6 1 3 6 5 512 5 617 2 2 6

.. A C L P R P E T ...

t

Sequence of charactersst

Sequence of M-dimensional vectors vt

0 vt (n) S t, n

k=1 vt (k) = S t

M

90 0 0 0 0 0 0 0 010 0 0 0 0 0 0 0 0 0 0

n

For proteins M=20

Constraints

Sequence-profile-based HMM

Sequence of characters

Sequence of M-dimensional vectors

Probability of emission from state k

P(st|k) = ek(st)

P(vt|k) = n=1 vt(n)ek(n)

M1Z

st

vt

Z is independent of the state

Algorithms for training and probability computation can be derived

constraints P(vt|k) dM vt = 1

If n=1 ek(n) = 1M

Z = n=1 ek (n)MSA

A!

Hybrid systems: BasicsHybrid systems: Basics

•Sequence profile based HMMs•Membrane protein topology

1) Accuracy: Q2=P/N where P is the total number of correctly predicted residues and N is the total number of residues

2) Correlation coefficient C: C(s)=(p(s)*n(s) - u(s)*o(s))/[(p(s)+u(s))(p(s)+o(s))(n(s)+u(s))(n(s)+o(s))]1/2

where, for each class s, p(s) and n(s) are respectively the total number of correct predictions and correctly rejected assignments while u(s) and o(s) are the numbers of under and over predictions

3) Accuracy for each discriminated structure s: Q(s)=p(s)/[p(s)+u(s)]where p(s) and u(s) are the same as in equation 2

4) Probability of correct predictions P(s) : P(s)=p(s)/[p(s)+o(s)] where p(s) and o(s) are the same as in equation 2

5) Segment-based measure (Sov):

SovS SS S

LN

S

1 2

1 2

1

Scoring the prediction

Q2 QTMS Qloop PTMS Ploop Corr Sov

83% 83% 82% 79% 85% 0.65 0.83

HMM based on Multiple Sequence

Alignment

78% 74% 82% 81% 76% 0.56 0.79

76% 77% 76% 72% 80% 0.53 0.64Standard HMM based

on Single Sequence

NN based on Multiple Sequence

Alignment

Topology of -barrel membrane proteinsPerformance of sequence-profile-based HMM

Martelli PL, Fariselli P, Krogh A, Casadio R -A sequence-profile-based HMM for predicting and discriminating beta barrel membrane proteins- Bioinformatics 18: S46-S53 (2002)

0

10

20

30

40

50

60

70

80

90

100

2.75 2.8 2.85 2.9 2.95

Perc

enta

ge

Beta barrel membrane proteins (145)Globular proteins(1239)All helical membrane proteins (188)

I(s | M) = -1/L log P(s | M)

Topology of -barrel membrane proteins Discriminative power of the profile-based HMM

Sequence

Sequence Profiles

NN HMM1 HMM2

MaxSubSeq

Von Heijne rule

Prediction

Topography

Topology

The Bologna predictorfor the topology ofall membraneproteins

Q2 % Corr QTM % QLoop

% PTM

% PLoop

% Qtopography Qtoplogy SOV

NN° 85.8 0.714 84.1 87.3 84.7 86.8 49/59 (83%) 38/59 (64%) 0.908

HMM1° 84.4 0.692 88.6 80.9 79.5 89.4 48/59 (81%) 38/59 (64%) 0.896

HMM2° 82.4 0.658 88.0 78.1 77.1 88.1 48/59 (81%) 39/59 (66%) 0.872

Jury° 85.3 0.708 86.9 84.1 82.1 88.4 53/59 (90%) 42/59 (71%) 0.926

TMHMM 2.0 82.3 0.661 70.9 92.9 89.3 79.2 42/59 (71%) 32/59 (54%) 0.840

MEMSAT 83.6 0.672 70.6 93.9 90.5 75.7 35/49 (71%) 24/49 (49%) 0.823

PHD 80.1 0.614 63.6 94.0 89.9 75.5 43/59 (73%) 30/59 (51%) 0.847

HMMTOP 81.2 0.627 68.5 91.8 87.5 77.7 45/59 (76%) 35/59 (59%) 0.862

Nir Ben-Tal 46/59 (78%)

KD 43/59 (73%)

° ° Test

Topology of all- membrane proteinsPerformance

Martelli PL, Fariselli P, Casadio R -An ENSEMBLE machine learning approach for the prediction of all-alpha membrane proteins- Bioinformatics (in press, 2003)

HMM: Application in gene findingHMM: Application in gene finding

•Basics

Eukaryotic gene structure

A. Krogh

Simple model for coding regions

A. Krogh

Simple model for unspliced gene

A. Krogh

Simple model for spliced gene

A. Krogh

prologue: pitfalls of standard alignments

Documents