prologue: pitfalls of standard alignments

179
PROLOGUE: PROLOGUE: Pitfalls of standard alignments Pitfalls of standard alignments

Upload: licia

Post on 06-Jan-2016

21 views

Category:

Documents


0 download

DESCRIPTION

PROLOGUE: Pitfalls of standard alignments. Scoring a pairwise alignment. A: ALA E VLIRLIT K LYP B: ASA K HLNRLIT E LYP. Blosum62. Alignment of a family (globins). …………………………………………. Different positions are not equivalent. Sequence logos. http://weblogo.berkeley.edu/cache/file5h2DWc.png. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: PROLOGUE: Pitfalls of standard alignments

PROLOGUE:PROLOGUE:

Pitfalls of standard alignmentsPitfalls of standard alignments

Page 2: PROLOGUE: Pitfalls of standard alignments

A: ALAEVLIRLITKLYP B: ASAKHLNRLITELYP ),(),( ii BAsBAScore

Blosum62

Scoring a pairwise alignment

Page 3: PROLOGUE: Pitfalls of standard alignments

Alignment of a family (globins)

……

……

……

……

……

……

……

……

.

Different positions are not equivalent

Page 4: PROLOGUE: Pitfalls of standard alignments

http://weblogo.berkeley.edu/cache/file5h2DWc.png

The substitution score IN A FAMILY should depend on the position (the same for gaps)

Sequence logos

For modelling families we need more flexible tools

Page 5: PROLOGUE: Pitfalls of standard alignments

Probabilistic Models for Biological SequencesProbabilistic Models for Biological Sequences

•What are they?

Page 6: PROLOGUE: Pitfalls of standard alignments

Generative definition:

•Objects producing different outcomes (sequences) with different probabilities

•The probability distribution over the sequences space determines the model specificity

Probabilistic models for sequences

M

Sequence space

Pro

babili

ty

Generates si with probability P(si | M)e.g.: M is the representation of the family of globins

Page 7: PROLOGUE: Pitfalls of standard alignments

Associative definition:

•Objects that, given an outcome (sequence), compute a probability value

Probabilistic models for sequences

M

Sequence space

Pro

babili

ty

Associates probability P(si | M) to si

e.g.: M is the representation of the family of globins

We don’t need a generator of new biological sequences

the generative definition is useful as operative definition

Page 8: PROLOGUE: Pitfalls of standard alignments

Probabilistic models for sequences

Sequence space

Pro

babili

ty

Most useful probabilistic models are Trainable systems

The probability density function over the sequence space is estimated from known examples by means of a learning algorithm

Sequence space

Pro

babili

ty

Known examples Pdf estimate (generalization)

e.g.: Writing a generic representation of the sequences of globins starting from a set of known globins

Page 9: PROLOGUE: Pitfalls of standard alignments

Probabilistic Models for Biological SequencesProbabilistic Models for Biological Sequences

•What are they?•Why to use them?

Page 10: PROLOGUE: Pitfalls of standard alignments

Modelling a protein family

Probabilistic model

Seq1 Seq2 Seq3 Seq4 Seq5 Seq6

0.980.210.120.890.470.78

Given a protein class (e.g. Globins), a probabilistic model trained on this family can compute a probability value for each new sequence

This value measures the similarity between the new sequence and the family described by the model

Page 11: PROLOGUE: Pitfalls of standard alignments

Probabilistic Models for Biological SequencesProbabilistic Models for Biological Sequences

•What are they?•Why to use them?•Which probabilities do they compute?

Page 12: PROLOGUE: Pitfalls of standard alignments

A model M associates to a sequence s the probability P( s | M )

This probability answers the question:

Which is the probability for a model describing the Globins to generate the sequence s ?

The question we want to answer is:

Given a sequence s, is it a Globin?

We need to compute P( M | s ) !!

P( s | M ) or P( M | s ) ?

Page 13: PROLOGUE: Pitfalls of standard alignments

P(X,Y) = P(X | Y) P(Y) = P(Y | X) P(X) Joint probability

So:

P(Y | X) =

P(X | Y) P(Y)

P(X)

P(M | s) =

P(s | M) P(M)

P(s)A priori probabilities

Bayes Theorem

P(M | s)Evidenc

e sConclusio

n M

P(s | M)Evidenc

e MConclusio

n s

Page 14: PROLOGUE: Pitfalls of standard alignments

P(M | s) =

P(s | M) P(M)

P(s)A priori probabilities

The A priori probabilities

P(M) is the probability of the model (i.e. of the class described by the model) BEFORE we know the sequence:

can be estimated as the abundance of the class

P(s) is the probability of the sequence in the sequence space.

Cannot be reliably estimated!!

Page 15: PROLOGUE: Pitfalls of standard alignments

Comparison between models

P(M1 | s)

P(M2 | s) =

P(s | M1) P(M1)

P(s) P(s | M2) P(M2)

P(s)

P(s | M1) P(M1)

P(s | M2) P(M2)

=

=

We can overcome the problem comparing the probability of generating s from different models

Ratio between the abundance of the classes

Page 16: PROLOGUE: Pitfalls of standard alignments

Null model

Otherwise we can score a sequence for a model M comparing it to a Null Model: a model that generates ALL the possible sequences with probabilities depending ONLY on the statistical amino acid abundance

S(M, s) = log

P(s | M)

P(s | N)

In this case we need a threshold and a statistic for evaluating the significance (E-value, P-value)

S(M, s)

Sequences belonging to model M

Sequences NOT belonging to model M

Page 17: PROLOGUE: Pitfalls of standard alignments

The simplest probabilistic models:The simplest probabilistic models:Markov ModelsMarkov Models

•Definition

Page 18: PROLOGUE: Pitfalls of standard alignments

C R

SF

C: CloudsR: RainF: FogS: Sun

Markov Models Example: Weather

Register the weather conditions day by day:

as a first hypothesis the weather condition in a day depends ONLY on the weather conditions in the day before.

Define the conditional probabilities

P(C|C), P(C|R),…. P(R|C)…..

The probability for the 5-days registration CRRCS

P(CRRCS) = P(C)·P(R|C) ·P(R|R) ·P(C|R) ·P(S|C)

Page 19: PROLOGUE: Pitfalls of standard alignments

Stochastic generator of sequences in which the probability of state in position i depends ONLY on the state in position i-1

Given an alphabet C = {c1; c2; c3; ………cN }

a Markov model is described with N×(N+2) parameters {art , aBEGIN t , ar END; r, t C}

arq = P( s i = q| s i-1 = r ) aBEGIN q = P( s 1 = q ) ar END = P( s T = END | s T-1 = r )

Markov Model

t art + ar END = 1 r t aBEGIN t = 1

c3

c1

c2

c4

cN

ENDBEGIN

Page 20: PROLOGUE: Pitfalls of standard alignments

Given the sequence: s = s1s2s3s4s6 ……………sT

with si C = {c1; c2; c3; ………cN }

P( s | M ) = P( s1 ) i=2 P( s i | s i-1 ) =

Markov Models

aBEGIN s i=2 as s as END T

ii-11 T=

P(“ALKALI”)= aBEGIN A aA L aL K aK A aA L aL I aI END

Page 21: PROLOGUE: Pitfalls of standard alignments

Markov Models: Exercise

C

S

R

SF

0.4

0.20.2 ??

0.3

??0.2

0.5

0.3

0.2

0.3

0.3

0.4 0.2

0.10.2

C

S

R

SF

0.2

0.00.1 0.1

0.1

0.7??

0.2

0.3

1.0

0.0

??

0.0 0.8

0.40.0

1) Fill the non defined values for the transition probabilities

Page 22: PROLOGUE: Pitfalls of standard alignments

Markov Models: Exercise

C

S

R

SF

0.4

0.20.2 0.2

0.3

0.00.2

0.5

0.3

0.2

0.3

0.3

0.4 0.2

0.10.2

C

S

R

SF

0.2

0.00.1 0.1

0.1

0.70.0

0.2

0.3

1.0

0.0

0.1

0.0 0.8

0.40.0

2) Which model better describes the weather in summer? Which one describes the weather in winter?

Page 23: PROLOGUE: Pitfalls of standard alignments

Markov Models: Exercise

C

S

R

SF

0.4

0.20.2 0.2

0.3

0.00.2

0.5

0.3

0.2

0.3

0.3

0.4 0.2

0.10.2

C

S

R

SF

0.2

0.00.1 0.1

0.1

0.70.0

0.2

0.3

1.0

0.0

0.1

0.0 0.8

0.40.0

3) Given the sequenceCSSSCFS

which model gives the higher probability?[Consider the starting probabilities: P(X|BEGIN)=0.25]

Winter

Summer

Page 24: PROLOGUE: Pitfalls of standard alignments

Markov Models: Exercise

P (CSSSCFS | Winter) ==0.25x0.1x0.2x0.2x0.3x0.2x0.2==1.2 x 10-5

P (CSSSCFS | Summer) ==0.25x0.4x0.8x0.8x0.1x0.1x1.0==6.4 x 10-4

4) Can we conclude that the observation sequence refers to a summer week?

C

S

R

SF

0.4

0.20.2 0.2

0.3

0.00.2

0.5

0.3

0.2

0.3

0.3

0.4 0.2

0.10.2

C

S

R

SF

0.2

0.00.1 0.1

0.1

0.70.0

0.2

0.3

1.0

0.0

0.1

0.0 0.8

0.40.0

Winter

Summer

Page 25: PROLOGUE: Pitfalls of standard alignments

Markov Models: Exercise

P (Seq | Winter) =1.2 x 10-5

P (Seq | Summer) =6.4 x 10-4C

S

R

SF

0.4

0.20.2 0.2

0.3

0.00.2

0.5

0.3

0.2

0.3

0.3

0.4 0.2

0.10.2

C

S

R

SF

0.2

0.00.1 0.1

0.1

0.70.0

0.2

0.3

1.0

0.0

0.1

0.0 0.8

0.40.0

Winter

SummerP(Seq |Summer) P(Summer)P(Seq| Winter) P(Winter)

P(Summer | Seq)

P(Winter | Seq) =

=

Page 26: PROLOGUE: Pitfalls of standard alignments

G C

TA

DNA:C = {Adenine, Citosine, Guanine, Timine }

16 transition probabilities (12 of which independent) +4 Begin probabilities +4 End probabilities.

Simple Markov Model for DNA sequences

The parameters of the model are different in different zones of DNA

They describe the overall composition and the couple recurrences

Page 27: PROLOGUE: Pitfalls of standard alignments

G C

TA

Example of Markov Models: GpC Island

In the Markov Model of GpC Islands aGC is higher than in Markov Model Non-GpC Islands

G C

TA

GpC Islands Non-GpC Islands

Given a sequence s we can evaluate

P (GpC | s) = P (s | GpC) ·P(GpC) + P (s | nonGpC) ·P(nonGpC)

P ( s | GpC ) ·P(GpC)

GATGCGTCGC

CTACGCAGCG

Page 28: PROLOGUE: Pitfalls of standard alignments

The simplest probabilistic models:The simplest probabilistic models:Markov ModelsMarkov Models

•Definition•Training

Page 29: PROLOGUE: Pitfalls of standard alignments

Probabilistic training of a parametric methodProbabilistic training of a parametric method

Generally speaking, a parametric model M aims to reproduce a set Generally speaking, a parametric model M aims to reproduce a set of known dataof known data

Model MModel MParameters TParameters T

Modelled dataModelled dataReal data (D)Real data (D)

How to compare them?How to compare them?

Page 30: PROLOGUE: Pitfalls of standard alignments

Let M be the set of parameters of model M.

During the training phase, M parameters are estimated from the set of known data D

Maximum Likelihood Extimation (ML)

ML = argmax P( D | M, )

It can be proved that:

Training of Markov Models

Maximum A Posteriori Extimation (MAP)

MAP = argmax P( | M, D ) = argmax [P( D | M, ) P( ) ]

aik = nik

jnij

Frequency of occurrence as counted in the data set D

Page 31: PROLOGUE: Pitfalls of standard alignments

Example (coin-tossing)Example (coin-tossing)

Given N tossing of a coin (our data D), the outcomes are h heads and t tails (N=t+h)

ASSUME the model

P(D|M)= ph (1- p)t

Computing the maximum likelihood of P(D|M)

d P(D|M)d p = ph -1(1- p)t-1(h(1-p)-tp) = 0

d P(D|M)d p = 0

We obtain that our estimate of p is

p = h / (h+t) = h / N

Page 32: PROLOGUE: Pitfalls of standard alignments

Example (Error measure)Example (Error measure)

Suppose you think that your data are affected by a Gaussian error

So that they are distributed according to

F(xi)=A*exp-[(xi – )2 /22]

With A=1/sqrt(2 )

If your measures are independent the data likelihood is

P(Data| model) = i F(xi)

Find and that maximize the P(Data| model)

Page 33: PROLOGUE: Pitfalls of standard alignments

Maximum Likelihood training: Proof

Given a sequence s contained in D:s = s1s2s3s4s6 ……………sT

ENDs

T

isssBEGIN Tii aaaMsP

1

2, 11)|(

We can count the number of transitions between any to states j and k: njk

1

0

1

0

)|(N

j

N

k

njk

jkaMsP Where states 0 and N+1 are BEGIN and END

01)|(

0)|()|(

0''

N

kjk

j

jjk

jk

jk

aMsP

MsPa

n

a

MsP

N

j

N

kjkj aMsPMsP

0 0

1)|()|(

Normalisation contstraints are taken into account using the Lagrange multipliers k

N

kjk

jkjk

n

na

0''

Page 34: PROLOGUE: Pitfalls of standard alignments

Hidden Markov ModelsHidden Markov Models

•Preliminary examples

Page 35: PROLOGUE: Pitfalls of standard alignments

Given a sequence:

4156266656321636543662152611536264162364261664616263

We don’t know the sequence of dice that generated it.

RRRRRLRLRRRRRRRLRRRRRRRRRRRRLRLRRRRRRRRLRRRRLRRRRLRR

Loaded dice

We have 99 regular dice (R) and 1 loaded die (L).

P(1) P(2) P(3) P(4) P(5) P(6)R 1/6 1/6 1/6 1/6 1/6 1/6L 1/10 1/10 1/10 1/10 1/10 1/2

Page 36: PROLOGUE: Pitfalls of standard alignments

Hypothesis:

We chose a different die for each roll

Two stochastic processes give origin to the sequence of observations.

1) Choosing the die ( R o L ). 2) Rolling the die

The sequence of dice is hidden

The first process is assumed to be Markovian (in this case a 0-order MM)

The outcome of the second process depends only on the state reached in the first process (that is the chosen die)

Loaded dice

Page 37: PROLOGUE: Pitfalls of standard alignments

Model

Each state (R and L) generates a character of the alphabet

C = {1, 2, 3, 4, 5, 6 }

The emission probabilities depend only on the state.

The transition probabilities describe a Markov model that generates a state path: the hidden sequence ()

The observations sequence (s) is generated by two concomitant stochastic processes

R L0.01

0.01

0.99

0.99

Casinò

Page 38: PROLOGUE: Pitfalls of standard alignments

The observations sequence (s) is generated by two concomitant stochastic processes

R L0.01

0.01

0.99

0.99

Choose the State : R Probability= 0.99

Chose the Symbol: 1 Probability= 1/6 (given R)

4156266656321636543662152611RRRRRLRLRRRRRRRLRRRRRRRRRRRR

4156266656321636543662152611RRRRRLRLRRRRRRRLRRRRRRRRRRRR

Page 39: PROLOGUE: Pitfalls of standard alignments

The observations sequence (s) is generated by two concomitant stochastic processes

R L0.01

0.01

0.99

0.99

Choose the State : L Probability= 0.99

Chose the Symbol: 5 Probability= 1/10 (given L)

415626665632163654366215261RRRRRLRLRRRRRRRLRRRRRRRRRRR

41562666563216365436621526115RRRRRLRLRRRRRRRLRRRRRRRRRRRRL

Page 40: PROLOGUE: Pitfalls of standard alignments

Model

Each state (R and L) generates a character of the alphabet C = {1, 2, 3, 4, 5, 6 }

The emission probabilities depend only on the state.

The transition probabilities describe a Markov model that generates a state path: the hidden sequence ()

The observations sequence (s) is generated by two concomitant stochastic processes

R L0.01

0.01

0.990.99

Loaded dice

Page 41: PROLOGUE: Pitfalls of standard alignments

Some not so serious example1) DEMOGRAPHY

Observable: Number of births and deaths in a year in a village. Hidden variable: Economic conditions (as a first approximation we can consider the success in business as a random variable, and by consequence, the wealth as a Markov variable

---> can we deduce the economic conditions of a village during a century by means of the register of births and deaths?

2) THE METEREOPATHIC TEACHER

Observable: Average of the marks that a metereopathic teacher gives to their students during a day. Hidden variable: Weather conditions

---> can we deduce the weather conditions during a years by means of the class register?

Page 42: PROLOGUE: Pitfalls of standard alignments

To be more serious

1) SECONDARY STRUCTURE Observable: protein sequence Hidden variable: secondary structure

---> can we deduce (predict) the secondary structure of a protein given its amino acid sequence?

2) ALIGNMENT Observable: protein sequence Hidden variable: position of each residue along the alignment of a protein family

---> can we align a protein to a family, starting from its amino acid sequence?

Page 43: PROLOGUE: Pitfalls of standard alignments

Hidden Markov ModelsHidden Markov Models

•Preliminary examples•Formal definition

Page 44: PROLOGUE: Pitfalls of standard alignments

A HMM is a stochastic generator of sequences characterised by: N states A set of transition probabilities between two states {akj}

akj = P( (i) = j | (i-1) = k ) A set of starting probabilities {a0k}

a0k = P( (1) = k ) A set of ending probabilities {ak0}

ak0 = P( (i) = END | (i-1) = k ) An alphabet C with M characters. A set of emission probabilities for each state {ek (c)}

ek (c) = P( s i = c | (i) = k )Constraints:k a0 k = 1ak0 + j ak j = 1 kc C ek (c) = 1 k

Formal definition of Hidden Markov Models

s: sequence: path through the states

Page 45: PROLOGUE: Pitfalls of standard alignments

Choose the initial state (1) following the probabilities a0k

i = 1

Choose the character s i from the alphabet C following the probabilities ek(c)

Choose the next state following the probabilities ak j and ak0

Is the END state choosed?

YesEnd

Noi i +1

Generating a sequence with a HMM

Page 46: PROLOGUE: Pitfalls of standard alignments

s :AGCGCGTAATCTGYYYYYYYNNNNNN

P( s, | M ) can be easily computed

GpC Island, simple model

Y NaYN = 0.2

aNN = 0.8

aNY = 0.1aYY = 0.7

BEGIN

a0N = 0.8a0Y= 0.2

ENDaY0 = 0.1 aN0 = 0.1

eY (A) = 0.1 eY (G) = 0.4eY (C) = 0.4 eY (T) = 0.1

eN (A) = 0.25 eN (G) = 0.25eN (C) = 0.25 eN (T) = 0.25

Gpc Island Non- Gpc Island

Page 47: PROLOGUE: Pitfalls of standard alignments

P( s, | M ) can be easily computed

Y NaYN = 0.2

aNN = 0.8

aNY = 0.1aYY = 0.7

BEGIN

a0N = 0.8a0Y= 0.2

ENDaY0 = 0.1 aN0 = 0.1

eY (A) = 0.1 eY (G) = 0.4eY (C) = 0.4 eY (T) = 0.1

eN (A) = 0.25 eN (G) = 0.25eN (C) = 0.25 eN (T) = 0.25

GpC Island Non- GpC Island

s : A G C G C G T A A T C T GY Y Y Y Y Y Y N N N N N NEmission: 0.1 0.4 0.4 0.4 0.4 0.4

0.10.250.250.250.250.250.25Transition: 0.2 0.7 0.7 0.7 0.7 0.7 0.7 0.20.8 0.8 0.80.8 0.8 0.1

Multiplying all the probabilities gives the probability of having the sequence AND the path through the states

Page 48: PROLOGUE: Pitfalls of standard alignments

Evaluation of the joint probability of the sequence ad the path

)|(),|()|,( MPMsPMsP

0)(2 )()1()1(0)|( T

T

i ii aaaMP

T

i

ii seMsP

1 )( )(),|(

T

i

iiiiT seaaMsP

1 )()()1(0)( )()|,(

Page 49: PROLOGUE: Pitfalls of standard alignments

Hidden Markov ModelsHidden Markov Models

•Preliminary examples•Formal definition•Three questions

Page 50: PROLOGUE: Pitfalls of standard alignments

s :AGCGCGTAATCTG?????????????

P( s, | M ) can be easily computed How to evaluate P ( s | M )?

GpC Island, simple model

Y NaYN = 0.2

aNN = 0.8

aNY = 0.1aYY = 0.7

BEGIN

a0N = 0.8a0Y= 0.2

ENDaY0 = 0.1 aN0 = 0.1

eY (A) = 0.1 eY (G) = 0.4eY (C) = 0.4 eY (T) = 0.1

eN (A) = 0.25 eN (G) = 0.25eN (C) = 0.25 eN (T) = 0.25

GpC Island Non- GpC Island

Page 51: PROLOGUE: Pitfalls of standard alignments

How to evaluate P ( s | M )?

Y NaYN = 0.2

aNN = 0.8

aNY = 0.1aYY = 0.7

BEGIN

a0N = 0.8a0Y= 0.2

ENDaY0 = 0.1 aN0 = 0.1

eY (A) = 0.1 eY (G) = 0.4eY (C) = 0.4 eY (T) = 0.1

eN (A) = 0.25 eN (G) = 0.25eN (C) = 0.25 eN (T) = 0.25

GpC Island Non- GpC Island

s : A G C G C G T A A T C T GY Y Y Y Y Y Y Y Y Y Y Y YY Y Y Y Y Y Y Y Y Y Y Y NY Y Y Y Y Y Y Y Y Y Y N YY Y Y Y Y Y Y Y Y Y Y N NY Y Y Y Y Y Y Y Y Y N Y Y………………………………………………………………………………………………………

213 different pathsSumming over all the path will give the probability of

having the sequence

P ( s | M ) = P( s, | M )

Page 52: PROLOGUE: Pitfalls of standard alignments

s :AGCGCGTAATCTG?????????????

P( s, | M ) can be easily computed How to evaluate P ( s | M )?Can we show the hidden path?

Resumé: GpC Island, simple model

Y NaYN = 0.2

aNN = 0.8

aNY = 0.1aYY = 0.7

BEGIN

a0N = 0.8a0Y= 0.2

ENDaY0 = 0.1 aN0 = 0.1

eY (A) = 0.1 eY (G) = 0.4eY (C) = 0.4 eY (T) = 0.1

eN (A) = 0.25 eN (G) = 0.25eN (C) = 0.25 eN (T) = 0.25

GpC Island Non- GpC Island

Page 53: PROLOGUE: Pitfalls of standard alignments

Can we show the hidden path?

Y NaYN = 0.2

aNN = 0.8

aNY = 0.1aYY = 0.7

BEGIN

a0N = 0.8a0Y= 0.2

ENDaY0 = 0.1 aN0 = 0.1

eY (A) = 0.1 eY (G) = 0.4eY (C) = 0.4 eY (T) = 0.1

eN (A) = 0.25 eN (G) = 0.25eN (C) = 0.25 eN (T) = 0.25

GpC Island Non- GpC Island

s : A G C G C G T A A T C T GY Y Y Y Y Y Y Y Y Y Y Y YY Y Y Y Y Y Y Y Y Y Y Y NY Y Y Y Y Y Y Y Y Y Y N YY Y Y Y Y Y Y Y Y Y Y N NY Y Y Y Y Y Y Y Y Y N Y Y………………………………………………………………………………………………………

213 different pathsViterbi path: path that gives the best joint probability

* = argmax [ P( | s, M ) ] = argmax [ P( , s | M ) ]

Page 54: PROLOGUE: Pitfalls of standard alignments

A Posteriori decoding

For each position choose the state (t) :(i) = argmax k[ P( i = k| s, M ) ]

The contribution to this probability derives from all the paths that go through the state k at position i.

The A posteriori path can be a non-sense path (it may not be a legitimate path if some transitions are not permitted in the model)

Can we show the hidden path?

Page 55: PROLOGUE: Pitfalls of standard alignments

s :AGCGCGTAATCTGYYYYYYYNNNNNN

P( s, | M ) can be easily computed How to evaluate P ( s | M )?Can we show the hidden path? Can we evaluate the parameters starting from known examples?

GpC Island, simple model

Y NaYN = ?

aNN = ?

aNY = ?aYY = ?

BEGIN

a0N = ?a0Y= ?

ENDaY0 = ? aN0 = ?

eY (A) = ? eY (G) = ?eY (C) = ? eY (T) = ?

eN (A) = ? eN (G) = ?eN (C) = ? eN (T) = ?

GpC Island Non- GpC Island

Page 56: PROLOGUE: Pitfalls of standard alignments

Can we evaluate the parameters starting from known examples?

Y NaYN = ?

aNN = ?

aNY = ?aYY = ?

BEGIN

a0N = ?a0Y= ?

ENDaY0 = ? aN0 = ?

eY (A) = ? eY (G) = ?eY (C) = ? eY (T) = ?

eY (A) = ? eY (G) = ?eY (C) = ? eY (T) = ?

GpC Island Non- GpC Island

s : A G C G C G T A A T C T GY Y Y Y Y Y Y N N N N N NEmission: eY (A)eY (G)eY (C)e Y(G)eY (C)eY (G)eY (T)eN (A)eN (A)eN (T)eN (C)eN (T)eN

(G)

Transition: a0Y aYY aYY aYY aYY aYY aYY aYN aNN aNN aNN aNN aNN aN0

How to find the parameters e and a that maximises this probability?How if we don’t know the path?

Page 57: PROLOGUE: Pitfalls of standard alignments

Hidden Markov Models:Algorithms Hidden Markov Models:Algorithms

•Resumé•Evaluating P(s | M): Forward Algorithm

Page 58: PROLOGUE: Pitfalls of standard alignments

Computing P( s,| M ) for each path is a redundant operation

s : A G C G C G T A A T C T GY Y Y Y Y Y Y Y Y Y Y Y YEmission: 0.1 0.4 0.4 0.4 0.4 0.4 0.1 0.1 0.1 0.1 0.4 0.1

0.4Transition: 0.2 0.7 0.7 0.7 0.7 0.7 0.7 0.70.7 0.7 0.70.7 0.7 0.1

Y NaYN = 0.2

aNN = 0.8

aNY = 0.1aYY = 0.7

BEGIN

a0N = 0.8a0Y= 0.2

ENDaY0 = 0.1 aN0 = 0.1

eY (A) = 0.1 eY (G) = 0.4eY (C) = 0.4 eY (T) = 0.1

eN (A) = 0.25 eN (G) = 0.25eN (C) = 0.25 eN (T) = 0.25

GpC Island Non- GpC Island

s : A G C G C G T A A T C T GY Y Y Y Y Y Y Y Y Y Y Y NEmission: 0.1 0.4 0.4 0.4 0.4 0.4 0.1 0.1 0.1 0.1 0.4 0.1

0.25Transition: 0.2 0.7 0.7 0.7 0.7 0.7 0.7 0.70.7 0.7 0.70.7 0.2 0.1

Page 59: PROLOGUE: Pitfalls of standard alignments

Summing over all the possible paths

Y NaYN = 0.2

aNN = 0.8

aNY = 0.1aYY = 0.7

BEGIN

a0N = 0.8a0Y= 0.2

ENDaY0 = 0.1 aN0 = 0.1

eY (A) = 0.1 eY (G) = 0.4eY (C) = 0.4 eY (T) = 0.1

eN (A) = 0.25 eN (G) = 0.25eN (C) = 0.25 eN (T) = 0.25

GpC Island Non- GpC Island

s : A G Y Y

Emission: 0.1 0.4 Transition: 0.2 0.7 s : A G

Y N

Emission: 0.1 0.25 Transition: 0.2 0.2

s : A G N Y

Emission: 0.250.4 Transition: 0.8 0.1 s : A G

N N

Emission: 0.250.25 Transition: 0.8 0.8

0.0056

0.001 0.04

0.008

s : A G X Y

s : A G X N

0.0136

0.041

Sum

Page 60: PROLOGUE: Pitfalls of standard alignments

s : A G C X Y Y 0.0136

Y NaYN = 0.2

aNN = 0.8

aNY = 0.1aYY = 0.7

BEGIN

a0N = 0.8a0Y= 0.2

ENDaY0 = 0.1 aN0 = 0.1

eY (A) = 0.1 eY (G) = 0.4eY (C) = 0.4 eY (T) = 0.1

eN (A) = 0.25 eN (G) = 0.25eN (C) = 0.25 eN (T) = 0.25

GpC Island Non- GpC Island

Summing over all the possible paths

0.4 0.7

s : A G C X Y N 0.0136 0.25

0.2

s : A G C X N Y

0.041 0.4 0.1

s : A G C X N N

0.041 0.25 0.8

+

+

s : A G C X X Y

s : A G C X X N

0.005448

0.00888

Sum

Page 61: PROLOGUE: Pitfalls of standard alignments

Y NaYN = 0.2

aNN = 0.8

aNY = 0.1aYY = 0.7

BEGIN

a0N = 0.8a0Y= 0.2

ENDaY0 = 0.1 aN0 = 0.1

eY (A) = 0.1 eY (G) = 0.4eY (C) = 0.4 eY (T) = 0.1

eN (A) = 0.25 eN (G) = 0.25eN (C) = 0.25 eN (T) = 0.25

GpC Island Non- GpC Island

Summing over all the possible paths

A G C G C G T A A T C T GX X X X X X X X X X X X Y

A G C G C G T A A T C T GX X X X X X X X X X X X N

0.1 (aY0)

0.1 (aN0)

+

P(s|M)

Iterating until the last position of the sequence:

Page 62: PROLOGUE: Pitfalls of standard alignments

On the basis of preceding observations the computation of P(s | M) can be decomposed in simplest problems

For each state k and each position i in the sequence, we compute:

Fk(i) = P( s1s2s3……s i, (i) = k | M)

Initialisation: FBEGIN (0) = 1 Fi (0) = 0 i BEGIN

Recurrence: Fl ( i+1) = P( s1s2…s is i+1, (i + 1) = l ) =

= k P( s1s2 …s i, (i) = k ) a k l e l ( s i+1 ) =

= e l ( s i+1 ) k Fk ( i ) a k l

Termination: P( s ) = P( s1s2s3……s T, (T + 1) = END ) =

=k P( s1s2 …s T , (T) = k ) a k0

= k Fk ( T ) a k 0

Forward Algorithm

Will be understood

Page 63: PROLOGUE: Pitfalls of standard alignments

Computing P( s,| M ) for each path is a redundant operation

TBegin

L

R

End

0 1 2 3 T-1

Sta

tes

Iteration

0)()()()1(

1

1 )()()1(1 1111111)()()|,( T

TTTT

T

t

tttt aseaseaMsP

0)()()()1(

1

1 )()()1(2 2222222)()()|,( T

TTTT

T

t

tttt aseaseaMsP

If we compute the common part only once we gain 2·(T-1) operations

Page 64: PROLOGUE: Pitfalls of standard alignments

TBegin

L

R

End

0 1 2 3 T-1

Sta

tes

Iteration

If we know the probabilities of emitting the two first characters of the sequence ending the path in states L and R respectively:

FR(2) P(s1,s2,(2) = R | M) and FL(2) P(s1,s2,(2) = L | M)

then we can compute:

P(s1,s2,s3,(3) = R | M) = FR(2) · aRR ·eR(s3) + FL(2) · aLR ·eR(s3)

Summing over all the possible paths

Page 65: PROLOGUE: Pitfalls of standard alignments

STATE

Iteration

BEGIN

END

A

B

0 1 2

FB (2)

eB (s2)

T T + 1

Fi (1) ∙ aiB

P(s | M)

Forward Algorithm

Page 66: PROLOGUE: Pitfalls of standard alignments

Naïf method

P ( s | M ) = P( s, | M )

There are N T possible paths.

Each path requires about 2T operations.

The time for the computation is O( T N T )

Forward algorithm: computational complexity

s : A G C G C G T A A T C T GY Y Y Y Y Y Y Y Y Y Y Y YEmission: 0.1 0.4 0.4 0.4 0.4 0.4 0.1 0.1 0.1 0.1 0.4 0.1

0.4Transition: 0.2 0.7 0.7 0.7 0.7 0.7 0.7 0.70.7 0.7 0.70.7 0.7 0.1 s : A G C G C G T A A T C T G

Y Y Y Y Y Y Y Y Y Y Y Y NEmission: 0.1 0.4 0.4 0.4 0.4 0.4 0.1 0.1 0.1 0.1 0.4 0.1

0.25Transition: 0.2 0.7 0.7 0.7 0.7 0.7 0.7 0.70.7 0.7 0.70.7 0.2 0.1

Page 67: PROLOGUE: Pitfalls of standard alignments

s : A G C X Y Y 0.0136 0.4

0.7

s : A G C X Y N 0.0136 0.25

0.2

s : A G C X N Y

0.041 0.4 0.1

s : A G C X N N

0.041 0.25 0.8

s : A G C X X Y

s : A G C X X N

0.005448

0.00888

+

+

Sum

Forward algorithm

T positions, N values for each position

Each element requires about 2N product and 1 sum

The time for the computation is O(T N2)

Forward algorithm: computational complexity

Page 68: PROLOGUE: Pitfalls of standard alignments

0100200300400500600700800900

1000

1 2 3 4 5 6 7

T

No.

of

oper

atio

ns

Forward algorithm: computational complexity

Naïf method

Forward algorithm

Page 69: PROLOGUE: Pitfalls of standard alignments

Hidden Markov Models:Algorithms Hidden Markov Models:Algorithms

•Resumé•Evaluating P(s | M): Forward Algorithm•Evaluating P(s | M): Backward Algorithm

Page 70: PROLOGUE: Pitfalls of standard alignments

Backward AlgorithmSimilar to the Forward algorithm: it computes P( s | M ), reconstructing the sequence from the end

For each state k and each position i in the sequence, we compute:

Bk(i) = P( s i+1s i+2s i+3……s T | (i) = k )

Initialisation: Bk (T) = P((T+1) = END | (T) = k ) = ak0

Recurrence: Bl ( i-1) = P(s is i+1…s T | (i - 1) = l ) =

= k P(s i+1s i+2…s T | (i) = k) a l k e k (s i )=

= k Bk ( i ) e k ( s i ) a l k

Termination: P( s ) = P( s1s2s3……s T | (0) = BEGIN ) =

= k P( s2 …s T | (1) = k ) a 0 k e k ( s 1 ) =

= k Bk ( 1 ) a 0k e k ( s 1 )

Page 71: PROLOGUE: Pitfalls of standard alignments

STATE

Iteration

BEGIN

END

A

B

0 1 2 T T + 1

Backward Algorithm

BB (T-1)

Bk(T)· aB T· e k (s T-1 )

T-1

P(s | M)

Page 72: PROLOGUE: Pitfalls of standard alignments

Hidden Markov Models:Algorithms Hidden Markov Models:Algorithms

•Resumé•Evaluating P(s | M): Forward Algorithm•Evaluating P(s | M): Backward Algorithm•Showing the path: Viterbi decoding

Page 73: PROLOGUE: Pitfalls of standard alignments

Finding the best path

Y NaYN = 0.2

aNN = 0.8

aNY = 0.1aYY = 0.7

BEGIN

a0N = 0.8a0Y= 0.2

ENDaY0 = 0.1 aN0 = 0.1

eY (A) = 0.1 eY (G) = 0.4eY (C) = 0.4 eY (T) = 0.1

eN (A) = 0.25 eN (G) = 0.25eN (C) = 0.25 eN (T) = 0.25

GcP Island Non- GcP Island

s : A G Y Y

Emission: 0.1 0.4 Transition: 0.2 0.7 s : A G

Y N

Emission: 0.1 0.25 Transition: 0.2 0.2

s : A G N Y

Emission: 0.250.4 Transition: 0.8 0.1 s : A G

N N

Emission: 0.250.25 Transition: 0.8 0.8

0.0056

0.001 0.04

0.008

s : A G N Y

s : A G N N

0.008

0.04

Max

Page 74: PROLOGUE: Pitfalls of standard alignments

s : A G C N Y Y 0.008

Y NaYN = 0.2

aNN = 0.8

aNY = 0.1aYY = 0.7

BEGIN

a0N = 0.8a0Y= 0.2

ENDaY0 = 0.1 aN0 = 0.1

eY (A) = 0.1 eY (G) = 0.4eY (C) = 0.4 eY (T) = 0.1

eN (A) = 0.25 eN (G) = 0.25eN (C) = 0.25 eN (T) = 0.25

GcP Island Non- GcP Island

0.4 0.7

s : A G C N Y N 0.008 0.25

0.2

s : A G C N N Y

0.04 0.4 0.1

s : A G C N N N

0.04 0.25 0.8 ;

=0.00224 =0.0016

=0.0004 =0.008

Finding the best path

s : A G C N Y Y

s : A G C N N N

0.00224

0.008

Max

Page 75: PROLOGUE: Pitfalls of standard alignments

Y NaYN = 0.2

aNN = 0.8

aNY = 0.1aYY = 0.7

BEGIN

a0N = 0.8a0Y= 0.2

ENDaY0 = 0.1 aN0 = 0.1

eY (A) = 0.1 eY (G) = 0.4eY (C) = 0.4 eY (T) = 0.1

eY (A) = 0.25 eY (G) = 0.25eY (C) = 0.25 eY (T) = 0.25

GcP Island Non- GcP Island

A G C G C G T A A T C T GN Y Y Y Y Y Y N N N Y Y Y

A G C G C G T A A T C T GN N N N N N N N N N N N N

0.1 (aY0)

0.1 (aN0)

Choose the Maximum

Iterating until the last position of the sequence:

Finding the best path

Page 76: PROLOGUE: Pitfalls of standard alignments

Viterbi Algorithm

* = argmax [ P( , s | M ) ]The computation of P(s,*| M) can be decomposed in simplest problems

Let Vk(i) be the probability of the most probable path for generating the subsequence s1s2s3……s i ending in the state k at iteration i

Initialisation: VBEGIN (0) = 1 Vi (0) = 0 i BEGIN

Recurrence: Vl ( i+1) = e l ( s i+1 ) Max k ( Vk ( i ) a k l )

ptr i ( l ) = argmax k ( Vk ( i ) a k l )

Termination: P( s, * ) =Maxk (Vk ( T ) a k 0 )

* ( T ) = argmax k (Vk ( T ) a k 0 )Traceback: * ( i-1 ) = ptr i (* ( i ))

Page 77: PROLOGUE: Pitfalls of standard alignments

Viterbi Algorithm

STATE

Iteration

BEGIN

END

A

B

0 1 2

VB (2)

MAX

eB (s2)

T T + 1

Vi (1) ∙ aiB

P(s, *| M)

ptr2 (B)

Page 78: PROLOGUE: Pitfalls of standard alignments

Viterbi Algorithm

STATE

Iteration

BEGIN

END

A

B

0 1 2 T T + 1T – 1

Viterbi path

Different paths can have the same probability

Page 79: PROLOGUE: Pitfalls of standard alignments

Hidden Markov Models:Algorithms Hidden Markov Models:Algorithms

•Resumé•Evaluating P(s | M): Forward Algorithm•Evaluating P(s | M): Backward Algorithm•Showing the path: Viterbi decoding•Showing the path: A posteriori decoding•Training a model: EM algorithm

Page 80: PROLOGUE: Pitfalls of standard alignments

If we know the path generating the training sequence

Y NaYN = ?

aNN = ?

aNY = ?aYY = ?

BEGIN

a0N = ?a0Y= ?

ENDaY0 = ? aN0 = ?

eY (A) = ? eY (G) = ?eY (C) = ? eY (T) = ?

eN (A) = ? eN (G) = ?eN (C) = ? eN (T) = ?

GcP Island Non- GcP Island

s : A G C G C G T A A T C T GY Y Y Y Y Y Y N N N N N NEmission: eY (A)eY (G)eY (A)e Y(G)eY (C)eY (G)eY (T)eN (A)eN (A)eN (T)eN (C)eN (T)eN

(G)

Transition: a0Y aYY aYY aYY aYY aYY aYY aYN aNN aNN aNN aNN aNN aN0

Just count!Example: aYY= nYY /(nYY+ nYN)= 6/7

eY(A) = nY(A) /[nY(A)+nY(C) +nY(G) +nY(T)]= 1/7

Page 81: PROLOGUE: Pitfalls of standard alignments

If we DO NOT know the path generating the training sequence

Y NaYN = ?

aNN = ?

aNY = ?aYY = ?

BEGIN

a0N = ?a0Y= ?

ENDaY0 = ? aN0 = ?

eY (A) = ? eY (G) = ?eY (C) = ? eY (T) = ?

eN (A) = ? eN (G) = ?eN (C) = ? eN (T) = ?

GcP Island Non- GcP Island

s : A G C G C G T A A T C T G? ? ? ? ? ? ? ? ? ? ? ? ?Emission: e? (A)e? (G)e? (A)e ?(G)e? (C)e? (G)e? (T)e? (A)e? (A)e? (T)e? (C)e? (T)e? (G)

Transition: a0? a?? a?? a?? a?? a?? a?? a?? a?? a?? a?? a?? a?? a?0

We need “in some sense” to average over all the possible paths

No exact algorithm is available.Iterative Baum-Welch algorithm based on the Expectation-Maximisation

Page 82: PROLOGUE: Pitfalls of standard alignments

Ak,l = P(| s,0) · Ak,l() Ek (c) = P(| s,0) · Ek (c,)

We can compute the expected values over all the paths, given inital parameters 0

ak,l = ek(c) = Ak,l

m = 1 Ak,mN

Ek (c)

c Ek (c)

Baum-Welch algorithm (simple discussion)

Given a path we can countthe number of transition between states k and l: Ak,l()the number on emissions of character c from state k: Ek (c,)

s : A G C G C G T A A T C T GY Y Y Y Y Y Y Y Y Y Y Y YY Y Y Y Y Y Y Y Y Y Y Y NY Y Y Y Y Y Y Y Y Y Y N YY Y Y Y Y Y Y Y Y Y Y N N……………………………………………………………...

The updated parameters are:

Then we can iterate…

Page 83: PROLOGUE: Pitfalls of standard alignments

Expectation-Maximisation algorithm

We need to estimate the Maximum Likelihood parameters when the paths generating the training sequences are unknown

ML = argmaxP ( s | M)]

Given a model with parameters 0 the EM algorithm finds new parameters that increase the likelihood of the model:

P( s | ) > P( s| )

Page 84: PROLOGUE: Pitfalls of standard alignments

Expectation-Maximisation algorithm

We need to estimate the Maximum Likelihood parameters when the paths generating the training sequences are unknown

ML = argmaxP ( s | M)]

Given a model with parameters 0 the EM algorithm finds new parameters that increase the likelihood of the model:

P( s | ) > P( s| )

or equivalentely

log P( s | ) > log P( s| )

Page 85: PROLOGUE: Pitfalls of standard alignments

Expectation-Maximisation algorithm

log P( s | ) = log P(s,|) - log P(| s,)

Multiplying for P(| s,0) and summing over all the possible paths

log P( s | ) =

=P(| s,0) ·log P(s,| ) - P(| s,0) · log P(| s,)

Q(|0) : Expectation value of log P(s,|) over all the “current” paths

log P( s | ) - log P(s | ) = = Q(|) - Q(|0) +

Q(|) - Q(|0)

)s, |πP(

)s, |πP(log)s, |πP(-

00

π

0

Page 86: PROLOGUE: Pitfalls of standard alignments

Expectation-Maximisation algorithm

The EM algorithm is an iterative process

Each iteration performs two steps:

E-step: evaluation of Q(|) = P(| s,0) ·log P(s,| )

M-step: Maximisation of Q(|) over all

It does NOT assure to converge to the GLOBAL Maximum Likelihood

Page 87: PROLOGUE: Pitfalls of standard alignments

E-step:

Q( |0) = P(| s,0) ·log P(s,|)

P(s,|) = a0,(1) · i = 1 a(i),(i+1) ·e(i)(si) =

= k = 0 l = 1 ak,l · k = 1 c C ek (c)

T

NNN Ak.l () Ek (c,)

Ak,l(): number of transitions between the states k and l in path

Ek (c,): number of emissions of character c in path

Ak,l = P(| s,0) · Ak,l()

Ek (c) = P(| s,0) · Ek (c,)

So:

Q(|0) = k = 0 l = 1 Ak,l · log ak,l + k = 1 c C Ek (c) ·log ek (c) NN N

Baum-Welch implementation of the EM algorithm

Expected values over all the “actual” paths

Page 88: PROLOGUE: Pitfalls of standard alignments

ak,l =

ek(c) =

Ak,l

m = 1 Ak,mN

Ek (c)

m = 1 Em (c) N

Baum-Welch implementation of the EM algorithm

M-step:

0,

lka

For any state k and l, with l ak,l = 1

0)(

cek

For any state k and character c, with c ek(c) = 1

By means of Lagrange’s multipliers techniques, we can solve the system

Page 89: PROLOGUE: Pitfalls of standard alignments

Fk(i) = P( s1s2s3……s i, (i) = k )

Bk(i) = P( s i+1s i+2s i+3……s T | (i) = k )

Ak,l= P((i ) = k , ( i +1) = l | s,) =

Ek (c) = P( s i = c , (i ) = k | s,) =

i Fk(i ) a kl e l (s i +1) Bl(i + 1)

P (s )

s = c Fk(i ) Bl(i)

P (s )

i

Baum-Welch implementation of the EM algorithm

How to compute the expected number of transitions and emissions over all the paths

Page 90: PROLOGUE: Pitfalls of standard alignments

Baum-Welch implementation of the EM algorithm

AlgorithmStart with random parameters

Compute Forward and Backward matrices on the known sequences

Compute Ak,l and Ek (c) expected numbers of transitions and emissions

Update a k,l Ak,l ek (c) Ek (c)

Has P(s|M) incremented ?Yes

NoEnd

Page 91: PROLOGUE: Pitfalls of standard alignments

Profile HMMsProfile HMMs

•HMMs for alignments

Page 92: PROLOGUE: Pitfalls of standard alignments

Profile HMMsProfile HMMs

•HMMs for alignments

Page 93: PROLOGUE: Pitfalls of standard alignments

M0 M1 M2 M3 M4

How to align?

Each state represent a position in the alignment.

A C G G T AM0 M1 M2 M3 M4 M5

A C G A T CM0 M1 M2 M3 M4 M5

A T G T T CM0 M1 M2 M3 M4 M5

M5

Each position has a peculiar composition

Page 94: PROLOGUE: Pitfalls of standard alignments

M0 M1 M2 M3 M4

A C G G T AA C G A T CA T G T T C

M5

Given a set of sequences..

..we can train a model..

A 1 0 0 0.33 0 0.33C 0 0.66 0 0 0 0.66G 0 0 1 0.33 0 0T 0 0.33 0 0.33 1 0

..estimating the emission probabilities.

Page 95: PROLOGUE: Pitfalls of standard alignments

M0 M1 M2 M3 M4 M5

Given a trained model..

..we can align a new sequence..

A 1 0 0 0.33 0 0.33C 0 0.66 0 0 0 0.66G 0 0 1 0.33 0 0T 0 0.33 0 0.33 1 0

A C G A T C

..computing the probability of generating it

P(s|M) = 1 × 0.66 × 1 × 0.33 × 1 × 0.66

Page 96: PROLOGUE: Pitfalls of standard alignments

M0 M1 M2 M3 M4

And for the sequence AGATC ?

A G A T CM0 M2 M3 M4 M5 M5

M5

A 1 0 0 0.33 0 0.33C 0 0.66 0 0 0 0.66G 0 0 1 0.33 0 0T 0 0.33 0 0.33 1 0

We need a way to introduce gaps

Page 97: PROLOGUE: Pitfalls of standard alignments

Silent states

Red transitions allow gaps(N-1) ! transitions

To reduce the number of parameters we can use states that doesn’t emit any character4N-8 transitions

Page 98: PROLOGUE: Pitfalls of standard alignments

M0 M1 M2 M4M3

I0 I1 I2 I3

D1 D4D2 D3

I4

M5

Profile HMMs

Delete states

Insert states

Match states

A C G G T AM0 M1 M2 M3 M4 M5

A C G C A G T CM0 I0 I0 M1 M2 M3 M4 M5

A G A T CM0 D1 M2 M3 M4 M5

Page 99: PROLOGUE: Pitfalls of standard alignments

Example of alignmentSequence 1

A S T R A LViterbi path

M0 M1 M2 M3 M4 M5

A S T R A L

Sequence 2A S T A I L

Viterbi pathM0 M1 M2 D3 M4 I4 M5

A S T A I L

Sequence 3A R T I

Viterbi pathM0 M1 M2 D3 D4 M5

A R T I

M0 M1 M2 M4M3

I0 I1 I2 I3

D1 D4D2 D3

I4

M5

M0 M1 M2 M4M3

I0 I1 I2 I3

D1 D4D2 D3

I4

M5

M0 M1 M2 M4M3

I0 I1 I2 I3

D1 D4D2 D3

I4

M5

Page 100: PROLOGUE: Pitfalls of standard alignments

Example of alignment

Grouping by vertical layers0 1 2 3 4 5

s1 A S T R A Ls2 A S T AI Ls3 A R T I

AlignmentASTRA-LAST-AILART---I

M0 M1 M2 M3 M4 M5

A S T R A L

M0 M1 M2 D3 M4 I4 M5

A S T A I L

M0 M1 M2 D3 D4 M5

A R T I

Sequence 1

Sequence 2

Sequence 3

-Log P(s | M) Is an alignment score

Page 101: PROLOGUE: Pitfalls of standard alignments

Searching for a structural/functional pattern in protein sequence

Zn binding loop:

C H C I C R I C C H C L C K I C C H C I C S L C D H C L C T I C C H C I D S I C C H C L C K I C

Cysteines can be replaced by an Aspartic Acid, but only ONCE for each sequence

Page 102: PROLOGUE: Pitfalls of standard alignments

Searching for a structural/functional pattern in protein sequences

..ALCPCHCLCRICPLIY..

..WERWDHCIDSICLKDE..

M0 M1 M2 M4M3

I0 I1 I2 I3

D1 D4D2 D3

I4

M5

D5

I5

M6

D6

I6

M7

obtains higher probability than

.. because M0 and M4 have low emission probability for Aspartic Acid and we multiply them twice.

Page 103: PROLOGUE: Pitfalls of standard alignments

Profile HMMsProfile HMMs

•HMMs for alignments•Example on globins

Page 104: PROLOGUE: Pitfalls of standard alignments

Structural alignment of globins

Page 105: PROLOGUE: Pitfalls of standard alignments

Structural alignment of globins

Bashdorf D, Chothia C & Lesk AM, (1987) Determinants of a protein fold: unique features of the globin amino sequence. J.Mol.Biol. 196, 199-216

Page 106: PROLOGUE: Pitfalls of standard alignments

Alignment of globins reconstructed with profile HMMs

Krogh A, Brown M, Mian IS, Sjolander K & Haussler D (1994) Hidden Markov Models in computational biology: applications to protein modelling. J.Mol.Biol. 235, 1501-1531

Page 107: PROLOGUE: Pitfalls of standard alignments

Discrimination power of profile HMMs

Z-score = Log(P(s | M)) - <Log( P(s | M))>

(Log(P(s | M)) )

Krogh A, Brown M, Mian IS, Sjolander K & Haussler D (1994) Hidden Markov Models in computational biology: applications to protein modelling. J.Mol.Biol. 235, 1501-1531

Page 108: PROLOGUE: Pitfalls of standard alignments

Profile HMMsProfile HMMs

•HMMs for alignments•Example on globins•Other applications

Page 109: PROLOGUE: Pitfalls of standard alignments

Begin

Profile HMM specific for the considered domain

I2

End

I2

Finding a domain

Page 110: PROLOGUE: Pitfalls of standard alignments

BEGIN

HMM 1

HMM 2

HMM 3

HMM n

.END.

Clustering subfamilies

Each sequence s contributes to update HMM i with a weight equal to P ( s | Mi )

Page 111: PROLOGUE: Pitfalls of standard alignments

Profile HMMsProfile HMMs

•HMMs for alignments•Example on globins•Other applications•Available codes and servers

Page 112: PROLOGUE: Pitfalls of standard alignments

HMMER at WUSTL: http://hmmer.wustl.edu/Eddy SR (1998) Profile hidden Markov models. Bioinformatics 14:755-763

Page 113: PROLOGUE: Pitfalls of standard alignments

HMMER

Alignment of a protein family

hmmbuild

Trained profile-HMM

hmmcalibate

HMM calibrated with the accurate E-value

statistics

Takes the aligned sequences, checks for redundancy and sets the emission and the transitions probabilities of a HMM

Takes a trained HMM, generates a great number of random sequences, score them and fits the Extreme Value Distribution to the computed scores

Page 114: PROLOGUE: Pitfalls of standard alignments

HMMER

Set of sequences

Alignment of all the sequences to the

model

hmmalign

List of sequences that match the HMM

(sorted by E-value)

hmmsearchHMM

Set of HMMs Sequence

hmmpfam

List of HMMs that match the sequence

Page 115: PROLOGUE: Pitfalls of standard alignments

!!AA_MULTIPLE_ALIGNMENT 1.0PileUp of: *.pep

Symbol comparison table: GenRunData:blosum62.cmp CompCheck: 6430

GapWeight: 12 GapLengthWeight: 4

pileup.msf MSF: 308 Type: P August 16, 1999 09:09 Check: 9858 ..

Name: lgb1_pea Len: 308 Check: 2200 Weight: 1.00 Name: lgb1_vicfa Len: 308 Check: 214 Weight: 1.00 Name: myg_escgi Len: 308 Check: 3961 Weight: 1.00 Name: myg_horse Len: 308 Check: 5619 Weight: 1.00 Name: myg_progu Len: 308 Check: 6401 Weight: 1.00 Name: myg_saisc Len: 308 Check: 6606 Weight: 1.00

//

1 50 lgb1_pea ~~~~~~~~~G FTDKQEALVN SSSE.FKQNL PGYSILFYTI VLEKAPAAKGlgb1_vicfa ~~~~~~~~~G FTEKQEALVN SSSQLFKQNP SNYSVLFYTI ILQKAPTAKA myg_escgi ~~~~~~~~~V LSDAEWQLVL NIWAKVEADV AGHGQDILIR LFKGHPETLE myg_horse ~~~~~~~~~G LSDGEWQQVL NVWGKVEADI AGHGQEVLIR LFTGHPETLE myg_progu ~~~~~~~~~G LSDGEWQLVL NVWGKVEGDL SGHGQEVLIR LFKGHPETLE myg_saisc ~~~~~~~~~G LSDGEWQLVL NIWGKVEADI PSHGQEVLIS LFKGHPETLE

MSF Format: globins50.msf

Page 116: PROLOGUE: Pitfalls of standard alignments

Alignment of a protein family

hmmbuild

Trained profile-HMM

hmmbuild globin.hmm globins50.msf

All the transition and emission parameters are estimated by means of the Expectation Maximisation algorithm on the aligned sequences.

In principle we could use also NON aligned sequences to train the model. Nevertheless it is more efficient to build the starting alignment using, for example, CLUSTALW

Page 117: PROLOGUE: Pitfalls of standard alignments

hmmcalibrate [-num N] -histfile globin.histo globin.hmm

Trained profile-HMM

hmmcalibate

HMM calibrated with the accurate E-value

statistics

A number of N (default 5000) random sequences are generated and scored with the model.

Random sequences

Log P(s|M)/P(s|N)

Range for globin sequences

E-value(S): expected number of random sequences with a score > S

Page 118: PROLOGUE: Pitfalls of standard alignments

Trained model (globin.hmm)

HMMER2.0 [2.3.2]NAME globins50LENG 143ALPH AminoRF noCS noMAP yesCOM /home/gigi/bin/hmmbuild globin.hmm globins50.msfCOM /home/gigi/bin/hmmcalibrate --histfile globin.histo globin.hmmNSEQ 50DATE Sun May 29 19:03:18 2005CKSUM 9858XT -8455 -4 -1000 -1000 -8455 -4 -8455 -4 NULT -4 -8455NULE 595 -1558 85 338 -294 453 -1158 197 249 902 -1085 -142 -21 -313 45 531 201 384 -1998 -644 EVD -38.893742 0.243153HMM A C D E F G H I K L M N P Q R S T V W Y m->m m->i m->d i->m i->i d->m d->d b->m m->e -450 * -1900 1 591 -1587 159 1351 -1874 -201 151 -1600 998 -1591 -693 389 -1272 595 42 -31 27 -693 -1797 -1134 14 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 -450 * 2 -926 -2616 2221 2269 -2845 -1178 -325 -2678 -300 -2596 -1810 220 -1592 939 -974 -671 -939 -2204 -2785 -1925 15 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * * 3 -638 -1715 -680 497 -2043 -1540 23 -1671 2380 -1641 -840 -222 -1595 437 1040 -564 -523 -1363 2124 -1313 16 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * *

Page 119: PROLOGUE: Pitfalls of standard alignments

Trained model (globin.hmm)null model

HMMER2.0 [2.3.2]NAME globins50LENG 143ALPH AminoRF noCS noMAP yesCOM /home/gigi/bin/hmmbuild globin.hmm globins50.msfCOM /home/gigi/bin/hmmcalibrate --histfile globin.histo globin.hmmNSEQ 50DATE Sun May 29 19:03:18 2005CKSUM 9858XT -8455 -4 -1000 -1000 -8455 -4 -8455 -4 NULT -4 -8455NULE 595 -1558 85 338 -294 453 -1158 197 249 902 -1085 -142 -21 -313 45 531 201 384 -1998 -644 EVD -38.893742 0.243153HMM A C D E F G H I K L M N P Q R S T V W Y m->m m->i m->d i->m i->i d->m d->d b->m m->e -450 * -1900 1 591 -1587 159 1351 -1874 -201 151 -1600 998 -1591 -693 389 -1272 595 42 -31 27 -693 -1797 -1134 14 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 -450 * 2 -926 -2616 2221 2269 -2845 -1178 -325 -2678 -300 -2596 -1810 220 -1592 939 -974 -671 -939 -2204 -2785 -1925 15 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * * 3 -638 -1715 -680 497 -2043 -1540 23 -1671 2380 -1641 -840 -222 -1595 437 1040 -564 -523 -1363 2124 -1313 16 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * *

Score = INT [1000 log2(prob/null_prob)]

null_prob = 1 for transitions

Page 120: PROLOGUE: Pitfalls of standard alignments

Trained model (globin.hmm)null model

HMMER2.0 [2.3.2]NAME globins50LENG 143ALPH AminoRF noCS noMAP yesCOM /home/gigi/bin/hmmbuild globin.hmm globins50.msfCOM /home/gigi/bin/hmmcalibrate --histfile globin.histo globin.hmmNSEQ 50DATE Sun May 29 19:03:18 2005CKSUM 9858XT -8455 -4 -1000 -1000 -8455 -4 -8455 -4 NULT -4 -8455NULE 595 -1558 85 338 -294 453 -1158 197 249 902 -1085 -142 -21 -313 45 531 201 384 -1998 -644 EVD -38.893742 0.243153HMM A C D E F G H I K L M N P Q R S T V W Y m->m m->i m->d i->m i->i d->m d->d b->m m->e -450 * -1900 1 591 -1587 159 1351 -1874 -201 151 -1600 998 -1591 -693 389 -1272 595 42 -31 27 -693 -1797 -1134 14 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 -450 * 2 -926 -2616 2221 2269 -2845 -1178 -325 -2678 -300 -2596 -1810 220 -1592 939 -974 -671 -939 -2204 -2785 -1925 15 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * * 3 -638 -1715 -680 497 -2043 -1540 23 -1671 2380 -1641 -840 -222 -1595 437 1040 -564 -523 -1363 2124 -1313 16 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * *

Score = INT [1000 log2(prob/null_prob)]

= natural abundance for emissions

Page 121: PROLOGUE: Pitfalls of standard alignments

HMMER2.0 [2.3.2]NAME globins50LENG 143ALPH AminoRF noCS noMAP yesCOM /home/gigi/bin/hmmbuild globin.hmm globins50.msfCOM /home/gigi/bin/hmmcalibrate --histfile globin.histo globin.hmmNSEQ 50DATE Sun May 29 19:03:18 2005CKSUM 9858XT -8455 -4 -1000 -1000 -8455 -4 -8455 -4 NULT -4 -8455NULE 595 -1558 85 338 -294 453 -1158 197 249 902 -1085 -142 -21 -313 45 531 201 384 -1998 -644 EVD -38.893742 0.243153HMM A C D E F G H I K L M N P Q R S T V W Y m->m m->i m->d i->m i->i d->m d->d b->m m->e -450 * -1900 1 591 -1587 159 1351 -1874 -201 151 -1600 998 -1591 -693 389 -1272 595 42 -31 27 -693 -1797 -1134 14 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 -450 * 2 -926 -2616 2221 2269 -2845 -1178 -325 -2678 -300 -2596 -1810 220 -1592 939 -974 -671 -939 -2204 -2785 -1925 15 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * * 3 -638 -1715 -680 497 -2043 -1540 23 -1671 2380 -1641 -840 -222 -1595 437 1040 -564 -523 -1363 2124 -1313 16 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * *

Trained model (globin.hmm)

Transitions

Page 122: PROLOGUE: Pitfalls of standard alignments

HMMER2.0 [2.3.2]NAME globins50LENG 143ALPH AminoRF noCS noMAP yesCOM /home/gigi/bin/hmmbuild globin.hmm globins50.msfCOM /home/gigi/bin/hmmcalibrate --histfile globin.histo globin.hmmNSEQ 50DATE Sun May 29 19:03:18 2005CKSUM 9858XT -8455 -4 -1000 -1000 -8455 -4 -8455 -4 NULT -4 -8455NULE 595 -1558 85 338 -294 453 -1158 197 249 902 -1085 -142 -21 -313 45 531 201 384 -1998 -644 EVD -38.893742 0.243153HMM A C D E F G H I K L M N P Q R S T V W Y m->m m->i m->d i->m i->i d->m d->d b->m m->e -450 * -1900 1 591 -1587 159 1351 -1874 -201 151 -1600 998 -1591 -693 389 -1272 595 42 -31 27 -693 -1797 -1134 14 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 -450 * 2 -926 -2616 2221 2269 -2845 -1178 -325 -2678 -300 -2596 -1810 220 -1592 939 -974 -671 -939 -2204 -2785 -1925 15 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * * 3 -638 -1715 -680 497 -2043 -1540 23 -1671 2380 -1641 -840 -222 -1595 437 1040 -564 -523 -1363 2124 -1313 16 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * *

Trained model (globin.hmm)

Emissions

Page 123: PROLOGUE: Pitfalls of standard alignments

HMMER2.0 [2.3.2]NAME globins50LENG 143ALPH AminoRF noCS noMAP yesCOM /home/gigi/bin/hmmbuild globin.hmm globins50.msfCOM /home/gigi/bin/hmmcalibrate --histfile globin.histo globin.hmmNSEQ 50DATE Sun May 29 19:03:18 2005CKSUM 9858XT -8455 -4 -1000 -1000 -8455 -4 -8455 -4 NULT -4 -8455NULE 595 -1558 85 338 -294 453 -1158 197 249 902 -1085 -142 -21 -313 45 531 201 384 -1998 -644 EVD -38.893742 0.243153HMM A C D E F G H I K L M N P Q R S T V W Y m->m m->i m->d i->m i->i d->m d->d b->m m->e -450 * -1900 1 591 -1587 159 1351 -1874 -201 151 -1600 998 -1591 -693 389 -1272 595 42 -31 27 -693 -1797 -1134 14 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 -450 * 2 -926 -2616 2221 2269 -2845 -1178 -325 -2678 -300 -2596 -1810 220 -1592 939 -974 -671 -939 -2204 -2785 -1925 15 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * * 3 -638 -1715 -680 497 -2043 -1540 23 -1671 2380 -1641 -840 -222 -1595 437 1040 -564 -523 -1363 2124 -1313 16 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * *

Trained model (globin.hmm)

Emissions

Page 124: PROLOGUE: Pitfalls of standard alignments

HMMER2.0 [2.3.2]NAME globins50LENG 143ALPH AminoRF noCS noMAP yesCOM /home/gigi/bin/hmmbuild globin.hmm globins50.msfCOM /home/gigi/bin/hmmcalibrate --histfile globin.histo globin.hmmNSEQ 50DATE Sun May 29 19:03:18 2005CKSUM 9858XT -8455 -4 -1000 -1000 -8455 -4 -8455 -4 NULT -4 -8455NULE 595 -1558 85 338 -294 453 -1158 197 249 902 -1085 -142 -21 -313 45 531 201 384 -1998 -644 EVD -38.893742 0.243153HMM A C D E F G H I K L M N P Q R S T V W Y m->m m->i m->d i->m i->i d->m d->d b->m m->e -450 * -1900 1 591 -1587 159 1351 -1874 -201 151 -1600 998 -1591 -693 389 -1272 595 42 -31 27 -693 -1797 -1134 14 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 -450 * 2 -926 -2616 2221 2269 -2845 -1178 -325 -2678 -300 -2596 -1810 220 -1592 939 -974 -671 -939 -2204 -2785 -1925 15 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * * 3 -638 -1715 -680 497 -2043 -1540 23 -1671 2380 -1641 -840 -222 -1595 437 1040 -564 -523 -1363 2124 -1313 16 - -149 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 - -23 -6528 -7571 -894 -1115 -701 -1378 * *

Trained model (globin.hmm)

Emissions

Page 125: PROLOGUE: Pitfalls of standard alignments

hmmemit [-n N] globin.hmm

Trained profile-HMM

hmmemit

Sequences generated by the model

The parameters of the model are used to generate new sequences

Page 126: PROLOGUE: Pitfalls of standard alignments

hmmsearch globin.hmm Artemia.fa > Artemia.globin

Set of sequences

List of sequences that match the HMM

(sorted by E-value)

hmmsearch

Trained profile-HMM

Page 127: PROLOGUE: Pitfalls of standard alignments

Search results (Artemia.globin)

Sequence Description Score E-value N -------- ----------- ----- ------- ---S13421 S13421 474.3 1.7e-143 9

Parsed for domains:Sequence Domain seq-f seq-t hmm-f hmm-t score E-value-------- ------- ----- ----- ----- ----- ----- -------S13421 7/9 932 1075 .. 1 143 [] 76.9 7.3e-24S13421 2/9 153 293 .. 1 143 [] 63.7 6.8e-20S13421 3/9 307 450 .. 1 143 [] 59.8 9.8e-19S13421 8/9 1089 1234 .. 1 143 [] 57.6 4.5e-18S13421 9/9 1248 1390 .. 1 143 [] 52.3 1.8e-16S13421 1/9 1 143 [. 1 143 [] 51.2 4e-16S13421 4/9 464 607 .. 1 143 [] 46.7 8.6e-15S13421 6/9 775 918 .. 1 143 [] 42.2 2e-13S13421 5/9 623 762 .. 1 143 [] 23.9 6.6e-08

Alignments of top-scoring domains:S13421: domain 7 of 9, from 932 to 1075: score 76.9, E = 7.3e-24 *->eekalvksvwgkveknveevGaeaLerllvvyPetkryFpkFkdLss +e a vk+ w+ v+ ++ vG +++ l++ +P+ +++FpkF d+ S13421 932 REVAVVKQTWNLVKPDLMGVGMRIFKSLFEAFPAYQAVFPKFSDVPL 978

adavkgsakvkahgkkVltalgdavkkldd...lkgalakLselHaqklr d++++++ v +h V t+l++ ++ ld++ +l+ ++L+e H+ lr S13421 979 -DKLEDTPAVGKHSISVTTKLDELIQTLDEpanLALLARQLGEDHIV-LR 1026

vdpenfkllsevllvvlaeklgkeftpevqaalekllaavataLaakYk< v+ fk +++vl+ l++ lg+ f+ ++ +++k+++++++ +++ + S13421 1027 VNKPMFKSFGKVLVRLLENDLGQRFSSFASRSWHKAYDVIVEYIEEGLQ 1075

Number of domains

Domains sorted byE-value

Start End

Consensus sequence

Sequence

Page 128: PROLOGUE: Pitfalls of standard alignments

hmmalign globin.hmm globins630.fa

Set of sequences

hmmalign

Alignment of all sequences to the

model

Trained profile-HMM

InsertionsBAHG_VITSP QAG-..VAAAHYPIV.GQELLGAIKEV.L.G.D.AATDDILDAWGKAYGVGLB1_ANABR TR-K..ISAAEFGKI.NGPIKKVLAS-.-.-.K.NFGDKYANAWAKLVAVGLB1_ARTSX NRGT..-DRSFVEYL.KESL-----GD.S.V.D.EFT------VQSFGEVGLB1_CALSO TRGI..TNMELFAFA.LADLVAYMGTT.I.S.-.-FTAAQKASWTAVNDVGLB1_CHITH -KSR..ASPAQLDNF.RKSLVVYLKGA.-.-.T.KWDSAVESSWAPVLDFGLB1_GLYDI GNKH..IKAQYFEPL.GASLLSAMEHR.I.G.G.KMNAAAKDAWAAAYADGLB1_LUMTE ER-N..LKPEFFDIF.LKHLLHVLGDR.L.G.T.HFDF---GAWHDCVDQGLB1_MORMR QSFY..VDRQYFKVL.AGII-------.-.-.A.DTTAPGDAGFEKLMSMGLB1_PARCH DLNK..VGPAHYDLF.AKVLMEALQAE.L.G.S.DFNQKTRDSWAKAFSIGLB1_PETMA KSFQ..VDPQYFKVL.AAVI-------.-.-.V.DTVLPGDAGLEKLMSMGLB1_PHESE QHTErgTKPEYFDLFrGTQLFDILGDKnLiGlTmHFD---QAAWRDCYAV

Gaps

Page 129: PROLOGUE: Pitfalls of standard alignments

HMMER applications:PFAMhttp://www.sanger.ac.uk/Software/Pfam/

Page 130: PROLOGUE: Pitfalls of standard alignments
Page 131: PROLOGUE: Pitfalls of standard alignments
Page 132: PROLOGUE: Pitfalls of standard alignments

PFAM Exercise

Generate with hmmemit a sequence from the globin model and search it in PFAM database

Page 133: PROLOGUE: Pitfalls of standard alignments

Search in the SwissProt database the sequencesCG301_HUMANQ9H5F4_HUMAN

1) search them in the PFAM data base.2)launch PSI-BLAST searches. Is it possible to annotate the sequences by means of the BLAST results?

PFAM Exercise

Page 134: PROLOGUE: Pitfalls of standard alignments

SAM at UCSD:http://www.soe.ucsc.edu/research/compbio/sam.html

Krogh A, Brown M, Mian IS, Sjolander K & Haussler D (1994) Hidden Markov Models in computational biology: applications to protein modelling. J.Mol.Biol. 235, 1501-1531

Page 135: PROLOGUE: Pitfalls of standard alignments

SAM applications:http://www.cse.ucsc.edu/research/compbio/HMM-apps/T02-query.html

Page 136: PROLOGUE: Pitfalls of standard alignments

HMMPRO: http://www.netid.com/html/hmmpro.htmlPierre Baldi, Net-ID

Page 137: PROLOGUE: Pitfalls of standard alignments

HMMs for Mapping problemsHMMs for Mapping problems

•Mapping problems in protein prediction

Page 138: PROLOGUE: Pitfalls of standard alignments

Covalent structureTTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN

Ct

Nt

3D structure

Secondary structureEEEE..HHHHHHHHHHHH....HHHHHHHH.EEEE...........

Secondary structure

Page 139: PROLOGUE: Pitfalls of standard alignments

position of Trans Membrane Segments along the sequenceTopography

Topology of membrane proteins

Porin (Rhodobacter capsulatus)

Bacteriorhodopsin(Halobacterium salinarum)

Bil

ayer

-barrel -helices

Outer Membrane Inner Membrane

ALALMLCMLTYRHKELKLKLKK ALALMLCMLTYRHKELKLKLKK ALALMLCMLTYRHKELKLKLKK

Page 140: PROLOGUE: Pitfalls of standard alignments

HMMs for Mapping problemsHMMs for Mapping problems

•Mapping problems in protein prediction•Labelled HMMs

Page 141: PROLOGUE: Pitfalls of standard alignments

c

HMM for secondary structure prediction

Simplest model

Introducing a grammar

1 1c

2

3

2

Page 142: PROLOGUE: Pitfalls of standard alignments

HMM for secondary structure prediction

Labels

The states 1, 2 and 3 share the same label, so states 1 and 2 do.Decoding the Viterbi path for emitting a sequence s, makes a mapping between the sequence s and a sequence of labels y

S A L K M N Y T R E I M V A S N Q s: Sequenceccccc c c : Pathccccc c c Y(): Labels

1 1c

2

3

2

Page 143: PROLOGUE: Pitfalls of standard alignments

Computing P(s, y | M)

yYMsPMysP

)(|)|,()|,(

Only the path whose labelling is y have to be considered in the sumIn Forward and Backward algorithms it means to set

Fk(i) = 0, Bk(i) = 0 if Y(k) yi

S A L K M N Y T R E I M V A S N Q s: Sequenceccccc c c y: Labels

c

c

States Labelling

Page 144: PROLOGUE: Pitfalls of standard alignments

Baum-Welch training algorithm for labelled HMMs

Given a set of known labelled sequences (e.g. amino acid sequences and their native secondary structure) we want to find the parameters of the model, without knowing the generating paths:

ML = argmaxP ( s, y | M)]

The algorithm is the same as in the non-labelled case if we use the Forward and Backward matrices defined in the last slide.

Supervised learning of the mapping

Page 145: PROLOGUE: Pitfalls of standard alignments

HMMs for Mapping problemsHMMs for Mapping problems

•Mapping problems in protein prediction•Labelled HMMs•Duration modelling

Page 146: PROLOGUE: Pitfalls of standard alignments

Self loops and geometric decay

p

Begin End1-p

P(l) = p l-1 ·( 1-p )

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

p=0.9

p=0.5

p=0.1

l

P(l)

The length distribution of the generated segments is always exp-like

Page 147: PROLOGUE: Pitfalls of standard alignments

0,0000,050

0,1000,150

0,2000,250

0,3000,350

1 2 3 4 5 6 7 8

P(l)

l

P(1)

P(2)P(3)

P(4)

P(5)

P(6)

P(7)

P(8)

How can we model other length distributions?

1Begin End432 N….

p1 p2 p3 p4

pN

Limited case

This topology can model any length distribution between 1 and N

N

N

ii ppNP

pppP

ppP

pP

1

1

321

21

1

)1()(

......................

)1()1()3(

)1()2(

)1(

1

1

12

1

)1(

)(

..............

)1/()2(

)1(

k

ii

k

p

kPp

pPp

Pp

Page 148: PROLOGUE: Pitfalls of standard alignments

How can we model other length distributions?

Non limited case

This topology can model any length distribution between 1 and N-1 and a geometrical decay from N and

1Begin End432 N….

p1 p2 p3 p4

pN+1

pN

Page 149: PROLOGUE: Pitfalls of standard alignments

0

0,05

0,1

0,15

0,2

0,25

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34Length (residues)

Fre

qu

ency Helix

StrandCoil

Secondary structure: length statistic

Page 150: PROLOGUE: Pitfalls of standard alignments

c ccc

Secondary structure: model

Do we use the same emission probabilities for states sharing the same label?

Page 151: PROLOGUE: Pitfalls of standard alignments

HMMs for Mapping problemsHMMs for Mapping problems

•Mapping problems in protein prediction•Labelled HMMs•Duration modelling•Models for membrane proteins

Page 152: PROLOGUE: Pitfalls of standard alignments

Porin (Rhodobacter capsulatus)

Bacteriorhodopsin(Halobacterium salinarum)

Bil

ayer

-barrel -helices

Outer Membrane Inner Membrane

Page 153: PROLOGUE: Pitfalls of standard alignments

position of Trans Membrane Segments along the sequenceTopography

Topology of membrane proteins

Porin (Rhodobacter capsulatus)

Bacteriorhodopsin(Halobacterium salinarum)

Bil

ayer

-barrel -helices

Outer Membrane Inner Membrane

ALALMLCMLTYRHKELKLKLKK ALALMLCMLTYRHKELKLKLKK ALALMLCMLTYRHKELKLKLKK

Page 154: PROLOGUE: Pitfalls of standard alignments

A generic model for membrane proteins (TMHMM)

Transmembrane Inner Side

Outer Side

End

Begin

Page 155: PROLOGUE: Pitfalls of standard alignments

Model of -barrel membrane proteins

Transmembrane Inner SideOuter Side

End

Begin

Page 156: PROLOGUE: Pitfalls of standard alignments

Labels:

Transmembrane states

Loop states

Model of -barrel membrane proteins

Transmembrane Inner SideOuter Side

End

Begin

Page 157: PROLOGUE: Pitfalls of standard alignments

Length of transmembrane -strands:

Minimum: 6 residues

Maximum: unbounded

Model of -barrel membrane proteins

Transmembrane Inner SideOuter Side

End

Begin

Page 158: PROLOGUE: Pitfalls of standard alignments

Six different sets of emission parameters:

Outer loop Inner loop Long globular

domains

TM strands edges TM strands core

Model of -barrel membrane proteins

Transmembrane Inner SideOuter Side

End

Begin

Page 159: PROLOGUE: Pitfalls of standard alignments

Model of -helix membrane proteins (HMM1)

Transmembrane Inner SideOuter Side

....x 10

....x 10

.....x 13

.....x 12

x13.....x12........ ...

Page 160: PROLOGUE: Pitfalls of standard alignments

Model of -helix membrane proteins (HMM2)

Transmembrane Inner SideOuter Side

....x 10

....x 10

... ...

Page 161: PROLOGUE: Pitfalls of standard alignments

TMS probability

0.00.10.20.30.40.50.60.70.80.91.0

1 51 101 151 201 251 301 351 401 451Sequence (1A0S)

TM

S p

robab

ility

TMS probability

Dynamic programming filtering procedure

Page 162: PROLOGUE: Pitfalls of standard alignments

Dynamic programming filtering procedure

0.00.10.20.30.40.50.60.70.80.91.0

1 51 101 151 201 251 301 351 401 451Sequence (1A0S)

TM

S p

robab

ility

TMS probability Predicted TMS

Maximum-scoring subsequences with constrained segment length and number

Page 163: PROLOGUE: Pitfalls of standard alignments

0.00.10.20.30.40.50.60.70.80.91.0

1 51 101 151 201 251 301 351 401 451Sequence (1A0S)

TM

S p

robab

ility

TMS probability Observed TMS Predicted TMS

Dynamic programming filtering procedure

Maximum-scoring subsequences with constrained segment length and number

Page 164: PROLOGUE: Pitfalls of standard alignments

www.cbs.dtu.dk/services/TMHMM

Predictors of alpha-transmembrane topology

Page 165: PROLOGUE: Pitfalls of standard alignments

Hybrid systems: BasicsHybrid systems: Basics

•Sequence profile based HMMs

Page 166: PROLOGUE: Pitfalls of standard alignments

1 Y K D Y H S - D K K K G E L - - 2 Y R D Y Q T - D Q K K G D L - - 3 Y R D Y Q S - D H K K G E L - - 4 Y R D Y V S - D H K K G E L - - 5 Y R D Y Q F - D Q K K G S L - - 6 Y K D Y N T - H Q K K N E S - - 7 Y R D Y Q T - D H K K A D L - - 8 G Y G F G - - L I K N T E T T K 9 T K G Y G F G L I K N T E T T K 10 T K G Y G F G L I K N T E T T K

A 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 C 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 D 0 0 70 0 0 0 0 60 0 0 0 0 20 0 0 0 E 0 0 0 0 0 0 0 0 0 0 0 0 70 0 0 0 F 0 0 0 10 0 33 0 0 0 0 0 0 0 0 0 0 G 10 0 30 0 30 0 100 0 0 0 0 50 0 0 0 0 H 0 0 0 0 10 0 0 10 30 0 0 0 0 0 0 0 K 0 40 0 0 0 0 0 0 10 100 70 0 0 0 0 100 I 0 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 L 0 0 0 0 0 0 0 30 0 0 0 0 0 0 0 0 M 0 0 0 0 0 0 0 0 0 0 0 0 0 60 0 0 N 0 0 0 0 10 0 0 0 0 0 30 10 0 0 0 0 P 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Q 0 0 0 0 40 0 0 0 30 0 0 0 0 0 0 0 R 0 50 0 0 0 0 0 0 0 0 0 0 0 0 0 0 S 0 0 0 0 0 33 0 0 0 0 0 0 10 10 0 0 T 20 0 0 0 0 33 0 0 0 0 0 30 0 30 100 0 V 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 W 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Y 70 0 0 90 0 0 0 0 0 0 0 0 0 0 0 0

sequence position

MS

AS

equ

ence

pro

file

Sequence profiles

Page 167: PROLOGUE: Pitfalls of standard alignments

Sequence-profile-based HMM

085 0 0 5 0 0 0 0 2 0 8 0 0 0 0 0 0 0 0

0 0 0 0 4 013 0 4 0 5 0 6 0 023 0 144 0

0 022 023 0 0 5 023 0 3 011 0 0 2 011 0

034 0 0 024 0 0 0 0 0 2 022 018 0 0 0 0

8 0 0 0 0 0 0 0 0 0 0 092 0 0 0 0 0 0 0

90 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 077 023

3 0 2 7 4 0 8 6 1 3 6 5 512 5 617 2 2 6

.. A C L P R P E T ...

t

Sequence of charactersst

Sequence of M-dimensional vectors vt

0 vt (n) S t, n

k=1 vt (k) = S t

M

90 0 0 0 0 0 0 0 010 0 0 0 0 0 0 0 0 0 0

n

For proteins M=20

Constraints

Page 168: PROLOGUE: Pitfalls of standard alignments

Sequence-profile-based HMM

Sequence of characters

Sequence of M-dimensional vectors

Probability of emission from state k

P(st|k) = ek(st)

P(vt|k) = n=1 vt(n)ek(n)

M1Z

st

vt

Z is independent of the state

Algorithms for training and probability computation can be derived

constraints P(vt|k) dM vt = 1

If n=1 ek(n) = 1M

Z = n=1 ek (n)MSA

A!

Page 169: PROLOGUE: Pitfalls of standard alignments

Hybrid systems: BasicsHybrid systems: Basics

•Sequence profile based HMMs•Membrane protein topology

Page 170: PROLOGUE: Pitfalls of standard alignments

1) Accuracy: Q2=P/N  where P is the total number of correctly predicted residues and N is the total number of residues

2) Correlation coefficient C: C(s)=(p(s)*n(s) - u(s)*o(s))/[(p(s)+u(s))(p(s)+o(s))(n(s)+u(s))(n(s)+o(s))]1/2

where, for each class s, p(s) and n(s) are respectively the total number of correct predictions and correctly rejected assignments while u(s) and o(s) are the numbers of under and over predictions

3) Accuracy for each discriminated structure s: Q(s)=p(s)/[p(s)+u(s)]where p(s) and u(s) are the same as in equation 2

4) Probability of correct predictions P(s) : P(s)=p(s)/[p(s)+o(s)] where p(s) and o(s) are the same as in equation 2

5) Segment-based measure (Sov):

SovS SS S

LN

S

1 2

1 2

1

Scoring the prediction

Page 171: PROLOGUE: Pitfalls of standard alignments

Q2 QTMS Qloop PTMS Ploop Corr Sov

83% 83% 82% 79% 85% 0.65 0.83

HMM based on Multiple Sequence

Alignment

78% 74% 82% 81% 76% 0.56 0.79

76% 77% 76% 72% 80% 0.53 0.64Standard HMM based

on Single Sequence

NN based on Multiple Sequence

Alignment

Topology of -barrel membrane proteinsPerformance of sequence-profile-based HMM

Martelli PL, Fariselli P, Krogh A, Casadio R -A sequence-profile-based HMM for predicting and discriminating beta barrel membrane proteins- Bioinformatics 18: S46-S53 (2002)

Page 172: PROLOGUE: Pitfalls of standard alignments

0

10

20

30

40

50

60

70

80

90

100

2.75 2.8 2.85 2.9 2.95

Perc

enta

ge

Beta barrel membrane proteins (145)Globular proteins(1239)All helical membrane proteins (188)

I(s | M) = -1/L log P(s | M)

Topology of -barrel membrane proteins Discriminative power of the profile-based HMM

Page 173: PROLOGUE: Pitfalls of standard alignments

Sequence

Sequence Profiles

NN HMM1 HMM2

MaxSubSeq

Von Heijne rule

Prediction

Topography

Topology

The Bologna predictorfor the topology ofall membraneproteins

Page 174: PROLOGUE: Pitfalls of standard alignments

Q2 % Corr QTM % QLoop

% PTM

% PLoop

% Qtopography Qtoplogy SOV

NN° 85.8 0.714 84.1 87.3 84.7 86.8 49/59 (83%) 38/59 (64%) 0.908

HMM1° 84.4 0.692 88.6 80.9 79.5 89.4 48/59 (81%) 38/59 (64%) 0.896

HMM2° 82.4 0.658 88.0 78.1 77.1 88.1 48/59 (81%) 39/59 (66%) 0.872

Jury° 85.3 0.708 86.9 84.1 82.1 88.4 53/59 (90%) 42/59 (71%) 0.926

TMHMM 2.0 82.3 0.661 70.9 92.9 89.3 79.2 42/59 (71%) 32/59 (54%) 0.840

MEMSAT 83.6 0.672 70.6 93.9 90.5 75.7 35/49 (71%) 24/49 (49%) 0.823

PHD 80.1 0.614 63.6 94.0 89.9 75.5 43/59 (73%) 30/59 (51%) 0.847

HMMTOP 81.2 0.627 68.5 91.8 87.5 77.7 45/59 (76%) 35/59 (59%) 0.862

Nir Ben-Tal 46/59 (78%)

KD 43/59 (73%)

° ° Test

Topology of all- membrane proteinsPerformance

Martelli PL, Fariselli P, Casadio R -An ENSEMBLE machine learning approach for the prediction of all-alpha membrane proteins- Bioinformatics (in press, 2003)

Page 175: PROLOGUE: Pitfalls of standard alignments

HMM: Application in gene findingHMM: Application in gene finding

•Basics

Page 176: PROLOGUE: Pitfalls of standard alignments

Eukaryotic gene structure

A. Krogh

Page 177: PROLOGUE: Pitfalls of standard alignments

Simple model for coding regions

A. Krogh

Page 178: PROLOGUE: Pitfalls of standard alignments

Simple model for unspliced gene

A. Krogh

Page 179: PROLOGUE: Pitfalls of standard alignments

Simple model for spliced gene

A. Krogh