defining scoring functions, multiple sequence alignment lecture #4 this class has been edited from...

.

Defining Scoring Functions, Multiple Sequence Alignment

Lecture #4

This class has been edited from Nir Friedman’s lecture. Changes made by Dan Geiger and Shlomo Moran.

Background Readings: Chapter 6 in Biological Sequence Analysis, Durbin et al., 2001.Chapter 3.4, 3.5 in Introduction to Computational Molecular Biology, Setubal and Meidanis, 1997.

taken from lecture4 of winter03-4 (Danny geiger), with some changes inserted.fall 05-6: next 4 slides added from previous lecture (25-28), and majorly modified the PAM1 definition+analysis, and the "relevance test" in multiple alignment. presentation was slow - 16 slides 1st hour, hardly completed the "relevance test". should be OK if a bit faster

2

Reminder: A Probabilistic Model for Defining the Scoring Function

Assume alignment without indels. Assume each position (nucleotide /amino-acid) is

independent of other positions. Consider two options:

M: the sequences are Matched (related)

R: the sequences are Random (unrelated)

3

R: Unrelated Sequences

Our random model of unrelated sequences is simple Each position is sampled independently from a

distribution over the alphabet We assume there is a distribution q() that

describes the probability of letters in such positions.

Then:

i

itqisqRntnsP ])[(])[()|]..1[],..1[(

4

M: Related Sequences

We assume that each pair of aligned positions (s[i],t[i]) evolved from a common ancestor

Let p(a,b) be a distribution over pairs of letters. p(a,b) is the probability that some ancestral letter

evolved into this particular pair of letters.

i

itispMntnsP ])[],[()|]..1[],..1[(

5

Odds-Ratio Test for Alignment

i

i

i

itqisq

itisp

itqisq

itisp

RtsP

MtsPQ

])[(])[(

])[],[(

])[(])[(

])[],[(

)|,(

)|,(

If Q > 1, then the two strings s and t are more likely tobe related (M) than unrelated (R).

If Q < 1, then the two strings s and t are more likely tobe unrelated (R) than related (M).

6

Score(s[i],t[i])

Log Odds-Ratio Test for AlignmentTaking logarithm of Q yields

])[(])[(

])[],[(log

])[(])[(

])[],[(log

)|,Pr(

)|,Pr(log

itqisq

itisp

itqisq

itisp

Rts

Mts

ii

If log Q > 0, then s and t are more likely to be related.If log Q < 0, then they are more likely to be unrelated.

Next we estimate Pr((a,b)|M)=P(a,b) for amino acids

7

We need to define for each pair (a,b) of amino acids (possibly a=b) the probability p(a,b) that a is aligned with b given that the sequences are related. p(a,b) should be determined by the data we have. The next slides present one way to define p(a,b) .

Estimating p(·,·) for proteins(Setubel et al, p. 81-84)

next slides modified drastically in fall 05-6

8

First, we define for each letter a a probability qa, which is the relative frequency of the amino-acid a in a collection of proteins which is considered reliable for this purpose.

Specifically, qa = na/n where na is the number of occurrences of letter a and n is the total number of letters in the collection, so n = ana.

Estimating p(·,·) for proteins

9

The collection of accepted mutations contains mutations in alignments of closely related protein sequences. For example, Hemoglobin alpha chain in humans and other organisms (homologous proteins).


For ab, we use as a data a collection of accepted mutations: Formally, it is a multiset of the form

{(a,b): a,b are amino acids and ab}.We consider the mutations as unordered pairs.

10

Estimating p(·,·) for proteinsUsing the collection of accepted mutations, which we consider as unordered pairs of amino acids. Since the pairs are unordered, we denote them by {a,b}.Some parameters:

Mutation countsfab=fba is the number of mutations {a, b}: ab in the above collection. f is the number of mutations (unordered pairs) in the collection.

fab /f is the conditional probability:

is a mutation)

= ).

Pr({ , } { , } | { , }

Pr({ , } { , } |

x y a b x y

x y a b x y

fall05-6: after the lecture I changed notations: from (a,b) to {a,b}

11


We still have to define the relative mutability, Pr(xy|M). This is the probability that in the “match” model M, a random pair (x,y) is a mutation. The relative mutability is a user defined parameter, denoted r.

in the model for is given by

)

= )

( , ) :

( , ) Pr({ , } { , } |

Pr({ , } { , } |

Pr( | )

Pr( | )

.ab

p a b M a b

p a b x y a b M

x y a b x y

fx y M

f

x y M

12

In PAM-1 matrix it is assumed that the relative mutability r is 1%. That is: 1/100 of the amino acids are mutated. This is modeled by letting n, the total number of pairs of amino acids, be 100f. Using this, we now can compute faa , the number of {a,a} pairs.

PAM-1 matrices[1-Percent Accepted Mutation].

)100

( , ) Pr({ , } { , } | .ab abf fp a b x y a b M

n f

[HW: prove that, by our assumptions, 99 of these pairs

are of the form , and hence ]2

{ : }{ , } .abb a b

aa a

f

f

x x f nq

In this model, p(a,b) is given as the relative frequency of the (unordered) pairs {a,b}:

explanation here motivated by Setubel et., p. 82.proof of hw:f_aa =[#(a in the colloections)-#(a in mutations)]/2

13

The log odds-ratio for PAM-1:

100

100

/(( , ) | )( , )

(( , )) | )ab ab

a b a b

f f fP a b MQ a b

P a b R q q fq q

The entries in PAM-1 matrix are the logarithms of the odd-ratio:

1100

( , | )[ ] log[ ] log[ ]

( , | )ab

aba b

fp a b MPAM

p a b R fq q

The actual values are 10 times the logarithm, rounded to integer.

14

Evolutionary distance

The choice that r=1% of amino acids change is quite arbitrary. It could fit specific set of proteins whose evolutionary distance is such that indeed 1% of the letters have mutated.

This is a unit of evolutionary change, not time because evolution acts differently on distinct sequence types.

What is the substitution matrix for k units of evolutionary time ?

15

Model of Evolution

We make some assumptions:

1. Each position changes independently of the rest

2. The probability of mutations is the same in each position

3. Evolution does not “remember”

Timet t+ t+2 t+3 t+4

A A C C GT T T C G

16

Model of Evolution

How do we model such a process? This process is called a Markov Chain

A chain is defined by the transition probability P(Xt+ =b|Xt=a) - the probability that the next

state is b given that the current state is a We often describe these probabilities by a matrix:

M[]ab = P(Xt+ =b|Xt=a)

17

Multi-Step Changes

Thus P[2] = P[]P[]

By induction (HMW exercise): P[n] = P[] n

c

ttttt

tt

aXcXPaXcXbXP

aXbXP

)|(),|(

)|(2

2

Based on Pab, we can compute the probabilities of changes over two time periods

2( | ) ( | )t t t t

c

ac cbc

P X b X c P X c X a

P P

Using Conditional independence (No memory)

18

More on Markov Chains – next week

19

Comments regarding PAM

Historically researchers use PAM-250. (The only one published in the original paper.)

Original PAM matrices were based on small number of proteins (circa 1978). Later versions use many more examples.

Used to be the most popular scoring rule, but there are some problems with PAM matrices.

20

Degrees of freedom in PAM definition

With r=1/100 the 1-PAM matrix is obtained from the matrix

100[ ] ab

aba b

fM

f q q

With r=1/50 the basic matrix is different, namely:

250

'[ ] abab

a b

fM

f q q

Use the PAM-1 matrix to the fourth power: M[4] = M[] 4

OrUse the K=50 matrix to the second power: M[4] = M’[2] 2

Thus we have two different ways to estimate the matrix M[4] :

21

Problems in building distance matrices

How do we find pairs of aligned sequences? How far is the ancestor ?

earlier divergence low sequence similarity later divergence high sequence similarity

E.g., M[250] is known not to reflect well long period changes (see p. 43 at Durbin et al).

Does one letter mutate to the other or are they both mutations of a third letter ?

22

BLOSUM Outline(Durbin et. al, p. 43-44)

PAM matrices does not capture the difference between short and long time mutations. BLOck-SUbstitution-Matrices (1992) attempt to model long time mutations.

Idea: use aligned ungapped regions of protein families.These are assumed to have a common ancestor. Similar ideas but better statistics and modeling.

Procedure: Cluster together sequences in a family whenever more than L%

identical residues are shared. Count number of substitutions across different clusters (in the

same family). Estimate frequencies using the counts.

Practice: Blosum50 (ie L=50) and Blosum62 are wildly used. (See page 43-44 in Durbin et al).

שינויים קצרי טווח נובעים משינוי בסיס יחיד בשלישיה של קודון, בעוד שבשינויים ארוכי טווח מופיעים כל השינויים האפשריים.

23

Multiple Sequence Alignment

S1=AGGTC

S2=GTTCG

S3=TGAACPossible alignment

A-T

GGG

G--

TTA

-TA

CCC

-G-

Possible alignment

AG-

GTT

GTG

T-A

--A

CCA

-GC

24


Definition: Given strings S1, S2, …,Sk a multiple (global) alignment map them to strings S’1, S’2, …,S’k that may contain gaps, where:

1. |S’1|= |S’2|=…= |S’k|

2. The removal of gaps from S’i leaves Si

Aligning more than two sequences.

25

Multiple alignmentsWe use a matrix to represent the alignment of k sequences,

K=(x1,...,xk). We assume no columns consists solely of gaps.

M Q _ I L L L

M L R - L L -

M K _ I L L L

M P P V L I L

The common scoring functions give a score to each column, and set: score(K)= ∑i score(column(i))

The scoring function is symmetric - the order of arguments need not matter: score(I,_,I,V) = score(_,I,I,V). Still, for k=10, a scoring function has > 2k -1 > 1000 entries to specify.

x1

x2

x3

x4

The exact bound is [L+k) over k] - 1, where L is the number of letters in the alphabet (= [number of ways to distribute k identical balls in L+1 cells] minus one).

26

SUM OF PAIRS

M Q _ I L L L

M L R - L L -

M K _ I L L L

M P P V L I L

A common scoring function is SP – sum of scores of the projected pairwise alignments: SPscore(K)=∑i<j score(xi,xj).

In order for this score to be written as ∑i score(column(i)),we set score(-,-) = 0. Why ?

Because these entries appear in the sum of columns but not in the sum of projected pairwise alignments (lines).

Note that we need to specify the score(-,-) because a column may have several blanks (as long as not all entries are blanks).

27

SUM OF PAIRS

M Q _ I L L L

M L R - L L -

M K _ I L L L

M P P V L I L

Definition: The sum-of-pairs (SP) value for a multiple global

alignment A of k strings is the sum of the values of all projected

pairwise alignments induced by A where the pairwise alignment

function score(xi,xj) is additive.

2

k

28

Example

Consider the following alignment:

a c - c d b -

- c - a d b d

a - b c d a d

Using the edit distance and for ,

this alignment has a SP value of

0, xx 1, yx yx

33 +43 + 4 + 5 = 12

29


Given k strings of length n, there is a natural generalization of the dynamic programming algorithm that finds an alignment that maximizes

SP-score(K) = ∑i<j score(xi,xj).

Instead of a 2-dimensional table, we now have a k-dimensional table to fill.

For each cell i =(i1,..,ik), compute an optimal multiple alignment for the k prefix sequences x1(1,..,i1),...,xk(1,..,ik).

The adjacent cells are those that differ in their indices by one or zero. Each cell depends on 2k-1 adjacent cells.

30

The idea via K=2

])[,(],[

)],[(],[

])[],[(],[

max],[

1jtj1iV

1is1jiV

1jt1isjiV

1j1iV

])..[],..[(],[ j1ti1sdjiV

V[i,j] V[i+1,j]

V[i,j+1] V[i+1,j+1] Note that the new cell index (i+1,j+1) differs from previous indices by one of 2k-1 non-zero binary vectors (1,1), (1,0), (0,1).

Recall the notation:

and the following recurrence for V:

31

The idea for arbitrary k

Order the cells i =(i1,..,ik) by increasing order of the sum ∑ij. Set s(0,..,0)=0, and for i > (0,...,0):

( ) max{ ( ) ( ( , ))}s s score columnb

i i b i b

• b ranges over all non-zero binary k-tuples. •The cell i-b is the non-negative difference of i and b. • column(i,b) is the k-tuple (c1,…,ck) where: cj= xj(ij) if bi=1, and cj= ‘-’ otherwise.(Reflecting that bj=1 iff the sequence xj was extended by a letter in the corresponding case).

Where

32

Complexity of the DP approach

Number of cells nk.

Number of adjacent cells O(2k).Computation of SP score for each column(i,b) is O(k2)

Total run time is O(k22knk) which is utterly unacceptable !

Not much hope for a polynomial algorithm because the problem has been shown to be NP complete.

Need heuristic to reduce time.

33

Time saving heuristics: Relevance testsHeuristic: Avoid computing score(i) for irrelevant cells.

M Q _ I L L L

M L R - L L -

M K _ I L L L

M P P V L I L

x1

x2

x3

x4

Let L be a lower bound on the optimal SP score of a multiple alignment of the k sequences (L can be obtained from an arbitrary multiple alignment, computed in any way).

34

Time saving heuristics: Relevance testsHeuristic: Avoid computing score(i) for irrelevant cells.

Relevance test for cell i=(..iu,..iv…): For each two indices 1

u < v k, compute the optimal score of an alignment of s=xu and t=xv, which goes via cell i. If the sum of these scores over all pairs (u,v) is smaller than L, then i is irrelevant.

An alignment via s=xu and t=xv, which goes via cell i is an alignment of s and t which has a prefix which aligns

(s[1],…s[iu]) with (t[1],..,t[iv]).How do we find an optimal such alignment?

35

Recall the Linear Space algorithm V[i,j] = d(s[1..i],t[1..j]) B[i,j] = d(s[i+1..n],t[j+1..m]) F[i,j] + B[i,j] = score of best alignment through

(i,j)t

s

These computations done in linear space.

),( vuxxiia vuBuild such a table

for every two sequences s=xu and t=xv, 1 u, v k. This entry encodes the optimum through (iu,iv).

36

Time saving heuristics:Relevance test

1

1

1

Let be a lower bound on SP- ( ,.., ).

We say that cell ( ,.., ) is irrelevant if the

score of any alignment through is less than .

Definition: ( ) ( , )

Claim 1: If L> (

u v

k

k

u vx xu v k

L score x x

i i

L

L a i i

L

ii

i

i

1) then ( ,.., ) is irrelevant.ki ii

But can we go over all cells determine if they are relevant or not ?No. Start with (0,…,0) and add to the list relevant entries until reaching (n1,…,nk)

37

Multiple Sequence Alignment – Approximation Algorithm

In tutorial time you saw an O(k2n2) multiple alignment algorithm for the SP-score that errs by a factor of at most 2(1-1/k) < 2.

38

Star Alignments

Rather then summing up all pairwise alignments, select a fixed sequence x0 as a center, and set

Star-score(K) = ∑j>0score(x0,xj).

The algorithm to find optimal alignment: at each step, add another sequence aligned with x0, keeping old gaps and possibly adding new ones.

39

Tree Alignments

Assume that there is a tree T=(V,E) whose leaves are the sequences. • Associate a sequence in each internal node.• Tree-score(K) = ∑(i,j)Escore(xi,xj).

Finding the optimal assignment of sequences to the internal nodes is NP Hard.

We will meet again this problem in the study ofPhylogenetic trees

defining scoring functions, multiple sequence alignment lecture #4 this class has been edited from...

Documents