multiple sequence alignment dynamic programming. multiple sequence alignment vtisctgsssnigag ...

19
Multiple Sequence Alignment Dynamic Programming

Upload: darlene-booton

Post on 31-Mar-2015

216 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Multiple Sequence Alignment Dynamic Programming. Multiple Sequence Alignment VTISCTGSSSNIGAG  NHVKWYQQLPG VTISCTGTSSNIGS  ITVNWYQQLPG LRLSCSSSGFIFSS

Multiple Sequence Alignment

Dynamic Programming

Page 2: Multiple Sequence Alignment Dynamic Programming. Multiple Sequence Alignment VTISCTGSSSNIGAG  NHVKWYQQLPG VTISCTGTSSNIGS  ITVNWYQQLPG LRLSCSSSGFIFSS

Multiple Sequence Alignment

VTISCTGSSSNIGAGNHVKWYQQLPGVTISCTGTSSNIGSITVNWYQQLPGLRLSCSSSGFIFSSYAMYWVRQAPGLSLTCTVSGTSFDDYYSTWVRQPPGPEVTCVVVDVSHEDPQVKFNWYVDGATLVCLISDFYPGAVTVAWKADSATLVCLISDFYPGAVTVAWKADSAALGCLVKDYFPEPVTVSWNSG-VSLTCLVKGFYPSDIAVEWESNG-

•Goal: Bring the greatest number of similar characters into the same column of the alignment

•Similar to alignment of two sequences.

Page 3: Multiple Sequence Alignment Dynamic Programming. Multiple Sequence Alignment VTISCTGSSSNIGAG  NHVKWYQQLPG VTISCTGTSSNIGS  ITVNWYQQLPG LRLSCSSSGFIFSS

CLUSTALW MSA

MSA of four oxidoreductase NAD binding domain protein sequences. Red: AVFPMILW. Blue: DE. Magenta: RHK. Green: STYHCNGQ. Grey: all others. Residue ranges are shown after sequence names.

Chenna et al. Nucleic Acids Research, 2003, Vol. 31, No. 13 3497-3500

Page 4: Multiple Sequence Alignment Dynamic Programming. Multiple Sequence Alignment VTISCTGSSSNIGAG  NHVKWYQQLPG VTISCTGTSSNIGS  ITVNWYQQLPG LRLSCSSSGFIFSS

Multiple Sequence Alignment: Motivation

• Correspondence. Find out which parts “do the same thing”– Similar genes are conserved across widely divergent species,

often performing similar functions

• Structure prediction– Use knowledge of structure of one or more members of a

protein MSA to predict structure of other members – Structure is more conserved than sequence

• Create “profiles” for protein families– Allow us to search for other members of the family

• Genome assembly: Automated reconstruction of “contig” maps of genomic fragments such as ESTs

• MSA is the starting point for phylogenetic analysis

Page 5: Multiple Sequence Alignment Dynamic Programming. Multiple Sequence Alignment VTISCTGSSSNIGAG  NHVKWYQQLPG VTISCTGTSSNIGS  ITVNWYQQLPG LRLSCSSSGFIFSS

Multiple Sequence Alignment: Approaches

• Optimal Global Alignments -Dynamic programming– Generalization of Needleman-Wunsch– Find alignment that maximizes a score function– Computationally expensive: Time grows as product of

sequence lengths

• Global Progressive Alignments - Match closely-related sequences first using a guide tree

• Global Iterative Alignments - Multiple re-building attempts to find best alignment

• Local alignments– Profiles, Blocks, Patterns

Page 6: Multiple Sequence Alignment Dynamic Programming. Multiple Sequence Alignment VTISCTGSSSNIGAG  NHVKWYQQLPG VTISCTGTSSNIGS  ITVNWYQQLPG LRLSCSSSGFIFSS

Scoring a multiple alignment

Sum of pairs Star Tree

A

A

C

CA

AA

A

A

A

A

C C

CC

Page 7: Multiple Sequence Alignment Dynamic Programming. Multiple Sequence Alignment VTISCTGSSSNIGAG  NHVKWYQQLPG VTISCTGTSSNIGS  ITVNWYQQLPG LRLSCSSSGFIFSS

Sum of Pairs

AAAAAAAAAAACACC

A

A

A

AA

10α

A

A

A

CA

+ (6α - 4β)

A

A

C

CA

+ (4α - 6β)

= 20α - 10β

Page 8: Multiple Sequence Alignment Dynamic Programming. Multiple Sequence Alignment VTISCTGSSSNIGAG  NHVKWYQQLPG VTISCTGTSSNIGS  ITVNWYQQLPG LRLSCSSSGFIFSS

Sum-of-Pairs Scoring Function

Score of multiple alignment

= ∑i <j score(Si,Sj)

where score(Si,Sj) = score of induced pairwise alignment

Page 9: Multiple Sequence Alignment Dynamic Programming. Multiple Sequence Alignment VTISCTGSSSNIGAG  NHVKWYQQLPG VTISCTGTSSNIGS  ITVNWYQQLPG LRLSCSSSGFIFSS

Induced Pairwise Alignment

S1 S - T I S C T G - S - N IS2 L - T I – C N G S S - N IS3 L R T I S C S G F S Q N I

Induced pairwise alignment of S1, S2:

S1 S T I S C T G - S N IS2 L T I – C N G S S N I

Page 10: Multiple Sequence Alignment Dynamic Programming. Multiple Sequence Alignment VTISCTGSSSNIGAG  NHVKWYQQLPG VTISCTGTSSNIGS  ITVNWYQQLPG LRLSCSSSGFIFSS

MSA: Dynamic Programming

• The two-sequence alignment algorithm can be generalized to any number of sequences.

• E.g., for three sequences X, Y, W defineC[i,j,k] = score of optimum alignment

among X[1..i], Y[1..j], W[1..k]• As for two sequences, divide possible

alignments into different classes, depending on how they end.– Use to devise recurrence relations for C[i,j,k]– C[i,j,k] is the maximum out of all possibilities

Page 11: Multiple Sequence Alignment Dynamic Programming. Multiple Sequence Alignment VTISCTGSSSNIGAG  NHVKWYQQLPG VTISCTGTSSNIGS  ITVNWYQQLPG LRLSCSSSGFIFSS

Xi

Yj

Wk

MSA: 7 ways alignment can end for 3 sequences

X1 . . . Xi-1 Xi

Y1 . . . Yj-1 Yj

W1 . . . Wk-1 Wk-Yj

Wk

Xi

-WkXi

Yj

-

--Wk

-Yj

-

Xi

--

Page 12: Multiple Sequence Alignment Dynamic Programming. Multiple Sequence Alignment VTISCTGSSSNIGAG  NHVKWYQQLPG VTISCTGTSSNIGS  ITVNWYQQLPG LRLSCSSSGFIFSS

Dynamic programming for three sequences

V S N

S

S

N A —

A S— — —

V S N S

S

N

A

AS

Start

Each alignment is a path through the dynamic programming matrix

Page 13: Multiple Sequence Alignment Dynamic Programming. Multiple Sequence Alignment VTISCTGSSSNIGAG  NHVKWYQQLPG VTISCTGTSSNIGS  ITVNWYQQLPG LRLSCSSSGFIFSS

For 3 seqs. of length n, time is proportional to n3

Dynamic Programming for Three Sequences

C[i,j,k]

C[i-1,j-1,k-1]

There are 7 ways to get to C[i,j,k]

C[i-1,j,k-1]

C[i-1,j,k-1]

Enumerate all possibilities and choose the best one

Page 14: Multiple Sequence Alignment Dynamic Programming. Multiple Sequence Alignment VTISCTGSSSNIGAG  NHVKWYQQLPG VTISCTGTSSNIGS  ITVNWYQQLPG LRLSCSSSGFIFSS

Dynamic Programming MSA: General Case

• For k sequences of length n, dynamic programming algorithm does (2k-1) nk

operations – Example: 6 sequences of length 100 require

6.4X1013 calculations• Space for table is nk

• Implementations (e.g., WashU MSA 2.1) use tricks and only search subset of dynamic programming table– Even this is expensive. E.g., Baylor CM Search

launcher limits MSA to 8 sequences of 800 characters and 10 minutes processing time

Page 15: Multiple Sequence Alignment Dynamic Programming. Multiple Sequence Alignment VTISCTGSSSNIGAG  NHVKWYQQLPG VTISCTGTSSNIGS  ITVNWYQQLPG LRLSCSSSGFIFSS

Problems with SP scoring

• Pair-wise comparisons can over-score evolutionarily distant pairs.

• Reason: For 3 or more sequences, SP scoring does not correspond to any evolutionary tree

But not:

Page 16: Multiple Sequence Alignment Dynamic Programming. Multiple Sequence Alignment VTISCTGSSSNIGAG  NHVKWYQQLPG VTISCTGTSSNIGS  ITVNWYQQLPG LRLSCSSSGFIFSS

Overcoming problems with SP scoring

• Use weights to incorporate evolution in sum of pairs scoring:– Some pair-wise alignments are more important

than others • E.g., more important to have a good alignment

between mouse and human sequences than mouse and bird

– Assign different weights to different pair-wise alignments.

• Weight decreases with evolutionary distance.

• Use star tree approach – one sequence is assigned as the ancestor and

all others are contrasted it.

Page 17: Multiple Sequence Alignment Dynamic Programming. Multiple Sequence Alignment VTISCTGSSSNIGAG  NHVKWYQQLPG VTISCTGTSSNIGS  ITVNWYQQLPG LRLSCSSSGFIFSS

Star Alignments

• Construct multiple alignments using pair-wise alignment relative to a fixed sequence

• Out of a set S = {S1, S2, . . . , Sr} of sequences, pick sequence Sc that maximizes

star_score(c) = ∑ {sim(Sc, Si) : 1 ≤ i ≤ r, i ≠ c}

where sim(Si, Sj) is the optimal score of a pair-wise alignment between Si and Sj

Page 18: Multiple Sequence Alignment Dynamic Programming. Multiple Sequence Alignment VTISCTGSSSNIGAG  NHVKWYQQLPG VTISCTGTSSNIGS  ITVNWYQQLPG LRLSCSSSGFIFSS

Algorithm

1. Compute sim(Si, Sj) for every pair (i,j)2. Compute star_score(i) for every i3. Choose the index c that minimizes

star_score(c) and make it the center of the star

4. Produce a multiple alignment M such that, for every i, the induced pairwise alignment of Sc and Si is the same as the optimum alignment of Sc and Si.

Page 19: Multiple Sequence Alignment Dynamic Programming. Multiple Sequence Alignment VTISCTGSSSNIGAG  NHVKWYQQLPG VTISCTGTSSNIGS  ITVNWYQQLPG LRLSCSSSGFIFSS

Step 4: Detail

Sc AA--CCTT

S1 AATGCC--

Sc A-ACC-TT

S2 AGACCGT-

Sc A-A--CC-TT

S1 A-ATGCC---

S2 AGA--CCGT-