multiple sequence alignment - tel aviv universityrshamir/algmb/presentations/multiple...

61
1 Multiple Sequence Alignment Some slides from: Jones, Pevzner, USC Intro to Bioinformatics Algorithms http://www.bioalgorithms.info/ S. Batzoglu, Stanford http://ai.stanford.edu/~serafim/CS262_2006/ Geiger, Wexler, Technion http://www.cs.technion.ac.il/~cs236522/ Ruzzo, Tompa U. Washington CSE 590bi Poch, Strasbourg www.inra.fr/internet/Projets/agroBI/PHYLO/Poch.ppt A. Drummond, Auckland, NZ Tandy Warnow, Illinois tandy.cs.illinois.edu/Jordan-april14.pptx CG © Ron Shamir

Upload: vohanh

Post on 25-Apr-2018

228 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

1

Multiple Sequence Alignment

Some slides from: • Jones, Pevzner, USC Intro to Bioinformatics Algorithms

http://www.bioalgorithms.info/ • S. Batzoglu, Stanford http://ai.stanford.edu/~serafim/CS262_2006/

• Geiger, Wexler, Technion http://www.cs.technion.ac.il/~cs236522/

• Ruzzo, Tompa U. Washington CSE 590bi • Poch, Strasbourg www.inra.fr/internet/Projets/agroBI/PHYLO/Poch.ppt

• A. Drummond, Auckland, NZ • Tandy Warnow, Illinois tandy.cs.illinois.edu/Jordan-april14.pptx

CG © Ron Shamir

Page 3: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

3

Multiple Alignment Definition Input: Sequences S1 , S2 ,…, Sk over the same alphabet Output: Gapped sequences S’1 , S’2 ,…, S’k of equal length

1. |S’1|= |S’2|=…= |S’k|

2. Removal of spaces from S’i gives Si for all i

CG © Ron Shamir

Page 4: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

4

Example

S1=AGGTC

S2=GTTCG

S3=TGAAC Possible alignment

A - T

G G G

G - -

T T A

- T A

C C C

- G -

Possible alignment

A G -

G T T

G T G

T - A

- - A

C C A

- G C

CG © Ron Shamir

Page 5: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

5

CG © Ron Shamir

Page 6: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

6 CG © Ron Shamir

Page 7: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

7

Example

Multiple sequence alignment of 7 neuroglobins using clustalx

CG © Ron Shamir

Page 9: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

9

Protein Phylogenies – Example

Kinase domain

CG © Ron Shamir

Page 10: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

10

Motivation again • Common structure, function or origin may

be only weakly reflected in sequence – multiple comparisons may highlight weak signals

• Major uses: – Identify and represent protein families – Identify and represent conserved seq. elements

(e.g. domains) – Deduce evolutionary history – PCR primer design – Transcription factor binding sites

CG © Ron Shamir

Page 11: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

Structure comparison, modelling

Interaction networks

Hierarchical function annotation: homologs, domains, motifs

Phylogenetic studies

Human genetics, SNPs

Therapeutics, drug discovery

Therapeutics, drug design DBD

LBD

insertion domain

binding sites / mutations

Gene identification, validation

RNA sequence, structure, function

Comparative genomics

MACS

MSA : central role in biology

CG © Ron Shamir

Page 12: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

12

Scoring alignments

• Given input seqs. S1 , S2 ,…, Sk find a multiple alignment of optimal score

• Scores preview: – Sum of pairs – Consensus – Tree

• Varying methods - no single winner

CG © Ron Shamir

Page 13: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

13

Sum of Pairs score Def: Induced pairwise alignment A pairwise alignment induced by the multiple

alignment Example: x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAG; z: GCCGC-GAG; z: GCCGCGAG

S(M) = Σk<l σ(S’k, S’l) CG © Ron Shamir

Page 14: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

14

Consider the following alignment:

AC-CDB- -C-ADBD A-BCDAD

SOP Score Example

Scoring scheme: match - 0 mismatch/indel - -1 SP score: -3 -5 -4 =-12

CG © Ron Shamir

Page 15: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

15

Alignments = Paths

• Align 3 sequences: ATGC, AATC,ATGC

A A T -- C

A -- T G C

-- A T G C

CG © Ron Shamir

Page 16: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

16

Alignment Paths 0 1 1 2 3 4

A A T -- C

A -- T G C

-- A T G C

x coordinate

CG © Ron Shamir

Page 17: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

17

Alignment Paths • Align 3 sequences: ATGC, AATC,ATGC 0 1 1 2 3 4

0 1 2 3 3 4

A A T -- C

A -- T G C

-- A T G C

x coordinate

y coordinate

CG © Ron Shamir

Page 18: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

18

Alignment Paths

0 1 1 2 3 4

0 1 2 3 3 4

A A T -- C

A -- T G C

0 0 1 2 3 4

-- A T G C

• Resulting path in (x,y,z) space: (0,0,0)→(1,1,0)→(1,2,1) →(2,3,2) →(3,3,3) →(4,4,4)

x coordinate

y coordinate

z coordinate

CG © Ron Shamir

Page 19: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

19

Aligning Three Sequences • Same strategy as

aligning two sequences

• Use a 3-D “Manhattan Cube”, with each axis representing a sequence to align

• For global alignments, go from source to sink

source

sink

CG © Ron Shamir

Page 20: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

20

2-D vs 3-D Alignment Grid

V

W

2-D edit graph

3-D edit graph

CG © Ron Shamir

Page 21: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

21

2-D cell versus 2-D Alignment Cell

In 3-D, 7 edges in each unit cube

In 2-D, 3 edges in each unit square

CG © Ron Shamir

Page 22: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

22

Architecture of 3-D Alignment Cell

(i-1,j-1,k-1)

(i,j-1,k-1)

(i,j-1,k)

(i-1,j-1,k) (i-1,j,k)

(i,j,k)

(i-1,j,k-1)

(i,j,k-1)

CG © Ron Shamir

Page 23: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

23

Multiple Alignment: Dynamic Programming

• si,j,k = max

• δ(x, y, z) is an entry in the 3-D scoring matrix (e.g., using SOP score)

si-1,j-1,k-1 + δ(vi, wj, uk) si-1,j-1,k + δ (vi, wj, _ ) si-1,j,k-1 + δ (vi, _, uk) si,j-1,k-1 + δ (_, wj, uk) si-1,j,k + δ (vi, _ , _) si,j-1,k + δ (_, wj, _) si,j,k-1 + δ (_, _, uk)

cube diagonal: no indels

face diagonal: one indel

edge diagonal: two indels

CG © Ron Shamir

Page 24: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

24

Running Time

• For 3 sequences of length n, the run time is O(n3)

• For k sequences, build a k-dimensional cube, with run time O(2knk)

• Want n, k 10’s to 100’s (sometimes much more)

• Impractical CG © Ron Shamir

Page 25: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

25

Scoring gaps • Our alg assumed a linear gap score (const.

per space). • For affine gap scores: need 2k – 1 states,

one per combination of gapped/ungapped sequences

• Running time: O(2k × 2k × nk) = O(4k nk)

CG © Ron Shamir

Page 26: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

26

NP- Hardness of Multiple Alignment

• For fixed k, the problem is polynomial • The general problem for SOP:

• Wang & Jiang 94: special type matrix • Bonizzoni & Vedova 01: more general

matrix, binary alphabet • Elias 03: binary or larger alphabet, any

matrix • Similar results for other scoring

functions

CG © Ron Shamir

Page 27: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

27

Forward Dynamic Programming • An alternative approach to DP for pairwise

(and multiple) alignment: • D(v) – opt value of path sourcev • p(w) – best-yet solution of path sourcew • When D(v) is computed, send its value

forward on all the arcs (v,w): p(w)=min{p(w),D(v)+cost(v,w)}

• Once p(w) has been updated by all incoming edges – that value is optimal; set as D(w)

CG © Ron Shamir

Page 28: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

28

Forward Dynamic Programming (2)

• Maintain a queue of nodes whose D is not set yet

• For the node w at the head of the queue: Set D(w)p(w) and remove

• Update p for each out-neighbor x of w – if x is not in the queue – add it at the end*

• Same complexity as the regular (backwards) DP

• *breaking ties by lexicographically!

CG © Ron Shamir

Page 29: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

29

Faster DP Algorithm for MultiAlign Carillo-Lipman 88

• Use forward DP. • We’ll demonstrate on three sequences • f12(i,j) = opt pairwise alignment score of

suffixes S1(i,..n1), S2(j,..n2), etc. • Key idea: if ∃ a known soln of cost z, if D(i,j,k)+ f12(i,j) + f13(i,k) +f23(j,k) > z Do not place (i,j,k) in the queue • Extra work: DP for all seq pairs - O(k2n2) • Guarantees opt soln – no improved time

bound, but often saves a lot in practice.

CG © Ron Shamir

Page 30: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

30

http://www.techfak.uni-bielefeld.de/bcd/Curric/MulAli/cl2.gif CG © Ron Shamir

Page 31: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

31

Approximation Algorithm – SOP

We use min cost instead of max score

Find alignment of minimal cost

Assumption: the cost function δ is a distance function

• δ(x,x) = 0

• δ(x,y) = δ(y,x) ≥ 0

• δ(x,y) + δ(y,z) ≥ δ(x,z) (triangle inequality) (e.g. cost of MM ≤ cost of two indels)

D(S,T) - cost of minimum global alignment between S and T

CG © Ron Shamir

Page 32: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

32

Input: Γ - set of k strings S1, …,Sk.

1. Find the string S*∈ Γ (center) that minimizes 2. Denote S1=S* and the rest of the strings as S2, …,Sk

3. Iteratively add S2, …,Sk to the alignment as follows:

a. Suppose S1, …,Si-1 are already aligned as S’1, …,S’i-1

b. Optimally align Si to S’1 to produce S’i and S’’1 aligned

c. Adjust S’2, …,S’i-1 by adding spaces where spaces were added to S’’1 d. Replace S’1 by S’’1

( ){ }

∑Γ∈ *\

*,SS

SSD

The Center Star algorithm

CG © Ron Shamir

Page 33: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

33

• Choosing S1 – execute DP for all sequence-pairs - O(k2n2)

• Adding Si to the alignment - execute DP for Si , S’1 - O(i·n2).

(In the ith stage the length of S’1 can be up-to i· n)

( ) ( )∑−

=

=⋅1

1

222k

inkOniO

Running time

total complexity

CG © Ron Shamir

Page 34: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

34

For all i: d(1,i)=D(S1,Si) (we perform optimal alignment between S’1 and Si and δ(-,-) = 0 )

Approximation ratio

• M* - An optimal alignment • M - The alignment produced by this algorithm • d(i,j) - The distance M induces on the pair Si,Sj •

•recall D(S,T) – min cost of alignment between S and T

( ) ( ) ( )∑∑∑<=

≠=

==ji

k

i

k

ijj

jidjidMv ,2,1 1

CG © Ron Shamir

Page 35: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

35

Approximation ratio (2)

( )∑=

−=k

llSSDk

21,)1(2

( )∑=

=k

jjSSDk

21,

2)1(2)()(

* ≤−

≤k

kMvMv

( ) ( )∑∑=

≠=

=k

i

k

ijj

jidMv1 1

, ( ) ( )( )∑∑=

≠=

+≤k

i

k

ijj

jdid1 1

,1,1

( )∑=

−=k

lldk

2,1)1(2

( ) ( )∑∑=

≠=

=k

i

k

ijj

jidMv1 1

** , ( ) ≥≥ ∑∑=

≠=

k

i

k

ijj

ji SSD1 1

,

( )∑∑= =

≥k

i

k

jjSSD

1 21 ,

Definition of S1:

( ) ( )∑∑≠==

≤∀k

ijj

ji

k

jj SSDSSDi

121 ,,:

Triangle inequality + symmetry

CG © Ron Shamir

Page 36: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

36

Theorem (Gusfield 93)

• We have proved: • The center star algorithm is a

polynomial algorithm that guarantees a solution at most twice the optimum.

• “a 2-approximation” • “an approximation ratio of 2”

CG © Ron Shamir

Page 37: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

37

Steiner/consensus string • Input: Γ - set of k strings S1, …,Sk. • D(X,Y) – score of aligning X, Y.

• The consensus error of S relative to Γ:

E(S)= Σi≤k D(S, Si) • Goal: find S* that minimizes E(S) • S* is called an optimal Steiner string for Γ

• New objective – k terms and not k2

• No direct relation to multialign! (for now)

CG © Ron Shamir

Page 38: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

38

Thm: Assume D satisfies triangle ineq. Then ∃S∈Γ that guarantees an approximation ratio 2.

∑ ≠=

iSS iSSDSE ),()( ( ) ( )( )∑ ≠+≤

iSS iSSDSSD *,*,

∑ ≠++−=

SS ii

SSDSSDSSDk )*,(*),(*),()2(

*)(*),()2( SESSDk +−=Pick S ∈Γ closest to S* (not constructively)

∑ Γ∈=

iS iSSDSE )*,(*)( *),( SSDk ⋅≥

Pf: Pick S ∈Γ

Approximation

21)2()()(

* <+−

≤k

kSESE

CG © Ron Shamir

Page 39: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

40

Consensus string from MA • The consensus character in column i of MA is the char

that minimizes the sum of distances to it from all the chars in column i

• The consensus string of a MA is obtained by taking the consensus character in each position.

• Ex: match =0, mismatch/space=1 consensus=majority char • S*: AC-GC-GAG • x: AC-GCGG-C • y: AC-GC-GAG • z: GCCGA-GAG • u: AC-T-GGCA • v: -CAGT-GAG • w: AC-GC-GAG

Alignment error: S(M) = Σk σ(S’k, S*)

The opt consensus MA: one with least alignment error

Thm: opt consensus MA = Steiner string

(up to spaces)

CG © Ron Shamir

Page 40: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

41

Tree MA • Input: Tree T, a string for each leaf • Phylogenetic alignment for T: Assignment of

a string to each internal node

• Score – (weighted) sum of scores along edges • Goal: find phyl. alignment of optimal score • Consensus = phyl. Alignment where T is a star

CTGG

CCGG

GTTC CTTG

GTTG

GTTG

CTGG

CG © Ron Shamir

Page 41: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

42

Tree MA – complexity

• Also NP-hard • Poly time approximations:

– 2-approximation – Better approximation with more time

(PTAS)

CG © Ron Shamir

Page 42: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

43

Profile Representation of MA - A G G C T A T C A C C T G T A G – C T A C C A - - - G C A G – C T A C C A - - - G C A G – C T A T C A C – G G C A G – C T A T C G C – G G A 1 1 .8 C .6 1 .4 1 .6 .2 G 1 .2 .2 .4 1 T .2 1 .6 .2 - .2 .8 .4 .8 .4

• Alternatively, use log odds: • pi(a) = fraction of a’s in col i • p(a) = fraction of a’s overall • log pi(a)/p(a)

CG © Ron Shamir

Page 43: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

44

CG © Ron Shamir

Page 44: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

45

Aligning a sequence to a profile

• Key in pairwise alignment is scoring two positions x,y: σ(x,y)

• For a letter x and a column y in a profile, σ(x,y)=value of x in col. Y

• Invent a score for σ(x,-) • Run the DP alg for pairwise alignment

CG © Ron Shamir

Page 45: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

46

Aligning alignments

• Given two alignments, how can we align them?

• Hint: use DP on the corresponding profiles.

x GGGCACTGCAT y GGTTACGTC-- Alignment 1 z GGGAACTGCAG

w GGACGTACC-- Alignment 2 v GGACCT-----

x GGGCACTGCAT y GGTTACGTC-- z GGGAACTGCAG w GGACGTACC-- v GGACCT-----

CG © Ron Shamir

Page 46: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

47

Multiple Alignment: Greedy Heuristic

• Choose most similar pair of sequences and combine into a profile , thereby reducing alignment of k sequences to an alignment of of k-1 sequences/profiles. Repeat

u1= ACGTACGTACGT…

u2 = TTAATTAATTAA…

u3 = ACTACTACTACT…

uk = CCGGCCGGCCGG

u1= ACg/tTACg/tTACg/cT…

u2 = TTAATTAATTAA…

uk = CCGGCCGGCCGG… k

k-1

CG © Ron Shamir

Page 47: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

48

Progressive Alignment • A variation of greedy algorithm with a

somewhat more intelligent strategy for choosing the order of alignments.

CG © Ron Shamir

Page 48: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

49

Progressive alignment

Align sequences (pairwise) in some (greedy) order

Decisions (1) Order of alignments (2) Alignment of sequence to group (only), or allow group to group (3) Method of alignment, and scoring function

CG © Ron Shamir

Page 49: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

50

Guide tree A

B

C

D

E

A

B

C

D

F

this ?

or this ?

E

CG © Ron Shamir

Page 50: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

64

ClustalW Thompson, Higgins, Gibson 94

• Popular multiple alignment tool today • ‘W’ = ‘weighted’ (different parts of

alignment are weighted differently). • Three-step process

1.) Construct pairwise alignments 2.) Build Guide Tree 3.) Progressive alignment guided by

the tree

CG © Ron Shamir

Page 51: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

65

Step 1: Pairwise Alignment • Aligns each sequence against each

other giving a similarity matrix • Similarity = exact matches / sequence

length (percent identity) v1 v2 v3 v4 v1 - v2 .17 - v3 .87 .28 - v4 .59 .33 .62 -

(.17 means 17 % identical)

CG © Ron Shamir

Page 52: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

66

Step 2: Guide Tree • Use the similarity method to create a

Guide Tree by applying some clustering method*

• Guide tree roughly reflects

evolutionary relations

• *ClustalW uses the neighbor-joining method (to be described later in the course)

CG © Ron Shamir

Page 53: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

67

Step 2: Guide Tree (cont’d)

v1 v3 v4 v2

Calculate: v1,3 = alignment (v1, v3) v1,3,4 = alignment((v1,3),v4) v1,2,3,4 = alignment((v1,3,4),v2)

v1 v2 v3 v4 v1 - v2 .17 - v3 .87 .28 - v4 .59 .33 .62 -

CG © Ron Shamir

Page 54: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

68

Step 3: Progressive Alignment • Start by aligning the two most similar

sequences • Using the guide tree, add in the most similar

pair (seq-seq, seq-prof or prof-prof) • Insert gaps as necessary • Many ad-hoc rules: weighting, different

matrices, special gap scores…. FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFD FOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFD FOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFD FOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQ FOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ . . : ** . :.. *:.* * . * **:

Dots and stars show how well-conserved a column is. CG © Ron Shamir

Page 55: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

MAFFT – MultiAlign via FFT Katoh et al, NAR 02

• Overview: – Homology identification using FFT – Alignment Scoring/Selection

• Faster computation due to

– O(NlogN) homology detection using FFT – Simpler-to-compute scoring function

69

Page 56: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

Signal and pairwise correlation • For amino acid sequences

– [volume v(a), polarity p(a)] per AA a – Each standardized (mean 0, std 1) of all AAs

• Correlation between seqs 1,2 at offset k:

• Direct computation for all k O(N2) for seqs of length ~N. With FFT O(NlogN)

• c(k) = cv(k) + cp(k)

70

Page 57: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

Pairwise alignment • High c(k) gives offset, not location

• Use sliding windows to find 20 highest peaks, merging if necessary.

• Result: ≤20 homologous segments

71

Page 58: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

Pairwise alignment (2) • Use regular DP to

find highest scoring chain of segments

• Extend to full alignment constrained to contain the chain (within the white

72

Page 59: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

Group-group alignment For column n in group1, weight the

sequences to get average values of v, p Apply the same procedure to get group1-

group2 alignment

73

Page 60: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

Overall alg • Compute pairwise distances (quickly) • Form guide tree T1 • Progressive alignment of sequences in

bottom up order of T1 • Infer a new guide tree T2 from the

multialignment • Realign each input seq along T2 • …. • x100 faster than prev algs, same

accuracy 74

Page 61: Multiple Sequence Alignment - Tel Aviv Universityrshamir/algmb/presentations/Multiple Alignment.pdf · serafim/CS262_2006/ •Geiger, ... Multiple sequence alignment of 7 neuroglobins

78

Multiple Alignment: History 1 975 Sankoff Formulated multiple alignment problem

and gave dynamic programming solution 1 988 Carrillo- Lipman Branch and Bound approach for MSA 1 990 Feng- Doolittle Progressive alignment 1 994 Thompson- Higgins- Gibson- ClustalW Most popular multiple alignment program 1 998 Morgenstern et al. - DIALIGN Segment-based multiple alignment 2000 Notredame- Higgins- Heringa- T- coffee Using the library of pairwise alignments 2002 MAFFT 2004 MUSCLE 2005 ProbCons

…… Still a lot to be done! CG © Ron Shamir