multiple sequence alignment

Multiple Sequence Alignment

Some slides from: • Jones, Pevzner, USC Intro to Bioinformatics Algorithms

http://www.bioalgorithms.info/ • S. Batzoglu, Stanford http://ai.stanford.edu/~serafim/CS262_2006/

• Geiger, Wexler, Technion http://www.cs.technion.ac.il/~cs236522/

• Ruzzo, Tompa U. Washington CSE 590bi • Poch, Strasbourg www.inra.fr/internet/Projets/agroBI/PHYLO/Poch.ppt

• A. Drummond, Auckland, NZ • Tandy Warnow, Illinois tandy.cs.illinois.edu/Jordan-april14.pptx

CG © Ron Shamir

Multiple Alignment vs. Pairwise Alignment

• Up until now we have only tried to align two sequences.

• What about more than two? And what for?

• A faint similarity between two sequences becomes significant if present in many

• Multiple alignments can reveal subtle similarities that pairwise alignments do not reveal CG © Ron Shamir

Multiple Alignment Definition Input: Sequences S1 , S2 ,…, Sk over the same alphabet Output: Gapped sequences S’1 , S’2 ,…, S’k of equal length

1. |S’1|= |S’2|=…= |S’k|

2. Removal of spaces from S’i gives Si for all i

CG © Ron Shamir

Example

S1=AGGTC

S2=GTTCG

S3=TGAAC Possible alignment

Possible alignment

CG © Ron Shamir

6 CG © Ron Shamir

Example

Multiple sequence alignment of 7 neuroglobins using clustalx

CG © Ron Shamir

Human-centric beta globin Multiple Alignment

http://globin.cse.psu.edu/ CG © Ron Shamir

Protein Phylogenies – Example

Kinase domain

CG © Ron Shamir

Motivation again • Common structure, function or origin may

be only weakly reflected in sequence – multiple comparisons may highlight weak signals

• Major uses: – Identify and represent protein families – Identify and represent conserved seq. elements

(e.g. domains) – Deduce evolutionary history – PCR primer design – Transcription factor binding sites

CG © Ron Shamir

Structure comparison, modelling

Interaction networks

Hierarchical function annotation: homologs, domains, motifs

Phylogenetic studies

Human genetics, SNPs

Therapeutics, drug discovery

Therapeutics, drug design DBD

insertion domain

binding sites / mutations

Gene identification, validation

RNA sequence, structure, function

Comparative genomics

MSA : central role in biology

CG © Ron Shamir

Scoring alignments

• Given input seqs. S1 , S2 ,…, Sk find a multiple alignment of optimal score

• Scores preview: – Sum of pairs – Consensus – Tree

• Varying methods - no single winner

CG © Ron Shamir

Sum of Pairs score Def: Induced pairwise alignment A pairwise alignment induced by the multiple

alignment Example: x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAG; z: GCCGC-GAG; z: GCCGCGAG

S(M) = Σk<l σ(S’k, S’l) CG © Ron Shamir

Consider the following alignment:

AC-CDB- -C-ADBD A-BCDAD

SOP Score Example

Scoring scheme: match - 0 mismatch/indel - -1 SP score: -3 -5 -4 =-12

CG © Ron Shamir

Alignments = Paths

• Align 3 sequences: ATGC, AATC,ATGC

A A T -- C

A -- T G C

-- A T G C

CG © Ron Shamir

Alignment Paths 0 1 1 2 3 4

A A T -- C

A -- T G C

-- A T G C

x coordinate

CG © Ron Shamir

Alignment Paths • Align 3 sequences: ATGC, AATC,ATGC 0 1 1 2 3 4

0 1 2 3 3 4

A A T -- C

A -- T G C

-- A T G C

x coordinate

y coordinate

CG © Ron Shamir

Alignment Paths

0 1 1 2 3 4

0 1 2 3 3 4

A A T -- C

A -- T G C

0 0 1 2 3 4

-- A T G C

• Resulting path in (x,y,z) space: (0,0,0)→(1,1,0)→(1,2,1) →(2,3,2) →(3,3,3) →(4,4,4)

x coordinate

y coordinate

z coordinate

CG © Ron Shamir

Aligning Three Sequences • Same strategy as

aligning two sequences

• Use a 3-D “Manhattan Cube”, with each axis representing a sequence to align

• For global alignments, go from source to sink

source

CG © Ron Shamir

2-D vs 3-D Alignment Grid

2-D edit graph

3-D edit graph

CG © Ron Shamir

2-D cell versus 2-D Alignment Cell

In 3-D, 7 edges in each unit cube

In 2-D, 3 edges in each unit square

CG © Ron Shamir

Architecture of 3-D Alignment Cell

(i-1,j-1,k-1)

(i,j-1,k-1)

(i,j-1,k)

(i-1,j-1,k) (i-1,j,k)

(i,j,k)

(i-1,j,k-1)

(i,j,k-1)

CG © Ron Shamir

Multiple Alignment: Dynamic Programming

• si,j,k = max

• δ(x, y, z) is an entry in the 3-D scoring matrix (e.g., using SOP score)

si-1,j-1,k-1 + δ(vi, wj, uk) si-1,j-1,k + δ (vi, wj, _ ) si-1,j,k-1 + δ (vi, _, uk) si,j-1,k-1 + δ (_, wj, uk) si-1,j,k + δ (vi, _ , _) si,j-1,k + δ (_, wj, _) si,j,k-1 + δ (_, _, uk)

cube diagonal: no indels

face diagonal: one indel

edge diagonal: two indels

CG © Ron Shamir

Running Time

• For 3 sequences of length n, the run time is O(n3)

• For k sequences, build a k-dimensional cube, with run time O(2knk)

• Want n, k 10’s to 100’s (sometimes much more)

• Impractical CG © Ron Shamir

Scoring gaps • Our alg assumed a linear gap score (const.

per space). • For affine gap scores: need 2k – 1 states,

one per combination of gapped/ungapped sequences

• Running time: O(2k × 2k × nk) = O(4k nk)

CG © Ron Shamir

NP- Hardness of Multiple Alignment

• For fixed k, the problem is polynomial • The general problem for SOP:

• Wang & Jiang 94: special type matrix • Bonizzoni & Vedova 01: more general

matrix, binary alphabet • Elias 03: binary or larger alphabet, any

matrix • Similar results for other scoring

functions

CG © Ron Shamir

Forward Dynamic Programming • An alternative approach to DP for pairwise

(and multiple) alignment: • D(v) – opt value of path sourcev • p(w) – best-yet solution of path sourcew • When D(v) is computed, send its value

forward on all the arcs (v,w): p(w)=min{p(w),D(v)+cost(v,w)}

• Once p(w) has been updated by all incoming edges – that value is optimal; set as D(w)

CG © Ron Shamir

Forward Dynamic Programming (2)

• Maintain a queue of nodes whose D is not set yet

• For the node w at the head of the queue: Set D(w)p(w) and remove

• Update p for each out-neighbor x of w – if x is not in the queue – add it at the end*

• Same complexity as the regular (backwards) DP

• *breaking ties by lexicographically!

CG © Ron Shamir

Faster DP Algorithm for MultiAlign Carillo-Lipman 88

• Use forward DP. • We’ll demonstrate on three sequences • f12(i,j) = opt pairwise alignment score of

suffixes S1(i,..n1), S2(j,..n2), etc. • Key idea: if ∃ a known soln of cost z, if D(i,j,k)+ f12(i,j) + f13(i,k) +f23(j,k) > z Do not place (i,j,k) in the queue • Extra work: DP for all seq pairs - O(k2n2) • Guarantees opt soln – no improved time

bound, but often saves a lot in practice.

CG © Ron Shamir

http://www.techfak.uni-bielefeld.de/bcd/Curric/MulAli/cl2.gif CG © Ron Shamir

Approximation Algorithm – SOP

We use min cost instead of max score

Find alignment of minimal cost

Assumption: the cost function δ is a distance function

• δ(x,x) = 0

• δ(x,y) = δ(y,x) ≥ 0

• δ(x,y) + δ(y,z) ≥ δ(x,z) (triangle inequality) (e.g. cost of MM ≤ cost of two indels)

D(S,T) - cost of minimum global alignment between S and T

CG © Ron Shamir

Input: Γ - set of k strings S1, …,Sk.

1. Find the string S*∈ Γ (center) that minimizes 2. Denote S1=S* and the rest of the strings as S2, …,Sk

3. Iteratively add S2, …,Sk to the alignment as follows:

a. Suppose S1, …,Si-1 are already aligned as S’1, …,S’i-1

b. Optimally align Si to S’1 to produce S’i and S’’1 aligned

c. Adjust S’2, …,S’i-1 by adding spaces where spaces were added to S’’1 d. Replace S’1 by S’’1

( ){ }

∑Γ∈ *\

The Center Star algorithm

CG © Ron Shamir

• Choosing S1 – execute DP for all sequence-pairs - O(k2n2)

• Adding Si to the alignment - execute DP for Si , S’1 - O(i·n2).

(In the ith stage the length of S’1 can be up-to i· n)

( ) ( )∑−

inkOniO

Running time

total complexity

CG © Ron Shamir

For all i: d(1,i)=D(S1,Si) (we perform optimal alignment between S’1 and Si and δ(-,-) = 0 )

Approximation ratio

• M* - An optimal alignment • M - The alignment produced by this algorithm • d(i,j) - The distance M induces on the pair Si,Sj •

•recall D(S,T) – min cost of alignment between S and T

( ) ( ) ( )∑∑∑<=

jidjidMv ,2,1 1

CG © Ron Shamir

Approximation ratio (2)

( )∑=

llSSDk

21,)1(2

( )∑=

jjSSDk

2)1(2)()(

* ≤−

( ) ( )∑∑=

jidMv1 1

, ( ) ( )( )∑∑=

jdid1 1

( )∑=

2,1)1(2

( ) ( )∑∑=

jidMv1 1

** , ( ) ≥≥ ∑∑=

ji SSD1 1

( )∑∑= =

1 21 ,

Definition of S1:

( ) ( )∑∑≠==

≤∀k

jj SSDSSDi

121 ,,:

Triangle inequality + symmetry

CG © Ron Shamir

Theorem (Gusfield 93)

• We have proved: • The center star algorithm is a

polynomial algorithm that guarantees a solution at most twice the optimum.

• “a 2-approximation” • “an approximation ratio of 2”

CG © Ron Shamir

Steiner/consensus string • Input: Γ - set of k strings S1, …,Sk. • D(X,Y) – score of aligning X, Y.

• The consensus error of S relative to Γ:

E(S)= Σi≤k D(S, Si) • Goal: find S* that minimizes E(S) • S* is called an optimal Steiner string for Γ

• New objective – k terms and not k2

• No direct relation to multialign! (for now)

CG © Ron Shamir

Thm: Assume D satisfies triangle ineq. Then ∃S∈Γ that guarantees an approximation ratio 2.

∑ ≠=

iSS iSSDSE ),()( ( ) ( )( )∑ ≠+≤

iSS iSSDSSD *,*,

∑ ≠++−=

SSDSSDSSDk )*,(*),(*),()2(

*)(*),()2( SESSDk +−=Pick S ∈Γ closest to S* (not constructively)

∑ Γ∈=

iS iSSDSE )*,(*)( *),( SSDk ⋅≥

Pf: Pick S ∈Γ

Approximation

21)2()()(

* <+−

CG © Ron Shamir

Consensus string from MA • The consensus character in column i of MA is the char

that minimizes the sum of distances to it from all the chars in column i

• The consensus string of a MA is obtained by taking the consensus character in each position.

• Ex: match =0, mismatch/space=1 consensus=majority char • S*: AC-GC-GAG • x: AC-GCGG-C • y: AC-GC-GAG • z: GCCGA-GAG • u: AC-T-GGCA • v: -CAGT-GAG • w: AC-GC-GAG

Alignment error: S(M) = Σk σ(S’k, S*)

The opt consensus MA: one with least alignment error

Thm: opt consensus MA = Steiner string

(up to spaces)

Tree MA • Input: Tree T, a string for each leaf • Phylogenetic alignment for T: Assignment of

a string to each internal node

• Score – (weighted) sum of scores along edges • Goal: find phyl. alignment of optimal score • Consensus = phyl. Alignment where T is a star

GTTC CTTG

Tree MA – complexity

• Also NP-hard • Poly time approximations:

– 2-approximation – Better approximation with more time

(PTAS)

Profile Representation of MA - A G G C T A T C A C C T G T A G – C T A C C A - - - G C A G – C T A C C A - - - G C A G – C T A T C A C – G G C A G – C T A T C G C – G G A 1 1 .8 C .6 1 .4 1 .6 .2 G 1 .2 .2 .4 1 T .2 1 .6 .2 - .2 .8 .4 .8 .4

• Alternatively, use log odds: • pi(a) = fraction of a’s in col i • p(a) = fraction of a’s overall • log pi(a)/p(a)

Aligning a sequence to a profile

• Key in pairwise alignment is scoring two positions x,y: σ(x,y)

• For a letter x and a column y in a profile, σ(x,y)=value of x in col. Y

• Invent a score for σ(x,-) • Run the DP alg for pairwise alignment

Aligning alignments

• Given two alignments, how can we align them?

• Hint: use DP on the corresponding profiles.

x GGGCACTGCAT y GGTTACGTC-- Alignment 1 z GGGAACTGCAG

w GGACGTACC-- Alignment 2 v GGACCT-----

x GGGCACTGCAT y GGTTACGTC-- z GGGAACTGCAG w GGACGTACC-- v GGACCT-----

Multiple Alignment: Greedy Heuristic

• Choose most similar pair of sequences and combine into a profile , thereby reducing alignment of k sequences to an alignment of of k-1 sequences/profiles. Repeat

u1= ACGTACGTACGT…

u2 = TTAATTAATTAA…

u3 = ACTACTACTACT…

uk = CCGGCCGGCCGG

u1= ACg/tTACg/tTACg/cT…

u2 = TTAATTAATTAA…

uk = CCGGCCGGCCGG… k

Progressive Alignment • A variation of greedy algorithm with a

somewhat more intelligent strategy for choosing the order of alignments.

Progressive alignment

Align sequences (pairwise) in some (greedy) order

Decisions (1) Order of alignments (2) Alignment of sequence to group (only), or allow group to group (3) Method of alignment, and scoring function

Guide tree A

this ?

or this ?

ClustalW Thompson, Higgins, Gibson 94

• Popular multiple alignment tool today • ‘W’ = ‘weighted’ (different parts of

alignment are weighted differently). • Three-step process

1.) Construct pairwise alignments 2.) Build Guide Tree 3.) Progressive alignment guided by

the tree

Step 1: Pairwise Alignment • Aligns each sequence against each

other giving a similarity matrix • Similarity = exact matches / sequence

length (percent identity) v1 v2 v3 v4 v1 - v2 .17 - v3 .87 .28 - v4 .59 .33 .62 -

(.17 means 17 % identical)

Step 2: Guide Tree • Use the similarity method to create a

Guide Tree by applying some clustering method*

• Guide tree roughly reflects

evolutionary relations

• *ClustalW uses the neighbor-joining method (to be described later in the course)

Step 2: Guide Tree (cont’d)

v1 v3 v4 v2

Calculate: v1,3 = alignment (v1, v3) v1,3,4 = alignment((v1,3),v4) v1,2,3,4 = alignment((v1,3,4),v2)

v1 v2 v3 v4 v1 - v2 .17 - v3 .87 .28 - v4 .59 .33 .62 -

Step 3: Progressive Alignment • Start by aligning the two most similar

sequences • Using the guide tree, add in the most similar

pair (seq-seq, seq-prof or prof-prof) • Insert gaps as necessary • Many ad-hoc rules: weighting, different

matrices, special gap scores…. FOS_RAT PEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKNISNMELKAEPFD FOS_MOUSE PEEMSVAS-LDLTGGLPEASTPESEEAFTLPLLNDPEPK-PSLEPVKSISNVELKAEPFD FOS_CHICK SEELAAATALDLG----APSPAAAEEAFALPLMTEAPPAVPPKEPSG--SGLELKAEPFD FOSB_MOUSE PGPGPLAEVRDLPG-----STSAKEDGFGWLLPPPPPPP-----------------LPFQ FOSB_HUMAN PGPGPLAEVRDLPG-----SAPAKEDGFSWLLPPPPPPP-----------------LPFQ . . : ** . :.. *:.* * . * **:

MAFFT – MultiAlign via FFT Katoh et al, NAR 02

• Overview: – Homology identification using FFT – Alignment Scoring/Selection

• Faster computation due to

– O(NlogN) homology detection using FFT – Simpler-to-compute scoring function

Signal and pairwise correlation • For amino acid sequences

– [volume v(a), polarity p(a)] per AA a – Each standardized (mean 0, std 1) of all AAs

• Correlation between seqs 1,2 at offset k:

• Direct computation for all k O(N2) for seqs of length ~N. With FFT O(NlogN)

• c(k) = cv(k) + cp(k)

Pairwise alignment • High c(k) gives offset, not location

• Use sliding windows to find 20 highest peaks, merging if necessary.

• Result: ≤20 homologous segments

Pairwise alignment (2) • Use regular DP to

find highest scoring chain of segments

• Extend to full alignment constrained to contain the chain (within the white

Group-group alignment For column n in group1, weight the

sequences to get average values of v, p Apply the same procedure to get group1-

group2 alignment

Overall alg • Compute pairwise distances (quickly) • Form guide tree T1 • Progressive alignment of sequences in

bottom up order of T1 • Infer a new guide tree T2 from the

multialignment • Realign each input seq along T2 • …. • x100 faster than prev algs, same

accuracy 74

Multiple Alignment: History 1 975 Sankoff Formulated multiple alignment problem

and gave dynamic programming solution 1 988 Carrillo- Lipman Branch and Bound approach for MSA 1 990 Feng- Doolittle Progressive alignment 1 994 Thompson- Higgins- Gibson- ClustalW Most popular multiple alignment program 1 998 Morgenstern et al. - DIALIGN Segment-based multiple alignment 2000 Notredame- Higgins- Heringa- T- coffee Using the library of pairwise alignments 2002 MAFFT 2004 MUSCLE 2005 ProbCons

multiple sequence alignment - tel aviv universityrshamir/algmb/presentations/multiple...

Documents

3.4 multiple sequence alignment and domain...

3.1 spectrum alignment - tel aviv...

1 multiple sequence alignment(msa). 2 multiple alignment...

multiple sequence alignment - tel aviv...

protein sequence alignment multiple sequence alignment

multiple sequence alignment (ii)

previous lecture: multiple alignment

cap5510 – bioinformatics multiple alignment

multiple sequence alignment (msa)

multiple sequence alignment - cbcb · • a multiple...

mao: multiple alignment ontology

multiple sequence alignment why?

from pairwise to multiple alignment. whats today? multiple...

multiple sequence alignment - fasta multiple sequence...

lecture 2 pairwise alignment - tel aviv...

multiple alignment

lecture 5: multiple sequence alignment€¦ · refining...

rna multiple sequence alignment

multiple alignment - dtu