multiple sequence alignment. multiple alignment versus pairwise alignment up until now we have only...

22
Multiple Sequence Alignment

Upload: lorraine-hall

Post on 15-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multiple Sequence Alignment. Multiple Alignment versus Pairwise Alignment Up until now we have only tried to align two sequences. What about more than

Multiple Sequence Alignment

Page 2: Multiple Sequence Alignment. Multiple Alignment versus Pairwise Alignment Up until now we have only tried to align two sequences. What about more than

Multiple Alignment versus Pairwise Alignment

• Up until now we have only tried to align two sequences.

• What about more than two? And what for?

• A faint similarity between two sequences becomes significant if present in many

• Multiple alignments can reveal subtle similarities that pairwise alignments do not reveal

Page 3: Multiple Sequence Alignment. Multiple Alignment versus Pairwise Alignment Up until now we have only tried to align two sequences. What about more than

Generalizing the Notion of Pairwise Alignment

• Alignment of 2 sequences is represented as a 2-row matrix• In a similar way, we represent alignment of 3 sequences

as a 3-row matrix

A T - G C G - A _ C G T - A A T C A C - A

• Score: more conserved columns, better alignment

Page 4: Multiple Sequence Alignment. Multiple Alignment versus Pairwise Alignment Up until now we have only tried to align two sequences. What about more than

Aligning Three Sequences• Same strategy as

aligning two sequences

• Use a 3-D “Manhattan Cube”, with each axis representing a sequence to align

• For global alignments, go from source to sink

source

sink

Page 5: Multiple Sequence Alignment. Multiple Alignment versus Pairwise Alignment Up until now we have only tried to align two sequences. What about more than

2-D cell versus 2-D Alignment Cell

In 3-D, 7 edges in each unit cube

In 2-D, 3 edges in each unit square

Page 6: Multiple Sequence Alignment. Multiple Alignment versus Pairwise Alignment Up until now we have only tried to align two sequences. What about more than

Architecture of 3-D Alignment Cell(i-1,j-1,k-1)

(i,j-1,k-1)

(i,j-1,k)

(i-1,j-1,k) (i-1,j,k)

(i,j,k)

(i-1,j,k-1)

(i,j,k-1)

Page 7: Multiple Sequence Alignment. Multiple Alignment versus Pairwise Alignment Up until now we have only tried to align two sequences. What about more than

Multiple Alignment: Dynamic Programming

• si,j,k = max

(x, y, z) is an entry in the 3-D scoring matrix

si-1,j-1,k-1 + (vi, wj, uk)si-1,j-1,k + (vi, wj, _ )si-1,j,k-1 + (vi, _, uk)si,j-1,k-1 + (_, wj, uk)si-1,j,k + (vi, _ , _)si,j-1,k + (_, wj, _)si,j,k-1 + (_, _, uk)

cube diagonal: no indels

face diagonal: one indel

edge diagonal: two indels

Page 8: Multiple Sequence Alignment. Multiple Alignment versus Pairwise Alignment Up until now we have only tried to align two sequences. What about more than

Multiple Alignment: Running Time

• For 3 sequences of length n, the run time is 7n3; O(n3)

• For k sequences, build a k-dimensional Manhattan, with run time (2k-1)(nk); O(2knk)

• Conclusion: dynamic programming approach for alignment between two sequences is easily extended to k sequences but it is impractical due to exponential running time

Page 9: Multiple Sequence Alignment. Multiple Alignment versus Pairwise Alignment Up until now we have only tried to align two sequences. What about more than

Practically speaking

• Multiple alignment is a hard problem• Yet, it is of extreme practical importance

– Comparing several species is a very common task– Doing this for the entire genome is an increasingly

common demand!

• Several heuristic-based algorithms have been developed that employ greedy, divide-and-conquer, dynamic programming approaches in various combinations

• The algorithms we will see today are actually used by current multiple aligners

Page 10: Multiple Sequence Alignment. Multiple Alignment versus Pairwise Alignment Up until now we have only tried to align two sequences. What about more than

Multiple Alignment Induces Pairwise Alignments

Every multiple alignment induces pairwise alignments

x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG

Induces:

x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAGy: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

Page 11: Multiple Sequence Alignment. Multiple Alignment versus Pairwise Alignment Up until now we have only tried to align two sequences. What about more than

Reverse Problem: Constructing Multiple Alignment from Pairwise Alignments

Given 3 arbitrary pairwise alignments:

x: ACGCTGG-C; x: AC-GCTGG-C; y: AC-GC-GAGy: ACGC--GAC; z: GCCGCA-GAG; z: GCCGCAGAG

can we construct a multiple alignment that inducesthem?

Page 12: Multiple Sequence Alignment. Multiple Alignment versus Pairwise Alignment Up until now we have only tried to align two sequences. What about more than

Reverse Problem: Constructing Multiple Alignment from Pairwise Alignments

Given 3 arbitrary pairwise alignments:

x: ACGCTGG-C; x: AC-GCTGG-C; y: AC-GC-GAGy: ACGC--GAC; z: GCCGCA-GAG; z: GCCGCAGAG

can we construct a multiple alignment that inducesthem? NOT ALWAYS

Pairwise alignments may be inconsistent

Page 13: Multiple Sequence Alignment. Multiple Alignment versus Pairwise Alignment Up until now we have only tried to align two sequences. What about more than

Multiple Alignment: Greedy Approach

• Choose most similar pair of strings and combine into a profile , thereby reducing alignment of k sequences to an alignment of of k-1 sequences/profiles. Repeat

• This is a heuristic greedy methodu1= ACGTACGTACGT…

u2 = TTAATTAATTAA…

u3 = ACTACTACTACT…

uk = CCGGCCGGCCGG

u1= ACg/tTACg/tTACg/cT…

u2 = TTAATTAATTAA…

uk = CCGGCCGGCCGG…

k

k-1

Page 14: Multiple Sequence Alignment. Multiple Alignment versus Pairwise Alignment Up until now we have only tried to align two sequences. What about more than

Progressive Alignment

• Progressive alignment is a variation of greedy algorithm with a somewhat more intelligent strategy for choosing the order of alignments.

Page 15: Multiple Sequence Alignment. Multiple Alignment versus Pairwise Alignment Up until now we have only tried to align two sequences. What about more than

Clustal

• Popular multiple alignment tool today

• Uses “progressive alignment”

• Three-step process

1.) Construct pairwise alignments

2.) Build Guide Tree

3.) Progressive Alignment guided by the tree

Page 16: Multiple Sequence Alignment. Multiple Alignment versus Pairwise Alignment Up until now we have only tried to align two sequences. What about more than

Step 1: Pairwise Alignment

• Aligns each sequence again each other giving a similarity matrix

• Similarity = exact matches / sequence length (percent identity)

v1 v2 v3 v4

v1 -v2 .17 -v3 .87 .28 -v4 .59 .33 .62 -

Page 17: Multiple Sequence Alignment. Multiple Alignment versus Pairwise Alignment Up until now we have only tried to align two sequences. What about more than

Step 2: Guide Tree

• Create Guide Tree using the similarity matrix

• ClustalW uses the “neighbor-joining method”

• Guide tree roughly reflects evolutionary relations

Page 18: Multiple Sequence Alignment. Multiple Alignment versus Pairwise Alignment Up until now we have only tried to align two sequences. What about more than

Step 2: Guide Tree (cont’d)

v1

v3

v4 v2

Calculate:v1,3 = alignment (v1, v3)v1,3,4 = alignment((v1,3),v4)v1,2,3,4 = alignment((v1,3,4),v2)

v1 v2 v3 v4

v1 -v2 .17 -v3 .87 .28 -v4 .59 .33 .62 -

Page 19: Multiple Sequence Alignment. Multiple Alignment versus Pairwise Alignment Up until now we have only tried to align two sequences. What about more than

Step 3: Progressive Alignment

• Start by aligning the two most similar sequences

• Following the guide tree, add in the next sequences, aligning to the existing alignment

• An alignment is stored as a “consensus sequence”, to be aligned with other sequences or alignments later

• Consensus sequence: Residue a if 75% of aligned sequences have an a at that position.Otherwise “X”.

Page 20: Multiple Sequence Alignment. Multiple Alignment versus Pairwise Alignment Up until now we have only tried to align two sequences. What about more than

Evaluation

• Evaluating alignment programs is very difficult

• What is a benchmark here ?

• We haven’t witnessed the process of evolution, so we cannot say for certain what the true alignment of “extant” sequences should be

• One approach: “simulate” evolution

Page 21: Multiple Sequence Alignment. Multiple Alignment versus Pairwise Alignment Up until now we have only tried to align two sequences. What about more than

Simulating evolution

• Generate a random sequence and introduce realistic evolutionary changes to it, along branches of an assumed phylogeny

• Substitutions, insertions, deletions, insertion & deletion rates, duplications, introduction of repeat elements, etc.

Page 22: Multiple Sequence Alignment. Multiple Alignment versus Pairwise Alignment Up until now we have only tried to align two sequences. What about more than

Evaluating alignment

• Once simulation done, take all the sequences at the leaf nodes of the phylogeny (started with root)

• Align these sequences using software

• Compare computed alignment and known (“true”) alignment– sensitivity and specificity