multiple sequence alignment. sequence families most sequences are members of large families, some...

Multiple Sequence Alignment

Sequence Families• Most sequences are members of large families, some with the same

function and others with different functions.– Members of different families are thought to not have a common ancestor.

• The basic idea of multiple alignment is to line up a group of related proteins so that the amino acids in each column:

– Have the same position in the 3-dimensional structure– Are derived from the same amino acid in their common ancestor– Have the same function in the protein

• These three goals are not completely compatible: structures change over evolutionary time, mutations compensate for each other, etc.

– An interesting hypothesis: there are only a few thousand different types of protein fold in existence, and all proteins are made up of one or more of these folds.

• X-ray crystallography seems to be confirming this idea, but it is far from being widely accepted.

• Another point of view: a multiple alignment is an attempt to reproduce the sequence of evolutionary events that occurred in moving from the common ancestor to all of the present-day sequences being aligned.

Multiple Alignments• Most multiple alignments are global, not local.

– This means that it is necessary for the proteins being aligned to all have the same basic domain structure.

• Aligning 2 sequences is easy and fairly precise (given a substitution matrix and set of gap penalties).

– Most multiple alignment programs start with pairwise alignments.

• But, multiple alignments are still an active area of research, and the best alignments are still refined by hand.

– There are many multiple alignment programs available, and there are always new ones coming out.

• Not to mention algorithms that aren’t fully developed into new user-friendly web-based programs.– Lots of heuristics and general ad hoc decisions that improve results without any underlying

theory.

• Closely related sequences can be aligned more unambiguously that distant sequences.

– Thus, most multiple alignment schemes start by joining the most similar sequences– This is a case of the general rule “do the easy ones first”

Scoring a Multiple Alignment• How do you determine which alignment is best?• “Looks pretty good to me” (i.e. hand refinement) may work when experts in that specific protein

are involved, but in general it is an invitation to introducing prejudices– There are several sets of test case data, with alignments refined by hand using 3-D structures. For

example, BaliBase

• “star tree” scoring: score all other sequences relative to a single common (canonical) one. – This is effectively what happens when one sequence is used as a BLAST or PSI-BLAST query and all

other sequences are aligned to it.– Not good idea: the original query is not the center of evolutionary events

• We are trying to re-create evolutionary events, so the ideal method would be a phylogenetic scoring method, that would take evolutionary descent into account.

– Detect and count each evolutionary event that distinguishes between sequences.– Phylogeny is often inferred during the course of a multiple alignment, but it is not independent from the

alignment itself, so it can’t legitimately be used for scoring.– The best phylogenetic methods are quite slow.

More on Scoring

• Most common scoring = “sum of pairs”. Once the multiple alignment is made, use a substitution matrix to score all possible pairs of sequences, then add up the total.– Sum of pairs ignores the principal that closely related sequences

are easier to align properly than distant ones.– Sum of pairs also overcounts more distant evolutionary events.

For example, the first divergence from the common ancestor is counted as if it occurred multiple independent times, not just once.

– Some alignment methods use weighting schemes to overcome this objection.

• A more recent method (slower, unfortunately), is “consistency-based scoring”. – The basic concept is that the final multiple alignment should

match as many pairwise alignments as possible.

Progressive Alignment• The first thought: why not just extend Smith-Waterman to multiple

sequences? i.e. simultaneous alignment.– Huge multi-dimensional matrices and exponential scaling ( O(2N), where

N = number of sequences ) make it impractical except for small numbers of sequences.

• Progressive alignment: start by aligning the most similar pair, then add other sequences one by one. – Alignments are most likely to be correct if the sequences are closely

related.

• The big problem is: the order in which the sequences are added affects the final alignment significantly.– Once 2 sequences are aligned, that alignment can’t be changed, so

early mistakes in alignment are propagated.• This property is called “greediness”

– Modern alignment schemes try to re-test and re-align later additions. This is called “iterative alignment”, meaning that the multiple alignment goes through several cycles (iterations) of retesting and re-aligning.

ClustalW• ClustalW is the most popular multiple alignment program used today.

– Despite the fact that no improvements have been made since 1994– Despite the fact that it suffers horribly from the order-of-addition problem: it is a

progressive method, not iterative.• On the web: http://www.ebi.ac.uk/Tools/clustalw2/ • Basic algorithm:

– Do all possible pairwise alignments and score them.– Using cluster analysis, create a relationship tree (a guide tree) based on the

similarities between pairs.– Combine alignments by introducing gaps

• Some heuristic attempts to concentrate gaps into the same alignment columns

http://www.ebi.ac.uk/Tools/clustalw2/

Cluster Analysis

• How to create the guide tree– Based on a matrix of distances between

aligned pairs of sequences– Several ways of converting this to a tree. We

will look at a simple method called UPGMA– There are better, more accurate ways of

creating a phylogenetic tree, but UPGMA is fast and so fits the needs of a multiple alignment program

Distance Matrix• To create a UPGMA tree, we

start with a distance matrix.• Start by scoring all pairs of

aligned sequences with BLOSUM62 (for example).

• Convert the scores to similarities, on a 0-1 scale.

• Distances are just 1 – similarity. (also a 0-1 scale)

• The main diagonal is the distance between a sequence and itself, which is always 0.

• The matrix is symmetrical: the distance between sequence A and sequence B is the same as between B and A.

A B C D E

A 0 0.10 0.20 0.34 0.38

B 0.10 0 0.24 0.36 0.40

C 0.20 0.24 0 0.32 0.34

D 0.34 0.36 0.32 0 0.20

E 0.38 0.40 0.34 0.20 0

UPGMA

• = Unweighted Pair Group Method with Arithmetic Mean.

– UPGMA is a simple and intuitive clustering method

– It produces a rooted tree (dendrogram)• Algorithm:

– Start by finding the closest pair of sequences: A and B are 0.10 apart.

– Join them. The branch lengths are the distances.

– Combine their columns by averaging distances to all other sequences

– Repeat until all sequences have been joined into a single tree.

• To start: A and B are the closest: 0.10 apart.

A+B C D E

A+B 0 0.22 0.35 0.39

C 0.22 0 0.32 0.34

D 0.35 0.32 0 0.20

E 0.39 0.34 0.20 0

A B C D E

A 0 0.10 0.20 0.34 0.38

B 0.10 0 0.24 0.36 0.40

C 0.20 0.24 0 0.32 0.34

D 0.34 0.36 0.32 0 0.20

E 0.38 0.40 0.34 0.20 0

More UPGMA

• Next, D and E are the closest on the revised distance matrix: 0.20.

• The branch length is proportional to the distance: 0.1 for A-B and 0.2 for D-E

A+B C D E

A+B 0 0.22 0.35 0.39

C 0.22 0 0.32 0.34

D 0.35 0.32 0 0.20

E 0.39 0.34 0.20 0

A+B C D+E

A+B 0 0.22 0.37

C 0.22 0 0.33

D+E 0.37 0.33 0

More UPGMA

• Next C is closest to A+B (0.22)• Note that to get the distances

from A+B+C to D+E, we are using twice as much contribution from A+B as we are from C, because A+B represents 2 sequences.

A+B C D+E

A+B 0 0.22 0.37

C 0.22 0 0.33

D+E 0.37 0.33 0

A+B+C D+E

A+B+C 0 0.36

D+E 0.36 0

End of UPGMA

• The last join in A+B+C with D+E.

• When Clustal-W uses this guide tree for alignments, it joins A and B, then separately joins D and E, then adds C to A+B, and finally joins the two groups A+B+C and D+E.

MUSCLE• “MUSCLE” is probably an acronym for something, but I am not a fan of

cutesy acronyms.• It gets high marks in recent surveys of multiple alignment methods for being

fast, accurate, and capable of dealing with a wide range of sequences. • MUSCLE as a web-based service at EBI (European Bioinformatics

Institute): http://www.ebi.ac.uk/Tools/muscle/index.html• MUSCLE is an iterative method:

– it starts out doing a progressive alignment of sequences using a guide tree, starting with the most similar.

– Then, it attempts to test and re-align various sequences, trying to avoid freezing in early bad decisions.

• This is an instance of a more general statistical technique called “jacknifing”. Jacknifing involves repeating an analysis many times on a subset of the data: leaving out random halves of the data, or leaving out single random observations, etc. It is used (along with bootstrapping) to estimate the robustness of one’s conclusions.

• It also illustrates the general principle of “annealing”: slowly converging on a solution by allowing changes but gradually raising the score needed to accept a change.

http://www.ebi.ac.uk/Tools/muscle/index.html

http://www.ebi.ac.uk/Tools/muscle/index.html

MUSCLE Algorithm• After some preliminary work, MUSCLE

estimates a tree using UPGMA and the Kimura method of estimating the evolutionary distance between sequences (which we will deal with later in the semester).

• The iterative procedure:– Cut the tree into 2 subtrees at random

locations – re-align each subtree– Align the subtrees together– Score the new alignment with sum-of-pairs

• If the new score is better than the old one, discard the old tree and repeat the cutting and realignment procedure with the new tree.

• If the new score is worse than the old one, discard it and repeat the procedure with the old tree.

– Repeat until no further changes appear (convergence) or a user-set limit is reached.

T-COFFEE• T-Coffee is also a cute acronym for something• For our purposes, it illustrates a scoring method more refined than sum-of-

pairs.– It is probably more accurate than other methods, but considerably slower.

• Web: http://www.ebi.ac.uk/Tools/t-coffee/• Consistency-based scoring.• Algorithm:

1. Create a library of all pairwise alignments• You can generate them externally if you like. For instance, using alignments of 3-

dimensional structures.• You can use several alignments for the same pair of sequences, even if they are not

consistent.• The information is stored residue-by-residue: it is a list of entries like “amino acid 23 in

sequence X aligns with amino acid 27 in sequence Y”2. Weight alignments by the percentage of identical residues.

• If there is more than one alignment for a given pair of sequences, create a weighted average of them.

http://www.ebi.ac.uk/Tools/t-coffee/

Library Extension

3. The next step, library extension, involves examining all triplets of sequences.– Residues that pair up in all three have their weights increased

by the lesser of the two pair scores.– Residues that match differently in the two pairs being joined to

form a triplet do not have their weights increased.– Thus, residues that consistently match up end up with very high

weights. This is what is meant by consistency-based scoring.

4. After this, the weighted set of aligned residues is subjected to a progressive alignment equivalent to ClustalW.– Once a gap is created, it remains in all further alignments– Separate gap penalties aren’t needed: already incorporated into

the library extension process.

Library Extension Example

The scores are pairwise:the percentageof identical amino acids forall positions with an amino acid in both sequences

Example• Pairwise scores. Every aligned pair of amino acids gets an initial score equal to the

percentage of identity between the aligned sequences.– Sequences x and y aligned as a pair gives 88% identity.– Sequences x and v aligned as pairs gives 77% identity (because LAST and VERY are

aligned but not identical).– Sequences v and y are 100% identical, because VERY in sequence v is a gap in sequence

y.– Sequences x and z, and y and z, are 100% identical, although mostly gaps.

• Sequences x and y are then aligned through all other sequences: v and z in this case• When the x-v-y triplet is assembled, every aligned pair in x and y gets its weight

increased by the minimum of the two pair scores (77 here).– This includes GARFIELD THE FAT CAT, but not VERY or the S in “FAST” because they

aren’t aligned.• Similarly, when x, z, and y are aligned together, all matching pairs in x and y are

increased by 100 (the minimum of the x-z and y-z scores). – This only increases the score for THE FAT CAT

• Note that LAST FAT in sequence x matches FAST CAT in y, but it matches VERY FAST in v, and ---- FAT in z. This results in two possible matches for these residues, and lower scores for both

– These matches are thus less reliable than other parts of the x-y alignment.

Some Numbers

x T H E L A S T F A - T C A T

y T H E F A S T C A - T - - -

x-y score

88 88 88 88 88 88 88 88 88 0 88 0 0 0


V T H E V E R Y F A S T C A T

Y T H E - - - - F A S T C A T

New score

165 165 165 88 88 88 88 77 77 0 77 77 77 77


z T H E - - - - F A - T C A T

y T H E - - - - F A S T C A T

Final score

265 265 265 0 0 0 0 177 177 0 177 177 177 177

Further...

• There are many more multiple alignment algorithms.

multiple sequence alignment. sequence families most sequences are members of large families, some...

Documents