multiple sequence alignment (msa)ashehu/sites/default/files/cs444...multiple sequence alignment up...

16
Introduction to Computational Biology A. Shehu – CS444 Multiple Sequence Alignment (MSA) bioalgorithms.info, cnx.org, and instructors across the country

Upload: others

Post on 24-May-2020

17 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multiple Sequence Alignment (MSA)ashehu/sites/default/files/cs444...Multiple Sequence Alignment Up to now, we have only considered aligning 2 sequences In general, the alignment of

Introduction to Computational Biology A. Shehu – CS444

Multiple Sequence Alignment (MSA)

bioalgorithms.info, cnx.org, and instructors across the country

Page 2: Multiple Sequence Alignment (MSA)ashehu/sites/default/files/cs444...Multiple Sequence Alignment Up to now, we have only considered aligning 2 sequences In general, the alignment of

Introduction to Computational Biology A. Shehu – CS444

Multiple Sequence Alignment: Learning Goals

▆ Pairwise Sequence Alignment ▆ Methods

▆ Needleman-Wunsch (global)

▆ Smith-Waterman (local)

▆ Scoring▆ Substitution matrices

▆ Gap models

▆ Pairwise Alignment in BLAST

▆ Multiple Sequence Alignment▆ Methods

▆ Naïve method

▆ Guide tree

▆ Divide-and-Conquer method

▆ Scoring▆ Entropy

▆ Sum of Pairs

Page 3: Multiple Sequence Alignment (MSA)ashehu/sites/default/files/cs444...Multiple Sequence Alignment Up to now, we have only considered aligning 2 sequences In general, the alignment of

Introduction to Computational Biology A. Shehu – CS444

Multiple Sequence Alignment

▆ Up to now, we have only considered aligning 2 sequences

▆ In general, the alignment of multiple sequences provides a more reliable assessment of similarity than a pairwise alignment

• Ambiguities in a pairwise comparison can often be resolved when further sequences are compared

Page 4: Multiple Sequence Alignment (MSA)ashehu/sites/default/files/cs444...Multiple Sequence Alignment Up to now, we have only considered aligning 2 sequences In general, the alignment of

Introduction to Computational Biology A. Shehu – CS444

Multiple Sequence Alignment

▆ Motivation:• Find interesting patterns characteristic of specific protein families

• Build phylogenetic trees

• Detect homology between new sequences and existing families

• Aid secondary/tertiary structure prediction

• Illustrate sequence conservation throughout the aligned sequences

Page 5: Multiple Sequence Alignment (MSA)ashehu/sites/default/files/cs444...Multiple Sequence Alignment Up to now, we have only considered aligning 2 sequences In general, the alignment of

Introduction to Computational Biology A. Shehu – CS444

Naïve Method

▆ Generalizing the notion of pairwise alignment.▆ The alignment of n sequences represented by an n-row matrix

A T _ G C G _

A _ C G T _ A

A T C A C _ A

▆ What is the effect on the similarity score?

Page 6: Multiple Sequence Alignment (MSA)ashehu/sites/default/files/cs444...Multiple Sequence Alignment Up to now, we have only considered aligning 2 sequences In general, the alignment of

Introduction to Computational Biology A. Shehu – CS444

Naïve Method

▆ Same strategy as aligning two sequences

▆ Each axis represents a sequence to align

▆ Global alignments go from Source to Sink

Source

Sink

Page 7: Multiple Sequence Alignment (MSA)ashehu/sites/default/files/cs444...Multiple Sequence Alignment Up to now, we have only considered aligning 2 sequences In general, the alignment of

Introduction to Computational Biology A. Shehu – CS444

Naïve Method

Q: Does this make sense?

A: No! k sequences => O(2knk)

Page 8: Multiple Sequence Alignment (MSA)ashehu/sites/default/files/cs444...Multiple Sequence Alignment Up to now, we have only considered aligning 2 sequences In general, the alignment of

Introduction to Computational Biology A. Shehu – CS444

Comment

▆ Basis for aligning pairs of sequences is still dynamic programming...

▆ But we must come up with other techniques to combine these together into a multiple alignment

▆ Optimal MSA is NP-complete

Page 9: Multiple Sequence Alignment (MSA)ashehu/sites/default/files/cs444...Multiple Sequence Alignment Up to now, we have only considered aligning 2 sequences In general, the alignment of

Introduction to Computational Biology A. Shehu – CS444

Divide-and-Conquer Method

▆ Sequences are divided into two parts by cutting them near their midpoints

▆ This process is repeated until the length of the sequence falls below a predefined threshold

▆ Align the sequences

▆ Merge the aligned sequences

The problem of aligning multiple sequences is divided into several smaller alignment tasks

Q: Is this optimal?

Page 10: Multiple Sequence Alignment (MSA)ashehu/sites/default/files/cs444...Multiple Sequence Alignment Up to now, we have only considered aligning 2 sequences In general, the alignment of

Introduction to Computational Biology A. Shehu – CS444

Hierarchical Method

1. Compare all sequences in a pairwise fashion

=> Similarity matrix

(similarity = percent identity)

1. Perform cluster analysis on the pairwise data to generate a hierarchy of sequences in order of their similarity (guide tree)

2. A multiple alignment is then built based on the guide tree:

1. Align the most similar sequences

2. Following the guide tree, add in the next sequences, aligning to the existing alignment (insert gaps as needed)

3. Repeat until all sequences have been aligned

Page 11: Multiple Sequence Alignment (MSA)ashehu/sites/default/files/cs444...Multiple Sequence Alignment Up to now, we have only considered aligning 2 sequences In general, the alignment of

Introduction to Computational Biology A. Shehu – CS444

Hierarchical Method

1. Pairwise alignment

1. Guide tree

Page 12: Multiple Sequence Alignment (MSA)ashehu/sites/default/files/cs444...Multiple Sequence Alignment Up to now, we have only considered aligning 2 sequences In general, the alignment of

Introduction to Computational Biology A. Shehu – CS444

Hierarchical Method

Suggested by Feng and Doolittle Different implementations differ:

Order of alignments Whether the progression involves only alignment of

sequences to a single growing alignment Actual procedure used to do and score the alignments

Page 13: Multiple Sequence Alignment (MSA)ashehu/sites/default/files/cs444...Multiple Sequence Alignment Up to now, we have only considered aligning 2 sequences In general, the alignment of

Introduction to Computational Biology A. Shehu – CS444

▆ We compute the score of each induced pairwise alignment

▆ We sum over the pairs to obtain the Sum-of-Pairs (SP) score

▆ The next algorithm finds an optimal MSA given an SP score

Scoring MSA

Page 14: Multiple Sequence Alignment (MSA)ashehu/sites/default/files/cs444...Multiple Sequence Alignment Up to now, we have only considered aligning 2 sequences In general, the alignment of

Introduction to Computational Biology A. Shehu – CS444

▆ Choose one of the sequences that gives the minimum SP score

▆ If A: SP(A) = S(A,B) + S(A,C) + …

▆ If B: SP(B) = S(B,A) + S(B,C) + …

▆ Compare SP(A) to SP(B) to SP(C) In this example, SP(A) is the highest; A is chosen as center

The different pairwise alignments with Aimpose different gaps on A. How do wemake this uniform to obtain an MSA?

Consensus sequence needed

Center-Star Alignment

Page 15: Multiple Sequence Alignment (MSA)ashehu/sites/default/files/cs444...Multiple Sequence Alignment Up to now, we have only considered aligning 2 sequences In general, the alignment of

Introduction to Computational Biology A. Shehu – CS444

▆ The consensus string SM derived from multiple alignment M is the concatenation of the consensus characters for each column of M

▆ The consensus character for column i is the character with optimal summed distance from all characters in column I

▆ Distance is measured using the substitution matrix

Consensus Sequence

Page 16: Multiple Sequence Alignment (MSA)ashehu/sites/default/files/cs444...Multiple Sequence Alignment Up to now, we have only considered aligning 2 sequences In general, the alignment of

Introduction to Computational Biology A. Shehu – CS444

▆ Basic outline of the algorithm:▆ Calculate the C(k,2) [k choose 2] pairwise alignment scores

▆ Use a neighbor-joining algorithm to build a tree based on the distances

ClustalW

http://www.ebi.ac.uk/Tools/clustalw2/index.html