fast algorithms for large scale genome alignment and comparison

21
Algorithms for Computational Molecular Biology 2007/05/28 Fast algorithms for large scale genome alignment and comparison Davide Eynard [email protected] Dipartimento di Elettronica e Informazione Politecnico di Milano

Upload: davide-eynard

Post on 11-Nov-2014

2.915 views

Category:

Technology


0 download

DESCRIPTION

A presentation of the three articles about MUMmer by A.L. Delcher et al., made for the PhD course "Algorithms for Computational Molecular Biology" at Politecnico di Milano, May 2007

TRANSCRIPT

Page 1: Fast algorithms for large scale genome alignment and comparison

Algorithms for Computational Molecular Biology

2007/05/28

Fast algorithms for large scale genome alignment and comparison

Davide [email protected]

Dipartimento di Elettronica e InformazionePolitecnico di Milano

Page 2: Fast algorithms for large scale genome alignment and comparison

ACMBp. 2 2007/05/28

The article(s)

A.L. Delcher, S. Kasif, R.D. Fleischmann, J. Peterson, O. White, S.L. Salzberg: “Alignment of whole genomes”, 1999

A.L. Delcher, A. Philippy, J. Carlton, S.L. Salzberg: “Fast algorithms for large-scale genome alignment and comparison”, 2002

S. Kurtz, A. Philippy, A.L. Delcher, M. Smoot, M. Shumway, C. Antonescu, S.L. Salzberg: “Versatile and open software for comparing large genomes”, 2004

Page 3: Fast algorithms for large scale genome alignment and comparison

ACMBp. 3 2007/05/28

The problem

When the genome sequence of two closely related organisms becomes available, one of the first questions researchers want to ask is how the two genomes align

Aligning (very) long sequences• Single gene sequences may be as long as tens of

thousand of nucleotides• Whole genomes are usually millions of nucleotides

or larger!

Page 4: Fast algorithms for large scale genome alignment and comparison

ACMBp. 4 2007/05/28

The challenge

Naïve• O(n2) space and time

Hashing• faster, but still partly O(n2)

Dynamic Programming• O(n) space, takes more time

MUMmer• Suffix trees: O(n) space and time• LIS: O(k log k) where k is the number of MUMs

Page 5: Fast algorithms for large scale genome alignment and comparison

ACMBp. 5 2007/05/28

The algorithm

1) Perform a Maximal Unique Match (MUM) decomposition of the two genomes

2) Sort the matches found in the MUM alignment, and extract the LIS (Longest Increasing Sequence) of matches that occur in the same order in both genomes

3) Close the gaps in the alignment, performing local identification of large inserts, repeats, small mutated regions, tandem repeats and SNPs

4) Output the alignment

Page 6: Fast algorithms for large scale genome alignment and comparison

ACMBp. 6 2007/05/28

MUM: the suffix tree

Page 7: Fast algorithms for large scale genome alignment and comparison

ACMBp. 7 2007/05/28

Longest Increasing Subsequence

Page 8: Fast algorithms for large scale genome alignment and comparison

ACMBp. 8 2007/05/28

Closing the gaps

Page 9: Fast algorithms for large scale genome alignment and comparison

ACMBp. 9 2007/05/28

MUMmer v2.0

Relaxes the uniqueness constraint Faster, takes less space Algorithmic improvements

• memory• streaming query• new module to cluster matches

Able to align not only simple DNA sequences, but also human chromosomes

Able to align incomplete genomes and protein sequences

Page 10: Fast algorithms for large scale genome alignment and comparison

ACMBp. 10 2007/05/28

Time-space improvements

The amount of memory used in the suffix tree has been reduced • from at most 37bytes/bp to at most 20bytes/bp

Speed has increased• E.coli vs. V.cholerae, from 74sec,293MB to 27sec,

100MB Suffix tree is used to store only one sequence,

while the second one (query) is streamed against the suffix tree• once the suffix tree has been built, multiple queries

can be streamed• quick way to find the next match• matches are maximal on the right hand side

Page 11: Fast algorithms for large scale genome alignment and comparison

ACMBp. 11 2007/05/28

Streaming queries

Page 12: Fast algorithms for large scale genome alignment and comparison

ACMBp. 12 2007/05/28

Clustering of matches

Old version computed a single longest alignment between the sequences

New version works as follows:• first, the system outputs a series of separate,

independent alignment regions• clustering is performed by finding pairs of matches

that are sufficiently close• finally, a LIS computation is done within each

component to yield the most consistent sequence of matches in the cluster

Page 13: Fast algorithms for large scale genome alignment and comparison

ACMBp. 13 2007/05/28

Alignment of incomplete genomes

In a typical Whole-Genome Shotgun-Sequencing, the genome is broken up into millions of pieces• If the reads are generated at random, then >99%

of a genome will be covered by sequencing enough reads to cover the genome eight times

• The result of assembly is usually a collection of large, unordered DNA sequences called contigs

NUCmer (nucleotide MUMmer) is a multiple-contig alignment program that uses MUMmer 2 as its core aligment engine

Page 14: Fast algorithms for large scale genome alignment and comparison

ACMBp. 14 2007/05/28

Alignment of incomplete genomes

1)NUCmer input: two multi-fasta files representing partial or complete assemblies

2)Create a map of all contig positions within each file

3)Concatenate files separately and run MUMmer to find exact matches

4)Map matches to separate contigs5)MUMs are clustered together if they are

separated by no more than a user-specifiedd distance

6)Dynamic programming is used to align sequences between the MUMs

Page 15: Fast algorithms for large scale genome alignment and comparison

ACMBp. 15 2007/05/28

NUCmer

Page 16: Fast algorithms for large scale genome alignment and comparison

ACMBp. 16 2007/05/28

PROmer

1)Given two multi-fasta files, PROmer translates the DNA to amino acids

2)An index is created that maps all protein sequences and lengths to the source DNA

3)Pseudo-proteomes (amino acid sequences) are passed to MUMmer

4)The index is used to translate the matches back to the original DNA input

5)Clustering step

Page 17: Fast algorithms for large scale genome alignment and comparison

ACMBp. 17 2007/05/28

MUMmer v3.0

New improvements in code• slightly faster than 2.0, 25% less memory

More modular and configurable• possibility to build hybrid systems

Ability to run a multi-contig query against a multi-contig reference

Non-unique maximal matches Speed-up of Nucmer and Promer modules

(approx. 10-fold) Graphical viewers

Page 18: Fast algorithms for large scale genome alignment and comparison

ACMBp. 18 2007/05/28

Graphical interfaces

Page 19: Fast algorithms for large scale genome alignment and comparison

ACMBp. 19 2007/05/28

Graphical interfaces

Page 20: Fast algorithms for large scale genome alignment and comparison

ACMBp. 20 2007/05/28

Graphical interfaces

Page 21: Fast algorithms for large scale genome alignment and comparison

ACMBp. 21 2007/05/28

That's All, Folks

Thank you!

Questions are welcome