grappa: large-scale whole genome phylogenies based upon gene order evolution

30
GRAPPA: Large-scale whole genome phylogenies based upon gene order evolution Tandy Warnow, UT-Austin Department of Computer Sciences Institute for Cellular and Molecular Biology Program in Evolution, Ecology, and Behavior Center for Computational Biology and Bioinformatics

Upload: danica

Post on 13-Jan-2016

21 views

Category:

Documents


1 download

DESCRIPTION

GRAPPA: Large-scale whole genome phylogenies based upon gene order evolution. Tandy Warnow, UT-Austin Department of Computer Sciences Institute for Cellular and Molecular Biology Program in Evolution, Ecology, and Behavior Center for Computational Biology and Bioinformatics. A. C. A. D. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: GRAPPA: Large-scale whole genome phylogenies based upon gene order evolution

GRAPPA: Large-scale whole genome phylogenies based upon

gene order evolution

Tandy Warnow, UT-AustinDepartment of Computer Sciences

Institute for Cellular and Molecular Biology

Program in Evolution, Ecology, and Behavior

Center for Computational Biology and Bioinformatics

Page 2: GRAPPA: Large-scale whole genome phylogenies based upon gene order evolution

Whole-Genome Phylogenetics

A

B

C

D

E

F

X

Y

ZW

A

B

C

D

E

F

Page 3: GRAPPA: Large-scale whole genome phylogenies based upon gene order evolution

Genomes As Signed Permutations

1 –5 3 4 -2 -6or

6 2 -4 –3 5 –1etc.

Page 4: GRAPPA: Large-scale whole genome phylogenies based upon gene order evolution

Genomes Evolve by Rearrangements

• Inverted Transposition

1 2 3 9 -8 –7 –6 –5 –4 10

1 2 3 4 5 6 7 8 9 10

• Inversion (Reversal)

1 2 3 –8 –7 –6 –5 -4 9 10

• Transposition

1 2 3 9 4 5 6 7 8 10

Page 5: GRAPPA: Large-scale whole genome phylogenies based upon gene order evolution

Genome Rearrangement Has A Huge State Space

• DNA sequences : 4 states per site• Signed circular genomes with n genes:

states, 1 site

• Circular genomes (1 site)

– with 37 genes: states

– with 120 genes: states

)!1(2 1 −− nn

521056.2 ×2321070.3 ×

Page 6: GRAPPA: Large-scale whole genome phylogenies based upon gene order evolution

Why use gene orders?

• “Rare genomic changes”: huge state space and relative infrequency of events (compared to site substitutions) could make the inference of deep evolution easier, or more accurate.

• Our research shows this is true, but accurate analysis of gene order data is computationally very intensive!

Page 7: GRAPPA: Large-scale whole genome phylogenies based upon gene order evolution

Phylogeny reconstruction from gene orders

• Distance-based reconstruction: estimate pairwise distances, and apply methods like Neighbor-Joining or Weighbor

• “Maximum Parsimony”: find tree with the minimum length (inversions, transpositions, or other edit distances)

• Maximum Likelihood: find tree and parameters of evolution most likely to generate the observed data

Page 8: GRAPPA: Large-scale whole genome phylogenies based upon gene order evolution

This talk

• Distance-based methods: we show how to estimate genomic distances appropriately for the Generalized Nadeau-Taylor model

• Parsimony-style methods: we can find very good solutions to NP-hard problems (inversion and breakpoint phylogeny) quite quickly

• Validation of these approaches on real and simulated data

Page 9: GRAPPA: Large-scale whole genome phylogenies based upon gene order evolution

Distance-based methods

Page 10: GRAPPA: Large-scale whole genome phylogenies based upon gene order evolution

Genomic Distance Estimators

• Standard: – Breakpoint distance– (Minimum) Inversion distance

• Our estimators: We attempt to estimate the actual number of events (the “true evolutionary distance”):– EDE [Moret et al, ISMB’01]

– Approx-IEBP [Wang and Warnow, STOC’01]

– Exact-IEBP [Wang, WABI’01]

Page 11: GRAPPA: Large-scale whole genome phylogenies based upon gene order evolution

Breakpoint Distance

• Breakpoint distance=5

1 2 3 4 5 6 7 8 9 10

1 –3 –2 4 5 9 6 7 8 10

Page 12: GRAPPA: Large-scale whole genome phylogenies based upon gene order evolution

Minimum Inversion Distance

1 2 3 4 5 6 7 8 9 10

1 2 3 –8 –7 –6 –5 –4 9 10

1 8 –3 –2 –7 –6 –5 –4 9 10

1 8 –3 7 2 –6 –5 –4 9 10

• Inversion distance=3

Page 13: GRAPPA: Large-scale whole genome phylogenies based upon gene order evolution

Measured Distance vs. Actual Number of Events

Breakpoint Distance Inversion Distance

120 genes, inversion-only evolution

Page 14: GRAPPA: Large-scale whole genome phylogenies based upon gene order evolution

Generalized Nadeau-Taylor Model

• Three types of events: – Inversions

– Transpositions

– Inverted Transpositions

• Events of the same type are equiprobable• Probability of the three types have fixed ratio:

Inv : Trp : Inv.Trp = (1--)::

Page 15: GRAPPA: Large-scale whole genome phylogenies based upon gene order evolution

Estimating True Evolutionary Distances for Genomes

Given fixed probabilities for each type of event, we estimate the expected breakpoint distance after k random events, or the expected inversion distance after k random events. Inverting these functions gives us a better estimate of true evolutionary distances.

• Approx-IEBP [Wang and Warnow 2001]

• Exact-IEBP [Wang 2001]

• EDE [Moret et al, ISMB 2001] (inversion based)

Page 16: GRAPPA: Large-scale whole genome phylogenies based upon gene order evolution

Estimating True Evolutionary Distances for Genomes (cont.)

Estimating the expected Inversion distance:

EDE [Moret, Wang, Warnow, Wyman 2001]

– Closed-form formula based upon an empirical estimation of the expected inversion distance after k random events (based upon 120 genes and inversion only, but robust to errors in the model) .

– Polynomial time, fastest of the three.

Page 17: GRAPPA: Large-scale whole genome phylogenies based upon gene order evolution

Absolute Difference

•120 genes•Inversion only evolution (Similar relative performance under other models)

Page 18: GRAPPA: Large-scale whole genome phylogenies based upon gene order evolution

Accuracy of Neighbor Joining Using Distance Estimators

•120 genes•All three event types equiprobable•10, 20, 40, 80, and 160 genomes•Similar relative performance under other models

Page 19: GRAPPA: Large-scale whole genome phylogenies based upon gene order evolution

Summary of Distance-based Reconstruction Methods

• Statistically-based estimation of genomic distances improves NJ analyses.

• Our IEBP estimators assume knowledge of the probabilities of each type of event, but are robust to model violations.

• EDE is based upon an inversion-only evolutionary model, but is robust.

• Best performing method: Weighbor(EDE); second best is NJ(EDE); both are robust to model violations.

• Worst performing is NJ(BP).• Accuracy is very good, except when very close to saturation.

Page 20: GRAPPA: Large-scale whole genome phylogenies based upon gene order evolution

Maximum Parsimony on Rearranged Genomes (MPRG)

• The leaves are rearranged genomes.• Find the tree that minimizes the total number of rearrangement events

A

B

C

D

3 6

2

3

4

A

B

C

D

EF

Total length= 18

Page 21: GRAPPA: Large-scale whole genome phylogenies based upon gene order evolution

Optimization problems for gene order phylogeny

• Breakpoint phylogeny: find the phylogeny which minimizes the total number of breakpoints (NP-hard, even to find the median of three genomes)

• Inversion phylogeny: find the phylogeny which minimizes the sum of inversion distances on the edges (NP-hard, even to find the median of three genomes)

Page 22: GRAPPA: Large-scale whole genome phylogenies based upon gene order evolution

Inversion and Breakpoint phylogenies

Phylogenetic trees

MP score

Global optimum

Local optimum

• When the data are close to saturated, even Weighbor(EDE) analyses are insufficiently accurate. In these cases, our initial investigations suggest that the inversion and breakpoint phylogeny approaches may be superior.

• Problem: finding the best trees is enormously hard, since even the “point estimation” problem is hard (worse than estimating branch lengths in ML).

Page 23: GRAPPA: Large-scale whole genome phylogenies based upon gene order evolution

GRAPPA (Genome Rearrangement Analysis under Parsimony and other

Phylogenetic Algorithms)http://www.cs.unm.edu/~moret/GRAPPA/• Heuristics for NP-hard optimization problems• Fast polynomial time distance-based methods• Contributors: U. New Mexico,U. Texas at

Austin, Universitá di Bologna, Italy• Freely available in source code at this site.• Project leader: Bernard Moret (UNM)

([email protected])

Page 24: GRAPPA: Large-scale whole genome phylogenies based upon gene order evolution

Benchmark gene order dataset: Campanulaceae

• 12 genomes + 1 outgroup (Tobacco), 105 gene segments• NP-hard optimization problems: breakpoint and inversion

phylogenies (techniques score every tree)

1997: BPAnalysis (Blanchette and Sankoff): 200 years (est.)

Page 25: GRAPPA: Large-scale whole genome phylogenies based upon gene order evolution

Benchmark gene order dataset: Campanulaceae

• 12 genomes + 1 outgroup (Tobacco), 105 gene segments• NP-hard optimization problems: breakpoint and inversion

phylogenies (techniques score every tree)

1997: BPAnalysis (Blanchette and Sankoff): 200 years (est.)2000: Using GRAPPA v1.1 on the 512-processor Los Lobos

Supercluster machine: 2 minutes (200,000-fold speedup per processor)

Page 26: GRAPPA: Large-scale whole genome phylogenies based upon gene order evolution

Benchmark gene order dataset: Campanulaceae

• 12 genomes + 1 outgroup (Tobacco), 105 gene segments• NP-hard optimization problems: breakpoint and inversion phylogenies

(techniques score every tree)

1997: BPAnalysis (Blanchette and Sankoff): 200 years (est.)2000: Using GRAPPA v1.1 on the 512-processor Los Lobos Supercluster

machine: 2 minutes (200,000-fold speedup per processor)2003: Using latest version of GRAPPA: 2 minutes on a single processor (1-

billion-fold speedup per processor)

Page 27: GRAPPA: Large-scale whole genome phylogenies based upon gene order evolution

Summary

• Weighbor(EDE) and NJ(EDE) are highly accurate polynomial time distance-based reconstructions, except when datasets are close to saturated

• GRAPPA (inversion phylogeny or breakpoint phylogeny) produces highly accurate estimates of trees, and even of ancestral gene orders, acceptably fast

Page 28: GRAPPA: Large-scale whole genome phylogenies based upon gene order evolution

Limitations and ongoing research

• Current methods limited to single chromosomes with equal gene content (or very small amounts of deletions and duplications) -- we are working on developing reliable techniques for genomes with unequal gene content

Page 29: GRAPPA: Large-scale whole genome phylogenies based upon gene order evolution

Acknowledgements

• Funding: The David and Lucile Packard Foundation, and The National Science Foundation.• Collaborators: Bernard Moret (UNM), David Bader (UNM), Bob

Jansen (UT), Linda Raubeson (CWU)• Students: Li-San Wang (now postdoc of Junhyong

Kim at Penn), Jijun Tang (UNM), and others at UNM

Page 30: GRAPPA: Large-scale whole genome phylogenies based upon gene order evolution

Phylolab, U. TexasPlease visit us athttp://www.cs.utexas.edu/users/phylo/