crimson: a data management system to support evaluating phylogenetic tree reconstruction algorithms...

1
Crimson: A Data Management System to Support Evaluating Phylogenetic Tree Reconstruction Algorithms Yifeng Zheng, Stephen Fisher, Shirley cohen, Sheng Guo, Junhyong Kim and Susan B. Davidson Phylogenetics – the science of identifying and understanding evolutionary relationship between different species Cyberinfrastructure for Phylogenetic Research project (CIPRes) Design efficient data storage and query capabilities for managing phylogenetic trees Evaluate existing phylogenetic tree reconstruction algorithms • Building “gold standards” by simulating very large phylogenetic tree as well as sequences for each species in the tree according to models that are carefully curated by experts. ... Crimson system focuses on providing data management support for CIPRes simulation. Background Our Solution Data storage and index strategy: extension of the Dewey labeling scheme Query evaluation algorithm which achieve high performance An user friendly data management system: Crimson system Technical Challenges PHylogenetic trees may cntain millions of species associated with sequences with thousands of characters. Efficiently manage and query this data is important. Data management strategies developed for XML are not suitable for phylogenetic tree management. Different from XML documents used in web and commercial application which are relatively shallow, phylogenetic trees can be very deep. • According to a survey of 200,000 XML documents by Mignet, Barbosa and Veltri in WWW 2003, the average depth of XML was reported to be 4 and the deepest was 135. • Simulation phylogenetic tree have an average depth of greater than 1000, and the deepest can be more than 1 million. Queries used with phylogenetic trees are also very different from the path-oriented or restructuring quries supported by XPath and XQuery. System Architecture Tree Projector Sampling Benchmark Manager Projection Tree Tree Repository Species Repository Query Repository Repository Manager GUI Manager Input Query Tree Viewer Query Histor y Data Loader Simulation Tree Sampling Species with Sequences Sampling Strategy Phylogenetic Trees Phylogenetic Queries The phylogenetic reconstruction problem is NP-hard, so current algorithms can only handle a relative small input set. To benchmark these reconstruction algorithms, we must therefore be able to efficiently sample a subset of species according to various criteria, and project the tree pattern induced by the smaple in the simulation tree. Sampling a set of species according to a given time • Guarantee that the sampling results are derived from an evolutionary time period. • Given a tree T with weight on the edge representing time, sampling a set of species according to a given time t will return a subset of T’s leaves set such that for all species, whose evaluation time (the weighted distance from the root to this specie) is t, have the same number of descendant species sampled out. Tree projection • determining the relationship among a set of species by appealing to an authoritative tree • Given a tree T and a subset S of its leaves, the tree projection of T over S is a “subtree” T’ in which each edge is a subpath of a path from the root of T to a node in S and each node has at least two children. Extended Dewey Labeling 6 5 2 1 2 3 4 Performance Results Tim e to generate the tree and store itgiven a 20 leafnode set 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 1 2 3 4 5 6 Leaves nodes in the file (*100) Tim e(seconds) Tim e to generate and store a subtree from the selected leaves ofa phylogenetic tree w ith 2000 leaves 0 0.5 1 1.5 2 2.5 3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 N um berofrandom ly selected leaves(*10) Tim e(seconds) References: Cyberinfrastructure for Phylogenetic Research (CIPRES) project (www.phylo.org ) Susan B. Davidson, Junhyong Kim, Yifeng Zheng: Efficiently Supporting Structure Queries on Phylogenetic Trees. SSDBM 2005: 93-102

Upload: brent-armstrong

Post on 03-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Crimson: A Data Management System to Support Evaluating Phylogenetic Tree Reconstruction Algorithms Yifeng Zheng, Stephen Fisher, Shirley cohen, Sheng

Crimson: A Data Management System to Support Evaluating Phylogenetic Tree Reconstruction Algorithms

Yifeng Zheng, Stephen Fisher, Shirley cohen, Sheng Guo, Junhyong Kim and Susan B. Davidson

• Phylogenetics – the science of identifying and understanding evolutionary relationship between different species

• Cyberinfrastructure for Phylogenetic Research project (CIPRes)– Design efficient data storage and query capabilities for managing phylogenetic trees

– Evaluate existing phylogenetic tree reconstruction algorithms• Building “gold standards” by simulating very large phylogenetic tree as well as sequences for each species in the tree according

to models that are carefully curated by experts.

– ...

• Crimson system focuses on providing data management support for CIPRes simulation.

Background

Our Solution• Data storage and index strategy: extension of the Dewey labeling scheme• Query evaluation algorithm which achieve high performance• An user friendly data management system: Crimson system

Technical Challenges• PHylogenetic trees may cntain millions of species associated with sequences with thousands of

characters. Efficiently manage and query this data is important.• Data management strategies developed for XML are not suitable for phylogenetic tree management.

– Different from XML documents used in web and commercial application which are relatively shallow, phylogenetic trees can be very deep. • According to a survey of 200,000 XML documents by Mignet, Barbosa and Veltri in WWW 2003, the average depth of XML was

reported to be 4 and the deepest was 135.• Simulation phylogenetic tree have an average depth of greater than 1000, and the deepest can be more than 1 million.

– Queries used with phylogenetic trees are also very different from the path-oriented or restructuring quries supported by XPath and XQuery.

System Architecture

Tree ProjectorSampling Benchmark Manager

Projection Tree

TreeRepository

SpeciesRepository

QueryRepository

Repository Manager

GUI Manager

Input Query

Tree ViewerQueryHistory

Data Loader

SimulationTree

Sampling Species with Sequences

Sampling Strategy

Phylogenetic Trees

Phylogenetic Queries

• The phylogenetic reconstruction problem is NP-hard, so current algorithms can only handle a relative small input set. To benchmark these reconstruction algorithms, we must therefore be able to efficiently sample a subset of species according to various criteria, and project the tree pattern induced by the smaple in the simulation tree.

– Sampling a set of species according to a given time • Guarantee that the sampling results are derived from an evolutionary time period.• Given a tree T with weight on the edge representing time, sampling a set of species

according to a given time t will return a subset of T’s leaves set such that for all species, whose evaluation time (the weighted distance from the root to this specie) is t, have the same number of descendant species sampled out.

– Tree projection• determining the relationship among a set of species by appealing to an authoritative

tree • Given a tree T and a subset S of its leaves, the tree projection of T over S is a “subtree”

T’ in which each edge is a subpath of a path from the root of T to a node in S and each node has at least two children.

Extended Dewey Labeling

6

5

2

1 2

3 4

Performance Results

Time to generate the tree and store it given a 20 leaf node set

00.020.040.060.08

0.10.120.140.160.18

1 2 3 4 5 6

Leaves nodes in the file (*100)

Tim

e(s

ec

on

ds

)

Time to generate and store a subtree from the selected leaves of a phylogenetic tree with 2000 leaves

0

0.5

1

1.5

2

2.5

3

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Number of randomly selected leaves(*10)T

ime(

seco

nd

s)

References:• Cyberinfrastructure for Phylogenetic Research (CIPRES) project (

www.phylo.org)

• Susan B. Davidson, Junhyong Kim, Yifeng Zheng: Efficiently Supporting Structure Queries on Phylogenetic Trees. SSDBM 2005: 93-102