grid enabling phylogenetic inference on virus sequences using beast - a possibility?
DESCRIPTION
Grid enabling phylogenetic inference on virus sequences using BEAST - a possibility?. EUAsiaGrid Workshop 4-6 May 2010. Chanditha Hapuarachchi Environmental Health Institute National Environment Agency. Outline. Work scope Analytical approach Current limitations - PowerPoint PPT PresentationTRANSCRIPT
Grid enabling phylogenetic inference on virus sequences using BEAST - a possibility?
EUAsiaGrid Workshop
4-6 May 2010
Chanditha Hapuarachchi
Environmental Health Institute
National Environment Agency
Outline
Work scope
Analytical approach
Current limitations
What is expected from Grid-enabling?
Work scope
Understanding the molecular epidemiology of vector-borne, infectious diseases in Singapore with a view of utilizing information in disease control operations
Objectives To determine the routes of pathogen migration (mainly Dengue and
Chikungunya viruses)
To understand the evolutionary dynamics of pathogens
To understand the outbreak potential of pathogens within the country
Molecular epidemiology
of DENV & CHIKV
Phylogenetic relationships
(trees)
(BEAST, MEGA)
Evolutionary dynamics
(Evolutionary rates, selection pressure, recombination etc)
(BEAST, HYPHY etc.)
Population dynamics
(Bayesian skyline plots)
(BEAST)
Temporo-spatial distribution of viruses
(BEAST, NETWORK)
What phylogenetic inferences are made?
BEAST is a multi-task software package
CHIKV whole genome tree with spatial model
India
Sri Lanka
Singapore
Malaysia
Ind. Ocean Islands
Kenya
Time (yrs)
Spatial distribution of different lineages of DENV in Singapore
However……..
BEAST analysis is time consuming & requires substantial computing power
Limitations of the BEAST approach?
Size of dataset
Length of sequences
No. of sequences
E.g. Analyzing a dataset of ~90 whole genomes of CHIKV (11.8 kb) takes several days depending on the available computing power
Analytical parameters
A basic analysis takes ~0.3 hrs per million states
(Core 2 duo, 2.1 GHz, 4 GB RAM, >50% CPU)
A general run involves at least a 100 million sampling frame
(=~30 hrs)
The duration increases substantially with changing parameters
Incorporation of spatial model (7 states) alone increases the runtime to ~0.4 hrs per million states
The ultimate duration depends on Effective Sample Size (ESS)
values (general requirement >200)
Limitations…
BEAST Tracer output window
Limitations…
Number of parallel runs & users
↑ runs & users -------- ↓ analytical efficiency
Single run takes up >50% of CPU power
Why to Grid-enable BEAST?
Enables efficient data analysis
parallel runs
multiple users
expanded datasets
Enhances data interpretation
Can Grid-enabling help to improve the existing performance?