creating the cipres science gateway for inference of large phylogenetic trees mark a. miller san...
TRANSCRIPT
Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees
Mark A. MillerSan Diego Supercomputer Center
Systematics is the study of the diversification of life on the planet Earth, both past and present, and the relationships among living things through time
?
Evolutionary relationships can (for the most part) be represented as a rooted graph.
Originally, evolutionary relationships were inferred from morphology alone:
Morphological characters are scored “by hand” to create matrices of characters.
Scoring occurs via low volume/low throughput methodologies
Even though tree inference is NP hard, matrices created using morphological characters alone are typically relatively small, so computations are relatively tractable (with heuristics developed by the community)
Originally, evolutionary relationships were inferred from morphology alone:
Evolutionary relationships are also inferred from DNA sequence comparisons:
Unlike morphological characters, DNA sequence determination is now fully automated.
The increase in DNA sequences with time is faster than Moore’s law.
There are at least 107 species, each with 3000 - 30,000 genes, so the needfor computational power and new tools will continue to grow.
Tree inference is NP hard, so even with heuristics, computational powerfrequently limits the analysis.
Analyses often involve 1000’s of species, and 1000’s of characters, creatingvery large matrices.
Evolutionary relationships are also inferred from DNA sequence comparisons:
The CIPRES Project was created to support this new age of large phylogenetic data sets. The project had as its principal goals:
1. Developing heuristics and tools for analyzing the large DNA data sets that are available.
2. Improving researcher access to computational resources.
The CIPRES Portal was created as part of Goal 2, improving researcher access to computational resources
The CIPRES Portal was designed to be a flexible web application that allows users to run analyses of large sequence data sets using community codes on a significant computational resource.
User requirements:
• Provide login-protected personal user space for storing results indefinitely.
• Provide access to most or all native command line options for each code.
• Support addition of new tools and new versions as needed.
The CIPRES Portal was built on a generic portal software package called The Workbench Framework
<?xml version="1.0" encoding="ISO-8859-1" ?><!DOCTYPE pise SYSTEM "http://www.phylo.org/dev/rami/PARSER/pise.dtd" [<!ENTITY nucdbs SYSTEM "http://www.phylo.org/dev/rami/XMLDIR/nucdbs.xml"><!ENTITY protdbs SYSTEM "http://www.phylo.org/dev/rami/XMLDIR/protdbs.xml"><!ENTITY blastDBpath SYSTEM "http://www.phylo.org/dev/rami/XMLDIR/blastDBpath.xml"><!ENTITY fastaDBpath SYSTEM "http://www.phylo.org/dev/rami/XMLDIR/fastaDBpath.xml"><!ENTITY blocksDBpath SYSTEM "http://www.phylo.org/dev/rami/XMLDIR/blocksDBpath.xml"><!ENTITY nucDBfasta SYSTEM "http://www.phylo.org/dev/rami/XMLDIR/nucDBfasta.xml"><!ENTITY protDBfasta SYSTEM "http://www.phylo.org/dev/rami/XMLDIR/protDBfasta.xml">]><pise>
<head> <title>TFASTY</title> <version>34t10d3</version> <description>Compare PS to Translated NS Or NS-DB</description> <authors>W. Pearson</authors> <reference>Pearson, W. R. (1999) Flexible sequence similarity searching with the FASTA3 program package. Methods in Molecular Biology</reference> <reference>W. R. Pearson and D. J. Lipman (1988), Improved Tools for Biological Sequence Analysis, PNAS 85:2444-2448</reference> <reference> W. R. Pearson (1998) Empirical statistical estimates for sequence similarity searches. In J. Mol. Biol. 276:71-84</reference> <reference>Pearson, W. R. (1996) Effective protein sequence comparison. In Meth. Enz., R. F. Doolittle, ed. (San Diego: Academic Press) 266:227-258</reference><category>Protein Sequence</category>
To expose Command Line Tools quickly, the Workbench Framework uses the PISE XML standard….
All command line parameters can be set.
All command line parameters can be set.
Usage Statistics for CIPRES Portal 5/2007 – 11/2009
47,500 total jobs in 30 months
Limitations of the original CIPRES Portal
• all jobs were run serially (efficient, but no gain in wall time)• the cluster was modest (16 X 8-way dual core nodes)• runs were limited to 72 hours• the cluster was at the end of its useful lifetime• funding for the project was ending• demand for job runs was increasing
This is not a scalable, sustainable solution!
Workbench
Framework
The solution: make community codes available on scalable, sustainable resources (e.g. TeraGrid).
TeraGrid
CIPRES Cluster
Triton
Parallel codes
Serial codes
Greater than 90% of all computational time is used for three tree inference codes: MrBayes, RAxML, and GARLI.
Implement parallel versions of these codes on TeraGrid Machines Abe and Lonestar; using Globus/GRAM.
Work with community developers to improve the speed-up available through the parallel codes offered by CSG.
Keep other serial codes on local SDSC resources that provide the project with fee-for-service cycles.
The Workbench Framework design made it possible to deploy jobs on TeraGrid resources fairly easily. A Science Gateway development allocation allowed us to accomplish the initial setup.
Code Type Max cores
Speed-up Efficiency
MrBayes Hybrid MPI/OpenMP 32 2.4 X (4 nodes)
~60%
RAxML Hybrid MPI/OpenMP 40 3.0 X(5 nodes)
~ 60%
GARLI MPI 100 77 X(100 nodes)
77-94%
CIPRES Science Gateway parallel code profiles
Month 2 4 6 8 1012
Us
ers
/
mo
nth
100
200
300
400all users
Jan
new users
Mar May Jul
repeat users
Sep
all users
new users
repeat users
CIPRES Science Gateway Usage Dec 2009 – Oct 2010
Month 2 4 6 8 1012
Us
ers
/
mo
nth
100
200
300
400all users
Jan
new users
Mar May Jul
repeat users
Sep
all users
new users
repeat users
CIPRES Science Gateway Usage Dec 2009 – Oct 2010
1575 new TGusers
Month
2 4 6 8 10 12
Jo
bs
/ m
on
th
1000
2000
3000
4000
Jan Mar JulMay
SU
s / m
on
th (in
tho
us
an
ds
)
200
400
600
800
Sep
CIPRES Science Gateway Usage
:
Intellectual Merit: 90+ publications enabled by the CIPRES Science Gateway
Broad Impact: CIPRES Science Gateway used to deliver curriculum by at least 21 instructors
What happens if you build it and too many people come???
Month2 4 6 810
SU
s (
tho
us
an
ds
)
5001000150020002500300035004000?!?
What happens if you build it and too many people come???
• make an explicit fair resource use policy
• make sure resource use delivers impact
• make sure resource use is efficient
• then expand resource base as required
What happens if you build it and too many people come???
• make an explicit fair resource use policy
• make sure resource use delivers impact
• make sure resource use is efficient
• then expand resource base as required
What happens if you build it and too many people come???
SUs /monthNumber of
Users% total SU % per user
< 100 536 (75.3%) 19 0.04
100 - 500 86 (12.1%) 9 0.10
500 - 2000 53 ( 7.5%) 10 0.19
2000 - 5000 24 ( 3.4%) 17 0.71
5,000 - 10,000 9 ( 1.3%) 14 1.56
> 10,000 3 ( 0.4%) 31 10.33
What happens if you build it and too many people come???
SUs /monthNumber of
Users% total SU % per user
< 100 536 (75.3%) 19 0.04
100 - 500 86 (12.1%) 9 0.10
500 - 2000 53 ( 7.5%) 10 0.19
2000 - 5000 24 ( 3.4%) 17 0.71
5,000 - 10,000 9 ( 1.3%) 14 1.56
> 10,000 3 ( 0.4%) 31 10.33
What happens if you build it and too many people come???
SUs /monthNumber of
Users% total SU % per user
< 100 536 (75.3%) 19 0.04
100 - 500 86 (12.1%) 9 0.10
500 - 2000 53 ( 7.5%) 10 0.19
2000 - 5000 24 ( 3.4%) 17 0.71
5,000 - 10,000 9 ( 1.3%) 14 1.56
> 10,000 3 ( 0.4%) 31 10.33
This level of resource use requires additional justification…
What happens if you build it and too many people come???
Tier % Community
allocation usedAccount Status
1< 2% of monthly
allocationOpen access
22% of monthly
allocationRequest personal
TG allocation
33% of monthly
allocationUse of community allocation blocked
CIPRES SG Fair Use Policy
What happens if you build it and too many people come???
• make an explicit fair resource use policy
• make sure resource use delivers impact
• make sure resource use is efficient
• then expand resource base as required
What happens if you build it and too many people come???
High end users must acquire a personal allocation from the TRAC committee. This will insure that very heavy users of the resource are supporting peer-reviewed research (and have a US institutional affiliation).
What happens if you build it and too many people come???
Tools required to implement the CIPRES SG Fair Use Policy:
• ability to halt submissions from a given user account
• ability to charge to a user’s personal TG allocation
• ability to monitor usage by each account
What happens if you build it and too many people come???
• make an explicit fair resource use policy
• make sure resource use delivers impact
• make sure resource use is efficient
• then expand resource base as required
Job Attrition on the CIPRES Science Gateway
CPU time User Staff
Input error low high low
Machine error 0 high low
Communication error high high high
Unknown error high high low
Error Impact analysis
Communication Errors
• occur when there is a break in communication between the web application and the TG resource
• kills the job monitoring process for all running jobs.
• jobs will continue to execute and consume SU, but web application can no longer return job results.
• the user must ask for the results, and a staff member must fetch them manually.
CONCLUSION: Time to refactor the job monitoring system!
User Submission Command line;files TG Machine
Running Task Table
Globus “gsissh”
Normal Operation
LoadResults DaemonDetects results,fetches via Grid ftp,puts results in the user db.
User DB
User Submission checkJobsDaemon TG Machine
Running Task Table
Globus “gsissh”
If normal notification fails from:Machine outageJob timeoutCIPRES Application down
Abnormal Operation:
User DBLoadResults DaemonDetects results w/delivery error;fetches via Grid ftp;puts results in the user DB.
SEPT OCT
MrBayes 39 71
RAxML 106 172
GARLI 14 23
Total 159* 266*
JOBS SAVED BY THE GSISSH / TASK TABLE SYSTEM
* 7% of all submitted jobs
Future plans to address other source of attrition
User errors (15%) : pre-check uploaded files for valid format.
Machine errors (8%): establish redundancy in where codes can run, use tools to check machine availability,queue depth.
Unknown errors (3%): TBD
What happens if you build it and too many people come???
• make an explicit fair resource use policy
• make sure resource use delivers impact
• make sure resource use is efficient
• then expand resource base as required
What happens if you build it and too many people come???
Month2 4 6 810
SU
s (
tho
us
an
ds
)
5001000150020002500300035004000?!?
Month2 4 6 810
SU
s (
tho
us
an
ds
)
5001000150020002500300035004000
What happens if you build it and too many people come???
Other possible resource opportunities:
New TG machines (e.g. Trestles machine at SDSC)
The Open Science Grid (OSG)
The NSF FutureGrid project (adapting these HPC applications to a cloud environment).
CIPRES Science Gateway Terri Liebowitz
TeraGrid Hybrid Code Development Wayne PfeifferAlexandros Stamatakis
TeraGrid Implementation Support Nancy Wilkins-DiehrDoru Marcusiu
Workbench Framework: Paul HooverLucie Chan
Acknowledgements: