creating the cipres science gateway for inference of large phylogenetic trees mark a. miller san...

48
Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San Diego Supercomputer Center

Upload: marianna-stanley

Post on 16-Jan-2016

215 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San Diego Supercomputer Center

Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees

Mark A. MillerSan Diego Supercomputer Center

Page 2: Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San Diego Supercomputer Center

Systematics is the study of the diversification of life on the planet Earth, both past and present, and the relationships among living things through time

?

Page 3: Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San Diego Supercomputer Center

Evolutionary relationships can (for the most part) be represented as a rooted graph.

Page 4: Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San Diego Supercomputer Center

Originally, evolutionary relationships were inferred from morphology alone:

Page 5: Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San Diego Supercomputer Center

Morphological characters are scored “by hand” to create matrices of characters.

Scoring occurs via low volume/low throughput methodologies

Even though tree inference is NP hard, matrices created using morphological characters alone are typically relatively small, so computations are relatively tractable (with heuristics developed by the community)

Originally, evolutionary relationships were inferred from morphology alone:

Page 6: Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San Diego Supercomputer Center

Evolutionary relationships are also inferred from DNA sequence comparisons:

Page 7: Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San Diego Supercomputer Center

Unlike morphological characters, DNA sequence determination is now fully automated.

The increase in DNA sequences with time is faster than Moore’s law.

There are at least 107 species, each with 3000 - 30,000 genes, so the needfor computational power and new tools will continue to grow.

Tree inference is NP hard, so even with heuristics, computational powerfrequently limits the analysis.

Analyses often involve 1000’s of species, and 1000’s of characters, creatingvery large matrices.

Evolutionary relationships are also inferred from DNA sequence comparisons:

Page 8: Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San Diego Supercomputer Center

The CIPRES Project was created to support this new age of large phylogenetic data sets. The project had as its principal goals:

1. Developing heuristics and tools for analyzing the large DNA data sets that are available.

2. Improving researcher access to computational resources.

Page 9: Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San Diego Supercomputer Center

The CIPRES Portal was created as part of Goal 2, improving researcher access to computational resources

The CIPRES Portal was designed to be a flexible web application that allows users to run analyses of large sequence data sets using community codes on a significant computational resource.

Page 10: Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San Diego Supercomputer Center

User requirements:

• Provide login-protected personal user space for storing results indefinitely.

• Provide access to most or all native command line options for each code.

• Support addition of new tools and new versions as needed.

Page 11: Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San Diego Supercomputer Center

The CIPRES Portal was built on a generic portal software package called The Workbench Framework

Page 12: Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San Diego Supercomputer Center

<?xml version="1.0" encoding="ISO-8859-1" ?><!DOCTYPE pise SYSTEM "http://www.phylo.org/dev/rami/PARSER/pise.dtd" [<!ENTITY nucdbs SYSTEM "http://www.phylo.org/dev/rami/XMLDIR/nucdbs.xml"><!ENTITY protdbs SYSTEM "http://www.phylo.org/dev/rami/XMLDIR/protdbs.xml"><!ENTITY blastDBpath SYSTEM "http://www.phylo.org/dev/rami/XMLDIR/blastDBpath.xml"><!ENTITY fastaDBpath SYSTEM "http://www.phylo.org/dev/rami/XMLDIR/fastaDBpath.xml"><!ENTITY blocksDBpath SYSTEM "http://www.phylo.org/dev/rami/XMLDIR/blocksDBpath.xml"><!ENTITY nucDBfasta SYSTEM "http://www.phylo.org/dev/rami/XMLDIR/nucDBfasta.xml"><!ENTITY protDBfasta SYSTEM "http://www.phylo.org/dev/rami/XMLDIR/protDBfasta.xml">]><pise>

<head> <title>TFASTY</title> <version>34t10d3</version> <description>Compare PS to Translated NS Or NS-DB</description> <authors>W. Pearson</authors> <reference>Pearson, W. R. (1999) Flexible sequence similarity searching with the FASTA3 program package. Methods in Molecular Biology</reference> <reference>W. R. Pearson and D. J. Lipman (1988), Improved Tools for Biological Sequence Analysis, PNAS 85:2444-2448</reference> <reference> W. R. Pearson (1998) Empirical statistical estimates for sequence similarity searches. In J. Mol. Biol. 276:71-84</reference> <reference>Pearson, W. R. (1996) Effective protein sequence comparison. In Meth. Enz., R. F. Doolittle, ed. (San Diego: Academic Press) 266:227-258</reference><category>Protein Sequence</category>

To expose Command Line Tools quickly, the Workbench Framework uses the PISE XML standard….

Page 13: Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San Diego Supercomputer Center

All command line parameters can be set.

Page 14: Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San Diego Supercomputer Center

All command line parameters can be set.

Page 15: Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San Diego Supercomputer Center

Usage Statistics for CIPRES Portal 5/2007 – 11/2009

47,500 total jobs in 30 months

Page 16: Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San Diego Supercomputer Center

Limitations of the original CIPRES Portal

• all jobs were run serially (efficient, but no gain in wall time)• the cluster was modest (16 X 8-way dual core nodes)• runs were limited to 72 hours• the cluster was at the end of its useful lifetime• funding for the project was ending• demand for job runs was increasing

This is not a scalable, sustainable solution!

Page 17: Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San Diego Supercomputer Center

Workbench

Framework

The solution: make community codes available on scalable, sustainable resources (e.g. TeraGrid).

TeraGrid

CIPRES Cluster

Triton

Parallel codes

Serial codes

Page 18: Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San Diego Supercomputer Center

Greater than 90% of all computational time is used for three tree inference codes: MrBayes, RAxML, and GARLI.

Implement parallel versions of these codes on TeraGrid Machines Abe and Lonestar; using Globus/GRAM.

Work with community developers to improve the speed-up available through the parallel codes offered by CSG.

Keep other serial codes on local SDSC resources that provide the project with fee-for-service cycles.

Page 19: Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San Diego Supercomputer Center

The Workbench Framework design made it possible to deploy jobs on TeraGrid resources fairly easily. A Science Gateway development allocation allowed us to accomplish the initial setup.

Page 20: Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San Diego Supercomputer Center

Code Type Max cores

Speed-up Efficiency

MrBayes Hybrid MPI/OpenMP 32 2.4 X (4 nodes)

~60%

RAxML Hybrid MPI/OpenMP 40 3.0 X(5 nodes)

~ 60%

GARLI MPI 100 77 X(100 nodes)

77-94%

CIPRES Science Gateway parallel code profiles

Page 21: Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San Diego Supercomputer Center

Month 2 4 6 8 1012

Us

ers

/

mo

nth

100

200

300

400all users

Jan

new users

Mar May Jul

repeat users

Sep

all users

new users

repeat users

CIPRES Science Gateway Usage Dec 2009 – Oct 2010

Page 22: Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San Diego Supercomputer Center

Month 2 4 6 8 1012

Us

ers

/

mo

nth

100

200

300

400all users

Jan

new users

Mar May Jul

repeat users

Sep

all users

new users

repeat users

CIPRES Science Gateway Usage Dec 2009 – Oct 2010

1575 new TGusers

Page 23: Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San Diego Supercomputer Center

Month

2 4 6 8 10 12

Jo

bs

/ m

on

th

1000

2000

3000

4000

Jan Mar JulMay

SU

s / m

on

th (in

tho

us

an

ds

)

200

400

600

800

Sep

CIPRES Science Gateway Usage

Page 24: Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San Diego Supercomputer Center

:

Intellectual Merit: 90+ publications enabled by the CIPRES Science Gateway

Broad Impact: CIPRES Science Gateway used to deliver curriculum by at least 21 instructors

Page 25: Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San Diego Supercomputer Center

What happens if you build it and too many people come???

Month2 4 6 810

SU

s (

tho

us

an

ds

)

5001000150020002500300035004000?!?

Page 26: Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San Diego Supercomputer Center

What happens if you build it and too many people come???

• make an explicit fair resource use policy

• make sure resource use delivers impact

• make sure resource use is efficient

• then expand resource base as required

Page 27: Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San Diego Supercomputer Center

What happens if you build it and too many people come???

• make an explicit fair resource use policy

• make sure resource use delivers impact

• make sure resource use is efficient

• then expand resource base as required

Page 28: Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San Diego Supercomputer Center

What happens if you build it and too many people come???

SUs /monthNumber of

Users% total SU % per user

< 100 536 (75.3%) 19 0.04

100 - 500 86 (12.1%) 9 0.10

500 - 2000 53 ( 7.5%) 10 0.19

2000 - 5000 24 ( 3.4%) 17 0.71

5,000 - 10,000 9 ( 1.3%) 14 1.56

> 10,000 3 ( 0.4%) 31 10.33

Page 29: Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San Diego Supercomputer Center

What happens if you build it and too many people come???

SUs /monthNumber of

Users% total SU % per user

< 100 536 (75.3%) 19 0.04

100 - 500 86 (12.1%) 9 0.10

500 - 2000 53 ( 7.5%) 10 0.19

2000 - 5000 24 ( 3.4%) 17 0.71

5,000 - 10,000 9 ( 1.3%) 14 1.56

> 10,000 3 ( 0.4%) 31 10.33

Page 30: Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San Diego Supercomputer Center

What happens if you build it and too many people come???

SUs /monthNumber of

Users% total SU % per user

< 100 536 (75.3%) 19 0.04

100 - 500 86 (12.1%) 9 0.10

500 - 2000 53 ( 7.5%) 10 0.19

2000 - 5000 24 ( 3.4%) 17 0.71

5,000 - 10,000 9 ( 1.3%) 14 1.56

> 10,000 3 ( 0.4%) 31 10.33

This level of resource use requires additional justification…

Page 31: Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San Diego Supercomputer Center

What happens if you build it and too many people come???

Tier % Community

allocation usedAccount Status

1< 2% of monthly

allocationOpen access

22% of monthly

allocationRequest personal

TG allocation

33% of monthly

allocationUse of community allocation blocked

CIPRES SG Fair Use Policy

Page 32: Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San Diego Supercomputer Center

What happens if you build it and too many people come???

• make an explicit fair resource use policy

• make sure resource use delivers impact

• make sure resource use is efficient

• then expand resource base as required

Page 33: Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San Diego Supercomputer Center

What happens if you build it and too many people come???

High end users must acquire a personal allocation from the TRAC committee. This will insure that very heavy users of the resource are supporting peer-reviewed research (and have a US institutional affiliation).

Page 34: Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San Diego Supercomputer Center

What happens if you build it and too many people come???

Tools required to implement the CIPRES SG Fair Use Policy:

• ability to halt submissions from a given user account

• ability to charge to a user’s personal TG allocation

• ability to monitor usage by each account

Page 35: Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San Diego Supercomputer Center

What happens if you build it and too many people come???

• make an explicit fair resource use policy

• make sure resource use delivers impact

• make sure resource use is efficient

• then expand resource base as required

Page 36: Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San Diego Supercomputer Center

Job Attrition on the CIPRES Science Gateway

Page 37: Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San Diego Supercomputer Center

CPU time User Staff

Input error low high low

Machine error 0 high low

Communication error high high high

Unknown error high high low

Error Impact analysis

Page 38: Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San Diego Supercomputer Center

Communication Errors

• occur when there is a break in communication between the web application and the TG resource

• kills the job monitoring process for all running jobs.

• jobs will continue to execute and consume SU, but web application can no longer return job results.

• the user must ask for the results, and a staff member must fetch them manually.

Page 39: Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San Diego Supercomputer Center

CONCLUSION: Time to refactor the job monitoring system!

Page 40: Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San Diego Supercomputer Center

User Submission Command line;files TG Machine

Running Task Table

Globus “gsissh”

Normal Operation

LoadResults DaemonDetects results,fetches via Grid ftp,puts results in the user db.

User DB

Page 41: Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San Diego Supercomputer Center

User Submission checkJobsDaemon TG Machine

Running Task Table

Globus “gsissh”

If normal notification fails from:Machine outageJob timeoutCIPRES Application down

Abnormal Operation:

User DBLoadResults DaemonDetects results w/delivery error;fetches via Grid ftp;puts results in the user DB.

Page 42: Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San Diego Supercomputer Center

SEPT OCT

MrBayes 39 71

RAxML 106 172

GARLI 14 23

Total 159* 266*

JOBS SAVED BY THE GSISSH / TASK TABLE SYSTEM

* 7% of all submitted jobs

Page 43: Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San Diego Supercomputer Center

Future plans to address other source of attrition

User errors (15%) : pre-check uploaded files for valid format.

Machine errors (8%): establish redundancy in where codes can run, use tools to check machine availability,queue depth.

Unknown errors (3%): TBD

Page 44: Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San Diego Supercomputer Center

What happens if you build it and too many people come???

• make an explicit fair resource use policy

• make sure resource use delivers impact

• make sure resource use is efficient

• then expand resource base as required

Page 45: Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San Diego Supercomputer Center

What happens if you build it and too many people come???

Month2 4 6 810

SU

s (

tho

us

an

ds

)

5001000150020002500300035004000?!?

Page 46: Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San Diego Supercomputer Center

Month2 4 6 810

SU

s (

tho

us

an

ds

)

5001000150020002500300035004000

What happens if you build it and too many people come???

Page 47: Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San Diego Supercomputer Center

Other possible resource opportunities:

New TG machines (e.g. Trestles machine at SDSC)

The Open Science Grid (OSG)

The NSF FutureGrid project (adapting these HPC applications to a cloud environment).

Page 48: Creating the CIPRES Science Gateway for Inference of Large Phylogenetic Trees Mark A. Miller San Diego Supercomputer Center

CIPRES Science Gateway Terri Liebowitz

TeraGrid Hybrid Code Development Wayne PfeifferAlexandros Stamatakis

TeraGrid Implementation Support Nancy Wilkins-DiehrDoru Marcusiu

Workbench Framework: Paul HooverLucie Chan

Acknowledgements: