building data-intensive pipelines ravi k madduri argonne national lab university of chicago

7
Building Data- intensive Pipelines Ravi K Madduri Argonne National Lab University of Chicago

Upload: rudolf-payne

Post on 24-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Building Data-intensive Pipelines Ravi K Madduri Argonne National Lab University of Chicago

Building Data-intensive Pipelines

Ravi K MadduriArgonne National LabUniversity of Chicago

Page 2: Building Data-intensive Pipelines Ravi K Madduri Argonne National Lab University of Chicago

Recap from other talks on genomics

• FBIRN combining imaging, clinical and genetics data

• CIDR provide better value to end users– Globus Online helping CIDR to reliably transfer large

sequencing data sets to end users

• Ivo and Fabio presented various challenges in building Pipelines in Genomics – Large data volumes– Multiple, complex analytical tools

• In this talk we will focus on how we can provide workflow capabilities to end users in a way that is both easy to use and scalable

Page 3: Building Data-intensive Pipelines Ravi K Madduri Argonne National Lab University of Chicago

Enter Galaxy

• A free (for everyone) web service integrating a wealth of tools, compute resources, terabytes of reference data and permanent storage

• Open source software that makes it easy to integrate your own tools and data and customize your own site

• Flexible architecture -> Customizable

3

Page 4: Building Data-intensive Pipelines Ravi K Madduri Argonne National Lab University of Chicago

Galaxy Adoption

• ~50 deployments of Galaxy– Galaxy for MicroArray analysis, Machine Learning, Drug

Discovery etc

• ~130,000 jobs a month and growing on the public instance of Galaxy

• 1 TB/week in user uploads – 60TB from China

• 150+ attendees in the Galaxy users conference– From 6 continents

• Adoption driven primarily by– Ease of use– Software as a service – Responsive to user needs

4

Page 5: Building Data-intensive Pipelines Ravi K Madduri Argonne National Lab University of Chicago

Opportunities for BIRN collaborators

• Galaxy for biomedical informatics– Researchers can discover, download

interesting and useful datasets provided by BIRN

– Analyze data using various BIRN tools– Create and share pipelines with other

researchers– Create virtual collaborations by

leveraging flexible, secure user and group management

5

Page 6: Building Data-intensive Pipelines Ravi K Madduri Argonne National Lab University of Chicago

Use case: CVRG-Galaxy

• Created a Galaxy instance for CVRG community

• Integrated it with Globus Online File transfer capabilities so researchers can get data for analysis

• Created a CVRG Toolbox in Galaxy with Bioconductor tools from CRData.org

• Investigating how individual PIs can contribute their own compute and storage

6

Page 7: Building Data-intensive Pipelines Ravi K Madduri Argonne National Lab University of Chicago

CVRG CRData Galaxy

7