jsm madduri-august-2015

32
globus.org/genomics Finding Needles in a Haystack – Big Data Management and Analysis using Globus Ravi Madduri [email protected] JSM 2015, Seattle, Washington

Upload: ravi-madduri

Post on 12-Apr-2017

431 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Jsm madduri-august-2015

globus.org/genomics

Finding Needles in a Haystack – Big Data Management and Analysis using Globus

Ravi [email protected]

JSM 2015, Seattle, Washington

Page 2: Jsm madduri-august-2015

globus.org/genomics

• Globus Genomics is developed, operated, and supported by researchers, developers, and bioinformaticians at the Computation Institute – University of Chicago/Argonne National Lab

• We are a non-profit organization building solutions for non-profit researchers

• Our goal is to support the advancement of science by bringing together our strengths and capabilities to help meet the unique needs of researchers and research institutions

Who We Are

Page 3: Jsm madduri-august-2015

globus.org/genomics

Publish

results

Collectdata

Design experimen

t

Test hypothesis

Hypothesize

explanation

Identify patterns

Analyzedata

Finding needles in haystacks

Pose questio

n

3

Page 4: Jsm madduri-august-2015

globus.org/genomics

Imagine if a researcher, when tackling a problem,

could easily:• Assemble, integrate, and interpret all

relevant data within a knowledge network

• Be informed of anomalies, patterns, gaps

• Formulate & apply computational models

• Outsource tasks if local expertise lacking

• Launch automated processes to test hypotheses, expand knowledge network

• Pay for all this by taking on other tasks

Page 5: Jsm madduri-august-2015

globus.org/genomics

We will cover

• Accelerating Scientific Discovery Process by providing Science as a Service– Research Data Management– Analyzing Research Data

• Interactive Analysis• Large-scale Analysis

– Publishing Results so others can• Discover• Validate• Reproduce/Use

Page 6: Jsm madduri-august-2015

globus.org/genomics

90% of cancer patients carry a mutation that may be responsive to a known drug

Mark Rubin, Weill Cornell Medical College and NewYork-Presbyterian Hospital in New York in Nature, April, 2015

Page 7: Jsm madduri-august-2015

Trying to find a single causative gene for diseases with a complex genetic background is like looking for the proverbial needle in a haystack

– Nancy Cox (Vanderbilt)

Page 8: Jsm madduri-august-2015

globus.org/genomics

Higgs discovery “only possible because of the extraordinary achievements of …

grid computing”Rolf Heuer, CERN DG

10s of PB, 100s of institutions,1000s of scientists, 100Ks of CPUs, Bs of tasks

Page 9: Jsm madduri-august-2015

globus.org/genomics

How do we accelerate discovery without requiring that every lab acquire a haystack-sorting machine?

Clayton & Shuttleworth thresher, 1910: Museum Victoria, Australia

Page 10: Jsm madduri-august-2015

globus.org/genomics

Managing big data with Globus

PI initiates transfer request; or requested automatically by script, science gateway

1

Globus transfers files reliably, securely

Light SourceCompute Facility

2

PI selects files to share, selects user or group,

and sets access permissions

Globus controls access to shared

files on existing storage; no need

to move files to cloud storage!

Researcher logs in to Globus and accesses shared files; no local

account required; download via Globus

Researcher assembles data set;

describes it using metadata (Dublin core and domain-

specific)

Curator reviews and approves; data set

published on campus or other system

Peers, collaborators search and discover datasets; transfer and share using Globus

4

7

6

3

5• SaaS Only a web

browser required• Access using your

campus credentials• Globus monitors and

informs throughout

6 8

Publication Repository

Personal Computer

Page 11: Jsm madduri-august-2015

globus.org/genomics

Globus Platform-as-a-Service

Identity, Group, Profile Management Services

Sharing Service

Transfer Service

Globus Toolkit

Glo

bus

API

s

Glo

bus

Conn

ect

Page 12: Jsm madduri-august-2015

globus.org/genomics

Globus Adoption and Usage• 166,449 active Globus endpoints• 27,961 users registered• Biggest transfer: 500.42TB• Longest running transfer: 182 days. • Fastest transfer: 58.5Gbps (average)• 55TB moved per day, on average, since the

service was launched in November 2010• Average throughput: 637.7Mbps (since

service launch)

Page 13: Jsm madduri-august-2015

globus.org/genomics

Analyzing Big Data using Globus Galaxies

Sequencing Centers

Sequencing Centers

PublicData

Storage

Local Cluster/CloudSeq

Center

Research Lab

Globus provides for• High-performance • Fault-tolerant• Securefile transfer between all data-endpoints

Data management Data analysis

Picard

GATK

Fastq Ref Genome

Alignment

Variant Calling

Galaxy Data Libraries

Globus Genomics on Amazon EC2

• Analytical tools are automatically run on the scalable compute resources when possible

• Globus integrated within Galaxy

• Web-based UI• Drag-Drop

workflow creations

• Easily modify workflows with new tools

Galaxy-based workflow management

FTP, SCP, others

FTP, SCP

SCP

Globus Genomics

FTP,

SCP,

HTTP

Page 14: Jsm madduri-august-2015

globus.org/genomics

Our Science Stack• Galaxy

– Interactive execution, iPython, R– Creation, Execution, Sharing, Discovering

Workflows• Globus

– Data management– Identity Management

• AWS– HTCondor, Chef, EC2, EBS, S3, SNS– Spot, Route 53, Cloud Formation

SaaS

PaaS

IaaS

Page 15: Jsm madduri-august-2015

globus.org/genomics

Examples of what researchers have done

Page 16: Jsm madduri-august-2015

globus.org/genomics

• 134 samples and 4 workflows • 4 TB data initially• 2200 core hours in 6 days

Cox lab, UChicago

Page 17: Jsm madduri-august-2015

globus.org/genomics

Consensus Caller

Page 18: Jsm madduri-august-2015

globus.org/genomics

Rediscovery of previously observed variants Transition/Transversion Ratio

Genotype Mendel Error Rate Distributions of Mendel Error Counts per Trio

Page 19: Jsm madduri-august-2015

globus.org/genomics

Contaminated Samples

Page 20: Jsm madduri-august-2015

globus.org/genomics

Olopade lab, UChicago

A profile of inherited predisposition to breast cancer among Nigerian womenY. Zheng, T. Walsh, F. Yoshimatsu, M. Lee, S. Gulsuner, S. Casadei, A. Rodriguez, T. Ogundiran, C. Babalola, O. Ojengbede, D. Sighoko, R. Madduri, M.-C. King, O. Olopade

• 200 targeted exomes• 200 GB data initially• 76,920 core hours in 1.25 days

Page 21: Jsm madduri-august-2015

globus.org/genomics

Expanding Consensus Genotyper – SNVs, Indels,

SVs

RAW FASTQs

GATK Pipeline/HC

FreeBayes

SAMtools mpileup

GATK Pipeline/UG

VCF

VCF

VCF

VCF

Consensus Genotyper

VCF

Atlas2

Delly/Contra

VCF

VCF

Page 22: Jsm madduri-august-2015

globus.org/genomics

14 deleterious SNVs and 11 damaging Indels (BRCA1: 15, BRCA2: 4, PALB2: 2, BRIP1: 1, CHEK2: 1, NBN: 1, TP53: 1) were found in 29 subjects, and they were all confidently detected among 5 callers. Identified SNVs and Indels were all confirmed by Sanger sequencing. 

Preliminary Results are very encouraging

Page 23: Jsm madduri-august-2015

globus.org/genomics

QC

PPMI ADNI

Adenocarcinomahttp://bit.ly/1M0h6Yx

http://bit.ly/A10R89y

Adrenal

Brain Alignment Feature count

AlignmentQC

1. Query and discover data

3. Execute parallel alignment workflow on dynamically provisioned cloud resources

ERMrest

2. Transfer bags

Alignment FilesAlignment

Files

3. Publish bags

BDDS Collection

Alignment FilesAlignment

Files

Differential expression

Differential expression

4. Discover published data and execute comparison workflow

Combining Data management and Analysis

Page 24: Jsm madduri-august-2015

globus.org/genomics

Gene Expression Results

Page 25: Jsm madduri-august-2015

globus.org/genomics

Globus Genomics at a glance

30 institutions, groups

10smillion core hours

labs

2 PBsraw sequences

analyzed

>1500 analysis tools

1000s genomes processed

>50workflows

99%uptime over the past

two years

1 PBlargest single transfer

to do

5 dayslongest running

workflow

100sdifferent species

1000s genomes processed

5 dayslongest running

workflow

Page 26: Jsm madduri-august-2015

globus.org/genomics

Other Globus Genomics users

DobynsLab

Cox LabVolchenboum LabOlopade Lab

Nagarajan Lab

Page 27: Jsm madduri-august-2015

globus.org/genomics

Pricing includes• Estimated compute• Storage (one month)• Globus Genomics platform usage• Support

Costs are remarkably low

Page 28: Jsm madduri-august-2015

globus.org/genomics

Globus Genomics – Making it routine to find needles in NGS haystacks

www.globus.org/genomics

Page 29: Jsm madduri-august-2015

globus.org/genomics

Other Examples of Science as a Service

• PDACS - Portal for data analysis services for cosmological simulations

• CVRG Galaxy – Large-scale ECG Data Analysis

• Globus Proteomics• eMatter – Material Science Simulations• FACE-IT - Framework to Advance Climate,

Economic, and Impact Investigations with Information Technology (usefaceit.org)

Page 30: Jsm madduri-august-2015

globus.org/genomics

• More information on Globus Genomics:www.globus.org/genomics

• More information on Globus: www.globus.org

Page 31: Jsm madduri-august-2015

globus.org/genomics

Our work is supported by:U.S . DEPARTMENT OF

ENERGY

31

Page 32: Jsm madduri-august-2015

globus.org/genomics

Thank you!

@madduri