“creating a high performance cyberinfrastructure to support analysis of illumina metagenomic...

19
“Creating a High Performance Cyberinfrastructure to Support Analysis of Illumina Metagenomic Data” DNA Day Department of Computer Science and Engineering University of California, San Diego September 16, 2015 Dr. Larry Smarr Director, California Institute for Telecommunications and Information Technology Harry E. Gruber Professor, Dept. of Computer Science and Engineering Jacobs School of Engineering, UCSD http://lsmarr.calit2.net 1

Upload: shauna-warner

Post on 12-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: “Creating a High Performance Cyberinfrastructure to Support Analysis of Illumina Metagenomic Data” DNA Day Department of Computer Science and Engineering

“Creating a High Performance Cyberinfrastructure to Support Analysis of Illumina Metagenomic Data”

DNA Day

Department of Computer Science and Engineering

University of California, San Diego

September 16, 2015

Dr. Larry Smarr

Director, California Institute for Telecommunications and Information Technology

Harry E. Gruber Professor,

Dept. of Computer Science and Engineering

Jacobs School of Engineering, UCSD

http://lsmarr.calit2.net1

Page 2: “Creating a High Performance Cyberinfrastructure to Support Analysis of Illumina Metagenomic Data” DNA Day Department of Computer Science and Engineering

The National Science FoundationHas Funded Over 100 Campuses to Build “Big Data Freeways”

134 awards, 128 projects - All but 4 states - 120+ institutions

Page 3: “Creating a High Performance Cyberinfrastructure to Support Analysis of Illumina Metagenomic Data” DNA Day Department of Computer Science and Engineering

Creating a “Big Data” Plane on Campus:NSF Funded Prism@UCSD and CHeruB

Prism@UCSD, Phil Papadopoulos, SDSC, Calit2, PICHERuB, Mike Norman, SDSC PI

CHERuB

Page 4: “Creating a High Performance Cyberinfrastructure to Support Analysis of Illumina Metagenomic Data” DNA Day Department of Computer Science and Engineering

SDSC Big Data Compute/Storage Facility -Interconnected at Over 1 Tbps

128COMETVM SC 2 PF

128

GordonBig Data SC Oasis Data Store

128

Source: Philip Papadopoulos, SDSC/Calit2

Arista Router Can Switch

576 10Gps Light Paths

6000 TB> 800 Gbps

# of Parallel 10GbpsOptical Light Paths

128 x 10Gbps = 1.3TbpsSDSCSupercomputers

Page 5: “Creating a High Performance Cyberinfrastructure to Support Analysis of Illumina Metagenomic Data” DNA Day Department of Computer Science and Engineering

Prism@UCSD Will Link Computational Mass Spectrometryand Genome Sequencing Cores to the Big Data Freeway

ProteoSAFe: Compute-intensive discovery MS at the click of a button

MassIVE: repository and identification platform for all

MS data in the world

Source: proteomics.ucsd.edu

Page 6: “Creating a High Performance Cyberinfrastructure to Support Analysis of Illumina Metagenomic Data” DNA Day Department of Computer Science and Engineering

IDI Enhanced Cyberinfrastructure Supporting Knight Lab

FIONA12 Cores/GPU128 GB RAM3.5 TB SSD48TB Disk

10Gbps NIC

Knight Lab

10Gbps

Gordon

Prism@UCSD

Data Oasis7.5PB,

100GB/s

Knight 1024 ClusterIn SDSC Co-Lo

CHERuB100Gbps

Emperor & Other Vis Tools

64Mpixel Data Analysis Wall

120Gbps

40Gbps

Page 7: “Creating a High Performance Cyberinfrastructure to Support Analysis of Illumina Metagenomic Data” DNA Day Department of Computer Science and Engineering

The Pacific Wave PlatformCreates a Regional Science-Driven “Big Data Freeway System”

Source: John Hess, CENIC

Funded by NSF $5M Oct 2015-2020

Flash Disk to Flash Disk File Transfer Rate

Page 8: “Creating a High Performance Cyberinfrastructure to Support Analysis of Illumina Metagenomic Data” DNA Day Department of Computer Science and Engineering

Coupling Supercomputing to Illumina Metagenomics Sequencing

5 Ileal Crohn’s Patients, 3 Points in Time

2 Ulcerative Colitis Patients, 6 Points in Time

“Healthy” Individuals

Source: Jerry Sheehan, Calit2Weizhong Li, Sitao Wu, CRBS, UCSD

Total of 27 Billion ReadsOr 2.7 Trillion Bases

Inflammatory Bowel Disease (IBD) Patients

250 Subjects1 Point in Time

7 Points in Time

Each Sample Has 100-200 Million Illumina Short Reads (100 bases)

Larry Smarr(Colonic Crohn’s)

Page 9: “Creating a High Performance Cyberinfrastructure to Support Analysis of Illumina Metagenomic Data” DNA Day Department of Computer Science and Engineering

We Created a Reference DatabaseOf Known Gut Genomes

• NCBI April 2013– 2471 Complete + 5543 Draft Bacteria & Archaea Genomes– 2399 Complete Virus Genomes– 26 Complete Fungi Genomes– 309 HMP Eukaryote Reference Genomes

• Total 10,741 genomes, ~30 GB of sequences

Now to Align Our 27 Billion ReadsAgainst the Reference Database

Source: Weizhong Li, Sitao Wu, CRBS, UCSD

Page 10: “Creating a High Performance Cyberinfrastructure to Support Analysis of Illumina Metagenomic Data” DNA Day Department of Computer Science and Engineering

Computational NextGen Sequencing Pipeline:From Sequence to Taxonomy and Function

PI: (Weizhong Li, CRBS, UCSD): NIH R01HG005978 (2010-2013, $1.1M)

Page 11: “Creating a High Performance Cyberinfrastructure to Support Analysis of Illumina Metagenomic Data” DNA Day Department of Computer Science and Engineering

To Map Out the Dynamics of Autoimmune Microbiome Ecology Couples Next Generation Genome Sequencers to Big Data Supercomputers

Source: Weizhong Li, UCSD

Our Team Used 25 CPU-yearsto Compute

Comparative Gut MicrobiomesStarting From

2.7 Trillion DNA Bases of My Samples

and Healthy and IBD Controls

Illumina HiSeq 2000 at JCVI

SDSC Gordon Data Supercomputer

Page 12: “Creating a High Performance Cyberinfrastructure to Support Analysis of Illumina Metagenomic Data” DNA Day Department of Computer Science and Engineering

Next Step Programmability, Scalability and Reproducibility using bioKepler

www.kepler-project.org

www.biokepler.org

National Resources

(Gordon) (Comet)

(Stampede)(Lonestar)

Cloud Resources

Optimized

Local Cluster Resources

Source: Ilkay

Altintas, SDSC

Page 13: “Creating a High Performance Cyberinfrastructure to Support Analysis of Illumina Metagenomic Data” DNA Day Department of Computer Science and Engineering

We Found Major State Shifts in Microbial Ecology PhylaBetween Healthy and Two Forms of IBD

Most Common Microbial

Phyla

Average HE

Average Ulcerative Colitis Average LS Average Crohn’s Disease

Collapse of BacteroidetesExplosion of Actinobacteria

Explosion of Proteobacteria

Hybrid of UC and CDHigh Level of Archaea

Page 14: “Creating a High Performance Cyberinfrastructure to Support Analysis of Illumina Metagenomic Data” DNA Day Department of Computer Science and Engineering

Our Relative Abundance Results Across ~300 People Reveal Potential Diagnostic Species

UC 100x Healthy

UC 100x CD

We Produced Similar Results for ~2500 Microbial Species

Healthy 100x CD

Page 15: “Creating a High Performance Cyberinfrastructure to Support Analysis of Illumina Metagenomic Data” DNA Day Department of Computer Science and Engineering

Dell Analytics Separates The 4 Patient Types in Our DataUsing Our Microbiome Species Data

Source: Thomas Hill, Ph.D.Executive Director Analytics

Dell | Information Management Group, Dell Software

Healthy

Ulcerative Colitis

Colonic Crohn’s

Ileal Crohn’s

Page 16: “Creating a High Performance Cyberinfrastructure to Support Analysis of Illumina Metagenomic Data” DNA Day Department of Computer Science and Engineering

I Built on Dell Analytics to Show Dynamic Evolution of My Microbiome Toward and Away from Healthy State – Colonic Crohn’s

Healthy

Ileal Crohn’s

Seven Time Samples Over 1.5 Years

Colonic Crohn’s

Page 17: “Creating a High Performance Cyberinfrastructure to Support Analysis of Illumina Metagenomic Data” DNA Day Department of Computer Science and Engineering

Time Series Reveals Oscillations in Immune BiomarkersAssociated with Time Progression of Autoimmune Disease

Immune &Inflammation

Variables

Weekly Symptoms

PharmaTherapies

StoolSamples

2009 20142013201220112010 2015

Page 18: “Creating a High Performance Cyberinfrastructure to Support Analysis of Illumina Metagenomic Data” DNA Day Department of Computer Science and Engineering

UC San Diego Will Be Carrying Out a Major Clinical Study of IBD Using These Techniques

Inflammatory Bowel Disease BiobankFor Healthy and Disease Patients

Drs. William J. Sandborn, John Chang, & Brigid BolandUCSD School of Medicine, Division of Gastroenterology

Over 200 Enrolled

Announced November 7, 2014

Page 19: “Creating a High Performance Cyberinfrastructure to Support Analysis of Illumina Metagenomic Data” DNA Day Department of Computer Science and Engineering

Next StepKnight/Smarr Lab Collaboration

• Smarr Gut Microbiome Time Series– From 7 to 50 Times Over Four Years

• Healthy Human Microbiome– Use 255+ Raw Reads from NIH Human Microbiome Project

• IBD Patients: From 5 Crohn’s Disease and 2 Ulcerative Colitis Patients to ~100– 50 Carefully Phenotyped Patients Drawn from Sandborn BioBank– 43 Metagenomes from the RISK Cohort of Newly Diagnosed IBD patients,

• Illumina Reagent Grant Key– Enables Deep Metagenomic (and 16S) Sequencing at IGM of Smarr + Sandborn Samples

• New Software Suite from Knight Lab– Major Re-annotation of Reference Genomes, Functional and Taxonomic Variations– Novel Assembly Algorithms from Pavel Pevzner-Very Computationally Intensive

– See Talk Later This Morning

• Supercomputer Grant On SDSC Comet (Awarded from XSEDE)– From 25 Gordon to 100 Comet Core-Years

– Each Comet Core 40GF Peak=2x Gordon Core: 8X Increase in Compute