“Creating a High Performance Cyberinfrastructure to Support Analysis of Illumina Metagenomic Data”
DNA Day
Department of Computer Science and Engineering
University of California, San Diego
September 16, 2015
Dr. Larry Smarr
Director, California Institute for Telecommunications and Information Technology
Harry E. Gruber Professor,
Dept. of Computer Science and Engineering
Jacobs School of Engineering, UCSD
http://lsmarr.calit2.net1
The National Science FoundationHas Funded Over 100 Campuses to Build “Big Data Freeways”
134 awards, 128 projects - All but 4 states - 120+ institutions
Creating a “Big Data” Plane on Campus:NSF Funded Prism@UCSD and CHeruB
Prism@UCSD, Phil Papadopoulos, SDSC, Calit2, PICHERuB, Mike Norman, SDSC PI
CHERuB
SDSC Big Data Compute/Storage Facility -Interconnected at Over 1 Tbps
128COMETVM SC 2 PF
128
GordonBig Data SC Oasis Data Store
•
128
Source: Philip Papadopoulos, SDSC/Calit2
Arista Router Can Switch
576 10Gps Light Paths
6000 TB> 800 Gbps
# of Parallel 10GbpsOptical Light Paths
128 x 10Gbps = 1.3TbpsSDSCSupercomputers
Prism@UCSD Will Link Computational Mass Spectrometryand Genome Sequencing Cores to the Big Data Freeway
ProteoSAFe: Compute-intensive discovery MS at the click of a button
MassIVE: repository and identification platform for all
MS data in the world
Source: proteomics.ucsd.edu
IDI Enhanced Cyberinfrastructure Supporting Knight Lab
FIONA12 Cores/GPU128 GB RAM3.5 TB SSD48TB Disk
10Gbps NIC
Knight Lab
10Gbps
Gordon
Prism@UCSD
Data Oasis7.5PB,
100GB/s
Knight 1024 ClusterIn SDSC Co-Lo
CHERuB100Gbps
Emperor & Other Vis Tools
64Mpixel Data Analysis Wall
120Gbps
40Gbps
The Pacific Wave PlatformCreates a Regional Science-Driven “Big Data Freeway System”
Source: John Hess, CENIC
Funded by NSF $5M Oct 2015-2020
Flash Disk to Flash Disk File Transfer Rate
Coupling Supercomputing to Illumina Metagenomics Sequencing
5 Ileal Crohn’s Patients, 3 Points in Time
2 Ulcerative Colitis Patients, 6 Points in Time
“Healthy” Individuals
Source: Jerry Sheehan, Calit2Weizhong Li, Sitao Wu, CRBS, UCSD
Total of 27 Billion ReadsOr 2.7 Trillion Bases
Inflammatory Bowel Disease (IBD) Patients
250 Subjects1 Point in Time
7 Points in Time
Each Sample Has 100-200 Million Illumina Short Reads (100 bases)
Larry Smarr(Colonic Crohn’s)
We Created a Reference DatabaseOf Known Gut Genomes
• NCBI April 2013– 2471 Complete + 5543 Draft Bacteria & Archaea Genomes– 2399 Complete Virus Genomes– 26 Complete Fungi Genomes– 309 HMP Eukaryote Reference Genomes
• Total 10,741 genomes, ~30 GB of sequences
Now to Align Our 27 Billion ReadsAgainst the Reference Database
Source: Weizhong Li, Sitao Wu, CRBS, UCSD
Computational NextGen Sequencing Pipeline:From Sequence to Taxonomy and Function
PI: (Weizhong Li, CRBS, UCSD): NIH R01HG005978 (2010-2013, $1.1M)
To Map Out the Dynamics of Autoimmune Microbiome Ecology Couples Next Generation Genome Sequencers to Big Data Supercomputers
Source: Weizhong Li, UCSD
Our Team Used 25 CPU-yearsto Compute
Comparative Gut MicrobiomesStarting From
2.7 Trillion DNA Bases of My Samples
and Healthy and IBD Controls
Illumina HiSeq 2000 at JCVI
SDSC Gordon Data Supercomputer
Next Step Programmability, Scalability and Reproducibility using bioKepler
www.kepler-project.org
www.biokepler.org
National Resources
(Gordon) (Comet)
(Stampede)(Lonestar)
Cloud Resources
Optimized
Local Cluster Resources
Source: Ilkay
Altintas, SDSC
We Found Major State Shifts in Microbial Ecology PhylaBetween Healthy and Two Forms of IBD
Most Common Microbial
Phyla
Average HE
Average Ulcerative Colitis Average LS Average Crohn’s Disease
Collapse of BacteroidetesExplosion of Actinobacteria
Explosion of Proteobacteria
Hybrid of UC and CDHigh Level of Archaea
Our Relative Abundance Results Across ~300 People Reveal Potential Diagnostic Species
UC 100x Healthy
UC 100x CD
We Produced Similar Results for ~2500 Microbial Species
Healthy 100x CD
Dell Analytics Separates The 4 Patient Types in Our DataUsing Our Microbiome Species Data
Source: Thomas Hill, Ph.D.Executive Director Analytics
Dell | Information Management Group, Dell Software
Healthy
Ulcerative Colitis
Colonic Crohn’s
Ileal Crohn’s
I Built on Dell Analytics to Show Dynamic Evolution of My Microbiome Toward and Away from Healthy State – Colonic Crohn’s
Healthy
Ileal Crohn’s
Seven Time Samples Over 1.5 Years
Colonic Crohn’s
Time Series Reveals Oscillations in Immune BiomarkersAssociated with Time Progression of Autoimmune Disease
Immune &Inflammation
Variables
Weekly Symptoms
PharmaTherapies
StoolSamples
2009 20142013201220112010 2015
UC San Diego Will Be Carrying Out a Major Clinical Study of IBD Using These Techniques
Inflammatory Bowel Disease BiobankFor Healthy and Disease Patients
Drs. William J. Sandborn, John Chang, & Brigid BolandUCSD School of Medicine, Division of Gastroenterology
Over 200 Enrolled
Announced November 7, 2014
Next StepKnight/Smarr Lab Collaboration
• Smarr Gut Microbiome Time Series– From 7 to 50 Times Over Four Years
• Healthy Human Microbiome– Use 255+ Raw Reads from NIH Human Microbiome Project
• IBD Patients: From 5 Crohn’s Disease and 2 Ulcerative Colitis Patients to ~100– 50 Carefully Phenotyped Patients Drawn from Sandborn BioBank– 43 Metagenomes from the RISK Cohort of Newly Diagnosed IBD patients,
• Illumina Reagent Grant Key– Enables Deep Metagenomic (and 16S) Sequencing at IGM of Smarr + Sandborn Samples
• New Software Suite from Knight Lab– Major Re-annotation of Reference Genomes, Functional and Taxonomic Variations– Novel Assembly Algorithms from Pavel Pevzner-Very Computationally Intensive
– See Talk Later This Morning
• Supercomputer Grant On SDSC Comet (Awarded from XSEDE)– From 25 Gordon to 100 Comet Core-Years
– Each Comet Core 40GF Peak=2x Gordon Core: 8X Increase in Compute