a workflow-driven discovery and training ecosystem for distributed analysis of biomedical big data
TRANSCRIPT
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data
İlkay ALTINTAŞ, Ph.D.Chief Data Science Officer, San Diego Supercomputer CenterFounder and Director, Workflows for Data Science Center of Excellence
SAN DIEGO SUPERCOMPUTER CENTER at UC San DiegoProviding Cyberinfrastructure for Research and Education
• Establishedasanationalsupercomputerresourcecenterin1985byNSF
• AworldleaderinHPC,data-intensivecomputing,andscientificdatamanagement
• Currentstrategicfocuson“BigData”,“versatilecomputing”,and“lifesciencesapplications”
1985
today
Two discoveries in drug design from 1987 and 1991.
Ross Walker Group
SDSC continues to be a leader in scientific computing and big data!
Gordon: FirstFlash-basedSupercomputerforData-intensiveApps
Comet: Serving the Long Tail of Science
27 standard racks= 1944 nodes= 46,656 cores= 249 TB DRAM= 622 TB SSD
~ 2 Pflop/s
• 36 GPU nodes• 4 Large Memory nodes• 7 PB Lustre storage• High performance
virtualization
SDSC Data Science Office-- Expertise, Systems and Training
for Data Science Applications --
SDSC Data Science Office (DSO)
SDSC DSO is a collaborative virtual organization at SDSC for collective lasting innovation in data science research, development and education.
DSO
SDSC Expertise and Strengths
Big
Dat
a P
latfo
rms
Trai
ning
Indu
stry
App
licat
ions
Life Sciences is an ongoing strategic application thrust at SDSC…
Genomic Analysis is a Big Data and Big Compute Problem
BIG DATACOMPUTING AT
SCALE
Enables dynamic data-driven applicationsComputer-Aided Drug Discovery
Personalized Precision Medicine
Requires:• Data management • Data-driven methods• Scalable tools for
dynamic coordination and resource optimization
• Skilled interdisciplinary workforce
Team work and process management
Vaccine Development
Metagenomics
…
New era of data science!
Needs and Trends for the New Era Data Science
-- the Big Data Era Goals --• Moredata-driven• Moredynamic• Moreprocess-driven• Morecollaborative• Moreaccountable• Morereproducible• Moreinteractive• Moreheterogeneous
Velocity
Variety
Volume Scalable batch processing
Stream processing
Extensible data storage, access and integration
Genomic Data Management and Processing in the Big Data Era has Unique Challenges!
HBase
Hive Pig
Zookeeper Giraph
Storm
Spark
MapReduce
YARN
MongoDB
Cassandra
HDFSFlink
Lower levels:Storage and scheduling
Higher levels:Interactivity
These challenges push for new tools to tackle them.
COORDINATION AND WORKFLOW MANAGEMENT
DATA INTEGRATION AND PROCESSING
DATA MANAGEMENT AND STORAGE
How do we use these new tools
and combine them with existing
domain-specific solutions in
scientific computing and data science?
COORDINATION AND WORKFLOW MANAGEMENT
DATA INTEGRATION AND PROCESSING
DATA MANAGEMENT AND STORAGE
Layer 1: Data Management and Storage
COORDINATION AND WORKFLOW MANAGEMENT
DATA INTEGRATION AND PROCESSING
DATA MANAGEMENT AND STORAGE
Layer 2: Data Integration and Processing
HBase
Hive PigZookeeper Giraph
Storm
Spark
MapReduce
YARN
MongoDB
Cassandra
HDFS
Flink + Application
specific libraries
Most of the time, more than one analysis need to take place…
And each analysis has multiple steps to integrate!
Pipelining is a way to put the steps together.
Source: http://www.slideshare.net/BigDataCloud/big-data-analytics-with-google-cloud-platform
Source: https://www.mapr.com/blog/distributed-stream-and-graph-processing-apache-flink
Source: https://www.computer.org/csdl/mags/so/2016/02/mso2016020060.html
Source: http://www.slideshare.net/ThoughtWorks/big-data-pipeline-with-scala
COORDINATION AND WORKFLOW MANAGEMENT
DATA INTEGRATION AND PROCESSING
DATA MANAGEMENT AND STORAGE
Layer 3: Coordination and Workflow Management
COORDINATION AND WORKFLOW MANAGEMENT
ACQUIRE PREPARE ANALYZE REPORT ACT…kepler-project.org
Workflows for Data Science Center of Excellence at SDSC
Building functional, operational and reproducible solution
architectures using big data and HPC tools is what we do.
Focusonthequestion,notthe
technology!
• Access and query data• Scale computational analysis• Increase reuse • Save time, energy and money• Formalize and standardize
Real-TimeHazardsManagementwifire.ucsd.edu
Data-ParallelBioinformaticsbioKepler.org
ScalableAutomatedMolecularDynamicsandDrugDiscoverynbcr.ucsd.edu
WorDS.sdsc.edu
bioKepler:A Kepler Module for Bio Big Data Analysis
Data-ParallelBioinformaticsbioKepler.org
Source: Larry Smarr, Calit2
• Metagenomic Sequencing• JCVIProduced
• ~150BillionDNABasesFromSevenofLSStoolSamplesOver1.5Years
• ~3TrillionDNABasesFromNIHHumanMicrobiomeProgramDataBase• 255HealthyPeople,21withIBD
IlluminaHiSeq 2000 at JCVI
SDSC Gordon Data Supercomputer
Example from 2013: Inflammatory Bowel Disease (IBD)• Supercomputing(W.Li,JCVI/HLI/UCSD):
• ~20CPU-YearsonSDSC’sGordon• ~4CPU-YearsonDell’sHPCCloud
• ProducedRelativeAbundanceof• ~10KBacteria,Archaea,Virusesin~300People• ~3MillionFilledSpreadsheetCells
Ongoing Research:Optimization of Heterogeneous Resource Utilization using bioKepler
NationalResources
(Gordon) (Comet)
(Stampede)(Lonestar)
CloudResources
Optimized
LocalClusterResources
Uses existing genomics tools and computing
systems!
Computing is just one part of it…
…new methods needed!
Needs of a Dynamic Ecosystem of Genomic Discovery • Exploratorymethodstoseetemporalchangesandpatternsinsequence
data• Efficientupdatestoanalysisasquickasnewsequencedatagetsgenerated• Regularrerunsofannotationsasreferencedatabasesevolve• Integrationofgenomicdatawithothertypesofdata,e.g.,image,
environmental,socialgraphs• Dynamicabilitytocheckqualityandprovenanceofdataandanalysis• Transparentsupportforcomputingplatformsdesignedforgenomic
discoveryandpatternanalysis• Workflowcoordinationandsystemintegration• Peopleandculturetomakeithappencollaboratively!
Examples from 2016: Apache Big Data Technologies in Life Sciences• LightningFastGenomicswithADAM
• Goal• Studygeneticvariationsinpopulationsatscale(e.g.,1000GenomesProject)
• Technologystack• ApacheAvro(dataserialization,schemadefinition)• ApacheParquet(compactcolumnarstorage)• ApacheSpark(distributedparallelprocessing)• SparkMLlib(machinelearning,clustering)
• Source:AMPLab,UCBerkeley(http://bdgenomics.org/)• CompressiveStructuralBioinformaticsusingMMTF
• Goal• 100+speedupoflarge-scale3DstructuralanalysisoftheProteinDataBank(PDB)
• Technologystack• MMTF(MacromolecularTransmissionformat,compactstorageinHadoopSequenceFiles)• ApacheSpark(in-memory,paralleldistributedworkflowsusingcompresseddata)• SparkML(clustering)
• Source:SDSC,UCSanDiego(http://mmtf.rcsb.org/)
Development of tools and technologies that enable models to bridge across diverse scales of biological organization, while leveraging all types and
sources of data
NBCR Example: Distilling Medical Image Data for Biomedical Action nbcr.ucsd.edu
Identify gaps in multiscale modeling capabilities and develop new methods and tools that allow us to bridge across these gaps
Å nm – μm 0.1mm - mm cm
fs - μs μs - ms ms - s s - lifespan
Molecular & Macromolecular Sub-Cellular Cell Tissue Organ
Spat
ial a
nd
Tem
pora
l Sc
ales
Driving Biomedical Projects propel technology development across multi-scale modeling capability gaps, from simulation to data assembly & integration
A challenge: Data Integration
Challenge to bridge across diverse scales of biological organization, to understand emergent behavior, and the molecular mechanisms underlying biological function & disease
Integrated Multi-Scale Modeling Toolkits in NBCR
UserInterface NBCRProducts
Battling complexity while facilitating collaboration and increasing reproducibility.Cyberinfrastructure Innovation Based on User Needs
Domain-specific tools, workflows, data and computing infrastructure.
Components for Multi-Scale ModelingA handful of customizable and and
extensible tools, workflows, user interfaces and publishable research
objects.
NBCR Products
Workflows
ScientificTools
PastExperiments
• UI generation • Logical workflow generation• Uncertainty quantification• Workflow execution• Provenance tracking • System integration
mediu
m Pri
ma-1
Stictic
acid
35ZW
F 25
KKL
22LS
V 32
CTM
26RQ
Z 27
WT9
33AG
6 33
BAZ
28NZ
6 27
TGR
27VF
S 35
LWZ
36EB
5 27
UDP
32LD
E 0
0.2
0.4
0.6
0.8
1
1.2
0
0.2
0.4
0.6
0.8
1
1.2
0
0.2
0.4
0.6
0.8
1
1.2 nop53
0"
0.2"
0.4"
0.6"
0.8"
1"
1.2"
1.4"
no compound
Prima-1
35ZWF
25KKL
25PWS
24MLP
26YYG
22LSV
24MNR
32CTM
22KTV
24MY4
24LBC
24NPU
24NW3
Series1"Series2"
0"
0.2"
0.4"
0.6"
0.8"
1"
1.2"
1.4"
no compound
Prima-1
35ZWF
25KKL
25PWS
24MLP
26YYG
22LSV
24MNR
32CTM
22KTV
24MY4
24LBC
24NPU
24NW3
Series1"Series2"cancercellwithp53-R175Hmutant
cellprolife
ratio
n
15 new reactivation compounds
reactivation compounds kill cells with p53 cancer mutant
BENEFITS:• Increasereuse• Reproducibility• Scaleexecution,
problem&solution• Comparemethods• Trainstudents
Minimization Actor Equilibration Actor
AMBER GPU MD Workbench
Rommie Amaro, PI, UCSDComputational chemistry, biophysics
Andrew McCammon, UCSDComputational chemistry, biophysics, chemical physics
Mark Ellisman, UCSDMolecular & cellular biology
Andrew McCulloch, UCSDBioengineering, biophysics
Michel Sanner, TSRIDrug discovery & molecular visualization
Phil Papadopoulos, UCSD/SDSCComputer engineering, cyberinfrastructure
technologyIlkay Altintas, UCSD/SDSCWorkflows, provenance
Michael Holst, UCSDMath, physics
Arthur Olson, TSRIComputational chemistry, drug discovery, visualization
LEADERSHIPTEAM
Training at the interface
Challenge: how do we build the next generation of interdisciplinary scientists?
Data-to-Structural-Models Simulation-Based Drug Discovery
Biomedical Big Data Training Collaboratoryhttp://biobigdata.ucsd.edu
• BBDTCwebsiteisupandevolving!• BBDTCcontainssevenfull,openbiomedicaltrainingcourses• Four-coursebiomedicalbigdataseriesisplannedforWinter2017
Working with Industry Partners at SDSC
SDSC Provides a Range of Strategies for Engaging with Industry
• Sponsoredresearchagreements• Serviceagreementsforuseofsystems&consulting• Focusedcentersofexcellence(BigDataSystems,PredictiveAnalytics,Workflow
Technologies)• TrainingprogramsinDataScience&Analytics• IndustryPartnersProgramfor“jumpstarting”collaborations
Working with industry helps companies be more competitive, drives innovation, and fosters a healthy ecosystem between the research and private sector.
Example for Industrial Collaboration: Janssen R&D Rheumatoid Arthritis Study• JanssenwasinterestedincorrelatinggenomicprofilewithresponsetoTNFαinhibitorgolimumab
• Sequenced438patients(fullgenome)• SDSCassistedwithre-alignmentandvariantcallingusingnew/improvedalgorithms
• Neededanalysisdoneinareasonabletimeframe(afewweeks)
Que
stio
ns?
Ilkay
Alti
ntas
, Ph.
D.
Emai
l: ia
ltint
as@
ucsd
.edu