a workflow-driven discovery and training ecosystem for distributed analysis of biomedical big data

A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data

İlkay ALTINTAŞ, Ph.D.Chief Data Science Officer, San Diego Supercomputer CenterFounder and Director, Workflows for Data Science Center of Excellence

SAN DIEGO SUPERCOMPUTER CENTER at UC San DiegoProviding Cyberinfrastructure for Research and Education

• Establishedasanationalsupercomputerresourcecenterin1985byNSF

• AworldleaderinHPC,data-intensivecomputing,andscientificdatamanagement

• Currentstrategicfocuson“BigData”,“versatilecomputing”,and“lifesciencesapplications”

1985

today

Two discoveries in drug design from 1987 and 1991.

Ross Walker Group

SDSC continues to be a leader in scientific computing and big data!

Gordon: FirstFlash-basedSupercomputerforData-intensiveApps

Comet: Serving the Long Tail of Science

27 standard racks= 1944 nodes= 46,656 cores= 249 TB DRAM= 622 TB SSD

~ 2 Pflop/s

• 36 GPU nodes• 4 Large Memory nodes• 7 PB Lustre storage• High performance

virtualization

SDSC Data Science Office-- Expertise, Systems and Training

for Data Science Applications --

SDSC Data Science Office (DSO)

SDSC DSO is a collaborative virtual organization at SDSC for collective lasting innovation in data science research, development and education.

DSO

SDSC Expertise and Strengths

Big

Dat

a P

latfo

rms

Trai

ning

Indu

stry

App

licat

ions

Life Sciences is an ongoing strategic application thrust at SDSC…

Genomic Analysis is a Big Data and Big Compute Problem

BIG DATACOMPUTING AT

SCALE

Enables dynamic data-driven applicationsComputer-Aided Drug Discovery

Personalized Precision Medicine

Requires:• Data management • Data-driven methods• Scalable tools for

dynamic coordination and resource optimization

• Skilled interdisciplinary workforce

Team work and process management

Vaccine Development

Metagenomics

…

New era of data science!

Needs and Trends for the New Era Data Science

-- the Big Data Era Goals --• Moredata-driven• Moredynamic• Moreprocess-driven• Morecollaborative• Moreaccountable• Morereproducible• Moreinteractive• Moreheterogeneous

Velocity

Variety

Volume Scalable batch processing

Stream processing

Extensible data storage, access and integration

Genomic Data Management and Processing in the Big Data Era has Unique Challenges!

HBase

Hive Pig

Zookeeper Giraph

Storm

Spark

MapReduce

YARN

MongoDB

Cassandra

HDFSFlink

Lower levels:Storage and scheduling

Higher levels:Interactivity

These challenges push for new tools to tackle them.

COORDINATION AND WORKFLOW MANAGEMENT

DATA INTEGRATION AND PROCESSING

DATA MANAGEMENT AND STORAGE

How do we use these new tools

and combine them with existing

domain-specific solutions in

scientific computing and data science?




Layer 1: Data Management and Storage




Layer 2: Data Integration and Processing

HBase

Hive PigZookeeper Giraph

Storm

Spark

MapReduce

YARN

MongoDB

Cassandra

HDFS

Flink + Application

specific libraries

Most of the time, more than one analysis need to take place…

And each analysis has multiple steps to integrate!

Pipelining is a way to put the steps together.

Source: http://www.slideshare.net/BigDataCloud/big-data-analytics-with-google-cloud-platform

Source: https://www.mapr.com/blog/distributed-stream-and-graph-processing-apache-flink

Source: https://www.computer.org/csdl/mags/so/2016/02/mso2016020060.html

Source: http://www.slideshare.net/ThoughtWorks/big-data-pipeline-with-scala




Layer 3: Coordination and Workflow Management


ACQUIRE PREPARE ANALYZE REPORT ACT…kepler-project.org

Workflows for Data Science Center of Excellence at SDSC

Building functional, operational and reproducible solution

architectures using big data and HPC tools is what we do.

Focusonthequestion,notthe

technology!

• Access and query data• Scale computational analysis• Increase reuse • Save time, energy and money• Formalize and standardize

Real-TimeHazardsManagementwifire.ucsd.edu

Data-ParallelBioinformaticsbioKepler.org

ScalableAutomatedMolecularDynamicsandDrugDiscoverynbcr.ucsd.edu

WorDS.sdsc.edu

bioKepler:A Kepler Module for Bio Big Data Analysis

Data-ParallelBioinformaticsbioKepler.org

Source: Larry Smarr, Calit2

• Metagenomic Sequencing• JCVIProduced

• ~150BillionDNABasesFromSevenofLSStoolSamplesOver1.5Years

• ~3TrillionDNABasesFromNIHHumanMicrobiomeProgramDataBase• 255HealthyPeople,21withIBD

IlluminaHiSeq 2000 at JCVI

SDSC Gordon Data Supercomputer

Example from 2013: Inflammatory Bowel Disease (IBD)• Supercomputing(W.Li,JCVI/HLI/UCSD):

• ~20CPU-YearsonSDSC’sGordon• ~4CPU-YearsonDell’sHPCCloud

• ProducedRelativeAbundanceof• ~10KBacteria,Archaea,Virusesin~300People• ~3MillionFilledSpreadsheetCells

Ongoing Research:Optimization of Heterogeneous Resource Utilization using bioKepler

NationalResources

(Gordon) (Comet)

(Stampede)(Lonestar)

CloudResources

Optimized

LocalClusterResources

Uses existing genomics tools and computing

systems!

Computing is just one part of it…

…new methods needed!

Needs of a Dynamic Ecosystem of Genomic Discovery • Exploratorymethodstoseetemporalchangesandpatternsinsequence

data• Efficientupdatestoanalysisasquickasnewsequencedatagetsgenerated• Regularrerunsofannotationsasreferencedatabasesevolve• Integrationofgenomicdatawithothertypesofdata,e.g.,image,

environmental,socialgraphs• Dynamicabilitytocheckqualityandprovenanceofdataandanalysis• Transparentsupportforcomputingplatformsdesignedforgenomic

discoveryandpatternanalysis• Workflowcoordinationandsystemintegration• Peopleandculturetomakeithappencollaboratively!

Examples from 2016: Apache Big Data Technologies in Life Sciences• LightningFastGenomicswithADAM

• Goal• Studygeneticvariationsinpopulationsatscale(e.g.,1000GenomesProject)

• Technologystack• ApacheAvro(dataserialization,schemadefinition)• ApacheParquet(compactcolumnarstorage)• ApacheSpark(distributedparallelprocessing)• SparkMLlib(machinelearning,clustering)

• Source:AMPLab,UCBerkeley(http://bdgenomics.org/)• CompressiveStructuralBioinformaticsusingMMTF

• Goal• 100+speedupoflarge-scale3DstructuralanalysisoftheProteinDataBank(PDB)

• Technologystack• MMTF(MacromolecularTransmissionformat,compactstorageinHadoopSequenceFiles)• ApacheSpark(in-memory,paralleldistributedworkflowsusingcompresseddata)• SparkML(clustering)

• Source:SDSC,UCSanDiego(http://mmtf.rcsb.org/)

Development of tools and technologies that enable models to bridge across diverse scales of biological organization, while leveraging all types and

sources of data

NBCR Example: Distilling Medical Image Data for Biomedical Action nbcr.ucsd.edu

Identify gaps in multiscale modeling capabilities and develop new methods and tools that allow us to bridge across these gaps

Å nm – μm 0.1mm - mm cm

fs - μs μs - ms ms - s s - lifespan

Molecular & Macromolecular Sub-Cellular Cell Tissue Organ

Spat

ial a

nd

Tem

pora

l Sc

ales

Driving Biomedical Projects propel technology development across multi-scale modeling capability gaps, from simulation to data assembly & integration

A challenge: Data Integration

Challenge to bridge across diverse scales of biological organization, to understand emergent behavior, and the molecular mechanisms underlying biological function & disease

Integrated Multi-Scale Modeling Toolkits in NBCR

UserInterface NBCRProducts

Battling complexity while facilitating collaboration and increasing reproducibility.Cyberinfrastructure Innovation Based on User Needs

Domain-specific tools, workflows, data and computing infrastructure.

Components for Multi-Scale ModelingA handful of customizable and and

extensible tools, workflows, user interfaces and publishable research

objects.

NBCR Products

Workflows

ScientificTools

PastExperiments

• UI generation • Logical workflow generation• Uncertainty quantification• Workflow execution• Provenance tracking • System integration

mediu

m Pri

ma-1

Stictic

acid

35ZW

F 25

KKL

22LS

V 32

CTM

26RQ

Z 27

WT9

33AG

6 33

BAZ

28NZ

6 27

TGR

27VF

S 35

LWZ

36EB

5 27

UDP

32LD

E 0

0.2

0.4

0.6

0.8

1

1.2

0

0.2

0.4

0.6

0.8

1

1.2

0

0.2

0.4

0.6

0.8

1

1.2 nop53

0"

0.2"

0.4"

0.6"

0.8"

1"

1.2"

1.4"

no compound

Prima-1

35ZWF

25KKL

25PWS

24MLP

26YYG

22LSV

24MNR

32CTM

22KTV

24MY4

24LBC

24NPU

24NW3

Series1"Series2"

0"

0.2"

0.4"

0.6"

0.8"

1"

1.2"

1.4"

no compound

Prima-1

35ZWF

25KKL

25PWS

24MLP

26YYG

22LSV

24MNR

32CTM

22KTV

24MY4

24LBC

24NPU

24NW3

Series1"Series2"cancercellwithp53-R175Hmutant

cellprolife

ratio

n

15 new reactivation compounds

reactivation compounds kill cells with p53 cancer mutant

BENEFITS:• Increasereuse• Reproducibility• Scaleexecution,

problem&solution• Comparemethods• Trainstudents

Minimization Actor Equilibration Actor

AMBER GPU MD Workbench

Rommie Amaro, PI, UCSDComputational chemistry, biophysics

Andrew McCammon, UCSDComputational chemistry, biophysics, chemical physics

Mark Ellisman, UCSDMolecular & cellular biology

Andrew McCulloch, UCSDBioengineering, biophysics

Michel Sanner, TSRIDrug discovery & molecular visualization

Phil Papadopoulos, UCSD/SDSCComputer engineering, cyberinfrastructure

technologyIlkay Altintas, UCSD/SDSCWorkflows, provenance

Michael Holst, UCSDMath, physics

Arthur Olson, TSRIComputational chemistry, drug discovery, visualization

LEADERSHIPTEAM

Training at the interface

Challenge: how do we build the next generation of interdisciplinary scientists?

Data-to-Structural-Models Simulation-Based Drug Discovery

Biomedical Big Data Training Collaboratoryhttp://biobigdata.ucsd.edu

• BBDTCwebsiteisupandevolving!• BBDTCcontainssevenfull,openbiomedicaltrainingcourses• Four-coursebiomedicalbigdataseriesisplannedforWinter2017

Working with Industry Partners at SDSC

SDSC Provides a Range of Strategies for Engaging with Industry

• Sponsoredresearchagreements• Serviceagreementsforuseofsystems&consulting• Focusedcentersofexcellence(BigDataSystems,PredictiveAnalytics,Workflow

Technologies)• TrainingprogramsinDataScience&Analytics• IndustryPartnersProgramfor“jumpstarting”collaborations

Working with industry helps companies be more competitive, drives innovation, and fosters a healthy ecosystem between the research and private sector.

Example for Industrial Collaboration: Janssen R&D Rheumatoid Arthritis Study• JanssenwasinterestedincorrelatinggenomicprofilewithresponsetoTNFαinhibitorgolimumab

• Sequenced438patients(fullgenome)• SDSCassistedwithre-alignmentandvariantcallingusingnew/improvedalgorithms

• Neededanalysisdoneinareasonabletimeframe(afewweeks)

Que

stio

ns?

Ilkay

Alti

ntas

, Ph.

D.

Emai

l: ia

ltint

as@

ucsd

.edu

a workflow-driven discovery and training ecosystem for distributed analysis of biomedical big data

Data & Analytics