a workflow-driven discovery and training ecosystem for distributed analysis of biomedical big data

35
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data İlkay ALTINTAŞ, Ph.D. Chief Data Science Officer, San Diego Supercomputer Center Founder and Director, Workflows for Data Science Center of Excellence

Upload: ilkay-altintas-phd

Post on 15-Feb-2017

68 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data

A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data

İlkay ALTINTAŞ, Ph.D.Chief Data Science Officer, San Diego Supercomputer CenterFounder and Director, Workflows for Data Science Center of Excellence

Page 2: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data

SAN DIEGO SUPERCOMPUTER CENTER at UC San DiegoProviding Cyberinfrastructure for Research and Education

• Establishedasanationalsupercomputerresourcecenterin1985byNSF

• AworldleaderinHPC,data-intensivecomputing,andscientificdatamanagement

• Currentstrategicfocuson“BigData”,“versatilecomputing”,and“lifesciencesapplications”

1985

today

Two discoveries in drug design from 1987 and 1991.

Page 3: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data

Ross Walker Group

SDSC continues to be a leader in scientific computing and big data!

Gordon: FirstFlash-basedSupercomputerforData-intensiveApps

Comet: Serving the Long Tail of Science

27 standard racks= 1944 nodes= 46,656 cores= 249 TB DRAM= 622 TB SSD

~ 2 Pflop/s

• 36 GPU nodes• 4 Large Memory nodes• 7 PB Lustre storage• High performance

virtualization

Page 4: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data

SDSC Data Science Office-- Expertise, Systems and Training

for Data Science Applications --

SDSC Data Science Office (DSO)

SDSC DSO is a collaborative virtual organization at SDSC for collective lasting innovation in data science research, development and education.

DSO

SDSC Expertise and Strengths

Big

Dat

a P

latfo

rms

Trai

ning

Indu

stry

App

licat

ions

Page 5: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data

Life Sciences is an ongoing strategic application thrust at SDSC…

Page 6: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data

Genomic Analysis is a Big Data and Big Compute Problem

BIG DATACOMPUTING AT

SCALE

Enables dynamic data-driven applicationsComputer-Aided Drug Discovery

Personalized Precision Medicine

Requires:• Data management • Data-driven methods• Scalable tools for

dynamic coordination and resource optimization

• Skilled interdisciplinary workforce

Team work and process management

Vaccine Development

Metagenomics

Page 7: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data

New era of data science!

Needs and Trends for the New Era Data Science

-- the Big Data Era Goals --• Moredata-driven• Moredynamic• Moreprocess-driven• Morecollaborative• Moreaccountable• Morereproducible• Moreinteractive• Moreheterogeneous

Page 8: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data

Velocity

Variety

Volume Scalable batch processing

Stream processing

Extensible data storage, access and integration

Genomic Data Management and Processing in the Big Data Era has Unique Challenges!

Page 9: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data

HBase

Hive Pig

Zookeeper Giraph

Storm

Spark

MapReduce

YARN

MongoDB

Cassandra

HDFSFlink

Lower levels:Storage and scheduling

Higher levels:Interactivity

These challenges push for new tools to tackle them.

Page 10: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data

COORDINATION AND WORKFLOW MANAGEMENT

DATA INTEGRATION AND PROCESSING

DATA MANAGEMENT AND STORAGE

How do we use these new tools

and combine them with existing

domain-specific solutions in

scientific computing and data science?

Page 11: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data

COORDINATION AND WORKFLOW MANAGEMENT

DATA INTEGRATION AND PROCESSING

DATA MANAGEMENT AND STORAGE

Layer 1: Data Management and Storage

Page 12: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data

COORDINATION AND WORKFLOW MANAGEMENT

DATA INTEGRATION AND PROCESSING

DATA MANAGEMENT AND STORAGE

Layer 2: Data Integration and Processing

HBase

Hive PigZookeeper Giraph

Storm

Spark

MapReduce

YARN

MongoDB

Cassandra

HDFS

Flink + Application

specific libraries

Page 13: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data

Most of the time, more than one analysis need to take place…

And each analysis has multiple steps to integrate!

Page 14: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data

Pipelining is a way to put the steps together.

Source: http://www.slideshare.net/BigDataCloud/big-data-analytics-with-google-cloud-platform

Source: https://www.mapr.com/blog/distributed-stream-and-graph-processing-apache-flink

Source: https://www.computer.org/csdl/mags/so/2016/02/mso2016020060.html

Source: http://www.slideshare.net/ThoughtWorks/big-data-pipeline-with-scala

Page 15: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data

COORDINATION AND WORKFLOW MANAGEMENT

DATA INTEGRATION AND PROCESSING

DATA MANAGEMENT AND STORAGE

Layer 3: Coordination and Workflow Management

Page 16: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data

COORDINATION AND WORKFLOW MANAGEMENT

ACQUIRE PREPARE ANALYZE REPORT ACT…kepler-project.org

Page 17: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data

Workflows for Data Science Center of Excellence at SDSC

Building functional, operational and reproducible solution

architectures using big data and HPC tools is what we do.

Focusonthequestion,notthe

technology!

• Access and query data• Scale computational analysis• Increase reuse • Save time, energy and money• Formalize and standardize

Real-TimeHazardsManagementwifire.ucsd.edu

Data-ParallelBioinformaticsbioKepler.org

ScalableAutomatedMolecularDynamicsandDrugDiscoverynbcr.ucsd.edu

WorDS.sdsc.edu

Page 18: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data

bioKepler:A Kepler Module for Bio Big Data Analysis

Data-ParallelBioinformaticsbioKepler.org

Page 19: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data

Source: Larry Smarr, Calit2

• Metagenomic Sequencing• JCVIProduced

• ~150BillionDNABasesFromSevenofLSStoolSamplesOver1.5Years

• ~3TrillionDNABasesFromNIHHumanMicrobiomeProgramDataBase• 255HealthyPeople,21withIBD

IlluminaHiSeq 2000 at JCVI

SDSC Gordon Data Supercomputer

Example from 2013: Inflammatory Bowel Disease (IBD)• Supercomputing(W.Li,JCVI/HLI/UCSD):

• ~20CPU-YearsonSDSC’sGordon• ~4CPU-YearsonDell’sHPCCloud

• ProducedRelativeAbundanceof• ~10KBacteria,Archaea,Virusesin~300People• ~3MillionFilledSpreadsheetCells

Page 20: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data

Ongoing Research:Optimization of Heterogeneous Resource Utilization using bioKepler

NationalResources

(Gordon) (Comet)

(Stampede)(Lonestar)

CloudResources

Optimized

LocalClusterResources

Uses existing genomics tools and computing

systems!

Computing is just one part of it…

…new methods needed!

Page 21: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data

Needs of a Dynamic Ecosystem of Genomic Discovery • Exploratorymethodstoseetemporalchangesandpatternsinsequence

data• Efficientupdatestoanalysisasquickasnewsequencedatagetsgenerated• Regularrerunsofannotationsasreferencedatabasesevolve• Integrationofgenomicdatawithothertypesofdata,e.g.,image,

environmental,socialgraphs• Dynamicabilitytocheckqualityandprovenanceofdataandanalysis• Transparentsupportforcomputingplatformsdesignedforgenomic

discoveryandpatternanalysis• Workflowcoordinationandsystemintegration• Peopleandculturetomakeithappencollaboratively!

Page 22: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data

Examples from 2016: Apache Big Data Technologies in Life Sciences• LightningFastGenomicswithADAM

• Goal• Studygeneticvariationsinpopulationsatscale(e.g.,1000GenomesProject)

• Technologystack• ApacheAvro(dataserialization,schemadefinition)• ApacheParquet(compactcolumnarstorage)• ApacheSpark(distributedparallelprocessing)• SparkMLlib(machinelearning,clustering)

• Source:AMPLab,UCBerkeley(http://bdgenomics.org/)• CompressiveStructuralBioinformaticsusingMMTF

• Goal• 100+speedupoflarge-scale3DstructuralanalysisoftheProteinDataBank(PDB)

• Technologystack• MMTF(MacromolecularTransmissionformat,compactstorageinHadoopSequenceFiles)• ApacheSpark(in-memory,paralleldistributedworkflowsusingcompresseddata)• SparkML(clustering)

• Source:SDSC,UCSanDiego(http://mmtf.rcsb.org/)

Page 23: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data

Development of tools and technologies that enable models to bridge across diverse scales of biological organization, while leveraging all types and

sources of data

NBCR Example: Distilling Medical Image Data for Biomedical Action nbcr.ucsd.edu

Page 24: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data

Identify gaps in multiscale modeling capabilities and develop new methods and tools that allow us to bridge across these gaps

Å nm – μm 0.1mm - mm cm

fs - μs μs - ms ms - s s - lifespan

Molecular & Macromolecular Sub-Cellular Cell Tissue Organ

Spat

ial a

nd

Tem

pora

l Sc

ales

Driving Biomedical Projects propel technology development across multi-scale modeling capability gaps, from simulation to data assembly & integration

Page 25: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data

A challenge: Data Integration

Challenge to bridge across diverse scales of biological organization, to understand emergent behavior, and the molecular mechanisms underlying biological function & disease

Page 26: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data

Integrated Multi-Scale Modeling Toolkits in NBCR

UserInterface NBCRProducts

Battling complexity while facilitating collaboration and increasing reproducibility.Cyberinfrastructure Innovation Based on User Needs

Domain-specific tools, workflows, data and computing infrastructure.

Components for Multi-Scale ModelingA handful of customizable and and

extensible tools, workflows, user interfaces and publishable research

objects.

NBCR Products

Workflows

ScientificTools

PastExperiments

• UI generation • Logical workflow generation• Uncertainty quantification• Workflow execution• Provenance tracking • System integration

Page 27: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data

mediu

m Pri

ma-1

Stictic

acid

35ZW

F 25

KKL

22LS

V 32

CTM

26RQ

Z 27

WT9

33AG

6 33

BAZ

28NZ

6 27

TGR

27VF

S 35

LWZ

36EB

5 27

UDP

32LD

E 0

0.2

0.4

0.6

0.8

1

1.2

0

0.2

0.4

0.6

0.8

1

1.2

0

0.2

0.4

0.6

0.8

1

1.2 nop53

0"

0.2"

0.4"

0.6"

0.8"

1"

1.2"

1.4"

no compound

Prima-1

35ZWF

25KKL

25PWS

24MLP

26YYG

22LSV

24MNR

32CTM

22KTV

24MY4

24LBC

24NPU

24NW3

Series1"Series2"

0"

0.2"

0.4"

0.6"

0.8"

1"

1.2"

1.4"

no compound

Prima-1

35ZWF

25KKL

25PWS

24MLP

26YYG

22LSV

24MNR

32CTM

22KTV

24MY4

24LBC

24NPU

24NW3

Series1"Series2"cancercellwithp53-R175Hmutant

cellprolife

ratio

n

15 new reactivation compounds

reactivation compounds kill cells with p53 cancer mutant

BENEFITS:• Increasereuse• Reproducibility• Scaleexecution,

problem&solution• Comparemethods• Trainstudents

Page 28: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data

Minimization Actor Equilibration Actor

AMBER GPU MD Workbench

Page 29: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data

Rommie Amaro, PI, UCSDComputational chemistry, biophysics

Andrew McCammon, UCSDComputational chemistry, biophysics, chemical physics

Mark Ellisman, UCSDMolecular & cellular biology

Andrew McCulloch, UCSDBioengineering, biophysics

Michel Sanner, TSRIDrug discovery & molecular visualization

Phil Papadopoulos, UCSD/SDSCComputer engineering, cyberinfrastructure

technologyIlkay Altintas, UCSD/SDSCWorkflows, provenance

Michael Holst, UCSDMath, physics

Arthur Olson, TSRIComputational chemistry, drug discovery, visualization

LEADERSHIPTEAM

Page 30: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data

Training at the interface

Challenge: how do we build the next generation of interdisciplinary scientists?

Data-to-Structural-Models Simulation-Based Drug Discovery

Page 31: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data

Biomedical Big Data Training Collaboratoryhttp://biobigdata.ucsd.edu

• BBDTCwebsiteisupandevolving!• BBDTCcontainssevenfull,openbiomedicaltrainingcourses• Four-coursebiomedicalbigdataseriesisplannedforWinter2017

Page 32: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data

Working with Industry Partners at SDSC

Page 33: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data

SDSC Provides a Range of Strategies for Engaging with Industry

• Sponsoredresearchagreements• Serviceagreementsforuseofsystems&consulting• Focusedcentersofexcellence(BigDataSystems,PredictiveAnalytics,Workflow

Technologies)• TrainingprogramsinDataScience&Analytics• IndustryPartnersProgramfor“jumpstarting”collaborations

Working with industry helps companies be more competitive, drives innovation, and fosters a healthy ecosystem between the research and private sector.

Page 34: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data

Example for Industrial Collaboration: Janssen R&D Rheumatoid Arthritis Study• JanssenwasinterestedincorrelatinggenomicprofilewithresponsetoTNFαinhibitorgolimumab

• Sequenced438patients(fullgenome)• SDSCassistedwithre-alignmentandvariantcallingusingnew/improvedalgorithms

• Neededanalysisdoneinareasonabletimeframe(afewweeks)

Page 35: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data

Que

stio

ns?

Ilkay

Alti

ntas

, Ph.

D.

Emai

l: ia

ltint

as@

ucsd

.edu