a workflow-driven discovery and training ecosystem for distributed analysis of biomedical big data

A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data

İlkay ALTINTAŞ, Ph.D.Chief Data Science Officer, San Diego Supercomputer CenterFounder and Director, Workflows for Data Science Center of Excellence

SAN DIEGO SUPERCOMPUTER CENTER at UC San DiegoProviding Cyberinfrastructure for Research and Education

• Establishedasanationalsupercomputerresourcecenterin1985byNSF

• AworldleaderinHPC,data-intensivecomputing,andscientificdatamanagement

• Currentstrategicfocuson“BigData”,“versatilecomputing”,and“lifesciencesapplications”

Two discoveries in drug design from 1987 and 1991.

Ross Walker Group

SDSC continues to be a leader in scientific computing and big data!

Gordon: FirstFlash-basedSupercomputerforData-intensiveApps

Comet: Serving the Long Tail of Science

27 standard racks= 1944 nodes= 46,656 cores= 249 TB DRAM= 622 TB SSD

~ 2 Pflop/s

• 36 GPU nodes• 4 Large Memory nodes• 7 PB Lustre storage• High performance

virtualization

SDSC Data Science Office-- Expertise, Systems and Training

for Data Science Applications --

SDSC Data Science Office (DSO)

SDSC DSO is a collaborative virtual organization at SDSC for collective lasting innovation in data science research, development and education.

SDSC Expertise and Strengths

Life Sciences is an ongoing strategic application thrust at SDSC…

Genomic Analysis is a Big Data and Big Compute Problem

BIG DATACOMPUTING AT

Enables dynamic data-driven applicationsComputer-Aided Drug Discovery

Personalized Precision Medicine

Requires:• Data management • Data-driven methods• Scalable tools for

dynamic coordination and resource optimization

• Skilled interdisciplinary workforce

Team work and process management

Vaccine Development

Metagenomics

New era of data science!

Needs and Trends for the New Era Data Science

-- the Big Data Era Goals --• Moredata-driven• Moredynamic• Moreprocess-driven• Morecollaborative• Moreaccountable• Morereproducible• Moreinteractive• Moreheterogeneous

Velocity

Variety

Volume Scalable batch processing

Stream processing

Extensible data storage, access and integration

Genomic Data Management and Processing in the Big Data Era has Unique Challenges!

Hive Pig

Zookeeper Giraph

MapReduce

MongoDB

Cassandra

HDFSFlink

Lower levels:Storage and scheduling

Higher levels:Interactivity

These challenges push for new tools to tackle them.

COORDINATION AND WORKFLOW MANAGEMENT

DATA INTEGRATION AND PROCESSING

DATA MANAGEMENT AND STORAGE

How do we use these new tools

and combine them with existing

domain-specific solutions in

scientific computing and data science?

Layer 1: Data Management and Storage

Layer 2: Data Integration and Processing

Hive PigZookeeper Giraph

MapReduce

MongoDB

Cassandra

Flink + Application

specific libraries

Most of the time, more than one analysis need to take place…

And each analysis has multiple steps to integrate!

Pipelining is a way to put the steps together.

Source: http://www.slideshare.net/BigDataCloud/big-data-analytics-with-google-cloud-platform

Source: https://www.mapr.com/blog/distributed-stream-and-graph-processing-apache-flink

Source: https://www.computer.org/csdl/mags/so/2016/02/mso2016020060.html

Source: http://www.slideshare.net/ThoughtWorks/big-data-pipeline-with-scala

Layer 3: Coordination and Workflow Management

ACQUIRE PREPARE ANALYZE REPORT ACT…kepler-project.org

Workflows for Data Science Center of Excellence at SDSC

Building functional, operational and reproducible solution

architectures using big data and HPC tools is what we do.

Focusonthequestion,notthe

technology!

• Access and query data• Scale computational analysis• Increase reuse • Save time, energy and money• Formalize and standardize

Real-TimeHazardsManagementwifire.ucsd.edu

Data-ParallelBioinformaticsbioKepler.org

ScalableAutomatedMolecularDynamicsandDrugDiscoverynbcr.ucsd.edu

WorDS.sdsc.edu

bioKepler:A Kepler Module for Bio Big Data Analysis

Data-ParallelBioinformaticsbioKepler.org

Source: Larry Smarr, Calit2

• Metagenomic Sequencing• JCVIProduced

• ~150BillionDNABasesFromSevenofLSStoolSamplesOver1.5Years

• ~3TrillionDNABasesFromNIHHumanMicrobiomeProgramDataBase• 255HealthyPeople,21withIBD

IlluminaHiSeq 2000 at JCVI

SDSC Gordon Data Supercomputer

Example from 2013: Inflammatory Bowel Disease (IBD)• Supercomputing(W.Li,JCVI/HLI/UCSD):

• ~20CPU-YearsonSDSC’sGordon• ~4CPU-YearsonDell’sHPCCloud

• ProducedRelativeAbundanceof• ~10KBacteria,Archaea,Virusesin~300People• ~3MillionFilledSpreadsheetCells

Ongoing Research:Optimization of Heterogeneous Resource Utilization using bioKepler

NationalResources

(Gordon) (Comet)

(Stampede)(Lonestar)

CloudResources

Optimized

LocalClusterResources

Uses existing genomics tools and computing

systems!

Computing is just one part of it…

…new methods needed!

Needs of a Dynamic Ecosystem of Genomic Discovery • Exploratorymethodstoseetemporalchangesandpatternsinsequence

data• Efficientupdatestoanalysisasquickasnewsequencedatagetsgenerated• Regularrerunsofannotationsasreferencedatabasesevolve• Integrationofgenomicdatawithothertypesofdata,e.g.,image,

environmental,socialgraphs• Dynamicabilitytocheckqualityandprovenanceofdataandanalysis• Transparentsupportforcomputingplatformsdesignedforgenomic

discoveryandpatternanalysis• Workflowcoordinationandsystemintegration• Peopleandculturetomakeithappencollaboratively!

Examples from 2016: Apache Big Data Technologies in Life Sciences• LightningFastGenomicswithADAM

• Goal• Studygeneticvariationsinpopulationsatscale(e.g.,1000GenomesProject)

• Technologystack• ApacheAvro(dataserialization,schemadefinition)• ApacheParquet(compactcolumnarstorage)• ApacheSpark(distributedparallelprocessing)• SparkMLlib(machinelearning,clustering)

• Source:AMPLab,UCBerkeley(http://bdgenomics.org/)• CompressiveStructuralBioinformaticsusingMMTF

• Goal• 100+speedupoflarge-scale3DstructuralanalysisoftheProteinDataBank(PDB)

• Technologystack• MMTF(MacromolecularTransmissionformat,compactstorageinHadoopSequenceFiles)• ApacheSpark(in-memory,paralleldistributedworkflowsusingcompresseddata)• SparkML(clustering)

• Source:SDSC,UCSanDiego(http://mmtf.rcsb.org/)

Development of tools and technologies that enable models to bridge across diverse scales of biological organization, while leveraging all types and

sources of data

NBCR Example: Distilling Medical Image Data for Biomedical Action nbcr.ucsd.edu

Identify gaps in multiscale modeling capabilities and develop new methods and tools that allow us to bridge across these gaps

Å nm – μm 0.1mm - mm cm

fs - μs μs - ms ms - s s - lifespan

Molecular & Macromolecular Sub-Cellular Cell Tissue Organ

Driving Biomedical Projects propel technology development across multi-scale modeling capability gaps, from simulation to data assembly & integration

A challenge: Data Integration

Challenge to bridge across diverse scales of biological organization, to understand emergent behavior, and the molecular mechanisms underlying biological function & disease

Integrated Multi-Scale Modeling Toolkits in NBCR

UserInterface NBCRProducts

Battling complexity while facilitating collaboration and increasing reproducibility.Cyberinfrastructure Innovation Based on User Needs

Domain-specific tools, workflows, data and computing infrastructure.

Components for Multi-Scale ModelingA handful of customizable and and

extensible tools, workflows, user interfaces and publishable research

objects.

NBCR Products

Workflows

ScientificTools

PastExperiments

• UI generation • Logical workflow generation• Uncertainty quantification• Workflow execution• Provenance tracking • System integration

Stictic

1.2 nop53

no compound

Prima-1

Series1"Series2"

no compound

Prima-1

Series1"Series2"cancercellwithp53-R175Hmutant

cellprolife

15 new reactivation compounds

reactivation compounds kill cells with p53 cancer mutant

BENEFITS:• Increasereuse• Reproducibility• Scaleexecution,

problem&solution• Comparemethods• Trainstudents

Minimization Actor Equilibration Actor

AMBER GPU MD Workbench

Rommie Amaro, PI, UCSDComputational chemistry, biophysics

Andrew McCammon, UCSDComputational chemistry, biophysics, chemical physics

Mark Ellisman, UCSDMolecular & cellular biology

Andrew McCulloch, UCSDBioengineering, biophysics

Michel Sanner, TSRIDrug discovery & molecular visualization

Phil Papadopoulos, UCSD/SDSCComputer engineering, cyberinfrastructure

technologyIlkay Altintas, UCSD/SDSCWorkflows, provenance

Michael Holst, UCSDMath, physics

Arthur Olson, TSRIComputational chemistry, drug discovery, visualization

LEADERSHIPTEAM

Training at the interface

Challenge: how do we build the next generation of interdisciplinary scientists?

Data-to-Structural-Models Simulation-Based Drug Discovery

Biomedical Big Data Training Collaboratoryhttp://biobigdata.ucsd.edu

• BBDTCwebsiteisupandevolving!• BBDTCcontainssevenfull,openbiomedicaltrainingcourses• Four-coursebiomedicalbigdataseriesisplannedforWinter2017

Working with Industry Partners at SDSC

SDSC Provides a Range of Strategies for Engaging with Industry

• Sponsoredresearchagreements• Serviceagreementsforuseofsystems&consulting• Focusedcentersofexcellence(BigDataSystems,PredictiveAnalytics,Workflow

Technologies)• TrainingprogramsinDataScience&Analytics• IndustryPartnersProgramfor“jumpstarting”collaborations

Working with industry helps companies be more competitive, drives innovation, and fosters a healthy ecosystem between the research and private sector.

Example for Industrial Collaboration: Janssen R&D Rheumatoid Arthritis Study• JanssenwasinterestedincorrelatinggenomicprofilewithresponsetoTNFαinhibitorgolimumab

• Sequenced438patients(fullgenome)• SDSCassistedwithre-alignmentandvariantcallingusingnew/improvedalgorithms

• Neededanalysisdoneinareasonabletimeframe(afewweeks)

a workflow-driven discovery and training ecosystem for distributed analysis of biomedical big data

Data & Analytics

cove webinar february 26, 2009. what is cove?...

workflow api and workflow services

biomedical informatics research network birn workflow portal

program and abstracts · 2020-04-07 · scott mills,...

pulse on vr: a workflow and ecosystem study · pulse on vr:...

pericles workflow for the automated updating of digital...

integrating biomedical text mining services into a...

ocean framework for...

workflow basics guide - informatica · 1 • • • •...

lecture biomedical information systems and medical ......

supporting early career health and biomedical sciences...

a proposal for the inclusion of accessibility criteria in...

by, nasheet ahmed siddiqui. agenda workflow overview...

new a. holzinger 709.049 mi, 04.11 - human-centered.ai ·...

working out a digital workflow - schewe...

prepress workflow automation - crisp digital prepress...

the swan hypothesis manager in the biomedical knowledge...

building a real-time, self-service data analytics ecosystem...

bioinformatics research overview outline biomedical...

healthcare and biomedical ecosystem report€¦ · barrow...