will biomedical research fundamentally change in the era of big data?

43
Will Biomedical Research Fundamentally Change in the Era of Big Data? Philip E. Bourne PhD, FACMI Stephenson Chair of Data Science Director, Data Science Institute Professor of Biomedical Engineering [email protected] https://www.slideshare.net/pebourne 6/6/17 UNC 1

Upload: philip-bourne

Post on 23-Jan-2018

432 views

Category:

Education


1 download

TRANSCRIPT

Will Biomedical Research Fundamentally Change in the Era of

Big Data?

Philip E. Bourne PhD, FACMIStephenson Chair of Data Science

Director, Data Science InstituteProfessor of Biomedical Engineering

[email protected]://www.slideshare.net/pebourne

6/6/17 UNC 1

My Bias in Addressing this Question

• Research in computational biology and big data

• Open science zealot

• AVC for Innovation UCSD

• Maintained biological data resources for 15 years (PDB, IEDB)

• Chief Data Officer of the NIH for 3 years (federal view)

• DSI Director 1 month (state view)

6/6/17 UNC 2

The short answer…

I don’t really know, but if what is happening in some sectors is any

indication, there will at a minimum be a significant perturbation

6/6/17 UNC 3

How Significant?One extreme is the 6D’s

How Significant?One extreme is the 6D’s

6/6/17 UNC 5

DigitizationDeception

Disruption

Demonetization

Dematerialization

Democratization

Time

Digital camera invented by

Kodak but shelved

Megapixels & quality improve slowly;

Kodak slow to react

Film market collapses;

Kodak goes bankrupt

Phones replace

cameras

Instagram,

Flickr become the

value proposition

Digital media becomes bona fide

form of communication

Surely the 6D’s is an extreme view?

After all biomedical research has always been data driven? So what

is different?

So What is Different?

• The scope, variety, complexity, and volume of data that can be collected and accessed

• The need for new methods and tools for large-scale, distributed analysis

• The need to sustain high value resources in more cost-effective ways

• The need for a more open process• The need for a larger trained and experienced

workforce to support cutting-edge research

As presented to the NIH leadership November 2016

How Much Data?

• Big Data– Total data from NIH-funded research currently

estimated at 650 PB*

– 20 PB of that is in NCBI/NLM (3%) and it is expected to grow by 10 PB this year

• Dark Data– Only 12% of data described in published papers is

in recognized archives – 88% is dark data^

• Cost– 2007-2014: NIH spent ~$1.2Bn extramurally on

maintaining data archives

* In 2012 Library of Congress was 3 PB

^ http://www.ncbi.nlm.nih.gov/pubmed/26207759

Why a More Open Process?Use case:

Diffuse Intrinsic Pontine Gliomas (DIPG)

• Occur 1:100,000

individuals

• Peak incidence 6-8 years

of age

• Median survival 9-12

months

• Surgery is not an option

• Chemotherapy ineffective

and radiotherapy only

transitive

From Adam Resnick

Timeline of genomic studies in DIPG

• Landmark studies identify

histone mutations as

recurrent driver mutations in

DIPG ~2012

• Almost 3 years later, in

largely the same datasets,

but partially expanded, the

same two groups and 2

others identify ACVR1

mutations as a secondary, co-

occurring mutation

From Adam Resnick

What do we need to do differently to reveal ACVR1?

• ACVR1 is a targetable kinase

• Inhibition of ACVR1 inhibited tumor

progression in vitro

• ~300 DIPG patients a year

• ~60 are predicted to have ACVR1

• If large scale data sets were only

integrated with TCGA and/or rare

disease data in 2012, ACVR1 mutations

would have been identified

• 60 patients/year X 3 years = 180

children’s lives (who likely succumbed to

the disease during that time) could have

been impacted if only data were FAIRFrom Adam Resnick

Before we get too down on ourselves consider some positives ….

6/6/17 UNC 12

Causation:

Comorbidity Network for 6.2M Danes

Over 14.9 Years

Jensen et al 2014 Nat Comm 5:4022Jensen et al 2014 Nat Comm 5:4022

EHR-based

phenotyping

neuroimage-based

phenotyping

transcriptome-based

phenotyping

epigenome-based

phenotyping

phenotype models for

breast cancer screening

stochastic

modeling

low-dimensional

representations

data management

value of information

Pro

ject

s

La

bs

The Center for Predictive Computational Phenotyping

EHR-based phenotyping

timenow

prospective phenotyping:

predict a phenotype of

interest before it is

exhibited

retrospective phenotyping: identify subjects who have exhibited a phenotype of interest (i.e. identify cases and controls)

?

genotype

demographics

events in EHR (diagnoses,

procedures, medications,

labs, etc.)

Can predict thousands of diagnoses months in advance of being recorded

in an EHR

• ~ 1.5 million subjects from Marshfield Clinic

• models learned for all ICD-9 codes (~3500) for which

500 cases and controls identified

Will we do this research in a different way?

Will it become more like Airbnb?

6/6/17 UNC 23

Bonazzi & Bourne 2017 PLOS Biology 15(4) e2001818

I am not crazy, hear me out

• Airbnb is a platform that supports a trusted relationship between consumer (renter) and supplier (host)

• The platform focuses on maximizing the exchange of services between supplier and consumer and maximizing the amount of trust associated with a given stakeholder

• It seems to be working: – 60 million users searching 2 million listings in 192 countries

– Average of 500,000 stays per night.

– Evaluation of US $25bn

Bonazzi & Bourne 2017 PLOS Biology 15(4) e2001818

Is not biomedical research the same?

Why a comparison to Airbnb is not fair

• Airbnb was born digital

• The exchange of services on Airbnb are simple compared to what is required of a platform to support biomedical research

Nevertheless there is much to be learnt

Paper Author Paper Reader

Data Provider Data Consumer

Employer Employee

Reagent Provider

Reagent Consumer

Software Provider

Software Consumer

Grant Writer Grant Reviewer

Supplier Consumer Platform

MS ProjectGoogle Drive

CourseraResearchgateAcademia.eduOpen Science

FrameworkSynapseF1000

Rio

Educator Student

Platforms – The situation today

In summary there is not currently a widely adopted single platform for

the exchange of services in biomedical research. Either there is a platform per service or no platform at all. Why have we not done better

and what are the impediments today?

Impediments to a biomedical platform

• Current work practices by all stakeholders

• Entrenched business models

• Size of the undertaking aka resources needed

• Trust

• Incentives to use the platform

http://www.forbes.com/sites/johnhall/2013/04/29/10-barriers-to-employee-innovation/#8bdbaa811133

To some degree work practices are changing …

More of a culture of sharing

1999 20042003 2007 20142008

Research Tools Policy

NIH Data Sharing Policy

Model Organism Policy

Genome-wide Association (GWAS) Policy

2012

NIH Public Access Policy (Publications)

Big Data to Knowledge (BD2K) Initiative

Genomic Data Sharing (GDS) Policy

Modernization of NIH Clinical Trials

White House Initiative

(2013 “HoldrenMemo”)

Driving sharing and innovation: Open Science Prize

NIH, Wellcome Trust, HHMI

https://www.openscienceprize.org

• An international scientific challenge competition to encourage and support the prototyping and development of services, tools, or platforms that enable utilization of open content

• 96 submissions received

• Solvers from 45 countries,

spanning 5 continents

• Timeline

• May 2016: Phase 1 winners announced at Health DataPalooza

• Dec 1, 2016: Presentations and public voting

• Feb 2017: Overall winner announced

The NIH through the Big Data to Knowledge (BD2K) is experimenting with a platform,

keeping in mind the need to overcome these impediments

Enter The Commons

https://en.wikipedia.org/wiki/Ealing_Common#/media/File:Ealing_Common_-_geograph.org.uk_-_17075.jpg

Paper Author Paper Reader

Data Provider Data Consumer

Employer Employee

Reagent Provider

Reagent Consumer

Software Provider

Software Consumer

Grant Writer Grant Reviewer

Supplier Consumer Platform

MS ProjectGoogle Drive

CourseraResearchgateAcademia.eduOpen Science

FrameworkSynapseF1000

Rio

Educator Student

Commons –Initial focus is on integrating two layers of the scholarly workflow

Commons Topology

Compute Platform: Cloud or HPC

Services: APIs, Containers, Indexing,

Software: Services & Tools

scientific analysis tools/workflows

Data

“Reference” Data Sets

User defined data

Digital O

bject C

om

plian

ce

App store/User Interface

PaaS

SaaS

IaaS

https://datascience.nih.gov/commons

Commons Compliance

• Treat products of research – data, methods, papers etc. as digital objects

• These digital objects exist in a sharedvirtual space

• Digital object compliance through FAIR principles:

– Findable

– Accessible (and usable)

– Interoperable

– Reusablehttps://commonfund.nih.gov/bd2k/commons

Incentives

• Airbnb

– Monetize unutilized space

– Ease of use

– New vacation experience

• Commons

– Need to improve rigor and reproducibility

– Productivity

– Sustainability

– Education and training

– Opportunity to undertake elastic compute on large complex data

https://commonfund.nih.gov/bd2k/commons

Why? - Consider Current High Profile Examples

• Moonshot - Bringing together 5 petabytes of homogenized data within the Genome Data Commons (GDC) to explore genotype-phenotype relationships

• MODs – Multiple high value high cost genomic resources• Human Microbiome Project – microbe characterization and analysis• TOPMed – Genomic, proteomic, metabolomic, image and EHR data• Precision Medicine - Building a platform to support data on >1M individuals

with extensive and constantly updated health profiles• ECHO – Effects of Environmental Exposures on Child Health and

Development - Integration of child health and environmental data• BRAIN - Temporal and spatial analysis of neural circuits

What Data Science Do We Need to Get There?

• Moonshot – new ways to analyze genotype-phenotype associations• MODs – new curation and integration tools• Human Microbiome Project – new cloud based tools• TOPMed – large scale storage and analysis; data harmonization• Precision Medicine – security; analysis of sensor data; EHR integration• ECHO – metadata descriptions of health and environmental data;

application of geospatial methods• BRAIN – methods for network analysis, visualization

All: Analytics, the Commons, FAIR, sustainability, workforce

What should institutions be doing?

Here is one example…

40

• Presidential Fellows in Data

Science

• MSDS Capstone Projects

• Governor’s Data Internship

Program

• Collaborative Research

Grants

• Data Science Training

So let me summarize

• Will Biomedical Research Fundamentally Change in the Era of Big Data?

– Yes to some unknown degree, Why?

– The research opportunities, both causal and predictive

– Recognition that the process needs to change and big data can facilitate change given the right incentives

6/6/17 UNC 42

Acknowledgements

6/6/17 UNC 43

The BD2K Team at NIH

My New Colleagues at UVA