will biomedical research fundamentally change in the era of big data?
TRANSCRIPT
Will Biomedical Research Fundamentally Change in the Era of
Big Data?
Philip E. Bourne PhD, FACMIStephenson Chair of Data Science
Director, Data Science InstituteProfessor of Biomedical Engineering
[email protected]://www.slideshare.net/pebourne
6/6/17 UNC 1
My Bias in Addressing this Question
• Research in computational biology and big data
• Open science zealot
• AVC for Innovation UCSD
• Maintained biological data resources for 15 years (PDB, IEDB)
• Chief Data Officer of the NIH for 3 years (federal view)
• DSI Director 1 month (state view)
6/6/17 UNC 2
The short answer…
I don’t really know, but if what is happening in some sectors is any
indication, there will at a minimum be a significant perturbation
6/6/17 UNC 3
How Significant?One extreme is the 6D’s
6/6/17 UNC 5
DigitizationDeception
Disruption
Demonetization
Dematerialization
Democratization
Time
Digital camera invented by
Kodak but shelved
Megapixels & quality improve slowly;
Kodak slow to react
Film market collapses;
Kodak goes bankrupt
Phones replace
cameras
Instagram,
Flickr become the
value proposition
Digital media becomes bona fide
form of communication
Surely the 6D’s is an extreme view?
After all biomedical research has always been data driven? So what
is different?
So What is Different?
• The scope, variety, complexity, and volume of data that can be collected and accessed
• The need for new methods and tools for large-scale, distributed analysis
• The need to sustain high value resources in more cost-effective ways
• The need for a more open process• The need for a larger trained and experienced
workforce to support cutting-edge research
As presented to the NIH leadership November 2016
How Much Data?
• Big Data– Total data from NIH-funded research currently
estimated at 650 PB*
– 20 PB of that is in NCBI/NLM (3%) and it is expected to grow by 10 PB this year
• Dark Data– Only 12% of data described in published papers is
in recognized archives – 88% is dark data^
• Cost– 2007-2014: NIH spent ~$1.2Bn extramurally on
maintaining data archives
* In 2012 Library of Congress was 3 PB
^ http://www.ncbi.nlm.nih.gov/pubmed/26207759
Why a More Open Process?Use case:
Diffuse Intrinsic Pontine Gliomas (DIPG)
• Occur 1:100,000
individuals
• Peak incidence 6-8 years
of age
• Median survival 9-12
months
• Surgery is not an option
• Chemotherapy ineffective
and radiotherapy only
transitive
From Adam Resnick
Timeline of genomic studies in DIPG
• Landmark studies identify
histone mutations as
recurrent driver mutations in
DIPG ~2012
• Almost 3 years later, in
largely the same datasets,
but partially expanded, the
same two groups and 2
others identify ACVR1
mutations as a secondary, co-
occurring mutation
From Adam Resnick
What do we need to do differently to reveal ACVR1?
• ACVR1 is a targetable kinase
• Inhibition of ACVR1 inhibited tumor
progression in vitro
• ~300 DIPG patients a year
• ~60 are predicted to have ACVR1
• If large scale data sets were only
integrated with TCGA and/or rare
disease data in 2012, ACVR1 mutations
would have been identified
• 60 patients/year X 3 years = 180
children’s lives (who likely succumbed to
the disease during that time) could have
been impacted if only data were FAIRFrom Adam Resnick
Causation:
Comorbidity Network for 6.2M Danes
Over 14.9 Years
Jensen et al 2014 Nat Comm 5:4022Jensen et al 2014 Nat Comm 5:4022
EHR-based
phenotyping
neuroimage-based
phenotyping
transcriptome-based
phenotyping
epigenome-based
phenotyping
phenotype models for
breast cancer screening
stochastic
modeling
low-dimensional
representations
data management
value of information
Pro
ject
s
La
bs
The Center for Predictive Computational Phenotyping
EHR-based phenotyping
timenow
prospective phenotyping:
predict a phenotype of
interest before it is
exhibited
retrospective phenotyping: identify subjects who have exhibited a phenotype of interest (i.e. identify cases and controls)
?
genotype
demographics
events in EHR (diagnoses,
procedures, medications,
labs, etc.)
Can predict thousands of diagnoses months in advance of being recorded
in an EHR
• ~ 1.5 million subjects from Marshfield Clinic
• models learned for all ICD-9 codes (~3500) for which
500 cases and controls identified
Will we do this research in a different way?
Will it become more like Airbnb?
6/6/17 UNC 23
Bonazzi & Bourne 2017 PLOS Biology 15(4) e2001818
I am not crazy, hear me out
• Airbnb is a platform that supports a trusted relationship between consumer (renter) and supplier (host)
• The platform focuses on maximizing the exchange of services between supplier and consumer and maximizing the amount of trust associated with a given stakeholder
• It seems to be working: – 60 million users searching 2 million listings in 192 countries
– Average of 500,000 stays per night.
– Evaluation of US $25bn
Bonazzi & Bourne 2017 PLOS Biology 15(4) e2001818
Why a comparison to Airbnb is not fair
• Airbnb was born digital
• The exchange of services on Airbnb are simple compared to what is required of a platform to support biomedical research
Nevertheless there is much to be learnt
Paper Author Paper Reader
Data Provider Data Consumer
Employer Employee
Reagent Provider
Reagent Consumer
Software Provider
Software Consumer
Grant Writer Grant Reviewer
Supplier Consumer Platform
MS ProjectGoogle Drive
CourseraResearchgateAcademia.eduOpen Science
FrameworkSynapseF1000
Rio
Educator Student
Platforms – The situation today
In summary there is not currently a widely adopted single platform for
the exchange of services in biomedical research. Either there is a platform per service or no platform at all. Why have we not done better
and what are the impediments today?
Impediments to a biomedical platform
• Current work practices by all stakeholders
• Entrenched business models
• Size of the undertaking aka resources needed
• Trust
• Incentives to use the platform
http://www.forbes.com/sites/johnhall/2013/04/29/10-barriers-to-employee-innovation/#8bdbaa811133
More of a culture of sharing
1999 20042003 2007 20142008
Research Tools Policy
NIH Data Sharing Policy
Model Organism Policy
Genome-wide Association (GWAS) Policy
2012
NIH Public Access Policy (Publications)
Big Data to Knowledge (BD2K) Initiative
Genomic Data Sharing (GDS) Policy
Modernization of NIH Clinical Trials
White House Initiative
(2013 “HoldrenMemo”)
Driving sharing and innovation: Open Science Prize
NIH, Wellcome Trust, HHMI
https://www.openscienceprize.org
• An international scientific challenge competition to encourage and support the prototyping and development of services, tools, or platforms that enable utilization of open content
• 96 submissions received
• Solvers from 45 countries,
spanning 5 continents
• Timeline
• May 2016: Phase 1 winners announced at Health DataPalooza
• Dec 1, 2016: Presentations and public voting
• Feb 2017: Overall winner announced
The NIH through the Big Data to Knowledge (BD2K) is experimenting with a platform,
keeping in mind the need to overcome these impediments
Enter The Commons
https://en.wikipedia.org/wiki/Ealing_Common#/media/File:Ealing_Common_-_geograph.org.uk_-_17075.jpg
Paper Author Paper Reader
Data Provider Data Consumer
Employer Employee
Reagent Provider
Reagent Consumer
Software Provider
Software Consumer
Grant Writer Grant Reviewer
Supplier Consumer Platform
MS ProjectGoogle Drive
CourseraResearchgateAcademia.eduOpen Science
FrameworkSynapseF1000
Rio
Educator Student
Commons –Initial focus is on integrating two layers of the scholarly workflow
Commons Topology
Compute Platform: Cloud or HPC
Services: APIs, Containers, Indexing,
Software: Services & Tools
scientific analysis tools/workflows
Data
“Reference” Data Sets
User defined data
Digital O
bject C
om
plian
ce
App store/User Interface
PaaS
SaaS
IaaS
https://datascience.nih.gov/commons
Commons Compliance
• Treat products of research – data, methods, papers etc. as digital objects
• These digital objects exist in a sharedvirtual space
• Digital object compliance through FAIR principles:
– Findable
– Accessible (and usable)
– Interoperable
– Reusablehttps://commonfund.nih.gov/bd2k/commons
Incentives
• Airbnb
– Monetize unutilized space
– Ease of use
– New vacation experience
• Commons
– Need to improve rigor and reproducibility
– Productivity
– Sustainability
– Education and training
– Opportunity to undertake elastic compute on large complex data
https://commonfund.nih.gov/bd2k/commons
Why? - Consider Current High Profile Examples
• Moonshot - Bringing together 5 petabytes of homogenized data within the Genome Data Commons (GDC) to explore genotype-phenotype relationships
• MODs – Multiple high value high cost genomic resources• Human Microbiome Project – microbe characterization and analysis• TOPMed – Genomic, proteomic, metabolomic, image and EHR data• Precision Medicine - Building a platform to support data on >1M individuals
with extensive and constantly updated health profiles• ECHO – Effects of Environmental Exposures on Child Health and
Development - Integration of child health and environmental data• BRAIN - Temporal and spatial analysis of neural circuits
What Data Science Do We Need to Get There?
• Moonshot – new ways to analyze genotype-phenotype associations• MODs – new curation and integration tools• Human Microbiome Project – new cloud based tools• TOPMed – large scale storage and analysis; data harmonization• Precision Medicine – security; analysis of sensor data; EHR integration• ECHO – metadata descriptions of health and environmental data;
application of geospatial methods• BRAIN – methods for network analysis, visualization
All: Analytics, the Commons, FAIR, sustainability, workforce
• Presidential Fellows in Data
Science
• MSDS Capstone Projects
• Governor’s Data Internship
Program
• Collaborative Research
Grants
• Data Science Training
So let me summarize
• Will Biomedical Research Fundamentally Change in the Era of Big Data?
– Yes to some unknown degree, Why?
– The research opportunities, both causal and predictive
– Recognition that the process needs to change and big data can facilitate change given the right incentives
6/6/17 UNC 42