data commons & data science workshop
TRANSCRIPT
![Page 1: Data Commons & Data Science Workshop](https://reader036.vdocuments.us/reader036/viewer/2022062503/58ef4cff1a28ab487a8b462d/html5/thumbnails/1.jpg)
Data Sharing, FAIR and the NCI
June 7th, 2016
Warren Kibbe, PhD
NCI Center for Biomedical Informatics
@wakibbe
The views expressed are my views and not those of NCI
![Page 2: Data Commons & Data Science Workshop](https://reader036.vdocuments.us/reader036/viewer/2022062503/58ef4cff1a28ab487a8b462d/html5/thumbnails/2.jpg)
2
Key points
• Support open science • Support data reusability• Assess the impact of data, software,
and annotation sharing• Next generation of pre-clinical
models• Predictive modeling
Reduce the risk, improve early detection, outcomes and survivorship in cancer
![Page 3: Data Commons & Data Science Workshop](https://reader036.vdocuments.us/reader036/viewer/2022062503/58ef4cff1a28ab487a8b462d/html5/thumbnails/3.jpg)
3
![Page 4: Data Commons & Data Science Workshop](https://reader036.vdocuments.us/reader036/viewer/2022062503/58ef4cff1a28ab487a8b462d/html5/thumbnails/4.jpg)
4
![Page 5: Data Commons & Data Science Workshop](https://reader036.vdocuments.us/reader036/viewer/2022062503/58ef4cff1a28ab487a8b462d/html5/thumbnails/5.jpg)
5
![Page 6: Data Commons & Data Science Workshop](https://reader036.vdocuments.us/reader036/viewer/2022062503/58ef4cff1a28ab487a8b462d/html5/thumbnails/6.jpg)
6
![Page 7: Data Commons & Data Science Workshop](https://reader036.vdocuments.us/reader036/viewer/2022062503/58ef4cff1a28ab487a8b462d/html5/thumbnails/7.jpg)
7
To develop the knowledge base that will lessen the burden of
cancer in the United States and around the world.
NCI Mission
![Page 8: Data Commons & Data Science Workshop](https://reader036.vdocuments.us/reader036/viewer/2022062503/58ef4cff1a28ab487a8b462d/html5/thumbnails/8.jpg)
8
US National Cancer Program
![Page 9: Data Commons & Data Science Workshop](https://reader036.vdocuments.us/reader036/viewer/2022062503/58ef4cff1a28ab487a8b462d/html5/thumbnails/9.jpg)
9
FAIR –
Making data Findable,
Accessible,Attributable,
Interoperable,Reusable,
and provide Recognition
Force11 white paperhttps://www.force11.org/group/fairgroup/fairprinciples
![Page 10: Data Commons & Data Science Workshop](https://reader036.vdocuments.us/reader036/viewer/2022062503/58ef4cff1a28ab487a8b462d/html5/thumbnails/10.jpg)
10
NIH Genomic Data Sharing Policy
https://gds.nih.gov/ Went into effect January 25, 2015
NCI guidance:http://
www.cancer.gov/grants-training/grants-management/nci-policies/genomic-data
Requires public sharing of genomic data sets
![Page 11: Data Commons & Data Science Workshop](https://reader036.vdocuments.us/reader036/viewer/2022062503/58ef4cff1a28ab487a8b462d/html5/thumbnails/11.jpg)
11
NIH and Nature, November 2015
![Page 12: Data Commons & Data Science Workshop](https://reader036.vdocuments.us/reader036/viewer/2022062503/58ef4cff1a28ab487a8b462d/html5/thumbnails/12.jpg)
12
NEJM, January 2016
![Page 13: Data Commons & Data Science Workshop](https://reader036.vdocuments.us/reader036/viewer/2022062503/58ef4cff1a28ab487a8b462d/html5/thumbnails/13.jpg)
13
Science, March 2016
![Page 14: Data Commons & Data Science Workshop](https://reader036.vdocuments.us/reader036/viewer/2022062503/58ef4cff1a28ab487a8b462d/html5/thumbnails/14.jpg)
14
Changing the conversation around data sharing
How do we find data, software, standards? How can we make data, software, metadata accessible? How do we reuse data standards How do we make more data machine readable?
Assumption:
Data sharing enhances reusability and reproducibility
![Page 15: Data Commons & Data Science Workshop](https://reader036.vdocuments.us/reader036/viewer/2022062503/58ef4cff1a28ab487a8b462d/html5/thumbnails/15.jpg)
15
Changing the conversation around data sharing
How do we find data, software, standards? How can we make data, software, metadata accessible? How do we reuse data standards How do we make more data machine readable?
Assumption:
Data sharing enhances reusability and reproducibility
![Page 16: Data Commons & Data Science Workshop](https://reader036.vdocuments.us/reader036/viewer/2022062503/58ef4cff1a28ab487a8b462d/html5/thumbnails/16.jpg)
16
Vice President’s Cancer Initiative(aka Moonshot)
How do we enable meaningful, patient-centered and patient-level
data sharing for cancer?
![Page 17: Data Commons & Data Science Workshop](https://reader036.vdocuments.us/reader036/viewer/2022062503/58ef4cff1a28ab487a8b462d/html5/thumbnails/17.jpg)
17
Vice President Biden’s Cancer InitiativeJanuary 2016
Federal Government Cancer Task Force – how can the federal agencies work together to:
Accelerate our understanding of cancer and its prevention, early detection, treatment, and cure
Improve patient access and care
Support greater access to new research, data, and computational capabilities
Encourage development of cancer treatments
Identify and address any unnecessary regulatory barriers and consider ways to expedite administrative reforms
Ensure optimal investment of federal resources
Identify opportunities to develop public–private partnerships and increase coordination of the federal government’s efforts with the private sector, as appropriate
![Page 18: Data Commons & Data Science Workshop](https://reader036.vdocuments.us/reader036/viewer/2022062503/58ef4cff1a28ab487a8b462d/html5/thumbnails/18.jpg)
18
Vice President Biden’s Cancer InitiativeJanuary 2016
Scientific Objectives of the Vice President’s Cancer Initiative Blue Ribbon Panel
Prevention and Cancer Vaccine Development
Early Cancer Detection
Cancer Immunotherapy and Combination Therapy
Genomic Analysis of Tumor and Surrounding Cells
Enhanced Data Sharing Oncology Center of Excellence
Pediatric Cancer
Exceptional Scientific Opportunities in Cancer Research
http://www.cancer.gov/research/key-initiatives/moonshot-cancer-initiative/get-involved
![Page 19: Data Commons & Data Science Workshop](https://reader036.vdocuments.us/reader036/viewer/2022062503/58ef4cff1a28ab487a8b462d/html5/thumbnails/19.jpg)
Data Commons
Data commons co-locate data, storage and computing infrastructure, and commonly used tools for analyzing and sharing data to create an interoperable resource for the research community.*Robert L. Grossman, Allison Heath, Mark Murphy, Maria Patterson, A Case for Data Commons Towards Data Science as a Service, to appear. Source of image: Interior of one of Google’s Data Center, www.google.com/about/datacenters/.
![Page 20: Data Commons & Data Science Workshop](https://reader036.vdocuments.us/reader036/viewer/2022062503/58ef4cff1a28ab487a8b462d/html5/thumbnails/20.jpg)
20
The Cancer Genomic Data Commons (GDC) is an existing effort to standardize and simplify submission of genomic data to NCI and follow the principles of FAIR – Findable, Accessible, Interoperable, Reusable.
The GDC is part of the NIH Big Data to Knowledge (BD2K) initiative and an example of the NIH Commons
Genomic Data Commons
Microattribution, nanopublications, tracking the use of data, annotation of data, use of
algorithms, supports the data /software /metadata life cycle to provide credit and
analyze impact of data, software, analytics, algorithm, curation and knowledge sharing
![Page 21: Data Commons & Data Science Workshop](https://reader036.vdocuments.us/reader036/viewer/2022062503/58ef4cff1a28ab487a8b462d/html5/thumbnails/21.jpg)
NCI Genomic Data Commons
The GDC will go live with approximately 4.1 PB of data. This includes: 2.6 PB of legacy data; and 1.5 PB of “harmonized” data. 577,878 files about 14194 cases (patients), in 42 cancer types,
across 29 primary sites. 10 major data types, ranging from Raw Sequencing Data, Raw
Microarray Data, to Copy Number Variation, Simple Nucleotide Variation and Gene Expression.
Data are derived from 17 different experimental strategies, with the major ones being RNA-Seq, WXS, WGS, miRNA-Seq, Genotyping Array and Expression Array.
![Page 22: Data Commons & Data Science Workshop](https://reader036.vdocuments.us/reader036/viewer/2022062503/58ef4cff1a28ab487a8b462d/html5/thumbnails/22.jpg)
Genomic Data Commons Data Portal
![Page 23: Data Commons & Data Science Workshop](https://reader036.vdocuments.us/reader036/viewer/2022062503/58ef4cff1a28ab487a8b462d/html5/thumbnails/23.jpg)
The NCI Genomic Data Commons User InterfaceHome Page
![Page 24: Data Commons & Data Science Workshop](https://reader036.vdocuments.us/reader036/viewer/2022062503/58ef4cff1a28ab487a8b462d/html5/thumbnails/24.jpg)
The NCI Genomic Data Commons User InterfaceSample Browser
![Page 25: Data Commons & Data Science Workshop](https://reader036.vdocuments.us/reader036/viewer/2022062503/58ef4cff1a28ab487a8b462d/html5/thumbnails/25.jpg)
The NCI Genomic Data Commons User InterfaceSample Selection
![Page 26: Data Commons & Data Science Workshop](https://reader036.vdocuments.us/reader036/viewer/2022062503/58ef4cff1a28ab487a8b462d/html5/thumbnails/26.jpg)
Clinical data Biospecimen data
Molecular data Files uploaded
The NCI Genomic Data Commons User InterfaceData Submission Dashboard
![Page 27: Data Commons & Data Science Workshop](https://reader036.vdocuments.us/reader036/viewer/2022062503/58ef4cff1a28ab487a8b462d/html5/thumbnails/27.jpg)
Development of the NCI Genomic Data Commons (GDC)To Foster the Molecular Diagnosis and Treatment of Cancer
GDC
Bob Grossman PIUniv. of Chicago
Ontario Inst. Cancer Res.Leidos
![Page 28: Data Commons & Data Science Workshop](https://reader036.vdocuments.us/reader036/viewer/2022062503/58ef4cff1a28ab487a8b462d/html5/thumbnails/28.jpg)
28
Support the Precision Medicine Initiative
• Integrate GDC with Cloud
• Expand data model to include other data (e.g. imaging and proteomics)
The Genomic Data Commons and Cloud Pilots
![Page 29: Data Commons & Data Science Workshop](https://reader036.vdocuments.us/reader036/viewer/2022062503/58ef4cff1a28ab487a8b462d/html5/thumbnails/29.jpg)
NCI GDC Datasets
GDC datasets from the NCI Office of Cancer Genomics (OCG) include: The Cancer Genome Atlas (TCGA) (today) TARGET (today) CGCI: Cancer Genome Characterization Initiative (today) Exceptional Responders (soon) ALCHEMIST (Adjuvant Lung Cancer Enrichment Marker) CTD2: Cancer Target Discovery and Development
Approximate size of the aggregate data: 2016 2 PB 2017 5 PB 2018 ? PB
![Page 30: Data Commons & Data Science Workshop](https://reader036.vdocuments.us/reader036/viewer/2022062503/58ef4cff1a28ab487a8b462d/html5/thumbnails/30.jpg)
NCI GDC and NCI Cloud Pilots
Source: Jean Claude Zenklusen, Ph.D., Director, TCGA Program Office, National Cancer Institute
![Page 31: Data Commons & Data Science Workshop](https://reader036.vdocuments.us/reader036/viewer/2022062503/58ef4cff1a28ab487a8b462d/html5/thumbnails/31.jpg)
The NCI Cancer Genomics Cloud Pilots
Understanding how to meet the research community’s need to analyze large-scale cancer
genomic and clinical data
![Page 32: Data Commons & Data Science Workshop](https://reader036.vdocuments.us/reader036/viewer/2022062503/58ef4cff1a28ab487a8b462d/html5/thumbnails/32.jpg)
32
Support the Precision Medicine Initiative
• Expand data model to include other data (e.g. imaging and proteomics)
• Allow easy publication of persistent links to data, annotations, algorithms, tools, workflows
• Measure usage and impact
• Change incentives for public contributions
The Genomic Data Commons and Cloud Pilots
![Page 33: Data Commons & Data Science Workshop](https://reader036.vdocuments.us/reader036/viewer/2022062503/58ef4cff1a28ab487a8b462d/html5/thumbnails/33.jpg)
33
PMI – Oncology, the GDC and the Cloud Pilots Goals
Support precision medicine-focused clinical research Enable researchers to deposit well-annotated
(Interoperable) genomic data sets with the GDC Provide a single source (and single dbGaP access
request!) to Find and Access these data Enable effective analysis and meta-analysis of these data
without requiring local downloads – data Reuse
![Page 34: Data Commons & Data Science Workshop](https://reader036.vdocuments.us/reader036/viewer/2022062503/58ef4cff1a28ab487a8b462d/html5/thumbnails/34.jpg)
34
PMI – Oncology, the GDC and the Cloud Pilots Goals
Provide a data integration platform to allow multiple data types, multi-scalar data, temporal data from cancer models and patients through open APIs Work with the Global Alliance for Genomics and Health
(GA4GH) to define the next generation of secure, flexible, meaningful, interoperable, lightweight interfaces – open APIs
Engage the cancer research community in evaluating the open APIs for ease of use and effectiveness
![Page 35: Data Commons & Data Science Workshop](https://reader036.vdocuments.us/reader036/viewer/2022062503/58ef4cff1a28ab487a8b462d/html5/thumbnails/35.jpg)
35
Three Cancer Genomics Cloud Pilots
• PI: Gad Getz• Google Cloud• Firehose in the cloud• http://firecloud.org
Broad Institute
• PI: Ilya Shmulevich• Google Cloud• Interactive visualization and analysis• http://cgc.systemsbiology.net/
Institute for Systems Biology
• PI: Deniz Kural• Amazon Web Services• > 30 public pipelines• http://www.cancergenomicscloud.org
Seven Bridges Genomics
![Page 36: Data Commons & Data Science Workshop](https://reader036.vdocuments.us/reader036/viewer/2022062503/58ef4cff1a28ab487a8b462d/html5/thumbnails/36.jpg)
CGC Pilot Team Principal Investigators • Gad Getz, Ph.D - Broad Institute - http://firecloud.org • Ilya Shmulevich, Ph.D - ISB - http://cgc.systemsbiology.net/ • Deniz Kural, Ph.D - Seven Bridges – http://www.cancergenomicscloud.org
NCI Project Officer & CORs• Anthony Kerlavage, Ph.D –Project Officer• Juli Klemm, Ph.D – COR, Broad Institute• Tanja Davidsen, Ph.D – COR, Institute for Systems Biology • Ishwar Chandramouliswaran, MS, MBA – COR, Seven Bridges Genomics
GDC Principal Investigator• Robert Grossman, Ph.D - University of Chicago
Cancer Genomics Project Teams
NCI Leadership Team• Doug Lowy, M.D.• Lou Staudt, M.D., Ph.D.• Stephen Chanock, M.D.• George Komatsoulis, Ph.D.• Warren Kibbe, Ph.D.
Center for Cancer Genomics Partners• JC Zenklusen, Ph.D.• Daniela Gerhard, Ph.D.• Zhining Wang, Ph.D.• Liming Yang, Ph.D.• Martin Ferguson, Ph.D.
![Page 37: Data Commons & Data Science Workshop](https://reader036.vdocuments.us/reader036/viewer/2022062503/58ef4cff1a28ab487a8b462d/html5/thumbnails/37.jpg)
37
Cloud Pilots went public in January
Sign up now!
Apply for TCGA dbGaP access if you do not already have it!
![Page 39: Data Commons & Data Science Workshop](https://reader036.vdocuments.us/reader036/viewer/2022062503/58ef4cff1a28ab487a8b462d/html5/thumbnails/39.jpg)
www.cancer.gov www.cancer.gov/espanol