data commons · data commons vs collection of datasets collections of datasets (ala nih data...
TRANSCRIPT
![Page 1: Data Commons · Data Commons vs collection of datasets Collections of datasets (ala NIH Data Commons). Solves the problem of finding the dataset. But the remaining problems --- cleaning,](https://reader034.vdocuments.us/reader034/viewer/2022042417/5f33a913d13d9324de187ef7/html5/thumbnails/1.jpg)
Data Commons
![Page 2: Data Commons · Data Commons vs collection of datasets Collections of datasets (ala NIH Data Commons). Solves the problem of finding the dataset. But the remaining problems --- cleaning,](https://reader034.vdocuments.us/reader034/viewer/2022042417/5f33a913d13d9324de187ef7/html5/thumbnails/2.jpg)
Context
Data powers everything
- policy, - journalism,- health,- science
How do we make it easier?
![Page 3: Data Commons · Data Commons vs collection of datasets Collections of datasets (ala NIH Data Commons). Solves the problem of finding the dataset. But the remaining problems --- cleaning,](https://reader034.vdocuments.us/reader034/viewer/2022042417/5f33a913d13d9324de187ef7/html5/thumbnails/3.jpg)
Problem is not a shortage of data
Demographics: ACS, Housing Survey, Community Survey,
Economics: BLS, BEA, World Bank, OECD, …Health: CDC Wonder, CDC Diabetes, County Health, …Climate: NOAA, Hurricanes, Flooding, …Genomics: NCBI, ENCODE, ...
Too many formats, schemas, ...
![Page 4: Data Commons · Data Commons vs collection of datasets Collections of datasets (ala NIH Data Commons). Solves the problem of finding the dataset. But the remaining problems --- cleaning,](https://reader034.vdocuments.us/reader034/viewer/2022042417/5f33a913d13d9324de187ef7/html5/thumbnails/4.jpg)
Using Data: current model
Forage for data sources, track down assumptions/provenance
Clean it up, compile data sources, figure out storage, …
High upfront costs, sparse ecosystem, few tools, …
![Page 5: Data Commons · Data Commons vs collection of datasets Collections of datasets (ala NIH Data Commons). Solves the problem of finding the dataset. But the remaining problems --- cleaning,](https://reader034.vdocuments.us/reader034/viewer/2022042417/5f33a913d13d9324de187ef7/html5/thumbnails/5.jpg)
Google maps made satellite imagery part of everyone’s life
![Page 6: Data Commons · Data Commons vs collection of datasets Collections of datasets (ala NIH Data Commons). Solves the problem of finding the dataset. But the remaining problems --- cleaning,](https://reader034.vdocuments.us/reader034/viewer/2022042417/5f33a913d13d9324de187ef7/html5/thumbnails/6.jpg)
We want to do that for data
From search for dataset, download, clean, normalize,
join …
to
Just ask Google
![Page 7: Data Commons · Data Commons vs collection of datasets Collections of datasets (ala NIH Data Commons). Solves the problem of finding the dataset. But the remaining problems --- cleaning,](https://reader034.vdocuments.us/reader034/viewer/2022042417/5f33a913d13d9324de187ef7/html5/thumbnails/7.jpg)
Census(ACS)BLSFBI
NOAA
CDCSequence data
Wikidata Landsat
Aggregated Knowledge Graph
Grid
Medicare
EPA
APIS
JournalistsStudents ResearchersSearch, News Cloud
![Page 8: Data Commons · Data Commons vs collection of datasets Collections of datasets (ala NIH Data Commons). Solves the problem of finding the dataset. But the remaining problems --- cleaning,](https://reader034.vdocuments.us/reader034/viewer/2022042417/5f33a913d13d9324de187ef7/html5/thumbnails/8.jpg)
Data Commons vs collection of datasets
Collections of datasets (ala NIH Data Commons). Solves the problem of finding the dataset.
But the remaining problems --- cleaning, joining … remain
Datacommons --- a single KG built by cleaning, normalizing, joining all these datasets
![Page 9: Data Commons · Data Commons vs collection of datasets Collections of datasets (ala NIH Data Commons). Solves the problem of finding the dataset. But the remaining problems --- cleaning,](https://reader034.vdocuments.us/reader034/viewer/2022042417/5f33a913d13d9324de187ef7/html5/thumbnails/9.jpg)
Data Commons v 0.9
- People, places, …: Integrated view of Census (ACS), CDC, BLS, BEA, FEC, NOAA, DEA, DOL,… on average 5k variables for every state, county, city, zip, school district, congressional district, …
- Education: College Scoreboard, NCES, ...- Disasters: earthquakes, hurricanes, floods, fires
- Scientific Collections: Bronx Botanical Gardens, ENCODE, parts of NCBI, GTEx
![Page 10: Data Commons · Data Commons vs collection of datasets Collections of datasets (ala NIH Data Commons). Solves the problem of finding the dataset. But the remaining problems --- cleaning,](https://reader034.vdocuments.us/reader034/viewer/2022042417/5f33a913d13d9324de187ef7/html5/thumbnails/10.jpg)
Data Commons v 0.9
Applications for 4 categories
- Researchers- Students- Journalists- Google Users
![Page 11: Data Commons · Data Commons vs collection of datasets Collections of datasets (ala NIH Data Commons). Solves the problem of finding the dataset. But the remaining problems --- cleaning,](https://reader034.vdocuments.us/reader034/viewer/2022042417/5f33a913d13d9324de187ef7/html5/thumbnails/11.jpg)
APIs
APIs in
- REST- Python / Python Notebooks- SPARQL- SQL against Big Query- Google Sheets
![Page 12: Data Commons · Data Commons vs collection of datasets Collections of datasets (ala NIH Data Commons). Solves the problem of finding the dataset. But the remaining problems --- cleaning,](https://reader034.vdocuments.us/reader034/viewer/2022042417/5f33a913d13d9324de187ef7/html5/thumbnails/12.jpg)
Python Notebooksfor students &researchers
![Page 13: Data Commons · Data Commons vs collection of datasets Collections of datasets (ala NIH Data Commons). Solves the problem of finding the dataset. But the remaining problems --- cleaning,](https://reader034.vdocuments.us/reader034/viewer/2022042417/5f33a913d13d9324de187ef7/html5/thumbnails/13.jpg)
Simple Charting Tool
![Page 14: Data Commons · Data Commons vs collection of datasets Collections of datasets (ala NIH Data Commons). Solves the problem of finding the dataset. But the remaining problems --- cleaning,](https://reader034.vdocuments.us/reader034/viewer/2022042417/5f33a913d13d9324de187ef7/html5/thumbnails/14.jpg)
Simple CorrelationTool
![Page 15: Data Commons · Data Commons vs collection of datasets Collections of datasets (ala NIH Data Commons). Solves the problem of finding the dataset. But the remaining problems --- cleaning,](https://reader034.vdocuments.us/reader034/viewer/2022042417/5f33a913d13d9324de187ef7/html5/thumbnails/15.jpg)
Google Sheets API
![Page 16: Data Commons · Data Commons vs collection of datasets Collections of datasets (ala NIH Data Commons). Solves the problem of finding the dataset. But the remaining problems --- cleaning,](https://reader034.vdocuments.us/reader034/viewer/2022042417/5f33a913d13d9324de187ef7/html5/thumbnails/16.jpg)
Biomedical Data Commons
![Page 17: Data Commons · Data Commons vs collection of datasets Collections of datasets (ala NIH Data Commons). Solves the problem of finding the dataset. But the remaining problems --- cleaning,](https://reader034.vdocuments.us/reader034/viewer/2022042417/5f33a913d13d9324de187ef7/html5/thumbnails/17.jpg)
Census(ACS)UniProtCheMBL
dbSNP
CDCSequence data
Entrez Gene GTEx
Aggregated Knowledge Graph
MeSH
Medicare
WHO
APIS
CitizenScientists
Students ResearchersDoctors Cloud
![Page 18: Data Commons · Data Commons vs collection of datasets Collections of datasets (ala NIH Data Commons). Solves the problem of finding the dataset. But the remaining problems --- cleaning,](https://reader034.vdocuments.us/reader034/viewer/2022042417/5f33a913d13d9324de187ef7/html5/thumbnails/18.jpg)
Biomedical Data Commons Datasets
- CDC - 500 Cities- CDC - Diabetes
Atlas- CDC - Wonder- ChEMBL*- ClinVar- dbSNP- Dartmouth
Medicare Atlas
- Disease Ontology*- FDA -
Pharmacologic Class*
- GTEx- ENCODE- Entrez Gene- MeSH*- NY Times -
COVID-19 Cases + Deaths
- SIDER*- SPOKE*- UCSC Genome
Browser- US Census - SAHIE- UniProt*- WHO - COVID-19
Cases + Deaths- WHO - ICD-10
Codes*Source: UCSF SPOKE
![Page 19: Data Commons · Data Commons vs collection of datasets Collections of datasets (ala NIH Data Commons). Solves the problem of finding the dataset. But the remaining problems --- cleaning,](https://reader034.vdocuments.us/reader034/viewer/2022042417/5f33a913d13d9324de187ef7/html5/thumbnails/19.jpg)
Eg: Extract 3 data points on given variants
Given a list of genetic variants
Find:
1. Clinical Significance2. Functional Category3. Significant Gene Associations in Whole Blood
![Page 20: Data Commons · Data Commons vs collection of datasets Collections of datasets (ala NIH Data Commons). Solves the problem of finding the dataset. But the remaining problems --- cleaning,](https://reader034.vdocuments.us/reader034/viewer/2022042417/5f33a913d13d9324de187ef7/html5/thumbnails/20.jpg)
Current Methods to find information
Option 1:Site Search
Use needed database browser and search all variants individually
Option 2:Download and Analyze Data
Download data, read it into memory, and parse it for the needed info programmatically
Option 3:Data Commons
Run 1 query on Data Commons
![Page 21: Data Commons · Data Commons vs collection of datasets Collections of datasets (ala NIH Data Commons). Solves the problem of finding the dataset. But the remaining problems --- cleaning,](https://reader034.vdocuments.us/reader034/viewer/2022042417/5f33a913d13d9324de187ef7/html5/thumbnails/21.jpg)
More on Option 2
- I want to find out more about a list of genetic variants- Download ClinVar and analyze clinical significance of the
variants - a few hours- Download dbSNP and identify the type of genomic region
these SNPs are located - a couple of days- Download GTEx and identify which genes a genetic variant is
significantly associated in a given tissue - a week
![Page 22: Data Commons · Data Commons vs collection of datasets Collections of datasets (ala NIH Data Commons). Solves the problem of finding the dataset. But the remaining problems --- cleaning,](https://reader034.vdocuments.us/reader034/viewer/2022042417/5f33a913d13d9324de187ef7/html5/thumbnails/22.jpg)
Example Data Commons Query
Issue Query to Data Commons
Time to run query: 21 seconds
![Page 23: Data Commons · Data Commons vs collection of datasets Collections of datasets (ala NIH Data Commons). Solves the problem of finding the dataset. But the remaining problems --- cleaning,](https://reader034.vdocuments.us/reader034/viewer/2022042417/5f33a913d13d9324de187ef7/html5/thumbnails/23.jpg)
Many Other Commons ...
- Economics- Covid- Energy
….
![Page 24: Data Commons · Data Commons vs collection of datasets Collections of datasets (ala NIH Data Commons). Solves the problem of finding the dataset. But the remaining problems --- cleaning,](https://reader034.vdocuments.us/reader034/viewer/2022042417/5f33a913d13d9324de187ef7/html5/thumbnails/24.jpg)
Technical Challenges
- Representation- Infrastructure- Inference
- General thoughts on KGs ...
![Page 25: Data Commons · Data Commons vs collection of datasets Collections of datasets (ala NIH Data Commons). Solves the problem of finding the dataset. But the remaining problems --- cleaning,](https://reader034.vdocuments.us/reader034/viewer/2022042417/5f33a913d13d9324de187ef7/html5/thumbnails/25.jpg)
Representational Issues
Representing time, geo, measures, provenance, ...
Stat. aggregates exacerbate the problem of long predicates: maleLatinoPopulationUnder...
Triples are easy to understand, relatively widely understood
But real systems need more expressiveness
![Page 26: Data Commons · Data Commons vs collection of datasets Collections of datasets (ala NIH Data Commons). Solves the problem of finding the dataset. But the remaining problems --- cleaning,](https://reader034.vdocuments.us/reader034/viewer/2022042417/5f33a913d13d9324de187ef7/html5/thumbnails/26.jpg)
Infrastructure
Data Commons has about 50B triples
Bio part is another 100B triples
Wide range of queries and latency requirements
- simple queries that require millisecond responses- complex sparql queries- everything in between
![Page 27: Data Commons · Data Commons vs collection of datasets Collections of datasets (ala NIH Data Commons). Solves the problem of finding the dataset. But the remaining problems --- cleaning,](https://reader034.vdocuments.us/reader034/viewer/2022042417/5f33a913d13d9324de187ef7/html5/thumbnails/27.jpg)
Data Commons approach
Multiple ‘backing’ stores, that support different classes of queries with different latencies
- Big Query … in memory hash tables- KG is a view on top of more native encodings- DLG + function terms + contexts + simple kinds of inference
is the real ‘language’ --- serves as the ‘epistemological/knowledge level’
![Page 28: Data Commons · Data Commons vs collection of datasets Collections of datasets (ala NIH Data Commons). Solves the problem of finding the dataset. But the remaining problems --- cleaning,](https://reader034.vdocuments.us/reader034/viewer/2022042417/5f33a913d13d9324de187ef7/html5/thumbnails/28.jpg)
PopulationsObservationsPlacesTriples
Translator
KG Schema <-> Relational Schema maps
Datalog (Sparql, Graphql, …)
SQL
Big Query
Cache Builder
Cache
Graph Nav APIs, Time Series, etc.
Weather
Mixer
Serving System
![Page 29: Data Commons · Data Commons vs collection of datasets Collections of datasets (ala NIH Data Commons). Solves the problem of finding the dataset. But the remaining problems --- cleaning,](https://reader034.vdocuments.us/reader034/viewer/2022042417/5f33a913d13d9324de187ef7/html5/thumbnails/29.jpg)
Ending thoughts
- Vocabulary creep: Google KG, Cyc, Wikidata all have many tens of thousands of ‘schema’ terms. Human language manages with few thousand terms. How do we bring the compositionality of NL to KR?
- The problems of old AI haven’t been solved. They just can’t be expressed clearly in today’s ML formalisms. e.g., World’s tallest mountain yesterday?