enabling the big data commons through indexing of data and ... · enabling the big data commons...
TRANSCRIPT
![Page 1: Enabling the Big Data Commons through indexing of data and ... · Enabling the Big Data Commons through indexing of data and their interactions. 2. nd. BD2K all-hands meeting Bethesda](https://reader033.vdocuments.us/reader033/viewer/2022050315/5f7764426f73d11c8b493cc0/html5/thumbnails/1.jpg)
biomedical and healthCAreData Discovery Index Ecosystem
Enabling the Big Data Commons through indexing of data and
their interactions
2nd BD2K all-hands meetingBethesda11/12/15
![Page 2: Enabling the Big Data Commons through indexing of data and ... · Enabling the Big Data Commons through indexing of data and their interactions. 2. nd. BD2K all-hands meeting Bethesda](https://reader033.vdocuments.us/reader033/viewer/2022050315/5f7764426f73d11c8b493cc0/html5/thumbnails/2.jpg)
2
Aims
1. Help users find accessible data 2. Assist data producers on how to publish
data for maximal discoverability3. Build a prototype/platform to dock
related products
PubMed of Data = DataMed
![Page 3: Enabling the Big Data Commons through indexing of data and ... · Enabling the Big Data Commons through indexing of data and their interactions. 2. nd. BD2K all-hands meeting Bethesda](https://reader033.vdocuments.us/reader033/viewer/2022050315/5f7764426f73d11c8b493cc0/html5/thumbnails/3.jpg)
Outline
v The data ecosystemw Data, derived data, metadataw StakeholdersNuts and Boltsw Components: metadata, search toolw Plan and timelinesHow to participatew Working groupsw Pilotsw Collaborations
v
v
![Page 4: Enabling the Big Data Commons through indexing of data and ... · Enabling the Big Data Commons through indexing of data and their interactions. 2. nd. BD2K all-hands meeting Bethesda](https://reader033.vdocuments.us/reader033/viewer/2022050315/5f7764426f73d11c8b493cc0/html5/thumbnails/4.jpg)
What does it take to use big data?
v Find the data (across various resources)v Find the tools that operate on the datav Find the appropriate computational
environment
v Access the datav Access the tools (software/systems)v Access the computational environment
![Page 5: Enabling the Big Data Commons through indexing of data and ... · Enabling the Big Data Commons through indexing of data and their interactions. 2. nd. BD2K all-hands meeting Bethesda](https://reader033.vdocuments.us/reader033/viewer/2022050315/5f7764426f73d11c8b493cc0/html5/thumbnails/5.jpg)
software tools
pre-curateddata
metadata
standards
curated data
prep for analysis
pre-processed
data
data ecosystem
![Page 6: Enabling the Big Data Commons through indexing of data and ... · Enabling the Big Data Commons through indexing of data and their interactions. 2. nd. BD2K all-hands meeting Bethesda](https://reader033.vdocuments.us/reader033/viewer/2022050315/5f7764426f73d11c8b493cc0/html5/thumbnails/6.jpg)
softwaretools
pre-curateddata
metadata
standards
curated data
prep for analysis
pre-processed
data
Characterizing data implies describing the software tools that helped produce them
data ecosystem
![Page 7: Enabling the Big Data Commons through indexing of data and ... · Enabling the Big Data Commons through indexing of data and their interactions. 2. nd. BD2K all-hands meeting Bethesda](https://reader033.vdocuments.us/reader033/viewer/2022050315/5f7764426f73d11c8b493cc0/html5/thumbnails/7.jpg)
software tools
pre-curateddata
metadata
curated data
prep for analysis
pre-processed
data
one woman’s pre-processed data are another woman’s “raw” data
data ecosystem
![Page 8: Enabling the Big Data Commons through indexing of data and ... · Enabling the Big Data Commons through indexing of data and their interactions. 2. nd. BD2K all-hands meeting Bethesda](https://reader033.vdocuments.us/reader033/viewer/2022050315/5f7764426f73d11c8b493cc0/html5/thumbnails/8.jpg)
software tools
hosting in a cloud, cluster, server
pre-curateddata
metadata
standards
curated data
prep for analysis
pre-processed
data
understanding which computational environment is best for the combination of data and relevant tools is important (e.g., HPC, GPU)
data ecosystem
![Page 9: Enabling the Big Data Commons through indexing of data and ... · Enabling the Big Data Commons through indexing of data and their interactions. 2. nd. BD2K all-hands meeting Bethesda](https://reader033.vdocuments.us/reader033/viewer/2022050315/5f7764426f73d11c8b493cc0/html5/thumbnails/9.jpg)
selecting the right computational environment for the right type of data is important
understanding the conditions of accessibility is also important
software tools
pre-curateddata
metadata
curated datapre-
processeddata
Protected Health Information(PHI)hosting in a HIPAA cloud, cluster, server
software tools
PHI
pre-curated data
PHI pre-processed
curated data
PHI
curated data
standards
metadata
data ecosystem
![Page 10: Enabling the Big Data Commons through indexing of data and ... · Enabling the Big Data Commons through indexing of data and their interactions. 2. nd. BD2K all-hands meeting Bethesda](https://reader033.vdocuments.us/reader033/viewer/2022050315/5f7764426f73d11c8b493cc0/html5/thumbnails/10.jpg)
+
software tools
pre-curateddata
metadata
standards
curated datapre-
processeddata
software tools
PHI
pre-curated data
PHI pre-processed
curated data
PHI
curated data
Repositories
big data analytics depend on
1. merging data from several different sources (e.g., reference databases, molecular data repositories,
clinical repositories),2. proper software, and the
3. proper computational environment
data ecosystem
![Page 11: Enabling the Big Data Commons through indexing of data and ... · Enabling the Big Data Commons through indexing of data and their interactions. 2. nd. BD2K all-hands meeting Bethesda](https://reader033.vdocuments.us/reader033/viewer/2022050315/5f7764426f73d11c8b493cc0/html5/thumbnails/11.jpg)
+
big data projects use several types of digital objects and they are inter-related
software tools
pre-curateddata
metadata
standards
curated data
prep for analysis
pre-processed
data
software tools
PHI
pre-curated data
PHI pre-processed
curated data
prep for analysis
PHI
curated data
join
software tools
pre-curateddata
metadata
curated data
prep for analysis
pre-processed
data
Repositories
Centers
Big Data projects
data ecosystem
![Page 12: Enabling the Big Data Commons through indexing of data and ... · Enabling the Big Data Commons through indexing of data and their interactions. 2. nd. BD2K all-hands meeting Bethesda](https://reader033.vdocuments.us/reader033/viewer/2022050315/5f7764426f73d11c8b493cc0/html5/thumbnails/12.jpg)
+
pre-processed
data
results of analyses are data too…
software tools
pre-curateddata
metadata
standards
curated data
prep for analysis
pre-processed
data
software tools
PHI
pre-curated data
PHI pre-processed
curated data
prep for analysis
PHI
curated data
join
software tools
pre-curateddata
metadata
curated data
prep for analysis
pre-processeddata
publishedresults
journal
selection
post-processeddata
(results)
analysis
Repositories
Centers,Big data projects
data ecosystem
![Page 13: Enabling the Big Data Commons through indexing of data and ... · Enabling the Big Data Commons through indexing of data and their interactions. 2. nd. BD2K all-hands meeting Bethesda](https://reader033.vdocuments.us/reader033/viewer/2022050315/5f7764426f73d11c8b493cc0/html5/thumbnails/13.jpg)
+
pre-
new types of “publications”have emerged
software tools
pre-curateddata
metadata
standards
curated data
prep for analysis
pre-processed
data
software tools
PHI
pre-curated data
PHI pre-processed
curated data
prep for analysis
PHI
curated data
join
software tools
pre-curateddata
metadata
curated data
prep for analysis
pr-ocesseddata
publishedresults
journal
selection
post-processeddata
(results)
analysis
Repositories
Centers,Big data projects
data ecosystem
![Page 14: Enabling the Big Data Commons through indexing of data and ... · Enabling the Big Data Commons through indexing of data and their interactions. 2. nd. BD2K all-hands meeting Bethesda](https://reader033.vdocuments.us/reader033/viewer/2022050315/5f7764426f73d11c8b493cc0/html5/thumbnails/14.jpg)
Stakeholdersstakeholders have different responsibilities and interests
big data
funder
producer
curator
hostmanager
user
owner
data ecosystem
![Page 15: Enabling the Big Data Commons through indexing of data and ... · Enabling the Big Data Commons through indexing of data and their interactions. 2. nd. BD2K all-hands meeting Bethesda](https://reader033.vdocuments.us/reader033/viewer/2022050315/5f7764426f73d11c8b493cc0/html5/thumbnails/15.jpg)
Stakeholdersstakeholders have different abilities in indexing different types of data
big data
funder
producer
curator
hostmanager
user
owner
data ecosystem
searching across different resources is
time consuming because no one is an
expert in all resources
![Page 16: Enabling the Big Data Commons through indexing of data and ... · Enabling the Big Data Commons through indexing of data and their interactions. 2. nd. BD2K all-hands meeting Bethesda](https://reader033.vdocuments.us/reader033/viewer/2022050315/5f7764426f73d11c8b493cc0/html5/thumbnails/16.jpg)
16
searching across indices and repositories
existing indices can interoperate with the cross-aggregator index
best indexers for data are those who use it all the time, but they may not know as much about other resources
data
![Page 17: Enabling the Big Data Commons through indexing of data and ... · Enabling the Big Data Commons through indexing of data and their interactions. 2. nd. BD2K all-hands meeting Bethesda](https://reader033.vdocuments.us/reader033/viewer/2022050315/5f7764426f73d11c8b493cc0/html5/thumbnails/17.jpg)
17
“find data on Kawasaki disease”
platform and portal
A, B, C: mapping of metadata, standards, links to aggregators, passing of queries
aggregators: various indices whose metadata are or can be mapped into Commons metadata
datadigital objects
![Page 18: Enabling the Big Data Commons through indexing of data and ... · Enabling the Big Data Commons through indexing of data and their interactions. 2. nd. BD2K all-hands meeting Bethesda](https://reader033.vdocuments.us/reader033/viewer/2022050315/5f7764426f73d11c8b493cc0/html5/thumbnails/18.jpg)
Metadata ModelA set of metadata specifications, future-proofed for progressive extensions, to
support intended capability of the Data Discovery Index prototype
Created using
![Page 19: Enabling the Big Data Commons through indexing of data and ... · Enabling the Big Data Commons through indexing of data and their interactions. 2. nd. BD2K all-hands meeting Bethesda](https://reader033.vdocuments.us/reader033/viewer/2022050315/5f7764426f73d11c8b493cc0/html5/thumbnails/19.jpg)
BioSharing: Content Standards and Databases
Supported by the NIH grant 1U24 AI117966-01 to the University of California, San Diego
![Page 20: Enabling the Big Data Commons through indexing of data and ... · Enabling the Big Data Commons through indexing of data and their interactions. 2. nd. BD2K all-hands meeting Bethesda](https://reader033.vdocuments.us/reader033/viewer/2022050315/5f7764426f73d11c8b493cc0/html5/thumbnails/20.jpg)
Data Identifiers
Define a set of best practices and operatingprocedures for identifiers that support theintended capability of the NIH BD2K DataDiscovery Index (DDI) prototype - being designedby the bioCADDIE Core Development Team.
Check document at biocaddie.org
Attend breakout session on Identifiers
![Page 21: Enabling the Big Data Commons through indexing of data and ... · Enabling the Big Data Commons through indexing of data and their interactions. 2. nd. BD2K all-hands meeting Bethesda](https://reader033.vdocuments.us/reader033/viewer/2022050315/5f7764426f73d11c8b493cc0/html5/thumbnails/21.jpg)
bioCADDIE Prototype
RepositoriesMetadata Ingestion ElasticSearch
Terminology server
User Interface
Online datasets
PublishersFunding Agencies
Data producers
Dat
a So
urce
s
Ingestion Indexing
![Page 22: Enabling the Big Data Commons through indexing of data and ... · Enabling the Big Data Commons through indexing of data and their interactions. 2. nd. BD2K all-hands meeting Bethesda](https://reader033.vdocuments.us/reader033/viewer/2022050315/5f7764426f73d11c8b493cc0/html5/thumbnails/22.jpg)
Data Indexing Pipeline
Data Source
1. Configuration file developed by curator
2. Extraction of metadata/data from data resource or dataset via ingestion modulew Cache information for further
processing
3. Process metadata/data via a set of modulesw e.g. ID conversion, keyword
extraction, data normalization
4. Mapping of metadata/data to metadata model(s)
5. Export to target endpoint(s)6. Search via ElasticSearch APIs
![Page 23: Enabling the Big Data Commons through indexing of data and ... · Enabling the Big Data Commons through indexing of data and their interactions. 2. nd. BD2K all-hands meeting Bethesda](https://reader033.vdocuments.us/reader033/viewer/2022050315/5f7764426f73d11c8b493cc0/html5/thumbnails/23.jpg)
User Interface Workflow
Query Entry
Terminology server
Entity Identification
Expansion QueryExecution
ElasticSearch
bioCADDIE backend
Organize results
Facets
Visualization
PresentationAdvanced
filters
![Page 24: Enabling the Big Data Commons through indexing of data and ... · Enabling the Big Data Commons through indexing of data and their interactions. 2. nd. BD2K all-hands meeting Bethesda](https://reader033.vdocuments.us/reader033/viewer/2022050315/5f7764426f73d11c8b493cc0/html5/thumbnails/24.jpg)
![Page 25: Enabling the Big Data Commons through indexing of data and ... · Enabling the Big Data Commons through indexing of data and their interactions. 2. nd. BD2K all-hands meeting Bethesda](https://reader033.vdocuments.us/reader033/viewer/2022050315/5f7764426f73d11c8b493cc0/html5/thumbnails/25.jpg)
Demo and Posters
Location –Room E1 2-4pm
Data and Software Indexing
E1 Commons & Interoperability
E2
![Page 26: Enabling the Big Data Commons through indexing of data and ... · Enabling the Big Data Commons through indexing of data and their interactions. 2. nd. BD2K all-hands meeting Bethesda](https://reader033.vdocuments.us/reader033/viewer/2022050315/5f7764426f73d11c8b493cc0/html5/thumbnails/26.jpg)
Core Development Roadmap
Search function • Implement the function
for 3 repositoriesFeedback collection• Github
RFA for pilot on Harvester for DDI schema • RFA announced• Review, selection and
award
Data identifier• Implement Data
identifier into the DDI
Data indexing• Set up indexing using
metadata from WG 3.0
Wrap up of Y1 pilot projects• Literature/dataset link: Advanced
search• Recommender System: Ranking
results• iSEE/DELVE: Innovative
visualization• PDB citation pipelines
Data ingestion• Determine datasets• Decide on scalable
data/metadata input routes
• Metadata mapping
DDI architecture• Setup website for
searching for datasets
• Set up infrastructure for web portal
Version 0.1
September2015
![Page 27: Enabling the Big Data Commons through indexing of data and ... · Enabling the Big Data Commons through indexing of data and their interactions. 2. nd. BD2K all-hands meeting Bethesda](https://reader033.vdocuments.us/reader033/viewer/2022050315/5f7764426f73d11c8b493cc0/html5/thumbnails/27.jpg)
Core Development RoadmapDataset result display• Sort datasets• Group metadataTerminology server• Import ontology• Integrate to Scigraph API• Integrate autocomplete
feature to prototypeInterface design• New interface for
prototype v 0.2• Global statistics
Usability study• UI AnalysisRanking algorithm• Results from pliot projectSearch function • Expand the function to 7
repositories• Find similar datasets• Search historyArchitecture• Code refactoring
We are here
Version 0.2
November2015
![Page 28: Enabling the Big Data Commons through indexing of data and ... · Enabling the Big Data Commons through indexing of data and their interactions. 2. nd. BD2K all-hands meeting Bethesda](https://reader033.vdocuments.us/reader033/viewer/2022050315/5f7764426f73d11c8b493cc0/html5/thumbnails/28.jpg)
Core Development RoadmapPersonalized search• Share/save search results• User accountLink dataset to external resources• PubMed• GrantsSearch algorithm• Boolean/advanced search• Data repository search
function
Usability study• User study• Track user’s actionRanking algorithm• Refine search results based on
user’s selection• Report from WG 8 on RankingData duplication problemMetadata management
Version 0.5
February 2016
Version 1.0
June2016
![Page 29: Enabling the Big Data Commons through indexing of data and ... · Enabling the Big Data Commons through indexing of data and their interactions. 2. nd. BD2K all-hands meeting Bethesda](https://reader033.vdocuments.us/reader033/viewer/2022050315/5f7764426f73d11c8b493cc0/html5/thumbnails/29.jpg)
28
Participation
v Working groupsw Participate or follow
v Prototypew Using and providing feedback on the
prototype search enginev Interoperate with the prototype
w Link your favorite index• Use or map to metadata• Collaborate on APIs
v Recommend data/repositories for inclusionw New working group
![Page 30: Enabling the Big Data Commons through indexing of data and ... · Enabling the Big Data Commons through indexing of data and their interactions. 2. nd. BD2K all-hands meeting Bethesda](https://reader033.vdocuments.us/reader033/viewer/2022050315/5f7764426f73d11c8b493cc0/html5/thumbnails/30.jpg)
Pilot$
Newly awarded – Metadata Discoverw Distributed data discovery using gym: github, yaml and markdown
Chris Mungall, Lawrence Berkeley National Laboratoryw Feasibility study of indexing clinical research data using HL7 FHIR
Guoqian Jiang, Mayo Clinic College of Medicinew Metadata discovery and integration to support repurposing of heterogeneous
data using the Openfurther platformRam Gouripeddi and Julio Facelli, University of Utah
![Page 31: Enabling the Big Data Commons through indexing of data and ... · Enabling the Big Data Commons through indexing of data and their interactions. 2. nd. BD2K all-hands meeting Bethesda](https://reader033.vdocuments.us/reader033/viewer/2022050315/5f7764426f73d11c8b493cc0/html5/thumbnails/31.jpg)
Working Groups
![Page 32: Enabling the Big Data Commons through indexing of data and ... · Enabling the Big Data Commons through indexing of data and their interactions. 2. nd. BD2K all-hands meeting Bethesda](https://reader033.vdocuments.us/reader033/viewer/2022050315/5f7764426f73d11c8b493cc0/html5/thumbnails/32.jpg)
Working Groups
![Page 33: Enabling the Big Data Commons through indexing of data and ... · Enabling the Big Data Commons through indexing of data and their interactions. 2. nd. BD2K all-hands meeting Bethesda](https://reader033.vdocuments.us/reader033/viewer/2022050315/5f7764426f73d11c8b493cc0/html5/thumbnails/33.jpg)
Acknowledgements
• 93 working group members
• 12 steering committee members
• 8 pilot application reviewers
• staff and trainees
• collaborators
Supported by the NIH grant 1U24 AI117966-01 to the University of California, San Diego
![Page 34: Enabling the Big Data Commons through indexing of data and ... · Enabling the Big Data Commons through indexing of data and their interactions. 2. nd. BD2K all-hands meeting Bethesda](https://reader033.vdocuments.us/reader033/viewer/2022050315/5f7764426f73d11c8b493cc0/html5/thumbnails/34.jpg)
pre-curateddata
curated data
pre-processed
data
metadata
prep for analysis
software tools
published data
result data
pre-curateddata
curated data
pre-processeddata
pre-curated PHI data
PHI curated data
pre-processed
PHI curated data
+
metadata
join
analysis
selection
a mouse modelfor data science