w3c hcls dataset description guidelines
TRANSCRIPT
![Page 1: W3C HCLS Dataset Description Guidelines](https://reader034.vdocuments.us/reader034/viewer/2022042716/55a6379a1a28ab761e8b462e/html5/thumbnails/1.jpg)
Describing Scientific Datasets: The HCLS Community Profile
@micheldumontier::CEDAR:Jan 20151
Michel Dumontier, Ph.D.
Associate Professor of Medicine (Biomedical Informatics)Stanford University
![Page 2: W3C HCLS Dataset Description Guidelines](https://reader034.vdocuments.us/reader034/viewer/2022042716/55a6379a1a28ab761e8b462e/html5/thumbnails/2.jpg)
World Wide Web Consortium (W3C)
• The W3C is the main international standards organization for the World Wide Web.
• The W3C is made up of over 400 member organizations for the purpose of working together in the development of standards for the World Wide Web.
@micheldumontier::CEDAR:Jan 20152
![Page 3: W3C HCLS Dataset Description Guidelines](https://reader034.vdocuments.us/reader034/viewer/2022042716/55a6379a1a28ab761e8b462e/html5/thumbnails/3.jpg)
The Semantic Webis the new global web of knowledge
3 @micheldumontier::CEDAR:Jan 2015
It involves standards for publishing, sharing and querying facts, expert knowledge and services
It is a scalable approach to thediscovery of independently formulated
and distributed knowledge
![Page 4: W3C HCLS Dataset Description Guidelines](https://reader034.vdocuments.us/reader034/viewer/2022042716/55a6379a1a28ab761e8b462e/html5/thumbnails/4.jpg)
Resource Description Framework
• It’s a language to represent knowledge– Logic-based formalism -> automated reasoning– graph-like properties -> data analysis
• Good for– Describing in terms of type, attributes, relations– Integrating data from different sources– Sharing the data (W3C standard)– Reusing what is available, developing what you need,
and contributing back to the web of data.
@micheldumontier::CEDAR:Jan 20154
![Page 5: W3C HCLS Dataset Description Guidelines](https://reader034.vdocuments.us/reader034/viewer/2022042716/55a6379a1a28ab761e8b462e/html5/thumbnails/5.jpg)
@micheldumontier::CEDAR:Jan 2015
drugbank:DB00586
drugbank_vocabulary:Drug
rdf:type
drugbank:290
drugbank_vocabulary:Target
rdf:type
drugbank_vocabulary:targets
rdfs:label
Prostaglandin G/H synthase 2 [drugbank_target:290]
rdfs:label
Diclofenac [drugbank:DB00586]
5
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema#PREFIX drugbank: <http://bio2rdf.org/drugbank:>PREFIX drugbank_vocabulary: <http://bio2rdf.org/drugbank_vocabulary:>
![Page 6: W3C HCLS Dataset Description Guidelines](https://reader034.vdocuments.us/reader034/viewer/2022042716/55a6379a1a28ab761e8b462e/html5/thumbnails/6.jpg)
The linked data network expands with every reference
@micheldumontier::CEDAR:Jan 2015
drugbank:DB00586
pharmgkb_vocabulary:Drug
rdf:type
rdfs:labeldiclofenac [drugbank:DB00586]
pharmgkb:PA449293
drugbank_vocabulary:Drug
pharmgkb_vocabulary:x-drugbank
diclofenac [pharmgkb:PA449293]rdfs:label
DrugBank
PharmGKB
6
![Page 7: W3C HCLS Dataset Description Guidelines](https://reader034.vdocuments.us/reader034/viewer/2022042716/55a6379a1a28ab761e8b462e/html5/thumbnails/7.jpg)
We are building a massive network of linked open data
7Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/”
@micheldumontier::CEDAR:Jan 2015
![Page 8: W3C HCLS Dataset Description Guidelines](https://reader034.vdocuments.us/reader034/viewer/2022042716/55a6379a1a28ab761e8b462e/html5/thumbnails/8.jpg)
Linked Data for the Life Sciences
• Free and open source• Leverages Semantic Web standards• 10B+ interlinked statements from 30+
conventional and high value datasets• Partnerships with EBI, SIB, NCBI, DBCLS, NCBO,
OpenPHACTS, and many others
chemicals/drugs/formulations, genomes/genes/proteins, domainsInteractions, complexes & pathwaysanimal models and phenotypesDisease, genetic markers, treatmentsTerminologies & publications
@micheldumontier::CEDAR:Jan 20158
Alison Callahan, Jose Cruz-Toledo, Peter Ansell, Michel Dumontier:
Bio2RDF Release 2: Improved Coverage, Interoperability and Provenance of Life Science Linked Data. ESWC 2013: 200-212
![Page 9: W3C HCLS Dataset Description Guidelines](https://reader034.vdocuments.us/reader034/viewer/2022042716/55a6379a1a28ab761e8b462e/html5/thumbnails/9.jpg)
Semantic Web for Health Care and Life Sciences Interest Group (HCLS)
• Mission: to develop, advocate for, and support the use of Semantic Web technologies across health care, life sciences, clinical research and translational medicine.
• Since 2001. 86 members from 29 organizations.
• Chairs: Michel Dumontier and Charlie Mead
• Objectives:
– Develop high level and architectural vocabularies.
– Implement proof-of-concept demonstrations and industry-ready code.
– Document guidelines to accelerate the adoption of the technology.
– Disseminate information about the group's work at government, industry, academic events and by participating in community initiatives.
@micheldumontier::CEDAR:Jan 20159
![Page 10: W3C HCLS Dataset Description Guidelines](https://reader034.vdocuments.us/reader034/viewer/2022042716/55a6379a1a28ab761e8b462e/html5/thumbnails/10.jpg)
Challenge: Working with Web Data
• Often have inadequate descriptions so we don’t know what they are about or how they were constructed.
• datasets change over time, but often don’t come with versioning information
• may have been constructed using other data, but it’s not clear which version of data was used or whether these were modified
• Data may be available in a variety of formats
• There may be multiple copies of data from different providers, but it’s unclear if they are exact copies or derivatives
@micheldumontier::CEDAR:Jan 201510
![Page 11: W3C HCLS Dataset Description Guidelines](https://reader034.vdocuments.us/reader034/viewer/2022042716/55a6379a1a28ab761e8b462e/html5/thumbnails/11.jpg)
Data registries aren’t in sync
– Identifiers.org, Bio2RDF.org, BioSharing.org, etc.
– May be concerned about only some data elements i.e. incomplete
– May be out-of-date and there is no easy way to exchange data descriptions
– May contain conflicting information, unclear the sources used.
@micheldumontier::CEDAR:Jan 201511
![Page 12: W3C HCLS Dataset Description Guidelines](https://reader034.vdocuments.us/reader034/viewer/2022042716/55a6379a1a28ab761e8b462e/html5/thumbnails/12.jpg)
no single vocabulary provides all key metadata fields
@micheldumontier::CEDAR:Jan 201512
![Page 13: W3C HCLS Dataset Description Guidelines](https://reader034.vdocuments.us/reader034/viewer/2022042716/55a6379a1a28ab761e8b462e/html5/thumbnails/13.jpg)
Key Use Cases
1. Dataset Identification, Description, Licensing and Provenance
2. Dataset Discovery (via Catalog)
3. Exchange of Dataset Descriptions
4. Dataset Linking
5. Content Summary
6. Monitoring of Dataset Changes
@micheldumontier::CEDAR:Jan 201513
![Page 14: W3C HCLS Dataset Description Guidelines](https://reader034.vdocuments.us/reader034/viewer/2022042716/55a6379a1a28ab761e8b462e/html5/thumbnails/14.jpg)
Objective
• Develop a guidance note for reusing existing vocabularies to describe datasets with RDF– Mandatory, recommended, optional descriptors– Identifiers– Versioning– Attribution– Provenance– Content summarization
• Recommend vocabulary-linked attributes and value sets
• Provide reference editor and validation
@micheldumontier::CEDAR:Jan 201514
![Page 15: W3C HCLS Dataset Description Guidelines](https://reader034.vdocuments.us/reader034/viewer/2022042716/55a6379a1a28ab761e8b462e/html5/thumbnails/15.jpg)
Dublin Core Metadata Initiative
Widely used
Broadly applicable– Documents
– Datasets
✗Generic terms
✗Not comprehensive
✗No required properties
@micheldumontier::CEDAR:Jan 2015
15
“Date: A point or period of time associated with an event in the lifecycle of the resource.”
![Page 16: W3C HCLS Dataset Description Guidelines](https://reader034.vdocuments.us/reader034/viewer/2022042716/55a6379a1a28ab761e8b462e/html5/thumbnails/16.jpg)
DCAT: Data Catalog
Separates Dataset and Distribution
✗No versioning
✗No prescribed properties
@micheldumontier::CEDAR:Jan 201516
![Page 17: W3C HCLS Dataset Description Guidelines](https://reader034.vdocuments.us/reader034/viewer/2022042716/55a6379a1a28ab761e8b462e/html5/thumbnails/17.jpg)
17@micheldumontier::CEDAR:Jan
2015
VoID: Vocabulary of Interlinked Datasets
Metadata carried with data
– Directly embedded: void:inDataset
✗No versioning
✗No checklist of requisite fields
✗Only for RDF data
![Page 18: W3C HCLS Dataset Description Guidelines](https://reader034.vdocuments.us/reader034/viewer/2022042716/55a6379a1a28ab761e8b462e/html5/thumbnails/18.jpg)
We compiled a list of metadata fields used across the community
@micheldumontier::CEDAR:Jan 201518
and then surveyed over 20 vocabularies to see if they provided relevant metadata elements or value sets
To produce a big spreadsheet that maps metadata needs with existing vocabularies
![Page 19: W3C HCLS Dataset Description Guidelines](https://reader034.vdocuments.us/reader034/viewer/2022042716/55a6379a1a28ab761e8b462e/html5/thumbnails/19.jpg)
@micheldumontier::CEDAR:Jan 201519
![Page 20: W3C HCLS Dataset Description Guidelines](https://reader034.vdocuments.us/reader034/viewer/2022042716/55a6379a1a28ab761e8b462e/html5/thumbnails/20.jpg)
@micheldumontier::CEDAR:Jan 201520
![Page 21: W3C HCLS Dataset Description Guidelines](https://reader034.vdocuments.us/reader034/viewer/2022042716/55a6379a1a28ab761e8b462e/html5/thumbnails/21.jpg)
Dataset
“A collection of data, available for access or download in one or more formats”
– DCAT
@micheldumontier::CEDAR:Jan 201521
![Page 22: W3C HCLS Dataset Description Guidelines](https://reader034.vdocuments.us/reader034/viewer/2022042716/55a6379a1a28ab761e8b462e/html5/thumbnails/22.jpg)
Included Vocabularies
@micheldumontier::CEDAR:Jan 201522
![Page 23: W3C HCLS Dataset Description Guidelines](https://reader034.vdocuments.us/reader034/viewer/2022042716/55a6379a1a28ab761e8b462e/html5/thumbnails/23.jpg)
Three Component Metadata Model:description – version - distribution
@micheldumontier::CEDAR:Jan 201523
![Page 24: W3C HCLS Dataset Description Guidelines](https://reader034.vdocuments.us/reader034/viewer/2022042716/55a6379a1a28ab761e8b462e/html5/thumbnails/24.jpg)
Example of Use
@micheldumontier::CEDAR:Jan 201524
![Page 25: W3C HCLS Dataset Description Guidelines](https://reader034.vdocuments.us/reader034/viewer/2022042716/55a6379a1a28ab761e8b462e/html5/thumbnails/25.jpg)
61 metadata elements
@micheldumontier::CEDAR:Jan 201525
![Page 26: W3C HCLS Dataset Description Guidelines](https://reader034.vdocuments.us/reader034/viewer/2022042716/55a6379a1a28ab761e8b462e/html5/thumbnails/26.jpg)
Metadata element, description, and example of use
@micheldumontier::CEDAR:Jan 201526
![Page 27: W3C HCLS Dataset Description Guidelines](https://reader034.vdocuments.us/reader034/viewer/2022042716/55a6379a1a28ab761e8b462e/html5/thumbnails/27.jpg)
Metadata Specificationconstrained property:value pairs
@micheldumontier::CEDAR:Jan 201527
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in
RFC 2119 [RFC2119].
![Page 28: W3C HCLS Dataset Description Guidelines](https://reader034.vdocuments.us/reader034/viewer/2022042716/55a6379a1a28ab761e8b462e/html5/thumbnails/28.jpg)
Description
• Identifiers• Title• Description• Homepage• License• Language• Keywords• Concepts and vocabularies used• Standards• Publication
@micheldumontier::CEDAR:Jan 201528
![Page 29: W3C HCLS Dataset Description Guidelines](https://reader034.vdocuments.us/reader034/viewer/2022042716/55a6379a1a28ab761e8b462e/html5/thumbnails/29.jpg)
Attribution
• Simple Model– Individuals are related to roles using specific
propertiese.g. dct:creator, pav:createdBy, pav:curatedBy
• Expandable Model– Individuals are related to roles and dates by
associated object– PROV, ViVo
@micheldumontier::CEDAR:Jan 201529
![Page 30: W3C HCLS Dataset Description Guidelines](https://reader034.vdocuments.us/reader034/viewer/2022042716/55a6379a1a28ab761e8b462e/html5/thumbnails/30.jpg)
Provenance and Change
• Version number
• Source
• Provenance: retrieved from, derived from, created with
• Frequency of change
@micheldumontier::CEDAR:Jan 201530
![Page 31: W3C HCLS Dataset Description Guidelines](https://reader034.vdocuments.us/reader034/viewer/2022042716/55a6379a1a28ab761e8b462e/html5/thumbnails/31.jpg)
Availability
• Format
• Download URL
• Landing page
• SPARQL endpoint
@micheldumontier::CEDAR:Jan 201531
![Page 32: W3C HCLS Dataset Description Guidelines](https://reader034.vdocuments.us/reader034/viewer/2022042716/55a6379a1a28ab761e8b462e/html5/thumbnails/32.jpg)
RDF Dataset Statistics
Basic Statistics
• # of triples
• # of typed entities
• # of distinct subjects
• # of distinct predicates
• # of distinct objects
• # of classes
• # of literals
Enhanced Statistics
• Classes + #
• Properties + triples
• Subject Types + # Property + triples
• Object Types + # Property + triples
• Literals + # Property + triples
• Dataset-Dataset links
@micheldumontier::CEDAR:Jan 201532
![Page 33: W3C HCLS Dataset Description Guidelines](https://reader034.vdocuments.us/reader034/viewer/2022042716/55a6379a1a28ab761e8b462e/html5/thumbnails/33.jpg)
Application scenarios
@micheldumontier::CEDAR:Jan 201533
![Page 34: W3C HCLS Dataset Description Guidelines](https://reader034.vdocuments.us/reader034/viewer/2022042716/55a6379a1a28ab761e8b462e/html5/thumbnails/34.jpg)
VoID Editor
@micheldumontier::CEDAR:Jan 201534
![Page 35: W3C HCLS Dataset Description Guidelines](https://reader034.vdocuments.us/reader034/viewer/2022042716/55a6379a1a28ab761e8b462e/html5/thumbnails/35.jpg)
Validator
@micheldumontier::CEDAR:Jan 201535
New version using ShEx in development
![Page 36: W3C HCLS Dataset Description Guidelines](https://reader034.vdocuments.us/reader034/viewer/2022042716/55a6379a1a28ab761e8b462e/html5/thumbnails/36.jpg)
Towards Semantic Interoperability
@micheldumontier::CEDAR:Jan 201536
![Page 37: W3C HCLS Dataset Description Guidelines](https://reader034.vdocuments.us/reader034/viewer/2022042716/55a6379a1a28ab761e8b462e/html5/thumbnails/37.jpg)
dumontierlab.com
@micheldumontier::CEDAR:Jan 2015
Website: http://dumontierlab.comPresentations: http://slideshare.com/micheldumontier
37
HCLS:http://www.w3.org/blog/hcls/
Mailing list:http://lists.w3.org/Archives/Public/public-semweb-lifesci/
Editors’ Draft: http://tiny.cc/hcls-datadesc-ed
W3C Interest Group Note:http://tiny.cc/hcls-datadesc
Special thanks to Alasdair Gray, Scott Marshall, Joachim BaranThanks to all other contributors to the HCLS note