semantic web anno 1999 - keio...

38
How data sharing leads to knowledge M. Scott Marshall, Ph.D. W3C HCLS IG co-chair Leiden University Medical Center University of Amsterdam http://staff.science.uva.nl/~marshall Semantic Web Anno 1999 DIY (Do It Yourself) - 別3 -

Upload: others

Post on 12-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Semantic Web Anno 1999 - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall.pdfavailable on the web, finding these resources among the billions of possible web pages

3/3/2011

1

How data sharing leads to knowledge

M. Scott Marshall, Ph.D.W3C HCLS IG co-chair

Leiden University Medical CenterUniversity of Amsterdam

http://staff.science.uva.nl/~marshall

Semantic Web Anno 1999

DIY (Do It Yourself)

- 別3 -

Page 2: Semantic Web Anno 1999 - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall.pdfavailable on the web, finding these resources among the billions of possible web pages

3/3/2011

2

Reasoning across a Federation(Semantic Web 2012?)

The Semantic Web is the New Global

Web of Knowledge

It is about standards for publishing, sharing and querying

knowledge drawn from diverse sources

It makes possible the answering

sophisticated questions using

background knowledge

Source: Susie Stephens

- 別4 -

Page 3: Semantic Web Anno 1999 - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall.pdfavailable on the web, finding these resources among the billions of possible web pages

3/3/2011

3

Motivation

Science is based on knowledge:

knowledge capture, knowledge sharing, i.e. communication of findings.

Semantic Web provides a basis for knowledge sharing through machine-readable and reason-able annotation of resources.

Biology in a nutshell: Bigger isn’t better

• DNA Dogma

– Transcription = DNA -> mRNA -> Protein

• Molecular pathways allow biologists to ‘connect’ one molecule to another.

• Huntington’s mutation mapped in 1993 yet there is still no understanding of the mechanism that causes the neurodegeneration.

• Semantic models are necessary to create a ‘systems view’ of biology.

- 別5 -

Page 4: Semantic Web Anno 1999 - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall.pdfavailable on the web, finding these resources among the billions of possible web pages

3/3/2011

4

Where is biomedical knowledge?

Can be extracted from:

• People

• Literature

• Diagrams

• Clinical reports

• Databases

• Excel sheets

• …

Most of these sources of biomedical knowledge are

not machine-readable

Examples

1. “find all images with a big lesion in the right frontal lobe » ?

2. Provide the neurosurgeon with landmarksHmm, but what’s

nearby?

Source: Christine Golbreich

- 別6 -

Page 5: Semantic Web Anno 1999 - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall.pdfavailable on the web, finding these resources among the billions of possible web pages

3/3/2011

5

Many tasks are still a challenge!

With existing Web and Health IT:

• Find, share and integrate information

– “Although a plethora of resources (tools, databases, materials) for neuroscientists is now available on the web, finding these resources among the billions of possible web pages continues to be a challenge.” [M. Martone, NCBO Seminar

Series, 4 Nov 2009]

• Make multiple inferences based on background knowledge

– to obtain more complete answers

– to discover knowledge

Source: Christine Golbreich

Examples– in a medical record system

“find all patients which radiology exhibits a fracture of femur”

– in genomic data

“find all genes annotated with a molecular function or any of its descendants and which is associated with anyform of a given disease (see genes associated to muscular dystrophy [Sahoo et al. 2007])

– find, share, annotate images

Source: Christine Golbreich

- 別7 -

Page 6: Semantic Web Anno 1999 - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall.pdfavailable on the web, finding these resources among the billions of possible web pages

3/3/2011

6

How ?

• Adding semantic annotations from ontologies

“ … The vocabulary for such markup will be defined by ontologies, sets of statements which describe the structure of the domain of interest and give meaning to the terms therein. ”

[I. Horrocks, An Ontology Language for the Semantic Web, IEEE Intelligent Systems, 2002]

Source: Christine Golbreich

Biological and medical ontologies• Medical domain is *very* lucky

a large number of terminologies and reference ontologies, E.g., FMA, NCI, GO, SNOMED-CT, etc.

• Web Portals

– Bioportal library contains ~200 ontologies in different languages: OBO, Protégé Frames, RDF, OWL http://bioportal.bioontology.org/

– Bioportal now provides SPARQL access to ontologies: http://sparql.bioontology.org

– Open Biomedical Ontologies (OBO) Foundry, http://obofoundry.org/

• Nearly every day a new ontology is created.

• BUT …

– Does a term mean the same thing for everyone ?

– Different ontologies use different terms for the same

– How to connect ontologies? How to import terms?

– Is the vocabulary consistent ?

Source: Christine Golbreich

- 別8 -

Page 7: Semantic Web Anno 1999 - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall.pdfavailable on the web, finding these resources among the billions of possible web pages

3/3/2011

7

Advantages of OWL 2?

• Exploits results of 15+ years of research in Knowledge Representation (KR)

• OWL 2 based on description logics

• A logical language provides

– Semantics (meaning)

– Reasoning support and tools

Source: Christine Golbreich

Source: Christine Golbreich

Large ontologies in OWL and applications

• Many biomedical ontologies being moved to OWL 2

– e.g., FMA

– e.g., OBO to OWL [Golbreich et al. ISWC 2007]

– e.g., SNOMED-CT in OWL 2

• Peter Hendler's slides at HCLS W3C F2F

• Application example: Semantic Annotations of Brain MRI

– An hybrid system for semantic annotation of brainanatomy in MRI images based on an OWL ontologyof sulci-gyri anatomy extended by SWRL rules[Mechouche & al. 2009]

- 別9 -

Page 8: Semantic Web Anno 1999 - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall.pdfavailable on the web, finding these resources among the billions of possible web pages

3/3/2011

8

What is metadata (in this talk)?

• Metadata: data about data

• Metadata can be syntactic such as a data type, e.g. Integer.

• Metadata can be semantic such as chromosome number.

• Note: not always ontology, but metadata can be stored in OWL

What is knowledge ?

“data”, “information”, “facts”, “knowledge”

Knowledge is a statement

that can be tested for truth.

(by a machine)

Otherwise, computing can’t add much

- 別10 -

Page 9: Semantic Web Anno 1999 - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall.pdfavailable on the web, finding these resources among the billions of possible web pages

3/3/2011

9

RDF : a web format for knowledge

RDF is a W3C language to express

statements.

RDF Triple:

Subject Predicate Object

Graph of Knowledge:

Node Edge Node

Round trip knowledge engineering (Vision)

In a linked open data environment:

– How will I find the resources?

– How will I integrate them into an application?

– How will I leverage existing knowledge in my analysis?

– How will I publish the new knowledge resulting from my analysis?

– And link to the evidence (data) from the new knowledge?

- 別11 -

Page 10: Semantic Web Anno 1999 - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall.pdfavailable on the web, finding these resources among the billions of possible web pages

3/3/2011

10

Historical perspective: ARPANET

From ARPANET to Semantic Web

• ARPANET = TCP/IP, C language, Unix

• Internet = e-mail, gopher, ftp

• Web = http, html + Mosaic

• ProgrammableWeb = XML, SOAP, REST, JSON

• Semantic Web = OWL/RDF, URI’s, SPARQL,

- 別12 -

Page 11: Semantic Web Anno 1999 - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall.pdfavailable on the web, finding these resources among the billions of possible web pages

3/3/2011

11

Programmable Web:A few thousand APIs

http://www.programmableweb.com/

SPARQL Interoperability:A single API

• a query language for RDF that matches graph patterns

• SPARQL endpoint is comparable to a database connector for triples

• SPARQL endpoint enables data layer interoperability

- 別13 -

Page 12: Semantic Web Anno 1999 - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall.pdfavailable on the web, finding these resources among the billions of possible web pages

3/3/2011

12

Background of the HCLS IG

• Originally chartered in 2005– Chairs: Eric Neumann and Tonya Hongsermeier

• Re-chartered in 2008– Chairs: Scott Marshall and Susie Stephens

– Team contact: Eric Prud’hommeaux

• Broad industry participation– Over 100 members

– Mailing list of over 600

• Background Information– http://www.w3.org/blog/hcls

– http://esw.w3.org/topic/HCLSIG

Mission of HCLS IG

•The mission of HCLS is to develop, advocate for, and support the use of Semantic Web technologies for

– Biological science

– Translational medicine

– Health care

•These domains stand to gain tremendous benefit by adoption of Semantic Web technologies, as they depend on the interoperability of information from many domains and processes for efficient decision support

- 別14 -

Page 13: Semantic Web Anno 1999 - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall.pdfavailable on the web, finding these resources among the billions of possible web pages

3/3/2011

13

How does the HCLS IG work?

• Task forces

• Regular teleconferences using teleconference bridge (Zakim), IRC, minutes

• Face2Face (F2F) twice a year

• Procedures for publishing W3C notes

Current Task Forces

• BioRDF – federating (neuroscience) knowledge bases– M. Scott Marshall (Leiden University Medical Center / University of Amsterdam)

• Clinical Observations Interoperability – patient recruitment in trials– Vipul Kashyap (Cigna Healthcare)

• Linking Open Drug Data – aggregation of Web-based drug data – Susie Stephens (Johnson & Johnson)

• Translational Medicine Ontology – high level patient-centric ontology– Michel Dumontier (Carleton University)

• Scientific Discourse – building communities through networking– Tim Clark (Harvard University)

• Terminology – Semantic Web representation of existing resources– John Madden (Duke University)

- 別15 -

Page 14: Semantic Web Anno 1999 - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall.pdfavailable on the web, finding these resources among the billions of possible web pages

3/3/2011

14

COI: Translating across domains

Microarray

MRI PubMed

AlzForum

EHR

COI: Use Case

Pharmaceutical companies pay a lot to test drugs

Pharmaceutical companies express protocol in CDISC

-- precipitous gap –

Hospitals exchange information in HL7/RIM

Hospitals have relational databases

Source: Eric Prud’hommeaux

- 別16 -

Page 15: Semantic Web Anno 1999 - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall.pdfavailable on the web, finding these resources among the billions of possible web pages

3/3/2011

15

• Type 2 diabetes on diet and exercise therapy or

• monotherapy with metformin, insulin

• secretagogue, or alpha-glucosidase inhibitors, or

• a low-dose combination of these at 50%

• maximal dose. Dosing is stable for 8 weeks prior

• to randomization.

• …

• ?patient takes metformin .

Inclusion Criteria

Source: Holger Stenzhorn

Exclusion Criteria

Use of warfarin (Coumadin), clopidogrel

(Plavix) or other anticoagulants.

?patient doesNotTake anticoagulant .

Source: Holger Stenzhorn

- 別17 -

Page 16: Semantic Web Anno 1999 - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall.pdfavailable on the web, finding these resources among the billions of possible web pages

3/3/2011

16

?medication1 sdtm:subject ?patient ;spl:activeIngredient ?ingredient1 .

?ingredient1 spl:classCode 6809 . #metformin

OPTIONAL {

?medication2 sdtm:subject ?patient ;spl:activeIngredient ?ingredient2 .

?ingredient2 spl:classCode 11289 .#anticoagulant

} FILTER (!BOUND(?medication2))

Criteria in SPARQL

Source: Holger Stenzhorn

TMO: Translating across domains

Microarray

MRI PubMed

AlzForum

EHR

- 別18 -

Page 17: Semantic Web Anno 1999 - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall.pdfavailable on the web, finding these resources among the billions of possible web pages

3/3/2011

17

Questions & ProblemsThe Drug Development Pipeline

• The road is long, and costly.

• How do we contain costs and develop better drugs?

“A virtual space odyssey”, Cath O'Driscoll (2004)

http://www.nature.com/horizon/chemicalspace/background/odyssey.html

Source: Elgar Pichler

Translational Medicine OntologyMission

• Focuses on the development of a high level patient-centric ontology for the pharmaceutical industry. The ontology should enable data integration across discovery research, hypothesis management, experimental studies, compounds, formulation, drug development, market size, competitive data, population data, etc. This would enable scientists to answer new questions, and to answer existing scientific questions more quickly.

• This will help pharmaceutical companies to model patient-centric information, which is essential for the tailoring of drugs, and for early detection of compounds that may have sub-optimal safety profiles. The ontology should link to existing publicly available domain ontologies.

- 別19 -

Page 18: Semantic Web Anno 1999 - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall.pdfavailable on the web, finding these resources among the billions of possible web pages

3/3/2011

18

TMO Structure

Source: Susie Stephens

Translational Medicine KB

Source: Susie Stephens

- 別20 -

Page 19: Semantic Web Anno 1999 - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall.pdfavailable on the web, finding these resources among the billions of possible web pages

3/3/2011

19

TMO Query

How many patients experienced side effects while taking Donepezil?

Source: Susie Stephens

TMO DevelopmentConcept Identification via Use Cases

• Example

• (see http://esw.w3.org/topic/HCLSIG/PharmaOntology/UseCases):

• 1. Patient [OBI:0000093, patient role] (and family members [NCI:Patient_Family_Member_or_Friend]) report symptoms [IDO:0000048, Symptom] to physician/clinician [NCIt:Physician]. Physician/clinician enters reported symptoms into eHR.

• 2. Physician [NCIt: Physician] makes a list of differential diagnoses, with a working diagnosis [OBI:0000075] of Alzheimer Disease [DOID:10652]. (Data Source: Physician's head).

• 3. Physician [NCIt:Physician] arranges for patient [OBI:0000093, patient role] to have a basic biochemical/haematological, and SNP [SO:0000694, SNP] profile undertaken. Biochemistry, Haematology, and SNP requests are input by respective departments directly into patient’s eHR [HL7:EHR, UMLS:C1555708, HID:20081] from laboratory (Data Source: eRecord). Preliminary SNP and genetic data will be submitted directly to the NIH Pharmacogenetics Research Network (PGRN).

• [...]

- 別21 -

Page 20: Semantic Web Anno 1999 - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall.pdfavailable on the web, finding these resources among the billions of possible web pages

3/3/2011

20

Terminology: Translating across domains

Microarray

MRI PubMed

AlzForum

EHR

Terminology

• Representation of medical documents (OWL and SKOS)

• SKOS enables ‘fuzzier’ statements, with ‘broader’ and ‘narrower’

• Mammogram: Represent both radiology and pathology report to discover discrepancies

• Use Translational Medicine Ontology, RadLex, SNOMED in the RDF

- 別22 -

Page 21: Semantic Web Anno 1999 - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall.pdfavailable on the web, finding these resources among the billions of possible web pages

3/3/2011

21

SciDisc: Translating across domains

Microarray

MRI PubMed

AlzForum

EHR

SWAN / CiTO alignment

Source: Paolo Ciccarese

- 別23 -

Page 22: Semantic Web Anno 1999 - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall.pdfavailable on the web, finding these resources among the billions of possible web pages

3/3/2011

22

BioRDF: Translating across domains

Microarray

MRI PubMed

AlzForum

EHR

BioRDF: Looking for Targets for Alzheimer’s

• Signal transduction pathways are considered to be rich in “druggable” targets

• CA1 Pyramidal Neurons are known to be particularly damaged in Alzheimer’s disease

• Casting a wide net, can we find candidate genes known to be involved in signal transduction and active in Pyramidal Neurons?

Source: Alan Ruttenberg

- 別24 -

Page 23: Semantic Web Anno 1999 - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall.pdfavailable on the web, finding these resources among the billions of possible web pages

3/3/2011

23

NeuronDB

BAMS

Literature

Homologene

SWAN

Entrez

Gene

Gene

Ontology

Mammalian

Phenotype

PDSPki

BrainPharm

AlzGene

Antibodies

PubChem

MESH

Reactome

Allen Brain

Atlas

BioRDF: Integrating Heterogeneous Data

Source: Susie Stephens

“find me genes involved in signal transduction that are related to pyramidal neurons”

- 別25 -

Page 24: Semantic Web Anno 1999 - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall.pdfavailable on the web, finding these resources among the billions of possible web pages

3/3/2011

24

BioRDF: Results: Genes, Processes

•DRD1, 1812 adenylate cyclase activation•ADRB2, 154 adenylate cyclase activation•ADRB2, 154 arrestin mediated desensitization of G-protein coupled receptor protein signaling pathway•DRD1IP, 50632 dopamine receptor signaling pathway•DRD1, 1812 dopamine receptor, adenylate cyclase activating pathway•DRD2, 1813 dopamine receptor, adenylate cyclase inhibiting pathway•GRM7, 2917 G-protein coupled receptor protein signaling pathway•GNG3, 2785 G-protein coupled receptor protein signaling pathway•GNG12, 55970 G-protein coupled receptor protein signaling pathway•DRD2, 1813 G-protein coupled receptor protein signaling pathway•ADRB2, 154 G-protein coupled receptor protein signaling pathway•CALM3, 808 G-protein coupled receptor protein signaling pathway•HTR2A, 3356 G-protein coupled receptor protein signaling pathway•DRD1, 1812 G-protein signaling, coupled to cyclic nucleotide second messenger•SSTR5, 6755 G-protein signaling, coupled to cyclic nucleotide second messenger•MTNR1A, 4543 G-protein signaling, coupled to cyclic nucleotide second messenger•CNR2, 1269 G-protein signaling, coupled to cyclic nucleotide second messenger•HTR6, 3362 G-protein signaling, coupled to cyclic nucleotide second messenger•GRIK2, 2898 glutamate signaling pathway•GRIN1, 2902 glutamate signaling pathway•GRIN2A, 2903 glutamate signaling pathway•GRIN2B, 2904 glutamate signaling pathway•ADAM10, 102 integrin-mediated signaling pathway•GRM7, 2917 negative regulation of adenylate cyclase activity•LRP1, 4035 negative regulation of Wnt receptor signaling pathway•ADAM10, 102 Notch receptor processing•ASCL1, 429 Notch signaling pathway•HTR2A, 3356 serotonin receptor signaling pathway•ADRB2, 154 transmembrane receptor protein tyrosine kinase activation (dimerization)•PTPRG, 5793 ransmembrane receptor protein tyrosine kinase signaling pathway•EPHA4, 2043 transmembrane receptor protein tyrosine kinase signaling pathway•NRTN, 4902 transmembrane receptor protein tyrosine kinase signaling pathway•CTNND1, 1500 Wnt receptor signaling pathway

Many of the genes

are related to AD

through gamma

secretase

(presenilin) activity

Source: Alan Ruttenberg

Improving HCLS KB

• Shared Identifiers – Use the same names

• Federate – Where will we find answers to our scientific questions?

• Provenance – What will we do with what we find?

- 別26 -

Page 25: Semantic Web Anno 1999 - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall.pdfavailable on the web, finding these resources among the billions of possible web pages

3/3/2011

25

Automatic query decomposition

• Decompose query into components

• Locate sources of information

• Distribute queries across information sources

• Aggregate results

A SPARQL query for processes involved in pyramidal neurons

prefix go: <http://purl.org/obo/owl/GO#>prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>prefix owl: <http://www.w3.org/2002/07/owl#>prefix mesh: <http://purl.org/commons/record/mesh/>prefix sc: <http://purl.org/science/owl/sciencecommons/>prefix ro: <http://www.obofoundry.org/ro/ro.owl#>

select ?genename ?processnamewhere{ graph <http://purl.org/commons/hcls/pubmesh>

{ ?paper ?p mesh:D017966 .?article sc:identified_by_pmid ?paper.?gene sc:describes_gene_or_gene_product_mentioned_by ?article.

}graph <http://purl.org/commons/hcls/goa>{ ?protein rdfs:subClassOf ?res.?res owl:onProperty ro:has_function.?res owl:someValuesFrom ?res2.?res2 owl:onProperty ro:realized_as.?res2 owl:someValuesFrom ?process.

graph <http://purl.org/commons/hcls/20070416/classrelations>{{?process <http://purl.org/obo/owl/obo#part_of> go:GO_0007166} union

{?process rdfs:subClassOf go:GO_0007166 }} ?protein rdfs:subClassOf ?parent.?parent owl:equivalentClass ?res3.?res3 owl:hasValue ?gene.

}graph <http://purl.org/commons/hcls/gene>{ ?gene rdfs:label ?genename }

graph <http://purl.org/commons/hcls/20070416>{ ?process rdfs:label ?processname}

}

MeSH: Pyramidal Neurons

Pubmed: Journal Articles

Entrez Gene: Genes

GO: Signal Transduction

Inference required

- 別27 -

Page 26: Semantic Web Anno 1999 - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall.pdfavailable on the web, finding these resources among the billions of possible web pages

3/3/2011

26

Recipe for a Semantic Web

• Follow Linked Open Data principles

• Attempt to use Shared Names (same URI’s)

• Query rewriting to map from:

– SPARQL -> (query language)

– SPARQL (term1) -> SPARQL (term2)

• Add federated query support to SPARQL engine implementations

The Classic Web

B C

HTML HTMLHTML

Web

Browsers

Search

Engines

hyper-

links

• Single information space

• HTML describes presentation

• Built on URIs– globally unique IDs

– retrieval mechanism

• Built on Hyperlinks– are the glue that holds

everything together

A

hyper-

links

Source: Chris Bizer

- 別28 -

Page 27: Semantic Web Anno 1999 - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall.pdfavailable on the web, finding these resources among the billions of possible web pages

3/3/2011

27

Linked Data

B C

Thing

typed

links

A D E

typed

links

typed

links

typed

links

Thing

Thing

Thing

Thing

Thing Thing

Thing

Thing

Thing

Search

Engines

Linked Data

Mashups

Linked Data

Browsers

Use Semantic Web technologies to publish structured data on the Web and set

links between data from one data source and data from another data sources

Source: Chris Bizer

Linked Data Principles

1. Use URIs as names for things.

2. Use HTTP URIs so that people can look up those names.

3. When someone looks up a URI, provide useful RDF information.

4. Include RDF statements that link to other URIs so that they can discover related things.

• Tim Berners-Lee 2007

• http://www.w3.org/DesignIssues/LinkedData.html

- 別29 -

Page 28: Semantic Web Anno 1999 - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall.pdfavailable on the web, finding these resources among the billions of possible web pages

3/3/2011

28

The Linked Data Cloud

Source: Chris Bizer

Interlinking in LODD

http://esw.w3.org/HCLSIG/LODD/Interlinking

- 別30 -

Page 29: Semantic Web Anno 1999 - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall.pdfavailable on the web, finding these resources among the billions of possible web pages

3/3/2011

29

LOD cloud – a few challenges

• It’s a (nice) data warehouse in RDF

– http://lod.openlinksw.com/

• Ad hoc identifiers

– Prefer authoritative, persistent, and shared

• Redundancy

• Lacks provenance – Who created a given RDF rendering? Using what method? Version?

So, what’s missing?

• Data stewardship – serve (annotated) data back to community

• Plug-n-play SPARQL access to relational stores

• Best practice for data federation of SPARQL endpoints

• Best practice for important data sources: microarray data, variety of image data

• Shared Identifiers

- 別31 -

Page 30: Semantic Web Anno 1999 - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall.pdfavailable on the web, finding these resources among the billions of possible web pages

3/3/2011

30

Provenance

• Represent knowledge so that others can discover where a fact (or triple) came from and evaluate how to use it – link facts to data as evidence

• Named graphs – talk about what’s in them

• Example use: “Find the latest RDF version of DrugBank” from SPARQL endpoints accessible to me.

Semantic Web Foundation

Source: Tim Berners-Lee, Director W3C

- 別32 -

Page 31: Semantic Web Anno 1999 - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall.pdfavailable on the web, finding these resources among the billions of possible web pages

3/3/2011

31

Provenance for Federationor “What’s in a graph?”

• Minimum information about a graph (MIAG)

• Is it an ontology or RDF data?

• Version, License, ..

• Who converted it to RDF?

• Rendering: “I’m looking for SNOMED in SKOS (not OWL)”

• What is the most populated class?

• Graph statistics

Translating across domains

Microarray

MRI PubMed

AlzForum

EHR

- 別33 -

Page 32: Semantic Web Anno 1999 - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall.pdfavailable on the web, finding these resources among the billions of possible web pages

3/3/2011

32

Provenance types are perspectives on the data

Source: Helena Deus

A Bottom-up Approach

Provenance

modelsWorkflow,

experimental designDomain ontologies

(DO, GO…)Community

models

Raw Data

Results

Questions

Which genes are markers for neurodegenerative

diseases?

Was gene ALG2 differentially expressed in

multiple experiments?

Provenance of

Microarray

experiment

What software was used to analyse the data?

How can the experiment be replicated?

Source: Helena Deus

- 別34 -

Page 33: Semantic Web Anno 1999 - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall.pdfavailable on the web, finding these resources among the billions of possible web pages

3/3/2011

33

BioRDF: finding latest versionFILTER (?date_issued2 > ?date_issued)

}

FILTER (!BOUND(?srvc2))

}

#Get associated diseases from most recently updated Diseasomeserver.

SERVICE ?srvc2 {

?diseasomeGene rdfs:label ?geneLabel .

?disease diseasome:associatedGene ?diseasomeGene.

?disease rdfs:label ?diseaseName .

}

Myth busting

• Creating a Semantic Web application does not necessarily require converting and storing all data in RDF

• “Leave the data where it lives”

• Create views of the data accessible via RDF export or (BETTER) a SPARQL endpoint

- 別35 -

Page 34: Semantic Web Anno 1999 - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall.pdfavailable on the web, finding these resources among the billions of possible web pages

3/3/2011

34

SWObjects

• Creator: Eric Prud’hommeaux

• SWObjects creates a SPARQL endpoint

• Query rewriting to SQL and SPARQL

• Uses SPARQL Constructs as a mapping language

• JNI interface

• http://en.wikipedia.org/wiki/SWObjects

• http://precedings.nature.com/documents/5538/

Semantic Web is a framework

that provides common languages for defining vocabularies and querying across data that uses them.

However, common identifiers must be established if we are to link data without a million mapping files.

- 別36 -

Page 35: Semantic Web Anno 1999 - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall.pdfavailable on the web, finding these resources among the billions of possible web pages

3/3/2011

35

Shared Identifiers

• Must use common URI’s in order to link data

• Provenance related identifiers still needed:

– Identifiers for people (researchers)

– Identifiers for diseases

– Identifiers for terms (Terminology servers)

– Identifiers for programs, processes, workflows

– Identifiers for chemical compounds

• Shared Names http://sharednames.org

- 別37 -

Page 36: Semantic Web Anno 1999 - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall.pdfavailable on the web, finding these resources among the billions of possible web pages

3/3/2011

36

Emerging Best Practices for RDF

1. Convert unstructured source data into a structured data format, such as relational tables or XML

2. Identify persistent URLs (PURLS) for concepts in existing ontologies and create a map from the structured data into an RDF view

3. Customize the map manually if necessary

4. Map concepts in the new RDF mapping to concepts in other RDF data sources relying as much as possible on ontologies

5. Publish the RDF data through a SPARQL endpoint or as Linked Data

6. Alternatively, if data is in a relational format, apply a Semantic Web toolkit such as SWObjects [Prud’hommeaux et al. 2010] that enables SPARQL queries over the relational schema

7. Create Semantic Web applications using the published data

Create RDF views that a stranger can use

• Use a mapping language to create an RDF view of the data when possible, rather than customized code.

• When possible, use vocabularies that are openly available from an authoritative server. For the life sciences, we have OBO and NCBO.

• Use rdfs:label generously to provide information to user interfaces

- 別38 -

Page 37: Semantic Web Anno 1999 - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall.pdfavailable on the web, finding these resources among the billions of possible web pages

3/3/2011

37

Publish RDF so that it can be discovered

• Publish open access data whenever possible, as well as any associated software.

• Publish a URL to the software and mappings that you used to create the RDF.

• Register your data at CKAN. If it’s biomedical, register it in BioCatalogue.

• Assign a graph URI to the RDF graph and add provenance and metadata about the graph URI to the graph itself. This practice makes it possible for visitors and crawlers to find out what is in the graph.

Acknowledgements• W3C Health Care and Life Science Interest Group,

http://www.w3.org/blog/hcls

• National Center for Biomedical Ontologies

• Concept Web Alliance

• Authors of all contributed slides

- 別39 -

Page 38: Semantic Web Anno 1999 - Keio Universitys-web.sfc.keio.ac.jp/conference2011/0102-marshall.pdfavailable on the web, finding these resources among the billions of possible web pages

3/3/2011

38

The End

“Science is built up of facts, as a house is built of stones; but an accumulation of facts is no more a science than a heap of stones is a house.”

– Henri Poincaré,

Science and Hypothesis, 1905

- 別40 -