linked data best practices (and abuses) lessons learned in ibm rational arthur ryman 2014-04-15

55
Linked Data Best Practices (and Abuses) Lessons Learned in IBM Rational Arthur Ryman 2014-04-15

Upload: lee-chase

Post on 23-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Linked Data Best Practices(and Abuses)

Lessons Learned in IBM Rational

Arthur Ryman2014-04-15

2

Best Practices

• Publishing vocabularies• Data model customization• Real-world things• JSON and RDF• Multi-valued and optional properties• Provenance and inverse properties• Ontologies and constraints

3

PUBLISHING VOCABULARIES

4

Publishing vocabularies

• We should use established vocabularies if they exist– W3C, Dublin Core, OSLC, …

• Any new terms we define should be described in vocabulary documents rooted at http://jazz.net/ns– propose generally useful terms to OSLC

• When you look up an RDF term, you should get its vocabulary document– HTML for web browsers– RDF for programs, e.g. query builders– e.g. http://jazz.net/ns/qm/rqm#Category

5

Vocabulary page for http://jazz.net/ns/qm/rqm#Category

6

How to publish a vocabulary

• We have a new public wiki!– https://jazz.net/wiki/bin/view/LinkedData

• Read the guidelines• Create a wiki page and attach the HTML,

Turtle, and RDF/XML files• Request a review from Nelson– Allow dev time to address issues

• Arthur will redirect jazz.net/ns to the wiki

7

LinkedData wiki

8

Abuses

• You published your vocabulary but skimped on the content– e.g. minimal or cryptic comments

• You published your vocabulary, but didn’t keep it up-to-date– e.g. Focal Point 227292

• You created some new terms but didn’t publish your vocabulary– e.g. JLIP Tracked Resource Set 306919

9

DATA MODEL CUSTOMIZATION

10

Data model customization

• Many of our tools allow customization– e.g. RTC work items

• We need to expose the custom data elements as RDF• Tools should allow users to map custom data elements to

externally defined RDF terms– industry standards– corporate standards

• When no mapping is specified, tools should generate local RDF terms and vocabularies– vocabularies are needed by query authors– tools must host the vocabularies they generate

11

Abuses

• Your tool generates a cryptic URI for local RDF terms– Obfuscates meaning– Forces humans to access vocabulary document

• Your tool does not generate a vocabulary document for local RDF terms– e.g. RTC 304143– see following case study

• When the mapping to RDF is changed, your tool does not create TRS change events for just the affected resources

12

Case study: RTC Work Items

• Some attributes are built-in• Some are defined by OSLC CM 2.0• Some are user defined• Consider Priority

13

Project area editor allows customization

14

Enumerated values should specify RDF URIs (External Value)

15

Priority values are enumerated

16

Get the resource URL

17

Look for priority in the RDF representation of Task 224727

19

Object of priority is not an RDF vocabulary term

20

Problems

• The priority predicate comes from a non-existent vocabulary (bad)– http://open-services.net/ns/cm-x#– RDF vocabularies should be dereferenceable– OSLC should publish it, tagged as archaic

• The object is a dereferenceable URI (good), but not a vocabulary term (ugly)– Need rdfs:label, rdfs:comment for query authors

• Result: no easy way to write queries based on priority

21

Best Practice for external vocabularies

• RTC project template should refer to external vocabularies for standard terms– OSLC CM V3 defines priority and 4 values

• Teach and enable clients to create corporate standard vocabularies for reuse of common terms (UA)– Needed for cross-project queries

• Provide export/import UI to manage vocabularies– E.g. Focal Point uses simple spreadsheet format

22

Best Practice for local vocabularies

• RTC (and all other tools) should generate a local RDF vocabulary for all user-defined terms– Include rdfs:label, rdfs:comment for query authors

(and other consumers)• LQE admin should load user-defined

vocabularies into LQE to make them available to queries– provide programmatic integration, e.g. a special

purpose vocabulary TRS

23

Best Practice for all vocabularies

• When an administrator changes the RDF representation of a set of resources, corresponding change events MUST be added to the TRS change log– Add/remove custom attributes and values– Modify mapping to RDF URIs

• Allow the administrator to make multiple representation changes and then manually trigger the generation of change events– Batch multiple representation changes together to

minimize re-indexing time and server load

24

REAL-WORLD THINGS

25

La Trahison des Images

"The famous pipe. How people reproached me for it! And yet, could you stuff my pipe? No, it's just a representation, is it not? So if I had written on my picture "This is a pipe", I'd have been lying!“- René Magritte

26

Real-world things

• Linked Data differentiates between two kinds of thing– Information, e.g. a document on the web– Real-world, e.g. a person

• Both kinds should be identified with HTTP URIs• Looking up a real-world URI should result in an

information resource that contains information about the real-world thing– URI-references (hash URIs)– HTTP redirect: 303 See Other (303 URIs)

• Refer to Cool URIs for the Semantic Web

27

Example foaf:Person

• Suppose you create a document, http://people.org/johnsmith, about John Smith on 2013-09-17

• The following is nonsense because John Smith was not created on 2013-09-17:<http://people.org/johnsmith> a foaf:Person .<http://people.org/johnsmith> dcterms:created “2013-09-17”^^xsd:date .

• The following makes sense:<http://people.org/johnsmith#me> a foaf:Person .<http://people.org/johnsmith> dcterms:created “2013-09-17”^^xsd:date .

28

Abuses

• Failure to differentiate between a person and an account owned by a person– Leads to nonsense triples– Focal Point Defect 234212 – JTS Defect 307861– See following JTS users case study

• NOTE: email address is the preferred way to identify people across tools

29

Work items refer to people

30

JTS Users

• OSLC Core specifies that the object of dcterms:creator, dcterms:contributor, oslc:modifiedBy should be a resource of class foaf:Agent or foaf:Person (real-world)

• RTC implements OSLC CM and has triples like:<https://jazz.net/jazz02/resource/...WorkItem/72226>

dcterms: creator <https://jazz.net/jts04/users/ryman> ,dcterms:contributor <https://jazz.net/jts04/users/retchles> .

31

RDF representation of person contains nonsense

32

Best Practice

• The property j.1:archived applies to the user account (information resource), not the person (real-world)

• Solution 1: use hash URIs for people:<https://jazz.net/jts04/users/ryman#me>

• Solution 2: use 303 URIs for accounts (preferred by Philippe):<https://jazz.net/jts04/accounts/ryman>

33

303 URI Solution

@prefix foaf: <http://xmlns.com/foaf/0.1/>.@prefix jfs: <http://jazz.net/xmlns/prod/jazz/jfs/1.0/>.

<https://jazz.net/jts04/accounts/ryman> a foaf:OnlineAccount , jfs:archived false.

<https://jazz.net/jts04/users/ryman> a foaf:Person; foaf:account < https://jazz.net/jts04/accounts/ryman> , foaf:img <https://jazz.net/jts04/users/photo/ryman>; foaf:mbox <mailto:[email protected]>; foaf:name "Arthur Ryman"; foaf:nick "ryman".

34

JSON AND RDF

35

JSON

• Familiar to OO and Web developers• Popularity fueled by Cloud• e.g. Amazon uses JSON as the payload in AWS

REST APIs as an alternative to SOAP and XML– Simpler/faster to handle by web clients

• Use is spreading across the stack– MongoDB, CouchDB/Cloudant– node.js

36

JSON and RDF

• Some developers are saying: “JSON is simpler and more popular than RDF. Let’s use JSON instead of RDF.”– This is a false dichotomy

• JSON is just as problematic as XML for data integration– JSON and XML are message formats

• Linked Data is our integration strategy– RDF expresses semantics

• Use JSON-LD, now a W3C standard– OSLC and Rational should publish standard contexts

• See following LQE Security Context case study

37

Initial JSON design

• Simple, but no explicit semantics• Use of UUIDs instead of HTTP URIs

[ { "security_context_id" : "urn:uuid:f81d4fae-7dec-11d0-a765-00a0c91e6bf6", "name" : "Resources for Alpha project" }, { "security_context_id" : "urn:uuid:g92e5gbf-8efd-22e1-b876-11b1d02f7cg7", "name" : "Resources for Beta project" } ]

38

Equivalent JSON-LD design

{ "@context": { "@base": "https://example.com/sc", "dcterms": "http://purl.org/dc/terms/" }, "@graph": [ { "@id": "#1", "dcterms:title": "Resources for Alpha project" }, { "@id": "#2", "dcterms:title": "Resources for Beta project" } ] }

39

Final JSON-LD design with type info{ "@graph": [ { "@id": "https://example.com/sc", "@type": "http://open-services.net/ns/core/sc#SecurityContextList" }, { "@id": "https://example.com/sc#1", "@type": "http://open-services.net/ns/core/sc#SecurityContext", "http://purl.org/dc/terms/title": "Resources for Alpha project" }, { "@id": "https://example.com/sc#2", "@type": "http://open-services.net/ns/core/sc#SecurityContext", "http://purl.org/dc/terms/title": "Resources for Beta project" } ] }

40

MULTI-VALUED AND OPTIONAL PROPERTIES

41

Multi-valued and optional properties

• RDF documentations contain sets of triples• Model multi-valued properties by a set of

triples that share a common subject and object

• Model the absence of an optional property by an empty set of triples

42

Abuses

• Model multiple values of a property by concatenating the values into a single object– Defeats database indexing– Slows queries since substring matching must be used

• Model the absence of an optional value using the presence of an empty string– Adds many unnecessary triples– Slows queries (longer scans)– Sometimes an empty string is a meaning value– Sometimes an empty string is lexically invalid

• See following RTC tag case study Defect 271867

43

“Tags” is multi-valued“Estimate” is optional

44

RDF representation

@prefix dcterms: <http://purl.org/dc/terms/> .@prefix rtc_cm: <http://jazz.net/xmlns/prod/jazz/rtc/cm/1.0/> . @prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

@base <https://jazz.net/jazz/resource/itemName/com.ibm.team.workitem.WorkItem/> .

<271867> dcterms:subject "datagap, oslc, next_release_candidate, data_gap, reporting-gap"^^xsd:string ;… rtc_cm:estimate ""^^xsd:long .

Syntax validated OK. There were warnings: Typed literal has an invalid lexical value: Input string was not in the correct format: s.Length==0.: ""^^<http://www.w3.org/2001/XMLSchema#long>.

45

PROVENANCE AND INVERSE PROPERTIES

46

Provenance: Where did the triple come from?

• A statement is represented by a triple• Triples from multiple documents may be merged and queried

– Default graph is a triple store• When storing RDF documents, the document URL is often

used as the name of a graph (e.g. in LQE)– triple + graph name = quad– triple stores are really quad stores

• Provenance of triples is important in several use cases– Updating a document– Access control– VVC (which version)

47

Provenance and authority

• The authority (trust) of a triple depends on the author of the document that contains the triple

• Triples should be placed in the document that the author is authorized to modify– When creating a link from A to B, put the link in

the document that the author is editing, not necessarily A or B or both

– Document C may contain links from A to B

48

Inverse properties• Directed relations between resources (links) may be stated in two

equivalent ways, e.g.– Testcase1 validates Requirement2 .– Requirement2 isValidatedBy Testcase1 .

• There is no benefit to having mutual inverse pairs of properties• The existence of mutual inverse pairs of properties makes query

authoring more complex, and query execution more expensive• A triple should be put in the document that the author of the triple is

editing (provenance)– There is no special significance attached to being the subject of a triple

• See OSLC guidance on preferred direction of properties– Direction should be from downstream to upstream, – e.g. test case validates requirement

49

Abuses

• OSLC domain specs define many pairs of mutual inverse predicates

• Recommendation– Deprecate one member of each pair– Replace deprecated property in all RDF

representations and queries

50

ONTOLOGIES AND CONSTRAINTS

51

Vocabularies and Ontologies

• A vocabulary defines the meaning of terms– Use RDFS: rdfs:label, rdfs:comment,

rdfs:isDefinedBy, …• An ontology defines inference rules– Given a set of triples, infer more triples– Use RDFS: rdfs:domain, rdfs:range,

rdfs:subClassOf, …– Use OWL for more complex inference rules

52

Ontologies and Constraints

• Ontologies are not designed to define integrity constraints– See Linked Data Interfaces for examples

• An RDFS or OWL reasoner will add triples to create a model for the ontology

• A reasoner will report an inconsistency if it cannot create a model– However, this mechanism cannot in practice be

used to check for typical integrity constraints

53

Best Practice: Ontologies

• Your triples may end up in a reasoner one day, so only add inference rules when they produce the intended results

• If you define generic properties, such as “uses”, then you probably SHOULD NOT define rdfs:domain and rdfs:range

• If you define type-specific properties, such as “usesTestCase” then rdfs:domain and rdfs:range MAY make sense

• e.g. If you intend to infer that the object of oslc_qm:usesTestCase is an oslc_qm:TestCase then include the following triple in an ontology:

oslc_qm:usesTestCase rdfs:range oslc_qm:TestCase .

54

Best Practice: Constraints

• W3C is starting an activity on RDF validation– See W3C workshop

• We have submitted the OSLC Resource Shape specification to W3C– See Resource Shape 2.0

• Use Resource Shape 2.0 to describe integrity constraints on RDF documents

55

Other topics

• Blank nodes– Mean there exists or some– use fragment ids for internal resources

• Containers– Avoid Seq, Bag, List– Use Linked Data Platform containers

• Consuming external vocabularies– Tools should gracefully degrade when external

resources are unreachable– Be a well-behaved HTTP client wrt caching, etc.