shacl: shaping the big ball of data mud
TRANSCRIPT
Shaping the Big Ball of Data Mud
W3C's Shapes Constraint Language (SHACL)
Richard CyganiakLotico Berlin Semantic Web Meetup, 17 November 2016
Semantic WebRDF
SPARQLOWLRDFS
RDFSPARQL
OWLRDFS
Strengths Weaknesses• Flexible can-say-anything data model•Merging data is trivial• Shared, explicit meaning thanks to URIs•Mixing and matching of schemas;
partial understanding• Painstakingly developed vocabularies• “Neutral ground” for modelling• SPARQL
• Overgeneralisation: works for anything, but great at nothing• “RDF tax”• Logic foundations and web
foundations can be baggage•Maps poorly to common
programming language data structures• Schemaless nature makes
optimisation difficult• Not good at semi-structured
Application Areas• Knowledge graphs• Publishing• Life sciences• Fraud detection & identity management• Data integration & analysis
The V’s of Big Data: Volume, Velocity, Variety
https://www.w3.org/blog/2010/05/linked-data-its-is-not-like-th/
RDFSPARQL
OWLRDFS
Validation?Constraint checking?
RDF is supposedly self-describing.
RDF
Schema.org
Simple Knowledge Organization Scheme (SKOS)
Dublin Core
Data Cube Vocabulary
R2RML
Linked Data Platform (LDP)
Why is RDFS not enough?
RDFSPARQL
OWLRDFS
Why is RDFS not enough?• RDF “Schema” — and schemas are for validation, right?• It’s a misnomer; should be “RDF Vocabulary Definition Language”• Very limited expressivity• Not the right semantics for validation• ex:capital range ex:City. ex:Berlin ex:capital ex:Germany => …?
• Invalid data -> infer more invalid data
=> ex:Germany a ex:City
RDFS
Why is OWL not enough?
RDFSPARQL
OWLRDFS
Why is OWL not enough?• De facto a constraint language: logical contradiction => invalid• Very expressive• But targeted at logic modelling, not validity constraints• Not the right semantics for validation• ex:Dublin ex:inCountry ex:Ireland, ex:USA => …?
• Open world assumption• No unique name assumption
=> ex:Ireland owl:sameAs ex:USA
OWL
ICV: OWL closed-world semantics in Stardog
Why is SPARQL not enough?
RDFSPARQL
OWLRDFS
Why is SPARQL not enough?SPARQL
http://spinrdf.org/
Why is SPARQL not enough?• SPARQL ASK seems ideal for constraint validation• Very expressive• Efficient implementations• But writing even simple constraints can be tedious
SPARQL
Other proposals
ShEx — Shape Expressions
http://shex.io/
So, something new?
RDFSPARQL
OWLRDFS
Validation?Constraint checking?
SHACLShapes Constraint
Language
SHACL Overview • A language for “checking RDF graphs against conditions”• Produced by W3C Data Shapes Working Group• Work in progress, some features at risk• 4th Working Draft: August 2016• Should be done by June 2017• Like RDFS and OWL, SHACL constraints are themselves written in RDF• SPARQL underneath (for evaluation semantics and extensibility)
ex:PersonShapea sh:Shape ;sh:targetClass ex:Person ;sh:property [
sh:predicate ex:ssn ;sh:maxCount 1 ;sh:datatype xsd:string ;sh:pattern "^\\d{3}-\\d{2}-\\d{4}$" ;
] ;sh:property [
sh:predicate ex:child ;sh:class ex:Person ;sh:nodeKind sh:IRI ;
] ;sh:property [
sh:path [ sh:inversePath ex:child ] ;sh:name "parent" ;sh:maxCount 2 ;
] .
How a Shape works
Diagram: Dimitris Kontokostas
Targets: Initial selection of focus nodes• Node target• Class instance target• Subjects-of target• Objects-of target• SPARQL-based selection (advanced)
Node constraintsConstraints about the focus node itself:
• Node kind (IRI, blank, literal)• IRI stem (namespace)• IRI regex• SPARQL query constraint (advanced)
Property constraintsConstraints about a certain outgoing or incoming property of the focus node(s):
• Cardinality• Class• Datatype• Node kind (IRI, blank node, literal)• String min/max length, string regex• Numeric min/max
• Value must match another shape• Value must not match another shape
Other features• Combine constraints with logical OR/any (default: AND/all)• Property-pair comparison (=, <, >)• Severities (Violation, Warning, Info)• Annotations (name, description, grouping, order)• Define additional types of constraints based on SPARQL (advanced)
Violation reports can be produced in RDFex:ExampleConstraintViolation
a sh:ValidationResult ;sh:severity sh:Violation ;sh:focusNode ex:Bob ;sh:path ex:age ;sh:value "twenty two" ;sh:message "ex:age must be literal of datatype xsd:integer." ;sh:sourceConstraintComponent sh:DatatypeConstraintComponent ;sh:sourceShape ex:PersonShape .
Relationship to Rules• Rules: “If someone says this, then I say that.”• SHACL can’t do this.• Does not replace SWRL, Jena Rules, RIF, SPIN Rules
Uses and implementations
SHACL in TopBraid Composer:Shapes + Constraints
SHACL support is available in the TopBraid Composer Free Edition. http://www.topquadrant.com/downloads/
SHACL in TopBraid Composer: SPARQL-based constraints
SHACL in TopQuadrant’s web products (EVN, EDG)
SHACL Protégé Plugin
http://me-at-big.blogspot.de/2015/07/shacl4p-shapes-constraint-language.html
Repairing SKOS taxonomies with SHACLValidation of SKOS with SHACL, and extension of SHACL with specification of repair strategies.
Christian Mader and Monika Solanki, http://ceur-ws.org/Vol-1666/paper-06.pdf
Validating the “bag of crisps”…• Validation is often not about correct/incorrect or valid/invalid• Constraints-first (e.g., SQL)• Well-formed vs valid (e.g., XML Schema)
• Validation is often about completeness and correctness for a specific purpose: “This is what I produce”; “This is what I understand”• Assumption is that there may be other statements• Different consumers may apply different constraints• SHACL should work well in this flexible, multi-source, multi-consumer
world.
“Anyone can say anything about anything”
RDFSPARQL
OWLRDFS
Statements: What is being said?
What words dowe have?
What makes logical sense to say?
What did you sayabout XYZ?
OWL SHACL
Is that word used correctly?What do you need to know from me?You can't say that here!I’d never say that!
2017
Backup slides