Copyright 2009 Digital Enterprise Research Institute. All rights reserved.
Digital Enterprise Research Institute www.deri.ie
0:55:04 1
Enhancing Large-Scale RDF Web Knowledge-bases for Query
Answering
Mini-VivaAidan Hogan
12th February, 2010
2
Digital Enterprise Research Institute www.deri.ie
Overview
Fig 1: RDF Web Dataset
explicit data
implicit data
Topic of today’s talk: How to exploit implicit
data for “query answering”
3
Digital Enterprise Research Institute www.deri.ie
Query Answering…
…over RDF Web data…(Linked Data if you prefer)
Search engines such as SWSE, Sindice, Falcons, Swoogle, Watson etc.
SPARQL endpoints over Web data such as YARS2, Virtuoso, etc.
4
Digital Enterprise Research Institute www.deri.ie
ex:Aidan ex:presented ex:RR2009Talk .
deri:Aidan ex:presented deri:FridayTalk .
Problem: Synonymous Omissions
Query:Give me all talks presented by Aidan
ex:Aidan ex:presented ?talk .
IMP
LIC
T EX
PLIC
IT
5
Digital Enterprise Research Institute www.deri.ie
ex:Aidan ex:presented deri:FridayTalk .
ex:Aidan ex:presented ex:FridayTalk .
Problem: Synonymous Duplicates
Query:Give me all talks presented by Aidan
ex:Aidan ex:presented ?talk .
IMP
LIC
T EX
PLIC
IT
6
Digital Enterprise Research Institute www.deri.ie
ex:RR2009Talk ex:presentedBy ex:Aidan .
Problem: Incomplete Answers
Query:Give me all talks presented by Aidan
ex:Aidan ex:presented ?talk .
IMP
LIC
T EX
PLIC
IT
7
Digital Enterprise Research Institute www.deri.ie
Solution: Publish Complete Data?
Query:Give me all talks presented by Aidan
ex:Aidan ex:presented ?talk .
IMP
LIC
T EX
PLIC
IT
ex:RR2009Talk ex:presentedBy ex:Aidan .
ex:Aidan ex:presented ex:RR2009Talk .
8
Digital Enterprise Research Institute www.deri.ie
ex:RR2009Talk ex:presentedBy ex:Aidan .
Solution: Write Query in many ways?
Query:Give me all talks presented by Aidan
ex:Aidan ex:presented ?talk .?talk ex:presentedBy ex:Aidan .
IMP
LIC
T EX
PLIC
IT
9
Digital Enterprise Research Institute www.deri.ie
deri:Aidan ex:presented deri:FridayTalk .deri:Aidan owl:sameAs ex:Aidan .ex:RR2009Talk ex:presentedBy ex:Aidan .ex:presentedBy owl:inverseOf ex:presented .
ex:Aidan ex:presented deri:FridayTalk .
ex:Aidan ex:presented ex:RR2009Talk .
Solution: Exploit OWL and RDFS…
Query:Give me all talks presented by Aidan
ex:Aidan ex:presented ?talk .
IMP
LIC
T EX
PLIC
IT
10
Digital Enterprise Research Institute www.deri.ie
OWL / RDFS
(loosely) Define the semantics of classes and properties… (define relationships between terms)
ex:presentedBy owl:inverseOf ex:presented .
ex:presentedBy rdfs:domain ex:Talk .
ex:presentedBy rdfs:range ex:Person .
Define equivalence between individuals (owl:sameAs)
ex:Aidan owl:sameAs deri:Aidan .
Give machines an insight into the meaning of data
Allows for reasoning
11
Digital Enterprise Research Institute www.deri.ie
Reasoning
(Loosely) Use the semantics of classes and properties—defined in RDFS and OWL—to make implicit knowledge explicit
One approach is using rules: IF condition THEN consequent
?p1 owl:inverseOf ?p2 . ?s ?p1 ?o . => ?o ?p2 ?s .
ex:presentedBy owl:inverseOf ex:presented .ex:FridayTalk ex:presentedBy ex:Aidan .=> ex:Aidan ex:presented ex:FridayTalk .
?p rdfs:domain ?c . ?s ?p ?o . => ?s rdf:type ?c .
?p rdfs:range ?c . ?s ?p ?o . => ?o rdf:type ?c .
ex:presentedBy rdfs:domain ex:Talk .ex:FridayTalk ex:presentedBy ex:Aidan .=> ex:FridayTalk rdf:type ex:Talk .
ex:presentedBy rdfs:range ex:Person .ex:FridayTalk ex:presentedBy ex:Aidan .=> ex:Aidan rdf:type ex:Person .
12
Digital Enterprise Research Institute www.deri.ie
deri:Aidan ex:presented deri:FridayTalk .deri:Aidan owl:sameAs ex:Aidan .ex:RR2009Talk ex:presentedBy ex:Aidan .ex:presentedBy owl:inverseOf ex:presented .
Reasoning: Make Explicit the Implicit
Query:Give me all talks presented by Aidan
ex:Aidan ex:presented ?talk .
ex:Aidan ex:presented deri:FridayTalk .ex:Aidan ex:presented ex:RR2009Talk .
IMP
LIC
T
EX
PLIC
IT
13
Digital Enterprise Research Institute www.deri.ie
Web Reasoning: Challenges
Scalability Billions or tens of billions of statements (for the moment)
– Near linear scale!!!
Noisy data Inconsistencies galore Publishing errors “Ontology hijacking”
14
Digital Enterprise Research Institute www.deri.ie
Web Reasoning: Challenges
Challenges (Semantic Web Wikipedia Article) Some of the challenges for the Semantic Web include vastness, vagueness, uncertainty, inconsistency and deceit. Automated reasoning systems will have to deal
with all of these issues in order to deliver on the promise of the Semantic Web. Vastness: The World Wide Web contains at least 48 billion pages as of this writing (August 2, 2009). The SNOMED CT medical terminology ontology contains 370,000 class names,
and existing technology has not yet been able to eliminate all semantically duplicated terms. Any automated reasoning system will have to deal with truly huge inputs. Vagueness: These are imprecise concepts like "young" or "tall". This arises from the vagueness of user queries, of concepts represented by content providers, of matching query
terms to provider terms and of trying to combine different knowledge bases with overlapping but subtly different concepts. Fuzzy logic is the most common technique for dealing with vagueness.
Uncertainty: These are precise concepts with uncertain values. For example, a patient might present a set of symptoms which correspond to a number of different distinct diagnoses each with a different probability. Probabilistic reasoning techniques are generally employed to address uncertainty.
Inconsistency: These are logical contradictions which will inevitably arise during the development of large ontologies, and when ontologies from separate sources are combined. Deductive reasoning fails catastrophically when faced with inconsistency, because "anything follows from a contradiction" . Defeasible reasoning and paraconsistent reasoning are two techniques which can be employed to deal with inconsistency.
Deceit: This is when the producer of the information is intentionally misleading the consumer of the information. Cryptography techniques are currently utilized to ameliorate this threat.
15
Digital Enterprise Research Institute www.deri.ie
Noisy Data: Omnipotent Being
Proposition 1 Web data is noisy.
Proof: 08445a31a78661b5c746feff39a9db6e4e2cc5cf
sha1-sum of ‘mailto:’ common value for foaf:mbox_sha1sum
An inverse-functional (uniquely identifying) property!!! Any person who shares the same value will be considered the same
Q.E.D.
16
Digital Enterprise Research Institute www.deri.ie
More Proof:
From http://www.eiao.net/rdf/1.0<owl:Property rdf:about="http://www.w3.org/1999/02/22-rdf-syntax-ns#type">
<rdfs:label xml:lang="en">type</rdfs:label>
<rdfs:comment xml:lang="en">Type of resource</rdfs:comment>
<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#testRun"/>
<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#pageSurvey"/>
<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#siteSurvey"/>
<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#scenario"/>
<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#rangeLocation"/>
<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#startPointer"/>
<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#endPointer"/>
<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#header"/>
<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#runs"/>
</owl:Property>
Ontology hijacking!!
Noisy Data: Redefining Everything …and home in time for tea
17
Digital Enterprise Research Institute www.deri.ie
The Web……forecast is for muck
18
Digital Enterprise Research Institute www.deri.ie
(Briefly) Why use a rule based approach?…as opposed to a Description Logics based approach
Massive A-Box (i.e., instance data)
Inconsistencies galore
Publishing errors / Messy data
Popular Web ontologies are fairly inexpressive
Web Reasoning: Use Rules!
19
Digital Enterprise Research Institute www.deri.ie
Forward Chaining materialisation:
Avoid runtime expense of backward-chaining– Users taught impatience by Google
Pre-compute answers for quick retrieval
Web-scale systems should be scalable!– More data = more disk space AND/OR more machines
Web Reasoning: Forward Chaining!
20
Digital Enterprise Research Institute www.deri.ie
“Standard” RDFS OWL 2 RL (W3C Rec: 27 Oct. 2009)
“Non-standard” DLP pD* (OWL Horst) OWL–
OWL 2 RL first standard OWL rule expressible “fragment”! More inclusive than previous non-standard OWL rule fragments Includes RDFS rules Includes rule support for new OWL 2 constructs
…although I don’t know of any OWL 2 data on the Web
What rules?
21
Digital Enterprise Research Institute www.deri.ie
Okay, so let’s do forward-chaining OWL 2 RL on billions of triples collected from the Web…
foaf:mbox_sha1sum a owl:InverseFunctionalProperty .
?x foaf:mbox_sha1sum 08445a31a78661b5c746feff39a9db6e4e2cc5cf .
OWL 2 RL rule prp-ifp: ?p a owl:InverseFunctionalProperty . ?x1 ?p ?z . ?x2 ?p ?z .
⇒ ?x1 owl:sameAs ?x2 .
106 ?x1/?x2 bindings in body 1012 inferred pair-wise and reflexive owl:sameAs statements
…or in simpler terms:
pow!
22
Digital Enterprise Research Institute www.deri.ie
Okay, so let’s do forward-chaining OWL 2 RL on billions of triples collected from the Web…
OWL 2 RL rule eq-ref: ?s ?p ?o . ⇒ ?s owl:sameAs ?s . ?p owl:sameAs ?p . ?o owl:sameAs ?o .
Adds |T| triples, where T is the set of RDF terms in the data Could be easily supported by backward-chaining/query rewriting Boring
23
Digital Enterprise Research Institute www.deri.ie
SAOR: Scalable Authoritative OWL Reasoner
Goals:Scalability
Separate TBox (schema) data Incomplete reasoning!
Reduced Output Incomplete reasoning!
Web tolerance Consider provenance of Web data Incomplete reasoning!
24
Digital Enterprise Research Institute www.deri.ie
Scalable Reasoning: In-mem T-Box
Main optimisation: Store T-Box in memory T-Box: (loosely) data describing classes and
properties By far, the most commonly accessed segment of
data for reasoning Quite small (1-2%)
e.g. from a 100M statement Web crawl A-Box: 3,753,791 X ?s foaf:name ?o .
vs. T-Box: <20 X foaf:name ?p ?o . + ?s ?p foaf:name .
25
Digital Enterprise Research Institute www.deri.ie
Scan 1: Scan input data separate T-Box statements, load T-Box statements into memory
Scan 2: Scan all on-disk data, join with in-memory T-Box.
Scalable Reasoning: Scans
26
Digital Enterprise Research Institute www.deri.ie
......
...
...
......
... ...
...
...ex:me ex:presented ex:FridayTalk...
...ex:FridayTalk ex:presentedBy ex:me .ex:me rdf:type foaf:Person .ex:me rdf:type foaf:Agent ....
IN-MEM T-BOX
ON-DISK A-BOX
ON-DISK OUTPUT
ex:presented
ex:presentedBy
owl:inverseOf
foaf:Person
rdfs:domain
foaf:Agent
rdfs:subClassOf
Execution of three rules:
OWL 2 RL rule prp-inv1?p1 owl:inverseOf ?p2 . ?x ?p1 ?y . ?y ?p⇒ 2 ?x .
OWL 2 RL rule prp-dom?p rdfs:domain ?c . ?x ?p ?y . ?x a ?c .⇒
OWL 2 RL rule cax-sco?c1 rdfs:subClassOf ?c2 . ?x a ?c1 . ?x a ?c⇒ 2 .
Scalable Reasoning: No A-Box Joins
27
Digital Enterprise Research Institute www.deri.ie
However: some rules do require A-Box joins ?p a owl:TransitiveProperty . ?x ?p ?y . ?y ?p z .
» ⇒ ?x ?p ?z .
Difficult to engineer a scalable solution (which reaches a fixpoint) No A-Box joins for SAOR reasoning over 1B statements ~99% of inferences over Web data possible without A-Box joins
48/76 OWL 2 RL rules don’t require A-Box joins Side note: No RDFS rule requires A-Box joins And rules cover ~most of what current Web ontologies use
Scalable Reasoning: A-Box joins?
28
Digital Enterprise Research Institute www.deri.ie
T-Box only!
Document D authoritative for class/property X iff: X is a blank-node
– OR De-referenced URI of X coincides with or redirects to D
FOAF spec authoritative for foaf:Person ✓ MY spec not authoritative for foaf:Person ✘
Only allow extension in authoritative documents my:Person rdfs:subClassOf foaf:Person . (MY spec) ✓
BUT: Reduce obscure memberships foaf:Person rdfs:subClassOf my:Person . (MY spec) ✘
ALSO: Protect specifications foaf:mbox rdf:type owl:SymmetricProperty . (MY spec) ✘
Similarly for other T-Box statements.
In-memory T-Box stores authoritative information for rule execution
Greatly reduces output size!!! Compatible with FOAF, SIOC, DC (common Web vocabulary etiquette)
Authoritative Reasoning
29
Digital Enterprise Research Institute www.deri.ie
More Proof:
From http://www.eiao.net/rdf/1.0<owl:Property rdf:about="http://www.w3.org/1999/02/22-rdf-syntax-ns#type">
<rdfs:label xml:lang="en">type</rdfs:label>
<rdfs:comment xml:lang="en">Type of resource</rdfs:comment>
<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#testRun"/>
<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#pageSurvey"/>
<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#siteSurvey"/>
<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#scenario"/>
<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#rangeLocation"/>
<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#startPointer"/>
<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#endPointer"/>
<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#header"/>
<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#runs"/>
</owl:Property>
Ontology hijacking!!
Noisy Data: Redefining Everything …revisited
Not authoritative!!!!
30
Digital Enterprise Research Institute www.deri.ie
Distributed Reasoning
More recently performed reasoning over a cluster of commodity hardware
Ran the “easy” OWL 2 RL rules (no A-Box joins) Duplicate the T-Box to all machines… A-Box can be arbitrarily
distributed… Authoritative (of course) Eight machines, 4GB main memory, 2.2 GHz 1.192b input statements crawled last month, pre-distributed
over the machines Reasoning in 113 minutes
– Extract T-Box 16 mins– Aggregate, perform authoritative analysis, broadcast T-Box: 14 mins– Reasoning over A-Box: 83 mins
Output 570m inferences
31
Digital Enterprise Research Institute www.deri.ie
ex:RR2009Talk ex:presentedBy ex:Aidan .ex:presentedBy owl:inverseOf ex:presented .
…and back again
Query:Give me all talks presented by Aidan
ex:Aidan ex:presented ?talk .
IMP
LIC
T EX
PLIC
IT
ex:Aidan ex:presented ex:RR2009Talk .
32
Digital Enterprise Research Institute www.deri.ie
ex:Aidan ex:presented ex:RR2009Talk .
deri:Aidan ex:presented deri:FridayTalk .
…but what about…
Query:Give me all talks presented by Aidan
ex:Aidan ex:presented ?talk .
IMP
LIC
T EX
PLIC
IT
33
Digital Enterprise Research Institute www.deri.ie
ex:Aidan ex:presented deri:FridayTalk .
ex:Aidan ex:presented ex:FridayTalk .
…and…
Query:Give me all talks presented by Aidan
ex:Aidan ex:presented ?talk .
IMP
LIC
T EX
PLIC
IT
34
Digital Enterprise Research Institute www.deri.ie
Equality Reasoning: Standard (e.g. OWL 2 RL) rules for OWL equality not great: Saw noisy data earlier Quadratic explosion of inferences for equivalences, with high
duplication Do not solve “synonymous duplicates” query answering problem
Entity Consolidation: Instead use “canonical” identifiers: one term to represent set of
equivalent individuals
Equality Reasoning/Entity Consolidation
35
Digital Enterprise Research Institute www.deri.ie
Need owl:sameAs relations
Explicit owl:sameAs statements: good precision / poor recall Inverse-functional properties: reasonable recall / poor precision
Publishers not aware of inverse-functional semantics of such properties Other rules: very poor recall / ? high precision
Been there, done that…
Most recently, consolidation using explicit owl:sameAs statements 8 machines (as before) 61mins for 1.193bn statements
36
Digital Enterprise Research Institute www.deri.ie
Probabilistic/statistical approach to equality reasoning Identify resources with high probability of being equivalent:
– use semantics in the data (equality reasoning);– use statistics derived from the data; e.g.:
– Two people have same birthday, name and share co-authors…
Identify resources with high probability of being different:– use semantics in the data (inconsistencies?);– again use statistics derived from the data; e.g.:
– Two people have different dates-of-birth and names
Perform “fuzzy” reasoning Leverage links-based analysis from input-data to give inferences “scores” of trustworthiness Depending on results, could be used to, e.g.:
– identify “interesting” inferences for partial-materialisation;– identify “trustworthy” inferences for bypassing noise in Web data.
Must be domain-agnostic/scalable/distributable/give good results for noisy and heterogeneous Web data
Future Work
37
Digital Enterprise Research Institute www.deri.ie
1. Web data is messy
2. Reasoning over Web data is difficult
3. Need incomplete, albeit inclusive reasoning
4. Rule execution optimisations possible through special treatment of terminological data
5. Need to consider the provenance of data
6. OWL 2 RL not immediately suitable to application over Web data
7. Incomplete OWL 2 RL support can be offered using existing technologies, in a scalable and tolerant way
8. Busy year ahead
Conclusion