Download - Aidan's PhD Viva
![Page 1: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/1.jpg)
Copyright 2009 Digital Enterprise Research Institute. All rights reserved.
Digital Enterprise Research Institute www.deri.ie
1
Exploiting RDFS and OWL for Integrating Heterogeneous, Large-Scale, Linked Data
Corpora
Aidan HoganPhD Viva
![Page 2: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/2.jpg)
Digital Enterprise Research Institute www.deri.ie
2
Cold Open
Figure 1: Web of Data
explicit data
implicit data
Topic of thesis:
How can consumers tap into the implicit data
![Page 3: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/3.jpg)
Digital Enterprise Research Institute www.deri.ie
PRELUDEThe Area…
The Problem…The Hypothesis…
3
![Page 4: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/4.jpg)
Digital Enterprise Research Institute www.deri.ie
The Area…
…Linked Data / Linking Open Data
4
![Page 5: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/5.jpg)
Digital Enterprise Research Institute www.deri.ie
5
Bottom-up Approach to Semantic Web Individual Publishers should:
1. Use URIs to name things (not just documents)
2. Use HTTP URIs that can be looked up
3. Return information in a common structured data model (RDF)
4. Use external URIs in your data so as to link to related data
…the micro… Linked Data Principles
![Page 6: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/6.jpg)
Digital Enterprise Research Institute www.deri.ie
6
…the macro… A Web of Data
Images from: http://richard.cyganiak.de/2007/10/lod/; Cyganiak, JentzschSeptember 2010
August 2007
November 2007
February 2008
March 2008
September 2008
March 2009
July 2009
![Page 7: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/7.jpg)
Digital Enterprise Research Institute www.deri.ie
…so what’s The Problem?…
…heterogeneity
7
![Page 8: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/8.jpg)
Digital Enterprise Research Institute www.deri.ie
8
Take Query Answering…
SPARQL endpoints over Web data such as YARS2, Virtuoso, FactForge, etc.
Search engines such as SWSE, Sindice, Falcons, Swoogle, Watson, etc.
![Page 9: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/9.jpg)
Digital Enterprise Research Institute www.deri.ie
9
Take Query Answering…
Gimme webpages relating to
Tim Berners-Lee
foaf:page
timbl:i
timbl:i foaf:page ?pages .
![Page 10: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/10.jpg)
Digital Enterprise Research Institute www.deri.ie
10
Hetereogenity in terminology…
webpage: properties
foaf:page
foaf:homepage
foaf:isPrimaryTopicOf
foaf:weblog
doap:homepage
foaf:topic
foaf:primaryTopic
mo:musicBrainz
mo:myspace
…
= rdfs:subPropertyOf
= owl:inverseOf
![Page 11: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/11.jpg)
Digital Enterprise Research Institute www.deri.ie
11
Linked Data, RDFS and OWL: Linked Vocabularies
…
…Image from http://blog.dbtune.org/public/.081005_lod_constellation_m.jpg:; Giasson, Bergman
![Page 12: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/12.jpg)
Digital Enterprise Research Institute www.deri.ie
12
Hetereogenity in naming…
Tim Berners-Lee: URIs
…
timbl:i
dblp:100007
identica:45563
adv:timblfb:en.tim_berners-lee
db:Tim-Berners_Lee
= owl:sameAs
![Page 13: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/13.jpg)
Digital Enterprise Research Institute www.deri.ie
13
Returning to our Query…
Gimme webpages relating to
Tim Berners-Lee
foaf:page
timbl:i timbl:i foaf:page ?pages .
... 7 x 6 = 42 possible patterns
foaf:homepage
foaf:isPrimaryTopicOf
doap:homepage foaf:topic foaf:primaryTopic
mo:myspace
dblp:100007
identica:45563adv:timbl
fb:en.tim_berners-lee
db:Tim-Berners_Lee
![Page 14: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/14.jpg)
Digital Enterprise Research Institute www.deri.ie
…The Hypothesis?…
…we can use the OWL and RDFS inherent in Linked Data to attenuate the problem of heterogeneity for consumers
14
![Page 15: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/15.jpg)
Digital Enterprise Research Institute www.deri.ie
Scenario…
…take a static corpus crawled from Linked Data…
…about a billion triples or so…
…and tackle the problem(s) of heterogeneity
…(without domain-specific “cheats”).
15
![Page 16: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/16.jpg)
Digital Enterprise Research Institute www.deri.ie
Setup…
hardware …9 machines
…~6 years old… 4Gb RAM, 2.2GHz, Ethernet
16
![Page 17: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/17.jpg)
Digital Enterprise Research Institute www.deri.ie
Setup…
corpus …crawl (9 machines: 52.5 hr)
…took random seed URIs from Billion Triple Challenge 2009 dataset
…crawled ~4 million RDF/XML documents …from arbitrary domains (e.g., dbpedia.org)
– Only found 785 domains providing RDF/XML
…1.118 billion quadruples …947 million unique triples
17
![Page 18: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/18.jpg)
Digital Enterprise Research Institute www.deri.ie
Setup…
ranking (9 machines: 30.3 hr) …applied PageRank over interlinked source
docs.– …source A links to source B if A uses a URI which
“dereferences” (points) to B
18
![Page 19: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/19.jpg)
Digital Enterprise Research Institute www.deri.ie
Challenges…
…what (OWL) reasoning is feasible for Linked Data?
19
![Page 20: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/20.jpg)
Digital Enterprise Research Institute www.deri.ie
20
Linked Data Reasoning: Challenges
![Page 21: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/21.jpg)
Digital Enterprise Research Institute www.deri.ie
CORE1. Reasoning…
2. Annotated Reasoning…3. Consolidation…
21
![Page 22: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/22.jpg)
Digital Enterprise Research Institute www.deri.ie
1. Reasoning
22
![Page 23: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/23.jpg)
Digital Enterprise Research Institute www.deri.ie
High Level Approach…
…apply a subset of OWL 2 RL/RDF rules over the data
23
![Page 24: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/24.jpg)
Digital Enterprise Research Institute www.deri.ie
24
Forward Chaining materialisation:
Avoid runtime expense of backward-chaining– Users taught impatience by Google
Pre-compute answers for quick retrieval
Web-scale systems should be scalable!– More data = more disk-space/machines
Web Reasoning: Forward Chaining!
![Page 25: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/25.jpg)
Digital Enterprise Research Institute www.deri.ie
25
Scalable Authoritative OWL Reasoner
Our Approach
![Page 26: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/26.jpg)
Digital Enterprise Research Institute www.deri.ie
26
Our Approach…
INPUT:• Flat file of triples (quads)
OUTPUT:• Flat file of (partial) inferred triples (quads)
![Page 27: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/27.jpg)
Digital Enterprise Research Institute www.deri.ie
27
Scalable Reasoning: In-mem T-Box
Main optimisation: Store T-Box in memory T-Box: (loosely) data describing classes and properties.
Aka. schemata/vocabularies/ontologies/terminologies. E.g.,
– foaf:topic owl:inverseOf foaf:page .– sioc:UserAccount rdfs:subClassOf foaf:OnlineAccount .
Most commonly accessed data for reasoning Quite small (~0.1% for our Linked Data corpus)
High selectivity (if you prefer) A-Box: Lots ?s foaf:page ?o .
vs. T-Box: Few foaf:page ?p ?o . + ?s ?p foaf:page .
![Page 28: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/28.jpg)
Digital Enterprise Research Institute www.deri.ie
28
Scan 1: Scan input data separate T-Box statements, load T-Box statements into memory Do T-Box level reasoning if required (semi-naïve)
Scan 2: Scan all on-disk data, join with in-memory T-Box.
Scalable Reasoning: Two Scans
![Page 29: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/29.jpg)
Digital Enterprise Research Institute www.deri.ie
29
......
...
...
......
... ...
...ex:me foaf:homepage ex:hp ....
...ex:hp rdf:type foaf:Document .ex:me foaf:page ex:hp .ex:hp foaf:topic ex:me ....
IN-MEM T-BOX
ON-DISK A-BOX
ON-DISK OUTPUT
foaf:homepage
foaf:Document
rdfs:domainfoaf:page
rdfs:subPropertyOf
foaf:topic
owl:inverseOf
Execution of three rules:
OWL 2 RL rule prp-inv1?p1 owl:inverseOf ?p2 .
?x ?p1 ?y .
⇒ ?y ?p2 ?x .
OWL 2 RL rule prp-rng?p rdfs:range ?c .
?x ?p ?y .
⇒ ?y a ?c .
OWL 2 RL rule prp-spo1?p1 rdfs:subPropertyOf ?p2 .
?x ?p1 ?y.
⇒ ?x ?p2 ?y .
Scalable Reasoning: No A-Box Joins
![Page 30: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/30.jpg)
Digital Enterprise Research Institute www.deri.ie
30
However: some rules do require A-Box joins ?p a owl:TransitiveProperty . ?x ?p ?y . ?y ?p z .
⇒ ?x ?p ?z . Difficult to engineer a scalable solution (which reaches a
fixpoint) for Linked Data(?) Can lead to quadratic inferences
A lot of useful reasoning still possible without A-Box joins…
Scalable Reasoning: A-Box joins?
![Page 31: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/31.jpg)
Digital Enterprise Research Institute www.deri.ie
31
Consider source of T-Box (schemata) data
Class/property URIs dereference to their authoritative document
FOAF spec authoritative for foaf:Person ✓ MY spec not authoritative for foaf:Person ✘
Allow “extension” in authoritative documents my:Person rdfs:subClassOf foaf:Person . (MY spec) ✓
BUT: Reduce obscure memberships foaf:Person rdfs:subClassOf my:Person . (MY spec) ✘
ALSO: Protect specifications foaf:knows a owl:SymmetricProperty . (MY spec) ✘
Authoritative Reasoning
![Page 32: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/32.jpg)
Digital Enterprise Research Institute www.deri.ie
32
Survey of terminology: counts
Looked at use of RDFS and OWL in our corpus
1. rdfs:subClassOf ~307k axioms ~51k docs ✓
2. owl:equivalentClass ~23k axioms ~23k docs ✓3. rdfs:domain ~16k axioms 623 docs ✓4. rdfs:range ~14k axioms 717 docs ✓5. owl:unionOf ~13k axioms 109 docs ✓6. rdfs:subPropertyOf ~9k axioms 227 docs ✓7. owl:inverseOf ~1k axioms 98 docs ✓8. owl:disjointWith 917 axioms 60 docs ✘9. owl:someValuesFrom 465 axioms 48 docs ✓10. owl:intersectionOf 325 axioms 12 docs ✓/ ✘…
![Page 33: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/33.jpg)
Digital Enterprise Research Institute www.deri.ie
33
...summary please?
Our “cheap rules” cover 99% of RDFS/OWL axioms in our corpus
82.3% of such axioms have an authoritative version
- 78.3% of all non-authoritative axioms come from one doc
- (without which, ~96% of axioms have auth. version)
9.1% of documents have non-authoritative axioms
Authoritative reasoning for cheap rules fully support 90.6% of the “vocabulary documents”
Survey of terminology: counts
![Page 34: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/34.jpg)
Digital Enterprise Research Institute www.deri.ie
34
Survey of terminology: ranks
Looked at use of RDFS and OWL wrt. ranks of documents…1. rdfs:subClassOf 0.295 ✓ 2. rdfs:range 0.294 ✓3. rdfs:domain 0.292 ✓4. rdfs:subPropertyOf 0.090 ✓5. owl:FunctionalProperty 0.063 ✘6. owl:disjointWith 0.049 ✘7. owl:inverseOf 0.047 ✓8. owl:unionOf 0.035 ✓9. owl:SymmetricProperty 0.033 ✓10. owl:equivalentClass 0.021 ✓11. owl:InverseFunctionalProperty 0.030 ✘12. owl:equivalentProperty 0.030 ✓13. owl:someValuesFrom 0.030 ✓/ ✘
![Page 35: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/35.jpg)
Digital Enterprise Research Institute www.deri.ie
35
...summary please?
Adding up the ranks of all vocabularies our rules fully support gives 77% of the total rank of all vocabularies
Adding up the ranks of all vocabularies our authoritative rules fully support gives 70% of the total rank of all vocabularies
The highest ranked document our rules do not fully support was 5th overall: SKOS
The highest ranked document with non-authoritative axioms was 7th overall: FOAF
Survey of terminology: ranks
![Page 36: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/36.jpg)
Digital Enterprise Research Institute www.deri.ie
36
...let’s stick to the simple rules
![Page 37: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/37.jpg)
Digital Enterprise Research Institute www.deri.ie
37
Scalable Distributed Reasoning
...
...ex:me ex:presented ex:ThisTalk
...
SAME T-BOX
ex:presented
foaf:Person
rdfs:domain
ex:presented
foaf:Person
rdfs:domain
ex:Talk
rdfs:range
SAME T-BOX SAME T-BOX SAME T-BOX SAME T-BOX
DIFF. A-BOX DIFF. A-BOX DIFF. A-BOX DIFF. A-BOX DIFF. A-BOX
...
...ex:me ex:presented ex:ThisTalk
...
...
...ex:me ex:presented ex:ThisTalk
...
...
...ex:me ex:presented ex:ThisTalk
...
...
...ex:me ex:presented ex:ThisTalk
... LOCAL
OUTPUT......ex:me ex:presented ex:ThisTalk
...
LOCAL OUTPUT
LOCAL OUTPUT
LOCAL OUTPUT
LOCAL OUTPUT
...
...ex:me ex:presented ex:ThisTal
...
...ex:me ex:presented ex:ThisTalk
...
...ex:me ex:presented ex:ThisTalk
...
...ex:me rdf:type ex:Awesome .
ex:Talk
rdfs:range
...
ex:presented
foaf:Person
rdfs:domain
ex:Talk
rdfs:range
...
ex:presented
foaf:Person
rdfs:domain
ex:Talk
rdfs:range
...
ex:presented
foaf:Person
rdfs:domain
ex:Talk
rdfs:range
... ...
...
...ex:me ex:presented ex:ThisTalk
...
...
...ex:me ex:presented ex:ThisTalk
...
...
...ex:me ex:presented ex:ThisTalk
...
...
...ex:me ex:presented ex:ThisTalk
...
...
...ex:me ex:presented ex:ThisTalk
... EXTRACT T-BOX EXTRACT T-BOX EXTRACT T-BOX EXTRACT T-BOX EXTRACT T-BOX
COLLECT T-BOX COLLECT T-BOX COLLECT T-BOX COLLECT T-BOX COLLECT T-BOX
...
...
![Page 38: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/38.jpg)
Digital Enterprise Research Institute www.deri.ie
38
Reasoning Performance (1 machine)
![Page 39: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/39.jpg)
Digital Enterprise Research Institute www.deri.ie
39
Reasoning Performance: Distrib.
9 machines: Total 3.35 hours
![Page 40: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/40.jpg)
Digital Enterprise Research Institute www.deri.ie
40
Reasoning: Results
962 million unique/novel triples
947 millionunique triples
![Page 41: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/41.jpg)
Digital Enterprise Research Institute www.deri.ie
2. AnnotatedReasoning
41
![Page 42: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/42.jpg)
Digital Enterprise Research Institute www.deri.ie
42
Annotated Reasoning
Let’s try track some meta-information during the reasoning process
Annotate input triples with information
Use annotated reasoning framework for transforming annotations on input triples into annotations on output triples
![Page 43: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/43.jpg)
Digital Enterprise Research Institute www.deri.ie
43
Each input triple is assigned the sum of the ranks of the documents in which it appears…
foaf:Person rdfs:subClassOf foaf:Agent 0.3 .
timbl:i rdf:type foaf:Person 0.04 .
aidan:me rdf:type foaf:Person 0.0001 .
Annotated Reasoning: ranks
![Page 44: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/44.jpg)
Digital Enterprise Research Institute www.deri.ie
44
During reasoning, inferences are assigned the least-trustworthy triple involved in their “proof”
foaf:Person rdfs:subClassOf foaf:Agent 0.3 .
timbl:i rdf:type foaf:Person 0.04 .
⇒timbl:i rdf:type foaf:Agent 0.04 .
Annotated Reasoning
![Page 45: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/45.jpg)
Digital Enterprise Research Institute www.deri.ie
45
1. Can do top-k materialisation Only give me inferences above a certain rank threshold Only give me top-k inferences
2. Can fix inconsistencies in the data… …aka. logical contradictions …interpreting the rank values as denoting
“trustworthy” data
Why?
![Page 46: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/46.jpg)
Digital Enterprise Research Institute www.deri.ie
46
foaf:Person owl:disjointWith foaf:Document .
Inconsistencies: aka. Contradictions
![Page 47: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/47.jpg)
Digital Enterprise Research Institute www.deri.ie
47
?c1 owl:disjointWith ?c2 .
?x rdf:type ?c1 .
?x rdf:type ?c2 .
⇒ false
foaf:Person owl:disjointWith foaf:Document .
ex:sleepygirl rdf:type foaf:Person .
ex:sleepygirl rdf:type foaf:Document .
⇒ false
Cannot compute…
![Page 48: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/48.jpg)
Digital Enterprise Research Institute www.deri.ie
48
Considered two approaches:
1. Find the “consistency threshold” of the input + inferred data: The largest rank such that all data above that rank are
consistent Unfortunately, the 22nd ranked document had an ill-
typed literal, and so was inconsistent… So we would keep the data of ~22 documents And throw away the data of nearly four million
Fixing inconsistencies
![Page 49: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/49.jpg)
Digital Enterprise Research Institute www.deri.ie
49
Time for Plan B:
2. Perform a “granular” repair of the data Remove the weakest triple causing each contradiction
foaf:Person owl:disjointWith foaf:Document 0.3 .
ex:sleepygirl rdf:type foaf:Person 0.007 .
ex:sleepygirl rdf:type foaf:Document 0.002.
Fixing inconsistencies
![Page 50: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/50.jpg)
Digital Enterprise Research Institute www.deri.ie
50
~294k ill-typed datatypes ~7k members of disjoint classes
Inconsistencies found
![Page 51: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/51.jpg)
Digital Enterprise Research Institute www.deri.ie
51
Performance
9 machines
Annotated Reasoning: 14.6 hrs (vs. 3.35hrs w/o annotations: need to do a distributed sort to
remove non-optimal triples ) Detect/Extract Inconsistencies: 2.9 hrs Diagnosis/Repair 2.8 hrs
Total ~20.3 hours
![Page 52: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/52.jpg)
Digital Enterprise Research Institute www.deri.ie
3. Consolidation
52
![Page 53: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/53.jpg)
Digital Enterprise Research Institute www.deri.ie
53
Consolidation for Linked Data
![Page 54: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/54.jpg)
Digital Enterprise Research Institute www.deri.ie
Baseline Approach…
…use the explicit owl:sameAs relations given in the data…
54
![Page 55: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/55.jpg)
Digital Enterprise Research Institute www.deri.ie
55
Scan the data and extract all owl:sameAs triples
timbl:i owl:sameas identica:45563 .
dbpedia:Berners-Lee owl:sameas identica:45563 .
Load into memory Use a map to store equivalences:
timbl:i ->
identica:45563 ->
dbpedia:Berners-Lee ->
Consolidation: Baseline
timbl:i
identica:45563
dbpedia:Berners-Lee
![Page 56: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/56.jpg)
Digital Enterprise Research Institute www.deri.ie
56
For each set of equivalent identifiers, choose a canonical term
Consolidation: Baseline
timbl:i
identica:45563
dbpedia:Berners-Lee
![Page 57: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/57.jpg)
Digital Enterprise Research Institute www.deri.ie
57
Scan data a second time: Rewrite identifiers to their canonical version
Skip predicates and values of rdf:type
Canonicalisation
timbl:i rdf:type foaf:Person .
identica:48404 foaf:knows identica:45563 .
dbpedia:Berners-Lee dpo:birthDate “1955-06-08”^^xsd:date .
dbpedia:Berners-Lee rdf:type foaf:Person .
identica:48404 foaf:knows dbpedia:Berners-Lee .
dbpedia:Berners-Lee dpo:birthDate “1955-06-08”^^xsd:date .
timbl:i
identica:45563
dbpedia:Berners-Lee
![Page 58: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/58.jpg)
Digital Enterprise Research Institute www.deri.ie
58
Baseline Consolidation: Performance
9 machines
1. Extract owl:sameAs: 0.2 hr 2. Gather owl:sameAs: 0.1 hr3. Canonicalise data 0.7 hr
Total ~1.1 hours
![Page 59: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/59.jpg)
Digital Enterprise Research Institute www.deri.ie
59
Applied over raw input data
~12 million owl:sameAs triples ~2.2 million sets of equivalent identifiers ~5.8 million identifiers involved
~2.65 identifiers per set ~99.99% of terms were URIs ~6.25% of all URIs
Baseline Consolidation: Results
![Page 60: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/60.jpg)
Digital Enterprise Research Institute www.deri.ie
Extended Approach…
…use the owl:sameAs relations inferable through reasoning…
60
![Page 61: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/61.jpg)
Digital Enterprise Research Institute www.deri.ie
61
Infer owl:sameAs through reasoning (OWL 2 RL/RDF)1. explicit owl:sameAs (again)
2. owl:InverseFunctionalProperty
3. owl:FunctionalProperty
4. owl:cardinality 1 / owl:maxCardinality 1
foaf:homepage a owl:InverseFunctionalProperty .
timbl:i foaf:homepage w3c:timblhomepage .
adv:timbl foaf:homepage w3c:timblhomepage .
⇒timbl:i owl:sameas adv:timbl .
…then apply consolidation as before
Extended Consolidation
![Page 62: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/62.jpg)
Digital Enterprise Research Institute www.deri.ie
62
OWL 2 RL/RDF consolidation rules require A-Box joins!
Might not be able to fit owl:sameAs index in memory (4 Gb)!
⇒ Use on-disk batch-processing Distributed sorts, scans and merge-joins
Derive owl:sameAs on-disk
![Page 63: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/63.jpg)
Digital Enterprise Research Institute www.deri.ie
63
Extended Consolidation: Performance
9 machines
1. Inferring owl:sameAs ~7.4 hr2. Canonicalise data ~4.9 hr
Total ~12.3 hours(11X baseline)
![Page 64: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/64.jpg)
Digital Enterprise Research Institute www.deri.ie
64
~12 million explicit owl:sameAs triples (as before) ~8.7 million thru. owl:InverseFunctionalProperty ~106 thousand thru. owl:FunctionalProperty none thru. owl:cardinality/owl:maxCardinality
~2.8 million sets of equivalent identifiers (1.31x baseline)
~14.86 million identifiers involved (2.58x baseline)
~5.8 million URIs (1.014x baseline)
Extended Consolidation: Results
![Page 65: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/65.jpg)
Digital Enterprise Research Institute www.deri.ie
CONCLUSION
65
![Page 66: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/66.jpg)
Digital Enterprise Research Institute www.deri.ie
66
timbl:i foaf:page ?pages .
timbl:i
identica:45563
dbpedia:Berners-Lee
dbpedia:Berners-Lee foaf:page ?pages .
![Page 67: Aidan's PhD Viva](https://reader035.vdocuments.us/reader035/viewer/2022062319/554bcc8db4c9058f6c8b47c8/html5/thumbnails/67.jpg)
Digital Enterprise Research Institute www.deri.ie
Heterogeneity poses a significant problem for consuming Linked Data
1. Lightweight reasoning can go a long way Simple/authoritative rules have reasonable coverage
2. Deceit/Noise ≠ End Of World3. Inconsistency ≠ End Of World
Useful for finding noise in fact!
4. Explicit owl:sameAs vs. extended consolidation: Extended consolidation mostly for consolidating
blank-nodes from older FOAF exporters
67
Conclusions