data interlinking
TRANSCRIPT
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
The problem: RDF data interlinking
3
〈http://data.bnf.fr/12144801/edgar allan poe the gold bug/, dc:title, “The gold bug”〉The gold bug
title
creator
en
E. Poe
lang
firstname lastname
Writer
Work
rdf:type
rdf:type
b a1 a2
Baudelaire Malarme
The raven
orig
name namename
orig
authortranslator translator
Person
Book
rdf:type
rdf:type
≈
≥
≤
≥Jerome Euzenat Data interlinking 3 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Goal of the lecture
I Provide an overview of the problem of data interlinking
I Describe broad categories of solutions
I Point to useful tools for generating links
Mostly about generating links, not on finding how to generate them
Jerome Euzenat Data interlinking 4 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Outline
Data interlinling
Similarity-based approach
Key-based interlinking
Ontology matching & data interlinking
Tools
Jerome Euzenat Data interlinking 5 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Data interlinking
I use (with the same meaning):
I instance matching
I entity linking
I data interlinking
I do not use:
I record linkage
I data deduplication
I entity reconciliation
I coreference resolution
Jerome Euzenat Data interlinking 6 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
The data interlinking problem
Data interlinking is the task of finding same entities within different datasets(RDF graphs).
Data source 1 Data source 2
interlinking
owl:sameAs
Jerome Euzenat Data interlinking 7 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
The data interlinking process
Data source
Data source
interlinking Resulting linksSample links
parameters
resources
Jerome Euzenat Data interlinking 8 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
The data interlinking process (2)
d
d ′
extraction
Linkage spec
generation l
interlinking
Jerome Euzenat Data interlinking 9 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Approaches to data interlinking
There are two main approaches to data interlinking:
I similarity-based: resources are compared through a similarity measureand if they are similar enough, they are the same.
I key-based: sufficient conditions for two resources to be the same areinduced and used to find same entities
Jerome Euzenat Data interlinking 10 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Classification of similarities
Data interlinking techniques may be based on:
I Data ID (URIs);
I Data keys
I External relations: (explicit or implicit) links to other resources
I Data description (content)
Jerome Euzenat Data interlinking 12 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Manual resource matching
URI1 URI2
Manual observation
owl:sameAs
I This does not scale.
I But may be good for a first sample or reference.
I Crowdsourcing?
Jerome Euzenat Data interlinking 13 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
URI matching
URI1 URI2
URI transformation
owl:sameAs
http://dbpedia.org/resource/Johann Sebastian Bach owl:sameAs
http://www.lastfm.fr/music/Johann+Sebastian+Bach
http://rdf.insee.fr/geo/regions-2011.rdf#REG 11 ?
http://ec.europa.eu/eurostat/ramon/rdfdata/nuts2008/FR10
Jerome Euzenat Data interlinking 14 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Id matching
id id
Finding same ids
owl:sameAs
You can find such types of ids:
I Social security numbers
I ISBN, DOI, MAC addresses, etc.
I authorities: ISO (countries, languages), IATA (airports)
Most databases are built on such identifiers. . . but they are often local to thedatabase.
Jerome Euzenat Data interlinking 15 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Context-based similarity
URI1 URI2
VIAF
Context-based“similarity”
owl:sameAs
Process:I Project your data into another resource (DBPedia, geonames, viaf, etc.)I Assess relations between considered termsI Import the relation in the dataset
This harness the power of links!Jerome Euzenat Data interlinking 16 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Content-based similarity
3
The gold bug
title
creator
E. Poe
firstname lastname
Writer
Work
rdf:type
rdf:typeb a1 a2
Baudelaire Poe
Le corbeauLe scarabe d’or
orig
name name
title
authortranslator
Person
Book
rdf:type
rdf:type
Compute similarity
owl:sameAs
Two main approaches:
I bag of text
I structured similarity
Jerome Euzenat Data interlinking 17 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Term-based similarity
The gold bug
E. Poe
firstname lastname
Writer
Work
type
type Baudelaire Poe
Le corbeauLe scarabe d’or
orig
name name
title
authortranslator
Person
Book
type type
Compute “bag of words” similarity
owl:sameAs
Various tools:I Normalisation (Stemmer, Tokenizers)I Use of linguistic resources (Wordnet)I TranslationI Many similarity measures, especially from information retrieval
Jerome Euzenat Data interlinking 18 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Structure similarity
title
creator
firstname lastname
type
type orig
name name
title
authortranslator
type
type
Compute structure similarity
owl:sameAs
Techniques:
I Based on graph matching techniques
I Can be used to learn weights on properties (but need matching)
I Problem: scalability
Jerome Euzenat Data interlinking 19 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Cross-lingual RDF data interlinking
http://a.org/Mus999 France
Musee du Louvre
nom
lieu
Paris
99,rue de Rivoli
75001
adresse
ville
rue
zip
http://bb.cn/盧浮宮
盧浮宮
法國巴黎
稱號
位於
owl:sameAs ?
Jerome Euzenat Data interlinking 20 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Similarity-based data interlinking
RESOURCE RESOURCE
SIMILARITY
owl:sameAs ?
Hypothesis: ↑ similarity ↑ probability that it is the same object
DOCUMENT DOCUMENTSIMILARITY
owl:sameAs ?
Yuzhong Qu, Wei Hu, Gong Cheng: Constructing virtual documents for ontology matching. WWW 2006: 23-31.
DOCUMENT(zh) DOCUMENT(en)
DOCUMENT(en)
translation
DOCUMENT(zh)
translationSIMSIM
SIMILARITY
owl:sameAs ?
BabelNet(IDs) BabelNet(IDs)SIMILARITY
owl:sameAs ?
Jerome Euzenat Data interlinking 21 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
General cross-lingual interlinkingframework
1 VirtualDocuments
3 SimilarityComputation
4 LinkGeneration
2 LanguageNormalization
Jerome Euzenat Data interlinking 22 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Building virtual documents by levels
http://dbpedia.org/resource/Charles Perrault
Charles Perrault
dbpedia:France
Level 1
France is a sovereigncountry in Western Eu-rope that includes over-seas regions and territo-ries. . .
Level 2
Jerome Euzenat Data interlinking 23 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Machine translation: parameters
1 VirtualDocuments
2.1 MachineTranslation
2.2 NLPPreprocessing
3 SimilarityComputation
4 LinkGeneration
Level 1
Level 2
ZH→ENLowercase+Tokenize+ Filter stop words
+ Stemming (Porter)
+ Bigrams (terms)
TF+cosine
TF*IDF+cosine
Greedy
Hungarian
32 settings have been explored in total
Jerome Euzenat Data interlinking 24 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Lcase+Tokenization with TF*IDF atLevel 1
0 - 0.11
0.11 - 0.15
0.15 - 0.25
0.25 - 0.35
0.35 - 0.45
0.45 - 1
Jerome Euzenat Data interlinking 25 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Adding noise
Jerome Euzenat Data interlinking 26 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
BabelNet method: parameters
1 VirtualDocuments
2 MultilingualKB Mapping
3 SimilarityComputation
4 LinkGeneration
Level 1
Level 2
TF+cosine
TF*IDF+cosine
Greedy
Hungarian
Jerome Euzenat Data interlinking 27 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Database keys
I A set of attributes which uniquely identifies elements of a relation
I e.g., Book: isbn, People: fistname, lastname, birthplace, birthdate
I usually given and used to check integrity
They may be used for identifying same entities across two databases.But they require alignments.
Jerome Euzenat Data interlinking 29 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Example of interlinking with keys andalignments
Are the resources bnf:cb118949856 and bne:XX1721208 the same?
I if BNF ontology states foaf:Person owl:hasKey {foaf:name, dc:dates}I and we have the following alignment
foaf:Person
bnf:cb118949856
Albert Camus
07-11-1913
04-01-1960
Romancier, dramaturge et essayiste
http://id.loc.gov/vocabulary/countries/fr
Mondovi (Algerie)
1913-1960
foaf:name
rda:dateOfBirth
rda:dateOfDeath
rda:biographicalInformation
rda:countryAssociatedWithThePerson
rda:placeOfBirth
dc:dates
frbrer:C1005
bne:XX1721208
Camus, Albert
1913-1960
Aut [...]1980
frber:P3039
frber:P3040
rda:sourceConsulted
w
≡
≡
≈
≈
owl:sameAs
owl:sameAs ?
Jerome Euzenat Data interlinking 30 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Key-based interlinking methods
Database keys allow for identifying entities: if they are aligned, this can beused for linking.
I AdvantagesI they are logically groundedI they allow to minimize the number of properties to compare (if we use
minimal keys)
I DrawbacksI Require alignment between properties and classesI Very few key axioms are available, and they are not necessarily useful for
interlinking
We overcome these drawbacks by introducing link keys
Jerome Euzenat Data interlinking 31 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Link key
A link key
〈{〈p1, q1〉, . . . , 〈pn, qn〉}{〈p′1, q′1〉, . . . , 〈p′m, q′m〉} linkkey 〈c , d〉〉
holds iffFor all pairs of instances a and b belonging respectively to classes c and d ofontologies O and O′,
if a and b share at least one value (object) for each pairs ofproperties pi and qi respectively,
and a and b share all their values (objects) for each pairs ofproperties p′i and q′i respectively,
then they are the same (〈a, owl:sameAs, b〉).
Example:
〈{〈foaf:name, frbr:P3039〉}{〈dc:dates, frbr:P3040〉} linkkey 〈foaf:Person, frbr:C1005〉〉
Jerome Euzenat Data interlinking 32 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Link key extraction
Problem: How to induce such link keys from data?
The number of set of pairs of properties is exponential
Our approach:
I discover only candidate link keys.
I evaluate them in order to select only the “good” ones
Jerome Euzenat Data interlinking 33 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Candidate link key
A candidate link key is a set of property pairs {〈p1, q1〉, . . . , 〈pk , qk〉} that
1. would generate at least one link if used as a link key
2. is maximal for at least one link, or is the intersection of severalcandidate link keys
Jerome Euzenat Data interlinking 34 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Supervised selection measures
If a sample of reference links is available:
I Positive examples (L+) : a set of owl:sameAs links
I Negative examples (L−) : a set of owl:differentFrom links
Idea: Approximate precision and recall on that sample
Definition (Relative precision and recall)
precision(K , L+, L−) =|L+ ∩ LD,D′(K )|
|(L+ ∪ L−) ∩ LD,D′(K )|
recall(K , L+) =|L+ ∩ LD,D′(K )|
|L+|
Jerome Euzenat Data interlinking 35 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Unsupervised selection measures
When no reference link is available.Idea: measuring how close the extracted links would be fromone-to-one and total.
Definition (Discriminability)
disc(K ,D,D ′) =min(|{a : 〈a, b〉 ∈ LD,D′(K )}|, |{b : 〈a, b〉 ∈ LD,D′(K )}|)
|LD,D′(K )|
Definition (Coverage)
cov(K ,D,D ′) =|{a : 〈a, b〉 ∈ LD,D′(K )} ∪ {b : 〈a, b〉 ∈ LD,D′(K )}|
|{a : c(a) ∈ D} ∪ {b : d(b) ∈ D ′}|
Jerome Euzenat Data interlinking 36 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Experimental evaluation
These selection measures were evaluated on public datasets.
Finding links between French municipalities described in two differentdatasets:
I Insee dataset: 36700 instances;
I Geonames dataset: 36552 instances.
The reference link set is composed of:
I Positive links: 36552 owl:sameAs statements;
I owl:differentFrom links derived from owl:sameAs links (closed worldassumption).
Jerome Euzenat Data interlinking 37 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Evaluation
The algorithm extracted 11 candidate link keys:
bad F-measure≈ 0
high F-measure≈ .99
good F-measure≈ 0.89
{1} {2} {3, 4} {5, 6}
{7, 1} {2, 1} {3, 4, 1} {3, 2, 4}
{3, 7, 4, 1} {3, 2, 4, 1}
{3, 7, 2, 4, 1}
coveraged
iscr
imin
abili
ty
1 = 〈nom, name〉 2 = 〈nom, alternateName〉3 = 〈subdivisionDe, parentFeature〉 4 = 〈subdivisionDe, parentADM3〉5 = 〈codeINSEE, population〉 6 = 〈codeCommune, population〉7 = 〈nom, officialName〉
Jerome Euzenat Data interlinking 38 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Evaluation
Correlation between the harmonic means of discriminability and coverage andF-measure:
bad F-measure≈ 0
high F-measure≈ .99
good F-measure≈ 0.89
{1} {2} {3, 4} {5, 6}
{7, 1} {2, 1} {3, 4, 1} {3, 2, 4}
{3, 7, 4, 1} {3, 2, 4, 1}
{3, 7, 2, 4, 1}
coverage
dis
crim
inab
ility
h-mean(disc.,cov)≈ .99 h-mean(disc.,cov)≈ .89 h-mean(disc.,cov) ≈ 0
1 = 〈nom, name〉 2 = 〈nom, alternateName〉3 = 〈subdivisionDe, parentFeature〉 4 = 〈subdivisionDe, parentADM3〉5 = 〈codeINSEE, population〉 6 = 〈codeCommune, population〉7 = 〈nom, officialName〉
Jerome Euzenat Data interlinking 38 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Why using ontologies?
Because it is obvious that we must compare the instances of equivalentclasses based on equivalent properties.
More precisely:
I For reducing the search space for finding link keys and similarities
I For reducing the scope of linkage specifications
I Because not the same linkage rules work for the same classes
I Because classes and properties are hint like others of the similaritybetween resources
Ex. With similarity and with keys
Jerome Euzenat Data interlinking 40 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Data interlinking through a commonontology
o
URI1 URI2
Resource matching
of datasets
described by the
same ontology
owl:sameAs
Jerome Euzenat Data interlinking 41 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Matching with a common ontology
+ Focus the search: only match instances of the same class;
– Not sufficient: it remains to identify corresponding entities
+ If keys are defined (OWL 2), this is done;+ At least we know which properties to compare;– Inferring secondary keys may be useful;– Correcting discrepancies: record linkage.
Jerome Euzenat Data interlinking 42 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Record linkage
Name Johann
Date 1665-03-21
Place Munchen
NameJohannes
Date31/03/1665
PlaceMonaco di Bavaria
Having a common ontology does not solve all problems.
Jerome Euzenat Data interlinking 43 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Different types of mismatch
I Different domains, connected (BIM, Energy demand)⇒ few correspondences, any type
I Same domain, different models (engineer, policy maker)⇒ many correspondences, mostly equivalence
I Same domain, different granularity (city management, building design)⇒ many correspondences, mostly subsumption
Jerome Euzenat Data interlinking 44 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Data interlinking with differentontologies (implicit alignment)
o o ′
URI1 URI2
Resource matching
of datasets
described by
different ontologies
owl:sameAs
Jerome Euzenat Data interlinking 45 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Data interlinking with differentontologies (explicit alignment)
o o ′
URI1 URI2
A
Resource matching
of datasets
described by
different ontologies
owl:sameAs
Jerome Euzenat Data interlinking 46 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Ontology matching for data interlinking
o o ′
URI1 URI2
Ontology matching
A
Data interlinking
owl:sameAs
Jerome Euzenat Data interlinking 47 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Heterogeneity problem
Resources being expressed in different ways must be reconciled before beingused.Mismatch between formalized knowledge can occur when:
I different languages are used (OWL vs. Topic maps);I different terminologies are used:
I English vs. Chinese;I Book vs. Monograph.
I different models are used:I different classes: Autobiography vs. Paperback;I classes vs. property: Essay vs. literarygenre;I classes vs. instances: One physical book as an instance vs. one work as
an instance.
I different scopes and granularity are used.I Only books vs. cultural items vs. any product;I Books detailed to the print and translation level vs. books as works.
Jerome Euzenat Data interlinking 48 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Ontology alignment
Item
DVD
Book
Paperback
Hardcover
CD
pricetitledoicreatorpp
author
integer
string
uri
Person
Monograph
Essay
Literary critics
Politics
Biography
Autobiography
Literature
pages
isbnauthor
title
subject
Human
Writer
≥
≥
≥
≤
≥
Jerome Euzenat Data interlinking 49 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Expressive alignments (EDOAL)
Booktopic
author=
Volume
size14≥
Autobiography
v
=
∀x ,Pocket(x)⇐ Volume(x) ∧ size(x , y) ∧ y ≤ 14
∀x ,Book(x) ∧ author(x , y) ∧ topic(x , y) ≡ Autobiography(x)
Jerome Euzenat Data interlinking 50 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Example: INSEE dataset
Region table:
code nom chef-lieu
11 Ile-de-France 7505621 Champagne-Ardenne 5110822 Picardie 80021
Sous-region table:
region departement
11 7511 7711 7811 9111 9211 93
Jerome Euzenat Data interlinking 51 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Example: Administrative ontology
Territoire FR
Pays
Region
Departement
Arrondissement
Commune
codenom
chef-lieusubdivision
integer
string
Jerome Euzenat Data interlinking 52 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Example: NUTS dataset
NUTSRegion table:
level code name hasParentRegion
0 FR FRANCE
1 FR1 ILE DE FRANCE FR
2 FR10 Ile de France FR13 FR101 Paris FR103 FR104 Essonne FR10
Jerome Euzenat Data interlinking 53 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Example: Linking INSEE and NUTS
NUTS: Nomenclature of territorial units for statistics
#INSEE INSEE name NUTS Level #NUTS1 Pays 0 34
1 14226 Region 2 344
100 Departement 3 1488342 Arrondissement
4036 Canton 452422 Commune 5
Jerome Euzenat Data interlinking 54 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Example: Linking INSEE and NUTS
Territoire FR
Pays
Region
Departement
Commune
PAYS FR
REG 11
DEP 75
DEP 77
DEP 78
COM 75056
Region
Country
NUTSRegion
LAURegion
FR
UK
FR1
FR10
FR101
FR102
FR103
owl:sameAs
owl:sameAs
owl:sameAs
owl:sameAs
owl:sameAs
Jerome Euzenat Data interlinking 55 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Example: Linksets
Specific data sets containing URIs.
<http://www.example.org/linkset/INSEE-NUTS>
a void:Linkset ;
void:target <http://rdf.insee.fr/geo/regions-2011.rdf>;
void:target <http://nuts.psi.enakting.org/id/>;
insee:PAYS FR owl:sameAs nuts:FR
insee:REG 11 owl:sameAs nuts:FR10
insee:DEP 75 owl:sameAs nuts:FR101
insee:DEP 77 owl:sameAs nuts:FR102
insee:DEP 78 owl:sameAs nuts:FR103
Jerome Euzenat Data interlinking 56 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Example: interesting sets
nuts
onsordnance s. igninsee
geonames dbpedia freebase
Jerome Euzenat Data interlinking 57 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
A simple algorithm
I Find matching concepts [concept matching];
I For each of them, determine matching properties based on the similaritybetween their values in both datasets [property matching];
I From them find property combinations identifying corresponding entities[key extraction];
I Link corresponding entities [link generation].
For instance, nom/RegionINSEE ⊆ name/NUTSRegionNUTS and moreoverthey are unambiguous.
Jerome Euzenat Data interlinking 58 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
INSEE and NUTS: ontology alignment
Territoire FR
Pays
Region
Departement
Arrondissement
Canton
Commune
codenom
chef-lieusubdivision
integer
string
Region
Country
NUTSRegion
LAURegion
name
level
code
hasSubRegion=
≤
≤
≤
≤
≤
=
Jerome Euzenat Data interlinking 59 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Simple alignments are not sufficient
Territoire FR
Region
Departement
Commune
nom
DEP 75
nom
COM 75056
nom
Region
NUTSRegion
name
FR101
name
Paris
=
=
=
≤≤
≤
=
=
=
Jerome Euzenat Data interlinking 60 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Expressive alignments are necessary
Region
NUTSRegion
level
hasParentRegion
2 =
FR=
=
subdivision hasSubRegion=
nom name=
Jerome Euzenat Data interlinking 61 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
What does this mean?
I Ontology alignments are schema-level expression of correspondences;
I They are useful for focussing the search;
I Expressive alignments are necessary;
I They can be turned into SPARQL-based link generators.
but it is also necessary to express instance level constraints:
I for converting data (e.g., mph vs. m/s);
I for expressing matching constraint on data (e.g., similarity).
Jerome Euzenat Data interlinking 62 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Data interlinking and ontology matching
d
o
d ′
o ′Matcher
A
Generator
l
Jerome Euzenat Data interlinking 63 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Tools for data interlinking
Linkage spec extraction generation
similarity LIMES Silk, LIMES, OpenRefine
key LinkKeyDisco SPARQL
Jerome Euzenat Data interlinking 65 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Silk
Silk is a robust software for interlinking data sets.
It relies on an expressive specification of linking conditions:
I Declare data sources (DataSource);
I Circumscribe entities to compare (Source/TargetDataset);I Describe how to compare them (LinkageRule):
I Select properties to compare through paths (Input);I Compute distances between them (Compare+threshold);I Aggregate all comparisons (Aggregate);
I Select those pairs of entities to be linked (Filter);
I Generate links (Output+thresholds).
Jerome Euzenat Data interlinking 66 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
A Silk script
Consider a linking script between INSEE and NUTS:
<Silk>
<Prefix id="nuts"
namespace="http://ec.europa.eu/.../geographic.rdf#" />
<Prefix id="insee"
namespace="http://rdf.insee.fr/geo/" />
<DataSource id="nuts2008"
type="sparqlEndpoint">
<Param name="endpointURI"
value="http://localhost:9091/.../internal"/>
<Param name="graph"
value="http://localhost:9091/.../nuts2008-complete-1"/>
</DataSource>
<DataSource id="insee2010"
type="sparqlEndpoint">
<Param name="endpointURI"
value="http://localhost:9091/.../internal"/>
<Param name="graph"
value="http://localhost:9091/.../source/regions-2010-1"/>
</DataSource>
<Thresholds accept="0.9" verify="0.7" />
<Outputs>
<Output type="sparul">
<Param name="graphUri"
value="http://localhost:9091/.../source/insee-nuts-silk"/>
<Param name="uri"
value="http://localhost:9091/.../lifted/"/>
<Param name="parameter" value="update"/>
</Output>
</Outputs>
<Interlinks>
<Interlink id="linkingNUTS">
<LinkType>owl:sameAs</LinkType>
<SourceDataset dataSource="nuts2008" var="s">
<RestrictTo>?s rdf:type nuts:NUTSRegion.
?s nuts:level 2.
</RestrictTo>
</SourceDataset>
<TargetDataset dataSource="insee2010" var="ss">
<RestrictTo>?ss rdf:type insee:Region</RestrictTo>
</TargetDataset>
<LinkageRule>
<Aggregate type="max">
<Compare metric="levenshteinDistance"
threshold=".2">
<Input path="?s/nuts:name"/>
<Input path="?ss/insee:nom"/>
</Compare>
</Aggregate>
</LinkageRule>
</Interlinks>
</Interlink>
</Silk>
Jerome Euzenat Data interlinking 67 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Silk: prefix and sources
<Silk>
<Prefix id="nuts" namespace="http://ec.europa.eu/.../geographic.rdf#" />
<Prefix id="insee" namespace="http://rdf.insee.fr/geo/" />
<DataSource id="nuts2008" type="sparqlEndpoint">
<Param name="endpointURI" value="http://localhost:9091/.../internal"/>
<Param name="graph" value="http://localhost:9091/.../nuts2008-complete-1"/>
</DataSource>
<DataSource id="id1" type="file">
<Param name="file" value="/Skratch/TutoLinking/admin/regions-2010.rdf"/>
<Param name="format" value="RDF/XML" />
</DataSource>
Sources can be files or SPARQL endpoint.
Jerome Euzenat Data interlinking 68 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Silk rules
<Interlinks>
<Interlink id="linkingNUTS">
<LinkType>owl:sameAs</LinkType>
<SourceDataset dataSource="nuts2008" var="s">
<RestrictTo>?s rdf:type nuts:NUTSRegion.
?s nuts:level 2.
</RestrictTo>
</SourceDataset>
<TargetDataset dataSource="insee2010" var="ss">
<RestrictTo>?ss rdf:type insee:Region</RestrictTo>
</TargetDataset>
<Thresholds accept="0.9" verify="0.7" />
<Outputs>
<Output type="sparul">
<Param name="graphUri" value="http://localhost:9091/.../source/insee-nuts-silk"/>
<Param name="uri" value="http://localhost:9091/.../lifted/"/>
<Param name="parameter" value="update"/>
</Output>
</Outputs>
Restrictions are given in SPARQL graph patternsOutput can be file (in various format, including the Alignment API) or aSPARQL endpoint.They can be made dependent on thresholds.
Jerome Euzenat Data interlinking 69 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Silk rules (cont’ed)
<LinkageRule>
<Aggregate type="max">
<Compare metric="levenshteinDistance" threshold=".2">
<Input path="?s/nuts:name"/>
<Input path="?ss/insee:nom"/>
</Compare>
</Aggregate>
</LinkageRule>
</Interlinks>
</Interlink>
</Silk>
They can:
I transform the data (lowercase, tokenize, to integers, etc.),
I use comparison metrics (equality, levenshtein, Jaro-Winkler, etc.), and
I aggregate their values (average, min, max, etc.).
Jerome Euzenat Data interlinking 70 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Silk workbench
Jerome Euzenat Data interlinking 71 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
EDOAL Alignments
<Cell>
<entity1><e:Class rdf:about="&insee;Region"/></entity1>
<entity2>
<e:Class>
<e:and rdf:parseType="Collection">
<e:Class rdf:about="&nuts;NUTSRegion"/>
<e:AttributeValueRestriction>
<e:onAttribute><e:Property rdf:about="&nuts;level"/></e:onAttribute>
<e:comparator rdf:resource="&edoal;equals"/>
<e:value><e:Literal e:type="&xsd;integer" e:string="2" /></e:value>
</e:AttributeValueRestriction>
<e:AttributeValueRestriction>
<e:onAttribute>
<e:Relation rdf:about="&nuts;hasParentRegion" />
</e:onAttribute>
<e:comparator rdf:resource="&edoal;equals"/>
<e:value><e:Instance rdf:about="&esdata;FR" /></e:value>
</e:AttributeValueRestriction>
</e:and>
</e:Class>
</entity2>
<relation>equivalence</relation>
<measure>1.0</measure>
...
</Cell>
Jerome Euzenat Data interlinking 72 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Link keys in the Alignment API
<e:linkkey>
<e:Linkkey>
<e:binding>
<e:Intersects>
<e:property1><e:Property rdf:about="&insee;nom" /></e:property1>
<e:property2><e:Property rdf:about="&nuts;name" /></e:property2>
</e:Intersects>
<e:Equals>
<e:property1>
<e:Property>
<e:inverse><e:Property rdf:about="&insee;subdivision" /></e:inverse>
</e:property1>
<e:property2><e:Property rdf:about="&nuts;hasParentRegion" /></e:property2>
</e:Equals>
</e:binding>
</e:Linkkey>
</e:linkkey>
Jerome Euzenat Data interlinking 73 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Query generation
PREFIX insee: <http://rdf.insee.fr/ontologie-geo-2006.rdf#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT ?r
FROM <http://rdf.insee.fr/geo/regions-2011.rdf>
WHERE {?r rdf:type insee:Region .
}
PREFIX nuts: <http://ec.europa.eu/eurostat/ramon/ontologies/geographic.rdf#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT ?n
FROM <http://ec.europa.eu/eurostat/ramon/rdfdata/nuts2008/>
WHERE {?n rdf:type nuts:NUTSRegion .
?n nuts:level 2^^xsd:integer .
?n nuts:hasParentRegion nuts:FR .
}
Jerome Euzenat Data interlinking 74 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Data transformation
PREFIX insee: <http://rdf.insee.fr/ontologie-geo-2006.rdf#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX insee: <http://rdf.insee.fr/ontologie-geo-2006.rdf#>
CONSTRUCT {?r rdf:type nuts:NUTSRegion .
?r nuts:level 2^^xsd:integer .
?r nuts:hasParentRegion nuts:FR .
}FROM <http://rdf.insee.fr/geo/regions-2011.rdf>
WHERE {?r rdf:type insee:Region .
}
Jerome Euzenat Data interlinking 75 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
SameAs link generation generation
PREFIX insee: <http://rdf.insee.fr/ontologie-geo-2006.rdf#>
PREFIX nuts: <http://ec.europa.eu/eurostat/ramon/ontologies/geographic.rdf#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
CONSTRUCT { ?r owl:sameAs ?n . }FROM <http://rdf.insee.fr/geo/regions-2011.rdf>
FROM <http://ec.europa.eu/eurostat/ramon/rdfdata/nuts2008/>
WHERE {?r rdf:type insee:Region .
?r insee:nom ?l .
?n rdf:type nuts:NUTSRegion .
?n nuts:name ?l .
?n nuts:level 2^^xsd:integer .
?n nuts:hasParentRegion nuts:FR .
}
Jerome Euzenat Data interlinking 76 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Other issue: performances
n ×m n3 ×
m3 + n
3 ×m3 + n
3 ×m3
10× 10 = 1001000× 1000 = 1000000
100000× 100000 = 10000000000
Jerome Euzenat Data interlinking 77 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Other issue: performances
Blocking: index+cluster
Dataset 1 Dataset 2
Jerome Euzenat Data interlinking 78 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Other issue: performances
Blocks can be obtained from:
I clustering values in index
I predefined block (based on equality)
I classes in an ontology (blocks are defined as class expressions)
Jerome Euzenat Data interlinking 79 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Other issue: evaluation
d
d ′
interlinking l
Reference links
evaluation
Precision
Recall
F-measure
Jerome Euzenat Data interlinking 80 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Other issue: learning
d
d ′
Training links interlinking l
evaluation
Jerome Euzenat Data interlinking 81 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Conclusion
I Data interlinking is one of the most critical task in linked data
I . . . but not only, e.g. smart citiesI If faces many problems due to:
I heterogeneity (format, languages, convention)I size
I Interlinking can be based on similarities or keys
I There is active work to infer such interlinking pattern
Jerome Euzenat Data interlinking 82 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Further reading
I T. Heath, C. Bizer, Linked Data: Evolving the Web into a Global DataSpace, Morgan & Claypool (US), 2011 http://linkeddatabook.com/
I J. Euzenat, P. Shvaiko, Ontology matching, 2nd ed., Springer,Heildelberg (DE), 2013 http://book.ontologymatching.org
I K. Stefanidis, V. Efthymiou, M. Herschel, V. Christophides, EntityResolution in the Web of Data, Tutorial, WWW conference, Seoul(KR), 2014 http://www.csd.uoc.gr/~vefthym/er/
Silk http://silk-framework.com/
Alignment API http://alignapi.gforge.inria.fr
Al 4 SC http://al4sc.inrialpes.fr
Jerome Euzenat Data interlinking 83 / 0
Data interlinlingSimilarity-based approach
Key-based interlinkingOntology matching & data interlinking
Tools
Thanks
I To my colleagues Manuel Atencia, Jerome David, Nicolas Guillouet andFrancois Scharffe
I The Datalift and Lindicle projects
I The Ready4SmartCities project
Jerome Euzenat Data interlinking 84 / 0