workincollaborationwithfrançoisscharffeandothers ... · linked data methods for data interlinking...

64
Interlinking the web of data: challenges and solutions Work in collaboration with François Scharffe and others Jérôme Euzenat June 8, 2011

Upload: dohuong

Post on 02-May-2018

231 views

Category:

Documents


1 download

TRANSCRIPT

Interlinking the web of data: challenges and solutionsWork in collaboration with François Scharffe and others

Jérôme Euzenat

June 8, 2011

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

Outline

Linked data

Methods for data interlinking

General framework for data interlinking

Conclusions

Jérôme Euzenat Interlinking the web of data: challenges and solutions 2 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

Outline

Linked data

Methods for data interlinking

General framework for data interlinking

Conclusions

Jérôme Euzenat Interlinking the web of data: challenges and solutions 3 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

Linked (open) data cloud

Jérôme Euzenat Interlinking the web of data: challenges and solutions 4 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

Publishing data on the web

Four publication principles1. Use URIs for identifying resources2. Use dereferencable URIs3. When a URI is dereferenced, a description of the identified resource is

returned4. Published datasets must be interconnected to other datasets

If possible using semantic web technologies: URI, HTTP, RDF, OWL

Jérôme Euzenat Interlinking the web of data: challenges and solutions 5 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

What for?

I If you are an authority publishing data (gov), this allows you to put yourdata in the context other authority data and the others to reuse yourdata;

I If you are a link producer, i.e., someone who knows how to add value bycross-linking data, you have standard references to work with;

I If you are an application developer, e.g., seevl.net, you can takeadvantage of this data always on the web.

and

I the data is constantly up-to-date;I the data can be freely linked;I Data is browsable.

Jérôme Euzenat Interlinking the web of data: challenges and solutions 6 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

Example: INSEE dataset

Région table:code nom chef-lieu11 Île-de-France 7505621 Champagne-Ardenne 5110822 Picardie 80021

Sous-région table:

région département11 7511 7711 7811 9111 9211 93

Jérôme Euzenat Interlinking the web of data: challenges and solutions 7 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

Example: Administrative ontology

Territoire_FR

Pays

Region

Departement

Arrondissement

Commune

codenom

chef-lieusubdivision

integer

string

Jérôme Euzenat Interlinking the web of data: challenges and solutions 8 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

Example: RDF

REG_11

Region

rdf:typ

e

11

i:code

Île-de-Francei:nom

PAYS_FR

Paysrdf:typ

e

FRi:co

de_iso

Francei:nom

i:sub

DEP_75DEP_77DEP_78

Departement Commune

COM_75056

i:chef-lieu

rdf:typerdf:typerdf:type

rdf:type

i:subi:su

bi:sub

Parisi:nom

i:nom

Jérôme Euzenat Interlinking the web of data: challenges and solutions 9 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

Example: Linking INSEE and NUTS

NUTS: Nomenclature of territorial units for statistics

#INSEE INSEE name NUTS Level #NUTS1 Pays 0 34

1 14226 Région 2 344

100 Département 3 1488342 Arrondissement4036 Canton 4

52422 Commune 5

Å vs. Saint-Rémy-en-Bouzemont-Saint-Genest-et-Issonor Montbonnot Saint-Martin

Jérôme Euzenat Interlinking the web of data: challenges and solutions 10 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

Example: Linking INSEE and NUTS

NUTS: Nomenclature of territorial units for statistics

#INSEE INSEE name NUTS Level #NUTS1 Pays 0 34

1 14226 Région 2 344

100 Département 3 1488342 Arrondissement4036 Canton 4

52422 Commune 5

Å vs. Saint-Rémy-en-Bouzemont-Saint-Genest-et-Issonor Montbonnot Saint-Martin

Jérôme Euzenat Interlinking the web of data: challenges and solutions 10 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

Example: Linking INSEE and NUTS

Territoire_FR

Pays

Region

Departement

Commune

PAYS_FR

REG_11

DEP_75

DEP_77

DEP_78

COM_75056

Region

Country

NUTSRegion

LAURegion

FR1

UK1

FR10

FR101

FR102

FR103

owl:sameAs

owl:sameAs

owl:sameAs

owl:sameAs

owl:sameAs

Jérôme Euzenat Interlinking the web of data: challenges and solutions 11 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

Outline

Linked data

Methods for data interlinking

General framework for data interlinking

Conclusions

Jérôme Euzenat Interlinking the web of data: challenges and solutions 12 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

The data interlinking problem

URI1 URI2

o o ′

Data interlinking

owl:sameAs

Jérôme Euzenat Interlinking the web of data: challenges and solutions 13 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

The data interlinking problem

URI1 URI2

o o ′

Data interlinking

owl:sameAs

Jérôme Euzenat Interlinking the web of data: challenges and solutions 13 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

The data interlinking problem

URI1 URI2

o o ′

Data interlinking

owl:sameAs

Jérôme Euzenat Interlinking the web of data: challenges and solutions 13 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

The data interlinking problem

URI1 URI2

o o ′

Data interlinking

owl:sameAs

Jérôme Euzenat Interlinking the web of data: challenges and solutions 13 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

Manual resource matching

URI1 URI2

Manual observation

owl:sameAs

This does not scale.

Jérôme Euzenat Interlinking the web of data: challenges and solutions 14 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

Manual resource matching

URI1 URI2

Manual observation

owl:sameAs

This does not scale.

Jérôme Euzenat Interlinking the web of data: challenges and solutions 14 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

URI matching

URI1 URI2

URI transformation

owl:sameAs

http://dbpedia.org/resource/Johann_Sebastian_Bach owl:sameAshttp://www.lastfm.fr/music/Johann+Sebastian+Bach

http://rdf.insee.fr/geo/regions-2011.rdf#REG_11 ?http://ec.europa.eu/eurostat/ramon/rdfdata/nuts2008/FR10

Jérôme Euzenat Interlinking the web of data: challenges and solutions 15 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

URI matching

URI1 URI2

URI transformation

owl:sameAs

http://dbpedia.org/resource/Johann_Sebastian_Bach owl:sameAshttp://www.lastfm.fr/music/Johann+Sebastian+Bach

http://rdf.insee.fr/geo/regions-2011.rdf#REG_11 ?http://ec.europa.eu/eurostat/ramon/rdfdata/nuts2008/FR10

Jérôme Euzenat Interlinking the web of data: challenges and solutions 15 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

URI matching

URI1 URI2

URI transformation

owl:sameAs

http://dbpedia.org/resource/Johann_Sebastian_Bach owl:sameAshttp://www.lastfm.fr/music/Johann+Sebastian+Bach

http://rdf.insee.fr/geo/regions-2011.rdf#REG_11 ?http://ec.europa.eu/eurostat/ramon/rdfdata/nuts2008/FR10

Jérôme Euzenat Interlinking the web of data: challenges and solutions 15 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

Data matching through a commonontology

o

URI1 URI2

Resource matchingof datasets

described by thesame ontology

owl:sameAs

Jérôme Euzenat Interlinking the web of data: challenges and solutions 16 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

Data matching through a commonontology

o

URI1 URI2

Resource matchingof datasets

described by thesame ontology

owl:sameAs

Jérôme Euzenat Interlinking the web of data: challenges and solutions 16 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

Data matching through a commonontology

o

URI1 URI2

Resource matchingof datasets

described by thesame ontology

owl:sameAs

Jérôme Euzenat Interlinking the web of data: challenges and solutions 16 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

Matching with a common ontology

+ Focus the search: only match instances of the same class;– Not sufficient: it remains to identify corresponding entities

+ If keys are defined (OWL 2), this is done;+ At least we know which properties to compare;– Inferring secondary keys may be useful;– Correcting discrepancies: record linkage.

Jérôme Euzenat Interlinking the web of data: challenges and solutions 17 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

Record linkage

Name Johann

Date 1665

Place München

NameJohannes

Date1667

PlaceMonaco di Bavaria

Having a common ontology does not solve the problem.

Jérôme Euzenat Interlinking the web of data: challenges and solutions 18 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

Data matching with different ontologies(implicit alignment)

o o ′

URI1 URI2

Resource matchingof datasetsdescribed by

different ontologies

owl:sameAs

Jérôme Euzenat Interlinking the web of data: challenges and solutions 19 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

Data matching with different ontologies(implicit alignment)

o o ′

URI1 URI2

Resource matchingof datasetsdescribed by

different ontologies

owl:sameAs

Jérôme Euzenat Interlinking the web of data: challenges and solutions 19 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

Data matching with different ontologies(implicit alignment)

o o ′

URI1 URI2

Resource matchingof datasetsdescribed by

different ontologies

owl:sameAs

Jérôme Euzenat Interlinking the web of data: challenges and solutions 19 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

Independent key computation

I Different span requires different key (France is not a key for INSEE);I Differences in schema and depths makes difference in what is a key

("Paris" is both a department name (DEP_75) and a municipalityname (COM_75056) for INSEE while the region name may be a key forNUTS)

I Keys are often meaningless: they are data-independent anddatabase-dependent, hence they cannot be used for matching entities(REG_11 vs. FR_10).

I rdf:type and insee:nom are keys for INSEE (Region);I nuts:level and nuts:name are keys for NUTS (NUTSRegion);I insee:nom corresponds to nuts:name; there exists a correspondence

between rdf:type in INSEE and nuts:level.

Jérôme Euzenat Interlinking the web of data: challenges and solutions 20 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

Independent key computation

I Different span requires different key (France is not a key for INSEE);I Differences in schema and depths makes difference in what is a key

("Paris" is both a department name (DEP_75) and a municipalityname (COM_75056) for INSEE while the region name may be a key forNUTS)

I Keys are often meaningless: they are data-independent anddatabase-dependent, hence they cannot be used for matching entities(REG_11 vs. FR_10).

I rdf:type and insee:nom are keys for INSEE (Region);I nuts:level and nuts:name are keys for NUTS (NUTSRegion);I insee:nom corresponds to nuts:name; there exists a correspondence

between rdf:type in INSEE and nuts:level.

Jérôme Euzenat Interlinking the web of data: challenges and solutions 20 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

Data matching with different ontologies(explicit alignment)

o o ′

URI1 URI2

A

Resource matchingof datasetsdescribed by

different ontologies

owl:sameAs

Jérôme Euzenat Interlinking the web of data: challenges and solutions 21 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

Data matching with different ontologies(explicit alignment)

o o ′

URI1 URI2

A

Resource matchingof datasetsdescribed by

different ontologies

owl:sameAs

Jérôme Euzenat Interlinking the web of data: challenges and solutions 21 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

Data matching with different ontologies(explicit alignment)

o o ′

URI1 URI2

A

Resource matchingof datasetsdescribed by

different ontologies

owl:sameAs

Jérôme Euzenat Interlinking the web of data: challenges and solutions 21 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

Data matching with different ontologies(explicit alignment)

o o ′

URI1 URI2

A

Resource matchingof datasetsdescribed by

different ontologies

owl:sameAs

Jérôme Euzenat Interlinking the web of data: challenges and solutions 21 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

INSEE and NUTS: ontology alignment

Territoire_FR

Pays

Region

Departement

Arrondissement

Commune

codenom

chef-lieusubdivision

integer

string

Region

Country

NUTSRegion

LAURegion

name

level

codehasSubRegion

=

≤≤

=

Jérôme Euzenat Interlinking the web of data: challenges and solutions 22 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

INSEE and NUTS: ontology alignment

Territoire_FR

Pays

Region

Departement

Arrondissement

Commune

codenom

chef-lieusubdivision

integer

string

Region

Country

NUTSRegion

LAURegion

name

level

codehasSubRegion

=

≤≤

=

Jérôme Euzenat Interlinking the web of data: challenges and solutions 22 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

Ontology alignment

+ Ontology alignment restore the benefits of common ontology: it helpsfocussing the search;

– It is not exact science! (but alignments may be available);+ Ontology alignment and data linking reinforce each others.

Jérôme Euzenat Interlinking the web of data: challenges and solutions 23 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

A simple algorithm

I Find matching concepts [concept matching];I For each of them, determine matching properties based on the similarity

between their values in both datasets [property matching];I From them find property combinations identifying corresponding entities

[key extraction];I Link corresponding entities [link generation].

For instance, nom/RegionINSEE ⊆ name/NUTSRegionNUTS and moreoverthey are unambiguous.

Jérôme Euzenat Interlinking the web of data: challenges and solutions 24 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

Simple alignments are not sufficient

Territoire_FR

Region

Departement

Commune

nom

DEP_75

nom

COM_75056

nom

Region

NUTSRegion

name

FR101name

Paris

=

=

=

≤≤

=

=

=

Jérôme Euzenat Interlinking the web of data: challenges and solutions 25 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

Simple alignments are not sufficient

Territoire_FR

Region

Departement

Commune

nom

DEP_75

nom

COM_75056

nom

Region

NUTSRegion

name

FR101name

Paris

=

=

=

≤≤

=

=

=

Jérôme Euzenat Interlinking the web of data: challenges and solutions 25 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

Simple alignments are not sufficient

Territoire_FR

Region

Departement

Commune

nom

DEP_75

nom

COM_75056

nom

Region

NUTSRegion

name

FR101name

Paris

=

=

=

≤≤

=

=

=

Jérôme Euzenat Interlinking the web of data: challenges and solutions 25 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

Expressive alignments are necessary

Region

NUTSRegion

level

hasParentRegion

2 =

FR1=

=

subdivision hasSubRegion=

nom name=

Jérôme Euzenat Interlinking the web of data: challenges and solutions 26 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

Expressive alignments are necessary

Region

NUTSRegion

level

hasParentRegion

2 =

FR1=

=

subdivision hasSubRegion=

nom name=

Jérôme Euzenat Interlinking the web of data: challenges and solutions 26 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

Expressive alignments are necessary

Region

NUTSRegion

level

hasParentRegion

2 =

FR1=

=

subdivision hasSubRegion=

nom name=

Jérôme Euzenat Interlinking the web of data: challenges and solutions 26 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

Query generation

SELECT ?rPREFIX insee: <http://rdf.insee.fr/ontologie-geo-2006.rdf#>PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>FROM <http://rdf.insee.fr/geo/regions-2011.rdf>WHERE {

?r rdf:type insee:Region .}

SELECT ?nPREFIX nuts: <http://ec.europa.eu/eurostat/ramon/ontologies/geographic.rdf#>PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>FROM <http://ec.europa.eu/eurostat/ramon/rdfdata/nuts2008/>WHERE {

?n rdf:type nuts:NUTSRegion .?n nuts:level 2ˆˆxsd:int .?n nuts:hasParentRegion nuts:FR1 .

}

Jérôme Euzenat Interlinking the web of data: challenges and solutions 27 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

Query generation

CONSTRUCT { ?r owl:sameAs ?n . }PREFIX insee: <http://rdf.insee.fr/ontologie-geo-2006.rdf#>PREFIX nuts: <http://ec.europa.eu/eurostat/ramon/ontologies/geographic.rdf#>PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>FROM <http://rdf.insee.fr/geo/regions-2011.rdf>FROM <http://ec.europa.eu/eurostat/ramon/rdfdata/nuts2008/>WHERE {

?r rdf:type insee:Region .?r insee:nom ?l .?n rdf:type nuts:NUTSRegion .?n nuts:name ?l .?n nuts:level 2ˆˆxsd:int .?n nuts:hasParentRegion nuts:FR1 .

}

Jérôme Euzenat Interlinking the web of data: challenges and solutions 28 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

Outline

Linked data

Methods for data interlinking

General framework for data interlinking

Conclusions

Jérôme Euzenat Interlinking the web of data: challenges and solutions 29 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

What does this mean?

I Ontology alignments are schema-level expression of correspondences;I They are useful for focussing the search;I Expressive alignments are necessary;I They can be turned into SPARQL-based link generators.

but it is also necessary to express instance level constraints:I for converting data (e.g., mph vs. m/s);I for expressing matching constraint on data (e.g., similarity).

Jérôme Euzenat Interlinking the web of data: challenges and solutions 30 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

General framework

o o ′

URI1 URI2

Ontology matching

A

Data interlinking

owl:sameAs

Jérôme Euzenat Interlinking the web of data: challenges and solutions 31 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

General framework

o o ′

URI1 URI2

Ontology matching

A

Data interlinking

owl:sameAs

Jérôme Euzenat Interlinking the web of data: challenges and solutions 31 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

General framework

o o ′

URI1 URI2

Ontology matching

A

Data interlinking

owl:sameAs

Jérôme Euzenat Interlinking the web of data: challenges and solutions 31 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

General framework

o o ′

URI1 URI2

Ontology matching

A

Data interlinking

owl:sameAs

Jérôme Euzenat Interlinking the web of data: challenges and solutions 31 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

General framework

o o ′

URI1 URI2

Ontology matching

A

Data interlinking

owl:sameAs

Jérôme Euzenat Interlinking the web of data: challenges and solutions 31 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

Six interlinking tools

RKB-CRS Co-reference resolution for the RKB knowledge base.LD-mapper Linking tool for the Music Ontology.ODD Linker SQL-based linker.

RDF-AI Dataset linker and merger.Silk Linking script engine based on explicit linking description.

KnoFuss Alignment based linker.

and others: ObjectCoRef, LN2R, LIMES. . .

http://melinda.inrialpes.fr

Jérôme Euzenat Interlinking the web of data: challenges and solutions 32 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

Six interlinking tools

o o ′

URI1 URI2

owl:sameAs

Data interlinkingODD-Linker, RKB-CRS, LD-Mapper

ASilk, RDF-AI

Ontology matchingKnoFuss

Jérôme Euzenat Interlinking the web of data: challenges and solutions 33 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

Six interlinking tools

o o ′

URI1 URI2

owl:sameAs

Data interlinkingODD-Linker, RKB-CRS, LD-Mapper

ASilk, RDF-AI

Ontology matchingKnoFuss

Jérôme Euzenat Interlinking the web of data: challenges and solutions 33 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

Six interlinking tools

o o ′

URI1 URI2

owl:sameAs

Data interlinkingODD-Linker, RKB-CRS, LD-Mapper

ASilk, RDF-AI

Ontology matchingKnoFuss

Jérôme Euzenat Interlinking the web of data: challenges and solutions 33 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

Six interlinking tools

o o ′

URI1 URI2

owl:sameAs

Data interlinkingODD-Linker, RKB-CRS, LD-Mapper

ASilk, RDF-AI

Ontology matchingKnoFuss

Jérôme Euzenat Interlinking the web of data: challenges and solutions 33 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

Directions

I Combination between ontology matching and data interlinking;I Explicit use of alignments or interlinking scripts;I Data-driven algorithms (database key computation).I Domain-specific techniques (geographic data).

Jérôme Euzenat Interlinking the web of data: challenges and solutions 34 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

Outline

Linked data

Methods for data interlinking

General framework for data interlinking

Conclusions

Jérôme Euzenat Interlinking the web of data: challenges and solutions 35 / 38

Linked dataMethods for data interlinking

General framework for data interlinkingConclusions

Conclusions

I A large part of linked data added value is in links;I They may not be easy to find;I Many techniques are available for automating interlinking;I Having a general framework may help integrating them.

Jérôme Euzenat Interlinking the web of data: challenges and solutions 36 / 38

Questions?

http://exmo.inrialpes.fr

Jerome . Euzenat @ inria . fr