revealing the conceptual schemas of rdf datasets

30
Revealing the Conceptual Schemas of RDF Datasets Subhi Issa, Pierre-Henri Paris, Fayçal Hamdi, Samira Si-Said Cherfi CEDRIC Lab Conservatoire National des Arts et Métiers, Paris, France June 6th, 2019

Upload: others

Post on 24-Mar-2022

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Revealing the Conceptual Schemas of RDF Datasets

Revealing the Conceptual Schemas of RDFDatasets

Subhi Issa, Pierre-Henri Paris, Fayçal Hamdi, Samira Si-Said Cherfi

CEDRIC LabConservatoire National des Arts et Métiers, Paris, France

June 6th, 2019

Page 2: Revealing the Conceptual Schemas of RDF Datasets

Introduction Approach Prototype Conclusion

Context

Several datasets publishedaccording to the Linked Dataprinciples

DBpediaYAGOWikidata

Data is from a variety of sources and of varying quality andthus not easy to trust

There is a real need for developing suitable methods andtechniques to better exploit web dataConceptual modeling is widely recognized as a mean forabstraction and semantics highlighting

S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 2 / 25

Page 3: Revealing the Conceptual Schemas of RDF Datasets

Introduction Approach Prototype Conclusion

Conceptual modeling and Linked Open Data

Conceptual ModelingInitiated by business needsControlled by modelersExplicit ”user”requirementsQuality is controlled bythe design process

Linked Open Data contextInitiated by user’s moodsDriven by dataInexistant or hidenintentionsQuality is not a subject

S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 3 / 25

Page 4: Revealing the Conceptual Schemas of RDF Datasets

Introduction Approach Prototype Conclusion

Motivating Example

Example

SELECT * WHERE {?actor rdf:type dbo:Scientist .?actor foaf:name ?name .?actor dbo:birthDate ?birthDate .?actor dbo:birthPlace ?birthPlace .}

Needs to know properties of Scientist (Schema)Brings only scientists having values for all properties(completeness)

S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 4 / 25

Page 5: Revealing the Conceptual Schemas of RDF Datasets

Introduction Approach Prototype Conclusion

What is completeness?

In relational databases : percentage of non null values

Consequently, we need a reference schema

A reference scientist schema (ontology) could be:

Scientist_Schema = {Properties on Scientist} ∪{Properties on Person} ∪ {Properties on Agent} ∪{Properties on Thing}

such that: Scientist v Person v Agent v Thing

S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 5 / 25

Page 6: Revealing the Conceptual Schemas of RDF Datasets

Introduction Approach Prototype Conclusion

What is completeness?

In relational databases : percentage of non null valuesConsequently, we need a reference schema

A reference scientist schema (ontology) could be:

Scientist_Schema = {Properties on Scientist} ∪{Properties on Person} ∪ {Properties on Agent} ∪{Properties on Thing}

such that: Scientist v Person v Agent v Thing

S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 5 / 25

Page 7: Revealing the Conceptual Schemas of RDF Datasets

Introduction Approach Prototype Conclusion

What is completeness?

In relational databases : percentage of non null valuesConsequently, we need a reference schema

A reference scientist schema (ontology) could be:

Scientist_Schema = {Properties on Scientist} ∪{Properties on Person} ∪ {Properties on Agent} ∪{Properties on Thing}

such that: Scientist v Person v Agent v Thing

S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 5 / 25

Page 8: Revealing the Conceptual Schemas of RDF Datasets

Introduction Approach Prototype Conclusion

What is completeness?

In relational databases : percentage of non null valuesConsequently, we need a reference schema

A reference scientist schema (ontology) could be:

Scientist_Schema = {Properties on Scientist} ∪{Properties on Person} ∪ {Properties on Agent} ∪{Properties on Thing}

such that: Scientist v Person v Agent v Thing

S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 5 / 25

Page 9: Revealing the Conceptual Schemas of RDF Datasets

Introduction Approach Prototype Conclusion

Comp(Albert_Einstein) =|Properties on Albert_Einstein|

|Scientist_Schema|

=21664

= 4, 21%

Is this schema relevant?

S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 6 / 25

Page 10: Revealing the Conceptual Schemas of RDF Datasets

Introduction Approach Prototype Conclusion

Comp(Albert_Einstein) =|Properties on Albert_Einstein|

|Scientist_Schema|

=21664

= 4, 21%

Is this schema relevant?

S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 6 / 25

Page 11: Revealing the Conceptual Schemas of RDF Datasets

Introduction Approach Prototype Conclusion

The approach overview: a scratch card like

Goal

An approach guiding the mining of a schema meeting userrequirements

A completeness based explorationAn incremental processConsidering user requirements (the human in the loop)

S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 7 / 25

Page 12: Revealing the Conceptual Schemas of RDF Datasets

Introduction Approach Prototype Conclusion

Conceptual schemas derivation

S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 8 / 25

Page 13: Revealing the Conceptual Schemas of RDF Datasets

Introduction Approach Prototype Conclusion

Completeness Specification

D a triple (C , IC ,P), where:C the set of categories (e.g., Actor, City)IC the set of instances for categories in CP = {p1, p2, ..., pn} the set of properties (e.gresidence(Person,Place))

T = {t1, t2, ..., tm} a set of transactions with:∀k , 1 ≤ k ≤ m : tk ⊆ P vector of transactions over PE (tk) be the set of items in transaction tkEach transaction is a set of properties used in the descriptionof the instances of the subset I ′ = {i1, i2, ..., im} with I ′ ⊆ IC

Completeness

We consider CP the completeness of I ′ against properties used inthe description of each of its instances

S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 9 / 25

Page 14: Revealing the Conceptual Schemas of RDF Datasets

Introduction Approach Prototype Conclusion

Example

Subject Predicate ObjectThe_Godfather director Francis_Ford_CoppolaThe_Godfather musicComposer Nino_Rota

Goodfellas director Martin_ScorseseGoodfellas editing Thelma_SchoonmakerTrue_Lies director James_CameronTrue_Lies editing Conrad_Buff_IVTrue_Lies musicComposer Brad_Fiedel

Instance TransactionThe_Godfather director, musicComposerGoodfellas director, editingTrue_Lies director, editing, musicComposer

S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 10 / 25

Page 15: Revealing the Conceptual Schemas of RDF Datasets

Introduction Approach Prototype Conclusion Properties mining Completeness calculation

The Mining-based Approach

The Mining-based Approach includes two steps:

1 Properties mining: Apply the well known FP-growthalgorithm for mining maximal frequent itemsetsMFP

2 Completeness calculation: Use the apparition frequency ofitems (properties) inMFP, to give each of them a weightand calculate the completeness of each transaction (regardingthe presence or absence of properties)

S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 11 / 25

Page 16: Revealing the Conceptual Schemas of RDF Datasets

Introduction Approach Prototype Conclusion Properties mining Completeness calculation

Properties mining

Given D(C , IC ,P) and I ′ a subset of instances with I ′ ⊆ IC1 Initialize T = φ,MFP = φ

2 For each i ∈ I ′ we generate a transaction t

3 Compute the set of frequent patterns FP from the transactionvector T .

Definition (Pattern)

Let T be a set of transactions. A pattern P̂ is a sequence ofproperties shared by one or several transactions t in T .

S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 12 / 25

Page 17: Revealing the Conceptual Schemas of RDF Datasets

Introduction Approach Prototype Conclusion Properties mining Completeness calculation

4 Use the FP-growth algorithm to generate the frequent patternsFP

5 Reduce the set of FP by generating a subset containing only"Maximal" patterns

Definition (MFP)

Let P̂ be a frequent pattern. P̂ is maximal if none of its propersuperset is frequent. We define the set of Maximal FrequentPatternsMFP as:

MFP = {P̂ ∈ FP | ∀P̂ ′ ) P̂ :|T (P̂ ′)||T |

< ξ}

S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 13 / 25

Page 18: Revealing the Conceptual Schemas of RDF Datasets

Introduction Approach Prototype Conclusion Properties mining Completeness calculation

Example

Instance TransactionThe_Godfather director, musicComposerGoodfellas director, editingTrue_Lies director, editing, musicComposer

Let ξ = 60% and the set of frequent patterns

FP = {{director}, {musicComposer}, {editing}, {director ,musicComposer},{director , editing}}

The MFP set would be:

MFP = {{director ,musicComposer}, {director , editing}}

S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 14 / 25

Page 19: Revealing the Conceptual Schemas of RDF Datasets

Introduction Approach Prototype Conclusion Properties mining Completeness calculation

Completeness calculation

Given theMFP set:

Definition (Completeness CP)Let I ′ be a subset of instances, T the set of transactions constructedfrom I ′, andMFP a set of maximal frequent pattern. Thecompleteness of I ′ corresponds to the completeness of its transactionvector T obtained by calculating the average of the completeness of Tregarding each pattern inMFP. Therefore, we define the completenessCP of a subset of instances I ′ as follows:

CP(I ′) = 1|T |

|T |∑k=1

|MFP|∑j=1

δ(E (tk), P̂j)

|MFP|(1)

such that: P̂j ∈MFP, and

δ(E (tk), P̂j) =

{1 if P̂j ⊂ E (tk)0 otherwise

S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 15 / 25

Page 20: Revealing the Conceptual Schemas of RDF Datasets

Introduction Approach Prototype Conclusion Properties mining Completeness calculation

Example

LetMFP = {{director ,musicComposer}, {director , editing}}where both itemsets have a support of 60%.The The completeness would be:

CP(I ′) = (2 ∗ (1/2) + (2/2))/3 = 0.67

we have 3 transactions The_Godfather, Goodfellas andTrue_Lies and 2 MFPmore frequent properties have a higer wheight (director)only co-occurente properties are considered

S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 16 / 25

Page 21: Revealing the Conceptual Schemas of RDF Datasets

Introduction Approach Prototype Conclusion Properties mining Completeness calculation

Algorithm 1 Completeness calculation

Input: D, I ′, ξOutput: CP(I ′)for each

doti =

∣∣p1 p2 . . . pn∣∣

T = T + ti. Properties miningMFP = Maximal(FP-growth(T , ξ)). Using equation 1return CP(I ′) = CalculateCompleteness(I ′, T ,MFP)

S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 17 / 25

Page 22: Revealing the Conceptual Schemas of RDF Datasets

Introduction Approach Prototype Conclusion

LOD-CM web service

Browse dataset without examining data in detailChoose the dataset that will be most suitable for its intendeduseFacilitate data browsing (based on user requirements):

Inheritance relationshipRelations between classesCompleteness value of each property

S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 18 / 25

Page 23: Revealing the Conceptual Schemas of RDF Datasets

Introduction Approach Prototype Conclusion

Experimental setup

DBpedia version 2016-10English edition1.1 billion RDF triples468 classes1378 properties

Data HDT dumpsImplemented in JAVA using Jena libraryPlantUML tool to visualize diagrams

S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 19 / 25

Page 24: Revealing the Conceptual Schemas of RDF Datasets

Introduction Approach Prototype Conclusion

LOD-CMUser interface: http://cedric.cnam.fr/lod-cm

S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 20 / 25

Page 25: Revealing the Conceptual Schemas of RDF Datasets

Introduction Approach Prototype Conclusion

LOD-CMExample: Class name: Film, Completeness: 50%

S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 21 / 25

Page 26: Revealing the Conceptual Schemas of RDF Datasets

Introduction Approach Prototype Conclusion

LOD-CMExample: Class name: Film, Completeness: 50% - First iteration

Properties and relationshipswith Completeness > 50%For the next step: zoom inon Artist

S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 22 / 25

Page 27: Revealing the Conceptual Schemas of RDF Datasets

Introduction Approach Prototype Conclusion

LOD-CMExample: Class name: Film, Completeness: 50% - First iteration

Properties and relationshipswith Completeness > 50%For the next step: zoom inon Artist

S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 22 / 25

Page 28: Revealing the Conceptual Schemas of RDF Datasets

Introduction Approach Prototype Conclusion

LOD-CMExample: Class name: Artist, Completeness: 50%

S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 23 / 25

Page 29: Revealing the Conceptual Schemas of RDF Datasets

Introduction Approach Prototype Conclusion

Conclusion

Reveals conceptual schemas from RDF data sourcesConsiders user-specified threshold and exploration requirementsProvides enriched schemas with completeness values

PerspectivesInvestigate the effectiveness of the approach against additionalLinked Open Data datasets such as YAGO, Wikidata, etc.Allow the user to compare conceptual schemas from differentdatasetsExtend user requirements : extra quality criteria, user queries...

S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 24 / 25

Page 30: Revealing the Conceptual Schemas of RDF Datasets

Introduction Approach Prototype Conclusion

Thank You!Questions?

S. Cherfi (CNAM) Revealing the Conceptual Schemas of RDF Datasets June 6th, 2019 25 / 25