learning to assess linked data relationships using genetic programming

28
Ilaria Tiddi, Mathieu d’Aquin, Enrico Motta Learning to Assess Linked Data Relationships Using Genetic Programming @IlaTiddi 20.10.2016 15 th International Semantic Web Conference (ISWC 2016)

Upload: i-tiddi

Post on 14-Apr-2017

236 views

Category:

Presentations & Public Speaking


2 download

TRANSCRIPT

Ilaria Tiddi, Mathieu d’Aquin, Enrico Motta

Learning to AssessLinked Data RelationshipsUsing Genetic Programming

@IlaTiddi

20.10.201615th International Semantic Web Conference (ISWC 2016)

Research ProblemAutomatically discover what makes a strong relationship between two entities in (the Web of) Linked Data.

• relationship : a semantic path between two entities

ASongOfIceAndFire(novel) GoTASongOfIce

AndFire(topic)dc:subject dc:subject

Research ProblemAutomatically discover what makes a strong relationship between two entities in (the Web of) Linked Data.

• relationship : a semantic path between two entities• automatically : through graph search techniques

ASongOfIceAndFire(novel)

UnitedStates

GoT

GeorgeRRMartin

ASongOfIceAndFire(topic)

:author

:born

:airedIn

dc:subjectdc:subject

Fantasy

dc:subject dc:subject

Research ProblemProblem • Entities/properties in a path might come from a number

of different, unknown data sources

Solution (the easy one)• indexing & preprocessing of a portion of Linked Data • a priori knowledge, computational resources

ASongOfIceAndFire(novel)

UnitedStates

GoT

GeorgeRRMartin

ASongOfIceAndFire(topic)

:author

:born

:airedIn

dc:subjectdc:subject

Fantasy

dc:subject dc:subject

Research ProblemSolution• Find paths between entities through Link Traversal • Incremental and agnostic graph exploration • Perform uninformed (or blind) search over Linked Data

ASongOfIceAndFire(novel) GoT

Research ProblemSolution• Find paths between entities through Link Traversal • Incremental and agnostic graph exploration • Perform uninformed (or blind) search over Linked Data

ASongOfIceAndFire(novel) GoT

GeorgeRRMartin

ASongOfIceAndFire(topic)

:author

dc:subject

Fantasy

dc:subject

Research ProblemSolution• Find paths between entities through Link Traversal • Incremental and agnostic graph exploration • Perform uninformed (or blind) search over Linked Data

ASongOfIceAndFire(novel) GoT

GeorgeRRMartin

ASongOfIceAndFire(topic)

:author

dc:subject

Fantasy

dc:subject

Research ProblemSolution• Find paths between entities through Link Traversal • Incremental and agnostic graph exploration • Perform uninformed (or blind) search over Linked Data

ASongOfIceAndFire(novel) GoTASongOfIce

AndFire(topic)dc:subject

Fantasy

dc:subject

UnitedStates:born

GeorgeRRMartin

:author

Research ProblemSolution• Find paths between entities through Link Traversal • Incremental and agnostic graph exploration • Perform uninformed (or blind) search over Linked Data

ASongOfIceAndFire(novel) GoT

GeorgeRRMartin

ASongOfIceAndFire(topic)

:author

dc:subject

Fantasy

dc:subject

UnitedStates:born

Research ProblemSolution• Find paths between entities through Link Traversal • Incremental and agnostic graph exploration • Perform uninformed (or blind) search over Linked Data

ASongOfIceAndFire(novel)

UnitedStates

GoT

GeorgeRRMartin

ASongOfIceAndFire(topic)

:author

dc:subjectdc:subject

Fantasy

dc:subject

:born

Research ProblemSolution• Find paths between entities through Link Traversal • Incremental and agnostic graph exploration • Perform uninformed (or blind) search over Linked Data

ASongOfIceAndFire(novel)

UnitedStates

GoT

GeorgeRRMartin

ASongOfIceAndFire(topic)

:author :airedIn

dc:subjectdc:subject

Fantasy

dc:subject dc:subject

:born

Research Problem

ASongOfIceAndFire(novel)

UnitedStates

GoT

GeorgeRRMartin

ASongOfIceAndFire(topic)

:author :airedIn

dc:subjectdc:subject

Fantasy

dc:subject dc:subject

Solution• Find paths between entities through Link Traversal • Incremental and agnostic graph exploration • Perform uninformed (or blind) search over Linked Data

:born

Research HypothesisProblemUninformed searches require a cost-function to explore the graph following the most promising paths

HypoLinked Data information can drive a cost-function that detects strong relationships between entities

ASongOfIceAndFire(novel)

UnitedStates

GoT

GeorgeRRMartin

ASongOfIceAndFire(topic)

:author :airedIn

dc:subjectdc:subject

Fantasy

dc:subject dc:subject

:born

Research QuestionsWhat makes a path strong? • Which topological or semantic features of nodes/edges?

✗ e.g. length of a path? entities of different datasets are connected by many

paths of similar length

How can we use Linked Data to assess strong relationships?• Which information do we need?• Can we use structural features of the graph?

Challenges• find topological/semantic features to detect strong

relationships• combine these features in a cost-function• perform an effective blind search

Proposed Approach

• A set of topological/semantic characteristics of the Linked Data graph

• a benchmark of human-evaluated relationship paths

Identify the cost-function for a blind search that best performs in ranking sets of alternative relationship paths

Automatically learn a cost-function to detect strong relationships between Linked Data entities using a supervised method (Genetic Programming)

Proposed Approach

Genetic Programming: why?• Flexible learning process• Suitable for wide search spaces (such as Linked Data)• Results assessed with a fitness (scores vs. functions)• Human-understandable results• Easy to integrate in a graph search

Automatically learn a cost-function to detect strong relationships between Linked Data entities using a supervised method (Genetic Programming)

VS

Genetic ProgrammingPrograms (solutions for a problem)• trees of primitives• functions : internal nodes (mathematical or logical

operations) • terminals : leaf nodes (constants or variables)

Fitness function (evaluation)• how well the program solves the problem

Genetic operations (evolution) • reproduction • crossover from two parents • mutation from one parent

Termination condition • maximum number of evolutions• a desired fitness

Genetic ProgrammingProcedure• Create random population of programs based on the primitives

• Evolve population until an ideal situation is met

✗✗✗ ✔✔✗✗ ✔

canned spaghetti meatballs spaghetti tomato sauced penne tomato sauced spaghetti

Genetic ProgrammingGiven• a starting population of randomly generated cost-functions• sets of alternative paths between two Linked Data entities,

ranked by humans

Determine how good each cost-function is in ranking paths compared to the human evaluators

✗✗✗ ✔✔✗✗ ✔

canned spaghetti meatballs spaghetti tomato sauced penne tomato sauced spaghetti

Genetic ProgrammingPrimitives

Constant terminals • Z= {0, 1000}

Aggregated terminals • Topological edge weighs

indegree, outdegree, constant weight• Semantic edge weighs

usage of namespaces, taxonomies, vocabularies • Aggregators along the path

sum, avg, min, max

Functions (combining different information)• Math operations

addition, multiplication, division, log

Genetic ProgrammingFitnessNormalised Discounted Cumulative Gain (nDCG)• (IR) quality of rankings provided by search engines based on

the graded relevance of the returned documents• how good is a program in ranking paths based on human ranks• avg(nDCG) across the dataset• length penalty

Genetic operations• Reproduction• Crossover• Mutation

Learning• Training set + test set• Keep fittest program for each runs on training set• Test them (discard inconsistent)

ExperimentsDataset

Entities (random types from different sources)• 12,630 events from Yago• 8,185 people from the VIAF dataset• 999 movies from the LMDB• 1,174 countries/capitals from Geonames/ the UNESCO dataset

Paths (a set of possible paths between them)• select a random pair• bidirectional breadth-first search

Assessment• 100 pairs (~10 possible paths per pair)• 8 judges• from (2) highly relevant to (0) not relevant

db:Dina-Korzun

viaf:Dina-Korzungn:Europegn:United-

Kingdomlmdb:TheSkinGame

owl:sameAsdbo:citizenshipgno:parentFeature

foaf:based_near

ExperimentsResults

Different runs (fitness on training set/test set)(T) Topological primitives only(S) Topological + semantic primitives(N) Topological + namespaces primitives

Runs Best program Fitness TR Fitness TS

T1 log(log(min.cd × min.cd))/max.cd 0.79 0.79

T2 log(min.cd)/(avg.cd + 87) 0.77 0.78

T3 min.cd × (min.cd/max.cd) 0.78 0.72

N1 (log((max.ns/max.cd))/avg.ns) + min.ns 0.82 0.81

N2 (min.dg/sum.cd)/sum.ou) + min.ns 0.79 0.77

N3 min.ns/(log(max.cd)/avg.ns) 0.83 0.75

S1 min.ns + (sum.ns/log(log(sum.si))) 0.88 0.83

S2 min.ns + (min.cd/log(log(sum.si))) 0.88 0.86

S3 min.ns + (log(max.in)/log(log(sum.si))) 0.87 0.86

ExperimentsResults

Lower performance for T-runs and N-runsRecurrent terminals• conditional degree (node degree depending on the RDF

triple)• namespace variety • number of topic properties

(dc:subject/skos:broader/foaf:primaryTopic)Runs Best program Fitness TR Fitness TS

T1 log(log(min.cd × min.cd))/max.cd 0.79 0.79

T2 log(min.cd)/(avg.cd + 87) 0.77 0.78

T3 min.cd × (min.cd/max.cd) 0.78 0.72

N1 (log((max.ns/max.cd))/avg.ns) + min.ns 0.82 0.81

N2 (min.dg/sum.cd)/sum.ou) + min.ns 0.79 0.77

N3 min.ns/(log(max.cd)/avg.ns) 0.83 0.75

S1 min.ns + (sum.ns/log(log(sum.si))) 0.88 0.83S2 min.ns + (min.cd/log(log(sum.si))) 0.88 0.86S3 min.ns + (log(max.in)/log(log(sum.si))) 0.87 0.86

ExperimentsComparative evaluationBest programs• automatically learntvs. literature functions• RECAP,RelFinder,Everything Is Connected Engine, Moore et al.• ad-hoc / handcrafted information theoretical measures

ExperimentsWhich cost-function?

Interpretation• pass through nodes with rich node descriptions

higher min_namespaces = higher path score• not high level entities / few topic categories

few incoming topic categories = higher path score• more specific entities (not hubs) for path with few topic categories

ratio conditional_degree / inTopicCategories

specific paths are privileged over general paths

ConclusionsContributionsA measure to detect strong relationships in Linked Data

can be integrated in uninformed searches over Linked Datavs. indexing/pre-processing techniques

derived empirically through Genetic Programmingvs. domain-specific / handcrafted measures

what is important in Linked Datatopological features + little knowledge about the edge vocabulary

Future work• Integrate the measure in the blind-search process• Explore more characteristics• Improve the measure

THANK YOU VERY MUCH(AND DO NOT MESS UP WITH ITALIAN FOOD)

Questions?

IlaTiddi [email protected]