idmesh: graph-based disambiguation of linked data · social websites exported (linked) data...

22
idMesh: Graph-Based Disambiguation of Linked Data Philippe Cudré-Mauroux -- MIT joint work with Parisa Haghani, Michael Jost, Karl Aberer (EPFL) and Hermann de Meer (U. Passau) April 24, 2009 World Wide Web Conference

Upload: others

Post on 22-Sep-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: idMesh: Graph-Based Disambiguation of Linked Data · Social websites Exported (linked) data Automated mash-ups ... Reputation-Based T rust Management lk 1-2 Constraint-Satisfaction

idMesh: Graph-Based Disambiguation of Linked Data

Philippe Cudré-Mauroux -- MIT

joint work with Parisa Haghani, Michael Jost, Karl Aberer (EPFL) and Hermann de Meer (U. Passau)

April 24, 2009World Wide Web Conference

Page 2: idMesh: Graph-Based Disambiguation of Linked Data · Social websites Exported (linked) data Automated mash-ups ... Reputation-Based T rust Management lk 1-2 Constraint-Satisfaction

Overview■ A Web of Resources

■ Distributed Naming Game■ Entity Consolidation

■ idMesh Constructs■ Link-Analysis Framework■ System Architecture■ Performance■ Conclusions & Future Work

Page 3: idMesh: Graph-Based Disambiguation of Linked Data · Social websites Exported (linked) data Automated mash-ups ... Reputation-Based T rust Management lk 1-2 Constraint-Satisfaction

A Web of Resources

• Increasingly, the world is modeled as a collection of (interlinked) identifiers■ Linked Data■ Semantic Web■ RESTful services■ ...

http://data.semanticweb.org/person/philippe-cudre-mauroux

http://data.semanticweb.org/conference/www/2009/paper/60

foaf:made

Page 4: idMesh: Graph-Based Disambiguation of Linked Data · Social websites Exported (linked) data Automated mash-ups ... Reputation-Based T rust Management lk 1-2 Constraint-Satisfaction

Naming & Decentralization

• The great thing about unique identifiers is that there are so many to choose from■ Decentralized naming game■ Soaring dimensions in Web 2.0 / 3.0 contexts

■ Social websites■ Exported (linked) data■ Automated mash-ups

http://semanticweb.org/id/Philippe_Cudre-Mauroux

http://data.semanticweb.org/person/philippe-cudre-mauroux

http://people.csail.mit.edu/pcm/i http://lsirpeople.epfl.ch/pcudre/i

http://semanticweb.org/wiki/Special:ExportRDF/Philippe_Cudr%C3%A9-Mauroux

http://tw.rpi.edu/wiki/Special:ExportRDF/Philippe_Cudr%C3%A9-Mauroux

http://wiki.ontoworld.org/index.php/Special:ExportRDF/Philippe_Cudr%C3%A9-Mauroux

http://korrekt.org/index.php/Special:ExportRDF/Philippe_Cudr%C3%A9-Mauroux

http://prauw.cs.vu.nl:8080/flink/graph?profile=http%3A%2F%2Fwww.cs.vu.nl%2F%7Epmika%2Fsocionet

%23Philippe%2BCudre-Mauroux

http://www.zoominfo.com/PersonID=402960578 http://www.flickr.com/photos/28735...@N00/

http://www.facebook.com/profile.php?id=1251943... .......

Page 5: idMesh: Graph-Based Disambiguation of Linked Data · Social websites Exported (linked) data Automated mash-ups ... Reputation-Based T rust Management lk 1-2 Constraint-Satisfaction

Naming & Decentralization

• The great thing about unique identifiers is that there are so many to choose from■ Decentralized naming game■ Soaring dimensions in Web 2.0 / 3.0 contexts

■ Social websites■ Exported (linked) data■ Automated mash-ups

http://semanticweb.org/id/Philippe_Cudre-Mauroux

http://data.semanticweb.org/person/philippe-cudre-mauroux

http://people.csail.mit.edu/pcm/i http://lsirpeople.epfl.ch/pcudre/i

http://semanticweb.org/wiki/Special:ExportRDF/Philippe_Cudr%C3%A9-Mauroux

http://tw.rpi.edu/wiki/Special:ExportRDF/Philippe_Cudr%C3%A9-Mauroux

http://wiki.ontoworld.org/index.php/Special:ExportRDF/Philippe_Cudr%C3%A9-Mauroux

http://korrekt.org/index.php/Special:ExportRDF/Philippe_Cudr%C3%A9-Mauroux

http://prauw.cs.vu.nl:8080/flink/graph?profile=http%3A%2F%2Fwww.cs.vu.nl%2F%7Epmika%2Fsocionet

%23Philippe%2BCudre-Mauroux

http://www.zoominfo.com/PersonID=402960578 http://www.flickr.com/photos/28735...@N00/

http://www.facebook.com/profile.php?id=1251943... .......

ID Jungle

Page 6: idMesh: Graph-Based Disambiguation of Linked Data · Social websites Exported (linked) data Automated mash-ups ... Reputation-Based T rust Management lk 1-2 Constraint-Satisfaction

Entity Consolidation (i)

• A few constructs are increasingly used to consolidate Wed identifiers■ OWL:SameAs, XFN rel:me, pipes, etc.

http://data.semanticweb.org/person/philippe-cudre-mauroux

http://semanticweb.org/id/Philippe_Cudre-Mauroux

Same As

Page 7: idMesh: Graph-Based Disambiguation of Linked Data · Social websites Exported (linked) data Automated mash-ups ... Reputation-Based T rust Management lk 1-2 Constraint-Satisfaction

Entity Consolidation (ii)

• Online entity consolidation is a complex game■ Simple binary constructs are often insufficient

■ Social contexts (e.g., professional vs personal entities)

■ Granularity (e.g., out-of-date entities)

■ Uncertainty (e.g., automatically-generated entities)

http://people.csail.mit.edu/pcm/i???

http://www.facebook.com/id=1251943...

http://people.csail.mit.edu/pcm/i???

http://lsirpeople.epfl.ch/pcudre/i

http://people.csail.mit.edu/pcm/i???

http://www.zoominfo.com/PersonID=402960578

Page 8: idMesh: Graph-Based Disambiguation of Linked Data · Social websites Exported (linked) data Automated mash-ups ... Reputation-Based T rust Management lk 1-2 Constraint-Satisfaction

New Twist on an Old Problem

• Well-known problem know as Entity Disambiguation or Resolution■ Large body of related work

■ see paper

• New context■ Unprecedented scale■ Networked game■ Social dimension

➡central problem impeding all automated,large-scale online data processing endeavors

Page 9: idMesh: Graph-Based Disambiguation of Linked Data · Social websites Exported (linked) data Automated mash-ups ... Reputation-Based T rust Management lk 1-2 Constraint-Satisfaction

The idMesh Approach

• idMesh suggests a radically different approach to online entity consolidation that is■ User-driven■ Best-effort (probabilistic)■ Decentralized

• Link-analysis framework based on transitive closures of relationships■ Emergent semantics

■ semantics of data derived through network■ the sum is greater than the parts

Page 10: idMesh: Graph-Based Disambiguation of Linked Data · Social websites Exported (linked) data Automated mash-ups ... Reputation-Based T rust Management lk 1-2 Constraint-Satisfaction

idMesh Constructs...

<rdfs:Class rdf:ID="Entity"/>

<rdf:Property rdf:ID="idMeshProperty"> <rdfs:domain rdf:resource="#Entity" /> <rdfs:range rdf:resource="#Entity" />

</rdf:Property>

<rdf:Property rdf:ID="LinkConfidence"> <rdfs:domain rdf:Statement /> <rdfs:range rdf:datatype="&xsd;decimal" />

</rdf:Property>

<rdf:Property rdf:ID="EquivalentTo"> <rdfs:subPropertyOf rdf:resource="#idMeshProperty" />

</rdf:Property>

<rdf:Property rdf:ID="NotEquivalentTo"> <rdfs:subPropertyOf rdf:resource="#idMeshProperty" />

</rdf:Property>

<rdf:Property rdf:ID="Predates"> <rdfs:subPropertyOf rdf:resource="#EquivalentTo" />

</rdf:Property>

<rdf:Property rdf:ID="Postdates"> <rdfs:subPropertyOf rdf:resource="#EquivalentTo" />

</rdf:Property>

<rdf:Property rdf:ID="Equidates"> <rdfs:subPropertyOf rdf:resource="#EquivalentTo" />

</rdf:Property>

<rdf:Description rdf:about="http://www.epfl.ch/"><idMesh: NotEquivalentTo rdf:ID="link0001"

rdf:resource="http://www.ethz.ch"/></rdf:Description>

<rdf:Description rdf:about="http://www.epfl.ch/"><idMesh:EquivalentTo rdf:ID="link0002"

rdf:resource="http://en.wikipedia.org/wiki/EPFL"/></rdf:Description>

<rdf:Description rdf:about="#link0002"><idMesh:LinkConfidence

rdf:datatype="&xsd;decimal"> 0.9 </idMesh:LinkConfidence></rdf:Description>

• Two levels of granularity• Entity disambiguation• Temporal discrimination

• Confidence values

• Can encompass previous constructs

Page 11: idMesh: Graph-Based Disambiguation of Linked Data · Social websites Exported (linked) data Automated mash-ups ... Reputation-Based T rust Management lk 1-2 Constraint-Satisfaction

Problem Definition

• Input: series of statements defining a weighted graph or interrelated identifiers■ no associated contents, attributes, or properties...

• Output: clusters of equivalent identifiers■ probabilistic, a posteriori network equivalence■ equivalence based on probabilistic threshold

eq1-2

i1

i4

eq1-3

eq3-4eq2-4

eq1-4

< i1 ≡ 0.9 i2 >

< i1 ≡ 0.9 i3 >

< i1 ≡ 0.9 i4 >

c

< i2 ≡ 1.0 i4 >

< i3 ≡ 1.0 i4 >

i2 i3

...

...

Page 12: idMesh: Graph-Based Disambiguation of Linked Data · Social websites Exported (linked) data Automated mash-ups ... Reputation-Based T rust Management lk 1-2 Constraint-Satisfaction

Problem Definition

• Input: series of statements defining a weighted graph or interrelated identifiers■ no associated contents, attributes, or properties...

• Output: clusters of equivalent identifiers■ probabilistic, a posteriori network equivalence■ equivalence based on probabilistic threshold

eq1-2

i1

i4

eq1-3

eq3-4eq2-4

eq1-4

< i1 ≡ 0.9 i2 >

< i1 ≡ 0.9 i3 >

< i1 ≡ 0.9 i4 >

c

< i2 ≡ 1.0 i4 >

< i3 ≡ 1.0 i4 >

i2 i3✓ ✗

✓✓

...

...

Page 13: idMesh: Graph-Based Disambiguation of Linked Data · Social websites Exported (linked) data Automated mash-ups ... Reputation-Based T rust Management lk 1-2 Constraint-Satisfaction

Probabilistic Disambiguation (i)

lk1-2

e1

e4

lk1-3

lk3-4lk2-4

lk1-4

< e1 ≡ c1 e2 >

< e1 ≡ c2 e3 >

< e1 ≢ c3 e4 >

< e2 ≢ c4 e4 >c

Trusted Source s1

< e2 ≡ c5 e4 >

< e3 ≡ c6 e4 >

Unknown Source s2

i)

ii)

e2 e3

lk1-2

s1 s2

lk1-3 lk1-4 lk2-4 lk3-4

c1 c2 c3 c5 c6c4

Source Graph

Entity Graph

Definition of two graphs

Page 14: idMesh: Graph-Based Disambiguation of Linked Data · Social websites Exported (linked) data Automated mash-ups ... Reputation-Based T rust Management lk 1-2 Constraint-Satisfaction

Probabilistic Disambiguation (ii)

Definition of conditional probability functions relating links & sources

• Transitive closures of link properties (entity graph)■ ID Equivalence is

■ symmetric ■ transitive

ID 1 ID 2

ID 3

eq90%

eq95% non-eq15%

Page 15: idMesh: Graph-Based Disambiguation of Linked Data · Social websites Exported (linked) data Automated mash-ups ... Reputation-Based T rust Management lk 1-2 Constraint-Satisfaction

Probabilistic Disambiguation (iii)

Definition of conditional probability functions relating links & sources

• Source discrimination (source graph)■ Through internet domains / authentication mechanisms

■ openid, foaf-ssl, etc.■ High confidence values for well-known + well-behaved

sources

source 1

well-known,well-behaved

VSsource 2

unknown,conflicting

Page 16: idMesh: Graph-Based Disambiguation of Linked Data · Social websites Exported (linked) data Automated mash-ups ... Reputation-Based T rust Management lk 1-2 Constraint-Satisfaction

Probabilistic Disambiguation (iv)

lk1-2

gc2 ( ) gc3 ( )gc1 ( )

lk1-3 lk1-4 lk2-4 lk3-4

vc1

ts1ts2

Trust Values

for Sources

Initial

Link Values

Combined

Value Functions /

Priors for Links

Trust

Constraint

Inferred

Link Values

Graph

Constraints

iii)

Rep

uta

tion-B

ase

d T

rust

Managem

ent

Const

rain

t-Sati

sfact

ion

lk1-2

e1

e4

lk1-3

lk3-4lk2-4

lk1-4

i)

ii)

e2 e3

lk1-2

s1 s2

lk1-3 lk1-4 lk2-4 lk3-4

cv1 ( ) cv2 ( ) cv3 ( ) cv4 ( ) cv5 ( )

tc1 ( )

vc2vc3

vc4vc5

vc6

c1 c2 c3 c5 c6c4

Source Graph

Entity Graph

Probabilistic inference on *combined* graph

Page 17: idMesh: Graph-Based Disambiguation of Linked Data · Social websites Exported (linked) data Automated mash-ups ... Reputation-Based T rust Management lk 1-2 Constraint-Satisfaction

Scalability

• Problem: both source / entity graphs can become very large in practice■ Unbounded number of sources

■ peer production■ Cheap production of (uncertain) links

■ automated matching algorithms

➡ inference should in itself be decentralized

Page 18: idMesh: Graph-Based Disambiguation of Linked Data · Social websites Exported (linked) data Automated mash-ups ... Reputation-Based T rust Management lk 1-2 Constraint-Satisfaction

Distributed, P2P Architecture

IP Network

subnet

Internet

Layer

Overlay

Layer

(Jupp +

GridVine)

Entity Management

Layer (idMesh)

192.143

192.144

192.145

34.109 35.142 38.14345.123

109.144

112.144

117.122125.98

0001

0100

00110010

0101

0101

0110

0111

Insert(tuple)

Update(tuple)

Retrieve(query)

GetEquivalent(id)

GetPosterior(id)

Internet

DHT

MessagePassing

Page 19: idMesh: Graph-Based Disambiguation of Linked Data · Social websites Exported (linked) data Automated mash-ups ... Reputation-Based T rust Management lk 1-2 Constraint-Satisfaction

Summary of Results

• Efficient, distributed computations■ Parallelized sums & products only■ Quasi-instantaneous on a local machine■ Naturally scales up in networked environments

■ Seconds to disambiguate 8’000 entities interlinked by 24’000 links on 400 machines

• High discriminative power in practice■ 90%+ accuracy with well-behaved but uncertain

sources■ 75% accuracy with 90% malign sources

Page 20: idMesh: Graph-Based Disambiguation of Linked Data · Social websites Exported (linked) data Automated mash-ups ... Reputation-Based T rust Management lk 1-2 Constraint-Satisfaction

Conclusions & Future Work (i)

• idMesh: a ...■ user-driven■ probabilistic■ decentralized

... link-analysis approach to disambiguate linked data.

• Can be combined with previous approaches■ Previous constructs■ Automated matching / content-based disambiguation■ Reputation-based trust mechanisms

Page 21: idMesh: Graph-Based Disambiguation of Linked Data · Social websites Exported (linked) data Automated mash-ups ... Reputation-Based T rust Management lk 1-2 Constraint-Satisfaction

Conclusions & Future Work (ii)

• Could be extended to encompass further types of links■ subsumption■ relatedness

• Should be extended to support personalized disambiguation capabilities■ context-sensitive

Page 22: idMesh: Graph-Based Disambiguation of Linked Data · Social websites Exported (linked) data Automated mash-ups ... Reputation-Based T rust Management lk 1-2 Constraint-Satisfaction

idMesh: Graph-Based Disambiguation of Linked Data

Philippe Cudré-Mauroux -- MITp c m @ c s a i l . m i t . e d u