aidan hogan, antoine zimmermann, jürgen umbrich , axel polleres , stefan decker

19
Aidan Hogan, Antoine Zimmermann, Jürgen Umbrich, Axel Polleres, Stefan Decker Presented by Joseph Park SCALABLE AND DISTRIBUTED METHODS FOR ENTITY MATCHING, CONSOLIDATION AND DISAMBIGUATION OVER LINKED DATA CORPORA

Upload: stu

Post on 23-Feb-2016

46 views

Category:

Documents


0 download

DESCRIPTION

Scalable and Distributed Methods for Entity Matching, Consolidation and Disambiguation over Linked Data Corpora. Aidan Hogan, Antoine Zimmermann, Jürgen Umbrich , Axel Polleres , Stefan Decker Presented by Joseph Park. Introduction. Linked Data best practices: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Aidan  Hogan,  Antoine  Zimmermann,  Jürgen  Umbrich , Axel  Polleres ,  Stefan  Decker

Aidan Hogan, Anto ine Zimmermann, Jürgen Umbr ich, Axel Pol leres , Stefan Decker

Presented by Joseph Park

SCALABLE AND DISTRIBUTED METHODS FOR ENTITY MATCHING,

CONSOLIDATION AND DISAMBIGUATION OVER LINKED DATA

CORPORA

Page 2: Aidan  Hogan,  Antoine  Zimmermann,  Jürgen  Umbrich , Axel  Polleres ,  Stefan  Decker

Linked Data best practices: Use URIs as names for things (not just documents) Make those URIs dereferenceable via HTTP Return useful and relevant RDF content upon lookup of

those URIs Include links to other datasets

Linked Open Data project Goal of providing dereferenceable machine readable data in

RDF Emphasis on reuse of URIs and inter-linkage between

remote datasets

Web of Data 30 billion published RDF triples

INTRODUCTION

Page 3: Aidan  Hogan,  Antoine  Zimmermann,  Jürgen  Umbrich , Axel  Polleres ,  Stefan  Decker

Focus on finding equivalent entities E.g. people, places, musicians, proteins Two entities are equivalent if they are coreferent

Interest in identifying coreferences and merge knowledge contributions provided by distinct parties (consolidation)

AIMS & GOALS

Page 4: Aidan  Hogan,  Antoine  Zimmermann,  Jürgen  Umbrich , Axel  Polleres ,  Stefan  Decker

owl:sameAs A core OWL property that defines equivalences between

individuals Two individuals related by owl:sameAs are coreferent

Inferring new owl:sameAs relations: Inverse-functional properties (e.g :biologicalMotherOf) Functional properties (e.g :hasBiologicalMother) Cardinality and max-cardinality restrictions

OWL:SAMEAS

Page 5: Aidan  Hogan,  Antoine  Zimmermann,  Jürgen  Umbrich , Axel  Polleres ,  Stefan  Decker

CONSTRAINTS TO OWL:SAMEAS

Page 6: Aidan  Hogan,  Antoine  Zimmermann,  Jürgen  Umbrich , Axel  Polleres ,  Stefan  Decker

1.118 billion quadruples Crawled from 3.985 million web documents 1.106 billion are unique 947 million are unique triples

9 machines linked by Gigabit ethernet

EXPERIMENT

Page 7: Aidan  Hogan,  Antoine  Zimmermann,  Jürgen  Umbrich , Axel  Polleres ,  Stefan  Decker

Extracted 11.93 million raw owl:sameAs quadruples Only 3.77 million unique triples

1000 randomly chosen pairs hand-checked Trivially same (661 times) Same (301 times) Different (28 times) Unclear (10 times)

BASELINE – OWL:SAMEAS

Page 8: Aidan  Hogan,  Antoine  Zimmermann,  Jürgen  Umbrich , Axel  Polleres ,  Stefan  Decker

No documents used owl:maxQualifiedCardinality

434 functional properties57 inverse-functional properties109 cardinality restrictions with a value of 1

52.93 million memberships of inverse-functional properties 22.14 million asserted

11.09 million memberships of functional properties 1.17 million asserted

2.56 million cardinality triples 533 thousand asserted

CONSTRAINT COUNTS

Page 9: Aidan  Hogan,  Antoine  Zimmermann,  Jürgen  Umbrich , Axel  Polleres ,  Stefan  Decker

Zero owl:sameAs inferences through cardinality rules

106.8 thousand owl:sameAs through functional-property reasoning

8.7 million owl:sameAs through inverse-functional-property reasoning

Resulted in a total of 12.03 million owl:sameAs statements

REASONING USING CONSTRAINTS

Page 10: Aidan  Hogan,  Antoine  Zimmermann,  Jürgen  Umbrich , Axel  Polleres ,  Stefan  Decker

From the 12.03 million owl:sameAs quadruples

1000 randomly chosen and hand-checked: Trivially same (145 times) Same (823 times) Different (23 times) Unclear (9 times)

RESULTS FROM CONSTRAINTS

Page 11: Aidan  Hogan,  Antoine  Zimmermann,  Jürgen  Umbrich , Axel  Polleres ,  Stefan  Decker

Entity concurrence—sharing of outlinks, inlinks, and attribute values

Higher score means more discriminating shared characteristics

STATISTICAL CONCURRENCE

Page 12: Aidan  Hogan,  Antoine  Zimmermann,  Jürgen  Umbrich , Axel  Polleres ,  Stefan  Decker

RUNNING EXAMPLE

Page 13: Aidan  Hogan,  Antoine  Zimmermann,  Jürgen  Umbrich , Axel  Polleres ,  Stefan  Decker

Observed cardinality (e.g. Card_G_ex (foaf:maker; dblp:AliceB10) = 2)

Observed inverse-cardinality (e.g. ICard_G_ex (foaf:gender; "female") = 2)

Average inverse-cardinality (e.g. AIC_G_ex (foaf:gender) = 1.5) Can also be viewed as average non-zero cardinalities For example, foaf:gender; 1 for “male”, 2 for “female”

QUANTIFYING CONCURRENCE

Page 14: Aidan  Hogan,  Antoine  Zimmermann,  Jürgen  Umbrich , Axel  Polleres ,  Stefan  Decker

ADJUSTED AVERAGE INVERSE-CARDINALITY

Page 15: Aidan  Hogan,  Antoine  Zimmermann,  Jürgen  Umbrich , Axel  Polleres ,  Stefan  Decker

CONCURRENCE COEFFICIENTS

Page 16: Aidan  Hogan,  Antoine  Zimmermann,  Jürgen  Umbrich , Axel  Polleres ,  Stefan  Decker

COEFFICIENT EXAMPLE

Page 17: Aidan  Hogan,  Antoine  Zimmermann,  Jürgen  Umbrich , Axel  Polleres ,  Stefan  Decker

Same process as determining the probability of two independent events occurring (given the same outcome event) P(AB) = P(A) + P(B) – P(A*B)

AGGREGATED CONCURRENCE SCORE

Page 18: Aidan  Hogan,  Antoine  Zimmermann,  Jürgen  Umbrich , Axel  Polleres ,  Stefan  Decker

Average cardinality of about 1.5Average inverse-cardinality of about 2.64Total of 636.9 million weighted concurrence pairs

Mean concurrence weight of about 0.0159

Highly concurring entities were in many cases not coreferent

RESULTS FROM CONCURRENCE

Page 19: Aidan  Hogan,  Antoine  Zimmermann,  Jürgen  Umbrich , Axel  Polleres ,  Stefan  Decker

EXAMPLE OF CONCURRENCE