![Page 1: Towards Scalable Information Integration with Instance Coreferences](https://reader033.vdocuments.us/reader033/viewer/2022051019/5681596b550346895dc6ac70/html5/thumbnails/1.jpg)
Towards Scalable Information Integration with Instance Coreferences
Abir Qasem1, Dimitre Dimitrov2, Jeff Heflin1
1 Lehigh University2 Tech-X Corporation
07/11/09
U.S. Department of Energy DE-FG02-05ER84171 SBIR grant
![Page 2: Towards Scalable Information Integration with Instance Coreferences](https://reader033.vdocuments.us/reader033/viewer/2022051019/5681596b550346895dc6ac70/html5/thumbnails/2.jpg)
2 of 30
The Semantic Web
• Definition– The Semantic Web is not a separate Web but an extension of
the current one, in which information is given well-defined meaning, better enabling computers and people to work in cooperation. (Berners-Lee et al., Scientific American, May 2001)
• Ontology– a key component of the Semantic Web– ontologies define the semantics of the terms used in semi-
structured web pages• identify context, provide shared definitions• has a formal syntax and unambiguous semantics
– can be used to describe alignments between heterogeneous schemas
![Page 3: Towards Scalable Information Integration with Instance Coreferences](https://reader033.vdocuments.us/reader033/viewer/2022051019/5681596b550346895dc6ac70/html5/thumbnails/3.jpg)
3 of 30
A Web of Ontologies
Foaf
DBLP CongressCiteseer
AIGP NSF Awards
extends
extends
extends extends
S3
S7
commits to
commits to
commits to
The answer to a user’s query might require the combination of data from S1, S2, S3, and S4.
Region
S1 S2
Dublin Core
S5
extends
S4
S6
commits to
commits to commits to
extends
![Page 4: Towards Scalable Information Integration with Instance Coreferences](https://reader033.vdocuments.us/reader033/viewer/2022051019/5681596b550346895dc6ac70/html5/thumbnails/4.jpg)
4 of 30
Semantic Web Standards
• RDF(S) (1999, revised 2004)– essentially semantic networks
with URIs– XML serialization syntax
• OWL (2004)– extends RDF with more semantic
primitives– based on description logics (DLs)– has a model theoretic semantics
World Wide Web Consortium (W3C) Recommendations
u:Chair
John Smith
rdf:typeg:name
g:Person
g:name
rdfs:Class rdf:Property
rdf:typerdf:type
rdf:type
rdfs:subclassOf
rdfs:domain
<owl:Class rdf:ID=”Band”> <rdfs:subClassOf> <owl:Restriction> <owl:onProperty rdf:resource=”#hasMember” /> <owl:allValuesFrom rdf:resource=”#Musician” /> </owl:Restriction> </rdfs:subClassOf></owl:Class>
A Band is a subset of the groups which only have Musicians as members
![Page 5: Towards Scalable Information Integration with Instance Coreferences](https://reader033.vdocuments.us/reader033/viewer/2022051019/5681596b550346895dc6ac70/html5/thumbnails/5.jpg)
AIGP - http://aigp.eecs.umich.edu/ DBLP - http://www.informatik.uni-trier.de/~ley/db/
aigp:researcher/show/93
aigp:researcher/show/21
“Eugene Charniak”
“Marvin Minsky”
aigp:name
aigp:advisorOf
aigp:name
QUERY: Find all academic papers written by Marvin Minsky’s advisees.
Integrating RDF Sources
dblp:c/Charniak:Eugene
dblp:jrnl/aim/Charniak97
“Eugene Charniak”
“Statistical Techniques for Natural Language Parsing”
dblp:name
dblp:hasAuthor
dblp:title
=?
![Page 6: Towards Scalable Information Integration with Instance Coreferences](https://reader033.vdocuments.us/reader033/viewer/2022051019/5681596b550346895dc6ac70/html5/thumbnails/6.jpg)
Coreference Information
• owl:sameAs– states that two URIs denote the same individual
• Linking Open Data initiative– ~100 sources with over 4 billion triples (i.e., facts)– >100 million explicit owl:sameAs statements
• Many RDF users publish owl:sameAs statements with their data
• Can use automated coreference resolution techniques to find others– allow for the possibility of human correction
![Page 7: Towards Scalable Information Integration with Instance Coreferences](https://reader033.vdocuments.us/reader033/viewer/2022051019/5681596b550346895dc6ac70/html5/thumbnails/7.jpg)
Scaling
• AIGP and DBLP have about 4000 coreferent instances
• Marvin Minsky has about 20 advisees• Only a small fragment of coreference
information is relevant to any given query– Need to be selective about what information
to use• Quantity of coreference information
– 80K between DBPedia and Geonames – 100K between CIA factbook and Geonames
![Page 8: Towards Scalable Information Integration with Instance Coreferences](https://reader033.vdocuments.us/reader033/viewer/2022051019/5681596b550346895dc6ac70/html5/thumbnails/8.jpg)
OBII
IndexKBLAV, GAV,(REL statements are LAV + URL of data source)
O1 On
Om1 Omn
Domainontologies
OWLII mapontologies
REL set
SPARQL Query
GNS
Data sources
S1
Result
Potentially relevant sources from the leaves
Retrieve potentially relevant sources and
load them in a reasoner
Potentially relevant sources
LAV/GAVmatches
http calls
System startupor periodic update
Query PhaseSemantic Web Space
S3
S2
S4S4 S5 Sn
EQKB
R1 Rn
LAV/GAV
Rcs and Rps to LAV/GAV
Rs to Indexed Equivalence closure
is ?
Get All
LAV/GAV
![Page 9: Towards Scalable Information Integration with Instance Coreferences](https://reader033.vdocuments.us/reader033/viewer/2022051019/5681596b550346895dc6ac70/html5/thumbnails/9.jpg)
Potential Relevance
• A summary of a source’s content that allows us to ignore sources that can not possibly contribute to a query
• Unless we look inside the source there is no way to guarantee its relevance
• REL statements have three forms stating relevance of three different assertions a source can have(In the following d is the URL of a data source, Cs is a class, CE is a class expression, Ps, Pq are property names, {u1 …. un} are a set of URIs)
– For Classes Rc the form is REL (d, Cs, CE)
– For properties Rp the form is REL (d, Ps, Pq)
– For owl:sameAs assertions R the form is REL (d, {u1 …. un})
![Page 10: Towards Scalable Information Integration with Instance Coreferences](https://reader033.vdocuments.us/reader033/viewer/2022051019/5681596b550346895dc6ac70/html5/thumbnails/10.jpg)
Information Integration vs. Source Selection
Information Integration
Source Selection
Data sources Queryable Lightweight
Query reformulation process
Match and expansion of rules with query atoms
Match and expansion of rules with query atoms
Query reformulation result
Conjunctive queries over sources
A set of “potentially relevant” sources
Obtaining the answer
Issue the queriesand union the results
Load the atomic sources into a reasoning engine and issue the original query
![Page 11: Towards Scalable Information Integration with Instance Coreferences](https://reader033.vdocuments.us/reader033/viewer/2022051019/5681596b550346895dc6ac70/html5/thumbnails/11.jpg)
Equivalence KB
• Implementation is a variation of disjoint set forest algorithm [Cormen et al. 01]– standard operations: union(x,y) and find-set(x)
• Also supports isEquivalent and getAllEquivalent methods• The index is built by an update algorithm (with a set of
seed URLs)• Uses an inverted document index for equivalence
relevance information
[Cormen et al. 01]
T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms: Second Edition. The MIT Press, Cambridge, MA, 2001.
![Page 12: Towards Scalable Information Integration with Instance Coreferences](https://reader033.vdocuments.us/reader033/viewer/2022051019/5681596b550346895dc6ac70/html5/thumbnails/12.jpg)
Update of EquivalenceKB
![Page 13: Towards Scalable Information Integration with Instance Coreferences](https://reader033.vdocuments.us/reader033/viewer/2022051019/5681596b550346895dc6ac70/html5/thumbnails/13.jpg)
Preliminary Tests
• We have used– 202,383 owl:sameAs statements that align data from
AIGP, DBLP and Citeseer data sources– Part of Hawkeye Project
• http://swat.cse.lehigh.edu/resources/index.html• 166 million facts and several “integration resources”
• PC with 3GB – EquivalanceKB is 7mb– Buildup time 3 seconds– 1000 calls to getAllEquivalents returns in less than
half a second
![Page 14: Towards Scalable Information Integration with Instance Coreferences](https://reader033.vdocuments.us/reader033/viewer/2022051019/5681596b550346895dc6ac70/html5/thumbnails/14.jpg)
Query Answering
Needs equivalence information
![Page 15: Towards Scalable Information Integration with Instance Coreferences](https://reader033.vdocuments.us/reader033/viewer/2022051019/5681596b550346895dc6ac70/html5/thumbnails/15.jpg)
GNS Extension
Needs equivalenceinformation
![Page 16: Towards Scalable Information Integration with Instance Coreferences](https://reader033.vdocuments.us/reader033/viewer/2022051019/5681596b550346895dc6ac70/html5/thumbnails/16.jpg)
contains is used before expansion to avoid cyclic expansion To avoid redundancy, we consider syntactic query containment
E.g., CONTAINS(cl, P(x,a)) is true if P(x,y) is in cl Equivalence information is relevant
author (X, GNS) in Closed list we should not expand author (X, GOAL-NODE-SEARCH)
assuming GNS = GOAL-NODE-SEARCH
GNS Extension
![Page 17: Towards Scalable Information Integration with Instance Coreferences](https://reader033.vdocuments.us/reader033/viewer/2022051019/5681596b550346895dc6ac70/html5/thumbnails/17.jpg)
GNS Extension
unifyEQ is like regular unify except it accounts for coreferences When matching two constants we use isEqual of Equivalence KB livesIn(X, DC) and livesIn (X, WashingtonDC) will not unify unless we know DC = WashingtonDC
![Page 18: Towards Scalable Information Integration with Instance Coreferences](https://reader033.vdocuments.us/reader033/viewer/2022051019/5681596b550346895dc6ac70/html5/thumbnails/18.jpg)
Conclusion and Future Work
• Scalable Instance Coreference Handling is an important issue
• Initial work shows promise• Two important issues
– Avoid pre-computation of equivalence closure and make the system more dynamic
– Disk based implementation of EquivalenceKB• We are currently fine tuning a dynamic algorithm
– UpdateEqualKB is not seeded with all URIs but rather with URIs from a query
– Equivalence information is updated as new URIs are discovered due to rule expansion
– Coming soon to a conference near you
![Page 19: Towards Scalable Information Integration with Instance Coreferences](https://reader033.vdocuments.us/reader033/viewer/2022051019/5681596b550346895dc6ac70/html5/thumbnails/19.jpg)
Backups
![Page 20: Towards Scalable Information Integration with Instance Coreferences](https://reader033.vdocuments.us/reader033/viewer/2022051019/5681596b550346895dc6ac70/html5/thumbnails/20.jpg)
OWLII in OWL/RDF
Axiom type Subject (left-hand side) Object (right-hand side)
owl:equivalentClass Named classes, owl:intersectionOf, owl:someValuesFrom owl:hasValue
Named classes, owl:intersectionOf, owl:someValuesFrom owl:hasValue
rdfs:subClassOf All of the above +
owl:unionOf
All of the above +
owl:allValuesFrom
owl:equivalentProperty
rdfs:subPropertyOf
named properties , owl:inverseOf
named properties , owl:inverseOf
owl:inverseOf named properties named properties
![Page 21: Towards Scalable Information Integration with Instance Coreferences](https://reader033.vdocuments.us/reader033/viewer/2022051019/5681596b550346895dc6ac70/html5/thumbnails/21.jpg)
Map example
O1:GreenTranpsort (X) :- O2:Transport (X), O2:greenRating(X, good)
<owl:Class rdf:about=“http://O1#GreenTransport”>
<rdfs:subClassOf rdf:resource=“http://O2#Transport”/>
<rdfs:subClassOf>
<owl:Restriction>
<owl:onProperty rdf:resource=“http://O2#greenRating”/>
<owl:hasValue rdf:resource= “http://uri#good”/>
</owl:Restriction>
</rdfs:subClassOf>
</owl:Class>
![Page 22: Towards Scalable Information Integration with Instance Coreferences](https://reader033.vdocuments.us/reader033/viewer/2022051019/5681596b550346895dc6ac70/html5/thumbnails/22.jpg)
REL example
<meta:RelStatement> <meta:source rdf:resource=“http://U2”/> <meta:contained> <owl:Class rdf:about=“http://O1#MtnBike” /> </meta:contained> <meta:container>
<owl:Class rdf:about=“http://O1#GreenTransport” /> </meta:container>
</meta:RelStatement>
R4: O1:MtnBike (X) ⊑O1:GreenTransport(X) ,U2