towards scalable information integration with instance coreferences
DESCRIPTION
Towards Scalable Information Integration with Instance Coreferences. Abir Qasem 1 , Dimitre Dimitrov 2 , Jeff Heflin 1 1 Lehigh University 2 Tech-X Corporation 07/11/09. U.S. Department of Energy DE-FG02-05ER84171 SBIR grant. The Semantic Web. Definition - PowerPoint PPT PresentationTRANSCRIPT
Towards Scalable Information Integration with Instance Coreferences
Abir Qasem1, Dimitre Dimitrov2, Jeff Heflin1
1 Lehigh University2 Tech-X Corporation
07/11/09
U.S. Department of Energy DE-FG02-05ER84171 SBIR grant
2 of 30
The Semantic Web
• Definition– The Semantic Web is not a separate Web but an extension of
the current one, in which information is given well-defined meaning, better enabling computers and people to work in cooperation. (Berners-Lee et al., Scientific American, May 2001)
• Ontology– a key component of the Semantic Web– ontologies define the semantics of the terms used in semi-
structured web pages• identify context, provide shared definitions• has a formal syntax and unambiguous semantics
– can be used to describe alignments between heterogeneous schemas
3 of 30
A Web of Ontologies
Foaf
DBLP CongressCiteseer
AIGP NSF Awards
extends
extends
extends extends
S3
S7
commits to
commits to
commits to
The answer to a user’s query might require the combination of data from S1, S2, S3, and S4.
Region
S1 S2
Dublin Core
S5
extends
S4
S6
commits to
commits to commits to
extends
4 of 30
Semantic Web Standards
• RDF(S) (1999, revised 2004)– essentially semantic networks
with URIs– XML serialization syntax
• OWL (2004)– extends RDF with more semantic
primitives– based on description logics (DLs)– has a model theoretic semantics
World Wide Web Consortium (W3C) Recommendations
u:Chair
John Smith
rdf:typeg:name
g:Person
g:name
rdfs:Class rdf:Property
rdf:typerdf:type
rdf:type
rdfs:subclassOf
rdfs:domain
<owl:Class rdf:ID=”Band”> <rdfs:subClassOf> <owl:Restriction> <owl:onProperty rdf:resource=”#hasMember” /> <owl:allValuesFrom rdf:resource=”#Musician” /> </owl:Restriction> </rdfs:subClassOf></owl:Class>
A Band is a subset of the groups which only have Musicians as members
AIGP - http://aigp.eecs.umich.edu/ DBLP - http://www.informatik.uni-trier.de/~ley/db/
aigp:researcher/show/93
aigp:researcher/show/21
“Eugene Charniak”
“Marvin Minsky”
aigp:name
aigp:advisorOf
aigp:name
QUERY: Find all academic papers written by Marvin Minsky’s advisees.
Integrating RDF Sources
dblp:c/Charniak:Eugene
dblp:jrnl/aim/Charniak97
“Eugene Charniak”
“Statistical Techniques for Natural Language Parsing”
dblp:name
dblp:hasAuthor
dblp:title
=?
Coreference Information
• owl:sameAs– states that two URIs denote the same individual
• Linking Open Data initiative– ~100 sources with over 4 billion triples (i.e., facts)– >100 million explicit owl:sameAs statements
• Many RDF users publish owl:sameAs statements with their data
• Can use automated coreference resolution techniques to find others– allow for the possibility of human correction
Scaling
• AIGP and DBLP have about 4000 coreferent instances
• Marvin Minsky has about 20 advisees• Only a small fragment of coreference
information is relevant to any given query– Need to be selective about what information
to use• Quantity of coreference information
– 80K between DBPedia and Geonames – 100K between CIA factbook and Geonames
OBII
IndexKBLAV, GAV,(REL statements are LAV + URL of data source)
O1 On
Om1 Omn
Domainontologies
OWLII mapontologies
REL set
SPARQL Query
GNS
Data sources
S1
Result
Potentially relevant sources from the leaves
Retrieve potentially relevant sources and
load them in a reasoner
Potentially relevant sources
LAV/GAVmatches
http calls
System startupor periodic update
Query PhaseSemantic Web Space
S3
S2
S4S4 S5 Sn
EQKB
R1 Rn
LAV/GAV
Rcs and Rps to LAV/GAV
Rs to Indexed Equivalence closure
is ?
Get All
LAV/GAV
Potential Relevance
• A summary of a source’s content that allows us to ignore sources that can not possibly contribute to a query
• Unless we look inside the source there is no way to guarantee its relevance
• REL statements have three forms stating relevance of three different assertions a source can have(In the following d is the URL of a data source, Cs is a class, CE is a class expression, Ps, Pq are property names, {u1 …. un} are a set of URIs)
– For Classes Rc the form is REL (d, Cs, CE)
– For properties Rp the form is REL (d, Ps, Pq)
– For owl:sameAs assertions R the form is REL (d, {u1 …. un})
Information Integration vs. Source Selection
Information Integration
Source Selection
Data sources Queryable Lightweight
Query reformulation process
Match and expansion of rules with query atoms
Match and expansion of rules with query atoms
Query reformulation result
Conjunctive queries over sources
A set of “potentially relevant” sources
Obtaining the answer
Issue the queriesand union the results
Load the atomic sources into a reasoning engine and issue the original query
Equivalence KB
• Implementation is a variation of disjoint set forest algorithm [Cormen et al. 01]– standard operations: union(x,y) and find-set(x)
• Also supports isEquivalent and getAllEquivalent methods• The index is built by an update algorithm (with a set of
seed URLs)• Uses an inverted document index for equivalence
relevance information
[Cormen et al. 01]
T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms: Second Edition. The MIT Press, Cambridge, MA, 2001.
Update of EquivalenceKB
Preliminary Tests
• We have used– 202,383 owl:sameAs statements that align data from
AIGP, DBLP and Citeseer data sources– Part of Hawkeye Project
• http://swat.cse.lehigh.edu/resources/index.html• 166 million facts and several “integration resources”
• PC with 3GB – EquivalanceKB is 7mb– Buildup time 3 seconds– 1000 calls to getAllEquivalents returns in less than
half a second
Query Answering
Needs equivalence information
GNS Extension
Needs equivalenceinformation
contains is used before expansion to avoid cyclic expansion To avoid redundancy, we consider syntactic query containment
E.g., CONTAINS(cl, P(x,a)) is true if P(x,y) is in cl Equivalence information is relevant
author (X, GNS) in Closed list we should not expand author (X, GOAL-NODE-SEARCH)
assuming GNS = GOAL-NODE-SEARCH
GNS Extension
GNS Extension
unifyEQ is like regular unify except it accounts for coreferences When matching two constants we use isEqual of Equivalence KB livesIn(X, DC) and livesIn (X, WashingtonDC) will not unify unless we know DC = WashingtonDC
Conclusion and Future Work
• Scalable Instance Coreference Handling is an important issue
• Initial work shows promise• Two important issues
– Avoid pre-computation of equivalence closure and make the system more dynamic
– Disk based implementation of EquivalenceKB• We are currently fine tuning a dynamic algorithm
– UpdateEqualKB is not seeded with all URIs but rather with URIs from a query
– Equivalence information is updated as new URIs are discovered due to rule expansion
– Coming soon to a conference near you
Backups
OWLII in OWL/RDF
Axiom type Subject (left-hand side) Object (right-hand side)
owl:equivalentClass Named classes, owl:intersectionOf, owl:someValuesFrom owl:hasValue
Named classes, owl:intersectionOf, owl:someValuesFrom owl:hasValue
rdfs:subClassOf All of the above +
owl:unionOf
All of the above +
owl:allValuesFrom
owl:equivalentProperty
rdfs:subPropertyOf
named properties , owl:inverseOf
named properties , owl:inverseOf
owl:inverseOf named properties named properties
Map example
O1:GreenTranpsort (X) :- O2:Transport (X), O2:greenRating(X, good)
<owl:Class rdf:about=“http://O1#GreenTransport”>
<rdfs:subClassOf rdf:resource=“http://O2#Transport”/>
<rdfs:subClassOf>
<owl:Restriction>
<owl:onProperty rdf:resource=“http://O2#greenRating”/>
<owl:hasValue rdf:resource= “http://uri#good”/>
</owl:Restriction>
</rdfs:subClassOf>
</owl:Class>
REL example
<meta:RelStatement> <meta:source rdf:resource=“http://U2”/> <meta:contained> <owl:Class rdf:about=“http://O1#MtnBike” /> </meta:contained> <meta:container>
<owl:Class rdf:about=“http://O1#GreenTransport” /> </meta:container>
</meta:RelStatement>
R4: O1:MtnBike (X) ⊑O1:GreenTransport(X) ,U2