fusing semantic data

Fusing automatically extracted annotations for the Semantic

WebAndriy NikolovKnowledge Media InstituteThe Open University

Outline

• Motivation• Handling fusion subtasks

– problem-solving method approach

• Processing inconsistencies– applying uncertainty reasoning

• Overcoming schema heterogeneity– Linked Data scenario

Database scenario

• Classical scenario (database domain)– Merging information from datasets

containing partially overlapping information

Name Year of birth E-mail Address

H. Schmidt 1972 h.schmidt@gmail.com …

J. Smith 1983 j.smith@yahoo.com …

Name Year of birth E-Mail Job position

Wen, Zhao 1980 wenzh@gmail.com …

Schmidt, Hans 1973 h.schmidt@gmail.com …

Database scenario

• Coreference resolution (record linkage)– Resolving ambiguous identities

Database scenario

• Inconsistency resolution– Handling contradictory pieces of data

Semantic data scenario

• Database domain:– A record belongs to a single

table– Table structure defines

relevant attributes– Inconsistency of values

• Semantic data:– Classes are organised into

hierarchies– One individual may belong

to several classes– Available properties depend

on the level of granularity– Other types of

inconsistencies are possible• E.g., class disjointness

foaf:Person

sweto:Person

foaf:namexsd:string

xsd:stringfoaf:mbox

sweto:Placesweto:lives_in

sweto:Organization

sweto:affiliated_with

sweto:Researcherxsd:stringsome:has_degree

Motivating scenario – X-Media

Images

Other data

Annotation FusionText

Internal corporate reports (Intranet)

Pre-defined public sources (WWW)

Domain ontology

KnoFuss

Knowledge base

Outline

Handling fusion subtasks

• For each subtask, several available methods exist

• Example: coreference resolution– Aggregated attribute similarity

• [Fellegi&Sunter 1969]

– String similarity• Levenshtein, Jaro, Jaro-Winkler

– Machine learning• Clustering• Classification

– Rule-based

Handling fusion subtasks

• All methods have their pros and cons– Rule-based

• High precision• Restricted to a specific domain

– Machine learning• Require sufficient training data

– String similarity• Lower precision• Still need configuration (e.g., distance metric, threshold, set of

attributes to include)

• Trade-off between the quality of results and applicability range – better precision requires more domain-specific knowledge

Problem-solving method approach

• Fusion task is decomposed into subtasks• Algorithms defined as methods solving a particular

task• Each method is formally described using the fusion

ontology– Task handled by the method– Applicability criteria– Domain knowledge required– Reliability of output

• Methods are selected based on their capabilities

KnoFuss architecture

Fusion KBIntermediate data

Main KB

KnoFuss

CoreferenceResolutionMethod

ConflictDetectionMethod

ConflictResolutionMethod

Method library

New data

Fusion ontology

• Method library– Contains implementation of each technique for specific

subtasks (problem-solving method [Motta 1999])• Fusion ontology

– Describes method capabilities– Defines intermediate structures (mappings, conflict sets, etc.)

Task decomposition

Knowledge fusion

Coreferenceresolution

Knowledge base

updating

Modelconfiguration

Dependency identification

Dependency resolution

Linkdiscovery

Source KB

TargetKB

(fused)

TargetKB

Method selection

Adaptive learning matcher Application context:

Publication

Application context:Journal Article

rdf:type owl:Thing

datatypeProperty ?x

reliability =0.4

rdf:type sweto:Publication

rdfs:label ?x

sweto:year ?y

reliability =0.8

rdf:type sweto:Article

rdfs:label ?x

sweto:year ?y

sweto:journal ?z

sweto:volume ?a

reliability =0.9

• Depends on:– Range of applicability– Reliability

• Configuration parameters– Generic (for individuals of unknown types)– Context-dependent

Using class hierarchy

• Configuring machine-learning methods:– Using training instances for a subclass to learn

generic models for superclasses

owl:Thing

foaf:Person foaf:Document

sweto:Publication

sweto:Article sweto:Article_in_Proceedings

volume book_title

journal_name

Ind1: {label, year, book_title}Ind2: {label, year, book_title}Ind3: {label, year, book_title}

Ind1: {label, year}Ind2: {label, year}Ind3: {label, year}

Ind1: {label}Ind2: {label}Ind3: {label}

Using class hierarchy

• Configuring machine-learning methods:– Combining training instances for subclasses to

learn a generic model for a superclasssweto:Publication

sweto:Article sweto:Article_in_Proceedings

volumebook_title

journal_name

Ind1: {label, year, book_title}Ind2: {label, year, book_title}Ind3: {label, year, book_title}

Ind1: {label, year}Ind2: {label, year}Ind3: {label, year}Ind4: {label, year}Ind5: {label, year}Ind6: {label, year}

Ind4: {label, year, journal_name, volume}Ind5: {label, year, journal_name, volume}Ind6: {label, year, journal_name, volume}

Outline

Data quality problems

• Causes of inconsistency– Data errors

• Obsolete data• Mistakes of manual annotators• Errors of information extraction algorithms

– Coreference resolution errors• Automatic methods not 100% reliable

• Applying uncertainty reasoning– Estimated reliability of separate pieces of

data– Domain knowledge defined in the ontology

Refining fused data

• Additional evidence:– Ontological schema restrictions

• Disjointness• Cardinality• …

– Neighborhood graph• Mappings between related entities

– Provenance• Uncertainty of candidate mappings• Uncertainty of data statements• “Cleanness” of data sources

Dempster-Shafer theory of evidence

• Bayesian probability theory:Assigning probabilities to atomic alternatives: – p(true)=0.6 ! p(false)=0.4 – Sometimes hard to assign– Negative bias:

Extraction uncertainty less than 0.5 – negative evidence rather than insufficient evidence

Dempster-Shafer theory: Assigning confidence degrees (masses) to sets of alternatives– m({true}) = 0.6– m({false}) = 0.1– m({true;false})=0.3

probability

support

plausibility

Dependency detection

• Identifying and localizing conflicts– Using formal diagnosis [Reiter 1987] in

combination with standard ontological reasoning

ArticleArticle ProceedingsProceedings

Paper_10Paper_10

owl:disjointWithowl:disjointWith owl:FunctionalPropertyowl:FunctionalProperty

rdf:typerdf:type

hasYear2007

hasYear 2006

E. MottaE. Motta V.S. UrenV.S. UrenhasAuthor hasAuthor

Belief networks (cont)

• Valuation networks [Shenoy and Shafer 1990]

• Network nodes – OWL axioms– Variable nodes

• ABox statements (I2X, R(I1, I2))

• One variable – the statement itself

– Valuation nodes• TBox axioms (XtY)• Mass distribution between several variables

(I2X, I2Y, I2XtY)

Belief networks (cont)

• Belief network construction– Using translation rules– Rule antecedents:

• Existence of specific OWL axioms (one rule per OWL construct)

• Existence of network nodes– Example rule:

• Explicit ABox statements:IF I2X THEN CREATE N1(I2X)

• TBox inferencing:IF Trans(R) AND EXIST N1(R(I1, I2)) AND EXIST N2(R(I2, I3)) THEN CREATE N3(Trans(R)) AND CREATE N4(R(I1, I3))

Example

#Paper_10#Paper_10

owl:disjointWithowl:disjointWith

rdf:type rdf:type

Example

#Paper_10#Paper_10

rdf:type rdf:type

#Paper_102Article

#Paper_102Proceedings

Example

#Paper_10#Paper_10

rdf:type rdf:type

#Paper_102Article

Article v :Proceedings

Example

#Paper_102Article

m(true)=0.8m(false) = 0

m({true;false})=0.2

m({true;false})=0.4

#Paper_102 Article

#Paper_102 Proceedings

false false

false true

true false

true true

m( )=1.0

m( )=0.0

Example

#Paper_102Article

m({true;false})=0.2

m({true;false})=0.4

#Paper_102 Article

false false

false true

true false

false true

true false

m( )=0.15 -Dempster’s rule

)=0.23

)=0.62

Example

#Paper_102Article

m(true)=0.62m(false) = 0.23

m({true;false})=0.15

m(true)=0.23m(false) = 0.62

m({true;false})=0.15

#Paper_102 Article

false false

false true

true false

false true

true false

m( )=0.15

)=0.23

)=0.62

Belief propagation

• Translating subontology into a belief network– Using provenance and confidence values of data statements– Coreferencing algorithm precision for owl:sameAs mappings

• Data refinement:– Detecting spurious mappings– Removing unreliable data statements

Articlev:in_Proc

Ind1=Ind2

Functional(year)

Article(Ind1)

(0.99;1.0)/(0.97;0.98)

in_Proc(Ind2) Ind1=Ind2

inProc(Ind1)

Ind1=Ind2

year(Ind1, 2007)

year(Ind2, 2007)

year(Ind1, 2006)

(0.9;1.0)/(0.74;0.82) (0.92;1.0)/(0.2;0.21) (0.85;1.0)/(0.72;0.85)

(0.95;1.0)/(0.91;0.96)

Neighbourhood graph

• Non-functional relations: varying impact

Paper_10Paper_10 H. SchmidtH. SchmidthasAuthor

Paper_11Paper_11 Schmidt, HansSchmidt, Hans

owl:sameAs (0.9) owl:sameAs (0.3)

hasAuthor

ProceedingsProceedings

rdf:type

PersonPerson

GermanyGermany H. SchmidtH. Schmidtcitizen_of

GermanyGermany Schmidt, HansSchmidt, Hans

owl:sameAs (1.0) owl:sameAs (0.3)

citizen_of

CountryCountry

rdf:type

PersonPerson

Neighborhood graph

• Implicit relations: set co-membership

Person11 = Person12

Person21 = Person22

Coauthor(Person12, Person22)

Person11 = Person12

Person21 = Person22

“Bard, J.B.L.”=“Jonathan Bard”

“Webber, B.L.”=“Bonnie L. Webber”

0.84/(0.86;1.0)

0.16/(0.83;1.0)1.0/(1.0;1.0)

1.0/(1.0;1.0)

Provenance

• Initial belief assignments:– Data statements

(source AND/OR extractor confidence)

– Candidate mappings (precision of attribute similarity algorithms)

– Source “cleanness” – contains duplicates or not

Arl_Va Arl_Tx

Arlington = Arl_Tx

Arlington Arl_Tx

Arlington = Arl_Va

Arl_Va Arl_Tx

1.0/(1.0;1.0)

0.9/(0.31;0.35)

Arlington, Virginia

0.95/(0.65;0.69)

Arlington, Texas

Experiments

• Datasets:– Publication 1

• AKT• Rexa• SWETO-DBLP

– Cora• database community benchmark• translated into RDF• 2 versions used

– different structure – different gold standard

Experiments

Dataset No Matcher Publication

Prec. Recall F1 Prec. Recall F1

AKT/Rexa 1 Jaro-Winkler 0.950 0.833 0.887 0.969 0.832 0.895

2 L2 Jaro-Winkler 0.879 0.956 0.916 0.923 0.956 0.939

AKT/DBLP 3 Jaro-Winkler 0.922 0.952 0.937 0.992 0.952 0.971

4 L2 Jaro-Winkler 0.389 0.984 0.558 0.838 0.983 0.905

Rexa/DBLP 5 Jaro-Winkler 0.899 0.933 0.916 0.944 0.932 0.938

6 L2 Jaro-Winkler 0.546 0.982 0.702 0.823 0.981 0.895

Cora (I) 7 Monge-Elkan 0.735 0.931 0.821 0.939 0.836 0.884

Cora (II) 8 Monge-Elkan 0.698 0.986 0.817 0.958 0.956 0.957

• Publication individuals– Ontological restrictions mainly influence

precision

Experiments

Dataset No Matcher Person

Prec. Recall F1 Prec. Recall F1

AKT/Rexa 7 L2 Jaro-Winkler 0.738 0.888 0.806 0.788 0.935 0.855

AKT/DBLP 8 L2 Jaro-Winkler 0.532 0.746 0.621 0.583 0.921 0.714

Rexa/DBLP 9 Jaro-Winkler 0.965 0.755 0.846 0.968 0.876 0.920

Cora (I) 10 L2 Jaro-Winkler 0.983 0.879 0.928 0.981 0.895 0.936

Cora (II) 11 L2 Jaro-Winkler 0.999 0.994 0.997 0.999 0.994 0.997

• Person individuals– Evidence coming from the neighborhood graph– Mainly influences recall

Outline

Advanced scenario

• Linked Data cloud: network of public RDF repositories [Bizer et al. 2009]

• Added value: coreference links (owl:sameAs)

Data linking: current state

• Automatic instance matching algorithms– SILK, ODDLinker, KnoFuss, …

• Pairwise matching of datasets– Requires significant configuration

effort

• Transitive closure of links– Use of “reference” datasets

Reference datasets

Problems

• Transitive closures often incomplete– Reference dataset is incomplete– Missing intermediate links– Direct comparison of relevant datasets is

desirable

• Schema heterogeneity– Which instances to compare?– Which properties are relevant

A BReference

Schema matching

• Interpretation mismatches– dbpedia:Actor = professional actor– movie:actor = anybody who participated in a movie

• Class interpretation “as used” vs “as designed”– FOAF: foaf:Person = any person– DBLP: foaf:Person = computer scientist

• Instance-based ontology matching

Repository Richard Nixon David Garrick

dbpedia:Actor DBPedia - +

movie:Actor LinkedMDB + -

KnoFuss - enhanced

Knowledge fusion

Ontology integration

Knowledge base

integration

Ontology matching

Instancetransformation

Coreferenceresolution

Dependency resolution

Source KB

TargetKB

SPARQL query translation

Schema matching

• Step 1: inferring schema mappings from pre-existing instance mappings

• Step 2: utilizing schema mappings to produce new instance mappings

Ontology 1 Ontology 2

Dataset 1 Dataset 2

Ontology 1 Ontology 2

Dataset 1 Dataset 2

• Background knowledge:– Data-level

(intermediate repositories)

– Schema-level (datasets with more fine-grained schemas)

Overview

Algorithm

• Step 1:– Obtaining transitive closure of

existing mappings

LinkedMDB DBPedia

movie:music_contributor/2490

MusicBrainz

music:artist/a16…9fdf

dbpedia:Ennio_Morricone

Algorithm

• Step 2: Inferring class and property mappings– ClassOverlap and PropertyOverlap mappings– Confidence (classes A, B) = |c(A)Åc(B)| / min(c(|A|), c(|B|))

(overlap coefficient)– Confidence (properties r1, r2) = |c(X)|/||c(Y)|

• X – identity clusters with equivalent values of r1 and r2• Y – all identity clusters which have values for both r1 and r2

LinkedMDB DBPediaMusicBrainz

music:artist/a16…9fdf

dbpedia:Ennio_Morriconemovie:music_contributor/2490

movie:music_contributor dbpedia:Artist

is_a is_a

• Step 3: Inferring data patterns

• Functionality restrictions

• IF 2 equivalent movies do not have overlapping actors AND have different release dates THEN break the equivalence link

• Note:– Only usable if not taken

into account at the initial instance matching stage

Algorithm

• Step 4: utilizing mappings and patterns– Run instance-level matching for individuals

of strongly overlapping classes– Use patterns to filter out existing mappings

• LinkedMDB

SELECT ?uri

WHERE {

?uri rdf:type movie:music_contributor .

• DBPediaSELECT ?uriWHERE { ?uri rdf:type

dbpedia:Artist . }

Results

• Class mappings:– Improvement in recall

• Previously omitted mappings were discovered after direct comparison of instances

• Data patterns– Improved precision

• Filtered out spurious mappings• Identified 140 mappings

between movies as “potentially spurious”

• 132 identified correctly

00.10.20.30.40.50.60.70.80.9

Existing KnoFuss(only)

Combined

PrecisionRecallF1-measure

00.10.20.30.40.50.60.70.80.9

Combined

00.10.20.30.40.50.60.70.80.9

Combined

DBPedia/

LinkedMDB

DBPedia/

BookMashup

Future work

• From the pairwise scenario to the network of repositories

• Combining schema and data integration in an efficient way

• Evaluating data sources– Which data source(s) to link to?– Which data source(s) to select data

Questions?

Thanks for your attention

References

[Shenoy and Shafer 1990] P. Shenoy, G. Shafer. Axioms for probability and belief-function propagation. In: Readings in uncertain reasoning. San Francisco: Morgan Kaufmann, pp. 575-610, 1990

[Motta 1999] E. Motta. Reusable components for knowledge modelling. Amsterdam: IOS Press, 1999

[Bizer et al 2009] C. Bizer, T. Heath, T. Berners-Lee. Linked Data - the story so far. International Journal on Semantic Web and Information Systems 5(3), pp. 1-22, 2009

[Fellegi and Sunter 1969] Ivan P. Fellegi and Alan B. Sunter. A theory for record linkage. Journal of American Statistical Association, 64(328):1183-1210, 1969

[Reiter 1987] R. Reiter. A theory of diagnosis from first principles. Artificial Intelligence, 32(1):57-95, 1987

fusing semantic data

address email

article sweto

schmidt address year

publication sweto

type sweto

person sweto

title label journal

y sweto

Education

fusing - meyer machines · pdf filecontrol technical data:...

fusing spatial, pictorial and photometric data to … ·...

semantic data structuring

fusing approaches in educational research: data collection

semantic data practice 2 : rdf/rdfs · semantic data...

semantic data model

fusing human and technical sensor data: concepts and

summarizing semantic data

semantic data control

lncs 5823 - semantic enhancement for enterprise data ... ·...

analysis: textonboost and semantic texton...

fusing vulnerability data and user...

semantic data chapter 3 : the semantic web resource...

achieving business value by fusing hadoop and corporate data

christian bizer: fusing the web of data (12/08/2008) 3rd...

fusing mobile, sensor, and social data to fully enable...

fusing lidar and semantic image information in octree maps...

fusing euipment catalog data ca132027en supersedes...

a hybrid approach for fusing physics and data for failure...

semantic data...