evaluation of instance matching tools: the …...evaluation of instance matching tools: the...

Evaluation of Instance Matching Tools:The Experience of OAEI

A. Ferraraa, A. Nikolovb, J. Noessnerc, F. Scharffed

aDICo, Universita degli Studi di Milano,Via Comelico 39, 20135 Milano, Italy

bKnowledge Media Institute, The Open UniversityWalton Hall, Milton Keynes, MK7 6AA, United Kingdom

fluid Operations AG, Altrottstrasse 31, 69190 Walldorf, GermanycKR & KM Research Group

University of Mannheim, B6 26, 68159 Mannheim, GermanydLIRMM, University of Montpellier

161 rue Ada, 34095 Montpellier Cedex 5, France

Abstract

Nowadays, the availability of large collections of data requires techniques and tools capable of linking datatogether, by retrieving potentially useful relations among them and helping in associating together datarepresenting same or similar real objects. One of the main problems in developing data linking techniquesand tools is to understand the quality of the results produced by the matching process. In this paper, wedescribe the experience of instance matching and data linking evaluation in the context of the OntologyAlignment Evaluation Initiative (IM@OAEI). Our goal is to be able to validate different proposed methods,identify most promising techniques and directions for improvement, and, subsequently, guide further researchin the area as well as development of robust tools for real-world tasks.

Keywords: instance matching evaluation, data linking, semantic web

1. Introduction

Problems concerning the automatic matching ofdata and ontology instances, such as data linkingand instance matching, are becoming crucial for thefuture directions of the Semantic Web and the Webin general. The availability of large collections ofdata requires techniques and tools capable of link-ing data together, by retrieving potentially usefulrelations among them and helping in associatingtogether data representing same or similar real ob-jects. One of the main problems in developing thesekind of techniques and tools is to have a method-ology and a set of benchmarks for understandingthe quality of the results produced by the matchingprocess. Moreover, matching tools developers needa framework in which their tools can be compared

Email addresses: [email protected] (A.Ferrara), [email protected] (A. Nikolov),[email protected] (J. Noessner),[email protected] (F. Scharffe)

with other similar tools on the same data in orderto understand where improvements and new solu-tions are possible and needed. We addressed thisneeds by organizing the Instance Matching trackof the Ontology Alignments Evaluation Initiative(IM@OAEI).

In the remainder of this section we first give ashort background outlining the instance matchingproblem in the context of semantic data. Then,we present the requirements for evaluating instancematching and data linking approaches.

1.1. Instance matching problem

The instance matching problem can be informallydefined as an operation which takes two collectionsof data as input and produces a set of mappingsbetween entities of the two collections as output.Mappings denote binary relations between entitieswhich are considered equivalent one to another.The following can serve as a high-level formal defi-nition:

Preprint submitted to Elsevier October 7, 2012

Definition: Let D1 and D2 represent twodatasets, each one containing a set of data individ-uals Ii and structured according to a schema Oi.Each individual Iij ∈ Ii describes some entity ωj.Two individuals are said to be equivalent Ij ≡ Ik ifthey describe the same entity ωj = ωk according toa chosen identity criterion. The goal of the entityresolution task is to discover all pairs of individuals{(I1i, I2j)|I1i ∈ I1, I2j ∈ I2} such that ω1i = ω2j.

Actual format of data individuals depends on theformat of the datasets Di. If Di represent relationaldatabases, then database records serve as individu-als Ii and are identified by the primary key values.In the context of Semantic Web data, datasets Di

are represented by RDF graphs. Individuals Ii ∈ Iiare identified by URIs and described using the clas-sification schema and properties defined in the cor-responding ontology Oi.

The task can be approached using different typesof available evidence. Based on this, existing tech-niques for instance matchning can be classified intothree main categories [16]:

• Value matching. These techniques usuallyserve as basic building blocks of data linkingtools and focus on identifying equivalence be-tween property values of instances. Typical ex-amples of these techniques are string similaritymetrics like edit distance or Jaro-Winkler.

• Individual matching. These techniques decidewhether two individuals represent the samereal-world object. They operate with descrip-tions of a pair of individuals, which may con-tain multiple attributes. They utilise aggre-gation of similarities between correspondingproperty values.

• Dataset matching. These techniques take intoaccount all individuals in two datasets andtry to construct an optimal alignment betweenthese whole sets of individuals. They rely onthe results of the individual matching and canfurther refine them. These techniques utilizedifferent methods such as similarity propaga-tion, optimization algorithms, logical reason-ing, etc.

Existing tools normally combine different tech-niques from several categories and aim to specifyan instance matching workflow which would yieldhigh quality results. Detailed surveys of the in-stance matching techniques and tools created in the

databse community can be found in [12] and [25],while [16] lists the approaches applied to the Se-mantic Web domain and linked data.

1.2. Requirements for the evaluation of instancematching and data linking approaches

The main goal of developing common evalua-tion approaches for data linking is to be able tovalidate different proposed methods, identify mostpromising techniques and directions for improve-ment, and, subsequently, guide further research inthe area as well as development of robust tools forreal-world tasks. In order to achieve this, the eval-uation procedure must satisfy several requirements,which present non-trivial challenges.

The first subset of these requirements relates tothe representative capabilities of the evaluation ap-proach. Evaluation results must provide useful in-formation about expected performance of evalu-ated tools and techniques if they are applied toother real-world tasks as well as compare differentmethods and choose the best suited ones. The re-search performed in the area of data linking primar-ily builds on top of two research areas: databaserecord linkage [14] and ontology matching [13, 17].In both these areas, the primary evaluation ap-proach, which fits these requirements, is bench-marking, where a set of pre-defined tests are usedon which the results produced by methods canbe measured with respect to a well-defined scale.However, real-world tasks can differ with respectto many parameters: for example, domain of pro-cessed datasets, richness of information representedin these datasets, dataset size, chosen data format,availability of background knowledge which can beutilized. These parameters present different chal-lenges to the data linking tools. Given the multi-tude of different possible combinations of these pa-rameters, it is impossible to devise a single bench-marking test which would approximate all the vari-ety of different real-world tasks. The set of bench-marking tests used for evaluation of data linkingtools must aim at achieving two diverse goals:

• being comprehensive: including as many chal-lenges occurring in real-world matching tasks,as possible (e.g., diversity of data formats, at-tributes, schemas)

• being illustrative: reflect the distribution ofdifferent data features similar to the most likely

2

parameters of real-world tasks, i.e., a data fea-ture which rarely occurs in real-world tasksshould not dominate test data.

Besides the parameters of test data, the evaluationprocedure must utilize appropriate evaluation cri-teria. In existing research, the most important cri-terion corresponds to the quality of data linkingresults. This is normally measured in terms of pre-cision (proportion of correct mappings among themethod results) and recall (proportion of correctmappings identified by the tool among all actualmappings). However, other evaluation criteria canbe important as well: for example, given the largevolumes of information which must be processed onthe web of data, the computation time required bythe linking tool becomes an orthogonal evaluationdimension.

The second kind of requirements are pragmaticand related to the evaluation procedure itself. Inparticular, the evaluation procedure should not re-quire extensive effort, as this would discourage bothtool developers and users from adopting it. More-over, the chosen approach should ensure that theevaluation results are correctly measured with re-spect to the chosen criteria. This is not always triv-ial to achieve, especially with real-world data: e.g.,constructing the gold standard alignments is prob-lematic for large-scale datasets.

Since these requirements to some extent are mu-tually contradictory, it is hardly possible to satisfyall them to the full extent. However, while devisinga reusable evaluation method, it is necessary to con-sider all of them and aim at a reasonable compro-mise. Significant relevant work on the evaluationof both record linkage and ontology matching toolshas been performed in the corresponding domains.In order to determine suitable evaluation methodsfor instance matching tools in the semantic web do-main, it is necessary to consider these approachesfirst and analyze the possibilities for reuse.

The paper is organized as follows: in Section 2,we discuss the main contributions given in litera-ture to the problem of evaluating instance matchingtools and techniques. In Section 3, we present theexperience of instance matching evaluation in thecontext of OAEI. In Section 4, we present our real-data benchmark, while in Section 5 we present ourapproach for the generation of automatically gener-ated benchmarks. In Section 6, we discuss the mainopen issues and future directions in the field of in-stance matching evaluation. Finally, in Section 7,

we give our concluding remarks.

2. State of the art

The task of data linking is closely related both tothe record linkage problem studied in the databasecommunity and to the ontology matching task.Given that both these tasks require approximatemethods, their evaluation requires estimating theoutput quality by performing experiments with re-alistic benchmarks. The evaluation initiatives per-formed to evaluate tools developed in these areasare relevant to the data linking domain both be-cause they have to deal with similar requirementsand because they can be partially reused.

2.1. Evaluation initiatives in the database commu-nity

In parallel with the development of record linkagealgorithms, work on their evaluation has been con-ducted in the database community for a long time.Evaluation test sets used to validate these methodscan generally be classified into two types:

• Real-world data sources. Usually, such abenchmark includes two or more publicly avail-able datasets which originate from differentsources but describe the same domain. Goldstandard mappings between records in thesedatasets are either created manually or vali-dated manually after an initial automatic gen-eration of candidates.

• Artificially generated datasets. An artificiallygenerated benchmark is normally created bytaking one reference dataset in advance and in-troducing artificial distortions into it in a con-trolled way: e.g., by removing/adding the at-tributes and changing their values randomly.In this way, creation of the gold standard setof mappings is straightforward: it includes allmappings between an original record and itsdistorted version.

Both approaches have their advantages and disad-vantages. Taking real-world data sources allowsthe matching techniques to be evaluated in real-istic conditions. In particular, this concerns thepresence of heterogeneity problems, such as specificformat differences, missed and incorrect attributevalues, as well as the distribution of these prob-lems in the dataset. However, this approach also

3

has its disadvantages: given that the parameters ofthe datasets depend on the domain, it is difficult togeneralize the results obtained on these datasets toother application cases. Moreover, creating the setof gold standard mappings for such datasets is prob-lematic. Ideally, all mappings have to be checkedmanually, which is difficult to achieve for large-scaledatasets. On the other hand, artificially generateddatasets provide fully controlled test conditions, inwhich matching challenges can be added at will.However, matching problems in the artificially gen-erated datasets usually do not cover domain-specificfeatures, and the distribution of these problems canbe unrealistic.

During the earlier stages of research publicdatabases describing the domain of scientific publi-cations were particularly popular as sources of eval-uation data. This was caused both by the factthat citation matching is one of particularly im-portant application domain in academic environ-ment as well as by the public availability of thesedatasets. In particular, Cora1 is one of the com-monly used evaluation datasets in the database do-main (e.g., used in [33], [32], [9]). It includes cita-tions collected from the Web, which referenced a setof pre-selected academic papers. Another dataset,ACM-DBLP was used in [24], [36] and created bymatching references to the same publications men-tioned in the ACM2 and DBLP3 web repositories.Several other datasets used by different researchersto evaluate their tools separately were collectedwithin the RIDDLE repository4.

These datasets have also been adapted to the se-mantic web standards and used to evaluate the in-stance matching algorithms in the semantic web do-main: e.g., Cora was used in [9], [29], and [19], whilethe Restaurants dataset from the RIDDLE repos-itory was used by [34]. The advantage of reusingthese is the possibility to compare with the tech-niques developed in the database community, de-spite the differences in the format of processed data.However, these benchmark datasets are not fullyrepresentative of the challenges of the linked dataenvironment. In particular, the data do not uti-lize the specific semantic web features such as, e.g.,class and property hierarchy, heterogeneous schemaontologies. The second problem commonly occur-

1http://www.cs.umass.edu/ mccallum/data.html2http://dl.acm.org3http://www.informatik.uni-trier.de/ ley/db/4http://www.cs.utexas.edu/users/ml/riddle/

ring with the datasets is the lack of version con-sistency. Sometimes, different versions of the samedataset exist, as researchers can introduce minormodifications into a dataset in order to conduct spe-cific experiments and report them. Then, a mod-ified version is re-used for other experiments. Inthe process, it becomes difficult for the users totrack provenance of the datasets reported in differ-ent publications, which leads to their results beingcompared with each other. For instance, the pop-ular Cora dataset exists in at least four differentversions5.

In order to avoid these problems when evalu-ating data linking techniques, there is a need tocreate benchmarks which represent realistic match-ing challenges occurring in linked data sourcesand to maintain “canonical” versions of benchmarkdatasets. These requirements were the primary mo-tivations for the instance matching evaluation ini-tiative within the OAEI evaluation campaign.

2.2. Evaluation of ontology matching tools

In the area of ontological schema matching, eval-uation efforts have been joined within the Ontol-ogy Alignment Evaluation Initiative (OAEI) [10] toproduce a comprehensive set of benchmark testswhich test different aspects of existing matchingtools. Since 2005, several datasets were includedinto the evaluation campaign in order to evaluatevarious aspects of the ontology matching task.

Originally, the OAEI evaluation campaign fea-tured a single artificial benchmark (known as thebenchmark dataset) that includes various tests illus-trating different challenges occurring in real-worldmatching tasks: controlled modifications concernboth the level of atomic elements (e.g., randommodifications of elements/labels) and the structurelevel (e.g., deleting/inserting classes in the hierar-chy). This serves well the purpose of checking thecapabilities of schema matching tools to deal withthe presence or absence of different features occur-ring in the ontologies. However, as evaluation ex-perience has shown [10], this artificial benchmarkis less suited for comparing the overall performanceof tools: each test focuses on a specific type ofsituation while not providing a realistic test as awhole. To deal with this problem, the evaluationcampaign was extended to include several realistic

5Experiments with different Cora versions are reported in[9], [32], [33], and [38]

4

benchmarks involving real-world ontologies cover-ing the same topics. These formed the basis of otherbenchmarks, in particular:

• Conference, which consists of a set of 16 on-tologies dedicated to the topic of conferenceorganisation and developed within the Onto-Farm project6.

• Anatomy, which includes detailed ontologiesdescribing human and mouse anatomy7.

Thus, experiences of the ontology matching eval-uation led to a conclusion that to achieve effectiveevaluation of the tools, benchmark tests have toutilise both artificial and real-world datasets.

While some ontologies used in the ontologymatching benchmarks contain instance data as well,they are not fully suitable for reuse to evaluate datalinking methods due to important differences be-tween these tasks. These differences involve thefollowing aspects.

Larger datasets. One of the main problems inmatching instances with respect to the problemof matching ontology concepts is that instancedatasets are usually much larger than ontologies interms of number of entities, properties and values.This puts more emphasis on the problem of match-ing tool performances and complexity.

Large number of literal data values. Even if lit-erals are often used also in ontologies as labels,comments, and values of property restrictions, ininstance datasets the number of properties whichhave literals as values is usually larger. This putsmore emphasis on the problem of having specificdata matching functions for strings, dates, num-bers.

Identity and similarity: The same-as problem. Inontologies, the semantics of equivalence and sub-sumption is well defined and can be checked troughstandard reasoning techniques. On the contrary, ininstance datasets the semantics of same-as relationsis ambiguous and, in fact, same-as relations are of-ten used as links with different meanings. This putsemphasis on the problem of providing a formal in-terpretation of links resulting from the matchingprocess.

6http://nb.vse.cz/ svatek/ontofarm.html7http://oaei.ontologymatching.org/2011/anatomy/index.html

Different role of names and property values. Bothin ontology and instance matching we deal with la-bels, used as names for concepts and instances, andwith concrete values of properties, used to representfeatures of concepts and instances. Many matchingtools work with names and property values underthe assumption that they are useful to determineconcepts meaning and real-world objects denotedby instances. However, names and values have adifferent relevance when dealing with ontologies orinstances. In fact, in ontologies, concept namesare usually meaningful, although the semantics ofconcepts depends basically on the mutual conceptrelations and constraints involving concepts, whileconcrete values of properties are limited to labelsand comments, which are less useful for matchingpurposes. On the contrary, names of instances areonly a limited portion of the information availableabout instances, since they are associated with sev-eral concrete values denoting their features. Asan example, we can take the concept Person fromthe DBpedia ontology and the individual Aristo-tle from the DBpedia dataset. Person is a sub-class of Agent and is associated with seven differentconcrete values of the property label. However, allthese labels report the term “Person” in differentlanguages, which is only partially useful to deter-mine the concept meaning. On the contrary, Aris-totle is denoted by many property values, providinginformation about his birth and death dates and lo-cations, his philosophical interests and works, andso on. Thus, property values have a more relevantrole in instance matching than ontology matching,in that we can use some properties to determine theidentity of each instance, especially for those prop-erties which contain unique values. This makes in-stance datasets more similar to relational databaserecords and puts emphasis on the use of record link-age techniques for matching instances.

Different kinds of data heterogeneities. Whenmatching labels and strings, the main problemof ontology matching is due to the fact that la-bels are used to describe the concept meaning inhuman understandable terms. Thus, the hetero-geneity is mainly due to the fact that often thesame label is used to denote different conceptsor, on the contrary, that the same concept is la-beled with different strings. These are languageheterogeneities associated with the use of commonwords (nouns, verbs, adjectives). This happensalso in instance matching, but, in this case, there

5

are other problems more related to the representa-tion of named entities and enumerated items. Infact, data heterogeneity is often due to errors indata or different conventions about name abbrevia-tions, acronyms and other string/date/number for-mat heterogeneities. This requires more sophisti-cated techniques for syntactic data value matching.Moreover, in the generic ontology matching, diver-sity between descriptions of matching concepts andproperties can be caused by different factors [23]: inone case, usage of synonyms, in another, rephras-ing of descriptions, in the third one, slight semanticdifference, etc. In the instance matching tasks, itis more common that the same factor occurs formany pairs of instances and is specific for a partic-ular pair of datasets. Thus, while for an ontologymatching tool it is important to be able to applya wide range of techniques to recognize a mappingbetween a pair of entities, for an instance matchingmethod the focus is more on discovering the mostappropriate set of techniques and applying them ina consistent way. The capability to adapt to specifictypes of data at hand is thus particularly valuable.

Structural differences between ontology and in-stances as graphs. Graph-based matching tech-niques are crucial and much used both in ontologyand in instance matching. However, when seen asgraphs, there are some differences between ontolo-gies and instance datasets. Usually, the density ofinstance graphs is higher than ontologies and theaverage number of other entities connected to agiven instance is also higher. This leads to morecomplex graph matching problems, where the iden-tity of an instance often depends on the meaning ofother instances connected with the former one. Inaddition, while the ontology schema usually repre-sents a single graph, at the data level, informationis commonly described using a set of homogeneoussubgraphs. This makes it important to be able toextract and analyse relevant subgraphs which carrythe instance identity.

Relations between datasets and the real-world.Typically, ontologies are used to represent ashared vocabulary of concepts described by com-mon nouns. On the contrary, instances are usuallyintended to represent real world objects or digitaldocuments identified by named entities. This im-plies that thesauri and top ontologies that are usefulresources for ontology matching are often less usefulfor instance matching. On the other side, instance

matching may be based on external data sourcesof interest so that is possible to use existing collec-tion of objects as a reference for the validation ofmatching results: for example, a publicly availablerepository such as Wikipedia or Geonames can beused to check whether a particular name describesa single entity (e.g., the city “Dnepropetrovsk”) orseveral entities (e.g., “St Petersburg” which maypossibly refer to places in Russia and USA).

Mutual relations between ontology and instancematching. Ontology matching may use instancematching as part of the matching process, in thatinstances characterize the meaning of concepts:they serve as the main type of evidence for instance-based ontology matching techniques. On the otherhand, ontological heterogeneity presents an addi-tional challenge for instance matching, as corre-sponding sets of instances and properties relevantto the instance identity can be represented in dif-ferent ways. In these cases, ability to perform par-tial ontology matching is important for an instancematching tool.

These differences have motivated the need to de-velop a different set of benchmarks specifically forthe task of instance matching in the semantic webdomain. For this reason, it was decided to establishthe instance matching evaluation as a separate sub-track within the OAEI evaluation campaign, whichby now has been performed 3 times (in 2009, 2010,and 2011). In the following sections, we describe theorganization of the campaign, the proposed bench-marks as well as the results of the tests.

3. Instance Matching at the Ontology Align-ment Evaluation Initiative

The Ontology Alignment Evaluation Initiative(OAEI) [10] aims to evaluate the performance ofontology matching systems on a various number ofproblems related to ontology matching. The in-stance matching track was organized over the lastthree editions of the OAEI.8 Given the require-ments and issues presented in sections 1 and 2,organization of evaluation testing required makingdecisions at different levels: both technical (choiceof benchmark datasets and evaluation metrics) andpragmatic (campaign organization). In this section,

8Results of every OAEI campains are available on http:

//oaei.ontologymatching.org.

6

http://oaei.ontologymatching.org

http://oaei.ontologymatching.org

we describe the general choices we made related tothe evaluation campaign as a whole: section 3.1outlines the main stages of the evaluation camaign,while section 3.2 discusses the choice of evaluationmetrics. Then, sections 4 and 5 describe two chosentypes of datasets: natural and synthetic.

3.1. Organization

The evaluation is performed as follows:

preparation phase Datasets to be matched andreference alignments are provided in advance.This gives potential participants the occasionto send observations, bug corrections, remarksand other test cases to the organizers. Thegoal of this preparatory period is to ensure thatthe delivered tests make sense to the partici-pants. The final test base is then released aftera month. The data sets do not evolve after thisperiod.

execution phase During the execution phase,participants use their systems to automaticallymatch the instance data from the test cases.Participants are asked to use one algorithmand the same set of parameters for all tests inall tracks. It is fair to select the set of param-eters that provide the best results. Beside pa-rameters, the input of the algorithms must bethe two datasets to be matched and any generalpurpose resource available to everyone, i.e., noresource especially designed for the test. In allcases datasets are serialized in RDF and con-tain the ontology declarations in some cases.The expected alignments are provided in theAlignment format expressed in RDF/XML [8].Participants also provided the papers that arepublished in the Ontology Matching Workshopproceedings and a link to their systems andtheir configuration parameters.

evaluation phase The organizers evaluate thealignments provided by the participants andreturn comparisons on these results. In orderto ensure that it is possible to process auto-matically the provided results, the participantsare requested to provide (preliminary) resultsafter two months. The standard evaluationmeasures are precision and recall computedagainst the reference alignments. For the mat-ter of aggregation of the measures, weightedharmonic means are used (weights being thesize of the true positives). This clearly helps

in the case of empty alignments. Another tech-nique that has been used is the computation ofprecision/recall graphs so it was advised thatparticipants provide their results with a weightto each correspondence they found. New mea-sures addressing some limitations of precisionand recall have also been used for testing pur-poses as well as measures compensating for thelack of complete reference alignments.

3.2. Evaluation metrics

In addition to the choice of the test datasets, theevaluation methodology must include the choice ofthe valid quantitative evaluation measures. A vari-ety of evaluation measures has been used to validaterecord linkage algorithms:

• Maximum F-Measure: harmonic mean be-tween pairwise precision and recall achievedwith the optimal settings of an algo-rithm [2] [6].

• Pairwise accuracy for the optimal number ofpairs [35].

• Percentage of the correct equivalence classes(sets of equivalent instances obtained by com-puting the transitive closure over obtainedmappings) [26].

• Proportions of true matching pairs at differenterror rate levels [39].

• Precision-recall curves which visualize the al-gorithm’s performance over the whole range ofpossible threshold values [3].

These similarity measures highlight different perfor-mance aspects and also have their advantages anddisadvantages. F-Measure combines the precisionand recall in one balanced metric, but it does nottake account of true negative matches. Pairwise ac-curacy, on the contrary, counts both correctly iden-tified positive and negative matches. However, thedisadvantage of this is that in case of predominanceof negative examples in the gold standard (which isnormal for an instance matching task) the metricbecomes non-discriminative: in this case, a simplematching algorithm which would reject all possiblemappings between two datasets would already ob-tain high accuracy. Counting equivalence classesinstead of atomic mappings does not consider er-rors within each cluster. Error rate thresholds as-sume a specific matching methodology [14]. Finally,

7

precision-recall curves can visualize complex behav-ior patterns of algorithms, but assume that thesealgorithms use a threshold-based cut-off, which isnot always the case.

In the area of ontology matching, novel evalu-ation metrics have been proposed to improve thestandard precision/recall: in particular, in [11] sev-eral extended metrics have been proposed, such assymmetric proximity, correction effort, and relaxedprecision/recall. These metrics take into accountthe semantics of mappings: e.g., a class-subclassmapping between two classes which are equivalentin the gold standard will still contribute to the finalscore, although with a lower weight. This providesflexibility needed to deal with relationships betweenontological concepts which are not always clear-cut.Although more appropriate to the schema match-ing environment, these measures can be useful forinstance matching as well, e.g., if the datasets tobe matched model instances at different levels ofgranularity.

Given these pros and cons, for the evaluationof Semantic Web data linking tools we found theprecision and recall measures the most informa-tive: given the large volumes of data containinglargely distinct individuals, considering correctlyidentified non-matches measuring the performanceof the tools does not add valuable information. Toshow the balance between these metrics, we usedthe maximum F-Measure as a single quantitativeindicator and precision-recall curves as a more fine-grained illustration means. While more involvedmetrics such as those defined in [11] were not suit-able for the benchmark datasets chosen so far, westudy the possibility to use them in the followingiterations.

In the rest of the paper we describe the bench-marks we developed for evaluating instance match-ing tools in this context, which include both real-world (Section 4) and automatically generated (Sec-tion 5) datasets, provide an overview of our experi-ence with using these datasets to evaluate proposedtools and discuss the lessons learnt from this expe-rience (Section 6).

4. The real-data benchmark

In this section we describe three benchmarksproposed to OAEI instance matching participants.These benchmarks are based on datasets actuallyavailable as web data and describing data used in

applications. We also give the evaluation results oneach benchmark.

4.1. OAEI 2009: interlinking scientific publicationsdata

Starting from 2009 we proposed a track to OAEIparticipants focusing on instance matching.

4.1.1. Benchmark

In the 2009 edition the test bed was made of threedatasets in the domain of scientific publications.

• AKT EPrints archive, containing informationabout papers produced within the AKT re-search project.9

• Rexa dataset, extracted from the Rexa searchserver10, which was constructed at the Univer-sity of Massachusetts using automatic informa-tion extraction algorithms.

• SWETO-DBLP dataset11, a publicly availabledataset listing publications from the computerscience domain.

The SWETO-DBLP dataset was originally repre-sented in RDF. Two other datasets (AKT EPrintsand Rexa) were extracted from the HTML sourcesusing specially constructed wrappers and struc-tured according to the SWETO-DBLP ontology.This heterogeneity resulted in many non-trivialcases of data mismatches. Sometimes, the sourcescontained misrepresentation of data fields, missingvalues, and even incorrect values (e.g., incorrectpublication dates and missing authors).

The ontology describes information about sci-entific publications and their authors and extendsthe commonly used FOAF ontology. Authors arerepresented as individuals of the foaf:Person class,and a special class sweto:Publication is defined forpublications, with two subclasses sweto:Article andsweto:Article in Proceedings for journal and con-ference publications respectively. The participantswere invited to produce alignments for each pair ofdatasets (AKTRexa, AKTDBLP, and RexaDBLP).

This benchmark presents the advantage to con-tains three datasets about the same domain.Matching instances sets overlap across the three

9http://www.aktors.org/10http://rexa.info/11http://lsdis.cs.uga.edu/projects/semdis/swetodblp/

8

datasets, giving the possibility to validate somematching results using transitivity. The publica-tions domain is also familiar to the ontology align-ment evaluation initiative.

4.1.2. Results

Five systems participated to the evalua-tion: DSSim [28], RiMOM [27], OKKAM [34],HMatch [5], and ASMOV [21]. In this first instancematching track, 4 systems out of 5 representedgeneric ontology matching tools, which includedinstance matching as a part of their functionality,while only one (OKKAM) was specifically aimedat resolving data level coreferences.

Table 1 shows the results.The AKT/Rexa test scenario was the only one

for which the results for ASMOV were avail-able and the only one for which all the sys-tems provided alignments for both foaf:Personand sweto:Publication classes. OKKAM forthe AKT/DBLP test case and RiMOM for theRexa/DBLP test case only produced alignments forPublication instances, which reduced their overallrecall. For the class Publication the best F-measurein all three cases was achieved by HMatch with Ri-MOM being the second. OKKAM, which specifi-cally focused on precision, achieved the highest pre-cision in all three cases at the expense of recall. It isinteresting to see the difference between systems inthe Rexa/DBLP scenario where many distinct in-dividuals had identical titles (e.g., “Editorial.”, or“Minitrack Introduction.”): this primarily affectedthe precision in the case of HMatch and RiMOM,but reduced recall for OKKAM.

The performance of all systems was lower for theclass Person where ambiguous personal names anddifferent label formats reduced the performance ofstring similarity techniques. The highest F-measurewas achieved by RiMOM for the AKT/Rexa sce-nario and by HMatch for the AKT/DBLP andRexa/DBLP cases. Again, it is interesting to notethe difference between HMatch and OKKAM in theRexa/DBLP case where the first system focused onF-measure and the second one on precision. Thisdistinction of approaches can be an important crite-rion when a tool has to be selected for a real worlduse case: in some cases the cost of an erroneouscorrespondence is much higher than than the costof a missed one (e.g., the large-scale entity nam-ing service such as OKKAM) while in other sce-narios this might not be true (e.g., assisting theuser who performs manual alignment of datasets).

In contrast, in the AKT/Rexa scenario the perfor-mance of OKKAM was lower than the performanceof other systems both in terms of precision and re-call. This was caused by different label formatsused by AKT and Rexa datasets (“FirstName Last-Name” vs “LastName, FirstName”), which affectedOKKAM most.

In this first evaluation track, the benchmark pro-vided a representative set of realistic challengesoccurring in real-world data, which gave interest-ing insights about the capabilities of different al-gorithms. However, in itself this benchmark wasinsufficient, as it did not cover some important as-pects. Most notably, these included the large scaleand ontological heterogeneity. Two out of threedatasets were relatively small scale: this meantthat in each pairwise matching task one of thedatasets was small, which reduced the complexityof the matching task. Moreover, all three datasetswere structured using the same ontology (SWETO)which also simplified the challenges, which the toolshad to tackle.

4.2. OAEI 2010: interlinking health-care data

In OAEI 2010, participants were asked to inter-link together four datasets, selected for their poten-tial to be interlinked, for the availability of curatedinterlinks between them, and for their size.

4.2.1. Benchmark

All datasets are on the health-care domain andall of them contain information about drugs (see[22] for more details on the datasets):

dailymed is published by the US National Libraryof Medecine and has for topic marketed drugs.Dailymed contains information on the chemi-cal structure, mechanism of action, indication,usage, contraindications and adverse reactionsfor the drugs.

diseasome contains information about 4300 disor-ders and genes.

drugbank is a repository of more than 5000 drugsapproved by the US Federal Drugs Agency. Itcontains information about chemical, pharma-ceutical and pharmacological data along withthe drugs data.

sider contains information on marketed drugs andtheir recorded adverse reactions. It was orig-inally published on flat files before being con-

9

Table 1: Results of the real-data benchmark subtrack.

Concept sweto:Publication foaf:Person Overall

System Prec. Rec. FMeas. Prec. Rec. FMeas. Prec. Rec. FMeas.

AKT/REXA

DSSim 0.15 0.16 0.16 0.81 0.43 0.30 0.60 0.38 0.28RiMOM 0.96 0.96 0.97 0.93 0.79 0.70 0.94 0.80 0.59OKKAM 0.99 0.76 0.61 0.73 0.03 0.02 0.94 0.18 0.10HMatch 0.97 0.93 0.89 0.94 0.56 0.39 0.95 0.62 0.46ASMOV 0.32 0.46 0.79 0.76 0.37 0.24 0.52 0.39 0.32

AKT/DBLP

DSSim 0 0 0 0.15 0.19 0.11 0.17 0.13 0.15RiMOM 1.0 0.84 0.72 0.92 0.79 0.70 0.93 0.73 0.70OKKAM 0.98 0.88 0.80 0 0 0 0.98 0.28 0.16HMatch 0.93 0.95 0.97 0.58 0.57 0.57 0.65 0.65 0.65

REXA/DBLP

DSSim 0 0 0 0 0 0 0 0 0RiMOM 0.94 0.94 0.95 0.76 0.71 0.66 0.80 0.76 0.72OKKAM 0.98 0.26 0.15 1.00 0.20 0.11 0.99 0.21 0.12HMatch 0.45 0.61 0.96 0.40 0.37 0.34 0.42 0.45 0.48

verted as linked-data through a relationaldatabase.

These datasets were semi-automatically inter-linked using Silk [4] and ODD Linker [18] providingthe reference alignments for this task and partic-ipants were asked to retrieve these links using anautomatic method.

As for the 2009 edition, this benchmark providesfour interlinked datasets on a same domain. Refer-ence links were available, provided by the LinkedOpened Drug Data working group [22]. On theother hand, datasets are of a relatively small size,thus not evaluating the ability of the systems toscale. The fact that the datasets are all about thesame domain is both positive as it allows systems touse links between a pair of datasets to validate (orinvalidate) links between another pair of datasets.It is also negative as systems can be tuned to bemore efficient for this particular domain, for exam-ple using domain specific background knowledge.Another inconvenient of this benchmark lies in thefact the matching task was rather easy. This isdue to the domain having strong naming conven-tions, making that two drugs will very likely havethe same name across every drug datasets.

4.2.2. Results

Two systems participated in the data interlinkingtask: ObjectCoref [18] and RiMOM [27]. Table 3shows the results.

Table 2: Health-care benchmark composition

Dataset Dailymed Diseasome Drugbank Sider

Dailymed 0 0 0 1,592Diseasome 0 0 0 238Drugbank 0 0 0 283Sider 1,986 0 1,140 0

The results are very different for the two systems,with ObjectCoref being better in precision and Ri-MOM being better in recall. A difficult task withinterlinking real data is to understand if the resultsare due to a weakness of the matching system orbecause links can be not very reliable. In any case,what we can conclude from this experiment withlinked data is that a lot of work is still requiredin three directions: i) providing a reliable mecha-nism for systems evaluation; ii) improving the per-formances of matching systems in terms of bothprecision and recall; iii) work on the scalability ofmatching techniques in order to make affordable thetask of matching large collections of real data. Wehave thus built the OAEI 2011 instance matchingtrack with these chalenges in mind.

4.3. OAEI 2011: interlinking the New York Timesdata

While using subsets of publicly available linkeddata repositories has shown its value, in particular,

10


ObjectCoref

Dataset Prec. FMeas. Rec.dailymed 0,55 0,09 0,05diseasome 0,84 0,10 0,05drugbank 0,30 0,05 0,03sider 0,00 NaN 0,00H-mean 0,50 0,08 0,04

RiMOM

Dataset Prec. FMeas. Rec.dailymed 0,08 0,13 0,30diseasome 0,08 0,13 0,47drugbank 0,05 0,08 0,42sider 0,62 0,53 0,47H-mean 0,08 0,13 0,35

to illustrate matching challenges occurring in real-world domains, the evaluation procedure in OAEI2010 also identified some issues. In particular, thetopic of datasets was restricted to the medical do-main, which biased the overall benchmark towardsthe specific features of this domain. Second, dueto the need to support large-scale matching tasks,the set of gold standard mappings could not beconstructed manually, and evaluation had to relyon pre-existing mappings between datasets. Thesepre-existing mappings were constructed by semi-automated tools, which themselves did not alwaysproduce results with 100% quality.

4.3.1. Benchmark

To deal with these issues, the instance match-ing track in the OAEI 2011 included the set oftests involving the New York Times (NYT) linkeddata12. The NYT repository includes three sub-sets describing different types of entities mentionedin the New York Times articles: people, organisa-tions, and places. These three subsets were linkedto three commonly used semantic web data repos-itories: DBpedia13, Freebase14, and Geonames15.These links were provided by the data publishers,which improved the gold standard quality.

12http://data.nytimes.com/13http://dbpedia.org14http://www.freebase.com/15http://www.geonames.org/

The data in the NYT datasets is structured us-ing the commonly used SKOS vocabulary16: in-dividuals are modelled as instances of the classskos:Concept, and instance labels use the prop-erty skos:label rather than generic rdfs:label. Othervocabularies are used to represent domain-specificproperties, such as number of relevant NYT articlesand geo-coordinates (for locations).

While the links provided in the benchmark wheremostly correct, it was found they were not completeas some participants found. Also the size of thebenchmark is still not large enough to really evalu-ate the scaling capabilities of participating systems.

4.3.2. Results

In the OAEI 2011 evaluation initiative, test re-sults with the NYT benchmark dataset were pro-duced using three instance matching tools: Agree-mentMaker, SERIMI, and Zhishi.links [7, 1, 30].Table 5 shows an overview of the Precision, Recalland F1-measure results per dataset for these tools,while Figure 1 shows the Precision-Recall graph fortheir results. In the experiments, Zhishi.links man-aged to produce high quality mappings consistentlyover all datasets: it obtained the highest scores on4 tests out of 7, and the highest average scores.SERIMI performed particularly well on Freebasedatasets (it outperformed other tools on 2 tests),while AgreementMaker was particularly successfulon linking people from NYT and Freebase.

0 0.2 0.4 0.6 0.8 1Recall

0

0.2

0.4

0.6

0.8

1

Precision

AgreementMakerSERIMIZhishi.links

Figure 1: Precision/recall of tools participating in the DIsubtrack.

However, the results also highlighted some im-portant issues. The first one is still the quality ofthe gold standard mappings: despite the fact the

16http://www.w3.org/TR/skos-reference/

11

http://data.nytimes.com/

http://dbpedia.org

http://www.freebase.com/

http://www.geonames.org/

http://www.w3.org/TR/skos-reference/

Table 4: The New York Times benchmark composition

Facet # Concepts Links to Freebase Links to DBPedia Links to Geonames

People 4,979 4,979 4,977 0Organizations 3,044 3,044 1,965 0Locations 1,920 1,920 1,920 1,920


AgreementMaker SERIMI Zhishi.links

Dataset Prec. FMeas. Rec. Prec. FMeas. Rec. Prec. FMeas. Rec.

DI-nyt-dbpedia-loc. 0,79 0,69 0,61 0,69 0,68 0,67 0,92 0,92 0,91DI-nyt-dbpedia-org. 0,84 0,74 0,67 0,89 0,88 0,87 0,90 0,91 0,93DI-nyt-dbpedia-peo. 0,98 0,88 0,80 0,94 0,94 0,94 0,97 0,97 0,97DI-nyt-freebase-loc. 0,88 0,85 0,81 0,92 0,91 0,90 0,90 0,88 0,86DI-nyt-freebase-org. 0,87 0,80 0,74 0,92 0,91 0,89 0,89 0,87 0,85DI-nyt-freebase-peo. 0,97 0,96 0,95 0,93 0,92 0,91 0,93 0,93 0,92DI-nyt-geonames. 0,90 0,85 0,80 0,79 0,80 0,81 0,94 0,91 0,88H-mean. 0,92 0,85 0,80 0,89 0,89 0,88 0,93 0,92 0,92

links were checked by the data publisher at thetime of their generation, errors in the gold stan-dard still occurred. These errors were caused byseveral factors: evolution of datasets since the timeof interlinking, omitted mappings (false negatives)which are particularly difficult to discover manu-ally, as well as the presence of ambiguous and du-plicate instances in target repositories. The secondissue concerns the use of specific techniques by thematching systems. In particular, the use of domain-specific knowledge such as common abbreviations.There are different points of view on the use of do-main knowledge: for example, it is explicitly for-bidden in the OAEI schema matching tests, as theschema matching tools must be able to deal withgeneric knowledge models. Although this can beseen as too restrictive for instance matching tools,as many cases of heterogeneity in instance matchingtasks cannot be resolved without possessing domainknowledge, the cases where some tool is specificallytargeted at solving the benchmark tasks have to beprevented to ensure that evaluation results can begeneralized. Establishing the rules concerning theuse of domain knowledge constitutes an importantchallenge for the OAEI instance matching track andfuture instance matching initiatives in general as itinfluences the quality of comparative evaluation.

5. The automatically generated benchmark

The automatically generated benchmark (calledIIMB) is based on the idea of automatically ac-quiring a potentially large set of data from an ex-isting datasource and to represent data in formof an OWL Abox, serialized in RDF (Either in2010 and in 2011, the dataset was extracted fromFreebase17). Then, starting from the initial set ofdata, we programmatically introduce several kindsof data transformations, with the goal of producinga final set of Aboxes in a controlled way. Partici-pants are then required to match each of the trans-formed Aboxes against the initial one, trying to findthe correct mappings between the original entitiesand the transformed ones. The main advantage insuch an approach is that we have a control over thetype and strength of each transformation, whichmeans that it is possible to analytically evaluatethe results produced by each tool, by highlightingpotential points of strength and weakness of eachtool.

5.1. Creation of the benchmark

The benchmark is created using the SWING ap-proach (Semantic Web INstance Generation) [15] a

17http://www.freebase.com

12

disciplined approach to the semi-automatic genera-tion of benchmarks to be used for the evaluation ofmatching applications. The SWING approach hasbeen implemented as a Java application and it isavailable at http://code.google.com/p/swing.

The SWING approach is articulated in threephases as shown in Figure 2. These phases willbe briefly described in the Sections 5.2 - 5.4. Thesesections are a comprehensive summary of [15].

5.2. Data acquisition techniques.

The SWING data acquisition phase is based onthe idea of acquiring data from a linked data repos-itory in a controlled way using a set of predefinedqueries and then, to enrich the structural and se-mantic complexity of acquired data from the de-scription logic ALE(D) up to ALCHI(D). Themain reason of the enrichment step is that linkeddata are typically featured by a limited level of se-mantic complexity, while we are interested in pro-viding a benchmark suitable for the evaluation alsoof logical and reasoning capabilities of the match-ing tools at hand. In order to briefly summarize theextensions added during the enrichment step, we re-port the main operations supported by SWING:

• Add super classes and super properties.

• Convert attributes to class assertions.

• Determine disjointness restrictions.

• Enrich with inverse properties.

• Specify domain and range restrictions.

All these operations are semi-automatically per-formed by focusing on data features like propertyvalues and class relations. The benchmark designercan choose which operations have to be applied todata in order to control the semantic complexity ofthe final Abox.

5.3. Data transformation techniques.

In the subsequent data transformation activitythe initial ABox is modified in several ways by gen-erating a set of new ABoxes, called test cases. Eachtest case, is produced by transforming the individ-ual descriptions in the reference ABox in new indi-vidual descriptions that are inserted in the test caseat hand. In particular, the SWING approach sup-ports the following automatic transformation tech-niques. In our implementation the evaluation de-signer has control over these techniques with aneasy understandable parameter file.

Deletion/Addition of Individuals.. The SWINGapproach allows the evaluation designer to select aportion of individuals that must be deleted and/orduplicated in the new ontology. The reason behindthis functionality is to obtain a new ontology whereeach original individual can have none, one, or morematching counterparts. The goal is to add somenoise in the expected mappings in such a way thatthe resulting benchmark contains both test caseswhere each original instance has only one match-ing counterpart (i.e., one-to-one mappings) and testcases where each original instance may have morethan one matching counterpart (i.e., one-to-manymappings).

Data value transformation. operations work onthe concrete values of data properties and theirdatatypes when available. The output is a newconcrete value. In the standard transformation ty-pos are simulated as well as special value trans-formations for dates, names, gender attributes,and numbers like integers and float. Further-more, in our synonym transformation we ex-tract synonyms from WordNet (e.g. Jackson has

won multiple awards is transformed to Jackson has

gained several prizes).

Table 6: Examples of data transformation operations

Operation Original value Transformed value

Standard trans-formation

Luke Skywalker L4kd Skiwaldek

Date format 1948-12-21 December 21, 1948Name format Samuel L. Jack-

sonJackson, S.L.

Gender format Male MSynonyms Jackson has won

multiple awards[...]

Jackson has gainedseveral prizes [...]

Integer 10 110Float 1.3 1.30

Data structure transformation. operations changethe way data values are connected to individuals inthe original ontology graph and change the type andnumber of properties associated with a given indi-vidual. A comprehensive example of data structuretransformation is shown in Table 7, where an ini-tial set of assertions A is transformed in the corre-sponding set of assertions A′ by applying the prop-erty type transformation, property assertion dele-tion/addition, and property assertion splitting.

13

http://code.google.com/p/swing

Data Acquisition

Data SelectionData Enrichment

Data Transformation Data Evaluation

Data Value TransformationData Structure TransformationData Semantic Transformation

Definition of Expected Results

Linked Data repositories

Reference OWLABox

Transformed OWLABoxes (test cases) Reference Alignment

Activities & Techniques

Phases

Outputs

Example(IIMB2010)

(http://www.freebase.com/) star wars iv a new hope

FantasyScience Fiction

harrison ford 1942-7-13

star wars

Film

ford h.a

b

c

d

Film a

be

c

d

e

Testing

Figure 2: The SWING approach

Table 7: Example of data structure transformations

Original Abox Transformed Abox

name(n, “Natalie Portman”) name(n, “Natalie”)born in(n,m) name(n, “Portman”)name(m, “Jerusalem”) born in(n,m)gender(n, “Female”) name(m, “Jerusalem”)date of birth(n, “1981-06-09”) name(m, “Auckland”)

obj gender(n, y)has value(y, “Female”)

Data semantic transformation. operations arebased on the idea of changing the way individualsare classified and described in the original ontology.For the sake of brevity, we illustrate the main se-mantic transformation operations by means of thefollowing example, by taking into account the por-tion of TO and the assertions sets A and A′ shownin Table 8.

The challenge introduced by this kind of transfor-mation is twofold: one one side, we expect that thematching tool will be able to conclude that the in-dividual b of the original Abox cannot be the sameindividual b that appears in the transformed Abox.This task is quite trivial if the matching tools en-forces a reasoning process: in fact concepts Creatureand Country are disjoint concepts and, thus, the as-sertions Creature(b) and Country(b) are incompati-ble and this lead to the conclusion that b in A and bin A′ must denote different individuals. But, if the

Table 8: Example of data semantic transformations

Tbox

Character v Creature, created by ≡ creates−,acted by v featuring, Creature u Country v ⊥

Original Abox Transformed Abox

Character(k) Creature(k)Creature(b) Country(b)Creature(r) >(r)created by(k, b) creates(b, k)acted by(k, r) featuring(k, r)name(k, “Luke Skywalker”) name(k, “Luke Skywalker”)name(b, “George Lucas”) name(b, “George Lucas”)name(r, “Mark Hamill”) name(r, “Mark Hamill”)

matching tool does not enforce reasoning, the manysimilarities between the two descriptions may leadthe tool to the wrong conclusion. On the other side,the two individuals k are expected to be consideredas matching. But, again, reasoning is needed toconclude that the two assertions Character(k) andCreature(k) are fully compatible.

5.4. Data evaluation techniques.

Finally, in the data evaluation activity, we auto-matically create a ground-truth as a reference align-ment for each test case. A reference alignment con-tains the mappings (in some contexts called “links”)between the reference ABox individuals and thecorresponding transformed individuals in the test

14

case. These mappings are what an instance match-ing application is expected to find between the orig-inal ABox and the test case.

5.5. The Benchmarks for the OAEI 2010 and 2011

For the OAEI 2010 campaign two datasets of dif-ferent size have been used. We provided one smalldataset containing about 400 individuals and onelarger dataset with about 1400 individuals. In theOAEI 2011 campaign we increased the size of thedataset to 12.333 individuals since we were inter-ested in being more realistic with the size of thebenchmark. Table 9 summarizes the different char-acteristics of the datasets.

Table 9: Characteristics of the automatically generatedbenchmarks.

2010 small 2010 large 2011

Individuals 363 1416 12,333Classes 29 81 163Object-Properties 32 32 45Data-Properties 13 13 13DL-Expressivity ALCHI(D) ALCHI(D) ALCHI(D)

There exist 80 test cases for each of the datasets,divided into 4 sets of 20 test cases each. Thefirst three sets are different implementations of datavalue, data structure, and data semantic transfor-mations, respectively, while the fourth set is ob-tained by combining together the three kinds oftransformations.

5.6. Results

The IIMB benchmark has been used for evalua-tion of matching systems in the campains of 2010and 2011. In 2010 the three Systems ASMOV[20],CODI[31], and RiMOM[37] participated in bothdatasets, the small and the large version of IIMB.

Figure 3 shows the results of the large version.All the systems obtained very good results whendealing with data value transformations and logicaltransformations, both in terms of precision and interms of recall. Instead, in case of structural trans-formations (e.g., property value deletion of addi-tion, property hierarchy modification) and of thecombination of different kinds of transformationswe have worse results, especially concerning recall.Looking at the results, it seems that the combi-nation of different kinds of heterogeneity in datadescriptions is still an open problem for instance

matching systems. When comparing the overallf-measures of the participating systems, all sys-tems are comparable. CODI reached the highest f-measure score of 0.87, RiMOM had 0.84 f-measure,and ASMOV’s f-measure was 0.82.

The systems produced almost similar results inthe small IIMB benchmark. However, the averagef-measure was slightly better for all participatingsystems in the small benchmark compared to thelarge one. CODI’s f-measure was 0.89, RiMOMreached 0.88, and ASMOV had 0.84 f-measure.

In the 2011 campain we increased the size ofthe benchmark in order to create a more realis-tic testing scenario. In the IIMB 2011 dataset itwas not suitable any more to calculate the simi-larity values of every possible indivdual correspon-dence because this would result in approximative12, 333 · 12, 333 = 152, 102, 889 similarity compar-isons. Unfortunately, only CODI could cope withthis large dataset. CODI gained an average f-measure of 0.60. This score is not comparable tothe IIMB 2010 benchmarks since the transforma-tion complexity was heavily increased in the 2011benchmark.

From these two years we can conclude that ac-tual matching systems perform quite well on smallto medium sized benchmarks, but have difficultieswith large datasets. This, however, is an impor-tant requirement in linked open data where a vastamount of instances exist.

6. Current issues and open problems

The benchmarks proposed for instance matchingat OAEI so far have been shown to be adequatefor the evaluation of instance matching algorithmsand tools. This approach has been derived from thework done at OAEI about ontology matching [10].The main effort of the instance matching track hasbeen devoted in providing specific datasets for in-stance matching and in introducing the idea of us-ing artificially generated data with the goal of an-alytically controlling the performances of instancematching tools in different situations. However,there are several open issues about the evaluation ofinstance matching technologies with respect to on-tology matching tools that are still open and thatwe have learned through our experience at OAEI.

The experience of the first editions of the instancematching track in the context of OAEI has providedmaterials, suggestions, and tools for improving thecontest and trying to address the main issues still

15

Figure 3: Results of IIMB 2010, large version

open in future editions. In particular, some of theopen issues discussed in the previous section havebeen already partially addressed, while others re-quire new actions or new tools. In Table 10, wesummarize what we already have and what is stillrequired in the future.

More in detail, possible future actions forIM@OAEI are the following:

Larger datasets. We already have large datasets,especially for what concerns the real-data bench-mark. However, the contest has been focused inthe previous editions only on precision and recall.In future editions, we will focus more on time per-formances of matching tools, trying to understandthe scalability of matching techniques and the ca-pability of matching tools to incrementally evaluatematching when the number of entities in a datasetgrows. To this end, we will provide datasets with adifferent number of entities, relations, and proper-ties, in order to put in relation tools time efficiencywith the number of elements involved in the match-ing process.

Identity and similarity. One of the main problemsin evaluating the capability of matching tools toidentify different kind of “same-as” relations be-tween instances is that a reference set of expectedrelations among data is usually missing. In orderto address this problem, one of the possible solu-tions is to implement in the artificially generatedbenchmark tool a functionality for generating anexpected set of mappings including several kinds

of possible relations, ranging from strict identity tosimple similarity. As an example, we could think toan instance I featured by two properties: the firstproperty is sufficient to identify the object repre-sented by the instance (i.e., an ID code), while thesecond property as a more generic value (e.g., adate or the gender). In such a case, we will gener-ate two different instances, I ′ and I ′′. The first in-stance I ′ is featured by both the properties with val-ues transformed by means of one of the string/datetransformation functions already available in ourtransformation tools. In generating the second in-stance I ′′ we apply transformations functions butwe also delete the ID property. Then, we generatetwo mappings for I. The first mapping I ↔ I ′ de-notes the identity between I and I ′, since both theinstance refer to the same real-world object. Thesecond mapping I ↔ I ′′ denotes a generic simi-larity between the two instances, since we do nothave enough information to conclude the identityof I ′′. Finally, participants will be required to dis-cover both the mappings with the correct meaning.In particular, in order to avoid obliging the tools toproduce a non standard output, we could envisagethree different solutions: in the first one, tools arerequested to exploit the property “relations” whichis foreseen in the RDF alignment mapping format18

in order to state the difference between identity andsimilarity mappings. The second solution is basedon the idea of considering similarity as a sort of“identity degree”. In this case, we stress the role of

18http://alignapi.gforge.inria.fr/format.html.

16

Table 10: Current and future actions with respect to the open issues in the evaluation of instance matching tools

Issue Current actions /material

Future directions

Larger datasets Real-data benchmark More focus on time performances of matchingtools

Identity and similarity Artificially generatedbenchmark

Generate different kinds of possible mappingsaccording to different definitions of “same-as”.Participants should be able to discriminateamong the different mappings

Different role of names andproperty values

Artificially generatedbenchmark

Generate new transformations on the basis of theproperty values distribution

Different kinds of data het-erogeneities


Improve the transformations functions includingmore standard transformations (e.g., acronyms)

Structural differences be-tween ontology and in-stances as graphs


Add the graph density as a parameter for trans-formation functions

Relations between datasetsand the real-world

Real-data benchmark Include the usage of external sources as a refer-ence for matching tools

Mutual relations betweenontology and instancematching


Improve transformations based on the logicalstructure of the reference Tbox

weights associated with mappings: the higher theweight, the higher the similarity relation. In thisapproach, identity is represented as a similarity re-lation associated with a weight equal to 1.0. Finally,a third solution is to run two different tests, one foridentity and the other for similarity. This last so-lution is preferable, because we give participantsthe opportunity to change their tools configurationparameters according to the task at hand.

Different role of names and property values. Thisissue is already partially addressed in the artificiallygenerated benchmark. In fact, we are already ableto generate new property values by applying arbi-trary transformation functions. However, the trans-formation process does not take into account thevalue distribution of each property. In other terms,if a property has a limited number of different val-ues for many individuals, such as a property genre,which has only two possible values (i.e., Male andFemale) for all the individuals, it is transformed byapplying the same rules used for properties whichhave many different values for different individuals,such as a property name. Our plan is to includesuch a parameter in order to make it possible forthe benchmark designer to choose if she wants toapply transformations only to properties featured

by a limited range of possible values (i.e., no iden-tifying properties) and/or also to properties withmany possible values (i.e., highly identifying prop-erties). The idea here is that this second kind ofproperties are more crucial in order to determinethe identity of individuals and their mappings.

Different kinds of data heterogeneities. Also this is-sue is already partially addressed in the artificiallygenerated benchmark. However, the majority ofstring transformation functions used in the trans-formation tool are based on random string trans-formations. In order to make transformations morerealistic, we will implement more transformationfunctions for some kinds of standard data formats,such as acronyms.

Structural differences between ontology and instanceas graphs. Actually, in the automatically generatedbenchmark it is possible to transform a propertywhich has another instance as value into a prop-erty with a concrete value, and vice versa. As anexample, suppose to have an instance representinga movie, featured by a property director having asvalue another instance D representing the directorJohn Smith. D is featured by a property name,having the string “John Smith” as value. In our

17

benchmark, it is possible to transform the movie in-stance by deleting the instance D and changing theproperty director which will have the string “JohnSmith” as value. On the contrary, it is also possibleto transform a string into a new individual. Thisbenchmark generation capability helps the designerin controlling the density of the resulting graph ofthe test cases. However, this procedure is not iter-able to obtain large graphs with a very high numberof individuals and properties. In order to incre-ment the complexity of the graphs resulting fromtransformations, we will add a parameter which willmake it possible to control the number of individu-als and new properties added to the graph by adopt-ing this specific transformation.

Relations between datasets and the real-world. Ourreal-data benchmark already contains real-worlddata. However, till now, we have not explored therole played by external data sources of referent onthe instance matching process. A possible approachto that end is to provide a specific subset of datawhich includes additional data to be used as a sup-port in correctly determining the mappings betweenelements. Then, we will require participants to ex-ecute a first run of matching using only the datasetwithout the additional data, which will instead beavailable only for a second run. The idea is tocompare evaluation results after the first run againsevaluation results after the second run, in order toobserve the impact of using additional data in thematching process.

Mutual relations between ontology and instancematching. One section of test-cases produced inthe artificially generated benchmark is devoted totransformations which are based on the Tbox struc-ture that is transformed as well. About this issue,we plan to improve the transformations based onthe logical structure of the original dataset, by in-cluding new ontological transformations by reduc-ing at the same time the information provided byproperty values, especially for what concerns con-crete values. In such a way, the information aboutinstances derived from the Tbox constraints will be-come more crucial to the end of finding the correctmappings. This approach has the goal of determin-ing if and how much matching tools are capable ofexploiting the ontology during the instance match-ing process.

7. Concluding remarks

In this paper, we have presented the experience ofIM@OAEI, an initiative to promote the evaluationof instance matching and data linking techniquesand tools, in the context of OAEI, the OntologyAlignment Evaluation Initiative. In particular, wehave presented our approach that is based on theidea of combining real-data and automatic gener-ated data for the evaluation in order to provide onone side a realistic context for instance matchingtools and, on the other side, a framework where wecan reproduce different causes of data heterogene-ity in order to analytically and programmaticallyverify the points of strength and weakness of eachevaluated tool. Our future work in the next edi-tions of IM@OAEI will be devoted to the study ofnew measures for the evaluation besides the classi-cal precision and recall as well as on the improve-ment of our benchmarks with the goal of evaluatingthe behavior of the instance matching tools withrespect to some crucial open problems in the field,such as the semantics of instance mappings and theefficiency of matching tools when dealing with largecollections of data.

References

[1] Samur Araujo, Arjen de Vries, and Daniel Schwabe.Serimi results for OAEI 2011. In CEUR-WS Vol-814,2011.

[2] Mikhail Bilenko and Raymond J. Mooney. Adaptive du-plicate detection using learnable string similarity mea-sures. In 9th ACM SIGKDD International Confer-ence on Knowledge Discovery and Data Mining (KDD-2003), pages 39–48, Washington DC, 2003.

[3] Mikhail Bilenko and Raymond J. Mooney. On evalua-tion and training set construction for duplicate detec-tion. In ACM SIGKDD-03 Workshop on Data Clean-ing, Record Linkage, and Object Consolidation, Wash-ington DC, 2003.

[4] Christian Bizer, Julius Volz, Georgi Kobilarov, andMartin Gaedke. Silk - a link discovery framework forthe web of data. In 18th International World Wide WebConference, April 2009.

[5] Silvana Castano, Alfio Ferrara, Davide Lorusso, To-bias Henrik Nath, and Ralf Moller. Mapping validationby probabilistic reasoning. In Proceedings of the 5thEuropean semantic web conference on The semanticweb: research and applications, ESWC’08, pages 170–184, Berlin, Heidelberg, 2008. Springer-Verlag.

[6] William W. Cohen and Jacob Richman. Learning tomatch and cluster large high-dimensional data sets fordata integration. In 8th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining(KDD-2002), 2002.

18

[7] Isabel F. Cruz, Cosmin Stroe, Federico Caimi, AlessioFabiani, Catia Pesquita, Francisco M. Couto, and Mat-teo Palmonari. Using agreementmaker to align ontolo-gies for OAEI 2011. In CEUR-WS Vol-814, 2011.

[8] Jerome David, Jerome Euzenat, Francois Scharffe, andCassia Trojahn dos Santos. The alignment api 4.0. Se-mantic web journal, 2(1):3–10, 2011.

[9] Xin Dong, Alon Halevy, and Jayant Madhavan. Refer-ence reconciliation in complex information spaces. InSIGMOD ’05: Proceedings of the 2005 ACM SIGMODinternational conference on Management of data, pages85–96, New York, NY, USA, 2005. ACM.

[10] Jerome Euzenat, Christian Meilicke, Heiner Stucken-schmidt, Pavel Shvaiko, and Cassia Trojahn. Ontologyalignment evaluation initiative: Six years of experience.Journal of Data Semantics, XV:158–192, 2011.

[11] Marc Ehrig, Jerome Euzenat. Relaxed precision andrecall for ontology matching. In Proc. of the K-CapWorkshop on Integrating Ontologies, pages 2532, Banff(Canada), 2005.

[12] Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, Vassil-ios S. Verykios. Duplicate Record Detection: A Survey.IEEE Transactions on Knowledge and Data Engineer-ing, 19(1): 1–16, 2007.

[13] Jerome Euzenat and Pavel Shvaiko. Ontology Matching.Springer-Verlag, Heidelberg (DE), 2007.

[14] Ivan P. Fellegi and Alan B. Sunter. A theory for recordlinkage. Journal of the American Statistical Associa-tion, 64(328):1183–1210, 1969.

[15] A. Ferrara, S. Montanelli, J. Noessner, and H. Stucken-schmidt. Benchmarking matching applications on thesemantic web. In The Semanic Web: Research and Ap-plications (ESWC 2011), Lecture Notes in ComputerScience Volume 6644, pages 108–122, 2011.

[16] A. Ferrara, A. Nikolov, F. Scharffe. Data Linking forthe Semantic Web. International Journal of SemanticWeb and Information Systems, 7(3):46–76, 2011.

[17] S. Castano, A. Ferrara, S. Montanelli, G. Varese. On-tology and Instance Matching. In Knowledge-DrivenMultimedia Information Extraction and Ontology Evo-lution, Springer Berlin / Heidelberg, 2011.

[18] Oktie Hassanzadeh, Lipyeow Lim, Anastasios Ke-mentsietsidis, and Min Wang. A declarative frame-work for semantic link discovery over relational data.In WWW ’09: Proceedings of the 18th internationalconference on World wide web, pages 1101–1102, NewYork, NY, USA, 2009. ACM.

[19] Robert Isele and Christian Bizer. Learning linkage rulesusing genetic programming. In The 6th InternationalWorkshop on Ontology Matching (OM 2011), 10th In-ternational Semantic Web Conference (ISWC 2011),Bonn, Germany, 2011.

[20] Y.R. Jean-Mary, E.P. Shironoshita, and M.R. Kabuka.Asmov: Results for oaei 2010. Ontology Matching, 126,2010.

[21] Yves R. Jean-Mary, E. Patrick Shironoshita, andMansur R. Kabuka. Ontology matching with seman-tic verification. Web Semantics: Science, Services andAgents on the World Wide Web, 7, 2009.

[22] Anja Jentzsch, Jun Zhao, Oktie Hassanzadeh, Kei-HoiCheung, Matthias Samwald, and Bo Andersson. Link-ing open drug data. In Linking Open Data TriplificationChallenge at the I-Semantics, 2009.

[23] Michel Klein. Change Management for Distributed On-tologies. PhD thesis, Vrije Universiteit Amsterdam,

2004.[24] Hanna Kopcke, Andreas Thor, and Edhard Rahm.

Comparative evaluation of entity resolution approacheswith fever. In 35th Intl. Conference on Very LargeDatabases (VLDB), Lyon, France, 2009.

[25] Hanna Kopcke, Erhard Rahm. Frameworks for entitymatching: A comparison. Data & Knowledge Engineer-ing, 69(2):197–210, 2010.

[26] Steve Lawrence, C. Lee Giles, and Kurt D. Bollacker.Autonomous citation matching. In 3rd InternationalConference on Autonomous Agents, New York, USA,1999.

[27] Juanzi Li, Jie Tang, Yi Li, and Qiong Luo. RiMOM: Adynamic multistrategy ontology alignment framework.IEEE Transactions on Knowledge and Data Engineer-ing, 21(8):1218–1232, 2009.

[28] M. Nagy, M. Vargas-Vera, and P. Stolarski. Dssim re-sults for oaei 2009. In The 4th Workshop on OntologyMatching (OM 2009), 8th International Semantic WebConference (ISWC 2009), 2009.

[29] Andriy Nikolov, Victoria Uren, Enrico Motta, and Annede Roeck. Refining instance coreferencing results usingbelief propagation. In 3rd Asian Semantic Web Con-ference (ASWC 2008), pages 405–419, Bangkok, Thai-land, 2008.

[30] Xing Niu, Shu Rong, Yunlong Zhang, and HaofenWang. Zhishi.links results for OAEI 2011. In CEUR-WS Vol-814, 2011.

[31] J. Noessner and M. Niepert. Codi: Combinatorial opti-mization for data integration–results for oaei 2010. On-tology Matching, page 142, 2010.

[32] Parag and Pedro Domingos. Multi-relational recordlinkage. In KDD Workshop on Multi-Relational DataMining, pages 31–48, Seattle, CA, USA, 2004. ACMPress.

[33] Steffen Rendle and Lars Schmidt-Thieme. Object iden-tification with constraints. In 6th IEEE InternationalConference on Data Mining (ICDM ’06), 2006.

[34] Heiko Stoermer, Nataliya Rassadko, and NachiketVaidya. Feature-based entity matching: The FBEMmodel, implementation, evaluation. In ProceedingsCAISE 2010, pages 180–193, 2010.

[35] Sheila Tejada, Craig A. Knoblock, and Steven Minton.Learning object identification rules for information in-tegration. Information Systems Journal, 26:635–656,2001.

[36] Andreas Thor and Erhard Rahm. MOMA - a mapping-based object matching system. In 3rd Biennial Confer-ence on Innovative Data Systems Research, Asilomar,CA, USA, 2007.

[37] Z. Wang, X. Zhang, L. Hou, Y. Zhao, J. Li, Y. Qi,and J. Tang. Rimom results for oaei 2010. OntologyMatching, 195, 2010.

[38] Melanie Weis and Felix Naumann. Relationship-basedduplicate detection. Technical Report HU-IB-206,Humboldt University of Berlin, 2006.

[39] William E. Winkler. Methods for record linkage andbayesian networks. Technical report, US Bureau of theCensus, Washington, DC 20233, 2002.

19

evaluation of instance matching tools: the …...evaluation of instance matching tools: the...

Documents