web-scale blocking, iterative and progressive entity...

Web-scale Blocking, Iterative and ProgressiveEntity Resolution

Kostas StefanidisUniversity of Tampere

[email protected]

Vassilis ChristophidesINRIA-Paris & Univ. of Crete

France & [email protected]

Vasilis EfthymiouICS-FORTH & Univ. of Crete

[email protected]

Abstract—Entity resolution aims to identify descriptions ofthe same entity within or across knowledge bases. In this work,we provide a comprehensive and cohesive overview of the keyresearch results in the area of entity resolution. We are interestedin frameworks addressing the new challenges in entity resolutionposed by the Web of data in which real world entities aredescribed by interlinked data rather than documents. Sincesuch descriptions are usually partial, overlapping and sometimesevolving, entity resolution emerges as a central problem bothto increase dataset linking, but also to search the Web of datafor entities and their relations. We focus on Web-scale blocking,iterative and progressive solutions for entity resolution. Specifi-cally, to reduce the required number of comparisons, blocking isperformed to place similar descriptions into blocks and executescomparisons to identify matches only between descriptions withinthe same block. To minimize the number of missed matches, aniterative entity resolution process can exploit any intermediateresults of blocking and matching, discovering new candidatedescription pairs for resolution. Finally, we overview works onprogressive entity resolution, which attempt to discover as manymatches as possible given limited computing budget, by estimatingthe matching likelihood of yet unresolved descriptions, based onthe matches found so far.

I. INTRODUCTION

Over the past decade, numerous knowledge bases (KBs)have been built to power large-scale knowledge sharing, butalso an entity-centric Web search, mixing both structured dataand text querying. These KBs offer comprehensive, machine-readable descriptions of a large variety of real-world entities(e.g., persons, places, products, events) published on the Webas Linked Data (LD). Traditionally, KBs are manually craftedby a dedicated team of knowledge engineers, such as thepioneering projects Wordnet and Cyc. Today, more and moreKBs are built from existing Web content using informationextraction tools. Such an automated approach offers an un-precedented opportunity to scale-up KBs construction andleverage existing knowledge published in HTML documents.Although KBs (e.g., DBpedia, Freebase) may be derived fromthe same data source (e.g., a Wikipedia entry), they mayprovide multiple, non-identical descriptions of the same real-world entities. This is mainly due to the different informa-tion extraction tools and curation policies employed by KBs,resulting to complementary and sometimes conflicting entitydescriptions.

Entity resolution (ER) aims to identify descriptions thatrefer to the same real-world entity appearing either within oracross KBs [8], [9]. Compared to data warehouses, the new

ER challenges stem from the openness of the Web of datain describing entities by an unbounded number of KBs, thesemantic and structural diversity of the descriptions providedacross domains even for the same real-world entities, as wellas the autonomy of KBs in terms of adopted processes forcreating and curating entity descriptions. In particular:

• The number of KBs (aka RDF datasets) in the LinkingOpen Data (LOD) cloud1 has roughly tripled between2011 and 2014 (from 295 to 1014), while KBs inter-linking dropped by 30%. The main reason is that withmore KBs available, it becomes more difficult for datapublishers to identify relations between the data theypublish and the data already published. Thus, the majorityof KBs are sparsely linked, while their popularity in linksis heavily skewed. Sparsely interlinked KBs appear inthe periphery of the LOD cloud (e.g., Open Food Facts,Bio2RDF), while heavily interlinked ones lie at the center(e.g., DBpedia, GeoNames). Encyclopaedic KBs, such asDBpedia, or widely used georeferencing KBs, such asGeoNames, are interlinked with the largest number ofKBs [25].

• The descriptions contained in these KBs present a highdegree of semantic and structural diversity, even for thesame entity types. Despite the Linked Data principles,multiple names (e.g., URIs) can be used to refer to thesame real-world entity. The majority (58.24%) of the 649vocabularies currently used by KBs are proprietary, i.e.,they are used by only one KB, while diverse sets ofproperties are commonly used to describe the entities bothin terms of types and number of occurrences even in thesame KB. Only YAGO contains 350K different types ofentities, while Google’s Knowledge Graph contains 35Kproperties, used to describe 600M entities.

The two core ER problems, i.e, how can we (a) effectivelycompute similarity of entity descriptions and (b) efficientlyresolve sets of entities within or across sources, are challengedby the large scale (both in terms of the number of sourcesand entity descriptions), the high diversity (both in terms ofnumber of entity types and properties) and the importance ofrelationships among entity descriptions (not committing to aparticular schema defined in advance). In particular, in additionto highly similar descriptions that feature many common to-kens in values of semantically related attributes, typically metin the center of the LOD cloud and heavily interlinked mostly

1http://lod-cloud.net

Fig. 1. The ER Framework.

using owl:sameAs predicates, we are encountering somehowsimilar descriptions with significantly fewer common tokens inattributes not always semantically related, that appear usuallyin the periphery of the LOD cloud and are sparsely interlinkedwith various kinds of predicates. Plainly, the coming up ofhighly and somehow similar semi-structured entity descriptionsrequires solutions that go beyond those applicable to duplicatedetection. More precisely, deduplication techniques [6] areessentially ER techniques for highly (low) similar in structure(content) descriptions from one relation, record linkage forstructured [6] or semi-structured Web data [15] targets highly(low) similar in content (structure) descriptions from tworelations, while in the Web of data, descriptions hosted in anetwork of KBs exhibit low similarity both in content andstructure (i.e., are somehow similar).

A promising area of research in this respect is cross-domainsimilarity search and mining (e.g., [29]), aiming to exploitsimilarity of objects described by different modalities (i.e.,text, image) and contexts (i.e., facets) and support research byanalogy. Such techniques could be also beneficial for matchinghighly heterogeneous entity descriptions and thus support ERat the Web scale.

In this tutorial, we present how to resolve entities describedby linked data in the Web (e.g., in RDF). Figure 1 illustratesthe general steps involved in our process.

II. BLOCKING AND META-BLOCKING FOR ENTITYRESOLUTION

Grouping entity descriptions in blocks before comparingthem for matching is an important pre-processing step forpruning the quadratic number of comparisons required to re-solve a collection of entity descriptions [7]. The main objectiveof algorithms for entity blocking is to achieve a reasonablecompromise between the number of comparisons suggestedand the number of missed entity matches. In this tutorial, webriefly present traditional blocking algorithms proposed forrelational records and explain why they cannot be used in theWeb of data. Then, we detail a family of algorithms that relieson a simple inverted index of entity descriptions extracted fromthe tokens of their attribute values. Hence, two descriptionsare placed into the same block if they share at least a commontoken (e.g., [21], [20]).

An alternative approach for blocking is to consider string-similarity join algorithms. In a nutshell, such algorithms con-struct blocks by identifying all pairs of descriptions whose

string values similarities are above a certain threshold andpotentially some pairs whose string values similarities arebelow that threshold. To achieve that, without computing thesimilarity of all pairs of descriptions, string-similarity joinalgorithms (e.g., [5], [28]) build an inverted index from the to-kens of the descriptions values. A method to reduce the numberof compared descriptions consists of building blocks for sets oftokens that appear together in many entity descriptions. Severalvariations of this problem have been proposed. For example,[19] studies the problem of scalable finding frequent sets oftokens that appear in specific sequences. Finally, [17] proposedthe concept of multidimensional overlapping blocks. Specifi-cally, this method firstly constructs a collection of blocks foreach similarity function, and then, all blocking collections areaggregated into a single multidimensional collection, takinginto account the similarities of descriptions that share blocks.

Blocking, even as a pre-processing step for entity res-olution, is a heavy computational task. This way, severalapproaches exploit the MapReduce programming model forparallelizing blocking algorithms (e.g., [18], [12], [13]). Ab-stractly, in all cases, a collection of entity descriptions, givenas input to a MapReduce job, is split into smaller chunks,which are then processed in parallel. A map function, emittingintermediate (key, value) pairs for each input split, and a reducefunction that processes the list of values of an intermediate key,coming from all the mappers, should be defined.

To further improve the efficiency of blocking, severalapproaches (e.g., [26], [21], [20]) focus on different ways fordiscarding comparisons that do not lead to matches. Morerecently, [22], [10], [11] propose to reconstruct the blocks of agiven blocking collection in order to more drastically discardredundant comparisons, as well as comparisons between de-scriptions that are unlikely to match. Meta-blocking essentiallytransforms a given blocking collection B into a blockinggraph. Its nodes correspond to the descriptions in B, whileits undirected edges connect the co-occurring descriptions.No parallel edges are allowed, thus eliminating all redundantcomparisons. Every edge is associated with a weight, repre-senting the likelihood that the adjacent entities are matchingcandidates. The low-weighted edges are pruned, so as todiscard comparisons between unlikely-to-match descriptions.

III. ITERATIVE ENTITY RESOLUTION

The inherent distribution of entity descriptions in differentknowledge bases, along with their significant semantic andstructural diversity, yield incomplete evidence regarding the

similarity of descriptions published in the Web of data. Itera-tive ER approaches aim to tackle this problem by exploitingany partial result of the ER process in order to generate newcandidate pairs of descriptions not considered in a previousstep or even to revise a previous matching decision.

Abstractly, new matches can be found by exploiting mergeddescriptions of previously identified matches or relationshipsbetween descriptions. We call the iterative ER approachesthat build their iterations on the merging of descriptionsmerging-based, and those that use entity relationships fortheir iteration step relationship-based. Intuitively, the merging-based approaches deal with descriptions of the same type, e.g.,descriptions for persons, while relationship-based approachespresume upon the relationships between different types ofentities. Interestingly, there is an important difference betweenthe two families of iterative approaches. Namely, this hasto do with the fact that triggers a new iteration in eachfamily of iterative approaches. In our tutorial, we will highlightthis fact, by exploring the general framework of iterativeentity resolution approaches that are typically composed ofan initialization phase and an iterative phase [16]. The goalof the initialization phase is to create a queue capturing thepairs of entity descriptions that will be compared, or even theorder for comparing these pairs. For example, such a queue canbe constructed automatically by exhaustively computing theinitial similarity of all pairs of descriptions, or can be handledmanually by domain experts, who specify, for instance, whichentities will be compared. In the iterative phase, we get a pairof descriptions from the queue, compute the similarity of thispair to decide if it is a match, and with respect to this decision,we potentially update the queue. Actually, the updates in thisphase trigger the next iteration of entity resolution.

In merging-based approaches (e.g., [2], [14]), when twomatching descriptions are merged, the pairs in the queue inwhich these descriptions participate are updated, replacing theinitial descriptions with the results of their merging. New pairsmay be added to the queue as well, suggesting the com-parison of the new merged description to other descriptions.In relationship-based approaches (e.g., [3], [24], [4]), whenrelated descriptions are identified as matches, new pairs canbe added to the queue, e.g., a pair of building descriptions isadded to the queue, when their architects are matching, or evenexisting pairs can be re-ordered. Ideally, the iterative phaseterminates when the queue becomes empty.

Recent works have proposed using an iterative ER process,interleaved with blocking. The intuition is that the ER resultsof a processed block, may help identifying more matches inanother block. Specifically, iterative blocking [27], appliedon the results of blocking, examines one block at a timelooking for matches. When a match is found in a block, theresulting merging of the descriptions that match is propagatedto all other blocks, replacing the initial matching descriptions.This way, redundant comparisons between the same pair ofdescriptions in different blocks are saved and, in addition, morematches can be identified efficiently. The same block may beprocessed multiple times, until no new matches are found.Inherently, iterative blocking follows a sequential executionmodel, since the results of ER in one block directly affectother blocks.

IV. PROGRESSIVE ENTITY RESOLUTION

Progressive works focus on maximizing the reportedmatches, given a limited computing budget, by potentially ex-ploiting the partial matching results obtained so far. Essentially,they extend the typical ER workflow with a scheduling phase,which is responsible for selecting which pairs of descriptions,that have resulted from blocking, will be compared in theentity matching phase and in what order. The goal of thisnew phase is to favor more promising comparisons, i.e., thosethat are more likely to result in matches. This way, thosecomparisons are executed before less promising ones andthus, more matches are identified early on in the process.The optional update phase (Fig. 1) propagates the results ofmatching, such that a new scheduling phase will promotethe comparison of pairs that were influenced by the previousmatches. This iterative process continues until the pre-definedcomputing budget is consumed.

In a progressive ER setting, [1] uses a graph in whichnodes are description pairs, i.e., candidate matches, and anedge indicates that the resolution of a node influences theresolution of another node. For the scheduling phase, it dividesthe total cost budget into several windows of equal cost. Foreach window, a comparison schedule is generated, by choosingamong those whose cost does not exceed the current window,the one with the highest expected benefit. The cost of aschedule is computed by considering the cost of finding thedescription pairs and the cost of resolving them. Its benefit isdetermined by how many matches are expected to be foundby this schedule, and how useful it will be to declare thosenodes as matches, in identifying more matches within the costbudget. In the update phase, after a schedule is executed, thematching decisions are propagated to all the influenced nodes,whose expected benefit now increases and have, thus, higherchances of being chosen by the next schedule. The algorithmterminates when the cost budget has been reached.

Among other heuristics, [26] uses a hierarchy of descriptionpartitions. This hierarchy is built by applying different distancethresholds for each partitioning: highly similar descriptions areplaced in the lower levels, while somehow similar descriptionsin higher ones. By traversing bottom-up this hierarchy untilthe cost budget has been reached, we favor the resolution ofhighly similar descriptions. An other heuristic, also presentedin [26], relies on a sorted list of description pairs, which canbe generated by a blocking key as in sorted neighborhood,and then iteratively considers sliding windows of increasingsize, comparing descriptions of increasing distance. Startingfrom a window of size 2, this heuristic favors comparisonsof descriptions with more similar values on their blockingkeys. [23] extends this heuristic, to capture cases in whichmatches appear in dense areas of the initial sorting, by addinga scheduling phase that performs a local lookahead - if thedescriptions at positions (i, j) are found to match, then thedescriptions at (i+1, j) and (i, j+1) are immediately compared,since they have a high chance of matching.

PRESENTERS

Kostas Stefanidis is an Associate Professor at the Uni-versity of Tampere, Finland. Previously, he worked as aresearch scientist at ICS-FORTH, Greece, and as a post-doctoral researcher at NTNU, Norway, and at CUHK, Hong

Kong. His research interests lie in the intersection of databases,Web and information retrieval, and include personalized andcontext-aware data management, recommender systems, andinformation extraction, resolution and integration. He has co-authored more than 50 papers in peer-reviewed conferencesand journals, including ACM SIGMOD, IEEE ICDE and ACMTODS. He is the General co-Chair of the Workshop on Ex-ploratory Search in Databases and the Web (ExploreDB), andserved as the Web & Information Chair of SIGMOD/PODS2016, and the Proceedings Chair of EDBT/ICDT 2016. Hehas also co-authored a book on entity resolution in the Webof data.

Vassilis Christophides is a Professor of Computer Scienceat the University of Crete. He has been recently appointed toan advanced research position at INRIA Paris-Rocquencourt.Previously, he worked as a Distinguished Scientist at Techni-color, R&I Center in Paris. His main research interests includeDatabases and Web Information Systems, as well as Big DataProcessing and Analysis. He has published over 130 articlesin high-quality international conferences and journals. He hasbeen scientific coordinator of a number of research projectsfunded by the European Union and the Greek State. He hasreceived the 2004 SIGMOD Test of Time Award and 2 BestPaper Awards (ISWC 2003 and 2007). He served as GeneralChair of the EDBT/ICDT Conference in 2014 and as AreaChair for the ICDE Semi-structured, Web, and Linked DataManagement track in 2016. He has also co-authored a bookon entity resolution in the Web of data.

Vasilis Efthymiou is a Ph.D. candidate at the University ofCrete, Greece, and a member of the Information Systems Lab-oratory of the Institute of Computer Science at FORTH. Thetopic of his Ph.D. research is entity resolution in the Web ofdata. He has been an intern at IBM Thomas J. Watson ResearchCenter, and he has received undergraduate and postgraduatescholarships from FORTH, working in the areas of SemanticWeb, non-monotonic reasoning, and Ambient Intelligence. Hehas also co-authored a book on entity resolution in the Webof data.

REFERENCES

[1] Y. Altowim, D. V. Kalashnikov, and S. Mehrotra. Progressiveapproach to relational entity resolution. PVLDB, 7(11):999–1010, 2014.

[2] O. Benjelloun, H. Garcia-Molina, D. Menestrina, Q. Su, S. E.Whang, and J. Widom. Swoosh: a generic approach to entityresolution. VLDB J., 18(1):255–276, 2009.

[3] I. Bhattacharya and L. Getoor. Collective entity resolution inrelational data. TKDD, 1(1), 2007.

[4] C. Bohm, G. de Melo, F. Naumann, and G. Weikum. LINDA:distributed web-of-data-scale entity matching. In CIKM, 2012.

[5] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operatorfor similarity joins in data cleaning. In ICDE, 2006.

[6] P. Christen. Data Matching - Concepts and Techniques forRecord Linkage, Entity Resolution, and Duplicate Detection.Data-Centric Systems and Applications. Springer, 2012.

[7] P. Christen. A survey of indexing techniques for scalable recordlinkage and deduplication. IEEE Trans. Knowl. Data Eng.,24(9):1537–1555, 2012.

[8] V. Christophides, V. Efthymiou, and K. Stefanidis. EntityResolution in the Web of Data. Synthesis Lectures on the

Semantic Web: Theory and Technology. Morgan & ClaypoolPublishers, 2015.

[9] X. L. Dong and D. Srivastava. Big Data Integration. SynthesisLectures on Data Management. Morgan & Claypool Publishers,2015.

[10] V. Efthymiou, G. Papadakis, G. Papastefanatos, K. Stefanidis,and T. Palpanas. Parallel meta-blocking: Realizing scalableentity resolution over large, heterogeneous data. In IEEE BigData, 2015.

[11] V. Efthymiou, G. Papadakis, G. Papastefanatos, K. Stefanidis,and T. Palpanas. Parallel meta-blocking for scaling entityresolution over big heterogeneous data. Inf. Syst., 65:137–157,2017.

[12] V. Efthymiou, K. Stefanidis, and V. Christophides. Big dataentity resolution: From highly to somehow similar entity de-scriptions in the Web. In IEEE Big Data, 2015.

[13] V. Efthymiou, K. Stefanidis, and V. Christophides. Benchmark-ing blocking algorithms for web entities. IEEE Trans. Big Data,3, 2017.

[14] L. Galarraga, G. Heitz, K. Murphy, and F. M. Suchanek.Canonicalizing open knowledge bases. In CIKM, 2014.

[15] O. Hassanzadeh. Record Linkage for Web Data. PhD thesis,2013.

[16] M. Herschel, F. Naumann, S. Szott, and M. Taubert. Scalableiterative graph duplicate detection. IEEE Trans. Knowl. DataEng., 24(11):2094–2108, 2012.

[17] R. Isele, A. Jentzsch, and C. Bizer. Efficient multidimensionalblocking for link discovery without losing recall. In WebDB,2011.

[18] L. Kolb, A. Thor, and E. Rahm. Dedoop: Efficient deduplicationwith hadoop. PVLDB, 5(12):1878–1881, 2012.

[19] I. Miliaraki, K. Berberich, R. Gemulla, and S. Zoupanos. Mindthe gap: large-scale frequent sequence mining. In SIGMOD,2013.

[20] G. Papadakis, E. Ioannou, C. Niederee, T. Palpanas, and W. Ne-jdl. Beyond 100 million entities: large-scale blocking-basedresolution for heterogeneous data. In WSDM, 2012.

[21] G. Papadakis, E. Ioannou, T. Palpanas, C. Niederee, and W. Ne-jdl. A blocking framework for entity resolution in highlyheterogeneous information spaces. IEEE Trans. Knowl. DataEng., 25(12):2665–2682, 2013.

[22] G. Papadakis, G. Koutrika, T. Palpanas, and W. Nejdl. Meta-blocking: Taking entity resolutionto the next level. IEEE Trans.Knowl. Data Eng., 26(8):1946–1960, 2014.

[23] T. Papenbrock, A. Heise, and F. Naumann. Progressive duplicatedetection. IEEE Trans. Knowl. Data Eng., 27(5):1316–1329,2015.

[24] V. Rastogi, N. N. Dalvi, and M. N. Garofalakis. Large-scalecollective entity matching. PVLDB, 4(4):208–218, 2011.

[25] M. Schmachtenberg, C. Bizer, and H. Paulheim. Adoption ofthe linked data best practices in different topical domains. InISWC, pages 245–260, 2014.

[26] S. E. Whang, D. Marmaros, and H. Garcia-Molina. Pay-as-you-go entity resolution. IEEE Trans. Knowl. Data Eng.,25(5):1111–1124, 2013.

[27] S. E. Whang, D. Menestrina, G. Koutrika, M. Theobald, andH. Garcia-Molina. Entity resolution with iterative blocking. InSIGMOD, 2009.

[28] C. Xiao, W. Wang, X. Lin, J. X. Yu, and G. Wang. Efficient sim-ilarity joins for near-duplicate detection. ACM Trans. DatabaseSyst., 36(3):15, 2011.

[29] Y. Zhen, P. Rai, H. Zha, and L. Carin. Cross-modal similaritylearning via pairs, preferences, and active supervision. In AAAI,pages 3203–3209, 2015.

web-scale blocking, iterative and progressive entity...

Documents