concepts and techniques for record linkage, entity resolution, and duplicate detection by peter...

CONCEPTS AND TECHNIQUES FOR RECORD L INKAGE, ENTITY RESOLUTION, AND

DUPLICATE DETECTION

BY PETER CHRISTEN

PRESENTED BY JOSEPH PARK

Data Matching

Introduction

“Data matching is the task of identifying, matching, and merging records that correspond to the same entities from several databases”

Also known as: Record or data linkage Entity resolution Object identification Field matching

Aims & Challenges

Three tasks: Schema matching Data matching Data fusion

Challenges: Lack of unique entity identifier and data quality Computation complexity Lack of training data (e.g. gold standards) Privacy and confidentiality (health informatics & data

mining)

Overview of Data Matching

Five major steps: Data pre-processing Indexing Record pair comparison Classification Evaluation

Diagram

Data Pre-processing

Remove unwanted characters and wordsExpand abbreviations and correct

misspellingsSegment attributes into well-defined and

consistent output attributesVerify the correctness of attribute values

Example of Data Pre-processing

Indexing

Reduces computational complexityGenerates candidate record pairsCommon technique—Blocking

Example of Blocking

Record Pair Comparison

Comparison vector – vector of numerical similarity values

Example of Record Pair Comparison

Jaro and Winkler String Comparison

Jaro: Combines edit distance and q-gram based comparison

Winkler: Increases Jaro similarity for up to four agreeing initial

chars

Record Pair Classification

Two-class or three-class classification: Match or non-match Match or non-match or potential match (requires

clerical review)Supervised and unsupervisedActive learning

Example of Record Pair Classification

Unsupervised Classification

Threshold-based classificationProbabilistic classificationCost-based classificationRule-based classificationClustering-based classification

Probabilistic Classification

Three-class basedDifferent weights assigned to different

attributes Newcombe & Kennedy – cardinalities

Comparison vectors, binary comparisonConditionally independent attributes

assumed

Formulae

Example of Probabilistic Classification

Active Learning

Trains a model with small set of seed dataClassifies comparison vectors not in training

set as matches or non-matchesAsks users for help on the most difficult to

classifyAdds manually classified to training data setTrains the next, improved, classification

modelRepeats until stopping criteria met

concepts and techniques for record linkage, entity resolution, and duplicate detection by peter...

Documents