concepts and techniques for record linkage, entity resolution, and duplicate detection by peter...
TRANSCRIPT
CONCEPTS AND TECHNIQUES FOR RECORD L INKAGE, ENTITY RESOLUTION, AND
DUPLICATE DETECTION
BY PETER CHRISTEN
PRESENTED BY JOSEPH PARK
Data Matching
Introduction
“Data matching is the task of identifying, matching, and merging records that correspond to the same entities from several databases”
Also known as: Record or data linkage Entity resolution Object identification Field matching
Aims & Challenges
Three tasks: Schema matching Data matching Data fusion
Challenges: Lack of unique entity identifier and data quality Computation complexity Lack of training data (e.g. gold standards) Privacy and confidentiality (health informatics & data
mining)
Overview of Data Matching
Five major steps: Data pre-processing Indexing Record pair comparison Classification Evaluation
Diagram
Data Pre-processing
Remove unwanted characters and wordsExpand abbreviations and correct
misspellingsSegment attributes into well-defined and
consistent output attributesVerify the correctness of attribute values
Example of Data Pre-processing
Indexing
Reduces computational complexityGenerates candidate record pairsCommon technique—Blocking
Example of Blocking
Record Pair Comparison
Comparison vector – vector of numerical similarity values
Example of Record Pair Comparison
Jaro and Winkler String Comparison
Jaro: Combines edit distance and q-gram based comparison
Winkler: Increases Jaro similarity for up to four agreeing initial
chars
Record Pair Classification
Two-class or three-class classification: Match or non-match Match or non-match or potential match (requires
clerical review)Supervised and unsupervisedActive learning
Example of Record Pair Classification
Unsupervised Classification
Threshold-based classificationProbabilistic classificationCost-based classificationRule-based classificationClustering-based classification
Probabilistic Classification
Three-class basedDifferent weights assigned to different
attributes Newcombe & Kennedy – cardinalities
Comparison vectors, binary comparisonConditionally independent attributes
assumed
Formulae
Example of Probabilistic Classification
Active Learning
Trains a model with small set of seed dataClassifies comparison vectors not in training
set as matches or non-matchesAsks users for help on the most difficult to
classifyAdds manually classified to training data setTrains the next, improved, classification
modelRepeats until stopping criteria met