concepts and techniques for record linkage, entity resolution, and duplicate detection by peter...

19
CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching

Upload: sharleen-atkinson

Post on 04-Jan-2016

218 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching

CONCEPTS AND TECHNIQUES FOR RECORD L INKAGE, ENTITY RESOLUTION, AND

DUPLICATE DETECTION

BY PETER CHRISTEN

PRESENTED BY JOSEPH PARK

Data Matching

Page 2: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching

Introduction

“Data matching is the task of identifying, matching, and merging records that correspond to the same entities from several databases”

Also known as: Record or data linkage Entity resolution Object identification Field matching

Page 3: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching

Aims & Challenges

Three tasks: Schema matching Data matching Data fusion

Challenges: Lack of unique entity identifier and data quality Computation complexity Lack of training data (e.g. gold standards) Privacy and confidentiality (health informatics & data

mining)

Page 4: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching

Overview of Data Matching

Five major steps: Data pre-processing Indexing Record pair comparison Classification Evaluation

Page 5: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching

Diagram

Page 6: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching

Data Pre-processing

Remove unwanted characters and wordsExpand abbreviations and correct

misspellingsSegment attributes into well-defined and

consistent output attributesVerify the correctness of attribute values

Page 7: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching

Example of Data Pre-processing

Page 8: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching

Indexing

Reduces computational complexityGenerates candidate record pairsCommon technique—Blocking

Page 9: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching

Example of Blocking

Page 10: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching

Record Pair Comparison

Comparison vector – vector of numerical similarity values

Page 11: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching

Example of Record Pair Comparison

Page 12: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching

Jaro and Winkler String Comparison

Jaro: Combines edit distance and q-gram based comparison

Winkler: Increases Jaro similarity for up to four agreeing initial

chars

Page 13: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching

Record Pair Classification

Two-class or three-class classification: Match or non-match Match or non-match or potential match (requires

clerical review)Supervised and unsupervisedActive learning

Page 14: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching

Example of Record Pair Classification

Page 15: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching

Unsupervised Classification

Threshold-based classificationProbabilistic classificationCost-based classificationRule-based classificationClustering-based classification

Page 16: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching

Probabilistic Classification

Three-class basedDifferent weights assigned to different

attributes Newcombe & Kennedy – cardinalities

Comparison vectors, binary comparisonConditionally independent attributes

assumed

Page 17: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching

Formulae

Page 18: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching

Example of Probabilistic Classification

Page 19: CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching

Active Learning

Trains a model with small set of seed dataClassifies comparison vectors not in training

set as matches or non-matchesAsks users for help on the most difficult to

classifyAdds manually classified to training data setTrains the next, improved, classification

modelRepeats until stopping criteria met