error tolerant record matching pverconf_may2011
DESCRIPTION
May 2011 Personal Validation and Entity Resolution Conference. Presenter: Surajit Chaudhuri, Microsoft ResearchTRANSCRIPT
Error Tolerant Record Matching
Surajit ChaudhuriMicrosoft Research
04/10/2023 [email protected] 2
Key Contributors
Sanjay Agrawal
Arvind Arasu
Zhimin Chen
Kris Ganjam
Venky Ganti
Raghav Kaushik
Christian Konig
Rajeev Motwani (Stanford)
Vivek Narasayya
Dong Xin
04/10/2023 [email protected] 3
Data Warehousing & Business Intelligence
Data Warehouse
Extract - Transform – LoadExternal Source
Analysis ServicesQuery / Reporting
Data Mining
04/10/2023 [email protected] 4
Bing Maps
04/10/2023 [email protected] 5
Bing Shopping
04/10/2023 7
Our Approach to Data Cleaning
Record Matching
De-duplicati
on
Parsing
Core Operators
Design Tools
Address Matching
Product De-duplication
Local Live
Windows Live Products
Focus of this talk
04/10/2023 [email protected] 8
Challenge: Record Matching over Large Data Sets
Reference table of addresses
Prairie Crosing Dr W Chicago IL 60185
Large Table(~10M Rows)
04/10/2023 [email protected] 9
Efficient Indexing is Needed
Prairie Crosing Dr W Chicago IL 60185
Large Table(~10M Rows)
• Needed for Efficiency & Scalability
• Specific to similarity function
Find all rows sj such thatSim (r, sj ) ≥ θ
Reference table
04/10/2023 [email protected] 10
Outline
Introduction and Motivation
Two Challenges in Record
Matching
Concluding Remarks
04/10/2023 [email protected] 11
Challenge 1: Too Many Similarity Functions
Methodology Choose similarity function f appropriate for the domain
Choose best implementation of f with support for
indexing
Can we get away with a common foundation
and simulate these variations?
04/10/2023 [email protected] 12
Challenge 2: Lack of Customizability
Abbreviations USA ≈ United States of America
St ≈ Street, NE ≈ North East
Name variations, Mike ≈ Michael, Bill ≈ William
Aliases One ≈ 1, First ≈ 1st
Can we inject customizability without loss of
efficiency?
04/10/2023 [email protected] 13
Challenge 1: Too Many Similarity Functions
04/10/2023 [email protected] 14
Jaccard Similarity
Statistical measure
Originally defined over sets
String = set of words
Range of values: [0,1]
𝐽𝑎𝑐𝑐𝑎𝑟𝑑 (𝑠1 ,𝑠2 )=¿𝑠 1∩𝑠2∨ ¿¿ 𝑠1∪𝑠 2∨¿¿
¿
04/10/2023 [email protected] 15
Seeking a common Foundation: Jaccard Similarity
148th Ave NE, Redmond, WA
140th Ave NE, Redmond, WA
𝐽𝑎𝑐𝑐𝑎𝑟𝑑=44+2
≈ 0.66
04/10/2023 [email protected] 16
Using Jaccard Similarity to Implement f
Jacc. Sim. ≥ θ’
Query
String Set
Check f ≥ θ
Reference table
String Set
Lookup on f
f ≥ θ Jacc. Sim. ≥ θ’
04/10/2023 [email protected] 17
Edit Similarity Set Similarity
Crossing Crosing
Jaccard Similarity
7/8
C,r,o,s,s,i,n,g C,r,o,s,i,n,g
If strlen(r) ≥ strlen(s): Edit Distance(r,s) ≤ k Jacc. Sim(1-gram(r), 1-gram(s)) ≥ (strlen(r) -
k)/(strlen(r) + k))
04/10/2023 [email protected] 18
Inverted Index Based Approach
100 Prairie ChicagoDrive
2
……
2 2
…
Crossing
2
…
2
…
Dr
10
…
100 Prairie Crossing Dr Chicago
≥ 0.5 M comparisons
0.5 MRows
Rid Lists
04/10/2023 [email protected] 19
Prefix Filter
41 1r s
Any size 2 subset of r has non-empty overlap with s
100 Prairie Crossing Dr Chicago
100 Prairie Crossing Drive Chicago
04/10/2023 [email protected] 20
Inverted Index Based Approach
100 Prairie ChicagoDrive
2
……
2 2
…
Crossing
2
…
2
…
Dr
10
…
100 Prairie Crossing Dr Chicago
Use 100 and Prairie
0.5 MRows
Rid Lists
04/10/2023 [email protected] 21
Signature based Indexing
Use signature-based scheme to further reduce cost of indexing and index lookupProperty: If two strings have high JC, then signatures must intersectLSH signatures work well
04/10/2023 [email protected] 22
Challenge 2: Lack of Customizability
23
Normalization?
A Turing
A TuringAlan Turing
A Turing
Alan A
Jaccard Similarity
1.0
04/10/2023 [email protected]
24
Normalization?
A Turing
Aaron TuringAlan Turing
A Turing
Alan A
Jaccard Similarity
1.0
Aaron A
04/10/2023 [email protected]
04/10/2023 BYU Talk 25
Transformations
Transformation Rules
Xing CrossingW WestDr Drive
Programmable Similarity
SetSimilarity
04/10/2023 BYU Talk 26
Semantics of Programmable Similarity
Transformation Rules
Prairie Crossing Dr ChicagoPrairie Xing Dr Chicago
Xing CrossingW WestDr Drive
Programmable Similarity
SetSimilarity
04/10/2023 BYU Talk 27
Semantics: Example
Transformation Rules
Prairie Crossing Dr ChicagoPrairie Xing Dr Chicago
Xing CrossingW WestDr Drive
Prairie Crossing Dr Chicago
Programmable Similarity
SetSimilarity
04/10/2023 BYU Talk 28
Semantics: Example
Transformation Rules
Prairie Crossing Dr ChicagoPrairie Xing Dr Chicago
Xing CrossingW WestDr Drive
Prairie Crossing Dr ChicagoPrairie Crossing Drive Chicago
Programmable Similarity
SetSimilarity
04/10/2023 BYU Talk 29
Semantics: Example
Transformation Rules
Prairie Crossing Dr ChicagoPrairie Xing Dr Chicago
Xing CrossingW WestDr Drive
Prairie Crossing Dr Chicago
Prairie Crossing Drive Chicago
Prairie Xing Drive Chicago
Programmable Similarity
SetSimilarity
04/10/2023 BYU Talk 30
Semantics: Example
Transformation Rules
Prairie Crossing Dr ChicagoPrairie Xing Dr Chicago
Xing CrossingW WestDr Drive
Prairie Crossing Dr Chicago
Prairie Crossing Drive Chicago
Prairie Xing Drive Chicago
Prairie Xing Dr Chicago
Programmable Similarity
SetSimilarity
04/10/2023 BYU Talk 31
Semantics: Example
Transformation Rules
Prairie Crossing Dr ChicagoPrairie Xing Dr Chicago
Xing CrossingW WestDr Drive
Prairie Crossing Dr Chicago
Prairie Crossing Drive Chicago
Prairie Xing Drive Chicago
Prairie Xing Dr Chicago Prairie Crossing Dr Chicago
Prairie Crossing Drive Chicago
Programmable Similarity
SetSimilarity
04/10/2023 BYU Talk 32
Semantics: Example
Transformation Rules
Prairie Crossing Dr ChicagoPrairie Xing Dr Chicago
Xing CrossingW WestDr Drive
Prairie Crossing Dr ChicagoPrairie Crossing Drive Chicago
Prairie Xing Drive Chicago
Prairie Xing Dr Chicago Prairie Crossing Dr Chicago
Prairie Crossing Drive Chicago
Programmable Similarity
SetSimilarity
04/10/2023 BYU Talk 33
Semantics: Example
Transformation Rules
Prairie Crossing Dr ChicagoPrairie Xing Dr Chicago
Xing CrossingW WestDr Drive
Prairie Crossing Dr ChicagoPrairie Crossing Drive Chicago
Prairie Xing Drive Chicago
Prairie Xing Dr Chicago Prairie Crossing Dr Chicago
Prairie Crossing Drive Chicago
1.0
Programmable Similarity
SetSimilarity
04/10/2023 [email protected] 34
Source of Transformations
Domain-specific authorities ~ 200000 rules from USPS for address matching
Hard to capture using a black-box similarity function
Web Wikipedia redirects
Program First 1st, Second 2nd
04/10/2023 [email protected] 35
Computational Challenge: Blowup
ATT Corp., 100 Prairie Xing Dr Chicago, IL,
USA
1. ATT2. American Telephone and Telegraph
1. Corp2. Corporation 1. 100
2. One Hundred3. Hundred4. Door 100
1. Dr2. Drive1. IL
2. Illinois
1. USA2. United States3. United States of America
1. Xing2. Crossing 384 variations!
04/10/2023 [email protected] 36
Similarity With Transformations: Bipartite
Matching
Prairie
Dr
Chicago
Prairie
Crossing
Drive
ChicagoXing Crossing
W WestDr Drive
Max Intersection = Max Matching = 4
Max Jaccard = Max Intersection / (8 – Max Intersection) = 4/4 = 1
04/10/2023 [email protected] 37
Extensions to Signature based Indexing
Use same LSH signature-based scheme to reduce cost of indexing and index lookupTwo Properties:
If two strings have high JC, then signatures must intersect
All LSH signatures corresponding to generated strings can be obtained efficiently without materializing
04/10/2023 [email protected] 38
Similarity Join
(St, City)
Parse Address
R (Address)
R (St,City,State,Zip)
Similarity Join(St,
State,Zip)
Union
S (St,City,State,Zip)
Challenge of Setting Thresholds
0.9 0.7
What are the “right” thresholds?
Xing CrossingW WestDr Drive
WA WashingtonWI WisconsinFL FloridaXing CrossingW WestDr Drive
04/10/2023 [email protected] 39
Learning From ExamplesInput
A set of examples: matches & non-matches
An operator tree invoking (multiple) Sim Join
operations
Goal Set the thresholds such that
(Number of thresholds = no. of join columns)
Precision Threshold : the number of false positives
is less than B Recall is maximized: Number of correctly classified
matching pairs
Can be generalized to also choose joining columns and similarity functions
04/10/2023 [email protected] 40
Outline
Introduction and Motivation
Two Challenges in Record
Matching
Concluding Remarks
Real-World Record Matching Task
Katrina: Given evacuee lists…First Name Last Name Address Phone Father Mother
John Doe 3 Third St 345-6789
Martin Johnson 123-4567 Donald Johnson
Thomas 5 Main St Lenny
John Doe Third Street
M Johnson 123-4567 D Johnson
match against enquiries
41
42
Beyond Enterprise Data
Documents
Dictionary
Canon Rebel XTi SLR Digital Camera
Lenovo ThinkPad X61 Tablet
Sony Handycam DCR SR42 Digital Camcorder
…
The Canon EOS Rebel XTi remains a very good first
dSLR…
The Canon EOS Rebel XTi remains a very good first
dSLR…
The EOS Digital Rebel XTi is the product of Canon's
extensive in-house development…
The EOS Digital Rebel XTi is the product of Canon's
extensive in-house development…
New ThinkPad X61 Tablet models are available with
Intel® Centrino® Pro processor…
New ThinkPad X61 Tablet models are available with
Intel® Centrino® Pro processor…
Challenge: Pairwise Matching
[email protected]/10/2023
04/10/2023 [email protected] 43
Final ThoughtsGoal: Make Application building easier
Customizability; EfficiencyInternal Impact of MSR’s Record Matching
SQL Server Integration Services; Relationship Discovery in Excel PowerPivot
Bing Maps, Bing ShoppingOpen Issues
Design Studio for Record Matching Record Matching for Web Scale Problems Broader use of Feature engineering techniques
04/10/2023 [email protected] 44
Questions?
04/10/2023 [email protected] 45
References
Arvind Arasu, Surajit Chaudhuri, and Raghav Kaushik, Learning String Transformations from Examples, in VLDB, Very Large Data Bases Endowment Inc., August 2009 Surajit Chaudhuri and Raghav Kaushik, Extending Autocompletion to Tolerate Errors, in ACM SIGMOD June 2009 Arvind Arasu, Christopher Re, and Dan Suciu, Large-Scale Deduplication with Constraints using Dedupalog, in IEEE ICDE 2009Arvind Arasu, Surajit Chaudhuri, and Raghav Kaushik, Transformation-based Framework for Record Matching, in IEEE ICDE 2008Surajit Chaudhuri, Bee Chung Chen, Venkatesh Ganti, and Raghav Kaushik, Example Driven Design of Efficient Record Matching Queries, in VLDB 2007 Surajit Chaudhuri, Anish Das Sarma, Venkatesh Ganti, and Raghav Kaushik., Leveraging Aggregate Constraints for Deduplication, in SIGMOD 2007 Surajit Chaudhuri, Venkatesh Ganti, Surajit Chaudhuri: Robust Identification of Fuzzy Duplicates., in IEEE ICDE 2005: 865-876Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, and Surajit Chaudhuri, Robust and efficient fuzzy match for online data cleaning, in SIGMOD 2003
04/10/2023 [email protected] 46
Appendix:
“Robust Identification of Fuzzy Duplicates” (IEEE Data Engineering, 2005)
47
DeduplicationGiven a relation R, the goal is to partition R into groups such that each group consists of “duplicates” (of the same entity)
Also called reference reconciliation, entity resolution, merge/purge
Record matching, record linkage: identify record pairs (across relations) which are duplicates
Important sub-goals of deduplication
48
Previous Techniques
Distance functions to abstract closeness between tuples E.g., edit distance, cosine similarity, etc.
Approach 1: clustering Hard to determine number of clusters
Approach 2: partition into “valid” groups Global threshold g
All pairs of tuples whose distance < g are considered duplicates
Partitioning Connected components in the threshold graph
49
Our Approach Local structural properties are important for
identifying sets of duplicates Identify two criteria to characterize local
structural properties Formalize the duplicate elimination problem
based upon these criteria Unique solution, rich space of solutions, impact of
distance transformations, etc. Propose an algorithm for solving the problem
50
Compact Set (CS) Criterion
Duplicates are closer to each other than to other tuples
A group is compact if it consists of all mutual nearest neighbors
In {1,2,3,6,7,10,11,12}: {1,2,3}, {6,7}, {10,11,12} are compact groups
Good distance functions for duplicate identification have the characteristic that sets of duplicates form compact sets
51
Sparse Neighborhood (SN) Criterion
Duplicate tuples are well-separated from other tuples
Neighborhood is “sparse” ng(v) = #tuples in larger
sphere / #tuples in smaller sphere around v
ng(set S of tuples) = AGG{ng(v) of each v in S}
S is sparse if ng(S) < c
2∙nn(v)
nn(v)
Growth spheres
52
Other Constraints Goal: Partition R into the minimum number
of groups {G1,…,Gm} such that for all 1 ≤ i ≤ m Gi is a compact set and Gi is an SN group
Can lead to unintuitive solutions {101, 102, 104, 201, 202, 301, 302} – 1 group!
Size constraint: size of a group of duplicates is less than K
Diameter constraint: diameter of a group of duplicates is less than θ