error tolerant record matching pverconf_may2011

Error Tolerant Record Matching

Surajit ChaudhuriMicrosoft Research

04/10/2023 [email protected] 2

Key Contributors

Sanjay Agrawal

Arvind Arasu

Zhimin Chen

Kris Ganjam

Venky Ganti

Raghav Kaushik

Christian Konig

Rajeev Motwani (Stanford)

Vivek Narasayya

Dong Xin


Data Warehousing & Business Intelligence

Data Warehouse

Extract - Transform – LoadExternal Source

Analysis ServicesQuery / Reporting

Data Mining


Bing Maps


Bing Shopping

6

OBJECTIVE: Reduce Cost of building a data cleaning application

04/10/2023 [email protected]

04/10/2023 7

Our Approach to Data Cleaning

Record Matching

De-duplicati

on

Parsing

Core Operators

Design Tools

Address Matching

Product De-duplication

Local Live

Windows Live Products

[email protected]

Focus of this talk


Challenge: Record Matching over Large Data Sets

Reference table of addresses

Prairie Crosing Dr W Chicago IL 60185

Large Table(~10M Rows)


Efficient Indexing is Needed

Prairie Crosing Dr W Chicago IL 60185

Large Table(~10M Rows)

• Needed for Efficiency & Scalability

• Specific to similarity function

Find all rows sj such thatSim (r, sj ) ≥ θ

Reference table


Outline

Introduction and Motivation

Two Challenges in Record

Matching

Concluding Remarks


Challenge 1: Too Many Similarity Functions

Methodology Choose similarity function f appropriate for the domain

Choose best implementation of f with support for

indexing

Can we get away with a common foundation

and simulate these variations?


Challenge 2: Lack of Customizability

Abbreviations USA ≈ United States of America

St ≈ Street, NE ≈ North East

Name variations, Mike ≈ Michael, Bill ≈ William

Aliases One ≈ 1, First ≈ 1st

Can we inject customizability without loss of

efficiency?


Challenge 1: Too Many Similarity Functions


Jaccard Similarity

Statistical measure

Originally defined over sets

String = set of words

Range of values: [0,1]

𝐽𝑎𝑐𝑐𝑎𝑟𝑑 (𝑠1 ,𝑠2 )=¿𝑠 1∩𝑠2∨ ¿¿ 𝑠1∪𝑠 2∨¿¿

¿


Seeking a common Foundation: Jaccard Similarity

148th Ave NE, Redmond, WA

140th Ave NE, Redmond, WA

𝐽𝑎𝑐𝑐𝑎𝑟𝑑=44+2

≈ 0.66


Using Jaccard Similarity to Implement f

Jacc. Sim. ≥ θ’

Query

String Set

Check f ≥ θ

Reference table

String Set

Lookup on f

f ≥ θ Jacc. Sim. ≥ θ’


Edit Similarity Set Similarity

Crossing Crosing

Jaccard Similarity

7/8

C,r,o,s,s,i,n,g C,r,o,s,i,n,g

If strlen(r) ≥ strlen(s): Edit Distance(r,s) ≤ k Jacc. Sim(1-gram(r), 1-gram(s)) ≥ (strlen(r) -

k)/(strlen(r) + k))


Inverted Index Based Approach

100 Prairie ChicagoDrive

2

……

2 2

…

Crossing

2

…

2

…

Dr

10

…

100 Prairie Crossing Dr Chicago

≥ 0.5 M comparisons

0.5 MRows

Rid Lists


Prefix Filter

41 1r s

Any size 2 subset of r has non-empty overlap with s


100 Prairie Crossing Drive Chicago


Inverted Index Based Approach

100 Prairie ChicagoDrive

2

……

2 2

…

Crossing

2

…

2

…

Dr

10

…


Use 100 and Prairie

0.5 MRows

Rid Lists


Signature based Indexing

Use signature-based scheme to further reduce cost of indexing and index lookupProperty: If two strings have high JC, then signatures must intersectLSH signatures work well


Challenge 2: Lack of Customizability

23

Normalization?

A Turing

A TuringAlan Turing

A Turing

Alan A

Jaccard Similarity

1.0


24

Normalization?

A Turing

Aaron TuringAlan Turing

A Turing

Alan A

Jaccard Similarity

1.0

Aaron A


04/10/2023 BYU Talk 25

Transformations

Transformation Rules

Xing CrossingW WestDr Drive

Programmable Similarity

SetSimilarity

04/10/2023 BYU Talk 26

Semantics of Programmable Similarity


Prairie Crossing Dr ChicagoPrairie Xing Dr Chicago



SetSimilarity

04/10/2023 BYU Talk 27

Semantics: Example




Prairie Crossing Dr Chicago


SetSimilarity

04/10/2023 BYU Talk 28

Semantics: Example




Prairie Crossing Dr ChicagoPrairie Crossing Drive Chicago


SetSimilarity

04/10/2023 BYU Talk 29

Semantics: Example





Prairie Crossing Drive Chicago

Prairie Xing Drive Chicago


SetSimilarity

04/10/2023 BYU Talk 30

Semantics: Example







Prairie Xing Dr Chicago


SetSimilarity

04/10/2023 BYU Talk 31

Semantics: Example







Prairie Xing Dr Chicago Prairie Crossing Dr Chicago



SetSimilarity

04/10/2023 BYU Talk 32

Semantics: Example









SetSimilarity

04/10/2023 BYU Talk 33

Semantics: Example








1.0


SetSimilarity


Source of Transformations

Domain-specific authorities ~ 200000 rules from USPS for address matching

Hard to capture using a black-box similarity function

Web Wikipedia redirects

Program First 1st, Second 2nd


Computational Challenge: Blowup

ATT Corp., 100 Prairie Xing Dr Chicago, IL,

USA

1. ATT2. American Telephone and Telegraph

1. Corp2. Corporation 1. 100

2. One Hundred3. Hundred4. Door 100

1. Dr2. Drive1. IL

2. Illinois

1. USA2. United States3. United States of America

1. Xing2. Crossing 384 variations!


Similarity With Transformations: Bipartite

Matching

Prairie

Xing

Dr

Chicago

Prairie

Crossing

Drive

ChicagoXing Crossing

W WestDr Drive

Max Intersection = Max Matching = 4

Max Jaccard = Max Intersection / (8 – Max Intersection) = 4/4 = 1


Extensions to Signature based Indexing

Use same LSH signature-based scheme to reduce cost of indexing and index lookupTwo Properties:

If two strings have high JC, then signatures must intersect

All LSH signatures corresponding to generated strings can be obtained efficiently without materializing


Similarity Join

(St, City)

Parse Address

R (Address)

R (St,City,State,Zip)

Similarity Join(St,

State,Zip)

Union

S (St,City,State,Zip)

Challenge of Setting Thresholds

0.9 0.7

What are the “right” thresholds?


WA WashingtonWI WisconsinFL FloridaXing CrossingW WestDr Drive


Learning From ExamplesInput

A set of examples: matches & non-matches

An operator tree invoking (multiple) Sim Join

operations

Goal Set the thresholds such that

(Number of thresholds = no. of join columns)

Precision Threshold : the number of false positives

is less than B Recall is maximized: Number of correctly classified

matching pairs

Can be generalized to also choose joining columns and similarity functions


Outline

Introduction and Motivation

Two Challenges in Record

Matching

Concluding Remarks

Real-World Record Matching Task

Katrina: Given evacuee lists…First Name Last Name Address Phone Father Mother

John Doe 3 Third St 345-6789

Martin Johnson 123-4567 Donald Johnson

Thomas 5 Main St Lenny

John Doe Third Street

M Johnson 123-4567 D Johnson

match against enquiries

41

42

Beyond Enterprise Data

Documents

Dictionary

Canon Rebel XTi SLR Digital Camera

Lenovo ThinkPad X61 Tablet

Sony Handycam DCR SR42 Digital Camcorder

…

The Canon EOS Rebel XTi remains a very good first

dSLR…

The Canon EOS Rebel XTi remains a very good first

dSLR…

The EOS Digital Rebel XTi is the product of Canon's

extensive in-house development…

The EOS Digital Rebel XTi is the product of Canon's

extensive in-house development…

New ThinkPad X61 Tablet models are available with

Intel® Centrino® Pro processor…

New ThinkPad X61 Tablet models are available with

Intel® Centrino® Pro processor…

Challenge: Pairwise Matching

[email protected]/10/2023


Final ThoughtsGoal: Make Application building easier

Customizability; EfficiencyInternal Impact of MSR’s Record Matching

SQL Server Integration Services; Relationship Discovery in Excel PowerPivot

Bing Maps, Bing ShoppingOpen Issues

Design Studio for Record Matching Record Matching for Web Scale Problems Broader use of Feature engineering techniques


Questions?


References

Arvind Arasu, Surajit Chaudhuri, and Raghav Kaushik, Learning String Transformations from Examples, in VLDB, Very Large Data Bases Endowment Inc., August 2009 Surajit Chaudhuri and Raghav Kaushik, Extending Autocompletion to Tolerate Errors, in ACM SIGMOD June 2009 Arvind Arasu, Christopher Re, and Dan Suciu, Large-Scale Deduplication with Constraints using Dedupalog, in IEEE ICDE 2009Arvind Arasu, Surajit Chaudhuri, and Raghav Kaushik, Transformation-based Framework for Record Matching, in IEEE ICDE 2008Surajit Chaudhuri, Bee Chung Chen, Venkatesh Ganti, and Raghav Kaushik, Example Driven Design of Efficient Record Matching Queries, in VLDB 2007 Surajit Chaudhuri, Anish Das Sarma, Venkatesh Ganti, and Raghav Kaushik., Leveraging Aggregate Constraints for Deduplication, in SIGMOD 2007 Surajit Chaudhuri, Venkatesh Ganti, Surajit Chaudhuri: Robust Identification of Fuzzy Duplicates., in IEEE ICDE 2005: 865-876Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, and Surajit Chaudhuri, Robust and efficient fuzzy match for online data cleaning, in SIGMOD 2003


Appendix:

“Robust Identification of Fuzzy Duplicates” (IEEE Data Engineering, 2005)

47

DeduplicationGiven a relation R, the goal is to partition R into groups such that each group consists of “duplicates” (of the same entity)

Also called reference reconciliation, entity resolution, merge/purge

Record matching, record linkage: identify record pairs (across relations) which are duplicates

Important sub-goals of deduplication

48

Previous Techniques

Distance functions to abstract closeness between tuples E.g., edit distance, cosine similarity, etc.

Approach 1: clustering Hard to determine number of clusters

Approach 2: partition into “valid” groups Global threshold g

All pairs of tuples whose distance < g are considered duplicates

Partitioning Connected components in the threshold graph

49

Our Approach Local structural properties are important for

identifying sets of duplicates Identify two criteria to characterize local

structural properties Formalize the duplicate elimination problem

based upon these criteria Unique solution, rich space of solutions, impact of

distance transformations, etc. Propose an algorithm for solving the problem

50

Compact Set (CS) Criterion

Duplicates are closer to each other than to other tuples

A group is compact if it consists of all mutual nearest neighbors

In {1,2,3,6,7,10,11,12}: {1,2,3}, {6,7}, {10,11,12} are compact groups

Good distance functions for duplicate identification have the characteristic that sets of duplicates form compact sets

51

Sparse Neighborhood (SN) Criterion

Duplicate tuples are well-separated from other tuples

Neighborhood is “sparse” ng(v) = #tuples in larger

sphere / #tuples in smaller sphere around v

ng(set S of tuples) = AGG{ng(v) of each v in S}

S is sparse if ng(S) < c

2∙nn(v)

nn(v)

Growth spheres

52

Other Constraints Goal: Partition R into the minimum number

of groups {G1,…,Gm} such that for all 1 ≤ i ≤ m Gi is a compact set and Gi is an SN group

Can lead to unintuitive solutions {101, 102, 104, 201, 202, 301, 302} – 1 group!

Size constraint: size of a group of duplicates is less than K

Diameter constraint: diameter of a group of duplicates is less than θ