error tolerant record matching pverconf_may2011

52
Error Tolerant Record Matching Surajit Chaudhuri Microsoft Research

Upload: norc-at-the-university-of-chicago

Post on 14-Dec-2014

399 views

Category:

Business


0 download

DESCRIPTION

May 2011 Personal Validation and Entity Resolution Conference. Presenter: Surajit Chaudhuri, Microsoft Research

TRANSCRIPT

Page 1: Error Tolerant Record Matching PVERConf_May2011

Error Tolerant Record Matching

Surajit ChaudhuriMicrosoft Research

Page 2: Error Tolerant Record Matching PVERConf_May2011

04/10/2023 [email protected] 2

Key Contributors

Sanjay Agrawal

Arvind Arasu

Zhimin Chen

Kris Ganjam

Venky Ganti

Raghav Kaushik

Christian Konig

Rajeev Motwani (Stanford)

Vivek Narasayya

Dong Xin

Page 3: Error Tolerant Record Matching PVERConf_May2011

04/10/2023 [email protected] 3

Data Warehousing & Business Intelligence

Data Warehouse

Extract - Transform – LoadExternal Source

Analysis ServicesQuery / Reporting

Data Mining

Page 4: Error Tolerant Record Matching PVERConf_May2011

04/10/2023 [email protected] 4

Bing Maps

Page 5: Error Tolerant Record Matching PVERConf_May2011

04/10/2023 [email protected] 5

Bing Shopping

Page 6: Error Tolerant Record Matching PVERConf_May2011

6

OBJECTIVE: Reduce Cost of building a data cleaning application

04/10/2023 [email protected]

Page 7: Error Tolerant Record Matching PVERConf_May2011

04/10/2023 7

Our Approach to Data Cleaning

Record Matching

De-duplicati

on

Parsing

Core Operators

Design Tools

Address Matching

Product De-duplication

Local Live

Windows Live Products

[email protected]

Focus of this talk

Page 8: Error Tolerant Record Matching PVERConf_May2011

04/10/2023 [email protected] 8

Challenge: Record Matching over Large Data Sets

Reference table of addresses

Prairie Crosing Dr W Chicago IL 60185

Large Table(~10M Rows)

Page 9: Error Tolerant Record Matching PVERConf_May2011

04/10/2023 [email protected] 9

Efficient Indexing is Needed

Prairie Crosing Dr W Chicago IL 60185

Large Table(~10M Rows)

• Needed for Efficiency & Scalability

• Specific to similarity function

Find all rows sj such thatSim (r, sj ) ≥ θ

Reference table

Page 10: Error Tolerant Record Matching PVERConf_May2011

04/10/2023 [email protected] 10

Outline

Introduction and Motivation

Two Challenges in Record

Matching

Concluding Remarks

Page 11: Error Tolerant Record Matching PVERConf_May2011

04/10/2023 [email protected] 11

Challenge 1: Too Many Similarity Functions

Methodology Choose similarity function f appropriate for the domain

Choose best implementation of f with support for

indexing

Can we get away with a common foundation

and simulate these variations?

Page 12: Error Tolerant Record Matching PVERConf_May2011

04/10/2023 [email protected] 12

Challenge 2: Lack of Customizability

Abbreviations USA ≈ United States of America

St ≈ Street, NE ≈ North East

Name variations, Mike ≈ Michael, Bill ≈ William

Aliases One ≈ 1, First ≈ 1st

Can we inject customizability without loss of

efficiency?

Page 13: Error Tolerant Record Matching PVERConf_May2011

04/10/2023 [email protected] 13

Challenge 1: Too Many Similarity Functions

Page 14: Error Tolerant Record Matching PVERConf_May2011

04/10/2023 [email protected] 14

Jaccard Similarity

Statistical measure

Originally defined over sets

String = set of words

Range of values: [0,1]

𝐽𝑎𝑐𝑐𝑎𝑟𝑑 (𝑠1 ,𝑠2 )=¿𝑠 1∩𝑠2∨ ¿¿ 𝑠1∪𝑠 2∨¿¿

¿

Page 15: Error Tolerant Record Matching PVERConf_May2011

04/10/2023 [email protected] 15

Seeking a common Foundation: Jaccard Similarity

148th Ave NE, Redmond, WA

140th Ave NE, Redmond, WA

𝐽𝑎𝑐𝑐𝑎𝑟𝑑=44+2

≈ 0.66

Page 16: Error Tolerant Record Matching PVERConf_May2011

04/10/2023 [email protected] 16

Using Jaccard Similarity to Implement f

Jacc. Sim. ≥ θ’

Query

String Set

Check f ≥ θ

Reference table

String Set

Lookup on f

f ≥ θ Jacc. Sim. ≥ θ’

Page 17: Error Tolerant Record Matching PVERConf_May2011

04/10/2023 [email protected] 17

Edit Similarity Set Similarity

Crossing Crosing

Jaccard Similarity

7/8

C,r,o,s,s,i,n,g C,r,o,s,i,n,g

If strlen(r) ≥ strlen(s): Edit Distance(r,s) ≤ k Jacc. Sim(1-gram(r), 1-gram(s)) ≥ (strlen(r) -

k)/(strlen(r) + k))

Page 18: Error Tolerant Record Matching PVERConf_May2011

04/10/2023 [email protected] 18

Inverted Index Based Approach

100 Prairie ChicagoDrive

2

……

2 2

Crossing

2

2

Dr

10

100 Prairie Crossing Dr Chicago

≥ 0.5 M comparisons

0.5 MRows

Rid Lists

Page 19: Error Tolerant Record Matching PVERConf_May2011

04/10/2023 [email protected] 19

Prefix Filter

41 1r s

Any size 2 subset of r has non-empty overlap with s

100 Prairie Crossing Dr Chicago

100 Prairie Crossing Drive Chicago

Page 20: Error Tolerant Record Matching PVERConf_May2011

04/10/2023 [email protected] 20

Inverted Index Based Approach

100 Prairie ChicagoDrive

2

……

2 2

Crossing

2

2

Dr

10

100 Prairie Crossing Dr Chicago

Use 100 and Prairie

0.5 MRows

Rid Lists

Page 21: Error Tolerant Record Matching PVERConf_May2011

04/10/2023 [email protected] 21

Signature based Indexing

Use signature-based scheme to further reduce cost of indexing and index lookupProperty: If two strings have high JC, then signatures must intersectLSH signatures work well

Page 22: Error Tolerant Record Matching PVERConf_May2011

04/10/2023 [email protected] 22

Challenge 2: Lack of Customizability

Page 23: Error Tolerant Record Matching PVERConf_May2011

23

Normalization?

A Turing

A TuringAlan Turing

A Turing

Alan A

Jaccard Similarity

1.0

04/10/2023 [email protected]

Page 24: Error Tolerant Record Matching PVERConf_May2011

24

Normalization?

A Turing

Aaron TuringAlan Turing

A Turing

Alan A

Jaccard Similarity

1.0

Aaron A

04/10/2023 [email protected]

Page 25: Error Tolerant Record Matching PVERConf_May2011

04/10/2023 BYU Talk 25

Transformations

Transformation Rules

Xing CrossingW WestDr Drive

Programmable Similarity

SetSimilarity

Page 26: Error Tolerant Record Matching PVERConf_May2011

04/10/2023 BYU Talk 26

Semantics of Programmable Similarity

Transformation Rules

Prairie Crossing Dr ChicagoPrairie Xing Dr Chicago

Xing CrossingW WestDr Drive

Programmable Similarity

SetSimilarity

Page 27: Error Tolerant Record Matching PVERConf_May2011

04/10/2023 BYU Talk 27

Semantics: Example

Transformation Rules

Prairie Crossing Dr ChicagoPrairie Xing Dr Chicago

Xing CrossingW WestDr Drive

Prairie Crossing Dr Chicago

Programmable Similarity

SetSimilarity

Page 28: Error Tolerant Record Matching PVERConf_May2011

04/10/2023 BYU Talk 28

Semantics: Example

Transformation Rules

Prairie Crossing Dr ChicagoPrairie Xing Dr Chicago

Xing CrossingW WestDr Drive

Prairie Crossing Dr ChicagoPrairie Crossing Drive Chicago

Programmable Similarity

SetSimilarity

Page 29: Error Tolerant Record Matching PVERConf_May2011

04/10/2023 BYU Talk 29

Semantics: Example

Transformation Rules

Prairie Crossing Dr ChicagoPrairie Xing Dr Chicago

Xing CrossingW WestDr Drive

Prairie Crossing Dr Chicago

Prairie Crossing Drive Chicago

Prairie Xing Drive Chicago

Programmable Similarity

SetSimilarity

Page 30: Error Tolerant Record Matching PVERConf_May2011

04/10/2023 BYU Talk 30

Semantics: Example

Transformation Rules

Prairie Crossing Dr ChicagoPrairie Xing Dr Chicago

Xing CrossingW WestDr Drive

Prairie Crossing Dr Chicago

Prairie Crossing Drive Chicago

Prairie Xing Drive Chicago

Prairie Xing Dr Chicago

Programmable Similarity

SetSimilarity

Page 31: Error Tolerant Record Matching PVERConf_May2011

04/10/2023 BYU Talk 31

Semantics: Example

Transformation Rules

Prairie Crossing Dr ChicagoPrairie Xing Dr Chicago

Xing CrossingW WestDr Drive

Prairie Crossing Dr Chicago

Prairie Crossing Drive Chicago

Prairie Xing Drive Chicago

Prairie Xing Dr Chicago Prairie Crossing Dr Chicago

Prairie Crossing Drive Chicago

Programmable Similarity

SetSimilarity

Page 32: Error Tolerant Record Matching PVERConf_May2011

04/10/2023 BYU Talk 32

Semantics: Example

Transformation Rules

Prairie Crossing Dr ChicagoPrairie Xing Dr Chicago

Xing CrossingW WestDr Drive

Prairie Crossing Dr ChicagoPrairie Crossing Drive Chicago

Prairie Xing Drive Chicago

Prairie Xing Dr Chicago Prairie Crossing Dr Chicago

Prairie Crossing Drive Chicago

Programmable Similarity

SetSimilarity

Page 33: Error Tolerant Record Matching PVERConf_May2011

04/10/2023 BYU Talk 33

Semantics: Example

Transformation Rules

Prairie Crossing Dr ChicagoPrairie Xing Dr Chicago

Xing CrossingW WestDr Drive

Prairie Crossing Dr ChicagoPrairie Crossing Drive Chicago

Prairie Xing Drive Chicago

Prairie Xing Dr Chicago Prairie Crossing Dr Chicago

Prairie Crossing Drive Chicago

1.0

Programmable Similarity

SetSimilarity

Page 34: Error Tolerant Record Matching PVERConf_May2011

04/10/2023 [email protected] 34

Source of Transformations

Domain-specific authorities ~ 200000 rules from USPS for address matching

Hard to capture using a black-box similarity function

Web Wikipedia redirects

Program First 1st, Second 2nd

Page 35: Error Tolerant Record Matching PVERConf_May2011

04/10/2023 [email protected] 35

Computational Challenge: Blowup

ATT Corp., 100 Prairie Xing Dr Chicago, IL,

USA

1. ATT2. American Telephone and Telegraph

1. Corp2. Corporation 1. 100

2. One Hundred3. Hundred4. Door 100

1. Dr2. Drive1. IL

2. Illinois

1. USA2. United States3. United States of America

1. Xing2. Crossing 384 variations!

Page 36: Error Tolerant Record Matching PVERConf_May2011

04/10/2023 [email protected] 36

Similarity With Transformations: Bipartite

Matching

Prairie

Xing

Dr

Chicago

Prairie

Crossing

Drive

ChicagoXing Crossing

W WestDr Drive

Max Intersection = Max Matching = 4

Max Jaccard = Max Intersection / (8 – Max Intersection) = 4/4 = 1

Page 37: Error Tolerant Record Matching PVERConf_May2011

04/10/2023 [email protected] 37

Extensions to Signature based Indexing

Use same LSH signature-based scheme to reduce cost of indexing and index lookupTwo Properties:

If two strings have high JC, then signatures must intersect

All LSH signatures corresponding to generated strings can be obtained efficiently without materializing

Page 38: Error Tolerant Record Matching PVERConf_May2011

04/10/2023 [email protected] 38

Similarity Join

(St, City)

Parse Address

R (Address)

R (St,City,State,Zip)

Similarity Join(St,

State,Zip)

Union

S (St,City,State,Zip)

Challenge of Setting Thresholds

0.9 0.7

What are the “right” thresholds?

Xing CrossingW WestDr Drive

WA WashingtonWI WisconsinFL FloridaXing CrossingW WestDr Drive

Page 39: Error Tolerant Record Matching PVERConf_May2011

04/10/2023 [email protected] 39

Learning From ExamplesInput

A set of examples: matches & non-matches

An operator tree invoking (multiple) Sim Join

operations

Goal Set the thresholds such that

(Number of thresholds = no. of join columns)

Precision Threshold : the number of false positives

is less than B Recall is maximized: Number of correctly classified

matching pairs

Can be generalized to also choose joining columns and similarity functions

Page 40: Error Tolerant Record Matching PVERConf_May2011

04/10/2023 [email protected] 40

Outline

Introduction and Motivation

Two Challenges in Record

Matching

Concluding Remarks

Page 41: Error Tolerant Record Matching PVERConf_May2011

Real-World Record Matching Task

Katrina: Given evacuee lists…First Name Last Name Address Phone Father Mother

John Doe 3 Third St 345-6789

Martin Johnson 123-4567 Donald Johnson

Thomas 5 Main St Lenny

John Doe Third Street

M Johnson 123-4567 D Johnson

match against enquiries

41

Page 42: Error Tolerant Record Matching PVERConf_May2011

42

Beyond Enterprise Data

Documents

Dictionary

Canon Rebel XTi SLR Digital Camera

Lenovo ThinkPad X61 Tablet

Sony Handycam DCR SR42 Digital Camcorder

The Canon EOS Rebel XTi remains a very good first

dSLR…

The Canon EOS Rebel XTi remains a very good first

dSLR…

The EOS Digital Rebel XTi is the product of Canon's

extensive in-house development…

The EOS Digital Rebel XTi is the product of Canon's

extensive in-house development…

New ThinkPad X61 Tablet models are available with

Intel® Centrino® Pro processor…

New ThinkPad X61 Tablet models are available with

Intel® Centrino® Pro processor…

Challenge: Pairwise Matching

[email protected]/10/2023

Page 43: Error Tolerant Record Matching PVERConf_May2011

04/10/2023 [email protected] 43

Final ThoughtsGoal: Make Application building easier

Customizability; EfficiencyInternal Impact of MSR’s Record Matching

SQL Server Integration Services; Relationship Discovery in Excel PowerPivot

Bing Maps, Bing ShoppingOpen Issues

Design Studio for Record Matching Record Matching for Web Scale Problems Broader use of Feature engineering techniques

Page 44: Error Tolerant Record Matching PVERConf_May2011

04/10/2023 [email protected] 44

Questions?

Page 45: Error Tolerant Record Matching PVERConf_May2011

04/10/2023 [email protected] 45

References

Arvind Arasu, Surajit Chaudhuri, and Raghav Kaushik, Learning String Transformations from Examples, in VLDB, Very Large Data Bases Endowment Inc., August 2009 Surajit Chaudhuri and Raghav Kaushik, Extending Autocompletion to Tolerate Errors, in ACM SIGMOD June 2009 Arvind Arasu, Christopher Re, and Dan Suciu, Large-Scale Deduplication with Constraints using Dedupalog, in IEEE ICDE 2009Arvind Arasu, Surajit Chaudhuri, and Raghav Kaushik, Transformation-based Framework for Record Matching, in IEEE ICDE 2008Surajit Chaudhuri, Bee Chung Chen, Venkatesh Ganti, and Raghav Kaushik, Example Driven Design of Efficient Record Matching Queries, in VLDB 2007 Surajit Chaudhuri, Anish Das Sarma, Venkatesh Ganti, and Raghav Kaushik., Leveraging Aggregate Constraints for Deduplication, in SIGMOD 2007 Surajit Chaudhuri, Venkatesh Ganti, Surajit Chaudhuri: Robust Identification of Fuzzy Duplicates., in IEEE ICDE 2005: 865-876Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, and Surajit Chaudhuri, Robust and efficient fuzzy match for online data cleaning, in SIGMOD 2003

Page 46: Error Tolerant Record Matching PVERConf_May2011

04/10/2023 [email protected] 46

Appendix:

“Robust Identification of Fuzzy Duplicates” (IEEE Data Engineering, 2005)

Page 47: Error Tolerant Record Matching PVERConf_May2011

47

DeduplicationGiven a relation R, the goal is to partition R into groups such that each group consists of “duplicates” (of the same entity)

Also called reference reconciliation, entity resolution, merge/purge

Record matching, record linkage: identify record pairs (across relations) which are duplicates

Important sub-goals of deduplication

Page 48: Error Tolerant Record Matching PVERConf_May2011

48

Previous Techniques

Distance functions to abstract closeness between tuples E.g., edit distance, cosine similarity, etc.

Approach 1: clustering Hard to determine number of clusters

Approach 2: partition into “valid” groups Global threshold g

All pairs of tuples whose distance < g are considered duplicates

Partitioning Connected components in the threshold graph

Page 49: Error Tolerant Record Matching PVERConf_May2011

49

Our Approach Local structural properties are important for

identifying sets of duplicates Identify two criteria to characterize local

structural properties Formalize the duplicate elimination problem

based upon these criteria Unique solution, rich space of solutions, impact of

distance transformations, etc. Propose an algorithm for solving the problem

Page 50: Error Tolerant Record Matching PVERConf_May2011

50

Compact Set (CS) Criterion

Duplicates are closer to each other than to other tuples

A group is compact if it consists of all mutual nearest neighbors

In {1,2,3,6,7,10,11,12}: {1,2,3}, {6,7}, {10,11,12} are compact groups

Good distance functions for duplicate identification have the characteristic that sets of duplicates form compact sets

Page 51: Error Tolerant Record Matching PVERConf_May2011

51

Sparse Neighborhood (SN) Criterion

Duplicate tuples are well-separated from other tuples

Neighborhood is “sparse” ng(v) = #tuples in larger

sphere / #tuples in smaller sphere around v

ng(set S of tuples) = AGG{ng(v) of each v in S}

S is sparse if ng(S) < c

2∙nn(v)

nn(v)

Growth spheres

Page 52: Error Tolerant Record Matching PVERConf_May2011

52

Other Constraints Goal: Partition R into the minimum number

of groups {G1,…,Gm} such that for all 1 ≤ i ≤ m Gi is a compact set and Gi is an SN group

Can lead to unintuitive solutions {101, 102, 104, 201, 202, 301, 302} – 1 group!

Size constraint: size of a group of duplicates is less than K

Diameter constraint: diameter of a group of duplicates is less than θ