generic entity resolution: identifying real-world entities in large data sets hector garcia-molina...

44
Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su, Jennifer Widom, Tyson Condie, Johnson Gong, Nicolas Pombourcq, David Menestrina, Steven Whang

Upload: harriet-jefferson

Post on 17-Jan-2016

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,

Generic Entity Resolution:Identifying Real-World Entities in

Large Data Sets

Hector Garcia-Molina

Stanford University

Work with: Omar Benjelloun, Qi Su,

Jennifer Widom, Tyson Condie, Johnson Gong,

Nicolas Pombourcq, David Menestrina, Steven Whang

Page 2: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,

2

Entity Resolution

N: a A: b CC#: c Ph: e

e1

N: a Exp: d Ph: e

e2

Page 3: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,

3

Applications

• comparison shopping

• mailing lists

• classified ads

• customer files

• counter-terrorism

N: a A: b CC#: c Ph: e

e1

N: a Exp: d Ph: e

e2

Page 4: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,

4

Outline

• Why is ER challenging?

• How is ER done?

• Some ER work at Stanford

• Confidences

Page 5: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,

5

Challenges (1)

•No keys!

•Value matching– “Kaddafi”, “Qaddafi”, “Kadafi”, “Kaddaffi”...

•Record matching

Nm: TomAd: 123 Main StPh: (650) 555-1212Ph: (650) 777-7777

Nm: ThomasAd: 132 Main StPh: (650) 555-1212

Page 6: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,

6

Challenges (2)

•Merging recordsNm: TomAd: 123 Main StPh: (650) 555-1212Ph: (650) 777-7777

Nm: ThomasAd: 132 Main StPh: (650) 555-1212Zp: 94305

Nm: TomNm: ThomasAd: 123 Main StPh: (650) 555-1212Ph: (650) 777-7777Zp: 94305

Page 7: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,

7

Challenges (3)

•ChainingNm: TomWk: IBMOc: laywerSal: 500K

Nm: TomAd: 123 MainBD: Jan 1, 85Wk: IBM

Nm: ThomasAd: 123 MaimOc: lawyer

Nm: TomAd: 123 MainBD: Jan 1, 85Wk: IBMOc: lawyer

Nm: TomAd: 123 MainBD: Jan 1, 85Wk: IBMOc: lawyerSal: 500K

Page 8: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,

8

Challenges (4)

•Un-merging

Nm: TomAd: 123 MainBD: Jan 1, 85Wk: IBMOc: lawyerSal: 500K

too young to make500K at IBM!!

Page 9: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,

9

Challenges (5)

• Confidences in dataNm: Tom (0.9)Ad: 123 Main St (1.0)Ph: (650) 555-1212 (0.6)Ph: (650) 777-7777 (0.8)

(0.8)

• In value matching, match rules, merge:

conf = ?

Page 10: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,

10

Taxonomy

•Pairwise snaps vs. clustering

•De-duplication vs. fidelity enhancement

•Schema differences

•Relationships

•Exact vs. approximate

•Generic vs application specific

•Confidences

Page 11: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,

11

Schema Differences

Name: TomAddress: 123 Main StPh: (650) 555-1212Ph: (650) 777-7777

FirstName: TomStreetName: Main StStreetNumber: 123Tel: (650) 777-7777

Page 12: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,

12

Pair-Wise Snaps vs. Clustering

r1

r2

r3

r4

r5

r6

s9

s8

s7

s10

r1

r2

r3

r4r5

r6

r9

r8

r7

r10

Page 13: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,

13

De-Duplication vs. Fidelity Enhancement

R S

B S

N

Page 14: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,

14

Relationships

r1 r2

r5

r7brother

father

businessbusiness

Page 15: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,

15

Using Relationships

p1

p2

p5

p7

a1

a2

a3

a5

a4

authors papers

same??

Page 16: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,

16

Exact vs Approximate ER

products

camerasresolvedcameras

CDs

books

...

resolvedCDs

resolvedbooks

...

ER

ER

ER

Page 17: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,

17

Exact vs Approximate ER

terrorists terroristssortby age

Widom 30match against

ages 25-35

Page 18: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,

18

Generic vs Application Specific

• Match function M(r, s)

• Merge function <r, s> => t

Page 19: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,

19

Taxonomy

•Pairwise snaps vs. clustering

•De-duplication vs. fidelity enhancement

•Schema differences

•Relationships

•Exact vs. approximate

•Generic vs application specific

•Confidences

Page 20: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,

20

Outline

• Why is ER challenging?

• How is ER done?

• Some ER work at Stanford

• Confidences

Page 21: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,

21

Taxonomy

•Pairwise snaps vs. clustering

•De-duplication vs. fidelity enhancement

•Schema differences No

•Relationships No

•Exact vs. approximate

•Generic vs application specific

•Confidences ... later on

Page 22: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,

22

Model

Nm: TomWk: IBMOc: laywerSal: 500K

Nm: TomAd: 123 MainBD: Jan 1, 85Wk: IBM

Nm: ThomasAd: 123 MaimOc: lawyer

Nm: TomAd: 123 MainBD: Jan 1, 85Wk: IBMOc: lawyer

Nm: TomAd: 123 MainBD: Jan 1, 85Wk: IBMOc: lawyerSal: 500K

r1 r3r2

r4:<r1, r2><r4, r3>

M(r1, r2) M(r4, r3)

Page 23: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,

23

Correct Answer

r1

r2

r3

r4

r5

r6

s9

s8

s7

s10

ER(R) = All derivable records.....

Minus “dominated” records

Page 24: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,

24

Question

• What is best sequence of match, merge calls that give us right answer?

Page 25: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,

25

Brute Force Algorithm

• Input R:– r1 = [a:1, b:2]

– r2 = [a:1, c: 4, e:5]

– r3 = [b:2, c:4, f:6]

– r4 = [a:7, e:5, f:6]

Page 26: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,

26

Brute Force Algorithm

• Input R:– r1 = [a:1, b:2]

– r2 = [a:1, c: 4, e:5]

– r3 = [b:2, c:4, f:6]

– r4 = [a:7, e:5, f:6]

• Match all pairs:– r1 = [a:1, b:2]

– r2 = [a:1, c: 4, e:5]

– r3 = [b:2, c:4, f:6]

– r4 = [a:7, e:5, f:6]

– r12 = [a:1, b:2, c:4, e:5]

Page 27: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,

27

Brute Force Algorithm

• Match all pairs:– r1 = [a:1, b:2]

– r2 = [a:1, c: 4, e:5]

– r3 = [b:2, c:4, f:6]

– r4 = [a:7, e:5, f:6]

– r12 = [a:1, b:2, c:4, e:5]

• Repeat:– r1 = [a:1, b:2]

– r2 = [a:1, c: 4, e:5]

– r3 = [b:2, c:4, f:6]

– r4 = [a:7, e:5, f:6]

– r12 = [a:1, b:2, c:4, e:5]

– r123 =[a:1, b:2, c:4, e:5, f:6]

Page 28: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,

28

Question # 1

Brute Force Algorithm

• Input R:– r1 = [a:1, b:2]

– r2 = [a:1, c: 4, e:5]

– r3 = [b:2, c:4, f:6]

– r4 = [a:7, e:5, f:6]

• Match all pairs:– r1 = [a:1, b:2]

– r2 = [a:1, c: 4, e:5]

– r3 = [b:2, c:4, f:6]

– r4 = [a:7, e:5, f:6]

– r12 = [a:1, b:2, c:4, e:5]

Can we deleter1, r2?

Page 29: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,

29

Question # 2

Brute Force Algorithm

• Match all pairs:– r1 = [a:1, b:2]

– r2 = [a:1, c: 4, e:5]

– r3 = [b:2, c:4, f:6]

– r4 = [a:7, e:5, f:6]

– r12 = [a:1, b:2, c:4, e:5]

• Repeat:– r1 = [a:1, b:2]

– r2 = [a:1, c: 4, e:5]

– r3 = [b:2, c:4, f:6]

– r4 = [a:7, e:5, f:6]

– r12 = [a:1, b:2, c:4, e:5]

– r123 =[a:1, b:2, c:4, e:5, f:6]

Can we avoidcomparisons?

Page 30: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,

30

ICAR Properties

• Idempotence:– M(r1, r1) = true; <r1, r1> = r1

•Commutativity:– M(r1, r2) = M(r2, r1)

– <r1, r2> = <r2, r1>

•Associativity– <r1, <r2, r3>> = <<r1, r2>, r3>

Page 31: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,

31

More Properties

•Representativity– If <r1, r2> = r3, then

for any r4 such that M(r1, r4) is true we also have M(r3, r4) = true.

r1

r2

r3

r4

Page 32: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,

32

ICAR Properties Efficiency

• Commutativity

• Idempotence

• Associativity

• Representativity

• Can discard records• ER result independent

of processing order

Page 33: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,

33

Swoosh Algorithms

•Record Swoosh•Merges records as soon as they match•Optimal in terms of record comparisons

•Feature Swoosh•Remembers values seen for each feature•Avoids redundant value comparisons

Page 34: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,

34

Swoosh Performance

Page 35: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,

35

If ICAR Properties Do Not Hold?

r1: [Joe Sr., 123 Main, DL:X]

r23: [Joe Jr., 123 Main, Ph: 123, DL:Y]

r12: [Joe Sr., 123 Main, Ph: 123, DL:X]

r2: [Joe, 123 Main, Ph:123]

r3: [Joe Jr., 123 Main, DL:Y]

Page 36: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,

36

If ICAR Properties Do Not Hold?

r1: [Joe Sr., 123 Main, DL:X]

r23: [Joe Jr., 123 Main, Ph: 123, DL:Y]

r12: [Joe Sr., 123 Main, Ph: 123, DL:X]

r2: [Joe, 123 Main, Ph:123]

r3: [Joe Jr., 123 Main, DL:Y]

Full Answer: ER(R) = {r12, r23, r1, r2, r3}Minus Dominated: ER(R) = {r12, r23}

Page 37: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,

37

If ICAR Properties Do Not Hold?

r1: [Joe Sr., 123 Main, DL:X]

r23: [Joe Jr., 123 Main, Ph: 123, DL:Y]

r12: [Joe Sr., 123 Main, Ph: 123, DL:X]

r2: [Joe, 123 Main, Ph:123]

r3: [Joe Jr., 123 Main, DL:Y]

Full Answer: ER(R) = {r12, r23, r1, r2, r3}Minus Dominated: ER(R) = {r12, r23}R-Swoosh Yields: ER(R) = {r12, r3} or {r1, r23}

Page 38: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,

38

Swoosh Without ICAR Properties

Page 39: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,

39

Distributed Swoosh

P1 P2 P3

r1r2r3r4r5r6...

Page 40: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,

40

Distributed Swoosh

P1 P2 P3

r1

r3r4

r6...

r1r2

r4r5

...

r2r3

r5r6...

Page 41: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,

41

DSwoosh Performance

Page 42: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,

42

Outline

• Why is ER challenging?

• How is ER done?

• Some ER work at Stanford

• Confidences

Page 43: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,

43

Conclusion

• ER is old and important problem

• Our approach: generic

• Confidences– challenging

– two ways to tame:• thresholds

• packages

Page 44: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,

44

Thanks.