![Page 1: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,](https://reader034.vdocuments.us/reader034/viewer/2022051018/5697bf8e1a28abf838c8cae3/html5/thumbnails/1.jpg)
Generic Entity Resolution:Identifying Real-World Entities in
Large Data Sets
Hector Garcia-Molina
Stanford University
Work with: Omar Benjelloun, Qi Su,
Jennifer Widom, Tyson Condie, Johnson Gong,
Nicolas Pombourcq, David Menestrina, Steven Whang
![Page 2: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,](https://reader034.vdocuments.us/reader034/viewer/2022051018/5697bf8e1a28abf838c8cae3/html5/thumbnails/2.jpg)
2
Entity Resolution
N: a A: b CC#: c Ph: e
e1
N: a Exp: d Ph: e
e2
![Page 3: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,](https://reader034.vdocuments.us/reader034/viewer/2022051018/5697bf8e1a28abf838c8cae3/html5/thumbnails/3.jpg)
3
Applications
• comparison shopping
• mailing lists
• classified ads
• customer files
• counter-terrorism
N: a A: b CC#: c Ph: e
e1
N: a Exp: d Ph: e
e2
![Page 4: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,](https://reader034.vdocuments.us/reader034/viewer/2022051018/5697bf8e1a28abf838c8cae3/html5/thumbnails/4.jpg)
4
Outline
• Why is ER challenging?
• How is ER done?
• Some ER work at Stanford
• Confidences
![Page 5: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,](https://reader034.vdocuments.us/reader034/viewer/2022051018/5697bf8e1a28abf838c8cae3/html5/thumbnails/5.jpg)
5
Challenges (1)
•No keys!
•Value matching– “Kaddafi”, “Qaddafi”, “Kadafi”, “Kaddaffi”...
•Record matching
Nm: TomAd: 123 Main StPh: (650) 555-1212Ph: (650) 777-7777
Nm: ThomasAd: 132 Main StPh: (650) 555-1212
![Page 6: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,](https://reader034.vdocuments.us/reader034/viewer/2022051018/5697bf8e1a28abf838c8cae3/html5/thumbnails/6.jpg)
6
Challenges (2)
•Merging recordsNm: TomAd: 123 Main StPh: (650) 555-1212Ph: (650) 777-7777
Nm: ThomasAd: 132 Main StPh: (650) 555-1212Zp: 94305
Nm: TomNm: ThomasAd: 123 Main StPh: (650) 555-1212Ph: (650) 777-7777Zp: 94305
![Page 7: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,](https://reader034.vdocuments.us/reader034/viewer/2022051018/5697bf8e1a28abf838c8cae3/html5/thumbnails/7.jpg)
7
Challenges (3)
•ChainingNm: TomWk: IBMOc: laywerSal: 500K
Nm: TomAd: 123 MainBD: Jan 1, 85Wk: IBM
Nm: ThomasAd: 123 MaimOc: lawyer
Nm: TomAd: 123 MainBD: Jan 1, 85Wk: IBMOc: lawyer
Nm: TomAd: 123 MainBD: Jan 1, 85Wk: IBMOc: lawyerSal: 500K
![Page 8: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,](https://reader034.vdocuments.us/reader034/viewer/2022051018/5697bf8e1a28abf838c8cae3/html5/thumbnails/8.jpg)
8
Challenges (4)
•Un-merging
Nm: TomAd: 123 MainBD: Jan 1, 85Wk: IBMOc: lawyerSal: 500K
too young to make500K at IBM!!
![Page 9: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,](https://reader034.vdocuments.us/reader034/viewer/2022051018/5697bf8e1a28abf838c8cae3/html5/thumbnails/9.jpg)
9
Challenges (5)
• Confidences in dataNm: Tom (0.9)Ad: 123 Main St (1.0)Ph: (650) 555-1212 (0.6)Ph: (650) 777-7777 (0.8)
(0.8)
• In value matching, match rules, merge:
conf = ?
![Page 10: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,](https://reader034.vdocuments.us/reader034/viewer/2022051018/5697bf8e1a28abf838c8cae3/html5/thumbnails/10.jpg)
10
Taxonomy
•Pairwise snaps vs. clustering
•De-duplication vs. fidelity enhancement
•Schema differences
•Relationships
•Exact vs. approximate
•Generic vs application specific
•Confidences
![Page 11: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,](https://reader034.vdocuments.us/reader034/viewer/2022051018/5697bf8e1a28abf838c8cae3/html5/thumbnails/11.jpg)
11
Schema Differences
Name: TomAddress: 123 Main StPh: (650) 555-1212Ph: (650) 777-7777
FirstName: TomStreetName: Main StStreetNumber: 123Tel: (650) 777-7777
![Page 12: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,](https://reader034.vdocuments.us/reader034/viewer/2022051018/5697bf8e1a28abf838c8cae3/html5/thumbnails/12.jpg)
12
Pair-Wise Snaps vs. Clustering
r1
r2
r3
r4
r5
r6
s9
s8
s7
s10
r1
r2
r3
r4r5
r6
r9
r8
r7
r10
![Page 13: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,](https://reader034.vdocuments.us/reader034/viewer/2022051018/5697bf8e1a28abf838c8cae3/html5/thumbnails/13.jpg)
13
De-Duplication vs. Fidelity Enhancement
R S
B S
N
![Page 14: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,](https://reader034.vdocuments.us/reader034/viewer/2022051018/5697bf8e1a28abf838c8cae3/html5/thumbnails/14.jpg)
14
Relationships
r1 r2
r5
r7brother
father
businessbusiness
![Page 15: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,](https://reader034.vdocuments.us/reader034/viewer/2022051018/5697bf8e1a28abf838c8cae3/html5/thumbnails/15.jpg)
15
Using Relationships
p1
p2
p5
p7
a1
a2
a3
a5
a4
authors papers
same??
![Page 16: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,](https://reader034.vdocuments.us/reader034/viewer/2022051018/5697bf8e1a28abf838c8cae3/html5/thumbnails/16.jpg)
16
Exact vs Approximate ER
products
camerasresolvedcameras
CDs
books
...
resolvedCDs
resolvedbooks
...
ER
ER
ER
![Page 17: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,](https://reader034.vdocuments.us/reader034/viewer/2022051018/5697bf8e1a28abf838c8cae3/html5/thumbnails/17.jpg)
17
Exact vs Approximate ER
terrorists terroristssortby age
Widom 30match against
ages 25-35
![Page 18: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,](https://reader034.vdocuments.us/reader034/viewer/2022051018/5697bf8e1a28abf838c8cae3/html5/thumbnails/18.jpg)
18
Generic vs Application Specific
• Match function M(r, s)
• Merge function <r, s> => t
![Page 19: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,](https://reader034.vdocuments.us/reader034/viewer/2022051018/5697bf8e1a28abf838c8cae3/html5/thumbnails/19.jpg)
19
Taxonomy
•Pairwise snaps vs. clustering
•De-duplication vs. fidelity enhancement
•Schema differences
•Relationships
•Exact vs. approximate
•Generic vs application specific
•Confidences
![Page 20: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,](https://reader034.vdocuments.us/reader034/viewer/2022051018/5697bf8e1a28abf838c8cae3/html5/thumbnails/20.jpg)
20
Outline
• Why is ER challenging?
• How is ER done?
• Some ER work at Stanford
• Confidences
![Page 21: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,](https://reader034.vdocuments.us/reader034/viewer/2022051018/5697bf8e1a28abf838c8cae3/html5/thumbnails/21.jpg)
21
Taxonomy
•Pairwise snaps vs. clustering
•De-duplication vs. fidelity enhancement
•Schema differences No
•Relationships No
•Exact vs. approximate
•Generic vs application specific
•Confidences ... later on
![Page 22: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,](https://reader034.vdocuments.us/reader034/viewer/2022051018/5697bf8e1a28abf838c8cae3/html5/thumbnails/22.jpg)
22
Model
Nm: TomWk: IBMOc: laywerSal: 500K
Nm: TomAd: 123 MainBD: Jan 1, 85Wk: IBM
Nm: ThomasAd: 123 MaimOc: lawyer
Nm: TomAd: 123 MainBD: Jan 1, 85Wk: IBMOc: lawyer
Nm: TomAd: 123 MainBD: Jan 1, 85Wk: IBMOc: lawyerSal: 500K
r1 r3r2
r4:<r1, r2><r4, r3>
M(r1, r2) M(r4, r3)
![Page 23: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,](https://reader034.vdocuments.us/reader034/viewer/2022051018/5697bf8e1a28abf838c8cae3/html5/thumbnails/23.jpg)
23
Correct Answer
r1
r2
r3
r4
r5
r6
s9
s8
s7
s10
ER(R) = All derivable records.....
Minus “dominated” records
![Page 24: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,](https://reader034.vdocuments.us/reader034/viewer/2022051018/5697bf8e1a28abf838c8cae3/html5/thumbnails/24.jpg)
24
Question
• What is best sequence of match, merge calls that give us right answer?
![Page 25: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,](https://reader034.vdocuments.us/reader034/viewer/2022051018/5697bf8e1a28abf838c8cae3/html5/thumbnails/25.jpg)
25
Brute Force Algorithm
• Input R:– r1 = [a:1, b:2]
– r2 = [a:1, c: 4, e:5]
– r3 = [b:2, c:4, f:6]
– r4 = [a:7, e:5, f:6]
![Page 26: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,](https://reader034.vdocuments.us/reader034/viewer/2022051018/5697bf8e1a28abf838c8cae3/html5/thumbnails/26.jpg)
26
Brute Force Algorithm
• Input R:– r1 = [a:1, b:2]
– r2 = [a:1, c: 4, e:5]
– r3 = [b:2, c:4, f:6]
– r4 = [a:7, e:5, f:6]
• Match all pairs:– r1 = [a:1, b:2]
– r2 = [a:1, c: 4, e:5]
– r3 = [b:2, c:4, f:6]
– r4 = [a:7, e:5, f:6]
– r12 = [a:1, b:2, c:4, e:5]
![Page 27: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,](https://reader034.vdocuments.us/reader034/viewer/2022051018/5697bf8e1a28abf838c8cae3/html5/thumbnails/27.jpg)
27
Brute Force Algorithm
• Match all pairs:– r1 = [a:1, b:2]
– r2 = [a:1, c: 4, e:5]
– r3 = [b:2, c:4, f:6]
– r4 = [a:7, e:5, f:6]
– r12 = [a:1, b:2, c:4, e:5]
• Repeat:– r1 = [a:1, b:2]
– r2 = [a:1, c: 4, e:5]
– r3 = [b:2, c:4, f:6]
– r4 = [a:7, e:5, f:6]
– r12 = [a:1, b:2, c:4, e:5]
– r123 =[a:1, b:2, c:4, e:5, f:6]
![Page 28: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,](https://reader034.vdocuments.us/reader034/viewer/2022051018/5697bf8e1a28abf838c8cae3/html5/thumbnails/28.jpg)
28
Question # 1
Brute Force Algorithm
• Input R:– r1 = [a:1, b:2]
– r2 = [a:1, c: 4, e:5]
– r3 = [b:2, c:4, f:6]
– r4 = [a:7, e:5, f:6]
• Match all pairs:– r1 = [a:1, b:2]
– r2 = [a:1, c: 4, e:5]
– r3 = [b:2, c:4, f:6]
– r4 = [a:7, e:5, f:6]
– r12 = [a:1, b:2, c:4, e:5]
Can we deleter1, r2?
![Page 29: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,](https://reader034.vdocuments.us/reader034/viewer/2022051018/5697bf8e1a28abf838c8cae3/html5/thumbnails/29.jpg)
29
Question # 2
Brute Force Algorithm
• Match all pairs:– r1 = [a:1, b:2]
– r2 = [a:1, c: 4, e:5]
– r3 = [b:2, c:4, f:6]
– r4 = [a:7, e:5, f:6]
– r12 = [a:1, b:2, c:4, e:5]
• Repeat:– r1 = [a:1, b:2]
– r2 = [a:1, c: 4, e:5]
– r3 = [b:2, c:4, f:6]
– r4 = [a:7, e:5, f:6]
– r12 = [a:1, b:2, c:4, e:5]
– r123 =[a:1, b:2, c:4, e:5, f:6]
Can we avoidcomparisons?
![Page 30: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,](https://reader034.vdocuments.us/reader034/viewer/2022051018/5697bf8e1a28abf838c8cae3/html5/thumbnails/30.jpg)
30
ICAR Properties
• Idempotence:– M(r1, r1) = true; <r1, r1> = r1
•Commutativity:– M(r1, r2) = M(r2, r1)
– <r1, r2> = <r2, r1>
•Associativity– <r1, <r2, r3>> = <<r1, r2>, r3>
![Page 31: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,](https://reader034.vdocuments.us/reader034/viewer/2022051018/5697bf8e1a28abf838c8cae3/html5/thumbnails/31.jpg)
31
More Properties
•Representativity– If <r1, r2> = r3, then
for any r4 such that M(r1, r4) is true we also have M(r3, r4) = true.
r1
r2
r3
r4
![Page 32: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,](https://reader034.vdocuments.us/reader034/viewer/2022051018/5697bf8e1a28abf838c8cae3/html5/thumbnails/32.jpg)
32
ICAR Properties Efficiency
• Commutativity
• Idempotence
• Associativity
• Representativity
• Can discard records• ER result independent
of processing order
![Page 33: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,](https://reader034.vdocuments.us/reader034/viewer/2022051018/5697bf8e1a28abf838c8cae3/html5/thumbnails/33.jpg)
33
Swoosh Algorithms
•Record Swoosh•Merges records as soon as they match•Optimal in terms of record comparisons
•Feature Swoosh•Remembers values seen for each feature•Avoids redundant value comparisons
![Page 34: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,](https://reader034.vdocuments.us/reader034/viewer/2022051018/5697bf8e1a28abf838c8cae3/html5/thumbnails/34.jpg)
34
Swoosh Performance
![Page 35: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,](https://reader034.vdocuments.us/reader034/viewer/2022051018/5697bf8e1a28abf838c8cae3/html5/thumbnails/35.jpg)
35
If ICAR Properties Do Not Hold?
r1: [Joe Sr., 123 Main, DL:X]
r23: [Joe Jr., 123 Main, Ph: 123, DL:Y]
r12: [Joe Sr., 123 Main, Ph: 123, DL:X]
r2: [Joe, 123 Main, Ph:123]
r3: [Joe Jr., 123 Main, DL:Y]
![Page 36: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,](https://reader034.vdocuments.us/reader034/viewer/2022051018/5697bf8e1a28abf838c8cae3/html5/thumbnails/36.jpg)
36
If ICAR Properties Do Not Hold?
r1: [Joe Sr., 123 Main, DL:X]
r23: [Joe Jr., 123 Main, Ph: 123, DL:Y]
r12: [Joe Sr., 123 Main, Ph: 123, DL:X]
r2: [Joe, 123 Main, Ph:123]
r3: [Joe Jr., 123 Main, DL:Y]
Full Answer: ER(R) = {r12, r23, r1, r2, r3}Minus Dominated: ER(R) = {r12, r23}
![Page 37: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,](https://reader034.vdocuments.us/reader034/viewer/2022051018/5697bf8e1a28abf838c8cae3/html5/thumbnails/37.jpg)
37
If ICAR Properties Do Not Hold?
r1: [Joe Sr., 123 Main, DL:X]
r23: [Joe Jr., 123 Main, Ph: 123, DL:Y]
r12: [Joe Sr., 123 Main, Ph: 123, DL:X]
r2: [Joe, 123 Main, Ph:123]
r3: [Joe Jr., 123 Main, DL:Y]
Full Answer: ER(R) = {r12, r23, r1, r2, r3}Minus Dominated: ER(R) = {r12, r23}R-Swoosh Yields: ER(R) = {r12, r3} or {r1, r23}
![Page 38: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,](https://reader034.vdocuments.us/reader034/viewer/2022051018/5697bf8e1a28abf838c8cae3/html5/thumbnails/38.jpg)
38
Swoosh Without ICAR Properties
![Page 39: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,](https://reader034.vdocuments.us/reader034/viewer/2022051018/5697bf8e1a28abf838c8cae3/html5/thumbnails/39.jpg)
39
Distributed Swoosh
P1 P2 P3
r1r2r3r4r5r6...
![Page 40: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,](https://reader034.vdocuments.us/reader034/viewer/2022051018/5697bf8e1a28abf838c8cae3/html5/thumbnails/40.jpg)
40
Distributed Swoosh
P1 P2 P3
r1
r3r4
r6...
r1r2
r4r5
...
r2r3
r5r6...
![Page 41: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,](https://reader034.vdocuments.us/reader034/viewer/2022051018/5697bf8e1a28abf838c8cae3/html5/thumbnails/41.jpg)
41
DSwoosh Performance
![Page 42: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,](https://reader034.vdocuments.us/reader034/viewer/2022051018/5697bf8e1a28abf838c8cae3/html5/thumbnails/42.jpg)
42
Outline
• Why is ER challenging?
• How is ER done?
• Some ER work at Stanford
• Confidences
![Page 43: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,](https://reader034.vdocuments.us/reader034/viewer/2022051018/5697bf8e1a28abf838c8cae3/html5/thumbnails/43.jpg)
43
Conclusion
• ER is old and important problem
• Our approach: generic
• Confidences– challenging
– two ways to tame:• thresholds
• packages
![Page 44: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets Hector Garcia-Molina Stanford University Work with: Omar Benjelloun, Qi Su,](https://reader034.vdocuments.us/reader034/viewer/2022051018/5697bf8e1a28abf838c8cae3/html5/thumbnails/44.jpg)
44
Thanks.