linking records with value diversity
DESCRIPTION
Linking Records with Value Diversity. Pei Li University of Milan – Bicocca Advisor : Andrea Maurino Supervisors@ AT&T Labs - Research: Xin Luna Dong, Divesh Srivastava October, 2012. Some Statistics from DBLP. How many Wei Wang’s are there? What are their authoring histories?. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Linking Records with Value Diversity](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816855550346895dde6f8b/html5/thumbnails/1.jpg)
Linking Records with Value Diversity
Pei LiUniversity of Milan – Bicocca
Advisor : Andrea Maurino Supervisors@ AT&T Labs - Research: Xin Luna Dong, Divesh Srivastava
October, 2012
![Page 2: Linking Records with Value Diversity](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816855550346895dde6f8b/html5/thumbnails/2.jpg)
Some Statistics from DBLP
-How many Wei Wang’s are there?-What are their authoring histories?
••• 2
![Page 3: Linking Records with Value Diversity](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816855550346895dde6f8b/html5/thumbnails/3.jpg)
Some Statistics from YellowPages
••• 3
-Are there any business chains?-If yes, which businesses are their members?
![Page 4: Linking Records with Value Diversity](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816855550346895dde6f8b/html5/thumbnails/4.jpg)
Record Linkage
• What is record linkage (entity resolution)?• Input: a set of records• Output: clustering of records • A critical problem in data integration and data cleaning
• “A reputation for world-class quality is profitable, a ‘business maker’.” – William E. Winkler
• Current work (surveyed in [Elmagarmid, 07], [Koudas, 06]) :• assume that records of the same entities are consistent • often focus on different representations of the same value • e.g., “IBM” and “International Business Machines”
••• 4
![Page 5: Linking Records with Value Diversity](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816855550346895dde6f8b/html5/thumbnails/5.jpg)
New Challenges• In reality, we observe value diversity of entities
• Values can evolve over time • Catholic Healthcare (1986 - 2012) Dignity Health (2012 -)
• Different records of the same group can have “local” values
• Some sources may provide erroneous values
••• 5
ID Name Address Phone URL001 F.B. Insurance Vernon 76384 TX 877 635-4684 txfb-ins.com002 F.B. Insurance #1 Lufkin 75901 TX 936 634-7285 txfb.org003 F.B. Insurance #5 Cibolo 78108 TX 877 635-4684
ID Name URL Source001 Meekhof Tire Sales & Service Inc www.meekhoftire.com Src. 1002 Meekhof Tire Sales & Service Inc www.napaautocare.com Src. 2
••• 5
![Page 6: Linking Records with Value Diversity](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816855550346895dde6f8b/html5/thumbnails/6.jpg)
My Goal
• To improve the linkage quality of integrated data with fairly high diversity
• linking temporal records[VLDB ’11] [VLDB ’12 demo][FCS Journal ’12]
• linking records of the same group[Under preparation for SIGMOD ’13]
••• 6
![Page 7: Linking Records with Value Diversity](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816855550346895dde6f8b/html5/thumbnails/7.jpg)
Outline
• Motivation• Linking temporal records• Decay• Temporal clustering• Demo
• Linking records of the same group• Related work• Conclusions & Future work
••• 7
![Page 8: Linking Records with Value Diversity](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816855550346895dde6f8b/html5/thumbnails/8.jpg)
1991
1991
1991
1991
1991
2004
2005
2006
2007
2008
2009
2010
r1: Xin Dong R. Polytechnic Institute r2: Xin Dong
University of Washington
r7: Dong Xin University of Illinois
r3: Xin Dong University of Washington
r4: Xin Luna DongUniversity of Washington
r8:Dong XinUniversity of Illinoisr9: Dong Xin
Microsoft Research
r5: Xin Luna DongAT&T Labs-Research
r10: Dong Xin University of Illinois
r11: Dong Xin Microsoft Research
r6: Xin Luna DongAT&T Labs-Research
r12: Dong Xin Microsoft Research
-How many authors?-What are their authoring histories? 201
1
8
![Page 9: Linking Records with Value Diversity](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816855550346895dde6f8b/html5/thumbnails/9.jpg)
1991
1991
1991
1991
1991
2004
2005
2006
2007
2008
2009
2010
r1: Xin Dong R. Polytechnic Institute r2: Xin Dong
University of Washington
r7: Dong Xin University of Illinois
r3: Xin Dong University of Washington
r4: Xin Luna DongUniversity of Washington
r8:Dong XinUniversity of Illinoisr9: Dong Xin
Microsoft Research
r5: Xin Luna DongAT&T Labs-Research
r10: Dong Xin University of Illinois
r11: Dong Xin Microsoft Research
r6: Xin Luna DongAT&T Labs-Research
r12: Dong Xin Microsoft Research
-Ground truth
3 authors
2011
9
![Page 10: Linking Records with Value Diversity](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816855550346895dde6f8b/html5/thumbnails/10.jpg)
1991
1991
1991
1991
1991
2004
2005
2006
2007
2008
2009
2010
r1: Xin Dong R. Polytechnic Institute r2: Xin Dong
University of Washington
r7: Dong Xin University of Illinois
r3: Xin Dong University of Washington
r4: Xin Luna DongUniversity of Washington
r8:Dong XinUniversity of Illinoisr9: Dong Xin
Microsoft Research
r5: Xin Luna DongAT&T Labs-Research
r10: Dong Xin University of Illinois
r11: Dong Xin Microsoft Research
r6: Xin Luna DongAT&T Labs-Research
r12: Dong Xin Microsoft Research
-Solution 1:-requiring high value consistency
5 authorsfalse negative
2011
10
![Page 11: Linking Records with Value Diversity](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816855550346895dde6f8b/html5/thumbnails/11.jpg)
1991
1991
1991
1991
1991
2004
2005
2006
2007
2008
2009
2010
r1: Xin Dong R. Polytechnic Institute r2: Xin Dong
University of Washington
r7: Dong Xin University of Illinois
r3: Xin Dong University of Washington
r4: Xin Luna DongUniversity of Washington
r8:Dong XinUniversity of Illinoisr9: Dong Xin
Microsoft Research
r5: Xin Luna DongAT&T Labs-Research
r10: Dong Xin University of Illinois
r11: Dong Xin Microsoft Research
r6: Xin Luna DongAT&T Labs-Research
r12: Dong Xin Microsoft Research
-Solution 2:-matching records w. similar names
2 authorsfalse positive
2011
11
![Page 12: Linking Records with Value Diversity](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816855550346895dde6f8b/html5/thumbnails/12.jpg)
Opportunities
ID Name Affiliation Co-authors Yearr1 Xin Dong R. Polytechnic
InstituteWozny 1991
r2 Xin Dong University of Washington
Halevy, Tatarinov
2004
r7 Dong Xin University of Illinois Han, Wah 2004r3 Xin Dong University of
WashingtonHalevy 2005
r4 Xin Luna Dong
University of Washington
Halevy, Yu 2007
r8 Dong Xin University of Illinois Wah 2007r9 Dong Xin Microsoft Research Wu, Han 2008r10
Dong Xin University of Illinois Ling, He 2009
r11
Dong Xin Microsoft Research Chaudhuri, Ganti
2009
r5 Xin Luna Dong
AT&T Labs-Research
Das Sarma, Halevy
2009
r6 Xin Luna Dong
AT&T Labs-Research
Naumann 2010
r12
Dong Xin Microsoft Research He 2011
Smooth transition
Seldom erratic change
s
Continuity of history
••• 12
![Page 13: Linking Records with Value Diversity](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816855550346895dde6f8b/html5/thumbnails/13.jpg)
IntuitionsID Name Affiliation Co-authors Yearr1 Xin Dong R. Polytechnic
InstituteWozny 1991
r2 Xin Dong University of Washington
Halevy, Tatarinov
2004
r7 Dong Xin University of Illinois Han, Wah 2004r3 Xin Dong University of
WashingtonHalevy 2005
r4 Xin Luna Dong
University of Washington
Halevy, Yu 2007
r8 Dong Xin University of Illinois Wah 2007r9 Dong Xin Microsoft Research Wu, Han 2008r10
Dong Xin University of Illinois Ling, He 2009
r11
Dong Xin Microsoft Research Chaudhuri, Ganti
2009
r5 Xin Luna Dong
AT&T Labs-Research
Das Sarma, Halevy
2009
r6 Xin Luna Dong
AT&T Labs-Research
Naumann 2010
r12
Dong Xin Microsoft Research He 2011
Less penalty on different values over time
Less reward on the same value over time
Consider records in time order for clustering
••• 13
![Page 14: Linking Records with Value Diversity](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816855550346895dde6f8b/html5/thumbnails/14.jpg)
Outline
• Motivation• Linking temporal records• Decay• Temporal clustering• Demo
• Linking records of the same group• Related work• Conclusions & Future work
••• 14
![Page 15: Linking Records with Value Diversity](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816855550346895dde6f8b/html5/thumbnails/15.jpg)
Disagreement Decay
• Intuition: different values over a long time is not a strong indicator of referring to different entities.
• University of Washington (01-07)• AT&T Labs-Research (07-date)
• Definition (Disagreement decay) • Disagreement decay of attribute A over
time ∆t is the probability that an entity changes its A-value within time ∆t.
••• 15
![Page 16: Linking Records with Value Diversity](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816855550346895dde6f8b/html5/thumbnails/16.jpg)
Agreement Decay• Intuition: the same value over a long
time is not a strong indicator of referring to the same entities.
• Adam Smith: (1723-1790) Adam Smith: (1965-)
• Definition (Agreement decay) • Agreement decay of attribute A over
time ∆t is the probability that different entities share the same A-value within time ∆t. ••• 16
![Page 17: Linking Records with Value Diversity](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816855550346895dde6f8b/html5/thumbnails/17.jpg)
Decay Curves
• Decay curves of address learnt from European Patent data
0 5 10 15 20 250
0.10.20.30.40.50.60.70.80.9
1
∆ Year
Deca
y
Disagreement decay
Agreement decay
Patent records: 1871Real-world inventors: 359In years: 1978 - 2003
••• 17
![Page 18: Linking Records with Value Diversity](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816855550346895dde6f8b/html5/thumbnails/18.jpg)
Applying Decay
• E.g. • r1 <Xin Dong, Uni. of Washington, 2004>• r2 <Xin Dong, AT&T Labs-Research, 2009>
• No decayed similarity:• w(name)=w(affi.)=.5• sim(r1, r2)=.5*1+.5*0=.5
• Decayed similarity• w(name, ∆t=5)=1-dagree(name , ∆t=5)=.95, • w(affi., ∆t=5)=1-ddisagree(affi. , ∆t=5)=.1 • sim(r1, r2)=(.95*1+.1*0)/(.95+.1)=.9 Match
Un-match
••• 18
![Page 19: Linking Records with Value Diversity](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816855550346895dde6f8b/html5/thumbnails/19.jpg)
Applying Decay
••• 19
ID Name Affiliation Co-authors Yearr1 Xin Dong R. Polytechnic
InstituteWozny 1991
r2 Xin Dong University of Washington
Halevy, Tatarinov
2004
r7 Dong Xin University of Illinois Han, Wah 2004r3 Xin Dong University of
WashingtonHalevy 2005
r4 Xin Luna Dong
University of Washington
Halevy, Yu 2007
r8 Dong Xin University of Illinois Wah 2007r9 Dong Xin Microsoft Research Wu, Han 2008r10
Dong Xin University of Illinois Ling, He 2009
r11
Dong Xin Microsoft Research Chaudhuri, Ganti
2009
r5 Xin Luna Dong
AT&T Labs-Research
Das Sarma, Halevy
2009
r6 Xin Luna Dong
AT&T Labs-Research
Naumann 2010
r12
Dong Xin Microsoft Research He 2011
All records are merged into the same cluster!!
Able to detect changes!
![Page 20: Linking Records with Value Diversity](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816855550346895dde6f8b/html5/thumbnails/20.jpg)
Decayed Similarity & Traditional Clustering
••• 20
F-1 Precision Recall0
0.10.20.30.40.50.60.70.80.9
1PARTITION CENTER MERGE DECAY
Decay improves recall over baselines by 23-67%
Patent records: 1871Real-world inventors: 359In years: 1978 - 2003
![Page 21: Linking Records with Value Diversity](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816855550346895dde6f8b/html5/thumbnails/21.jpg)
Outline
• Motivation• Linking temporal records• Decay• Temporal clustering• Demo
• Linking records of the same group• Related work• Conclusions & Future work
••• 21
![Page 22: Linking Records with Value Diversity](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816855550346895dde6f8b/html5/thumbnails/22.jpg)
Early Binding
• Compare a new record with existing clusters
• Make eager merging decision for each record
• Maintain the earliest/latest timestamp for its last value
••• 22
![Page 23: Linking Records with Value Diversity](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816855550346895dde6f8b/html5/thumbnails/23.jpg)
Early BindingID Name Affiliation Co-authors Fro
m To
r2 Xin Dong Univ. of Washington
Halevy, Tatarinov
2004 2004
ID Name Affiliation Co-authors From
To
r3 Xin Dong Univ. of Washington
Halevy 2004 2005
r1 Xin Dong R. P. Institute Wozny 1991 1991
r7 Dong Xin
University of Illinois
Han, Wah 2004 2004r8 Dong
Xin University of Illinois
Wah 2004 2007
r4 Xin Luna Dong
Univ. of Washington
Halevy, Yu 2004 2007
r9 Dong Xin
Microsoft Research
Wu, Han 2008 2008
r10
Dong Xin University of Illinois
Ling, He 2009 2009
ID Name Affiliation Co-authors From
Tor5 Xin Luna
DongAT&T Labs-Research
Das Sarma, Halevy
2009
2009
r11
Dong Xin
Microsoft Research
Chaudhuri, Ganti
2008 2009
r6 Xin Luna Dong
AT&T Labs-Research
Naumann 2009
2010
r12
Dong Xin
Microsoft Research
He 2008 2011
C1
C2
C3
earlier mistakes prevent later merging!!
Avoid a lot of false positives!
••• 23
![Page 24: Linking Records with Value Diversity](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816855550346895dde6f8b/html5/thumbnails/24.jpg)
Adjusted Binding• Compare earlier records with clusters
created later
• Proceed in EM-style1. Initialization: Start with the result of initialized
clustering 2. Estimation: Compute record-cluster similarity3. Maximization: Choose the optimal clustering4. Termination: Repeat until the results converge
or oscillate
••• 24
![Page 25: Linking Records with Value Diversity](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816855550346895dde6f8b/html5/thumbnails/25.jpg)
Adjusted Binding• Compute similarity by • Consistency: consistency in evolution of
values• Continuity: continuity of records in time
•
Case 1:r.t C.late
record time stamp cluster time stamp
C.early
Case 2: r.t C.lateC.earlyCase 3: r.t C.lateC.earlyCase 4: r.tC.lateC.early
sim(r, C)=cont(r, C)*cons(r, C)
••• 25
![Page 26: Linking Records with Value Diversity](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816855550346895dde6f8b/html5/thumbnails/26.jpg)
26
Adjusted Bindingr7
DongXin@UI -2004
r9DongXin@MSR -2008
C3
C4
C5r10DongXin@UI -2009
r8DongXin@UI -2007
r11DongXin@MSR -2009
r12DongXin@MSR -2011
r10 has higher continuity with C4
r8 has higher continuity with C4
Once r8 is merged to C4, r7 has higher continuity with C4
![Page 27: Linking Records with Value Diversity](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816855550346895dde6f8b/html5/thumbnails/27.jpg)
Adjusted Binding
C1C2
C3
ID Name Affiliation Co-authors Yearr1 Xin Dong R. Polytechnic
InstituteWozny 1991
r2 Xin Dong University of Washington
Halevy, Tatarinov
2004
r3 Xin Dong University of Washington
Halevy 2005
r4 Xin Luna Dong
University of Washington
Halevy, Yu 2007
r5 Xin Luna Dong
AT&T Labs-Research
Das Sarma, Halevy
2009
r6 Xin Luna Dong
AT&T Labs-Research
Naumann 2010
r7 Dong Xin University of Illinois Han, Wah 2004r8 Dong Xin University of Illinois Wah 2007r9 Dong Xin Microsoft Research Wu, Han 2008r10
Dong Xin University of Illinois Ling, He 2009
r11
Dong Xin Microsoft Research Chaudhuri, Ganti
2009
r12
Dong Xin Microsoft Research He 2011
Correctly cluster all records
••• 27
![Page 28: Linking Records with Value Diversity](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816855550346895dde6f8b/html5/thumbnails/28.jpg)
Temporal Clustering
••• 28
Patent records: 1871Real-world inventors: 359In years: 1978 - 2003
F-1 Precision Recall0
0.10.20.30.40.50.60.70.80.9
1PARTITION CENTER MERGE DECAY ADJUST FULL ALGO.
Full algorithm has the best result
Adjusted Clustering improves recall without reducing precision much
![Page 29: Linking Records with Value Diversity](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816855550346895dde6f8b/html5/thumbnails/29.jpg)
F-1 Precision Recall0
0.10.20.30.40.50.60.70.80.9
1
PARTITION CENTER MERGE FULL ALGO.
F-1 Precision Recall0
0.10.20.30.40.50.60.70.80.9
1
PARTITION CENTER MERGE FULL ALGO.
Experimental Results• Data sets:
#Records #Entities YearsPatent 1871 359 1978-2003DBLP-XD 72 8 1991-2010DBLP-WW 738 18+potpourri 1992-2011
(a) Results of XD data (b) Results of WW data
••• 29
![Page 30: Linking Records with Value Diversity](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816855550346895dde6f8b/html5/thumbnails/30.jpg)
Demonstration
• CHRONOS: Facilitating History Discovery by Linking Temporal Records
••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 30
![Page 31: Linking Records with Value Diversity](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816855550346895dde6f8b/html5/thumbnails/31.jpg)
Outline
• Motivation• Linking temporal records• Decay• Temporal clustering• Demo
• Linking records of the same group• Related work• Conclusions & Future work
••• 31
![Page 32: Linking Records with Value Diversity](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816855550346895dde6f8b/html5/thumbnails/32.jpg)
32
-Are there any business chains?-If yes, which businesses are their members?
![Page 33: Linking Records with Value Diversity](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816855550346895dde6f8b/html5/thumbnails/33.jpg)
33
-Ground Truth
2 chains
![Page 34: Linking Records with Value Diversity](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816855550346895dde6f8b/html5/thumbnails/34.jpg)
34
-Solution 1: -Require high value consistency
0 chain
![Page 35: Linking Records with Value Diversity](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816855550346895dde6f8b/html5/thumbnails/35.jpg)
35
-Solution 2:-Match records w. same name
1 chain
![Page 36: Linking Records with Value Diversity](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816855550346895dde6f8b/html5/thumbnails/36.jpg)
Challenges
ID name phone state URL domainr1 Taco Casa AL tacocasa.comr2 Taco Casa 900 AL tacocasa.comr3 Taco Casa 900 AL tacocasa.com,
tacocasatexas.comr4 Taco Casa 900 ALr5 Taco Casa 900 ALr6 Taco Casa 701 TX tacocasatexas.comr7 Taco Casa 702 TX tacocasatexas.comr8 Taco Casa 703 TX tacocasatexas.comr9 Taco Casa 704 TXr10 Elva’s Taco
CasaTX tacodemar.com
Erroneous values
Different local values
Scalability6.8M Records
••• 36
![Page 37: Linking Records with Value Diversity](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816855550346895dde6f8b/html5/thumbnails/37.jpg)
Two-Stage Linkage – Stage I• Stage I: Identify cores containing listings very
likely to belong to the same chain• Require strong robustness in presence of possibly
erroneous values Graph theory• High Scalability
••• 37
ID name phone state URL domainr1 Taco Casa AL tacocasa.comr2 Taco Casa 900 AL tacocasa.comr3 Taco Casa 900 AL tacocasa.com, tacocasatexas.com
r4 Taco Casa 900 ALr5 Taco Casa 900 ALr6 Taco Casa 701 TX tacocasatexas.comr7 Taco Casa 702 TX tacocasatexas.comr8 Taco Casa 703 TX tacocasatexas.comr9 Taco Casa 704 TXr10 Elva’s Taco Casa TX tacodemar.com
![Page 38: Linking Records with Value Diversity](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816855550346895dde6f8b/html5/thumbnails/38.jpg)
Two-Stage Linkage – Stage II• Stage II: Cluster cores and remaining records into chains. • Collect strong evidence from cores and leverage in
clustering• No penalty on local values
••• 38
ID name phone state URL domainr1 Taco Casa AL tacocasa.comr2 Taco Casa 900 AL tacocasa.comr3 Taco Casa 900 AL tacocasa.com, tacocasatexas.com
r4 Taco Casa 900 ALr5 Taco Casa 900 ALr6 Taco Casa 701 TX tacocasatexas.comr7 Taco Casa 702 TX tacocasatexas.comr8 Taco Casa 703 TX tacocasatexas.comr9 Taco Casa 704 TXr10 Elva’s Taco Casa TX tacodemar.com
Reward strong evidence
![Page 39: Linking Records with Value Diversity](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816855550346895dde6f8b/html5/thumbnails/39.jpg)
• Stage II: Cluster cores and remaining records into chains. • Collect strong evidence from cores and leverage in
clustering• No penalty on local values
••• 39
ID name phone state URL domainr1 Taco Casa AL tacocasa.comr2 Taco Casa 900 AL tacocasa.comr3 Taco Casa 900 AL tacocasa.com, tacocasatexas.com
r4 Taco Casa 900 ALr5 Taco Casa 900 ALr6 Taco Casa 701 TX tacocasatexas.comr7 Taco Casa 702 TX tacocasatexas.comr8 Taco Casa 703 TX tacocasatexas.comr9 Taco Casa 704 TXr10 Elva’s Taco Casa TX tacodemar.com
Reward strong evidence
Two-Stage Linkage – Stage II
![Page 40: Linking Records with Value Diversity](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816855550346895dde6f8b/html5/thumbnails/40.jpg)
• Stage II: Cluster cores and remaining records into chains. • Collect strong evidence from cores and leverage in
clustering• No penalty on local values
••• 40
ID name phone state URL domainr1 Taco Casa AL tacocasa.comr2 Taco Casa 900 AL tacocasa.comr3 Taco Casa 900 AL tacocasa.com, tacocasatexas.com
r4 Taco Casa 900 ALr5 Taco Casa 900 ALr6 Taco Casa 701 TX tacocasatexas.comr7 Taco Casa 702 TX tacocasatexas.comr8 Taco Casa 703 TX tacocasatexas.comr9 Taco Casa 704 TXr10 Elva’s Taco Casa TX tacodemar.com
Apply weak evidence
Two-Stage Linkage – Stage II
![Page 41: Linking Records with Value Diversity](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816855550346895dde6f8b/html5/thumbnails/41.jpg)
• Stage II: Cluster cores and remaining records into chains. • Collect strong evidence from cores and leverage in
clustering• No penalty on local values
••• 41
ID name phone state URL domainr1 Taco Casa AL tacocasa.comr2 Taco Casa 900 AL tacocasa.comr3 Taco Casa 900 AL tacocasa.com, tacocasatexas.com
r4 Taco Casa 900 ALr5 Taco Casa 900 ALr6 Taco Casa 701 TX tacocasatexas.comr7 Taco Casa 702 TX tacocasatexas.comr8 Taco Casa 703 TX tacocasatexas.comr9 Taco Casa 704 TXr10 Elva’s Taco Casa TX tacodemar.com
No penalty on local values
Two-Stage Linkage – Stage II
![Page 42: Linking Records with Value Diversity](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816855550346895dde6f8b/html5/thumbnails/42.jpg)
Experimental Evaluation • Data set
• 6.8M records from YellowPages.com• Effectiveness:
• Precision / Recall / F-measure (avg.): .96 / .96 / .96• Efficiency:
• 6.9 hrs for single-machine solution• 40 mins for Hadoop solution
• 80K chains and 1M records in chains
••• 42
Chain name # Stores USPS - United States Post Office 12,776SUBWAY 11,278State Farm Insurance 8,711McDonald's 7,450Edward Jones 6,781
![Page 43: Linking Records with Value Diversity](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816855550346895dde6f8b/html5/thumbnails/43.jpg)
Experimental Evaluation II
••• ITIS Lab ••• http://www.itis.disco.unimib.it ••• 43
Sample #Records #Chains Chain size #Single-biz recordsRandom 2062 30 [2, 308] 503
AI 2446 1 2446 0UB 322 7 [2, 275] 5
FBIns 1149 14 [33, 269] 0
![Page 44: Linking Records with Value Diversity](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816855550346895dde6f8b/html5/thumbnails/44.jpg)
Related Work
• Record similarity: • Probabilistic linkage
• Classification-based approaches: classify records by probabilistic model [Felligi, ’69]
• Deterministic linkage• Distance-base approaches: apply distance metric to compute
similarity of each attribute, and take the weighted sum as record similarity [Dey,08]
• Rule-based approaches: apply domain knolwedge to match record [Hernandez,98]
• Record clustering• Transitive rule [Hernandez,98]• Optimization problem [Wijaya,09]• …
••• 44
![Page 45: Linking Records with Value Diversity](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816855550346895dde6f8b/html5/thumbnails/45.jpg)
Conclusions
• In some applications record linkage needs to be tolerant with value diversity
• When linking temporal records, time decay allows tolerance on evolving values
• When linking group members, two-stage linkage allows leveraging strong evidence and allows tolerance on different local values
••• 45
![Page 46: Linking Records with Value Diversity](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816855550346895dde6f8b/html5/thumbnails/46.jpg)
Future Work
••• 46
Data Integration
Temporal DatabaseData Quality
![Page 47: Linking Records with Value Diversity](https://reader035.vdocuments.us/reader035/viewer/2022062814/56816855550346895dde6f8b/html5/thumbnails/47.jpg)
Thanks!
••• 47