record linkage: a 10-year retrospective chen li and sharad mehrotra uc irvine 1
TRANSCRIPT
![Page 1: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649ea15503460f94ba4a1b/html5/thumbnails/1.jpg)
1
Record Linkage: A 10-Year Retrospective
Chen Li and Sharad Mehrotra
UC Irvine
![Page 2: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649ea15503460f94ba4a1b/html5/thumbnails/2.jpg)
2
Efficient Record Linkage in Large Data Sets
Liang Jin, Chen Li, Sharad MehrotraUniversity of California, Irvine
DASFAA, Kyoto, Japan, March 2003
![Page 3: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649ea15503460f94ba4a1b/html5/thumbnails/3.jpg)
![Page 4: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649ea15503460f94ba4a1b/html5/thumbnails/4.jpg)
How was the paper written?
Two faculty working on different areas, plus
1st year PhD student
![Page 5: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649ea15503460f94ba4a1b/html5/thumbnails/5.jpg)
5
Chen’s Story: 2001 …
![Page 6: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649ea15503460f94ba4a1b/html5/thumbnails/6.jpg)
6
Data Integration Problems?
Talking to medical doctors…
![Page 7: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649ea15503460f94ba4a1b/html5/thumbnails/7.jpg)
Example
Name SSN Addr
Jack Lemmon
430-871-8294 Maple St
Harrison Ford
292-918-2913 Culver Blvd
Tom Hanks 234-762-1234 Main St
… … …
Table R
Name SSN Addr
Ton Hanks 234-162-1234 Main Street
Kevin Spacey
928-184-2813 Frost Blvd
Jack Lemon 430-817-8294 Maple Street
… … …
Table S
Q: Find records from different datasets that could be the same entity
7Chen Li
![Page 8: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649ea15503460f94ba4a1b/html5/thumbnails/8.jpg)
Sharad’s research
8Chen Li
![Page 9: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649ea15503460f94ba4a1b/html5/thumbnails/9.jpg)
Liang’s story1st-year PhD student at UC Irvine
9Chen Li
![Page 10: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649ea15503460f94ba4a1b/html5/thumbnails/10.jpg)
Challenges How to define good similarity functions?
How to do matching efficiently?
10Chen Li
![Page 11: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649ea15503460f94ba4a1b/html5/thumbnails/11.jpg)
11
Nested-loop? Not desirable for large data sets 5 hours for 30K strings!
![Page 12: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649ea15503460f94ba4a1b/html5/thumbnails/12.jpg)
12
Our 2-step approach Step 1: map strings (in a metric
space) to objects in a Euclidean space
Step 2: do a similarity join in the Euclidean space
![Page 13: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649ea15503460f94ba4a1b/html5/thumbnails/13.jpg)
13
Advantages Applicable to many metric similarity
functions— E.g.: Edit distance
Open to existing algorithms— Mapping techniques— Join techniques
![Page 14: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649ea15503460f94ba4a1b/html5/thumbnails/14.jpg)
14
Step 1Map strings into a high-dimensional Euclidean
space
Metric Space Euclidean Space
![Page 15: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649ea15503460f94ba4a1b/html5/thumbnails/15.jpg)
15
Use data set 1 (54K names) as an example k=2, d=20
— Use k’=5.2 to differentiate similar and dissimilar pairs.
Can it preserve distances?
![Page 16: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649ea15503460f94ba4a1b/html5/thumbnails/16.jpg)
16
Multi-attribute linkage Example: title + name + year Different attributes have different
similarity functions and thresholds Consider merge rules in disjunctive
format:
![Page 17: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649ea15503460f94ba4a1b/html5/thumbnails/17.jpg)
17
Secret of the paper …
![Page 18: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649ea15503460f94ba4a1b/html5/thumbnails/18.jpg)
18
![Page 19: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649ea15503460f94ba4a1b/html5/thumbnails/19.jpg)
19
Work since then … Chen: efficiency
Sharad: quality
![Page 20: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649ea15503460f94ba4a1b/html5/thumbnails/20.jpg)
20
Chen’s Work on Efficiency Gram-based algorithms
— Indexing— Selection algorithms— Join algorithms— Variable-length grams— Selectivity estimation
Trie-based algorithms— Instant search
![Page 22: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649ea15503460f94ba4a1b/html5/thumbnails/22.jpg)
22
Follow-up work in the community
Significant amount of work on approximate string queries— Selection— Join
![Page 23: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649ea15503460f94ba4a1b/html5/thumbnails/23.jpg)
23
Make an impact?
![Page 24: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649ea15503460f94ba4a1b/html5/thumbnails/24.jpg)
Chen Li 24
UCI People Search
![Page 25: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649ea15503460f94ba4a1b/html5/thumbnails/25.jpg)
Chen Li 25
Psearch (2008) : 2 stories
![Page 26: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649ea15503460f94ba4a1b/html5/thumbnails/26.jpg)
26
Fuzzy search
![Page 28: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649ea15503460f94ba4a1b/html5/thumbnails/28.jpg)
Research commercialization
28Chen Li
![Page 29: Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1](https://reader035.vdocuments.us/reader035/viewer/2022062718/56649ea15503460f94ba4a1b/html5/thumbnails/29.jpg)
Lesson learned: Hands-on experiences important!
29Chen Li