![Page 1: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech](https://reader034.vdocuments.us/reader034/viewer/2022042922/5f6d77cdb58af3515f38fbc9/html5/thumbnails/1.jpg)
Class Website
CX4242:
Data Integration
Mahdi Roozbahani
Lecturer, Computational Science and
Engineering, Georgia Tech
![Page 2: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech](https://reader034.vdocuments.us/reader034/viewer/2022042922/5f6d77cdb58af3515f38fbc9/html5/thumbnails/2.jpg)
What is Data Integration?Combining data from multiple sources to
provide the user with a unified view.
Why is it Important?Think about the apps, websites, and
services that you use every day.
![Page 3: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech](https://reader034.vdocuments.us/reader034/viewer/2022042922/5f6d77cdb58af3515f38fbc9/html5/thumbnails/3.jpg)
Businesses derive value
through data integration.
![Page 4: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech](https://reader034.vdocuments.us/reader034/viewer/2022042922/5f6d77cdb58af3515f38fbc9/html5/thumbnails/4.jpg)
![Page 5: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech](https://reader034.vdocuments.us/reader034/viewer/2022042922/5f6d77cdb58af3515f38fbc9/html5/thumbnails/5.jpg)
![Page 6: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech](https://reader034.vdocuments.us/reader034/viewer/2022042922/5f6d77cdb58af3515f38fbc9/html5/thumbnails/6.jpg)
![Page 7: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech](https://reader034.vdocuments.us/reader034/viewer/2022042922/5f6d77cdb58af3515f38fbc9/html5/thumbnails/7.jpg)
More Examples?
• Social media (data from users, businesses)
• Facebook: your posts, advertisements, review
• Search engine: Google, Bing, Yahoo, etc.
• Smart assistants: Siri, Cortana, Alexa, Google
• Price comparison: Kayak
• Uber, Lyft: drivers, traffic data, customers
• google maps: users, restaurants, traffic….
8
![Page 8: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech](https://reader034.vdocuments.us/reader034/viewer/2022042922/5f6d77cdb58af3515f38fbc9/html5/thumbnails/8.jpg)
How to do data integration?
![Page 9: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech](https://reader034.vdocuments.us/reader034/viewer/2022042922/5f6d77cdb58af3515f38fbc9/html5/thumbnails/9.jpg)
“Low” Effort Approaches
1. Use database’s “Join”! (e.g., SQLite)When does this approach work? (Or, when does it NOT work?)
16
id name salary
111 Smith $40k
222 Johnson $60k
333 Lee $50k
id name
111 Smith
222 Johnson
333 Lee
id salary
111 $40k
222 $60k
333 $50k
2. Open Refinehttp://openrefine.org (Video #3 “Reconcile and Match Data”)
![Page 10: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech](https://reader034.vdocuments.us/reader034/viewer/2022042922/5f6d77cdb58af3515f38fbc9/html5/thumbnails/10.jpg)
IDs are really important, and
can simplify data integration!
But who creates the IDs?
17
![Page 11: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech](https://reader034.vdocuments.us/reader034/viewer/2022042922/5f6d77cdb58af3515f38fbc9/html5/thumbnails/11.jpg)
Crowd-sourcing Approaches: Freebase
18
Freebase intro video: https://youtu.be/TJfrNo3Z-DU
Learn more about Freebase at https://en.wikipedia.org/wiki/Freebase
![Page 12: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech](https://reader034.vdocuments.us/reader034/viewer/2022042922/5f6d77cdb58af3515f38fbc9/html5/thumbnails/12.jpg)
Freebase(a graph of entities)
“…a large collaborative knowledge base
consisting of metadata composed mainly
by its community members…”
19
Wikipedia.
![Page 13: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech](https://reader034.vdocuments.us/reader034/viewer/2022042922/5f6d77cdb58af3515f38fbc9/html5/thumbnails/13.jpg)
So what? What can you do with the
Freebase knowledge graph?
Hint: Google acquired it in 2010.
20
![Page 14: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech](https://reader034.vdocuments.us/reader034/viewer/2022042922/5f6d77cdb58af3515f38fbc9/html5/thumbnails/14.jpg)
Google Knowledge Graph video: https://youtu.be/mmQl6VGvX-c
![Page 15: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech](https://reader034.vdocuments.us/reader034/viewer/2022042922/5f6d77cdb58af3515f38fbc9/html5/thumbnails/15.jpg)
Freebase replaced by
Google Knowledge Graph API
23
Example:
What does Google know
about Taylor Swift?
https://developers.google.com/
knowledge-graph/
![Page 16: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech](https://reader034.vdocuments.us/reader034/viewer/2022042922/5f6d77cdb58af3515f38fbc9/html5/thumbnails/16.jpg)
24
What does Google know about Taylor Swift?
https://developers.google.com/knowledge-graph/
![Page 17: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech](https://reader034.vdocuments.us/reader034/viewer/2022042922/5f6d77cdb58af3515f38fbc9/html5/thumbnails/17.jpg)
What if we don’t have the luxury of having IDs ?
30(Screenshot from FreeBase video)
A common
problem in
academia:
Polo Chau
Duen Horng Chau
Duen Chau
D. Chau
![Page 18: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech](https://reader034.vdocuments.us/reader034/viewer/2022042922/5f6d77cdb58af3515f38fbc9/html5/thumbnails/18.jpg)
Entity Resolution(A hard problem in data integration)
31
Then you need to do…
![Page 19: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech](https://reader034.vdocuments.us/reader034/viewer/2022042922/5f6d77cdb58af3515f38fbc9/html5/thumbnails/19.jpg)
Why is entity resolution
so difficult?
Let’s understand it through
shopping for an iPhone on
Apple, Amazon and eBay
![Page 20: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech](https://reader034.vdocuments.us/reader034/viewer/2022042922/5f6d77cdb58af3515f38fbc9/html5/thumbnails/20.jpg)
![Page 21: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech](https://reader034.vdocuments.us/reader034/viewer/2022042922/5f6d77cdb58af3515f38fbc9/html5/thumbnails/21.jpg)
![Page 22: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech](https://reader034.vdocuments.us/reader034/viewer/2022042922/5f6d77cdb58af3515f38fbc9/html5/thumbnails/22.jpg)
![Page 23: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech](https://reader034.vdocuments.us/reader034/viewer/2022042922/5f6d77cdb58af3515f38fbc9/html5/thumbnails/23.jpg)
D-DupeInteractive Data Deduplication and Integration
TVCG 2008
University of Maryland
Bilgic, Licamele, Getoor, Kang, Shneiderman
36
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=4035746
https://linqspub.soe.ucsc.edu/basilic/web/Publications/2006/bilgic:vast06/
![Page 24: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech](https://reader034.vdocuments.us/reader034/viewer/2022042922/5f6d77cdb58af3515f38fbc9/html5/thumbnails/24.jpg)
![Page 25: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech](https://reader034.vdocuments.us/reader034/viewer/2022042922/5f6d77cdb58af3515f38fbc9/html5/thumbnails/25.jpg)
Mahdi
Madhi
Alice
Bob
Carol
Dave
![Page 26: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech](https://reader034.vdocuments.us/reader034/viewer/2022042922/5f6d77cdb58af3515f38fbc9/html5/thumbnails/26.jpg)
Core components: Similarity functions
Determine how two entities are similar.
D-Dupe’s approach:
Attribute similarity + relational similarity
39
Similarity score for a pair of entities
![Page 27: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech](https://reader034.vdocuments.us/reader034/viewer/2022042922/5f6d77cdb58af3515f38fbc9/html5/thumbnails/27.jpg)
40
Attribute similarity (a weighted sum)
![Page 28: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech](https://reader034.vdocuments.us/reader034/viewer/2022042922/5f6d77cdb58af3515f38fbc9/html5/thumbnails/28.jpg)
Properties of Similarity Function
41
![Page 29: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech](https://reader034.vdocuments.us/reader034/viewer/2022042922/5f6d77cdb58af3515f38fbc9/html5/thumbnails/29.jpg)
Distance Functions for
Vectors
42
d
d
d
d
d
d
d
![Page 30: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech](https://reader034.vdocuments.us/reader034/viewer/2022042922/5f6d77cdb58af3515f38fbc9/html5/thumbnails/30.jpg)
Example
43
![Page 31: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech](https://reader034.vdocuments.us/reader034/viewer/2022042922/5f6d77cdb58af3515f38fbc9/html5/thumbnails/31.jpg)
Some problems with Euclidean distance
44
x y
z
d(x,y) and d(x,z) ?
r
Curse of dimensionality
![Page 32: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech](https://reader034.vdocuments.us/reader034/viewer/2022042922/5f6d77cdb58af3515f38fbc9/html5/thumbnails/32.jpg)
Numerous similarity functions
• Euclidean distanceEuclidean norm / L2 norm
• TaxiCab/Manhattan distance
• Jaccard Similarity (e.g., used with w-shingles)e.g., overlap of nodes’ #neighbors
• String edit distancee.g., “Mahdi” vs “Madhi”
45
http://infolab.stanford.edu/~ullman/mmds/ch3a.pdfExcellent read:
![Page 33: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech](https://reader034.vdocuments.us/reader034/viewer/2022042922/5f6d77cdb58af3515f38fbc9/html5/thumbnails/33.jpg)
46
https://reference.wolfram.com/language/guide/DistanceAndSimilarityMeasures.
html
![Page 34: Class Website CX4242 - poloclub.github.io … · Class Website CX4242: Data Integration Mahdi Roozbahani Lecturer, Computational Science and Engineering, Georgia Tech](https://reader034.vdocuments.us/reader034/viewer/2022042922/5f6d77cdb58af3515f38fbc9/html5/thumbnails/34.jpg)
Excellent Tutorial on Entity Resolution
http://www.umiacs.umd.edu/~getoor/Tutorials/ER_
KDD2013.pdf
by Lise Getoor and Ashwin Machanavajjhala
47