dataengconf sf16 - entity resolution in data pipelines using spark
TRANSCRIPT
http://www.umiacs.umd.edu/~getoor/Tutorials/ER_VLDB2012.pdf
Entity Resolution
Talk StructureLayer 1: Naive ER
Layer 2: Graphical ER
Layer 3: Big Data ER
Layer 4: Temporal ER
Layer 5: Learned ER
Naive ER
Entity Resolution
ID Name Website GeoA Facebook facebook.com MenloPark,CAB FB facebook.com CAC Joe'sCookies joescookies.com SanFrancisco,CA
Suppose we have the following data:
Entity ResolutionSuppose we have the following data:
ID Name Website GeoA Facebook facebook.com MenloPark,CAB FB facebook.com CAC Joe'sCookies joescookies.com SanFrancisco,CAD JoesCookies facebook.com SanFrancisco,CA
Entity ResolutionSuppose we have the following data:
ID Name Website GeoA Facebook facebook.com MenloPark,CAB FB facebook.com CAC Joe'sCookies joescookies.com SanFrancisco,CAD JoesCookies facebook.com SanFrancisco,CAE JoesCookies NULL NewYork,NY
Fundamental Concept
Match entities on the similarity of their properties
Example: Company Similarity
Example: Company Similarity
Problems
• What about when match arity != 2
• Entities can’t duplicate across matches
• O(N^2) isn’t great either
Graphical ER
Think Like a Graph
A B
E C
D
ID Name Website Geo
A Facebook facebook.com MenloPark,CA
B FB facebook.com CA
C Joe'sCookies joescookies.com SanFrancisco,CA
D JoesCookies facebook.com SanFrancisco,CA
E JoesCookies NULL NewYork,NY
Think Like a Graph
A B
E C
D
150
50
-100 -100
50 50
50 50
-150-150
Think Like a Graph
A B
E C
D
150
50
-100 -100
50 50
50 50
-150-150
Key Concept: Cliques
Think Like a Clique
A B
E C
D
150
50
-100 -100
50 50
50 50
-150-150
{A}{B}{C}{D}{E}{E, A}{E, B}{E, C}{E, D}{A, B}{A, C}{A, D}{B, C}{B, D}{C, D}{E, A, B}{E, A, C}{E, A, D}{E, B, C}{E, B, D}{E, C, D}{A, B, C}{A, B, D}{A, C, D}{B, C, D}{E, A, B, C}{E, A, B, D}{E, A, C, D}{E, B, C, D}{A, B, C, D}{E, A, B, C, D}
possible cliques =>
Recurring Theme:Powerset
Scoring Cliques
from above
Overlapping Cliques
A B
E C
D
A B
E C
D
A = 0.75 B = 0.55
Overlapping Cliques
An entity can’t belong to more than one clique.
When we choose a clique, we must ensure no other cliques
use any of those entities
Clique Choosing
Clique Choosing
Recap• Given a dataset of entities…
• Take the powerset of those entities => every possible clique
• Score all the cliques
• In sorted order, choose the best cliques when no elements have been touched
ER on Bigger Data
• Get potential matches on the same machine
• Avoid using powerset(n) for large n
Challenges
Locality-Sensitive Hashing (LSH)
Basic Idea: Use Map Reduce to get likely matches onto the same machines
“Johnathon”
“Sequoia Capital, LLC”
[37.773972, -122.431297]
“John”
“Sequoia”
[37.73, -122.43]
“app.example.com” “example.com”
Locality-Sensitive Hashing
Locality-Sensitive Hashing
Problems
• What if our entities have missing properties?
Locality-Sensitive Hashing
Joe’s CookiesJoe’s Cookie’sjoescookies.com joescookies.com
A B C
“Joe Cookie” “Joe Cookie” “”
LSH on “name”
Multilevel LSH
• Basic Idea: Use LSH multiple times on converging cliques
Joe’s CookiesJoe’s Cookie’sjoescookies.com joescookies.com
A B C
“Joe Cookie” “Joe Cookie” “”
LSN on “name”
Joe’s Cookie’sjoescookies.com joescookies.com
Clique #3
Clique #2
“joescookies.com” “joescookies.com”
LSN on “website”
Clique #1
Clique Choosing• We now have all potential cliques, spread across
the cluster
• We now need to choose the best cliques?
• Remember: But choosing one clique invalidates others
• Fundamentally a Serial Algorithm!
Clique ChoosingRDD[T].toLocalIterator() : Iterator[T]
• Produces an iterator on the Driver that seamlessly iterates every partition
Clique Choosing
Clique Choosing
uh oh
Challenge
• We need to keep track of which entities we’ve “touched”
• But using a HashSet means we will start eating a lot memory
Primer: Bloom Filters
BloomFilter { def mightContain(T obj) def put(T obj)}
example: 1 MB @ 0.5% error => 130 KB
Clique Choosing w/ Bloom Filters
Clique Choosing w/ Bloom Filters
Recap
• Challenge: Get data to the right machine. Solution: Use Locality-Sensitive-Hashing
• Challenge: Choose the best cliques. Solution: Use serial iterator and bloom-filters to keep memory low
Temporal ER
Temporal Entity Resolution
T1 T2
Ms Sally Smith Mrs Sally Doe
thefacebook.com facebook.com
Zen Payroll Gusto
Temporal Entity Resolution
A B
Zen Payrollzenpayroll.com
+100
C
Zen Payroll <=> Gusto zenpayroll.com <=> gusto.com
Gustogusto.com
+100
-1000
Iterative Poison Pills
• Basic Idea: Use ER techniques we’ve already established
• Introduce “poison pills” that can break up cliques if temporal properties don’t match
• Iteratively use the poison pills to match on increasingly temporally-aware entities
gusto.com (Payroll)2016
Perform Regular ER
gusto.com (Travel)2010
gusto.com
< 2015
gusto.com zenpayroll.com
> 2015
zenpayroll.com(Payroll)2014
A B C D E
A, C, D, E B, E
Kick Out Entities ThatDon’t Match TemporalRequirements
A, Dgusto.com < 2015
B, Egusto.com > 2015zenpayroll < 2014
C, Egusto,2016
Perform Regular ER(now with more temporal fields available)
A, C, D B, C, E
Temporal Poison Pills
Temporal Entity Resolution
• Very Computational Expensive
• Requires Significant Tuning & Tweaking to Keep Tractable
• Considered one of the Holy Grails of ER
Learned ER
Recap
• Gorilla in the room: All of our scoring has been manual
Supervised Learning ER
• Basic Idea: Use a training set to learn the weights in our scoring functions
• Disclaimer: Only proceed with this if you have very complex scoring properties
Supervised Learning ER
Supervised Learning ER
More Learning Opts
• Gradient Descent: What if we viewed the system as having overall “error”? We can then use Gradient Descent to find optimal solution.
• Very very computationally intense
Questions?Thanks!