dataengconf sf16 - entity resolution in data pipelines using spark

Slides @

www.jakequist.com/go/dataengconf

http://www.jakequist.com/go/dataengconf

http://www.umiacs.umd.edu/~getoor/Tutorials/ER_VLDB2012.pdf

Entity Resolution

Talk StructureLayer 1: Naive ER

Layer 2: Graphical ER

Layer 3: Big Data ER

Layer 4: Temporal ER

Layer 5: Learned ER

Naive ER

Entity Resolution

ID Name Website GeoA Facebook facebook.com MenloPark,CAB FB facebook.com CAC Joe'sCookies joescookies.com SanFrancisco,CA

Suppose we have the following data:

Entity ResolutionSuppose we have the following data:

ID Name Website GeoA Facebook facebook.com MenloPark,CAB FB facebook.com CAC Joe'sCookies joescookies.com SanFrancisco,CAD JoesCookies facebook.com SanFrancisco,CA

Entity ResolutionSuppose we have the following data:

ID Name Website GeoA Facebook facebook.com MenloPark,CAB FB facebook.com CAC Joe'sCookies joescookies.com SanFrancisco,CAD JoesCookies facebook.com SanFrancisco,CAE JoesCookies NULL NewYork,NY

Fundamental Concept

Match entities on the similarity of their properties

Example: Company Similarity

Problems

• What about when match arity != 2

• Entities can’t duplicate across matches

• O(N^2) isn’t great either

Graphical ER

Think Like a Graph

A B

E C

D

ID Name Website Geo

A Facebook facebook.com MenloPark,CA

B FB facebook.com CA

C Joe'sCookies joescookies.com SanFrancisco,CA

D JoesCookies facebook.com SanFrancisco,CA

E JoesCookies NULL NewYork,NY

Think Like a Graph

A B

E C

D

150

50

-100 -100

50 50

50 50

-150-150

Key Concept: Cliques

Think Like a Clique

A B

E C

D

150

50

-100 -100

50 50

50 50

-150-150

{A}{B}{C}{D}{E}{E, A}{E, B}{E, C}{E, D}{A, B}{A, C}{A, D}{B, C}{B, D}{C, D}{E, A, B}{E, A, C}{E, A, D}{E, B, C}{E, B, D}{E, C, D}{A, B, C}{A, B, D}{A, C, D}{B, C, D}{E, A, B, C}{E, A, B, D}{E, A, C, D}{E, B, C, D}{A, B, C, D}{E, A, B, C, D}

possible cliques =>

Recurring Theme:Powerset

Scoring Cliques

from above

Overlapping Cliques

A B

E C

D

A B

E C

D

A = 0.75 B = 0.55

Overlapping Cliques

An entity can’t belong to more than one clique.

When we choose a clique, we must ensure no other cliques

use any of those entities

Clique Choosing

Recap• Given a dataset of entities…

• Take the powerset of those entities => every possible clique

• Score all the cliques

• In sorted order, choose the best cliques when no elements have been touched

ER on Bigger Data

• Get potential matches on the same machine

• Avoid using powerset(n) for large n

Challenges

Locality-Sensitive Hashing (LSH)

Basic Idea: Use Map Reduce to get likely matches onto the same machines

“Johnathon”

“Sequoia Capital, LLC”

[37.773972, -122.431297]

“John”

“Sequoia”

[37.73, -122.43]

“app.example.com” “example.com”

Locality-Sensitive Hashing

Problems

• What if our entities have missing properties?

Locality-Sensitive Hashing

Joe’s CookiesJoe’s Cookie’sjoescookies.com joescookies.com

A B C

“Joe Cookie” “Joe Cookie” “”

LSH on “name”

http://joescookies.com


Multilevel LSH

• Basic Idea: Use LSH multiple times on converging cliques

Joe’s CookiesJoe’s Cookie’sjoescookies.com joescookies.com

A B C

“Joe Cookie” “Joe Cookie” “”

LSN on “name”

Joe’s Cookie’sjoescookies.com joescookies.com

Clique #3

Clique #2

“joescookies.com” “joescookies.com”

LSN on “website”

Clique #1




Clique Choosing• We now have all potential cliques, spread across

the cluster

• We now need to choose the best cliques?

• Remember: But choosing one clique invalidates others

• Fundamentally a Serial Algorithm!

Clique ChoosingRDD[T].toLocalIterator() : Iterator[T]

• Produces an iterator on the Driver that seamlessly iterates every partition

Clique Choosing

Clique Choosing

uh oh

Challenge

• We need to keep track of which entities we’ve “touched”

• But using a HashSet means we will start eating a lot memory

Primer: Bloom Filters

BloomFilter { def mightContain(T obj) def put(T obj)}

example: 1 MB @ 0.5% error => 130 KB

Clique Choosing w/ Bloom Filters

Recap

• Challenge: Get data to the right machine. Solution: Use Locality-Sensitive-Hashing

• Challenge: Choose the best cliques. Solution: Use serial iterator and bloom-filters to keep memory low

Temporal ER

Temporal Entity Resolution

T1 T2

Ms Sally Smith Mrs Sally Doe

thefacebook.com facebook.com

Zen Payroll Gusto


A B

Zen Payrollzenpayroll.com

Gustogusto.com

-1000

http://zenpayroll.com


A B

Zen Payrollzenpayroll.com

+100

C

Zen Payroll <=> Gusto zenpayroll.com <=> gusto.com

Gustogusto.com

+100

-1000



http://gusto.com

Iterative Poison Pills

• Basic Idea: Use ER techniques we’ve already established

• Introduce “poison pills” that can break up cliques if temporal properties don’t match

• Iteratively use the poison pills to match on increasingly temporally-aware entities

gusto.com (Payroll)2016

Perform Regular ER

gusto.com (Travel)2010

gusto.com

< 2015

gusto.com zenpayroll.com

> 2015

zenpayroll.com(Payroll)2014

A B C D E

A, C, D, E B, E

Kick Out Entities ThatDon’t Match TemporalRequirements

A, Dgusto.com < 2015

B, Egusto.com > 2015zenpayroll < 2014

C, Egusto,2016

Perform Regular ER(now with more temporal fields available)

A, C, D B, C, E

Temporal Poison Pills

http://gusto.com

http://gusto.com

http://gusto.com

http://gusto.com

http://gusto.com

http://gusto.com


• Very Computational Expensive

• Requires Significant Tuning & Tweaking to Keep Tractable

• Considered one of the Holy Grails of ER

Learned ER

Recap

• Gorilla in the room: All of our scoring has been manual

Supervised Learning ER

• Basic Idea: Use a training set to learn the weights in our scoring functions

• Disclaimer: Only proceed with this if you have very complex scoring properties

Supervised Learning ER

More Learning Opts

• Gradient Descent: What if we viewed the system as having overall “error”? We can then use Gradient Descent to find optimal solution.

• Very very computationally intense

Questions?Thanks!

[email protected]

dataengconf sf16 - entity resolution in data pipelines using spark

Technology