entity matching for semistructured data in the cloud

Entity Matching for Semistructured Datain the Cloud

Marcus ParadiesACM SAC 2012 - CC Track

March 27, 2012

1 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies

N

Outline

1 Motivation

2 ChuQL

3 Entity Matching

4 MAXIM: Entity Matching in the Cloud

5 Summary


N

Motivation

Enriching/Improving Wikipedia

References from Wikipedia article Hash join


N

Motivation


Lookup in the CiteSeer database


N

Motivation


Lookup in Google


N

Motivation

Wikipedia in a nutshell

Characteristics

3.7 Mio articles (english Wikipedia database)

Dataset size about 30GB of XML (without history)

3.6 Mio references

References are categorized into books, journals, websites, etc.

Challenges

Articles in Wikipedia are incomplete

Articles in Wikipedia are inaccurate

Articles in Wikipedia are subjective


N

Motivation

Problem Statement

Definition

Given two datasets of records, R and S , a set of attributesa1, . . . , an, a set of similarity functions sima1 , . . . , siman and asimilarity threshold τ , the task between R and S is defined asfinding and combining all pairs of records from R and S where∑n

i=1 simai (R.ai , S .ai ) ≥ τ

{{Cite book | last = Mumford | first = David | authorlink = David Mumford | title = The Red Book of Varieties and Schemes | publisher = [[Springer]] | location = Berlin | date = 1999 | page = 198 | doi = 10.1007/b62130 | isbn = 354063293X }}

{{Cite book | last = Mumford | first = David | authorlink = David Mumford | title = The Red Book of Varieties and Schemes | publisher = [[Springer]] | location = Berlin | date = 1999 | page = 198 | doi = 10.1007/b62130 | isbn = 354063293X }}

<record id=”6627383”> <author>David Mumford</author> <title>The red book of Varieties and Schemes</title> <publisher>Springer</publisher> <year>1999</year> <doi>10.1007/b62130</doi></record>

<record id=”6627383”> <author>David Mumford</author> <title>The red book of Varieties and Schemes</title> <publisher>Springer</publisher> <year>1999</year> <doi>10.1007/b62130</doi></record>

Wikipedia Data set CiteSeer Data set


N

ChuQL


N

ChuQL

ChuQL by example

Wordcount in ChuQL

1 mapreduce {

2 input { fn:collection ("hdfs :// wiki /") }

3 rr { for $rev in $hxml:in// revision

4 return {"key": fn:data($x// username|$x//ip),

5 "val": $x//title } }

6 map { $hxml:in }

7 reduce { {"key": $hxml:in=>"key", "value": fn:count($hxml:in=>"val ")} }

8 rw { <author name ="{ $hxml:in=>"key"}" count ="{ $hxml:in=>"val"}"/> }

9 output { fn:put("hdfs :// outputdir /") }

10 }


N

Entity Matching


N

Entity Matching

What is Entity Matching?

Challenges

Entity Matching has quadratic runtime behavior

Entity Matching has high CPU- and memory demands

The definition of “what is similar” is domain-dependent


N

Entity Matching

Entity Matching Architecture

Data Source

S1

Data Source

S1

Data Source

S2

Data Source

S2

BlockingBlocking

b1b1

b2b2

b3b3

bnbn

...

MatchingMatching Match Result

R

Match Result

R


N

Entity Matching


Data Source

S1

Data Source

S1

Data Source

S2

Data Source

S2

BlockingBlocking

b1b1

b2b2

b3b3

bnbn

...


R

Match Result

R

How can we improve the runtime of an EM task?


N

Entity Matching


Data Source

S1

Data Source

S1

Data Source

S2

Data Source

S2

BlockingBlocking

b1b1

b2b2

b3b3

bnbn

...


R

Match Result

R

Distributed Blocking


N

Entity Matching


Data Source

S1

Data Source

S1

Data Source

S2

Data Source

S2

BlockingBlocking

b1b1

b2b2

b3b3

bnbn

...


R

Match Result

R

Distributed Blocking Parallel Matching


N

MAXIM: Entity Matchingin the Cloud


N

MAXIM: Entity Matching in the Cloud

Requirements and Approach

Requirements

Efficient processing of semistructured data

Scalability to large datasets

Independency from specific similarity functions

Ability to easily add new similarity functions

Main Idea

Use MapReduce and ChuQL to process semistructured data

Use a search-based blocking to generate candidate pairs

Apply similarity functions to candidate pairs within a block


N


Architecture

Node 1

Task Tracker

ChuQL Engine

Data Node

Had

oop

Full-textIndex

Search Engine

Node 2

Task Tracker

ChuQL Engine

Data Node

Had

oop

Full-textIndex

SearchEngine

Node N

Task Tracker

ChuQL Engine

Data Node

Had

oop

Full-textIndex

SearchEngine

...

HDFSHDFS

Architecture

Hadoop cluster with up to 40 nodes

Each node runs a search engine and an attached full-text index

Each node runs an in-memory XQuery processor

Semistructured data is partitioned and placed on HDFS


N


Processing Stages

Search EnginesSearch Engines

HDFSHDFS

Three Stages

Preparation Stage

Blocking Stage

Matching Stage


N


Processing Stages

Extract Wikipedia references


Index CiteSeerX records



HDFSHDFS

Extract references

Store references

Transform into full-text index XML

Build index

Preparation Stage

Stage 1: Preparation Stage

Extracts references from Wikipedia

Reads and transforms records from CiteSeerX

Sends CiteSeerX data to local full-text index


N


Processing Stages





Generate SemanticBlock



HDFSHDFS

Extract references

Store references


Build index

Retrieve references

Generate query

Get query response

Store blocks

Preparation Stage Blocking Stage

Stage 2: Blocking Stage

Reads extracted references from HDFS

Probes full-text index to retrieve candidate publications

Assign candidate publications to block(s)


N


Processing Stages






Generate SemanticBlock Record pair generationRecord pair generation


HDFSHDFS

Extract references

Store references


Build index

Retrieve references

Generate query

Get query response

Store blocks

Verify candidatepairs

Store record pairs

Preparation Stage Blocking Stage Matching Stage

Stage 3: Matching Stage

Read blocks from HDFS

Generate candidate pairs and apply similarity functions

Store matching pairs and their similarity


N



Extracting References

Indexing Publications


N




{{cite journal | author1 = Hansjörg Zeller | author2 = Jim Gray | title = An Adaptive Hash Join Algorithm for Multi-User Environments | journal = Proceedings of the 16th VLDB conference | year = 1990 | pages = 186–197}}

Extraction



N





<reference type=“journal“> <author1>Hansjörg Zeller</author1> <author2>Jim Gray</author2> <title>An Adaptive Hash Join Algorithm for Multi-User Environments</title> <journal>Proceedings of the 16th VLDB conference</journal> <year>1990</year> <pages>186–197</pages></reference>

Extraction

Transformation



N






Extraction

Transformation


HDFS


N






Extraction

Transformation


<doc> <field name="id">10.1.1.49.2550</field> <field name="title">Selecting Tense, Aspect, and Connecting Words In Language Generation</field> <field name="author">Bonnie Dorr</field> <field name="description">Generating language ...</field> </doc>

HDFS

Read and Transformation


N






Extraction

Transformation


<doc> <field name="id">10.1.1.49.2550</field> <field name="title">Selecting Tense, Aspect, and Connecting Words In Language Generation</field> <field name="author">Bonnie Dorr</field> <field name="description">Generating language ...</field> </doc>

LuceneIndex

LuceneIndex

HDFS

Read and Transformation

Indexing


N



Block generation

Each reference generates a set of candidate publications

Each candidate publication is inserted into all blocks, which arelisted in reference


N



Block generation

Each reference generates a set of candidate publications

Each candidate publication is inserted into all blocks, which arelisted in reference

Example

<citation> <id>26334893</id> <cat>Search engine optimization</cat> <cat>Internet search algorithms</cat> <cat>Link analysis</cat> <ref> <type>journal</type> <author>Taher Haveliwala</author> <year>2003</year> <pages>56-70</pages> <title>The Second Eigenvalue

of the Google Matrix</title> <journal>Stanford University

Technical Report</journal> </ref></citation>

<citation> <id>26334893</id> <cat>Search engine optimization</cat> <cat>Internet search algorithms</cat> <cat>Link analysis</cat> <ref> <type>journal</type> <author>Taher Haveliwala</author> <year>2003</year> <pages>56-70</pages> <title>The Second Eigenvalue

of the Google Matrix</title> <journal>Stanford University

Technical Report</journal> </ref></citation>

<citation> <id>26334893</id> <cat>Hashing</cat> <cat>Join algorithms</cat> <ref> <type>journal</type> <author>Hansjörg Zeller</author> <author>Jim Gray</author> <year>1990</year> <pages>186-197</pages> <title>An Adaptive Hash Join Algorithm for Multiuser Environments</title> <journal>Proceedings of the 16th VLDB

conference</journal> </ref></citation>

Full-Text Index

Search Engine

Hashing

Join algorithms

10.0.1.1.124

10.0.1.1.124

10.0.7.23.14

10.0.1.11.23

send result

send result

send query


N



Distributed Search in MAXIM

(a)(a

)(a)

(a)

(b) (b) (b) (b)

Node 2

Task Tracker

ChuQL Engine

Data Node

Had

oop

Full-textIndex

SearchEngine

Node 3

Task Tracker

ChuQL Engine

Data Node

Had

oop

Full-textIndex

SearchEngine

Node 4

Task Tracker

ChuQL Engine

Data NodeH

adoo

p

Full-textIndex

SearchEngine

Node 5

Task Tracker

ChuQL Engine

Data Node

Had

oop

Full-textIndex

Search Engine

Node 1

Task Tracker

ChuQL Engine

Data Node

Had

oop

Full-textIndex

SearchEngine

(c) Collect partial results

(b) HTTP response (partial result)

(a) Send HTTP request (query) (c)


N


Stage 3: Matching Stage

Applies user-defined similarity functions to candidate pairs

Each attribute can be evaluated by a specific similarity function

Number of candidate pairs

CP =n∑

i=1

Ci ∗ Ri (1)

n - # of blocks in B1, . . . ,Bn

Ri - # of references in block Bi

Ci - # of candidate publications in block Bi

CP - # of candidate pairs to verify


N

Summary

Summary

Wikipedia provides many opportunities for research

Need for efficiently processing semistructured data is increasing

Entity Matching is critical for data integration and data cleaning

Entity Matching is difficult to parallelize due to unbalanced datapartitions

MAXIM parallelizes EM by building blocks of similar records in aclassification fashion

MAXIM allows to define own similarity functions and computationfunctions without changing the algorithm


N

“Everything that can be invented has been invented.”

(Charles H. Duell, Commissioner, U.S. Office of Patents, 1899)


N

Experiments

Scaleup and Speedup

1

2

3

4

5

6

7

8

9

5 10 20 40

Spe

edup

= B

ase

Tim

e / N

ew T

ime

Number of nodes

IdealINDEXING-2000

EXTRACTING-2000BLOCKINGMATCHING

(a) Speedup for all stages

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

5 10 20 40S

cale

up =

Bas

e T

ime

/ New

Tim

e

Number of nodes

IdealEXTRACTING-2000

INDEXING-2000

(b) Scaleup for preparation stage


N

Experiments

Query Performance

0

100

200

300

400

500

600

700

800

900

5 10 20 40

Avg

. Que

ry R

espo

nse

Tim

e (m

s)

Number of Nodes

RESULTCOUNT-50RESULTCOUNT-100RESULTCOUNT-150RESULTCOUNT-200

Figure: Query Performance for different result set sizes and cluster sizes.


N

Experiments

Blocking Accuracy

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

0 0.25 0.5 0.75 1.0

Acc

urac

y

Variance

IdealWRONG-ORDER

MISPLACED-ENDMISPLACED-ANY

MISSING

Figure: Blocking accuracy for different typographical error classes.


N

Experiments

Number of Candidate Pairs

0

500000

1e+006

1.5e+006

2e+006

2.5e+006

3e+006

3.5e+006

4e+006

4.5e+006

5e+006

5.5e+006

0.0 0.1 0.25 0.5 0.75 1.0

Num

ber

of c

andi

date

pai

rs

Variance

RSCOUNT-50RSCOUNT-100RSCOUNT-150RSCOUNT-200

Figure: Number of candidate pair verifications in the matching stage.


N

entity matching for semistructured data in the cloud

Technology

challengesentity matching

entity matchingwhat

semistructured datain

googlemarcus paradies

chuqlmarcus paradies

david mumford authorlink

david authorlink

x title