entity matching for semistructured data in the cloud
TRANSCRIPT
Entity Matching for Semistructured Datain the Cloud
Marcus ParadiesACM SAC 2012 - CC Track
March 27, 2012
1 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies
N
Outline
1 Motivation
2 ChuQL
3 Entity Matching
4 MAXIM: Entity Matching in the Cloud
5 Summary
2 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies
N
Motivation
Enriching/Improving Wikipedia
References from Wikipedia article Hash join
3 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies
N
Motivation
Enriching/Improving Wikipedia
Lookup in the CiteSeer database
3 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies
N
Motivation
Enriching/Improving Wikipedia
Lookup in Google
3 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies
N
Motivation
Wikipedia in a nutshell
Characteristics
3.7 Mio articles (english Wikipedia database)
Dataset size about 30GB of XML (without history)
3.6 Mio references
References are categorized into books, journals, websites, etc.
Challenges
Articles in Wikipedia are incomplete
Articles in Wikipedia are inaccurate
Articles in Wikipedia are subjective
4 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies
N
Motivation
Wikipedia in a nutshell
Characteristics
3.7 Mio articles (english Wikipedia database)
Dataset size about 30GB of XML (without history)
3.6 Mio references
References are categorized into books, journals, websites, etc.
Challenges
Articles in Wikipedia are incomplete
Articles in Wikipedia are inaccurate
Articles in Wikipedia are subjective
4 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies
N
Motivation
Problem Statement
Definition
Given two datasets of records, R and S , a set of attributesa1, . . . , an, a set of similarity functions sima1 , . . . , siman and asimilarity threshold τ , the task between R and S is defined asfinding and combining all pairs of records from R and S where∑n
i=1 simai (R.ai , S .ai ) ≥ τ
{{Cite book | last = Mumford | first = David | authorlink = David Mumford | title = The Red Book of Varieties and Schemes | publisher = [[Springer]] | location = Berlin | date = 1999 | page = 198 | doi = 10.1007/b62130 | isbn = 354063293X }}
{{Cite book | last = Mumford | first = David | authorlink = David Mumford | title = The Red Book of Varieties and Schemes | publisher = [[Springer]] | location = Berlin | date = 1999 | page = 198 | doi = 10.1007/b62130 | isbn = 354063293X }}
<record id=”6627383”> <author>David Mumford</author> <title>The red book of Varieties and Schemes</title> <publisher>Springer</publisher> <year>1999</year> <doi>10.1007/b62130</doi></record>
<record id=”6627383”> <author>David Mumford</author> <title>The red book of Varieties and Schemes</title> <publisher>Springer</publisher> <year>1999</year> <doi>10.1007/b62130</doi></record>
Wikipedia Data set CiteSeer Data set
5 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies
N
Motivation
Problem Statement
Definition
Given two datasets of records, R and S , a set of attributesa1, . . . , an, a set of similarity functions sima1 , . . . , siman and asimilarity threshold τ , the task between R and S is defined asfinding and combining all pairs of records from R and S where∑n
i=1 simai (R.ai , S .ai ) ≥ τ
{{Cite book | last = Mumford | first = David | authorlink = David Mumford | title = The Red Book of Varieties and Schemes | publisher = [[Springer]] | location = Berlin | date = 1999 | page = 198 | doi = 10.1007/b62130 | isbn = 354063293X }}
{{Cite book | last = Mumford | first = David | authorlink = David Mumford | title = The Red Book of Varieties and Schemes | publisher = [[Springer]] | location = Berlin | date = 1999 | page = 198 | doi = 10.1007/b62130 | isbn = 354063293X }}
<record id=”6627383”> <author>David Mumford</author> <title>The red book of Varieties and Schemes</title> <publisher>Springer</publisher> <year>1999</year> <doi>10.1007/b62130</doi></record>
<record id=”6627383”> <author>David Mumford</author> <title>The red book of Varieties and Schemes</title> <publisher>Springer</publisher> <year>1999</year> <doi>10.1007/b62130</doi></record>
Wikipedia Data set CiteSeer Data set
5 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies
N
ChuQL
6 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies
N
ChuQL
ChuQL by example
Wordcount in ChuQL
1 mapreduce {
2 input { fn:collection ("hdfs :// wiki /") }
3 rr { for $rev in $hxml:in// revision
4 return {"key": fn:data($x// username|$x//ip),
5 "val": $x//title } }
6 map { $hxml:in }
7 reduce { {"key": $hxml:in=>"key", "value": fn:count($hxml:in=>"val ")} }
8 rw { <author name ="{ $hxml:in=>"key"}" count ="{ $hxml:in=>"val"}"/> }
9 output { fn:put("hdfs :// outputdir /") }
10 }
7 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies
N
Entity Matching
8 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies
N
Entity Matching
What is Entity Matching?
Challenges
Entity Matching has quadratic runtime behavior
Entity Matching has high CPU- and memory demands
The definition of “what is similar” is domain-dependent
9 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies
N
Entity Matching
What is Entity Matching?
Challenges
Entity Matching has quadratic runtime behavior
Entity Matching has high CPU- and memory demands
The definition of “what is similar” is domain-dependent
9 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies
N
Entity Matching
Entity Matching Architecture
Data Source
S1
Data Source
S1
Data Source
S2
Data Source
S2
BlockingBlocking
b1b1
b2b2
b3b3
bnbn
...
MatchingMatching Match Result
R
Match Result
R
10 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies
N
Entity Matching
Entity Matching Architecture
Data Source
S1
Data Source
S1
Data Source
S2
Data Source
S2
BlockingBlocking
b1b1
b2b2
b3b3
bnbn
...
MatchingMatching Match Result
R
Match Result
R
How can we improve the runtime of an EM task?
10 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies
N
Entity Matching
Entity Matching Architecture
Data Source
S1
Data Source
S1
Data Source
S2
Data Source
S2
BlockingBlocking
b1b1
b2b2
b3b3
bnbn
...
MatchingMatching Match Result
R
Match Result
R
Distributed Blocking
10 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies
N
Entity Matching
Entity Matching Architecture
Data Source
S1
Data Source
S1
Data Source
S2
Data Source
S2
BlockingBlocking
b1b1
b2b2
b3b3
bnbn
...
MatchingMatching Match Result
R
Match Result
R
Distributed Blocking Parallel Matching
10 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies
N
MAXIM: Entity Matchingin the Cloud
11 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies
N
MAXIM: Entity Matching in the Cloud
Requirements and Approach
Requirements
Efficient processing of semistructured data
Scalability to large datasets
Independency from specific similarity functions
Ability to easily add new similarity functions
Main Idea
Use MapReduce and ChuQL to process semistructured data
Use a search-based blocking to generate candidate pairs
Apply similarity functions to candidate pairs within a block
12 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies
N
MAXIM: Entity Matching in the Cloud
Requirements and Approach
Requirements
Efficient processing of semistructured data
Scalability to large datasets
Independency from specific similarity functions
Ability to easily add new similarity functions
Main Idea
Use MapReduce and ChuQL to process semistructured data
Use a search-based blocking to generate candidate pairs
Apply similarity functions to candidate pairs within a block
12 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies
N
MAXIM: Entity Matching in the Cloud
Architecture
Node 1
Task Tracker
ChuQL Engine
Data Node
Had
oop
Full-textIndex
Search Engine
Node 2
Task Tracker
ChuQL Engine
Data Node
Had
oop
Full-textIndex
SearchEngine
Node N
Task Tracker
ChuQL Engine
Data Node
Had
oop
Full-textIndex
SearchEngine
...
HDFSHDFS
Architecture
Hadoop cluster with up to 40 nodes
Each node runs a search engine and an attached full-text index
Each node runs an in-memory XQuery processor
Semistructured data is partitioned and placed on HDFS
13 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies
N
MAXIM: Entity Matching in the Cloud
Processing Stages
Search EnginesSearch Engines
HDFSHDFS
Three Stages
Preparation Stage
Blocking Stage
Matching Stage
14 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies
N
MAXIM: Entity Matching in the Cloud
Processing Stages
Extract Wikipedia references
Extract Wikipedia references
Index CiteSeerX records
Index CiteSeerX records
Search EnginesSearch Engines
HDFSHDFS
Extract references
Store references
Transform into full-text index XML
Build index
Preparation Stage
Stage 1: Preparation Stage
Extracts references from Wikipedia
Reads and transforms records from CiteSeerX
Sends CiteSeerX data to local full-text index
14 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies
N
MAXIM: Entity Matching in the Cloud
Processing Stages
Extract Wikipedia references
Extract Wikipedia references
Index CiteSeerX records
Index CiteSeerX records
Generate SemanticBlock
Generate SemanticBlock
Search EnginesSearch Engines
HDFSHDFS
Extract references
Store references
Transform into full-text index XML
Build index
Retrieve references
Generate query
Get query response
Store blocks
Preparation Stage Blocking Stage
Stage 2: Blocking Stage
Reads extracted references from HDFS
Probes full-text index to retrieve candidate publications
Assign candidate publications to block(s)
14 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies
N
MAXIM: Entity Matching in the Cloud
Processing Stages
Extract Wikipedia references
Extract Wikipedia references
Index CiteSeerX records
Index CiteSeerX records
Generate SemanticBlock
Generate SemanticBlock Record pair generationRecord pair generation
Search EnginesSearch Engines
HDFSHDFS
Extract references
Store references
Transform into full-text index XML
Build index
Retrieve references
Generate query
Get query response
Store blocks
Verify candidatepairs
Store record pairs
Preparation Stage Blocking Stage Matching Stage
Stage 3: Matching Stage
Read blocks from HDFS
Generate candidate pairs and apply similarity functions
Store matching pairs and their similarity
14 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies
N
MAXIM: Entity Matching in the Cloud
Stage 1: Preparation Stage
Extracting References
Indexing Publications
15 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies
N
MAXIM: Entity Matching in the Cloud
Stage 1: Preparation Stage
Extracting References
{{cite journal | author1 = Hansjörg Zeller | author2 = Jim Gray | title = An Adaptive Hash Join Algorithm for Multi-User Environments | journal = Proceedings of the 16th VLDB conference | year = 1990 | pages = 186–197}}
Extraction
Indexing Publications
15 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies
N
MAXIM: Entity Matching in the Cloud
Stage 1: Preparation Stage
Extracting References
{{cite journal | author1 = Hansjörg Zeller | author2 = Jim Gray | title = An Adaptive Hash Join Algorithm for Multi-User Environments | journal = Proceedings of the 16th VLDB conference | year = 1990 | pages = 186–197}}
<reference type=“journal“> <author1>Hansjörg Zeller</author1> <author2>Jim Gray</author2> <title>An Adaptive Hash Join Algorithm for Multi-User Environments</title> <journal>Proceedings of the 16th VLDB conference</journal> <year>1990</year> <pages>186–197</pages></reference>
Extraction
Transformation
Indexing Publications
15 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies
N
MAXIM: Entity Matching in the Cloud
Stage 1: Preparation Stage
Extracting References
{{cite journal | author1 = Hansjörg Zeller | author2 = Jim Gray | title = An Adaptive Hash Join Algorithm for Multi-User Environments | journal = Proceedings of the 16th VLDB conference | year = 1990 | pages = 186–197}}
<reference type=“journal“> <author1>Hansjörg Zeller</author1> <author2>Jim Gray</author2> <title>An Adaptive Hash Join Algorithm for Multi-User Environments</title> <journal>Proceedings of the 16th VLDB conference</journal> <year>1990</year> <pages>186–197</pages></reference>
Extraction
Transformation
Indexing Publications
HDFS
15 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies
N
MAXIM: Entity Matching in the Cloud
Stage 1: Preparation Stage
Extracting References
{{cite journal | author1 = Hansjörg Zeller | author2 = Jim Gray | title = An Adaptive Hash Join Algorithm for Multi-User Environments | journal = Proceedings of the 16th VLDB conference | year = 1990 | pages = 186–197}}
<reference type=“journal“> <author1>Hansjörg Zeller</author1> <author2>Jim Gray</author2> <title>An Adaptive Hash Join Algorithm for Multi-User Environments</title> <journal>Proceedings of the 16th VLDB conference</journal> <year>1990</year> <pages>186–197</pages></reference>
Extraction
Transformation
Indexing Publications
<doc> <field name="id">10.1.1.49.2550</field> <field name="title">Selecting Tense, Aspect, and Connecting Words In Language Generation</field> <field name="author">Bonnie Dorr</field> <field name="description">Generating language ...</field> </doc>
HDFS
Read and Transformation
15 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies
N
MAXIM: Entity Matching in the Cloud
Stage 1: Preparation Stage
Extracting References
{{cite journal | author1 = Hansjörg Zeller | author2 = Jim Gray | title = An Adaptive Hash Join Algorithm for Multi-User Environments | journal = Proceedings of the 16th VLDB conference | year = 1990 | pages = 186–197}}
<reference type=“journal“> <author1>Hansjörg Zeller</author1> <author2>Jim Gray</author2> <title>An Adaptive Hash Join Algorithm for Multi-User Environments</title> <journal>Proceedings of the 16th VLDB conference</journal> <year>1990</year> <pages>186–197</pages></reference>
Extraction
Transformation
Indexing Publications
<doc> <field name="id">10.1.1.49.2550</field> <field name="title">Selecting Tense, Aspect, and Connecting Words In Language Generation</field> <field name="author">Bonnie Dorr</field> <field name="description">Generating language ...</field> </doc>
LuceneIndex
LuceneIndex
HDFS
Read and Transformation
Indexing
15 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies
N
MAXIM: Entity Matching in the Cloud
Stage 2: Blocking Stage
Block generation
Each reference generates a set of candidate publications
Each candidate publication is inserted into all blocks, which arelisted in reference
16 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies
N
MAXIM: Entity Matching in the Cloud
Stage 2: Blocking Stage
Block generation
Each reference generates a set of candidate publications
Each candidate publication is inserted into all blocks, which arelisted in reference
Example
<citation> <id>26334893</id> <cat>Search engine optimization</cat> <cat>Internet search algorithms</cat> <cat>Link analysis</cat> <ref> <type>journal</type> <author>Taher Haveliwala</author> <year>2003</year> <pages>56-70</pages> <title>The Second Eigenvalue
of the Google Matrix</title> <journal>Stanford University
Technical Report</journal> </ref></citation>
<citation> <id>26334893</id> <cat>Search engine optimization</cat> <cat>Internet search algorithms</cat> <cat>Link analysis</cat> <ref> <type>journal</type> <author>Taher Haveliwala</author> <year>2003</year> <pages>56-70</pages> <title>The Second Eigenvalue
of the Google Matrix</title> <journal>Stanford University
Technical Report</journal> </ref></citation>
<citation> <id>26334893</id> <cat>Hashing</cat> <cat>Join algorithms</cat> <ref> <type>journal</type> <author>Hansjörg Zeller</author> <author>Jim Gray</author> <year>1990</year> <pages>186-197</pages> <title>An Adaptive Hash Join Algorithm for Multiuser Environments</title> <journal>Proceedings of the 16th VLDB
conference</journal> </ref></citation>
Full-Text Index
Search Engine
Hashing
Join algorithms
10.0.1.1.124
10.0.1.1.124
10.0.7.23.14
10.0.1.11.23
send result
send result
send query
16 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies
N
MAXIM: Entity Matching in the Cloud
Stage 2: Blocking Stage
Distributed Search in MAXIM
(a)(a
)(a)
(a)
(b) (b) (b) (b)
Node 2
Task Tracker
ChuQL Engine
Data Node
Had
oop
Full-textIndex
SearchEngine
Node 3
Task Tracker
ChuQL Engine
Data Node
Had
oop
Full-textIndex
SearchEngine
Node 4
Task Tracker
ChuQL Engine
Data NodeH
adoo
p
Full-textIndex
SearchEngine
Node 5
Task Tracker
ChuQL Engine
Data Node
Had
oop
Full-textIndex
Search Engine
Node 1
Task Tracker
ChuQL Engine
Data Node
Had
oop
Full-textIndex
SearchEngine
(c) Collect partial results
(b) HTTP response (partial result)
(a) Send HTTP request (query) (c)
16 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies
N
MAXIM: Entity Matching in the Cloud
Stage 3: Matching Stage
Applies user-defined similarity functions to candidate pairs
Each attribute can be evaluated by a specific similarity function
Number of candidate pairs
CP =n∑
i=1
Ci ∗ Ri (1)
n - # of blocks in B1, . . . ,Bn
Ri - # of references in block Bi
Ci - # of candidate publications in block Bi
CP - # of candidate pairs to verify
17 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies
N
MAXIM: Entity Matching in the Cloud
Stage 3: Matching Stage
Applies user-defined similarity functions to candidate pairs
Each attribute can be evaluated by a specific similarity function
Number of candidate pairs
CP =n∑
i=1
Ci ∗ Ri (1)
n - # of blocks in B1, . . . ,Bn
Ri - # of references in block Bi
Ci - # of candidate publications in block Bi
CP - # of candidate pairs to verify
17 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies
N
Summary
Summary
Wikipedia provides many opportunities for research
Need for efficiently processing semistructured data is increasing
Entity Matching is critical for data integration and data cleaning
Entity Matching is difficult to parallelize due to unbalanced datapartitions
MAXIM parallelizes EM by building blocks of similar records in aclassification fashion
MAXIM allows to define own similarity functions and computationfunctions without changing the algorithm
18 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies
N
“Everything that can be invented has been invented.”
(Charles H. Duell, Commissioner, U.S. Office of Patents, 1899)
19 / 19Entity Matching for Semistructured Data in the CloudMarcus Paradies
N
Experiments
Scaleup and Speedup
1
2
3
4
5
6
7
8
9
5 10 20 40
Spe
edup
= B
ase
Tim
e / N
ew T
ime
Number of nodes
IdealINDEXING-2000
EXTRACTING-2000BLOCKINGMATCHING
(a) Speedup for all stages
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
5 10 20 40S
cale
up =
Bas
e T
ime
/ New
Tim
e
Number of nodes
IdealEXTRACTING-2000
INDEXING-2000
(b) Scaleup for preparation stage
20 / 23Entity Matching for Semistructured Data in the CloudMarcus Paradies
N
Experiments
Query Performance
0
100
200
300
400
500
600
700
800
900
5 10 20 40
Avg
. Que
ry R
espo
nse
Tim
e (m
s)
Number of Nodes
RESULTCOUNT-50RESULTCOUNT-100RESULTCOUNT-150RESULTCOUNT-200
Figure: Query Performance for different result set sizes and cluster sizes.
21 / 23Entity Matching for Semistructured Data in the CloudMarcus Paradies
N
Experiments
Blocking Accuracy
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
0 0.25 0.5 0.75 1.0
Acc
urac
y
Variance
IdealWRONG-ORDER
MISPLACED-ENDMISPLACED-ANY
MISSING
Figure: Blocking accuracy for different typographical error classes.
22 / 23Entity Matching for Semistructured Data in the CloudMarcus Paradies
N
Experiments
Number of Candidate Pairs
0
500000
1e+006
1.5e+006
2e+006
2.5e+006
3e+006
3.5e+006
4e+006
4.5e+006
5e+006
5.5e+006
0.0 0.1 0.25 0.5 0.75 1.0
Num
ber
of c
andi
date
pai
rs
Variance
RSCOUNT-50RSCOUNT-100RSCOUNT-150RSCOUNT-200
Figure: Number of candidate pair verifications in the matching stage.
23 / 23Entity Matching for Semistructured Data in the CloudMarcus Paradies
N