record matching
TRANSCRIPT
RECORD MATCHING OVER MULTIPLE QUERY RESULTS
SUBMITTED BY, DIVYA.K.P NEETHU.P.N NISHNA.M.A SWATHI.S
04
/13
/23
1
NE
HR
U C
OL
LE
GE
OF
EN
GIN
EE
RIN
G
CONTENTS INTRODUCTION PROBLEM DEFINITION EXISTING SYSTEM LITERATURE SURVEY PROPOSED SYSTEM CONCLUSION REFERENCES
04
/13
/23
2
NE
HR
U C
OL
LE
GE
OF
EN
GIN
EE
RIN
G
INTRODUCTION
RECORD MATCHINGIt identifies the records that represent the same
real world entity . It is an important step for data integration.It is not applicable for the Web database
scenario, where the records to match are query results dynamically generated on the-fly.
04
/13
/23
3
NE
HR
U C
OL
LE
GE
OF
EN
GIN
EE
RIN
G
UDD Unsupervised Duplicate Detection
Implemented to address the problem of record
matching in the Web database scenario
For a given query, can effectively identify
duplicates from the query result records of
multiple Web databases.
CONTD…
04
/13
/23
4
NE
HR
U C
OL
LE
GE
OF
EN
GIN
EE
RIN
G
CONTD…
Two cooperating classifiers are used They are: A weighted component similarity summing classifier(WCSS)
An SVM classifier
04
/13
/23
5
NE
HR
U C
OL
LE
GE
OF
EN
GIN
EE
RIN
G
PROBLEM DEFINITION
Most previous works are based on
predefined matching rules hand-
coded by domain Such approaches work well in a
traditional database environment
04
/13
/23
6
NE
HR
U C
OL
LE
GE
OF
EN
GIN
EE
RIN
G
CONTD… Our focus is on Web databases from the same
domain which provide the same type of records in response to user queries.
Suppose there are s records in data source A and there are t records in data source B, with each record having a set of fields or attributes. Each of the t records in data source B can potentially be a duplicate of each of the s records in data source A.
The goal of duplicate detection is to determine the matching status of these by sXt record pairs.
04
/13
/23
7
NE
HR
U C
OL
LE
GE
OF
EN
GIN
EE
RIN
G
EXISTING SYSTEM Previous record matching work is targeted at
matching a single type of record It is based on predefined matching
rules hand-coded by domain experts. In the Web database scenario, the
records to match are highly query-dependent, since they can only be obtained through online queries.
04
/13
/23
8
NE
HR
U C
OL
LE
GE
OF
EN
GIN
EE
RIN
G
CONTD… Recent work has addressed the matching of
multiple types of records with rich associations between the records.
The matching complexity increases rapidly with the number of record types
Today, more and more databases that dynamically generate Web pages in response to user queries are available on the Web.
04
/13
/23
9
NE
HR
U C
OL
LE
GE
OF
EN
GIN
EE
RIN
G
LITERATURE SURVEY Most record matching methods adopt
a framework that uses two major steps. .
Identifying a similarity function and the other is matching records.
The first method uses training examples. In the second method, the composite similarity function is used to calculate the similarity between the candidate pairs.
04
/13
/23
10
NE
HR
U C
OL
LE
GE
OF
EN
GIN
EE
RIN
G
CONTD... Several methods have been proposed for duplicate
detection purpose including standard blocking, sorted nighbourhood method, Bigram Indexing, and record clustering.
The record matching works most closely related to UDD are Christen’s method and PEBL.
In UDD, any similarity measure, or its combination can be easily incorporated.
04
/13
/23
11
NE
HR
U C
OL
LE
GE
OF
EN
GIN
EE
RIN
G
04
/13
/23
NE
HR
U C
OL
LE
GE
OF
EN
GIN
EE
RIN
G
12
UDD the weights are adjusted dynamically.
UDD uses two classifiers.
UDD’s performance is not very sensitive to
false negative cases.
The weights used in Christen’s method.
Both Christen’s method and PEBL use only one
classifier during the iterations of the convergence
stage
CONTD... Hand-coding or offline-learning approaches
are not appropriate for two reasons: The full data set is not available
beforehand, and therefore, good representative data for training are hard to obtain.
Even if good representative data are found and labeled for learning, the rules learned on the representatives of a full data set may not work well on a partial and biased part of that data set.
04
/13
/23
13
NE
HR
U C
OL
LE
GE
OF
EN
GIN
EE
RIN
G
PROPOSED SYSTEM
This system studies and solves the online duplicate detection problem for the Web database scenario.
This work takes advantage of the dissimilarity among records from the same Web database for record matching.
Our work focuses on studying and addressing the field weight assignment issue rather than on the similarity measure.
04
/13
/23
14
NE
HR
U C
OL
LE
GE
OF
EN
GIN
EE
RIN
G
CONCLUSION Duplicate detection is an important
step in data integration and most state-of-the-art methods are based on offline learning techniques, which require training data.
In the Web database scenario, where records to match are greatly query-dependent which is not applicable.
UDD is implemented to overcome all the problems for detecting duplicates over the query results of multiple Web databases.
04
/13
/23
15
NE
HR
U C
OL
LE
GE
OF
EN
GIN
EE
RIN
G
CONTD…
Two classifiers WCSS and SVM, are used cooperatively in the convergence step of record matching.
Experimental results show that our approach is comparable to previous work that requires training examples for identifying duplicates from the query results of multiple Web databases.
04
/13
/23
16
NE
HR
U C
OL
LE
GE
OF
EN
GIN
EE
RIN
G
REFERENCES
M. Bilenko and R.J. Mooney, “Adaptive Duplicate Detection Using Learnable String Similarity Measures,” Proc. ACM SIGKDD, pp. 39-48, 2003.
Y. Thibaudeau, “The Discrimination Power of Dependency Structures in Record Linkage,” Survey Methodology, vol. 19, pp. 31-38, 1993.
R. Baeza-Yates and B. Ribeiro-Neto. Modern InformationRetrieval. ACM Press, New York, 1999.
04
/13
/23
17
NE
HR
U C
OL
LE
GE
OF
EN
GIN
EE
RIN
G
04
/13
/23
18
NE
HR
U C
OL
LE
GE
OF
EN
GIN
EE
RIN
G
04
/13
/23
19
NE
HR
U C
OL
LE
GE
OF
EN
GIN
EE
RIN
G