record matching

19
RECORD MATCHING OVER MULTIPLE QUERY RESULTS SUBMITTED BY, DIVYA.K.P NEETHU.P.N NISHNA.M.A SWATHI.S 0 6 / 1 1 / 2 2 1 N E H R U C O L L E G E O F E N G I N E E R I N G

Upload: nishna-ma

Post on 28-Jun-2015

118 views

Category:

Education


0 download

TRANSCRIPT

Page 1: Record matching

RECORD MATCHING OVER MULTIPLE QUERY RESULTS

SUBMITTED BY, DIVYA.K.P NEETHU.P.N NISHNA.M.A SWATHI.S

04

/13

/23

1

NE

HR

U C

OL

LE

GE

OF

EN

GIN

EE

RIN

G

Page 2: Record matching

CONTENTS INTRODUCTION PROBLEM DEFINITION EXISTING SYSTEM LITERATURE SURVEY PROPOSED SYSTEM CONCLUSION REFERENCES

04

/13

/23

2

NE

HR

U C

OL

LE

GE

OF

EN

GIN

EE

RIN

G

Page 3: Record matching

INTRODUCTION

RECORD MATCHINGIt identifies the records that represent the same

real world entity . It is an important step for data integration.It is not applicable for the Web database

scenario, where the records to match are query results dynamically generated on the-fly.

04

/13

/23

3

NE

HR

U C

OL

LE

GE

OF

EN

GIN

EE

RIN

G

Page 4: Record matching

UDD Unsupervised Duplicate Detection

Implemented to address the problem of record

matching in the Web database scenario

For a given query, can effectively identify

duplicates from the query result records of

multiple Web databases.

CONTD…

04

/13

/23

4

NE

HR

U C

OL

LE

GE

OF

EN

GIN

EE

RIN

G

Page 5: Record matching

CONTD…

Two cooperating classifiers are used They are: A weighted component similarity summing classifier(WCSS)

An SVM classifier

04

/13

/23

5

NE

HR

U C

OL

LE

GE

OF

EN

GIN

EE

RIN

G

Page 6: Record matching

PROBLEM DEFINITION

Most previous works are based on

predefined matching rules hand-

coded by domain Such approaches work well in a

traditional database environment

04

/13

/23

6

NE

HR

U C

OL

LE

GE

OF

EN

GIN

EE

RIN

G

Page 7: Record matching

CONTD… Our focus is on Web databases from the same

domain which provide the same type of records in response to user queries.

Suppose there are s records in data source A and there are t records in data source B, with each record having a set of fields or attributes. Each of the t records in data source B can potentially be a duplicate of each of the s records in data source A.

The goal of duplicate detection is to determine the matching status of these by sXt record pairs.

04

/13

/23

7

NE

HR

U C

OL

LE

GE

OF

EN

GIN

EE

RIN

G

Page 8: Record matching

EXISTING SYSTEM Previous record matching work is targeted at

matching a single type of record It is based on predefined matching

rules hand-coded by domain experts. In the Web database scenario, the

records to match are highly query-dependent, since they can only be obtained through online queries.

04

/13

/23

8

NE

HR

U C

OL

LE

GE

OF

EN

GIN

EE

RIN

G

Page 9: Record matching

CONTD… Recent work has addressed the matching of

multiple types of records with rich associations between the records.

The matching complexity increases rapidly with the number of record types

Today, more and more databases that dynamically generate Web pages in response to user queries are available on the Web.

04

/13

/23

9

NE

HR

U C

OL

LE

GE

OF

EN

GIN

EE

RIN

G

Page 10: Record matching

LITERATURE SURVEY Most record matching methods adopt

a framework that uses two major steps. .

Identifying a similarity function and the other is matching records.

The first method uses training examples. In the second method, the composite similarity function is used to calculate the similarity between the candidate pairs.

04

/13

/23

10

NE

HR

U C

OL

LE

GE

OF

EN

GIN

EE

RIN

G

Page 11: Record matching

CONTD... Several methods have been proposed for duplicate

detection purpose including standard blocking, sorted nighbourhood method, Bigram Indexing, and record clustering.

The record matching works most closely related to UDD are Christen’s method and PEBL.

In UDD, any similarity measure, or its combination can be easily incorporated.

04

/13

/23

11

NE

HR

U C

OL

LE

GE

OF

EN

GIN

EE

RIN

G

Page 12: Record matching

04

/13

/23

NE

HR

U C

OL

LE

GE

OF

EN

GIN

EE

RIN

G

12

UDD the weights are adjusted dynamically.

UDD uses two classifiers.

UDD’s performance is not very sensitive to

false negative cases.

The weights used in Christen’s method.

Both Christen’s method and PEBL use only one

classifier during the iterations of the convergence

stage

Page 13: Record matching

CONTD... Hand-coding or offline-learning approaches

are not appropriate for two reasons: The full data set is not available

beforehand, and therefore, good representative data for training are hard to obtain.

Even if good representative data are found and labeled for learning, the rules learned on the representatives of a full data set may not work well on a partial and biased part of that data set.

04

/13

/23

13

NE

HR

U C

OL

LE

GE

OF

EN

GIN

EE

RIN

G

Page 14: Record matching

 PROPOSED SYSTEM

This system studies and solves the online duplicate detection problem for the Web database scenario.

This work takes advantage of the dissimilarity among records from the same Web database for record matching.

Our work focuses on studying and addressing the field weight assignment issue rather than on the similarity measure.

04

/13

/23

14

NE

HR

U C

OL

LE

GE

OF

EN

GIN

EE

RIN

G

Page 15: Record matching

CONCLUSION Duplicate detection is an important

step in data integration and most state-of-the-art methods are based on offline learning techniques, which require training data.

In the Web database scenario, where records to match are greatly query-dependent which is not applicable.

UDD is implemented to overcome all the problems for detecting duplicates over the query results of multiple Web databases.

04

/13

/23

15

NE

HR

U C

OL

LE

GE

OF

EN

GIN

EE

RIN

G

Page 16: Record matching

CONTD…

Two classifiers WCSS and SVM, are used cooperatively in the convergence step of record matching.

Experimental results show that our approach is comparable to previous work that requires training examples for identifying duplicates from the query results of multiple Web databases.

04

/13

/23

16

NE

HR

U C

OL

LE

GE

OF

EN

GIN

EE

RIN

G

Page 17: Record matching

REFERENCES

M. Bilenko and R.J. Mooney, “Adaptive Duplicate Detection Using Learnable String Similarity Measures,” Proc. ACM SIGKDD, pp. 39-48, 2003.

Y. Thibaudeau, “The Discrimination Power of Dependency Structures in Record Linkage,” Survey Methodology, vol. 19, pp. 31-38, 1993.

R. Baeza-Yates and B. Ribeiro-Neto. Modern InformationRetrieval. ACM Press, New York, 1999.

04

/13

/23

17

NE

HR

U C

OL

LE

GE

OF

EN

GIN

EE

RIN

G

Page 18: Record matching

04

/13

/23

18

NE

HR

U C

OL

LE

GE

OF

EN

GIN

EE

RIN

G

Page 19: Record matching

04

/13

/23

19

NE

HR

U C

OL

LE

GE

OF

EN

GIN

EE

RIN

G