graduate school etd form 9 purdue university graduate school

Graduate School ETD Form 9 (Revised 12/07)

PURDUE UNIVERSITY GRADUATE SCHOOL

Thesis/Dissertation Acceptance

This is to certify that the thesis/dissertation prepared

By

Entitled

For the degree of

Is approved by the final examining committee:

Chair

To the best of my knowledge and as understood by the student in the Research Integrity and Copyright Disclaimer (Graduate School Form 20), this thesis/dissertation adheres to the provisions of Purdue University’s “Policy on Integrity in Research” and the use of copyrighted material.

Approved by Major Professor(s): ____________________________________

____________________________________

Approved by: Head of the Graduate Program Date

Mohamed Ahmed Mohamed Ahmed Yakout

Guided Data Cleaning

Doctor of Philosophy

Ahmed K. Elmagarmid

Walid G. Aref

Luo Si

Jennifer Neville

Ahmed K. Elmagarmid

Sunil K. Parbhakar / William J. Gorman 06/15/2012

Graduate School Form 20 (Revised 9/10)

PURDUE UNIVERSITY GRADUATE SCHOOL

Research Integrity and Copyright Disclaimer

Title of Thesis/Dissertation:

For the degree of Choose your degree

I certify that in the preparation of this thesis, I have observed the provisions of Purdue University Executive Memorandum No. C-22, September 6, 1991, Policy on Integrity in Research.*

Further, I certify that this work is free of plagiarism and all materials appearing in this thesis/dissertation have been properly quoted and attributed.

I certify that all copyrighted material incorporated into this thesis/dissertation is in compliance with the United States’ copyright law and that I have received written permission from the copyright owners for my use of their work, which is beyond the scope of the law. I agree to indemnify and save harmless Purdue University from any and all claims that may be asserted or that may arise from any copyright violation.

______________________________________ Printed Name and Signature of Candidate

______________________________________ Date (month/day/year)

*Located at http://www.purdue.edu/policies/pages/teach_res_outreach/c_22.html

Guided Data Cleaning


Mohamed Yakout

06/23/2012

GUIDED DATA CLEANING

A Dissertation

Submitted to the Faculty

of

Purdue University

by

Mohamed A. Yakout

In Partial Fulfillment of the

Requirements for the Degree

of


August 2012

Purdue University

West Lafayette, Indiana

ii

To my parents, Princisa and Ahmed, for a life of sacrifice and inspiration.

To my wife, Walaa, for years of love, dedication and support.

To my children, Jasmine and Zeyad, the light of my eyes.

iii

ACKNOWLEDGMENTS

Coming to the end of this long journey, It is my pleasure to express my gratitude

to a large number of people who have contributed, in many different ways, to make

my success a part of their own.

First, I wish to express my deepest gratitude to my supervisor Prof. Ahmed

Elmagarmid. I am totally indebted to his continuous encouragement, efforts and

invaluable advices. He was a wonderful advisor, a great leader and a close friend. I

learned from Ahmed how to do high quality research, how to transform my fledging

ideas into crisp research endeavors, how to present and sell my ideas. I also learned

from him how to think “out of the box” and see the value of a proposition. After all,

I was really fortune to have Ahmed as my advisor and I am so delighted and honored

for being his student.

I will be always grateful to Prof. Walid Aref for the thoughtful discussions with

him on both professional and personal levels. Whenever, I was in need to an advice

or stuck in a decision, Walid was always there by his experience and invaluable com-

ments. I am also grateful to Dr. Mourad Ouzzani for the countless hours he spent

with me on multiple research projects. He treated me as his brother and never made

me feel that he was a research faculty member while I was only a graduate student.

I would like to thank Prof. Mikhail Atallah for his support and encouragement. I

learned from him how teamwork is vital to enable solving problems efficiently. Special

thanks to Prof. Jennifer Neville for her help and continuous support. I learned from

her a great deal about the importance of Data Mining and this made me able to find

plenty of rooms to involve data mining techniques in my solutions.

I would also like to thank Prof. Luo Si for serving on my exam committee. He was

always so kind and supportive; this is in addition to his insightful comments. I want

also to thank Prof. Chris Clifton for his continuous help and for his useful comments

iv

in my prelim I would also like to thank Dr. William Gorman and Renate Mallus for

their dedication to students and for helping me. Dr. Gorman was always there to fill

my advisor’s absence during his leave.

During my summer internships with Microsoft Research and Google, I worked

with wonderful smart people. My internships at Microsoft was truly unforgettable

experience My sincere thanks to Kris Ganjam for being a wonderful mentor and Dr.

Kaushik Chakrabarti for sharing his advice and experience with me. My discussions

with them significantly contributed to my way of thinking and attacking real-world

data problems. Also I had a great experience during my internship at Google Inc.

I am especially grateful to Dr. Moustafa Hammad for his wonderful mentorship.

Moustafa was always ready to get into deep discussions with me on how to improve

solutions approaches, or even how to better implement them. At the personal level I

value my friendship with Moustafa to the greatest extent.

Special thanks are due to my friends and colleagues who made my graduate life

easier. In particular, thanks to Dr. Hicham Elmongui for his continuous help and

advices during my first few years in the PhD. I would also like to acknowledge Dr.

Hazem Elmeleegy, Dr. Mohamed El Tabakh, Samer Barakat, Dr. Ahmed Amin,

Ahmed Abdel-Gawad, Amr Ebeid and Amgad Madkour.

My sincere gratitude goes to my wife Walaa. Walaa’s love, dedication, persever-

ance, and belief in me were key factors in my success. Her support is infinite and

her patience is endless. She was always reliable in taking care of anything that might

keep me away from studying. She gave the highest priority to me and to our kids,

Jasmine and Zeyad.

My forever gratitude goes to my parents for their sacrifices, endless support, en-

couragement and continuous prayers for me. I can not be grateful enough to them.

They taught me the value of respect, hard work, good judgment and honesty. Thanks

to my sisters Rabab and Rania for their support and advices.

v

Above all, I thank ALLAH. For only through ALLAH’s grace and blessing has

this pursuit been possible. I pray for ALLAH’s support and guidance in the rest of

my career and my life.

vi

TABLE OF CONTENTS

Page

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Key Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.1 User’s Direct Interaction for Data Cleaning . . . . . . . . . . 3

1.1.2 Scalable Data Cleaning Techniques . . . . . . . . . . . . . . 3

1.1.3 User’s Indirect Interaction for Data Cleaning . . . . . . . . . 5

1.1.4 Leveraging the WWW for Data Cleaning . . . . . . . . . . . 7

1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.2.1 Constraint-based Data Cleaning . . . . . . . . . . . . . . . . 8

1.2.2 Machine Learning Techniques for Data Cleaning . . . . . . . 9

1.2.3 Involving Users in the Data Cleaning Process . . . . . . . . 10

1.2.4 WWW for Data Integration and Data Cleaning . . . . . . . 12

1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2 GUIDED DATA REPAIR . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2 Problem Definition and Solution Overview . . . . . . . . . . . . . . 20

2.2.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . 20

2.2.2 Solution Overview . . . . . . . . . . . . . . . . . . . . . . . 22

2.3 Generating Candidate Updates . . . . . . . . . . . . . . . . . . . . 24

2.3.1 Dirty Tuples Identification and Updates Discovery: . . . . . 24

2.3.2 Updates Consistency Manager . . . . . . . . . . . . . . . . . 28

vii

Page

2.3.3 Grouping Updates . . . . . . . . . . . . . . . . . . . . . . . 31

2.4 Ranking and Displaying Suggested Updates . . . . . . . . . . . . . 31

2.4.1 VOI-based Ranking . . . . . . . . . . . . . . . . . . . . . . . 32

2.4.2 Active Learning Ordering . . . . . . . . . . . . . . . . . . . 36

2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.5.1 VOI Ranking Evaluation . . . . . . . . . . . . . . . . . . . . 41

2.5.2 GDR Overall Evaluation . . . . . . . . . . . . . . . . . . . . 43

2.5.3 User Efforts vs. Repair Accuracy . . . . . . . . . . . . . . . 47

2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3 SCALABLE APPROACH TO GENERATE DATA CLEANING UPDATES 49

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.2 Problem Definition and Solution Approach . . . . . . . . . . . . . . 52

3.2.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . 52

3.2.2 Solution Approach . . . . . . . . . . . . . . . . . . . . . . . 56

3.3 Modeling Dependencies and Predicting Updates . . . . . . . . . . . 57

3.3.1 Modeling Dependencies . . . . . . . . . . . . . . . . . . . . . 57

3.3.2 Predicting Updates . . . . . . . . . . . . . . . . . . . . . . . 59

3.4 Scaling Up the Maximal Likelihood Repairing Approach . . . . . . 62

3.4.1 Process Overview . . . . . . . . . . . . . . . . . . . . . . . . 63

3.4.2 Repair Generation Phase . . . . . . . . . . . . . . . . . . . . 64

3.4.3 Tuple Repair Selection Phase . . . . . . . . . . . . . . . . . 67

3.4.4 Approximate Solution for Tuple Repair Selection . . . . . . 72

3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

3.5.1 Repair Quality Evaluation . . . . . . . . . . . . . . . . . . . 76

3.5.2 SCARE Scalability . . . . . . . . . . . . . . . . . . . . . . . 82

3.5.3 SCARE vs. ERACER to Predict Missing Values . . . . . . . 83

3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

viii

Page

4 INDIRECT GUIDANCE FOR DEDUPLICATION (BEHAVIOR BASEDRECORD LINKAGE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

4.2 Behavior Based Approach . . . . . . . . . . . . . . . . . . . . . . . 88

4.2.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . 88

4.2.2 Approach Overview . . . . . . . . . . . . . . . . . . . . . . . 89

4.2.3 Pre-processing and Behavior Extraction . . . . . . . . . . . 90

4.2.4 Matching Strategy . . . . . . . . . . . . . . . . . . . . . . . 94

4.3 Candidate Generation Phase . . . . . . . . . . . . . . . . . . . . . . 96

4.4 Accurate Matching Phase . . . . . . . . . . . . . . . . . . . . . . . 100

4.4.1 Statistical Modeling Technique . . . . . . . . . . . . . . . . 100

4.4.2 Information Theoretic technique (Compressibility) . . . . . . 108

4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

4.5.1 Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

4.5.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

5 HOLISITIC MATCHING WITH WEB TABLES FOR ENTITIES AUG-MENTATION AND FINDING MISSING VALUES . . . . . . . . . . . . 120

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

5.2 Holistic Matching Framework . . . . . . . . . . . . . . . . . . . . . 127

5.2.1 Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

5.2.2 General Augmentation Framework . . . . . . . . . . . . . . 128

5.2.3 Direct Match Approach . . . . . . . . . . . . . . . . . . . . 129

5.2.4 Holistic Match Approach . . . . . . . . . . . . . . . . . . . . 130

5.3 System architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

5.4 Building the SMW Graph and computing FPPR . . . . . . . . . . . 135

5.4.1 Building the SMW Graph . . . . . . . . . . . . . . . . . . . 135

5.4.2 Computing FPPR on SMW Graph . . . . . . . . . . . . . . 141

5.5 Supporting Core Operations . . . . . . . . . . . . . . . . . . . . . . 141

ix

Page

5.5.1 Augmentation-By-Attribute (ABA) . . . . . . . . . . . . . . 141

5.5.2 Augmentation-By-Example (ABE) . . . . . . . . . . . . . . 142

5.6 Handling n-ary Web Tables . . . . . . . . . . . . . . . . . . . . . . 143

5.7 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 145

5.7.1 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . 145

5.7.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . 147

5.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

6 SUMMARY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

6.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . 155

6.2 Future Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

6.2.1 User Centric Data Cleaning . . . . . . . . . . . . . . . . . . 158

6.2.2 Holistic Data Cleaning . . . . . . . . . . . . . . . . . . . . . 158

6.2.3 The WWW for Data Cleaning . . . . . . . . . . . . . . . . . 159

6.2.4 Private Data Cleaning . . . . . . . . . . . . . . . . . . . . . 159

LIST OF REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

VITA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

x

LIST OF TABLES

Table Page

5.1 Web tables matching features as documents. . . . . . . . . . . . . . . . 139

5.2 Query entity domains and augmenting attributes . . . . . . . . . . . . 146

xi

LIST OF FIGURES

Figure Page

2.1 Example data and rules . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2 GDR Framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.3 Comparing VOI-based ranking in GDR (GDR-NoLearning) to other strate-gies against the amount of feedback. Feedback is reported as the percent-age of the maximum number of verified updates required by an approach.Our application of the VOI concept shows superior performance comparedto other naıve ranking strategies. . . . . . . . . . . . . . . . . . . . . . 42

2.4 Overall evaluation of GDR compared with other techniques. The combina-tion of the VOI-based ranking with the active learning was very successfulin efficiently involving the user. The user feedback is reported as a per-centage of the initial number of the identified dirty tuples. . . . . . . . 45

2.5 Accuracy vs. user efforts. As the user spends more effort with GDR, theoverall accuracy is improved. The user feedback is reported as a percentageof the initial number of the identified dirty tuples. . . . . . . . . . . . . 48

3.1 Illustrative example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.2 Generated predictions for tuple repairs with their corresponding predictionprobabilities for tuple t4 in Figure 3.1. . . . . . . . . . . . . . . . . . . 67

3.3 Step by step demonstration for the SelectTupleRepair algorithm. At eachiteration, the vertex with minimum weighted degree is removed as long asit is not the only vertex in its corresponding vertex set. . . . . . . . . . 70

3.4 Quality vs. the percentage of errors: SCARE maintains high precision bymaking the best use of δ, the allowed amount of changes. . . . . . . . . 76

3.5 δ controls the amount of changes to apply to the database: small δ guar-antees high precision at the cost of the recall and vice versa. . . . . . . 78

3.6 Using SCARE in an iterative way helps improving the recall and the overallquality of the updates. The decrease in the precision is small compared tothe increase in the recall, achieving an overall high quality improvementdemonstrated by the f-measure. . . . . . . . . . . . . . . . . . . . . . . 80

xii

Figure Page

3.7 Increasing the number of partition functions |H| improves the accuracy ofthe predictions and hence increases the precision. The recall is not affectedmuch because we use a fixed δ. . . . . . . . . . . . . . . . . . . . . . . 81

3.8 SCARE scalability when varying the database size. . . . . . . . . . . . 82

3.9 Comparison between SCARE and ERACER to predict missing values.Generally, both SCARE and ERACER show high accuracy in predict-ing the missing values. SCARE uses in this experiment Naıve Bayesianmodel, while ERACER leverage domain knowledge interpreted in carefullydesigned Bayesian Network. . . . . . . . . . . . . . . . . . . . . . . . . 84

4.1 Process for behavior-based record linkage. . . . . . . . . . . . . . . . . 89

4.2 Retail store running example. . . . . . . . . . . . . . . . . . . . . . . . 92

4.3 Actions patterns in the complex plane and the effect on the magnitude. 98

4.4 Behavior linkage overall quality. . . . . . . . . . . . . . . . . . . . . . . 112

4.5 Improving the textual matching quality. . . . . . . . . . . . . . . . . . 114

4.6 Behavior linkage quality vs. different splitting probabilities and behaviorexhaustiveness. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

4.7 Behavior linkage quality vs. behavior contiguousness and percentage ofoverlapping entities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

4.8 Behavior linkage performance. . . . . . . . . . . . . . . . . . . . . . . . 118

5.1 APIs of the core operations . . . . . . . . . . . . . . . . . . . . . . . . . 121

5.2 ABA operation using web tables . . . . . . . . . . . . . . . . . . . . . . . 123

5.3 InfoGather System Architecture . . . . . . . . . . . . . . . . . . . . . . 135

5.4 The distribution of the number of columns per web table and statisticsabout the relational web tables and . . . . . . . . . . . . . . . . . . . . 144

5.5 Augmenting-By-Attribute (ABA) evaluation . . . . . . . . . . . . . . . . . 148

5.6 Sensitivity of the precision and coverage to the number of examples. The Holis-

tic shows high precision and maintains high coverage in comparison to DMA. 150

5.7 Joint sensitivity analysis to the number of examples and the head vs. tail

records in the web tables. The Holistic is robust in comparison to the DMA. 151

5.8 Web tables matching accuracy . . . . . . . . . . . . . . . . . . . . . . . . 152

5.9 Response time evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 153

xiii

ABSTRACT

Yakout, Mohamed A. Ph.D., Purdue University, August 2012. Guided Data Cleaning.Major Professor: Ahmed K. Elmagarmid.

Until recently, all data cleaning techniques have focused on providing fully auto-

mated solutions, which are risky to rely on, without efficiently and effectively consider-

ing collaboration with the data users and other available resources. This dissertation

studies techniques to involve data users directly and indirectly, as well as leverag-

ing the WWW, specifically web tables, for data cleaning tasks. In particular, the

dissertation addresses four key challenges for guided data cleaning.

The first challenge relates to directly involving users in the data cleaning process.

The goal is to efficiently combine the best of both the user fidelity to guide the data

cleaning process and the existing automatic cleaning techniques to suggest cleaning

updates. For this purpose, we develop the necessary principles to reason about which

questions to forward to the user using a novel combination of decision theory and

active learning.

The second challenge is scalability as existing automatic cleaning techniques are

not scalable. We introduce a new approach that is based on statistical machine

learning techniques. We achieve scalability by introducing a robust mechanism to

partition the database, and then aggregate the final cleaning decisions from the several

partitions.

The third challenge relates to involving users indirectly for a data cleaning task.

We notice that the users’ actions (or behavior), which can be found in the systems

log, can be useful evidence for the task of deduplicating the users themselves. We

develop the necessary pattern detection and modeling algorithms for this purpose.

xiv

Finally, the fourth challenge relates to leveraging the WWW for data cleaning

tasks. We address the problem of finding missing values (or entity augmentation)

using web tables. Our solution relies on aggregating answers from several web tables

that directly and indirectly match the user’s entities. We model this problem as a

topic sensitive pagerank, which models the holistic semantic match of a web table to

the topic of the list of entities.

Our experimental evaluations using real-world datasets demonstrate the effec-

tiveness and efficiency of our proposed approaches to improve the quality of dirty

databases.

1

1. INTRODUCTION

This dissertation studies techniques to involve data users (or domain experts), in

addition to leveraging data on the Web in data cleaning tasks. The purpose is to

achieve better data quality efficiently and effectively.

Data quality issues are of several kinds, e.g., inaccuracy, inconsistency, duplicates

and incompleteness, and may be the consequence of several reasons, e.g., misspelling,

integration from heterogeneous sources, and software bugs. Poor data quality is a

fact of life for most organizations and can have serious implications on their efficiency

and effectiveness [1].

Data quality experts estimate that erroneous data can cost a business as much

as 10 to 20% of its total system implementation budget [2]. They agree that as

much as 40 to 50% of a project budget might be spent correcting data errors in

time-consuming, labor-intensive and tedious processes. The proliferation of data also

heightens the relevance of data cleaning and makes the problem more challenging:

more sources and larger amounts of data imply larger variety and intrication of the

data quality problems and higher complexity for maintaining the quality of the data

in a cost-effective way. Not to mention the importance of data quality in the health

care domain as well. In such critical applications, incorrect information about pa-

tients in an Electronic Health Record (EHR) may lead to inconsistent treatments

and prescriptions, which consequently may cause severe medical problems including

death. As a result, various computational procedures for data cleaning have been pro-

posed by the database community to (semi-)automatically identify errors and, when

possible, correct them.

Most existing approaches to clean dirty databases either rely on predefined data

quality rules that should be satisfied by the database or rely on machine learning

techniques. Most of these techniques focus on providing fully automated solutions

2

using different heuristics, which could be risky especially for critical data. To guaran-

tee that the best desired quality updates are applied to the database, users (domain

experts) should be involved to confirm updates. This highlights the increasing need

for a techniques that combines the best of both worlds.

There are other cases where involving users or relying on data cleaning rules will

not be helpful. For example, when there are a lot of missing values in the database.

Consider the following scenario in an enterprize database, where we have a table

containing a list of companies, but their location or contact information is missing.

Neither rules nor correlations among the attributes are helpful in this case; and a user

has to collect all this information manually. Usually, the WWW is helpful in most of

such situations as it covers a large spectrum of domains. This highlight the need for

techniques to automatically leverage the WWW for such data cleaning tasks.

In this chapter, we start in Section 1.2 by highlighting the key challenges we

address in this dissertation in Section 1.1, and we then discuss the related work.

Section 1.3 summarizes the contributions and goals of this dissertation in view of the

challenges presented in Section 1.1. Finally, Section 1.4 outlines the structure of this

dissertation.

1.1 Key Challenges

In this section, we highlight some of the challenges in data cleaning. We will

focus on four key challenges, where the contributions of this dissertation revolve

around. The challenges are related to efficiently involving the user in the data cleaning

process, the scalability of the techniques to generate cleaning updates, leveraging

users generated log for a data cleaning task, and finally, leveraging the WWW for

data cleaning.

3

1.1.1 User’s Direct Interaction for Data Cleaning

Exiting automated solutions for data cleaning can be used as generators for data

cleaning updates. Then the data users can be involved to inspect such updates and

confirm the correct ones. However, involving the user can be very expensive because

of the large number of possibilities to be verified. Since automated techniques for

data cleaning produce far more updates than one can expect the user to handle,

techniques for selecting the most useful updates for presentation to the user become

very important.

The key challenge in involving users is to determine how and in what order sug-

gested updates should be presented to them. This requires developing a set of princi-

pled measures to estimate the improvement in quality to reason about the selection

process of possible updates, as well as, investigating machine learning techniques to

minimize user effort. The goal is to achieve a good trade-off between high quality

data and minimal user involvement.

1.1.2 Scalable Data Cleaning Techniques

For the inconsistent databases, we focus on solutions that rely on providing

cleaning updates in the form of value modification. Most existing solutions follow

constraint-based repairing approaches [3–5], which search for minimal change of the

database to satisfy a predefined set of constraints. While a variety of constraints (e.g.,

integrity constraints, conditional functional and inclusion dependencies) can detect

the presence of errors, they are recognized to fall short of guiding to correct the er-

rors, and worse, may introduce new errors when repairing the data [6]. Moreover,

despite the research conducted on integrity constraints to ensure the quality of the

data, in practice, databases often contain a significant amount of non-trivial errors.

These errors, both syntactic and semantic, are generally subtle mistakes which are

difficult or even impossible to express using the general types of constraints available

4

in modern DBMSs [7]. This highlights the need for different techniques to clean dirty

databases.

Usually statistical Machine Learning (ML) (e.g., decision tree, Bayesian networks)

can capture dependencies, correlations, and outliers from datasets based on various

analytic, predictive or computational models [8]. Existing efforts in data cleaning

using ML techniques mainly focused on data imputation (e.g., [7]) and deduplication

(e.g., [9]). To the best of our knowledge, we are not aware of an approach to consider

ML techniques for repairing databases by value modification.

Involving ML techniques for repairing erroneous data is not straightforward and

it raises four major challenges: (1) Several attribute values (of the same record) may

be dirty. Therefore, the process is not as simple as predicting values for a single

erroneous attribute. This requires accurate modeling of correlations between the

database attributes, assuming a subset is dirty and its complement is reliable. (2) A

ML technique can predict an update for each tuple in the database; and the question

is how to distinguish the predictions that should be applied. Therefore, a measure

to quantify the quality of the predicted updates is required. (3) Over-fitting problem

may occur when modeling a database with a large variety of dependencies that may

hold locally for data subsets but do not hold globally. (4) Finally, the process of

learning a model from a very large database is expensive, and the prediction model

itself may not fit in the main memory. Despite the existence of scalable ML techniques

for large datasets, they are either model dependent (i.e., limited to specific models,

for example SVM [10]) or data dependent (e.g., limited to specific types of datasets

such as scientific data and documents repository). Note that the scalability is also an

issue for the constraint-based repairing approaches [11].

Such limitations motivate the need for effective and scalable methods to accurately

predict cleaning updates with statistical guarantees.

5

1.1.3 User’s Indirect Interaction for Data Cleaning

There are situations where the users interaction with the systems is registered in

a transaction log. This raises the following question: is it possible to leverage such

log for a data cleaning task? In this case, the users are indirectly involved for data

cleaning; this is in contrast to the direct interaction we discussed in Section 1.1.1.

Specifically, we focus on the task of deduplication or record linkage.

Record linkage is the process of identifying records that refer to the same real world

entity. There has been a large body of research on this topic (refer to [9] for a recent

survey). While most existing record linkage techniques focus on simple attribute

similarities, more recent techniques are considering richer information extracted from

the raw data for enhancing the matching process (e.g. [12–15]).

In contrast to most existing techniques, we are considering entity behavior as a

new source of information to enhance the record linkage quality. We observe that

by interpreting massive transactional datasets, for example, transaction logs, we can

discover behavior patterns and identify entities based on these patterns. Various

applications such as retail stores, web sites, and surveillance systems, maintain trans-

action logs that track the actions performed by entities over time. Entities in these

applications will usually perform actions, e.g., buying a specific quantity of milk at

a specific point in time or browsing specific pages within a web site, which represent

their behavior vis-a-vis the system.

To further motivate the importance of using the behavior for record linkage, con-

sider the following real-life example. Yahoo has recently acquired a Jordanian Inter-

net company called Maktoob, which, similar to Yahoo, provides a large number of

Internet services to its customers in the region like e-mail, blogs, news, and online

shopping. It was reported that with this acquisition, Yahoo will be able to add the 16

million Maktoob users to its 20 million users from the middle east region1. Clearly,

Yahoo should expect that the overlap between these two groups of users can be quite

1http://www.techcrunch.com/2009/08/25/confirmed-yahoo-acquires-arab-internet-portal-maktoob/

6

significant, and hence the strong need for record linkage. However, user profile in-

formation stored by both companies may not be reliable enough because of different

languages, unreal information, . . . etc. In this scenario, analyzing the users behavior,

in terms of how they use the different Internet services, will be an invaluable source of

information to identify potentially common users. Record linkage analysis based on

entity behavior has also many other applications. For example, identifying common

customers for stores that are considering a merge, tracking users accessing web sites

from different IP addresses, as well as helping in crime investigations.

A seemingly straightforward strategy to match two entities is to measure the

similarity between their behaviors. However, a closer examination shows that this

strategy may not be useful, for the following reasons. It is usually the case that

the complete knowledge of an entity’s behavior is not available to both sources, since

each source is only aware of the entity’s interaction with that same source. Hence, the

comparison of entities’ “behaviors” will in reality be a comparison of their “partial

behaviors”, which can easily be misleading. Moreover, even in the rare case when

both sources have almost complete knowledge about the behavior of a given entity

(e.g., a customer who did all his grocery shopping at Walmart for one year and then

at Safeway for another year), the similarity strategy still will not help. The problem is

that many entities do have very similar behaviors, and hence measuring the similarity

can at best group the entities with similar behavior together (e.g., [16–18]), but not

find their unique matches.

The key challenge is to devise an alternative strategy to match entities using their

behavior, because the straightforward similarity is not the write way to address this

problem. This highlight further challenges on how we devise a matching function for

entities behavior, how to represent and model entities behavior, and finally, how to

design an efficient solution to handle the expected large amount of transactions.

7

1.1.4 Leveraging the WWW for Data Cleaning

As we mentioned earlier, there are cases where none of the automated cleaning

techniques can be helpful and we have to rely on external data sources. The WWW

is the richest data source and the question here is how to effectively and efficiently

leverage the WWW for data cleaning.

The Web contains a vast corpus of HTML tables. In this dissertation, we focus

on one class of HTML tables: entity-attribute tables (also referred to as relational

tables [19,20] and 2-dimensional tables [21]). Such a table contains values of multiple

entities on multiple attributes, each row corresponding to an entity and each column

corresponding to an attribute. Cafarella et. al. reported 154M such tables from a

snapshot of Google’s crawl in 2008; we extracted 573M such tables from a recent

crawl of Microsoft Bing search engine. Henceforth, we refer to such tables as simply

web tables.

Consider an enterprize database where we have a table about companies and all

(or most) of their contact information is missing, or consider a product database

with a table about digital cameras. In the cameras table, the camera model name

is provided, but some other attributes such as brand, resolution, price and optical

zoom have missing values. We call these attributes as augmenting attributes and the

process of finding the missing attributes values as entities augmentation.

Such augmentation would be difficult to perform using an enterprize database or

an ontology because the entities can be from any arbitrary domain. Today, users try

to manually find the web sources containing this information and assemble the values.

Assuming that this information is available, albeit scattered, in various web tables, we

can save a lot of time and effort if we can perform this operation automatically. This

will require discovering sematic matching relationships between the web tables. The

result is going to be a Semantic Matching Web tables (SMW) graph. Constructing

and processing such large graph is a big challenge.

8

The challenges and requirements for such operation are: (i) high precision

(#corraug#aug

) and high coverage ( #aug#entity

) where #corraug, #aug and #entity denote

the number of entities correctly augmented, the number of entities augmented and

the number of entities, respectively. (ii) fast (ideally interactive) response times and

(iii) applicability to entities of any arbitrary domain.

1.2 Related Work

Improving data quality has been the focus of a large body of research for decades.

Our work is closely related to the following research areas: (i) constraint-based data

repair, (ii) statistical machine learning techniques for data cleaning, (iii) interactive

systems for data cleaning and user’s modeling, and (iv) leveraging the WWW for

data integration tasks.

1.2.1 Constraint-based Data Cleaning

This approach has two main steps: (1) identify a set of constraints that should be

followed by the data, and then (2) use the constraints and the data to find another

consistent database that minimally differs from the original database (e.g., [3–5, 22–

26]). Most earlier work (except [3, 23,24,26,27]) considers traditional full and denial

dependencies, which subsume functional dependencies (FDs). The repair algorithm

in [23] uses traditional FDs and inclusion dependencies (INDs) to derive repairs,

while [4] is applicable for restricted denial constraint. The work in [23] uses equivalent

classes to group the attributes values that are equivalent in obtaining a final consistent

database instance. The repair approach in [3] uses CFDs for data repair and it is

considered a non-trivial extension to the repair algorithms described in [23]. The

proposed algorithms are based on a cost-based greedy heuristic to decide upon a

repair of errors. The work in [5] uses FDs to map the repairing problem to hyper-graph

optimization problem, where a heuristic vertex cover algorithm can help finding the

minimal number of attributes values to modify in order to find a database consistent

9

with the FDs. The main drawback of these approaches is that the data should be

covered by a set of constraints that have been specified or validated by domain experts,

which may be an expensive manual process and may not be affordable for all data

domains. Moreover, the constraints usually fall short to correctly identify the right

fixes [6].

In the literature, several classes of data quality rules have been introduced. For

example, the Conditional Functional Dependencies (CFDs) [28], which extend stan-

dard functional dependencies (FDs) with conditional pattern tableaux that define

the subset of tuples or context in which the underlying FD holds. The Matching

Dependencies (MDs) [11], which is similar to the FDs, but it takes into account simi-

larity between values instead of exact matching. The records matching rules [29] and

Dedupalog [30] are used to identify duplicate records.

CFDs is being extensively studies due to its usefulness as integrity constraint to

summarize data semantics and identify data inconsistencies. Prior work focused on

consistency and implication analysis for CFDs [28], propagation of CFDs from source

data to views in data integration [31], extensions of CFDs by adding disjunction and

negation [32] or adding ranges [33], estimating CFDs confidence [34]. Consequently,

competing algorithms for discovering CFDs were immediately introduced in [33, 35,

36].

1.2.2 Machine Learning Techniques for Data Cleaning

Data cleaning using ML techniques mainly focused on deduplication (refer to [9]

for survey), data imputation (e.g., [7, 37]) and errors detecting (e.g., [8, 38]). To

the best of our knowledge, the problem of using ML techniques for repairing dirty

databases by value modification has not been addressed.

In data imputation for example, [7] uses relational learning to learn the charac-

teristics of the attributes relationships in a relational database. Then, the learnt

model is used to infer the missing values. This technique requires a priori knowledge

10

about the relationships between the attributes to construct the appropriate Bayesian

network for learning. Most of similar techniques for data imputation are limited to

numerical or categorical attributes.

The main challenges for these techniques are (i) the scalability for large databases

to be modeled with all existing data correlations and (ii) the accuracy of the replace-

ment values prediction due to the fact that existing methods usually capture either

local or global data relationships and do not combine both views. Despite the exis-

tence of scalable ML techniques for large datasets, they are either model dependent

(i.e., limited to specific models, for example SVM [10]) or data dependent (e.g., lim-

ited to specific types of datasets such as scientific data and documents repository).

The scalability is also an issue for the constraint-based repairing approaches [11].

1.2.3 Involving Users in the Data Cleaning Process

Most existing systems for data cleaning provide tools for data exploration and

transformation without taking advantage of recent efforts on automatic data re-

pair. Usually, the repair actions are “explicitly specified by the user”. For example,

AJAX [39] proposes a declarative language to eliminate duplicates during data trans-

formations. Potter’s Wheel [40] combines data transformations with the detection

of errors in the form of irregularities. None of these systems efficiently leverage user

feedback by either ranking or using learning mechanisms.

A recent work to repair critical data with quality guarantee was introduced in [6].

In [6] it is assumed that a reference correct data exists and the user is required

to specify certain attributes to be correct across the entire dataset. Moreover, the

proposed solution relies on a pre-specified set of editing rules.

Previous work on soliciting user feedback to improve data quality focuses mostly

on two objectives: (i) identify correctly matched references in a large scale integrated

data (e.g. [41–44] ) or for duplicate elimination (e.g. [45]); and (ii) improve the pre-

diction quality of a learning model by taking into account data acquisition costs (e.g.

11

cost sensitive-learning [46], utility-based learning [47], active learning [48], selective

supervision [49], selective repeated labeling [50]).

The work in [41] and [44] addresses incorporating user feedback into schema match-

ing tasks. [42] introduced a framework to provide many users with candidate matches,

without any ranking or selection mechanism, and then combine the responses to con-

verge to a correct answer. In [43], a decision theoretic framework has been proposed

to rank candidates reference matches to improve the quality of query response of

dataspace. This framework is limited to soliciting user feedback to resolve candidate

matches from a dataspace and it can not be applied in a constrained repair frame-

work for relational database. [45] introduced active-learning based approach to build

a generic matching function for identifying duplicate records.

Selective supervision [49] combines decision theory with active learning. It uses a

value of information approach for selecting unclassified cases for labeling. Selective

repeating labeling [50] assumes unreliability of user feedback and combines label un-

certainty with active learning to select instances for repeated labeling. The overall

goal of these approaches is to reduce the uncertainty in the predicted output without

regard to how important those predictions to quality of the underlying database are.

For the approaches that leverage users or entities behavior for entities dedupli-

cation, a closely related area is the users adaptive systems for web navigation and

information retrieval (e.g., [16–18]). Most of these techniques focus on statistically

modeling user interactions to extract domain specific features to understand users

preferences. These models focus on the statistical significance of extracted features

and may take into account the sequence of users actions. However, they do not take

into account the time dimension to determine the repeated patterns of actions. They

are better suited to determine groups of common behaviors and may be used to eval-

uate the similarities between entities. However, they cannot be helpful at the level of

computing the pair-wise matching between entities based on their registered actions.

12

1.2.4 WWW for Data Integration and Data Cleaning

The most related work is the Octopus system developed by Cafarella et. al.

[20]. The Extend operator proposed by Octopus is similar to the operation of

finding missing values using web tables. Octopus uses the web search API to retrieve

matching tables; this does not have any well-defined semantics. Since web search is

not meant for matching tables, in many cases, the top 1000 returned urls does not

provide any matching tables. Moreover, Octopus needs to invoke the search API

for each user record and perform clustering of web tables at query time leading to a

prohibitive performance to handel even small size databases. This highlight the need

for a different approach that rely on a well defined semantic matching between the

web tables and the user table (the table with missing values). Moreover, the approach

needs to perform most of the “heavy lifting” at a preprocessing step such that we get

a fast response time.

Researchers have developed techniques to annotate web tables with column names

and names of relationships [51, 52]. These techniques can help to build a better

Semantic Matching Web tables (SMW) graph.

Building the SMW graph is related to the vast body of work on schema matching

[53–55]. Most modern approaches uses several base techniques such as linguistic

matching of attribute names and detecting overlap of data instances and combines

them to determine the final matchings; the base techniques as well as the combiner

can either be machine learning-based techniques or non-learning methods [41,56]. In

contrast to enterprize tables, web tables have more features that can be involved and

obtain better sematic matching. For example, the context (i.e., the text in the web

page) where the table came from.

There exists a rich body of work of leveraging HTML lists for set expansion and

table augmentation [57,58]. However, the focus is on discovering more entities rather

than augment the provided entities.

13

1.3 Contributions

The central thesis of this dissertation can be stated as follows: A complete and

effective solution to improve the data quality is likely to depend on a close collaboration

between humans in the form of data users (or domain experts) and machines in the

form of the automated solutions to clean dirty databases, in addition to leveraging

other information resources such as the enormous amount of data on the Web.

The data users must get in the loop, because the automatic cleaning techniques

may cause undesired changes to the database and the data quality may even get worse.

Moreover, sometimes exploring other information resources is helpful. For example,

it is common for users to refer to the WWW to search for accurate information to

correct the database.

We claim the following list of contributions:

• We propose a novel interactive data cleaning framework for Guided Data Repair

(GDR) that tackle the problem of data cleaning from a more realistic and

pragmatic viewpoint. GDR interactively involves the user directly in guiding

the cleaning process alongside existing automatic cleaning techniques. The goal

is to effectively involve users in a way to achieve better data quality as quickly

as possible. The basic intuition is to continuously consult the user for cleaning

updates that are most beneficial in improving the data quality as we go.

• Since existing automatic data cleaning approaches are not scalable, we introduce

a new approach that is based on machine learning techniques. The objective

is to build a new data cleaning updates generators to be used within GDR for

large databases. Our approach relies on maximizing the data likelihood given

the underline data distribution, which can be modeled using ML techniques.

We achieve scalability by introducing a mechanism for horizontal data parti-

tioning and enable parallel processing of data blocks; various ML methods can

be applied and provide “local” predictions that are then combined to obtain a

final accurate predictions.

14

• We introduce a novel technique to involve the user indirectly into a data cleaning

task. We propose a technique that leverage the users’ or entities’ generated

transaction log to do entities deduplication or record linkage. We present the

first formulation for this problem and introduce statistical techniques to model

the entities behavior. Our approach for matching entities using their behavior

does not rely on measuring behavior similarities. However, we first merge the

entities transactions (or behavior) and measure the gain in recognizing behavior

patterns in the merged log. Since the transactions log is expected to be large we

introduce efficient fast techniques to produce candidate matches by computing

inaccurate summaries of the entities behaviors.

• To effectively and efficiently leverage the WWW for a data cleaning task, we

propose a novel approach that rely on web tables to augment entities on missing

values attributes. The core of our approach is to match the dirty table (or query

table) with the web tables to get the relevant matched web tables. The relevant

matched web tables are then used to obtain the missing values. We develop a

novel holistic matching framework based on topic sensitive pagerank (TSP) over

the SMW graph. We argue that by considering the query table as a topic and

web tables as documents, we can efficiently model the holistic matching as TSP.

We propose a system architecture that leverages preprocessing in MapReduce

to achieve extremely fast (interactive) response times at query time. Finally,

we present a machine learning-based technique for building the SMW graph.

Our key insight is that the text surrounding the web tables is important in

determining whether two web tables match or not. We propose a novel set of

features that leverage this insight.

For each one of our proposed techniques, we implemented a research prototype

showing their applicability. Moreover, we conducted experimental studies using real-

istic datasets to validate the effectiveness of our approaches.

15

1.4 Outline

The rest of the dissertation is organized as follows: Chapter 2 describes the GDR

framework for guided data repair. In Chapter 3, we introduce our scalable automatic

repair approach. Chapter 4 introduces our approach to leverage user’s indirectly

for the data cleaning task of deduplication. The approach that uses web tables for

augmenting entities is described in Chapter 5. Finally, Chapter 6 concludes the

dissertation and points out directions for future work.

Parts of this dissertation have been published in conferences. In particular, the

work on guided data repair (Chapter 2) is described in a paper [59] in the Proceedings

of the 2011 International Conference on Very Large Databases (PVLDB 2011). Also

our implemented system for the GDR was accepted for demonstration [60] in the

2010 International Conference on Management of Data (SIGMOD 2010). The work

on leveraging entities behavior for deduplication (Chapter 4) is also described in a

paper [61] in the Proceedings of the 2010 International Conference on Very Large

Databases (PVLDB 2010). Finally, the work that leverage web tables for entities

augmentation (Chapter 5) is described in a paper [62] in the the 2012 International

Conference on Management of Data (SIGMOD 2012).

16

2. GUIDED DATA REPAIR

In this chapter, we introduce GDR, a framework for guided data repair, that effi-

ciently involves the user directly in the data cleaning process. Here, we describe the

framework components and the principals upon which we rely upon to reason about

the interaction with the data user. The objective is to converge faster to a better

data quality with minimal user involvement.

The chapter is organized as follows: Section 2.1 provides a motivating example

for our approach, and we describe the problem and our solution approach in Sec-

tion 2.2. In Section 2.3 we discuss our mechanism to generate candidate cleaning

updates. In Section 2.4, we develop a principled approach to decide upon ranking

the questions for user feedback. We experimentally evaluate GDR in Section 2.5, and

finally, summarize the chapter in Section 2.6.

2.1 Introduction

A recent approach for repairing dirty databases is to use data quality rules in the

form of database constraints to identify tuples with errors and inconsistencies and

then use these rules to derive updates to these tuples. Most of the existing data

repair approaches (e.g., [3, 4, 23, 25]) focus on providing fully automated solutions

using different heuristics to select updates that would introduce minimal changes to

the data, which could be risky especially for critical data. To guarantee that the best

desired quality updates are applied to the database, users (domain experts) should

be involved to confirm updates. This highlights the increasing need for a framework

that combines the best of both worlds. The framework will automatically suggest

updates while efficiently involve users to guide the cleaning process.

17Name SRC STR CT STT ZIPt1: Jim H1 REDWOOD DR MICHIGAN CITY MI 46360t2: Tom H2 REDWOOD DR WESTVILLE IN 46360t3: Jeff H2 BIRCH PARKWAY WESTVILLE IN 46360t4: Rick H2 BIRCH PARKWAY WESTVILLE IN 46360t5: Joe H1 BELL AVENUE FORT WAYNE IN 46391t6: Mark H1 BELL AVENUE FORT WAYNE IN 46825t7: Cady H2 BELL AVENUE FORT WAYNE IN 46825t8: Sindy H2 SHERDEN RD FT WAYNE IN 46774(a) Data

ϕ1 : (ZIP → CT, STT, {46360 ∥ MichiganCity, IN})

ϕ2 : (ZIP → CT, STT, {46774 ∥ NewHaven, IN})

ϕ3 : (ZIP → CT, STT, {46825 ∥ FortWayne, IN})

ϕ4 : (ZIP → CT, STT, {46391 ∥ Westville, IN})

ϕ5 : (STR, CT → ZIP, { ,FortWayne ∥ })

(b) CFD Rules

Fig. 2.1.: Example data and rules

Motivation Example

Consider the following example. Let Relation Customer(Name, SRC, STR, CT, STT,

ZIP) specifies personal address information Street (STR), City (CT), State (STT) and

(ZIP), in addition to the source (SRC) of the data or the data entry operator. An

instance of this relation is shown in Figure 2.1(a).

Data quality rules can be defined in the form of Conditional Functional Dependen-

cies (CFDs) as described in Figure 2.1(b). A CFD is a pair consisting of a standard

Functional Dependency (FD) and a pattern tableau that specifies the applicability of

the FD on parts of the data. For example, ϕ1 − ϕ4 state that the FD ZIP → CT, STT

(i.e., zip codes uniquely identify city and state) holds in the context where the ZIP is

18

46360, 46774, 46825 or 46391. Moreover, the pattern tableau enforces bindings be-

tween the attribute values, e.g., if ZIP= 46360, then CT= ‘Michigan City’. ϕ5 states

that the FD STR, CT → ZIP holds in the context where CT = ‘Fort Wayne’, i.e., street

names uniquely identify the zip codes whenever the city is ‘Fort Wayne’. Note that

all the tuples in Figure 2.1(a) have violations.

Typically, a repairing algorithm will use the rules and the current database in-

stance to find the best possible repair operations or updates. For example, t5 violates

ϕ4 and a possible update would be to either replace CT by ‘Westville’ or replace ZIP

by 46825, which would make t5 fall in the context of ϕ3 and ϕ5 but without violations.

To decide which update to apply, different heuristics can be used [4, 23].

However, automatic changes to data can be risky especially if the data is critical,

e.g., choosing the wrong value among the possible updates. On the other hand,

involving the user can be very expensive because of the large number of possibilities

to be verified. Since automated methods for data repair produce far more updates

than one can expect the user to handle, techniques for selecting the most useful

updates for presentation to the user become very important.

Moreover, to efficiently involve the user in guiding the cleaning process, it is

helpful if the suggested updates are presented in groups that share some contextual

information. This will make it easier for the user to provide feedback. For example,

the user can quickly inspect a group of tuples where the value ‘Michigan City’ is

suggested for the CT attribute. Similar grouping ideas have been explored in [45].

In the example in Figure 2.1, let us assume that a cleaning algorithm suggested

two groups of updates. In the first group, the updates suggest replacing the attribute

CT with the value ‘Michigan City’ for t2, t3, and t4 while in the second group they

suggest replacing the attribute ZIP with the value 46825 for t5 and t8. Let us assume

further that we were able to obtain the user feedback on the correct values for these

tuples; namely that the user has confirmed ‘Michigan City’ as a correct value of CT

for t2, t3, but as incorrect for t4, and 46825 as the correct value of ZIP for t5, but

as incorrect for t8. In this case, consulting the user on the first group, which has

19

more correct updates, is better and would allow for faster convergence to a cleaner

database instance as desired by the user. The second group will not lead for such fast

convergence.

Finally in our example, we could recognize correlations between the attribute

values in a tuple and the correct updates. For example, when SRC = ‘H2’, the CT

attribute is incorrect most of the time, while the ZIP attribute is correct. This is

an example of recurrent mistakes that exist in real data. Patterns like that with

correlations between the original tuple and the correct updates, if captured by a

machine learning algorithm, can reduce user involvement.

The key challenge in involving users is to determine how and in what order sug-

gested updates should be presented to them. This requires developing a set of princi-

pled measures to estimate the improvement in quality to reason about the selection

process of possible updates as well as investigating machine learning techniques to

minimize user effort. The goal is to achieve a good trade-off between high quality

data and minimal user involvement.

In this chapter, we propose to tackle the problem of data cleaning from a more

realistic and pragmatic viewpoint. We present GDR, a framework for guided data

repair, that interactively involves the user in guiding the cleaning process alongside

existing automatic cleaning techniques. The goal is to effectively involve users in a

way to achieve better data quality as quickly as possible. The basic intuition is to

continuously consult the user for updates that are most beneficial in improving the

data quality as we go.

We use CFDs [63] as the data quality rules to derive candidate updates. CFDs

have proved to be very useful for data quality and triggered several efforts e.g., [33,36],

for their automatic discovery as well as making them a practical choice for data repair

techniques.

We summarize the contributions of this chapter as follows:

• We introduce GDR, a framework for data repair, that selectively acquire user

feedback on suggested updates. User feedback is used to train the GDR machine

20

learning component that can take over the task of deciding the correctness of

these updates.

• We propose a novel ranking mechanism for suggested updates that applies a

combination of decision theory and active learning in the context of data quality

to reason about such task in a principled manner.

• We use the concept of value-of-information (VOI) [64] from decision theory to

develop a mechanism to estimate the update benefit from consulting the user on

a group of updates. We quantify the data quality loss by the degree of violations

to the rules. The benefit of a group of updates can be then computed by the

difference between the data quality loss before and after user feedback. Since we

do not know the user feedback beforehand, we develop a set of approximations

that allow efficient estimations.

• We apply active learning to order the updates within a group such that the

updates that can strengthen the prediction capabilities of the learned model the

most come first. To this end, we assign to each suggested update an uncertainty

score that quantifies the benefit to the prediction model, learning benefit, when

the update is labeled.

We conduct an extensive experimental evaluation on real datasets that shows the

effectiveness of GDR in allowing fast convergence to a better quality database with

minimal user intervention.

2.2 Problem Definition and Solution Overview

2.2.1 Problem Definition

We consider a database instance D with a relational schema S. Each relation

R ∈ S is defined over a set of attributes attr(R) and the domain of an attribute

A ∈ attr(R) is denoted by dom(A). We also consider a set of data quality rules Σ

21

that represent data integrity semantics. In this work, we consider rules in the form

of CFDs.

CFD Overview A CFD ϕ over R can be represented by ϕ : (X → Y, tp), where

X and Y ∈ attr(R), X → Y is a standard functional dependency (FD), referred to

as FD embedded in ϕ, and tp is a tuple pattern containing all attributes in X and Y .

For each A ∈ (X ∪ Y ), the value of the attribute A for the tuple pattern tp, tp[A], is

either a constant ’a’ ∈ dom(A), or ’−’ which represents a variable value. We denote

X as LHS(ϕ) (left hand side) and Y as RHS(ϕ) (right hand side). Examples for

CFD rules are provided in Figure 2.1.

To denote that a tuple t ∈ D matches a particular pattern tp, the symbol ≍ is

defined on data values and ’−’. We write t[X] ≍ tp[X] iff for each A ∈ X, either

t[A] = tp[A] or tp[A] = ’−’. For example, (Sherden RD, Fort Wayne, IN) ≍ (−,

Fort Wayne, −). We assume that CFDs are provided in the normal form [3], i.e.,

ϕ : (X → A, tp), A ∈ attr(R) and tp is a single pattern tuple.

A CFD ϕ : (X → A, tp) is said to be constant, if tp[A] =’−’. Otherwise, ϕ is a

variable CFD. For example in Figure 2.1, ϕ1 is a constant CFD, while ϕ5 is a variable

CFD.

A database instance D satisfies the constant CFD ϕ = (X → A, tp), denoted by

D |= ϕ, iff for each tuple t ∈ D, if t[X] ≍ tp[X] then t[A] = tp[A]. If ϕ is a variable

CFD, then D |= ϕ iff for each pair of tuples t1, t2 ∈ D, if t1[X] = t2[X] ≍ tp[X] then

t1[A] = t2[A] ≍ tp[A]. This means that if t1[X] and t2[X] are equal and match the

pattern tp[X], then t1[A] and t2[A] must also be equal to each other. CFDs address

a single relation only. However, the repairing algorithm that uses CFDs is applicable

to general relational schemas by simply repairing each relation in isolation.

We address the following problems:

• The use of the data quality rules Σ to generate candidate updates for the tuples

that are violating Σ. The rules can be either given or discovered by an automatic

discovery technique (e.g., [33,36]). Usually, the automatic discovery techniques

employ thresholds on the confidence of the discovered rules. In this setting, the

22PossibleUpdates{rj, sj} Learning ComponentRankingDirty Tuples Identification& Updates DiscoveryDCFDs Repository Training ExamplesUser Feedback to DBUser Feedback to train the learning componentUpdates are ranked according to the benefit the DQGroupingGroups of updates Learner decisions to repair the dataUpdates Consistency Mngr TrRanked Groups{c, g(c)}Updates{rj} Display ordered byUncertaintyInput Updates Generation Ranking Updates

Fig. 2.2.: GDR Framework.

user is the one to guide the repairing process and we assume that user decisions

are consistent with Σ.

• Deciding upon the best groups of updates—as mentioned in Section 2.1— to be

presented to the user during an interactive process for faster convergence and

higher data quality.

• Applying active learning to learn user feedback and use the learned models to

decide upon the correctness of the suggested updates without user’s involve-

ment.

2.2.2 Solution Overview

Figure 2.2 shows the GDR framework and the cleaning process is outlined in

Algorithm 2.1.

GDR guides the user to focus her efforts on providing feedback on the updates

that would improve quality faster, while the user guides the system to automatically

identify and apply updates on the data. This continuous feedback process, illustrated

in steps 3-10 (Procedure 1), runs while there are dirty tuples and the user is available

to give feedback.

23

Algorithm 2.1 GDR Process(D dirty database, Σ DQRs)

1: Identify dirty tuples in D using Σ and generate and store initial suggested updates in

PossibleUpdates list.

2: Group the candidate updates appropriately.

3: while User is available and dirty tuples exist do

4: Rank groups of updates such that the most beneficial come first.

5: The user selects group c from the top.

6: Updates in c are labeled by learner predictions and the user interactively gives feed-

back on the suggested updates, until the user is satisfied with the learner predictions

or has verified all the updates within c.

7: User feedback and learner decisions are applied to the database.

8: Remove rejected updates from PossibleUpdates and replace as needed.

9: Check for new dirty tuples and generate updates.

10: end while

In Step 1, all dirty tuples that violate the rules are identified and a repairing

algorithm is used to generate candidate updates. In Step 2, we group the updates for

the user in a way that makes it easier for a batch inspection.

The interactive loop in steps 3-10 starts with ranking the groups of updates such

that groups that are more likely to move the database to a cleaner state faster come

first. The user will then pick one of the top groups (c) in the list and provide feedback

through an interactive active learning session (Step 6). (The ranking mechanism and

active learning are discussed in Section 2.4.)

In Step 7, all decisions on suggested updates, either made by the user or the

learner, are applied to the database. In Step 8, the list of candidate updates is

modified by replacing rejected updates and generating new ones for emerging dirty

tuples because of the applied updates.

After getting the user feedback, the violations are recomputed by the consistency

manager and new updates may be proposed. The assumption is that if the user

verifies all the database cells then the final database instance is consistent with the

24

rules. This guarantees that we are always making progress toward the final consistent

database and the process will terminate.

2.3 Generating Candidate Updates

In this section, we outline the different steps involved in suggesting updates, main-

taining their consistency when applied to the database, and grouping them for the

user.

2.3.1 Dirty Tuples Identification and Updates Discovery:

Once a set Σ of CFDs is defined, dirty tuples can be identified through violations

of Σ and stored in a DirtyTuples list. A tuple t is considered dirty if ∃ ϕ ∈ Σ such

that t |= ϕ, i.e., t violates rule ϕ.

Resolving CFD Violations

A dirty tuple t may violate a CFD ϕ = (R : X → A, tp) in Σ following two possible

cases [3]:

• Case 1: ϕ is a constant CFD (i.e., tp[A] = a, where a is a constant) and t[X] ≍

tp[X] but t[A] = a.

• Case 2: ϕ is a variable CFD, t[X] ≍ tp[X], and ∃t′ such that t′[X] = t[X] ≍

tp[X] but t[A] = t′[A].

The latter case is similar to the violation of a standard FD. Accordingly, given a set Σ

of CFDs, the dirty tuples can be immediately identified and stored in theDirtyTuples

list.

To resolve a violation of a CFD ϕ = (R : X → A, tp) by a tuple t, we proceed as

follows: For case 1, we either modify the RHS(ϕ) attribute such that t[A] = tp[A]

or we change some of the attributes in LHS(ϕ) such that t[X] ≍ tp[X]. For case 2,

25

we either modify t[A] (resp. t′[A]) such that t[A] = t′[A] or we change some LHS(ϕ)

attributes t[X] (resp. t′[X]) such that t[X] = t′[X] or t[X] ≍ tp[X] (resp. t′[X] ≍

tp[X]).

Example 2.3.1 In Figure 2.1, the normal form of

ϕ1 : (ZIP → CT, STT, {46360 ∥ MichiganCity, IN}) would be

ϕ1,1 : (ZIP → CT, {46360 ∥ MichiganCity}) and

ϕ1,2 : (ZIP → STT, {46360 ∥ IN}).

t2 violates ϕ1,1 : (ZIP → CT, {46360 ∥ MichiganCity}) following case 1. Thus,

a suggested update by changing RHS(ϕ1,1) is to replace ‘Westville’ by ‘Michigan City’

in t2[CT], while another update by changing LHS(ϕ1,1) is to replace ‘46360’ by ‘46391’

in t2[ZIP], for example. t5, t6 both violate ϕ5 following case 2. A possible update is to

change RHS(ϕ5) by modifying t5[ZIP] to be ’46825’ instead of ’46391’. Yet, another

possible update is to make a change in LHS(ϕ5). For example, by changing t5[STR] or

t5[CT] to another value.

We implemented an on demand update discovery process based on the above

mechanism for resolving CFDs violations and generating candidate updates. This

process is triggered to suggest an update for t[A], the value of attribute A in tuple t.

Initially, the process is called for all dirty tuples and their attributes. Later during the

interactions with the user, it is triggered by the consistency manager as a consequence

of receiving user feedback.

The generated updates are tuples in the form rj = ⟨t, A, v, sj⟩ stored in the

PossibleUpdates list, where v is the suggested value in t[A] and sj is the update

score. sj ∈ [0..1] is assigned to each update rj by an update evaluation function

to reflect the certainty of the repairing technique about the suggested update. We

follow the same evaluation approach used in [23] and [3]. Given an update r to mod-

ify t[A] = v such that t[A] = v′, we compute the update evaluation score s as the

similarity between v and v′. This can be done based on the edit distance function

distA(v, v′) as follows

26

s(r) = sim(v, v′) = 1− distA(v, v′)

max(|v|, |v′|). (2.1)

where |v|, |v′| denote the size of v, v′, respectively. The intuition here is that, the

more accurate v′, the more it is close to v. s(r) is in the range [0..1] and any domain

specific similarity function can be used for this purpose. Finally, the update can be

composed in the tuple form r = ⟨t, A, v′, s(r)⟩.

Generating Updates

We now show how to use CFDs to generate updates for each potentially dirty

attribute B in t ∈ DirtyTuples. The generated updates are tuples in the form

⟨t, B, v, s⟩, where v is the suggested repair value for t[B] and s is the repair evaluation

score from Eq. 2.1.

The suggested updates correspond to attribute value modifications, which are

enough for CFDs violations [3]. For each dirty tuple t, we store the list of violated

rules in t.vioRuleList. Furthermore, for each pair ⟨t, B⟩, we keep a list of values

⟨t, B⟩.preventedList, which contains values for t[B] that are confirmed as wrong.

Thus, when searching a new suggestion for t[B], the values in ⟨t, B⟩.preventedList

are discarded. Also, we keep a flag ⟨t, B⟩.Changeable that is set to False when the

value in t[B] was confirmed to be correct.

Initially, we assume that each attribute value is incorrect for all t ∈ DirtyTuples

and proceed by searching for the best update value that provides the best

score according to Eq. 2.1. This can be performed by calling Algorithm 2.2,

UpdateAttributeTuple(t, B) for all t ∈ DirtyTuples and B ∈ attr(R).

UpdateAttributeTuple described in Algorithm 2.2 finds the best update value for

t[B] by exploring three possible scenarios:

1. B = A for some violated CFD ϕ = (X → A, tp) and tp[A] =’−’ (i.e., ϕ is a

constant CFD): This corresponds to case 1 of rule violations where t[X] ≍ tp[X]

and t[A] ≍ tp[A]. In this scenario, a value v = a is suggested (lines 4-6).

27

Algorithm 2.2 UpdateAttributeTuple (Tuple t, Attribute B)

1: if ⟨t, B⟩.Changeable = false then return;

2: best s = 0 ; v = null

3: for all ϕ = (X → A, tp) ∈ t.vioRuleList do

4: if B = A ∧ tp[A] =’−’ then

5: cur s = sim(t[A], tp[A]) {scenario 1}

6: if cur s > best s then { best s = cur s; v = tp[A] }

7: else if B = A ∧ t[A] =’−’ then

8: ⟨best s, v⟩ = getValueForRHS(ϕ, A, t, best s) {scenario 2}

9: end if

10: end for

11: if ∃ ϕ = (X → A, tp) ∈ t.vioRuleList s.t. B ∈ X then

12: ⟨best s, v⟩ = getValueForLHS(A, t, best s) {scenario 3}

13: end if

14: if v = null then

15: PossibleUpdates = PossibleUpdates ∪ {⟨t, B, v, s = sim(t[B], v)⟩}

16: end if

2. B = A for some violated CFD ϕ = (X → A, tp) and tp[A] =’−’ (i.e., ϕ is a

variable CFD): This corresponds to case 2 of rule violations where t[X] ≍ tp[X]

and t[A] ≍ tp[A] and there exists another tuple t′ that violates ϕ with t, i.e.,

t′[X] ≍ t[X] but t′[A] ≍ t[A]. In this scenario, a value v = t′[A] is suggested

(lines 7-9).

3. B ∈ LHS(ϕ) for some violated CFD ϕ = (X → A, tp): This corresponds to

either case 1 or case 2 of rule violations. In this scenario, we look for a value

v that maximizes the repair evaluation score sim(t[B], v) (Eq. 2.1.) The aim is

to select semantically related values by first using the values in the CFDs, then

searching in the tuples identified by the pattern t[X ∪ A− {B}] (lines 11-13).

In each of the above scenarios, the value v ∈ ⟨t, B⟩.preventedList. Finally, a

repair tuple is composed ⟨t, B, v, s⟩ and inserted into PossibleUpdates in line 14

28

Example 2.3.2 In Figure 2.1, t5 violates ϕ4 and when repairing the attribute CT ∈

RHS(ϕ4), a suggested update according to Scenario 1 will be ‘Westville’. Also, t5

violates ϕ5 and when repairing the attribute ZIP ∈ RHS(ϕ5), a suggested update will

be 46825 according to Scenario 2. When repairing the attribute STR ∈ LHS(ϕ5), a

suggested value from the domain dom(STR) can be ‘Sherden RD’ according to Sce-

nario 3.

2.3.2 Updates Consistency Manager

Once an update r = ⟨t, A, v, s⟩ is confirmed to be correct, either by the user or

the learning component, it is immediately applied to the database resulting into a

new database instance. Consequently, (i) new violations may arise and hence the

on demand update discovery process needs to be triggered for the new dirty tuples,

and (ii) some of the already suggested updates that are not verified yet may become

inconsistent since they were generated according to a different database instance. For

example, in Figure 2.1, two updates are proposed: r1 replaces t6[ZIP] = 46391 and r2

replaces t6[CT] = ’FT Wayne’. If a feedback is received confirming r1, then r2 is not

consistent with the new database instance and the rules anymore since t6 will fall in

the context of ϕ4. The on demand process can then find a consistent update r′2 that

corresponds to replacing t6[CT] by ’Westville’, and r2 will be discarded in favor of r′2.

The consistency manager needs to maintain two invariants: (i) There is no tuple

t ∈ D such that t |= ϕ for any ϕ ∈ Σ, and t ∈ DirtyTuples. (ii) There is no update

r ∈ PossibleUpdates such that r depends on data values that have been modified

in the database. In the following, we provide the detailed steps of the consistency

manager procedure that we implemented in GDR. Given an update r = ⟨t, B, v, s⟩

along with the feedback ∈ {confirm, reject, retain}:

1. If the feedback is to retain the current value t[B], then we set

⟨t, B⟩.Changeable = false to stop looking for updates for t[B].

29

2. If the feedback is to reject the update, i.e., t[B] cannot be v, then v is added

immediately to the list ⟨t, B⟩.P reventedList. This is followed by a call to

UpdateAttributeTuple(t, B) to find another update for t[B].

3. If the feedback confirms that t[B] must be v, then the update is applied to

the database immediately and we stop generating updates for t[B] by setting

⟨t, B⟩.Changeable = false. Afterward, we go through the rules that involve the

attribute B and update the necessary data structures to reflect the removed

violations as well as new emerging violations. Particularly, for each ϕ : (X →

A, tp) ∈ Σ where B ∈ (X ∪ A), we do the following:

(a) If t |= ϕ, then we consider two cases:

i. ϕ is a constant CFD: If ⟨t, C⟩.Changeable = false, ∀C ∈ X, i.e., all at-

tributes in LHS(ϕ) have been confirmed as correct and are not change-

able values, then RHS(ϕ) should be applied; we apply t[A] = tp[A]

to the database directly, set ⟨t, A⟩.Changeable = false, and remove ϕ

from t.vioRuleList. If some of the LHS(ϕ) attribute values are change-

able in t, then ∀ C ∈ ({X ∪ A} B) we add ⟨t, C⟩ to RevisitList. ϕ

is added to t.vioRuleList, if it is not already there, and t is added to

the DirtyTuples as well.

ii. ϕ is a variable CFD: We add ϕ to t.vioRuleList and then identify

the tuples t′ that violate ϕ with t. Then for each t′, we add ϕ to

t′.vioRuleList and add t′ to the DirtyTyples. Also, we add ⟨t′, C⟩ to

the RevisitList, ∀ C ∈ {X∪A} because this ϕ may be a new emerging

violation for t′ and all the attributes are candidates to be wrong for t′.

(b) If t |= ϕ while ϕ ∈ t.vioRuleList, then ϕ originally was violated by t before

applying this update. Therefore, we remove ϕ from t.vioRuleList. If ϕ is

a constant CFD, no further action is required. However, if ϕ is a variable

CFD, we need to check the other tuples t′, which were involved with t in

violating ϕ, and eventually update their vioRuleList. We remove ϕ from

30

t′.vioRuleList as long as @ t′′ s.t. t′, t′′ |= ϕ, i.e., t′ is not involved in

violating ϕ with another tuple t′′.

4. Remove update r = ⟨t, C, v, s⟩ from the PossibleUpdates, if ⟨t, C⟩ ∈

RevisitedList or ⟨t, C⟩.Changeable = false.

5. For every element ⟨t, C⟩ ∈ RevisitedList, we call UpdateAttributeTuple(t, C)

to find another repair for t[C].

6. Remove t from DirtyTuples, if t.vioRuleList is empty.

Note that the first update consistency invariant is maintained because of the

following: A tuple t may become dirty if it is modified or another tuple t′ is modified

so that t, t′ violates some variable CFD ϕ ∈ Σ. For a tuple t and a CFD rule ϕ,

assuming that due to a database update t |= ϕ, then t must be in DirtyTuples after

applying Step 3a.

If ϕ is a constant CFD, then Step 3(a)i should have been applied. If t continues to

violate ϕ it should be in DirtyTuples. If ϕ is a variable CFD, then Step 3(a)ii should

have been applied. There are two cases to consider: First, if t is the tuple being

repaired and t |= ϕ, then it is added to DirtyTuples, if not already there. Second,

if t |= ϕ because another tuple t′ was repaired (or modified), then Step 3(a)ii should

have been applied on t′. Thus all tuples involved with t′ in violating ϕ, including

t will be added to DirtyTuples. Following the same rationale, Step 3b maintains

that t.vioRuleList contains only rules that are being violated by t. Thus, Step 6

guarantees that the content of DirtyTuples corresponds to tuples involved in rules

violation.

The second update consistency invariant is maintained as well because of steps

3(a)i, 4, and 5. These steps maintain a local list, RevisitedList, to hold tuple-

attribute pairs, where their generated updates may depend on the applied update.

In Step 3(a)i, changing the value of t[B] may affect the update choice for the other

attributes of ϕ. For a variable CFD, Step 3(a)ii, all the tuples involved in the vio-

lations due to the modified value will need their attributes values to be revisited to

31

find updates. Step 4 removes the corresponding updates from the PossibleUpdates

and we proceed in Step 5 to get potentially new updates.

Note that Step 3 loops on the set of rules for the particular tuple t that was

updated. In Steps 3(a) and 3(b), we consider the immediate dependencies (conse-

quences) of updating t with respect to a single rule ϕ. Particularly in Step 3(a), we

check for new violations for ϕ that involve t, because it is the only change to the

database. In Step 3(b), we check for already resolved violations for ϕ due to updating

t. This local process to tuple t that considers only a single rule ϕ at a time guarantees

that the consistency manager will terminate and will not get into an infinite loop.

2.3.3 Grouping Updates

There are two reasons for the grouping: (i) Providing a useful-looking set of up-

dates with some common contextual information will be easier for the user to handle

and process. (ii) Providing a machine learning algorithm with a group of training

examples that have some correlations due to the grouping will increase the predic-

tion accuracy compared with just providing random, unrelated examples. Similar

grouping ideas have been explored in [45]. We use a grouping function where the

tuples with the same update value in a given attribute are grouped together. This

technique of grouping is in the flavor of the equivalence classes techniques described

in [23]. Another way to do the grouping is based on the conflicting structures, a

concept introduced in [5] The conflicting structures identifies the groups of cells that

can not stay unchanged in the final consistent database (i.e., for each group there

must be at least a database cell to change).

2.4 Ranking and Displaying Suggested Updates

In this section, we introduce the key concepts of GDR, namely the ranking and

learning components (Figure 2.2), which describe how GDR interacts with the user

to get feedback on suggested updates. The task of these components is to devise how

32

to best present the updates to the user, in a way that will provide the most benefit

for improving the quality of the data. To this end, we apply the concept of value

of information (VOI) [64] from decision theory, combined with an active learning

approach, to choose a ranking in a principled manner.

2.4.1 VOI-based Ranking

At any iteration of the process outlined in Algorithm 2.1, there will be several

possible suggested updates to forward to the user. As discussed in the previous

section, these updates are grouped into groups {c1, c2 . . . }.

VOI is a mean of quantifying the potential benefit of determining the true value

of some unknown. At the core of VOI is a loss (or utility) function that quantifies the

desirability of a given level of database quality. To make a decision on which group

to forward first to the user, we compare data quality loss before and after the user

works on a group of updates. More specifically, we devise a data quality loss function,

L, based on the quantified violations to the rules Σ. Since the exact loss in quality

cannot be measured, as we do not know the correctness of the data, we develop a set

of approximations that allow for efficient estimation of this quality loss. Before we

proceed, we need first to introduce the notion of database violations.

Definition 2.4.1 Given a database D and a CFD ϕ, we define the tuple t violation

w.r.t ϕ, denoted vio(t, {ϕ}), as follows:

vio(t, {ϕ}) =

1 , if ϕ is a constant CFD.

Number of tuples t′

that violate ϕ with t , if ϕ is a variable CFD.

Consequently, the total violations for D with respect to Σ is:

vio(D,Σ) =∑ϕ∈Σ

∑t∈D

vio(t, {ϕ}).

33

The definition for the variable CFDs is equivalent to the pairwise counting of viola-

tions discussed in [3]. The violation can be scaled further using a weight attached to

the tuple denoting its importance for the business to be clean.

Update Benefit: Given a database instance D and a group c = {r1, . . . , rJ}.

If the system receives a feedback from the user on rj, there are two possible cases:

either the user confirms rj to be applied or not. We denote the two corresponding

database instances as Drj and Drj , respectively. Assuming that the user will confirm

rj with a probability pj, then the expected data quality loss after consulting the user

on rj can be expressed by: pj L(Drj)+ (1− pj) L(D

rj). If we further assume that all

the updates within the group c are independent then the update benefit g (or data

quality gain) of acquiring user feedback for the entire group c can be expressed as:

g(c) = L(D|c)−∑rj∈c

[ pj L(Drj) + (1− pj) L(D

rj) ] (2.2)

where L(D|c) is the current loss in data quality given that c is suggested. To simplify

our analysis, we assumed that these updates are independent. Taking into account

these dependencies would require to model the full joint probabilities of the updates,

which will lead to a formulation that is computationally infeasible due to the expo-

nential number of possibilities.

Data Quality Loss (L): We define quality loss as inversely proportional to the

degree of satisfaction of the specified rules Σ. To compute L(D|c), we first need to

measure the quality loss with respect to ϕ ∈ Σ, namely ql(D|c, ϕ). Assuming that

Dopt is the clean database instance desired by the user, we can express ql by:

ql(D|c, ϕ) = 1− |D |= ϕ||Dopt |= ϕ|

=|Dopt |= ϕ| − |D |= ϕ|

|Dopt |= ϕ|(2.3)

where |D |= ϕ| and |Dopt |= ϕ| are the numbers of tuples satisfying the rule ϕ in the

current database instance D and Dopt, respectively. Consequently, the data quality

loss, given c, can be computed for Eq. 2.2 as follows:

L(D|c) =∑ϕi∈Σ

wi × ql(D|c, ϕi). (2.4)

34

where wi is a user defined weight for ϕi. These weights are user defined parameters.

In our experiments, we used the values wi =|D(ϕi)||D| , where |D(ϕi)| is the number of

tuples that fall in the context of the rule ϕi. The intuition is that the more tuples

fall in the context of a rule, the more important it is to satisfy this rule. to express

the business or domain value of satisfying the rule ϕi.

To use this gain formulation, we are faced with two practical challenges: (1) we

do not know the probabilities pj for Eq. 2.2, since we do not know the correctness of

the update rj beforehand, and (2) we do not know the desired clean database Dopt

for computing Eq. 2.3, since that is the goal of the cleaning process in the first place.

User Model: To approximate pj, we learn and model the user as we obtain

his/her feedback for the suggested updates. pj is approximated by the prediction prob-

ability, pj, of having rj correct (learning user feedback is discussed in Section 2.4.2).

Since initially there is no feedback, we assign sj to pj, where sj ∈ [0, 1] is a score that

represents the repairing algorithm certainty about the suggested update rj.

Estimating Update Benefit: To compute the overall quality loss L in Eq. 2.4,

we need to first compute the quality loss with respect to a particular rule ϕ, i.e.,

ql(D|c, ϕ) in Eq. 2.3. To this end, we approximate the numerator and denomina-

tor separately. The numerator expression, which represents the difference between

the numbers of tuples satisfying ϕ in Dopt and D, respectively, is approximated us-

ing D’s violations with respect to ϕ. Thus, we use the expression vio(D, {ϕ}) (cf.

Definition 2.4.1) as the numerator in Eq. 2.3.

The main approximation we made is to assume that the updates within a group

c are independent. Hence to approximate the denominator of Eq. 2.3, we assume

further that there is only one suggested update rj in c. The effect of this last as-

sumption is that we consider two possible clean desired databases—one in which rj

is correct, denoted by Drj , and another one in which rj is incorrect, denoted by

Drj . Consequently, there are two possibilities for the denominator of Eq. 2.3, each

with a respective probability pj and (1− pj). Our evaluations show that despite our

approximations, our approach produces a good ranking of the groups of updates.

35

We apply this approximation independently for each rj ∈ c and estimate the

quality loss ql as follows:

E[ql(D|c, ϕ)] =∑rj∈c

[pj ·vio(D, {ϕ})|Drj |= ϕ|

+ (1− pj)vio(D, {ϕ})|Drj |= ϕ|

] (2.5)

where we approximate pj with pj.

The expected loss in data quality for the database D, given the suggested group

of updates c, can be then approximated based on Eq. 2.4 by replacing ql with E[ql]

obtained from Eq. 2.5:

E[L(D|c)] =∑ϕi∈Σ

wi

∑rj∈c

[pj

vio(D, {ϕ})|Drj |= ϕ|

+ (1− pj)vio(D, {ϕ})|Drj |= ϕ|

](2.6)

We can also compute the expected loss for Drj and Drj using Eq. 2.4 and Eq. 2.6 as

follows: E[L(Drj)] =∑

ϕi∈Σwi · vio(Drj ,{ϕi})

|Drj |=ϕi| where we use pj = 1 since in Drj we know

that rj is correct and E[L(Drj)] =∑

ϕi∈Σwi · vio(Drj ,{ϕi})|Drj |=ϕi|

where we use pj = 0 since

in Drj we know that rj is incorrect.

Finally, using Eq. 2.2 and substituting L(D|c) with E[L(D|c)] from Eq. 2.6, we

compute an estimate for the data quality gain of acquiring feedback for the group c

as follows:

E[g(c)] = E[L(D|c)]−∑rj∈c

[pj E[L(Drj )] + (1− pj)E[L(Drj )]

]=

∑ϕi∈Σ

wi

∑rj∈c

[pj

vio(D, {ϕi})|Drj |= ϕi|

+ (1− pj)vio(D, {ϕi})|Drj |= ϕi|

]−

∑rj∈cpj ∑ϕi∈Σ

wivio(Drj , {ϕi})|Drj |= ϕi|

+ (1− pj)∑ϕi∈Σ

wivio(Drj , {ϕi})|Drj |= ϕi|

Note that vio(D, {ϕi})− vio(Drj , {ϕi}) = 0 since Drj is the database resulting from

rejecting the suggested update rj which will not modify the database. Therefore, Drj

is the same as D with the same violations. After a simple rearrangement, we obtain

the final formula to compute the estimated gain for c:

E[g(c)] =∑ϕi∈Σ

wi

∑rj∈c

pjvio(D, {ϕi})− vio(Drj , {ϕi})

|Drj |= ϕi|

(2.7)

36

The final formula in Eq. 2.7 is intuitive by itself and can be justified by the follow-

ing. The main objective to improve the quality is to reduce the number of violations

in the database. Therefore, the difference in the amount of database violations as

defined in Definition 1, before and after applying rj, is a major component to com-

pute the update benefit. This component is computed, under the first summation,

for every rule ϕi as a fraction of the number of tuples that would be satisfying ϕi, if

rj is applied. Since the correctness of the repair rj is unkown, we cannot use the term

under the first summation as a final benefit score. Instead, we compute the expected

update benefit by approximating our certainty about the benefit by the prediction

probability pj.

Example 2.4.1 For the example in Figure 2.1, assume that the repairing algorithm

generated 3 updates to replace the value of the CT attribute by ‘Michigan City’ in

t2, t3 and t4. Assume also that the probabilities, pj, for each of them are 0.9, 0.6,

and 0.6, respectively. The weights wi for each ϕi, i = 1, . . . , 5 are {48, 18, 28, 18, 38}.

Due to this modifications only ϕ1 will have their violations affected. Then for this

group of updates, the estimated benefit can be computed as follow using Eq. 2.7:

48× (0.9× 4−3

1+ 0.6× 4−3

1+ 0.6× 4−3

1) = 1.05.

2.4.2 Active Learning Ordering

One way to reduce the cost of acquiring user feedback for verifying each update is

to relegate the task of providing feedback to a machine learning algorithm. The use of

a learning component in GDR is motivated by the existence of correlations between

the original data and the correct updates. If these correlations can be identified

and represented in a classification model, then the model can be trained to predict

the correctness of a suggested update and hence replace the user for similar (future)

situations.

37

As stated earlier, GDR provides groups of updates to the user for feedback. Here,

we discuss how the updates within a group will be ordered and displayed to the

user, such that user feedback for the top updates would strengthen the learning

component’s capability to replace the user for predicting the correctness for the rest

of the updates.

Interactive Active Learning Session: After ranking the groups of updates, the

user will pick a group c that has a high score E[g(c)]. The learner orders these updates

such that those that would most benefit, i.e., improve the model prediction accuracy,

from labeling come first. The updates are displayed to the user along with their

learner predictions for the correctness of the update. The user will then give feedback

on the top ns updates, that she is sure about, and inherently correct any mistakes

made by the learner. The newly labeled examples in ns are added to the learner

training dataset Tr and the active learner is retrained. The learner then provides

new predictions and reorder the currently displayed updates based on the training

examples obtained so far. If the user is not satisfied with the learner predictions,

the user will then give feedback on another ns updates from c. This interactive

process continues until the user is either satisfied with the learner predictions, and

thus delegates the remaining decisions on the suggested updates in c to the learned

model, or the updates within c are all labeled, i.e., verified, by the user.

Active Learning: In the learning component, there is a machine learning algo-

rithm that constructs a classification model. Ideally, we would like to learn a model

to automatically identify correct updates without user intervention. Active learning

is an approach to learning models in such situations where unlabeled examples (i.e.

suggested updates) is plentiful but there is a cost to labeling examples (acquiring user

feedback) for training.

By delegating some decisions on suggested updates to the learned models, GDR is

allowing for “automatic” repairing. However, there is a guarantee to correctly repair

the data that is inherently provided by the active learning process to learn accurate

38

classifiers to predict the correctness of the updates. The user is the one to decide

whether the classifiers are accurate while inspecting the suggestions.

Learning User Feedback: The learning component predicts for a suggested

update r = ⟨t, A, v, s⟩ one of the following predictions, which corresponds to the

expected user feedback. (i) confirm, the value of t[A] should be v. (ii) reject, v is not

a valid value for t[A] and GDR needs to find another update. (iii) retain, t[A] is a

correct value and there is no need to generate more updates for it. The user may also

suggest new value v′ for t[A] and GDR will consider it as a confirm feedback for the

repair r′ = ⟨t, A, v′, 1⟩.

In the learning component, we learn a set of classification models {MA1 , . . . ,MAn},

one for each attribute Ai ∈ attr(R). Given a suggested update for t[Ai], model MAi

is consulted to predict user feedback. The models are trained by examples acquired

incrementally from the user. We present here our choices for data representation

(input to the classifier), classification model, and learning benefit scores.

Data Representation: For a given update r = ⟨t, Ai, v, s⟩ and user feedback F ∈

{confirm, reject, retain}, we construct a training example for model MAiin the form

⟨t[A1], . . . , t[An], v,R(t[Ai], v),F⟩. Here, t[A1], . . . , t[An] are the original attributes’

values of tuple t and R(t[Ai], v)1 is a function that quantifies the relationship between

t[Ai] and its suggested value v.

Including the original dirty tuple along with the suggested update value enables

the classifier to model associations between original attribute values and suggested

values. Including the relationship function, R, enables the classifier to model asso-

ciations based on similarities that do not depend solely on the values in the original

database instance and the suggested updates.

Active Learning Using Model Uncertainty: Active learning starts with a

preliminary classifier learned from a small set of labeled training examples. The

classifier is applied to the unlabeled examples and a scoring mechanism is used to

estimate the most valuable example to label next and add to the training set. Many

1We use a string similarity function.

39

criteria have been proposed to determine the most valuable examples for labeling

(e.g, [65,66]) by focusing on selecting the examples whose predictions have the largest

uncertainty.

One way to derive the uncertainty of an example is by measuring the disagreement

amongst the predictions it gets from a committee of k classifiers [45]. The committee

is built so that the k classifiers are slightly different from each other, yet they all

have similar accuracy on the training data. For an update r to be classified by label

F ∈ {confirm, reject, retain}, it would get the same prediction F from all members.

The uncertain ones will get different labels from the committee and by adding them

in the training set the disagreement amongst the members will be lowered.

In our implementation, each model MAiis a random forest which is an ensemble

of decision trees [67] that are built in a similar way to construct a committee of

classifiers. Random forest learns a set of k decision trees. Let the number of instances

in the training set be N and the number of attributes in the examples be M . Each

of the k trees are learned as follows: randomly sample with replacement a set S of

size N ′ < N from the original data, then learn a decision tree with the set S. The

random forest algorithm uses a standard decision-tree learning algorithm with the

exception that at each attribute split, the algorithm selects the best attribute from

a random subsample of M ′ < M attributes. We used the WEKA2 random forest

implementation with k = 10 and default values for N ′ and M ′.

Computing Learning Benefit Score: To classify an update r = ⟨t, Ai, v, s⟩

with the learned random forest MAi, each tree in the ensemble is applied separately

to obtain the predictions F1, . . . ,Fk for r, then the majority prediction from the set of

trees is used as the output classification for r. The learning benefit or the uncertainty

of predictions of a committee can be quantified by the entropy on the fraction of

committee members that predicted each of the class labels.

Example 2.4.2 Assume that r1, r2 are two candidate updates to change the CT at-

tribute to ‘Michigan City’ in tuples t2, t3. The model of the CT attribute, MCT, is a

2http://www.cs.waikato.ac.nz/ml/weka/

40

random forest with k = 5. By consulting the forest MCT, we obtain for r1, the predic-

tions {confirm, confirm, confirm, reject, retain}, and for r2, the predictions {confirm,

reject, reject, reject, reject}. In this case, the final prediction for r1 is ‘confirm’ with

an uncertainty score of 0.86 (= −35× log3

35− 1

5× log3

15− 1

5× log3

15) and for r2

the final prediction is ’reject’ with an uncertainty score of 0.45. In this case, r1 will

appear to the user before r2 because it has higher uncertainty.

2.5 Experiments

In this section, we present a thorough evaluation of the GDR framework, which

has already been demonstrated in [60]. Specifically, we show that the proposed

ranking mechanism converges quickly to a better data quality state. Moreover, we

assess the trade-off between the user efforts and the resulting data quality.

Datasets. In our experiments, we used two datasets, denoted as Dataset 1

and 2 respectively. Dataset 1 is a real world dataset obtained by integrating

(anonymized) emergency room visits from 74 hospitals. Such patient data is used

to monitor naturally occurring disease outbreaks, biological attacks, and chemical

attacks. Since such data is coming from several sources, a myriad of data quality

issues arise due to the different information systems used by these hospitals and

the different data entry operators responsible for entering this data. For our ex-

periments, we selected a subset of the available patient attributes, namely Patient

ID, Age, Sex, Classification, Complaint, HospitalName, StreetAddress, City,

Zip, State, and VisitDate. For Dataset 2, we used the adult dataset from the

UCI repository (http://archive.ics.uci.edu/ml/). For our experiments, we used the

attributes education, hours per week, income, marital status, native country,

occupation, race, relationship, sex, and workclass.

Ground truth. To evaluate our technique against a ground-truth, we manually

repaired 20,000 patient records in Dataset 1. We used address and zip code lookup web

sites for this purpose. We assumed that Dataset 2, which is about 23,000 records, is

41

already clean and hence can be used as our ground truth. We synthetically introduced

errors in the attribute values as follows. We randomly picked a set of tuples, and then

for each tuple, we randomly picked a subset of the attributes to perturb by either

changing characters or replacing the attribute value with another value from the

domain attribute values. All experiments are reported when 30% of the tuples are

dirty.

Data Quality Rules. For Dataset 1, we used CFDs similar to what was

illustrated in Figure 2.1. The rules were identified while manually repairing the

tuples. For Dataset 2, we implemented the technique described in [36] to discover

CFDs and we used a support threshold of 5%.

User interaction simulation. We simulated user feedback to suggested updates

by providing answers as determined by the ground truth.

Data quality state metric. We report the improvement in data quality through

computing the loss (Eq. 2.4). We consider the ground truth as the desired clean

database Dopt.

Settings. All the experiments were conducted on a server with a 3 GHz pro-

cessor and 32 GB RAM running on Linux. We used Java to implement the proposed

techniques and MySQL to store and query the records.

2.5.1 VOI Ranking Evaluation

The objective here is to evaluate the effectiveness and quality of the VOI-based

ranking mechanism described in Section 2.4.1. In this experiment, we did not use the

learning component to replace the user; the user will need to evaluate each suggested

update. Recall that the grouping provides the user with related tuples and their

corresponding updates that could help in a quick batch inspection by the user.

We compare in this experiment the following techniques:

• GDR-NoLearning : The GDR framework of Figure 2.2 without the learning

component.

42

020406080

100

0 20 40 60 80 100

Data

Quali

ty Im

prove

ment

Feedaback (User efforts)

GDR-NoLearningGreedyRandom

(a) Dataset 1.

020406080

100

0 20 40 60 80 100

Data

Quali

ty Im

prove

ment

Feedback (User efforts)

GDR-NoLearningGreedyRandom

(b) Dataset 2.

Fig. 2.3.: Comparing VOI-based ranking in GDR (GDR-NoLearning) to other strate-

gies against the amount of feedback. Feedback is reported as the percentage of the

maximum number of verified updates required by an approach. Our application of the

VOI concept shows superior performance compared to other naıve ranking strategies.

• Greedy : Here, we rank the groups according to their sizes. The rationale behind

this strategy is that groups that cover larger numbers of updates may have high

impact on the quality if most of the suggestions within them are correct.

• Random: The naıve strategy where we randomly order the groups; all update

groups are equally important.

In Figure 2.3, we show the progress in improving the quality against the number

of verified updates (i.e., the amount of feedback). The feedback is reported as a

percentage of the total number of suggested updates through the interaction process

to reach the desired clean database.

The ultimate objective of GDR is to minimize user effort while reaching better

quality quickly. In Figure 2.3, the slope of the curves in the first of iterations with the

user is the key component to the curve: the steeper the curve the better the ranking.

As illustrated for both datasets, the GDR-NoLearning approach performs well com-

pared to the Greedy and Random approaches. This is because the GDR-NoLearning

43

approach perfectly identifies the most beneficial groups that are more likely to have

correct updates. While the Greedy approach improves the quality, most of the content

of the groups is sometimes incorrect updates leading to wasted user efforts. The Ran-

dom approach showed the worst performance in Dataset 1, while for Dataset 2, it was

comparable with the Greedy approach especially in the beginning of the curves. This

is because in Dataset 2, most of the sizes of the groups were close to each others mak-

ing the Random and Greedy approaches behave almost identically, while in Dataset 1

the groups sizes varies widely making the random choices ineffective. Finally, we no-

tice that GDR-NoLearning is much better for Dataset 1 than for Dataset 2, because

of two reasons related to the nature of the Dataset 2: (i) most of the initially sug-

gested updates for Dataset 2 are correct, and (ii) the sizes of the groups in Dataset 2

are close to each other. The consequence is that any ranking strategy for Dataset 2

will not be far from the optimal.

The results reported above justify clearly the importance and effectiveness of the

GDR ranking component. The GDR-NoLearning approach is well suited for repairing

“very” critical data, where every suggested update has to be verified before applying

it to the database.

2.5.2 GDR Overall Evaluation

Here, we evaluate GDR’s performance when using the learning component to re-

duce user efforts. More precisely, we evaluate the VOI-based ranking when combined

with the active learning ordering. For this experiment, we evaluate the following

approaches:

• GDR: is the approach proposed in this chapter. In each interactive session, the user

provides feedback for the top ranked updates. The required amount of feedback

per group is inversely proportional to the benefit score of the group (Eq. 2.7)—

the higher the benefit the less effort from the user is needed, since most likely the

updates are correct and there are very few uncertain updates for the learned model

44

that would require user involvement. As such, we require that the user verifies di

updates for a group ci, di = E×(1− g(ci)

gmax

), where E is the initial number of dirty

tuples and gmax = max∀cj{g(cj)}.

• GDR-S-Learning : Here, we eliminate the active learning from the system—the

updates are grouped and then ranked using VOI-based scoring alone. User is

solicited for a random selection of updates within each group, instead of being

ordered by uncertainty. However, all of the user feedback is used to train the

learning component, which then replaces the user on deciding for the remaining

updates in the group. GDR-S-Learning is included to assess the benefit of the active

learning aspect of our framework, compared with traditional passive learning.

• Active-Learning : In this approach, we eliminate the grouping and their ranking

from the GDR framework. In other words, we neither group the updates nor use

VOI-based scores for ranking. We only solicit user feedback for updates ordered

with the learner uncertainty scores. The user is required to provide feedback for

the top update and then the learning component is updated to reorder the updates

for the user in an iterative fashion. The resulting learned model is applied for pre-

dicting the remaining suggested updates and the database is updated accordingly.

We report the quality improvement for different amount of feedbacks. This ap-

proach is included to assess the benefit of the grouping and the VOI-based ranking

mechanisms compared with using only an active learning approach.

• GDR-NoLearning : This approach is the one described in the previous experiment;

It provides a baseline to assess the utility of machine learning aspect for GDR.

• Automatic-Heuristic: The BatchRepair method described in [3] for automatic data

repair using CFDs.

In Figure 2.4, we report the improvement in data quality as the amount of feedback

increases. Assuming that the user can afford verifying at most a number of updates

equal to the number of initially identified dirty tuples (6000 for Dataset 1 and 3000

45

020406080

100

0 20 40 60 80 100

Data

Quali

ty Im

provm

ent


GDRGDR-S-LearningGDR-NoLearningActive LearningHeuristic

(a) Dataset 1.

020406080

100

0 20 40 60 80 100

Data

Quali

ty Im

provm

ent


GDRGDR-S-LearningGDR-NoLearningActive LearningHeuristic

(b) Dataset 2.

Fig. 2.4.: Overall evaluation of GDR compared with other techniques. The com-

bination of the VOI-based ranking with the active learning was very successful in

efficiently involving the user. The user feedback is reported as a percentage of the

initial number of the identified dirty tuples.

for Dataset 2), we report the amount of feedbacks as a percentage of this number.

The results show that GDR achieves superior performance compared with the other

approaches; For Dataset 1, GDR gain about 90% improvement with 20% efforts or

verifying about 1000 updates. For Dataset 2, about 94% quality improvement was

gained with 30% efforts or verifying about 1000 updates.

In Dataset 1, Active Learning is comparable to GDR only in the beginning of the

curve until reaching about 70% quality improvement. GDR-S-Learning starts to out-

perform Active Learning after about 45% user effort. The Heuristic approach repairs

the database without user feedback, therefore, it produces a constant result. Note

that the quality improvement achieved by the Heuristic approach is attained by GDR

with about 10% user effort, i.e., giving feedback for updates numbering about 10% of

the initial set of dirty tuples in the database. The GDR-NoLearning approach does

improve the quality of the database, but not as quickly as any of the approaches that

use learning methods. In comparison to Figure 2.3, the final performance of GDR-

46

NoLearning is 100%, assuming all required feedback were obtained. GDR involves

learning which allows for automatic updates to be applied and hence opens the door

for some mistakes to occur. Thus, the 100% accuracy may not be reached.

For Dataset 2, similar results were achieved. However, the Active Learning ap-

proach was not as successful as for Dataset 1. This is due to the randomness nature

of the errors in this dataset, which resulted in fewer correlations between these errors

that could be learned by the model. Due to the wider array of real-world dependen-

cies in Dataset 1, the machine learning methods were more successful and achieved

better performance. For example, some hospitals located on the boundary between

two zip codes have their zip attributes dirty; this is most likely due to a data entry

confusion on where they are really located.

The superior performance of GDR is justified by the following: for a single group

of updates, using the learner uncertainty to select updates can effectively strengthen

the learned model predictions as these “uncertain” updates are more important for

the model. In GDR-S-Learning, randomly inspecting updates from the groups pro-

vided by the VOI-based ranking does enhance the learned model. However, more

user effort is wasted in verifying less important updates according to the learning

benefit. For the Active Learning approach, it is apparent that having the user spend

more effort does not help the learned model due to the model over fitting problem.

This problem is avoided in both GDR and GDR-S-Learning approaches because of

the grouping provided by the GDR framework. The grouping provides the learned

model a mechanism to adapt locally to the current group, which in turn provides

the necessary guidance for the model to strongly learn the associations for a highly

beneficial group rather than just weakly learning the associations for a wide variety

of cases. This is also the reason that the GDR-S-Learning eventually outperforms

the Active Learning with an increase in user effort.

This experiment demonstrates the importance of the learning component for

achieving a faster convergence to a better quality. The results support our initial

hypothesis about the existence of correlations between the dirty and correct versions

47

of the tuples in real-world data. Also, the combination of VOI-based ranking with

active learning improves over the traditional active learning mechanism.

2.5.3 User Efforts vs. Repair Accuracy

We evaluate GDR’s ability to provide a trade-off between user effort and accurate

updates. We use the precision and recall, where precision is defined as the ratio of

the number of values that have been correctly updated to the total number of values

that were updated, while recall is defined as the ratio of the number of values that

have been correctly updated to the number of incorrect values in the entire database.

Since we know the correct data, we can compute these values.

The user in this experiment affords only verifying F updates, then GDR decide

about the rest of the updates automatically. GDR asks the user to verify di of the

suggested updates in a group of repairs ci, until we reach F .

In Figure 2.5, we report the precision and recall values resulting from repairing

the database as we increase F (reported as % of dirty tuples). For both datasets the

precision and recall generally improve as F increases. However, for Dataset 1, the

precision is always higher than for Dataset 2. This is due to the lower accuracy of

the learning component for Dataset 2, which stems from the random nature of the

errors in Dataset 2. Overall, these results illustrate the benefit of user feedback—as

the user effort increases, the repair accuracy increases.

2.6 Summary

We presented GDR, a framework that combines constraint-based repair techniques

with user feedback through an interactive process. The main novelty of GDR is to

solicit user feedback for the most useful updates using a novel decision-theoretic mech-

anism combined with active learning. The aim is to move the quality of the database

to a better state as far as the data quality rules are concerned. Our experiments

48

0.50.60.70.80.9

1

0 20 40 60 80 100

Prec

ision

and

Rec

all


PrecisionRecall

(a) Dataset 1.

0.50.60.70.80.9

1

0 20 40 60 80 100

Prec

ision

and

Rec

all

Feedback (user efforts)

PrecisionRecall

(b) Dataset 2.

Fig. 2.5.: Accuracy vs. user efforts. As the user spends more effort with GDR, the

overall accuracy is improved. The user feedback is reported as a percentage of the

initial number of the identified dirty tuples.

show very promising results in moving the data quality forward with minimal user

involvement.

49

3. SCALABLE APPROACH TO GENERATE DATA

CLEANING UPDATES

Existing automatic data cleaning techniques are not scalable, and moreover, the

constraint-based cleaning techniques are recognized to fall short to identify correct

cleaning updates. In Chapter 2, GDR relies on such automatic cleaning techniques

to generate candidate updates to the dirty database. To enable GDR handel large

databases, we introduce in this chapter a scalable data cleaning approach that is

based on Machine Learning techniques. Involving ML helps in obtaining more accu-

rate cleaning updates than the constraint-based methods.

The chapter is organized as follows: In Section 3.1, we highlight the need for

different data cleaning techniques and discuss the challenges. Section 3.2 defines the

problem and introduces the notion of maximal likelihood repair. Section 3.3 presents

our solution for modeling dependencies and predicting accurate replacement values.

Section 3.4 presents SCARE, our scalable solution to repair the data. We demon-

strate the validity of our approach and experimental results in terms of efficiency and

scalability in Section 3.5, and finally, summarize the chapter in Section 3.6.

3.1 Introduction

Most existing solutions to repair dirty databases by value modification follow

constraint-based repairing approaches [3–5], which search for minimal change of the

database to satisfy a predefined set of constraints. While a variety of constraints (e.g.,

integrity constraints, conditional functional and inclusion dependencies) can detect

the presence of errors, they are recognized to fall short of guiding to correct the

errors, and worse, may introduce new errors when repairing the data [6]. Moreover,

despite the research conducted on integrity constraints to ensure the quality of the

50

data, in practice, databases often contain a significant amount of non-trivial errors.

These errors, both syntactic and semantic, are generally subtle mistakes which are

difficult or even impossible to express using the general types of constraints available

in modern DBMSs [7]. This highlights the need for different techniques to clean dirty

databases.

In this chapter, we address the issues on scalability and accuracy of replacement

values by leveraging Machine Learning (ML) techniques for predicting better quality

updates to repair dirty databases.

Statistical ML techniques (e.g., decision tree, Bayesian networks) can capture

dependencies, correlations, and outliers from datasets based on various analytic, pre-

dictive or computational models [8]. Existing efforts in data cleaning using ML tech-

niques mainly focused on data imputation (e.g., [7]) and deduplication (e.g., [9]). To

the best of our knowledge, our work is the first approach to consider ML techniques

for repairing databases by value modification.

Involving ML techniques for repairing erroneous data is not straightforward and

it raises four major challenges: (1) Several attribute values (of the same record) may

be dirty. Therefore, the process is not as simple as predicting values for a single

erroneous attribute. This requires accurate modeling of correlations between the

database attributes, assuming a subset is dirty and its complement is reliable. (2) A

ML technique can predict an update for each tuple in the database; and the question

is how to distinguish the predictions that should be applied. Therefore, a measure

to quantify the quality of the predicted updates is required. (3) Over-fitting problem

may occur when modeling a database with a large variety of dependencies that may

hold locally for data subsets but do not hold globally. (4) Finally, the process of

learning a model from a very large database is expensive, and the prediction model

itself may not fit in the main memory. Despite the existence of scalable ML techniques

for large datasets, they are either model dependent (i.e., limited to specific models,

for example SVM [10]) or data dependent (e.g., limited to specific types of datasets

51

such as scientific data and documents repository). Note that the scalability is also an

issue for the constraint-based repairing approaches [11].

Such limitations motivate the need for effective and scalable methods to accurately

predict cleaning updates with statistical guarantees. Precisely in this chapter, our

contributions can be summarized as follows:

• We formalize a novel data repairing approach that maximizes the likelihood

of the data given the underline data distribution, which can be modeled using

statistical ML techniques. The objective is to apply selected database updates

that (i) will best preserve the relationships among the data values, and (ii)

will introduce a small amount of changes. This approach enables a variety

of ML techniques to be involved for the purpose of accurately repairing dirty

databases by value modification. This way, we eliminate the necessity to pre-

define database constraints, which requires expensive experts involvement. In

contrast to the constraint-based data repair approaches, which find the min-

imum number of changes to satisfy a set of constraints, our likelihood-based

repair approach finds the bounded amount of changes to maximize the data

likelihood.

• One of the challenges is that multiple attributes values may be considered dirty.

Therefore, we introduce a technique to provide predictions for multiple at-

tributes at a time, while taking into account two types of dependencies: (i)

the dependency between the identified clean attributes and dirty attributes, as

well as, (ii) the dependency among the dirty attributes themselves. We present

our technique by introducing the probabilistic principles which it relies upon.

• We propose SCARE (SCalable Automatic REpairing), a systematic scalable

framework for repairing erroneous values that follows the new repairing ap-

proach and, more importantly, it is scalable for very large datasets. SCARE

has a robust mechanism for horizontal data partitioning to ensure the scala-

bility and enable parallel processing of data blocks; various ML methods are

52

applied to each data block to model attributes values correlations and provide

“local” predictions. We then provide a novel mechanism to combine the local

predictions from several data partitions. The mechanism computes the valid-

ity of the predictions for the individual ML models and takes into account the

models’ reliability in terms of minimizing the risk of wrong predictions, as well

as, the significance of partitions’ sizes used in the learning stage. Finally, given

several local predictions for repairing a tuple, we incorporate these predictions

into a graph optimization problem, which captures the associations between

the predicted values across the partitions and obtain more accurate final tuple

repair predictions.

• We present an extensive experimental evaluation to demonstrate the effective-

ness, efficiency, and scalability of our approach on very large real-world datasets.

3.2 Problem Definition and Solution Approach

In this section, we formalize our maximal likelihood repair problem and introduce

our solution approach.

3.2.1 Problem Definition

We consider a database instance D over a relation schema R with A denoting its

set of attributes. The domain of an attribute A ∈ A is denoted by dom(A).

In the relation R, a set F = {E1, . . . , EK} ∈ A represents the flexible attributes,

which are allowed to be modified (in order to substitute the possibly erroneous values),

and the other attributes R = A − F = {C1, . . . CL} are called reliable with correct

values. Hence a database tuple t has two parts: the reliable part (t[R] = t[C1, . . . CL]),

and the flexible part (t[F ] = t[E1, . . . EK ]). For short we refer to t[R] and t[F ] as r

and f , respectively (i.e., t = rf). We assume that it is possible to identify a subset

Dc ⊂ D of clean (or correct) tuples and De = D − Dc represents the remaining

53�� ! �"� �� #��$� � ��"�� %�� "� �� &��$� ' �"��"� � �� ()��)�� '��*�� + �� ,�� ()��)�� "�� ,'��*�� + �� -��.�� ()��)�� ,'��*�� + �� %��,��# �� /��0�� ! �"� �� #/��$� � ��"��Fig. 3.1.: Illustrative example

possibly dirty tuples. This distinction does not have to be accurate in specifying the

dirty records, but it should be accurate in specifying the clean records. Our objective

is to learn from the correct tuples in Dc to predict accurate replacement values for

the possibly dirty tuples in De.

There are various techniques to distinguish Dc as it is always possible to use

reference data and existing statistical techniques (e.g., [8, 38]), as well as, database

constraints (if available) to provide a score Pe(t) ∈ [0..1] for each database tuple t for

being erroneous. Applying a conservative threshold on the scores of each tuple Pe(t),

we can select high quality records to be used for training.

Example 3.2.1 Consider the example relation in Figure 3.1 with a sample of 8 tu-

ples about some personal information: Name, Institution, area code AC, telephone

number Tel, in addition to address information: City, State and Zip.

This data is a result of integrating professional contact information and lookup

address database. Due to the integration process, we know that some of the address

attributes (City, State and Zip) may contain errors. Therefore, we call the address

attributes flexible attributes. After the integration process, we could separate high

quality records by consulting other reference data or verifying some widely known re-

lationships among the attributes. For this example, tuples t5, . . . , t8 ∈ Dc are identified

as correct ones, while we are not sure about tuples t1, . . . , t4 ∈ De.

54

We introduce the data repair likelihood given the data distribution as a technique

to guide the selection of the updates to repair the dirty tuples. Our hypothesis is

that the more the update will make the data follows the underline data distribution

with least cost, the more likely the update to be correct.

The likelihood of the database D is the product of the tuples’ probabilities given

a probability distribution for the tuples in the database. Given the identified clean

subset of the database Dc, we can model the probability distribution P (R,F ). Then,

the likelihood of the possibly erroneous subset De can be written (as log likelihood):

L(De|Dc) =∑t∈De

logP (t | Dc) =∑

t=rf∈De

logP (f | r) (3.1)

where we use P (t | Dc) = P (f | r), which we discuss in Section 3.3.

Assuming for a given tuple t = rf a ML technique predicted f ′ instead of f . We

say that the update u is predicted to replace f by f ′. Applying u to the database

will change the likelihood of the data; we call the amount of increase in the data

likelihood given the data distribution as the likelihood benefit of u.

Definition 3.2.1 Likelihood benefit of an update u (l(u)): Given a database

D = Dc ∪De, t = rf ∈ De and an update u to replace f by f ′, the likelihood benefit

of u is the increase in the database likelihood given the data distribution learned from

Dc, or (L(Due |Dc)− L(De|Dc)), where Du

e refers to De when the update u is applied.

Using Eq. 3.1 we obtain

l(u) = logP (f ′ | r)− logP (f | r). (3.2)

We also define the cost of an update as follows:

Definition 3.2.2 Cost of an update u (c(u)): For a given database tuple t = rf

and an update u to replace f by f ′, the cost of u is the distance between f and f ′,

c(u) =∑E∈F

dE(f [E], f ′[E]) (3.3)

55

where dE(f [E], f ′[E]) is a distance function for the value domain of attribute E that

returns a score between 0 and 1. Examples of distance functions for string attributes

include the normalized Edit distance or Jaro coefficient; for numerical attributes,

the normalized distance can be used, e.g., dE(a, b) =|a−b|

maxE−minE, where a and b are

two numbers in dom(E), and maxE,minE are the maximum and minimum values in

dom(E) respectively.

Our objective is to modify the data to maximize its likelihood, however, and

similar to existing repairing approaches, we need to be conservative in modifying the

data. Therefore, we bound the amount of changes introduced to the database by

a parameter δ. Hence, the problem becomes: given an allowed amount of changes

δ, how to best select the cleaning updates from all the predicted updates? This is

a constrained maximization problem where the objective is to find the updates that

maximizes the likelihood value under the constraint of a bounded amount of database

changes, δ. We call this problem the “Maximal Likelihood Repair”.

Definition 3.2.3 Maximal Likelihood Repair: Given a scalar δ and a database

D = De ∪ Dc. The Maximal Likelihood Repair problem is to find another database

instance D′ = D′e ∪ Dc, such that L(D′

e | Dc) is maximum subject to the constraint

Dist(D,D′) ≤ δ.

where Dist is a distance function between the two database instances D and

D′ before and after the repairing and it can be defined as Dist(D,D′) =∑∀t∈D,A∈A dA(t[A], t

′[A]), where t′ ∈ D′ is the repaired tuple corresponding to tu-

ple t ∈ D.

Regarding δ estimation, it is possible to use the score Pe(t), which estimates the

erroneousness of tuple t, to estimate δ = ϵ∑

t∈DePe(t), where ϵ ∈ [0..1]. The idea is

that a possibly erroneous tuple is expected to be modified according to its score of

being erroneous. ϵ can be chosen close to zero to be more conservative to the amount

of introduced changes.

56

3.2.2 Solution Approach

For each tuple t = rf , we obtain the prediction f ′ that represents an update u

to t. We compute the likelihood benefit and cost of u. Finally, we need to find the

subset of updates that maximizes the overall likelihood subject to the constraint that

the total cost is not more than δ, i.e., Dist(D,D′) ≤ δ.

Formally, given a set U of updates and, for each update u, we compute l(u) and

c(u) using Eq. 3.2 and 3.3, respectively. Our goal is to find the set of updates U ′ ⊆ U ,

such that ∑∀u∈U ′

l(u) is maximum subject to:∑∀u∈U ′

c(u) ≤ δ. (3.4)

This is typically a 0/1 knapsack problem setting, which implies that the maximal

likelihood repair problem is NP-complete.

Heuristic and quality measure: To solve the above problem, we use the

famous heuristic to solve the 0/1 knapsack problem by processing the updates in de-

creasing order of the ratio l(u)c(u)

. This heuristic suggests that the “correctness measure”

of an update u is the ratio of the update’s likelihood benefit to the cost of applying

the update to the database (i.e., the higher the likelihood benefit with small cost, the

more likely the update to be correct). Empirically, this gives good predictions for the

updates as we will illustrate in our experiments.

Example 3.2.2 In Figure 3.1, assume that two updates were predicted to the

database. u1 updates t3 such that f ′3={“Chicago”, “IL”, “60614”} and u2 updates t4

such that f ′4={“WLafayette”, “IN”, “47907”}. Assume also that both of l(u1), l(u2)

have the same likelihood benefit. In this case, u2 will encounter lower cost in updating

one character in both the Zip and the City attributes (update the Zip from“47906” to

“47907” and the City from “Lafayette” to “WLafayette”), while u1 will cost updating

4 characters in the Zip from “61801” to “60614”. Hence, for δ ≤ 2 characters, only

u2 will be applied to the database.

57

3.3 Modeling Dependencies and Predicting Updates

The key challenge when considering data repair using the data distribution is

that multiple attributes values may be dirty. In the case when a single attribute

is erroneous, the problem is to model the conditional probability distribution of the

erroneous attribute given the other attributes, and hence, a single classification model

can be used to obtain the predicted values for the erroneous attribute. However, it

is mostly the case that a set of attributes have low quality values and not a single

attribute. Therefore, we need to model the probability distribution of the subset of

dirty attributes given the other attributes that have reliable values (i.e., most likely

to be correct) to achieve a better prediction of the replacement values.

Example 3.3.1 In Figure 3.1 assuming we know that only the City attribute con-

tains some errors. This is the simple case because a ML model can be trained by

the database tuples considering the City as the label to be predicted. However in

practice, more than one attribute can be dirty at the same time, for example, all the

address attributes. In this case, we can use a ML technique to model the distribu-

tion of the combination (City, State, Zip)—taking into account their possible inter-

dependencies—given existing reliable values of attributes, e.g., (Name, Institution,

AC, Tel).

3.3.1 Modeling Dependencies

Let SR = dom(C1) × dom(C2) · · · × dom(CL) denotes the space of possible reli-

able parts of tuples t[C1 . . . CL] (with clean attribute values), and SF = dom(E1) ×

dom(E2) · · ·×dom(EK) denotes the space of possible flexible parts of tuples t[E1 . . . EK ]

(with possibly erroneous values). Assuming that the tuples of D are generated ran-

domly according to a probability distribution P (R,F ) on SR × SF , P (F |r) is the

58

conditional distribution of F given R = r and PEi(Ei|r) is the corresponding marginal

distribution of the values of attribute Ei,

PEi(ei|r) =

∑f∈SF |f [Ei]=ei

P (f | r)

Note that the posterior probability distribution P (F | r) provides the means to an-

alyze the dependencies among the flexible attributes. The distribution informs about

the probability of each combination of values for the flexible attributes ⟨e1, . . . , eK⟩,

where e1 ∈ dom(E1), . . . , eK ∈ dom(EK).

Given a database tuple t = rf , the conditional probability of each combination of

the flexible attribute values f can be computed using the product rule:

P (f | r) = P (f [E1] | r)K∏i=2

P (f [Ei] | r, f [E1 . . . Ei−1]). (3.5)

Note that we assume a particular order in the dependencies among the flexible

attributes {E1, . . . , EK}. To obtain this order, we leverage an existing technique [68]

to construct a dependency network for the database attributes. The dependency

network is a graph with the database attributes as the vertices; and there is a directed

edge from Ai to Aj if the analysis determined that Aj depends on Ai. In our case,

there will be two sets of vertices; the reliable set R and flexible set F . The first

flexible attribute E1 in the order is the one that have the maximum number of reliable

attributes as its parents in the graph. Then subsequently, the next attribute in order

i is the one with maximum number of parents that are either reliable attributes

or flexible attributes with an assigned order. In our experiments, we followed this

procedure by analyzing a sample of the database to determine the dependency order

of the flexible attributes. Another alternative method to compute the conditional

probability P (f | r) without considering any particular order of the flexible attributes

is to use Gibbs sampling [69], however, it is very expensive to be applied to even

moderate size databases. Please refer to [68] for further details.

In Section 3.3.2, we introduce an efficient way to obtain the predictions f ′ that is

desirable for our cleaning approach and compute the conditional probabilities P (f | r).

59

Algorithm 3.1 GetPredictions(Classification Model Mi, ⟨r, f [E1], . . . f [Ei−1]⟩ input

tuple ri, Probability P , Database Tuple t = rf)

1: if (i > K) then

2: f ′ = ri − r

3: AllPredictions = AllPredictions ∪{(f ′, P )}

4: return

5: end if

6: fEi = Mi(ri)

7: rs = ⟨ri, fEi⟩

8: Ps = P × P (fEi | ri)

9: GetPredictions(Mi+1, rs, Ps, t)

10: if fEi = t[Ei] then

11: r′s = ⟨ri, t[Ei]⟩ {Adding the original Ei’s value to the next input}

12: P ′s = P × P (t[Ei] | ri)

13: GetPredictions(Mi+1, r′s, P

′s, t) {Predicting attribute Ei’s value}

14: end if

3.3.2 Predicting Updates

We use a ML model M (as predictor) to model the above joint distribution in

Eq. 3.5. The model M is a mapping SR → SF that assigns (or predicts) the flexible

attributes values f ′ for a database tuple t = rf given the values r of the reliable

attributes R. The prediction takes the form:

M(r) = ⟨M1(r), . . . ,MK(r)⟩ = f ′.

To estimate the joint distribution of the flexible attribute values, P (f | r) in Eq.

3.5, we learn K classification models Mi(·) on the input space SR × dom(E1)× · · · ×

dom(Ei−1), (i.e., using all the reliable attributes and the flexible attributes up to

attribute Ei).

Mi : SR × dom(E1)× · · · × dom(Ei−1) → dom(Ei)

60

We assume that Mi is a probabilistic classifier (e.g., Naıve Bayesian) that will be

trained using Dc and produce a probability distribution over the values of the flexible

attribute Ei given ⟨r, f [E1], . . . , f [Ei−1]⟩.

One efficient greedy way to approximate the optimal prediction f ′ is to proceed as

follows: given a tuple t = rf , the classifier M1 is used to predict the value of attribute

E1 (i.e., f ′[E1]) given r. Then, M2 predicts the value for attribute E2 given r and

f ′[E1] as input. Proceeding in this way, Mi predicts the value of attribute Ei given

r and f ′[E1] . . . f′[Ei−1]. This approach can be considered as searching greedily for a

path in a tree that has the possible values of f ′ ∈ SF at the leaves. We call this tree

as the flexible attributes values search tree. Needless to say that this approach does

not guarantee finding the prediction f ′ with the highest probability.

For a tuple t, to find a better prediction that is desired for our cleaning approach,

we follow the conservative assumption in updating the database by considering and

preferring the original attributes values in the tuple. Based on this assumption, the

best prediction will be among these explored tree paths, which involve the original val-

ues of the tuple t. Hence, we can compute, in addition to the greedy path, additional

paths that assume that the original values in the tuple are the supposed predictions.

The algorithm to compute a set of predictions for the flexible attributes F for a given

tuple t is described in Algorithm 3.1, GetPredictions. Basically, GetPredictions pro-

ceeds recursively in the flexible attributes values search tree. At each node tree level

i, two branches are considered when the prediction of attribute Ei is different from

its original value in the tuple, otherwise, a single branch is considered. The initial

call to Algorithm 3.1 to get predictions for tuple t = rf is GetPredictions(M0, r, 1.0,

t = rf).

In Algorithm 3.1, Line 1 checks if we reached the prediction of the last flexible

attribute EK ; and in this case, we add the flexible part f ′ of the obtained final tuple

s to AllPredictions list. In Line 6, we predict the value fEiof attribute Ei. Lines

7 and 8 compose the new input rs by adding fEito ri and compute the prediction

probability so far, Ps. We then proceed recursively to get the prediction for the next

61

flexible attribute E(i+1). The lines 11-13 are executed if the predicted value fEiis

different from the original value t[Ei]. In this case, we compose another input r′s

using the original value t[Ei] and compute the prediction probability so far using

P (t[Ei] | ri) from the model Mi, then finally, proceed recursively to get a prediction

for E(i+1).

Example 3.3.2 Consider the example relation in Figure 3.1. Assume that tuple t4

was marked as erroneous and we want to obtain predictions for its flexible attributes.

In GetPredictions initially ri is the reliable attributes values {“C. Clifton”, “Purdue

Univ.”, “765”, “494-6005”}. In Line 6 the classifier M0, which was trained using only

the set of reliable attributes to predict the first flexible attribute City, provides the pre-

diction to be “WLafayette”. Then rs is composed to be the input to the next classifier

M1, which was trained by the reliable attributes and the first flexible attribute City

to predict the second flexible attribute State, rs = {“C. Clifton”, “Purdue Univ.”,

“765”, “494-6005”, “WLafayette”}. Since the predicted City is different from the

one in the table, we compose another input to the classifier M1 with the original City

value, r′s = {“C. Clifton”, “Purdue Univ.”, “765”, “494-6005”, “Lafayette”}. We

obtain the prediction for the State given the two inputs rs and r′s and proceed recur-

sively until we used M3 to predict the Zip, and we finally extract f ′ from each ri in

Line 2 to end up with the list of AllPredictions.

GetPredictions produces, for a given tuple t, at most 2K predictions with their

probabilities; however, in practice, the number of predictions are far less than 2K .

We select the prediction f ′ with the best benefit-cost ratio, i.e., the update u that

replaces f with f ′ and results in the highest l(u)c(u)

. Note that we need to compute l(u)

for only f ′ with P (f ′ | r) being greater than P (f | r), the probability of the original

values, otherwise the likelihood benefit of the predicted update will be negative. Note

also that P (f | r) is included as well in the output of GetPredictions.

In Section 3.4, we present the method to scale up the maximal likelihood repair

problem and get the predicted updates u along with their likelihood benefit l(u). The

cost c(u) is straight forward to compute.

62

3.4 Scaling Up the Maximal Likelihood Repairing Approach

One of the key challenges in repairing dirty databases is the scalability [11]. For

the case of maximal likelihood repair, the scalability issue is mainly due to learning

a set of classification models to predict the flexible attributes values. Usually, this

process is at least quadratic in the database size. Moreover, the learning process and

the model itself may not fit in main memory. Indeed, there are efforts on learning

from large scale datasets, but most of these techniques are either limited to a specific

ML models (e.g., scalable learning using SVM [10]), or it is limited to specific types

of datasets such as scientific data and documents repository.

In this section, we present a model-independent method to learn and predict up-

dates to the database that is based on horizontally partitioning the database. Each

database tuple will be a member of several partitions (or blocks). Each partition b

is processed to provide predictions to the erroneous tuples t ∈ b depending on the

database distribution learnt from block b (i.e., local predictions). Finally, we present

a novel mechanism to combine the local predictions from the different partitions and

determine more accurate final predictions.

This method is in the flavor of learning ensemble models [70] or committee-based

approaches, where the task is to predict a single class attribute by partitioning the

dataset into several smaller partitions; then a model is trained by each data partition.

For a given tuple, each model provide a prediction on the class attribute, and the

final prediction is the one with the highest aggregated prediction probability. But,

in our case, we want to predict the values of multiple flexible attributes together;

and we are not limited to predict a single attribute value. Hence for a given tuple,

we obtain a prediction (a combination f ′ of the flexible attribute values) from each

data partition. We then propose a technique to combine the models’ predictions into

a graph optimization problem to find the final prediction for the flexible attributes.

Our main insight is that the final (combinations of) predicted values are those which

would maximize the associations among the predicted values across the partitions.

63

Our mechanism to collect and incorporate the predicted updates takes into account

the reliability of the learnt classification models themselves to minimize the risk of

the predicted updates.

After obtaining the final predicted values with their likelihood benefit l(u), we use

them into the maximal likelihood repair problem (Eq. 3.4).

3.4.1 Process Overview

Algorithm 3.2 illustrates the main steps of the SCARE process to get the predicted

updates along with their likelihood benefit. The primary input to the framework is

a database instance D. The second input is a set of database partitioning functions

(or criteria) H = {h1, . . . , hJ}.

There are two main phases for SCARE: (1) Updates generation phase (lines 1-8),

and (2) Tuple repair selection phase (lines 9-13).

In Phase 1 (Line 1), each function hj ∈ H will partition D into blocks

{b1j, b2j, . . . }. Then, the loop in lines 2-8 processes each block bij as follows: (i)

Learn the set of classifiers Mij from the identified clean tuples in bij (lines 3); (ii)

Use Mij to predict the flexible attributes values for the possibly erroneous tuples

in bij using Algorithm 3.1 (lines 4-7). For each tuple, the prediction is considered a

candidate tuple repair and it is stored in a temporary repair storage, denoted as RS.

Since each tuple will be a member of several data partitions, we will end up with a

set of candidate tuple repairs for each possibly erroneous tuple. The details of the

repair generation is provided in Section 3.4.2.

Phase 2 (lines 9-13) loops on each tuple t ∈ De and retrieves all its candidate tuple

repairs from the repair storage RS, then uses Algorithm 3.3 SelectTupleRepair to get

the final tuple repair (update) with its estimated likelihood benefit. The details of

the repair selection algorithm is provided in Section 3.4.3. Note that each iteration

in Phase 1 does not depend on other iterations (similarly for the iterations of Phase

2). Hence, SCARE can be efficiently parallelized.

64

Algorithm 3.2 SCARE(D dirty database, H = {h1, . . . , hJ} DB partitioning func-

tions)

1: Given H, partition D into blocks bij .

2: for all block bij do

3: Learn the models Mij .

4: for all tuple t = rf ∈ bij ∧ t ∈ De do

5: Use Mij to predict f ′j and get Pij(f

′j | r) and Pij(f | r)

6: Store f ′j , Pij(f

′j | r) and Pij(f | r) in RS. {store in the Repair Storage.}

7: end for

8: end for

9: for all tuple t = xy ∈ D do

10: RS(t) = the candidate tuple repairs for t in RS.

11: f ′ =SelectFinalPrediction(RS(t))

12: For the update u to change f to f ′, compute the likelihood measure l(u) if f = f ′.

13: end for

3.4.2 Repair Generation Phase

In this phase, the data is partitioned, as we will explain shortly, for two main ben-

efits: (i) scale for large data by enabling independent processing for each partition,

and (ii) more accurate and efficient learning of the classification models for the pre-

diction task. The first benefit is obvious and the second benefit is obtained because

of the following: When we train a classification model for prediction, ideally, we need

the model to provide high prediction accuracy capturing all the possible dependencies

from the data (we call it a model with global view). All the statistically significant de-

pendencies are considered as the model’s search space. However, if the space contains

a lot of weak dependencies, most likely, the model will not be able to capture them.

But if it does, the global view will not be accurate enough for prediction because of

the model overfitting. Partitioning the database helps to capture local correlations

that are significant within subsets of the database and that require a different degree

of “zooming” to be recognized. Each of the partition functions h ∈ H provides the

search space partitioned according to the criteria shared by the tuples within the

65

same block. If we train models on multiple blocks, we will have models with several

local views (or specialized models, sometimes called experts [71]) for portions of the

search space. Combining these local views, will result in a better prediction accuracy.

Partitioning the database: Each partition function or criterion h(·) maps each

tuple to one of a set of partitions. Multiple criteria H = {h1, . . . , hJ} is used to

partition the database in different ways. Each tuple t is mapped to a set of partitions,

i.e., H(t) =∪

∀ j hj(t).

A simple way to choose the partition criteria is Random (i.e., randomly parti-

tion the data many times). Another way to choose the criteria is Blocking, where

partitions are constructed under the assumption that similar tuples will fit in the

same block or inversely, tuples across different blocks are less likely to be similar.

Many techniques for blocking have been introduced for the efficient detection of the

duplicate records (refer to [9] for a survey).

It is worth mentioning that increasing the number of partition functions will result

in a more accurate final prediction, because the variance in the predictions decreases

as we increase the number of ways (partition functions) to partition the data. We

found that partitioning the data using different blocking techniques provided more

accurate predictions with less number of partition functions in comparison with the

random partitioning.

Example 3.4.1 Consider again the relation in Figure 3.1. In this example, one may

partition the database based on the Institution attribute (as a partition function)

to get the tuples partitioned as follows: {t1, t7}, {t2, t8}, {t3}, {t4, t5, t6}. The result

of the learning process from these data partitions will be expert models based on the

person’s institution. Another function may use the AC or a combination of attributes.

Partition functions can be designed based on a signature based scheme or clustering

as we elaborate in the experimental section.

Reliability measure and risk minimization: In order to be conservative in

considering the predictions from each block bij and its model Mij, we propose a

66

mechanism to measure the reliability of a model and adapt the obtained prediction

probability accordingly to support or detract the model’s predictions.

Two major components help us judging the reliability of a model Mij:

(i) the model quality, which is classically quantified by its loss L(Mij) =

1|bij |

∑t∈bij ,t∈Dc,E∈F dE(f [E], f ′

ij[E]), where |bij| is the number of tuples in partition

bij, E is one of the flexible attributes F , dE is a distance function for the domain of

attribute E and f ′ij is the prediction of Model Mij on the flexible attributes F for

the tuple t ∈ bij and t ∈ Dc; (ii) the second component is the size of the block: the

smaller the block is, the less reliable the predictions will be. Hence, the reliability of

model Mij can be written as:

Re(Mij) =|bij||D|

(1− L(Mij)) . (3.6)

Finally, the prediction probabilities obtained from model Mij are scaled to be:

Pij(f′ | r) = Pij(f

′ | r)×Re(Mij).

Aggregating Suggestions: As mentioned earlier, a tuple t = rf will be a

member of |H(t)| data partitions. From each partition, we get a candidate tuple

repair for t, which are then stored in the storage RS with the following schema:

{t id, partition, E1, ..., EK , P (f ′|r), P (f |r)}, where t id is the original tuple identifier,

partition is the partition name, Ek ∈ F , P (f ′ | r) is the prediction probability of the

repairing update f ′, and P (f | r) is the probability of the original values in t. The

space required for RS is of O(|D| × |H|).

Example 3.4.2 Consider tuple t4 in Figure 3.1 and the flexible attributes are City

and State, Zip. Assume that we used 5 partition functions and hence t4 was a

member of 5 partitions, consequently, we obtain 5 possible candidate tuple repairs.

The table in Figure 3.2 illustrates the candidate tuple repairs of t4, RS(t4), with the

corresponding prediction probabilities obtained from each partition.

67�� Fig. 3.2.: Generated predictions for tuple repairs with their corresponding prediction

probabilities for tuple t4 in Figure 3.1.

3.4.3 Tuple Repair Selection Phase

Once the candidate tuple repairs are generated, we need a repair selection strategy

to pick the best one among the candidate set. One suggestion for a selection strategy

can be the majority voting. For a tuple t, the majority voting can be done by selecting

the most voted value from the partitions on each attribute Ei individually.

The majority voting strategy implies the assumption that each attribute was pre-

dicted independently from the others. For a tuple t = rf , we predict the combination

of the flexible attributes f ′ together. Thus, the independence assumption of the

attributes is not valid. Therefore, we propose a mechanism to vote for a final com-

bination of the flexible attributes that takes into account the dependencies between

the predicted values obtained from each partition.

Example 3.4.3 Consider the candidate tuple repairs of t4 in Figure 3.2. Note that

if we use the majority voting while using the prediction probability as the voter’s

certainty, the final prediction would be {“Lafayette”, “IN”, “47906”}. This solution

does not take into account the dependencies between the predicted values within the

same tuple repair. For example, there is a stronger association between “47907”

and “IN” than between “47906” and “IN”. This relationship is reflected on their

corresponding prediction probabilities. The values “47907” and “IN” were predicted

in f ′1, f

′4 with probabilities 0.7 and 0.8, while “47906” and “IN” were predicted in f ′

2, f′3

68

and their probabilities are smaller, 0.4 and 0.6. The same applies for the relationship

between “WLafayette” and “IN”, which have a stronger relationship than “Lafayette”

and “IN”. A more desired prediction will be {“WLafayette”, “IN”, “47907”}.

For a given database tuple t = rf , our goal is to find the final combination

f ′∗ = ⟨e∗1, . . . , e∗K⟩ such that∑

bij ,t∈bij P (f ′∗ | r) is maximum. This is computationally

infeasible, because this requires the computation of the probability of each possible

combination of the flexible attribute in each data block. Instead, we can search for

the values that can maximize all the pairwise joint probabilities. In principle, if

we maximize the pairwise association between the predicted values, then this implies

maximizing the full association between the predicted values. Hence, the final update

is the one that would maximize the prediction probabilities for each pair of attribute

values. We formalize this problem as follows.

Definition 3.4.1 The Tuple Repair Selection Problem: Given a set of predicted

combination for the flexible attributes RS(t)= {f ′1, . . . , f

′|H|} for database tuple t = rf

along with the prediction probabilities of each combination, (i.e., for f ′j ∈RS(t), we

have the corresponding prediction probabilities P (f ′j | r)), the tuple repair selection

problem for tuple t is to find f ′∗ = ⟨e∗1, . . . , e∗K⟩ such that the following sum is maxi-

mum ∑∀ e∗i , e∗k, i =k

∑∀ f ′∈RS(t),e∗i=f ′[Ei], e∗k=f ′[Ek]

p(f ′ | r).

To find a solution, we map this problem to a graph optimization problem for

finding the K-heaviest subgraph (KHS) [72] in a K-partite graph (KPG). The key

idea is to process each database tuple t individually and use its set of candidate

tuple repairs, RS(t), to construct a graph, where each vertex is an attribute value,

and an edge is added between a pair of vertices iff the corresponding values co-occur

in a prediction f ′ ∈RS(t). The edges will have a weight derived from the obtained

prediction probabilities. It is worth noting that this strategy is applied for each tuple

on separate, therefore, this phase can be efficiently parallelized.

69

Finding KHS in KPG: The K-heaviest subgraph (KHS) problem is an NP

optimization problem [72]. In an instance of the KHS problem, we are given a graph

G = (VG, EG), where VG is the set of vertices of size n, EG is the set of edges with

non-negative weights (Wwv denotes the weight on the edge between vertices w, v),

and a positive integer K < n.

The goal is to find V ′ ⊂ VG, |V ′| = K, where∑

(w,v)∈EG∩(V ′×V ′)Wwv is maximum.

In other words, the goal is to find a K-vertex subgraph with the maximum weight.

A graph G = (VG, EG) is said to be K-partite if we can divide VG into K subsets

{V1, . . . , VK}, such that two vertices in the same subset can not be adjacent. We call

KHS in KPG problem, the problem of finding KHS in a K-partite graph such that

the subgraph contains a vertex from each partite.

Definition 3.4.2 The KHS in KPG problem. Given a K-partite

graph G = (V1, . . . , VK , EG), find V ′ = {v1, . . . , vK} such that vk ∈ Vk and∑(vi,vj)∈EG∩(V ′×V ′)Wvivj is maximum.

Lemma 3.4.1 The KHS in KPG is NP-Complete.

Proof This is a proof sketch. It is straight forward to see that we can reduce the

problem of finding K-Clique (Clique of size K) in K-partite graph to the KHS in

KPG problem. The problem of K-Clique in K-partite graph G is NP-complete by

reduction from the problem of (n − K)-vertex cover in the complement K-partite

graph G′, which is NP-Complete (see [73] for details).

Solving the Tuple Repair Selection Problem: Given a set of predic-

tions RS(t)= {f ′1, . . . , f

′|H|} for the flexible attributes of a tuple t = rf , where

f ′j = ⟨e(j)1 , . . . , e

(j)K ⟩ and K = |F | is the number of flexible attributes.

The repair selection problem can be mapped to the KHS in KPG problem using

the following steps:

1. Building vertex sets for each attribute Ek: For each attribute Ek, create a

vertex v for each distinct value in {e(1)k , . . . , e(|H|)k }. Note that we have a set of

vertices for each attribute Ek (i.e., partite).

70

��

��

��(b) After removing I(a) Constructed Graph (c) After removing F (d) After removing 6 (e) After removing W

��

��

��

��

��

��

��

Fig. 3.3.: Step by step demonstration for the SelectTupleRepair algorithm. At each

iteration, the vertex with minimum weighted degree is removed as long as it is not

the only vertex in its corresponding vertex set.

2. Adding edges: Add an edge between vertices v, w when their corresponding

values co-occur in a candidate tuple repair. Note that v, w can not belong to

the same vertex set.

3. Assign edge weights: For an edge between v, w, the weight is computed as

follows: Let f(v,w) = {f ′j|f ′

j contains both v, w}, i.e., the set of predictions that

contain both the values v, w.

Wvw =∑

f ′j∈f(v,w)

Pij(f′j | r)

where Pij(f′j | r) is the prediction probability of f ′

j obtained from partition bij.

The graph construction requires a single scan over the predictions RS(t)=

{f ′1, . . . , f

′|H(t)|}, hence, it is of O(K |H|). The number of vertices is the number

of distinct values in the candidate tuple repairs.

Example 3.4.4 Figure 3.3(a) shows the constructed 3-partite graph from the pre-

dictions in Figure 3.2 for tuple t4 in the original relation of Figure 3.1. For each

attribute, there is a vertex set (or partite), e.g., the corresponding set of the Zip at-

tribute contains {“47906”, “47907”}. In the graph, we replaced the actual attributes

values by a character abbreviation to have a more compact graph as follows: {“6”

→ “47906”, “7” → “47907”, “L” → “Lafayette”, “W” → “Wlafayette”, “F” →

“lafytte”, “N” → “IN”, “I” → “IL”}.

71

Note that there is an edge between “W” and “N” with edge weight of 1.1 (= 0.4

+ 0.7). This is because “WLafayette” and “IN” co-occur twice in f ′1 and f ′

3 and their

probabilities are 0.7 and 0.4 respectively. Also, there is an edge between “I” and “6”

with weight of 0.5, because “IL” and “47906” co-occur once in f ′5 with probability 0.5.

Similarly, the rest of the graph is constructed.

Finally, finding the KHS in the constructed KPG is a solution to the tuple repair

selection problem. The underlying idea is that the resulting K-subgraph G′(V ′, E ′)

will contain exactly a single vertex from each vertex set. This corresponds to selecting

a value for each flexible attribute. Moreover, the weight of the selected subgraph

corresponds to the result of maximizing the summation in Definition 3.4.1.

Computing the likelihood benefit: For a tuple t = rf , the solution of the

KHS in KPG problem is the final prediction f ′ for the flexible attributes. The final

prediction probability of f ′ is computed from the solution graph G′(V ′, E ′) by

P (f ′ | r) = 1

|E ′|∑

evw∈E′

1

|f(v,w)|∑

f ′j∈f(v,w)

Pij(f′j | r).

The inner summation averages the probability of each pair of attribute values (i.e.,

each edge in G′) in the final prediction f ′. The outer summation averages the prob-

ability over all the edges in the final graph G′.

The prediction probability of the original values in the flexible attribute f is

computed following the ensemble method by averaging the obtained probability from

each partition, i.e., P (f | r) = 1|H|

∑bij ,t∈bij Pij(f | r).

Finally, for the update u changing f into f ′, we can compute the likelihood benefit

l(u) using Equation 3.2.

Example 3.4.5 Consider the constructed initial graph in Figure 3.3(a). Assuming

that the solution for KHS in KPG is the subgraph {“W”, “N”, “7”} shown in Figure

3.3(e). Now, we have an update u to change the original values in tuple t4 in Figure

3.1 from f ={“Lafayette”, “IN”, “47906”} to f ′ ={“WLafayette”, “IN”, “47907”}.

From the RS(t4) in Figure 3.2, we get P (f | r) = avg{0.6, 0.5, 0.3, 0.6, 0.5} = 0.5.

72

For P (f ′ | r) = 13

[12(0.7 + 0.4) + 1

2(0.7 + 0.8) + 0.7

]= 0.66. Finally, we can use Eq.

3.2 to compute l(u) = 0.12.

3.4.4 Approximate Solution for Tuple Repair Selection

For the general problem of finding the KHS, many approximate algorithms were

introduced (e.g., [72, 74, 75]). For example, in [74] the authors model the problem

as a quadratic 0/1 program and apply random sampling and randomized rounding

techniques resulting in a polynomial-time approximation scheme, and in [75], the

algorithm is based on semi-definite programming relaxation.

If K is very small, then the optimal solution can be found by enumeration. For

the case where K is not very small, we provide here an approximate solution that

is inspired by the greedy heuristic discussed in [72]. For the general case graph

problem, the heuristic repeatedly deletes a vertex with the least weighted degree

from the current graph until K vertices are left. The vertex weighted degree is the

sum of weights on the edges attached to it.

In the following, we follow the same heuristic for the case of K-partite graph.

However, we iteratively remove the vertex with least weighted degree as long as it

is not the only vertex left in the partite, otherwise, we find the next least weighted

degree vertex. The algorithm is a greedy 2-approximation following the analysis

discussed in [72].

Algorithm 3.3 shows the main steps to find the final tuple repair. There are two

inputs to the algorithm: (i) the constructed graph G(VG, EG) from the predictions

and (ii) the sets of vertices S = {S1, . . . , SK}, where each Sk represents the predicted

values for attribute Ek. We store for each vertex v its current weighted degree in

Weighted Degree(v)=∑

∀evw∈EGWvw, which is the sum of the edges weights that are

incident to v.

The algorithm proceeds iteratively in the loop illustrated in lines 1-9. The

loop stops when a solution is found, where there is only one vertex in each ver-

73

Algorithm 3.3 SelectTupleRepair(G(V,E) graph, S = {S1, . . . , SK})1: while ∃S ∈ S s.t. |S| > 1 do

2: v =GetMinWeightedDegreeVertex(G,S)

3: If v = null Then break;

4: for all vertex w ∈ V s.t. ewv ∈ E do

5: Remove ewv from G.

6: Weighted Degree(w) - = Wwv

7: end for

8: Remove v from its corresponding set S.

9: end while

tex set, i.e., |S| = 1 ∀S ∈ S. In each loop iteration, we start (Line 2) by

finding the vertex v that has the minimum weighted degree using the Algorithm

GetMinWeightedDegreeVertex(G,S). Then, we remove all the edges incident to v

and update the WeightedDegree(w) by and subtracting Wwv, where w was connected

to v by the removed edge ewv (Lines 4-7). Finally, vertex v is removed from G and

from its corresponding vertex set in Line 8.

GetMinWeightedDegreeVertex goes through the vertex sets that has more than

one vertex and returns the vertex that has the minimum weighted degree.

Analysis: Algorithm 3.3 requires: First, visiting all n vertices to remove them

except for K ones. For each vertex, each set S ∈ S of the K sets is visited to get

its minimum vertex according to the weighted degree. This requires O(nK log |S|),

where n ≈ O(K|H|) and |S|’s worst case is O(|H|). Hence, visiting the vertices

is of O(K2|H| log |H|). Second, removing the vertices requires visiting their edges,

O(|EG|), which has a worst case of O(K2|H|). Then, the overall complexity of Algo-

rithm 3.3 is O(K2|H| log |H|).

Example 3.4.6 The SelectTupleRepair algorithm is illustrated step-by-step in Fig-

ure 3.3. The algorithm looks for the vertex with the least weighted degree to be re-

moved. The first vertex is “I”, which has a weighted degree equal 1.0 = 0.5 + 0.5,

corresponding to the two incident edges in “I”. This leaves the vertex set of the State

74

attribute with only one vertex, “N”. Therefore, we do not consider removing the ver-

tex “N” in further iterations of the algorithm. The next vertex to remove is “F” to

get Figure 3.3(c), and so on.

Finally, we got the final solution in Figure 3.3(e), which corresponds to a subgraph

with 3 vertices —there is a vertex from each initial partite. This graph is the heaviest

subgraph of size 3 (i.e., the sum of the edges weight is the maximum), where each

vertex belongs to a different partite. It is worth mentioning that the final graph does

not have to be fully connected. Thus, the final prediction is {“WLafayette”, “IN”,

“47907”}.

3.5 Experiments

In this section, we evaluate our data repair approach; specifically, the objectives of

the experiments are as follows: (1) Evaluation of SCARE and the notion of maximal

likelihood repair in comparison with existing constraint-based approaches for data

repair, (2) Assessment of the scalability of SCARE.

Datasets: In our evaluations, we use three datasets: (i) Dataset 1 is the same

real-world dataset about patients discussed in Section 2.5. We selected a subset

of the available patient attributes, namely Patient ID, Age, Sex, Classification,

Complaint, HospitalName, StreetAddress, City, Zip, State and VisitDate. This

is in addition to the Longitude and Latitude of the address information. This

dataset is dirty and it is used as input to the repairing approaches. The flexible

attributes that we consider for repairing are (City, Zip, HospitalName, Longitude

and Latitude). (ii) Dataset 2 is the US Census Data (1990) Dataset1 containing

about 2 M tuples. It has been used only in the scalability experiments. (iii) Dataset

3 is the Intel Lab Data (ILD2) used to evaluate SCARE for predicting missing values

and compare it to ERACER [7], as a recent system that relies on relational learning

for predicting missing data in relational databases.

1http://archive.ics.uci.edu/ml/datasets/US+Census+Data+ (1990)2http://db.csail.mit.edu/labdata/labdata.html.

75

Parameters: In our evaluations, we study several parameters that we list here

with their assigned default values: (1) e: the percentage of the erroneous tuples in

the dataset (default 30%), (2) d: the dataset size (default 10,000 tuples), (3) δ: the

maximum amount of changes, as a fraction of d, the dataset size that SCARE is

allowed to update (default 0.1 or 10% of d). (4) I: the number of iterations to run

SCARE (default 1). (5) |H|: the number of partition functions (default 5).

All the experiments were conducted on a server running Linux with 32 GB RAM

and 2 processors each with 3 GHz speed. We use MySQL to store and query the tuples.

For the probabilistic classifiers, we use the Naıve Bayesian, specifically, we use the

NBC WEKA implementation3 with the default parameters settings. Java is used to

implement our approach and we use Java Threads to benefit from a multiprocessor

environment.

Regarding the partition functions using blocking, we repeat the following process

|H| times: we randomly sample from the dataset a small number of tuples to be

clustered in |D|nb

clusters, where nb is the average number of tuples per partition. Then,

each tuple is assigned to the closest cluster as its corresponding partition name. This

process allows for having different blocking functions due to the random sample of

tuples used in the clustering step of each iteration. The tuples that have been assigned

to the same partition have common or similar features due to their assignment to the

closest cluster. In all our quality experiments, we use blocking as the technique to

partition the dataset. Another simple way to partition the dataset is to use random

partitioning functions. In this case, given nb, we assign each tuple to a partition name

bij, where i is a random number from {1, . . . , |D|nb}, and j = 0 initially. This process

is repeated |H| times while incrementing j each time.

3NBC WEKA available at http://www.cs.waikato.ac.nz/ml/weka

76

0.5

0.6

0.7

0.8

0.9

1

10 20 30 40 50

Precision

Percent of Errors (e)

KHSKPGMVSMConstRpr

(a) Precision

00.10.20.30.40.50.60.7

10 20 30 40 50

Recall

Percent of Errors (e)

KHSKPGMVSMConstRpr

(b) Recall

Fig. 3.4.: Quality vs. the percentage of errors: SCARE maintains high precision by

making the best use of δ, the allowed amount of changes.

3.5.1 Repair Quality Evaluation

To evaluate the quality of the automatic repair, we manually cleaned the dirty

datasets (e.g., for Dataset 1, using addresses web sites, external reference data sources,

and visual inspection) to obtain the ground truth and we compare the clean versions

of the datasets (and each replacement value) with the repair output of our method.

In the following experiments, we use the standard precision and recall to measure

the quality of the applied updates for string attributes.

The precision is defined as the ratio of the number of values that have been

correctly updated to the total number of values that were updated, while the recall

is defined as the ratio of the number of values that have been correctly updated to

the number of incorrect values in the entire database. For numerical attributes, we

use the mean absolute error (MAE): 1N

∑di=1 |vi − ai| where vi is the suggested value

by SCARE, ai is the actual value of the original data. We can compute these values

since we know the ground truth for Dataset 1.

77

We report the quality results for four approaches:

• KHSinKPG : This is SCARE with the described tuple repair selection strategy

as described in Section 3.4.3.

• MV : In this approach, SCARE uses directly the majority voting to select an

attribute value from the candidate tuple repairs. We include this approach in

the evaluation to compare our tuple repair selection strategy to the straight

forward majority voting strategy.

• SM : The single model approach, where the whole dataset is considered as a

single partition. Afterward, the likelihood benefit and cost are computed to

select the best updates. This approach is included to show the advantages of

combining several models with local views rather than using a single model with

global view on the data.

• ConstRepair : One of the recent constraint-based repair approaches. We imple-

mented (to the best of our understanding) the technique described in [3], which

uses CFDs as constraints to find dirty tuples and derive repairs. We manually

compiled a list of CFDs during the process of cleaning Dataset 1. Moreover,

we implemented a CFD discovery algorithm [36] to be used as input to the re-

pairing algorithm. To have a high quality rules and be fair to this approach, we

discovered CFDs from the original “correct” data, which can not be the case in

a realistic situation as we usually start with dirty data. We specified the rules

support threshold to be 1%. The implementation to discover the CFDs is very

time-consuming, therefore, this approach does not show up in all of our plots.

Quality vs. the percentage of errors: We use Dataset 1 and report in Fig-

ure 3.4 the precision and recall results for the applied updates while changing e, the

percentage of tuples with erroneous values, from 10 to 50%. Generally, the approaches

that maximize the data likelihood substantially outperform the constraint-based ap-

proach for the precision. Moreover, for the recall the likelihood approaches outperform

78

0.8

0.85

0.9

0.95

1

0 2 4 6 8 10

Precision

δ/|D| %

P-KHSKPGP-MVP-SM(a)

00.10.20.30.40.50.6

0 2 4 6 8 10

Recall

δ/|D| %

R-KHSKPGR-MVR-SM

(b)

Fig. 3.5.: δ controls the amount of changes to apply to the database: small δ guar-

antees high precision at the cost of the recall and vice versa.

the constraint-based approach when the errors is up to 30%, afterwards, the recall is

comparable for all the approaches when errors is more than 30%.

SCARE with KHSinKPG shows the highest precision. The precision increases

using the three likelihood-based approaches with the increase in the amount of errors,

but the recall is decreasing. For the Longitude and Latitude attributes, SCARE-

based approaches (KHSinKPG and MV ) corrected these attributes with error rate

between 1% and 5%.

For the SCARE-based approaches, the precision increases because we use a fixed δ.

When the amount of errors in the data (noted e) is small (10 to 20%), SCARE-based

approaches modify more data that allow for less accurate updates resulting in less

precision and relatively high recall. As e increases, the data needs more updates to

correct it, however, SCARE applies fewer, yet more accurate updates, and hence the

precision increases, but the recall decreases (Figure 3.4(b)). The KHSinKPG ap-

proach outperforms MV approach, because KHSinKPG takes into account further

associations between the predicted values across the partitions. These associations

are ignored in the MV approach. Both SCARE-based approaches that rely on data

79

partitioning (KHSinKPG and MV ) show a comparable, and sometimes even better,

accuracy than that of the SM approach.

The ConstRepair has been outperformed by all the likelihood-based approaches,

because the ConstRepair relies on the heuristic of finding values replacement that are

close to the original data without considering any information on the data distribution

and relationships.

The recall of all the repairing approaches is in the range of 30 to 65%. However, for

the likelihood-based approaches, the recall can be improved by running the approach

again over the resulting database instance. We illustrate this improvement later in

the experiment of Figure 3.6. But, the ConstRepair approach achieves about 35%

recall that can not be further improved given a fixed set of constraints.

To conclude, in comparison to the constraint-based repairing approaches, which

demonstrated both low precision and recall, our likelihood based approach demon-

strated accurate updates with high precision. Moreover, partitioning the data and

combining the different predictions across the partitions provide more accurate predic-

tions, because partitioning allows to learn data relationships at different granularity

levels (local and global).

Quality vs. the amount of changes δ: In this experiment, we study the effect

of δ, the number of tolerated changes on the repair quality. We report in Figure 3.5

the resulting precision and recall when δ|D| changes from 1% to 10% (e.g., if δ

|D| = 5%,

then SCARE can change up to 10% of the tuples by replacing a selected attribute

value v by v′ with distance d(v, v′) = 0.5). Generally, low values of δ guarantee high

precision and inversely. Increasing δ gives more chance to SCARE to modify the data.

Hence, the recall increases as we increase δ, but updates with a lower confidence could

be made. This justifies the decrease in the precision when we increase δ.

The conclusion from this experiment is that small δ guarantees high precision

at the cost of the recall. However, the recall can be improved by further SCARE

iterations over the data as we will see in the next experiment.

80

0.8

0.85

0.9

0.95

1

1 2 3 4 5

Precision

SCARE Iterations (I)

KHSKPGMVSM

(a)

��

� � � � �

��

��

��

(b)

Fig. 3.6.: Using SCARE in an iterative way helps improving the recall and the overall

quality of the updates. The decrease in the precision is small compared to the increase

in the recall, achieving an overall high quality improvement demonstrated by the f-

measure.

Quality vs. the number of SCARE iterations: This experiment shows

the effect on the quality if we repeatedly execute SCARE over the dataset I times,

I = {1, . . . , 5}. After each iteration, the repaired tuples are considered members of

the clean subset of the database, which is used in training the ML models. We report

the obtained precision and recall in Figure 3.6. For all the SCARE-based approaches,

the recall substantially improves from about 35% to close to 70% as we increase the

number of iterations, while, the precision slightly decreases from the 90’s to the 80’s

%. This indicates that the overall quality improves as we run more SCARE iterations.

The KHSinKPG outperforms the other approaches in terms of both precision and

recall.

In each iteration, SCARE tries to repair the data to maximize the data likelihood

given the learnt classification models subject to a constant amount of changes, δ. In

the first few iterations, the discovered updates have higher correctness measure (i.e.,

ratio of the likelihood benefit to the cost of the update, l(u)c(u)

) than those that are

discovered later. Therefore, the applied updates in the first iterations have higher

81

��

��

��

��

�

� � � ��

��

��

��

(a)

��

� � � ��

��

��

��

(b)

Fig. 3.7.: Increasing the number of partition functions |H| improves the accuracy of

the predictions and hence increases the precision. The recall is not affected much

because we use a fixed δ.

confidence, and hence, the precision starts high and decreases in later iterations.

The recall increases faster than the decrease in precision, and therefore, the overall

quality is improving. The main reason is that most of the applied updates are correct

in the first few iterations, so the obtained database instances after each iteration are

of higher quality to be learnt and modeled for predictions in the later iterations of

SCARE. A stopping criteria for the iterations can be computed from the obtained

overall likelihood benefit of the updates. If the benefit is not significant, then it is

better to stop SCARE. In the GDR setting of Chapter 2, the user will be involved

interactively to inspect very small number of the least beneficial updates after each

iteration. If the least beneficial updates are correct, then most of the updates are

correct according to the data distribution.

Quality vs. the number of partition functions: Here, we study the sensi-

tivity of SCARE to the number of ways to partition the dataset, i.e., the number

of partition functions |H|. Each tuple is a member of |H| partitions. We report in

Figure 3.7, the precision and recall of the applied updates as we change |H| from 2 to

16. The SCARE-based approaches, KHSinKPG and MV , achieve higher precision

82

2

4

6

8

10

Tim

e (

min

)

Phase 2

Phase 1

0

2

4

6

8

10

0.1 0.2 0.5 0.7 1

Tim

e (

min

)

# of records in M

Phase 2

Phase 1

(a) Dataset 2

10

20

30

40

50

Tim

e (

sec)

Blocking

Random

0

10

20

30

40

50

0 10 20 30 40 50

Tim

e (

sec)

# of records in K (d)

Blocking

Random

(b) Dataset 1

Fig. 3.8.: SCARE scalability when varying the database size.

as we increase |H|, while SM shows constant lower precision as changing |H| does

not affect SM ’s performance. Also, the recall is not much affected by |H| as we are

using δ, a fixed amount of changes to apply.

Increasing |H| will increase the chance that a tuple belongs to a larger number

of partitions. SCARE learns a model from a local view (i.e., from one partition)

and predicts the most accurate value for the tuple attributes. As a consequence, a

larger number of candidate tuple repairs is proposed when increasing the number of

partitions and the variance of the predictions decreases. The repair selection strategy

combines the predictions of the local view models, and this increases the chance to

obtain more accurate predictions. The strategy in KHSinKPG offers better results

as it improves the precision over the majority voting by taking into account the

dependencies obtained from different partitions.

3.5.2 SCARE Scalability

The scalability is one of the main advantages provided by SCARE, this is in

addition to the quality of the updates demonstrated in the previous experiments. In

the following, we assess the scalability of SCARE to large datasets.

83

In Figure 3.8(a), we report the scalability of SCARE on Dataset 2. We report

as well the fraction of time for each of the two phases of SCARE. The reported

time includes the time for learning the classification models. SCARE scales linearly,

because of its systematic process of handling each data partition. Note also that

SCARE finished the processing of a 1 M tuples in less than 6 minutes. Phase 1,

the updates generation, takes 80-85 % of the time because of the process of learning

models and obtaining predictions.

In Figure 3.8(b), we report the overall time taken by SCARE to handle datasets

from Dataset 1 with different size from 5,000 to 50,000 tuples. We use two different

partition techniques: Random and Blocking. In general, SCARE still scales linearly

with the dataset size. Moreover, it is noted that partitioning the data by blocking

makes SCARE more efficient. The reason is that blocking makes a data partition

containing less diversity of the domain values. This results in a faster processing to

train the classification models for prediction.

Usually, the process of statistical modeling and prediction tasks is quadratic, but

this is not the case with SCARE because of the robust mechanism of partitioning the

data and the strategy used to combine multiple predictions.

3.5.3 SCARE vs. ERACER to Predict Missing Values

In this experiment, we use Dataset 3, which includes a number of measurements

taken from 54 sensors once every 31 seconds. It contains only numerical attributes.

We include this dataset to evaluate SCARE for predicting missing values and compare

it to ERACER. We used the same dataset with the same introduced errors as reported

in ERACER evaluation of [7].

ERACER is a recent machine learning technique for data cleaning; it is specifically

dedicated to replace missing values but it can not be used for data repair by value

modification. ERACER leverages domain expert knowledge about the dataset to

84

��

��

��

��

��

��

��

(a)

��

��

��

��

��

��

��

(b)

Fig. 3.9.: Comparison between SCARE and ERACER to predict missing values.

Generally, both SCARE and ERACER show high accuracy in predicting the miss-

ing values. SCARE uses in this experiment Naıve Bayesian model, while ERACER

leverage domain knowledge interpreted in carefully designed Bayesian Network.

design a Bayesian network. Then, ERACER uses an underlying relational database

design to store all constructed model’s parameters.

In Figure 3.9, we evaluate SCARE in the task of predicting the missing values

in comparison with ERACER. Here, we do not use δ as SCARE is not updating

an existing database value. Instead, we consider all the predictions obtained from

SCARE that fill a missing value.

Figure 3.9(a) reports the mean absolute error (MAE) while errors in the dataset

are in the form of missing values. Figure 3.9(b) reports also the MAE, while errors

are in the form of corrupting values, e.g., adding random values. More details on how

this data was corrupted is provided in [7].

There are two major numerical attributes in the Dataset 3: humidity and temper-

ature. In Figure 3.9(a), both SCARE and ERACER predict the missing values for

the humidity and temperature with almost the same low error percentage (2 to 4 %);

and also in Figure 3.9(a), they both behave similarly when increasing the percentage

85

of corrupted data, and in this case, the error percentage in prediction is between 3

and 7 %.

These numbers show that both techniques provide high accuracy predictions.

However, SCARE does not require the expensive domain expert to design a Bayesian

network as for ERACER. For this experiment, SCARE used the Naıve Bayesian for

the statistical models. The Naıve Bayesian did well when plugged into SCARE in

comparison to the Bayesian network which has to be carefully designed for ERACER.

This is thanks to the partitioning technique used in SCARE, which enables several

local views of the data that are then combined at the end to obtain the most reliable

global view for accurate predictions. Moreover, SCARE can benefit from carefully

designed learning techniques, like ERACER, and plug it in the learning step to get

more accurate predictions.

3.6 Summary

In this chapter, we propose SCARE, a robust and scalable approach for accu-

rately repairing erroneous data values. Our solution offers several advantages over

previous methods for data repairing using database constraints: the accuracy of the

replacement values and the scalability of the method are guaranteed; the cost of the

repair is bounded by the amount of changes that the user is willing to tolerate; no

constraint or editing rule is needed since SCARE analyzes the data, learns the cor-

relations from the correct data and takes advantages of them for predicting the most

accurate replacement values. Finally, as shown in our extensive experiments, SCARE

outperforms existing methods on large databases with no limitation on the type of

data (i.e., string, numeric, ordinal, categorical) or on the type of errors (i.e., missing,

incomplete, outlying, or inconsistent values).

86

4. INDIRECT GUIDANCE FOR DEDUPLICATION

(BEHAVIOR BASED RECORD LINKAGE)

In this chapter, we give an example for leveraging the indirect interactions of the

users (entities) to improve the data quality. Specifically, we use entities’ generated

transactions log for the task of finding duplicate entities or linking entities records.

We assume in this chapter that the database is a result of integrating two data sources

where it is expected that the data sources have some entities in common.

The rest of the chapter is organized as follows: In Section 4.1, we start by motivat-

ing our approach and summarizing our contributions. Section 4.2 provides a general

overview on the approach and formalize the problem by examples. A candidate gen-

eration phase for the matched entities is discussed in Section 4.3, and the accurate

techniques for matching entities using their behaviors (generated transactions) are

discussed in Section 4.4. The experimental evaluation is provided in Section 4.5, and

finally, we summarize the chapter in Section 4.6.

4.1 Introduction

Record linkage is the process of identifying records that refer to the same real world

entity. There has been a large body of research on this topic (refer to [9] for a recent

survey). While most existing record linkage techniques focus on simple attribute

similarities, more recent techniques are considering richer information extracted from

the raw data for enhancing the matching process (e.g. [12–15]).

In contrast to most existing techniques, we are considering entity behavior as a

new source of information to enhance the record linkage quality. We observe that

by interpreting massive transactional datasets, for example, transaction logs, we can

discover behavior patterns and identify entities based on these patterns.

87

A straightforward strategy to match two entities is to measure the similarity

between their behaviors. However, a closer examination shows that this strategy may

not be useful, for the following reasons. It is usually the case that the complete

knowledge of an entity’s behavior is not available to both sources, since each source is

only aware of the entity’s interaction with that same source. Hence, the comparison

of entities’ “behaviors” will in reality be a comparison of their “partial behaviors”,

which can easily be misleading. Moreover, even in the rare case when both sources

have almost complete knowledge about the behavior of a given entity (e.g., a customer

who did all his grocery shopping at Walmart for one year and then at Safeway for

another year), the similarity strategy still will not help. The problem is that many

entities do have very similar behaviors, and hence measuring the similarity can at

best group the entities with similar behavior together (e.g., [16–18]), but not find

their unique matches.

Fortunately, we developed an alternative strategy that works well even if complete

behavior knowledge is not known to both sources. The key to our proposed strategy

is that we merge the behavior information for each candidate pair of entities to be

matched. If the two behaviors seem to complete one another, in the sense that stronger

behavioral patterns become detectable after the merge, then this will be a strong

indication that the two entities are, in fact, the same. The problem of distinct entities

having similar overall behavior is also handled by the merge strategy, especially when

their behaviors are split across the two sources with different splitting patterns (e.g.,

20%-80% versus 60%-40%). In this case, two behaviors (from the first and second

sources) will complete each other if they indeed correspond to the same real world

entity, and not just two distinct entities who happen to share a similar behavior

(which is one of the shortcomings of the similarity strategy).

In this work, we develop principled computational algorithms to detect those

behavior patterns which correspond to latent unique entities in merged logs. We

compute the gain in recognizing a behavior before and after merging the entities

transactions and use this gain as a matching score. In our empirical studies with

88

real world data sets, the behavior merge strategy produced much better results

than the behavior similarity strategy in different scenarios of splitting the entities’

transactions among the data sources.

The contributions of this chapter can be summarized as follows:

• We present the first formulation of the record linkage problem using entity

behavior and solve the problem by detecting consistent repeated patterns in

merged transaction logs.

• To model entities’ behavior, we develop an accurate, principled detection ap-

proach that models the statistical variations in the repeated behavior patterns

and estimates them via expectation maximization [76].

• We present an alternative, more computationally efficient, detection technique

that is based on information theory which detects recognized patterns through

high compressibility.

• To speed up the linkage process, we propose a filtering procedure that produces

candidate matches for the above detection algorithms through a fast but inac-

curate matching. This filtering introduces a novel “winnowing” mechanism for

zeroing in a small set of candidate pairs with few false positives and almost no

false negatives.

• We conduct an extensive experimental study on real world datasets that demon-

strates the effectiveness of our approach to enhance the linkage quality.

4.2 Behavior Based Approach

4.2.1 Problem Statement

We are given two sets of entities {A1, . . . , AN1} and {B1, . . . , BN2}, where for each

entity A we have a transaction log {T1, . . . , TnA} and each transaction Ti is a tuple

89

Pre-processing and Behavior Extraction

Candidate Generation(Quick & Dirty Matching)

Accurate Matching(Statistical Technique or Information Theoretic Technique)

Final Filtering and Conflict Resolution

Phase 0:

Phase 1:

Phase 2:

Phase 3:

Fig. 4.1.: Process for behavior-based record linkage.

in the form of ⟨ti, a, F id⟩ where ti represents the time of the transaction, a is the

action (or event) that took place, and F id refers to the set of features that describe

how action a was performed.

Our goal is to return the most likely matches between entities from the two sets

in the form of ⟨Ai, Bj, Sm(Ai, Bj)⟩, where Sm(Ai, Bj) is the matching function. Given

entities A,B (and their transactions), the matching function returns a score reflecting

to what extent the transactions of both A and B correspond to the same entity.

4.2.2 Approach Overview

We begin by giving an overview of our approach for record linkage, which can be

summarized by the process depicted in Figure 4.1.

Phase 0: In the initial pre-processing and behavior extraction phase, we transform

raw transaction logs from both sources into a standard format as shown on the right

side of Figure 4.2(a). Next, we extract the behavior data for each single entity in

each log. Behavior data is initially represented in a matrix format similar to those

given in Figure 4.2(b), which we refer to as Behavior Matrix (BM).

Phases 1: Similar to most record linkage techniques, we start with a candidate

generation phase that uses a “quick and dirty”” matching function. When matching a

pair of entities, we follow the merge strategy described in the introduction. Moreover,

in this phase, we map each row in the BM to a 2-dimensional point resulting in a very

compact representation for the behavior with some information loss. This mapping

90

allows for very fast computations on the behavior data of both the original and merged

entities. The mapping is discussed in Section 4.3 and we will show how we can use it to

generate a relatively small set of candidate matches (with almost no false negatives).

Phase 2: Once the candidate matches are generated, the following phase, which

is the core of our approach, is to perform the accurate (yet more expensive) matching

of entities. Accurate matching of the candidate pair of entities (A,B) is achieved

by first modeling the behavior of entities A, B, and AB using a statistical generative

model, where AB is the entity representing the merge of A, B. The estimated models’

parameters are then used to compute the matching score. The details are provided

in Section 4.4.1.

In addition to the above statistical modeling technique, we also propose an alter-

native heuristic technique that is based on information theoretic principles for the

accurate matching phase (See Section 4.4.2). This alternative technique relies on

measuring the increase in the level of compressibility as we merge the behavior data

of pairs of entities. While, to some extent, it is less accurate than the statistical

technique, it is computationally more efficient.

Phase 3: The final filtering and conflict resolution phase is where the final

matches are selected. In our experiments, a simple filtering threshold, tf , is applied

to exclude low-scoring matches.

In the remainder of this section, we will first examine the details of phase 0, and

then we will give an introduction to phases 1 and 2, whose detailed discussion will be

presented in the following two sections.

4.2.3 Pre-processing and Behavior Extraction

A transaction log, from any domain, would typically keep track of certain types

of information for each action an entity performs. This information includes: (1) the

time at which the action occurred, (2) the key object upon which the action was per-

formed (e.g., buying a Twix bar), and (3) additional detailed information describing

91

the object and how the action was performed (e.g., quantity, payment method, etc).

For simplicity, we will be referring to each action just by its key object. For example,

“Twix” can be used to refer to the action of buying a Twix bar.

The following example illustrates how we can transform a raw transaction log into

a standard format with such information. Although the example is from retail stores,

the same steps can be applied in other domains with the help of domain experts.

Example 4.2.1 An example of a raw log is shown in table “Raw log” in Figure 4.2(a)

which has four columns representing the time, the customer (the entity to be matched),

the ID of the item bought by the customer, and the quantity. Since the item name

may be too specific to be the key identifier for the customer’s buying behavior, an

alternative is to use the item category name as the identifier for the different actions.

This way, actions will correspond to buying Chocolate and Cola rather than Twix

and Coca Cola. The main reason behind this generalization is that, for instance,

buying a bar of Twix should not be considered as a completely different action from

buying a bar of Snickers, and so on. In general, these decisions can be made by

a domain expert to avoid over-fitting when modeling the behavior. In this case, the

specific item name, along with the quantity, will be considered as additional detailed

information, which we will refer to as the action features.

The next step is to assign an id, F id, for each combination of features occurring

with a specific action in “Raw Log”, as shown in the “Action Description” table. This

step ensures that even if we have multiple features, we can always reason about them

as a single object using F id. If there is only one feature, then it can be used directly

with no need for F id.

As a final step, we generate the “Processed Log” by scanning “Raw Log” and

registering the time, entity, action, and F id information for each line.

Behavior Extraction and Representation: Given the standardized log, we

extract the transactions of each entity and represent them in a matrix format, called

Behavior Matrix.

92Raw log Processed LogTime Cstmr itm_id Qty Items Time Entity Action F_id… … … .. itm_id Category Name .. … … …3 A 1001 2 … … 3 A Chocolate 43 A 1004 2 1001 Chocolate 3 A Cola 26 A 1001 1 1002 Chocolate 6 A Chocolate 38 A 1004 2 1003 Chocolate 8 A Cola 210 A 1001 2 1004 Cola 10 A Chocolate 41 B 1003 2 1005 Cola 1 B Chocolate 21 B 1004 2 … … 1 B Cola 26 B 1004 2 6 B Cola 28 B 1001 1 8 B Chocolate 310 B 1004 2 10 B Cola 213 B 1001 1 13 B Chocolate 313 B 1004 2 13 B Cola 215 B 1001 1 Action Description 15 B Chocolate 315 B 1004 2 action F_id 15 B Cola 23 C 1002 4 … … 3 C Chocolate 53 C 1005 1 Chocolate 2 3 C Cola 16 C 1001 2 Chocolate 3 6 C Chocolate 46 C 1005 1 Chocolate 4 6 C Cola 19 C 1005 1 Chocolate 5 9 C Cola 110 C 1002 4 … … 10 C Chocolate 514 C 1002 4 Cola 1 14 C Chocolate 514 C 1005 1 Cola 2 14 C Cola 116 C 1001 2 … … 16 C Chocolate 4… … … … … …. … …<Qty=1>,<Desc=Pepsi Cola><Qty=2>,<Desc=Coca Cola>……Features<Qty=2>,<Desc=KitKat><Qty=1>,<Desc=Twix><Qty=2>,<Desc=Twix><Qty=4>,<Desc=Snickers>…Pepsi Cola……Item NameTwixSnickersKitKatCoca Cola1 1 22

(a) Raw log pre-processing example: The first step is to decide on the action

identifiers and the features describing each action to create the “Action

Description” table. The second step is to use the identified actions to re-

write the log. 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16A Chocolate 0 0 4 0 0 3 0 0 0 4 0 0 0 0 0 0Cola 0 0 2 0 0 0 0 2 0 0 0 0 0 0 0 0B Chocolate 3 0 0 0 0 0 0 3 0 0 0 0 3 0 3 0Cola 2 0 0 0 0 2 0 0 0 2 0 0 2 0 2 0C Chocolate 0 0 5 0 0 4 0 0 0 5 0 0 0 5 0 4Cola 0 0 1 0 0 1 0 0 1 0 0 0 0 1 0 1AB Chocolate 3 0 4 0 0 3 0 3 0 4 0 0 3 0 3 0Cola 2 0 2 0 0 2 0 2 0 2 0 0 2 0 2 0BC Chocolate 3 0 5 0 0 4 0 3 0 5 0 0 3 5 3 4Cola 2 0 1 0 0 3 0 0 1 2 0 0 2 1 2 1AB looks more consistent than BC, then AB is most likely the same customerWhen merging A &B and then B &CTime (Date)

(b) Resulting Behavior Matrices from the processed log.

Fig. 4.2.: Retail store running example.

93

Definition 4.2.1 Given a finite set of n actions performed over m time units by an

entity A, the Behavior Matrix (BM) of entity A is an n×m matrix, such that:

BMi,j =

Fij if action ai is performed

0 otherwise

Where, Fij ∈ Fi is the F id value for the combination of features describing action

ai when performed at time j, Fi is the domain of all possible F id values for action

ai ,i = 1, . . . , n and j = 1, . . . ,m.

Example 4.2.2 The BMs for customers A,B and C are shown in Figure 4.2(b). A

non-zero value indicates that the action was performed and the value itself is the F id

that links to the description of the action at this time instant.�A more compact representation for the entities’ behavior is derived from the Be-

havior Matrix representation, and is constructed and used during the accurate match-

ing phase. This second representation, which is based on the inter-arrival times,

considers each row in the BM as a stream or sequence of pairs {vij,F (vij)}, where

vij is the inter-arrival time since the last time action ai occurred, and F (vij) ∈ Fi

is a feature that describes ai from Lai possible descriptions, |Fi| = Lai. For ex-

ample, in Figure 4.2(b), the row corresponding to action ai = chocolate of en-

tity C, BMi = {0, 0, 5, 0, 0, 4, 0, 0, 0, 5, 0, 0, 0, 5, 0, 4}, will be represented as Xi =

{{3, 5}, {3, 4}, {4, 5}, {4, 5}, {2, 4}}.

The lossy behavior representation used in the candidate generation phase will be

described in Section 4.3.

It is worth mentioning that the actions, along with their level of details (e.g.,

buying chocolate vs. buying Twix) and their associated features, are assumed to be

homogeneous across the two sources. Otherwise, another pre-processing phase will be

required to match the actions, and thereby ensure the homogeneity. Needless to say,

the sources themselves must belong to the same domain (e.g., two grocery stores, two

news web sites, etc) for the behavior-based approach to be meaningful.

94

4.2.4 Matching Strategy

As we explained in Section 4.2.2, matching entities based on their extracted

behavior data is achieved in two consecutive phases: a candidate generation phase

followed by an accurate matching phase. In this section, we describe our general

matching strategy, which we apply in the two matching phases. Note that ultimately,

we need to assign a matching score, Sm, for each pair of entities (A,B) deemed as a

potential match, and then report the matches with the highest scores.

To compute Sm(A,B), we first compute a behavior recognition score, Sr, for each

entity (i.e., Sr(A) and Sr(B)). We then merge the behavior data of both A and B to

construct the behavior of some hypothetical entity AB, whose score, Sr(AB), is also

computed.

The next step is to check if this merge results in a more recognizable behavior

compared to either of the two individual behaviors. Hence, the overall matching score

should depend on the gain achieved for the recognition scores. More precisely, it can

be stated as follows:

Sm(A,B) =nA[Sr(AB)− Sr(A)] + nB[Sr(AB)− Sr(B)]

nA + nB

(4.1)

where nA and nB are the total number of transactions in the BMs of A and B

respectively. Note that the gains corresponding to the two entities are weighted

based on the density of their respective BMs.

Example 4.2.3 To better understand the intuition behind the behavior merge strat-

egy, we assume that entities A and C are from Source 1 and B is from Source 2 and

their processed log is shown in table “Processed Log” in Figure 4.2(a). To find the

best match for entity B, we first merge it with A, and then do the same with C. It is

apparent from the resulting BMs in Figure 4.2(b) that A is potentially a good match

for B; entity AB is likely to be an entity that buys chocolate every 2 or 3 days and

prefers 2 liters of Coca Cola with either 2 bars Twix or 4 bars Snikers chocolates.

However, it is hard to tell a behavior about entity BC. Of course, in a real scenario

we will deal with much more actions.

95

The key question now is: How to compute Sr(A)? In fact, the goal of the recog-

nition score, Sr, is to capture the consistency of an entity’s behavior along three

main components: (1) consistency in repeating actions, (2) stability in the features

describing the action, and (3) the association between actions. These three compo-

nents, which will be explained shortly, correspond to three score components of Sr;

i.e., Sr1, Sr2, and Sr3. Hence, we compute Sr(A) as their geometric mean as given

below.

Sr(A) =3√Sr1(A)× Sr2(A)× Sr3(A) (4.2)

The three behavior components we just mentioned, and which we would like to

capture in Sr, can be explained as follows.

1- Consistency in repeating actions: Entities tend to repeat specific actions

on a regular basis following almost consistent inter-arrival times. For example, a user

(entity) of a news web site may be checking the financial news (action) every morning

(pattern).

2- Stability in the features describing actions: When an entity performs an

action several times, almost the same features are expected to apply each time. For

example, when a customer buys chocolate, s/he mostly buys either 2 Twix bars or 1

Snickers bar, as opposed to buying a different type of chocolate each time and in

completely different quantities. The latter case is unlikely to occur in real scenarios.

3- Association between actions: Actions performed by entities are typically

associated, and the association patterns can be detected over time. For example,

a customer may be used to buying Twix chocolate and Pepsi cola every Sunday

afternoon, which implies an association between these two actions.

The distinction between each of the matching techniques that we will describe next

is in the method used to compute Sr1, Sr2, and Sr3. The candidate generation phase is

a special case as it only considers the first behavior component; i.e. Sr(A) = Sr1(A).

The matching strategy we have described so far can be referred to as the behavior

merge strategy, since it relies essentially on merging the entities’ behaviors and then

96

measuring the realized gain. This is to be contrasted to an alternative strategy, which

can be referred to as the behavior similarity strategy, where the matching score

can simply be a measure of the similarity between the two behaviors.

We will show in Section 4.4 how the behavior similarity strategy can be imple-

mented in the context of the statistical modeling technique for matching. In Section

4.5, we will experimentally show the superiority of the merge strategy over the simi-

larity strategy in all the scenarios we considered.

4.3 Candidate Generation Phase

To avoid examining all possible pairs of entities during the expensive phase of

accurate matching, we introduce a candidate generation phase, which quickly deter-

mines pairs of entities that are likely to be matched. This phase results in almost no

false negatives, at the expense of relatively low precision.

The high efficiency of this phase is primarily attributed to the use of a very

compact (yet lossy) behavior representation, which allows for fast computations. In

addition, only the first behavior component; i.e., consistency in repeating actions,

which is captured by Sr1, is considered in this phase. Note that because the other

components are ignored, binary BMs are used with 1’s replacing non-zero values.

Each row in the BM , which corresponds to an action, is considered as a binary

time sequence. For each such sequence, we compute the first element of its Discrete

Fourier Transform (DFT) [77], which is a 2-dimensional complex number. The

complex number corresponding to an action ai in the BM of an entity A is computed

by:

C(ai)A =

m−1∑j=0

BMi,je2jπ

√−1

m (4.3)

An interesting aspect of this transformation is that the lower the magnitude of the

complex number the more consistent and regular the time sequence, and vice versa.

This can be explained as follows. Consider each of the elements in the time series

as a vector whose magnitude is either 0 or 1, and that their angles are uniformly

97

distributed along the unit circle (i.e., the angle of the jth vector is 2jπm). The complex

number will then be the resultant of all these vectors. Now, if the time series was

consistent in terms of the inter-arrival times between the non-zero values, then their

corresponding vectors would be uniformly distributed along the unit circle, and hence

they would cancel each other out. Thus, the resultant’s magnitude will be close to

zero.

Another interesting aspect is that merging the two rows corresponding to an action

a in the BMs of two entities, A, B, would effectively reduce to adding two complex

numbers i.e., C(a)AB = C

(a)A + C

(a)B .

The following example shows how the candidate generation phase can distinguish

between “match” and “mismatch” candidates.

Example 4.3.1 Consider the example described in Figure 4.3. Let aA, aB and aC

be the rows of action a (chocolate) in the binary BMs of entities A, B and C from

Figure 4.2(b). At the left of Figure 4.3, when merging aA and aB, the magnitude

corresponding to the merged action, aAB equals 0.19, which is smaller than the original

magnitudes: 1.38 for aA and 1.53 for aB. The reduction in magnitude is because the

sequence aAB is more regular than either of aA and aB.

At the right of Figure 4.3, we apply the same process for aB and aC. The magni-

tudes we obtain are 2.03 for aBC, 1.54 for aB, and 0.09 for aC. In this case, merging

aB and aC resulted in an increase in magnitude because the sequence aBC looks less

regular than either of aB and aC.

Based on the above discussion, we can compute a recognition score, Sr(aA), for

each individual action a that belongs to entity A such that it is inversely proportional

to the magnitude of the complex number C(a)A . In particular, Sr(aA) = M−mag(C

(a)A ),

where mag(C(a)A ) is the magnitude of C

(a)A and M is the maximum computed magni-

tude.

98

-2

0

2

-2 0.2

-2

0

2

-2 0.2

Fig. 4.3.: Actions patterns in the complex plane and the effect on the magnitude.

To compute the overall Sr(A), we average the individual scores, Sr(aA), each

weighted by the number of times its respective action was repeated (n(a)A ). The

formula for Sr(A) is thus given as follows.

Sr(A) =1

nA

∑∀ a

n(a)A · Sr(aA) (4.4)

Standard SQL To Compute Candidate Matches: In the following, we pro-

vide a derivation for a final formulation of the matching score in the candidate gener-

ation matching phase. At the end, we provide the corresponding SQL statement we

used for this computation.

After computing the complex numbers representation for each action in an en-

tity, we computed Sr(aA) = M − mag(C(a)A ), where M is the maximum computed

magnitude. Then, we obtain

S(A) = 1

nA

∑∀ a

n(a)A (M −mag(C

(a)A )) (4.5)

99

By substituting Eq. 4.5 into Eq. 4.1, we obtain the matching score Sm(A,B):

Sm(A,B) =nA

nA + nB

[1

nA + nB

∑∀ a

(n(a)A + n

(a)B )(M −mag(C

(a)AB))

− 1

nA

∑∀ a

(n(a)A )(M −mag(C

(a)A ))]

+nB

nA + nB

[1

nA + nB

∑∀ a

(n(a)A + n

(a)B )(M −mag(C

(a)AB))

− 1

nB

∑∀ a

(n(a)B )(M −mag(C

(a)B ))]

By Simple rearrangement to collect the terms related to mag(C(a)AB), we get

Sm(A,B) =1

nA + nB∑∀ a

[(n(a)A + n

(a)B )M − (n

(a)A + n

(a)B ) mag(C

(a)AB)

−n(a)A M + n

(a)A mag(C

(a)A )

−n(a)B M + n

(a)B mag(C

(a)B )]

Note that the terms of M will cancel out and the final matching score will be

Sm(A,B) =1

nA + nB∑∀ a

[ n(a)A mag(C

(a)A ) + n

(a)B mag(C

(a)B )

− (n(a)A + n

(a)B ) mag(C

(a)AB) ] (4.6)

We store the complex number information for each data source in a relation with

the attributes (entity, action, Re, Im, mag, a supp, e supp), where there is a tuple

for each entity with its actions. For each action of an entity, we store the (Re and

Im) the real and imaginary components of the complex number in addition to the

100

(mag) magnitude value. a supp is the number of transaction for that action within

the entities log and e supp is total number of transactions for the entity repeated with

each tuple corresponding an action. Thus, there are two tables representing each of

the two data sources src1 and src2.

To generate the candidates, we need to compute Eq. 4.6 for each pair of entities

and filter the result using the threshold tc on the resulting matching score. The

following SQL applies this computation and returns the candidate matches.

select

c1.entity as e1 ,

c2.entity as e2 ,

( c1.a_supp * c1.mag // nâ_A * magâ_A

+ c2.a_supp * c2.mag // nâ_B * magâ_B

- (c1.a_supp + c2.a_supp ) * // nâ_AB *

SQRT( // magâ_AB

(c1.Re + c2.Re)*(c1.Re + c2.Re)

+(c1.Im + c2.Im)*(c1.Im + c2.Im))

)/ (c1.e_supp + c2.e_supp) // n_AB

as gain_score

from src1 c1 inner join src2 c2

on c1.action = c2.action

where magscore > t_c

group by c1.entity, c2.entity

4.4 Accurate Matching Phase

4.4.1 Statistical Modeling Technique

Building the Statistical Model

Our goal is to build a statistical model for the behavior of an entity given its

observed actions. The two key variables defining an entity’s behavior with respect

to a specific action are (1) the inter-arrival time between the action occurrences,

and (2) the feature id associated with each occurrence, which represents the features

101

describing how the action was performed at that time, or in other words it reflects

the entity’s preferences when performing this action.

In general, we expect that a given entity will be biased to a narrow set of inter-

arrival times and feature ids which is what will distinguish the entity’s behavior.

In merging two behavior matrices for the same entity, the bias should be enforced

and made clearer. However, when the behavior matrices of two different entities

are merged, the bias will instead be weakened and made harder to recognize. The

statistical model that we build should enable us to measure these properties.

Our problem is similar to classifying a biological sequence as being a motif, i.e., a

sequence that mostly contains a recognized repeated pattern, or not. A key objective

in computational biology is to be able to discover motifs by separating them from

some background sequence that is mostly random. In our case, a motif corresponds

to a sequence of an action by the same entity. In view of this analogy, our statistical

modeling will have the same spirit as the methods commonly used in computational

biology. However our model has to fit the specifics of our problem which are as follows:

(a) sequences are of two variables (inter-arrivals and feature id), rather than just one

variable (DNA character) and (b) for ordinal variables (such as the inter-arrival time),

neighboring values need to be treated similarly.

Modeling the Behavior for an Action: We model the behavior of an entity A

with respect to a specific action a using a finite mixture model M = {M1, . . . ,MK},

with mixing coefficients λ(aA) = {λ(aA)1 , . . . , λ

(aA)K }, where Mk is its kth component.

Each component Mk is associated with two random variables: (i) the inter-arrival,

which is generated from a uniform distribution over the range of inter-arrival times,

rk = [startk, endk]1. (ii) the feature id, is a discrete variable, which is modeled

using a mutinomial distribution with parameter θ(aA)k = {f (aA)

k1 , . . . , f(aA)kL }, where L

is the number of all possible feature ids, and f(aA)kj is the probability to describe the

1the range size of rk is user-configurable as it depends on the application and what values areconsidered close. In our experiments with retail store data from walmart, we generated ranges bysliding, over the time period, a widow of size 5 days with a step of 3 days. (i.e. {{1,6},{4,9},{7,12},. . . })

102

occurrence of action a using feature Fj, j = 1, . . . , L. In what follows, we omit the

superscript aA and assume that there is only one action in the system to simplify the

notations.

What we described so far is essentially a generative model in the sense that once

built, we can use it to generate new action occurrences for a given entity. For example,

using λ, we can select the component Mk to generate the next action occurrence,

which should occur after an inter-arrival time picked from the correspondng range

rk = [startk, endk] and we can describe the action by selecting a feature id using θk.

However, we do not use the model for this purpose. Instead, we use its estimated

parameters (λ and the vectors θk) to determine the level of recognizing repeated

patterns in the sequence corresponding to the action occurrences.

For the estimation of the model parameters, we use the Expectation-Maximization

(EM) algorithm to fit the mixture model for each specific action a of an entity A to

discover the optimal parameter values which maximize the likelihood function of the

observed behavior data.

Before we present the algorithm and the derivations used to estimate the model

parameters, we show an example of a behavior and the properties we desire for its

corresponding model parameters. We demonstrate the challenge in finding those

desired parameters, which we address by using the EM algorithm. We also show

through the example how we choose the initial parameter values required for the EM

algorithm.

Example 4.4.1 Consider that a customer’s behavior with respect

to the action of buying chocolate is represented by the sequence

{{6, s}, {15, l}, {6, s}, {8, s}, {15, l}, {14, l}, {13, l}}, where s denotes a small

quantity (e.g., 1-5 bars), and l denotes a large quantity (e.g., more than 5 bars). So

s/he bought a small quantity of chocolate after 6 days, a large quantity after 15 days,

and so on.

To characterize the inter-arrival times preferred by this customer, the best ranges

of size 2 to use are [6, 8] and [13, 15]. Their associated mixing coefficients (λk) should

103

be 37and 4

7, because the two ranges cover 3 and 4 respectively out of the 7 observed

data points.

However, since in general, the best ranges in a behavior sequence will not be as

clear as in this case, we need to systematically consider all the ranges of a given size

(2 in this case), and assign mixing coefficients to each of them. The possible ranges

for our example would be {[6, 8], [7, 9], [8, 10], . . . , [13, 15]}.

A straightforward approach to compute λk for each range is to compute the nor-

malized frequency of occurrence of the given range for all the observed data points.

For instance, the normalized frequencies for the ranges [6, 8], [12, 14], and [13, 15] are

312, 2

12, and 4

12(or 1

4, 1

6, and 1

3) respectively, where 12 is the sum of frequencies for all

possible ranges. (Note that the same inter-arrival time may fall in multiple overlap-

ping ranges.) Clearly, these are not the desired values for λk. We would rather have

zero values for all ranges other than [6, 8] and [13, 15]. However, we still use these

normalized frequencies as the initial values for λk to be fed into the EM algorithm.

Similarly, to compute the initial values for the θk probabilities, we first consider

the data points covered by the range corresponding to component Mk only. Then, for

each possible value of the feature id, we compute its normalized frequency across these

data points. Clearly, in our example, the customer favors buying small quantities

when s/he shops at short intervals (6-8 days apart), and large quantities when s/he

shops at longer intervals (13-15 days apart).

Behavior Model Parameters Estimation

In the following, we use the expectation maximization (EM) algorithm to fit the

finite mixture model of a given action sequence representing its occurrence and dis-

cover the parameters’ values of the overall model which was discussed earlier in this

section. To simplify the notations, we assume there is only one action in the system,

so we omit the superscript that link the entity and action names.

104

As we mentioned earlier, we use the expectation maximization (EM) for finite

mixture model to discover the parameters’ values of the overall model which would

maximize the likelihood of the data. The EM uses the concept of missing data and

follows an iterative procedure to find values for λ and θ, which maximize the likelihood

of the data given the model. In our case, the missing data is the knowledge of which

components produced X = {{v1,F (v1)}, . . . , {vN ,F (vN )}}. A finite mixture model

assumes that the sequence X arises from two or more components with different,

unknown parameters. Once we obtain these parameters, we use them to compute the

behavior scores along each of the behavior three components.

Let us now introduce a K-dimensional binary random variable Z with a 1-of-K

representation in which a particular zk is equal to 1 and all other elements are equal

to 0, i.e., zk ∈ {0, 1} and∑K

k=1 zk = 1, such that the probability p(zk = 1) = λk.

Every entry in the sequence Xi will be assigned Zi = {zi1, zi2, . . . , ziK}, We can easily

show that the probability

p(Xi|θ1, . . . , θK) =K∑k=1

p(zik = 1)p(Xi|Zi, θ1, . . . , θK)

=K∑k=1

λkp(Xi|θk)

Since we do not know zik, we consider the conditional probability γ(zik) of zik

given Xi p(zik = 1|Xi) which can be found using Bayes’ theorem [78]:

γ(zik) =p(zik = 1)p(Xi|zik = 1)∑Kk=1 p(zik = 1)p(Xi|zik = 1)

=λkp(Xi|θk)∑Kk=1 λkp(Xi|θk)

(4.7)

We shall view λk as the prior probability of zik = 1, and γ(zik) as the corresponding

posterior probability once we got X. γ(zik) can also be viewed as the responsibility

105

that componentMk takes for explaining the observationXi. Therefore, the likelihood

or probability of the data given the parameters can be written in the log form as:

ln p(X|λ, θ) =N∑i=1

K∑k=1

γ(zik) ln [λkp(Xi|θk)]

=N∑i=1

K∑k=1

γ(zik) ln p(Xi|θk) +N∑i=1

K∑k=1

γ(zik) lnλk (4.8)

The EM algorithm monotonically increases the log likelihood of the data until

convergence by iteratively computing the expected log likelihood of the complete

data (X, Z) in the E step and maximizing this expected log likelihood over the model

parameters λ and θ. We first choose some initial values for the parameters λ(0) and

θ(0). Then, we alternate between the E-step and M-step of the algorithm until it

convergences.

In the E-step, to compute the expected log likelyhood of the complete data, we

need to calculate the required conditional distribution γ(0)(zik). We plug the λ(0) and

θ(0) in Eq. 4.7 to get γ(0)(zik), where we can compute p(Xi|θk) as follows:

p(Xi|θk) =L∏

j=1

fI(j,k,F(vi))kj (4.9)

where Xi = {vi,F (vi)} and I(j, k,F (vi)) is an indicator function equal to 1 if vi ∈ rk

and F (vi) = Fj; otherwise it is 0.

Recall that rk = [start, end] is the period identifying the component Mk.

The M-step of EM maximizes Eq. 4.8 over λ and θ in order to re-estimate new

values for them λ(1) and θ(1). The maximization over λ involves only the second term

in Eq. 4.8,

argmaxλ∑N

i=1

∑Kk=1 γ(zik) lnλk, has the solution

λ(1)k =

1

N

N∑i=1

γ(0)(zik) , k = 1, . . . , K. (4.10)

106

We can maximize over θ by maximizing the first term in Eq. 4.8 separately over each

θk for k = {1, . . . , K}. argmaxθ E(logp(X,Z|θi, ..., θK)] is equivalent to maximizing

the right hand side of Eq. 4.11 over θk (only a piece of the parameter) for every k.

θk = argmaxθk

N∑i=1

γ(0)(zik) ln p(Xi|θk), (4.11)

To do this, for k = {1, . . . , K} and j = {1, . . . , L} let

ckj =N∑i=1

γ(0)(zik)I(j, k,F (vi)) (4.12)

Then ckj is in fact the expected number of times to describe the action by Fj when

its inter-arrival falls in Mk’s range rk. We re-estimate θk by substituting Eq. 4.9 into

Eq. 4.11 to get

θ(1)k = {fk1, . . . , fkL} = argmax

θk

L∑j=1

ckj ln fkj (4.13)

Therefore, fkj =ckj∑Lj=1 ckj

(4.14)

To find the initial parameters λ(0) and θ(0), we scan the sequence X once and use

Eq. 4.12 to get ckj by setting all γ(0) = 1. Afterward, we use Eq. 4.14 to get θ(0)k and

compute

λ(0)k =

∑Lj=1 ckj∑K

k=1

∑Lj=1 ckj

Computing Matching Scores

To this point, we succeeded in defining a model and estimating its parameters λ

and θ which can be used to re-generate the sequence {X1, . . . , XN} that represents

the occurrence of the action. Recall that our aim is to match two entities A and B, by

computing the gain Sm(A,B) in recognizing a behavior after merging A and B using

Eq. 4.1. This requires computing the scores Sr(A), Sr(B) and Sr(AB) using Eq. 4.2,

which in turn requires computing the behavior recognition scores corresponding to

the three behavior components, which, for entity A for example, are Sr1(A), Sr2(A),

and Sr3(A).

107

For the first behavior component, the consistency in repeating an action a is

equivalent to classifying its sequence as a motif. We quantify the pattern strength to

be inversely proportional to the uncertainty about selecting a model component using

λ(aA), i.e., action a’s sequence is a motif if the uncertainty about λ(aA) is low. Thus,

we can use the entropy to compute Sr1(aA) = logK − H(λ(aA)), where H(λ(aA)) =

−∑K

k=1 λ(aA)k log λ

(aA)k , and the overall score Sr1(A) is then computed by a weighted

sum over all the actions according to their support, i.e., the number of times the

action was repeated.

Sr1(A) =1

nA

∑∀ a

n(a)A · Sr1(aA) (4.15)

For the second behavior component, the stability in describing the action (ac-

tion features) is more recognizable when the uncertainty in picking the feature id

values is low. The behavior score along this component can be evaluated by first

computing θ′(aA) = {f ′(aA)1 , . . . , f

′(aA)L }, which is the overall parameter to pick a fea-

ture id value for action a using the multinomial distribution such that the over-

all probability for entity A to describe its action a by feature Fj is f′(aA)j . Here,

f′(aA)j =

∑Kk=1 λ

(aA)k f

(aA)kj combined from the all K components for j = 1, . . . , L,

knowing that θ(aA)k = {f (aA)

k1 , . . . , f(aA)kL }. Using the entropy of θ′(aA), we compute

Sr2(aA) = logL − H(θ′(aA)), where H(θ′(aA)) = −∑L

j=1 f′(aA)j log f

′(aA)j . Similar to

Eq. 4.15, we can compute the overall score for Sr2(A) as the weighted sum for Sr2(aA)

according to the actions support.

For the third component, we look for evidence about the associations between

actions. We estimate, for every pair of actions, its probability of being generated from

components with the same inter-arrival ranges. The association between actions can

be recognized when they occur close to each other. In other words, this can occur

when both of them tend to prefer the same model components to generate their

sequences. Consequently, the score for the third component can be computed over

all possible pairs of actions for the same entity as follows:

Sr3(A) =∑∀ a,b

K∑k=1

λ(aA)k λ

(bA)k

108

Computing Behavior Similarity Score:

The similarity between two behaviors can be simply quantified by the closeness

between the parameters of their corresponding behavior models according to the

Euclidean distance. For two entities A and B, we compute the behavior similarity as

follows:

BSim(A,B) = 1− 1

nA + nB

∑∀ a

(naA + na

B)√√√√ K∑k=1

[(λ(aA)k − λ

(aB)k )2 +

L∑j=1

(λ(aA)k f

(aA)kj − λ

(aB)k f

(aB)kj )2].

Note that this method is preferred over directly comparing the BMs of the entities,

since the latter method would require some sort of alignment for the time dimension

of the BMs. In particular, deciding which cells to compare to which cells is not

obvious.

4.4.2 Information Theoretic technique (Compressibility)

We here present an information theory-based technique for the computation of

the matching scores. It is not as accurate as the motif-based technique presented in

Section 4.4.1, but it is more computationally efficient. The underlying idea stems

from observing that if we represent the BM as an image, we will see horizontal

repeated blocks that would be more recognizable if the behavior is well recognized. The

repeated blocks appear because of the repetition in the behavior patterns. Therefore,

we expect more regularity along the rows than along the columns of the BM . In fact,

the order of values in any of the columns depends on the order of the actions in the

BM , which is not expected to follow any recognizable patterns. For these reasons, we

compress the BM on a row by row basis, rather than compressing the entire matrix

as a whole.

Most existing compression techniques exploit data repetition and encode it in a

more compact representation. We thus introduce compressibility as a measure of

109

confidence to recognize behaviors. In our experiments, we compress the BM with the

DCT compression technique [79], being one of the most commonly used compression

techniques in practice. We then use the compression ratios to compute the behavior

recognition scores. Significantly higher compression ratios imply a more recognizable

behavior.

Given the sequence representation of an action occurrence i.e. {{vj,F (vj)}}, if an

entity follows stability in repeating an action, the values vj’s will follow a certain level

of correlation showing the action rate. Moreover, the features values F (vj) will contain

similar values to describe how the action was performed. To perform a compression

of an action sequence, we follow the same approach used in JPEG [80] for a one

dimentional sequence.

Our aim is to compute the three behavior recognition scores along the three be-

havior components (see Section 4.2.4). For the first behavior component, we compress

the sequence {v1, . . . , vn(a)A}, which represents the inter-arrival times for each action

a. The behavior score, Sr1(aA) for action a of entity A, will be the resulted compres-

sion ratio; the higher the compression ratio, the more we can recognize a consistent

inter-interval time (motif). We then use Eq. 4.15 to compute the overall score

Sr1(A). Similarly, for the second behavior component, we compress the sequence

{F (v1), . . . ,F(v

n(a)A

)}, which represents the feature values that describe the action a.

Again, the score Sr2(aA), is the produced compression ratio; the higher the compres-

sion ratio, the more we can recognize stability in action features. Similar to Sr2(aA),

we can compute the overall score Sr2(A).

Finally, for the third behavior component, which evaluates the relationship be-

tween the actions, we compress the concatenated sequences of inter-arrival times of

every possible pair of actions.Given two actions a and b, we concatenate and then

compress their inter-arrival times to get the compression ratio cra,b. If a and b are

closely related, they will have similar inter-arrival times which would allow for bet-

ter compressibility of the concatenated sequence. On the contrary, if they are not

related, the concatenated sequence will contain varying values. Thus, cra,b quanti-

110

fies the association between actions a and b. Hence, the overall pairwise association

is an evidence for the strength in the relationship between the actions that can be

computed by:

S(A)r3 =

∑∀ a,b

cra,b

4.5 Experiments

The goals of our experimental study are:

• Evaluate the overall linkage quality of our behavior based record linkage for

various situations of splitting the entities’ transactions in the log.

• Demonstrate the quality improvement when our technique is combined with a

textual based record linkage technique.

• Study the performance and scalability as well as the effectiveness of the can-

didate generation phase on the overall performance. We also include in our

evaluation the compressibility technique, which is discussed in Appendix 4.4.2.

In the following, we will refer to the statistical model technique as motif.

To the best of our knowledge, this is the first approach to leverage entity behavior

for record linkage. Consequently, there is no other technique to directly compare to.

Instead, we show how our technique can be combined with a textual record linkage

technique.

Dataset: We use a real world transaction log from Walmart which would cover

many similar scenarios in the retail industry. These transactions cover a period of

16 months. An entry in the log represents an item that has been bought at a given

time. Figure 4.2(a) shows a typical example for this log. We use the first level item

grouping as the actions which are described by the quantity feature. This feature was

grouped into {very low, low, medium, high, very high} for each individual item (high

quantity for oranges is different from high quantity of milk gallons).

111

Setup and Parameters: To simulate the existence of two data sources whose

customers (entities) need to be linked, a given entity’s log is partitioned into con-

tiguous blocks which are then randomly assigned to the sources. The log splitting

operation is controlled by the following parameters with their assigned default values

if not specified: (1) e: the percentage of overlapped entities between the two data

sources (default 50%). (2) d: the probability of assigning a log block to the first data

source (default 0.5). (3) b: the transactions block size as a percentage of the entity’s

log size. When b is very small, the log split is called a random split (default 1%),

and for higher values we call the split a block split (default 30%). The block split

represents the case where the customer alternates between stores in different places,

e.g., because of moving during the summer to a different place. When b is 50%, the

log is split into two equal contintigues halves. From the overlapping entities, 50%

have their transactions random split and the rest is block split. These parameters

allow us to test our techniques under various scenarios on how entities interact with

the two systems.

All the matching scores with a phase are scalled to be between 0 and 1, by sub-

tracting the minimum and dividing by the maximum scores. All of the experiments

were conducted on a Linux box with a 3 GHz processor and 32 GB RAM. We im-

plemented the proposed techniques in Java and we used MySQL DBMS to store and

query the transactions and the intermediate results.

4.5.1 Quality

The matching quality of the proposed techniques is analyzed by reporting the

classical precision and recall. We also report the f-measure= 2×precision×recallprecision+recall

, which

corresponds to the weighted harmonic mean of precision and recall. Since we control

the number of overlapping entities, we know the actual unique entities to compute

the precision and recall. In some cases and to provide more readable plots, we only

report the f-measure as an overall quality measur.

112

020406080

100

0 0.2 0.4 0.6 0.8

Perce

ntage

Candidates threshold (tc)precisionrecallreduction

(a) Candidate generation quality.

020406080

100

0 0.2 0.4 0.6 0.8

Perce

ntage

Filtering threshold (tf)

motif-precision motif-recallsim-precision sim-recallcompress-precision compress-recall

(b) Accurate matching comparisons: Pre-

cision & Recall

020406080

100

0 0.2 0.4 0.6 0.8

f-measure

Filtering threshold (tf)

motifcompresssim

(c) Accurate matching comparisons: f-

measure

Fig. 4.4.: Behavior linkage overall quality.

Overall Quality: In this experiment, we use a log of a group of 1000 customers.

For the candidate generation phase, we report in Figure 4.4(a) the recall, precision

and percentage of the reduction in the number of candidates against the candidate

matching score threshold tc. If the two data sources contain p and q entities and the

number of generated candidates is c pairs, the reduction percentage corresponds to

r = 100(pq − c)/pq.

We observe that high recall values close to 100% are achieved for tc ≤ 0.3. More-

over, the reduction in the number of candidates starts around 40% and quickly in-

creases close to 100% for tc ≥ 0.2. The precision starts at very low values close to

zero and increases with tc.

113

The main purpose of this phase is to reduce the number of candidates while

maintaining high recall using an approximate matching process. Therefore, low values

for tc should be used to relax the matching function and avoid false negatives. For

low values around tc = 0.2, the number of candidates are perfectly reduced with very

few false negatives. This result was achieved on different datasets.

Figure 4.4(b) and 4.4(c) illustrate and compare the overall quality of the tech-

niques of the behavior merge strategy; motif and compressibility, and the behavior

similarity technique. In this experiment, we used the candidates produced at tc = 0.2.

The three techniques behave similarly with respect to tf , Phase 2 filtering threshold.

High recall is achieved for low values of tf , while high precision is reached for high tf .

In Figure 4.4(c), f-measure values show that the motif technique can get an accuracy

over 80% while the compressibility technique can hardly reach 65%. The behavior

similarity technique was the worst as it can hardly reach 45%.

The difference between the motif and compressibility techniques is expected as the

motif technique is based on an exhaustive statistical method that is more accurate,

while the compressibility is based on efficient mathematical computations. The be-

havior similarity technique did not perform well because when a customer’s shopping

behavior is split randomly, it will be difficult to accurately model his/her behavior

based on either source taken separately. Consequently, comparing the behavior model

can hardly help in matching the customers. In the case where the transactions are

block split, the behavior can be well modeled. However, since there are many distinct

customers who have similar shopping behaviors, the matching quality will drop.

For subsequent experiments and to be fair to the three behavior matching tech-

niques, we report the best achieved results when changing tf . For the candidate

generation phase, we used tc = 0.2.

Improving Quality with Textual Linkage: In this experiment, we consider a

situation that is similar to the Yahoo-Maktoob acquisition discussed in the introduc-

tion. We constructed a profile for each customer such that the textual information

is not very reliable. Basically, we synthetically perturbed the string values. To

114

0

20

40

60

80

100

0.5 0.6 0.7 0.8 0.9f-m

easure

String similarity threshold

motifcompresssimString simFig. 4.5.: Improving the textual matching quality.

match the customers textual information, we used the Levenshtein2 distance func-

tion [81]. In the experiment, we first match the customers using different string

similarity thresholds and we then pass the resulting matches to our behavior-based

matching techniques.

In Figure 4.5, the overall accuracy improvement is illustrated by reporting the

f-measure values. Generally, as we relax the matching using the textual informa-

tion by reducing the string similarity threshold, the behavior linkage approach get

more chance to improve the overall quality. Reducing the string similarity threshold

resulted in very low precision and high recall with an overall low f-measure. Leverag-

ing the behavior in such case improves the precision and consequently improves the

f-measure. Although all the behavior matching techniques improved the matching

quality, the motif technique is more accurate.

Split Transactions with Different Probabilities: In this experiment, we

study the effect of changing the parameter d (i.e. we assess the quality when the

entities’ transactions are split between the data sources with different density). In

Figure 4.6(a), we report and compare the f-measure when linking several datasets

each splitted with different d value from 0.1 to 0.5.

2We used the implementation in the Second String library (http://secondstring.sourceforge.net/)

115

020406080

100

0.1 0.2 0.3 0.4 0.5

f-mea

sure

d

motifcompresssim

(a) Quality vs. different splitting proba-

bilities

020406080

100

2 3 4 5 6 7

f-mea

sure

p

motifcompresssim

(b) Quality vs behavior exhaustiveness

Fig. 4.6.: Behavior linkage quality vs. different splitting probabilities and behavior

exhaustiveness.

We observe that for d ≤ 0.2 (i.e, the first source contains below 20% of the

entitiy’s transactions), the matching quality drops sharply. The drop happens because

one of the data sources will contain customers with fewer information about their

behaviors. When matching such small behaviors, it is likely to fit and produce well

recognized behavior with many other customers. The behavior merge techniques

consistently produce better results in situations when each of the participating data

sources contains at least 20% of an entity’s behavior. This is a reasonable results

especially when only the behavior information is used for linkage.

Split Incomplete Behaviors: This experiment evaluates the quality of the

behavior linkage when matching incomplete behaviors (i.e., non-exhaustive3). More

precisely, for a customer we split his/her transactions into p parts and proceed to

link the behaviors using only two parts. We report in Figure 4.6(b) the resulting

f-measure matching values when p = 2, . . . , 7.

As expected, as we increase p the matching quality using all the behavior linkage

approach drops. Although the customers we used in the experiments may not have

their complete buying behavior in Walmart stores, the motif technique was able to

3An exhaustive behavior means that the customer does all of his/her purchases from the same storethroughout the entire studied period

116

020406080

100

0 10 20 30 40 50

f-mea

sure

b

motifcompresssim

(a) Quality vs behavior contiguousness

020406080

100

10 30 50 70 90

f-mea

sure

e

compressmotifsim

(b) Quality vs percentage of overlapping

entities

Fig. 4.7.: Behavior linkage quality vs. behavior contiguousness and percentage of

overlapping entities.

get more than 50% quality for customers having about 25% of their behavior split

between the two sources. The matching quality of the compressibility technique drops

below 50% immediately if we split the transactions into 3 parts and link two of them.

The behavior similarity technique was not helpful most of the time even when using

an even transactions split.

Split Contiguous Behavior Blocks: This experiment studies the effect of

changing b, the transactions block size, to split an entity’s transactions. In Figure

4.7(a), we report the matching quality using the f-measure as we change b. For

b = 50%, the transactions are split into two contiguous halves and for b = 1%, it is

almost as if we are randomly splitting the transactions. For low values of b, we get

the best quality matching using the behavior merge, then as we increase the block

size, b, the matching quality drops. For the behavior similarity, the quality starts low,

about 45%, for very low values of b, then the quality improves to about 65% for b

between 5% and 15%. After that the quality keeps decreasing with the increase of b.

The behavior merge techniques benefit from having the original entities’ behavior

more random; when merging the transactions, the behavior patterns emerge. This is

the main reason for having good accuracy for low values of b. Moreover, high values

117

of b close to 50% mean that the behaviors can be well recognized and therefore the

computed gain after merging the transaction will not be significant enough to distin-

guish between the entities. On the other hand, for the behavior similarity technique,

there are two observations affecting its results: (i) there are many customers sharing

the same buying behavior, and (ii) how much of the complete behavior can be recog-

nized. For low values of b, the behavior can not be well recognized leading to poorly

estimated behavior model parameters, and consequently low matching accuracy. For

high values of b, the behavior is well recognized, however, because there are many

customers sharing the same behavior the matching quality drops.

Changing the Number of Overlapping Entities : In this experiment, we

study the effect of changing the overlapping percentage, e, from 10% to 100% (100%

means that all the customers use both stores) and the results are reported in Figure

4.7(b). We see that the overlapping percentage parameter is not significantly affecting

the matching quality for all techniques. This highlights the benefit of our approach to

provide good results even when the expected number of overlapping entities is small.

4.5.2 Performance

Our next set of experiments study the execution time. We start by showing the

positive effect of the candidate generation phase on the overall linking time, then we

discuss the scalability of our approach.

Candidate Generation Phase Effectiveness: In this experiment, we used the

same dataset as in Figure 4.4(a). In Figure 4.8(a), we report the total execution

time of the motif, compressibility and similarity techniques against different values of

Phase 1 threshold, tc. Phase 1 took 45 sec; this execution time is not affected by tc

because all the pairs of entities should be compared anyway and then filtered based

on tc’s selected value. For each value of tc, the candidates are passed to the accurate

matching phase to produce the final matching results.

118

10

100

1000

10000

0 0.2 0.4 0.6 0.8 1

Time (

sec)

Candidate thrshold (tc)motifcompresssim

(a) Candidate generation effectiveness.

110

1001000

10000100000

200 400 600 800 1000

Time (

sec)

Number of entities

motif motif-nP1compress compress-nP1sim sim-nP1

(b) Scalability.

Fig. 4.8.: Behavior linkage performance.

The time spent in the whole matching process decreases as tc increases because

the number of produced candidates drops dramatically. This was illustrated in Figure

4.4(a) in terms of reduction in the percentage of the number of candidates. However,

high values for tc results in many false negatives. As mentioned earlier, values around

tc = 0.2 produce good quality candidates.

When comparing the performance of the accurate matching techniques at tc =

0.2, the compressibility outperforms motif technique by a factor of about 3. This

is because the compressibility uses a technique that does not require scanning the

data many times while the motif technique uses an expensive iterative statistical

method. The compressibility technique is thus more attractive for very large logs.

The similarity technique requires less time than the motif, because the motif computes

for each candidate pair 3 statistical models (two for the original two entities and one

for the resulting merged entity), while the similarity computes models for only two

entities.

Scalability : This experiment analyzes the scalability of the behavior linkage

approach and compares the two cases of using or not the candidate generation phase.

The evaluation was conducted using a sequential implementation of the techniques

(i.e., no parallelization was introduced). In Figure 4.8(b), we report the overall linkage

time for the three behavior matching techniques when using Phase 1 (motif, compress

119

and sim) and without Phase 1 (motif-nP1, compress-nP1 and sim-nP1). When we

used Phase 1, tc = 0.2.

The behavior matching techniques require expensive computations and scale

poorly without the help of the candidate generation phase, which resulted in around

2 orders of magnitude speedup for the case of motif technique. The processing time is

governed by the generated number of candidates using the threshold tc as discussed

in the previous experiment.

For very large scale data processing, generating the candidates can benefit from

standard database join performance. Moreover, the computations required for each

candidate pair is independent from any other pair computation and hence can be

easily parallelized. Therefore, in a parallel environment, all the behavior matching

computations

4.6 Summary

In this chapter, we presented an technique to indirectly involve users in data clean-

ing task. In particular, we presented a novel approach for record linkage that uses

entity behavior extracted from transactions logs. When matching two entities, we

measure the gain in recognizing a behavior in their merged logs. We proposed two

different techniques for behavior recognition: a statistical modeling technique and a

more computationally efficient technique that is based on information theory. To im-

prove efficiency, we introduced a quick candidate generation phase. Our experiments

demonstrate the high quality and performance of our approach.

120

5. HOLISITIC MATCHING WITH WEB TABLES FOR

ENTITIES AUGMENTATION AND FINDING MISSING

VALUES

This chapter introduces an approach to leverage the WWW for a data cleaning task.

In particular, we use extracted web tables in the task of finding missing values in a

database. We assume that we have a table with a list of entities (we call it the query

table) and we want to find the missing values of some attributes of these entities.

The approach requires finding the web tables that match the query table, and then,

use the matched web tables to aggregate the missing values.

This chapter proposes a novel approach for matching web tables with the query

table by modeling the problem as a topic sensitive page rank, where the query table

defines a topic on the web tables. We also propose a system architecture that perform

most of the “heavy lifting” at a preprocessing step using MapReduce to produce a

set of indexes, such that we get a fast response time.

The chapter is organized as follows: We start by a motivating example for our

approach in Section 5.1. Then, we present our holistic matching framework in Sec-

tion 5.2. Section 5.3 describes our system architecture; and Section 5.4 discusses how

we build the sematic matching graph among web tables. In Section 5.7, we evaluate

our approach, and finally, we summarize the chapter in Section 5.8.

5.1 Introduction

The Web contains a vast corpus of HTML tables. In this chapter, we focus

on one class of HTML tables: entity-attribute tables (also referred to as relational

tables [19,20] and 2-dimensional tables [21]). Such a table contains values of multiple

entities on multiple attributes, each row corresponding to an entity and each column

121

Model BrandS80A10

GX-1ST1460

Model BrandS80 NikonA10 Canon

GX-1S SamsungT1460 Benq

S80 NikonA10 Canon

GX-1ST1460

S80 NikonA10 Canon

GX-1S SamsungT1460 Benq

Input (Query) Table

Output Table

(a) Augmentation By Attribute Name

(b) Augmentation By Example

Fig. 5.1.: APIs of the core operations

corresponding to an attribute. Cafarella et. al. reported 154M such tables from a

snapshot of Google’s crawl in 2008; we extracted 573M such tables from a recent

crawl of Microsoft Bing search engine. Henceforth, we refer to such tables as simply

web tables.

Consider an enterprize database where we have a table about companies and all

(or most) of their contact information is missing, or consider a product database

with a table about digital cameras. In the cameras table, the camera model name

is provided, but some other attributes such as brand, resolution, price and optical

zoom have missing values. We call these attributes as augmenting attributes and the

process of finding the missing attributes values as entities augmentation. Gathering

information about the “entities” is a labor-intensive task. We propose to automate

this task using the extracted web tables.

Such augmentation would be difficult to perform using an enterprize database or

an ontology because the entities can be from any arbitrary domain. Today, users try

to manually find the web sources containing this information and assemble the values.

Assuming that this information is available, albeit scattered, in various web tables,

we can save a lot of time and effort if we can perform this operation automatically.

122

To support finding missing values, we support two core operations using web ta-

bles. The first operation is called “Augmentation By Attribute Name” (ABA), where

we have a list of entities and an attribute name to be augmented using web tables.

Figure 5.1(a) shows example input and output for this operation applied to camera

model entities with one augmenting attribute (brand). The second operation is called

“Augmentation By Example” (ABE). It is a variant of ABA, where we provide the

values on the augmenting attribute(s) for a few entities instead of providing the name

of the augmenting attribute(s). Figure 5.1(b) shows example input and output for this

operation applied to camera model entities and one augmenting attribute (brand).

The requirements for these core operations are: (i) high precision (#corraug#aug

) and

high coverage ( #aug#entity

) where #corraug, #aug and #entity denote the number of

entities correctly augmented, the number of entities augmented and the number of

entities, respectively. (ii) fast (ideally interactive) response times and (iii) applicabil-

ity to entities of any arbitrary domain. The focus of the chapter is to perform these

operations using web tables such that the above requirements are satisfied.

Baseline Technique: We present the baseline technique and our insights in the

context of the ABA operation; they apply to ABE as discussed in Section 5.5. For

simplicity, we consider only one augmenting attribute. As shown in Figure 5.1(a),

the input can be viewed as a binary relation with the first column corresponding to

the entity name and the second corresponding to the augmenting attribute. The first

column is populated with the names of entities to be augmented while the second

column is empty. We refer to this table as the query table (or simply the query).

The baseline technique first identifies web tables that semantically “matches” with

the query table using schema matching techniques (we consider simple 1:1 mappings

only) [55]. Subsequently, we look each entity up in those web tables to obtain its value

on the augmenting attribute. The state-of-the-art entity augmentation technique,

namely Octopus, implements a variant of this technique using the search engine API

[20].

123

��

� ��

��

�� !"#� ��$%&�� '��&�(

�� !"#� �� )��*

$%&�� '��&�(��

�� )��*�� +��&��,�� ,� �-�#�� .�

�/0"�/"

�� '��12 $��,% ��!"#� �� ,� �-��

�/"

�/"

�/"

�/0"

�

�

�

�

Fig. 5.2.: ABA operation using web tables

Example 5.1.1 Consider the query table Q in Figure 5.2. For simplicity, assume

that, like the query table, all the web tables are entity-attribute binary (EAB) relations

with the first column corresponding to the entity name and the second to an attribute

of the entity. Note that for both the query table and web table the first column is

approximately the key column. Using traditional schema matching techniques, a web

table matches Q iff (i) data values in its first column overlaps with those in the first

column of Q and (ii) name of its second column is identical to that of the augmenting

attribute. We refer to such matches as “direct matches” and the approach as “direct

match approach” (DMA). In Figure 5.2, only web tables T1, T2 and T3 directly matches

with Q (shown using solid arrows). A score can be associated with each direct match

based on the degree of value overlap and degree of column name match; such scores are

shown in Figure 5.2. We then look the entities up in T1, T2 and T3. For S80, both T1

124

and T3 contain it but the values are different (Nikon and Benq respectively). We can

either choose arbitrarily or choose the value from the web table with the higher score,

i.e., Benq from T3. For A10, we can choose either Canon from T2 or Innostream

from T3 (they have equal scores). For GX−1S, we get Samsung. We fail to augment

T1460 as none of the matched tables contains that entity.

DMA suffers from two problems:

(i) Low precision: In the above example, T3 contains models and brands of cell

phones, not cameras. The names of some of the cell phone models in T3 are identical

to those of the camera models in the query table, hence, T3 get a high score. This

results in 2 (out of 3) wrong augmentations: S80 and A10 (assuming we choose

Innostream from T3 for A10). Hence, the precision is 33%. Such ambiguity of entity

names exist in all domains as validated by our experiments. Note that this can

mitigated by raising the “matching threshold” but this leads to poor coverage.

(ii) Low coverage: In the above example, we fail to augment T1460. Hence, the

coverage is 75%. This number is much lower in practice, especially for tail domains.

For example, the Octopus system (which implements a variant of DMA) reports

a coverage of 33%. This primarily happens because tables that can provide the

desired values either do not have column names or use different column name as the

augmenting attribute name provided by the user.

One way to address the coverage issue is to use synonyms of the augmenting

attribute [53, 82]. Traditionally, schema-matchers have used hand-crafted synonyms;

this is not feasible in our setting where the entities can be from any arbitrary domain.

Automatically generating attribute synonyms for arbitrary domains, as proposed in

[19], typically result in poor quality synonyms. Our experiments show that these are

unusable without manual intervention.

Main Insights and Contributions: Our key insight is that many tables indirectly

match the query table, i.e., via other web tables. These tables, in conjunction with the

directly matching ones, can improve both coverage and precision. We first consider

coverage. Observe that in Figure 5.2, table T4 contains the desired attribute value of

125

T1460 (Benq) but we cannot “reach” it using direct match. Using schema matching

techniques, we can find that T4 matches with T1 (i.e., there is 1:1 mapping between

the two attributes of the two relations) as well as T2 (as it has 2 records in common

with T1 and 1 in common with T2). Such schema matches among web tables are

denoted by dashed arrows; each such match has a score representing the degree of

match. Since T1 and/or T2 (approximately) matches with Q (using DMA) and T4

(approximately) matches with T1 and T2 (using schema matching among web tables),

we can conclude T4 (approximately) matches with Q. We refer to T4 as an indirectly

matching table; using it, we can correctly augment T1460. This improves coverage

from 75% to 100%.

Many of the indirectly matching tables are spurious matches; using these tables

to predict values would result in wrong predictions. The challenge is to be robust

to such spurious matches. We address this challenge in two ways. First, we perform

holistic matching. We observe that truly matching tables match with each other and

with the directly matching tables, either directly or indirectly while spurious ones do

not. For example, T1, T2 and T4 match directly with each other while T4 only matches

weakly with T2. If we compute the overall matching score of a table by aggregating

the direct match as well as all indirect matches, the true matching tables will get

higher scores; we refer to this as holistic matching1. In the above example, T1, T2

and T4 will get higher score compared with T3; this leads to correct augmentations

for S80 and A10 resulting in a precision of 100% (up from 33%). Second, for each

entity, we obtain predictions from multiple matched tables and “aggregate” them; we

then select the “top” one (or k) value(s) as the final predicted value(s).

This gives rise to additional technical challenges: (i) We need to compute schema

matches between pairs of web tables ; we refer to this as the schema matching among

web tables (SMW) graph . How do we build an accurate SMW graph over 573M ×

573M pairs of tables? (ii) How do we model the holistic matching? The model

should take into account the scores associated with the edges in the SMW graph as

1 This is different from holistic matching proposed in [83].

126

well as those associated with the direct matches. (iii) How do we augment the entities

efficiently at query time?

We have built the InfoGather system based on the above insights. Our contri-

butions can be summarized as follows:

• We develop a novel holistic matching framework based on topic sensitive pager-

ank (TSP) over the SMW graph. We argue that by considering the query table

as a topic and web tables as documents, we can efficiently model the holistic

matching as TSP (details are in Section 5.2.4). To the best of our knowledge,

this is the first work to propose holistic matching with web tables.

• We present a novel architecture for the InfoGather system that leverages

preprocessing in MapReduce to achieve extremely fast (interactive) response

times at query time. Our architecture overcomes the limitations of the prior

architecture (viz., Octopus) that uses the search API: its inability to perform

indirect/holistic matches and its high response times.

• We present a machine learning-based technique for building the SMW graph.

Our key insight is that the text surrounding the web tables is important in

determining whether two web tables match or not. We propose a novel set

of features that leverage this insight. Furthermore, we develop MapReduce

techniques to compute these (pairwise) features that scales to 573M tables.

Finally, we propose a novel approach to automatically generate training data

for this learning task; this liberates the system designer for manually producing

labeled data.

• We perform extensive experiments on six real-life query datasets and 573M

web tables. Our experiments show that our holistic matching framework has

significantly higher precision and coverage compared with both direct matching

approach as well as the state-of-the-art entity augmentation technique, Octopus.

Furthermore, our technique have four orders of magnitude faster response times

compared with Octopus.

127

5.2 Holistic Matching Framework

We present the data model, the general augmentation framework and its two

specializations: direct matching and holistic matching frameworks. We present them

in the context of ABA operation. How we leverage these frameworks for the ABE

operation is discussed in Section 5.5.

5.2.1 Data Model

For the purpose of exposition, we assume that the query table is an entity-attribute

binary (EAB) relation, i.e., a query table Q is of the form Q(K,A), where K denotes

the entity name attribute and A is the augmenting attribute. Since Q.K is approx-

imately the key attribute, we refer to it as the query table key attribute and the

entities as keys. The key column is populated while the augmenting attribute column

is empty. An example of the query table satisfying the above properties is shown in

Figure 5.2.

We assume that all web tables are EAB relations as well. For each web table

T ∈ T , we have the following: (1) the EAB relation TR(K,B) where K denotes the

entity name attribute and B is an attribute of the entity; as in the query table, since

T.K is approximately the key attribute, we refer to it as the web table key attribute,

(2) the url TU of the web page from which it was extracted, and (3) its context TC

(i.e., the text surrounding the table) in the web page from which it was extracted. For

simplicity, we denote TR(K,B) as T (K,B) when it is clear from the context. Figure

5.2 shows four web tables (T1,T2,T3,T4) satisfying the EAB property.

The ABA problem can be stated as follows.

Definition 5.2.1 Augmentation By Attribute Name (ABA): Given a query table

Q(K,A) and a set of web tables ⟨T (K,B), TU , TC⟩ ∈ T , predict the value of each

query record q ∈ Q on attribute A.

128

In practice, not all web tables are EAB relations; we show how our framework can

be used for general, n-ary web tables in Section 5.6. Furthermore, the query table

can have more than one augmenting attribute; we assume that those attributes are

independent and perform predictions for one attribute at a time.

5.2.2 General Augmentation Framework

Our augmentation framework consists of two main steps: First, identify web tables

that “match” with the query table. Second, use each matched web table to provide

value predictions for the particular keys that happen to overlap between the query

and the web table; then aggregate these predictions and pick the top value as the

final predicted value. We describe the two steps in further detail.

• Identify Matching Tables: Intuitively, a web table T (K,B) matches the

query table Q(K,A) if Q.K and T.K refer to the same type of entities and Q.A

and Q.B refers to the same attribute of the entities. In this work, we consider

simple 1:1 mappings only. Each web table T will be assigned a score S(Q, T )

representing the matching score to the query table Q. Since Q is fixed, we omit

Q from the notation and simply denote it as S(T ). There are many ways to

obtain the matching scores between the query table and web tables; we consider

two such ways in the next two subsections.

• Predict Values: For each record q ∈ Q, we predict the value q[Q.A] of record

q on attribute Q.A from the matching web tables. This is done by joining

the query table Q(K,A) with each matched web table T (K,B) on the key

attribute K. If there exists a record t ∈ T such that q[Q.K] ≈ t[T.K] (where ≈

denotes either exact or approximately equality of values), then we say that the

web table T predicted the value v = t[T.B] for q[Q.A] with a prediction score

ST (v) = S(T ) and return (v, ST (v)).

After processing all the matched tables, we end up with a set Pq =

{(x1, ST1(x1)), (x2, ST2(x2)), . . . } of predicted values for q[Q.A] along with their cor-

129

responding prediction scores. We then perform fuzzy grouping [84] on the xi’s to get

the groups Gq = {g1, g2, . . . }, such that, ∀xi ∈ gk, xi ≈ vk, where vk is the centroid

or the representative of group gk. We compute the final prediction score for each

group representative v by aggregating the predictions scores of the group’s members

as follows:

S(v) = F(xi,STi

(xi))∈Pq |xi≈vSTi

(xi) (5.1)

where F is an aggregation function. Any aggregation function such as sum or max

can be used in this framework.

The final predicted value for q[Q.A] is the one with the highest final prediction

score:

q[Q.A] = argmaxv

S(v) (5.2)

If the goal is to augment k values for an entity on an attribute (e.g., the entity is a

musical band and the goal is to augment it with all its albums), we simply pick the

k with the highest final prediction score.

Example 5.2.1 Consider the example in Figure 5.2. Using the table matching scores

shown, for the query record S80, Pq = {(Nikon, 0.25), (Benq, 0.5)} (predicted by

tables T1 and T3 respectively). The final predicted values are Nikon and Benq with

scores 0.25 and 0.5 respectively, so the predicted value is Benq.

5.2.3 Direct Match Approach

One way to compute the matching web tables and their scores is the direct match

approach (DMA) discussed in Section 5.1. The prediction step is identical to that in

the general augmentation framework. Using traditional schema matching techniques,

DMA considers a web table T to match with the query table Q iff (i) data values in

T.K overlaps with those Q.K and (ii) the attribute name T.B matches Q.A (denoted

130

by T.B ≈ Q.A). DMA computes the matching score S(T ) between Q and T , denoted

as SDMA(T ), as follows:

SDMA(T ) =

|T∩KQ|

min(|Q|,|T |) if Q.A ≈ T.B

0 otherwise.(5.3)

where |T ∩K Q| = |{t | t ∈ T & ∃ q ∈ Q s.t. t[T.K] ≈ q[Q.K]}|. For example, in

Figure 5.2, the scores for T1, T2 and T3 are 14, 2

4and 2

4respectively as they have 1, 2

and 2 matching keys respectively, min(|Q|, |T |) = 4 and Q.A ≈ T.B; the score for T4

is 0 because Q.A ≈ T.B.

5.2.4 Holistic Match Approach

To overcome the limitations of the DMA approach as outlined in Section 5.1,

we study the holistic approach to compute matching tables and their scores. The

prediction step remains the same as above. We model the holistic matching using

TSP. We start by reviewing the definitions of personalized pagerank (PPR) and TSP;

and then make the link to our problem in Section 5.2.4.

Preliminaries: Personalized and Topic Sensitive Pagerank

Consider a weighted, directed graph G(V,E). We denote the weight on an edge

(u, v) ∈ E with αu,v. Pagerank is the stationary distribution of a random walk on G

that at each step, with a probability ϵ, usually called the teleport probability, jumps

to a random node, and with probability (1− ϵ) follows a random outgoing edge from

the current node. Personalized Pagerank (PPR) is the same as Pagerank, except all

the random jumps are done back to the same node, denoted as the “source” node,

for which we are personalizing the Pagerank.

Formally, the PPR of a node v, with respect to the source node u, denoted by

πu(v), is defined as the solution of the following equation:

πu(v) = ϵδu(v) + (1− ϵ)∑

{w|(w,v)∈E}

πu(w)αw,v (5.4)

131

where δu(v) = 1 iff u = v, and 0 otherwise. The PPR values πu(v) of all nodes v ∈ V

with respect to u is referred to as the PPR vector of u.

A “topic” is defined as a preference vector β inducing a probability distribution

over V . We denote the value of β for node v ∈ V as βv. Topic sensitive pagerank

(TSP) is the same as Pagerank, except all the random jumps are done back to one of

the nodes u with βu > 0, chosen with probability βu. Formally, the TSP of a node v

for a topic β is defined as the solution of the following equation [85]:

πβ(v) = ϵβ + (1− ϵ)∑

{w|(w,v)∈E}

πβ(w)αw,v (5.5)

Modeling Holistic Matching using TSP

First, we draw the connection between the PPR of a node with respect to a source

node and the holistic match between two web tables. Then, we show how the holistic

matching between the query table and a web table can be modeled with TSP.

Consider two nodes u and v of any weighted, directed graph G(V,E). The PPR

πu(v) of v with respect to u represents the holistic relationship of v to u where E

represents the direct, pairwise relationships, i.e., it considers all the paths, direct as

well as indirect, from u to v and “aggregates” their scores to compute the overall

score. PPR has been applied to different types of relationships. When the direct,

pairwise relationships are hyperlinks between web pages, πu(v) is the holistic impor-

tance conferral (via hyperlinking) of v from u; when the direct, pairwise relationships

are direct friendships in a social network, πu(v) is the holistic friendship of v from u.

In this work, we propose to use PPR to compute the holistic semantic match

between two web tables. Therefore, we build the weighted graph G(V,E), where each

node v ∈ V corresponds to a web table and each edge (u, v) ∈ E represents the direct

pair-wise match (using schema matching) between the web tables corresponding to u

and v. Each edge (u, v) ∈ E has a weight αu,v which represents the degree of match

between the web tables u and v (provided by the schema matching technique). We

discuss building this graph and computing the weights in detail in Section 5.4.1. We

132

refer to this graph as the schema matching graph among web tables (SMW graph).

Thus, the PPR of πu(v) of v with respect to u over the SMW graph models the holistic

semantic match of v to u.

Suppose the query table Q is identical to a web table corresponding to the node

u, then the holistic match score SHol(T ) between Q and the web table T is πu(v),

where v is the node corresponding to T . However, the query table Q is typically not

identical to a web table. In this case, how can we model the holistic match of a web

table T to Q? Our key insight is to consider Q as a “topic” and model the match as

the TSP of the node v corresponding to T to the topic. In the web context where

the relationship is that of importance conferral, the most important pages on a topic

are used to model the topic (the ones included under that topic in Open Directory

Project); in our context where the relationship is semantic match, the top matching

tables should be used to model the topic of Q. We use the set of web tables S (referred

to as seed tables) that directly matches with Q, i.e., S = {T |SDMA(T ) > 0} to model

it. Furthermore, we use the direct matching scores SDMA(T )|T ∈ S as the preference

values β:

βv =

SDMA(T )∑

T∈S SDMA(T )if T ∈ S

0 otherwise(5.6)

where v corresponds to T . For example, βv are 0.251.25

, 0.51.25

and 0.51.25

for T1, T2 and

T3 respectively and 0 for all other tables. Just like the TSP score of web page

representing the holistically computed importance of a page to the topic, πβ(v) over

the SMW graph models the holistic semantic match of v to Q. Thus, we propose to

use SHol(T ) = πβ(v) where v corresponds to T .

5.3 System architecture

Suppose the SMW graph G has been built upfront. The naive way to compute

the holistic matching score SHol(T ) for each web table is to run the TSP computation

algorithm over G at augmentation time. This results in prohibitively high response

times. We leverage the following result to overcome this problem:

133

Theorem 5.3.1 (Linearity [85]) For any preference vector β, the following equality

holds:

πβ(v) =∑u∈V

βu × πu(v) (5.7)

If we can precompute the PPR πu(v) of every node v with respect to every other node

u (referred to as Full Personalize Pagerank (FPPR) computation) in the SMW graph,

we can compute the holistic matching score for any query table πβ(v) efficiently using

Eq. 5.7. This leads to very fast response times at query time.

InfoGather architecture has two components as shown in Figure 5.3. The

first component performs offline preprocessing for the web crawl to extract the web

tables, build the SMW graph and compute the FPPR. For all these offline steps, our

techniques need to scale to hundreds of millions of tables. We propose to leverage the

MapReduce framework for this purpose. The second component concerns the query

time processing, where we compute the TSP scores for the web tables and aggregate

the predictions from the web tables. In the following, we give more details about each

component:

Preprocessing: There are five main processing steps in this component:

• P1: Extract the HTML web tables from the web crawl and use a classifier to

distinguish the entity attribute tables from the other types of web tables, (e.g.,

formatting tables, attribute value tables, etc.). Our approach is similar to the

one proposed in [86]; we do not discuss this step further as it is not the focus

of this work.

• P2: Index the web tables to facilitate faster identification of the seed tables. We

use three indexes: (i) An index on the web tables’ key attribute values (WIK).

Given a query table Q, WIK(Q) returns the set of web tables that overlaps

with Q on at least one of the keys. (ii) An index for the web tables complete

records (that is key and value combined) (WIKV). WIKV(Q) returns the set

of web tables that contain at least one record from Q. (iii) An index on the

134

web tables attributes names (WIA), such that, WIA(Q) returns the set of web

tables {T |T.B ≈ Q.A}

• P3: Build the SMW graph based on schema matching techniques as we describe

in Section 5.4.1.

• P4: Compute the FPPR and store the PPR vector for each web table (we store

only the non-zero entries). We refer to this as the T2PPV index. For any web

table T , T2PPV(T ) returns the PPR vector of T . We discuss the technique we

use to compute the FPPR in Section 5.4.2.

The indexes (WIK, WIKV, WIA and T2PPV) may either be disk-resident or

reside in memory for faster access.

Query Time Processing: The query time processing can be abstracted in three

main steps. The details of each step depends on the operation. We provide those

details for the ABE operation in Section 5.5.

• Q1: Identify the seed tables: We leverage the WIK, WIKV and WIA indexes to

identify the seed tables and compute their DMA scores.

• Q2: Compute the TSP scores: We compute the preference vector β by plugging

the DMA matching scores in Eq. 5.6. According to Theorem 5.3.1, we can

use β and the stored PPR vectors of each table to compute the TSP score

for each web table. Note that only the seed tables have non-zero entries in β.

Accordingly, we need to retrieve the PPR vectors of only the seed tables using

the T2PPV index. Furthermore, we do not need to compute TSP scores for all

web tables in the retrieved PPR vectors. We need to compute it only for the

tables that could be used in the aggregation step: the one that have at least one

key overlapping with the query table. We refer to them as relevant tables. These

can be identified efficiently by invoking WIK(Q). These two optimizations are

important to compute the TSP scores efficiently.

135

Query Time ProcessingPre-processing

Web tables Indexes

WIK, WIKV WIA

Web Crawl

Extract & identify relational web

tables

Build web tables Graph

FPPR

Query Table

PredictionsTSP

T2PPV

Fig. 5.3.: InfoGather System Architecture

• Q3: Aggregate and select values: In this step, we collect the predictions provided

by the relevant web tables T along with the scores SHol(T ). The predictions are

then processed, the scores are aggregated and the final predictions are selected

according to the operation.

5.4 Building the SMW Graph and computing FPPR

This section discusses the major preprocessing steps of the web tables, namely,

building the SMW graph (P3) and computing the FPPR (P4).

5.4.1 Building the SMW Graph

First, we give details on how we match a pair of web tables and then address the

scalability challenges in building the SMW graph.

136

Matching Web Tables

In the SMW graph, there is an edge between a pair (T (K,B), T ′(K ′, B′)) of web

tables if T matches with T ′, i.e., T.K and T.K ′ refer to the same type of entities and

T.B and T ′.B′ refers to the same attribute of those entities. Our problem can be

formally stated as follows:

Definition 5.4.1 Pairwise web tables matching problem: Given two web tables

⟨T (K,B), TU , TC⟩ and ⟨T ′(K ′, B′), T ′U , T

′C⟩, determine whether T matches with T ′ and

compute the score of the mapping T.K → T ′.K ′, T.B → T ′.B′.

In schema matching [53, 55], the problem of matching two schemas S and S ′ is

normally framed as follows: Given the two schemas, for each attribute A of S, find

the best corresponding attribute A′ of S ′, possibly with an associated matching score.

The problems are similar enough so that the techniques used in standard schema

matching problem can be used for ours as well. Schema matching techniques first

identify information about each element of each schema that is relevant to discovering

matches. For each pair of elements, one from each schema, they compute a set of

“feature scores” where each feature score represent a match between the pair on a

different aspect. Finally, they combine those feature scores into a single score based

on which they decide whether the element pair matches or not. The combination

module can either use machine learning-based techniques or non-learning methods

[41,56]; we use machine learning-based techniques in this work.

Traditionally, the focus is on schema level features (e.g., attribute names match-

ing) and instance level features (e.g., attribute data values matching). Specifically for

web tables, [20] suggested to use two specific features: (i) the average columns width

similarity and (ii) the similarity of the text of the table content without considering

the columns and rows structure.

But for web tables it may not be sufficient to rely only on these traditional schema

and instance level information. For example, consider tables T2 and T3 in Figure 5.2.

At the schema level, they both share the same column names, and moreover, at the

137

instance level both share the Model A10 and share the Brand Samsung. Despite all

these similarities, these web tables are not a match, because T2 is about cameras

and T3 is about cell phones. On the other hand, consider web tables T2 and T4.

They neither share schema level nor instance level similarities. However, both T2 and

T4 contain digital camera models with their brands and should have high matching

score. Furthermore, many web tables do not have column names [51]; this further

exacerbates the problem.

Our main insight is that there is additional information about the web tables that

can be leveraged to overcome the above limitations. We propose 4 novel feature scores

based on this insight:

• Context similarity: The context or the text around the web table in the web

page provides valuable information about the table. Suppose, the context for

T3 is “Mobile phones or cell phones, ranging from . . . ”, while that for T2 and T4

are “Ready made correction data for cameras and lenses” and “Camera Bags

Compatibility List” respectively. This indicates that T2 and T4 are probably

about cameras while T3 about phones. Clearly, sharing the term ‘cameras’

indicates that similarity between T2 and T4. We capture this intuition using a

context similarity feature which is computed using the tf-idf cosine similarity of

the text around the table.

• Table-to-Context similarity: The context of a table may contain keywords

that overlap with values inside another web table. This provides evidence that

the web pages containing the tables are about similar subjects, and hence, the

tables may be about similar subjects as well. We capture this intuition using

table-to-context similarity feature, which is computed using the tf-idf cosine

similarity of the text around the first table and the text inside the second table.

• URL similarity: The URL of the web page containing the table can help

in matching with another table. Sometimes, a web site lists the records from

the same original large table in several web pages. For example a web site may

138

list the movies in several web pages by year, by first letter, etc. In this case,

matching the URLs of the web pages is a good signal in the matching of the

web tables. We capture this intuition using a URL similarity feature, computed

using cosine similarity of the URL terms.

• Tuples similarity: The web tables that we consider are EAB relations and the

correspondences between the attributes are frozen (T.K → T ′.K ′ and T.B →

T ′.B′); we just need to determine the strength of the correspondence. Hence, the

number of tuples that overlaps between the two tables will be a strong evidence

to decide upon the tables matching. Note that this is different from the instance

level feature, which consider the data values of each attribute individually.

We use the above similarities as features in a classification model. Given the fea-

tures, the model predicts the match between two tables with a probability, which is

used as the weight on the edge between them. The set of features include the newly

proposed features, namely, (1) Context similarity, (2) Table-to-Context similarity,

(3) URL similarity, and (4) Tuples similarity; in addition to the traditional schema

and instance level features, namely, (5) attribute names similarity, (6) column val-

ues similarity, (7) Table-to-Table similarity as a bag of words, (8) columns widths

similarity.

There are two major challenges in building the SMW graph: (i) computing the

pairwise features that scales to hundreds of millions of tables and, (ii) getting labeled

pairs of web tables to train a classifier. We address these challenges in the following

two subsections.

Scalable Computation of Pairwise Features

Note that we are computing these features for 573M × 573M web table pairs

and, obviously, we cannot do the cross product computation. Our key insight here

is that, for each of the mentioned features, the web table can be considered as a bag

of words (or a document). We can then leverage scalable techniques for computing

139

Table 5.1: Web tables matching features as documents.

Feature name Document

Context Terms in the text around the web table with idf

weights

Table-to-Context The table content as text and context text with idf

weights

URL The terms in the URL with idf weight computed from

all URL set

Tuples All the distinct table rows (or key-value pairs) form

terms of a document with equal weights

Attributes name The terms mentioned in the column names with equal

weights

Column values All the distinct values in a column form terms of a

document with equal weights

Table-to-Table The table content as text with idf weights

pairwise document similarities over a large document collection. Table 5.1 describes

the mapping of a web table to a document for each feature.

We leverage the technique described in [87] to compute the document similarity

matrix of a large document set using MapReduce. The technique can be summarized

as follows: Each document d contains a set of terms and can be represented as a vector

Wd of term weights wt,d. The similarity between two documents is the inner product

of the term weights as sim(d1, d2) =∑

t∈d1∪d2 wt,d1 · ww,d2 . The key observation here

is that a term t will contribute to the similarity of two documents d1, d2 iff t ∈ d1

and t ∈ d2. If we have an inverted index I, we can easily get the documents I(t) that

contain a particular term t. For each pair of document ⟨di, dj⟩ ∈ I(t)×I(t), sim(di, dj)

is incremented by (wt,di · wt,dj). By processing all the terms we have computed the

entire similarity matrix without the expensive cross-product computations.

140

This can be implemented directly as two MapReduce tasks: (1) Indexing : The

mapper processes each document d and emits for each term t ∈ d (key = t, value

= (d, wt,d)). The reducer outputs the term as the key and the list of documents

containing that key (key = t, value =I(t)). (2) Similarity computation: The mapper

processes each term with its list of documents, (t, I(t)), and emits for each pair of

documents ⟨di, dj⟩ ∈ I(t) × I(t) and i < j (key = ⟨di, dj⟩, value = wt,di · wt,dj).

Finally, the reducer does the summation to output the sim(di, dj) (key= ⟨di, dj⟩,

value=sim(di, dj) =∑

t∈(di∩dj)wt,di · wt,dj). For more efficiency, a df-cut notion is

used to eliminate terms with high document frequency [87].

Getting labeled pairs of web tables:

We mentioned earlier that we rely on a classification model to get the matching

score of two tables given their similarity features vector. The challenge here is to

obtain labeled examples to train the classifier. One way is to use a human to manually

label a random set of pairs of web tables. However, this is going to be painful and

time consuming. We propose an automatic way to obtain labeled pairs of web tables.

To label a pair of web tables (T , T ′) as a positive example, our hypothesis is that

T and T ′ may not have records in common, but a third web table T ′′ have some

records in common with T and T ′ individually (we call it a labeling web table). For

example, consider tables T2 and T4 in Figure 5.2. T1 is found to be a labeling web

table for them. T1 overlaps with T2 on one record (DSC W570, Sony), as well as, it

overlaps with T4 on one record (Optio E60, Pentax).

We formalize our hypothesis as follows: A pair of tables Ti(Ki, Bi) and Tj(Kj, Bj)

is a true example pair, if ∃ a web table TL(KL, BL) (labeling web table) such that (i)

the set of overlapping records |TL ∩Ti| ≥ θ and |TL ∩Tj| ≥ θ, and (ii) for each record

tL ∈ TL and ∃ record ti ∈ Ti(tj ∈ Tj), s. t., if tL[KL] = ti[Ki](tL[KL] = tj[Kj]), then

tL[BL] = ti[Bi](tL[BL] = tj[Bj]). The second condition guarantees that if Ti shares

141

a key with TL then, the value of the other attribute must match. If we do not find

such labeling web table, then the web table pair is considered as a negative example.

It may come to mind that the web table labeling approach can be used to generate

all the pairwise semantic matches to build the SMW graph, but this is too expensive

to be done for 573M × 573M pairs of web tables. However, using the labeling web

table approach to generate a few thousand examples to train a classifier is feasible.

5.4.2 Computing FPPR on SMW Graph

Once the SMW graph is constructed, we compute the full personalized pagerank

matrix. There are two broad approaches to compute personalized pagerank. The

first approach is to use linear algebraic techniques, such as Power Iteration [88]. The

other approach is Monte Carlo, where the basic idea is to approximate Personalized

Pagerank by directly simulating the corresponding random walks and then estimating

the stationary distributions with the empirical distributions of the performed walks.

We use the recently proposed MapReduce algorithm to compute the FPPR [89].

It is based on the Monte Carlo approach. The basic idea is to very efficiently compute

single random walks of a given length starting at each node in the graph. Then these

random walks are used to efficiently compute the PPR vector for each node.

5.5 Supporting Core Operations

We discuss how we support the core operations using our holistic matching frame-

work. Note that for each operation, we re-define the DMA score.

5.5.1 Augmentation-By-Attribute (ABA)

We have discussed the ABA operation already in Section 5.2. Here, we mention

the details of the 3 query time steps abstracted in Section 5.3. We present the pseudo

code for the ABA operation in Algorithm 5.1.

142

Algorithm 5.1 ABA(Query table Q(K,A))

1: ∀ q ∈ Q , Pq = {}

2: R = WIK(Q).

3: R = R∩WIA(Q) {Relevant web tables.}

4: for all T ∈ R do

5: for all q ∈ Q and t ∈ T , s.t., q[Q.K] ≈ t[T.K] do

6: Pq = Pq ∪ {( v = t[T.B] , ST (v) = S(T ) )}

7: end for

8: end for

9: ∀ q ∈ Q, Fuzzy group Pq to get Gq

10: for all q ∈ Q do

11: ∀ g ∈ Gq, s.t., v =centroid(g),

S(v) = F (xi,STi(xi))∈Pq |xi≈v STi(xi)

12: end for

13: ∀ q ∈ Q, q[Q.A] = argmaxv S(v)

• Q1: Identifying the seed table: The seed tables for Q(K,A) are identified using

the WIK and WIA indexes such that a web table T (K,B) is considered if there

is at least one key overlap and Q.A ≈ T.B. The DMA scores are computed

using Eq. 5.3.

• Q2: Computing the tables TSP scores: is identical to Step 2 in Section 5.3.

• Q3: Aggregating and processing values: This step is identical to the predict

values step in the augmentation framework discussed in Section 5.2.

5.5.2 Augmentation-By-Example (ABE)

This is a variation of ABA operation. Instead of providing the augmenting at-

tribute name, the user provides the query table with some complete records as exam-

ples, i.e., for some of the keys, she provides the values on the augmenting attribute

(e.g., Figure 5.1(b)).

143

Definition 5.5.1 Augmenting-By-Example (ABE): Given query table Q(K,A) =

Qc ∪ Qe, where Qc denotes the set of records {qc ∈ Q | qc[A] = null} (referred as

example complete records ) and Qe denotes the set of records {qe ∈ Q | qe[A] = null}

(referred as incomplete records), predict the value of each incomplete record qe ∈ Qe

on attribute A.

The 3 query time steps for the ABE operation is identical to those of ABA oper-

ation, except for the way we identify the seed tables and compute the DMA scores.

DMA considers a web table T to match the query table Q iff the records Qc over-

laps with those in T . For example, in Figure 5.2, table T1 is considered a seed table

for the query table illustrated in Figure 5.1(b), because they overlap on the record

(S80, Nikon). Given the query table, we use the WIKV index to get the seed tables

efficiently.

Intuitively, a web table T should be assigned a high DMA score if, for each shared

key between T and Qc, the two tables agree on the value of the augmenting attribute

as well. Accordingly, we redefine the DMA matching score as the fraction of the

shared keys that agree on the value of the augmenting attribute;

SDMA(T ) =|Qc ∩KV T ||Qc ∩K T |

(5.8)

where |Qc ∩KV T | denotes the number of overlapping records between the complete

records of the query table Q and the web table T . Recall that |Qc ∩K T | denotes the

number of shared keys.

5.6 Handling n-ary Web Tables

Throughout our discussion, we assume that the web tables are entity-attribute

binary (EAB) relations. The result is working with a simpler graph with a single

score among the nodes, and this enables us to model the problem as a TSP problem.

If we consider n-ary web tables and use a single score among the nodes, a matching

score between the query table and a web table will not say which column of the web

table is the desired augmenting attribute.

144

050

100150200250300

2 4 6 8 10

# o

f W

eb T

able

sM

Number of Columns

(a) columns # dist.

# of web tables 573MAvg. # columns 3.09Avg. # rows 26.54

(b) Statistics

Fig. 5.4.: The distribution of the number of columns per web table and statistics

about the relational web tables and

In practice, not all the tables on the web are binary relations. Fortunately, re-

lational tables on the web are meant for human consumption and usually it has a

subject column [19, 51]. According to [19], there are effective heuristics to identify

web table’s subject. For example, using the web search log where the subject column

name will have high search query hits (i.e., the subject column name appears in the

search query that hits the web page containing the web table), and also usually the

subject column appears in the left most column. If the subject column can be identi-

fied, then we split the table into several EAB relations, i.e., the subject column with

each of the other columns comprise a set of EAB relations. The main assumption

that we make on the web table is that the subject appears in a single column and we

do not consider multiple columns as subjects.

In this work, we do not assume anything and in this case, we split an n-ary web

table into (n− 1)2 EAB relations—all possible pairs of columns are considered EAB

relations. Our study shows the feasibility of doing that. Figure 5.4(a) shows the

distribution of the number of columns per a relational web table. The average is 3.1

and about 54% are binary tables and 70% are either binary or ternary relations.

145

5.7 Experimental Evaluation

We present an experimental evaluation of the techniques proposed in the chapter

for the ABA and ABE operations. The goals of the study are:

• To compare holistic matching approach with DMA, DMA with attribute syn-

onyms and the state-of-the-art approach, Octopus [20], in terms of precision

and coverage for the ABA operation

• To compare holistic matching approach with DMA in terms of precision and

coverage for the ABE operation

• To study the sensitivity of quality (precision and coverage) of the approaches

to “head” vs “tail” query entities

• To study the sensitivity of quality to the number of example complete records

for ABE operation

• To evaluate the (direct) impact of our novel features (context, table-to-context,

URL, tuples similarities) on the quality of the SMW graph as well as its (indi-

rect) impact on the quality of ABA operation

• To evaluate the holistic approach in terms of query response times and compare

with Octopus

5.7.1 Experimental Setting

Implementation: We implemented the InfoGather system described in Fig-

ure 5.3. In the offline preprocessing step, we extracted 573M entity-attribute HTML

tables from a recent snapshot (July, 2011) of Microsoft Bing search engine; such snap-

shots are available in the internal MapReduce clusters within Microsoft. We then built

the WIK, WIKV and WIA indexes, built the SMW graph, computed T2PPV index

and the T2Syn index. We performed all these steps in our MapReduce clusters as

146

Table 5.2: Query entity domains and augmenting attributes

Dataset name Entity (Key attribute) Augmenting attribute

Cameras Camera model Brand

Movies Movie Director

baseball Baseball team Player

albums Musical band Album

uk-pm UK parliament party Member of parliament

us-gov US state Governor

discussed in Section 5.3. To build the SMW graph, we use the solution described

in [87] and used a df-cut of 99.9%. We stored the indexes (WIK, WIKV, WIA and

T2PPV) on a single machine for query processing. We used an Intel x64 machine

with 8 2.66GHz Intel Xeon processors and 32GB RAM, running Windows 2008 Server

for this purpose.

Datasets: We conducted experiments on 6 datasets shown in Table 5.2. For

example, for the cameras dataset: the ABA operation augments the brand given a

set of camera model names and the string “brand” and the ABE operation augments

the brands of a set of camera model names given a set of (model, brand) pairs. Toy

examples of inputs and outputs for this dataset is shown in Figure 5.1.

We chose 4 datasets (baseball, albums, uk-pm, us-gov) that were also used to

evaluate Octopus. We compiled the complete ground truth for these datasets by

manually identifying a knowledgebase and extracting the desired information from it.

For example, for baseball, we got the “all-time roster” for a randomly chosen set of

12 baseball teams from Wikipedia; for albums, we got all the albums for a randomly

chosen set of 14 bands from Freebase. We chose two additional datasets (cameras,

movies) for which we had complete ground truth (from Microsoft Bing Shopping

product catalog and IMDB database, respectively). One distinguishing characteristic

of these two datasets are that the augmenting attribute has 1:1 relationship with

147

the key (as opposed to 1:n in the above 4 datasets). We generate a query table by

randomly selecting keys from the ground truth. For the movies we use a query table

of 6,000 and for the cameras 1,000. All our results are averaged over 5 such query

tables. We use F = sum in Eq. 5.1 for all our experiments.

Measures: Since some of the datasets have 1:n relationships between the key and

augmenting attribute, we generalize the precision and coverage measures defined in

Section 5.1 as follows. We first compute the precision and coverage for each key as

follows:

precision = #values correctly predicted#values predicted

coverage = #values predicted#values in ground truth

We average over all the keys in the query table Q to obtain the precision and coverage

for Q. Recall that if the ground truth has k values for a key on the augmenting

attribute, the augmentation framework selects the top-k values for that key.

5.7.2 Experimental Results

Evaluating Augmentation-By-Attribute (ABA) We implemented five dif-

ferent approaches for ABA:

• Holistic: This is our approach using TSP.

• DMA: This is the direct matching approach.

• DMA with attribute synonyms: This is the DMA approach where we use a set

of synonyms for the augmenting attribute. A web table T (K,B) will be used

for prediction if its keys overlap with those in the query table and Q.A matches

with any of the synonyms of T.B. We consider the state of the art technique to

get the synonyms using the attribute correlation statistics database (ACSDb)

as described in [19]. The algorithm requires a context attribute name for each

dataset. We provide the key attribute name as the context. We refer to this

approach as DMA-ACSDbSyn.

148

�

��

��

��

��

�

��

�� !��

(a) ABA precision

��

��

��

�� !��

(b) ABA coverage

Fig. 5.5.: Augmenting-By-Attribute (ABA) evaluation

• Octopus: This is the EXTEND operation using the MultiJoin algorithm intro-

duced in [20]. This is the state of the art to do ABA using web tables. Given a

query table and an attribute name a, MultiJoin composes a web search query in

the form “k a” for each key k in the query table. Then all the web tables in the

resulting web pages from all the web search queries are obtained. The resulting

web tables are then clustered according to their schema similarity. Finally, the

cluster that best cover the query table is selected to join each of cluster member

with the query table and augment the values.

Figure 5.5 reports the precision and coverage. The Holistic approach significantly

outperforms all other approaches both in terms of precision and coverage. The average

precision (over all 6 datasets) of Holistic is 0.79 respectively compared with 0.65, 0.42

and 0.39 for DMA, DMA-ACSDbSyn and Octopus respectively. The average coverage

149

(over all 6 datasets) of Holistic is 0.97 respectively compared with 0.36, 0.59 and 0.38

for DMA, DMA-ACSDbSyn and Octopus respectively. This shows that considering

indirect matches and computing the scores holistically improves both precision and

coverage.

DMA demonstrates high precision with all the datasets except for cameras where

it was 60%; the main limitation of DMA is coverage as it does not consider indirectly

matching tables.

DMA-ACSDbSyn has lower precision compared to DMA, due to the quality of

the synonyms used. We manually inspected the synonyms we get from the ACSDb;

there were almost no meaningful synonyms in the top 20 for the cameras and movies

datasets. This is because DMA-ACSDbSyn uses only schema-level correlations to

compute synonyms; attribute names are often ambiguous (e.g., the attribute name

“name”) leading to many spurious synonyms.

Octopus demonstrates low precision as well as low coverage for all the datasets,

except for the cameras where the precision was high and for the us-gov dataset where

the coverage was high. On average the coverage is about 33%, which matches the

results reported in [20]. Octopus uses the web search API to retrieve matching ta-

bles; since web search is not meant for matching tables, in many cases, the top 1000

returned urls did not provide any matching tables. Furthermore, Octopus’s archi-

tecture does not consider indirectly matching tables and does not perform holistic

matching.

Evaluating Augmentation-By-Example (ABE): We study the sensitivity to

the number of example complete records, as well as, the sensitivity to the nature of

the provided examples in terms of being famous (head) or rare (tail) examples. Head

(tail) examples are those records that show up in a high (low) number of web tables.

In Figure 5.6, we report the precision and coverage of the augmented values for

the query table using the Holistic and DMA approaches as we increase the number of

example complete records between 1 and 50. We report the results for the cameras

and movies datasets; for the other datasets, we observe quite similar results.

150

0.5

0.6

0.7

0.8

0.9

1

0 10 20 30 40 50

Pre

cisi

on

# examples

Holistic-camerasHolistic-moviesDMA-camerasDMA-movies

(a) Precision

0

0.2

0.4

0.6

0.8

1

0 10 20 30 40 50

Co

vera

ge

# of examples

Holistic-camerasHolistic-moviesDMA-camerasDMA-movies

(b) Coverage

Fig. 5.6.: Sensitivity of the precision and coverage to the number of examples. The Holistic

shows high precision and maintains high coverage in comparison to DMA.

In Figure 5.6(a), the reported precision is high for both datasets using the two

approaches. However, in Figure 5.6(b), the Holistic significantly outperforms the

DMA in coverage when there are very few example complete records (between 1 and

10). The Holistic provides values for about 99% and 93% of the incomplete records

for the movies and cameras datasets, respectively, while the DMA provides coverage

in the range between 20% and 75% for up to 10 examples. This shows that even

with small number of example complete records the Holistic can get enough tables to

augment the query table while the DMA does not.

The number of example complete records is not the only factor impacting the cov-

erage. The frequency of the examples in the web tables also impacts the performance.

We note that the query records distribution on the web tables follow the power law.

Hence, if the example complete records appear in a lot of web tables (head or famous

query records), then we will directly match a lot of web tables to increase the cover-

age. On the other hand, if the examples are tail records, there will be very few direct

matching web tables.

In Figure 5.7, we do joint sensitivity analysis of both the number of example

complete records and the nature of the records (i.e., being head, tail or mid—records

in the middle). We also report the results for number of examples of 2, 10, and 50.

151

� ��

��

�� Fig. 5.7.: Joint sensitivity analysis to the number of examples and the head vs. tail records

in the web tables. The Holistic is robust in comparison to the DMA.

The precision results were high and similar for all the datasets. The figure shows the

coverage for the movies dataset; we observed similar behavior for the other datasets.

TheDMA is sensitive to both the number of example records and their nature. The

coverage degrades as we decrease the number of examples; and it degrades further if

the examples are tail records. On the other hand, the Holistic does well and maintains

a coverage of 99% even in the hard situations with small number of example tail

records.

Impact of new features for building SMW graph: The objective of this exper-

iment is to evaluate the usefulness of our proposed set of features for matching web

tables in comparison to other features used in the literature before. In our evaluation,

we compare four techniques: (1)SMW Graph: This represents all our proposed set of

features, (2)Traditional: This represents the schema and instance level features. We

use the features that represent the similarity between the attributes names, as well

as, the similarity between the values in each column. (3) WTCluster: This represents

the features that were introduced in [20] for matching web tables, namely, the table

152

��

��

��

��

�

�� !��" ��#��$��!�� !��" � ��$��!��

(a) Direct schema matching evaluation using the proposed fea-

tures set

��

��

��

�� !��"�#�$��#��!�� !��"�# ��#��!��

(b) The impact of the schema matching quality on the ABA

operation

Fig. 5.8.: Web tables matching accuracy

text similarity and columns widths similarity. (4) Traditional & WTCluster: This

combines both the previous two sets of features.

We first compare the above techniques on the quality of the SMW graph. In this

experiment, we randomly picked 500 web tables relevant to each of the datasets, and

then we computed the features values and identified labels for each pair of web tables

as being a match or not (using our automatic labeling technique described in Section

5.4.1). We created a balanced set of examples (i.e., almost equal negative and positive

examples). We trained a classifier using the examples and reported in Figure 5.8(a)

153

0.001

0.1

10

1000

0 0.2 0.4 0.6 0.8 1T

ime

(sec

)# of query records (K)

InfoGatherOctopus

Fig. 5.9.: Response time evaluation

the model’s accuracy per dataset in addition to the overall accuracy. The displayed

results are average of 5 different runs to the above procedure.

In general, the web tables matching accuracy using our set of features, SMW

Graph, shows the best performance. SMW Graph has about 6% improvement over

the Traditional and about 10% improvement over ClusterWT. SMW graph also out-

performs the combined set of features Traditional & WTCluster by about 3%. These

results prove the importance of the new introduced features for matching web tables.

The SMW graph technique is consistently outperforming the other techniques,

however, the other techniques are not consistently reliable. For example, in the uk-pm

dataset, the Traditional technique performs better than WT-Cluster. The situation

is reversed in the us-gov dataset.

In Figure 5.8(b), we evaluate the impact of our features on the quality of the

ABA operation; that is whether the improved quality of the SMW graph translates

in better quality ABA (indirect evaluation of the features). Here, we report only the

precision as we obtain the same coverage using each of the features set. Note the

similarity between the two figures 5.8(a) and 5.8(b). The SMW graph features get

better quality SMW graph, and hence, better precision.

Efficiency evaluation: We evaluate in Figure 5.9 the efficiency of our approach

and architecture in comparison with the Octopus approach for the ABA operation

154

for the cameras dataset. We obtained similar performance with the other datasets.

Our implementation of Octopus involves using a web search engine API. We report

the query response time as we increase the size of the query table. Our approach

takes milliseconds to respond and is 4 orders of magnitude faster than Octopus.

Our fast response time is due to the fast computations of the TSP scores using

the pre-computed PPR vectors and indexes that we introduce in Section 5.3. For

Octopus, most of the time goes in processing the web search queries as it is based

on SOAP request/response kind of communication. As mentioned in [20], Octopus

can be implemented more efficiently if web search engines support Octopus-specific

operations; however, current search engines do not support such operations.

In summary, our experiments show that our holistic matching framework and pro-

posed system architecture can support the three core operations with high precision,

high coverage and interactive response times.

5.8 Summary

In this chapter, we present the InfoGather system to automate information

gathering tasks, like augmenting entities with attribute values and discovering at-

tributes, using web tables. Our experiments demonstrate the superiority of our tech-

niques compared to the state-of-the-art.

155

6. SUMMARY

This dissertation addresses the data cleaning problem from a practical and paragamtic

view point. The main goal is to introduce techniques to efficiently involve users and

the WWW, as well as, handling large scale databases. Involving users in the data

cleaning process is a must to guarantee accurate cleaning decisions. Automating the

process of consulting the WWW for data cleaning will save a tremendous amount of

time.

This thesis fills the gap between the theoretical research conducted on cleaning

algorithms and the practical systems for data cleaning, where usually the quality and

scalability are of great concern.

6.1 Summary of Contributions

This dissertation introduces four main contributions for guided data cleaning by

involving users or the WWW. First, we introduced GDR, a guided data repair frame-

work that combines the best of both; the user fidelity to guide the cleaning process

and the existing automatic cleaning techniques to suggest cleaning updates. The

user can help exploring the search space of the possible cleaning updates, if he/she

is consulted before taking some decisions. Once the user confirms a decision, any

further dependent decisions taken by the algorithm are guaranteed to be more accu-

rate, hence, achieving better data quality. The ultimate goal is to achieve better data

quality with minimal user feedback. Therefore to minimize the user’s efforts, GDR

learns from user feedback to eventually replace the user and minimize the efforts. The

key novelty we proposed in GDR is the ranking of the questions that are forwarded

to the user. For this purpose, we introduced a principled mechanism that depends

on a combination of decision theory and active learning to quantify the utility and

156

impact of obtaining user feedback on the questions. GDR was accepted for system

demonstration [60] in SIGMOD 2010, as well as, a research paper [59] was accepted

for publishing in PVLDB 2011.

The second contribution of this dissertation is proposing a new scalable data re-

pair approach that is based on machine learning techniques. The idea is to maximize

the data likelihood, given the learned data distributions, using a small amount of

changes to the database. We built a system (SCARE) which is a scalable framework

that follows our likelihood-based repair approach. SCARE has three advantages over

previous automatic cleaning approaches: (1) it is more accurate in identifying erro-

neous values and in finding the correct cleaning updates, (2) it scales well because

it relies on a robust mechanism to partition the database, and then, aggregate the

final cleaning decisions from the several partitions, and (3) it does not require data

quality rules, instead, it is based on modeling the data distributions using machine

learning techniques. In comparison to the quality rules discovery techniques, the ma-

chine learning techniques are more flexible and accurate in capturing the relationships

between the database attributes and values. In contrast to the constraint-based data

repair approaches, which finds the minimal changes to satisfy a set of data quality

rules, our likelihood-based repair approach finds a constrained amount of changes

that maximize the data likelihood.

Our third contribution is a novel approach to involve users or entities indirectly

for a data cleaning task. In this approach, we noticed that the users actions (or

behavior), which can be found in the systems log, can be a useful evidence for the

task of deduplicating the users themselves if they have different representations in the

system. For example, in the retial stores the customers are usually identified by their

credit cards and sometimes the customers use several cards for their transactions. Also

the users of a web site are usually identified by their IP addresses and the addresses

changes from time to time. The idea of our solution is to first merge the behavior

information (transaction log) for each candidate pair of entities to be matched. If the

two behaviors seem to complete one another, in the sense that stronger behavioral

157

patterns become detectable after the merge, then this will be a strong indication

that the two entities are, in fact, the same. To this end, we developed the necessary

patterns detection and modeling algorithms and computed the matching score as the

gain (or certainty increase) in the identification of the behavior patterns after merging

the entities’ behavior. Our approach was accepted for publishing [61] in PVLDB 2010.

The last contribution is a new approach to leverage the WWW for data cleaning.

We focused in this approach on the web tables, which are the relational web tables

that can be found in the web pages. We investigated the use of the web tables for the

task of finding missing values in the databases and augmenting entities attributes.

For example, a user may have a list of cameras models and using our system she can

find the cameras brands, optical zoom level, and other relevant attributes. The main

challenge in this work is the ambiguity of the entities and dirtiness of the data on the

web. Therefore, our solution relies on aggregating answers from several web tables

that directly and indirectly match the user’s list of entities. This required modeling

the relationships between the web tables to accurately identifying the relevant ta-

bles to the user’s database. We modeled this problem as a topic sensitive pagerank

problem [85]. We introduced a system to extract, process, index and compute the

pagerank of the web tables. This is a data intensive application and our solution rely

on steps that are done using MapReduce in an off-line phase, such that, the processing

at query time is done in milliseconds. Our approach was accepted for publishing [62]

in SIGMOD 2012.

6.2 Future Extensions

This dissertation raises a number of research problems related to data cleaning. It

is motivated by our believe that data cleaning should a result of close collaboration of

multiple resources including users and other information sources such as the WWW.

In this section, we give an overview of several directions for future research.

158

6.2.1 User Centric Data Cleaning

We introduced GDR to involve users directly in the data cleaning process. In our

proposed solution, we assumed that the user is always correct. However, there could

be uncertain answers from the users or even the user may do mistakes. It is always

useful to interact with experts who will provide feedback that is certainly trusted.

However, the experts are more expensive than the regular users of the data. Therefore,

to involve less expert users in cleaning the data, it is mandatory to take into account

the uncertainty of the user when providing feedback. This may require getting answers

for the same question from a different user or give another similar question to the

same user. Revising the suggested updates while propagating uncertain feedback will

change as it should take the user uncertainty into account. The problem of involving

direct user interaction for data cleaning is by itself challenging. Taking into account

the uncertainty of the feedback adds another level of challenge, however, it leads to

a more realistic setting.

Crowdsourcing has drawn a lot of attention recently and we believe that cleaning

and improving the quality of databases must benefit from such model. Our initial work

in guided data repair opens the door to involve the crowd in the data cleaning tasks.

However, we can identify several new challenges to be addressed in this model. For

example, the users modeling and taking into account their uncertainty, identifying

the need to aggregate feedbacks on the same questions to improve the certainty,

global aggregation of feedbacks while resolving conflicting answers, and the need for

an economical model to address the trade-off between the quality, cost and time.

6.2.2 Holistic Data Cleaning

It has always been the case that the research in the data cleaning area focus on a

single dimension of data quality (e.g., inconsistency, deduplication, . . . etc.). However,

a single database may have a combination of these problems, and unfortunately, these

problems interact with each others. For example, if we look into the deduplication

159

and repairing inconsistent database (i.e., violating a set of constraints). If we resolve

and merge duplicate records, we may end up fixing inconsistency in their values (or

introducing new inconsistency), and fixing inconsistent record may result in having

new duplicate records (or separating identified duplicates). Our system SCARE,

which we propose in Chapter 3 focuses on cleaning dirty values and it relies on the

blocking techniques for data partitioning. Blocking has been usually used to improve

the efficiency of deduplication techniques. We believe that SCARE can be extended

further to handle both problems of repairing dirty values, as well as, identifying and

merging duplicate records.

6.2.3 The WWW for Data Cleaning

We introduced in Chapter 5 an approach to leverage web tables for data cleaning.

There are multiple directions to extend this work. First, more structured information

can be found on the web and can be leveraged as an additional source of information,

for example; the HTML lists, the attribute-value tables and deep web. The challenge

here is how to correctly extract the information in a well formatted form to be easily

processed and used. Secondly, we focused in our work only on the data cleaning task

of finding missing values, however, other tasks need to be studied. For example, the

data on the web can be leveraged for deduplication. Imagine a matching function on

the web, such that given a record of any schema and another record with another

schema. Such matching function should tell whether records refer to the same entity

or not. The information is there on the web to help deciding upon the entities match,

but the problem needs to be studied.

6.2.4 Private Data Cleaning

Owners of the data in the same domain can collaborate and help each other

improving their data quality. For example, stores may collaborate to unify and repair

their information about the products. The challenge here is the privacy. How can

160

we develop a data cleaning solution to benefit from the union of the parties data

and give suggestions to every data owners to clean his/her data? We already did

efforts [90, 91]in this direction and introduced solutions for doing records matching

efficiently in a private preserving setting. Matching values and records is an important

and most frequently used basic block in the data cleaning techniques. Therefore,

our effort lays down the necessary infrastructure to revisit existing data cleaning

techniques for a private setting.

LIST OF REFERENCES

161

LIST OF REFERENCES

[1] C. Batini and M. Scannapieco, Data Quality: Concepts, Methodologies and Tech-niques (Data-Centric Systems and Applications). Springer, 2006.

[2] L. P. English, Information Quality Applied: Best Practices for Improving Busi-ness Information, Processes and Systems. Wiley, 2009.

[3] G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma, “Improving data quality: Con-sistency and accuracy,” in Proceedings of the 33rd International Conference onVery Large Data Bases, VLDB ’07, pp. 315–326, 2007.

[4] A. Lopatenko and L. Bravo, “Efficient approximation algorithms for repairinginconsistent databases,” in IEEE 23rd International Conference on Data Engi-neering, ICDE ’07, pp. 216–225, April 2007.

[5] S. Kolahi and L. V. S. Lakshmanan, “On approximating optimum repairs forfunctional dependency violations,” in Proceedings of the 12th International Con-ference on Database Theory, ICDT ’09, pp. 53–62, 2009.

[6] W. Fan, J. Li, S. Ma, N. Tang, and W. Yu, “Towards certain fixes with editingrules and master data,” Proceedings of VLDB Endowment (PVLDB), vol. 3,pp. 173–184, September 2010.

[7] C. Mayfield, J. Neville, and S. Prabhakar, “ERACER: A database approach forstatistical inference and data cleaning,” in Proceedings of the 2010 InternationalConference on Management of Data, SIGMOD ’10, pp. 75–86, 2010.

[8] X. Zhu and X. Wu, “Class noise vs. attribute noise: A quantitative study of theirimpacts,” Artificial Intellegnce Review, vol. 22, pp. 177–210, November 2004.

[9] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios, “Duplicate record de-tection: A survey,” IEEE Transactions on Knowledge and Data Engineering,vol. 19, pp. 1–16, January 2007.

[10] S. Sonnenburg, G. Ratsch, C. Schafer, and B. Scholkopf, “Large scale multiplekernel learning,” Journal Machine Learning Research, vol. 7, pp. 1531–1565,December 2006.

[11] W. Fan, “Dependencies revisited for improving data quality,” in Proceedingsof the 27th ACM SIGMOD-SIGACT-SIGART Symposium on Principles ofDatabase Systems, PODS ’08, pp. 159–170, 2008.

[12] I. Bhattacharya and L. Getoor, “Iterative record linkage for cleaning and inte-gration,” in Proceedings of the SIGMOD Workshop on Research Issues in DataMining and Knowledge Discovery, DMKD ’04, pp. 11–18, 2004.

162

[13] D. V. Kalashnikov, S. Mehrotra, and Z. Chen, “Exploiting relationships fordomain-independent data cleaning,” in SIAM International Conference on DataMining, 2005.

[14] A. Doan, Y. Lu, Y. Lee, and J. Han, “Object matching for information integra-tion: A profiler-based approach,” in In Workshop on Information Integration onthe Web, IIWeb ’03, 2003.

[15] S. Chaudhuri, A. Das Sarma, V. Ganti, and R. Kaushik, “Leveraging aggregateconstraints for deduplication,” in Proceedings of the 2007 International Confer-ence on Management of Data, SIGMOD ’07, pp. 437–448, 2007.

[16] F. Radlinski and T. Joachims, “Query chains: Learning to rank from implicitfeedback,” in Proceedings of the 11th ACM SIGKDD International Conferenceon Knowledge Discovery in Data Mining, KDD ’05, pp. 239–248, 2005.

[17] S. Holland, M. Ester, and W. Kießling, “Preference mining: A novel approach onmining user preferences for personalized applications,” in Knowledge Discoveryin Databases: PKDD 2003, 7th European Conference on Principles and Practiceof Knowledge Discovery in Databases, pp. 204–216, 2003.

[18] E. Agichtein, E. Brill, and S. Dumais, “Improving web search ranking by incor-porating user behavior information,” in Proceedings of the 29th Annual Inter-national ACM SIGIR Conference on Research and Development in InformationRetrieval, SIGIR ’06, pp. 19–26, 2006.

[19] M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang, “Webtables:Exploring the power of tables on the web,” Proceedings of VLDB Endowment(PVLDB), vol. 1, pp. 538–549, August 2008.

[20] M. J. Cafarella, A. Halevy, and N. Khoussainova, “Data integration for therelational web,” Proceedings of VLDB Endowment (PVLDB), vol. 2, pp. 1090–1101, August 2009.

[21] X. Yin, W. Tan, and C. Liu, “Facto: A fact lookup engine based on web tables,”in Proceedings of the 20th International Conference on World Wide Web, WWW’11, pp. 507–516, 2011.

[22] M. Arenas, L. Bertossi, and J. Chomicki, “Consistent query answers in inconsis-tent databases,” in Proceedings of the 18th ACM SIGMOD-SIGACT-SIGARTSymposium on Principles of Database Systems, PODS ’99, pp. 68–79, 1999.

[23] P. Bohannon, W. Fan, M. Flaster, and R. Rastogi, “A cost-based model andeffective heuristic for repairing constraints by value modification,” in Proceedingsof the 2005 International Conference on Management of Data, SIGMOD ’05,pp. 143–154, 2005.

[24] R. Bruni and A. Sassano, “Errors detection and correction in large scale datacollecting,” in Proceedings of the 4th International Conference on Advances inIntelligent Data Analysis, IDA ’01, pp. 84–94, 2001.

[25] J. Chomicki and J. Marcinkowski, “Minimal-change integrity maintenance usingtuple deletions,” Journal of Information and Computation, vol. 197, pp. 90–121,February 2005.

163

[26] E. Franconi, A. L. Palma, N. Leone, S. Perri, and F. Scarcello, “Census data re-pair: A challenging application of disjunctive logic programming,” in Proceedingsof the Artificial Intelligence on Logic for Programming, LPAR ’01, pp. 561–578,2001.

[27] J. Wijsen, “Condensed representation of database repairs for consistent query an-swering,” in Proceedings of the 9th International Conference on Database Theory,ICDT ’03, pp. 378–393, 2003.

[28] W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis, “Conditional functional de-pendencies for capturing data inconsistencies,” ACM Transaction on DatabaseSystems (TODS), vol. 33, pp. 6:1–6:48, June 2008.

[29] W. Fan, X. Jia, J. Li, and S. Ma, “Reasoning about record matching rules,”Proceedings of VLDB Endowment (PVLDB), vol. 2, pp. 407–418, August 2009.

[30] A. Arasu, C. Re, and D. Suciu, “Large-scale deduplication with constraints usingdedupalog,” in Proceedings of the 2009 IEEE International Conference on DataEngineering, ICDE ’09, pp. 952–963, 2009.

[31] W. Fan, S. Ma, Y. Hu, J. Liu, and Y. Wu, “Propagating functional dependencieswith conditions,” Proceedings of VLDB Endowment (PVLDB), vol. 1, pp. 391–407, August 2008.

[32] L. Bravo, W. Fan, F. Geerts, and S. Ma, “Increasing the expressivity of con-ditional functional dependencies without extra complexity,” in Proceedings ofthe 2008 IEEE 24th International Conference on Data Engineering, ICDE ’08,pp. 516–525, 2008.

[33] L. Golab, H. Karloff, F. Korn, D. Srivastava, and B. Yu, “On generating near-optimal tableaux for conditional functional dependencies,” Proceedings of VLDBEndowment (PVLDB), vol. 1, pp. 376–390, August 2008.

[34] G. Cormode, L. Golab, K. Flip, A. McGregor, D. Srivastava, and X. Zhang, “Es-timating the confidence of conditional functional dependencies,” in Proceedingsof the 2009 International Conference on Management of Data, SIGMOD ’09,pp. 469–482, 2009.

[35] F. Chiang and R. J. Miller, “Discovering data quality rules,” Proceedings ofVLDB Endowment (PVLDB), vol. 1, pp. 1166–1177, August 2008.

[36] W. Fan, F. Geerts, L. V. S. Lakshmanan, and M. Xiong, “Discovering condi-tional functional dependencies,” in Proceedings of the 2009 IEEE InternationalConference on Data Engineering, ICDE ’09, pp. 1231–1234, 2009.

[37] F. Chu, Y. Wang, D. S. Parker, and C. Zaniolo, “Data cleaning using beliefpropagation,” in Proceedings of the 2nd International Workshop on InformationQuality in Information Systems, IQIS ’05, pp. 99–104, 2005.

[38] J. L. Y. Koh, M. L. Lee, W. Hsu, and K. T. Lam, “Correlation-based detectionof attribute outliers,” in Proceedings of the 12th International Conference onDatabase Systems for Advanced Applications, DASFAA ’07, pp. 164–175, 2007.

[39] H. Galhardas, D. Florescu, D. Shasha, and E. Simon, “Ajax: An extensible datacleaning tool,” in Proceedings of the 2000 International Conference on Manage-ment of Data, SIGMOD ’00, pp. 590–, 2000.

164

[40] V. Raman and J. M. Hellerstein, “Potter’s wheel: An interactive data cleaningsystem,” in Proceedings of the 27th International Conference on Very Large DataBases, VLDB ’01, pp. 381–390, 2001.

[41] A. Doan, P. Domingos, and A. Y. Halevy, “Reconciling schemas of disparate datasources: A machine-learning approach,” in Proceedings of the 2001 InternationalConference on Management of Data, SIGMOD ’01, pp. 509–520, 2001.

[42] A. Doan and R. McCann, “Building data integration systems: A mass collabora-tion approach,” in In Workshop on Information Integration on the Web, IIWeb’03, pp. 183–188, 2003.

[43] S. R. Jeffery, M. J. Franklin, and A. Y. Halevy, “Pay-as-you-go user feedbackfor dataspace systems,” in Proceedings of the 2008 International Conference onManagement of Data, SIGMOD ’08, pp. 847–860, 2008.

[44] W. Wu, C. Yu, A. Doan, and W. Meng, “An interactive clustering-based ap-proach to integrating source query interfaces on the deep web,” in Proceedingsof the 2004 International Conference on Management of Data, SIGMOD ’04,pp. 95–106, 2004.

[45] S. Sarawagi and A. Bhamidipaty, “Interactive deduplication using active learn-ing,” in Proceedings of the 8th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, KDD ’02, pp. 269–278, 2002.

[46] P. Turney, “Types of cost in inductive concept learning,” in In Workshop onCost-Sensitive Learning at the Seventeenth International Conference on MachineLearning, pp. 15–21, 2000.

[47] F. Provost, “Toward economic machine learning and utility-based data mining,”in Proceedings of the 1st International Workshop on Utility-Based Data Mining,UBDM ’05, pp. 1–1, 2005.

[48] D. Cohn, L. Atlas, and R. Ladner, “Improving generalization with active learn-ing,” Machine Learning, vol. 15, pp. 201–221, May 1994.

[49] A. Kapoor, E. Horvitz, and S. Basu, “Selective supervision: Guiding supervisedlearning with decision-theoretic active learning,” in Proceedings of the 20th In-ternational Joint Conference on Artifical Intelligence, IJCAI’07, pp. 877–882,2007.

[50] V. S. Sheng, F. Provost, and P. G. Ipeirotis, “Get another label? improvingdata quality and data mining using multiple, noisy labelers,” in Proceedings ofthe 14th ACM SIGKDD International Conference on Knowledge Discovery andData Mining, KDD ’08, pp. 614–622, 2008.

[51] P. Venetis, A. Halevy, J. Madhavan, M. Pasca, W. Shen, F. Wu, G. Miao, andC. Wu, “Recovering semantics of tables on the web,” Proceedings of VLDB En-dowment (PVLDB), vol. 4, pp. 528–538, June 2011.

[52] G. Limaye, S. Sarawagi, and S. Chakrabarti, “Annotating and searching webtables using entities, types and relationships,” Proceedings of VLDB Endowment(PVLDB), vol. 3, pp. 1338–1347, September 2010.

165

[53] E. Rahm and P. A. Bernstein, “A survey of approaches to automatic schemamatching,” The VLDB Journal, vol. 10, pp. 334–350, December 2001.

[54] P. A. Bernstein, J. Madhavan, and E. Rahm, “Generic Schema Matching, TenYears Later,” in PVLDB, 2011 (VLDB 10 Year Best Paper Award Paper), 2011.

[55] Z. Bellahsene, A. Bonifati, and E. Rahm, Schema Matching and Mapping.Springer, 1st ed., 2011.

[56] J. Madhavan, P. A. Bernstein, A. Doan, and A. Halevy, “Corpus-based schemamatching,” in Proceedings of the 21st International Conference on Data Engi-neering, ICDE ’05, pp. 57–68, 2005.

[57] Y. He and D. Xin, “Seisa: Set expansion by iterative similarity aggregation,” inProceedings of the 20th International Conference on World Wide Web, WWW’11, pp. 427–436, 2011.

[58] R. Gupta and S. Sarawagi, “Answering table augmentation queries from un-structured lists on the web,” Proceedings of VLDB Endowment (PVLDB), vol. 2,pp. 289–300, August 2009.

[59] M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas, “Guideddata repair,” Proceedings of VLDB Endowment (PVLDB), vol. 4, pp. 279–289,February 2011.

[60] M. Yakout, A. K. Elmagarmid, J. Neville, and M. Ouzzani, “GDR: A systemfor guided data repair,” in Proceedings of the 2010 International Conference onManagement of Data, SIGMOD ’10, pp. 1223–1226, 2010.

[61] M. Yakout, A. K. Elmagarmid, H. Elmeleegy, M. Ouzzani, and A. Qi, “Behav-ior based record linkage,” Proceedings of VLDB Endowment (PVLDB), vol. 3,pp. 439–448, September 2010.

[62] M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri, “InfoGather: Entityaugmentation and attribute discovery by holistic matching with web tables,”in Proceedings of the 2012 International Conference on Management of Data,SIGMOD ’12, 2012.

[63] P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis, “Conditionalfunctional dependencies for data cleaning,” in Proceedings of the 23rd Interna-tional Conference on Data Engineering, ICDE ’07, pp. 746–755, 2007.

[64] S. J. Russell and P. Norvig, Artificial Intelligence: A Modern Approach. PearsonEducation, 2 ed., 2003.

[65] S. Tong and D. Koller, “Support vector machine active learning with applicationsto text classification,” The Journal of Machine Learning Research, vol. 2, pp. 45–66, March 2002.

[66] B. Zadrozny and C. Elkan, “Learning and making decisions when costs andprobabilities are both unknown,” in In Proceedings of the 7th International Con-ference on Knowledge Discovery and Data Mining, KDD’01, pp. 204–213, 2001.

[67] L. Breiman, “Random forests,” Machine Learning, vol. 45, pp. 5–32, 2001.

166

[68] D. Heckerman, D. M. Chickering, C. Meek, R. Rounthwaite, and C. Kadie, “De-pendency networks for inference, collaborative filtering, and data visualization,”Journal Machine Learning Research, vol. 1, pp. 49–75, September 2001.

[69] S. German and D. German, “Neurocomputing: Foundations of research,”ch. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration ofimages, MIT Press, 1988.

[70] T. G. Dietterich, “Ensemble methods in machine learning,” in Proceedings of the1st International Workshop on Multiple Classifier Systems, MCS ’00, pp. 1–15,2000.

[71] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive mixturesof local experts,” Neural Computation, vol. 3, pp. 79–87, March 1991.

[72] Y. Asahiro, K. Iwama, H. Tamaki, and T. Tokuyama, “Greedily finding a densesubgraph,” Journal of Algorithms, vol. 34, pp. 203–221, February 2000.

[73] D. S. Hochbaum, “Efficient bounds for the stable set, vertex cover and set packingproblems,” Discrete Applied Mathematics, vol. 6, pp. 243–254, 1983.

[74] S. Arora, D. Karger, and M. Karpinski, “Polynomial time approximation schemesfor dense instances of np-hard problems,” in Proceedings of the 27th annual ACMSymposium on Theory of Computing, STOC ’95, pp. 284–293, 1995.

[75] U. Feige and M. Seltser, “On the densest k-subgraph problem,” Algorithmica,vol. 29, p. 2001, 1997.

[76] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood fromincomplete data via the EM algorithm,” Journal of the Royal Statistical Society,Series B, vol. 39, no. 1, pp. 1–38, 1977.

[77] S. W. Smith, The scientist and engineer’s guide to digital signal processing. Cal-ifornia Technical Publishing, 1997.

[78] C. M. Bishop, Pattern Recognition and Machine Learning (Information Scienceand Statistics). Springer, 2006.

[79] N. Ahmed, T. Natarajan, and K. R. Rao, “Discrete cosine transfom,” IEEETransaction on Computers, vol. 23, pp. 90–93, January 1974.

[80] G. K. Wallace, “The jpeg still picture compression standard,” Communicationsof the ACM, vol. 34, pp. 30–44, April 1991.

[81] V. Levenshtein, “Binary Codes Capable of Correcting Deletions, Insertions andReversals,” Soviet Physics Doklady, vol. 10, p. 707, 1966.

[82] J. Madhavan, P. A. Bernstein, and E. Rahm, “Generic schema matching withcupid,” in Proceedings of the 27th International Conference on Very Large DataBases, VLDB ’01, pp. 49–58, 2001.

[83] B. He and K. C.-C. Chang, “Statistical schema matching across web query in-terfaces,” in Proceedings of the 2003 International Conference on Managementof Data, SIGMOD ’03, pp. 217–228, 2003.

167

[84] S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani, “Robust and efficientfuzzy match for online data cleaning,” in Proceedings of the 2003 InternationalConference on Management of Data, SIGMOD ’03, pp. 313–324, 2003.

[85] T. H. Haveliwala, “Topic-sensitive pagerank,” in Proceedings of the 11th Inter-national Conference on World Wide Web, WWW ’02, pp. 517–526, 2002.

[86] M. J. Cafarella, A. Y. Halevy, Y. Zhang, D. Z. Wang, and E. Wu, “Uncoveringthe relational web,” in Proceedings of the 11th International Workshop on theWeb and Databases, WebDB ’08, 2008.

[87] T. Elsayed, J. Lin, and D. W. Oard, “Pairwise document similarity in largecollections with mapreduce,” in Proceedings of the 46th Annual Meeting of theAssociation for Computational Linguistics on Human Language Technologies:Short Papers, HLT-Short ’08, pp. 265–268, 2008.

[88] L. Page, S. Brin, R. Motwani, and T. Winograd, “The pagerank citation rank-ing: Bringing order to the web.,” Technical Report 1999-66, Stanford InfoLab,November 1999. Previous number = SIDL-WP-1999-0120.

[89] B. Bahmani, K. Chakrabarti, and D. Xin, “Fast personalized pagerank on mapre-duce,” in Proceedings of the 2011 International Conference on Management ofData, SIGMOD ’11, pp. 973–984, 2011.

[90] M. Yakout, M. J. Atallah, and A. Elmagarmid, “Efficient private record linkage,”in Proceedings of the 2009 IEEE International Conference on Data Engineering,ICDE ’09, pp. 1283–1286, 2009.

[91] M. Yakout, M. J. Atallah, and A. Elmagarmid, “Efficient and practical approachfor private record linkage,” Journal of Data and Information Quality, 2012.

168

VITA

Mohamed Yakout was born in the beautiful city of Alexandria, Egypt. His inter-

est in technology and computer science started in high school. In 1996, Mohamed

joined the Faculty of Engineering at Alexandria University. After a very competitive

freshman year, he was ranked among the top students and joined the Computer Sci-

ence Department. After graduating with a bachelor’s degree in Computer Science,

Mohamed joined the ICT technical staff members of the Bibliotheca Alexandrina (Li-

brary of Alexandria) in 2001. During his work in the Bibliotheca, he participated

in major projects related to digital libraries, in particular, he led projects related to

the digitization of the Egyptian cultural heritage. In 2006, Mohamed earned a M.Sc.

in Computer Science from Alexandria University and after that he started to think

about combining his industrial skills with more advanced research skills.

In 2007, he joined the graduate program at Purdue University, working with his

advisor, Ahmed Elmagarmid. At Purdue, Mohamed learned to conduct world-class

research. He worked on very interesting problems in the area of improving data quality

and data cleaning. His work was published in the top data management conferences

such as SIGMOD, VLDB, and ICDE. In summer 2009, Mohamed interned at Google

Inc. and started to get the flavor of large US corporations. In the summers of 2010 and

2011, he interned with Microsoft Research, where he interacted with many world-class

researchers and gained his large-scale real-data management experience. Mohamed’s

main research interests focus on advancing technologies for improving data quality

by involving external resources such as users and data on the WWW. Mohamed

graduated with a Ph.D. degree in computer science from Purdue University in August

2012. In 2012, he joined the technical staff of Microsoft Bing in Seattle, Washington.

graduate school etd form 9 purdue university graduate school

Documents