![Page 1: A Grammar-based Entity Representation Framework for Data Cleaning Authors: Arvind Arasu Raghav Kaushik Presented by Rashmi Havaldar](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3c934/html5/thumbnails/1.jpg)
A Grammar-based Entity Representation Framework forData Cleaning
Authors: Arvind Arasu Raghav Kaushik
Presented by Rashmi Havaldar
![Page 2: A Grammar-based Entity Representation Framework for Data Cleaning Authors: Arvind Arasu Raghav Kaushik Presented by Rashmi Havaldar](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3c934/html5/thumbnails/2.jpg)
Problems Poor data quality is due
to lack to unique representations for real world entities
Eg: California can be represented as California, Calif, CA, etc
Although textually different, these 5 records correspond to just 2 authors
![Page 3: A Grammar-based Entity Representation Framework for Data Cleaning Authors: Arvind Arasu Raghav Kaushik Presented by Rashmi Havaldar](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3c934/html5/thumbnails/3.jpg)
Problem Definition Main problem in data cleaning is to determine whether
or not two representations are duplicate i.e. correspond to same real world entity.
Cosine similarity and Edit distance use textual similarity. But it can be misleading.
Two representations of same entity can be highly dissimilar
Conversely, two representations that are textually very similar can correspond to different entities
![Page 4: A Grammar-based Entity Representation Framework for Data Cleaning Authors: Arvind Arasu Raghav Kaushik Presented by Rashmi Havaldar](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3c934/html5/thumbnails/4.jpg)
Solution: Programmable Framework
![Page 5: A Grammar-based Entity Representation Framework for Data Cleaning Authors: Arvind Arasu Raghav Kaushik Presented by Rashmi Havaldar](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3c934/html5/thumbnails/5.jpg)
Basic Definitions The Program is a collection of triples of the form <R,P,A> where R is
the grammar rule, P is predicate and A is action The grammar rule has a head and body. Head is single non terminal
and body is sequence of non terminals, terminals and variables Terminals are words and punctuation Non terminals are represented by angular brackets
terminals using single quoted strings (eg:’Jeff’) and variables using uppercase letters
![Page 6: A Grammar-based Entity Representation Framework for Data Cleaning Authors: Arvind Arasu Raghav Kaushik Presented by Rashmi Havaldar](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3c934/html5/thumbnails/6.jpg)
Example: Framework program
![Page 7: A Grammar-based Entity Representation Framework for Data Cleaning Authors: Arvind Arasu Raghav Kaushik Presented by Rashmi Havaldar](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3c934/html5/thumbnails/7.jpg)
Expanded program G’ for program G Expanded program G’, like G is a collection of augmented rules To construct G’, we consider each augmented rule R=<R,P,A> and
enumerate all possible assignments of constant values to variables in R so that predicate P evaluates to true i.e. <R’, true, A’>
![Page 8: A Grammar-based Entity Representation Framework for Data Cleaning Authors: Arvind Arasu Raghav Kaushik Presented by Rashmi Havaldar](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3c934/html5/thumbnails/8.jpg)
Parse Tree:
Handles variations in the order in which the first name and last name appear
Program handles variations resulting from the use of nick name
![Page 9: A Grammar-based Entity Representation Framework for Data Cleaning Authors: Arvind Arasu Raghav Kaushik Presented by Rashmi Havaldar](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3c934/html5/thumbnails/9.jpg)
Weights: Non negative real numbers are assigned to each augmented
rule in G’ The weight of an output record is the sum of weights of
augmented rules involved in the parsing of output record Lower weights indicate high confidence Programmer can use “loose” rules, rules that the programmer
is not very confident about. Higher weights assigned to “loose” rules If R’ is augmented rule in expanded program G’, the weight of
R’ is the log of number of rules in G’
![Page 10: A Grammar-based Entity Representation Framework for Data Cleaning Authors: Arvind Arasu Raghav Kaushik Presented by Rashmi Havaldar](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3c934/html5/thumbnails/10.jpg)
Implementation Given a program G, we can construct expanded program G’.
Given an input record r, we can use traditional parsing technique to parse r
But the main problem with this approach is that the scale of the expanded program G’ can be very large
Instead, construct Gr’, a partially expanded program at query time.
To construct Gr’, consider R=<R,P,A> and enumerates all possible assignment of constants to variables in R such that P evaluates to true
Enforce an additional constraint, if variable X occurs in R, then the constant c assigned to variable X should be a substring of the record r.Dictionary (X): P(X,.…)
Eg: Smith Andy, J: Dictionary (N): Nicknames (I,N,F,G)
![Page 11: A Grammar-based Entity Representation Framework for Data Cleaning Authors: Arvind Arasu Raghav Kaushik Presented by Rashmi Havaldar](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3c934/html5/thumbnails/11.jpg)
Case studies1. UCD people data
![Page 12: A Grammar-based Entity Representation Framework for Data Cleaning Authors: Arvind Arasu Raghav Kaushik Presented by Rashmi Havaldar](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3c934/html5/thumbnails/12.jpg)
Quality of record matching and Record matching
![Page 13: A Grammar-based Entity Representation Framework for Data Cleaning Authors: Arvind Arasu Raghav Kaushik Presented by Rashmi Havaldar](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3c934/html5/thumbnails/13.jpg)
2. Author Affiliation Dataset
![Page 14: A Grammar-based Entity Representation Framework for Data Cleaning Authors: Arvind Arasu Raghav Kaushik Presented by Rashmi Havaldar](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3c934/html5/thumbnails/14.jpg)
Program:
![Page 15: A Grammar-based Entity Representation Framework for Data Cleaning Authors: Arvind Arasu Raghav Kaushik Presented by Rashmi Havaldar](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3c934/html5/thumbnails/15.jpg)
![Page 16: A Grammar-based Entity Representation Framework for Data Cleaning Authors: Arvind Arasu Raghav Kaushik Presented by Rashmi Havaldar](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3c934/html5/thumbnails/16.jpg)
Discussion
Record matching: Previous works on record matching focused on similarity design
function This framework indicates that, with right pre processing the
need for approximate equality when performing record matching is minimized and often eliminated
How ever string similarity joins are needed to capture variations such as typos and misspellings
This framework does not intend to replace this body of work
![Page 17: A Grammar-based Entity Representation Framework for Data Cleaning Authors: Arvind Arasu Raghav Kaushik Presented by Rashmi Havaldar](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3c934/html5/thumbnails/17.jpg)
Pay as you go: The goal of this framework is not to clean the entire
dataset, because doing so is difficult This framework rather approaches “pay as we go” where
they use example reference tables that cover only part of data to clean a subset of data
Lineage: Parse trees constitute a natural notion of lineage that can be
used to program on top of the module For eg. Data cleaning developer using this framework can
choose not to use rule weighting options and use if- then- else logic to capture parse tree preferences
![Page 18: A Grammar-based Entity Representation Framework for Data Cleaning Authors: Arvind Arasu Raghav Kaushik Presented by Rashmi Havaldar](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3c934/html5/thumbnails/18.jpg)
Uncertainty: Framework provides a tool to manage uncertainty in the data Framework incorporates “possible worlds”. Thus it allows
multiple possible variations of same entity. Framework also returns multiple parse trees for same input
record with accompanying score.
![Page 19: A Grammar-based Entity Representation Framework for Data Cleaning Authors: Arvind Arasu Raghav Kaushik Presented by Rashmi Havaldar](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3c934/html5/thumbnails/19.jpg)
Questions???
![Page 20: A Grammar-based Entity Representation Framework for Data Cleaning Authors: Arvind Arasu Raghav Kaushik Presented by Rashmi Havaldar](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649e495503460f94b3c934/html5/thumbnails/20.jpg)
Thank you!