potter’s wheel: an interactive data cleaning system vijayshankar raman joseph m. hellerstein
TRANSCRIPT
![Page 1: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein](https://reader030.vdocuments.us/reader030/viewer/2022032702/56649cd75503460f949a01c4/html5/thumbnails/1.jpg)
Potter’s Wheel: An Interactive Data Cleaning
System
Vijayshankar Raman
Joseph M. Hellerstein
![Page 2: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein](https://reader030.vdocuments.us/reader030/viewer/2022032702/56649cd75503460f949a01c4/html5/thumbnails/2.jpg)
Outline
Background
Potter’s Wheel architecture
Discrepancy detection
Interactive transformation
Conclusions and Future Work
![Page 3: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein](https://reader030.vdocuments.us/reader030/viewer/2022032702/56649cd75503460f949a01c4/html5/thumbnails/3.jpg)
Motivation
Dirty data common E.g., in content integration, e-catalogs
Inter-organizational differences in data representation Home Depot: 60,000 suppliers!
Data often scraped off web pages, etc. E.g. in centralized systems
Data entry “errors”, poor integrity constraints
Cleansing a prereq for analysis, xactionsCleansing done by “content managers” Ease of use critical!
Standards can help a bit (e.g. UDDI) But graphical tools are the name of the game
![Page 4: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein](https://reader030.vdocuments.us/reader030/viewer/2022032702/56649cd75503460f949a01c4/html5/thumbnails/4.jpg)
Current solutions
Detect errors in data “eyeball” data in a spreadsheet data auditing tools domain-specific algorithms
Code up transforms to fix errors “ETL” (extract/transform/load) tools from warehousing world string together domain-specific cleansing rules scripting languages, custom code, etc.
Apply transforms on dataIterate special cases nested discrepancies, e.g. 19997/10/31
Code
Apply
Detect
![Page 5: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein](https://reader030.vdocuments.us/reader030/viewer/2022032702/56649cd75503460f949a01c4/html5/thumbnails/5.jpg)
Problems
Slow, batch tasks
Significant human effort! Specification of transforms
regular expressions, grammars, custom scripts, etc.
Discrepancy detection notion of discrepancy domain-dependent want a mix of custom and standard techniques want to apply on parts of the data values
Rebecca by Daphne du Maurier (Mass Market Paperback) $6.29 ****
Sonnet 19. Craig W.J., ed. 1914. The Oxford Shakespeare
The Big Four Agatha Christie, Mass market paperback 5.39 10%
(from bartleby.com, bn.com)
![Page 6: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein](https://reader030.vdocuments.us/reader030/viewer/2022032702/56649cd75503460f949a01c4/html5/thumbnails/6.jpg)
Outline
Background
Potter’s Wheel architecture
Discrepancy detection
Interactive Transformation
Conclusions and Future Work
![Page 7: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein](https://reader030.vdocuments.us/reader030/viewer/2022032702/56649cd75503460f949a01c4/html5/thumbnails/7.jpg)
Potter’s Wheel: Design Goals
Eliminate wait time during each step Even on big data! Use Online Reordering (VLDB ‘99), sampling Ensure transform results can be seen/undone instantly Compile/optimize sequence of transforms when happy
Eliminate programming, but keep user “in the loop” Semi-automatic, “direct manipulation” GUI Support & leverage “eyeball” detection, verification (human input) Point-and-click transformation “by example”
Unify detection and transformation Detection always runs online in the background Detection always runs on transformed “view” of data
Extensibility Domain experts (vendors) should be able to plug
in detectors/transforms
A mixed (“Systems!”) design challenge: Query Processing, HCI, Learning
Limited appreciationfor this kind of systems work
![Page 8: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein](https://reader030.vdocuments.us/reader030/viewer/2022032702/56649cd75503460f949a01c4/html5/thumbnails/8.jpg)
Potter’s Wheel UIData read so far
![Page 9: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein](https://reader030.vdocuments.us/reader030/viewer/2022032702/56649cd75503460f949a01c4/html5/thumbnails/9.jpg)
Dataflow in Potter’s Wheel
Transformationengine
Sp
rea
dsh
eet
dis
pla
y
Optimized program
Onlinereorderer
Data source
Discrepancy detector
compile
get page scrollbar pos.n
specify/undo transforms
scroll
check for errors
![Page 10: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein](https://reader030.vdocuments.us/reader030/viewer/2022032702/56649cd75503460f949a01c4/html5/thumbnails/10.jpg)
Outline
BackgroundPotter’s Wheel architectureDiscrepancy detection Domains in Potter’s Wheel Structure inference
Interactive TransformationConclusions and Future Work
![Page 11: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein](https://reader030.vdocuments.us/reader030/viewer/2022032702/56649cd75503460f949a01c4/html5/thumbnails/11.jpg)
Discrepancy Detection
Challenge: find discrepancies in a column Structure inference: Given:
A set of (possibly composite) data items, including discrepancies
A set of user-defined “domains” (atomic types) Choose a “structure” for the set
A string of domains (w/repetition) that best fits the data E.g. for “March 17, 2000”:
* alpha* digit*, digit* [Machr]* 17, int
Report rows that do not fit chosen domain
PS: Must be an online algorithm!
![Page 12: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein](https://reader030.vdocuments.us/reader030/viewer/2022032702/56649cd75503460f949a01c4/html5/thumbnails/12.jpg)
Extensible Domains
As in Object-Relational, keep domains opaque.
class Domain {// Required inclusion functionboolean match(char *value);
// Helps in structure extractionint cardinality(int length);
// For probabilistic discrepancy checkingfloat matchWithConfidence(char *value, int dataSetSize);void updateState(char *value);
// Helps in parsingboolean isRedundantAfter(Domain d);
}
e.g. integer, ispell word, money, standard part names
![Page 13: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein](https://reader030.vdocuments.us/reader030/viewer/2022032702/56649cd75503460f949a01c4/html5/thumbnails/13.jpg)
Evaluating Structure Fit
Three desired characteristics Recall
match as many values as possible
Precision flag as many real discrepancies as possible e.g. Month day, day over alpha* digit*, digit*
Conciseness avoid over-fitting examples, make use of the domains e.g. alpha* digit*, digit* over March 17, 2000
![Page 14: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein](https://reader030.vdocuments.us/reader030/viewer/2022032702/56649cd75503460f949a01c4/html5/thumbnails/14.jpg)
Evaluating Structure Fit, cont.
Given structure S = d1d2…dp, string vi, how good is S?Minimum Description Length (MDL) principle
Rissanen, ‘78, etc. DL(vi,S) = length of theory for S + length to encode string vi with S
Computing DL(v,S)1) Length of theory = p log (number of domains known)
2) If vi doesn’t match S, encode it explicitly
3) Else encode vi = wi,1 wi,2 …wi,p where wi,j dj
Encode length of each wi,j
Encode each wi,j among all dj’s of length j use cardinality function
DL = AVGi((1) + (2) + (3)) = AVGi (UnConciseness + UnPrecision + UnRecall)
Choose structure with minimum DL(v,S) Hard search problem; heuristics in paper
![Page 15: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein](https://reader030.vdocuments.us/reader030/viewer/2022032702/56649cd75503460f949a01c4/html5/thumbnails/15.jpg)
Potter’s Wheel UI
![Page 16: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein](https://reader030.vdocuments.us/reader030/viewer/2022032702/56649cd75503460f949a01c4/html5/thumbnails/16.jpg)
Outline
Background
Potter’s Wheel architecture
Discrepancy detection
Interactive Transformation transforms split-by-example
Conclusions and Future Work
![Page 17: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein](https://reader030.vdocuments.us/reader030/viewer/2022032702/56649cd75503460f949a01c4/html5/thumbnails/17.jpg)
Interactive transformation
Sequence of simple visual transforms rather than a single complex program
Each transform must be easy to specify immediately applicable on screen rows
Must be able to undo transforms compensatory transforms not always possible everything REDO-oriented at display-time
no need for UNDO!
![Page 18: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein](https://reader030.vdocuments.us/reader030/viewer/2022032702/56649cd75503460f949a01c4/html5/thumbnails/18.jpg)
Transforms in Potter’s Wheel
Value translation Format(value) – reg. expr. substitution, arithmetic ops,
…
One-to-one row mappings Add/Drop/Copy columns Merge,Split columns Divide column by predicate
One-to-many row mappings Fold columns
adapted from Fold of SchemaSQL[LSS’96] Resolve some higher-order differences
![Page 19: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein](https://reader030.vdocuments.us/reader030/viewer/2022032702/56649cd75503460f949a01c4/html5/thumbnails/19.jpg)
Example (1)
2 Merges
Format'(.*), (.*)' to '\2 \1'
Stewart,Bob
Dole,JerryDavis
Marsh
Anna
Joan
StewartAnna Davis
DoleJoan MarshJerry
Bob Bob
Jerry
Stewart
DoleAnna
Joan
Davis
Marsh
Split at ' '
Anna
Joan
Davis
Marsh
Bob Stewart
Jerry Dole
![Page 20: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein](https://reader030.vdocuments.us/reader030/viewer/2022032702/56649cd75503460f949a01c4/html5/thumbnails/20.jpg)
Example (2)
Divide (like ’.*,.*’)
Anna Davis
Joan Marsh
Stewart,Bob
Dole,Jerry
Stewart,BobAnna
Dole,JerryJoan
Davis
Marsh
![Page 21: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein](https://reader030.vdocuments.us/reader030/viewer/2022032702/56649cd75503460f949a01c4/html5/thumbnails/21.jpg)
Example (3)
Split
Fold
2 Formats(demotes) Ann
BobMath:43Math:96
Bio:78Bio:54
Name Math4396
AnnBob
7854
Bio
Ann
BobBob
AnnName
Math:96Bio:54
Math:43Bio:78
MathBio
MathBio
AnnAnnBobBob
Name43789654
Name
![Page 22: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein](https://reader030.vdocuments.us/reader030/viewer/2022032702/56649cd75503460f949a01c4/html5/thumbnails/22.jpg)
Power all one-to-{one,many} row mappings interactive many-to-{one,many} mappings hard to do interactively
must find/display companion rows for each row to transform higher-order transforms
Specification click on appropriate columns and choose transform but, Split is hard
important transform in screen-scraping/wrapping need to enter regular expressions not always unambiguous e.g.
want to leverage domains Taylor, Jane, $52,072
Tony Smith, 1,00,533
Transforms summary
![Page 23: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein](https://reader030.vdocuments.us/reader030/viewer/2022032702/56649cd75503460f949a01c4/html5/thumbnails/23.jpg)
Split by Example
User marks split positions on examplesSystem infers structure, then parses rest
Parsing must identify matching substrings for structures multiple alternate parses could work
search heuristics explored in paper DecreasingSpecificity seems good
Taylor, Jane|, $52,072
Tony Smith|, 1,00,533
infer structures < * >, <‘,’ Money>
![Page 24: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein](https://reader030.vdocuments.us/reader030/viewer/2022032702/56649cd75503460f949a01c4/html5/thumbnails/24.jpg)
Related Work
Transformation languages -- e.g. SchemaSQL, YATLData cleaning tools commercial -- ETL and auditing tools research -- e.g. AJAX, Lee/Lu/Ling/Ko ’99
Custom auditing algorithms de-duplication (e.g. Hernandez/Stolfo ’97) outlier detection (e.g. Ramaswamy/Rastogi/Shim ’00) dependency inference (e.g. Kivinen/Manilla ’95)
Structure extraction techniques e.g. XTRACT, DataMold, Brazma ‘94
Transformation tools text-processing tools – e.g. perl/awk/sed, LAPIS screen-scraping -- e.g. NoDoSE, XWRAP, OnDisplay, Cohera Connect,
Telegraph Screen Scraper (TeSS)
Middleware, schema mapping
![Page 25: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein](https://reader030.vdocuments.us/reader030/viewer/2022032702/56649cd75503460f949a01c4/html5/thumbnails/25.jpg)
Conclusions
Interactive data cleaning Couple transformation and discrepancy detection Perform both interactively
short, immediately applied steps specify visually, undo if needed contrast with declarative language
Parse values before discrepancy detection user-defined domains helpful
Software online (http://control.cs.berkeley.edu/abc)
![Page 26: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein](https://reader030.vdocuments.us/reader030/viewer/2022032702/56649cd75503460f949a01c4/html5/thumbnails/26.jpg)
Looking Ahead
Generalizing transform by exampleTransforming nested data (XML, HTML)More complex domain-expressionsExtend to generalized query processor client in Telegraph specify initial query refine by specifying transforms as results stream in dynamically choose transforms to be pushed into
server See Shankar’s upcoming thesis, Telegraph papers
![Page 27: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein](https://reader030.vdocuments.us/reader030/viewer/2022032702/56649cd75503460f949a01c4/html5/thumbnails/27.jpg)
Backup Slides
![Page 28: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein](https://reader030.vdocuments.us/reader030/viewer/2022032702/56649cd75503460f949a01c4/html5/thumbnails/28.jpg)
Optimization of Transform Sequences
In Potter’s Wheel system generates program at end hence opportunities for optimization
remove redundant operationsavoid expensive memory copies/allocations/deallocationsby careful pipeliningmaterialize intermediate strings only when necessaryup to 110% speedup for C programs C programs 10x faster than Perl scripts
![Page 29: Potter’s Wheel: An Interactive Data Cleaning System Vijayshankar Raman Joseph M. Hellerstein](https://reader030.vdocuments.us/reader030/viewer/2022032702/56649cd75503460f949a01c4/html5/thumbnails/29.jpg)
Example
vs