bigdansing presentation slides for sigmod 2015
TRANSCRIPT
BigDansing: A BigData Cleansing System
Zuhair Khayyat Ihab F. Ilyas Alekh Jindal Samuel Madden
Mourad Ouzzani Paolo Papotti Jorge-Arnulfo Quiané-Ruiz Nan Tan Si Yin
Problem of Dirty Data
● “duplicate and dirty data costs the healthcare industry over $300 billion every year”
– Joe Fusaro (RingLead)
● “inaccurate data has a direct impact ... the average company losing 12% of its revenue”
– Ben Davis (Econsultancy)
The Process of Data Cleansing
Stained
The Process of Data Cleansing
Stained
One approach: Violation Detection using declarative rules
Stained
The Process of Data Cleansing
Suggested Repairs
One approach: Violation Detection using declarative rules
Stained
Stained
The Process of Data Cleansing
Apply the Repairs
One approach: Violation Detection using declarative rules
Suggested Repairs
Stained
StainedStained
The Process of Data Cleansing
One approach: Violation Detection using declarative rules
Suggested Repairs Clean Dataset
Stained
Apply the Repairs
StainedStained
The Process of Data Cleansing
One approach: Violation Detection using declarative rules
Suggested Repairs
Side effect: new Violations
Stained
Clean Dataset
Apply the Repairs
StainedStained Stained
The Process of Data Cleansing
One approach: Violation Detection using declarative rules
Suggested Repairs
Stained
Clean Dataset
Stained
Side effect: new ViolationsApply the Repairs
Stained Stained
Related work
• Functional dependencies (FDs, CFDs)
• Inclusion dependencies (INDs, CINDs)
• Denial constraints (DCs)
• Matching dependencies (MDs)
• Entity resolution rules (ERs)
Limited quality rules support*Limited quality rules support*
* On approximating optimum repairs for functional dependency violations, ICDT 2009
* Holistic data cleaning: Putting violations into context, ICDE 2013
* The llunatic data-cleaning framework, VLDB 2013
Related work
• Functional dependencies (FDs, CFDs)
• Inclusion dependencies (INDs, CINDs)
• Denial constraints (DCs)
• Matching dependencies (MDs)
• Entity resolution rules (ERs)
Limited quality rules support*Limited quality rules support*
NADEEF**NADEEF**
• Easy-to-use
• Extensible
• Effcient
** SIGMOD 2013
* On approximating optimum repairs for functional dependency violations, ICDT 2009
* Holistic data cleaning: Putting violations into context, ICDE 2013
* The llunatic data-cleaning framework, VLDB 2013
Related work
• Functional dependencies (FDs, CFDs)
• Inclusion dependencies (INDs, CINDs)
• Denial constraints (DCs)
• Matching dependencies (MDs)
• Entity resolution rules (ERs)
Limited quality rules support*Limited quality rules support*
NADEEF**NADEEF**
• Easy-to-use
• Extensible
• Effcient
• Scalability
** SIGMOD 2013
* On approximating optimum repairs for functional dependency violations, ICDT 2009
* Holistic data cleaning: Putting violations into context, ICDE 2013
* The llunatic data-cleaning framework, VLDB 2013
Data Cleansing is Big Data a problem
Dirty data Dirty data Dirty data
Data Cleansing is Big Data a problem
Dirty data Dirty data Dirty data
Scalable
BigData Cleansing Requirements
Fast
Scalable
BigData Cleansing Requirements
Fast
Scalable
Portable
BigData Cleansing Requirements
AbstractionScalability
vs.
Challenges of BigData Cleansing
Ease-of-use
Effciencyvs.
AbstractionScalability
vs.
Challenges of BigData Cleansing
Ease-of-use
Effciencyvs.
AbstractionScalability
vs.
Quality Rules
Inequalities
Challenges of BigData Cleansing
BigDansing
RepairAlgorithm
DirtyDataset
UDFs(operators)
Block
Scope
Iterate
Detect
GenFix
Declarative rule
+
(1)
(4)(5)
(6)
(7)
BigDansing
logical plans
physical plans
execution plans
Rule EngineLogical
Layer
Physical
Layer
Execution
Layer
(3)
(2)
violations&
possible repairs
value updates
Rule Parser
+
ApplyUpdates
Data Processing Framework
RepairAlgorithm
DirtyDataset
UDFs(operators)
Block
Scope
Iterate
Detect
GenFix
Declarative rule
+
(1)
(4)(5)
(6)
(7)
BigDansing
logical plans
physical plans
execution plans
Rule EngineLogical
Layer
Physical
Layer
Execution
Layer
(3)
(2)
violations&
possible repairs
value updates
Rule Parser
+
ApplyUpdates
Data Processing Framework
BigDansing: Abstraction
ScopeScope
BlockBlock
IterateIterate
DetectDetect
GenFixGenFix
Logical Operators
BigDansing: Abstraction
Declarative Rules: FD, CFD, DC, ....
ScopeScope
BlockBlock
IterateIterate
DetectDetect
GenFixGenFix
Logical OperatorsFD: zipcode -> city
BigDansing: Abstraction
ScopeScope
BlockBlock
IterateIterate
DetectDetect
GenFixGenFix
Logical OperatorsFD: zipcode -> city
BigDansing: Abstraction
ScopeScope
BlockBlock
IterateIterate
DetectDetect
GenFixGenFix
Logical OperatorsFD: zipcode -> city
BigDansing: Abstraction
ScopeScope
BlockBlock
IterateIterate
DetectDetect
GenFixGenFix
Logical OperatorsFD: zipcode -> city
BigDansing: Abstraction
ScopeScope
BlockBlock
IterateIterate
DetectDetect
GenFixGenFix
Logical OperatorsFD: zipcode -> city
BigDansing: Abstraction
Logical Operators
BigDansing: Abstraction
ScopeScope
BlockBlock
IterateIterate
DetectDetect
GenFixGenFix
Easy to use and enables scalability!
BigDansing: Optimizations
RepairAlgorithm
DirtyDataset
UDFs(operators)
Iterate
Scope
Block
Detect
GenFix
declarative rule
+
(1)
(4)(5)
(6)
(7)
BigDansing
logical plans
physical plans
execution plans
Rule EngineLogical
Layer
Physical
Layer
Execution
Layer
(3)
(2)
violations&
possible repairs
value updates
Rule Parser
+
ApplyUpdates
Data Processing Framework
BigDansing: Optimizations
RepairAlgorithm
DirtyDataset
UDFs(operators)
Iterate
Scope
Block
Detect
GenFix
declarative rule
+
(1)
(4)(5)
(6)
(7)
BigDansing
logical plans
physical plans
execution plans
Rule EngineLogical
Layer
Physical
Layer
Execution
Layer
(3)
(2)
violations&
possible repairs
value updates
Rule Parser
+
ApplyUpdates
Data Processing Framework
Shared Scans
Fast Inequality
Joins
Shared Execution
Optimizations: Fast Inequality Joins
DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)
PartitioningPartitioning
(divide) partition 1partition 1 partition 2partition 2 … partition npartition non rate on rate on rate
Optimizations: Fast Inequality Joins
DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)
PartitioningPartitioning
SortingSorting
(divide)
(prepare)
partition 1partition 1 partition 2partition 2 … partition npartition non rate on rate on rate
partition 1partition 1 partition 2partition 2 … partition npartition nsort sort sort
rate & salary rate & salary rate & salary
Optimizations: Fast Inequality Joins
DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)
PartitioningPartitioning
SortingSorting
PruningPruning
(divide)
(prepare)
(reduce)
partition 1partition 1 partition 2partition 2 … partition npartition non rate on rate on rate
partition 1partition 1
partition 2partition 2…
partition 3partition 3
partition 4partition 4partition npartition n
min-max values
min-max values
partition 1partition 1 partition 2partition 2 … partition npartition nsort sort sort
rate & salary rate & salary rate & salary
Optimizations: Fast Inequality Joins
DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)
PartitioningPartitioning
SortingSorting
PruningPruning
JoiningJoiningpartition 2partition 2 partition 3partition 3 …
(divide)
(prepare)
(reduce)
(execute)
partition 1partition 1 partition 2partition 2 … partition npartition non rate on rate on rate
partition 1partition 1
partition 2partition 2…
partition 3partition 3
partition 4partition 4partition npartition n
min-max values
min-max values
partition 1partition 1 partition 2partition 2 … partition npartition nsort sort sort
rate & salary rate & salary rate & salary
Optimizations: Fast Inequality Joins
DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)
BigDansing: Scalable Repair
RepairAlgorithm
DirtyDataset
UDFs(operators)
Block
Scope
Iterate
Detect
GenFix
Declarative rule
+
(1)
(4)(5)
(6)
(7)
BigDansing
logical plans
physical plans
execution plans
Rule EngineLogical
Layer
Physical
Layer
Execution
Layer
(3)
(2)
violations&
possible repairs
value updates
Rule Parser
+
ApplyUpdates
Data Processing Framework
Distributed Equivalent Class
A B
a1 b1
a1 b2
a1 b1
a2 b4
a2 b4
a2 b3
t1
t2
t3
t4
t5
t6
FD: A —> B
Scalable Repair
Distributed Equivalent Class
A B
a1 b1
a1 b2
a1 b1
a2 b4
a2 b4
a2 b3
t1
t2
t3
t4
t5
t6
FD: A —> B
t1t2t3
t4t5t6
EQ1
EQ2
Scalable Repair
Distributed Equivalent Class
A B
a1 b1
a1 b2
a1 b1
a2 b4
a2 b4
a2 b3
t1
t2
t3
t4
t5
t6
FD: A —> B
t1t2t3
t4t5t6
EQ1
EQ2
b2 —> b1
b3 —> b4
Scalable Repair
Distributed Equivalent Class
A B
a1 b1
a1 b2
a1 b1
a2 b4
a2 b4
a2 b3
t1
t2
t3
t4
t5
t6
FD: A —> B
t1t2t3
t4t5t6
EQ1
EQ2
b2 —> b1
b3 —> b4
Scalable Repair
EQ algorithm as a word count problem
data errors
data cells
Data Repair as a Black box
data errors
data cells
Data Repair as a Black box
centralized data repair algorithm
data errors
data cells
Data Repair as a Black box
centralized data repair algorithm
data errors
data cells
big connected components?
Data Repair as a Black box
RepairAlgorithm
DirtyDataset
UDFs(operators)
Block
Scope
Iterate
Detect
GenFix
Declarative rule
+
(1)
(4)(5)
(6)
(7)
BigDansing
logical plans
physical plans
execution plans
Rule EngineLogical
Layer
Physical
Layer
Execution
Layer
(3)
(2)
violations&
possible repairs
value updates
Rule Parser
+
ApplyUpdates
Data Processing Framework
BigDansing: Portability
RepairAlgorithm
DirtyDataset
UDFs(operators)
Block
Scope
Iterate
Detect
GenFix
Declarative rule
+
(1)
(4)(5)
(6)
(7)
BigDansing
logical plans
physical plans
execution plans
Rule EngineLogical
Layer
Physical
Layer
Execution
Layer
(3)
(2)
violations&
possible repairs
value updates
Rule Parser
+
ApplyUpdates
Data Processing Framework
BigDansing: Portability
Centralized Experiment
DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)
Several orders of magnitude faster!
Centralized Experiment
DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)
Parallel Experiment
DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)
Only BigDansing fnished!
Parallel Experiment
DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)
Summary
RepairAlgorithm
DirtyDataset
UDFs(operators)
Block
Scope
Iterate
Detect
GenFix
Declarative rule
+
(1)
(4)(5)
(6)
(7)
BigDansing
logical plans
physical plans
execution plans
Rule Engine
Logical
Layer
Physical
Layer
Execution
Layer
(3)
(2)
violations&
possible repairs
value updates
Rule Parser
+
ApplyUpdates
Data Processing Framework
RepairAlgorithm
DirtyDataset
UDFs(operators)
Block
Scope
Iterate
Detect
GenFix
Declarative rule
+
(1)
(4)(5)
(6)
(7)
BigDansing
logical plans
physical plans
execution plans
Rule Engine
Logical
Layer
Physical
Layer
Execution
Layer
(3)
(2)
violations&
possible repairs
value updates
Rule Parser
+
ApplyUpdates
Data Processing Framework
Ease-
of-Use
Summary
RepairAlgorithm
DirtyDataset
UDFs(operators)
Block
Scope
Iterate
Detect
GenFix
Declarative rule
+
(1)
(4)(5)
(6)
(7)
BigDansing
logical plans
physical plans
execution plans
Rule Engine
Logical
Layer
Physical
Layer
Execution
Layer
(3)
(2)
violations&
possible repairs
value updates
Rule Parser
+
ApplyUpdates
Data Processing Framework
Ease-
of-Use
Summary
RepairAlgorithm
DirtyDataset
UDFs(operators)
Block
Scope
Iterate
Detect
GenFix
Declarative rule
+
(1)
(4)(5)
(6)
(7)
BigDansing
logical plans
physical plans
execution plans
Rule Engine
Logical
Layer
Physical
Layer
Execution
Layer
(3)
(2)
violations&
possible repairs
value updates
Rule Parser
+
ApplyUpdates
Data Processing Framework
Ease-
of-Use
Scalab
ility
Summary
RepairAlgorithm
DirtyDataset
UDFs(operators)
Block
Scope
Iterate
Detect
GenFix
Declarative rule
+
(1)
(4)(5)
(6)
(7)
BigDansing
logical plans
physical plans
execution plans
Rule Engine
Logical
Layer
Physical
Layer
Execution
Layer
(3)
(2)
violations&
possible repairs
value updates
Rule Parser
+
ApplyUpdates
Data Processing Framework
Ease-
of-Use
Scalab
ility
Portability
Summary
57
Experiments – Parallel FD
● TPCH dataset:
● FD: custkey → custAddress
58
Experiments – Scalability
● TPCH Dataset:
● FD: custkey → custAddress
● Dataset: 500M rows
59
Repair Quality for FDs and DC
● Φ6: FD: Zipcode → State
● Φ8: FD: PhoneNumber → Zipcode
● Φ8: FD: ProviderID → City,PhoneNumber
● ØD: DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)
Φ6Φ6 & Φ7Φ6 - Φ8
ØD