bigdansing presentation slides for sigmod 2015

59
BigDansing: A Big Da ta Cleansing System Zuhair Khayyat Ihab F. Ilyas Alekh Jindal Samuel Madden Mourad Ouzzani Paolo Papotti Jorge-Arnulfo Quiané-Ruiz Nan Tan Si Yin

Upload: zuhair-khayyat

Post on 10-Aug-2015

256 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: BigDansing presentation slides for SIGMOD 2015

BigDansing: A BigData Cleansing System

Zuhair Khayyat Ihab F. Ilyas Alekh Jindal Samuel Madden

Mourad Ouzzani Paolo Papotti Jorge-Arnulfo Quiané-Ruiz Nan Tan Si Yin

Page 2: BigDansing presentation slides for SIGMOD 2015

Problem of Dirty Data

● “duplicate and dirty data costs the healthcare industry over $300 billion every year”

– Joe Fusaro (RingLead)

● “inaccurate data has a direct impact ... the average company losing 12% of its revenue”

– Ben Davis (Econsultancy)

Page 3: BigDansing presentation slides for SIGMOD 2015

The Process of Data Cleansing

Stained

Page 4: BigDansing presentation slides for SIGMOD 2015

The Process of Data Cleansing

Stained

One approach: Violation Detection using declarative rules

Stained

Page 5: BigDansing presentation slides for SIGMOD 2015

The Process of Data Cleansing

Suggested Repairs

One approach: Violation Detection using declarative rules

Stained

Stained

Page 6: BigDansing presentation slides for SIGMOD 2015

The Process of Data Cleansing

Apply the Repairs

One approach: Violation Detection using declarative rules

Suggested Repairs

Stained

StainedStained

Page 7: BigDansing presentation slides for SIGMOD 2015

The Process of Data Cleansing

One approach: Violation Detection using declarative rules

Suggested Repairs Clean Dataset

Stained

Apply the Repairs

StainedStained

Page 8: BigDansing presentation slides for SIGMOD 2015

The Process of Data Cleansing

One approach: Violation Detection using declarative rules

Suggested Repairs

Side effect: new Violations

Stained

Clean Dataset

Apply the Repairs

StainedStained Stained

Page 9: BigDansing presentation slides for SIGMOD 2015

The Process of Data Cleansing

One approach: Violation Detection using declarative rules

Suggested Repairs

Stained

Clean Dataset

Stained

Side effect: new ViolationsApply the Repairs

Stained Stained

Page 10: BigDansing presentation slides for SIGMOD 2015

Related work

• Functional dependencies (FDs, CFDs)

• Inclusion dependencies (INDs, CINDs)

• Denial constraints (DCs)

• Matching dependencies (MDs)

• Entity resolution rules (ERs)

Limited quality rules support*Limited quality rules support*

* On approximating optimum repairs for functional dependency violations, ICDT 2009

* Holistic data cleaning: Putting violations into context, ICDE 2013

* The llunatic data-cleaning framework, VLDB 2013

Page 11: BigDansing presentation slides for SIGMOD 2015

Related work

• Functional dependencies (FDs, CFDs)

• Inclusion dependencies (INDs, CINDs)

• Denial constraints (DCs)

• Matching dependencies (MDs)

• Entity resolution rules (ERs)

Limited quality rules support*Limited quality rules support*

NADEEF**NADEEF**

• Easy-to-use

• Extensible

• Effcient

** SIGMOD 2013

* On approximating optimum repairs for functional dependency violations, ICDT 2009

* Holistic data cleaning: Putting violations into context, ICDE 2013

* The llunatic data-cleaning framework, VLDB 2013

Page 12: BigDansing presentation slides for SIGMOD 2015

Related work

• Functional dependencies (FDs, CFDs)

• Inclusion dependencies (INDs, CINDs)

• Denial constraints (DCs)

• Matching dependencies (MDs)

• Entity resolution rules (ERs)

Limited quality rules support*Limited quality rules support*

NADEEF**NADEEF**

• Easy-to-use

• Extensible

• Effcient

• Scalability

** SIGMOD 2013

* On approximating optimum repairs for functional dependency violations, ICDT 2009

* Holistic data cleaning: Putting violations into context, ICDE 2013

* The llunatic data-cleaning framework, VLDB 2013

Page 13: BigDansing presentation slides for SIGMOD 2015

Data Cleansing is Big Data a problem

Dirty data Dirty data Dirty data

Page 14: BigDansing presentation slides for SIGMOD 2015

Data Cleansing is Big Data a problem

Dirty data Dirty data Dirty data

Page 15: BigDansing presentation slides for SIGMOD 2015

Scalable

BigData Cleansing Requirements

Page 16: BigDansing presentation slides for SIGMOD 2015

Fast

Scalable

BigData Cleansing Requirements

Page 17: BigDansing presentation slides for SIGMOD 2015

Fast

Scalable

Portable

BigData Cleansing Requirements

Page 18: BigDansing presentation slides for SIGMOD 2015

AbstractionScalability

vs.

Challenges of BigData Cleansing

Page 19: BigDansing presentation slides for SIGMOD 2015

Ease-of-use

Effciencyvs.

AbstractionScalability

vs.

Challenges of BigData Cleansing

Page 20: BigDansing presentation slides for SIGMOD 2015

Ease-of-use

Effciencyvs.

AbstractionScalability

vs.

Quality Rules

Inequalities

Challenges of BigData Cleansing

Page 21: BigDansing presentation slides for SIGMOD 2015

BigDansing

RepairAlgorithm

DirtyDataset

UDFs(operators)

Block

Scope

Iterate

Detect

GenFix

Declarative rule

+

(1)

(4)(5)

(6)

(7)

BigDansing

logical plans

physical plans

execution plans

Rule EngineLogical

Layer

Physical

Layer

Execution

Layer

(3)

(2)

violations&

possible repairs

value updates

Rule Parser

+

ApplyUpdates

Data Processing Framework

Page 22: BigDansing presentation slides for SIGMOD 2015

RepairAlgorithm

DirtyDataset

UDFs(operators)

Block

Scope

Iterate

Detect

GenFix

Declarative rule

+

(1)

(4)(5)

(6)

(7)

BigDansing

logical plans

physical plans

execution plans

Rule EngineLogical

Layer

Physical

Layer

Execution

Layer

(3)

(2)

violations&

possible repairs

value updates

Rule Parser

+

ApplyUpdates

Data Processing Framework

BigDansing: Abstraction

Page 23: BigDansing presentation slides for SIGMOD 2015

ScopeScope

BlockBlock

IterateIterate

DetectDetect

GenFixGenFix

Logical Operators

BigDansing: Abstraction

Declarative Rules: FD, CFD, DC, ....

Page 24: BigDansing presentation slides for SIGMOD 2015

ScopeScope

BlockBlock

IterateIterate

DetectDetect

GenFixGenFix

Logical OperatorsFD: zipcode -> city

BigDansing: Abstraction

Page 25: BigDansing presentation slides for SIGMOD 2015

ScopeScope

BlockBlock

IterateIterate

DetectDetect

GenFixGenFix

Logical OperatorsFD: zipcode -> city

BigDansing: Abstraction

Page 26: BigDansing presentation slides for SIGMOD 2015

ScopeScope

BlockBlock

IterateIterate

DetectDetect

GenFixGenFix

Logical OperatorsFD: zipcode -> city

BigDansing: Abstraction

Page 27: BigDansing presentation slides for SIGMOD 2015

ScopeScope

BlockBlock

IterateIterate

DetectDetect

GenFixGenFix

Logical OperatorsFD: zipcode -> city

BigDansing: Abstraction

Page 28: BigDansing presentation slides for SIGMOD 2015

ScopeScope

BlockBlock

IterateIterate

DetectDetect

GenFixGenFix

Logical OperatorsFD: zipcode -> city

BigDansing: Abstraction

Page 29: BigDansing presentation slides for SIGMOD 2015

Logical Operators

BigDansing: Abstraction

ScopeScope

BlockBlock

IterateIterate

DetectDetect

GenFixGenFix

Easy to use and enables scalability!

Page 30: BigDansing presentation slides for SIGMOD 2015

BigDansing: Optimizations

RepairAlgorithm

DirtyDataset

UDFs(operators)

Iterate

Scope

Block

Detect

GenFix

declarative rule

+

(1)

(4)(5)

(6)

(7)

BigDansing

logical plans

physical plans

execution plans

Rule EngineLogical

Layer

Physical

Layer

Execution

Layer

(3)

(2)

violations&

possible repairs

value updates

Rule Parser

+

ApplyUpdates

Data Processing Framework

Page 31: BigDansing presentation slides for SIGMOD 2015

BigDansing: Optimizations

RepairAlgorithm

DirtyDataset

UDFs(operators)

Iterate

Scope

Block

Detect

GenFix

declarative rule

+

(1)

(4)(5)

(6)

(7)

BigDansing

logical plans

physical plans

execution plans

Rule EngineLogical

Layer

Physical

Layer

Execution

Layer

(3)

(2)

violations&

possible repairs

value updates

Rule Parser

+

ApplyUpdates

Data Processing Framework

Shared Scans

Fast Inequality

Joins

Shared Execution

Page 32: BigDansing presentation slides for SIGMOD 2015

Optimizations: Fast Inequality Joins

DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)

Page 33: BigDansing presentation slides for SIGMOD 2015

PartitioningPartitioning

(divide) partition 1partition 1 partition 2partition 2 … partition npartition non rate on rate on rate

Optimizations: Fast Inequality Joins

DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)

Page 34: BigDansing presentation slides for SIGMOD 2015

PartitioningPartitioning

SortingSorting

(divide)

(prepare)

partition 1partition 1 partition 2partition 2 … partition npartition non rate on rate on rate

partition 1partition 1 partition 2partition 2 … partition npartition nsort sort sort

rate & salary rate & salary rate & salary

Optimizations: Fast Inequality Joins

DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)

Page 35: BigDansing presentation slides for SIGMOD 2015

PartitioningPartitioning

SortingSorting

PruningPruning

(divide)

(prepare)

(reduce)

partition 1partition 1 partition 2partition 2 … partition npartition non rate on rate on rate

partition 1partition 1

partition 2partition 2…

partition 3partition 3

partition 4partition 4partition npartition n

min-max values

min-max values

partition 1partition 1 partition 2partition 2 … partition npartition nsort sort sort

rate & salary rate & salary rate & salary

Optimizations: Fast Inequality Joins

DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)

Page 36: BigDansing presentation slides for SIGMOD 2015

PartitioningPartitioning

SortingSorting

PruningPruning

JoiningJoiningpartition 2partition 2 partition 3partition 3 …

(divide)

(prepare)

(reduce)

(execute)

partition 1partition 1 partition 2partition 2 … partition npartition non rate on rate on rate

partition 1partition 1

partition 2partition 2…

partition 3partition 3

partition 4partition 4partition npartition n

min-max values

min-max values

partition 1partition 1 partition 2partition 2 … partition npartition nsort sort sort

rate & salary rate & salary rate & salary

Optimizations: Fast Inequality Joins

DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)

Page 37: BigDansing presentation slides for SIGMOD 2015

BigDansing: Scalable Repair

RepairAlgorithm

DirtyDataset

UDFs(operators)

Block

Scope

Iterate

Detect

GenFix

Declarative rule

+

(1)

(4)(5)

(6)

(7)

BigDansing

logical plans

physical plans

execution plans

Rule EngineLogical

Layer

Physical

Layer

Execution

Layer

(3)

(2)

violations&

possible repairs

value updates

Rule Parser

+

ApplyUpdates

Data Processing Framework

Page 38: BigDansing presentation slides for SIGMOD 2015

Distributed Equivalent Class

A B

a1 b1

a1 b2

a1 b1

a2 b4

a2 b4

a2 b3

t1

t2

t3

t4

t5

t6

FD: A —> B

Scalable Repair

Page 39: BigDansing presentation slides for SIGMOD 2015

Distributed Equivalent Class

A B

a1 b1

a1 b2

a1 b1

a2 b4

a2 b4

a2 b3

t1

t2

t3

t4

t5

t6

FD: A —> B

t1t2t3

t4t5t6

EQ1

EQ2

Scalable Repair

Page 40: BigDansing presentation slides for SIGMOD 2015

Distributed Equivalent Class

A B

a1 b1

a1 b2

a1 b1

a2 b4

a2 b4

a2 b3

t1

t2

t3

t4

t5

t6

FD: A —> B

t1t2t3

t4t5t6

EQ1

EQ2

b2 —> b1

b3 —> b4

Scalable Repair

Page 41: BigDansing presentation slides for SIGMOD 2015

Distributed Equivalent Class

A B

a1 b1

a1 b2

a1 b1

a2 b4

a2 b4

a2 b3

t1

t2

t3

t4

t5

t6

FD: A —> B

t1t2t3

t4t5t6

EQ1

EQ2

b2 —> b1

b3 —> b4

Scalable Repair

EQ algorithm as a word count problem

Page 42: BigDansing presentation slides for SIGMOD 2015

data errors

data cells

Data Repair as a Black box

Page 43: BigDansing presentation slides for SIGMOD 2015

data errors

data cells

Data Repair as a Black box

Page 44: BigDansing presentation slides for SIGMOD 2015

centralized data repair algorithm

data errors

data cells

Data Repair as a Black box

Page 45: BigDansing presentation slides for SIGMOD 2015

centralized data repair algorithm

data errors

data cells

big connected components?

Data Repair as a Black box

Page 46: BigDansing presentation slides for SIGMOD 2015

RepairAlgorithm

DirtyDataset

UDFs(operators)

Block

Scope

Iterate

Detect

GenFix

Declarative rule

+

(1)

(4)(5)

(6)

(7)

BigDansing

logical plans

physical plans

execution plans

Rule EngineLogical

Layer

Physical

Layer

Execution

Layer

(3)

(2)

violations&

possible repairs

value updates

Rule Parser

+

ApplyUpdates

Data Processing Framework

BigDansing: Portability

Page 47: BigDansing presentation slides for SIGMOD 2015

RepairAlgorithm

DirtyDataset

UDFs(operators)

Block

Scope

Iterate

Detect

GenFix

Declarative rule

+

(1)

(4)(5)

(6)

(7)

BigDansing

logical plans

physical plans

execution plans

Rule EngineLogical

Layer

Physical

Layer

Execution

Layer

(3)

(2)

violations&

possible repairs

value updates

Rule Parser

+

ApplyUpdates

Data Processing Framework

BigDansing: Portability

Page 48: BigDansing presentation slides for SIGMOD 2015

Centralized Experiment

DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)

Page 49: BigDansing presentation slides for SIGMOD 2015

Several orders of magnitude faster!

Centralized Experiment

DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)

Page 50: BigDansing presentation slides for SIGMOD 2015

Parallel Experiment

DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)

Page 51: BigDansing presentation slides for SIGMOD 2015

Only BigDansing fnished!

Parallel Experiment

DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)

Page 52: BigDansing presentation slides for SIGMOD 2015

Summary

RepairAlgorithm

DirtyDataset

UDFs(operators)

Block

Scope

Iterate

Detect

GenFix

Declarative rule

+

(1)

(4)(5)

(6)

(7)

BigDansing

logical plans

physical plans

execution plans

Rule Engine

Logical

Layer

Physical

Layer

Execution

Layer

(3)

(2)

violations&

possible repairs

value updates

Rule Parser

+

ApplyUpdates

Data Processing Framework

Page 53: BigDansing presentation slides for SIGMOD 2015

RepairAlgorithm

DirtyDataset

UDFs(operators)

Block

Scope

Iterate

Detect

GenFix

Declarative rule

+

(1)

(4)(5)

(6)

(7)

BigDansing

logical plans

physical plans

execution plans

Rule Engine

Logical

Layer

Physical

Layer

Execution

Layer

(3)

(2)

violations&

possible repairs

value updates

Rule Parser

+

ApplyUpdates

Data Processing Framework

Ease-

of-Use

Summary

Page 54: BigDansing presentation slides for SIGMOD 2015

RepairAlgorithm

DirtyDataset

UDFs(operators)

Block

Scope

Iterate

Detect

GenFix

Declarative rule

+

(1)

(4)(5)

(6)

(7)

BigDansing

logical plans

physical plans

execution plans

Rule Engine

Logical

Layer

Physical

Layer

Execution

Layer

(3)

(2)

violations&

possible repairs

value updates

Rule Parser

+

ApplyUpdates

Data Processing Framework

Ease-

of-Use

Summary

Page 55: BigDansing presentation slides for SIGMOD 2015

RepairAlgorithm

DirtyDataset

UDFs(operators)

Block

Scope

Iterate

Detect

GenFix

Declarative rule

+

(1)

(4)(5)

(6)

(7)

BigDansing

logical plans

physical plans

execution plans

Rule Engine

Logical

Layer

Physical

Layer

Execution

Layer

(3)

(2)

violations&

possible repairs

value updates

Rule Parser

+

ApplyUpdates

Data Processing Framework

Ease-

of-Use

Scalab

ility

Summary

Page 56: BigDansing presentation slides for SIGMOD 2015

RepairAlgorithm

DirtyDataset

UDFs(operators)

Block

Scope

Iterate

Detect

GenFix

Declarative rule

+

(1)

(4)(5)

(6)

(7)

BigDansing

logical plans

physical plans

execution plans

Rule Engine

Logical

Layer

Physical

Layer

Execution

Layer

(3)

(2)

violations&

possible repairs

value updates

Rule Parser

+

ApplyUpdates

Data Processing Framework

Ease-

of-Use

Scalab

ility

Portability

Summary

Page 57: BigDansing presentation slides for SIGMOD 2015

57

Experiments – Parallel FD

● TPCH dataset:

● FD: custkey → custAddress

Page 58: BigDansing presentation slides for SIGMOD 2015

58

Experiments – Scalability

● TPCH Dataset:

● FD: custkey → custAddress

● Dataset: 500M rows

Page 59: BigDansing presentation slides for SIGMOD 2015

59

Repair Quality for FDs and DC

● Φ6: FD: Zipcode → State

● Φ8: FD: PhoneNumber → Zipcode

● Φ8: FD: ProviderID → City,PhoneNumber

● ØD: DC: ∀ t1, t2 ∈ D, ¬(t1.Salary > t2.Salary ˄ t1.Rate < t2.Rate)

Φ6Φ6 & Φ7Φ6 - Φ8

ØD