a survey based seminar: data cleaning & uncertain data management speaker: shawn yang...

1

A Survey Based Seminar:Data Cleaning & Uncertain Data

Management

Speaker: Shawn YangSupervisor: Dr. Reynold Cheng

Prof. David Cheung

2011.4.29

2

Outline

• Introduction• Traditional Data Quality and Cleaning • Uncertainty Management in Traditional Data

Cleaning• Cleaning Uncertain Database• Conclusion

3

Outline



4

Example• Report of Bird Sightings

Observer Bird-ID Bird-Name Prob

Mary Bird-1 Finch 0.8

Mary Bird-1 Toucan 0.2

Susan Bird-1 Nightingale 0.7

Susan Bird-1 Toucan 0.3

Another Bird-1 Hummingbird 0.65

Another Bird-1 Toucan 0.35

Observer Bird-ID Bird-Name

Mary Bird-1 Finch

Susan Bird-1 Nightingale

Another Bird-1 hummingbird

Cleaning

5

Philosophy

• Data Cleaning– To remove dirty data

• Uncertain Data Management– To preserve more information

6

Outline



7

Data Quality Issues

Multi-SourceSingle Source

Sche

ma

Leve

lIn

stan

ce Le

vel

Inconsistency

Area Code City Name

010 Beijing

021 Shanghai

010 Shanghai

021 Beijing

• Constraint

• Dirty Data

8

Data Quality Issues


Sche

ma

Leve

lIn

stan

ce Le

vel

• Sensor Networko Temperature

• Census Datao Birth Year

Inconsistency

Missing Values,Outliers

9

Data Quality Issues


Sche

ma

Leve

lIn

stan

ce Le

vel

Inconsistency

Missing Values,Outliers

Integration

Duplication

10

Single Source & Schema Level

• Inconsistent Repairs– Example

– Solutions• To Optimize some

Objective Function– Minimize the number of

changes– Cost FunctionObjective

FunctionCertainFix

Inconsistent Repairs

010 Shanghai

021 Beijing

11


• Inconsistent Repairs– Example

– Solutions• Certain Fix (VLDB’10)

– Master Data– Certain Region– Some attribute values are

asserted to be correctObjective Function

CertainFix


010 Shanghai

021 Beijing

Area Code City Name

010 Beijing

021 Shanghai

12


• Cleaning Operations– Deletion & Insertion– Update attribute values

• Efficiency Issues– NP-Complete– Heuristic Methods

Objective Function

CertainFix


Deletion & Insertion

Update

Cleaning Operation

EfficiencyIssues

13

Others

• Single Source Instance Level– Infer missing values, detect and correct outliers

with machine learning / statistical methods• Multi-Source Schema Level

– Schema Mapping• Multi-Source Instance Level

– Data Deduplication (Record Linkage)

14

Outline



15


• Cardinality-Set-Minimal Repair:A repair I’ of I is cardinality-set-minimal iff there is no repair I’’ of I such that Δ(I, I’’) \in Δ(I, I’)

Objective Function

CertainFix


Deletion & Insertion

Update

Cleaning Operation

EfficiencyIssues

PossibleRepair

…

16

Single Source & Instance Level• Missing Value & Outliers

– Census Database

• ERACER (sigmod’10)– User input dependency model

• Death age• Parent age

– Learn the parameters– Infer the missing value

• Infer the missing birth year based on death year & death age distribution• Further infer the child’s birth year.

– Repeat until the distribution converge

17

Multi-Source

• Schema Level– Uncertain Schema Matching

• Instance Level– Possible Repairs in Data Deduplication (VLDB’09)

18

Outline



19

Cleaning Uncertain Database

• Applying Integrity Constraints– Exact Method– Sampling Method

• Quality of Uncertain Query Results– PWS-Quality

• Efficiency Issues

20

Integrity Constraints

• Difference with Traditional Database– Locate error in the original database– Locate error in possible worlds

• Difficulties– Exponential number of

possible worlds• Statistical Description

– Posterior probabilitiesProb[j=7|C]

• Approaches – Exact Method– Approximate Method

Name SSN Prob

John1 0.27 0.8

Bill4 0.37 0.7

Name SSN

John 7

Bill 7

Constraint set (C): SSN is Unique

21

Exact Method (Christoph VLDB’08)

• Model the Constraints as Assignments.

• Compress the assignment into a tree structure

• Calculate the Posterior Probabilities

j = 1

j = 7, b = 4

…

22

Approximate Method (Haiquan Chen ICDE’10 Workshop)

• Aggregate Constraints

• Model the Constraints as Scoring Functions

• Get Posterior Probability by Sampling

Employee Salary (k) Confidence

Alice60 0.4

40 0.4

20 0.2

Bob30 0.7

40 0.3

Charles10 0.3

20 0.3

30 0.4

Constraints: Total Salary in [50k, 70k]

…

23

Quality of Uncertain Query Results(Reynold VLDB’08)

• Different Query have Different Properties– Range Query: Independent– Min/Max Query: Otherwise

• Uniform Metric for all Uncertain Queries– Quality on Possible World Answers

• Cleaning the uncertain tuple so that to improve query quality as much as possible– Oracle Assumption

24

Efficiency Issue

• A more “realistic” Oracle– Cleaning may fail– Even a successful cleaning can not remove all false values– Cleaning may involve a cost

• Objective– Remove as much uncertainty as possible– With limited number of cleaning operations

• Discussion– Instance Level Cleaning (Clean particular instance)– Schema Level Cleaning (Clean the entire DB)

25

Conclusion

• Improve Data Quality– Data Cleaning -> Remove Errors– Uncertain Data Management -> Maintain

Information• 2 Directions

ConstraintsTraditional Database Repairs

Consistent Possible World(s)

Uncertain Database

26

Discussion

Thank You :)

a survey based seminar: data cleaning & uncertain data management speaker: shawn yang...

Documents

missing birth year

missing valueinfer

finchsusanbird bird

marybird bird

minimal repair

objective functionminimize

correct outliers

childs birth year