a survey based seminar: data cleaning & uncertain data management speaker: shawn yang...

Post on 13-Jan-2016

215 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

A Survey Based Seminar:Data Cleaning & Uncertain Data

Management

Speaker: Shawn YangSupervisor: Dr. Reynold Cheng

Prof. David Cheung

2011.4.29

2

Outline

• Introduction• Traditional Data Quality and Cleaning • Uncertainty Management in Traditional Data

Cleaning• Cleaning Uncertain Database• Conclusion

3

Outline

• Introduction• Traditional Data Quality and Cleaning • Uncertainty Management in Traditional Data

Cleaning• Cleaning Uncertain Database• Conclusion

4

Example• Report of Bird Sightings

Observer Bird-ID Bird-Name Prob

Mary Bird-1 Finch 0.8

Mary Bird-1 Toucan 0.2

Susan Bird-1 Nightingale 0.7

Susan Bird-1 Toucan 0.3

Another Bird-1 Hummingbird 0.65

Another Bird-1 Toucan 0.35

Observer Bird-ID Bird-Name

Mary Bird-1 Finch

Susan Bird-1 Nightingale

Another Bird-1 hummingbird

Cleaning

5

Philosophy

• Data Cleaning– To remove dirty data

• Uncertain Data Management– To preserve more information

6

Outline

• Introduction• Traditional Data Quality and Cleaning • Uncertainty Management in Traditional Data

Cleaning• Cleaning Uncertain Database• Conclusion

7

Data Quality Issues

Multi-SourceSingle Source

Sche

ma

Leve

lIn

stan

ce Le

vel

Inconsistency

Area Code City Name

010 Beijing

021 Shanghai

010 Shanghai

021 Beijing

• Constraint

• Dirty Data

8

Data Quality Issues

Multi-SourceSingle Source

Sche

ma

Leve

lIn

stan

ce Le

vel

• Sensor Networko Temperature

• Census Datao Birth Year

Inconsistency

Missing Values,Outliers

9

Data Quality Issues

Multi-SourceSingle Source

Sche

ma

Leve

lIn

stan

ce Le

vel

Inconsistency

Missing Values,Outliers

Integration

Duplication

10

Single Source & Schema Level

• Inconsistent Repairs– Example

– Solutions• To Optimize some

Objective Function– Minimize the number of

changes– Cost FunctionObjective

FunctionCertainFix

Inconsistent Repairs

010 Shanghai

021 Beijing

11

Single Source & Schema Level

• Inconsistent Repairs– Example

– Solutions• Certain Fix (VLDB’10)

– Master Data– Certain Region– Some attribute values are

asserted to be correctObjective Function

CertainFix

Inconsistent Repairs

010 Shanghai

021 Beijing

Area Code City Name

010 Beijing

021 Shanghai

12

Single Source & Schema Level

• Cleaning Operations– Deletion & Insertion– Update attribute values

• Efficiency Issues– NP-Complete– Heuristic Methods

Objective Function

CertainFix

Inconsistent Repairs

Deletion & Insertion

Update

Cleaning Operation

EfficiencyIssues

13

Others

• Single Source Instance Level– Infer missing values, detect and correct outliers

with machine learning / statistical methods• Multi-Source Schema Level

– Schema Mapping• Multi-Source Instance Level

– Data Deduplication (Record Linkage)

14

Outline

• Introduction• Traditional Data Quality and Cleaning • Uncertainty Management in Traditional Data

Cleaning• Cleaning Uncertain Database• Conclusion

15

Single Source & Schema Level

• Cardinality-Set-Minimal Repair:A repair I’ of I is cardinality-set-minimal iff there is no repair I’’ of I such that Δ(I, I’’) \in Δ(I, I’)

Objective Function

CertainFix

Inconsistent Repairs

Deletion & Insertion

Update

Cleaning Operation

EfficiencyIssues

PossibleRepair

16

Single Source & Instance Level• Missing Value & Outliers

– Census Database

• ERACER (sigmod’10)– User input dependency model

• Death age• Parent age

– Learn the parameters– Infer the missing value

• Infer the missing birth year based on death year & death age distribution• Further infer the child’s birth year.

– Repeat until the distribution converge

17

Multi-Source

• Schema Level– Uncertain Schema Matching

• Instance Level– Possible Repairs in Data Deduplication (VLDB’09)

18

Outline

• Introduction• Traditional Data Quality and Cleaning • Uncertainty Management in Traditional Data

Cleaning• Cleaning Uncertain Database• Conclusion

19

Cleaning Uncertain Database

• Applying Integrity Constraints– Exact Method– Sampling Method

• Quality of Uncertain Query Results– PWS-Quality

• Efficiency Issues

20

Integrity Constraints

• Difference with Traditional Database– Locate error in the original database– Locate error in possible worlds

• Difficulties– Exponential number of

possible worlds• Statistical Description

– Posterior probabilitiesProb[j=7|C]

• Approaches – Exact Method– Approximate Method

Name SSN Prob

John1 0.27 0.8

Bill4 0.37 0.7

Name SSN

John 7

Bill 7

Constraint set (C): SSN is Unique

21

Exact Method (Christoph VLDB’08)

• Model the Constraints as Assignments.

• Compress the assignment into a tree structure

• Calculate the Posterior Probabilities

j = 1

j = 7, b = 4

22

Approximate Method (Haiquan Chen ICDE’10 Workshop)

• Aggregate Constraints

• Model the Constraints as Scoring Functions

• Get Posterior Probability by Sampling

Employee Salary (k) Confidence

Alice60 0.4

40 0.4

20 0.2

Bob30 0.7

40 0.3

Charles10 0.3

20 0.3

30 0.4

Constraints: Total Salary in [50k, 70k]

23

Quality of Uncertain Query Results(Reynold VLDB’08)

• Different Query have Different Properties– Range Query: Independent– Min/Max Query: Otherwise

• Uniform Metric for all Uncertain Queries– Quality on Possible World Answers

• Cleaning the uncertain tuple so that to improve query quality as much as possible– Oracle Assumption

24

Efficiency Issue

• A more “realistic” Oracle– Cleaning may fail– Even a successful cleaning can not remove all false values– Cleaning may involve a cost

• Objective– Remove as much uncertainty as possible– With limited number of cleaning operations

• Discussion– Instance Level Cleaning (Clean particular instance)– Schema Level Cleaning (Clean the entire DB)

25

Conclusion

• Improve Data Quality– Data Cleaning -> Remove Errors– Uncertain Data Management -> Maintain

Information• 2 Directions

ConstraintsTraditional Database Repairs

Consistent Possible World(s)

Uncertain Database

26

Discussion

Thank You :)

top related