a survey based seminar: data cleaning & uncertain data management speaker: shawn yang...

26
A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1

Upload: shannon-james

Post on 13-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1

1

A Survey Based Seminar:Data Cleaning & Uncertain Data

Management

Speaker: Shawn YangSupervisor: Dr. Reynold Cheng

Prof. David Cheung

2011.4.29

Page 2: A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1

2

Outline

• Introduction• Traditional Data Quality and Cleaning • Uncertainty Management in Traditional Data

Cleaning• Cleaning Uncertain Database• Conclusion

Page 3: A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1

3

Outline

• Introduction• Traditional Data Quality and Cleaning • Uncertainty Management in Traditional Data

Cleaning• Cleaning Uncertain Database• Conclusion

Page 4: A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1

4

Example• Report of Bird Sightings

Observer Bird-ID Bird-Name Prob

Mary Bird-1 Finch 0.8

Mary Bird-1 Toucan 0.2

Susan Bird-1 Nightingale 0.7

Susan Bird-1 Toucan 0.3

Another Bird-1 Hummingbird 0.65

Another Bird-1 Toucan 0.35

Observer Bird-ID Bird-Name

Mary Bird-1 Finch

Susan Bird-1 Nightingale

Another Bird-1 hummingbird

Cleaning

Page 5: A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1

5

Philosophy

• Data Cleaning– To remove dirty data

• Uncertain Data Management– To preserve more information

Page 6: A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1

6

Outline

• Introduction• Traditional Data Quality and Cleaning • Uncertainty Management in Traditional Data

Cleaning• Cleaning Uncertain Database• Conclusion

Page 7: A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1

7

Data Quality Issues

Multi-SourceSingle Source

Sche

ma

Leve

lIn

stan

ce Le

vel

Inconsistency

Area Code City Name

010 Beijing

021 Shanghai

010 Shanghai

021 Beijing

• Constraint

• Dirty Data

Page 8: A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1

8

Data Quality Issues

Multi-SourceSingle Source

Sche

ma

Leve

lIn

stan

ce Le

vel

• Sensor Networko Temperature

• Census Datao Birth Year

Inconsistency

Missing Values,Outliers

Page 9: A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1

9

Data Quality Issues

Multi-SourceSingle Source

Sche

ma

Leve

lIn

stan

ce Le

vel

Inconsistency

Missing Values,Outliers

Integration

Duplication

Page 10: A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1

10

Single Source & Schema Level

• Inconsistent Repairs– Example

– Solutions• To Optimize some

Objective Function– Minimize the number of

changes– Cost FunctionObjective

FunctionCertainFix

Inconsistent Repairs

010 Shanghai

021 Beijing

Page 11: A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1

11

Single Source & Schema Level

• Inconsistent Repairs– Example

– Solutions• Certain Fix (VLDB’10)

– Master Data– Certain Region– Some attribute values are

asserted to be correctObjective Function

CertainFix

Inconsistent Repairs

010 Shanghai

021 Beijing

Area Code City Name

010 Beijing

021 Shanghai

Page 12: A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1

12

Single Source & Schema Level

• Cleaning Operations– Deletion & Insertion– Update attribute values

• Efficiency Issues– NP-Complete– Heuristic Methods

Objective Function

CertainFix

Inconsistent Repairs

Deletion & Insertion

Update

Cleaning Operation

EfficiencyIssues

Page 13: A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1

13

Others

• Single Source Instance Level– Infer missing values, detect and correct outliers

with machine learning / statistical methods• Multi-Source Schema Level

– Schema Mapping• Multi-Source Instance Level

– Data Deduplication (Record Linkage)

Page 14: A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1

14

Outline

• Introduction• Traditional Data Quality and Cleaning • Uncertainty Management in Traditional Data

Cleaning• Cleaning Uncertain Database• Conclusion

Page 15: A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1

15

Single Source & Schema Level

• Cardinality-Set-Minimal Repair:A repair I’ of I is cardinality-set-minimal iff there is no repair I’’ of I such that Δ(I, I’’) \in Δ(I, I’)

Objective Function

CertainFix

Inconsistent Repairs

Deletion & Insertion

Update

Cleaning Operation

EfficiencyIssues

PossibleRepair

Page 16: A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1

16

Single Source & Instance Level• Missing Value & Outliers

– Census Database

• ERACER (sigmod’10)– User input dependency model

• Death age• Parent age

– Learn the parameters– Infer the missing value

• Infer the missing birth year based on death year & death age distribution• Further infer the child’s birth year.

– Repeat until the distribution converge

Page 17: A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1

17

Multi-Source

• Schema Level– Uncertain Schema Matching

• Instance Level– Possible Repairs in Data Deduplication (VLDB’09)

Page 18: A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1

18

Outline

• Introduction• Traditional Data Quality and Cleaning • Uncertainty Management in Traditional Data

Cleaning• Cleaning Uncertain Database• Conclusion

Page 19: A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1

19

Cleaning Uncertain Database

• Applying Integrity Constraints– Exact Method– Sampling Method

• Quality of Uncertain Query Results– PWS-Quality

• Efficiency Issues

Page 20: A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1

20

Integrity Constraints

• Difference with Traditional Database– Locate error in the original database– Locate error in possible worlds

• Difficulties– Exponential number of

possible worlds• Statistical Description

– Posterior probabilitiesProb[j=7|C]

• Approaches – Exact Method– Approximate Method

Name SSN Prob

John1 0.27 0.8

Bill4 0.37 0.7

Name SSN

John 7

Bill 7

Constraint set (C): SSN is Unique

Page 21: A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1

21

Exact Method (Christoph VLDB’08)

• Model the Constraints as Assignments.

• Compress the assignment into a tree structure

• Calculate the Posterior Probabilities

j = 1

j = 7, b = 4

Page 22: A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1

22

Approximate Method (Haiquan Chen ICDE’10 Workshop)

• Aggregate Constraints

• Model the Constraints as Scoring Functions

• Get Posterior Probability by Sampling

Employee Salary (k) Confidence

Alice60 0.4

40 0.4

20 0.2

Bob30 0.7

40 0.3

Charles10 0.3

20 0.3

30 0.4

Constraints: Total Salary in [50k, 70k]

Page 23: A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1

23

Quality of Uncertain Query Results(Reynold VLDB’08)

• Different Query have Different Properties– Range Query: Independent– Min/Max Query: Otherwise

• Uniform Metric for all Uncertain Queries– Quality on Possible World Answers

• Cleaning the uncertain tuple so that to improve query quality as much as possible– Oracle Assumption

Page 24: A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1

24

Efficiency Issue

• A more “realistic” Oracle– Cleaning may fail– Even a successful cleaning can not remove all false values– Cleaning may involve a cost

• Objective– Remove as much uncertainty as possible– With limited number of cleaning operations

• Discussion– Instance Level Cleaning (Clean particular instance)– Schema Level Cleaning (Clean the entire DB)

Page 25: A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1

25

Conclusion

• Improve Data Quality– Data Cleaning -> Remove Errors– Uncertain Data Management -> Maintain

Information• 2 Directions

ConstraintsTraditional Database Repairs

Consistent Possible World(s)

Uncertain Database

Page 26: A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1

26

Discussion

Thank You :)