![Page 1: A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649ec15503460f94bcd6ba/html5/thumbnails/1.jpg)
1
A Survey Based Seminar:Data Cleaning & Uncertain Data
Management
Speaker: Shawn YangSupervisor: Dr. Reynold Cheng
Prof. David Cheung
2011.4.29
![Page 2: A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649ec15503460f94bcd6ba/html5/thumbnails/2.jpg)
2
Outline
• Introduction• Traditional Data Quality and Cleaning • Uncertainty Management in Traditional Data
Cleaning• Cleaning Uncertain Database• Conclusion
![Page 3: A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649ec15503460f94bcd6ba/html5/thumbnails/3.jpg)
3
Outline
• Introduction• Traditional Data Quality and Cleaning • Uncertainty Management in Traditional Data
Cleaning• Cleaning Uncertain Database• Conclusion
![Page 4: A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649ec15503460f94bcd6ba/html5/thumbnails/4.jpg)
4
Example• Report of Bird Sightings
Observer Bird-ID Bird-Name Prob
Mary Bird-1 Finch 0.8
Mary Bird-1 Toucan 0.2
Susan Bird-1 Nightingale 0.7
Susan Bird-1 Toucan 0.3
Another Bird-1 Hummingbird 0.65
Another Bird-1 Toucan 0.35
Observer Bird-ID Bird-Name
Mary Bird-1 Finch
Susan Bird-1 Nightingale
Another Bird-1 hummingbird
Cleaning
![Page 5: A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649ec15503460f94bcd6ba/html5/thumbnails/5.jpg)
5
Philosophy
• Data Cleaning– To remove dirty data
• Uncertain Data Management– To preserve more information
![Page 6: A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649ec15503460f94bcd6ba/html5/thumbnails/6.jpg)
6
Outline
• Introduction• Traditional Data Quality and Cleaning • Uncertainty Management in Traditional Data
Cleaning• Cleaning Uncertain Database• Conclusion
![Page 7: A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649ec15503460f94bcd6ba/html5/thumbnails/7.jpg)
7
Data Quality Issues
Multi-SourceSingle Source
Sche
ma
Leve
lIn
stan
ce Le
vel
Inconsistency
Area Code City Name
010 Beijing
021 Shanghai
010 Shanghai
021 Beijing
• Constraint
• Dirty Data
![Page 8: A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649ec15503460f94bcd6ba/html5/thumbnails/8.jpg)
8
Data Quality Issues
Multi-SourceSingle Source
Sche
ma
Leve
lIn
stan
ce Le
vel
• Sensor Networko Temperature
• Census Datao Birth Year
Inconsistency
Missing Values,Outliers
![Page 9: A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649ec15503460f94bcd6ba/html5/thumbnails/9.jpg)
9
Data Quality Issues
Multi-SourceSingle Source
Sche
ma
Leve
lIn
stan
ce Le
vel
Inconsistency
Missing Values,Outliers
Integration
Duplication
![Page 10: A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649ec15503460f94bcd6ba/html5/thumbnails/10.jpg)
10
Single Source & Schema Level
• Inconsistent Repairs– Example
– Solutions• To Optimize some
Objective Function– Minimize the number of
changes– Cost FunctionObjective
FunctionCertainFix
Inconsistent Repairs
010 Shanghai
021 Beijing
![Page 11: A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649ec15503460f94bcd6ba/html5/thumbnails/11.jpg)
11
Single Source & Schema Level
• Inconsistent Repairs– Example
– Solutions• Certain Fix (VLDB’10)
– Master Data– Certain Region– Some attribute values are
asserted to be correctObjective Function
CertainFix
Inconsistent Repairs
010 Shanghai
021 Beijing
Area Code City Name
010 Beijing
021 Shanghai
![Page 12: A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649ec15503460f94bcd6ba/html5/thumbnails/12.jpg)
12
Single Source & Schema Level
• Cleaning Operations– Deletion & Insertion– Update attribute values
• Efficiency Issues– NP-Complete– Heuristic Methods
Objective Function
CertainFix
Inconsistent Repairs
Deletion & Insertion
Update
Cleaning Operation
EfficiencyIssues
![Page 13: A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649ec15503460f94bcd6ba/html5/thumbnails/13.jpg)
13
Others
• Single Source Instance Level– Infer missing values, detect and correct outliers
with machine learning / statistical methods• Multi-Source Schema Level
– Schema Mapping• Multi-Source Instance Level
– Data Deduplication (Record Linkage)
![Page 14: A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649ec15503460f94bcd6ba/html5/thumbnails/14.jpg)
14
Outline
• Introduction• Traditional Data Quality and Cleaning • Uncertainty Management in Traditional Data
Cleaning• Cleaning Uncertain Database• Conclusion
![Page 15: A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649ec15503460f94bcd6ba/html5/thumbnails/15.jpg)
15
Single Source & Schema Level
• Cardinality-Set-Minimal Repair:A repair I’ of I is cardinality-set-minimal iff there is no repair I’’ of I such that Δ(I, I’’) \in Δ(I, I’)
Objective Function
CertainFix
Inconsistent Repairs
Deletion & Insertion
Update
Cleaning Operation
EfficiencyIssues
PossibleRepair
…
![Page 16: A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649ec15503460f94bcd6ba/html5/thumbnails/16.jpg)
16
Single Source & Instance Level• Missing Value & Outliers
– Census Database
• ERACER (sigmod’10)– User input dependency model
• Death age• Parent age
– Learn the parameters– Infer the missing value
• Infer the missing birth year based on death year & death age distribution• Further infer the child’s birth year.
– Repeat until the distribution converge
![Page 17: A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649ec15503460f94bcd6ba/html5/thumbnails/17.jpg)
17
Multi-Source
• Schema Level– Uncertain Schema Matching
• Instance Level– Possible Repairs in Data Deduplication (VLDB’09)
![Page 18: A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649ec15503460f94bcd6ba/html5/thumbnails/18.jpg)
18
Outline
• Introduction• Traditional Data Quality and Cleaning • Uncertainty Management in Traditional Data
Cleaning• Cleaning Uncertain Database• Conclusion
![Page 19: A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649ec15503460f94bcd6ba/html5/thumbnails/19.jpg)
19
Cleaning Uncertain Database
• Applying Integrity Constraints– Exact Method– Sampling Method
• Quality of Uncertain Query Results– PWS-Quality
• Efficiency Issues
![Page 20: A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649ec15503460f94bcd6ba/html5/thumbnails/20.jpg)
20
Integrity Constraints
• Difference with Traditional Database– Locate error in the original database– Locate error in possible worlds
• Difficulties– Exponential number of
possible worlds• Statistical Description
– Posterior probabilitiesProb[j=7|C]
• Approaches – Exact Method– Approximate Method
Name SSN Prob
John1 0.27 0.8
Bill4 0.37 0.7
Name SSN
John 7
Bill 7
Constraint set (C): SSN is Unique
![Page 21: A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649ec15503460f94bcd6ba/html5/thumbnails/21.jpg)
21
Exact Method (Christoph VLDB’08)
• Model the Constraints as Assignments.
• Compress the assignment into a tree structure
• Calculate the Posterior Probabilities
j = 1
j = 7, b = 4
…
![Page 22: A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649ec15503460f94bcd6ba/html5/thumbnails/22.jpg)
22
Approximate Method (Haiquan Chen ICDE’10 Workshop)
• Aggregate Constraints
• Model the Constraints as Scoring Functions
• Get Posterior Probability by Sampling
Employee Salary (k) Confidence
Alice60 0.4
40 0.4
20 0.2
Bob30 0.7
40 0.3
Charles10 0.3
20 0.3
30 0.4
Constraints: Total Salary in [50k, 70k]
…
![Page 23: A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649ec15503460f94bcd6ba/html5/thumbnails/23.jpg)
23
Quality of Uncertain Query Results(Reynold VLDB’08)
• Different Query have Different Properties– Range Query: Independent– Min/Max Query: Otherwise
• Uniform Metric for all Uncertain Queries– Quality on Possible World Answers
• Cleaning the uncertain tuple so that to improve query quality as much as possible– Oracle Assumption
![Page 24: A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649ec15503460f94bcd6ba/html5/thumbnails/24.jpg)
24
Efficiency Issue
• A more “realistic” Oracle– Cleaning may fail– Even a successful cleaning can not remove all false values– Cleaning may involve a cost
• Objective– Remove as much uncertainty as possible– With limited number of cleaning operations
• Discussion– Instance Level Cleaning (Clean particular instance)– Schema Level Cleaning (Clean the entire DB)
![Page 25: A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649ec15503460f94bcd6ba/html5/thumbnails/25.jpg)
25
Conclusion
• Improve Data Quality– Data Cleaning -> Remove Errors– Uncertain Data Management -> Maintain
Information• 2 Directions
ConstraintsTraditional Database Repairs
Consistent Possible World(s)
Uncertain Database
![Page 26: A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung 2011.4.29 1](https://reader035.vdocuments.us/reader035/viewer/2022062518/56649ec15503460f94bcd6ba/html5/thumbnails/26.jpg)
26
Discussion
Thank You :)