a survey based seminar: data cleaning & uncertain data management speaker: shawn yang...
TRANSCRIPT
1
A Survey Based Seminar:Data Cleaning & Uncertain Data
Management
Speaker: Shawn YangSupervisor: Dr. Reynold Cheng
Prof. David Cheung
2011.4.29
2
Outline
• Introduction• Traditional Data Quality and Cleaning • Uncertainty Management in Traditional Data
Cleaning• Cleaning Uncertain Database• Conclusion
3
Outline
• Introduction• Traditional Data Quality and Cleaning • Uncertainty Management in Traditional Data
Cleaning• Cleaning Uncertain Database• Conclusion
4
Example• Report of Bird Sightings
Observer Bird-ID Bird-Name Prob
Mary Bird-1 Finch 0.8
Mary Bird-1 Toucan 0.2
Susan Bird-1 Nightingale 0.7
Susan Bird-1 Toucan 0.3
Another Bird-1 Hummingbird 0.65
Another Bird-1 Toucan 0.35
Observer Bird-ID Bird-Name
Mary Bird-1 Finch
Susan Bird-1 Nightingale
Another Bird-1 hummingbird
Cleaning
5
Philosophy
• Data Cleaning– To remove dirty data
• Uncertain Data Management– To preserve more information
6
Outline
• Introduction• Traditional Data Quality and Cleaning • Uncertainty Management in Traditional Data
Cleaning• Cleaning Uncertain Database• Conclusion
7
Data Quality Issues
Multi-SourceSingle Source
Sche
ma
Leve
lIn
stan
ce Le
vel
Inconsistency
Area Code City Name
010 Beijing
021 Shanghai
010 Shanghai
021 Beijing
• Constraint
• Dirty Data
8
Data Quality Issues
Multi-SourceSingle Source
Sche
ma
Leve
lIn
stan
ce Le
vel
• Sensor Networko Temperature
• Census Datao Birth Year
Inconsistency
Missing Values,Outliers
9
Data Quality Issues
Multi-SourceSingle Source
Sche
ma
Leve
lIn
stan
ce Le
vel
Inconsistency
Missing Values,Outliers
Integration
Duplication
10
Single Source & Schema Level
• Inconsistent Repairs– Example
– Solutions• To Optimize some
Objective Function– Minimize the number of
changes– Cost FunctionObjective
FunctionCertainFix
Inconsistent Repairs
010 Shanghai
021 Beijing
11
Single Source & Schema Level
• Inconsistent Repairs– Example
– Solutions• Certain Fix (VLDB’10)
– Master Data– Certain Region– Some attribute values are
asserted to be correctObjective Function
CertainFix
Inconsistent Repairs
010 Shanghai
021 Beijing
Area Code City Name
010 Beijing
021 Shanghai
12
Single Source & Schema Level
• Cleaning Operations– Deletion & Insertion– Update attribute values
• Efficiency Issues– NP-Complete– Heuristic Methods
Objective Function
CertainFix
Inconsistent Repairs
Deletion & Insertion
Update
Cleaning Operation
EfficiencyIssues
13
Others
• Single Source Instance Level– Infer missing values, detect and correct outliers
with machine learning / statistical methods• Multi-Source Schema Level
– Schema Mapping• Multi-Source Instance Level
– Data Deduplication (Record Linkage)
14
Outline
• Introduction• Traditional Data Quality and Cleaning • Uncertainty Management in Traditional Data
Cleaning• Cleaning Uncertain Database• Conclusion
15
Single Source & Schema Level
• Cardinality-Set-Minimal Repair:A repair I’ of I is cardinality-set-minimal iff there is no repair I’’ of I such that Δ(I, I’’) \in Δ(I, I’)
Objective Function
CertainFix
Inconsistent Repairs
Deletion & Insertion
Update
Cleaning Operation
EfficiencyIssues
PossibleRepair
…
16
Single Source & Instance Level• Missing Value & Outliers
– Census Database
• ERACER (sigmod’10)– User input dependency model
• Death age• Parent age
– Learn the parameters– Infer the missing value
• Infer the missing birth year based on death year & death age distribution• Further infer the child’s birth year.
– Repeat until the distribution converge
17
Multi-Source
• Schema Level– Uncertain Schema Matching
• Instance Level– Possible Repairs in Data Deduplication (VLDB’09)
18
Outline
• Introduction• Traditional Data Quality and Cleaning • Uncertainty Management in Traditional Data
Cleaning• Cleaning Uncertain Database• Conclusion
19
Cleaning Uncertain Database
• Applying Integrity Constraints– Exact Method– Sampling Method
• Quality of Uncertain Query Results– PWS-Quality
• Efficiency Issues
20
Integrity Constraints
• Difference with Traditional Database– Locate error in the original database– Locate error in possible worlds
• Difficulties– Exponential number of
possible worlds• Statistical Description
– Posterior probabilitiesProb[j=7|C]
• Approaches – Exact Method– Approximate Method
Name SSN Prob
John1 0.27 0.8
Bill4 0.37 0.7
Name SSN
John 7
Bill 7
Constraint set (C): SSN is Unique
21
Exact Method (Christoph VLDB’08)
• Model the Constraints as Assignments.
• Compress the assignment into a tree structure
• Calculate the Posterior Probabilities
j = 1
j = 7, b = 4
…
22
Approximate Method (Haiquan Chen ICDE’10 Workshop)
• Aggregate Constraints
• Model the Constraints as Scoring Functions
• Get Posterior Probability by Sampling
Employee Salary (k) Confidence
Alice60 0.4
40 0.4
20 0.2
Bob30 0.7
40 0.3
Charles10 0.3
20 0.3
30 0.4
Constraints: Total Salary in [50k, 70k]
…
23
Quality of Uncertain Query Results(Reynold VLDB’08)
• Different Query have Different Properties– Range Query: Independent– Min/Max Query: Otherwise
• Uniform Metric for all Uncertain Queries– Quality on Possible World Answers
• Cleaning the uncertain tuple so that to improve query quality as much as possible– Oracle Assumption
24
Efficiency Issue
• A more “realistic” Oracle– Cleaning may fail– Even a successful cleaning can not remove all false values– Cleaning may involve a cost
• Objective– Remove as much uncertainty as possible– With limited number of cleaning operations
• Discussion– Instance Level Cleaning (Clean particular instance)– Schema Level Cleaning (Clean the entire DB)
25
Conclusion
• Improve Data Quality– Data Cleaning -> Remove Errors– Uncertain Data Management -> Maintain
Information• 2 Directions
ConstraintsTraditional Database Repairs
Consistent Possible World(s)
Uncertain Database
26
Discussion
Thank You :)