Download - Data Repairing
Data Repairing
Giorgos Flouris, FORTHDecember 11-12, 2012, Luxembourg
Slide 2 of x
PART I: Problem Statement and
Proposed Solution(D2.2)
Slide 3 of x
Validity as a Quality Indicator Validity is an important quality indicator
◦ Encodes context- or application-specific requirements◦ Applications may be useless over invalid data◦ Binary concept (valid/invalid)
Two steps to guarantee validity (repair process):1. Identifying invalid ontologies (diagnosis)
Detecting invalidities in an automated manner Subtask of Quality Assessment
2. Remove invalidities (repair) Repairing invalidities in an automated manner Subtask of Quality Enhancement
Slide 4 of x
DiagnosisExpressing validity using validity rules
over an adequate relational schemaExamples:
◦Properties must have a unique domain◦p Prop(p) a Dom(p,a)◦p,a,b Dom(p,a) Dom(p,b) (a=b)
◦Correct classification in property instances◦x,y,p,a P_Inst(x,y,p) Dom(p,a) C_Inst(x,a)◦x,y,p,a P_Inst(x,y,p) Rng(p,a) C_Inst(y,a)
Diagnosis reduced to relational queries
Slide 5 of x
Ontology O0Class(Sensor), Class(SpatialThing), Class(Observation)Prop(geo:location)Dom(geo:location,Sensor)Rng(geo:location,SpatialThing)Inst(Item1), Inst(ST1)P_Inst(Item1,ST1,geo:location)C_Inst(Item1,Observation), C_Inst(ST1,SpatialThing)
Example
Correct classification in property instances x,y,p,a P_Inst(x,y,p) Dom(p,a) C_Inst(x,a)
Sensor SpatialThing
Observation
Item1 ST1
geo:location
Schema
Data
Item1 geo:location ST1 Sensor is the domain of geo:locationItem1 is not a Sensor
P_Inst(Item1,ST1,geo:location)O0
Remove P_Inst(Item1,ST1,geo:location)
Add C_Inst(Item1,Sensor)Remove Dom(geo:location,Sensor)
C_Inst(Item1,Sensor)O0
Dom(geo:location,Sensor)O0
Slide 6 of x
Preferences for Repair– Which repairing option is best?
◦Ontology engineer determines that via preferences
Specified by ontology engineer beforehand
High-level “specifications” for the ideal repair
Serve as “instructions” to determine the preferred (optimal) solution
Slide 7 of x
Preferences (On Ontologies)
O0
O2
O3
Score: 3
Score: 4
Score: 6
O1
Slide 8 of x
Preferences (On Deltas)
O0
O1
O2
O3Score: 2
Score: 4
Score: 5
-P_Inst (Item1,ST1, geo:location)
+C_Inst (Item1,Sensor)
-Dom (geo:location,
Sensor)
Slide 9 of x
PreferencesPreferences on ontologies are result-oriented
◦ Consider the quality of the repair result◦ Ignore the impact of repair◦ Popular options: prefer newest/trustable information,
prefer a specific ontological structurePreferences on deltas are impact-oriented
◦ Consider the impact of repair◦ Ignore the quality of the repair result◦ Popular options: minimize schema changes, minimize
addition/deletion of information, minimize delta sizeProperties of preferences
◦ Preferences on ontologies/deltas are equivalent◦ Quality metrics can be used for stating preferences◦ Metadata on the data can be used (e.g., provenance)◦ Can be qualitative or quantitative
Slide 10 of x
Generalizing the Approach For one violated constraint
1. Diagnose invalidity2. Determine minimal ways to resolve it3. Determine and return preferred (optimal) resolution
For many violated constraints◦ Problem becomes more complicated◦ More than one resolution steps are required
Issues:1. Resolution order2. When and how to filter non-optimal solutions?3. Constraint (and resolution) interdependencies
Slide 11 of x
Constraint Interdependencies A given resolution may:
◦ Cause other violations (bad)◦ Resolve other violations (good)
Optimal resolution unknown ‘a priori’◦ Cannot predict a resolution’s ramifications◦ Exhaustive, recursive search required
(resolution tree) Two ways to create the resolution tree
◦ Globally-optimal (GO) / locally-optimal (LO)◦ When and how to filter non-optimal
solutions?
Slide 12 of x
Resolution Tree Creation (GO)– Find all minimal resolutions
for all the violated constraints, then find the optimal ones
– Globally-optimal (GO)◦ Find all minimal resolutions
for one violation◦ Explore them all◦ Repeat recursively until valid◦ Return the optimal leaves
Optimal repairs (returned)
Slide 13 of x
Resolution Tree Creation (LO)– Find the minimal and
optimal resolutions for one violated constraint, then repeat for the next
– Locally-optimal (LO)◦ Find all minimal resolutions
for one violation◦ Explore the optimal one(s)◦ Repeat recursively until valid◦ Return all remaining leaves
Optimal repair (returned)
Slide 14 of x
Comparison (GO versus LO)Characteristics of GO
◦ Exhaustive◦ Less efficient:
large resolution trees◦ Always returns optimal
repairs◦ Insensitive to constraint
syntax◦ Does not depend on
resolution order
Characteristics of LO◦ Greedy◦ More efficient:
small resolution trees◦ Does not always return optimal repairs◦ Sensitive to constraint syntax◦ Depends on resolution order
Slide 15 of x
PART II: Complexity Analysis andPerformance Evaluation
(D2.2)
Slide 16 of x
Algorithm and ComplexityDetailed complexity analysis for GO/LO
and various different types of constraints and preferences
Inherently difficult problem◦Exponential complexity (in general)◦Exception: LO is polynomial (in special cases)
Theoretical complexity is misleading as to the actual performance of the algorithms
Slide 17 of x
Performance in PracticePerformance in practice
◦Linear with respect to ontology size◦Linear with respect to tree size
Types of violated constraints (tree width) Number of violations (tree height) – causes the
exponential blowup Constraint interdependencies (tree height) Preference (for LO): affects pruning (tree width)
Further performance improvement◦Use optimizations◦Use LO with restrictive preference
Slide 18 of x
Evaluation Parameters Evaluation
1. Effect of ontology size (for GO/LO)2. Effect of tree size (for GO/LO)3. Effect of violations (for GO/LO)4. Effect of preference (relevant for LO only)5. Quality of LO repairs
Evaluation results support our claims:◦ Linear with respect to ontology size◦ Linear with respect to tree size
Slide 19 of x
Effect of Ontology Size
499999.999999999 4999999.999999991.00
10.00
100.00
1000.00
10000.00
100000.00
1000000.00
10000000.00
Diagnosis GO Repair 16 Violations GO Repair 26 ViolationsLO Repair 16 Violations LO Repair 26 Violations
Triples (x1000)
Exec
utio
n Ti
me
(mse
c)(lo
gsca
le)
(logscale)
20000
Slide 20 of x
Effect of Tree Size (1/2)
0 30000000 60000000 90000000 120000000 1500000000
200000
400000
600000
800000
1000000
1200000
1400000
1M5M10M15M20M
Nodes (x )
GO E
xecu
tion
Tim
e (s
ec)
610Nodes (x106)
Slide 21 of x
Effect of Tree Size (2/2)
0 2000 4000 6000 8000 100000
5000
10000
15000
20000
25000
30000
35000
40000
1M5M10M15M20M
# Nodes
LO E
xecu
tion
Tim
e (s
ec)
Slide 22 of x
Effect of Violations (1/2)
0 2 4 6 8 10 12 14 16 18 20 22 24 26 280
200000
400000
600000
800000
1000000
1200000
1400000
1M5M10M15M20M
# Violations
GO E
xecu
tion
Tim
e (s
ec)
Slide 23 of x
Effect of Violations (2/2)
0 2 4 6 8 10 12 14 16 18 20 22 24 26 280
5000
10000
15000
20000
25000
30000
35000
40000
1M5M10M15M20M
# Violations
LO E
xecu
tion
Tim
e (s
ec)
Slide 24 of x
Effect of Preference (LO)
Slide 25 of x
Quality of LO Repairs (1/2)
0 3 6 9 12 15 18 210
1
2
3
4
5
6
7
GO∩LOGO\LOLO\GO
# Violations
# Pr
ef. R
ep. D
elta
sCCD KB
Max( )
Slide 26 of x
Quality of LO Repairs (2/2)
0 3 6 9 12 15 18 210
200
400
600
800
1000
1200
1400
GO∩LO
GO\LO
LO\GO
# Violations
# Pr
ef. R
ep. D
elta
sCCD KB
Min( )
Slide 27 of x
PART III: Application of Repairing
in a Real Setting(D4.4)
Slide 28 of x
Objectives and Main IdeaEvaluate repairing method in a real LOD
setting◦Using resources from WP4◦Using provenance-related preferences
Validate the utility of WP4 resources for a data quality benchmark
Evaluate the usefulness of provenance, recency etc as metrics/preferences for quality assessment and repair
Slide 29 of x
SettingUser seeks information on Brazilian cities
◦Fuses Wikipedia dumps from different languages
Guarantees maximal coverage, but may lead to conflicts ◦E.g., city with two different population counts
EN
PT
ES FR
GE
Slide 30 of x
Main TasksAssess the quality of the resulting dataset
◦Quality assessment frameworkRepair the resulting dataset
◦Using the aforementioned repairing method◦Evaluate the use of provenance-related
preferences Prefer most recent information Prefer most trustworthy information
Slide 31 of x
ContributionsContributions:
◦Define 5 different metrics based on provenance◦Each metric is used as:
Quality assessment metric (to assess quality) Repairing preference (to “guide” the repair)
◦Evaluate them in a real setting
Slide 32 of x
Experiments (Setting)Setting
◦ Fused 5 Wikipedias: EN, PT, SP, GE, FR◦ Distilled information about Brazilian cities
Properties considered:◦ populationTotal ◦ areaTotal◦ foundingDate
Validity rules: properties must be functional◦ Repaired invalidities (using our metrics)◦ Checked quality of result◦ Dimensions: consistency, validity, conciseness,
completeness and accuracy
Slide 33 of x
Metrics for Experiments (1/2)1. PREFER_PT: select conflicting information
based on its source (PT>EN>SP>GE>FR)2. PREFER_RECENT: select conflicting
information based on its recency (most recent is preferred)
3. PLAUSIBLE_PT: ignore “irrational” data (population<500, area<300km2, founding date<1500AD) otherwise use PREFER_PT
Slide 34 of x
Metrics for Experiments (2/2)4. WEIGHTED_RECENT: select based on
recency, but in cases where the records are almost equally recent, use source reputation (if less than 3 months apart, use PREFER_PT, else use PREFER_RECENT)
5. CONDITIONAL_PT: define source trustworthiness depending on data values (prefer PT for small cities with population<500.000, prefer EN for the rest)
Slide 35 of x
Consistency, ValidityConsistency
◦Lack of conflicting triples◦Guaranteed to be perfect (by the repairing
algorithm), regardless of preferenceValidity
◦Lack of rule violations◦Coincides with consistency for this example◦Guaranteed to be perfect (by the repairing
algorithm), regardless of preference
Slide 36 of x
Conciseness, CompletenessConciseness
◦No duplicates in the final result◦Guaranteed to be perfect (by the fuse process),
regardless of preferenceCompleteness
◦Coverage of information◦Improved by fusion◦Unaffected by our algorithm◦Input completeness = output completeness,
regardless of preference◦Measured to be at 77,02%
Slide 37 of x
AccuracyMost important metric for this experimentAccuracy
◦Closeness to the “actual state of affairs”◦Affected by the repairing choices
Compared repair with the Gold Standard ◦Taken from an official and independent data
source (IBGE)
Slide 38 of x
en.dbpedia pt.dbpedia
integrated data
GoldStandard
Instituto Brasileiro de Geografia e Estatística
(IBGE)
Fuse/Repair
Compare
Accuracy
dbpedia:areaTotaldbpedia:populationTotaldbpedia:foundingDate
dbpedia:areaTotaldbpedia:populationTotaldbpedia:foundingDate
Accuracy Evaluationfr.dbpedia
…
Slide 39 of x
Accuracy ExamplesCity of Aracati
◦Population: 69159/69616 (conflicting)◦Record in Gold Standard: 69159◦Good choice: 69159◦Bad choice: 69616
City of Oiapoque◦Population: 20226/20426 (conflicting)◦Record in Gold Standard: 20509◦Optimal approximation choice: 20426◦Sub-optimal approximation choice: 20226
Slide 40 of x
Accuracy Results
Slide 41 of x
Accuracy of Input and Output
Slide 42 of x
Publications Yannis Roussakis, Giorgos Flouris, Vassilis Christophides.
Declarative Repairing Policies for Curated KBs. In Proceedings of the 10th Hellenic Data Management Symposium (HDMS-11), 2011.
Giorgos Flouris, Yannis Roussakis, Maria Poveda-Villalon, Pablo N. Mendes, Irini Fundulaki. Using Provenance for Quality Assessment and Repair in Linked Open Data. In Proceedings of the Joint Workshop on Knowledge Evolution and Ontology Dynamics (EvoDyn-12), 2012.
Yannis Roussakis, Giorgos Flouris, Vassilis Christophides. Preference-Based Repairing of RDF(S) DBs. Under review in TODS Journal.
Slide 43 of x
BACKUP SLIDES
Slide 44 of x
Repair Removing invalidities by changing the
ontology in an adequate manner General concerns:
1. Return a valid ontology– Strict requirement
2. Minimize the impact of repair upon the data– Make minor, targeted modifications that repair
the ontology without changing it too much3. Return a “good” repair
– Emulate the changes that the ontology engineer would do for repairing the ontology
Slide 45 of x
InferenceInference expressed using validity rulesExample:
◦Transitivity of class subsumption◦a,b,c C_Sub(a,b) C_Sub(b,c) C_Sub(a,c)
In practice we use labeling algorithms ◦Avoid explicitly storing the inferred knowledge◦Improve efficiency of reasoning
Slide 46 of x
Quality Assessment Quality = “fitness for use”
◦ Multi-dimensional, multi-faceted, context-dependent Methodology for quality assessment
◦ Dimensions Aspects of quality Accuracy, completeness, timeliness, …
◦ Indicators Metadata values for measuring dimensions Last modification date (related to timeliness)
◦ Scoring Functions Functions to quantify quality indicators Days since last modification date
◦ Metrics Measures of dimensions (result of scoring function) Can be combined