data repairing

43
Data Repairing Giorgos Flouris, FORTH December 11-12, 2012, Luxembourg

Upload: pello

Post on 24-Feb-2016

39 views

Category:

Documents


0 download

DESCRIPTION

Data Repairing. Giorgos Flouris, FORTH December 11-12, 2012, Luxembourg. Structure. Part I: problem statement and proposed solution (D2.2) Sketch (also presented in the previous review) Part II: complexity analysis and performance evaluation (D2.2) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Data Repairing

Data Repairing

Giorgos Flouris, FORTHDecember 11-12, 2012, Luxembourg

Page 2: Data Repairing

Slide 2

StructurePart I: problem statement and proposed solution

(D2.2)◦ Sketch (also presented in the previous review)

Part II: complexity analysis and performance evaluation (D2.2)◦ Shows scalability and performance properties◦ Improved, compared to D2.2

Part III: application of repairing in a real setting (D4.4)◦ Result of collaboration between partners/WPs◦ Shows applicability, experimentation in real-world data

and setting

Page 3: Data Repairing

Slide 3

PART I: Problem Statement and

Proposed Solution(D2.2)

Page 4: Data Repairing

Slide 4

Validity as a Quality Indicator Validity is an important quality indicator

◦ Encodes context- or application-specific requirements◦ Applications may be useless over invalid data◦ Binary concept (valid/invalid)

Two steps to guarantee validity:1. Identifying invalid ontologies (diagnosis)

Detecting invalidities in an automated manner Subtask of Quality Assessment

2. Remove invalidities (repair) Repairing invalidities in an automated manner Subtask of Quality Enhancement

Page 5: Data Repairing

Slide 5

Main IdeaExpressing validity using validity rules

over an adequate relational schema, e.g.:◦Properties must have a unique domain

◦p Prop(p) a Dom(p,a)◦p,a,b Dom(p,a) Dom(p,b) (a=b)

◦Correct classification in property instances◦x,y,p,a P_Inst(x,y,p) Dom(p,a) C_Inst(x,a)◦x,y,p,a P_Inst(x,y,p) Rng(p,a) C_Inst(y,a)

Syntactical manipulations on rules allow:◦Diagnosis (reduced to relational queries)◦Repair (identify repairing options per violation)

Page 6: Data Repairing

Slide 6

Preferences for Repair• Which repairing option is best?

◦Ontology engineer determines that via preferences

Preferences◦Specified by ontology engineer beforehand◦High-level “specifications” for the ideal repair◦Serve as “instructions” to determine the

preferred (optimal) solution

Page 7: Data Repairing

Slide 7

Preferences (On Ontologies)

O0

O2

O3

Score: 3

Score: 4

Score: 6

O1

Page 8: Data Repairing

Slide 8

Preferences (On Deltas)

O0

O1

O2

O3Score: 2

Score: 4

Score: 5

-P_Inst (Item1,ST1, geo:location)

+C_Inst (Item1,Sensor)

-Dom (geo:location,

Sensor)

Page 9: Data Repairing

Slide 9

PreferencesPreferences on ontologies are result-oriented

◦ Consider the quality of the repair result◦ Ignore the impact of repair◦ Popular options: prefer newest/trustable information,

prefer a specific ontological structurePreferences on deltas are impact-oriented

◦ Consider the impact of repair◦ Ignore the quality of the repair result◦ Popular options: minimize schema changes, minimize

addition/deletion of information, minimize delta sizeProperties of preferences

◦ Preferences on ontologies/deltas are equivalent◦ Quality metrics can be used for stating preferences◦ Metadata on the data can be used (e.g., provenance)◦ Can be qualitative or quantitative

Page 10: Data Repairing

Slide 10

Generalizing the Approach For one violated rule

1. Diagnose invalidity2. Determine minimal ways to resolve it3. Determine and return preferred (optimal) resolution

For many violated rules◦ Problem becomes more complicated◦ More than one resolution steps are required

Issues:1. Resolution order2. When and how to filter non-optimal solutions?3. Rule (and resolution) interdependencies

Page 11: Data Repairing

Slide 11

Rule Interdependencies A given resolution may:

◦ Cause other violations (bad)◦ Resolve other violations (good)

Optimal resolution unknown ‘a priori’◦ Cannot predict a resolution’s ramifications◦ Exhaustive, recursive search required

(resolution tree) Two ways to create the resolution tree

◦ Globally-optimal (GO) / locally-optimal (LO)◦ When and how to filter non-optimal

solutions?

Page 12: Data Repairing

Slide 12

Resolution Tree Creation (GO)– Find all minimal resolutions

for all the violated rules, then find the optimal ones

– Globally-optimal (GO)◦ Find all minimal resolutions

for one violation◦ Explore them all◦ Repeat recursively until valid◦ Return the optimal leaves

Optimal repairs (returned)

Page 13: Data Repairing

Slide 13

Resolution Tree Creation (LO)– Find the minimal and

optimal resolutions for one violated rule, then repeat for the next

– Locally-optimal (LO)◦ Find all minimal resolutions

for one violation◦ Explore the optimal one(s)◦ Repeat recursively until valid◦ Return all remaining leaves

Optimal repair (returned)

Page 14: Data Repairing

Slide 14

Comparison (GO versus LO)Characteristics of GO

◦ Exhaustive◦ Less efficient:

large resolution trees◦ Always returns optimal

repairs◦ Insensitive to rule

syntax◦ Does not depend on

resolution order

Characteristics of LO◦ Greedy◦ More efficient:

small resolution trees◦ Does not always return

optimal repairs◦ Sensitive to rule

syntax◦ Depends on resolution

order

Page 15: Data Repairing

Slide 15

PART II: Complexity Analysis andPerformance Evaluation

(D2.2)

Page 16: Data Repairing

Slide 16

Complexity AnalysisDetailed complexity analysis for GO/LO

and various different types of rules and preferences

Inherently difficult problem◦Exponential complexity (in general)◦Exception: LO is polynomial (in special cases)

Theoretical complexity is misleading as to the actual performance of the algorithms

Page 17: Data Repairing

Slide 17

Performance in PracticePerformance in practice

◦Linear with respect to ontology size◦Linear with respect to tree size

Types of violated rules (tree width) Number of violations (tree height) – causes the

exponential blowup Rule interdependencies (tree height) Preference (for LO): affects pruning (tree width)

Further performance improvement◦Use optimizations◦Use LO with restrictive preference

Page 18: Data Repairing

Slide 18

Effect of Ontology Size

499999.999999999 4999999.999999991.00

10.00

100.00

1000.00

10000.00

100000.00

1000000.00

10000000.00

Diagnosis GO Repair 16 Violations GO Repair 26 ViolationsLO Repair 16 Violations LO Repair 26 Violations

Triples (x1000)

Exec

utio

n Ti

me

(sec

)(lo

gsca

le)

(logscale)

20000

Page 19: Data Repairing

Slide 19

Effect of Tree Size (GO)

0 30000000 60000000 90000000 120000000 1500000000

200000

400000

600000

800000

1000000

1200000

1400000

1M5M10M15M20M

Nodes (x )

GO E

xecu

tion

Tim

e (s

ec)

610Nodes (x106)

Page 20: Data Repairing

Slide 20

Effect of Tree Size (LO)

0 2000 4000 6000 8000 100000

5000

10000

15000

20000

25000

30000

35000

40000

1M5M10M15M20M

# Nodes

LO E

xecu

tion

Tim

e (s

ec)

Page 21: Data Repairing

Slide 21

Effect of Violations (GO)

0 2 4 6 8 10 12 14 16 18 20 22 24 26 280

200000

400000

600000

800000

1000000

1200000

1400000

1M5M10M15M20M

# Violations

GO E

xecu

tion

Tim

e (s

ec)

Page 22: Data Repairing

Slide 22

Effect of Violations (LO)

0 2 4 6 8 10 12 14 16 18 20 22 24 26 280

5000

10000

15000

20000

25000

30000

35000

40000

1M5M10M15M20M

# Violations

LO E

xecu

tion

Tim

e (s

ec)

Page 23: Data Repairing

Slide 23

Effect of Preference (LO)

0 2 4 6 8 10 12 14 16 18 20 22 24 26 2810000

100000

1000000

10000000

LO with P0 LO with P1LO with P2 LO with P3GO

# Violations

Exec

utio

n Ti

me

(sec

) (lo

gsca

le)

Page 24: Data Repairing

Slide 24

Quality of LO Repairs

CCD

0 3 6 9 12 15 18 2101234567

# Violations

# Pr

ef. R

ep. D

elta

s

Max( )

0 3 6 9 12 15 18 210

200400600800

100012001400

GO∩LOGO\LOLO\GO

# Violations

Min( )

Page 25: Data Repairing

Slide 25

PART III: Application of Repairing

in a Real Setting(D4.4)

Page 26: Data Repairing

Slide 26

Objectives and Main IdeaRepair real datasets using preferences

based on metadataPurpose:

◦WP2: evaluate repairing in a real LOD setting◦WP3: Evaluate the usefulness of provenance,

recency etc as preferences for repair◦WP4: Validate the utility of WP4 resources for a

data quality benchmark

Page 27: Data Repairing

Slide 27

Motivating ScenarioUser seeks information on Brazilian cities

◦Fuses Wikipedia dumps from various languagesGuarantees maximal coverage, but may

lead to conflicts ◦E.g., cities with two different population counts

Use repair to eliminate such conflicts◦Using our repairing method ◦Using adequate preferences

based on metadataEN

PT

ES FR

GE

Page 28: Data Repairing

Slide 28

Experimental SettingInput

◦Fused 5 Wikipedias: EN, PT, SP, GE, FR◦Distilled information about three properties of

Brazilian cities: populationTotal, areaTotal, foundingDate

Repair parameters◦Validity rules: all properties must be functional◦Preferences: 5 preferences based on metadata

Evaluation◦Quality of result along 5 dimensions:

consistency, validity, conciseness, completeness, accuracy

Page 29: Data Repairing

Slide 29

Preferences (1/2)1. PREFER_PT: resolve conflicts based on

source (PT>EN>SP>GE>FR)2. PREFER_RECENT: resolve conflicts based

on recency (most recent data is preferred)

3. PLAUSIBLE_PT: drop “irrational” data (population<500, area<300km2, founding date<1500AD); resolve remaining conflicts based on source

Page 30: Data Repairing

Slide 30

Preferences (2/2)4. WEIGHTED_RECENT: resolve conflicts

based on recency, but if the conflicting records are almost equally recent (less than 3 months apart), then resolve based on source

5. CONDITIONAL_PT: resolve conflicts based on source but change the order depending on the data (prefer PT for small cities with population<500.000, prefer EN for the rest)

Page 31: Data Repairing

Slide 31

Consistency, ValidityConsistency

◦Lack of conflicting triples◦Guaranteed to be perfect (by the repairing

algorithm), regardless of preferenceValidity

◦Lack of rule violations◦Coincides with consistency for this example◦Guaranteed to be perfect (by the repairing

algorithm), regardless of preference

Page 32: Data Repairing

Slide 32

Conciseness, CompletenessConciseness

◦No duplicates in the final result◦Guaranteed to be perfect (by the fuse process),

regardless of preferenceCompleteness

◦Coverage of information◦Improved by fusion◦Unaffected by the repairing algorithm◦Input completeness = output completeness,

regardless of preference◦Measured to be at 77,02%

Page 33: Data Repairing

Slide 33

AccuracyMost important metric for this experimentAccuracy

◦Closeness to the “actual state of affairs”◦Affected by the repairing choices

Compared repair with the Gold Standard ◦Taken from an official and independent data

source (IBGE)

Page 34: Data Repairing

Slide 34

Accuracy ExamplesCity of Aracati

◦Population: 69159/69616 (conflicting)◦Record in Gold Standard: 69159◦Good choice: 69159◦Bad choice: 69616

City of Oiapoque◦Population: 20226/20426 (conflicting)◦Record in Gold Standard: 20509◦Optimal approximation choice: 20426◦Sub-optimal approximation choice: 20226

Page 35: Data Repairing

Slide 35

Accuracy Results

Page 36: Data Repairing

Slide 36

Accuracy of Input and Output

Page 37: Data Repairing

Slide 37

Publications Yannis Roussakis, Giorgos Flouris, Vassilis Christophides.

Declarative Repairing Policies for Curated KBs. In Proceedings of the 10th Hellenic Data Management Symposium (HDMS-11), 2011.

Giorgos Flouris, Yannis Roussakis, Maria Poveda-Villalon, Pablo N. Mendes, Irini Fundulaki. Using Provenance for Quality Assessment and Repair in Linked Open Data. In Proceedings of the Joint Workshop on Knowledge Evolution and Ontology Dynamics (EvoDyn-12), 2012.

Yannis Roussakis, Giorgos Flouris, Vassilis Christophides. Preference-Based Repairing of RDF(S) DBs. Under review in TODS Journal.

Page 38: Data Repairing

Slide 38

BACKUP SLIDES

Page 39: Data Repairing

Slide 39

Repair Removing invalidities by changing the

ontology in an adequate manner General concerns:

1. Return a valid ontology– Strict requirement

2. Minimize the impact of repair upon the data– Make minor, targeted modifications that repair

the ontology without changing it too much3. Return a “good” repair

– Emulate the changes that the ontology engineer would do for repairing the ontology

Page 40: Data Repairing

Slide 40

InferenceInference expressed using validity rulesExample:

◦Transitivity of class subsumption◦a,b,c C_Sub(a,b) C_Sub(b,c) C_Sub(a,c)

In practice we use labeling algorithms ◦Avoid explicitly storing the inferred knowledge◦Improve efficiency of reasoning

Page 41: Data Repairing

Slide 41

Ontology O0Class(Sensor), Class(SpatialThing), Class(Observation)Prop(geo:location)Dom(geo:location,Sensor)Rng(geo:location,SpatialThing)Inst(Item1), Inst(ST1)P_Inst(Item1,ST1,geo:location)C_Inst(Item1,Observation), C_Inst(ST1,SpatialThing)

Example (Diagnosis/Repair)

Correct classification in property instances x,y,p,a P_Inst(x,y,p) Dom(p,a) C_Inst(x,a)

Sensor SpatialThing

Observation

Item1 ST1

geo:location

Schema

Data

Item1 geo:location ST1 Sensor is the domain of geo:locationItem1 is not a Sensor

P_Inst(Item1,ST1,geo:location)O0

Remove P_Inst(Item1,ST1,geo:location)

Add C_Inst(Item1,Sensor)Remove Dom(geo:location,Sensor)

C_Inst(Item1,Sensor)O0

Dom(geo:location,Sensor)O0

Page 42: Data Repairing

Slide 42

Quality Assessment Quality = “fitness for use”

◦ Multi-dimensional, multi-faceted, context-dependent Methodology for quality assessment

◦ Dimensions Aspects of quality Accuracy, completeness, timeliness, …

◦ Indicators Metadata values for measuring dimensions Last modification date (related to timeliness)

◦ Scoring Functions Functions to quantify quality indicators Days since last modification date

◦ Metrics Measures of dimensions (result of scoring function) Can be combined

Page 43: Data Repairing

Slide 43

en.dbpedia pt.dbpedia

integrated data

GoldStandard

Instituto Brasileiro de Geografia e Estatística

(IBGE)

Fuse/Repair

Compare

Accuracy

dbpedia:areaTotaldbpedia:populationTotaldbpedia:foundingDate

dbpedia:areaTotaldbpedia:populationTotaldbpedia:foundingDate

Accuracy Evaluationfr.dbpedia