quality and repair pablo n. mendes (freie universität berlin) giorgos flouris (forth) 1st year...
Post on 22-Dec-2015
215 views
TRANSCRIPT
![Page 1: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d805503460f94a64bf9/html5/thumbnails/1.jpg)
Quality and Repair
Pablo N. Mendes (Freie Universität Berlin)Giorgos Flouris (FORTH)
1st year reviewLuxembourg, December 2011
11/02/11
![Page 2: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d805503460f94a64bf9/html5/thumbnails/2.jpg)
18 24 30 366 120
Task 2.1Data quality assessment and repair
Task 2.3Recommendations for enhancing best practices for data publishing
D2.4 Update of D2.1
D2.3 Modelling and processing contextual aspects of data
D2.5 Proof-of-concept evaluation for modelling space and time
FUBFUB
42 48D2.1 Conceptual model and best practices for high-quality data publishing
D2.1 Conceptual model and best practices for high-quality data publishing
D2.2 Methods for quality repairD2.2 Methods for quality repair
KITKIT
KITKIT
Work Plan View WP2
D2.6 Methods for assessing the quality of sensor data
D2.7 Recommendations for contextual data publishing
Task 2.2Temporal, spatial and social aspects of data
![Page 3: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d805503460f94a64bf9/html5/thumbnails/3.jpg)
Upcoming deliverables
Quality AssessmentD2.1 - Conceptual model and best practices for high-quality metadata publishing
Quality EnhancementD2.2 - Methods for quality repair
![Page 4: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d805503460f94a64bf9/html5/thumbnails/4.jpg)
Outline
Overview of Quality
Data Quality Framework
Quality Assessment
Quality Enhancement (Repair)
![Page 5: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d805503460f94a64bf9/html5/thumbnails/5.jpg)
“Fitness for use.”
Joseph Juran. The Quality Control Handbook. McGraw-Hill,New York, 3rd edition, 1974.
Quality
![Page 6: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d805503460f94a64bf9/html5/thumbnails/6.jpg)
Data Quality
Multifaceted
accurate = high quality?
availability?
timeliness?
Subjective
weekly updates are ok.
Task-dependent
task: weather forecast
data is not good if it is not available for online query
vacation planning or aviation?
for me, for vacation planning
![Page 7: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d805503460f94a64bf9/html5/thumbnails/7.jpg)
Category Dimension
Intrinsic Dimensions
AccuracyConsistencyObjectivityTimeliness
Contextual Dimensions
ValidityBelievabilityCompletenessUnderstandabilityRelevancyReputationVerifiabilityAmount of Data
Representational Dimensions
InterpretabilityRep. ConcisenessRep. Consistency
Accessibility Dimensions
AvailabilityResponse TimeSecurity
Data Quality Dimensions
Presentation order
![Page 8: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d805503460f94a64bf9/html5/thumbnails/8.jpg)
Quality Enhancement Quality Assessment
Data Quality Framework
![Page 9: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d805503460f94a64bf9/html5/thumbnails/9.jpg)
ACCESSIBILITY
Dereferenceability
• Indicator: Dereferenceable URIs• “Resources identified by URIs that respond
with RDF to HTTP requests?”• Metrics:
• for datasets (d) and for resources (r)• deref(d) = count(r | deref(r))• ratioderef(d) = deref(d) / no-deref(r)
• Recommendation:• Your URIs should be dereferenceable.• Prefer reusing URIs that are dereferenceable.
![Page 10: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d805503460f94a64bf9/html5/thumbnails/10.jpg)
Access methods
• Indicator: Access methods• “Data is accessible in varied and recommended ways.”
• Metrics:• sample(d): {0,1} “example resource available for d”• endpoint(d): {0,1} “SPARQL endpoint available for d”• dump(d): {0,1} “RDF dumps available for d”
• Recommendation:• Provide as many access methods as possible• A sample resource provides a quick view into the type
of data you serve.• SPARQL endpoints for clients to obtain part of the data• Dumps are cheaper than alternatives when bulk access
is needed
ACCESSIBILITY
![Page 11: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d805503460f94a64bf9/html5/thumbnails/11.jpg)
Availability
• Indicator: Availability• “Average availability in time interval”
• Metrics: • avail(d,hour) = ∑{1..24} deref(sample(d)) / 24• Alternatively, httphead() instead of deref()
• Recommendation: • the higher the better
ACCESSIBILITY
![Page 12: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d805503460f94a64bf9/html5/thumbnails/12.jpg)
Accessiblity Dimensions
DereferenceabilityAvailabilityAccess methodsResponse timeRobustnessReachability
http GET / HEADhourly derefsURI, Bulk, SPARQLtimed derefrequests per minuteLOD cloud inlinks
ACCESSIBILITY
Examples:
![Page 13: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d805503460f94a64bf9/html5/thumbnails/13.jpg)
Representational: Interpretability
• Indicator: Human/Machine interpretability• “URI is dereferenceable to human and machine
readable formats”
• Metrics:• format(deref(r,f)) in {Fh U Fm} : {0,1}
• Fh = HTML, XHTML+RDFa, ...: {0,1}
• Fm = NT, RDF/XML, ...: {0,1}
• Recommendation:• Resources should dereference at least to human-
readable HTML and one widely adopted RDF serialization.
REPRESENTATIONAL
![Page 14: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d805503460f94a64bf9/html5/thumbnails/14.jpg)
REPRESENTATIONAL
Vocabulary understandability
• Schema understandability• “Schema terms are familiar to existing
agents.”
• Metrics:• vocab-underst(d) = triples(v,d) * triples(v,D) / triples(D)• Alt: Page Rank (prob. that random surfer has found v)
• Recommendation:• Reuse widely deployed vocabularies.
![Page 15: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d805503460f94a64bf9/html5/thumbnails/15.jpg)
Representational Dimensions
Human/Machine Interpretability
Vocabulary Understandability
Representational Conciseness
HTML, RDF
Vocabulary usage stats
Triples / Byte
REPRESENTATIONAL
![Page 16: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d805503460f94a64bf9/html5/thumbnails/16.jpg)
Contextual Dimensions
CompletenessFull set of objects and attributes wrt to a task
ConcisenessAmount of duplicate entries, redundant attributes
CoherenceHow well instance data conforms to schema
CONTEXTUAL DIMENSIONS
![Page 17: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d805503460f94a64bf9/html5/thumbnails/17.jpg)
Contextual Dimensions
VerifiabilityHow easy it is to check the data? Can use provenance information.
ValidityEncodes context- or application-specific requirements
CONTEXTUAL DIMENSIONS
![Page 18: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d805503460f94a64bf9/html5/thumbnails/18.jpg)
INTRINSIC DIMENSIONS
Intrinsic Dimensions
Accuracy
usually estimated; may be available for sensors
Timeliness
can use last update
Consistency
two or more values do not conflict with each other
Objectivity
Can be traced via provenance
![Page 19: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d805503460f94a64bf9/html5/thumbnails/19.jpg)
Example: AEMET
Metadata entry: http://thedatahub.org/dataset/aemet
Example item: http://aemet.linkeddata.es/page/resource/WeatherStation/id08001?output=ttl
Access methods: Example URI, SPARQL, BulkAvailability:
Example URI: availableSPARQL Endpoint: 100%
Format Interpretability: TTL=OKRDF/XML=OK
Verifiability: Published by third party
http://www4.wiwiss.fu-berlin.de/lodcloud/ckan/validator/validate.php?package=aemet
![Page 20: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d805503460f94a64bf9/html5/thumbnails/20.jpg)
Quality Enhancement Quality Assessment
Data Quality Framework
![Page 21: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d805503460f94a64bf9/html5/thumbnails/21.jpg)
Validity as a Quality Indicator
Validity is an important quality indicatorEncodes context- or application-specific requirementsApplications may be useless over invalid dataBinary concept (valid/invalid)
Two steps to guarantee validity (repair process):1. Identifying invalid ontologies (diagnosis)
Detecting invalidities in an automated mannerSubtask of Quality Assessment
2. Remove invalidities (repair)Repairing invalidities in an automated mannerSubtask of Quality Enhancement
![Page 22: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d805503460f94a64bf9/html5/thumbnails/22.jpg)
Diagnosis
Expressing validity using validity rules over an adequate relational schema
Examples:Properties must have a unique domain
p Prop(p) a Dom(p,a)p,a,b Dom(p,a) Dom(p,b) (a=b)
Correct classification in property instancesx,y,p,a P_Inst(x,y,p) Dom(p,a)
C_Inst(x,a)x,y,p,a P_Inst(x,y,p) Rng(p,a)
C_Inst(y,a)
Diagnosis reduced to relational queries
![Page 23: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d805503460f94a64bf9/html5/thumbnails/23.jpg)
Ontology O0
Class(Sensor), Class(SpatialThing), Class(Observation)Prop(geo:location)Dom(geo:location,Sensor)Rng(geo:location,SpatialThing)Inst(Item1), Inst(ST1)P_Inst(Item1,ST1,geo:location)C_Inst(Item1,Observation), C_Inst(ST1,SpatialThing)
Example
Correct classification in property instances x,y,p,a P_Inst(x,y,p) Dom(p,a) C_Inst(x,a)
Sensor SpatialThing
Observation
Item1 ST1
geo:location
Schema
Data
Item1 geo:location ST1 Sensor is the domain of geo:locationItem1 is not a Sensor
P_Inst(Item1,ST1,geo:location)O0
Remove P_Inst(Item1,ST1,geo:location)
Add C_Inst(Item1,Sensor)Remove Dom(geo:location,Sensor)
C_Inst(Item1,Sensor)O0
Dom(geo:location,Sensor)O0
![Page 24: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d805503460f94a64bf9/html5/thumbnails/24.jpg)
Preferences for Repair
Which repairing option is best?Ontology engineer determines that via
preferences
Specified by ontology engineer beforehandHigh-level “specifications” for the ideal
repairServe as “instructions” to determine the
preferred solution
![Page 25: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d805503460f94a64bf9/html5/thumbnails/25.jpg)
Preferences (On Ontologies)
O0
O2
O3
Score: 3
Score: 4
Score: 6
O1
![Page 26: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d805503460f94a64bf9/html5/thumbnails/26.jpg)
Preferences (On Deltas)
O0
O1
O2
O3Score: 2
Score: 4
Score: 5
-P_Inst (Item1,ST1, geo:location)
+C_Inst (Item1,Sensor)
-Dom (geo:location,
Sensor)
![Page 27: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d805503460f94a64bf9/html5/thumbnails/27.jpg)
Preferences
Preferences on ontologies are result-orientedConsider the quality of the repair resultIgnore the impact of repairPopular options: prefer newest information, prefer
trustable informationPreferences on deltas are more impact-oriented
Consider the impact of repairIgnore the quality of the repair resultPopular options: minimize schema changes, minimize
addition/deletion of information, minimize delta sizeTwo sides of the same coin (equivalent options)
Quality metrics can be used for stating preferencesMetadata on the data may be neededCan be qualitative or quantitative
![Page 28: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d805503460f94a64bf9/html5/thumbnails/28.jpg)
Generalizing the Approach
For one violated constraint1. Diagnose invalidity2. Determine minimal ways to resolve it3. Determine and return preferred resolution
For many violated constraintsProblem becomes more complicatedMore than one resolution steps are required
Issues:1. Resolution order2. When and how to filter non-preferred solutions?3. Constraint (and resolution) interdependencies
![Page 29: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d805503460f94a64bf9/html5/thumbnails/29.jpg)
Constraint Interdependencies
A given resolution may:Cause other violations (bad)Resolve other violations (good)
Cannot pre-determine the best resolutionDifficult to predict the ramifications of each oneExhaustive search requiredRecursive, tree-based search (resolution tree)
Two ways to create the resolution tree Globally-preferred (GP), locally-preferred (LP)When and how to filter non-preferred solutions?
![Page 30: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d805503460f94a64bf9/html5/thumbnails/30.jpg)
Resolution Tree Creation (GP)
Find all minimal resolutions for all the violated constraints, then find the preferred ones
Globally-preferred (GP)Find all minimal resolutions for
one violationExplore them allRepeat recursively until
consistentReturn the preferred leaves
Preferred repairs (returned)
![Page 31: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d805503460f94a64bf9/html5/thumbnails/31.jpg)
Resolution Tree Creation (LP)
Find the minimal and preferred resolutions for one violated constraint, then repeat for the next
Locally-preferred (LP)Find all minimal resolutions for
one violationExplore the preferred one(s)Repeat recursively until
consistentReturn all remaining leaves
Preferred repair (returned)
![Page 32: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d805503460f94a64bf9/html5/thumbnails/32.jpg)
Comparison (GP versus LP)
Characteristics of GP ExhaustiveLess efficient: large resolution treesAlways returns most preferred repairsInsensitive to constraint syntaxDoes not depend on resolution order
Characteristics of LPGreedyMore efficient: small resolution treesDoes not always return most preferred repairsSensitive to constraint syntaxDepends on resolution order
![Page 33: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d805503460f94a64bf9/html5/thumbnails/33.jpg)
Algorithm and Complexity
Detailed complexity analysis for GP/LP and various different types of constraints and preferences
Inherently difficult problemExponential complexity (in general)Main exception: LP is polynomial (in special
cases)
Theoretical complexity is misleading as to the actual performance of the algorithms
![Page 34: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d805503460f94a64bf9/html5/thumbnails/34.jpg)
Performance in Practice
Performance in practiceLinear with respect to ontology sizeLinear with respect to tree size
Types of violated constraints (tree width)Number of violations (tree height) – causes
the exponential blowupConstraint interdependencies (tree height)Preference (for LP): affects pruning (tree
width)
Further performance improvementUse optimizationsUse LP with restrictive preference
![Page 35: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d805503460f94a64bf9/html5/thumbnails/35.jpg)
Evaluation Parameters
Evaluation1. Effect of ontology size (for GP/LP)2. Effect of tree size (for GP/LP)3. Effect of violations (for GP/LP)4. Effect of preference (relevant for LP only)5. Quality of LP repairs
Preliminary results support our claims:Linear with respect to ontology sizeLinear with respect to tree size
![Page 36: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d805503460f94a64bf9/html5/thumbnails/36.jpg)
Publications
Yannis Roussakis, Giorgos Flouris, Vassilis Christophides. Declarative Repairing Policies for Curated KBs. In Proceedings of the 10th Hellenic Data Management Symposium (HDMS-11), 2011
Yannis Roussakis, Giorgos Flouris, Vassilis Christophides. Preference-Based Repairing of RDF/S DBs. Tentative title, to be submitted to PVLDB, January 2012
![Page 37: Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11](https://reader036.vdocuments.us/reader036/viewer/2022062516/56649d805503460f94a64bf9/html5/thumbnails/37.jpg)
Outlook
• Continue refining model based on experience with data sets catalog
• Derive “best practices checks” from metrics
• Results of quality assessment to be added to next release of the catalog
• Collaboration with EU-funded LOD2 (FP7) towards Data Fusion based on the PlanetData Quality Framework
• Finalize experiments for Data Repair