on provenance of queries on linked web data
DESCRIPTION
On Provenance of Queries on Linked Web Data. 1,2 Yannis Theoharis, 2 Irini Fundulaki, 3,2 Grigoris Karvounarakis and 1,2 Vassilis Christophides 1 Institute of Computer Science, FORTH and 2 Computer Science Department, University of Crete 3 LogicBox, USA. What is “Linked Data”. - PowerPoint PPT PresentationTRANSCRIPT
On Provenance of Queries on Linked Web Data
1,2Yannis Theoharis, 2Irini Fundulaki, 3,2Grigoris Karvounarakis and 1,2Vassilis Christophides
1Institute of Computer Science, FORTH and
2Computer Science Department, University of Crete
3LogicBox, USA
What is “Linked Data”
W3C Linking Open Data
publish various open datasets as RDF on the Web
set RDF typed links between data items from different data sources.
Motivation: Linked Data Processing Data is:
fetched from
heterogeneous
sources
integrated
materialized in RDF
made available
via SPARQL
Range of computations
SPARQL queries
Complex programs
(logic or procedular)
Provenance Aware Applications
Trust assessment
trustworthiness
Access control
confidentiality level
Data cleaning
validity
Curated databases
source data origin
All these applications need to represent and store the relation of the input
with the output of data processes
gain efficiency
impossible without provenance
Data Provenance Models
X Y Annot.
a b t
c d t
Y Z Annot.
b e
X Y Z Annot.
a b e
R1 R2R1 R2
Annotation Models: annotation computation coupled with a particular application and a particular assignment of source data annotations
ft tf
Abstract Provenance Models: abstract provenance tokens and operators are substituted by appropriate concrete tokens for a particular application and assignment
X Y Annot.
a b c1
c d c2
Y Z Annot.
b e c3
X Y Z Annot.
a b e c1 * c3
R1 R2R1 R2
tt
t
t Λ t
f
t Λ f
query recomputation!
t: trustedf: untrusted
This Talk
“Can previous work on abstract provenance models be leveraged for SPARQL” ?
NO: due to the OPTIONAL (similar to the SQL left outer join) operatorYES: for the positive (without OPTIONAL) fragment of SPARQL
We present our ongoing work on a SPARQL abstract provenance model.
Challenge: to capture the form of negation that OPTIONAL introduces
Outline
SPARQL algebra
Abstract Provenance Models for Positive SPARQL
Limitations of Previous Models
Towards a SPARQL Provenance Model
Outline
SPARQL algebra
Abstract Provenance Models for Positive SPARQL
Limitations of Previous Models
Towards a SPARQL Provenance Model
SPARQL (1/2)
triple patterns(?x, ?y, e)
mappings{(?x,d),(?y,b)}
{(?x,f),(?y,g)}
ComposeFilter
mappings
{ … }
mappings
{ … }Select
Construct/ Describe
SPARQL: W3C Recommendation language to Query RDF data.
Triple Set
S P O
a b c
d b e
f g e
?x ?y
d b
f g
μ1
μ2
(?x, ?y, e)
constantvariables
Ω1
SPARQL (2/2)
SPARQL algebra defines 5 operators on mapping bags
Unary ops: π (projection),
σ (selection, also called filtering)
Binary ops: U (union)
(join)
(optional)
?x ?y
a b
a c
d e
Ω
?x
a
d
π?x (Ω)
card(μ1) = 2card(μ2) = 1
μ1
μ2
Positive SPARQL (SPARQL+)
?x ?y
a b
a c
σ?x=a (Ω)
?x ?y
a b
Ω1
?x ?z
c d
Ω2
?x ?y ?z
a b -
c - d
Ω1 U Ω2
?z is unbound in μ1μ1μ2 μ1
μ2
?x ?y
a b
c d
e -
Ω1
?y ?z
b f
Ω2
?x ?y ?z
a b f
e b f
Ω1 Ω2
μ1
μ2
μ3
μ4 μ5 = μ1 U μ4
μ6 = μ3 U μ4
μ and μ’ are compatible (μ ~ μ’), if they agree
in their common variables μ1 ~ μ4
μ3 ~ μ4
μ2 ~ μ4
?x ?y
a b
c d
Ω1
?y ?z
b f
Ω2
?x ?y ?z
a b f
c d -
Ω1 Ω2
μ1
μ2
μ3 μ4 = μ1 U μ3
μ2Ω1 \ Ω2Ω1 Ω2
Outline
SPARQL algebra
Abstract Provenance Models for Positive SPARQL
Limitations of Previous Models
Towards a SPARQL Provenance Model
Abstract Provenance Models
Abstract provenance models encode the query
operators in different level of detail
Expressiveness vs efficiency
(annotation storage and computation time)
triple patterns(?x, ?y, e)
mappings{(?x,d),(?y,b)}
{(?x,f),(?y,g)}
ComposeFilter
mappings
{ … }
mappings
{ … }
Select
Provenance
How
Trio
Why
Lineage
Most
informative
Less
informative
Abstract Provenance Models for SPARQL+
Previous models are defined for positive relational algebra
Positive relational operators are monotonic
The addition (removal) of a tuple can only result in additional (removed) tuples in the output
This also holds for SPARQL+ (projection, union, join)
Previous models suffice for SPARQL+
Outline
SPARQL algebra
Abstract Provenance Models for Positive SPARQL
Limitations of Previous Models
Towards a SPARQL Provenance Model
boolean trust semantics
set semantics on trusted mappings
Boolean trust assessment (SPARQL)
?x ?y
d b
f g
Ω1
?x
?y
f g
?y ?z
b c
e h
Ω2
Ω1 \ Ω2
and \ are not monotonic: μ3 becomes untrusted
?x
?y
?z
d b c
f g -
Ω1 Ω2
?x
?y
?z
d b -
f g -
Ω1 \ Ω2
?x
?y ?z
d b -
f g -
Ω1 Ω2
μ1
μ2
μ3
μ4 μ2μ1
μ2
μ5
μ2
μ1
μ2
μ5 becomes untrusted and
μ1 becomes trusted in Ω1 Ω2
Trusted: μ1, μ2, μ3, μ4
Trusted: μ1, μ2, μ4
Perm
?x ?y
d b
f g
Ω1
?x ?y ?y2 ?z2
f g b c
f g e h
?y ?z
b c
e h
Ω2
Ω1 \ Ω2
Intuitively, (f, g) is in Ω1 \ Ω2 because it is not compatible
with neither μ3 nor μ4
?x
?y
?z ?x1 ?y1 ?y2 ?z2
d b c d b b c
f g - f g b c
f g - f g e h
Ω1 Ω2
μ1
μ2
μ3
μ4
If μ3 becomes untrusted, Perm infers that (d, b, c) becomes untrusted, but cannot infer that (d, b, -) should become trusted
(d, b, c) is in Ω1 \ Ω2 due to the join
between μ1 and μ3
RDF Meta Knowledge & M-semirings
?x ?y
d b c1
f g c2
Ω1
?x ?y RDF MK M-semirings
f g c2 Λ (c3Vc4) c2 0 = c2
?y ?z
b c c3
e h c4
Ω2
Ω1 \ Ω2
Like Perm, RDF Meta Knowledge and M-semirings infer that μ5 is untrusted but can not infer that μ1: (d, b, -) is trusted.
?x ?y ?z RDF MK M-semirings
d b c c1 Λ c3 c1 * c3
f g - c2 Λ (c3Vc4)
c2
Ω1 Ω2
μ1
μ2
μ3
μ4
μ2
μ5
μ2
ft
tt
t
ft
t
f
t
Outline
SPARQL algebra
Abstract Provenance Models for Positive SPARQL
Limitations of Previous Models
Towards a SPARQL Provenance Model
A Third Operation for Compatibility (1/2) Take care about compatible mappings
Only one between μ1, μ5 can appear in the result
Keep provenance information for both of them !
?x ?y
d b c1
f g c2
Ω1
?y ?z
b c c3
e h c4
Ω2
?x ?y ?z How SPARQL Prov.
d b c c1*c3 c1*c3
d b - No Info c1*A(μ1, μ3)
f g - c2 c2
Ω1 Ω2 = (Ω1 Ω2) U (Ω1 \ Ω2)
μ1
μ2
μ3
μ4
μ5
μ1
μ2
ft
tt
tt
(t Λ t) = t(t Λ f) = f
t?
A(μ1, μ3) =
f, if μ1 ~ μ3 and c3 = t
t, else
tf
A Third Operation for Compatibility (2/2)
A is a binary operator on mappings
Determines whether the mapping exist in the result or not
If yes, its provenance equals the positive provenance part, e.g. c1 for c1*A(μ1, μ3)
In general,
?x ?y ?z How SPARQL Prov.
d b c c1*c3 c1*c3
d b - No Info c1*A(μ1, μ3)
f g - c2 c2
Ω1 Ω2 = (Ω1 Ω2) U (Ω1 \ Ω2)
μ5
μ1
μ2
A(μ1, μ3) =
0, if μ1 ~ μ3 and c3 ≠ 0
1, else
0: the neutral element for +
1: the neutral element for *
SPARQL Provenance Operators
Two types of operators
on provenance tokens, i.e. + and * (for SPARQL+)
on mappings, i.e. A (for and \)
Good news:
Every triple of the dataset is uniquely annotated.
Why not to use annotations as mapping identifiers in A?
Due to the projection operator…
Enrich Tokens with Schema Information
Use tokens (c1, c2…) as mapping ids in A expressions
But, μ1 ~ μ2 might hold, while π?y,?z (μ1) ~ π ?y,?z (μ2)
Tokens don’t suffice, keep pairs token-schema
A(c1, c2) =
0, if μ1 ~ μ2 and c2 ≠ 0
1, else
?x ?y ?z
a b c
d b -
μ1
μ2
?x ?y ?z Prov.
a b c (c1, {?x, ?y, ?z})
d b - (c2, {?x, ?y, ?z})
?y ?z Prov.
b c (c1, {?y, ?z})
b - (c2, {?y, ?z})
Ω π?y,?z (Ω)
A( (c1, S1), (c2, S2) ) =
0, if πS1 (μ1) ~ πS2 (μ2) and c2 ≠ 0
1, else
Towards a SPARQL Provenance Model
Define an algebra on token-schema pairs
3 operations
2 for SPARQL operators
1 for compatibility
What if there is no projection (or projection is not allowed to be pushed down) ?
annotations suffice (no need for schema information),
still in need of the compatibility operator
What if there is no Optional ?
previous models suffice, e.g. How
Future Work
SPARQL Provenance Model
Extent model expressiveness to capture other computations on
Linked Data
Logic explanations
Implementation
Questions ?