on provenance of queries on linked web data

On Provenance of Queries on Linked Web Data

1,2Yannis Theoharis, 2Irini Fundulaki, 3,2Grigoris Karvounarakis and 1,2Vassilis Christophides

1Institute of Computer Science, FORTH and

2Computer Science Department, University of Crete

3LogicBox, USA

What is “Linked Data”

W3C Linking Open Data

publish various open datasets as RDF on the Web

set RDF typed links between data items from different data sources.

Motivation: Linked Data Processing Data is:

fetched from

heterogeneous

sources

integrated

materialized in RDF

made available

via SPARQL

Range of computations

SPARQL queries

Complex programs

(logic or procedular)

Provenance Aware Applications

Trust assessment

trustworthiness

Access control

confidentiality level

Data cleaning

validity

Curated databases

source data origin

All these applications need to represent and store the relation of the input

with the output of data processes

gain efficiency

impossible without provenance

Data Provenance Models

X Y Annot.

a b t

c d t

Y Z Annot.

b e

X Y Z Annot.

a b e

R1 R2R1 R2

Annotation Models: annotation computation coupled with a particular application and a particular assignment of source data annotations

ft tf

Abstract Provenance Models: abstract provenance tokens and operators are substituted by appropriate concrete tokens for a particular application and assignment

X Y Annot.

a b c1

c d c2

Y Z Annot.

b e c3

X Y Z Annot.

a b e c1 * c3

R1 R2R1 R2

tt

t

t Λ t

f

t Λ f

query recomputation!

t: trustedf: untrusted

This Talk

“Can previous work on abstract provenance models be leveraged for SPARQL” ?

NO: due to the OPTIONAL (similar to the SQL left outer join) operatorYES: for the positive (without OPTIONAL) fragment of SPARQL

We present our ongoing work on a SPARQL abstract provenance model.

Challenge: to capture the form of negation that OPTIONAL introduces

Outline

SPARQL algebra

Abstract Provenance Models for Positive SPARQL

Limitations of Previous Models

Towards a SPARQL Provenance Model

SPARQL (1/2)

triple patterns(?x, ?y, e)

mappings{(?x,d),(?y,b)}

{(?x,f),(?y,g)}

ComposeFilter

mappings

{ … }

mappings

{ … }Select

Construct/ Describe

SPARQL: W3C Recommendation language to Query RDF data.

Triple Set

S P O

a b c

d b e

f g e

?x ?y

d b

f g

μ1

μ2

(?x, ?y, e)

constantvariables

Ω1

SPARQL (2/2)

SPARQL algebra defines 5 operators on mapping bags

Unary ops: π (projection),

σ (selection, also called filtering)

Binary ops: U (union)

(join)

(optional)

?x ?y

a b

a c

d e

Ω

?x

a

d

π?x (Ω)

card(μ1) = 2card(μ2) = 1

μ1

μ2

Positive SPARQL (SPARQL+)

?x ?y

a b

a c

σ?x=a (Ω)

?x ?y

a b

Ω1

?x ?z

c d

Ω2

?x ?y ?z

a b -

c - d

Ω1 U Ω2

?z is unbound in μ1μ1μ2 μ1

μ2

?x ?y

a b

c d

e -

Ω1

?y ?z

b f

Ω2

?x ?y ?z

a b f

e b f

Ω1 Ω2

μ1

μ2

μ3

μ4 μ5 = μ1 U μ4

μ6 = μ3 U μ4

μ and μ’ are compatible (μ ~ μ’), if they agree

in their common variables μ1 ~ μ4

μ3 ~ μ4

μ2 ~ μ4

?x ?y

a b

c d

Ω1

?y ?z

b f

Ω2

?x ?y ?z

a b f

c d -

Ω1 Ω2

μ1

μ2

μ3 μ4 = μ1 U μ3

μ2Ω1 \ Ω2Ω1 Ω2

Outline

SPARQL algebra




Abstract Provenance Models

Abstract provenance models encode the query

operators in different level of detail

Expressiveness vs efficiency

(annotation storage and computation time)

triple patterns(?x, ?y, e)

mappings{(?x,d),(?y,b)}

{(?x,f),(?y,g)}

ComposeFilter

mappings

{ … }

mappings

{ … }

Select

Provenance

How

Trio

Why

Lineage

Most

informative

Less

informative

Abstract Provenance Models for SPARQL+

Previous models are defined for positive relational algebra

Positive relational operators are monotonic

The addition (removal) of a tuple can only result in additional (removed) tuples in the output

This also holds for SPARQL+ (projection, union, join)

Previous models suffice for SPARQL+

Outline

SPARQL algebra




boolean trust semantics

set semantics on trusted mappings

Boolean trust assessment (SPARQL)

?x ?y

d b

f g

Ω1

?x

?y

f g

?y ?z

b c

e h

Ω2

Ω1 \ Ω2

and \ are not monotonic: μ3 becomes untrusted

?x

?y

?z

d b c

f g -

Ω1 Ω2

?x

?y

?z

d b -

f g -

Ω1 \ Ω2

?x

?y ?z

d b -

f g -

Ω1 Ω2

μ1

μ2

μ3

μ4 μ2μ1

μ2

μ5

μ2

μ1

μ2

μ5 becomes untrusted and

μ1 becomes trusted in Ω1 Ω2

Trusted: μ1, μ2, μ3, μ4

Trusted: μ1, μ2, μ4

Perm

?x ?y

d b

f g

Ω1

?x ?y ?y2 ?z2

f g b c

f g e h

?y ?z

b c

e h

Ω2

Ω1 \ Ω2

Intuitively, (f, g) is in Ω1 \ Ω2 because it is not compatible

with neither μ3 nor μ4

?x

?y

?z ?x1 ?y1 ?y2 ?z2

d b c d b b c

f g - f g b c

f g - f g e h

Ω1 Ω2

μ1

μ2

μ3

μ4

If μ3 becomes untrusted, Perm infers that (d, b, c) becomes untrusted, but cannot infer that (d, b, -) should become trusted

(d, b, c) is in Ω1 \ Ω2 due to the join

between μ1 and μ3

RDF Meta Knowledge & M-semirings

?x ?y

d b c1

f g c2

Ω1

?x ?y RDF MK M-semirings

f g c2 Λ (c3Vc4) c2 0 = c2

?y ?z

b c c3

e h c4

Ω2

Ω1 \ Ω2

Like Perm, RDF Meta Knowledge and M-semirings infer that μ5 is untrusted but can not infer that μ1: (d, b, -) is trusted.

?x ?y ?z RDF MK M-semirings

d b c c1 Λ c3 c1 * c3

f g - c2 Λ (c3Vc4)

c2

Ω1 Ω2

μ1

μ2

μ3

μ4

μ2

μ5

μ2

ft

tt

t

ft

t

f

t

Outline

SPARQL algebra




A Third Operation for Compatibility (1/2) Take care about compatible mappings

Only one between μ1, μ5 can appear in the result

Keep provenance information for both of them !

?x ?y

d b c1

f g c2

Ω1

?y ?z

b c c3

e h c4

Ω2

?x ?y ?z How SPARQL Prov.

d b c c1*c3 c1*c3

d b - No Info c1*A(μ1, μ3)

f g - c2 c2

Ω1 Ω2 = (Ω1 Ω2) U (Ω1 \ Ω2)

μ1

μ2

μ3

μ4

μ5

μ1

μ2

ft

tt

tt

(t Λ t) = t(t Λ f) = f

t?

A(μ1, μ3) =

f, if μ1 ~ μ3 and c3 = t

t, else

tf

A Third Operation for Compatibility (2/2)

A is a binary operator on mappings

Determines whether the mapping exist in the result or not

If yes, its provenance equals the positive provenance part, e.g. c1 for c1*A(μ1, μ3)

In general,

?x ?y ?z How SPARQL Prov.

d b c c1*c3 c1*c3

d b - No Info c1*A(μ1, μ3)

f g - c2 c2

Ω1 Ω2 = (Ω1 Ω2) U (Ω1 \ Ω2)

μ5

μ1

μ2

A(μ1, μ3) =

0, if μ1 ~ μ3 and c3 ≠ 0

1, else

0: the neutral element for +

1: the neutral element for *

SPARQL Provenance Operators

Two types of operators

on provenance tokens, i.e. + and * (for SPARQL+)

on mappings, i.e. A (for and \)

Good news:

Every triple of the dataset is uniquely annotated.

Why not to use annotations as mapping identifiers in A?

Due to the projection operator…

Enrich Tokens with Schema Information

Use tokens (c1, c2…) as mapping ids in A expressions

But, μ1 ~ μ2 might hold, while π?y,?z (μ1) ~ π ?y,?z (μ2)

Tokens don’t suffice, keep pairs token-schema

A(c1, c2) =

0, if μ1 ~ μ2 and c2 ≠ 0

1, else

?x ?y ?z

a b c

d b -

μ1

μ2

?x ?y ?z Prov.

a b c (c1, {?x, ?y, ?z})

d b - (c2, {?x, ?y, ?z})

?y ?z Prov.

b c (c1, {?y, ?z})

b - (c2, {?y, ?z})

Ω π?y,?z (Ω)

A( (c1, S1), (c2, S2) ) =

0, if πS1 (μ1) ~ πS2 (μ2) and c2 ≠ 0

1, else


Define an algebra on token-schema pairs

3 operations

2 for SPARQL operators

1 for compatibility

What if there is no projection (or projection is not allowed to be pushed down) ?

annotations suffice (no need for schema information),

still in need of the compatibility operator

What if there is no Optional ?

previous models suffice, e.g. How

Future Work

SPARQL Provenance Model

Extent model expressiveness to capture other computations on

Linked Data

Logic explanations

Implementation

Questions ?

on provenance of queries on linked web data

Documents

sparql provenance modelsparql

sparql previous models

provenance of queries

abstract provenance

sparql projection

joinprevious models

x card1

data items