Download - WP3: Data Provenance and Access Control
WP3: Data Provenance and Access Control
Irini Fundulaki, FORTHDecember 11-12, 2012, Luxembourg
Slide 2 of 25
Extended example scenario
relational DBMS
stream DB stream DB RDF (stream)
DB
CSV twitter
C-SPARQL/SPARQL-STR/HTTP
registry
Qual
ity c
ontro
l
Provenance and access control
ontologiesSSN
GADM-RDF, Geospecies
Slide 3 of 25
First Year Review Comments Collaboration
◦ Provenance and Data Quality (with UPM, FUB for WP2)◦ Collaboration with KIT for a privacy aware access control
framework Real versus syntheric data
◦ PlanetData datasets (GADM-RDF1, GeoSpecies2 enhanced with links from DBPedia using LDSpider), GO3 and CIDOC4 ontologies
Scalability◦ Real datasets with up to 11M triples
1 GADM-RDF http://gadm.geovocab.org/2 GeoSpecies: http://lod.geospecies.org/3 Gene Ontology: http://www.geneontology.org/4 CIDOC-CRM: http://www.cidoc-crm.org/
Slide 4 of 25
Access Control Access Control:
◦ Refers to the ability to allow or deny the use of a particular resource by a particular entity
◦ Crucial for sensitive content since it ensures the selective exposure of information to different classes of users
Access Control Enforcement techniques for RDF data in the presence of RDFS inference:◦ Materialization (forward chaining)◦ Query Rewriting (backward chaining)◦ Static Analysis
Slide 5 of 25
Access Control ModelsStandard access control models associate a
concrete label to a triple to indicate whether the triple can be accessed or not
Built in rule: An implied RDF triple can be accessed if and only if all its implying triples can be accessed
s p o
&a type Student
label
yesStudent sc Person yes
&a type Person yes
s p o label
Slide 6 of 25
Access Control Models In the case of any kind of update, the implied triples &
their labels must be re-computed
An implied RDF triple can be accessed if and only if all its implying triples can be accessed
s p o
&a type Student
label
yesStudent sc Person yesno
&a type Person yes
s p o label
no
the overhead can be substantial in large datasetswhen updates occur frequently
Slide 8 of 25
Our Approach Fine-grained Abstract Access Control Model comprised of abstract
tokens and operators◦ focus at the RDF Triple level◦ focus on permissions for read-only queries◦ labeled triples are encoded as quadruples◦ supports RDFS inference and propagation of labels across the
RDFS hierarchies◦ encodes how the access label of an implied quadruple is
computed◦ concrete policies are used to assign specific values to the
abstract tokens and operators Implementation of a fine-grained, repository independent, portable
across platforms access control framework on top of the MonetDB column store
Slide 9 of 25
Annotating Triples with Access Tokens
s p o
&a type Student
label
l1Student sc Person l2
• l1 , l2 : abstract tokens
Person type l3class
q1:
q2:
q3:
&a type Persons p o label
l1 ⊙ l2 ⊙ : entailment operator records the triples used in the implication Student type class l2 ⊙ l3
q4:
q5:
&a type Person
s p o label l3 : propagation operator q6:
&a fname Alice
s p o label
: default access token q7:
Slide 10 of 25
Annotating Triples with Access TokensAuthorization Rules
A1: (construct {?x lname ?y} where {?x type Student},l1)
A2:(construct {?x sc ?y},l2)
A3:(construct {?x type Student},l3)
Authorizations(Query, abstract token)
A4:(construct {?x type class},l4)
A5:(construct {Student sc ?y},l5)
RDF quadruples
q1:
q2:
q3:
q4:
q5:
Student sc Person
Person sc Agent
&a type Student
&a fname Alice
Agent type class
l2
l2
l3
l4
q6: Student sc Person l5
s p o l
Slide 11 of 25
Computing the labels of implied triples
RDFS Inference: generate new knowledge
RDF quadruples
q1:
q2:
q3:
q5:
Student sc Person
Person sc Agent
&a type Student
Agent type class
l2
l2
l3
l4
q6: Student sc Person l5
s p o lq8:
q9:
q10:
q11:
q12:
Student sc Agent
Student sc Agent
&a type Person
&a type Agent
&a type Agent
l2 ⊙ l2
l5 ⊙ l2
l3 ⊙ l2
(l3 ⊙ l2) ⊙ l2
(l5 ⊙ l2) ⊙ l2
s p o l
(A1, sc, A2, l1) (A2, sc, A3, l2) (A1, sc, A3, l1 ⊙ l2)
(&r1, type, A1, l1)
(A1, sc, A2, l2) (&r1, type, A2, l1 ⊙ l2)
- subClassOf
- typeOf
Slide 12 of 25
Computing the labels of implied triples
RDFS Inference: generate new knowledge
RDF quadruples
q1:
q2:
q3:
q5:
Student sc Person
Person sc Agent
&a type Student
Agent type class
l2
l2
l3
l4
q6: Student sc Person l5
s p o l
q8:
q9:
q10:
q11:
q12:
Student sc Agent
Student sc Agent
&a type Person
&a type Agent
&a type Agent
l2 ⊙ l2
l5 ⊙ l2
l3 ⊙ l2
(l3 ⊙ l2) ⊙ l2
(l5 ⊙ l2) ⊙ l2
s p o l
(A1, sc, A2, l1) (A2, sc, A3, l2) (A1, sc, A3, l1 ⊙ l2)
(&r1, type, A1, l1)
(A1, sc, A2, l2) (&r1, type, A2, l1 ⊙ l2)
- subClassOf
- typeOf
Slide 13 of 25
Computing the propagated labels Propagating labels along the RDFS hierarchies:
no new knowledge created(A1, type, class, l1)
(A2, type, A1, (l1 ))
(A2, type, A1, l2)
RDF quadruples
q1:
q2:
q3:
q5:
Student sc Person
Person sc Agent
&a type Student
Agent type class
l2
l2
l3
l4
q6: Student sc Person l5
s p o lq8:
q9:
q10:
q11:
q12:
Student sc Agent
Student sc Agent
&a type Person
&a type Agent
&a type Agent
l2 ⊙ l2
l5 ⊙ l2
l3 ⊙ l2
(l3 ⊙ l2) ⊙ l2
(l5 ⊙ l2) ⊙ l2
q13: &a type Agent l4
s p o l
Slide 14 of 25
From Abstract Models to Concrete Policies Concrete Policies implement Abstract Model:
◦ Set of Concrete Tokens ◦ Mapping from abstract to concrete tokens◦ Set of concrete operators that implement the abstract
ones◦ Conflict resolution operator to resolve cases where the
same triple is assigned different labels◦ Access Function to decide when a triple is accessible
Slide 15 of 25
Concrete Policy C1 Concrete Tokens: LP = { true, false} Concrete Operators:
◦ Entailment Operator ⊙
◦ Propagation Operator : Identity◦ Conflict Resolution Operator
Access function: triples with label true are accessible, otherwise, inaccessible
l1 l2 if l1 and l2 are different from l1 if l2 is
if l1 , l2 are equal to l1 ⊙ l2 =
false if value false is in Strue if false does not belong in S
otherwise S =
Slide 16 of 25
Concrete Policy C1
q1:
q2:
q3:
q5:
Student sc Person
Person sc Agent
&a type Student
Agent type class
l2
l2
l3
l4
q6: Student sc Person l5
s p o l
q8:
q9:
Student sc Agent
Student sc Agent
l2 ⊙ l2
l5 ⊙ l2
q13: &a type Agent l4
Mapping: l2 true l4 true l5 false l3 true
q1:
q2:
q3:
q5:
Student sc Person
Person sc Agent
&a type Student
Agent type class
true
true
true
true
q6: Student sc Person false
s p o l
q8:
q9:
Student sc Agent
Student sc Agent
true
false
q13: &a type Agent true
Slide 17 of 25
Evaluation Implementation:
◦ Relational Schema: Authorization(auth_id, query, access_token) ExplicitTriples(qid, s, p, o, auth_id): stores explicit triples InferopTriples(qid, s, p, o): stores implicit quadruples PropOp(qid, s, p, o, l): stores the propagated quadruples LabelStore(qid, qid_uses): stores for each implied and
propagated quadruple, the explicit quadruples that are used in its implication;
◦ Query Engine: CWI’s MonetDB Open Source Column Store Stored Procedures are used for the computation of the access
labels
Slide 18 of 25
Benchmark Datasets:
◦ Synthetic Datasets: produced by PowerGen [1]◦ Real Datasets: CIDOC, GO, GeoSpecies and GADM Datasets
Authorizations:◦ 15 authorizations
Concrete Policy:◦ Access Tokens: {low, medium, high}
low < medium < high◦ Mapping:
same triple is assigned different labels and the same label is assigned to different triples ◦5 authorizations - assign to 35% of triples low label◦5 authorizations - assign to 55% of triples medium label◦5 authorizations - assign to 45% of triples high label
◦ Operators: Entailment, Conflict Resolution: min(), Propagation: identity
Slide 19 of 25
ExperimentsExperiment 1: Annotation Time
◦the time required to compute the implied quadruples and the propagated labels
Experiment 2: Evaluation Time ◦the time needed to compute for a concrete
policy, the concrete access label of a percentage of the RDF triples
Slide 20 of 25
Real DatasetsCharacteristics
Experiment 1: Annotation Time
Dataset C P CI PI SC SP Type Explicit Implied Totalgeospecies_plus-dbpedia_no-owl.nt111 40 88868 48145 97 14 127587 2591011 496028 3067278
go_3.5.11.nt 0 0 0 0 55169 0 35451 265355 395517 659054
gadm-rdf-0.94_plus-dbpedia-geo_no-owl.nt26 40 2915 41387 25 0 1856187 11303686 9 11303693
cidoc_crm_v5.0.2.nt 82 260 0 0 94 130 342 3282 451 3714
35451 35451 35451
Slide 21 of 25
Real Datasets: Experiment 2
Slide 22 of 25
Synthetic Datasets: Experiment 1
Annotation time does not depend on the number of explicit (i.e, initial triples in the dataset)
but on the number of produced implied quadruples
Slide 23 of 25
Synthetic Datasets: Experiment 2 Characteristics
Observations ◦ Dataset D2 has a larger complexity than D1 (due to the
larger depth) leading to more complex abstract expressions
Slide 24 of 25
Pros & Cons of Abstract Access Control Models Pros:
◦ Same application can experiment with different concrete policies over the same dataset e.g., liberal vs conservative policies for different classes
of users◦ Different applications can experiment with different
concrete policies for the same data◦ In the case of updates there is no need re-compute the
access labels of the inferred triples Cons:
◦ overhead in the required storage space expressions that describe how a label is computed can
become quite complex depending on the structure of the dataset
Slide 25 of 25
WP3: Work Plan View18 24 30 366 120
Task 3.1ProvenanceManagement
Task 3.2Privacy, DRM and Access Control
Task 3.3Trust management
D 3.4 Trust management and inference system
FORTHFORTH
42D 3.2 Provenance management and propagation through SPARQL queryand update languages
D 3.1 Access control specification language, reasoning and enforcement mechanisms
FORTHFORTH
EPFLEPFL
D 3.3 Access control system andprivacy-aware language
Slide 26 of 25
BackUp Slides