![Page 1: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew](https://reader036.vdocuments.us/reader036/viewer/2022062619/5518a66a550346c31f8b4abc/html5/thumbnails/1.jpg)
An Annotation Management System for Relational Databases
Laura ChiticariuUniversity of California, Santa Cruz
Joint work with Deepavali Bhagwat, Wang-Chiew Tan, Gaurav Vijayvargiya
![Page 2: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew](https://reader036.vdocuments.us/reader036/viewer/2022062619/5518a66a550346c31f8b4abc/html5/thumbnails/2.jpg)
2
A system that is able to propagate meta-data along with the data as the data is being moved around
Main motivation To trace the provenance and flow of data
Many other uses
Annotation Management System
transformationa2a1 a2
a1
b2
b1
b3
b2b1 b3a1 a2
a3
transformation step: a query, an ETL rule, etc.
transformation
![Page 3: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew](https://reader036.vdocuments.us/reader036/viewer/2022062619/5518a66a550346c31f8b4abc/html5/thumbnails/3.jpg)
3
Restaurant Cost Type
Peacock Alley
Bull & Bear
PacificaSoho Kitchen & Bar
$$$ French
$$$ Seafood
$ Chinese$ American
Restaurant Cost Type
PacificaSoho Kitchen & Bar
$ Chinese$ American
All Restaurants Cheap Restaurants
Yummy chicken curry!!
NYRestaurants
Restaurant Cost Type
Peacock Alley
Bull & Bear
PacificaSoho Kitchen & Bar
Zip
$$$ French 10022
$$$ Seafood 10022
$ Chinese 10013$ American10022
Serves fine French Cuisine in elegant setting. Formal attire.
Extensive wine list!
Our Vision
![Page 4: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew](https://reader036.vdocuments.us/reader036/viewer/2022062619/5518a66a550346c31f8b4abc/html5/thumbnails/4.jpg)
4
Other Applications
Keep information that cannot be otherwise stored in the current database design
Highlight wrong data Erroneous data may be copied around but the
comment that it is wrong goes along with it
Security and quality metric Annotate security or quality levels of data items
![Page 5: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew](https://reader036.vdocuments.us/reader036/viewer/2022062619/5518a66a550346c31f8b4abc/html5/thumbnails/5.jpg)
5
Some Related Work
Idea is not new though propagation of annotations was never explicitly stated as provenance-based: Wang & Madnick [VLDB 90], Lee, Bressan & Madnick [WIDM 98], Bernstein & Bergstraesser [IEEE Data Eng. 99]
Superimposed Information. Maier and Delcambre [WebDB 99]
Annotations of Web documents Annotations on genomic sequences Why-Provenance
Cui, Widom, & Wiener [CWW00]
![Page 6: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew](https://reader036.vdocuments.us/reader036/viewer/2022062619/5518a66a550346c31f8b4abc/html5/thumbnails/6.jpg)
6
Outline
pSQL queries Semantics
CUSTOM propagation schemeDEFAULT propagation schemeDEFAULT-ALL propagation scheme
ImplementationSystem architectureExperimental results
![Page 7: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew](https://reader036.vdocuments.us/reader036/viewer/2022062619/5518a66a550346c31f8b4abc/html5/thumbnails/7.jpg)
7
pSQL – an extension of SQL
A pSQL fragment:
SELECT DISTINCT selectlist
FROM fromlist
WHERE wherelist
PROPAGATE DEFAULT
| DEFAULT-ALL
| r1.A1 TO B1, …, rn.An TO Bn
A pSQL query is a union of pSQL fragments
![Page 8: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew](https://reader036.vdocuments.us/reader036/viewer/2022062619/5518a66a550346c31f8b4abc/html5/thumbnails/8.jpg)
8
The CUSTOM Scheme
SELECT DISTINCT AFROM R rPROPAGATE r.A TO A
UNION
SELECT DISTINCT A FROM R rPROPAGATE r.B TO A
A B
1 2
2 3
3 5
R
a
b
c
h
1
2
3
Result
Propagate annotations according to user specification
1
2
3
Result1
a
b
1
2
3
Result2
c
h
annotationUNION
a
b
c
h
![Page 9: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew](https://reader036.vdocuments.us/reader036/viewer/2022062619/5518a66a550346c31f8b4abc/html5/thumbnails/9.jpg)
9
The DEFAULT Scheme
Propagate annotations according to where data is copied from
r.B TO B
s.B TO B
A B
1 2
2 3
3 5
R
a
b
c
h
2
3
4
5
Result
c
e
f
h
A B
4 2
5 3
6 4
S
d
f
g
e
g
SELECT DISTINCT BFROM R r PROPAGATE DEFAULT
UNION
SELECT DISTINCT BFROM S sPROPAGATE DEFAULT
natural semantics for tracing the provenance of data
![Page 10: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew](https://reader036.vdocuments.us/reader036/viewer/2022062619/5518a66a550346c31f8b4abc/html5/thumbnails/10.jpg)
10
SELECT DISTINCT r.A, r.B, s.CFROM R r, S sWHERE r.B = s.BPROPAGATE DEFAULT
versus
SELECT DISTINCT *FROM R NATURAL JOIN SPROPAGATE DEFAULT
=a
Annotation Propagation under the DEFAULT Scheme
A B
1 2
R
a
1 2 3
Ans1
B C
2 3
S
b
a
1 2 3
Ans2
a b
equivalent queries,
but different
annotated output
Q1:
Q2:
![Page 11: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew](https://reader036.vdocuments.us/reader036/viewer/2022062619/5518a66a550346c31f8b4abc/html5/thumbnails/11.jpg)
11
The DEFAULT-ALL scheme
Propagate annotations according to where data is copied from according to all equivalent formulations of the given query
User Query Q:
Compute the results of Q on a database D – idea: E(Q) denotes the set of all queries that are equivalent to Q
(more precisely, (*)). Execute each query in E(Q) on the database D under the
DEFAULT scheme, then combine the results under a.
SELECT DISTINCT r.A, s.B, s.CFROM R r, S sWHERE r.B = s.BPROPAGATE DEFAULT-ALL
(*) the SQL query corresponding to Q
![Page 12: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew](https://reader036.vdocuments.us/reader036/viewer/2022062619/5518a66a550346c31f8b4abc/html5/thumbnails/12.jpg)
12
Computing the results of a DEFAULT-ALL query
Question:Given a pSQL query Q with DEFAULT-ALL propagation scheme and a database D, can we compute the result of Q(D)?
Problem: There are infinitely many queries in E(Q). It is therefore impossible to execute every query in E(Q) in order to obtain the result of Q(D).
Solution: Compute a finite basis of E(Q) first
![Page 13: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew](https://reader036.vdocuments.us/reader036/viewer/2022062619/5518a66a550346c31f8b4abc/html5/thumbnails/13.jpg)
13
A Query Basis of Q
A query basis of Q, denoted as B(Q), is a finite set of pSQL queries (with default propagation scheme) such that:
Ua q(D) =a Ua q(D)
Given B(Q), we can execute each query in B(Q) and combine the results to obtain the result of Q(D).
Question: Given Q, does B(Q) always exist and how can we compute B(Q)?
qB(Q) qE(Q)
![Page 14: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew](https://reader036.vdocuments.us/reader036/viewer/2022062619/5518a66a550346c31f8b4abc/html5/thumbnails/14.jpg)
14
Generating a Query Basis of Q Given R(A,B) and S(B,C) User query Q:
Representative Query Q0 :
Propagations under the default propagation scheme
Additional propagation due to the equality
r.B = s.B
Ans(x,y,z) :- R(x,y), S(y,z).
The representative query propagates annotations according to where data is copied from or equivalently copied from.
SELECT DISTINCT r.A, s.B, s.CFROM R r, S sWHERE r.B = s.BPROPAGATE DEFAULT-ALL
SELECT DISTINCT r.A, s.B, s.CFROM R r, S sWHERE r.B = s.BPROPAGATE r.A TO A, s.B TO B, s.C TO C, r.B TO B
![Page 15: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew](https://reader036.vdocuments.us/reader036/viewer/2022062619/5518a66a550346c31f8b4abc/html5/thumbnails/15.jpg)
15
Generating a Query Basis of Q
Auxiliary Queries:
Q1:
Q2:
Ans(x,y,z) :- R(x,y), S(y,z), R(x,w).
SELECT DISTINCT r.A, s.B, s.CFROM R r, S s, R r’WHERE r.B = s.B, r’.A = r.APROPAGATE r.A TO A, s.B TO B, s.C TO C, r.B TO B, r’.A TO A
Ans(x,y,z) :- R(x,y), S(y,z), S(w,z).
SELECT DISTINCT r.A, s.B, s.CFROM R r, S s, S s’WHERE r.B = s.B, s’.C = s.CPROPAGATE r.A TO A, s.B TO B, s.C TO C, r.B TO B, s’.C TO C
![Page 16: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew](https://reader036.vdocuments.us/reader036/viewer/2022062619/5518a66a550346c31f8b4abc/html5/thumbnails/16.jpg)
16
Generating a Query Basis of Q
Auxiliary Queries:
Q3:
Q4:
Ans(x,y,z) :- R(x,y), S(y,z), R(w,y).
SELECT DISTINCT r.A, s.B, s.CFROM R r, S s, R r’WHERE r.B = s.B, r’.B = r.BPROPAGATE r.A TO A, s.B TO B, s.C TO C, r.B TO B, r’.B TO B
Ans(x,y,z) :- R(x,y), S(y,z), S(y,w).
SELECT DISTINCT r.A, s.B, s.CFROM R r, S s, S s’WHERE r.B = s.B, s’.B = s.BPROPAGATE r.A TO A, s.B TO B, s.C TO C, r.B TO B, s’.B TO B
![Page 17: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew](https://reader036.vdocuments.us/reader036/viewer/2022062619/5518a66a550346c31f8b4abc/html5/thumbnails/17.jpg)
17
Correctness of the Algorithm
For the example, a query basis of Q consists of Q0, Q1, Q2, Q3, and Q4.
Theorem:
Given a pSQL query Q with DEFAULT-ALL propagation scheme, the algorithm generates a query basis of Q.
Proof Idea: Every query in B(Q) is an equivalent query of Q
Every equivalent query of Q is annotation-contained in Ua q(D) qB(Q)
![Page 18: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew](https://reader036.vdocuments.us/reader036/viewer/2022062619/5518a66a550346c31f8b4abc/html5/thumbnails/18.jpg)
18
Outline
pSQL queries
SemanticsCUSTOM propagation schemeDEFAULT propagation schemeDEFAULT-ALL propagation scheme
ImplementationSystem architectureExperimental results
![Page 19: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew](https://reader036.vdocuments.us/reader036/viewer/2022062619/5518a66a550346c31f8b4abc/html5/thumbnails/19.jpg)
19
System Architecture
Translator Module Input: a pSQL query Q Output: an SQL query Q’ written against the naïve storage
scheme
Q’ is sent to the RDBMS and executed
Postprocessor Module Input: sorted tuples (returned by the RDBMS) Output: An annotated set of tuples.
Annotations for the same output location are collected together Duplicate tuples are removed
PostprocessorTranslatorUSER pSQL
querySQL
querysortedtuples
finalresultRDBMS
![Page 20: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew](https://reader036.vdocuments.us/reader036/viewer/2022062619/5518a66a550346c31f8b4abc/html5/thumbnails/20.jpg)
20
For every attribute of every relation there is an additional attribute for storing the annotations
Conceivably, there are other possible storage schemes
A Naïve Storage Scheme
A B
1 2
3 4
a
b
cd
R A A’ B B’
1 a 2 c
1 d 2 -
3 b 4 -
R’
![Page 21: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew](https://reader036.vdocuments.us/reader036/viewer/2022062619/5518a66a550346c31f8b4abc/html5/thumbnails/21.jpg)
21
The Translator module
Generate a Query Basis
pSQL querydefault-all scheme
set of pSQL querieswith custom scheme
Translatedefault pSQL
to custom pSQL
pSQL querydefault scheme
pSQL querycustom scheme
Translatecustom pSQL
to SQL
SQL query
SELECT DISTINCT r.A AS A, r.B AS BFROM R rPROPAGATE DEFAULT
SELECT DISTINCT r.A AS A, r.B AS BFROM R rPROPAGATE r.A TO A, r.B TO B
default pSQL query custom pSQL query
![Page 22: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew](https://reader036.vdocuments.us/reader036/viewer/2022062619/5518a66a550346c31f8b4abc/html5/thumbnails/22.jpg)
22
Experiments
Goals
compare the performance of pSQL queries under different propagation schemes (DEFAULT, DEFAULT-ALL, or no propagation scheme)
compare the performance of pSQL queries when the number of annotations in a database is varied
![Page 23: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew](https://reader036.vdocuments.us/reader036/viewer/2022062619/5518a66a550346c31f8b4abc/html5/thumbnails/23.jpg)
23
Experimental setup
Implemented on top of Oracle 9i Datasets
100MB, 500MB, 1GB TPCH database Unannotated database on original schema 30%, 60%, 100% annotations on naïve schema buffer size: 256Mb
Test queries SPJ queries Varied the number of joins (0 to 4 joins) Varied the number of selected attributes (1,3 or 5 attributes)
![Page 24: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew](https://reader036.vdocuments.us/reader036/viewer/2022062619/5518a66a550346c31f8b4abc/html5/thumbnails/24.jpg)
24
100MB dataset – 100% annotated
Qi(j) denotes a query with i joins and j output attributes.
SQL vs. pSQL DEFAULT vs. pSQL DEFAULT-ALL
0.01
0.1
1
10
100
1000
Q0(1) Q1(1) Q2(1) Q3(1) Q4(1) Q0(3) Q1(3) Q2(3) Q3(3) Q4(3) Q0(5) Q1(5) Q2(5) Q3(5) Q4(5)
seco
nds
(log
sca
le)
SQL on Unannotated DB pSQL DEFAULT pSQL DEFAULT-ALL
![Page 25: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew](https://reader036.vdocuments.us/reader036/viewer/2022062619/5518a66a550346c31f8b4abc/html5/thumbnails/25.jpg)
25
500MB dataset – 100% annotated
Qi(j) denotes a query with i joins and j output attributes.
SQL vs. pSQL DEFAULT vs. pSQL DEFAULT-ALL
0.1
1
10
100
1000
10000
100000
Q0(1) Q1(1) Q2(1) Q3(1) Q4(1) Q0(3) Q1(3) Q2(3) Q3(3) Q4(3) Q0(5) Q1(5) Q2(5) Q3(5) Q4(5)
seco
nds
(log
sca
le)
SQL on Unannotated DB pSQL DEFAULT pSQL DEFAULT-ALL
![Page 26: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew](https://reader036.vdocuments.us/reader036/viewer/2022062619/5518a66a550346c31f8b4abc/html5/thumbnails/26.jpg)
26
1GB dataset – 100% annotated
Qi(j) denotes a query with i joins and j output attributes.
SQL vs. pSQL DEFAULT vs. pSQL DEFAULT-ALL
0.1
1
10
100
1000
10000
100000
Q0(1) Q1(1) Q2(1) Q3(1) Q4(1) Q0(3) Q1(3) Q2(3) Q3(3) Q4(3) Q0(5) Q1(5) Q2(5) Q3(5) Q4(5)
seco
nds
(log
sca
le)
SQL on Unannotated DB pSQL DEFAULT pSQL DEFAULT-ALL
![Page 27: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew](https://reader036.vdocuments.us/reader036/viewer/2022062619/5518a66a550346c31f8b4abc/html5/thumbnails/27.jpg)
27
100MB dataset annotated in various degrees
pSQL Default on 30%, 60% 100% annotated DBs
0.01
0.1
1
10
100
1000
Q0(1) Q1(1) Q2(1) Q3(1) Q4(1) Q0(3) Q1(3) Q2(3) Q3(3) Q4(3) Q0(5) Q1(5) Q2(5) Q3(5) Q4(5)
seco
nds
(log
sca
le)
SQL on Unannotated BD 30% Annotated DB 60% Annotated DBt 100% Annotated DB
Qi(j) denotes a query with i joins and j output attributes.
![Page 28: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew](https://reader036.vdocuments.us/reader036/viewer/2022062619/5518a66a550346c31f8b4abc/html5/thumbnails/28.jpg)
28
Contributions
an annotation management system for carrying annotations along as data is being transformed
based on provenance
pSQL query language for propagation annotations CUSTOM – user defined DEFAULT – where data was copied from? DEFAULT-ALL – invariant under equivalent queries
Generate-Query-Basis algorithm
an initial implementation
![Page 29: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew](https://reader036.vdocuments.us/reader036/viewer/2022062619/5518a66a550346c31f8b4abc/html5/thumbnails/29.jpg)
29
Future work
Performance of our annotation management system on other storage schemes
pSQL extensions Aggregates Bag Queries
![Page 30: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew](https://reader036.vdocuments.us/reader036/viewer/2022062619/5518a66a550346c31f8b4abc/html5/thumbnails/30.jpg)
30
END
![Page 31: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew](https://reader036.vdocuments.us/reader036/viewer/2022062619/5518a66a550346c31f8b4abc/html5/thumbnails/31.jpg)
31
The CUSTOM Scheme - Example
SELECT DISTINCT BFROM R rPROPAGATE r.A TO B, r.B TO B
A B
1 2
2 3
3 5
R
a
b
c
h
2
3
5
Result
a
b
c
h
![Page 32: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew](https://reader036.vdocuments.us/reader036/viewer/2022062619/5518a66a550346c31f8b4abc/html5/thumbnails/32.jpg)
32
Terminology
A location is a triple (R, t, A)
Definition:
A query Q1 is annotation contained in a query Q2 if:• Q1 Q2
• for every database D, the set of annotations attached to every output location in Q1(D) is a subset of the set of annotations associated with the same location in the output of Q2(D).
A B
1 2
R
aThe annotation “a”
is attached to the
location (R,(1,2),B)
![Page 33: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew](https://reader036.vdocuments.us/reader036/viewer/2022062619/5518a66a550346c31f8b4abc/html5/thumbnails/33.jpg)
33
Ans(x,y,z) :- R(x,y), S(y’,z), y = y’. { x ! 1, y ! 2, y’ ! 2, z ! 3 }Ans(x,y,z) :- R(x,y), S(y,z). { x ! 1, y ! 2, z ! 3 }
Annotations of values that reside in different source locations but are bound to the same variable are unioned together.
Ans(y) :- R(x,y).Ans(y) :- S(y,z).
Ans(2 ).
Annotations that belong to the same output location are unioned together.
In a More Concise Notation
a b
a b
a b
![Page 34: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew](https://reader036.vdocuments.us/reader036/viewer/2022062619/5518a66a550346c31f8b4abc/html5/thumbnails/34.jpg)
34
Containment vs. annotation-containment
A B C
1 2 3
1 4 5
1 8 4
8 9 5
R
a
b c
d
1 5
Ans1
c
1 5
Ans2
c d
a b
b
Q1
Ans(x,v) :- R(x,y,u), R(x,z,v), R(t,w,z). Q2
Ans(x,v) :- R(p,q,v), R(x,z,v), R(t,w,z).
Q1 Q2 but…Q1 a Q2 and Q2 a Q1
![Page 35: An Annotation Management System for Relational Databases Laura Chiticariu University of California, Santa Cruz Joint work with Deepavali Bhagwat, Wang-Chiew](https://reader036.vdocuments.us/reader036/viewer/2022062619/5518a66a550346c31f8b4abc/html5/thumbnails/35.jpg)
35
Translating a CUSTOM pSQL to SQL
Q1:SELECT r.A, NULL, s.B, s.B’, s.C, s.C’FROM R r, S sWHERE r.B = s.B
Q2:SELECT r.A, NULL, s.B, r.B’, s.C, NULLFROM R r, S sWHERE r.B = s.B
SELECT DISTINCT *FROM ( Q1 UNION Q2 ) tORDER BY t.A, t.B, t.C
SELECT DISTINCT r.A, s.B, s.CFROM R r, S sWHERE r.B = s.BPROPAGATE s.B TO B, s.C TO C, r.B TO B
custom pSQL query:
SQL query: