language design and data provenance

32
Language Design and Data Provenance 6/3/2019 1 GeCo Workshop, Como Val Tannen University of Pennsylvania

Upload: others

Post on 18-Apr-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Language Design and Data Provenance

LanguageDesignandDataProvenance

6/3/2019 1GeCoWorkshop,Como

ValTannenUniversityofPennsylvania

Page 2: Language Design and Data Provenance

6/3/2019 2GeCoWorkshop,Como

Collaborators

TofTawardTJGreenRelationalAIGrigorisKarvounarakisRelationalAI

GofPODSpaperTJ

ORCHESTRAZackIvesUniversityofPennsylvaniaTJ,Grigoris

OthercorepapersNateFosterCornellUniversityYaelAmsterdamerBar-IlanUniversityDanielDeutchTelAvivUniversityTovaMiloTelAvivUniversitySudeepaRoyDukeUniversityYuvalMoskovitchTelAvivUniversity

RecentworkErichGrädelRWTHAachen

MuchgratitudePeterBunemanUniversityofEdinburgh

Page 3: Language Design and Data Provenance

Provenance?

•  Provenanceisabout

–  trust:propagateitfrominputstooutputs

–  diagnostics:faultyoutputscomefromwhere?

–  (repairs):fixinputstofixoutputs(reverseprovenanceanalysis).

6/3/2019 GeCoWorkshop,Como 3

Page 4: Language Design and Data Provenance

(Binary)TrustwithCatVictims

6/3/2019 GeCoWorkshop,Como 4

mouse gray

mouse red

rat gray

*SueandValarenotedzoologists.**Zackisanotedcomputationalzoologist

cat mouse

cat rat

Sue’s notes *

Val’s notes *

cat gray

cat red

Zack ** computation

Yes

No

Yes

Yes

Yes Yes

No

No

No

Yes prey color

Page 5: Language Design and Data Provenance

ConfidenceScores(non-binarytrust)

6/3/2019 GeCoWorkshop,Como 5

mouse gray

mouse red

rat gray

cat mouse

cat rat

Sue’s notes

Val’s notes

cat gray

cat red

Zack computation

0.6

0.1

0.8

0.9

0.9 0.72

0.09

0.72 = max(0.9× 0.8, 0.9 × 0.6) 0.09 = 0.9 × 0.1

Page 6: Language Design and Data Provenance

ASimpleModelforDataPricing

6/3/2019 GeCoWorkshop,Como 6

mouse gray

mouse red

rat gray

cat mouse

cat rat

Sue’s notes

Val’s notes

cat gray

cat red

Zack computation

$6

$1

$8

$10

$10 $16

$11

16 = min(10 +8, 10 + 6) 11 = 10 + 1

Page 7: Language Design and Data Provenance

Computation?ExpressedinaQueryLanguage

6/3/2019 GeCoWorkshop,Como 7

mouse gray

mouse red

rat gray

cat mouse

cat rat

Sue’s notes

Val’s notes

cat gray

cat red

Zack computation

Zack(x,z) :- Sue(x,y) , Val(y,z)

Zack = PROJECT (JOIN (Sue, Val))

Zack = { (u.#pred, v.#color) | u 2 Sue , v 2 Val , u.#prey=v.#animal }

Page 8: Language Design and Data Provenance

6/3/2019 8GeCoWorkshop,Como

Doitonceanduseitrepeatedly:provenance

Label(annotate)inputitemsabstractlywithprovenancetokens.Provenancetracking:propagateexpressions(involvingtokens)

(toannotateintermediatedataand,finally,outputs)

Basedonquerylanguagedesign,tracktwodistinctwaysofusingdataitemsbycomputationprimitives:

•  jointly(thisaloneisbasicallylikekeepingalog)

•  alternatively(doingbothisessential;thinktrust)

Input-outputcompositional;Modular(intheprimitives)

Later,wewanttoevaluatetheprovenanceexpressionstoobtain binarytrust,confidencescores,dataprices,etc.

Page 9: Language Design and Data Provenance

AlgebraicinterpretationforRDB

SetX ofprovenancetokens.Spaceofannotations,provenanceexpressionsProv(X)

Prov(X)-relations:everytupleisannotatedwithsomeelementfromProv(X).

BinaryoperationsonProv(X):

· correspondstojointuse(join,cartesianproduct), +correspondstoalternativeuse(unionandprojection).

Specialannotations:

‘‘Absent’’tuplesareannotatedwith0. 1 isa‘‘neutral’’annotation(datawedonottrack).

6/3/2019 GeCoWorkshop,Como 9

Page 10: Language Design and Data Provenance

K-Relationalalgebra

Algebraiclawsof(Prov(X), +, ·, 0,1)?Moregenerally,forannotations

fromastructure(K, +, ·, 0,1)?

K-relations.GeneralizeRA+to(positive)K-relationalalgebra.

DesiredoptimizationequivalencesofK- relationalalgebraiff

(K, +, ·, 0,1) isacommutativesemiring.

GeneralizesSPJUorUCQornon-rec.Datalog

setsemantics(B,Ç,Æ,?,>)bagsemantics(N,+,·,0,1)

c-table-semantics[IL84](BoolExp(X), Ç,Æ,?,>) eventtablesemantics[FR97,Z97](P(Ω),[,Å,;,Ω)

6/3/2019 GeCoWorkshop,Como 10

Page 11: Language Design and Data Provenance

Whatisacommutativesemiring?

Analgebraicstructure(K,+,·,0,1)where:•  Kisthedomain

•  +isassociative,commutative,with0identity

•  ·isassociative,with1identitysemiring•  ·distributesover+•  a·0=0·a=0

•  ·isalsocommutative

Unlikering,norequirementforinversesto+

116/3/2019 GeCoWorkshop,Como

Page 12: Language Design and Data Provenance

Provenance:abstractsemiringannotation

6/3/2019 GeCoWorkshop,Como 12

mouse gray

mouse red

rat gray

cat mouse

cat rat

Sue’s notes

Val’s notes

cat gray

cat red

Zack Zack(x,z):-

Sue(x,y),Val(y,z)

r s t

p q

p·r+q·t p·s

KeepX={p,q,r,s,t } abstract.Diagnosticforwronganswers;Deletionpropagation.E.g.,r=s=0

Provenancepolynomials(N[X],+,·,0,1)semiring

Page 13: Language Design and Data Provenance

Provenancepropagationthroughlanguageoperations

6/3/2019 GeCoWorkshop,Como 13

mouse gray

mouse red

rat gray

cat mouse

cat rat

Sue Val

cat gray

cat red

PROJECT

r s t

p q

p·r+q·t p·s

cat mouse gray

cat mouse red

cat rat gray

p·r p·s q·t

JOIN

Page 14: Language Design and Data Provenance

Provenancepolynomials

6/3/2019 GeCoWorkshop,Como 14

(N[X],+,·,0,1)isthecommutativesemiringfreelygeneratedbyX(universalitypropertyinvolvinghomomorphisms)

ProvenancepolynomialsarePTIME-computable(datacomplexity).(querycomplexitydependsonlanguageandrepresentation)

ORCHESTRAprovenance(graphrepresentation)about30%overhead

Monomialscorrespondtologicalderivations(prooftreesinnon-rec.Datalog)

Provenancereadingofpolynomails:

outputtuplehasprovenance2r2 + rs threederivationsofthetuple-twoofthemuser, twice,-thethirduses r and s, onceeach

Page 15: Language Design and Data Provenance

Specializeprovenanceforconfidencescores

6/3/2019 GeCoWorkshop,Como 15

mouse gray

mouse red

rat gray

cat mouse

cat rat

Sue’s notes

Val’s notes cat gray

cat red

Zack Zack(x,z):-

Sue(x,y),Val(y,z)

r s t

p q

pr+qt ps

V =([0,1], max,·,0,1)theViterbisemiring

f: X![0,1] f(p)=f(q)=0.9 f(r)=0.6 f(s)=0.1 f(t)= 0.8

eval(f): N[X]!V eval(f)(pr+qt)=0.72 eval(f)(ps)= 0.09

0.6

0.1

0.8

0.72

0.09

0.9

0.9

Page 16: Language Design and Data Provenance

Someapplicationsemirings

6/3/2019 GeCoWorkshop,Como 16

(B,Æ,Ç,>,?)binarytrust

(N,+,·,0,1)multiplicity(numberofderivations)

(A,min,max,0,Pub)accesscontrol

V =([0,1], max,·,0,1)Viterbisemiring(MPE)confidencescores

T =([0,1],min,+,1,0)tropicalsemiring(shortestpaths)datapricing

F =([0,1], max,min,0,1)“fuzzylogic”semiring

Page 17: Language Design and Data Provenance

Twokindsofsemiringsinthisframework

6/3/2019 GeCoWorkshop,Como 17

Provenancesemirings,e.g.,

(N[X],+,·,0,1)provenancepolynomials[GKT07]

(Why(X),[,d,;,{;})witnesswhy-provenance[BKT01]

Applicationsemirings,e.g.,

(A,min,max,0,Pub)accesscontrol[FGT08]

V =([0,1], max,·,0,1)Viterbisemiring(MPE)[GKIT07]

Provenancespecializationrelieson

-Provenancesemiringsarefreelygeneratedbyprovenancetokens- Querycommutationwithsemiringhomomorphisms

Page 18: Language Design and Data Provenance

Querycommutationwithhomomorphisms

queryinQL homomorphismh : K1 ! K2

6/3/2019 GeCoWorkshop,Como 18

K1-Rel

K1-Rel

query query

h

h K2-Rel

K2-Rel

QL =RA+,Datalog[GKT07]andextensions[FGT08,GP10,ADT11a,T13,DMT15,GUKFC16,T17]

Page 19: Language Design and Data Provenance

K-NestedRelationalCalculus

K-sets.Everyelementofthesetisannotatedwithsomek 2 K.where (K, +, ·, 0,1) isacommutativesemiring.

Mapf onS{ f(x) | x 2 S }

Ifxisannotatedbykthentheannotationoff(x)ismultipliedbyk.

K-setsalsoformacommutativesemiring.Thisgivesannotationsfor

“FlatMap”g onS[ { g(x) | x 2 S }

6/3/2019 GeCoWorkshop,Como 19

Page 20: Language Design and Data Provenance

AHierarchyofProvenanceSemirings[G09,DMRT14]

N[X]

B[X] Trio(X)

Why(X)

Which(X)PosBool(X)

mostinformative

leastinformative

Example:2x2y+xy+5y2+xz

+="

206/3/2019 GeCoWorkshop,Como

Sorp(X)

surjectivesemiringhomomorphism,identityonX

absorption

absorption(ab+a=a)

"idemp.+idemp.

x2y+xy+y2+xz 3xy+5y+xz

y+xz

xy+y2+xz

xyz

"idemp.

xy+y+xz

"idemp. +idemp.

A

T,V

N

B

Page 21: Language Design and Data Provenance

Amenagerieofprovenancesemirings

6/3/2019 GeCoWorkshop,Como 21

(Which(X),[,[*, ;,;*)setsofcontributingtuples“Lineage”(1)[CWW00]

(Why(X),[,d,;,{;})setsofsetsof…Witnesswhy-provenance[BKT01]

(PosBool(X),Æ,Ç,>,?)minimalsetsofsetsof…Minimalwitnesswhy-provenance[BKT01]also“Lineage”(2)usedinprobabilisticdbs[SORK11]

(Trio(X),+,·,0,1)bagsofsetsof…“Lineage”(3)[BDHT08,G09]

(B[X],+,·,0,1)setsofbagsof…Booleancoeff.polynomials[G09]

(Sorp(X),+, ·,0,1)minimalsetsofbagsof…absorptivepolynomials[DMRT14]

(N[X],+,·,0,1)bagsofbagsof…universalprovenancepolynomials[GKT07]

Page 22: Language Design and Data Provenance

Furtheraspectsoftheframework

6/3/2019 GeCoWorkshop,Como 22

Extensiontotreedata(NestedRelationalCalculus,structuralrecursionontrees,unorderedXQuery)[FGT08]

StudyofCQ/UCQonprovenance-annotatedrelations[G09]

Extensiontoaggregates(poly-sizeoverhead)[ADT11a]

Poly-sizeprovenanceforDatalog(circuits;PosBool(X),Sorp(X)…)[DMRT14]

Extensiontodata-dependentfinitestateprocesses[DMT15]

Connectionstosemiringmonad[FGT08,T13] tosemimodules[ADT11a] totensorproducts[ADT11a,DMT15]

Page 23: Language Design and Data Provenance

Provenanceforaggregation

9/2/16

a 20+10 ?

b 15+10+25 ?

a 20 x

a 10 y

b 15 q

b 10 r

b 25 s

Desiderata1.  Compatibilitywithset/bagsemantics

2.  Fundamentalproperty(commutationwithhomomorphisms)

3.  Poly-sizeoverhead!1+2+4+…+2n-1=>2nresults

DS-agg

DS

SUMSGROUP BY D

23SimonsInstitute

Page 24: Language Design and Data Provenance

Solutioninspiredby(semi)linearalgebra

9/2/16

a x 20 + y 10 ?

b q 15 + r 10 + s 25 ?

DS-agga 20 x

a 10 y

b 15 q

b 10 r

b 25 s

DS

24SimonsInstitute

(R,+,0)isnotaProv(X)-semimodule,but…

(K-Rel,[,;)isaK-semimodulewiththesingletonsasbasis.

Relationsaretheresultof[-aggregation!Whatif(R,+,0)wereaProv(X)-semimodule?

Page 25: Language Design and Data Provenance

Tensorproductconstruction

9/2/16

a x ⊗20+y ⊗10 x + y

b q ⊗15+r ⊗10+s ⊗25 q + r + s

DS-agg

EmbedacommutativemonoidM(forsum,maxormin)intoaK-semimoduleK⊗M(newvalues!)

Consistency: embedding should be faithful.

25SimonsInstitute

Page 26: Language Design and Data Provenance

Negativeinformation;non-monotoneoperations(difference)

6/3/2019 GeCoWorkshop,Como 26

Booleanexpressions[IL84].Limited.

Addabinaryoperationcorrespondingtodifference m-semirings(commongen.ofsetandbagdifference)[GP10] spm-semirings(OPTIONALinSPARQL)[GUKFC16]

Encodedifferencebyaggregation[ADT11a]

Differentequationaltheories,differentalgebraicoptimizations[ADT11b]

Stillnotclearhowtotracknegativeinformation.useful:non-answers(whynot?),insertionpropagation.

Logicalmodelchecking(“provenanceof…truth?”) negationasduality(NNFs),logicalgames ongoingworkwithGrädel[T16,T17]

Page 27: Language Design and Data Provenance

Currenttargets

6/3/2019 GeCoWorkshop,Como 27

ANALYTICSCOMPUTATIONS

“Fine-grainedprovenanceforlinearalgebraoperators”Yan,T.,IvesTaPP16

DISTRIBUTEDSYSTEMS/NETWORKPROVENANCE

“Time-awareprovenancefordistributedsystems”,Zhou,Ding,Haeberlen,Ives,LooTaPP11

“Diagnosingmissingeventsindistributedsystemswithnegativeprovenance”,Wu,Zhao,Haeberlen,Zhou,LooSIGCOMM14

STATICANALYSISOFSOFTWARE

“OnabstractionrefinementforprogramanalysesinDatalog”Zhang,Mangal,Grigore,NaikPLDI14

Page 28: Language Design and Data Provenance

Frameworkreferences(I)

6/3/2019 GeCoWorkshop,Como 28

[GKT07]“Provenancesemirings”Green,Karvounarakis,TannenPODS07.

[GKIT07]“Updateexchangewithmappingsandprovenance”Green,Karvounarakis,Ives,TannenVLDB07.

[FGT08]“AnnotatedXML:queriesandprovenance”Foster,Green,TannenPODS08.

[G09]“Containmentofconjunctivequeriesonannotatedrelations”GreenICDT09.

[GP10]“OndatabasequerylanguagesforK-relations”,Geerts,PoggiJAppl.Logic2010.

Page 29: Language Design and Data Provenance

Frameworkreferences(II)

6/3/2019 GeCoWorkshop,Como 29

[ADT11a]“Provenanceforaggregatequeries”,Amsterdamer,Deutch,TannenPODS11.

[ADT11b]“Onthelimitationsofprovenanceforquerieswithdifference”,Amsterdamer,Deutch,TannenTaPP11

[T13]“Provenancepropagationincomplexqueries”TannenBunemanFestschrift2013

[DMRT14]“CircuitsforDatalogprovenance”,Deutch,Milo,Roy,T.ICDT14.

[DMT15]“Provenance-basedanalysisofdata-centricprocesses”Deutch,Moskovitch,TannenVLDBJ.2015

Page 30: Language Design and Data Provenance

Frameworkreferences(III)

6/3/2019 GeCoWorkshop,Como 30

[GUKFC16]“AlgebraicstructuresforcapturingtheprovenanceofSPARQLqueries”Geerts,Unger,Karvounarakis,Fundulaki,ChristophidesJACM2016

[T16]“Abouttheprovenanceoftruth”TannenSimonsInst.Website16https://simons.berkeley.edu/talks/val-tannen-2016-12-09

[T17]“ProvenanceanalysisforFOLmodelchecking”TannenSIGLOGNews2017

[GT17a]“Thesemiringframeworkfordatabaseprovenance”,Green,TannenPODS2017.

[GT17b]“Semiringprovenanceforfirst-ordermodelchecking”,Grädel,TannenCoRRabs/1712.01980(2017)

Page 31: Language Design and Data Provenance

Otherreferences

6/3/2019 GeCoWorkshop,Como 31

[IL84]“Incompleteinformationinrelationaldatabases”Imieliński,LipskiJACM1984

[FR97]“Aprobabilisticrelationalalgebra”Fuhr,RölleckeTOIS1997

[Z97]“Queryevaluationinprobabilisticrelationaldatabases”ZimányiDDS1997

[CWW00]“Tracingthelineageofviewdatainawarehousingenvironment”Cui,Widom,WienerTODS2000

[BKT01]“Whyandwhere:acharacterizationofdataprovenance”Buneman,Khanna,TanICDT2001

[BDHTW08]“Databaseswithuncertaintyandlineage”Benjelloun,DasSarma,Halevy,Theobald,WidomVLDBJ.2008

[SORK11]“Probabilisticdatabases”Suciu,Olteanu,Ré,KochSLDM2011

[SuciuOlteanuRéKoch11]

Page 32: Language Design and Data Provenance

6/3/2019 GeCoWorkshop,Como 32

Thankyou!