compsci516 data intensive computing systems - cs.duke.edu · duke cs, fall 2016 compsci 516: data...

Post on 17-Jun-2020

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

CompSci 516DataIntensiveComputingSystems

Lecture21Datalog

Instructor:Sudeepa Roy

1DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

Announcement

• HW3duenextWednesday:11/16

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 2

Today

• Datalog– forrecursion indatabasequeries

• AquicklookatIncrementalViewMaintenance(IVM)

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 3

ReadingMaterial:DatalogOptional:1. Thedatalog chaptersinthe“AliceBook”FoundationsofDatabasesAbiteboul-Hull-VianuAvailableonline:http://webdam.inria.fr/Alice/

2.Datalog tutorialSIGMOD2011“Datalog andEmergingApplications:AnInteractiveTutorial”

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 4

BriefHistoryofDatalog

• MotivatedbyProlog– startedbackin1970-80’s– thenquietforalongtime

• AlongargumentintheDatabasecommunitywhetherrecursionshouldbesupportedinquerylanguages– “Nopracticalapplicationsofrecursivequerytheory...havebeenfoundto

date”—MichaelStonebraker,1998ReadingsinDatabaseSystems,3rdEditionStonebraker andHellerstein,eds.

– RecentworkbyHellerstein etal.onDatalog-extensionstobuildnetworkingprotocolsanddistributedsystems.[Link]

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 5

Datalog isresurging!

• NumberofpapersandtutorialsinDBconferences

• Applicationsin– dataintegration,declarativenetworking,programanalysis,information

extraction,networkmonitoring,security,andcloudcomputing

• Systemssupportingdatalog inbothacademiaandindustry:– Lixto (informationextraction)– LogicBlox (enterprisedecisionautomation)– Semmle (programanalysis)– BOOM/Dedalus (Berlekey)– Coral– LDL++

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 6

RecallourdrinkerexampleinRC(Lecture4)

Find drinkers that frequent some bar that serves some beer they like.

Q(x) = $y. $z. Frequents(x, y)∧Serves(y,z)∧Likes(x,z)

Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)

7CompSci516:DataIntensiveComputingSystems

DukeCS,Fall2016

DrinkerexampleisfromslidesbyProfs.Balazinska andSuciuandthe[GUW]book

WriteitasaDatalog RuleFind drinkers that frequent some bar that serves some beer they like.

Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)

8CompSci516:DataIntensiveComputingSystems

DukeCS,Fall2016

RC:Q(x) = $y. $z. Frequents(x, y)∧Serves(y,z)∧Likes(x,z)

Datalog:Q(x) :- Frequents(x, y), Serves(y,z), Likes(x,z)

WriteitasaDatalog RuleFind drinkers that frequent some bar that serves some beer they like.

Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)

9CompSci516:DataIntensiveComputingSystems

DukeCS,Fall2016

Datalog:Q(x) :- Frequents(x, y), Serves(y,z), Likes(x,z)

RC:Q(x) = $y. $z. Frequents(x, y)∧Serves(y,z)∧Likes(x,z)

• Quickdifferences:– Uses“:-”not=– noneedfor$ (assumedbydefault)– Use“,”ontherighthandside(RHS)– AnythingonRHStheof:- isassumedtobecombinedwith∧ bydefault– ",Þ,notallowed– theyneedtousenegation¬– Standard“Datalog”doesnotallownegation– Negationallowedindatalog withnegation

• Howtospecifydisjunction(OR/⋁)?

Example:ORinDatalogFind drinkers that (a) either frequent some bar that serves some beer they like, (b) or like beer “BestBeer”

Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)

10CompSci516:DataIntensiveComputingSystems

DukeCS,Fall2016

Datalog:Q(x) :- Frequents(x, y), Serves(y,z), Likes(x,z)Q(x) :- Likes(x, “BestBeer”)

RC:Q(x)=[$y.$z.Frequents(x,y)∧Serves(y,z)∧Likes(x,z)]⋁ [Likes(x,“BestBeer”)]

Example:ORinDatalogFind drinkers that (a) either frequent some bar that serves some beer they like, (b) or like beer “BestBeer”, (c) or, frequent bars that “Joe” frequents

Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)

11CompSci516:DataIntensiveComputingSystems

DukeCS,Fall2016

Datalog:JoeFrequents(w):- Frequents(“Joe”,w)Q(x):- Frequents(x,y),Serves(y,z),Likes(x,z)Q(x):- Likes(x,“BestBeer”)Q(x):- Frequents(x,w),JoeFrequents(w)

RC:Q(x)=[$y.$z.Frequents(x,y)∧Serves(y,z)∧Likes(x,z)]⋁ [Likes(x,“BestBeer”)]

⋁ [$w Frequents(x,w)∧ Frequents(“Joe”,w)]

• Tospecify“OR”,writemultipleruleswiththesame“Head”• Next:terminologyforDatalog

• Each rule is of the form Head :- Body

• Each variable in the head of each rule must appear in the body of the rule

Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)

12CompSci516:DataIntensiveComputingSystems

DukeCS,Fall2016

JoeFrequents(w):- Frequents(“Joe”,w)Q(x):- Frequents(x,y),Serves(y,z),Likes(x,z)Q(x):- Likes(x,“BestBeer”)Q(x):- Frequents(x,w),JoeFrequents(w)

Fourrules

BodyHead

Datalog Rules

Atom

Variable

EDBsandIDBs

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

13

Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)

• Intensional DataBases (IDBs)– Relationsthatarederived– Canbeintermediateorfinaloutputtables– e.g.JoeFrequents,Q– CanbeontheLHSorRHS(e.g.JoeFrequents)

• ExtensionalDataBases (EDBs)– Inputrelationnames– e.g.Likes,Frequents,Serves– canonlybeontheRHSofarule

JoeFrequents(w):- Frequents(“Joe”,w)Q(x):- Frequents(x,y),Serves(y,z),Likes(x,z)Q(x):- Likes(x,“BestBeer”)Q(x):- Frequents(x,w),JoeFrequents(w)

TupleinanEDBoranIDB:aFACT

eitherbelongstoagivenEDBrelation,orisderivedinanIDB relation

GraphExample

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

14

a

b

c

d e

V1 V2

a c

b a

b d

c d

d a

d e

E(edgerelation)

Example1

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

15

E(edgerelation)

WriteaDatalog programtofindpathsoflengthtwo(outputstartandfinishvertices)

V1 V2

a c

b a

b d

c d

d a

d e

a

b

c

d e

Example1

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

16

E(edgerelation)

WriteaDatalog programtofindpathsoflengthtwo(outputstartandfinishvertices)

P2(x,y):- E(x,z),E(z,y)

V1 V2

a c

b a

b d

c d

d a

d e

a

b

c

d e

Example1:Execution

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

17

E(edgerelation)

WriteaDatalog programtofindpathsoflengthtwo(outputstartandfinishvertices)

P2(x,y):- E(x,z),E(z,y)

V1 V2

a c

b a

b d

c d

d a

d e

a

b

c

d e

V1 V2

a db cb ec ac ed c

P2

sameasE⨝E.V2=E.V1 E

Example2

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

18

E(edgerelation)

WriteaDatalog programtofindallpairsofvertices(u,v)suchthatvisreachablefromu

• CanyouwriteaSQL/RA/RCqueryforreachability?

V1 V2

a c

b a

b d

c d

d a

d e

a

b

c

d e

Example2

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

19

E(edgerelation)

• CanyouwriteaSQL/RA/RCqueryforreachability?• NO- SQL/RA/RCcannotexpressreachability

V1 V2

a c

b a

b d

c d

d a

d e

a

b

c

d e WriteaDatalog programtofindallpairsofvertices(u,v)suchthatvisreachablefromu

Example2

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

20

E(edgerelation)

R(x,y):- E(x,y)R(x,y):- E(x,z),R(z,y)

Option1

V1 V2

a c

b a

b d

c d

d a

d e

a

b

c

d e WriteaDatalog programtofindallpairsofvertices(u,v)suchthatvisreachablefromu

Example2

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

21

E(edgerelation)

R(x,y):- E(x,y)R(x,y):- E(x,z),R(z,y)

Option1 R(x,y):- E(x,y)R(x,y):- R(x,z),E(z,y)

Option2

R(x,y):- E(x,y)R(x,y):- R(x,z),R(z,y)

Option3

V1 V2

a c

b a

b d

c d

d a

d e

a

b

c

d e WriteaDatalog programtofindallpairsofvertices(u,v)suchthatvisreachablefromu

linear

non-linear

LinearDatalog

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

22

• Linearrule– atmostoneatominthebodythatisrecursivewiththeheadoftherule

– e.g.R(x,y):- E(x,z),R(z,y)• Lineardatalog program

– ifallrulesarelinear– likelinearrecursion

• Top-downandbottom-upevaluationarepossible– wewillfocusonbottom-up

Example2:Execution

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

23

E

R(x,y):- E(x,y)R(x,y):- E(x,z),R(z,y)

Option1

V1 V2

a cb ab dc dd ad e

Iteration1 R=E

V1 V2

a c

b a

b d

c d

d a

d e

a

b

c

d e

(verticesreachablein1-hopbyadirectedge)

Example2:Execution

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

24

E

R(x,y):- E(x,y)R(x,y):- E(x,z),R(z,y)

Option1

V1 V2

a cb ab dc dd ad ea db cb ec ac ed c

RIteration2

V1 V2

a c

b a

b d

c d

d a

d e

a

b

c

d e

(verticesreachablein2-hops)

Example2:Execution

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

25

E

R(x,y):- E(x,y)R(x,y):- E(x,z),R(z,y)

Option1

V1 V2

a cb ab dc dd ad ea db cb ec ac ed ca ea ac cd d

RIteration3

V1 V2

a c

b a

b d

c d

d a

d e

a

b

c

d e

(verticesreachablein3-hops)

Example2:Execution

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

26

E

R(x,y):- E(x,y)R(x,y):- E(x,z),R(z,y)

Option1

V1 V2

a cb ab dc dd ad ea db cb ec ac ed ca ea ac cd d

RIteration4

V1 V2

a c

b a

b d

c d

d a

d e

a

b

c

d e

Runchanged- stop

Examples3and4

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

27

E(edgerelation)

WriteaDatalog programtofindallverticesreachablefromb

R(x,y):- E(x,y)R(x,y):- E(x,z),R(z,y)QB(y):- R(b,y)

V1 V2

a c

b a

b d

c d

d a

d e

a

b

c

d e

WriteaDatalog programtofindallverticesureachablefromthemselvesR(u,u)

R(x,y):- E(x,y)R(x,y):- E(x,z),R(z,y)Q(x):- R(x,x)

TerminationofaDatalog Program

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

28

Q. ADatalog programalwaysterminates– why?

TerminationofaDatalog Program

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

29

Q. ADatalog programalwaysterminates– why?

• Becausethevaluesofthevariablesarecomingfromthe“activedomain”intheinputrelations(EDBs)

• Activedomain=(finite)valuesfromthe(possiblyinfinite)domainappearingintheinstanceofadatabase

– e.g.agecanbeanyinteger(infinite),butactivedomainisonlyfinitelymanyinR(id,name,age)

• ThereforethenumberofpossiblevaluesineachoftheIDBsisfinite

• e.g.inthereachabilityexampleR(x,y),thevaluesofxandycomefrom{a,b,c,d,e}

– atmost5x5=25tuplespossibleintheIDBR(x,y)– inanyiteration,atleastonenewtupleisaddedinatleastoneIDB– Muststopafterfinitesteps– e.g.themaximumnumberofiterationinthereachabilityexampleforanygraphwith

fiveverticesis25(itwasonly4inourexample)

Bottom-upEvaluationofaDatalog Program

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

30

• Naïveevaluation

• Semi-naïveevaluation

Naïveevaluation- 1

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

31

V1 V2

a cb ab dc dd ad e

V1 V2

a c

b a

b d

c d

d a

d e

Iteration1:R=E=R1(say)

E

a

b

c

d e

Inallsubsequentiteration,checkifanyoftherulescanbeapplied

DounionofalltheruleswiththesameheadIDB

Naïveevaluation- 2

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

32

V1 V2

a cb ab dc dd ad e

V1 V2

a c

b a

b d

c d

d a

d e

V1 V2

a cb ab dc dd ad ea db cb ec ac ed c

Iteration1:R=E=R1(say)

E

a

b

c

d e

Iteration2:R=E∪

E⨝ R1=R2(say)

R1≠R2socontinue

Naïveevaluation- 3

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

33

V1 V2

a cb ab dc dd ad e

V1 V2

a c

b a

b d

c d

d a

d e

V1 V2

a cb ab dc dd ad ea db cb ec ac ed c

V1 V2

a cb ab dc dd ad ea db cb ec ac ed ca ea ac cd d

Iteration1:R=E=R1(say)

E

a

b

c

d e

Iteration2:R=E∪

E⨝ R1=R2(say)

R1≠R2socontinue

Iteration3:R=E∪

E⨝ R2=R3(say)

R2≠R3socontinue

Naïveevaluation- 4

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

34

V1 V2

a cb ab dc dd ad e

V1 V2

a c

b a

b d

c d

d a

d e

V1 V2

a cb ab dc dd ad ea db cb ec ac ed c

V1 V2

a cb ab dc dd ad ea db cb ec ac ed ca ea ac cd d

Iteration1:R=E=R1(say)

E

a

b

c

d e

Iteration2:R=E∪

E⨝ R1=R2(say)

R1≠R2socontinue

Iteration3:R=E∪

E⨝ R2=R3(say)

R2≠R3socontinue

Iteration4:R=E∪

E⨝ R3=R4(say)

R3=R4soSTOP

ProblemwithNaïveEvaluation

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

35

• ThesameIDBfactsarediscoveredagainandagain– e.g.ineachiterationalledgesinEareincludedinR– Inthe2nd-4th iterations,thefirstsixtuplesinRarecomputed

repeatedly

• Solution:Semi-NaïveEvaluation

• Workonlywiththenewtuplesgeneratedinthepreviousiteration

Semi-Naïveevaluation- 1

DukeCS,Fall2016CompSci 516:DataIntensiveComputing

Systems 36

V1 V2

a cb ab dc dd ad e

V1 V2

a c

b a

b d

c d

d a

d e

Iteration1:R=E=R1(say)ΔR1=R1

E

a

b

c

d e

Initially:R=Φ

Semi-Naïveevaluation- 2

DukeCS,Fall2016CompSci 516:DataIntensiveComputing

Systems 37

V1 V2

a cb ab dc dd ad e

V1 V2

a c

b a

b d

c d

d a

d e

V1 V2

a cb ab dc dd ad ea db cb ec ac ed c

Iteration1:R=E=R1(say)ΔR1=R1

E

a

b

c

d e

Iteration2:R=R1∪

E⨝ ΔR1=R2(say)

ΔR2=R2– R1

ΔR2≠Φsocontinue

Initially:R=Φ

Semi-Naïveevaluation- 3

DukeCS,Fall2016CompSci 516:DataIntensiveComputing

Systems 38

V1 V2

a cb ab dc dd ad e

V1 V2

a c

b a

b d

c d

d a

d e

V1 V2

a cb ab dc dd ad ea db cb ec ac ed c

V1 V2

a cb ab dc dd ad ea db cb ec ac ed ca ea ac cd d

Iteration1:R=E=R1(say)ΔR1=R1

E

a

b

c

d e

Iteration2:R=R1∪

E⨝ ΔR1=R2(say)

ΔR2=R2– R1

ΔR2≠Φsocontinue

Iteration3:R=R2∪

E⨝ ΔR2=R3(say)

ΔR3=R3– R2

ΔR3≠Φsocontinue

Initially:R=Φ

Semi-Naïveevaluation- 4

DukeCS,Fall2016CompSci 516:DataIntensiveComputing

Systems 39

V1 V2

a cb ab dc dd ad e

V1 V2

a c

b a

b d

c d

d a

d e

V1 V2

a cb ab dc dd ad ea db cb ec ac ed c

V1 V2

a cb ab dc dd ad ea db cb ec ac ed ca ea ac cd d

Iteration1:R=E=R1(say)ΔR1=R1

E

a

b

c

d e

Iteration2:R=R1∪

E⨝ ΔR1=R2(say)

ΔR2=R2– R1

ΔR2≠Φsocontinue

Iteration3:R=R2∪

E⨝ ΔR2=R3(say)

ΔR3=R3– R2

ΔR3≠Φsocontinue

Iteration4:R=R3∪E⨝ ΔR3=R4(say)

ΔR4=R4– R3ΔR=Φ(CHECKJ)soSTOP

Initially:R=Φ

IncrementalViewMaintenance(IVM)

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

40

• Whydidthesemi-naïvealgorithmwork?• BecauseofthegenerictechniqueofIncrementalView

Maintenance(IVM)

• Supposeyouhave– adatabaseD=(R1,R2,R3)– aqueryQthatgivesanswerQ(D)– D=(R1,R2,R3)getsupdatedtoD’=(R1’,R2’,R3’)– e.g.R1’=R1∪ ΔR1(insertion),R2’=R2- ΔR1(deletion)etc.

IncrementalViewMaintenance(IVM)

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

41

• Whydidthesemi-naïvealgorithmwork?• BecauseofthegenerictechniqueofIncrementalView

Maintenance(IVM)

• Supposeyouhave– adatabaseD=(R1,R2,R3)– aqueryQthatgivesanswerQ(D)– D=(R1,R2,R3)getsupdatedtoD’=(R1’,R2’,R3’)– e.g.R1’=R1∪ ΔR1(insertion),R2’=R2- ΔR1(deletion)etc.

• IVM: CanyoucomputeQ(D’)usingQ(D)andΔR1,ΔR2,ΔR3withoutcomputingitfromscratch(i.e.donotrerunthequeryQ)?

IVM Example:Selection

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

42

V1 V2

a cb ad ac d

V1 V2

a cb ad ac db dd e

σV1=bR

R

V1 V2

b a

R’=R∪ ΔR

ΔR

σV1=bR’

V1 V2

b ab d

V1 V2

b a

V1 V2

b d

σV1=bR σV1=bΔR

• σV1=b(R∪ ΔR)=σV1=bR∪ σV1=bΔR• Itsufficestoapplytheselectionconditiononly onΔR

– andincludewiththeoriginalsolution

IVM Example:Projection

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

43

• πV1(R∪ ΔR)=πV1R∪ πV1ΔR• Itsufficestoapplytheprojectionconditiononly onΔR

– andincludewiththeoriginalsolution

V1 V2

a cb ad ad e

V1 V2

a cb ad ad eb dc d

πV1R

R

V1

abd

R’=R∪ ΔR

ΔR

πV1R’

V1

bc

πV1R

πV1ΔR

V1

abdc

V1

abd

IVM Example:Join

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

44

(R∪ ΔR)⨝ (S∪ ΔS)=(R⨝ S)∪ (R⨝ ΔS)∪ (ΔR⨝ S)∪ (ΔR⨝ ΔS)

A B

a1 b1a2 b2a3 b1

B C

b1 c1b2 c2

S’=S∪ ΔSR’=R∪ ΔR

ΔRΔS

A B

a1 b1B C

b1 c1A B C

a1 b1 c1⨝

=

A B

a1 b1B C

b1 c1⨝

A B

a1 b1⨝

B C

b2 c2

A B

a2 b2a3 b1

⨝B C

b1 c1

=

A B

a2 b2a3 b1

B C

b2 c2⨝

A B C

a1 b1 c1a3 b1 c1a2 b2 c2

= =

IVM forLinearDatalog Rule

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

45

(R∪ ΔR)⨝ (S∪ ΔS)=(R⨝ S)∪ (R⨝ ΔS)∪ (ΔR⨝ S)∪ (ΔR⨝ ΔS)

A B

a1 b1a2 b2a3 b1

B C

b1 c1

S’=S

R’=R∪ ΔR

ΔR

A B

a1 b1B C

b1 c1A B C

a1 b1 c1⨝

=

A B C

a1 b1 c1a3 b1 c1

=

• R(x,y):- E(x,z),R(z,y)– i.e.Rnew =E⨝ R

• ButEisEDB– ΔE=Φ

• Therefore,E⨝ (R∪ ΔR)=(E⨝ R)∪ (E⨝ ΔR)• Itsufficestojoinwiththedifference

ΔRandincludeintheresultinthepreviousroundE⨝ R

• Advantageofhaving“linearrule”

(Non-recursive)Datalog withNegation

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

46

• RecursionandnegationtogethermakeDatalog executioncomplicated– thereisanotioncalled“stratifiedsemantic”forthispurpose– computeIDBrelationsinstrata/layersbeforetakinganegation– notcoveredinthisclass

• Wewillonlydonegationfornon-recursiveDatalog

Unsafe/SafeDatalog Rules

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

47

• Whatistheproblemwiththisrule?• Whatshouldthisrulereturn?

– namesofalldrinkersintheworld?– namesofalldrinkersintheUSA?– namesofalldrinkersinDurham?

Find drinkers who like beer “BestBeer” Q(x) :- Likes(x, “BestBeer”)

Find drinkers who DO NOT like beer “BestBeer”

Q(x) :- ¬Likes(x, “BestBeer”)

Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

48

Find drinkers who like beer “BestBeer” Q(x) :- Likes(x, “BestBeer”)

Find drinkers who DO NOT like beer “BestBeer”

Q(x) :- ¬Likes(x, “BestBeer”)

ProblemwithNegationinDatalog Rules

Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)

• Whatistheproblemwiththisrule?• Dependenton“domain”ofdrinkers

– domain-dependent– infiniteanswerspossibletoo..

• keepgenerating“names”

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

49

• Solution:• Restrictto“activedomain”ofdrinkersfromtheinputLikes (orFrequents)relation– “domain-independence”– samefiniteansweralways

• Becomesa“saferule”

Find drinkers who like beer “BestBeer” Q(x) :- Likes(x, “BestBeer”)

Find drinkers who DO NOT like beer “BestBeer”

Q(x) :- ¬Likes(x, “BestBeer”)

ProblemwithNegationinDatalog Rules

Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)

Q(x) :- Likes(x, y), ¬Likes(x, “BestBeer”)

Q(x) = $y. Likes(x, y)∧"z.(Serves(z,y) Þ Frequents(x,z))

Query: Find drinkers that like some beer (so much) that they frequent all bars that serve it

CompSci516:DataIntensiveComputingSystems

50

Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)

DukeCS,Fall2016

RC→Datalog withnegation→ SQL(1/8)

Ack:slidesbyProfs.Balazinska andSuciu

RevisitexamplefromLecture4

Q(x) = $y. Likes(x, y)∧"z.(Serves(z,y) Þ Frequents(x,z))

Query: Find drinkers that like some beer so much that they frequent all bars that serve it

Step 1: Replace " with $ using de Morgan’s Laws

Q(x) = $y. Likes(x, y)∧ ¬$z.(Serves(z,y) ∧ ¬Frequents(x,z))

CompSci516:DataIntensiveComputingSystems

51

Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)

"x P(x) same as¬$x ¬P(x)

¬(¬P∨Q) same asP∧ ¬ Q

º Q(x) = $y. Likes(x, y)∧"z.(¬ Serves(z,y) ∨ Frequents(x,z))

DukeCS,Fall2016

RC→Datalog withnegation→SQL(2/8)

Ack:slidesbyProfs.Balazinska andSuciu

RevisitexamplefromLecture4

P => Q same as ¬P∨Q

Q(x) = $y. Likes(x, y)∧"z.(Serves(z,y) Þ Frequents(x,z))

Query: Find drinkers that like some beer so much that they frequent all bars that serve it

Step 1: Replace " with $ using de Morgan’s Laws

Q(x) = $y. Likes(x, y)∧ ¬$z.(Serves(z,y) ∧ ¬Frequents(x,z))

(new) Step 2: Make all subqueries domain independent

Q(x) = $y. Likes(x, y) ∧ ¬$z.(Likes(x,y)∧Serves(z,y)∧¬Frequents(x,z))

CompSci516:DataIntensiveComputingSystems

52

Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)

DukeCS,Fall2016

RC→Datalog withnegation→ SQL(3/8)

Ack:slidesbyProfs.Balazinska andSuciu

RevisitexamplefromLecture4

(new) Step 3: Create a datalog rule for some subexpressions of the form$x $y…. R(….)∧ S(….)∧ T(….)∧….

Q(x) = $y. Likes(x, y) ∧¬ $z.(Likes(x,y)∧Serves(z,y)∧¬Frequents(x,z))

H(x,y) :- Likes(x,y),Serves(z,y), not Frequents(x,z)Q(x) :- Likes(x,y), not H(x,y)

H(x,y)

CompSci516:DataIntensiveComputingSystems

53

Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)

DukeCS,Fall2016

RC→Datalog withnegation→ SQL(4/8)

Ack:slidesbyProfs.Balazinska andSuciu

RevisitexamplefromLecture4

Step 4: Write it in SQL

SELECT DISTINCT L.drinker FROM Likes LWHERE ……

H(x,y) :- Likes(x,y),Serves(z,y), not Frequents(x,z)Q(x) :- Likes(x,y), not H(x,y)

54

Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

RC→Datalog withnegation→ SQL(5/8)

Ack:slidesbyProfs.Balazinska andSuciu

RevisitexamplefromLecture4

Step 4: Write it in SQL

SELECT DISTINCT L.drinker FROM Likes LWHERE not exists

(SELECT * FROM Likes L2, Serves SWHERE … …)

H(x,y) :- Likes(x,y),Serves(z,y), not Frequents(x,z)Q(x) :- Likes(x,y), not H(x,y)

55

Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)

CompSci516:DataIntensiveComputingSystems

DukeCS,Fall2016

RC→Datalog withnegation→ SQL(6/8)

Ack:slidesbyProfs.Balazinska andSuciu

RevisitexamplefromLecture4

Step 4: Write it in SQL

SELECT DISTINCT L.drinker FROM Likes LWHERE not exists

(SELECT * FROM Likes L2, Serves SWHERE L2.drinker=L.drinker and L2.beer=L.beer

and L2.beer=S.beerand not exists (SELECT * FROM Frequents F

WHERE F.drinker=L2.drinkerand F.bar=S.bar))

H(x,y) :- Likes(x,y),Serves(z,y), not Frequents(x,z)Q(x) :- Likes(x,y), not H(x,y)

56

Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

RC→Datalog withnegation→ SQL(7/8)

Ack:slidesbyProfs.Balazinska andSuciu

RevisitexamplefromLecture4

Sometimes can simplify the SQL query by using an unsafe datalog ruleCorrectness ensured by safe outermost rule

SELECT DISTINCT L.drinker FROM Likes LWHERE not exists

(SELECT * FROM Serves SWHERE L.beer=S.beer

and not exists (SELECT * FROM Frequents FWHERE F.drinker=L.drinker

and F.bar=S.bar))

H(x,y) :- Likes(x,y),Serves(z,y), not Frequents(x,z)Q(x) :- Likes(x,y), not H(x,y) Unsafe rule

CompSci516:DataIntensiveComputingSystems

57

Likes(drinker, beer)Frequents(drinker, bar)Serves(bar, beer)

DukeCS,Fall2016

RC→Datalog withnegation→SQL(8/8)

Ack:slidesbyProfs.Balazinska andSuciu

RevisitexamplefromLecture4

AnOverviewofDataProvenancewithAnnotations

Selected/adaptedslidesfromthekeynotebyProf.ValTannen,EDBT2010

(optionalmaterial:fullslidedeckisavailableonVal’swebpage)

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

58

Optional/additionalslides

Lineage

• [Cui-Widom-Wiener’00]• Lineage:

– Givenadataitemintheview– Determinethesourcedatathatproducedit– Theprocessbywhichitwasproduced

59

optionalslide

Applications

• OLAP/OLAM(mining)– originofanomalousdatatoverifyreliability

• ScientificDatabases– howanswerwasproducedfromrawdata

• OnlineNetworkmonitoringandDiagnosissystem– identifyfaultysensorfromnetworkmonitors

60

optionalslide

Applications

• Cleanseddatafeedback– Cleanrawdataandsendreporttosources

• Materializedviewschemaevolution– ifviewschemaischanged(newcolumnadded),recomputation maynotbenecessary

• Viewupdate– translateviewupdatestobasedataupdates

61

optionalslide

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 62

SlidebyValTannen,EDBT2010

optionalslide

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 63

SlidebyValTannen,EDBT2010

optionalslide

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 64

SlidebyValTannen,EDBT2010

optionalslide

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 65

SlidebyValTannen,EDBT2010

optionalslide

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 66

SlidebyValTannen,EDBT2010

optionalslide

HasAsthma

Ann

Bob

Friend

Ann Joe

Ann Tom

Bob Tom

Smoker

Joe

Tom

Booleanquery Q():- HasAsthma (x),Friend(x,y),Smoker(y)

• x,y,z∈ {0,1}• 1=present• 0 =absent• Therearethreealternativewaystoderivethe“true”answer

x1x2

z1z2

y1y2y3

67

ProvenanceExample

ProvenanceFQ,D =x1y1z1 +x1y2z2 +x2y3z2

optionalslide

HasAsthma

Ann

Bob

Friend

Ann Joe

Ann Tom

Bob Tom

Smoker

Joe

Tom

Booleanquery Q:$ x$ yHasAsthma (x)Ù Friend(x,y)Ù Smoker(y)

• x,y,z∈ {0,1}

• WhathappensifAnnisdeleted?– Doestheanswerchangetofalsefromtrue?

x1x2

z1z2

y1y2y3

68

Applicationto“DeletionPropagation”

ProvenanceFQ,D =x1y1z1 +x1y2z2 +x2y3z2 [Greenetal.’07]

optionalslide

HasAsthma

Ann

Bob

Friend

Ann Joe

Ann Tom

Bob Tom

Smoker

Joe

Tom

Booleanquery Q:$ x$ yHasAsthma (x)Ù Friend(x,y)Ù Smoker(y)

• x,y,z∈ {0,1}

• WhathappensifAnnisdeleted?– Doestheanswerchangetofalsefromtrue?

• Noneedtore-evaluatethequery– justpluginx1=0andevaluate

x1x2

z1z2

y1y2y3

69

Applicationto“DeletionPropagation”

ProvenanceFQ,D =x1y1z1 +x1y2z2 +x2y3z2

0

[Greenetal.’07]

optionalslide

top related