cs 4604: introduction to database management...

89
CS 4604: Introduction to Database Management Systems B. Aditya Prakash Lecture #10: Query Processing

Upload: others

Post on 11-Oct-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

CS4604:IntroductiontoDatabaseManagementSystems

B.AdityaPrakashLecture#10:QueryProcessing

Page 2: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 2

Outline

§  introduction§  selection§  projection§  join§  set&aggregateoperations

Page 3: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 3

Introduction

§  Today’stopic:QUERYPROCESSING§  SomedatabaseoperationsareEXPENSIVE§  Cangreatlyimproveperformancebybeing“smart”–  e.g.,canspeedup1,000,000xovernaïveapproach

Page 4: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 4

Introduction(cnt’d)

§ Mainweaponsare:– cleverimplementationtechniquesforoperators– exploiting“equivalencies”ofrelationaloperators– usingstatisticsandcostmodelstochooseamongthese.

Page 5: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 5

AReallyBadQueryOptimizer

§  ForeachSelect-From-Wherequeryblock– docartesianproductsfirst–  thendoselections–  etc,ie.:

•  GROUPBY;HAVING•  projections•  ORDERBY

§  Incrediblyinefficient– Hugeintermediateresults!

× σpredicates

tables …

Page 6: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 6

Cost-basedQuerySub-System

Query Parser

Query Optimizer

Plan Generator Plan Cost Estimator

Query Plan Evaluator

Catalog Manager

Usuallythereisaheuristics-basedrewritingstepbeforethecost-basedsteps.

Schema Statistics

Select * From Blah B Where B.blah = blah

Queries

Page 7: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 7

TheQueryOptimizationGame§  “Optimizer”isabitofamisnomer…§  Goalistopicka“good”(i.e.,lowexpectedcost)plan.–  Involveschoosingaccessmethods,physicaloperators,operatororders,…

– Notionofcostisbasedonanabstract“costmodel”

Page 8: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 8

RelationalOperations§  Wewillconsiderhowtoimplement:–  Selection(σ)Selectsasubsetofrowsfromrelation.–  Projection(π)Deletesunwantedcolumnsfromrelation.–  Join()Allowsustocombinetworelations.–  Set-difference(-)Tuplesinreln.1,butnotinreln.2.–  Union(∪)Tuplesinreln.1andinreln.2.–  Aggregation(SUM,MIN,etc.)andGROUPBY

§  Recall:opscanbecomposed!§  Later(afterspringbreak),we’llseehowtooptimizequerieswithmanyops

▹◃

Page 9: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 9

SchemaforExamples

§  Similartooldschema;rnameaddedforvariations.§  Sailors:–  Eachtupleis50byteslong,80tuplesperpage,500pages.–  N=500,pS=80.

§  Reserves:–  Eachtupleis40byteslong,100tuplesperpage,1000pages.– M=1000,pR=100.

Sailors(sid:integer,sname:string,rating:integer,age:real)Reserves(sid:integer,bid:integer,day:dates,rname:string)

Page 10: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 10

SimpleSelections

§  Oftheform§  Question:howbesttoperform?

SELECT*FROMReservesRWHERER.rname<‘C%’

σ R attr valueop R. ( )

Page 11: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 11

SimpleSelections

§  A:Dependson:– whatindexes/accesspathsareavailable– whatistheexpectedsizeoftheresult(intermsofnumberoftuplesand/ornumberofpages)

Page 12: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 12

SimpleSelections

§  Sizeofresultapproximatedas sizeofR*reductionfactor– “reductionfactor”isalsocalledselectivity.– estimateofreductionfactorsisbasedonstatistics–wewilldiscussshortly.

Page 13: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 13

AlternativesforSimpleSelections

§ Withnoindex,unsorted:– Mustessentiallyscanthewholerelation– costisM(#pagesinR).For“reserves”=1000I/Os.

Page 14: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 14

SimpleSelections(cnt’d)

§ Withnoindex,sorted:– costofbinarysearch+numberofpagescontainingresults.

– Forreserves=10I/Os+⎡selectivity*#pages⎤

Page 15: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 15

SimpleSelections(cnt’d)

§ Withanindexonselectionattribute:– Useindextofindqualifyingdataentries,–  thenretrievecorrespondingdatarecords.–  (Hashindexusefulonlyforequalityselections.)

Page 16: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 16

UsinganIndexforSelections

§  Costdependson#qualifyingtuples,andclustering.– Cost:•  findingqualifyingdataentries(typicallysmall)• pluscostofretrievingrecords(couldbelargew/oclustering).

Page 17: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 17

SelectionsusingIndex(cnt’d)

Index entries

Data entries

direct search for

(Index File) (Data file)

Data Records

data entries

Data entries

Data Records

CLUSTEREDUNCLUSTERED

Page 18: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 18

SelectionsusingIndex(cnt’d)–  Inexample“reserves”relation,if10%oftuplesqualify(100pages,10,000tuples).• Withaclusteredindex,costislittlemorethan100I/Os;•  ifunclustered,couldbeupto10,000I/Os!unless…

Page 19: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 19

SelectionsusingIndex(cnt’d)§  Importantrefinementforunclusteredindexes:

1.Findqualifyingdataentries.2.Sorttherid’softhedatarecordstoberetrieved.3.Fetchridsinorder.Thisensuresthateachdatapageislookedatjustonce(though#ofsuchpageslikelytobehigherthanwithclustering).

Page 20: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 20

GeneralSelectionConditions

§  Q:Whatwouldyoudo?

(day<8/9/94 AND rname=‘Paul’) OR bid=5 OR sid=3 SKIP

Page 21: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 21

GeneralSelectionConditions

§  Q:Whatwouldyoudo?§  A:trytofindaselective(clustering)index.Specifically:

(day<8/9/94 AND rname=‘Paul’) OR bid=5 OR sid=3 SKIP

Page 22: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 22

GeneralSelectionConditions

§  Converttoconjunctivenormalform(CNF):–  (day<8/9/94ORbid=5ORsid=3)AND(rname=‘Paul’ ORbid=5ORsid=3)

§  WeonlydiscussthecasewithnoORs(aconjunctionoftermsoftheformattropvalue).

(day<8/9/94 AND rname=‘Paul’) OR bid=5 OR sid=3 SKIP

Page 23: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 23

GeneralSelectionConditions

§  AB-treeindexmatches(aconjunctionof)termsthatinvolveonlyattributesinaprefixofthesearchkey.–  Indexon<a,b,c>matchesa=5ANDb=3,butnotb=3.

§  ForHashindex,musthaveallattributesinsearchkey

(day<8/9/94 AND rname=‘Paul’) OR bid=5 OR sid=3 SKIP

Page 24: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 24

TwoApproachestoGeneralSelections

§  Firstapproach:Findthecheapestaccesspath,retrievetuplesusingit,andapplyanyremainingtermsthatdon’tmatchtheindex

§  Secondapproach:getridsfromfirstindex;ridsfromsecondindex;intersectandfetch.

SKIP

Page 25: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 25

TwoApproachestoGeneralSelections

§  Firstapproach:Findthecheapestaccesspath,retrievetuplesusingit,andapplyanyremainingtermsthatdon’tmatchtheindex:– Cheapestaccesspath:AnindexorfilescanwithfewestI/Os.

– Termsthatmatchthisindexreducethenumberoftuplesretrieved;othertermshelpdiscardsomeretrievedtuples,butdonotaffectnumberoftuples/pagesfetched.

SKIP

Page 26: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 26

CheapestAccessPath-Example§  Considerday<8/9/94ANDbid=5ANDsid=3.

§  AB+treeindexondaycanbeused;– then,bid=5andsid=3mustbecheckedforeachretrievedtuple.

§  Similarly,ahashindexon<bid,sid>couldbeused;– Then,day<8/9/94mustbechecked.

SKIP

Page 27: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 27

CheapestAccessPath-cnt’d

§  Considerday<8/9/94ANDbid=5ANDsid=3.

§  HowaboutaB+treeon<rname,day>?§  HowaboutaB+treeon<day,rname>?§  HowaboutaHashindexon<day,rname>?

SKIP

Page 28: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 28

IntersectionofRIDs

§  Secondapproach:ifwehave2ormorematchingindexes(w/Alternatives(2)or(3)fordataentries):– Getsetsofridsofdatarecordsusingeachmatchingindex.

– Thenintersectthesesetsofrids.– Retrievetherecordsandapplyanyremainingterms.

SKIP

Page 29: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 29

IntersectionofRIDs(cnt’d)

§  EXAMPLE:Considerday<8/9/94ANDbid=5ANDsid=3.

§  WithaB+treeindexondayandanindexonsid,§  wecanretrieveridsofrecordssatisfyingday<8/9/94usingthefirst,

§  ridsofrecssatisfyingsid=3usingthesecond,§  intersect,§  retrieverecordsandcheckbid=5.

SKIP

Page 30: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 30

TheProjectionOperation

§  Issueisremovingduplicates.§  Basicapproach:sorting– 1.ScanR,extractonlytheneededattrs(why?)– 2.Sorttheresultingset– 3.RemoveadjacentduplicatesCost:Reserveswithsizeratio0.25=250pages.With20bufferpagescansortin2passes,so1000+250+2*2*250+250=2500I/Os

SELECTDISTINCTR.sid,R.bidFROMReservesR

Page 31: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 31

Projection

§  Canimprovebymodifyingexternalsortalgorithm(seechapter13):– ModifyPass0ofexternalsorttoeliminateunwantedfields.

– Modifymergingpassestoeliminateduplicates.Cost:forabovecase:read1000pages,writeout250inrunsof40pages,mergeruns=1000+250+250=1500.

SKIP

Page 32: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 32

DiscussionofProjection

§  Ifanindexontherelationcontainsallwantedattributesinitssearchkey,candoindex-onlyscan.– Applyprojectiontechniquestodataentries(muchsmaller!)

Page 33: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 33

DiscussionofProjection

§  Ifanordered(i.e.,tree)indexcontainsallwantedattributesasprefixofsearchkey,candoevenbetter:– Retrievedataentriesinorder(index-onlyscan),discardunwantedfields,compareadjacenttuplestocheckforduplicates.

AB-treeindexmatches(aconjunctionof)termsthatinvolveonlyattributesinaprefixofthesearchkey.–  Indexon<a,b,c>matchesa=5ANDb=3,butnotb=3.

ForHashindex,musthaveallattributesinsearchkey

Page 34: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 34

Joins

§  Joinsareverycommon.§  Joinscanbeveryexpensive(crossproductin

worstcase).§  Manyapproachestoreducejoincost.

Page 35: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 35

Joins

§  Jointechniqueswewillcover:– Nested-loopsjoin–  Index-nestedloopsjoin– Sort-mergejoin– Hashjoin

Page 36: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 36

EqualityJoinsWithOneJoinColumn

§  Inalgebra:RS.Common!Mustbecarefullyoptimized.R×Sislarge;so,R×Sfollowedbyaselectionisinefficient.

§  Remember,joinisassociativeandcommutative.

SELECT*FROMReservesR1,SailorsS1WHERER1.sid=S1.sid

▹◃

Page 37: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 37

EqualityJoins

§  Assume:– MpagesinR,pRtuplesperpage,mtuplestotal– NpagesinS,pStuplesperpage,ntuplestotal–  Inourexamples,RisReservesandSisSailors.

§ Wewillconsidermorecomplexjoinconditionslater.

§  Costmetric:#ofI/Os.Wewillignoreoutputcosts.

Page 38: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 38

Nestedloops

§  Algorithm#0:(naive)nestedloop(SLOW!)

R(A,..)

S(A, ......) m

n

Page 39: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 39

Nestedloops

§  Algorithm#0:(naive)nestedloop(SLOW!)foreachtuplerofR

foreachtuplesofSprint,iftheymatch

R(A,..)

S(A, ......) m

n

Page 40: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 40

Nestedloops

§  Algorithm#0:(naive)nestedloop(SLOW!)foreachtuplerofR

foreachtuplesofSprint,iftheymatch

R(A,..)

S(A, ......) m

n

outer relation

inner relation

Page 41: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 41

Nestedloops

§  Algorithm#0:whyisitbad?§  howmanydiskaccesses(‘M’and‘N’arethenumberofblocksfor‘R’and‘S’)?

R(A,..)

S(A, ......) M pages,

m tuples N pages,

n tuples

Page 42: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 42

Nestedloops

§  Algorithm#0:whyisitbad?§  howmanydiskaccesses(‘M’and‘N’arethenumberofblocksfor‘R’and‘S’)?M+m*N

R(A,..)

S(A, ......) M pages,

m tuples N pages,

n tuples

Page 43: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 43

SimpleNestedLoopsJoin

§  Actualnumber(pR*M)*N+M=100*1000*500+1000I/Os.– At10ms/IO,Total:???

§ Whatifsmallerrelation(S)wasouter?

§ Whatassumptionsarebeingmadehere?

Page 44: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 44

SimpleNestedLoopsJoin

§  Actualnumber§  (pR*M)*N+M=100*1000*500+1000I/Os.– At10ms/IO,Total:~6days(!)

§ Whatifsmallerrelation(S)wasouter?– slightlybetter

§ Whatassumptionsarebeingmadehere?– 1bufferforeachtable(and1foroutput)

Page 45: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 45

Nestedloops

R(A,..)

S(A, ......) M pages,

m tuples N pages,

n tuples

•  Algorithm #1: Blocked nested-loop join –  read in a block of R

•  read in a block of S –  print matching tuples COST?

Page 46: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 46

Nestedloops

R(A,..)

S(A, ......) M pages,

m tuples N pages,

n tuples

•  Algorithm #1: Blocked nested-loop join –  read in a block of R

•  read in a block of S –  print matching tuples COST= M+M*N

Page 47: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 47

Nestedloops

R(A,..)

S(A, ......) M pages,

m tuples N pages,

n tuples

•  Which one should be the outer relation?

COST= M+M*N

Page 48: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 48

Nestedloops

R(A,..)

S(A, ......) M pages,

m tuples N pages,

n tuples

•  Which one should be the outer relation? •  A: the smallest (page-wise)

COST= M+M*N

Page 49: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 49

Nestedloops

R(A,..)

S(A, ......) M pages,

m tuples N pages,

n tuples

•  M=1000, N=500 •  Cost = 1000 + 1000*500 = 501,000 •  = 5010 sec ~ 1.4h COST= M+M*N

Page 50: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 50

Nestedloops

R(A,..)

S(A, ......) M pages,

m tuples N pages,

n tuples

•  M=1000, N=500 - if smaller is outer: •  Cost = 500 + 1000*500 = 500,500 •  = 5005 sec ~ 1.4h COST= N+M*N

Page 51: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 51

Nestedloops

R(A,..)

S(A, ......) M pages,

m tuples N pages,

n tuples

•  What if we have B buffers available?

Page 52: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 52

Nestedloops

R(A,..)

S(A, ......) M pages,

m tuples N pages,

n tuples

•  What if we have B buffers available? •  A: give B-2 buffers to outer, 1 to inner, 1 for

output

Page 53: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 53

Nestedloops

R(A,..)

S(A, ......) M pages,

m tuples N pages,

n tuples

•  Algorithm #1: Blocked nested-loop join –  read in B-2 blocks of R

•  read in a block of S –  print matching tuples COST= ?

Page 54: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 54

Nestedloops

R(A,..)

S(A, ......) M pages,

m tuples N pages,

n tuples

•  Algorithm #1: Blocked nested-loop join –  read in B-2 blocks of R

•  read in a block of S –  print matching tuples COST= M+M/(B-2)*N

Page 55: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 55

Nestedloops

R(A,..)

S(A, ......) M pages,

m tuples N pages,

n tuples

•  and, actually: •  Cost = M + ceiling(M/(B-2)) * N

COST= M+M/(B-2)*N

Page 56: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 56

Nestedloops

R(A,..)

S(A, ......) M pages,

m tuples N pages,

n tuples

•  If smallest (outer) fits in memory •  (ie., B= N+2), •  Cost =? COST= N+N/(B-2)*M

Page 57: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 57

Nestedloops

R(A,..)

S(A, ......) M pages,

m tuples N pages,

n tuples

•  If smallest (outer) fits in memory •  (ie., B= N+2), •  Cost =N+M (minimum!) COST= N+N/(B-2)*M

Page 58: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 58

Nestedloops-guidelines

§  pickasouterthesmallesttable(=fewestpages)

§  fitasmuchofitinmemoryaspossible§  loopovertheinner

Page 59: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 59

§  useanexistingindex,orevenbuildoneonthefly

§  cost:M+m*c(c:look-upcost)

IndexNLjoin

R(A,..)

S(A, ......) M pages,

m tuples N pages,

n tuples

Page 60: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 60

§  cost:M+m*c(c:look-upcost)§  ‘c’dependswhethertheindexisclusteredornot.

IndexNLjoin

R(A,..)

S(A, ......) M pages,

m tuples N pages,

n tuples

Page 61: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 61

Joins

§  Jointechniqueswewillcover:– Nested-loopsjoin–  Index-nestedloopsjoin– Sort-mergejoin– Hashjoin

Page 62: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 62

Sort-mergejoin

R(A,..)

S(A, ......) M pages,

m tuples N pages,

n tuples

•  sort both on joining attributed •  scan each and merge •  Cost, given B buffers?

Page 63: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 63

Sort-mergejoin

R(A,..)

S(A, ......) M pages,

m tuples N pages,

n tuples

•  Cost, given B buffers? •  ~ 2*M*logM/logB + 2*N* logN/logB + M + N

Page 64: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 64

Sort-MergeJoin§  Usefulif

Page 65: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 65

Sort-MergeJoin§  Usefulif– oneorbothinputsarealreadysortedonjoinattribute(s)

– outputisrequiredtobesortedonjoinattributes(s)

§  “Merge”phasecanrequiresomebacktrackingifduplicatevaluesappearinjoincolumn

Page 66: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 66

ExampleofSort-MergeJoin

sid sname rating age22 dustin 7 45.028 yuppy 9 35.031 lubber 8 55.544 guppy 5 35.058 rusty 10 35.0

sid bid day rname28 103 12/4/96 guppy28 103 11/3/96 yuppy31 101 10/10/96 dustin31 102 10/12/96 lubber31 101 10/11/96 lubber58 103 11/12/96 dustin

Page 67: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 67

ExampleofSort-MergeJoin

§  With35,100or300bufferpages,bothReservesandSailorscanbesortedin2passes;totaljoincost:7500.

§  (while Block Nested Loop (BNL) cost: 2,500 to 15,000 I/Os)

Page 68: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 68

Sort-mergejoin

§ Worstcaseformergingphase?

§  Cost?

Page 69: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 69

Refinements

§  Alltherefinementsofexternalsorting§  plusoverlappingofthemergingofsortingwiththemergingofjoining.

Page 70: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 70

Joins

§  Jointechniqueswewillcover:– Nested-loopsjoin–  Index-nestedloopsjoin– Sort-mergejoin– Hashjoin

Page 71: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 71

§  hashjoin:usehashingfunctionh()– hash‘R’into(0,1,...,‘max’)buckets– hash‘S’intobuckets(samehashfunction)–  joineachpairofmatchingbuckets

Hashjoins

R(A, ...) S(A, ......) 0

1

max

Page 72: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 72

– howtojoineachpairofpartitionsHr-i,Hs-i?– A:buildanotherhashtableforHs-i,andprobeitwitheachtupleofHr-i

Hashjoin-details

R(A, ...) S(A, ......)

Hr-0

0

1

max

Hs-0

Page 73: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 73

Hashjoin-details

§  Inmoredetail:§  Choosethe(page-wise)smallest-ifitfitsinmemory,do~NL– and,actually,buildahashtable(withh2()!=h())– andprobeit,witheachtupleoftheother

Page 74: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 74

§  whatifHs-iistoolargetofitinmain-memory?

§  A:recursivepartitioning§  moredetails(overflows,hybridhashjoins):inbook

§  costofhashjoin?(ifwehaveenoughbuffers:)3(M+N)(why?Seenextslide)

Hashjoindetails

Page 75: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 75

CostofHash-Join

§  Inpartitioningphase,read+writebothrelns;2(M+N).Inmatchingphase,readbothrelns;M+NI/Os.

§  Inourrunningexample,thisisatotalof4500I/Os.

Page 76: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 76

§  [costofhashjoin?(ifwehaveenoughbuffers:)3(M+N)]

§ Whatis‘enough’?sqrt(N),orsqrt(M)?

Hashjoindetails

Page 77: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 77

§  [costofhashjoin?(ifwehaveenoughbuffers:)3(M+N)]

§  Whatis‘enough’?sqrt(N),orsqrt(M)?§  A:sqrt(smallest)(why?)–  BecauseyouonlyneedenoughmemorytoholdthehashtablepartitionsofthesmallertableinmemorysoB>sizeofsmaller/B-1èB~sqrt(size-of-smaller)

Hashjoindetails Details

Page 78: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 78

Sort-MergeJoinvs.HashJoin

§  Givenaminimumamountofmemorybothhaveacostof3(M+N)I/Os.

(min.memoryforsort-merge=sqrt(largertable)usingaggressiverefinements---intextbook)(min.memoryforhash=sqrt(smallertable)---seepreviousslides)

Page 79: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 79

Sort-MergevsHashjoin

§  HashJoinPros:– ??– ??– ??

§  Sort-MergeJoinPros:– ??

Page 80: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 80

Sort-MergevsHashjoin

§  HashJoinPros:– Superiorifrelationsizesdiffergreatly– Showntobehighlyparallelizable(beyondscopeofclass)

§  Sort-MergeJoinPros:– ??

Page 81: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 81

Sort-MergevsHashjoin

§  HashJoinPros:– Superiorifrelationsizesdiffergreatly– Showntobehighlyparallelizable(beyondscopeofclass)

§  Sort-MergeJoinPros:– Lesssensitivetodataskew– Resultissorted(mayhelp“upstream”operators)– goesfasterifoneorbothinputsalreadysorted

Page 82: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 82

GeneralJoinConditions

§  Equalitiesoverseveralattributes(e.g.,R.sid=S.sidANDR.rname=S.sname):– allpreviousmethodsapply,usingthecompositekey

Page 83: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 83

GeneralJoinConditions

§  Inequalityconditions(e.g.,R.rname<S.sname):§  whichmethodsstillapply?– NL–  indexNL– Sortmerge– Hashjoin

Page 84: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 84

GeneralJoinConditions

§  Inequalityconditions(e.g.,R.rname<S.sname):§  whichmethodsstillapply?– NL (probably,thebest!)–  indexNL (onlyifclusteredindex)– Sortmerge (doesnotapply!)(why?)– Hashjoin (doesnotapply!)(why?)

Page 85: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 85

SetOperations

§  Intersectionandcross-product:specialcasesofjoin

§  Union(Distinct)andExcept:similar;we’lldounion:

§  Effectively:concatenate;usesortingorhashing§  Sortingbasedapproachtounion:– Sortbothrelations(oncombinationofallattributes).– Scansortedrelationsandmergethem.– Alternative:MergerunsfromPass0forbothrelations.

SKIP

Page 86: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 86

SetOperations,cont’d

§  Hashbasedapproachtounion:– PartitionRandSusinghashfunctionh.– ForeachS-partition,buildin-memoryhashtable(usingh2),scancorrespondingR-partitionandaddtuplestotablewhilediscardingduplicates.

SKIP

Page 87: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 87

AggregateOperations(AVG,MIN,etc.)

§ Withoutgrouping:–  Ingeneral,requiresscanningtherelation.– GivenindexwhosesearchkeyincludesallattributesintheSELECTorWHEREclauses,candoindex-onlyscan.

SKIP

Page 88: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 88

Summary§  AvirtueofrelationalDBMSs:–  queriesarecomposedofafewbasicoperators–  Theimplementationoftheseoperatorscanbecarefullytuned

–  Importanttodothis!

§  Manyalternativeimplementationtechniquesforeachoperator–  Nouniversallysuperiortechniqueformostoperators.

“it depends” [Guy Lohman (IBM)]

Page 89: CS 4604: Introduction to Database Management Systemscourses.cs.vt.edu/~cs4604/Fall18/lectures/lecture-10.pdf§ Issue is removing duplicates. § Basic approach: sorting – 1. Scan

Prakash2018 VTCS4604 89

Summarycont’d

§  Mustconsideravailablealternativesforeachoperationinaqueryandchoosebestonebasedonsystemstatistics,etc.–  Partofthebroadertaskofoptimizingaquerycomposedofseveralops.