index selection and external sorting€¦ · understanding the workload • a workload is a mix of...

31
CompSci 516 Database Systems Lecture 9 Index Selection and External Sorting Instructor: Sudeepa Roy 1 Duke CS, Fall 2017 CompSci 516: Database Systems

Upload: others

Post on 14-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Index Selection and External Sorting€¦ · Understanding the Workload • A workload is a mix of queriesand updates • For each query in the workload: – Which relations does

CompSci 516DatabaseSystems

Lecture9IndexSelection

andExternalSorting

Instructor:Sudeepa Roy

1DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 2: Index Selection and External Sorting€¦ · Understanding the Workload • A workload is a mix of queriesand updates • For each query in the workload: – Which relations does

Announcements• Privateprojectthreadscreatedonpiazza– Pleaseusethesethreads(andnotemails)forallcommunicationsonyourproject

• Projectproposaldeadlineextendeduntil9/28Thursday5pm

• HW2tobepostedsoon–WillreceiveemailwithyourCSemailIDforAWSaccountsfromKeping/Yilin

DukeCS,Fall2017 CompSci516:DatabaseSystems 2

Page 3: Index Selection and External Sorting€¦ · Understanding the Workload • A workload is a mix of queriesand updates • For each query in the workload: – Which relations does

Today

• Indexselection• Externalsort

DukeCS,Fall2017 CompSci516:DatabaseSystems 3

Page 4: Index Selection and External Sorting€¦ · Understanding the Workload • A workload is a mix of queriesand updates • For each query in the workload: – Which relations does

ReadingMaterial

• Index:asinLecture7/8

• Externalsorting:• [RG]– Externalsorting:Chapter13

• [GUW]– Chapter15.4.1

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 4

Acknowledgement:Thefollowingslideshavebeencreatedadaptingtheinstructormaterialofthe[RG]bookprovidedbytheauthorsDr.Ramakrishnan andDr.Gehrke.

Page 5: Index Selection and External Sorting€¦ · Understanding the Workload • A workload is a mix of queriesand updates • For each query in the workload: – Which relations does

SelectionofIndexes

DukeCS,Fall2017 CompSci516:DatabaseSystems 5

Page 6: Index Selection and External Sorting€¦ · Understanding the Workload • A workload is a mix of queriesand updates • For each query in the workload: – Which relations does

DifferentFileOrganizations

Searchkey=<age,sal>

Considerfollowingoptions:

• Heapfiles– randomorder;insertatend-of-file

• Sortedfiles– sortedon<age,sal>

• ClusteredB+treefile– searchkey<age,sal>

• Heapfilewithunclustered B+-treeindex– onsearchkey<age,sal>

• Heapfilewithunclustered hashindex– onsearchkey<age,sal>

DukeCS,Fall2017 CompSci516:DatabaseSystems 6

Weneedtounderstandtheimportanceofappropriatefileorganizationandindex

Page 7: Index Selection and External Sorting€¦ · Understanding the Workload • A workload is a mix of queriesand updates • For each query in the workload: – Which relations does

PossibleOperations

• Scan– Fetchallrecordsfromdisktobufferpool

• Equalitysearch– Findallemployeeswithage=23andsal =50– Fetchpagefromdisk,thenlocatequalifyingrecordinpage

• Rangeselection– Findallemployeeswithage>35

• Insertarecord– identifythepage,fetchthatpagefromdisk,insetrecord,writeback

todisk(possiblyotherpagesaswell)

• Deletearecord– similartoinsert

DukeCS,Fall2017 CompSci516:DatabaseSystems 7

Page 8: Index Selection and External Sorting€¦ · Understanding the Workload • A workload is a mix of queriesand updates • For each query in the workload: – Which relations does

UnderstandingtheWorkload• Aworkloadisamixofqueries andupdates

• Foreachqueryintheworkload:– Whichrelationsdoesitaccess?– Whichattributesareretrieved?– Whichattributesareinvolvedinselection/joinconditions?How

selectivearetheseconditionslikelytobe?

• Foreachupdateintheworkload:– Whichattributesareinvolvedinselection/joinconditions?How

selectivearetheseconditionslikelytobe?– Thetypeofupdate(INSERT/DELETE/UPDATE),andtheattributesthatare

affected

DukeCS,Fall2017 CompSci516:DatabaseSystems 8

Page 9: Index Selection and External Sorting€¦ · Understanding the Workload • A workload is a mix of queriesand updates • For each query in the workload: – Which relations does

ChoiceofIndexes

• Whatindexesshouldwecreate?– Whichrelationsshouldhaveindexes?Whatfield(s)shouldbethesearchkey?Shouldwebuildseveralindexes?

• Foreachindex,whatkindofanindexshoulditbe?– Clustered?Hash/tree?

DukeCS,Fall2017 CompSci516:DatabaseSystems 9

Page 10: Index Selection and External Sorting€¦ · Understanding the Workload • A workload is a mix of queriesand updates • For each query in the workload: – Which relations does

MoreonChoiceofIndexes• Oneapproach:– Considerthemostimportantqueries– Considerthebestplanusingthecurrentindexes– Seeifabetterplanispossiblewithanadditionalindex.– Ifso,createit.– Obviously,thisimpliesthatwemustunderstandhowaDBMSevaluatesqueriesandcreatesqueryevaluationplans

– Wewilllearnqueryexecutionandoptimizationlater- Fornow,wediscusssimple1-tablequeries.

• Beforecreatinganindex,mustalsoconsidertheimpactonupdatesintheworkload

DukeCS,Fall2017 CompSci516:DatabaseSystems 10

Page 11: Index Selection and External Sorting€¦ · Understanding the Workload • A workload is a mix of queriesand updates • For each query in the workload: – Which relations does

Trade-offsforIndexes• Indexescanmake– queriesgofaster– updatesslower

• Requirediskspace,too

DukeCS,Fall2017 CompSci516:DatabaseSystems 11

Page 12: Index Selection and External Sorting€¦ · Understanding the Workload • A workload is a mix of queriesand updates • For each query in the workload: – Which relations does

IndexSelectionGuidelines• AttributesinWHEREclausearecandidatesforindexkeys

– Exactmatchconditionsuggestshashindex– Rangequerysuggeststreeindex– Clusteringisespeciallyusefulforrangequeries

• canalsohelponequalityqueriesiftherearemanyduplicates

• Trytochooseindexesthatbenefitasmanyqueriesaspossible– Sinceonlyoneindexcanbeclusteredperrelation,chooseitbasedon

importantqueriesthatwouldbenefitthemostfromclustering

• Multi-attributesearchkeysshouldbeconsideredwhenaWHEREclausecontainsseveralconditions– Orderofattributesisimportantforrangequeries

• Note:clusteredindexshouldbeusedjudiciously– expensiveupdates,althoughcheaperthansortedfiles

DukeCS,Fall2017 CompSci516:DatabaseSystems 12

Page 13: Index Selection and External Sorting€¦ · Understanding the Workload • A workload is a mix of queriesand updates • For each query in the workload: – Which relations does

ExamplesofClusteredIndexes• B+treeindexonE.age canbeusedtogetqualifyingtuples

• Howselectiveisthecondition?– everyone>40,indexnotofmuchhelp,scanisasgood

– Suppose10%>40.Then?

• Dependsoniftheindexisclustered– otherwisecanbemoreexpensivethanalinearscan

– ifclustered,10%I/O(+indexpages)

SELECT E.dnoFROM Emp EWHERE E.age>40

DukeCS,Fall2017 CompSci516:DatabaseSystems 13

Whatisagoodindexingstrategy?

Whichattribute(s)?Clustered/Unclustered?B+tree/Hash?

Page 14: Index Selection and External Sorting€¦ · Understanding the Workload • A workload is a mix of queriesand updates • For each query in the workload: – Which relations does

ExamplesofClusteredIndexesGroup-Byquery

• UseE.age assearchkey?– BadIfmanytupleshaveE.age >10orifnot

clustered….– …usingE.age indexandsortingtheretrieved

tuplesbyE.dno maybecostly

• ClusteredE.dno indexmaybebetter– Firstgroupby,thencounttupleswithage>

10– goodwhenage>10isnottooselective

• Note:thefirstoptionisgoodwhentheWHEREconditionishighlyselective(fewtupleshaveage>10),thesecondisgoodwhennothighlyselective

SELECT E.dno, COUNT (*)FROM Emp EWHERE E.age>10GROUP BY E.dno

DukeCS,Fall2017 CompSci516:DatabaseSystems 14

Whatisagoodindexingstrategy?

Whichattribute(s)?Clustered/Unclustered?B+tree/Hash?

Page 15: Index Selection and External Sorting€¦ · Understanding the Workload • A workload is a mix of queriesand updates • For each query in the workload: – Which relations does

ExamplesofClusteredIndexes

Equalityqueriesandduplicates

• ClusteringonE.hobby helps– hobbynotacandidatekey,several

tuplespossible

• Doesclusteringhelpnow?– (eid =key)– Notmuch– atmostonetuplesatisfiesthe

condition

SELECT E.dnoFROM Emp EWHERE E.hobby=‘Stamps’

DukeCS,Fall2017 CompSci516:DatabaseSystems 15

SELECT E.dnoFROM Emp EWHERE E.eid=50

Whatisagoodindexingstrategy?

Whichattribute(s)?Clustered/Unclustered?B+tree/Hash?

Page 16: Index Selection and External Sorting€¦ · Understanding the Workload • A workload is a mix of queriesand updates • For each query in the workload: – Which relations does

IndexeswithCompositeSearchKeys• CompositeSearchKeys:Searchona

combinationoffields

• Equalityquery:Everyfieldvalueisequaltoaconstantvalue.E.g.wrt<sal,age>index:– age=20andsal =75

• Rangequery:Somefieldvalueisnotaconstant.E.g.:– sal >10– whichcombination(s)

wouldhelp?

– <age,sal>doesnothelp– B+tree on<sal>or<sal,age>helps– hastobeaprefix

sue 13 75

bobcaljoe 12

10

208011

12

name age sal

<sal, age>

<age, sal> <age>

<sal>

12,2012,1011,80

13,75

20,1210,12

75,1380,11

11121213

10207580

Data recordssorted by name

Data entries in indexsorted by <sal,age>

Data entriessorted by <sal>

Examples of composite keyindexes using lexicographic order.

Page 17: Index Selection and External Sorting€¦ · Understanding the Workload • A workload is a mix of queriesand updates • For each query in the workload: – Which relations does

CompositeSearchKeys• ToretrieveEmp recordswithage=30AND sal=4000,anindexon

<age,sal>wouldbebetterthananindexonage oranindexonsal– firstfindage=30,amongthemsearchsal =4000

• Ifconditionis:20<age<30AND 3000<sal<5000:– Clusteredtreeindexon<age,sal>or<sal,age>isbest.

• Ifconditionis:age=30AND 3000<sal<5000:– Clustered<age,sal>indexmuchbetterthan<sal,age>index– moreindexentriesareretrievedforthelatter

• Compositeindexesarelarger,updatedmoreoftenDukeCS,Fall2017 CompSci516:DatabaseSystems 17

Page 18: Index Selection and External Sorting€¦ · Understanding the Workload • A workload is a mix of queriesand updates • For each query in the workload: – Which relations does

Index-OnlyPlans• Anumberofqueriescanbeansweredwithoutretrievingany

tuplesfromoneormoreoftherelationsinvolvedifasuitableindexisavailable

SELECT E.dno, COUNT(*)FROM Emp EGROUP BY E.dno

SELECT E.dno, MIN(E.sal)FROM Emp EGROUP BY E.dno

SELECT AVG(E.sal)FROM Emp EWHERE E.age=25 AND

E.sal BETWEEN 3000 AND 5000

<E.dno>

<E.dno,E.sal>Tree index!

<E. age,E.sal>Tree index!

DukeCS,Fall2017 CompSci516:DatabaseSystems 18

• Forindex-onlystrategies,clusteringisnotimportant

Page 19: Index Selection and External Sorting€¦ · Understanding the Workload • A workload is a mix of queriesand updates • For each query in the workload: – Which relations does

ExternalSorting

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 19

Page 20: Index Selection and External Sorting€¦ · Understanding the Workload • A workload is a mix of queriesand updates • For each query in the workload: – Which relations does

quickreviewofmergesort onwhiteboard

WhySort?• Aclassicproblemincomputerscience• Datarequestedinsortedorder

– e.g.,findstudentsinincreasinggpa order

• SortingisfirststepinbulkloadingB+treeindex• Sortingusefulforeliminatingduplicatecopiesinacollectionofrecords

• Sort-mergejoinalgorithminvolvessorting• Problem:sort1Gbofdatawith1MbofRAM

– needtominimizethecostofdiskaccess

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 20

Page 21: Index Selection and External Sorting€¦ · Understanding the Workload • A workload is a mix of queriesand updates • For each query in the workload: – Which relations does

2-WaySort:Requires3Buffers• SupposeN=2k pagesinthefile• Pass0:Readapage,sortit,writeit.

– repeatforall2k pages– onlyonebufferpageisused

• Pass1:– Readtwopages,sort(merge)themusingoneoutputpage,writethemtodisk– repeat2k-1times– threebufferpagesused

• Pass2,3,4,…..continue

Main memory buffers

INPUT 1

INPUT 2

OUTPUT

DiskDiskDukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 21

Page 22: Index Selection and External Sorting€¦ · Understanding the Workload • A workload is a mix of queriesand updates • For each query in the workload: – Which relations does

Two-WayExternalMergeSort• Eachsortedsub-fileiscalledarun– eachruncancontainmultiplepages

• Eachpassweread+writeeachpageinfile.

• Npagesinthefile,• =>thenumberofpasses

• Sotoal costis:

• Nottoopractical,butusefultolearnbasicconceptsforexternalsorting

é ù= +log2 1N

é ù( )2 12N Nlog +

Input file

1-page runs

2-page runs

4-page runs

8-page runs

PASS0

PASS1

PASS2

PASS3

9

3,4 6,2 9,4 8,7 5,6 3,1 2

3,4 5,62,6 4,9 7,8 1,3 2

2,34,6

4,78,9

1,35,6 2

2,34,46,78,9

1,23,56

1,22,33,44,56,67,8

9

Page 23: Index Selection and External Sorting€¦ · Understanding the Workload • A workload is a mix of queriesand updates • For each query in the workload: – Which relations does

GeneralExternalMergeSort

• TosortafilewithNpagesusingBbufferpages:– Pass0:useBbufferpages:

• Produce⌈N/B⌉ sortedrunsofBpageseach.– Pass1,2,…,etc.:mergeB-1runs tooneoutputpage

• keepwritingtodiskoncetheoutputpageisfull

B Main memory buffers

INPUT 1

INPUT B-1

OUTPUT

DiskDisk

INPUT 2. . . . . .. . .

• Suppose we have more than 3 buffer pages. • How can we utilize them?

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 23

Page 24: Index Selection and External Sorting€¦ · Understanding the Workload • A workload is a mix of queriesand updates • For each query in the workload: – Which relations does

CostofExternalMergeSort

• Numberofpasses:1+⌈logB-1⌈N/B⌉⌉• Cost=2N*(#ofpasses)– why2times?• E.g.,with5bufferpages,tosort108pagefile:• Pass0:sorting5pagesatatime

– ⌈108/5⌉ =22sortedrunsof5pageseach(lastrunisonly3pages)

• Pass1:4-waymerge– ⌈22/4⌉ =6sortedrunsof20pageseach(lastrunisonly8pages)

• Pass2:4-waymerge– (but2-wayforthelasttworuns)– [6/4⌉ =2sortedruns,80pagesand28pages

• Pass3:2-waymerge(only2runsremaining)– Sortedfileof108pages

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 24

Page 25: Index Selection and External Sorting€¦ · Understanding the Workload • A workload is a mix of queriesand updates • For each query in the workload: – Which relations does

NumberofPassesofExternalSort

N B=3 B=5 B=9 B=17 B=129 B=257100 7 4 3 2 1 11,000 10 5 4 3 2 210,000 13 7 5 4 2 2100,000 17 9 6 5 3 31,000,000 20 10 7 5 3 310,000,000 23 12 8 6 4 3100,000,000 26 14 9 7 4 41,000,000,000 30 15 10 8 5 4

HighBisgood,althoughCPUcostincreases

12

Page 26: Index Selection and External Sorting€¦ · Understanding the Workload • A workload is a mix of queriesand updates • For each query in the workload: – Which relations does

I/OforExternalMergeSort

• If10bufferpages– eithermerge9runsatatimewithoneoutputbuffer– or8runswithtwooutputbuffers

• If#pageI/Oisthemetric– goalisminimizethe#passes– eachpageisreadandwrittenineachpass

• Ifwedecidetoreadablock ofbpagessequentially– Suggestsweshouldmakeeachbuffer(input/output)beablock ofpages

– Butthiswillreducefan-outduringmergepasses• i.e.notasmanyrunscanbemergedagainanymore

– Inpractice,mostfilesstillsortedin2-3passes

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 26

Page 27: Index Selection and External Sorting€¦ · Understanding the Workload • A workload is a mix of queriesand updates • For each query in the workload: – Which relations does

DoubleBuffering• ToreduceCPUwaittimeforI/Orequesttocomplete,canprefetch into`shadowblock’.

OUTPUT

OUTPUT'

Disk Disk

INPUT 1

INPUT k

INPUT 2

INPUT 1'

INPUT 2'

INPUT k'

block sizeb

B main memory buffers, k-way mergeDukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 27

Page 28: Index Selection and External Sorting€¦ · Understanding the Workload • A workload is a mix of queriesand updates • For each query in the workload: – Which relations does

UsingB+TreesforSorting

• Scenario:TabletobesortedhasB+treeindexonsortingcolumn(s)

• Idea:Canretrievedataentries(thenrecords)inorderbytraversingleafpages.

• Isthisagoodidea?• Casestoconsider:

– B+treeisclustered: Goodidea!– B+treeisnotclustered: Couldbeaverybadidea!

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 28

Page 29: Index Selection and External Sorting€¦ · Understanding the Workload • A workload is a mix of queriesand updates • For each query in the workload: – Which relations does

ClusteredB+TreeUsedforSorting

• Cost:roottotheleft-mostleaf,thenretrieveallleafpages(Alternative1)

• IfAlternative2isused?Additionalcostofretrievingdatarecords:eachpagefetchedjustonce

☛ Always better than external sorting!

(Directs search)

Data Records

Index

Data Entries("Sequenceset")

17

Page 30: Index Selection and External Sorting€¦ · Understanding the Workload • A workload is a mix of queriesand updates • For each query in the workload: – Which relations does

UnclusteredB+TreeUsedforSorting

• Alternative(2)fordataentries;eachdataentrycontainsrid ofadatarecord

• Ingeneral,oneI/Operdatarecord!

(Directs search)

Data Records

Index

Data Entries("Sequenceset")

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 30

Page 31: Index Selection and External Sorting€¦ · Understanding the Workload • A workload is a mix of queriesand updates • For each query in the workload: – Which relations does

Summary

• Externalsortingisimportant;DBMSmaydedicatepartofbufferpoolforsorting!

• ExternalmergesortminimizesdiskI/Ocost:– Pass0:ProducessortedrunsofsizeB(#bufferpages)– Laterpasses:mergeruns– #ofrunsmergedatatimedependsonB,andblocksize.– LargerblocksizemeanslessI/Ocostperpage.– Largerblocksizemeanssmaller#runsmerged.– Inpractice,#ofrunsrarelymorethan2or3

DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 31