index selection and external sorting€¦ · understanding the workload • a workload is a mix of...
TRANSCRIPT
CompSci 516DatabaseSystems
Lecture9IndexSelection
andExternalSorting
Instructor:Sudeepa Roy
1DukeCS,Fall2017 CompSci516:DatabaseSystems
Announcements• Privateprojectthreadscreatedonpiazza– Pleaseusethesethreads(andnotemails)forallcommunicationsonyourproject
• Projectproposaldeadlineextendeduntil9/28Thursday5pm
• HW2tobepostedsoon–WillreceiveemailwithyourCSemailIDforAWSaccountsfromKeping/Yilin
DukeCS,Fall2017 CompSci516:DatabaseSystems 2
Today
• Indexselection• Externalsort
DukeCS,Fall2017 CompSci516:DatabaseSystems 3
ReadingMaterial
• Index:asinLecture7/8
• Externalsorting:• [RG]– Externalsorting:Chapter13
• [GUW]– Chapter15.4.1
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 4
Acknowledgement:Thefollowingslideshavebeencreatedadaptingtheinstructormaterialofthe[RG]bookprovidedbytheauthorsDr.Ramakrishnan andDr.Gehrke.
SelectionofIndexes
DukeCS,Fall2017 CompSci516:DatabaseSystems 5
DifferentFileOrganizations
Searchkey=<age,sal>
Considerfollowingoptions:
• Heapfiles– randomorder;insertatend-of-file
• Sortedfiles– sortedon<age,sal>
• ClusteredB+treefile– searchkey<age,sal>
• Heapfilewithunclustered B+-treeindex– onsearchkey<age,sal>
• Heapfilewithunclustered hashindex– onsearchkey<age,sal>
DukeCS,Fall2017 CompSci516:DatabaseSystems 6
Weneedtounderstandtheimportanceofappropriatefileorganizationandindex
PossibleOperations
• Scan– Fetchallrecordsfromdisktobufferpool
• Equalitysearch– Findallemployeeswithage=23andsal =50– Fetchpagefromdisk,thenlocatequalifyingrecordinpage
• Rangeselection– Findallemployeeswithage>35
• Insertarecord– identifythepage,fetchthatpagefromdisk,insetrecord,writeback
todisk(possiblyotherpagesaswell)
• Deletearecord– similartoinsert
DukeCS,Fall2017 CompSci516:DatabaseSystems 7
UnderstandingtheWorkload• Aworkloadisamixofqueries andupdates
• Foreachqueryintheworkload:– Whichrelationsdoesitaccess?– Whichattributesareretrieved?– Whichattributesareinvolvedinselection/joinconditions?How
selectivearetheseconditionslikelytobe?
• Foreachupdateintheworkload:– Whichattributesareinvolvedinselection/joinconditions?How
selectivearetheseconditionslikelytobe?– Thetypeofupdate(INSERT/DELETE/UPDATE),andtheattributesthatare
affected
DukeCS,Fall2017 CompSci516:DatabaseSystems 8
ChoiceofIndexes
• Whatindexesshouldwecreate?– Whichrelationsshouldhaveindexes?Whatfield(s)shouldbethesearchkey?Shouldwebuildseveralindexes?
• Foreachindex,whatkindofanindexshoulditbe?– Clustered?Hash/tree?
DukeCS,Fall2017 CompSci516:DatabaseSystems 9
MoreonChoiceofIndexes• Oneapproach:– Considerthemostimportantqueries– Considerthebestplanusingthecurrentindexes– Seeifabetterplanispossiblewithanadditionalindex.– Ifso,createit.– Obviously,thisimpliesthatwemustunderstandhowaDBMSevaluatesqueriesandcreatesqueryevaluationplans
– Wewilllearnqueryexecutionandoptimizationlater- Fornow,wediscusssimple1-tablequeries.
• Beforecreatinganindex,mustalsoconsidertheimpactonupdatesintheworkload
DukeCS,Fall2017 CompSci516:DatabaseSystems 10
Trade-offsforIndexes• Indexescanmake– queriesgofaster– updatesslower
• Requirediskspace,too
DukeCS,Fall2017 CompSci516:DatabaseSystems 11
IndexSelectionGuidelines• AttributesinWHEREclausearecandidatesforindexkeys
– Exactmatchconditionsuggestshashindex– Rangequerysuggeststreeindex– Clusteringisespeciallyusefulforrangequeries
• canalsohelponequalityqueriesiftherearemanyduplicates
• Trytochooseindexesthatbenefitasmanyqueriesaspossible– Sinceonlyoneindexcanbeclusteredperrelation,chooseitbasedon
importantqueriesthatwouldbenefitthemostfromclustering
• Multi-attributesearchkeysshouldbeconsideredwhenaWHEREclausecontainsseveralconditions– Orderofattributesisimportantforrangequeries
• Note:clusteredindexshouldbeusedjudiciously– expensiveupdates,althoughcheaperthansortedfiles
DukeCS,Fall2017 CompSci516:DatabaseSystems 12
ExamplesofClusteredIndexes• B+treeindexonE.age canbeusedtogetqualifyingtuples
• Howselectiveisthecondition?– everyone>40,indexnotofmuchhelp,scanisasgood
– Suppose10%>40.Then?
• Dependsoniftheindexisclustered– otherwisecanbemoreexpensivethanalinearscan
– ifclustered,10%I/O(+indexpages)
SELECT E.dnoFROM Emp EWHERE E.age>40
DukeCS,Fall2017 CompSci516:DatabaseSystems 13
Whatisagoodindexingstrategy?
Whichattribute(s)?Clustered/Unclustered?B+tree/Hash?
ExamplesofClusteredIndexesGroup-Byquery
• UseE.age assearchkey?– BadIfmanytupleshaveE.age >10orifnot
clustered….– …usingE.age indexandsortingtheretrieved
tuplesbyE.dno maybecostly
• ClusteredE.dno indexmaybebetter– Firstgroupby,thencounttupleswithage>
10– goodwhenage>10isnottooselective
• Note:thefirstoptionisgoodwhentheWHEREconditionishighlyselective(fewtupleshaveage>10),thesecondisgoodwhennothighlyselective
SELECT E.dno, COUNT (*)FROM Emp EWHERE E.age>10GROUP BY E.dno
DukeCS,Fall2017 CompSci516:DatabaseSystems 14
Whatisagoodindexingstrategy?
Whichattribute(s)?Clustered/Unclustered?B+tree/Hash?
ExamplesofClusteredIndexes
Equalityqueriesandduplicates
• ClusteringonE.hobby helps– hobbynotacandidatekey,several
tuplespossible
• Doesclusteringhelpnow?– (eid =key)– Notmuch– atmostonetuplesatisfiesthe
condition
SELECT E.dnoFROM Emp EWHERE E.hobby=‘Stamps’
DukeCS,Fall2017 CompSci516:DatabaseSystems 15
SELECT E.dnoFROM Emp EWHERE E.eid=50
Whatisagoodindexingstrategy?
Whichattribute(s)?Clustered/Unclustered?B+tree/Hash?
IndexeswithCompositeSearchKeys• CompositeSearchKeys:Searchona
combinationoffields
• Equalityquery:Everyfieldvalueisequaltoaconstantvalue.E.g.wrt<sal,age>index:– age=20andsal =75
• Rangequery:Somefieldvalueisnotaconstant.E.g.:– sal >10– whichcombination(s)
wouldhelp?
– <age,sal>doesnothelp– B+tree on<sal>or<sal,age>helps– hastobeaprefix
sue 13 75
bobcaljoe 12
10
208011
12
name age sal
<sal, age>
<age, sal> <age>
<sal>
12,2012,1011,80
13,75
20,1210,12
75,1380,11
11121213
10207580
Data recordssorted by name
Data entries in indexsorted by <sal,age>
Data entriessorted by <sal>
Examples of composite keyindexes using lexicographic order.
CompositeSearchKeys• ToretrieveEmp recordswithage=30AND sal=4000,anindexon
<age,sal>wouldbebetterthananindexonage oranindexonsal– firstfindage=30,amongthemsearchsal =4000
• Ifconditionis:20<age<30AND 3000<sal<5000:– Clusteredtreeindexon<age,sal>or<sal,age>isbest.
• Ifconditionis:age=30AND 3000<sal<5000:– Clustered<age,sal>indexmuchbetterthan<sal,age>index– moreindexentriesareretrievedforthelatter
• Compositeindexesarelarger,updatedmoreoftenDukeCS,Fall2017 CompSci516:DatabaseSystems 17
Index-OnlyPlans• Anumberofqueriescanbeansweredwithoutretrievingany
tuplesfromoneormoreoftherelationsinvolvedifasuitableindexisavailable
SELECT E.dno, COUNT(*)FROM Emp EGROUP BY E.dno
SELECT E.dno, MIN(E.sal)FROM Emp EGROUP BY E.dno
SELECT AVG(E.sal)FROM Emp EWHERE E.age=25 AND
E.sal BETWEEN 3000 AND 5000
<E.dno>
<E.dno,E.sal>Tree index!
<E. age,E.sal>Tree index!
DukeCS,Fall2017 CompSci516:DatabaseSystems 18
• Forindex-onlystrategies,clusteringisnotimportant
ExternalSorting
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 19
quickreviewofmergesort onwhiteboard
WhySort?• Aclassicproblemincomputerscience• Datarequestedinsortedorder
– e.g.,findstudentsinincreasinggpa order
• SortingisfirststepinbulkloadingB+treeindex• Sortingusefulforeliminatingduplicatecopiesinacollectionofrecords
• Sort-mergejoinalgorithminvolvessorting• Problem:sort1Gbofdatawith1MbofRAM
– needtominimizethecostofdiskaccess
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 20
2-WaySort:Requires3Buffers• SupposeN=2k pagesinthefile• Pass0:Readapage,sortit,writeit.
– repeatforall2k pages– onlyonebufferpageisused
• Pass1:– Readtwopages,sort(merge)themusingoneoutputpage,writethemtodisk– repeat2k-1times– threebufferpagesused
• Pass2,3,4,…..continue
Main memory buffers
INPUT 1
INPUT 2
OUTPUT
DiskDiskDukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 21
Two-WayExternalMergeSort• Eachsortedsub-fileiscalledarun– eachruncancontainmultiplepages
• Eachpassweread+writeeachpageinfile.
• Npagesinthefile,• =>thenumberofpasses
• Sotoal costis:
• Nottoopractical,butusefultolearnbasicconceptsforexternalsorting
é ù= +log2 1N
é ù( )2 12N Nlog +
Input file
1-page runs
2-page runs
4-page runs
8-page runs
PASS0
PASS1
PASS2
PASS3
9
3,4 6,2 9,4 8,7 5,6 3,1 2
3,4 5,62,6 4,9 7,8 1,3 2
2,34,6
4,78,9
1,35,6 2
2,34,46,78,9
1,23,56
1,22,33,44,56,67,8
9
GeneralExternalMergeSort
• TosortafilewithNpagesusingBbufferpages:– Pass0:useBbufferpages:
• Produce⌈N/B⌉ sortedrunsofBpageseach.– Pass1,2,…,etc.:mergeB-1runs tooneoutputpage
• keepwritingtodiskoncetheoutputpageisfull
B Main memory buffers
INPUT 1
INPUT B-1
OUTPUT
DiskDisk
INPUT 2. . . . . .. . .
• Suppose we have more than 3 buffer pages. • How can we utilize them?
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 23
CostofExternalMergeSort
• Numberofpasses:1+⌈logB-1⌈N/B⌉⌉• Cost=2N*(#ofpasses)– why2times?• E.g.,with5bufferpages,tosort108pagefile:• Pass0:sorting5pagesatatime
– ⌈108/5⌉ =22sortedrunsof5pageseach(lastrunisonly3pages)
• Pass1:4-waymerge– ⌈22/4⌉ =6sortedrunsof20pageseach(lastrunisonly8pages)
• Pass2:4-waymerge– (but2-wayforthelasttworuns)– [6/4⌉ =2sortedruns,80pagesand28pages
• Pass3:2-waymerge(only2runsremaining)– Sortedfileof108pages
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 24
NumberofPassesofExternalSort
N B=3 B=5 B=9 B=17 B=129 B=257100 7 4 3 2 1 11,000 10 5 4 3 2 210,000 13 7 5 4 2 2100,000 17 9 6 5 3 31,000,000 20 10 7 5 3 310,000,000 23 12 8 6 4 3100,000,000 26 14 9 7 4 41,000,000,000 30 15 10 8 5 4
HighBisgood,althoughCPUcostincreases
12
I/OforExternalMergeSort
• If10bufferpages– eithermerge9runsatatimewithoneoutputbuffer– or8runswithtwooutputbuffers
• If#pageI/Oisthemetric– goalisminimizethe#passes– eachpageisreadandwrittenineachpass
• Ifwedecidetoreadablock ofbpagessequentially– Suggestsweshouldmakeeachbuffer(input/output)beablock ofpages
– Butthiswillreducefan-outduringmergepasses• i.e.notasmanyrunscanbemergedagainanymore
– Inpractice,mostfilesstillsortedin2-3passes
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 26
DoubleBuffering• ToreduceCPUwaittimeforI/Orequesttocomplete,canprefetch into`shadowblock’.
OUTPUT
OUTPUT'
Disk Disk
INPUT 1
INPUT k
INPUT 2
INPUT 1'
INPUT 2'
INPUT k'
block sizeb
B main memory buffers, k-way mergeDukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 27
UsingB+TreesforSorting
• Scenario:TabletobesortedhasB+treeindexonsortingcolumn(s)
• Idea:Canretrievedataentries(thenrecords)inorderbytraversingleafpages.
• Isthisagoodidea?• Casestoconsider:
– B+treeisclustered: Goodidea!– B+treeisnotclustered: Couldbeaverybadidea!
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 28
ClusteredB+TreeUsedforSorting
• Cost:roottotheleft-mostleaf,thenretrieveallleafpages(Alternative1)
• IfAlternative2isused?Additionalcostofretrievingdatarecords:eachpagefetchedjustonce
☛ Always better than external sorting!
(Directs search)
Data Records
Index
Data Entries("Sequenceset")
17
UnclusteredB+TreeUsedforSorting
• Alternative(2)fordataentries;eachdataentrycontainsrid ofadatarecord
• Ingeneral,oneI/Operdatarecord!
(Directs search)
Data Records
Index
Data Entries("Sequenceset")
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 30
Summary
• Externalsortingisimportant;DBMSmaydedicatepartofbufferpoolforsorting!
• ExternalmergesortminimizesdiskI/Ocost:– Pass0:ProducessortedrunsofsizeB(#bufferpages)– Laterpasses:mergeruns– #ofrunsmergedatatimedependsonB,andblocksize.– LargerblocksizemeanslessI/Ocostperpage.– Largerblocksizemeanssmaller#runsmerged.– Inpractice,#ofrunsrarelymorethan2or3
DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems 31