Transcript
Page 1: Lecture-22-DW-Cubedb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lec… · instructor material of the [RG] book provided by the authors Dr. Ramakrishnan and Dr. Gehrke

11/14/17

1

CompSci 516DatabaseSystems

Lecture21DataWarehousing

andDataCube

Instructor:Sudeepa Roy

DukeCS,Fall2017 CompSci516:DatabaseSystems 1

ReadingMaterial

• [RG]– Chapter25

• Gray-Chaudhuri-Bosworth-Layman-Reichart-Venkatrao-Pellow-Pirahesh,ICDE1996“DataCube:ARelationalAggregationOperatorGeneralizingGroup-By,Cross-Tab,andSub-Totals”

• Harinarayan-Rajaraman-Ullman,SIGMOD1996“Implementingdatacubesefficiently”

DukeCS,Fall2017 CompSci516:DatabaseSystems 2

Acknowledgement:• Thefollowingslideshavebeencreatedadaptingtheinstructormaterialofthe[RG]bookprovidedbytheauthorsDr.RamakrishnanandDr.Gehrke.• SomeslideshavebeenpreparedbyProf.Shivnath Babu

DataWarehousing

DukeCS,Fall2017 CompSci516:DatabaseSystems 3 4

Warehousing

• Growingindustry:$8billionwaybackin1998• DatawarehousevendorlikeTeradata

• big“Petabytescale”customers• Apple,Walmart(2008-2.5PB),eBay(2013-primaryDW9.2PB,otherbigdata40PB,singletablewith1trillionrows),Verizon,AT&T,BankofAmerica

• supportsdataintoandoutofHadoop• Lotsofbuzzwords,hype– slice&dice,rollup,MOLAP,pivot,...

Ack:SlidebyProf.Shivnath Babu

https://gigaom.com/2013/03/27/why-apple-ebay-and-walmart-have-some-of-the-biggest-data-warehouses-youve-ever-seen/

5

MotivatingExamples

• Forecasting• Comparingperformanceofunits• Monitoring,detectingfraud• Visualization

Introduction• Organizationsanalyzecurrentandhistoricaldata

– toidentifyusefulpatterns– tosupportbusinessstrategies

• Emphasisisoncomplex,interactive,exploratoryanalysisofverylargedatasets

• Createdbyintegratingdatafromacrossallpartsofanenterprise

• Dataisfairlystatic

• Relevantonceagainfortherecent“BigDataanalysis”– tofigureoutwhatwecanreuse,whatwecannot

DukeCS,Fall2017 CompSci516:DatabaseSystems 6

Page 2: Lecture-22-DW-Cubedb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lec… · instructor material of the [RG] book provided by the authors Dr. Ramakrishnan and Dr. Gehrke

11/14/17

2

ThreeComplementaryTrends

• DataWarehousing(DW):– Consolidatedatafrommanysourcesinonelargerepository– Loading,periodicsynchronizationofreplicas– Semanticintegration

• OLAP:– ComplexSQLqueriesandviews.– Queriesbasedonspreadsheet-styleoperationsand“multidimensional”viewofdata.

– Interactiveand“online”queries.

• DataMining:– Exploratorysearchforinterestingtrendsandanomalies– Nextlecture!

DukeCS,Fall2017 CompSci516:DatabaseSystems 7

DataWarehousing• Acollectionofdecisionsupporttechnologies• Toenablepeopleinindustry/organizationstomakebetter

decisions– SupportsOLAP(On-LineAnalyticalProcessing)

• Applicationsin– Manufacturing– Retail– Finance– Transportation– Healthcare– …

• Typicallymaintainedseparatelyfrom“OperationalDatabases”– OperationalDatabasessupportOLTP(On-LineTransactionProcessing)

8DukeCS,Fall2017 CompSci516:DatabaseSystems

9

WhyaWarehouse?

• TwoApproaches:– Query-Driven(Lazy)–Warehouse(Eager)

Source Source

?

10

AdvantagesofWarehousing

• Highqueryperformance• Queriesnotvisibleoutsidewarehouse• Localprocessingatsourcesunaffected• Canoperatewhensourcesunavailable• CanquerydatanotstoredinaDBMS• Extrainformationatwarehouse– Modify,summarize(storeaggregates)– Addhistoricalinformation

11

Query-DrivenApproach

Client Client

Wrapper Wrapper Wrapper

Mediator

Source Source Source

12

AdvantagesofQuery-Driven

• Noneedtocopydata– lessstorage– noneedtopurchasedata

• Moreup-to-datedata• Queryneedscanbeunknown• Onlyqueryinterfaceneededatsources• Maybelessdrainingonsources

Page 3: Lecture-22-DW-Cubedb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lec… · instructor material of the [RG] book provided by the authors Dr. Ramakrishnan and Dr. Gehrke

11/14/17

3

OLTP DataWarehousing/OLAPMostlyupdates Mostlyreads

Applications:Orderentry,salesupdate,bankingtransactions

Applications:Decisionsupportinindustry/organization

Detailed,up-to-datedata Summarized, historicaldata(frommultipleoperationaldb,growsovertime)

Structured,repetitive,shorttasks Queryintensive,adhoc,complexqueries

Eachtransactionreads/updatesonlyafewtuples(tensof)

Eachquerycanaccesses manyrecords,andperformmanyjoins,scans,aggregates

MB-GBdata GB-TB data

Typicallyclericalusers Decisionmakers,analysts asusers

Important:Consistency,recoverability,Maximizingtr.throughput

Important:QuerythroughputResponse times

13DukeCS,Fall2017 CompSci516:DatabaseSystems

DataMarts

• smallerdatawarehouse• subsetsofdataonselectedsubjects• e.g.Marketingdatamartcanincludecustomer,product,sales

• Department-focused,noenterprise-wideconsensusneeded

• Butmayleadtocomplexintegrationproblemsinthelongrun

14DukeCS,Fall2017 CompSci516:DatabaseSystems

ROLAPandMOLAP

• RelationalOLAP(ROLAP)– OntopofstandardrelationalDBMS– DataisstoredinrelationalDBMS– SupportsextensionstoSQLtoaccessmulti-dimensionaldata

• MultidimensionalOLAP(MOLAP)– Directlystoresmultidimensionaldatainspecialdatastructures(e.g.arrays)

15DukeCS,Fall2017 CompSci516:DatabaseSystems

DataWarehousingtoMining

• Integrateddataspanninglongtimeperiods,oftenaugmentedwithsummaryinformation

• Severalgigabytestoterabytescommon

• Interactiveresponsetimesexpectedforcomplexqueries;ad-hocupdatesuncommon

EXTERNALDATASOURCES

EXTRACTTRANSFORM

LOADREFRESH

DATAWAREHOUSE

MetadataRepository

SUPPORTS

OLAPDATAMINING

DukeCS,Fall2017 CompSci516:DatabaseSystems 16

WarehousingIssues• SemanticIntegration: Whengettingdatafrom

multiplesources,musteliminatemismatches– e.g.,differentcurrencies,schemas

• HeterogeneousSources: Mustaccessdatafromavarietyofsourceformatsandrepositories– Replicationcapabilitiescanbeexploitedhere

• Load,Refresh,Purge: Mustloaddata,periodicallyrefreshit,andpurgetoo-olddata

• MetadataManagement: Mustkeeptrackofsource,loadingtime,andotherinformationforalldatainthewarehouse

DukeCS,Fall2017 CompSci516:DatabaseSystems 17

DWArchitecture

18

• ExtractdatafrommultipleoperationalDBandexternalsources

• Clean/integrate/transform/store• Refreshperiodically

– updatebaseandderiveddata– admindecideswhenandhowDukeCS,Fall2017 CompSci516:DatabaseSystems

• MainDWandseveraldatamarts(possibly)• Managedbyoneormoreserversand

frontendtools• Additionalmetadataand

monitoring/admintools

Page 4: Lecture-22-DW-Cubedb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lec… · instructor material of the [RG] book provided by the authors Dr. Ramakrishnan and Dr. Gehrke

11/14/17

4

ROLAP:StarSchema

• Toreflectmulti-dimensionalviewsofdata

• Singlefacttable

• Singletableforeverydimension

• Eachtupleinthefacttableconsistsof– pointers(foreignkey)toeach

ofthedimensions(multi-dimensionalcoordinates)

– numericvalueforthosecoordinates

• Eachdimensiontablecontainsattributesofthatdimension 19

NosupportforattributehierarchiesDukeCS,Fall2017 CompSci516:DatabaseSystems

DimensionHierarchies

• Foreachdimension,thesetofvaluescanbeorganizedinahierarchy:

PRODUCT TIME LOCATION

categoryweekmonthstate

pnamedatecity

year

quartercountry

DukeCS,Fall2017 CompSci516:DatabaseSystems 20

ROLAP:SnowflakeSchema

21

• Refinesstar-schema• Dimensionalhierarchyis

explicitlyrepresented

• (+)Dimensiontableseasiertomaintain– supposethe“category

descriptionisbeingchanged

• (-)Needadditionaljoins

• FactConstellations– Multiplefacttablessharesome

dimensionaltables– e.g.ProjectedandActual

Expensesmaysharemanydimensions

DukeCS,Fall2017 CompSci516:DatabaseSystems

Motivation:OLAPQueries• Dataanalystsareinterestedinexploringtrendsandanomalies– Possiblybyvisualization(Excel)- 2Dor3Dplots– “DimensionalityReduction”bysummarizingdataandcomputing

aggregates– InfluencedbySQLandbyspreadsheets.– Acommonoperationistoaggregate ameasureoveroneormore

dimensions.

• Findtotalunitsalesforeach1. Model2. Model,brokenintoyears3. Year,brokenintocolors4. Year5. Model,brokenintocolor,….

22

Sales(Model,Year,Color,Units)

DukeCS,Fall2017 CompSci516:DatabaseSystems

OLAPand

DataCube

DukeCS,Fall2017 CompSci516:DatabaseSystems 23

HistogramsAtabulatedfrequencyofcomputedvalues

SELECT Year, COUNT(Units) as total

FROM Sales

GROUP BY Year

ORDER BY Year

MayrequireanestedSELECTtocompute

Sales(Model,Year,Color,Units)

Year->

total->

Page 5: Lecture-22-DW-Cubedb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lec… · instructor material of the [RG] book provided by the authors Dr. Ramakrishnan and Dr. Gehrke

11/14/17

5

Roll-Ups• Analysisreportsstartatacoarselevel,

gotofinerlevels• Orderofattributematters• Notrelationaldata(emptycellsno

keys)

Model Year Color Model,Year,Color Model,Year Model

Chevy 1994 Black 50

Chevy 1994 White 40

90

Chevy 1995 Black 115

Chevy 1995 White 85

200

290

GROUPBY

Sales(Model,Year,Color,Units)

Roll-ups

Drill-downs

Roll-Ups• Anotherrepresentation(ChrisDate’96)• Relational,but

– longattributenames– hardtoexpressinSQLandrepetition

Model Year Color Model,Year,Color Model,Year Model

Chevy 1994 Black 50 90 290

Chevy 1994 White 40 90 290

Chevy 1995 Black 85 200 290

Chevy 1995 Black 115 200 290

GROUPBY

Sales(Model,Year,Color,Units)

‘ALL’ ConstructEasiertovisualizeroll-upifallowALLtofillinthesuper-aggregates

SELECT Model, Year, Color, SUM(Units)

FROM Sales

WHERE Model = ‘Chevy’

GROUP BY Model, Year, Color

UNION

SELECT Model, Year, ‘ALL’, SUM(Units)

FROM Sales

WHERE Model = ‘Chevy’

GROUP BY Model, Year

UNION

UNION

SELECT ‘ALL’, ‘ALL’, ‘ALL’, SUM(Units)

FROM Sales

WHERE Model = ‘Chevy’;

Model Year Color Units

Chevy 1994 Black 50

Chevy 1994 White 40

Chevy 1994 ‘ALL’ 90

Chevy 1995 Black 85

Chevy 1995 White 115

Chevy 1995 ‘ALL’ 200

Chevy ‘ALL’ ‘ALL’ 290

Sales(Model,Year,Color,Units)

Model Year Color Units

Chevy 1994 Black 50

Chevy 1994 White 40

Chevy 1994 ‘ALL’ 90

Chevy 1995 Black 85

Chevy 1995 White 115

Chevy 1995 ‘ALL’ 200

Chevy ‘ALL’ ‘ALL’ 290

Model Year Color Model,Year,Color Model,Year Model

Chevy 1994 Black 50

Chevy 1994 White 40

90

Chevy 1995 Black 115

Chevy 1995 White 85

200

290

TraditionalRoll-Up ‘ALL’Roll-Up

• Roll-upsareasymmetric

Sales(Model,Year,Color,Units)

CrossTabulation• Ifwemadetheroll-upsymmetric,wewouldgetacross-tabulation• Generalizestohigherdimensions

SELECT Model, ‘ALL’, Color, SUM(Units)

FROM Sales

WHERE Model = ‘Chevy’

GROUP BY Model, Color

Chevy 1994 1995 Total(ALL)

Black 50 85 135

White 40 115 155

Total(ALL) 90 200 290

IstheproblemsolvedwithCross-TabandGROUP-BYswith‘ALL’?• RequiresalotofGROUPBYs(64for6-dimension)• Toocomplextooptimize (64scans,64sort/hash,slow)

Sales(Model,Year,Color,Units)

NaïveApproach

Runanumberofqueries

SELECT sum(units)FROM Sales

SELECT Color, sum(units)FROM SalesGROUP BY Color

SELECT Year, sum(units)FROM SalesGROUP BY Year

SELECT Model, Year, sum(units)FROM SalesGROUP BY Model, Year….

• DatacubegeneralizesHistogram,Roll-Ups,Cross-Tabs

• MorecomplextodothesewithGROUP-BY

30

• Howmanysub-queries?• Howmanysub-queriesfor

8attributes?

Sales(Model,Year,Color,Units)

YearColor

Mod

el

TotalUnitsales

DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 6: Lecture-22-DW-Cubedb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lec… · instructor material of the [RG] book provided by the authors Dr. Ramakrishnan and Dr. Gehrke

11/14/17

6

DataCube:Intuition

SELECT ‘ALL’, ‘ALL’, ‘ALL’, sum(units)

FROM Sales

UNIONSELECT ‘ALL’, ‘ALL’, Color, sum(units)

FROM Sales

GROUP BY Color

UNION

SELECT ‘ALL’, Year, ‘ALL’, sum(units)

FROM Sales

GROUP BY Year

UNION SELECT Model, Year, ‘ALL’, sum(units)

FROM Sales

GROUP BY Model, Year

UNION ….

31

Sales(Model,Year,Color,Units)

YearColor

Mod

el

TotalUnitsales

DukeCS,Fall2017 CompSci516:DatabaseSystems

DataCube

Ack:fromslidesbyLaurelOrrandJeremyHyrkas,UW

DataCube• Computestheaggregateonallpossiblecombinationsof

groupbycolumns.

• IfthereareNattributes,thereare2N-1super-aggregates.

• IfthecardinalityoftheNattributesareC1,...,CN,thenthereareatotalof(C1+1)...(CN+1)valuesinthecube.

• ROLL-UPissimilarbutjustlooksatNaggregates

DataCube Syntax• SQLServer

SELECT Model, Year, Color, sum(units)

FROM Sales

GROUP BY Model, Year, Color

WITH CUBE

Sales(Model,Year,Color,Units)

TypesofAggregates• Distributive:inputcanbepartitionedintodisjointsets

andaggregatedseparatelyo COUNT,SUM,MIN

• Algebraic:canbecomposedofdistributiveaggregateso AVG

• Holistic:aggregatemustbecomputedovertheentireinputseto MEDIAN

• EfficientcomputationoftheCUBEoperatordependsonthetypeofaggregate

– DistributiveandAlgebraicaggregatesmotivateoptimizations

DukeCS,Fall2017 CompSci516:DatabaseSystems 36

ViewMaterializationandMaintenance[RG]Chapters25.8-25.10

Page 7: Lecture-22-DW-Cubedb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lec… · instructor material of the [RG] book provided by the authors Dr. Ramakrishnan and Dr. Gehrke

11/14/17

7

Views(revisiting)

• Motivation(example)– Differentgroupsofanalystswithinanorganizationaretypicallyconcernedwithdifferentaspectsofabusiness

– Itisconvenienttodefine“views”thatgiveeachgroupinsightintotherelevantbusinessdetails

– Otherviewscanbedefinedorqueriescanbewrittenusingtheseviews

– ConvenientandEfficient

DukeCS,Fall2017 CompSci516:DatabaseSystems 37

ViewExample

CREATEVIEW RegionalSales(category,sales,state)ASSELECT P.category,S.sales,L.state

FROMProductsP,SalesS,LocationsLWHERE P.pid=S.pid AND S.locid=L.locid

SELECT R.category,R.state,SUM(R.sales)FROM RegionalSales AS RGROUPBY R.category,R.state

SELECT R.category,R.state,SUM(R.sales)FROM (SELECT P.category,S.sales,L.state

FROMProductsP,SalesS,LocationsLWHERE P.pid=S.pidAND S.locid=L.locid)AS R

GROUPBY R.category,R.state

View(salesofproductsbycategoryandstate)

Query(totalsalesforeachcategorybystate)

QueryModification(SQLdoesnotspecifyhowtoevaluatequeriesonviews,butcanconsideritasareplacement)

DukeCS,Fall2017 CompSci516:DatabaseSystems 38

ViewsandOLAP/Warehousing• OLAPqueriesaretypicallyaggregatequeries

– Precomputationisessentialforinteractiveresponsetimes– TheCUBEisinfactacollectionofaggregatequeries,and

precomputationisespeciallyimportant– lotsofworkonwhatisbesttoprecomputegivenalimitedamountof

spacetostoreprecomputedresults.

• Warehousescanbethoughtofasacollectionofasynchronouslyreplicatedtablesandperiodicallymaintainedviews– Factors:size,numberoftablesinvolved,manyarefromexternal

independentdatabases– Hasrenewedinterestin(asynchronous)viewmaintenance(morelater)

DukeCS,Fall2017 CompSci516:DatabaseSystems 39

ViewMaterialization• QueryModificationmaynotbeefficient

– whentheunderlyingviewiscomplex– evenwithsophisticatedoptimizationandevaluation– esp.whentheunderlyingtablesareinaremotedatabase(connectivity

andavailability)

• Alternative:ViewMaterialization– Precomputetheviewdefinitionandstoretheresult– Materializedviewscanbeusedasregularrelations– Providesfastaccess,likea(veryhigh-level)cache– Cancreateindexonviewstooforfurtherspeedup– Drawback:tomaintaintheconsistencyofthematerializedviewwhenthe

underlyingtable(s)areupdated(ViewMaintenance)– Ideally,wewantIncrementalViewMaintenancealgorithms(Lecture20)

DukeCS,Fall2017 CompSci516:DatabaseSystems 40

IndexonMaterializedViews:Examples

• SupposeweprecomputeRegionalSales andstoreitwithaclusteredB+treeindexon[category,state,sales].– Then,thequerycanbeansweredbyanindex-onlyscan.

SELECT R.state,SUM(R.sales)FROM RegionalSalesRWHERE R.category=“Laptop”GROUPBY R.state

SELECT R.state,SUM(R.sales)FROM RegionalSalesRWHERE R.state=“Wisconsin”GROUPBY R.category

Indexonprecomputedviewisgreat!

Indexislessuseful(mustscanentireleaflevel)

DukeCS,Fall2017 CompSci516:DatabaseSystems 41

CREATEVIEW RegionalSales(category,sales,state)ASSELECT P.category,S.sales,L.stateFROMProductsP,SalesS,LocationsLWHERE P.pid=S.pid AND S.locid=L.locid

SELECT R.category,R.state,SUM(R.sales)FROM RegionalSales AS RGROUPBY R.category,R.state

(Research)IssuesinViewMaterialization

1. Whatviewsshouldwematerialize,andwhatindexesshouldwebuildontheprecomputedresults?

2. Givenaqueryandasetofmaterializedviews,canweusethematerializedviewstoanswerthequery?– relatedtothefirstquestion(workloaddependent)– Trytomaterializeasmall,carefullychosensetofviewsthatcanbe

utilizedtoquicklyanswermostoftheimportantqueries

3. Howfrequentlyshouldwerefreshmaterializedviewstomakethemconsistentwiththeunderlyingtables?– Andhowcanwedothisincrementally?

DukeCS,Fall2017 CompSci516:DatabaseSystems 42

Page 8: Lecture-22-DW-Cubedb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lec… · instructor material of the [RG] book provided by the authors Dr. Ramakrishnan and Dr. Gehrke

11/14/17

8

ViewMaintenance• Twosteps:– Propagate: Computechangestoviewwhendatachanges– Refresh: Applychangestothematerializedviewtable

• Maintenancepolicy: Controlswhenwedorefresh– Immediate: Aspartofthetransactionthatmodifiestheunderlyingdatatables• +Materializedviewisalwaysconsistent• - updatesareslowed

– Deferred: Sometimelater,inaseparatetransaction• - Viewbecomesinconsistent• + canscaletomaintainmanyviewswithoutslowingupdates

DukeCS,Fall2017 CompSci516:DatabaseSystems 43

TypesofDeferredMaintenanceThreeflavors:• Lazy:

– Delayrefreshuntilnextqueryonview;thenrefreshbeforeansweringthequery(slowsdownqueriesthanupdates)

• Periodic(Snapshot):– Refreshperiodically(e.g.onceinaday).Queriespossiblyanswered

usingoutdatedversionofviewtuples.Widelyused,especiallyforasynchronousreplicationindistributeddatabases,andforwarehouseapplications

• Event-basedorForced:– E.g.,Refreshafterafixednumberofupdatestounderlyingdatatables

• e.g.SnapshotinOracle7– periodicallyrefreshedbyentirelyrecomputing theview– Incremental”fastrefresh”or“simplesnapshots”forsimplerviews(no

aggregate,groupby,join,distinctetc.)

DukeCS,Fall2017 CompSci516:DatabaseSystems 44

ImplementingDataCube

DukeCS,Fall2017 CompSci516:DatabaseSystems 45

BasicIdeas

• Needtocomputeallgroup-by-s:– ABCD,ABC,ABD,BCD,AB,AC,AD,BC,BD,CD,A,B,C,D

• ComputeGROUP-BYsfrompreviouslycomputedGROUP-BYs– e.g.firstABCD– thenABCorACD– thenABorAC…

• WhichorderABCDissorted,mattersforsubsequentcomputations

– if(ABCD)isthesortedorder,ABCischeap,ACDorBCDisexpensive

Notations

• ABCD– group-byonattributesA,B,C,D– noguaranteeontheorderoftuples

• (ABCD)– sortedaccordingtoA->B->C->D

• ABCDand(ABCD)and(BCDA)– allcontainthesameresults– butindifferentsortedorder

Optimization1:SmallestParent

• ComputeGROUP-BYfromthesmallest(size)previouslycomputedGROUP-BYasaparent

– ABcanbecomputedfromABC,ABD,orABCD

– ABCorABDbetterthanABCD– EvenABCorABDmayhave

differentsizes,trytochoosethesmallerparent

48

ABCD

ABC ABD ACD BCD

AC AD BC BD CDAB

A B C D

all

DukeCS,Fall2017 CompSci516:DatabaseSystems

LATTICESTRUCTUREofdatacube

Page 9: Lecture-22-DW-Cubedb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lec… · instructor material of the [RG] book provided by the authors Dr. Ramakrishnan and Dr. Gehrke

11/14/17

9

Optimization2:CacheResults

• CacheresultofoneGROUP-BYinmemorytoreducediskI/O– ComputeABfromABCwhile

ABCisstillinmemory

49

ABCD

ABC ABD ACD BCD

AC AD BC BD CDAB

A B C D

all

DukeCS,Fall2017 CompSci516:DatabaseSystems

Optimization3:AmortizeDiskScans

• AmortizediskreadsformultipleGROUP-BYs– SupposetheresultforABCD

isstoredondisk– ComputeallofABC,ABD,

ACD,BCDsimultaneouslyinonescanofABCD

50

ABCD

ABC ABD ACD BCD

AC AD BC BD CDAB

A B C D

all

DukeCS,Fall2017 CompSci516:DatabaseSystems

Optimization4,5(next)

• 4.Share-sort– forsort-basedalgorithms– pipe-sortalgorithm– coveredinclass

• 5.Shared-partition– forhash-basedalgorithms– pipe-hashalgorithm

• UseshashtablestocomputesmallerGROUP-Bys

• IfthehashtablesforABandACfitinmemory,computebothinonescanofABC

• OtherwisecanpartitiononA,andcancomputeHTsofABandACindifferentpartitions

– notcovered(seepaper)

51

ABCD

ABC ABD ACD BCD

AC AD BC BD CDAB

A B C D

all

DukeCS,Fall2017 CompSci516:DatabaseSystems

PipeSort:Idea

• Combinestwooptimizations:“shared-sorts”and“smallest-parent”

• Alsoincludes“cache-results”and“amortized-scans”

PipeSort:Share-sortoptimization• Datasortedinoneorder• ComputeallGROUP-BYsprefixedinthatorder• ComputeonetupleofABCD,propagateupwardinthe

pipelinebyasinglescan• Example:

– GROUP-BY overattributesABCD– Sortrawdataby(ABCD)– Compute(ABCD)->(ABC)->(AB)->(A) inpipelinedfashion– Noadditionalsortneeded

• BUT,mayhaveaconflictwith“smallest-parent”optimization– (ABD)->(AB)couldbeabetterchoice– Figureoutthebestparentchoicebyrunningaweighted-matching

algorithmlayerbylayer

(ABCD)

(ABC)

(AB)

(A)

SearchLattice• Directededge=>oneattributeless

andpossiblecomputation• Levelkcontainskattributes

– all=0attribute

• Twopossiblecostsforeachedgeeij=i --->j

• A(eij):i issortedforj– (BCA)->(BC)

• S(eij):i isNOTsortedforj– e.g.ABC->(BCA)->(BC)orhash

ABCD

ABC ABD ACD BCD

AC AD BC BD CDAB

A B C D

all Level0

Level1

Level2

Level3

Level4

A B C suma1 b1 c1 5a1 b1 c2 10a1 b2 c3 8a2 b2 c1 2a2 b2 c3 11

NotSortedA B C suma2 b2 c3 11a1 b1 c2 10a2 b2 c1 2a1 b1 c1 5a1 b2 c3 8

Sorted

A B suma1 b1 15a1 b2 8a2 b2 13

• Noparenthesis:orderoftuplescanbearbitrary

Page 10: Lecture-22-DW-Cubedb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lec… · instructor material of the [RG] book provided by the authors Dr. Ramakrishnan and Dr. Gehrke

11/14/17

10

PipeSort Output

• OutputsasubgraphO– eachnodehasasingleparent– eachnodehasasortedorderofattributes

• ifparent’ssortedorderisaprefix,cost=A(eij),elseS(eij)

– MarkbyAorS– AtmostoneA-out-edge– Note:forsomenodes,theremaybenogreenA-out-edge

ACBD

ACB ABD ACD BDC

AC AD BC BD CDAB

A B C D

all Level0

Level1

Level2

Level3

Level4

Sorted(A)Not-Sorted(S)

Goal:FindOwithmintotalcost

Outline:PipeSort Algorithm(1)• Gofromlevel0toN-1

– hereN=4

• Foreachlevelk,findthebestwaytoconstructitfromlevelk+1

• uses“min-costweightedbipartitematching”

• createsknewcopiesofnodesatlevelk+1

• edgesfromoriginalcopy– costA(eij)

• edgesfromnewcopies– costS(eij)

ABCD

ABC ABD ACD BCD

AC AD BC BD CDAB

A B C D

all Level0

Level1

Level2

Level3

Level4

Outline:PipeSort Algorithm(2)• Illustrationwithasmallerexample• Levelk=1fromlevelk+1=2

– onenewcopy(dottededges)– oneexistingcopy(solidedge)

• Assumptionforsimplicity– samecostforalloutgoingedges– A(eij)=A(eij’)forallj,j’– S(eij)=S(eij’)foralli,i’

ABC

AC BCAB

A B C

all Level0

Level1

Level2

Level3

2,10 5,12 13,20

Aftercomputingthe plan,executeallpipelines

Outline:PipeSort Algorithm(3)

1.Firstpipelineisexecutedbyonescanofthedata

2.Sort(CBAD)->(BADC),computethesecondpipeline

3.…..

A,S costs

SeepaperforanotherPipeHash algorithm

ACBD

ACB ABD ACD BDC

AC AD BC BD CD

A B C D

all

AB


Top Related