lecture-22-dw-cubedb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/lec… · instructor...

10
11/14/17 1 CompSci 516 Database Systems Lecture 21 Data Warehousing and Data Cube Instructor: Sudeepa Roy Duke CS, Fall 2017 CompSci 516: Database Systems 1 Reading Material [RG] Chapter 25 Gray-Chaudhuri-Bosworth-Layman-Reichart-Venkatrao-Pellow-Pirahesh, ICDE 1996 “Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals” Harinarayan-Rajaraman-Ullman, SIGMOD 1996 “Implementing data cubes efficiently” Duke CS, Fall 2017 CompSci 516: Database Systems 2 Acknowledgement: The following slides have been created adapting the instructor material of the [RG] book provided by the authors Dr. Ramakrishnan and Dr. Gehrke. Some slides have been prepared by Prof. Shivnath Babu Data Warehousing Duke CS, Fall 2017 CompSci 516: Database Systems 3 4 Warehousing Growing industry: $8 billion way back in 1998 Data warehouse vendor like Teradata big “Petabyte scale”customers Apple, Walmart (2008-2.5PB), eBay (2013-primary DW 9.2 PB, other big data 40PB, single table with 1 trillion rows), Verizon, AT&T, Bank of America supports data into and out of Hadoop Lots of buzzwords, hype slice & dice, rollup, MOLAP, pivot, ... Ack: Slide by Prof. Shivnath Babu https://gigaom.com/2013/03/27/why-apple-ebay-and-walmart- have-some-of-the-biggest-data-warehouses-youve-ever-seen/ 5 Motivating Examples Forecasting Comparing performance of units Monitoring, detecting fraud Visualization Introduction Organizations analyze current and historical data to identify useful patterns to support business strategies Emphasis is on complex, interactive, exploratory analysis of very large datasets Created by integrating data from across all parts of an enterprise Data is fairly static Relevant once again for the recent “Big Data analysisto figure out what we can reuse, what we cannot Duke CS, Fall 2017 CompSci 516: Database Systems 6

Upload: others

Post on 10-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture-22-DW-Cubedb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lec… · instructor material of the [RG] book provided by the authors Dr. Ramakrishnan and Dr. Gehrke

11/14/17

1

CompSci 516DatabaseSystems

Lecture21DataWarehousing

andDataCube

Instructor:Sudeepa Roy

DukeCS,Fall2017 CompSci516:DatabaseSystems 1

ReadingMaterial

• [RG]– Chapter25

• Gray-Chaudhuri-Bosworth-Layman-Reichart-Venkatrao-Pellow-Pirahesh,ICDE1996“DataCube:ARelationalAggregationOperatorGeneralizingGroup-By,Cross-Tab,andSub-Totals”

• Harinarayan-Rajaraman-Ullman,SIGMOD1996“Implementingdatacubesefficiently”

DukeCS,Fall2017 CompSci516:DatabaseSystems 2

Acknowledgement:• Thefollowingslideshavebeencreatedadaptingtheinstructormaterialofthe[RG]bookprovidedbytheauthorsDr.RamakrishnanandDr.Gehrke.• SomeslideshavebeenpreparedbyProf.Shivnath Babu

DataWarehousing

DukeCS,Fall2017 CompSci516:DatabaseSystems 3 4

Warehousing

• Growingindustry:$8billionwaybackin1998• DatawarehousevendorlikeTeradata

• big“Petabytescale”customers• Apple,Walmart(2008-2.5PB),eBay(2013-primaryDW9.2PB,otherbigdata40PB,singletablewith1trillionrows),Verizon,AT&T,BankofAmerica

• supportsdataintoandoutofHadoop• Lotsofbuzzwords,hype– slice&dice,rollup,MOLAP,pivot,...

Ack:SlidebyProf.Shivnath Babu

https://gigaom.com/2013/03/27/why-apple-ebay-and-walmart-have-some-of-the-biggest-data-warehouses-youve-ever-seen/

5

MotivatingExamples

• Forecasting• Comparingperformanceofunits• Monitoring,detectingfraud• Visualization

Introduction• Organizationsanalyzecurrentandhistoricaldata

– toidentifyusefulpatterns– tosupportbusinessstrategies

• Emphasisisoncomplex,interactive,exploratoryanalysisofverylargedatasets

• Createdbyintegratingdatafromacrossallpartsofanenterprise

• Dataisfairlystatic

• Relevantonceagainfortherecent“BigDataanalysis”– tofigureoutwhatwecanreuse,whatwecannot

DukeCS,Fall2017 CompSci516:DatabaseSystems 6

Page 2: Lecture-22-DW-Cubedb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lec… · instructor material of the [RG] book provided by the authors Dr. Ramakrishnan and Dr. Gehrke

11/14/17

2

ThreeComplementaryTrends

• DataWarehousing(DW):– Consolidatedatafrommanysourcesinonelargerepository– Loading,periodicsynchronizationofreplicas– Semanticintegration

• OLAP:– ComplexSQLqueriesandviews.– Queriesbasedonspreadsheet-styleoperationsand“multidimensional”viewofdata.

– Interactiveand“online”queries.

• DataMining:– Exploratorysearchforinterestingtrendsandanomalies– Nextlecture!

DukeCS,Fall2017 CompSci516:DatabaseSystems 7

DataWarehousing• Acollectionofdecisionsupporttechnologies• Toenablepeopleinindustry/organizationstomakebetter

decisions– SupportsOLAP(On-LineAnalyticalProcessing)

• Applicationsin– Manufacturing– Retail– Finance– Transportation– Healthcare– …

• Typicallymaintainedseparatelyfrom“OperationalDatabases”– OperationalDatabasessupportOLTP(On-LineTransactionProcessing)

8DukeCS,Fall2017 CompSci516:DatabaseSystems

9

WhyaWarehouse?

• TwoApproaches:– Query-Driven(Lazy)–Warehouse(Eager)

Source Source

?

10

AdvantagesofWarehousing

• Highqueryperformance• Queriesnotvisibleoutsidewarehouse• Localprocessingatsourcesunaffected• Canoperatewhensourcesunavailable• CanquerydatanotstoredinaDBMS• Extrainformationatwarehouse– Modify,summarize(storeaggregates)– Addhistoricalinformation

11

Query-DrivenApproach

Client Client

Wrapper Wrapper Wrapper

Mediator

Source Source Source

12

AdvantagesofQuery-Driven

• Noneedtocopydata– lessstorage– noneedtopurchasedata

• Moreup-to-datedata• Queryneedscanbeunknown• Onlyqueryinterfaceneededatsources• Maybelessdrainingonsources

Page 3: Lecture-22-DW-Cubedb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lec… · instructor material of the [RG] book provided by the authors Dr. Ramakrishnan and Dr. Gehrke

11/14/17

3

OLTP DataWarehousing/OLAPMostlyupdates Mostlyreads

Applications:Orderentry,salesupdate,bankingtransactions

Applications:Decisionsupportinindustry/organization

Detailed,up-to-datedata Summarized, historicaldata(frommultipleoperationaldb,growsovertime)

Structured,repetitive,shorttasks Queryintensive,adhoc,complexqueries

Eachtransactionreads/updatesonlyafewtuples(tensof)

Eachquerycanaccesses manyrecords,andperformmanyjoins,scans,aggregates

MB-GBdata GB-TB data

Typicallyclericalusers Decisionmakers,analysts asusers

Important:Consistency,recoverability,Maximizingtr.throughput

Important:QuerythroughputResponse times

13DukeCS,Fall2017 CompSci516:DatabaseSystems

DataMarts

• smallerdatawarehouse• subsetsofdataonselectedsubjects• e.g.Marketingdatamartcanincludecustomer,product,sales

• Department-focused,noenterprise-wideconsensusneeded

• Butmayleadtocomplexintegrationproblemsinthelongrun

14DukeCS,Fall2017 CompSci516:DatabaseSystems

ROLAPandMOLAP

• RelationalOLAP(ROLAP)– OntopofstandardrelationalDBMS– DataisstoredinrelationalDBMS– SupportsextensionstoSQLtoaccessmulti-dimensionaldata

• MultidimensionalOLAP(MOLAP)– Directlystoresmultidimensionaldatainspecialdatastructures(e.g.arrays)

15DukeCS,Fall2017 CompSci516:DatabaseSystems

DataWarehousingtoMining

• Integrateddataspanninglongtimeperiods,oftenaugmentedwithsummaryinformation

• Severalgigabytestoterabytescommon

• Interactiveresponsetimesexpectedforcomplexqueries;ad-hocupdatesuncommon

EXTERNALDATASOURCES

EXTRACTTRANSFORM

LOADREFRESH

DATAWAREHOUSE

MetadataRepository

SUPPORTS

OLAPDATAMINING

DukeCS,Fall2017 CompSci516:DatabaseSystems 16

WarehousingIssues• SemanticIntegration: Whengettingdatafrom

multiplesources,musteliminatemismatches– e.g.,differentcurrencies,schemas

• HeterogeneousSources: Mustaccessdatafromavarietyofsourceformatsandrepositories– Replicationcapabilitiescanbeexploitedhere

• Load,Refresh,Purge: Mustloaddata,periodicallyrefreshit,andpurgetoo-olddata

• MetadataManagement: Mustkeeptrackofsource,loadingtime,andotherinformationforalldatainthewarehouse

DukeCS,Fall2017 CompSci516:DatabaseSystems 17

DWArchitecture

18

• ExtractdatafrommultipleoperationalDBandexternalsources

• Clean/integrate/transform/store• Refreshperiodically

– updatebaseandderiveddata– admindecideswhenandhowDukeCS,Fall2017 CompSci516:DatabaseSystems

• MainDWandseveraldatamarts(possibly)• Managedbyoneormoreserversand

frontendtools• Additionalmetadataand

monitoring/admintools

Page 4: Lecture-22-DW-Cubedb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lec… · instructor material of the [RG] book provided by the authors Dr. Ramakrishnan and Dr. Gehrke

11/14/17

4

ROLAP:StarSchema

• Toreflectmulti-dimensionalviewsofdata

• Singlefacttable

• Singletableforeverydimension

• Eachtupleinthefacttableconsistsof– pointers(foreignkey)toeach

ofthedimensions(multi-dimensionalcoordinates)

– numericvalueforthosecoordinates

• Eachdimensiontablecontainsattributesofthatdimension 19

NosupportforattributehierarchiesDukeCS,Fall2017 CompSci516:DatabaseSystems

DimensionHierarchies

• Foreachdimension,thesetofvaluescanbeorganizedinahierarchy:

PRODUCT TIME LOCATION

categoryweekmonthstate

pnamedatecity

year

quartercountry

DukeCS,Fall2017 CompSci516:DatabaseSystems 20

ROLAP:SnowflakeSchema

21

• Refinesstar-schema• Dimensionalhierarchyis

explicitlyrepresented

• (+)Dimensiontableseasiertomaintain– supposethe“category

descriptionisbeingchanged

• (-)Needadditionaljoins

• FactConstellations– Multiplefacttablessharesome

dimensionaltables– e.g.ProjectedandActual

Expensesmaysharemanydimensions

DukeCS,Fall2017 CompSci516:DatabaseSystems

Motivation:OLAPQueries• Dataanalystsareinterestedinexploringtrendsandanomalies– Possiblybyvisualization(Excel)- 2Dor3Dplots– “DimensionalityReduction”bysummarizingdataandcomputing

aggregates– InfluencedbySQLandbyspreadsheets.– Acommonoperationistoaggregate ameasureoveroneormore

dimensions.

• Findtotalunitsalesforeach1. Model2. Model,brokenintoyears3. Year,brokenintocolors4. Year5. Model,brokenintocolor,….

22

Sales(Model,Year,Color,Units)

DukeCS,Fall2017 CompSci516:DatabaseSystems

OLAPand

DataCube

DukeCS,Fall2017 CompSci516:DatabaseSystems 23

HistogramsAtabulatedfrequencyofcomputedvalues

SELECT Year, COUNT(Units) as total

FROM Sales

GROUP BY Year

ORDER BY Year

MayrequireanestedSELECTtocompute

Sales(Model,Year,Color,Units)

Year->

total->

Page 5: Lecture-22-DW-Cubedb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lec… · instructor material of the [RG] book provided by the authors Dr. Ramakrishnan and Dr. Gehrke

11/14/17

5

Roll-Ups• Analysisreportsstartatacoarselevel,

gotofinerlevels• Orderofattributematters• Notrelationaldata(emptycellsno

keys)

Model Year Color Model,Year,Color Model,Year Model

Chevy 1994 Black 50

Chevy 1994 White 40

90

Chevy 1995 Black 115

Chevy 1995 White 85

200

290

GROUPBY

Sales(Model,Year,Color,Units)

Roll-ups

Drill-downs

Roll-Ups• Anotherrepresentation(ChrisDate’96)• Relational,but

– longattributenames– hardtoexpressinSQLandrepetition

Model Year Color Model,Year,Color Model,Year Model

Chevy 1994 Black 50 90 290

Chevy 1994 White 40 90 290

Chevy 1995 Black 85 200 290

Chevy 1995 Black 115 200 290

GROUPBY

Sales(Model,Year,Color,Units)

‘ALL’ ConstructEasiertovisualizeroll-upifallowALLtofillinthesuper-aggregates

SELECT Model, Year, Color, SUM(Units)

FROM Sales

WHERE Model = ‘Chevy’

GROUP BY Model, Year, Color

UNION

SELECT Model, Year, ‘ALL’, SUM(Units)

FROM Sales

WHERE Model = ‘Chevy’

GROUP BY Model, Year

UNION

UNION

SELECT ‘ALL’, ‘ALL’, ‘ALL’, SUM(Units)

FROM Sales

WHERE Model = ‘Chevy’;

Model Year Color Units

Chevy 1994 Black 50

Chevy 1994 White 40

Chevy 1994 ‘ALL’ 90

Chevy 1995 Black 85

Chevy 1995 White 115

Chevy 1995 ‘ALL’ 200

Chevy ‘ALL’ ‘ALL’ 290

Sales(Model,Year,Color,Units)

Model Year Color Units

Chevy 1994 Black 50

Chevy 1994 White 40

Chevy 1994 ‘ALL’ 90

Chevy 1995 Black 85

Chevy 1995 White 115

Chevy 1995 ‘ALL’ 200

Chevy ‘ALL’ ‘ALL’ 290

Model Year Color Model,Year,Color Model,Year Model

Chevy 1994 Black 50

Chevy 1994 White 40

90

Chevy 1995 Black 115

Chevy 1995 White 85

200

290

TraditionalRoll-Up ‘ALL’Roll-Up

• Roll-upsareasymmetric

Sales(Model,Year,Color,Units)

CrossTabulation• Ifwemadetheroll-upsymmetric,wewouldgetacross-tabulation• Generalizestohigherdimensions

SELECT Model, ‘ALL’, Color, SUM(Units)

FROM Sales

WHERE Model = ‘Chevy’

GROUP BY Model, Color

Chevy 1994 1995 Total(ALL)

Black 50 85 135

White 40 115 155

Total(ALL) 90 200 290

IstheproblemsolvedwithCross-TabandGROUP-BYswith‘ALL’?• RequiresalotofGROUPBYs(64for6-dimension)• Toocomplextooptimize (64scans,64sort/hash,slow)

Sales(Model,Year,Color,Units)

NaïveApproach

Runanumberofqueries

SELECT sum(units)FROM Sales

SELECT Color, sum(units)FROM SalesGROUP BY Color

SELECT Year, sum(units)FROM SalesGROUP BY Year

SELECT Model, Year, sum(units)FROM SalesGROUP BY Model, Year….

• DatacubegeneralizesHistogram,Roll-Ups,Cross-Tabs

• MorecomplextodothesewithGROUP-BY

30

• Howmanysub-queries?• Howmanysub-queriesfor

8attributes?

Sales(Model,Year,Color,Units)

YearColor

Mod

el

TotalUnitsales

DukeCS,Fall2017 CompSci516:DatabaseSystems

Page 6: Lecture-22-DW-Cubedb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lec… · instructor material of the [RG] book provided by the authors Dr. Ramakrishnan and Dr. Gehrke

11/14/17

6

DataCube:Intuition

SELECT ‘ALL’, ‘ALL’, ‘ALL’, sum(units)

FROM Sales

UNIONSELECT ‘ALL’, ‘ALL’, Color, sum(units)

FROM Sales

GROUP BY Color

UNION

SELECT ‘ALL’, Year, ‘ALL’, sum(units)

FROM Sales

GROUP BY Year

UNION SELECT Model, Year, ‘ALL’, sum(units)

FROM Sales

GROUP BY Model, Year

UNION ….

31

Sales(Model,Year,Color,Units)

YearColor

Mod

el

TotalUnitsales

DukeCS,Fall2017 CompSci516:DatabaseSystems

DataCube

Ack:fromslidesbyLaurelOrrandJeremyHyrkas,UW

DataCube• Computestheaggregateonallpossiblecombinationsof

groupbycolumns.

• IfthereareNattributes,thereare2N-1super-aggregates.

• IfthecardinalityoftheNattributesareC1,...,CN,thenthereareatotalof(C1+1)...(CN+1)valuesinthecube.

• ROLL-UPissimilarbutjustlooksatNaggregates

DataCube Syntax• SQLServer

SELECT Model, Year, Color, sum(units)

FROM Sales

GROUP BY Model, Year, Color

WITH CUBE

Sales(Model,Year,Color,Units)

TypesofAggregates• Distributive:inputcanbepartitionedintodisjointsets

andaggregatedseparatelyo COUNT,SUM,MIN

• Algebraic:canbecomposedofdistributiveaggregateso AVG

• Holistic:aggregatemustbecomputedovertheentireinputseto MEDIAN

• EfficientcomputationoftheCUBEoperatordependsonthetypeofaggregate

– DistributiveandAlgebraicaggregatesmotivateoptimizations

DukeCS,Fall2017 CompSci516:DatabaseSystems 36

ViewMaterializationandMaintenance[RG]Chapters25.8-25.10

Page 7: Lecture-22-DW-Cubedb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lec… · instructor material of the [RG] book provided by the authors Dr. Ramakrishnan and Dr. Gehrke

11/14/17

7

Views(revisiting)

• Motivation(example)– Differentgroupsofanalystswithinanorganizationaretypicallyconcernedwithdifferentaspectsofabusiness

– Itisconvenienttodefine“views”thatgiveeachgroupinsightintotherelevantbusinessdetails

– Otherviewscanbedefinedorqueriescanbewrittenusingtheseviews

– ConvenientandEfficient

DukeCS,Fall2017 CompSci516:DatabaseSystems 37

ViewExample

CREATEVIEW RegionalSales(category,sales,state)ASSELECT P.category,S.sales,L.state

FROMProductsP,SalesS,LocationsLWHERE P.pid=S.pid AND S.locid=L.locid

SELECT R.category,R.state,SUM(R.sales)FROM RegionalSales AS RGROUPBY R.category,R.state

SELECT R.category,R.state,SUM(R.sales)FROM (SELECT P.category,S.sales,L.state

FROMProductsP,SalesS,LocationsLWHERE P.pid=S.pidAND S.locid=L.locid)AS R

GROUPBY R.category,R.state

View(salesofproductsbycategoryandstate)

Query(totalsalesforeachcategorybystate)

QueryModification(SQLdoesnotspecifyhowtoevaluatequeriesonviews,butcanconsideritasareplacement)

DukeCS,Fall2017 CompSci516:DatabaseSystems 38

ViewsandOLAP/Warehousing• OLAPqueriesaretypicallyaggregatequeries

– Precomputationisessentialforinteractiveresponsetimes– TheCUBEisinfactacollectionofaggregatequeries,and

precomputationisespeciallyimportant– lotsofworkonwhatisbesttoprecomputegivenalimitedamountof

spacetostoreprecomputedresults.

• Warehousescanbethoughtofasacollectionofasynchronouslyreplicatedtablesandperiodicallymaintainedviews– Factors:size,numberoftablesinvolved,manyarefromexternal

independentdatabases– Hasrenewedinterestin(asynchronous)viewmaintenance(morelater)

DukeCS,Fall2017 CompSci516:DatabaseSystems 39

ViewMaterialization• QueryModificationmaynotbeefficient

– whentheunderlyingviewiscomplex– evenwithsophisticatedoptimizationandevaluation– esp.whentheunderlyingtablesareinaremotedatabase(connectivity

andavailability)

• Alternative:ViewMaterialization– Precomputetheviewdefinitionandstoretheresult– Materializedviewscanbeusedasregularrelations– Providesfastaccess,likea(veryhigh-level)cache– Cancreateindexonviewstooforfurtherspeedup– Drawback:tomaintaintheconsistencyofthematerializedviewwhenthe

underlyingtable(s)areupdated(ViewMaintenance)– Ideally,wewantIncrementalViewMaintenancealgorithms(Lecture20)

DukeCS,Fall2017 CompSci516:DatabaseSystems 40

IndexonMaterializedViews:Examples

• SupposeweprecomputeRegionalSales andstoreitwithaclusteredB+treeindexon[category,state,sales].– Then,thequerycanbeansweredbyanindex-onlyscan.

SELECT R.state,SUM(R.sales)FROM RegionalSalesRWHERE R.category=“Laptop”GROUPBY R.state

SELECT R.state,SUM(R.sales)FROM RegionalSalesRWHERE R.state=“Wisconsin”GROUPBY R.category

Indexonprecomputedviewisgreat!

Indexislessuseful(mustscanentireleaflevel)

DukeCS,Fall2017 CompSci516:DatabaseSystems 41

CREATEVIEW RegionalSales(category,sales,state)ASSELECT P.category,S.sales,L.stateFROMProductsP,SalesS,LocationsLWHERE P.pid=S.pid AND S.locid=L.locid

SELECT R.category,R.state,SUM(R.sales)FROM RegionalSales AS RGROUPBY R.category,R.state

(Research)IssuesinViewMaterialization

1. Whatviewsshouldwematerialize,andwhatindexesshouldwebuildontheprecomputedresults?

2. Givenaqueryandasetofmaterializedviews,canweusethematerializedviewstoanswerthequery?– relatedtothefirstquestion(workloaddependent)– Trytomaterializeasmall,carefullychosensetofviewsthatcanbe

utilizedtoquicklyanswermostoftheimportantqueries

3. Howfrequentlyshouldwerefreshmaterializedviewstomakethemconsistentwiththeunderlyingtables?– Andhowcanwedothisincrementally?

DukeCS,Fall2017 CompSci516:DatabaseSystems 42

Page 8: Lecture-22-DW-Cubedb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lec… · instructor material of the [RG] book provided by the authors Dr. Ramakrishnan and Dr. Gehrke

11/14/17

8

ViewMaintenance• Twosteps:– Propagate: Computechangestoviewwhendatachanges– Refresh: Applychangestothematerializedviewtable

• Maintenancepolicy: Controlswhenwedorefresh– Immediate: Aspartofthetransactionthatmodifiestheunderlyingdatatables• +Materializedviewisalwaysconsistent• - updatesareslowed

– Deferred: Sometimelater,inaseparatetransaction• - Viewbecomesinconsistent• + canscaletomaintainmanyviewswithoutslowingupdates

DukeCS,Fall2017 CompSci516:DatabaseSystems 43

TypesofDeferredMaintenanceThreeflavors:• Lazy:

– Delayrefreshuntilnextqueryonview;thenrefreshbeforeansweringthequery(slowsdownqueriesthanupdates)

• Periodic(Snapshot):– Refreshperiodically(e.g.onceinaday).Queriespossiblyanswered

usingoutdatedversionofviewtuples.Widelyused,especiallyforasynchronousreplicationindistributeddatabases,andforwarehouseapplications

• Event-basedorForced:– E.g.,Refreshafterafixednumberofupdatestounderlyingdatatables

• e.g.SnapshotinOracle7– periodicallyrefreshedbyentirelyrecomputing theview– Incremental”fastrefresh”or“simplesnapshots”forsimplerviews(no

aggregate,groupby,join,distinctetc.)

DukeCS,Fall2017 CompSci516:DatabaseSystems 44

ImplementingDataCube

DukeCS,Fall2017 CompSci516:DatabaseSystems 45

BasicIdeas

• Needtocomputeallgroup-by-s:– ABCD,ABC,ABD,BCD,AB,AC,AD,BC,BD,CD,A,B,C,D

• ComputeGROUP-BYsfrompreviouslycomputedGROUP-BYs– e.g.firstABCD– thenABCorACD– thenABorAC…

• WhichorderABCDissorted,mattersforsubsequentcomputations

– if(ABCD)isthesortedorder,ABCischeap,ACDorBCDisexpensive

Notations

• ABCD– group-byonattributesA,B,C,D– noguaranteeontheorderoftuples

• (ABCD)– sortedaccordingtoA->B->C->D

• ABCDand(ABCD)and(BCDA)– allcontainthesameresults– butindifferentsortedorder

Optimization1:SmallestParent

• ComputeGROUP-BYfromthesmallest(size)previouslycomputedGROUP-BYasaparent

– ABcanbecomputedfromABC,ABD,orABCD

– ABCorABDbetterthanABCD– EvenABCorABDmayhave

differentsizes,trytochoosethesmallerparent

48

ABCD

ABC ABD ACD BCD

AC AD BC BD CDAB

A B C D

all

DukeCS,Fall2017 CompSci516:DatabaseSystems

LATTICESTRUCTUREofdatacube

Page 9: Lecture-22-DW-Cubedb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lec… · instructor material of the [RG] book provided by the authors Dr. Ramakrishnan and Dr. Gehrke

11/14/17

9

Optimization2:CacheResults

• CacheresultofoneGROUP-BYinmemorytoreducediskI/O– ComputeABfromABCwhile

ABCisstillinmemory

49

ABCD

ABC ABD ACD BCD

AC AD BC BD CDAB

A B C D

all

DukeCS,Fall2017 CompSci516:DatabaseSystems

Optimization3:AmortizeDiskScans

• AmortizediskreadsformultipleGROUP-BYs– SupposetheresultforABCD

isstoredondisk– ComputeallofABC,ABD,

ACD,BCDsimultaneouslyinonescanofABCD

50

ABCD

ABC ABD ACD BCD

AC AD BC BD CDAB

A B C D

all

DukeCS,Fall2017 CompSci516:DatabaseSystems

Optimization4,5(next)

• 4.Share-sort– forsort-basedalgorithms– pipe-sortalgorithm– coveredinclass

• 5.Shared-partition– forhash-basedalgorithms– pipe-hashalgorithm

• UseshashtablestocomputesmallerGROUP-Bys

• IfthehashtablesforABandACfitinmemory,computebothinonescanofABC

• OtherwisecanpartitiononA,andcancomputeHTsofABandACindifferentpartitions

– notcovered(seepaper)

51

ABCD

ABC ABD ACD BCD

AC AD BC BD CDAB

A B C D

all

DukeCS,Fall2017 CompSci516:DatabaseSystems

PipeSort:Idea

• Combinestwooptimizations:“shared-sorts”and“smallest-parent”

• Alsoincludes“cache-results”and“amortized-scans”

PipeSort:Share-sortoptimization• Datasortedinoneorder• ComputeallGROUP-BYsprefixedinthatorder• ComputeonetupleofABCD,propagateupwardinthe

pipelinebyasinglescan• Example:

– GROUP-BY overattributesABCD– Sortrawdataby(ABCD)– Compute(ABCD)->(ABC)->(AB)->(A) inpipelinedfashion– Noadditionalsortneeded

• BUT,mayhaveaconflictwith“smallest-parent”optimization– (ABD)->(AB)couldbeabetterchoice– Figureoutthebestparentchoicebyrunningaweighted-matching

algorithmlayerbylayer

(ABCD)

(ABC)

(AB)

(A)

SearchLattice• Directededge=>oneattributeless

andpossiblecomputation• Levelkcontainskattributes

– all=0attribute

• Twopossiblecostsforeachedgeeij=i --->j

• A(eij):i issortedforj– (BCA)->(BC)

• S(eij):i isNOTsortedforj– e.g.ABC->(BCA)->(BC)orhash

ABCD

ABC ABD ACD BCD

AC AD BC BD CDAB

A B C D

all Level0

Level1

Level2

Level3

Level4

A B C suma1 b1 c1 5a1 b1 c2 10a1 b2 c3 8a2 b2 c1 2a2 b2 c3 11

NotSortedA B C suma2 b2 c3 11a1 b1 c2 10a2 b2 c1 2a1 b1 c1 5a1 b2 c3 8

Sorted

A B suma1 b1 15a1 b2 8a2 b2 13

• Noparenthesis:orderoftuplescanbearbitrary

Page 10: Lecture-22-DW-Cubedb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lec… · instructor material of the [RG] book provided by the authors Dr. Ramakrishnan and Dr. Gehrke

11/14/17

10

PipeSort Output

• OutputsasubgraphO– eachnodehasasingleparent– eachnodehasasortedorderofattributes

• ifparent’ssortedorderisaprefix,cost=A(eij),elseS(eij)

– MarkbyAorS– AtmostoneA-out-edge– Note:forsomenodes,theremaybenogreenA-out-edge

ACBD

ACB ABD ACD BDC

AC AD BC BD CDAB

A B C D

all Level0

Level1

Level2

Level3

Level4

Sorted(A)Not-Sorted(S)

Goal:FindOwithmintotalcost

Outline:PipeSort Algorithm(1)• Gofromlevel0toN-1

– hereN=4

• Foreachlevelk,findthebestwaytoconstructitfromlevelk+1

• uses“min-costweightedbipartitematching”

• createsknewcopiesofnodesatlevelk+1

• edgesfromoriginalcopy– costA(eij)

• edgesfromnewcopies– costS(eij)

ABCD

ABC ABD ACD BCD

AC AD BC BD CDAB

A B C D

all Level0

Level1

Level2

Level3

Level4

Outline:PipeSort Algorithm(2)• Illustrationwithasmallerexample• Levelk=1fromlevelk+1=2

– onenewcopy(dottededges)– oneexistingcopy(solidedge)

• Assumptionforsimplicity– samecostforalloutgoingedges– A(eij)=A(eij’)forallj,j’– S(eij)=S(eij’)foralli,i’

ABC

AC BCAB

A B C

all Level0

Level1

Level2

Level3

2,10 5,12 13,20

Aftercomputingthe plan,executeallpipelines

Outline:PipeSort Algorithm(3)

1.Firstpipelineisexecutedbyonescanofthedata

2.Sort(CBAD)->(BADC),computethesecondpipeline

3.…..

A,S costs

SeepaperforanotherPipeHash algorithm

ACBD

ACB ABD ACD BDC

AC AD BC BD CD

A B C D

all

AB