data warehousing and data cube - duke university · –complex sql queries and views. ... group by...
TRANSCRIPT
![Page 1: Data Warehousing and Data Cube - Duke University · –Complex SQL queries and views. ... group by columns. • If there are N attributes, there are 2N-1 super-aggregates. • If](https://reader033.vdocuments.us/reader033/viewer/2022041621/5e3f298d8828960b2625cf1e/html5/thumbnails/1.jpg)
CompSci 516DatabaseSystems
Lecture22DataWarehousing
andDataCube
Instructor:Sudeepa Roy
DukeCS,Fall2017 CompSci516:DatabaseSystems 1
![Page 2: Data Warehousing and Data Cube - Duke University · –Complex SQL queries and views. ... group by columns. • If there are N attributes, there are 2N-1 super-aggregates. • If](https://reader033.vdocuments.us/reader033/viewer/2022041621/5e3f298d8828960b2625cf1e/html5/thumbnails/2.jpg)
ReadingMaterial
• [RG]– Chapter25
• Gray-Chaudhuri-Bosworth-Layman-Reichart-Venkatrao-Pellow-Pirahesh,ICDE1996“DataCube:ARelationalAggregationOperatorGeneralizingGroup-By,Cross-Tab,andSub-Totals”
• Harinarayan-Rajaraman-Ullman,SIGMOD1996“Implementingdatacubesefficiently”
DukeCS,Fall2017 CompSci516:DatabaseSystems 2
Acknowledgement:• Thefollowingslideshavebeencreatedadaptingtheinstructormaterialofthe[RG]bookprovidedbytheauthorsDr.RamakrishnanandDr.Gehrke.• SomeslideshavebeenpreparedbyProf.Shivnath Babu
![Page 3: Data Warehousing and Data Cube - Duke University · –Complex SQL queries and views. ... group by columns. • If there are N attributes, there are 2N-1 super-aggregates. • If](https://reader033.vdocuments.us/reader033/viewer/2022041621/5e3f298d8828960b2625cf1e/html5/thumbnails/3.jpg)
DataWarehousing
DukeCS,Fall2017 CompSci516:DatabaseSystems 3
![Page 4: Data Warehousing and Data Cube - Duke University · –Complex SQL queries and views. ... group by columns. • If there are N attributes, there are 2N-1 super-aggregates. • If](https://reader033.vdocuments.us/reader033/viewer/2022041621/5e3f298d8828960b2625cf1e/html5/thumbnails/4.jpg)
4
Warehousing
• Growingindustry:$8billionwaybackin1998• DatawarehousevendorlikeTeradata
• big“Petabytescale”customers• Apple,Walmart(2008-2.5PB),eBay(2013-primaryDW9.2PB,otherbigdata40PB,singletablewith1trillionrows),Verizon,AT&T,BankofAmerica
• supportsdataintoandoutofHadoop
• Lotsofbuzzwords,hype– slice&dice,rollup,MOLAP,pivot,...
Ack:SlidebyProf.Shivnath Babu
https://gigaom.com/2013/03/27/why-apple-ebay-and-walmart-have-some-of-the-biggest-data-warehouses-youve-ever-seen/
![Page 5: Data Warehousing and Data Cube - Duke University · –Complex SQL queries and views. ... group by columns. • If there are N attributes, there are 2N-1 super-aggregates. • If](https://reader033.vdocuments.us/reader033/viewer/2022041621/5e3f298d8828960b2625cf1e/html5/thumbnails/5.jpg)
5
MotivatingExamples
• Forecasting• Comparingperformanceofunits• Monitoring,detectingfraud• Visualization
![Page 6: Data Warehousing and Data Cube - Duke University · –Complex SQL queries and views. ... group by columns. • If there are N attributes, there are 2N-1 super-aggregates. • If](https://reader033.vdocuments.us/reader033/viewer/2022041621/5e3f298d8828960b2625cf1e/html5/thumbnails/6.jpg)
Introduction• Organizationsanalyzecurrentandhistoricaldata
– toidentifyusefulpatterns– tosupportbusinessstrategies
• Emphasisisoncomplex,interactive,exploratoryanalysisofverylargedatasets
• Createdbyintegratingdatafromacrossallpartsofanenterprise
• Dataisfairlystatic
• Relevantonceagainfortherecent“BigDataanalysis”– tofigureoutwhatwecanreuse,whatwecannot
DukeCS,Fall2017 CompSci516:DatabaseSystems 6
![Page 7: Data Warehousing and Data Cube - Duke University · –Complex SQL queries and views. ... group by columns. • If there are N attributes, there are 2N-1 super-aggregates. • If](https://reader033.vdocuments.us/reader033/viewer/2022041621/5e3f298d8828960b2625cf1e/html5/thumbnails/7.jpg)
ThreeComplementaryTrends
• DataWarehousing(DW):– Consolidatedatafrommanysourcesinonelargerepository– Loading,periodicsynchronizationofreplicas– Semanticintegration
• OLAP:– ComplexSQLqueriesandviews.– Queriesbasedonspreadsheet-styleoperationsand
“multidimensional”viewofdata.– Interactiveand“online”queries.
• DataMining:– Exploratorysearchforinterestingtrendsandanomalies– Nextlecture!
DukeCS,Fall2017 CompSci516:DatabaseSystems 7
![Page 8: Data Warehousing and Data Cube - Duke University · –Complex SQL queries and views. ... group by columns. • If there are N attributes, there are 2N-1 super-aggregates. • If](https://reader033.vdocuments.us/reader033/viewer/2022041621/5e3f298d8828960b2625cf1e/html5/thumbnails/8.jpg)
DataWarehousing• Acollectionofdecisionsupporttechnologies• Toenablepeopleinindustry/organizationstomakebetter
decisions– SupportsOLAP(On-LineAnalyticalProcessing)
• Applicationsin– Manufacturing– Retail– Finance– Transportation– Healthcare– …
• Typicallymaintainedseparatelyfrom“OperationalDatabases”– OperationalDatabasessupportOLTP(On-LineTransactionProcessing)
8DukeCS,Fall2017 CompSci516:DatabaseSystems
![Page 9: Data Warehousing and Data Cube - Duke University · –Complex SQL queries and views. ... group by columns. • If there are N attributes, there are 2N-1 super-aggregates. • If](https://reader033.vdocuments.us/reader033/viewer/2022041621/5e3f298d8828960b2625cf1e/html5/thumbnails/9.jpg)
9
WhyaWarehouse?
• TwoApproaches:– Query-Driven(Lazy)–Warehouse(Eager)
Source Source
?
![Page 10: Data Warehousing and Data Cube - Duke University · –Complex SQL queries and views. ... group by columns. • If there are N attributes, there are 2N-1 super-aggregates. • If](https://reader033.vdocuments.us/reader033/viewer/2022041621/5e3f298d8828960b2625cf1e/html5/thumbnails/10.jpg)
10
AdvantagesofWarehousing
• Highqueryperformance• Queriesnotvisibleoutsidewarehouse• Localprocessingatsourcesunaffected• Canoperatewhensourcesunavailable• CanquerydatanotstoredinaDBMS• Extrainformationatwarehouse–Modify,summarize(storeaggregates)– Addhistoricalinformation
![Page 11: Data Warehousing and Data Cube - Duke University · –Complex SQL queries and views. ... group by columns. • If there are N attributes, there are 2N-1 super-aggregates. • If](https://reader033.vdocuments.us/reader033/viewer/2022041621/5e3f298d8828960b2625cf1e/html5/thumbnails/11.jpg)
11
Query-DrivenApproach
Client Client
Wrapper Wrapper Wrapper
Mediator
Source Source Source
![Page 12: Data Warehousing and Data Cube - Duke University · –Complex SQL queries and views. ... group by columns. • If there are N attributes, there are 2N-1 super-aggregates. • If](https://reader033.vdocuments.us/reader033/viewer/2022041621/5e3f298d8828960b2625cf1e/html5/thumbnails/12.jpg)
12
AdvantagesofQuery-Driven
• Noneedtocopydata– lessstorage– noneedtopurchasedata
• Moreup-to-datedata• Queryneedscanbeunknown• Onlyqueryinterfaceneededatsources• Maybelessdrainingonsources
![Page 13: Data Warehousing and Data Cube - Duke University · –Complex SQL queries and views. ... group by columns. • If there are N attributes, there are 2N-1 super-aggregates. • If](https://reader033.vdocuments.us/reader033/viewer/2022041621/5e3f298d8828960b2625cf1e/html5/thumbnails/13.jpg)
OLTP DataWarehousing/OLAPMostlyupdates Mostlyreads
Applications:Orderentry,salesupdate,bankingtransactions
Applications:Decisionsupportinindustry/organization
Detailed,up-to-datedata Summarized, historicaldata(frommultipleoperationaldb,growsovertime)
Structured,repetitive,shorttasks Queryintensive,adhoc,complexqueries
Eachtransactionreads/updatesonlyafewtuples(tensof)
Eachquerycanaccesses manyrecords,andperformmanyjoins,scans,aggregates
MB-GBdata GB-TB data
Typicallyclericalusers Decisionmakers,analysts asusers
Important:Consistency,recoverability,Maximizingtr.throughput
Important:QuerythroughputResponse times
13DukeCS,Fall2017 CompSci516:DatabaseSystems
![Page 14: Data Warehousing and Data Cube - Duke University · –Complex SQL queries and views. ... group by columns. • If there are N attributes, there are 2N-1 super-aggregates. • If](https://reader033.vdocuments.us/reader033/viewer/2022041621/5e3f298d8828960b2625cf1e/html5/thumbnails/14.jpg)
DataMarts
• smallerdatawarehouse• subsetsofdataonselectedsubjects• e.g.Marketingdatamartcanincludecustomer,product,sales
• Department-focused,noenterprise-wideconsensusneeded
• Butmayleadtocomplexintegrationproblemsinthelongrun
14DukeCS,Fall2017 CompSci516:DatabaseSystems
![Page 15: Data Warehousing and Data Cube - Duke University · –Complex SQL queries and views. ... group by columns. • If there are N attributes, there are 2N-1 super-aggregates. • If](https://reader033.vdocuments.us/reader033/viewer/2022041621/5e3f298d8828960b2625cf1e/html5/thumbnails/15.jpg)
ROLAPandMOLAP
• RelationalOLAP(ROLAP)– OntopofstandardrelationalDBMS– DataisstoredinrelationalDBMS– SupportsextensionstoSQLtoaccessmulti-dimensionaldata
• MultidimensionalOLAP(MOLAP)– Directlystoresmultidimensionaldatainspecialdatastructures(e.g.arrays)
15DukeCS,Fall2017 CompSci516:DatabaseSystems
![Page 16: Data Warehousing and Data Cube - Duke University · –Complex SQL queries and views. ... group by columns. • If there are N attributes, there are 2N-1 super-aggregates. • If](https://reader033.vdocuments.us/reader033/viewer/2022041621/5e3f298d8828960b2625cf1e/html5/thumbnails/16.jpg)
DataWarehousingtoMining
• Integrateddataspanninglongtimeperiods,oftenaugmentedwithsummaryinformation
• Severalgigabytestoterabytescommon
• Interactiveresponsetimesexpectedforcomplexqueries;ad-hocupdatesuncommon
EXTERNALDATASOURCES
EXTRACTTRANSFORM
LOADREFRESH
DATAWAREHOUSE
MetadataRepository
SUPPORTS
OLAPDATAMINING
DukeCS,Fall2017 CompSci516:DatabaseSystems 16
![Page 17: Data Warehousing and Data Cube - Duke University · –Complex SQL queries and views. ... group by columns. • If there are N attributes, there are 2N-1 super-aggregates. • If](https://reader033.vdocuments.us/reader033/viewer/2022041621/5e3f298d8828960b2625cf1e/html5/thumbnails/17.jpg)
WarehousingIssues• SemanticIntegration: Whengettingdatafrommultiplesources,musteliminatemismatches– e.g.,differentcurrencies,schemas
• HeterogeneousSources: Mustaccessdatafromavarietyofsourceformatsandrepositories– Replicationcapabilitiescanbeexploitedhere
• Load,Refresh,Purge: Mustloaddata,periodicallyrefreshit,andpurgetoo-olddata
• MetadataManagement: Mustkeeptrackofsource,loadingtime,andotherinformationforalldatainthewarehouse
DukeCS,Fall2017 CompSci516:DatabaseSystems 17
![Page 18: Data Warehousing and Data Cube - Duke University · –Complex SQL queries and views. ... group by columns. • If there are N attributes, there are 2N-1 super-aggregates. • If](https://reader033.vdocuments.us/reader033/viewer/2022041621/5e3f298d8828960b2625cf1e/html5/thumbnails/18.jpg)
DWArchitecture
18
• ExtractdatafrommultipleoperationalDBandexternalsources
• Clean/integrate/transform/store• Refreshperiodically
– updatebaseandderiveddata– admindecideswhenandhowDukeCS,Fall2017 CompSci516:DatabaseSystems
• MainDWandseveraldatamarts(possibly)• Managedbyoneormoreserversand
frontendtools• Additionalmetadataand
monitoring/admintools
![Page 19: Data Warehousing and Data Cube - Duke University · –Complex SQL queries and views. ... group by columns. • If there are N attributes, there are 2N-1 super-aggregates. • If](https://reader033.vdocuments.us/reader033/viewer/2022041621/5e3f298d8828960b2625cf1e/html5/thumbnails/19.jpg)
ROLAP:StarSchema
• Toreflectmulti-dimensionalviewsofdata
• Singlefacttable
• Singletableforeverydimension
• Eachtupleinthefacttableconsistsof– pointers(foreignkey)toeach
ofthedimensions(multi-dimensionalcoordinates)
– numericvalueforthosecoordinates
• Eachdimensiontablecontainsattributesofthatdimension
19
NosupportforattributehierarchiesDukeCS,Fall2017 CompSci516:DatabaseSystems
![Page 20: Data Warehousing and Data Cube - Duke University · –Complex SQL queries and views. ... group by columns. • If there are N attributes, there are 2N-1 super-aggregates. • If](https://reader033.vdocuments.us/reader033/viewer/2022041621/5e3f298d8828960b2625cf1e/html5/thumbnails/20.jpg)
DimensionHierarchies
• Foreachdimension,thesetofvaluescanbeorganizedinahierarchy:
PRODUCT TIME LOCATION
categoryweekmonthstate
pnamedatecity
year
quartercountry
DukeCS,Fall2017 CompSci516:DatabaseSystems 20
![Page 21: Data Warehousing and Data Cube - Duke University · –Complex SQL queries and views. ... group by columns. • If there are N attributes, there are 2N-1 super-aggregates. • If](https://reader033.vdocuments.us/reader033/viewer/2022041621/5e3f298d8828960b2625cf1e/html5/thumbnails/21.jpg)
ROLAP:SnowflakeSchema
21
• Refinesstar-schema• Dimensionalhierarchyis
explicitlyrepresented
• (+)Dimensiontableseasiertomaintain– supposethe“category
descriptionisbeingchanged
• (-)Needadditionaljoins
• FactConstellations– Multiplefacttablessharesome
dimensionaltables– e.g.ProjectedandActual
Expensesmaysharemanydimensions
DukeCS,Fall2017 CompSci516:DatabaseSystems
![Page 22: Data Warehousing and Data Cube - Duke University · –Complex SQL queries and views. ... group by columns. • If there are N attributes, there are 2N-1 super-aggregates. • If](https://reader033.vdocuments.us/reader033/viewer/2022041621/5e3f298d8828960b2625cf1e/html5/thumbnails/22.jpg)
Motivation:OLAPQueries• Dataanalystsareinterestedinexploringtrendsandanomalies– Possiblybyvisualization(Excel)- 2Dor3Dplots– “DimensionalityReduction”bysummarizingdataandcomputing
aggregates– InfluencedbySQLandbyspreadsheets.– Acommonoperationistoaggregate ameasureoveroneormore
dimensions.
• Findtotalunitsalesforeach1. Model2. Model,brokenintoyears3. Year,brokenintocolors4. Year5. Model,brokenintocolor,….
22
Sales(Model,Year,Color,Units)
DukeCS,Fall2017 CompSci516:DatabaseSystems
![Page 23: Data Warehousing and Data Cube - Duke University · –Complex SQL queries and views. ... group by columns. • If there are N attributes, there are 2N-1 super-aggregates. • If](https://reader033.vdocuments.us/reader033/viewer/2022041621/5e3f298d8828960b2625cf1e/html5/thumbnails/23.jpg)
OLAPand
DataCube
DukeCS,Fall2017 CompSci516:DatabaseSystems 23
![Page 24: Data Warehousing and Data Cube - Duke University · –Complex SQL queries and views. ... group by columns. • If there are N attributes, there are 2N-1 super-aggregates. • If](https://reader033.vdocuments.us/reader033/viewer/2022041621/5e3f298d8828960b2625cf1e/html5/thumbnails/24.jpg)
HistogramsAtabulatedfrequencyofcomputedvalues
SELECT Year, COUNT(Units) as total
FROM Sales
GROUP BY Year
ORDER BY Year
MayrequireanestedSELECTtocompute
Sales(Model,Year,Color,Units)
Year->
total->
![Page 25: Data Warehousing and Data Cube - Duke University · –Complex SQL queries and views. ... group by columns. • If there are N attributes, there are 2N-1 super-aggregates. • If](https://reader033.vdocuments.us/reader033/viewer/2022041621/5e3f298d8828960b2625cf1e/html5/thumbnails/25.jpg)
Roll-Ups• Analysisreportsstartatacoarselevel,
gotofinerlevels• Orderofattributematters• Notrelationaldata(emptycellsno
keys)
Model Year Color Model,Year,Color Model,Year Model
Chevy 1994 Black 50
Chevy 1994 White 40
90
Chevy 1995 Black 115
Chevy 1995 White 85
200
290
GROUPBY
Sales(Model,Year,Color,Units)
Roll-ups
Drill-downs
![Page 26: Data Warehousing and Data Cube - Duke University · –Complex SQL queries and views. ... group by columns. • If there are N attributes, there are 2N-1 super-aggregates. • If](https://reader033.vdocuments.us/reader033/viewer/2022041621/5e3f298d8828960b2625cf1e/html5/thumbnails/26.jpg)
Roll-Ups• Anotherrepresentation(ChrisDate’96)• Relational,but
– longattributenames– hardtoexpressinSQLandrepetition
Model Year Color Model,Year,Color Model,Year Model
Chevy 1994 Black 50 90 290
Chevy 1994 White 40 90 290
Chevy 1995 Black 85 200 290
Chevy 1995 Black 115 200 290
GROUPBY
Sales(Model,Year,Color,Units)
![Page 27: Data Warehousing and Data Cube - Duke University · –Complex SQL queries and views. ... group by columns. • If there are N attributes, there are 2N-1 super-aggregates. • If](https://reader033.vdocuments.us/reader033/viewer/2022041621/5e3f298d8828960b2625cf1e/html5/thumbnails/27.jpg)
‘ALL’ ConstructEasiertovisualizeroll-upifallowALLtofillinthesuper-aggregates
SELECT Model, Year, Color, SUM(Units)
FROM Sales
WHERE Model = ‘Chevy’
GROUP BY Model, Year, Color
UNION
SELECT Model, Year, ‘ALL’, SUM(Units)
FROM Sales
WHERE Model = ‘Chevy’
GROUP BY Model, Year
UNION
…
UNION
SELECT ‘ALL’, ‘ALL’, ‘ALL’, SUM(Units)
FROM Sales
WHERE Model = ‘Chevy’;
Model Year Color Units
Chevy 1994 Black 50
Chevy 1994 White 40
Chevy 1994 ‘ALL’ 90
Chevy 1995 Black 85
Chevy 1995 White 115
Chevy 1995 ‘ALL’ 200
Chevy ‘ALL’ ‘ALL’ 290
Sales(Model,Year,Color,Units)
![Page 28: Data Warehousing and Data Cube - Duke University · –Complex SQL queries and views. ... group by columns. • If there are N attributes, there are 2N-1 super-aggregates. • If](https://reader033.vdocuments.us/reader033/viewer/2022041621/5e3f298d8828960b2625cf1e/html5/thumbnails/28.jpg)
Model Year Color Units
Chevy 1994 Black 50
Chevy 1994 White 40
Chevy 1994 ‘ALL’ 90
Chevy 1995 Black 85
Chevy 1995 White 115
Chevy 1995 ‘ALL’ 200
Chevy ‘ALL’ ‘ALL’ 290
Model Year Color Model,Year,Color Model,Year Model
Chevy 1994 Black 50
Chevy 1994 White 40
90
Chevy 1995 Black 115
Chevy 1995 White 85
200
290
TraditionalRoll-Up ‘ALL’Roll-Up
• Roll-upsareasymmetric
Sales(Model,Year,Color,Units)
![Page 29: Data Warehousing and Data Cube - Duke University · –Complex SQL queries and views. ... group by columns. • If there are N attributes, there are 2N-1 super-aggregates. • If](https://reader033.vdocuments.us/reader033/viewer/2022041621/5e3f298d8828960b2625cf1e/html5/thumbnails/29.jpg)
CrossTabulation• Ifwemadetheroll-upsymmetric,wewouldgetacross-tabulation• Generalizestohigherdimensions
SELECT Model, ‘ALL’, Color, SUM(Units)
FROM Sales
WHERE Model = ‘Chevy’
GROUP BY Model, Color
Chevy 1994 1995 Total(ALL)
Black 50 85 135
White 40 115 155
Total(ALL) 90 200 290
IstheproblemsolvedwithCross-TabandGROUP-BYswith‘ALL’?• RequiresalotofGROUPBYs(64for6-dimension)• Toocomplextooptimize (64scans,64sort/hash,slow)
Sales(Model,Year,Color,Units)
![Page 30: Data Warehousing and Data Cube - Duke University · –Complex SQL queries and views. ... group by columns. • If there are N attributes, there are 2N-1 super-aggregates. • If](https://reader033.vdocuments.us/reader033/viewer/2022041621/5e3f298d8828960b2625cf1e/html5/thumbnails/30.jpg)
NaïveApproach
Runanumberofqueries
SELECT sum(units)FROM Sales
SELECT Color, sum(units)FROM SalesGROUP BY Color
SELECT Year, sum(units)FROM SalesGROUP BY Year
SELECT Model, Year, sum(units)FROM SalesGROUP BY Model, Year….
• DatacubegeneralizesHistogram,Roll-Ups,Cross-Tabs
• MorecomplextodothesewithGROUP-BY
30
• Howmanysub-queries?• Howmanysub-queriesfor
8attributes?
Sales(Model,Year,Color,Units)
YearColor
Mod
el
TotalUnitsales
DukeCS,Fall2017 CompSci516:DatabaseSystems
![Page 31: Data Warehousing and Data Cube - Duke University · –Complex SQL queries and views. ... group by columns. • If there are N attributes, there are 2N-1 super-aggregates. • If](https://reader033.vdocuments.us/reader033/viewer/2022041621/5e3f298d8828960b2625cf1e/html5/thumbnails/31.jpg)
DataCube:Intuition
SELECT ‘ALL’, ‘ALL’, ‘ALL’, sum(units)
FROM SalesUNIONSELECT ‘ALL’, ‘ALL’, Color, sum(units)
FROM SalesGROUP BY Color
UNIONSELECT ‘ALL’, Year, ‘ALL’, sum(units)FROM Sales
GROUP BY YearUNION SELECT Model, Year, ‘ALL’, sum(units)FROM SalesGROUP BY Model, Year
UNION ….
31
Sales(Model,Year,Color,Units)
YearColor
Mod
el
TotalUnitsales
DukeCS,Fall2017 CompSci516:DatabaseSystems
![Page 32: Data Warehousing and Data Cube - Duke University · –Complex SQL queries and views. ... group by columns. • If there are N attributes, there are 2N-1 super-aggregates. • If](https://reader033.vdocuments.us/reader033/viewer/2022041621/5e3f298d8828960b2625cf1e/html5/thumbnails/32.jpg)
DataCube
Ack:fromslidesbyLaurelOrrandJeremyHyrkas,UW
![Page 33: Data Warehousing and Data Cube - Duke University · –Complex SQL queries and views. ... group by columns. • If there are N attributes, there are 2N-1 super-aggregates. • If](https://reader033.vdocuments.us/reader033/viewer/2022041621/5e3f298d8828960b2625cf1e/html5/thumbnails/33.jpg)
DataCube• Computestheaggregateonallpossiblecombinationsof
groupbycolumns.
• IfthereareNattributes,thereare2N-1super-aggregates.
• IfthecardinalityoftheNattributesareC1,...,CN,thenthereareatotalof(C1+1)...(CN+1)valuesinthecube.
• ROLL-UPissimilarbutjustlooksatNaggregates
![Page 34: Data Warehousing and Data Cube - Duke University · –Complex SQL queries and views. ... group by columns. • If there are N attributes, there are 2N-1 super-aggregates. • If](https://reader033.vdocuments.us/reader033/viewer/2022041621/5e3f298d8828960b2625cf1e/html5/thumbnails/34.jpg)
DataCube Syntax• SQLServer
SELECT Model, Year, Color, sum(units)FROM Sales
GROUP BY Model, Year, Color
WITH CUBE
Sales(Model,Year,Color,Units)
![Page 35: Data Warehousing and Data Cube - Duke University · –Complex SQL queries and views. ... group by columns. • If there are N attributes, there are 2N-1 super-aggregates. • If](https://reader033.vdocuments.us/reader033/viewer/2022041621/5e3f298d8828960b2625cf1e/html5/thumbnails/35.jpg)
TypesofAggregates• Distributive:inputcanbepartitionedintodisjointsets
andaggregatedseparatelyo COUNT,SUM,MIN
• Algebraic:canbecomposedofdistributiveaggregateso AVG
• Holistic:aggregatemustbecomputedovertheentireinputseto MEDIAN
• EfficientcomputationoftheCUBEoperatordependsonthetypeofaggregate
– DistributiveandAlgebraicaggregatesmotivateoptimizations
![Page 36: Data Warehousing and Data Cube - Duke University · –Complex SQL queries and views. ... group by columns. • If there are N attributes, there are 2N-1 super-aggregates. • If](https://reader033.vdocuments.us/reader033/viewer/2022041621/5e3f298d8828960b2625cf1e/html5/thumbnails/36.jpg)
ImplementingDataCube
DukeCS,Fall2017 CompSci516:DatabaseSystems 36
![Page 37: Data Warehousing and Data Cube - Duke University · –Complex SQL queries and views. ... group by columns. • If there are N attributes, there are 2N-1 super-aggregates. • If](https://reader033.vdocuments.us/reader033/viewer/2022041621/5e3f298d8828960b2625cf1e/html5/thumbnails/37.jpg)
BasicIdeas
• Needtocomputeallgroup-by-s:– ABCD,ABC,ABD,BCD,AB,AC,AD,BC,BD,CD,A,B,C,D
• ComputeGROUP-BYsfrompreviouslycomputedGROUP-BYs– e.g.firstABCD– thenABCorACD– thenABorAC…
• WhichorderABCDissorted,mattersforsubsequentcomputations
– if(ABCD)isthesortedorder,ABCischeap,ACDorBCDisexpensive
![Page 38: Data Warehousing and Data Cube - Duke University · –Complex SQL queries and views. ... group by columns. • If there are N attributes, there are 2N-1 super-aggregates. • If](https://reader033.vdocuments.us/reader033/viewer/2022041621/5e3f298d8828960b2625cf1e/html5/thumbnails/38.jpg)
Notations
• ABCD– group-byonattributesA,B,C,D– noguaranteeontheorderoftuples
• (ABCD)– sortedaccordingtoA->B->C->D
• ABCDand(ABCD)and(BCDA)– allcontainthesameresults– butindifferentsortedorder
![Page 39: Data Warehousing and Data Cube - Duke University · –Complex SQL queries and views. ... group by columns. • If there are N attributes, there are 2N-1 super-aggregates. • If](https://reader033.vdocuments.us/reader033/viewer/2022041621/5e3f298d8828960b2625cf1e/html5/thumbnails/39.jpg)
Optimization1:SmallestParent
• ComputeGROUP-BYfromthesmallest(size)previouslycomputedGROUP-BYasaparent
– ABcanbecomputedfromABC,ABD,orABCD
– ABCorABDbetterthanABCD– EvenABCorABDmayhave
differentsizes,trytochoosethesmallerparent
39
ABCD
ABC ABD ACD BCD
AC AD BC BD CDAB
A B C D
all
DukeCS,Fall2017 CompSci516:DatabaseSystems
LATTICESTRUCTUREofdatacube
![Page 40: Data Warehousing and Data Cube - Duke University · –Complex SQL queries and views. ... group by columns. • If there are N attributes, there are 2N-1 super-aggregates. • If](https://reader033.vdocuments.us/reader033/viewer/2022041621/5e3f298d8828960b2625cf1e/html5/thumbnails/40.jpg)
Optimization2:CacheResults
• CacheresultofoneGROUP-BYinmemorytoreducediskI/O– ComputeABfromABCwhile
ABCisstillinmemory
40
ABCD
ABC ABD ACD BCD
AC AD BC BD CDAB
A B C D
all
DukeCS,Fall2017 CompSci516:DatabaseSystems
![Page 41: Data Warehousing and Data Cube - Duke University · –Complex SQL queries and views. ... group by columns. • If there are N attributes, there are 2N-1 super-aggregates. • If](https://reader033.vdocuments.us/reader033/viewer/2022041621/5e3f298d8828960b2625cf1e/html5/thumbnails/41.jpg)
Optimization3:AmortizeDiskScans
• AmortizediskreadsformultipleGROUP-BYs– SupposetheresultforABCD
isstoredondisk– ComputeallofABC,ABD,
ACD,BCDsimultaneouslyinonescanofABCD
41
ABCD
ABC ABD ACD BCD
AC AD BC BD CDAB
A B C D
all
DukeCS,Fall2017 CompSci516:DatabaseSystems
![Page 42: Data Warehousing and Data Cube - Duke University · –Complex SQL queries and views. ... group by columns. • If there are N attributes, there are 2N-1 super-aggregates. • If](https://reader033.vdocuments.us/reader033/viewer/2022041621/5e3f298d8828960b2625cf1e/html5/thumbnails/42.jpg)
Optimization4,5(next)
• 4.Share-sort– forsort-basedalgorithms– pipe-sortalgorithm– coveredinclass
• 5.Shared-partition– forhash-basedalgorithms– pipe-hashalgorithm
• UseshashtablestocomputesmallerGROUP-Bys
• IfthehashtablesforABandACfitinmemory,computebothinonescanofABC
• OtherwisecanpartitiononA,andcancomputeHTsofABandACindifferentpartitions
– notcovered(seepaper)
42
ABCD
ABC ABD ACD BCD
AC AD BC BD CDAB
A B C D
all
DukeCS,Fall2017 CompSci516:DatabaseSystems
![Page 43: Data Warehousing and Data Cube - Duke University · –Complex SQL queries and views. ... group by columns. • If there are N attributes, there are 2N-1 super-aggregates. • If](https://reader033.vdocuments.us/reader033/viewer/2022041621/5e3f298d8828960b2625cf1e/html5/thumbnails/43.jpg)
PipeSort:Idea
• Combinestwooptimizations:“shared-sorts”and“smallest-parent”
• Alsoincludes“cache-results”and“amortized-scans”
![Page 44: Data Warehousing and Data Cube - Duke University · –Complex SQL queries and views. ... group by columns. • If there are N attributes, there are 2N-1 super-aggregates. • If](https://reader033.vdocuments.us/reader033/viewer/2022041621/5e3f298d8828960b2625cf1e/html5/thumbnails/44.jpg)
PipeSort:Share-sortoptimization• Datasortedinoneorder• ComputeallGROUP-BYsprefixedinthatorder• ComputeonetupleofABCD,propagateupwardinthe
pipelinebyasinglescan
• Example:– GROUP-BY overattributesABCD– Sortrawdataby(ABCD)– Compute(ABCD)->(ABC)->(AB)->(A) inpipelinedfashion– Noadditionalsortneeded
• BUT,mayhaveaconflictwith“smallest-parent”optimization– (ABD)->(AB)couldbeabetterchoice– Figureoutthebestparentchoicebyrunningaweighted-matching
algorithmlayerbylayer
(ABCD)
(ABC)
(AB)
(A)
![Page 45: Data Warehousing and Data Cube - Duke University · –Complex SQL queries and views. ... group by columns. • If there are N attributes, there are 2N-1 super-aggregates. • If](https://reader033.vdocuments.us/reader033/viewer/2022041621/5e3f298d8828960b2625cf1e/html5/thumbnails/45.jpg)
SearchLattice• Directededge=>oneattributeless
andpossiblecomputation• Levelkcontainskattributes
– all=0attribute
• Twopossiblecostsforeachedgeeij=i --->j
• A(eij):i issortedforj– (BCA)->(BC)
• S(eij):i isNOTsortedforj– e.g.ABC->(BCA)->(BC)orhash
ABCD
ABC ABD ACD BCD
AC AD BC BD CDAB
A B C D
all Level0
Level1
Level2
Level3
Level4
A B C suma1 b1 c1 5a1 b1 c2 10a1 b2 c3 8a2 b2 c1 2a2 b2 c3 11
NotSortedA B C suma2 b2 c3 11a1 b1 c2 10a2 b2 c1 2a1 b1 c1 5a1 b2 c3 8
Sorted
A B suma1 b1 15a1 b2 8a2 b2 13
• Noparenthesis:orderoftuplescanbearbitrary
![Page 46: Data Warehousing and Data Cube - Duke University · –Complex SQL queries and views. ... group by columns. • If there are N attributes, there are 2N-1 super-aggregates. • If](https://reader033.vdocuments.us/reader033/viewer/2022041621/5e3f298d8828960b2625cf1e/html5/thumbnails/46.jpg)
PipeSort Output
• OutputsasubgraphO– eachnodehasasingleparent– eachnodehasasortedorderofattributes
• ifparent’ssortedorderisaprefix,cost=A(eij),elseS(eij)
– MarkbyAorS– AtmostoneA-out-edge– Note:forsomenodes,theremaybenogreenA-out-edge
ACBD
ACB ABD ACD BDC
AC AD BC BD CDAB
A B C D
all Level0
Level1
Level2
Level3
Level4
Sorted(A)Not-Sorted(S)
Goal:FindOwithmintotalcost
![Page 47: Data Warehousing and Data Cube - Duke University · –Complex SQL queries and views. ... group by columns. • If there are N attributes, there are 2N-1 super-aggregates. • If](https://reader033.vdocuments.us/reader033/viewer/2022041621/5e3f298d8828960b2625cf1e/html5/thumbnails/47.jpg)
Outline:PipeSort Algorithm(1)• Gofromlevel0toN-1
– hereN=4
• Foreachlevelk,findthebestwaytoconstructitfromlevelk+1
• uses“min-costweightedbipartitematching”
• createsknewcopiesofnodesatlevelk+1
• edgesfromoriginalcopy– costA(eij)
• edgesfromnewcopies– costS(eij)
ABCD
ABC ABD ACD BCD
AC AD BC BD CDAB
A B C D
all Level0
Level1
Level2
Level3
Level4
![Page 48: Data Warehousing and Data Cube - Duke University · –Complex SQL queries and views. ... group by columns. • If there are N attributes, there are 2N-1 super-aggregates. • If](https://reader033.vdocuments.us/reader033/viewer/2022041621/5e3f298d8828960b2625cf1e/html5/thumbnails/48.jpg)
Outline:PipeSort Algorithm(2)• Illustrationwithasmallerexample• Levelk=1fromlevelk+1=2
– onenewcopy(dottededges)– oneexistingcopy(solidedge)
• Assumptionforsimplicity– samecostforalloutgoingedges– A(eij)=A(eij’)forallj,j’– S(eij)=S(eij’)foralli,i’
ABC
AC BCAB
A B C
all Level0
Level1
Level2
Level3
2,10 5,12 13,20
![Page 49: Data Warehousing and Data Cube - Duke University · –Complex SQL queries and views. ... group by columns. • If there are N attributes, there are 2N-1 super-aggregates. • If](https://reader033.vdocuments.us/reader033/viewer/2022041621/5e3f298d8828960b2625cf1e/html5/thumbnails/49.jpg)
Aftercomputingthe plan,executeallpipelines
Outline:PipeSort Algorithm(3)
1.Firstpipelineisexecutedbyonescanofthedata
2.Sort(CBAD)->(BADC),computethesecondpipeline
3.…..
A,S costs
SeepaperforanotherPipeHash algorithm
ACBD
ACB ABD ACD BDC
AC AD BC BD CD
A B C D
all
AB