online analytical processing (olap) an overview kian win ong, nicola onose mar 3 rd 2006
TRANSCRIPT
Online Analytical Processing (OLAP)
An Overview
Kian Win Ong, Nicola OnoseMar 3rd 2006
Overview• Motivation
• Multi-Dimensional Data Model
• Research Areas
• Optimizations– Materializing multiple aggregates simultaneously– Materialization strategy
Motivation• Aggregation, summarization and exploration• Of historical data• To help management make informed decisions
Different Goal• Aggregation, summarization and exploration• Of historical data• To help management make informed decisions
Product Branch Time Price
Coke (0.5 gallon) Convoy Street 2006-03-01 09:00:01 $1.00
Pepsi (0.5 gallon) UTC 2006-03-01 09:00:01 $1.03
Coke (1 gallon) UTC 2006-03-01 09:00:02 $1.50
Altoids Costa Verde 2006-03-01 09:01:33 $0.30
...
• Find the total sales for each product and month• Find the percentage change in the total monthly
sales for each product
Different Requirements
OLTP OLAP
Tasks Day to day operation High level decision support
Size of database Gigabytes Terabytes
Time span Recent, up-to-date Spanning over months / years
Size of working set Tens of records, accessed through primary keys
Consolidated data from multiple databases
Workload Structured / repetitive Ad-hoc, exploratory queries
Performance Transaction throughput Query latency
• OLTP – On-Line Transaction Processing• OLAP – On-Line Analytical Processing
Overview• Motivation
• Multi-Dimensional Data Model
• Research Areas
• Optimizations– Materializing multiple aggregates simultaneously– Materialization strategy
Query Language Extensions
• In the real world, data is stored in RDBs.
Query Language Extensions
• In the real world, data is stored in RDBs.
• How to express N-dimensional problems using 2D tables?
Query Language Extensions
• In the real world, data is stored in RDBs.
• How to express N-dimensional problems using 2D tables?
• Can we combine OLAP and SQL queries?
• Jim Gray et al:Data Cube: A Relational Aggregation Operator1997
Query Language Extensions
1.histograms
Problems with GROUP BY
SELECT sales, prod_name, population FROM sales_history GROUP BY Population(City, State) as population
Query Language Extensions
1.histograms
2.rollup/drilldown
Problems with GROUP BY
ProductCategory
Product Name
Month Sales Sales by Cat.,
by Name
Salesby Cat.
Drinks Coke Feb 30.3
Mar 93.9 124.2
Heineken Feb 34.8
Mar 123.8 158.6 282.8
non relational representation
Query Language Extensions
1.histograms
2.rollup/drilldown
Problems with GROUP BY
Product Category
Product Name
Month Sales Sales by Cat.,
by Name
Salesby Cat.
Drinks Coke Feb 30.3 124.2 282.8
Drinks Coke Mar 93.9 124.2 282.8
Drinks Heineken Feb 34.8 158.6 282.8
Drinks Heineken Mar 123.8 158.6 282.8
relational, but the rollup is huge
Query Language Extensions
1.histograms
2.rollup/drilldown
3.cross tabulations
Problems with GROUP BY
2-D aggregation is more compact and more natural:
Drinks Feb Mar Total
Coke 30.3 93.9 124.2
Heineken 34.8 123.8 158.6
Total 65.1 217.7 282.8
Query Language ExtensionsReducing the number of attributes
Product Category
Product Name
Month Sales
Drinks Coke Feb 30.3
Drinks Coke Mar 93.9
Drinks Coke ALL 124.2
Drinks Heineken Feb 34.8
Drinks Heineken Mar 123.8
Drinks Heineken ALL 158.6
Drinks ALL ALL 282.8
Drinks ALL Feb 65.1
Drinks ALL Mar 217.7
Query Language Extensions
• introduce a new value: “ALL”
Reducing the number of attributes
“ALL” = the set over which we aggregate
Drinks Feb Mar Total (ALL)
Coke 30.3 93.9 124.2
Heineken 34.8 123.8 158.6
Total (ALL) 65.1 217.7 282.8
Query Language Extensions
• GROUP BY (1D)
General approach
Sales by Product Name
Feb Mar
Coke 30.3 93.9
Heineken 34.8 123.8
SUM 65.1 217.7
Query Language Extensions
• GROUP BY (1D)
• Cross Tab (2D)
General approach
Drinks Feb Mar ALL
Coke 30.3 93.9 124.2
Heineken 34.8 123.8 158.6
ALL 65.1 217.7 282.8
Product Category
Product Name
Month Sales
Drinks Coke Feb 30.3
Drinks Coke Mar 93.9
Drinks Coke ALL 124.2
Drinks Heineken Feb 34.8
Drinks Heineken Mar 123.8
Drinks Heineken ALL 158.6
Drinks ALL Feb 65.1
Drinks ALL Mar 217.7
Drinks ALL ALL 282.8
the corresponding relation:
Query Language Extensions
• GROUP BY (1D)
• Cross Tab (2D)
• Cube (3D)
General approachProduct
CategoryProduct Name
Month Sales
Drinks Coke Feb 30.3
Drinks Coke Mar 93.9
Drinks Coke ALL 124.2
… … … …
Snacks Doritos Feb 123.8
Snacks Doritos Mar 158.6
Snacks Doritos ALL 65.1
… … … …
ALL ALL ALL 964.0
By cat.andmonth
By cat. andname (does it make sense?)
By monthand name
Query Language Extensions
• GROUP BY (1D)
• Cross Tab (2D)
• Cube (3D)
• Any hypercube can be represented as a relation!
General approach
Query Language Extensions
• a CUBE relation, with aggregation function f(.)(x1, x2, …, xn-1, xn, f() )
……………………………
(x1, xn-1, …, xn, ALL, f() )
……………………………
(x1, x2, …, ALL, xn, f() )
……………………………
• after ROLLUP , reduce to a linear # of tuples(x1, x2, …, xn-1, xn, f() )
…………………………………
(x1, xn-1, …, xn, ALL, f() )
…………………………………
(x1, x2, …, ALL, ALL, f() )
…………………………………
(ALL, ALL, …, ALL, ALL, f() )
General approach
Query Language ExtensionsThe new operators: CUBE, ROLLUP
SELECT prod_category, prod_name, month, SUM(sales) AS sales FROM sales_history GROUP BY CUBE prod_category, prod_name, month
Product Category
Product Name
Month Sales
Drinks Coke Feb 30.3
Drinks Coke Mar 93.9
Drinks Coke ALL 124.2
… … … …
Drinks ALL Feb 99.8
… … … …
ALL ALL ALL 964.0
Idea:Group by the CUBE list.Union the aggregates.Introduce the ALL values.
Query Language ExtensionsThe new operators: CUBE, ROLLUP
SELECT prod_category, month, day, state, prod_name, SUM(sales) AS sales FROM sales_history GROUP BY prod_category ROLLUP month, day CUBE city, state
Product Category
Month Day State Product Name
Sales
Drinks Feb 26 CA Coke 12.3
Feb 26 CA Heineken 5.4
… … … … …
Feb 26 CA ALL 30.4
Feb 26 ALL Coke …
… … … …
Snacks Feb 26 CA Doritos 12.0
… … … …
Overview• Motivation
• Multi-Dimensional Data Model
• Research Areas
• Optimizations– Materializing multiple aggregates simultaneously– Materialization strategy
Research Areas• SQL language extensions
• Server architecture
• Parallel processing
• Index structures
• Materialized views
Overview• Motivation
• Multi-Dimensional Data Model
• Research Areas
• Optimizations– Materializing multiple aggregates simultaneously– Materialization strategy
Simultaneous
AggregatesMulti-Dimensional
• Optimization to calculate multiple aggregates simultaneously
• Useful for materialization of aggregate views
• Y. Zhao, P. Deshpande, J. NaughtonAn Array-Based Algorithm for Simultaneous Multidimensional AggregatesSIGMOD 1997
Multiple Aggregates
Month /Product
Feb Mar Total
Altoids 36 131 167
Coke 37 138 175
Doritos 21 136 157
Heineken 44 110 154
Pepsi 31 122 153
Pringles 37 126 164
Total 206 764 970
Product City Month Sales
Coke San Diego Feb 06 12
Pepsi Los Angeles Feb 06 13
Doritos San Diego Mar 06 72
Altoids San Diego Mar 06 65
...
Aggregate on…
Multiple Aggregates
City /Product
San Diego Los Angeles Total
Altoids 90 77 167
Coke 89 86 175
Doritos 74 83 157
Heineken 74 80 154
Pepsi 68 85 153
Pringles 73 90 164
Total 469 501 970
Month /City
Feb Mar Total
Los Angeles 112 358 469
San Diego 95 407 501
Total 206 764 970
Month /Product
Feb Mar Total
Altoids 36 131 167
Coke 37 138 175
Doritos 21 136 157
Heineken 44 110 154
Pepsi 31 122 153
Pringles 37 126 164
Total 206 764 970
Product City Month Sales
Coke San Diego Feb 06 12
Pepsi Los Angeles Feb 06 13
Doritos San Diego Mar 06 72
Altoids San Diego Mar 06 65
...
Aggregate on…
Multiple Aggregates
Product City Month Sales
Coke San Diego Feb 06 12
Pepsi Los Angeles Feb 06 13
Doritos San Diego Mar 06 72
Altoids San Diego Mar 06 65
...
1. Sales by Product / City2. Sales by Product / Month3. Sales by Month / City4. Sales by Product5. Sales by City6. Sales by Month7. Sales (Total)
Is it possible to• make a single pass over the transactional table?• calculate multiple aggregates simultaneously?
Aggregate on…
Chunking
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
64
20
36
42Dimension B
Dimension A
Dimension C
1
Product City Month Sales
Coke San Diego Feb 06 12
12
Array Chunk
Product
City
Month
Partition transactional data into array chunks
Naïve Algorithm
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
64
20
36
42
Dimension A
Dimension C
Pivot on ABaggregate on all C
Dimension A
Dimension B
Naïve Algorithm
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
64
20
36
42
Dimension A
Dimension C
Pivot on ABaggregate on all C
Pivot on ACaggregate on all B
Pivot on BCaggregate on all A
Dimension B
Single Pass Algorithm
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
64
20
36
42
Dimension A
Dimension C
B
1 2 3 4
1 2 3 4
1 2 3 4
AB
AC
BC
Make a single pass over data
Single Pass Algorithm
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
64
20
36
42
Dimension A
Dimension C
B
13
9 10 11 12
5 6 7 8
1 2 3 4
1 5 9 13
2 6 10 3 7 11 4 5 12
13
9 10 11 12
5 6 7 8
1 2 3 4
AB
AC
BC
Simultaneously maintain multiple aggregates
Single Pass Algorithm
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
64
20
36
42
Dimension A
Dimension C
B
13
9 10 11 12
5 6 7 8
1 2 3 4
1 5 9 13
2 6 10 3 7 11 4 5 12
13
9 10 11 12
5 6 7 8
1 2 3 4
AB
AC
BC
Write out completed aggregates
Single Pass Algorithm
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
64
20
36
42
Dimension A
Dimension C
B
13
9 10 11 12
5 6 7 8
1 2 3 4
1 5 9 13
2 6 10 3 7 11 4 5 12
13
AB
AC
BC
Only allocate memory that is necessary
Single Pass Algorithm13
9 10 11 12
5 6 7 8
1 2 3 4
1 5 9 13
2 6 10 3 7 11 4 5 12
13
AB
AC
BC
Array Chunk
ABC4 x 4 x 4
AB16 x 4 x 4
AC4 x 4 x 4
BC4 x 4
A4 x 4
B4
C4
all1
Minimum memory spanning tree
Multi Pass Algorithm
ABCD
ABC ABD ACD BCD
AB AC BC AD BD CD
A B C D
all
Recursively aggregate
Overview• Motivation
• Multi-Dimensional Data Model
• Research Areas
• Optimizations– Materializing multiple aggregates simultaneously– Materialization strategy
Implementing Data Cubes
• Biggest problem for data warehouses: the size
• Space / time trade-off:accelerate queries by materializing the cube
Implementing Data Cubes
• Biggest problem for data warehouses: the size
• Space / time trade-off:accelerate queries by materializing the cube
• The size of the relations gets even bigger!
Implementing Data Cubes
• Biggest problem for data warehouses: the size
• Space / time trade-off:accelerate queries by materializing the cube
• The size of the relations gets even bigger!
• M(ultidimensional)OLAP: good query performance, but bad scalability
• R(elational)OLAP: very scalable; query performance improved by materializing (partial) results
Implementing Data Cubes
• V. Harinarayan, A. Rajaraman, J.D. Ullman:Implementing Data Cubes EfficientlySIGMOD 1996
Presents a materialization strategy for the cells of the cube.
Implementing Data Cubes
Time Id
City Id
Product Id
Sales
Day
Month
WeekCity Id
City
State
Product Id
Name
Category
Week
Month
Year Year
Category Id
Category Name
Implementing Data Cubes
• casted as particular case of the rewriting using views problem
• what cells to materialize what SQL views to materialize
Implementing Data Cubes
• casted as particular case of the rewriting using views problem
• what cells to materialize what SQL views to materialize
p = productt = timec = city
• simple idea: Q1 depends on Q2 (Q1≤Q2) if Q1 can be fully answered using the results of Q2
ptc
pt tc pc
t p c
none
Implementing Data Cubes
• but cube dimensions are usually hierarchical
product_name
product_category
none
day
week month
year
none
city
state
none
X X
• direct-product latticep = productt = timec = city
ptc
pt tc pc
ptspwc
pyc
pmc
ps
pcatt
… ……
…
…
Implementing Data Cubes
• Def. cost of answering Q = # of rows in the table of ancestor(Q)
• It can be estimated w/o materializing the views
• Assume that all queries are identical to some view in the lattice
Implementing Data Cubes
• For a set S and a view vB(v,S) = ∑w≤v, (w not in S) max{cost(w)-cost(v), 0}
• Greedy algorithm for selecting k views to materialize from the lattice:
1. S := {top view}
2. For i=1 to k, add v to S s.t. B(v,S) is maximized
• The greedy algorithm is an (e-1)/e ≈ 0.63 approx. of the optimum.
Discussion
• Questions from the audience…