1 olap and decision support chapter 25, part a. 2 introduction increasingly, organizations are...
Post on 19-Dec-2015
216 views
TRANSCRIPT
1
OLAP and Decision Support
Chapter 25, Part A
2
Introduction
Increasingly, organizations are analyzing current and historical data to identify useful patterns and support business strategies.
Emphasis is on complex, interactive, exploratory analysis of very large datasets created by integrating data from across all parts of an enterprise; data is fairly static.
Solving modern business problems such as market analysis requires query-centric database schemas that are multidimensional.
3
Three Complementary Trends
Data Warehousing: Consolidate data from many sources in one large repository. Loading, periodic synchronization of replicas. Semantic integration.
OLAP: Complex SQL queries and views. Queries based on spreadsheet-style operations and
“multidimensional” view of data. Interactive and “online” queries.
Data Mining: Exploratory search for interesting trends and anomalies. (Another lecture!)
4
OLAP versus OLTP OLTP( Online Transaction Proccess ) is used to
run the business operations of a company such as inventory management and order processing. On the other hand, OLAP works with data that is geared towards decision making, esp. long-term strategic decisions. OLTP OLAPContains a snapshot of the current data( 6 – 24 months )
Requires a history of transactions spread over many years(5 - 20 years)
Updated continuously Static Data
Can have errors or missing data Validated and complete data
Processes millions of transactions daily
Updated priodically through batch processing – usually once per day
Uses Entity – Relationship Diagram
Uses multi – dimensional model
5
Multidimensional Data Model Fact
Table -> Collection of numeric measures,
which depend on a set of dimensions. E.g., measure Sales, dimensions
Product (key: pid), Location (locid), and Time (timeid).
8 10 10
30 20 50
25 8 15
1 2 3 timeid
p
id11
12
13
11 1 1 25
11 2 1 8
11 3 1 15
12 1 1 30
12 2 1 20
12 3 1 50
13 1 1 8
13 2 1 10
13 3 1 10
11 1 2 35
pid
tim
eid
locid
sale
s
locid
Slice locid=1is shown:
6
Why Multidimensional Data Model
Traditional relational databases, as well as spreadsheets , are based on two dimensional model such as # of customers by region.
However, a telecommunication company analyst may not be interested just determining the number of the phone customers in a single state such as PA.
7
Why Multidimensional Data Model For ex, the analyst might like to determine
the number of customers who subscribed to both home and wireless service in the past year.
The scenario may become more complicated as more dimensions are added.
The analyst would have to access data in different tables and perform complex table joins, a task would be beyond the capabilities of ordinary user.
8
Why Multidimensional Data Model
Reg
ion
Prod
uct
Time
Three dimensional model
9
What is OLAP?
OLAP is an analytical technique that combines data access tools with an analytical database engine. In contrast to the simple rows and columns structure of relational databases, OLAP uses a multi-dimensional view of data. OLAP uses calculations and transformations to perform its analytical tasks. There are two types of OLAP systems architectures ;
10
MOLAP vs ROLAP
In Multidimensional OLAP ( MOLAP ), data is stored in a special OLAP database server, after being extracted from various sources, in pre-aggregated cubic format. In contrast to this approach, Relational OLAP ( ROLAP ) does not use an intermediate server because it can work directly against the relational database.
11
MOLAP vs ROLAP
12
MOLAP vs ROLAP
MOLAP performs well with 10 dimensions while ROLAP can scale considerably higher. ROLAP ‘s advantage is that it can work directly relational database and it is not limited by 10 dimensions but it places a heavy load on the server which makes it much slower and ROLAP is expensive to maintain.
13
Dimension and Fact tables
price
category
pname
pid country
statecitylocid
sales
locidtimeid
pid
holiday_flag
weekdate
timeid month
quarter
year
(Fact table)SALES
TIMES
PRODUCTS LOCATIONS
(Dimension)
(Dimension table)
(Dimension table)• The main relation, which relates dimensions to
a measure, is called the fact table. Each dimension can have additional attributes and an associated dimension table.
E.g., Products(pid, pname, category, price) Fact tables are much larger than dimensional tables.
14
Dimension Hierarchies
For each dimension, the set of values can be organized in a hierarchy:
PRODUCT TIME LOCATION
category week month state
pname date city
year
quarter country
15
OLAP Queries
Influenced by SQL and by spreadsheets. A common operation is to aggregate a
measure over one or more dimensions. Find total sales. Find total sales for each city, or for each state. Find top five products ranked by total sales.
Roll-up: Aggregating at different levels of a dimension hierarchy. E.g., Given total sales by city, we can roll-up to
get sales by state.
16
OLAP Queries Drill-down: The inverse of roll-up.
E.g., Given total sales by state, can drill-down to get total sales by city.
E.g., Can also drill-down on different dimension to get total sales by product for each state.
Pivoting: Aggregation on selected dimensions. E.g., Pivoting on Location and Time
yields this cross-tabulation:63 81 144
38 107 145
75 35 110
WI CA Total
1995
1996
1997
176 223 339Total
Slicing and Dicing: Equality and range selections on one or more dimensions.
17
OLAP Queries
DiceSlice
Slicing and dicing basically are used for viewing different range of different dimension selection.
18
Comparison with SQL Queries The cross-tabulation obtained by pivoting can also
be computed using a collection of SQL queries:
SELECT SUM(S.sales)FROM Sales S, Times T, Locations LWHERE S.timeid=T.timeid AND S.timeid=L.timeidGROUP BY T.year, L.state
This query generates the entries in the body of the chart( outlined by the dark lines )
63 81 144
38 107 145
75 35 110
WI CA Total
1995
1996
1997
176 223 339Total
19
Comparison with SQL Queries
SELECT SUM(S.sales)FROM Sales S, Location LWHERE S.timeid=L.timeidGROUP BY L.state
SELECT SUM(S.sales)FROM Sales S, Times TWHERE S.timeid=T.timeidGROUP BY T.year
SELECT SUM(S.sales)FROM Sales S, Location LWHERE S.locid=L.locid
The summary column on the right is generated by this query
The summary row at the bottom is generated by this query
The cumulative sum in the bottom corner of the chart.
20
The CUBE Operator
Generalizing the previous example, if there are k dimensions, we have 2^k possible SQL GROUP BY queries that can be generated through pivoting on a subset of dimensions.
GROUP BY CUBESELECT T.year, L.state, SUM (S.sales)FROM Sales S, Times T, Locations LWHERE S.timeid = T.timeid AND S.locid=L.locidGROUP BY CUBE( T.year, L.state )
21
The CUBE Operator
T.Year L.state SUM(S.sales)
1995 WI 63
1995 CA 81
1995 Null 144
1996 WI 38
1996 CA 107
1996 Null 145
1997 WI 75
1997 CA 35
1997 Null 110
Null WI 176
Null CA 223
Null Null 399
The Result of GROUP BY CUBE on Sales
22
The CUBE Operator
CUBE pid, locid, timeid BY SUM Sales
This query rolls up the table Sales on all eight subsets of the set {pid, locid, timeid}. It is equivalent to eight queries of the form:
SELECT SUM(S.sales)FROM Sales SGROUP BY grouping list grouping list is some set of the set {pid, locid,
timeid}
23
Implementation Issues New indexing techniques: Bitmap indexes,
Join indexes, array representations, compression, precomputation of aggregations, etc.
E.g., Bitmap index:
10100110
112 Joe M 3115 Ram M 5
119 Sue F 5
112 Woo M 4
00100000010000100010
sex custid name sex rating ratingBit-vector:1 bit for eachpossible value.Many queries canbe answered usingbit-vector ops!
MF
24
Join Indexes Consider the join of Sales, Products, Times, and
Locations, possibly with additional selection conditions (e.g., country=“USA”). A join index can be constructed to speed up such joins.
The index contains [s,p,t,l] if there are tuples (with sid) s in Sales, p in Products, t in Times and l in Locations that satisfy the join (and selection) conditions.
Problem: Number of join indexes can grow rapidly. A variation addresses this problem: For each column
with an additional selection (e.g., country), build an index with [c,s] in it if a dimension table tuple with value c in the selection column joins with a Sales tuple with sid s; if indexes are bitmaps, called bitmapped join index.
25
Bitmapped Join Index
Consider a query with conditions price=10 and country=“USA”. Suppose tuple (with sid) s in Sales joins with a tuple p with price=10 and a tuple l with country =“USA”. There are two join indexes; one containing [10,s] and the other [USA,s].
Intersecting these indexes tells us which tuples in Sales are in the join and satisfy the given selection.
price
category
pname
pid country
statecitylocid
sales
locidtimeid
pid
holiday_flag
week
date
timeid
month
quarter
year
(Fact table)SALES
TIMES
PRODUCTS LOCATIONS
26
Querying Sequences in SQL:1999 Trend analysis is difficult to do in SQL-92:
Find the % change in monthly sales Find the top 5 product by total sales Find the trailing n-day moving average of sales The first two queries can be expressed with
difficulty, but the third cannot even be expressed in SQL-92 if n is a parameter of the query.
The WINDOW clause in SQL:1999 allows us to write such queries over a table viewed as a sequence (implicitly, based on user-specified sort keys)
27
SQL Versions
Year Names Comments
1986
SQL-86 First published by ANSI. Ratified by ISO.
1989
SQL-89 Minor version.
1992
SQL-92 Major Version( ISO 9075 ).
1999
SQL:1999 Added regular expression matching, recursive queries, triggers, non-scholar types.
2003
SQL:2003 Introduced XML features, window functions, standardized sequences.
2006
SQL:2006 ISO/IEC 9075-14:2006 defines ways in which SQL can be used in conjunction with XML.
28
The WINDOW Clause
This example shows moving average of sales over 3 months
Let the result of the FROM and WHERE clauses be “Temp”.
(Conceptually) Temp is partitioned according to the PARTITION BY clause. Similar to GROUP BY, but the answer has one row
for each row in a partition, not one row per partition!
SELECT L.state, T.month, AVG(S.sales) OVER W AS movavgFROM Sales S, Times T, Locations LWHERE S.timeid=T.timeid AND S.locid=L.locidWINDOW W AS (PARTITION BY L.state
ORDER BY T.monthRANGE BETWEEN INTERVAL `1’ MONTH PRECEDINGAND INTERVAL `1’ MONTH FOLLOWING)
29
The Window Clause
Each partition is sorted according to the ORDER BY clause.
For each row in a partition, the WINDOW clause creates a “window” of nearby (preceding or succeeding) tuples. Can be value-based, as in example, using
RANGE Can be based on number of rows to include in
the window, using ROWS clause The aggregate function is evaluated for each row
in the partition using the corresponding window.
30
The Window Clause
In the example, the window for a row includes the row itself plus all rows whose “month” value is within a month value is within a month before or after; therefore, a row whose month value is June 2002 has a window containing all rows with month equal to May, June, and July 2002.
31
Top N Queries
If you want to find the 10 (or so) cheapest cars, it would be nice if the DB could avoid computing the costs of all cars before sorting to determine the 10 cheapest. Idea: Guess at a cost c such that the 10 cheapest all
cost less than c, and that not too many more cost less. Then add the selection cost<c and evaluate the query.
• If the guess is right, great, we avoid computation for cars that cost more than c.
• If the guess is wrong, need to reset the selection and recompute the original query.
32
Top N Queries
SELECT P.pid, P.pname, S.salesFROM Sales S, Products PWHERE S.pid=P.pid AND S.locid=1 AND S.timeid=3ORDER BY S.sales DESCOPTIMIZE FOR 10 ROWS
OPTIMIZE FOR construct is not in SQL:1999! Cut-off value c is chosen by optimizer.
SELECT P.pid, P.pname, S.salesFROM Sales S, Products PWHERE S.pid=P.pid AND S.locid=1 AND S.timeid=3
AND S.sales > cORDER BY S.sales DESC
33
Online Aggregation Consider an aggregate query, e.g., finding
the average sales by state. Can we provide the user with some information before the exact average is computed for all states? Can show the current “running average” for
each state as the computation proceeds. Even better, if we use statistical techniques and
sample tuples to aggregate instead of simply scanning the aggregated table, we can provide bounds such as “the average sales for Wisconsin is 2000$102 with 95% probability.• Should also use nonblocking algorithms!
34
Online Aggregation
Status
Prioritize State AVG(Sales) Confidence
%90 Selected Alabama 5,232.5 %97
%20 UnSelected
Alaska 2,832.5 %93
%92 Selected Arizona 6,432.5 %98
… … … … …
%35 UnSelected
Wyoming
4,243.5 %92
An algorithm is said to block if it does not produce output tuples until it has consumed all its input tuples. For example, the sort-merge join algorithm blocks because sorting requires all input tables before determining the first output tuple. Hash join is preferable instead of sort-merge join for online aggregation.
35
Summary Decision support is an emerging, rapidly
growing subarea of databases. Involves the creation of large, consolidated
data repositories called data warehouses. Warehouses exploited using sophisticated
analysis techniques: complex SQL queries and OLAP “multidimensional” queries (influenced by both SQL and spreadsheets).
New techniques for database design, indexing, view maintenance, and interactive querying need to be supported.