1 olap and decision support chapter 25, part a. 2 introduction increasingly, organizations are...

1

OLAP and Decision Support

Chapter 25, Part A

2

Introduction

Increasingly, organizations are analyzing current and historical data to identify useful patterns and support business strategies.

Emphasis is on complex, interactive, exploratory analysis of very large datasets created by integrating data from across all parts of an enterprise; data is fairly static.

Solving modern business problems such as market analysis requires query-centric database schemas that are multidimensional.

3

Three Complementary Trends

Data Warehousing: Consolidate data from many sources in one large repository. Loading, periodic synchronization of replicas. Semantic integration.

OLAP: Complex SQL queries and views. Queries based on spreadsheet-style operations and

“multidimensional” view of data. Interactive and “online” queries.

Data Mining: Exploratory search for interesting trends and anomalies. (Another lecture!)

4

OLAP versus OLTP OLTP( Online Transaction Proccess ) is used to

run the business operations of a company such as inventory management and order processing. On the other hand, OLAP works with data that is geared towards decision making, esp. long-term strategic decisions. OLTP OLAPContains a snapshot of the current data( 6 – 24 months )

Requires a history of transactions spread over many years(5 - 20 years)

Updated continuously Static Data

Can have errors or missing data Validated and complete data

Processes millions of transactions daily

Updated priodically through batch processing – usually once per day

Uses Entity – Relationship Diagram

Uses multi – dimensional model

5

Multidimensional Data Model Fact

Table -> Collection of numeric measures,

which depend on a set of dimensions. E.g., measure Sales, dimensions

Product (key: pid), Location (locid), and Time (timeid).

8 10 10

30 20 50

25 8 15

1 2 3 timeid

p

id11

12

13

11 1 1 25

11 2 1 8

11 3 1 15

12 1 1 30

12 2 1 20

12 3 1 50

13 1 1 8

13 2 1 10

13 3 1 10

11 1 2 35

pid

tim

eid

locid

sale

s

locid

Slice locid=1is shown:

6

Why Multidimensional Data Model

Traditional relational databases, as well as spreadsheets , are based on two dimensional model such as # of customers by region.

However, a telecommunication company analyst may not be interested just determining the number of the phone customers in a single state such as PA.

7

Why Multidimensional Data Model For ex, the analyst might like to determine

the number of customers who subscribed to both home and wireless service in the past year.

The scenario may become more complicated as more dimensions are added.

The analyst would have to access data in different tables and perform complex table joins, a task would be beyond the capabilities of ordinary user.

8

Why Multidimensional Data Model

Reg

ion

Prod

uct

Time

Three dimensional model

9

What is OLAP?

OLAP is an analytical technique that combines data access tools with an analytical database engine. In contrast to the simple rows and columns structure of relational databases, OLAP uses a multi-dimensional view of data. OLAP uses calculations and transformations to perform its analytical tasks. There are two types of OLAP systems architectures ;

10

MOLAP vs ROLAP

In Multidimensional OLAP ( MOLAP ), data is stored in a special OLAP database server, after being extracted from various sources, in pre-aggregated cubic format. In contrast to this approach, Relational OLAP ( ROLAP ) does not use an intermediate server because it can work directly against the relational database.

11

MOLAP vs ROLAP

12

MOLAP vs ROLAP

MOLAP performs well with 10 dimensions while ROLAP can scale considerably higher. ROLAP ‘s advantage is that it can work directly relational database and it is not limited by 10 dimensions but it places a heavy load on the server which makes it much slower and ROLAP is expensive to maintain.

13

Dimension and Fact tables

price

category

pname

pid country

statecitylocid

sales

locidtimeid

pid

holiday_flag

weekdate

timeid month

quarter

year

(Fact table)SALES

TIMES

PRODUCTS LOCATIONS

(Dimension)

(Dimension table)

(Dimension table)• The main relation, which relates dimensions to

a measure, is called the fact table. Each dimension can have additional attributes and an associated dimension table.

E.g., Products(pid, pname, category, price) Fact tables are much larger than dimensional tables.

14

Dimension Hierarchies

For each dimension, the set of values can be organized in a hierarchy:

PRODUCT TIME LOCATION

category week month state

pname date city

year

quarter country

15

OLAP Queries

Influenced by SQL and by spreadsheets. A common operation is to aggregate a

measure over one or more dimensions. Find total sales. Find total sales for each city, or for each state. Find top five products ranked by total sales.

Roll-up: Aggregating at different levels of a dimension hierarchy. E.g., Given total sales by city, we can roll-up to

get sales by state.

16

OLAP Queries Drill-down: The inverse of roll-up.

E.g., Given total sales by state, can drill-down to get total sales by city.

E.g., Can also drill-down on different dimension to get total sales by product for each state.

Pivoting: Aggregation on selected dimensions. E.g., Pivoting on Location and Time

yields this cross-tabulation:63 81 144

38 107 145

75 35 110

WI CA Total

1995

1996

1997

176 223 339Total

Slicing and Dicing: Equality and range selections on one or more dimensions.

17

OLAP Queries

DiceSlice

Slicing and dicing basically are used for viewing different range of different dimension selection.

18

Comparison with SQL Queries The cross-tabulation obtained by pivoting can also

be computed using a collection of SQL queries:

SELECT SUM(S.sales)FROM Sales S, Times T, Locations LWHERE S.timeid=T.timeid AND S.timeid=L.timeidGROUP BY T.year, L.state

This query generates the entries in the body of the chart( outlined by the dark lines )

63 81 144

38 107 145

75 35 110

WI CA Total

1995

1996

1997

176 223 339Total

19

Comparison with SQL Queries

SELECT SUM(S.sales)FROM Sales S, Location LWHERE S.timeid=L.timeidGROUP BY L.state

SELECT SUM(S.sales)FROM Sales S, Times TWHERE S.timeid=T.timeidGROUP BY T.year

SELECT SUM(S.sales)FROM Sales S, Location LWHERE S.locid=L.locid

The summary column on the right is generated by this query

The summary row at the bottom is generated by this query

The cumulative sum in the bottom corner of the chart.

20

The CUBE Operator

Generalizing the previous example, if there are k dimensions, we have 2^k possible SQL GROUP BY queries that can be generated through pivoting on a subset of dimensions.

GROUP BY CUBESELECT T.year, L.state, SUM (S.sales)FROM Sales S, Times T, Locations LWHERE S.timeid = T.timeid AND S.locid=L.locidGROUP BY CUBE( T.year, L.state )

21

The CUBE Operator

T.Year L.state SUM(S.sales)

1995 WI 63

1995 CA 81

1995 Null 144

1996 WI 38

1996 CA 107

1996 Null 145

1997 WI 75

1997 CA 35

1997 Null 110

Null WI 176

Null CA 223

Null Null 399

The Result of GROUP BY CUBE on Sales

22

The CUBE Operator

CUBE pid, locid, timeid BY SUM Sales

This query rolls up the table Sales on all eight subsets of the set {pid, locid, timeid}. It is equivalent to eight queries of the form:

SELECT SUM(S.sales)FROM Sales SGROUP BY grouping list grouping list is some set of the set {pid, locid,

timeid}

23

Implementation Issues New indexing techniques: Bitmap indexes,

Join indexes, array representations, compression, precomputation of aggregations, etc.

E.g., Bitmap index:

10100110

112 Joe M 3115 Ram M 5

119 Sue F 5

112 Woo M 4

00100000010000100010

sex custid name sex rating ratingBit-vector:1 bit for eachpossible value.Many queries canbe answered usingbit-vector ops!

MF

24

Join Indexes Consider the join of Sales, Products, Times, and

Locations, possibly with additional selection conditions (e.g., country=“USA”). A join index can be constructed to speed up such joins.

The index contains [s,p,t,l] if there are tuples (with sid) s in Sales, p in Products, t in Times and l in Locations that satisfy the join (and selection) conditions.

Problem: Number of join indexes can grow rapidly. A variation addresses this problem: For each column

with an additional selection (e.g., country), build an index with [c,s] in it if a dimension table tuple with value c in the selection column joins with a Sales tuple with sid s; if indexes are bitmaps, called bitmapped join index.

25

Bitmapped Join Index

Consider a query with conditions price=10 and country=“USA”. Suppose tuple (with sid) s in Sales joins with a tuple p with price=10 and a tuple l with country =“USA”. There are two join indexes; one containing [10,s] and the other [USA,s].

Intersecting these indexes tells us which tuples in Sales are in the join and satisfy the given selection.

price

category

pname

pid country

statecitylocid

sales

locidtimeid

pid

holiday_flag

week

date

timeid

month

quarter

year

(Fact table)SALES

TIMES

PRODUCTS LOCATIONS

26

Querying Sequences in SQL:1999 Trend analysis is difficult to do in SQL-92:

Find the % change in monthly sales Find the top 5 product by total sales Find the trailing n-day moving average of sales The first two queries can be expressed with

difficulty, but the third cannot even be expressed in SQL-92 if n is a parameter of the query.

The WINDOW clause in SQL:1999 allows us to write such queries over a table viewed as a sequence (implicitly, based on user-specified sort keys)

27

SQL Versions

Year Names Comments

1986

SQL-86 First published by ANSI. Ratified by ISO.

1989

SQL-89 Minor version.

1992

SQL-92 Major Version( ISO 9075 ).

1999

SQL:1999 Added regular expression matching, recursive queries, triggers, non-scholar types.

2003

SQL:2003 Introduced XML features, window functions, standardized sequences.

2006

SQL:2006 ISO/IEC 9075-14:2006 defines ways in which SQL can be used in conjunction with XML.

28

The WINDOW Clause

This example shows moving average of sales over 3 months

Let the result of the FROM and WHERE clauses be “Temp”.

(Conceptually) Temp is partitioned according to the PARTITION BY clause. Similar to GROUP BY, but the answer has one row

for each row in a partition, not one row per partition!

SELECT L.state, T.month, AVG(S.sales) OVER W AS movavgFROM Sales S, Times T, Locations LWHERE S.timeid=T.timeid AND S.locid=L.locidWINDOW W AS (PARTITION BY L.state

ORDER BY T.monthRANGE BETWEEN INTERVAL `1’ MONTH PRECEDINGAND INTERVAL `1’ MONTH FOLLOWING)

29

The Window Clause

Each partition is sorted according to the ORDER BY clause.

For each row in a partition, the WINDOW clause creates a “window” of nearby (preceding or succeeding) tuples. Can be value-based, as in example, using

RANGE Can be based on number of rows to include in

the window, using ROWS clause The aggregate function is evaluated for each row

in the partition using the corresponding window.

30

The Window Clause

In the example, the window for a row includes the row itself plus all rows whose “month” value is within a month value is within a month before or after; therefore, a row whose month value is June 2002 has a window containing all rows with month equal to May, June, and July 2002.

31

Top N Queries

If you want to find the 10 (or so) cheapest cars, it would be nice if the DB could avoid computing the costs of all cars before sorting to determine the 10 cheapest. Idea: Guess at a cost c such that the 10 cheapest all

cost less than c, and that not too many more cost less. Then add the selection cost<c and evaluate the query.

• If the guess is right, great, we avoid computation for cars that cost more than c.

• If the guess is wrong, need to reset the selection and recompute the original query.

32

Top N Queries

SELECT P.pid, P.pname, S.salesFROM Sales S, Products PWHERE S.pid=P.pid AND S.locid=1 AND S.timeid=3ORDER BY S.sales DESCOPTIMIZE FOR 10 ROWS

OPTIMIZE FOR construct is not in SQL:1999! Cut-off value c is chosen by optimizer.

SELECT P.pid, P.pname, S.salesFROM Sales S, Products PWHERE S.pid=P.pid AND S.locid=1 AND S.timeid=3

AND S.sales > cORDER BY S.sales DESC

33

Online Aggregation Consider an aggregate query, e.g., finding

the average sales by state. Can we provide the user with some information before the exact average is computed for all states? Can show the current “running average” for

each state as the computation proceeds. Even better, if we use statistical techniques and

sample tuples to aggregate instead of simply scanning the aggregated table, we can provide bounds such as “the average sales for Wisconsin is 2000$102 with 95% probability.• Should also use nonblocking algorithms!

34

Online Aggregation

Status

Prioritize State AVG(Sales) Confidence

%90 Selected Alabama 5,232.5 %97

%20 UnSelected

Alaska 2,832.5 %93

%92 Selected Arizona 6,432.5 %98

… … … … …

%35 UnSelected

Wyoming

4,243.5 %92

An algorithm is said to block if it does not produce output tuples until it has consumed all its input tuples. For example, the sort-merge join algorithm blocks because sorting requires all input tables before determining the first output tuple. Hash join is preferable instead of sort-merge join for online aggregation.

35

Summary Decision support is an emerging, rapidly

growing subarea of databases. Involves the creation of large, consolidated

data repositories called data warehouses. Warehouses exploited using sophisticated

analysis techniques: complex SQL queries and OLAP “multidimensional” queries (influenced by both SQL and spreadsheets).

New techniques for database design, indexing, view maintenance, and interactive querying need to be supported.

1 olap and decision support chapter 25, part a. 2 introduction increasingly, organizations are...

Documents