data warehouse & olap technologyredwood.cs.ttu.edu/~rahewett/teach/datamining/2-data...4 13 dw...

1

CS 5331 by Rattikorn Hewett

Texas Tech University 1

Data Warehouse &

OLAP Technology

2

Outline

What is a data warehouse?

A multi-dimensional data model

Data warehouse design & architecture

Data warehouse implementation

From data warehousing to data mining

3

Motivation

In business domains,

Large amount of business data

Data are in different formats and locations

(heterogeneous vs. distributed)

Decision makers need fast accesses of

summarized information (usually) on a

single site

4

Data Warehouses

Loosely speaking, a data warehouse is a data store that integrates data from multiple

heterogeneous sources to support decision making

maintained separately from operational databases

A data warehouse system provides a solid

platform of consolidated data for on-line analytical

processing (OLAP) and data analysis

Data warehousing is a process of

constructing and using data warehouses

2

5

Other Related terms

Data Warehouse: enterprise-based

Concerns with decision subjects of the whole

enterprise or organization

Data Mart: department-based

Specialized single line of business

warehouses e.g., within departments or

groups of people

6


More precisely,

A data warehouse (DW) is a

subject-oriented

integrated

time-variant, and

non-volatile (i.e., stable)

collection of data in support of management’s

decision-making process [Inmon, 96]

7

Subject-oriented

A data warehouse:

Is organized around subjects of interest

e.g., customer, product, sales.

Gives a simple and concise view of relevant issues

by excluding non-useful data to the decision support

Focuses on the modeling/analysis of data for

decision makers,

not on daily operations or transaction processing.

8

Integrated

A data warehouse integrates multiple heterogeneous data sources: relational databases, on-line transaction records

Moving (or converting) data to the warehouse requires

data cleaning and data integration techniques to ensure consistency among different data sources in

naming conventions

encoding structures, attribute measures, etc.

E.g., Hotel price: currency, tax, breakfast covered, etc.

3

9

Time-variant

Key structures In data warehouses

contain “time” information explicitly or implicitly

Not necessary in operational databases

give historical perspective (e.g., past 5-10 years)

Operational time period is shorter (e.g., current data)

10

Non-volatile

A data warehouse

is a physically separate store

of data transformed from the operational environment.

does not require transaction processing, recovery,

and concurrency control mechanisms

requires only two operations in data accessing:

initial loading of data

access of data

Non-volatile

11

Data Warehouse vs. Heterogeneous DB

Traditional heterogeneous DB integration: Query-driven

Builds wrappers/mediators on top of heterogeneous DBs

requires complex information filtering and integration processes

Data Warehouse: Update-driven information from multiple heterogeneous sources is integrated in

advance and stored for query

more efficient than heterogeneous DB

but data (stored over longer period) are not as current

Client

site MetadataDictionary

Local query Processor 1Heter.DB Site 1

Local query Processor n

mediator

query

Global answer

Heter.DB Site n

12

DW systems vs. Operational DBMS

OLTP (on-line transaction processing) Major task of operational (traditional relational) DBMS

Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc.

OLTP Operations: Are structured, repetitive, short, atomic and isolated transactions

Require detailed and up-to-date data

OLAP (on-line analytical processing) Major task of data warehouse system

Serve knowledge workers: Data analysis and decision making

OLAP Operations: View data flexibly from different perspectives and abstractions

Require historical data with time elements

4

13

DW systems vs. Operational DBMS

Distinct features OLTP vs. OLAP

User/system orientation: customer vs. market

Data contents: current, detailed vs. historical, consolidated

Design: ER & vs. star &

app-oriented subj-oriented

View: current, local vs. evolutionary, integrated

Access patterns: update vs. read-only but complex queries

14

OLTP vs. OLAP

OLTP OLAP

users clerk, IT professional knowledge worker

function day to day operations decision support

DB design application-oriented subject-oriented

data current, up-to-date

detailed, flat relational

isolated

historical,

summarized, multidimensional

integrated, consolidated

usage repetitive ad-hoc

access read/write

index/hash on prim. key

lots of scans

unit of work short, simple transaction complex query

# records accessed tens millions

#users thousands hundreds

DB size 100MB-GB 100GB-TB

metric transaction throughput query throughput, response

15

Why Separate Data Warehouse?

High performance is desired in both DBMS and DW systems DBMS – tuned for OLTP: indexing, searching certain records, querying “canned” queries

DW – tuned for OLAP: complex OLAP queries, use of special organization, multidimensional view, consolidation.

OLAP necessitates special data organization that supports multidimensional views

If mixed - OLAP queries would degrade operational DBMS

Different functions, contents and uses in both systems DBMS – Transactional operations, requires concurrency controls (to

ensure data consistency)

OLAP is read only and does not need concurrency control.

Concurrency control on OLAP’s operations would jeopardize execution of concurrent transactions in DBMS

DW – Decision support, requires consolidation of data from heterogeneous sources (for aggregation etc.) and historical data

DBMS does not maintain historical data nor facilitates data consolidation16

Outline






5

17

Data Cubes

A data warehouse is based on a multi-dimensional data model which

views data in the form of a data cube

Example: A data cube with three dimensions: item (type)

time (quarter)

location (city)

with measure: Dollars_sold

Q1

Q2

Q3

Q4

TV VCR oven phone

400Dollars_sold (in thousands) of

(phone, Q1, Vancouver) = 400

Dollars_sold of (TV, Q1) = 650

Tim

e (

qu

art

er)

Item (type)

100200

30050

Dollars_sold of (Q3)

18

Lattice of Cuboids

The lattice of cuboids forms a data cube.

Each cuboid represents a different degree of summarization

The bottom most n-D base cube is called a base cuboid

The top most 0-D cuboid is called the apex cuboid - holds the

highest-level of summarization – denoted by all

(item)(location)

()

(time)

(location, item) (location, time) (item, time)

(location, item, time)

0-D (apex) cuboid

1-D cuboids

2-D cuboids

3-D (base) cuboid

19

Cuboids of 4-D data cubeall

time item location supplier

time,item

time,location

time,supplier

item,location

item,supplier

location,supplier

time,item,locationtime,item,supplier

time,location,supplier

item,location,supplier

time, item, location, supplier

0-D(apex) cuboid

1-D cuboids

2-D cuboids

3-D cuboids

4-D(base) cuboid

20

Three data model schemas

Star schema: A fact table in the middle connected to a

set of dimension tables

Snowflake schema: A refinement of star schema where

some dimensional hierarchy is normalized into additonal

smaller dimension tables, forming a shape similar to

snowflake

Fact constellation schema: Multiple fact tables share

dimension tables, viewed as a collection of stars,

therefore called galaxy schema or fact constellation

6

21

Example of Star Schema

time_key

day

day_of_the_week

month

quarter

year

Time: dimension table

location_key

street

city

state_or_province

country

Location: dimension table

Sales: Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_key

item_name

brand

type

supplier_type

Item: dimension table

branch_key

branch_name

branch_type

Branch: dimension table

22

Star Snowflake

Star introduces redundancy

E.g., (203 Pine St, Abilene, TX, USA)

(205 14th St, Abilene, TX, USA)

Snowflake normalizes this dimension table to

location_key

street

city

state_or_province

country


location_key

street

city_keycity_key

state_or_province

country

(203 Pine St, Abilene_key)

(205 14th St, Abilene_key)

(Abilene_key, TX, USA)

23

Example of Snowflake Schema

time_key

day

day_of_the_week

month

quarter

year


location_key

street

city

state_or_province

country


Sales: Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_key

item_name

brand

type

supplier_type


branch_key

branch_name

branch_type


key

-key

Supplier_keySupplier_type

city_keyState_or_provincecountry

24

dollars_sold

avg_sales

Example of Constellation Schema

time_key

day

day_of_the_week

month

quarter

year


location_key

street

city

state_or_province

country


Sales: Fact Table

time_key

item_key

branch_key

location_key

units_sold

Measures

item_key

item_name

brand

type

supplier_type


branch_key

branch_name

branch_type

Branch:dimension table

item_key

time_key

shipper_key

from_location

to_location

dollars_cost

units_shipped

shipper_key

shipper_name

location_key

shipper_type

Shipping: Fact Table

Shipper: dimension table

7

25

Which design schema to use?

Benchmarking performance can be used to select the best

schema

Snowflake vs. Star

Snowflake: easier to maintain dimension tables when

they are very large (reduce space)

Star: more effective for browsing (less joins for query)

Data Warehouse: uses “fact constellation” since it can model

multiple, interrelated subjects

Data Mart: uses “star” (more popular) or “snowflake”

26

DMQL: A Data Mining Query Language

Cube Definition (Fact Table)define cube <cube_name> [<dimension_list>]:

<measure_list>

Dimension Definition ( Dimension Table )define dimension <dimension_name> as

(<attribute_or_subdimension_list>)

Special Case (Shared Dimension Tables)

First time as “cube definition”

define dimension <dimension_name> as<dimension_name_first_time> in cube<cube_name_first_time>

27

Example of Star Schema

time_key

day

day_of_the_week

month

quarter

year


location_key

street

city

state_or_province

country


Sales: Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_key

item_name

brand

type

supplier_type


branch_key

branch_name

branch_type


28

A Star Schema in DMQL

define cube sales [time, item, branch, location]:dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars),

units_sold = count(*)

define dimension time as (time_key, day, day_of_week,

month, quarter, year)

define dimension item as (item_key, item_name, brand, type,

supplier_type)

define dimension branch as (branch_key, branch_name,

branch_type)

define dimension location as (location_key, street, city,

province_or_state, country)

8

29

Example of Snowflake Schema

time_key

day

day_of_the_week

month

quarter

year


location_key

street

City_key


Sales: Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_key

item_name

brand

type

supplier_key


branch_key

branch_name

branch_type


Supplier_keySupplier_type

city_keyState_or_provincecountry

30

A Snowflake Schema in DMQL

define cube sales [time, item, branch, location]:dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars),

units_sold = count(*)

define dimension time as (time_key, day, day_of_week,

month, quarter, year)

define dimension item as (item_key, item_name, brand, type,

supplier (supplier-key, supplier_type))

define dimension branch as (branch_key, branch_name,

branch_type)

define dimension location as (location_key, street, city (city-

key, province_or_state, country))

31

dollars_sold

avg_sales

Example of Constellation Schema

time_key

day

day_of_the_week

month

quarter

year


location_key

street

city

state_or_province

country


Sales: Fact Table

time_key

item_key

branch_key

location_key

units_sold

Measures

item_key

item_name

brand

type

supplier_type


branch_key

branch_name

branch_type

Branch:dimension table

item_key

time_key

shipper_key

from_location

to_location

dollars_cost

units_shipped

shipper_key

shipper_name

location_key

shipper_type

Shipping: Fact Table

Shipping: dimension table32

A fact constellation Schema in DMQLdefine cube sales [time, item, branch, location]:

dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)

define dimension time as (time_key, day, day_of_week, month, quarter, year)

define dimension item as (item_key, item_name, brand, type, supplier_type)

define dimension branch as (branch_key, branch_name, branch_type)define dimension location as (location_key, street, city, province_or_state,

country)define cube shipping [time, item, shipper, from_location, to_location]:

dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)define dimension time as time in cube salesdefine dimension item as item in cube salesdefine dimension shipper as (shipper_key, shipper_name, location as

location in cube sales, shipper_type)define dimension from_location as location in cube salesdefine dimension to_location as location in cube sales

9

33

Measures

Quantities computed by aggregating values in data cube cells in various ways:

distributive: if the result can be derived by aggregating results from partitioned data.

E.g., count(), sum(), min(), max().

algebraic: if it can be computed by an algebraic function with a finite number of arguments, each of which is obtained by applying a distributive function.

E.g., avg() = sum()/count(), standard_deviation().

holistic: if there is no constant bound on the storage size needed to describe a subaggregate.

E.g., median(), mode(), rank().

34

Concept Hierarchy

Example: Hierarchy for the dimension “location”

all

Europe North_America

MexicoCanadaSpainGermany

Vancouver

M. WindL. Chan

...

......

... ...

...

TorontoFrankfurt

all

region

office

country

city

35

Specifications of Hierarchies

Schema hierarchy:

Office < City < Country < Region

Day < {month<quarter; week} < year

Set_grouping hierarchy

(1..10] < inexpensive

Region Year

Country Quarter

City Month Week

Office Day

Dimensions: Location Time

36

Hierarchy and Warehouse View

DBMiner:

• Importing data

• Table browsing

• Dimension

creation/ browsing

• Cube building/browsing

10

37

Building cube: Sales volume as a function of product, month, and region

Time (quarter)

Locati

on

(cit

y)

sum

sumTV

VCRPC

1Qtr 2Qtr 3Qtr 4Qtr

U.S.A

Canada

Mexico

sum

Total annual sales of

TVs in USA

ALL = Total sales 38

Browsing a Data Cube

Visualization

OLAP capabilities

Interactive manipulation

39

Typical OLAP operations

Roll up (increase the level of abstraction):

Summarize data by climbing up hierarchy or by

dimension reduction

Drill down (decrease the level of abstraction):

Reverse of roll-up from higher level summary to lower

level summary or detailed data, or introducing new

dimensions

40

Examples: Roll-up & drill-down

Q1

Q2

Q3

Q4

TV VCR oven phone

400

100200

300

50

Jan

Feb

Mar

TV VCR oven phone

100

200

100

Drill-down

On time for Q1

USA

Canada

Q1

Q2

Q3

Q4

TV VCR oven phone

300350

Roll-up

on location

11

41

Typical OLAP operations (cont)

Slice and dice(project and select):

Pivot (rotate):

reorient the cube, visualization, 3D to series

of 2D planes.

Other operations

drill across (links to the raw data): involving

(across) more than one fact table

drill through: through the bottom level of the

cube to its back-end relational tables (using

SQL)42

Examples: dice, slice, pivot

Q1

Q2

Q3

Q4

TV VCR oven phone

400

100200

30050

150 250

Toronto

Vancouver

Q1

Q2

TV VCR

100200

Dice half top

left corner

Pivot

TV

VCR

oven

phone

Va

nco

uve

r

To

ron

to

Ne

w Y

ork

Ch

ica

go

100

150

250

400

Chicago

New York

Toronto

Vancouver100 150 250 400

TV VCR oven phone Slice for Q1

43

OLAP functions

OLAP offers functions to:

Calculate e.g., ratios, variance, etc.

Generate e.g., summarizations, aggregations and hierarchies at

each level and every dimension intersection

Support forecasting functions e.g., trend and stat analysis

OLAPs and SDBs (statistical Databases) are similar

SDBs focus on socioeconomic applications, and are sensitive to

privacy issues when drilling down

OLAPs are designed to handle HUGE amounts of data

OLAP’s querying is based on a starnet model –

representing abstraction levels in each dimension

44

A Star-Net Query ModelShipping Method

AIR-EXPRESS

TRUCKORDER

Customer Orders

CONTRACTS

Customer

Product

PRODUCT GROUP

PRODUCT LINE

PRODUCT ITEM

SALES PERSON

DISTRICT

DIVISION

OrganizationPromotion

CITY

COUNTRY

REGION

Location

DAILYQTRLYANNUALYTime

Each circle is called a footprint

12

45

Outline






46

A Data Warehouse Design

Four views to be considered: Top-down view: allows selection of the relevant

information necessary for the data warehouse

Data source view: exposes the information being

captured, stored, and managed by operational systems

Data warehouse view: consists of fact tables and

dimension tables

Business query view: perspectives of data in the

warehouse from the view of end-users

47

Design approaches

Top-down: Starts with overall design and planning

good for mature technology and well-defined problems

Systematic and min. integration problem

expensive and lack flexibility

Bottom-up: Starts with experiments and prototypes

Rapid returns, low cost

Flexible but hard to integrate

Combined

48

Design process

Typical steps in data warehouse design process:

Select

business process to model

(e.g., orders, invoices, etc.)

grain (atomic level of data) of the business process

(e.g., individual transactions, daily-snapshots etc.)

dimensions that will apply to each fact table record

measure that will populate each fact table record

13

49

Three data warehouse models

Enterprise warehouse collects all of the information about subjects spanning the entire

organization

Data Mart a subset of corporate-wide data that is of value to a specific

groups of users. Its scope is confined to specific, selected

groups, such as marketing data mart

Independent vs. dependent (directly from warehouse) data

mart

Virtual warehouse A set of views over operational databases

Only some of the possible summary views may be materialized

50

Data warehouse development

As in Software Engineering, two development methods:

Waterfall: do structured and systematic analysis at each

step before proceeding to the next

Spiral: rapid generation of increasingly functional systems

short turn around time

Recommended for the development of data warehouses

and data marts

51

A recommended approach

For data warehouse development

Define a high-level corporate data model

Data

Mart

Data

Mart

Distributed

Data Marts

Multi-Tier Data Warehouse

Enterprise

Data

Warehouse

Model refinementModel

refinement

52

DataWarehouse

Extract

Transform

Load

Refresh

OLAP Engine

•Analysis• Query

Reports

• Data mining

Monitor&

Integrator

Metadata

Data Sources Front-End (Client) Tools

Serve

Data Marts

Operational DBs

Externalsources

Data Warehouse Server

OLAP Server

Data Warehouse Architecture

Bottom tier Middle tier Top tier

14

53

Data sources

Data sources are

Often the operational DBMS

Designed for operational use, not for decision

support

Multiple data sources

Are often from different systems using different

hardware and customized software

Create problems e.g., semantic conflicts

54

Back-End Tools & Utilities

Data Cleaning: eliminates noise and errors. Three classes of tools:

Migration: simple data transformation

Scrubbing: by using domain-specific knowledge

Auditing: detects outliers

Data Extraction: by a program interface called gateway

Data Transformation: convert data from legacy or host format to

warehouse format

Load: sort, summarize, consolidate, compute views, check integrity

constraints, and build indices and partitions

Refresh: propagates updates from data sources to the data store When to refresh? Depends on usage and types of data

How to refresh? Data shipping (by triggers) and

Transaction shipping (by updating transaction logs)

55

The Bottom tier: warehouse server

Monitor:

Detect changes that are of interest

Propagate the change to the integrator

Integrator:

Integrate changes into the warehouse

Make the data conform to conceptual schema in the warehouse

Resolve possible update anomalies

Metadata Repository: data about data (warehouse objects)

Administrative metadata: source DBs and their contents, security

Business metadata: business terms, ownership of data

Operational metadata:history of migrated data and transformations applied

56

The Middle tier: OLAP server

Implemented by:

ROLAP (Relational OLAP) e.g., microstrategy’s DSS Server, Informix’s Metacube

Uses relational or extended-relational DBMS (often in a star schema)

Includes optimization of DBMS backend tools

Supports extended SQL (Sequence Query Language)

A cell in a multi-dimensional structure is represented by a tuple

Advantage: scalable (no empty cell for sparse cube)

Disadvantage: no direct access to cells

15

57

The Middle tier: OLAP server (cont)

MOLAP (Multidimensional OLAP) e.g., Arbor’s Essbase

Stores data in a multi-dimensional data structure (MDDS)

Uses Array-based multidimensional storage engine (sparse matrix

techniques)

Advantage: fast indexing to pre-computed aggregations.

Disadvantage: not very scalable and sparse

HOLAP (Hybrid OLAP) Scalability of ROLAP + efficiency of MOLAP

User flexibility, e.g., low level: relational, high-level: array-based

Specialized SQL servers specialized support for SQL queries over star/snowflake schemas

58

Outline






59

Efficient Cube Computation

Data cube can be viewed as a lattice of cuboids The bottom-most cuboid is the base cuboid

The top-most cuboid (apex) contains an empty cell

Suppose that at most one level of each dimension appears in a cuboid.

How many cuboids in an n-dimensional cube with L levels?

Ex: A data cube of 3 dimensions: time, location, item

The total number of cuboids = 2x2x2

Suppose now time has 3 levels of hierarchy: month < quarter < year

The total number of cuboids = 4x2x2 The cuboid contains month or

quarter or year or none

The cuboid contains time or none

)11( +Õ

==n

iiLT 10-D cube with 4 levels gives

T = 510 ~ 9.8 x 106 !!!!

60

Partial Materialization

Pre-computing and materializing all possible cuboids of a data cube is

not feasible when there are large number of them

Given a base cuboid, there are three choices

Full materialization – pre-compute every cuboid

No materialization – do not pre-compute any “nonbase” cuboid

Partial materialization – select subset of cuboids to compute

Selection of which cuboids to materialize depends on the queries in the

workload, their frequencies, and access costs etc.

E.g., OLAP product uses heuristic – “Prefer cuboids that are most

referred”

16

61

Full materialization

To ensure fast on-line processing, full materialization may

be necessary Efficient cube computation

ROLAP cube computation

Optimization ideas:

Reorder or cluster related tuples

Partial grouping of subaggregates and use them to speedup

computation of other subaggregates and aggregates

MOLAP cube computation

Multi-way array aggregation

(can’t use ROLAP’s approach since it is based on different data structure and

data access/index from ROLAP)

62

Multi-way Array Aggregation

Partition arrays into chunks (a small subcube which fits in memory).

Compress chunks to remove wasted space -- handle sparse cubes

Compute aggregates in “multiway” by visiting cube cells in the order

which minimizes the # of times to visit each cell, and reduces

memory access and storage cost.

What is the best

traversing order

to do multi-way

aggregation?

A

B

1 2 3 4

5

9

13 14 15 16

a1a0

b3

b2

b1

b0

a2 a3

C

B

63


A

B

29 30 31 32

1 2 3 4

5

9

13 14 15 16

6463626148474645

a1a0

c3c2

c1c 0

b3

b2

b1

b0

a2 a3

C

4428

5640

2452

3620

60

B

64


A

B

29 30 31 32

1 2 3 4

5

9

13 14 15 16

6463626148474645

a1a0

c3c2

c1c 0

b3

b2

b1

b0

a2 a3

C

4428

5640

2452

3620

60

B

17

65


A

B

29 30 31 32

1 2 3 4

5

9

13 14 15 16

6463626148474645

a1a0

c3c2

c1c 0

b3

b2

b1

b0

a2 a3

C

4428

5640

2452

3620

60

B

66


A

B

29 30 31 32

1 2 3 4

5

9

13 14 15 16

6463626148474645

a1a0

c3c2

c1c 0

b3

b2

b1

b0

a2 a3

C

4428

5640

2452

3620

60

B

67


Method: the planes should be sorted and computed from size small to large (see Han and Kamber, Example 2.12 pp. 75-78)

Idea: keep the smallest plane in the main memory, fetch

and compute only one chunk at a time for the largest

plane

Limitation: efficient for a small number of dimensions

If there is a large number of dimensions,

bottom-up computation [Beyer & Ramakrishnan, SIGMOD’99]

iceberg cube computation

68

Indexing OLAP data

Bitmap Indexing Index on a particular column (attribute)

Each value in the column corresponds to a bit vector: bit-op is fast

The length of the bit vector: # of records in the base table

The i-th bit is set if the i-th row of the base table has the value for the

indexed column

not suitable for high cardinality domains

Cust Region Type

C1 Asia Retail

C2 Europe Dealer

C3 Asia Dealer

C4 America Retail

C5 Europe Dealer

Base tableIndex on Region Index on Type

RecID Asia Europe America

1 1 0 0

2 0 1 0

3 1 0 0

4 0 0 1

5 0 1 0

RecID Retail Dealer

1 1 0

2 0 1

3 0 1

4 1 0

5 0 1

18

69

Indexing OLAP data

Join index Allows records to identify joinable tuples without performing join

operation (which is expensive) - used to identify subcubes of interest

E.g., in star schema data warehouse

Fact table, “sale” and dimensions: location, item

sale_k Loc_k Br_k Item_kTime_k Loc_k Str City State country Item_k Name Brand type

sale_k Loc_k sale_k Item_k sale_k Item_k Loc_k

Join indices

71

Example: Selecting cuboids

Which cuboid to process query on {brand, state} with “year = 2000” ?

& cuboid1: {item_name, city, year}

cuboid2: {brand, country, year}

cuboid3: {brand, state, year}

cuboid4: {item_name, state} where year = 2000

Eliminate cuboid2

since “country” can’t be generalized to “state” (lower level)

By generalization rule, the rest of cuboids will work

But which is better?

cuboid1: requires two generalizations cost most

if not many year values associated with brand but each brand has several

item_names couboid3 is less costly than cuboid4

Country

State Type

City Brand

Street Item_name

73

Outline






74

Data Warehouse Usage

Three kinds of data warehouse applications

Information processing

supports querying, basic statistical analysis, and reporting using

crosstabs, tables, charts and graphs

Analytical processing

multidimensional analysis of data warehouse data

supports basic OLAP operations, slice-dice, drilling, pivoting

Data mining

knowledge discovery from hidden patterns

supports associations, constructing analytical models, performing

classification and prediction, and presenting the mining results using

visualization tools.

Differences among the three tasks

19

75

From OLAP to OLAM

Why OLAM (online analytical mining)?High quality of data in data warehouses (DW)

DW contains integrated, consistent, cleaned data

Available information processing structure surrounding data warehouses ODBC, OLEDB, Web accessing, service facilities, reporting

and OLAP tools

OLAP gives a good framework for data exploration “interactive” mining with drilling, dicing, pivoting, etc.

On-line selection of data mining functions e.g., integration and swapping of multiple mining functions, algorithms, and tasks.

76

OLAM Architecture

Data Warehouse

Meta Data

MDDB

OLAM

Engine

OLAP

Engine

User GUI API

Data Cube API

Database API

Data cleaning

Data integration

Filtering&Integration Filtering

Databases

Mining query Mining result

Layer3

OLAP/OLAM

Layer2

MDDB (Multi-dimensional DB)

Layer1

Data

Repository

Layer4

User Interface

77

Discovery-Driven Exploration

Data cube exploration:

Hypothesis-driven exploration by user, huge search space, can overlook

Discovery-driven (Sarawagi, et al.’98) Effective navigation of large OLAP data cubes

pre-compute measures indicating exceptions, guide user in the data analysis, at all levels of aggregation

Exception: significantly different from the value anticipated, based on a statistical model

Visual cues such as background color are used to reflect the degree of exception of each cell

78

Example

20

79

“Exception” and its computation

Parameters

SelfExp: surprise of cell relative to other cells at same level of aggregation

InExp: surprise beneath the cell

PathExp: surprise beneath cell for each drill-down path

Computation of exception indicator (model fitting and computing SelfExp, InExp, and PathExp values) can be overlapped with cube construction

Exception themselves can be stored, indexed and retrieved like pre-computed aggregates

80

Data Mining & Warehouses

Construction of data warehouses requires data cleaning

and integration

important preprocessing in data mining

Data mining tools can interface with the OLAP engine

facilitates interactive data exploration, including

data integration, aggregation and navigation

OLAP-based mining OLAM

81

Summary

Data warehouse

A multi-dimensional model of a data warehouseRepresentation: A data cube consists of dimensions & measure

Schema: Star, snowflake, and fact constellations

OLAP operations: drilling, rolling, slicing, dicing and pivoting

OLAP servers: ROLAP, MOLAP, HOLAP

Efficient computation of data cubes Partial vs. full vs. no materialization Multi-way array aggregation Bitmap index and join index implementations

Data mining and data warehouse Discovery-driven in data warehouse From OLAP to OLAM (on-line analytical mining)

82

For more, see …

S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S.

Sarawagi. On the computation of multidimensional aggregates. VLDB’96

D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient view maintenance in data warehouses.

SIGMOD’97.

R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional databases. ICDE’97

K. Beyer and R. Ramakrishnan. Bottom-Up Computation of Sparse and Iceberg CUBEs..

SIGMOD’99.

S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. ACM

SIGMOD Record, 26:65-74, 1997.

OLAP council. MDAPI specification version 2.0. In http://www.olapcouncil.org/research/apily.htm,

1998.

G. Dong, J. Han, J. Lam, J. Pei, K. Wang. Mining Multi-dimensional Constrained Gradients in Data

Cubes. VLDB’2001

J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H.

Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab and sub-

totals. Data Mining and Knowledge Discovery, 1:29-54, 1997.

21

83

For more, see …

J. Han, J. Pei, G. Dong, K. Wang. Efficient Computation of Iceberg Cubes With Complex Measures.

SIGMOD’01

V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently. SIGMOD’96

Microsoft. OLEDB for OLAP programmer's reference version 1.0. In

http://www.microsoft.com/data/oledb/olap, 1998.

K. Ross and D. Srivastava. Fast computation of sparse datacubes. VLDB’97.

K. A. Ross, D. Srivastava, and D. Chatziantoniou. Complex aggregation at multiple granularities.

EDBT'98.

S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-driven exploration of OLAP data cubes.

EDBT'98.

E. Thomsen. OLAP Solutions: Building Multidimensional Information Systems. John Wiley & Sons,

1997.

W. Wang, H. Lu, J. Feng, J. X. Yu, Condensed Cube: An Effective Approach to Reducing Data

Cube Size. ICDE’02.

Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm for simultaneous

multidimensional aggregates. SIGMOD’97.

data warehouse & olap technologyredwood.cs.ttu.edu/~rahewett/teach/datamining/2-data...4 13 dw...

Documents