dimensional modeling mis2502 data analytics. so we know… relational databases are good for storing...

22
Dimensional modeling MIS2502 Data Analytics

Upload: anissa-benson

Post on 13-Jan-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: DIMENSIONAL MODELING MIS2502 Data Analytics. So we know… Relational databases are good for storing transactional data But bad for analytical data What

Dimensional modelingMIS2502

Data Analytics

Page 2: DIMENSIONAL MODELING MIS2502 Data Analytics. So we know… Relational databases are good for storing transactional data But bad for analytical data What

So we know…

• Relational databases are good for storing transactional data• But bad for analytical data

• What we can do is design an analytical data store based on the operational data store

• That architecture gives us the advantages of both• Relational database for operational use• Analytical database for analysis (Online Analytical Processing)

Page 3: DIMENSIONAL MODELING MIS2502 Data Analytics. So we know… Relational databases are good for storing transactional data But bad for analytical data What

Why have a separate ADS?

• Issue 1: Performance• The structure is built to handle analysis• You keep the load off the operational data store

• Issue 2: Usability• We can structure the data in an intuitive way• You keep the load off of your IT department

Page 4: DIMENSIONAL MODELING MIS2502 Data Analytics. So we know… Relational databases are good for storing transactional data But bad for analytical data What

Some terminology

Page 5: DIMENSIONAL MODELING MIS2502 Data Analytics. So we know… Relational databases are good for storing transactional data But bad for analytical data What

How they all relate

We’ll start here.

Page 6: DIMENSIONAL MODELING MIS2502 Data Analytics. So we know… Relational databases are good for storing transactional data But bad for analytical data What

The Data Cube• Core component of

Online Analytical Processing and Multidimensional Data Analysis

• Made up of “facts” and “dimensions”

quantity

& total price

quantity

& total price

quantity

& total price

quantity

& total price

quantity

& total price

quantity

& total price

quantity

& total price

quantity

& total price

quantity

& total price

quantity

& total price

quantity

& total price

quantity

& total price

quantity

& total price

quantity

& total price

quantity

& total price

quantity

& total price

Product

Sto

re

Tim

e

M&Ms DietCoke

DoritosFamousAmos

Ardmore, PA

TempleMain

Cherry Hill,NJ

King of Prussia, PA

Jan. 2011

Feb. 2011

Mar. 2011

Quantity sold and total price are measured facts.Why isn’t product price a measured fact?

Page 7: DIMENSIONAL MODELING MIS2502 Data Analytics. So we know… Relational databases are good for storing transactional data But bad for analytical data What

The Data Cube

quantity

& total price

quantity

& total price

quantity

& total price

quantity

& total price

quantity

& total price

quantity

& total price

quantity

& total price

quantity

& total price

quantity

& total price

quantity

& total price

quantity

& total price

quantity

& total price

quantity

& total price

quantity

& total price

quantity

& total price

quantity

& total price

Product

Sto

re

Tim

e

M&Ms DietCoke

Doritos FamousAmos

Ardmore, PA

TempleMain

Cherry Hill,NJ

King of Prussia, PA

Jan. 2011

Feb. 2011

Mar. 2011

The highlighted element represents all the M&Ms sold in Ardmore, PA in

January, 2011

A single summary record representing

a business event (monthly sales).

Page 8: DIMENSIONAL MODELING MIS2502 Data Analytics. So we know… Relational databases are good for storing transactional data But bad for analytical data What

The Data Cube

quantity

& total price

quantity

& total price

quantity

& total price

quantity

& total price

quantity

& total price

quantity

& total price

quantity

& total price

quantity

& total price

quantity

& total price

quantity

& total price

quantity

& total price

quantity

& total price

quantity

& total price

quantity

& total price

quantity

& total price

quantity

& total price

Product

Sto

re

Tim

e

M&Ms DietCoke

Doritos FamousAmos

Ardmore, PA

TempleMain

Cherry Hill,NJ

King of Prussia, PA

Jan. 2011

Feb. 2011

Mar. 2011

The highlighted elements represent

Famous Amos cookies sold on Temple’s Main campus from

January to March, 2011

This is called “slicing the data.”

Page 9: DIMENSIONAL MODELING MIS2502 Data Analytics. So we know… Relational databases are good for storing transactional data But bad for analytical data What

The Data Cube

quantity

& total price

quantity

& total price

quantity

& total price

quantity

& total price

quantity

& total price

quantity

& total price

quantity

& total price

quantity

& total price

quantity

& total price

quantity

& total price

quantity

& total price

quantity

& total price

quantity

& total price

quantity

& total price

quantity

& total price

quantity

& total price

Product

Sto

re

Tim

e

M&Ms DietCoke

Doritos FamousAmos

Ardmore, PA

TempleMain

Cherry Hill,NJ

King of Prussia, PA

Jan. 2011

Feb. 2011

Mar. 2011

What do the orange highlighted elements

represent?

What do the orange highlighted elements

represent?

What do the blue highlighted elements

represent?

Page 10: DIMENSIONAL MODELING MIS2502 Data Analytics. So we know… Relational databases are good for storing transactional data But bad for analytical data What

The n-dimensional cube

• Could you have a data mart with five dimensions?• If so, give an example

• Then why does our cube example (and most others you will see) only have three?

Page 11: DIMENSIONAL MODELING MIS2502 Data Analytics. So we know… Relational databases are good for storing transactional data But bad for analytical data What

Designing the Cube: The Star Schema

• We can’t reasonably store the original data as a single table• Summarization would

be too slow• A lot of redundancy

• So it is stored as a star schema

SalesSales_ID

Product_IDStore_IDTime_ID

Quantity SoldTotal Price

ProductProduct_ID

Product_NameProduct_Price

Product_Weight

StoreStore_ID

Store_AddressStore_City

Store_StateStore_Type

TimeTime_ID

DayMonthYear

Fact

Dim

en

sion

Dim

en

sion

Dim

en

sion

Page 12: DIMENSIONAL MODELING MIS2502 Data Analytics. So we know… Relational databases are good for storing transactional data But bad for analytical data What

A join to make the cube?

SalesID

Qty. Sold

Total Price

Prod.ID

Prod.Name

Prod.Price

Prod.Weight

StoreID

StoreAddress

StoreCity

StoreState

StoreType

TimeID

Day Month Year

1000

1001

1002

Storing the entire join would generate many,

many rows!

Product Dimension Store Dimension Time DimensionSales Fact

Page 13: DIMENSIONAL MODELING MIS2502 Data Analytics. So we know… Relational databases are good for storing transactional data But bad for analytical data What

It adds up fast

Page 14: DIMENSIONAL MODELING MIS2502 Data Analytics. So we know… Relational databases are good for storing transactional data But bad for analytical data What

From Star Schema to Cubes

quantity& total price

quantity& total price

quantity& total price

quantity& total price

quantity& total price

quantity& total price

quantity& total price

quantity& total price

quantity& total price

quantity& total price

quantity& total price

quantity& total price

quantity& total price

quantity& total price

quantity& total price

quantity& total price

Product

Store

M&Ms DietCoke

Doritos FamousAmos

Ardmore, PA

TempleMain

Cherry Hill,NJ

King of Prussia, PA

Jan. 2011

Feb. 2011

Mar. 2011

Page 15: DIMENSIONAL MODELING MIS2502 Data Analytics. So we know… Relational databases are good for storing transactional data But bad for analytical data What

Demo – Foodmart• A pre-created data cube that can be read in Excel• Summaries are already created

Page 16: DIMENSIONAL MODELING MIS2502 Data Analytics. So we know… Relational databases are good for storing transactional data But bad for analytical data What

Updating the cube• Data marts are non-volatile (i.e., they can’t be changed)• Logically: It’s a record of what has happened• Practically: Would require constant re-computation of the

cube

• The cube is refreshed periodically from the transactional database• Overnight• Daily• Weekly

Page 17: DIMENSIONAL MODELING MIS2502 Data Analytics. So we know… Relational databases are good for storing transactional data But bad for analytical data What

Designing the Star Schema

• Kimball’s Four Step Process for Data Cube Design (Kimball et al., 2008)

• Choose the business process• Decide on the level of granularity• Identify the dimensions• Identify the fact

• From where do you get the data?

Page 18: DIMENSIONAL MODELING MIS2502 Data Analytics. So we know… Relational databases are good for storing transactional data But bad for analytical data What

Choose the business process• What your data cube is “about”• Determined by the questions you want to answer about your organization

Question Business Process

Who is my best customer? Sales

What are my highest selling products? Sales

Which teachers have the best student performance?

Standardized testing

Which supplier is offering us the best deals?

Purchasing

Note that a “business process” is not always about business.

Page 19: DIMENSIONAL MODELING MIS2502 Data Analytics. So we know… Relational databases are good for storing transactional data But bad for analytical data What

Decide on the level of granularity• Level of detail for each event (row in the table)• Will determine the data in the dimensions

• Example: Who is my best customer?• The “event” is a sales transaction• Choices for time: yearly, quarterly, monthly, daily• Choices for store: store, city, state

How would you select the right granularity?

Page 20: DIMENSIONAL MODELING MIS2502 Data Analytics. So we know… Relational databases are good for storing transactional data But bad for analytical data What

Identify the dimensions• Determined by the business process

• Refined by the level of granularity

• The key elements of the process needed to answer to the question

• Example: Sales transaction• Our example schema defines a “sale” as

taking place for a particular product, in a particular store, at a particular time

• Could this data mart tell you• The best selling product?• The best customer? Try it for the “student

performance” example.

Page 21: DIMENSIONAL MODELING MIS2502 Data Analytics. So we know… Relational databases are good for storing transactional data But bad for analytical data What

Identify the fact

• The data associated with the business event

Try it for the “student performance” example.

Page 22: DIMENSIONAL MODELING MIS2502 Data Analytics. So we know… Relational databases are good for storing transactional data But bad for analytical data What

Data cube caveats• You have to choose your aggregations in advance

• So choose wisely!

• Consider a sales data cube with product, store, time, salesperson• If quantity_sold and total_price are the facts, you can’t figure out

the average number of people working in a store • All people might not have sold all products and therefore wouldn’t

be in the joined table

• Granularity is also an issue• Can’t track daily sales if “date” is monthly (pre-aggregated)• So why not include every single sale and do no aggregation

beforehand?