dimensional modeling mis2502 data analytics. so we know… relational databases are good for storing...
TRANSCRIPT
Dimensional modelingMIS2502
Data Analytics
So we know…
• Relational databases are good for storing transactional data• But bad for analytical data
• What we can do is design an analytical data store based on the operational data store
• That architecture gives us the advantages of both• Relational database for operational use• Analytical database for analysis (Online Analytical Processing)
Why have a separate ADS?
• Issue 1: Performance• The structure is built to handle analysis• You keep the load off the operational data store
• Issue 2: Usability• We can structure the data in an intuitive way• You keep the load off of your IT department
Some terminology
How they all relate
We’ll start here.
The Data Cube• Core component of
Online Analytical Processing and Multidimensional Data Analysis
• Made up of “facts” and “dimensions”
quantity
& total price
quantity
& total price
quantity
& total price
quantity
& total price
quantity
& total price
quantity
& total price
quantity
& total price
quantity
& total price
quantity
& total price
quantity
& total price
quantity
& total price
quantity
& total price
quantity
& total price
quantity
& total price
quantity
& total price
quantity
& total price
Product
Sto
re
Tim
e
M&Ms DietCoke
DoritosFamousAmos
Ardmore, PA
TempleMain
Cherry Hill,NJ
King of Prussia, PA
Jan. 2011
Feb. 2011
Mar. 2011
Quantity sold and total price are measured facts.Why isn’t product price a measured fact?
The Data Cube
quantity
& total price
quantity
& total price
quantity
& total price
quantity
& total price
quantity
& total price
quantity
& total price
quantity
& total price
quantity
& total price
quantity
& total price
quantity
& total price
quantity
& total price
quantity
& total price
quantity
& total price
quantity
& total price
quantity
& total price
quantity
& total price
Product
Sto
re
Tim
e
M&Ms DietCoke
Doritos FamousAmos
Ardmore, PA
TempleMain
Cherry Hill,NJ
King of Prussia, PA
Jan. 2011
Feb. 2011
Mar. 2011
The highlighted element represents all the M&Ms sold in Ardmore, PA in
January, 2011
A single summary record representing
a business event (monthly sales).
The Data Cube
quantity
& total price
quantity
& total price
quantity
& total price
quantity
& total price
quantity
& total price
quantity
& total price
quantity
& total price
quantity
& total price
quantity
& total price
quantity
& total price
quantity
& total price
quantity
& total price
quantity
& total price
quantity
& total price
quantity
& total price
quantity
& total price
Product
Sto
re
Tim
e
M&Ms DietCoke
Doritos FamousAmos
Ardmore, PA
TempleMain
Cherry Hill,NJ
King of Prussia, PA
Jan. 2011
Feb. 2011
Mar. 2011
The highlighted elements represent
Famous Amos cookies sold on Temple’s Main campus from
January to March, 2011
This is called “slicing the data.”
The Data Cube
quantity
& total price
quantity
& total price
quantity
& total price
quantity
& total price
quantity
& total price
quantity
& total price
quantity
& total price
quantity
& total price
quantity
& total price
quantity
& total price
quantity
& total price
quantity
& total price
quantity
& total price
quantity
& total price
quantity
& total price
quantity
& total price
Product
Sto
re
Tim
e
M&Ms DietCoke
Doritos FamousAmos
Ardmore, PA
TempleMain
Cherry Hill,NJ
King of Prussia, PA
Jan. 2011
Feb. 2011
Mar. 2011
What do the orange highlighted elements
represent?
What do the orange highlighted elements
represent?
What do the blue highlighted elements
represent?
The n-dimensional cube
• Could you have a data mart with five dimensions?• If so, give an example
• Then why does our cube example (and most others you will see) only have three?
Designing the Cube: The Star Schema
• We can’t reasonably store the original data as a single table• Summarization would
be too slow• A lot of redundancy
• So it is stored as a star schema
SalesSales_ID
Product_IDStore_IDTime_ID
Quantity SoldTotal Price
ProductProduct_ID
Product_NameProduct_Price
Product_Weight
StoreStore_ID
Store_AddressStore_City
Store_StateStore_Type
TimeTime_ID
DayMonthYear
Fact
Dim
en
sion
Dim
en
sion
Dim
en
sion
A join to make the cube?
SalesID
Qty. Sold
Total Price
Prod.ID
Prod.Name
Prod.Price
Prod.Weight
StoreID
StoreAddress
StoreCity
StoreState
StoreType
TimeID
Day Month Year
1000
1001
1002
Storing the entire join would generate many,
many rows!
Product Dimension Store Dimension Time DimensionSales Fact
It adds up fast
From Star Schema to Cubes
quantity& total price
quantity& total price
quantity& total price
quantity& total price
quantity& total price
quantity& total price
quantity& total price
quantity& total price
quantity& total price
quantity& total price
quantity& total price
quantity& total price
quantity& total price
quantity& total price
quantity& total price
quantity& total price
Product
Store
M&Ms DietCoke
Doritos FamousAmos
Ardmore, PA
TempleMain
Cherry Hill,NJ
King of Prussia, PA
Jan. 2011
Feb. 2011
Mar. 2011
Demo – Foodmart• A pre-created data cube that can be read in Excel• Summaries are already created
Updating the cube• Data marts are non-volatile (i.e., they can’t be changed)• Logically: It’s a record of what has happened• Practically: Would require constant re-computation of the
cube
• The cube is refreshed periodically from the transactional database• Overnight• Daily• Weekly
Designing the Star Schema
• Kimball’s Four Step Process for Data Cube Design (Kimball et al., 2008)
• Choose the business process• Decide on the level of granularity• Identify the dimensions• Identify the fact
• From where do you get the data?
Choose the business process• What your data cube is “about”• Determined by the questions you want to answer about your organization
Question Business Process
Who is my best customer? Sales
What are my highest selling products? Sales
Which teachers have the best student performance?
Standardized testing
Which supplier is offering us the best deals?
Purchasing
Note that a “business process” is not always about business.
Decide on the level of granularity• Level of detail for each event (row in the table)• Will determine the data in the dimensions
• Example: Who is my best customer?• The “event” is a sales transaction• Choices for time: yearly, quarterly, monthly, daily• Choices for store: store, city, state
How would you select the right granularity?
Identify the dimensions• Determined by the business process
• Refined by the level of granularity
• The key elements of the process needed to answer to the question
• Example: Sales transaction• Our example schema defines a “sale” as
taking place for a particular product, in a particular store, at a particular time
• Could this data mart tell you• The best selling product?• The best customer? Try it for the “student
performance” example.
Identify the fact
• The data associated with the business event
Try it for the “student performance” example.
Data cube caveats• You have to choose your aggregations in advance
• So choose wisely!
• Consider a sales data cube with product, store, time, salesperson• If quantity_sold and total_price are the facts, you can’t figure out
the average number of people working in a store • All people might not have sold all products and therefore wouldn’t
be in the joined table
• Granularity is also an issue• Can’t track daily sales if “date” is monthly (pre-aggregated)• So why not include every single sale and do no aggregation
beforehand?