tvdm digital 5 - dimentional modeling
DESCRIPTION
Dimensional modeling and SQLTRANSCRIPT
3 GEST S492 – The Digital Firm
Example of Star Schema
time_key day day_of_the_week month quarter year
time
location_key street city province_or_street country
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales Measures
item_key item_name brand type supplier_type
item
branch_key branch_name branch_type
branch
4 GEST S492 – The Digital Firm
Conceptual Modeling of DWH
Modeling data warehouses: dimensions & measures • Star schema: A fact table in the middle connected to a set
of dimension tables
• Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake
• Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation
5 GEST S492 – The Digital Firm
Example of Snowflake Schema
time_key day day_of_the_week month quarter year
time
location_key street city_key
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_key item_name brand type supplier_key
item
branch_key branch_name branch_type
branch
supplier_key supplier_type
supplier
city_key city province_or_street country
city
6 GEST S492 – The Digital Firm
Example of Fact Constellation
time_key day day_of_the_week month quarter year
time
location_key street city province_or_street country
location
Sales Fact Table
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales Measures
item_key item_name brand type supplier_type
item
branch_key branch_name branch_type
branch
Shipping Fact Table
time_key
item_key
shipper_key
from_location
to_location
dollars_cost
units_shipped
shipper_key shipper_name location_key shipper_type
shipper
7 GEST S492 – The Digital Firm
Fact tables contain factual or quantitative data
Dimension tables contain descriptions about the subjects of the business
1:N relationship between dimension tables and fact tables
Dimension tables are denormalized to maximize performance
Star Schema
8 GEST S492 – The Digital Firm
Fact table provides statistics for sales broken down by product, period and store dimensions
An example
10 GEST S492 – The Digital Firm
Fact tables contain time-period data è Date dimensions are important
Modeling dates
11 GEST S492 – The Digital Firm
Issues Regarding Star Schema
• Dimension table keys must be surrogate (non-intelligent and non-business related), because: – Keys will change over time when history on dimensions is
built – Length/format consistency
• Granularity of Fact Table – what level of detail do you want? – Transactional grain – finest level – Aggregated grain – more summarized – Finer grain è better market basket analysis capability – Finer grain è more dimension tables, more rows in fact table
14 GEST S492 – The Digital Firm
How does it work? • Slice 1: Number of Trucks sold in Europe in 1997
SELECT sum(Vehicles_Sold) FROM Location L, Product P, Time T, Fact F WHERE L.Location_Key = F.Location_Key AND P.Product_Key = F.Product_Key AND T.Time_Key = F.Time_Key AND L.Continent = ‘Europe’ AND P.Vehicle_Kind = ‘Truck’ AND T.Year =1997
Joining necessary dimensions & Fact
Using connected dimensions to constraint
F.Location_Key in (1,2) and F.Product_Key in (3, 4) and T.Time_Key in (3,4)
Step 3
T.Time_Key= (3,4) P.Product_Key= (3,4)
L.Location_Key= (1,2) Step 1
L.Location_Key = F.Location_Key = (1,2) P.Product_Key = F.Product_Key= (3, 4) T.Time_Key = F.Time_Key = (3,4)
Step 2
15 GEST S492 – The Digital Firm
SQL on Facts
• Slice 1 SELECT sum(Vehicles_Sold) FROM Location L, Product P, Time T, Fact F WHERE L.Location_Key = F.Location_Key AND P.Product_Key = F.Product_Key AND T.Time_Key = F.Time_Key AND L.Continent = ‘Europe’ AND P.Vehicle_Kind = ‘Truck’ AND T.Year =1997
• Slice 2: nbre of vehicles by country, size & year SELECT L.Country, P.Vehicle_Size, T.Year, SUM(Vehicles_Sold) FROM Location L, Product P, Time T, Fact F WHERE L.Location_Key = F.Location_Key AND P.Product_Key = F.Product_Key AND T.Time_Key = F.Time_Key AND L.Country = ‘USA’ GROUP BY L.Country, P.Vehicle_Size, T.Year
• Roll-up and Drill-down are implemented by Group By and aggregate functions
Joining necessary dimensions & Fact Using connected dimensions to constraint
16 GEST S492 – The Digital Firm
SQL for OLAP operations
• « Analyzing the sales by Continents » SELECT L.Continent , SUM(F.Vehicles_Sold) FROM Location L, Product P, Time T, Fact F WHERE L.Location_Key = F.Location_Key AND P.Product_Key = F.Product_Key AND T.Time_Key = F.Time_Key
GROUP BY L.Continent • « Drill down to countries »
SELECT L.Continent, L.Country, SUM(F.Vehicles_Sold) FROM Location L, Product P, Time T, Fact F WHERE L.Location_Key = F.Location_Key AND P.Product_Key = F.Product_Key AND T.Time_Key = F.Time_Key
GROUP BY L.Continent, L.Country
17 GEST S492 – The Digital Firm
An example
• Star for Grocery stores – TVDM BI 6 Star Grocery.mdb
• OLAP in Pivot Table Excel – TVDM BI 6 Star Grocery Pivot.xlsx
19 GEST S492 – The Digital Firm
Dimensional Models
• A denormalized relational model – Made up of tables with attributes – Relationships defined by keys and foreign keys
• Organized for understandability and ease of reporting rather than update
• Queried and maintained by SQL or special purpose management tools – Because the semantic structure is known, and the surrogate
key mechanism is standard, tools can be optimized for query and indexing
20 GEST S492 – The Digital Firm
Building a DWH from a Normalized Database
The steps • Analyse the normalized entity-relationship
– identify business processes to model each separately • Translate this into a dimensional model
– Select m-n relationships containing numeric and additive facts: these will be the fact tables
– Denormalize the remaining tables into flat tables with single-part key that connects directly to fact tables: these are the dimension tables
• Translate this into the physical model. This reflects the changes necessary to reach the performance objectives
21 GEST S492 – The Digital Firm
Steps in dimensional modeling
• Select an associative entity for a fact table (table containing additive measures)
• Determine granularity • Replace operational keys with surrogate keys • Promote the keys from all hierarchies to the fact table
– Can be in attributes: Date/week/month; Product/Category/Brand – Can be in dimensions: Customer / Segment
• Add date dimension • Split all compound attributes
– “Rue Auguste Rodin 45, 1050 Bruxelles, Belgique” • Add necessary categorical attributes: Reference Tables !
– DOB: baby, young, teenager, adult, older • Fact (varies with time) / Attribute (constant)
22 GEST S492 – The Digital Firm
ORDER order_num (PK) customer_ID (FK) store_ID (FK) clerk_ID (FK) date
STORE store_ID (PK) store_name address district floor_type
CLERK clerk_id (PK) clerk_name clerk_grade
PRODUCT SKU (PK) description brand category
CUSTOMER customer_ID (PK) customer_name purchase_profile credit_profile address
PROMOTION promotion_NUM (PK) promotion_name price_type ad_type
ORDER-LINE order_num (PK) (FK) SKU (PK) (FK) promotion_key (FK) dollars_sold units_sold dollars_cost
ERD
SKU = Stock Keeping Unit
n
n
n
1
1
1
1
1
1
n
n
n
23 GEST S492 – The Digital Firm
ORDER order_num (PK) customer_ID (FK) store_ID (FK) clerk_ID (FK) date
STORE store_ID (PK) store_name address district floor_type
CLERK clerk_id (PK) clerk_name clerk_grade
PRODUCT SKU (PK) description brand category
CUSTOMER customer_ID (PK) customer_name purchase_profile credit_profile address
PROMOTION promotion_NUM (PK) promotion_name price_type ad_type
ORDER-LINE order_num (PK) (FK) SKU (PK) (FK) promotion_key (FK) dollars_sold units_sold dollars_cost
ERD
SKU = Stock Keeping Unit
Measures Dimensions
Where are the measures? Where are the dimensions?
n
n
n
1
1
1
1 n
1 n
1 n
This date will be used to link the time dimension to the fact What if this date is missing in OLT?
24 GEST S492 – The Digital Firm
TIME time_key (PK) SQL_date day_of_week month
STORE store_key (PK) store_ID store_name address district floor_type
CLERK clerk_key (PK) clerk_id clerk_name clerk_grade
PRODUCT product_key (PK) SKU description brand category
CUSTOMER customer_key (PK) customer_name purchase_profile credit_profile address
PROMOTION promotion_key (PK) promotion_name price_type ad_type
FACT time_key (FK) store_key (FK) clerk_key (FK) product_key (FK) customer_key (FK) promotion_key (FK) dollars_sold units_sold dollars_cost
DIMENSONAL MODEL
25 GEST S492 – The Digital Firm
The 5 Step Design Process
• Choose the Data Mart • Choose the Grain • Choose the Dimensions • Choose the Facts • Define the Measures
26 GEST S492 – The Digital Firm
Choosing the Mart
• A specific set of business questions (belongs to a BL) – Finance & control, marketing, sales, procurement, etc.
• A set of related fact and dimension tables – Costs, sales, number of contacts – Clients, Products, Campaign, Channel, etc.
• Single source or multiple source – ERP, legacy, Excel, etc.
• Conformed dimensions – Dimensions shared by all facts – allow to join them
• Typically have a fact table for each process – Campaign selection, Campaign tracking, etc.
27 GEST S492 – The Digital Firm
Grain (unit of analysis)
The grain determines the level of detail of each fact record • For example
– Individual transactions: sales transaction, product shipment – Snapshots (points in time): sales of the day
• Generally it is better to focus on the smallest grain – Moving from low grain to more aggregate is possible, but
inverse is not • Retailer: sales ticket = transaction, Sales per item per client per week
will, for ever, make any basket analysis impossible
– Levels can be used for going from high granularity to lowest granularity
– There is a price to pay: the finest grain will create the largest fact tables, and hence aggregates will have to be computed
Question: how to do this?
28 GEST S492 – The Digital Firm
Dimensions
A table (or hierarchy of tables) connected with the fact table with keys and foreign keys
• Dimension tables contain text or numeric attributes that gives the context of the measures available in a Fact Table – Time, client, status, shop, product, employee, etc.
• Preferably in a 1-M relationship with Facts – 1 context can have many measures: 1 client having many sales – If not a bridge table must be created
• Dimension’s attributes will be the source of query constraints – Select sum(Sales) from … where Client.ZIP = 1150 – Customer (region, status); Product (Type, Price_range); Time (Mondays, WE, 2010)
• Connected with surrogate (generated) keys, not operational keys – Surrogate key will be used to store the history of the dimensions. They
are the central point of the whole dimensional architecture
29 GEST S492 – The Digital Firm
Fact Tables
Represent a process or reporting environment interesting for some business users
• Contains many foreign keys connecting to Dimensions • It is important to determine the identity of the fact table and
specify exactly what it represents – Sales in shops, direct-mail campaigns, shipment, production of item x
(family)
• Typically correspond to an associative entity in the ER model
• Tend to have huge numbers of records – The max number of records is the combination of all records of each
attribute (Cartesian product)
• Useful facts tend to be numeric and additive – Quantity ordered, Price paid, Number of responders
30 GEST S492 – The Digital Firm
Measures
Measurements associated with fact records at the fact table granularity
• Normally numeric and additive – Sales, number of contacts, etc.
• Semi-additive (“stock” or “level” measures) – Measures that are non additive on time but additive along
other dimensions: Inventory, balance • Non-additive (“Value-per-unit” measures)
– Measure that are non additive at all: exchange rate Attributes in dimension tables are constants. Facts attributes vary with the granularity of the fact table: when aggregations are done, they should smoothly aggregate as well.
31 GEST S492 – The Digital Firm
Dimensions with m:n Fact relation
• Use a Bridge Table • Problem for additive measures – will be counted N times
– Add a weighting factor to correct fact addition – Define a business rule to identify main record (winner-takes-all)
Bridge tables introduce complexity in the query process to take care of additive facts à user impact
Credit FACT - Amount - Rate
Client
Day Campaign
Group_ Campaign - Weight - Main
Campaign Group_CampaignCpg_key Desc Launch_date GRPCpg_keyCpg_key Main
1 hello world 1/01/2009 1 1 12 Contest 15/01/2009 1 3 03 Credit 17/01/2009 1 5 04 loan it 1/02/2009 2 1 05 Mbre2Mbre 19/02/2009 2 3 16 Picnic 30/02/2009 2 7 07 Card 1/03/2009 2 8 08 Safe it 12/03/2009
n (one Cpg can be n times in Group_Cpg table)
1 (one Cpg can befound once in Cpg table)
The choice of the weight and groups is done at the loading of the fact
32 GEST S492 – The Digital Firm
Good Dimension Attributes
• Verbose – Gives a rich context to the measure
• Descriptive – User crystal clear naming and semantics
• Complete – All modalities must be represented – ‘ooops “Risk A” is not loaded’ – All attributes within a dimension must describe completely the object
represented by the dimension • Quality assured
– No trash value like 99 for age…
33 GEST S492 – The Digital Firm
Good Dimension Attributes
• Indexed (b-tree vs bitmap) – Fast for SQL ‘where’ clauses
• Equally available – Attribute must be available for most of the objects it represents – ‘age is
available for only 1% of our clients’ • Documented
– From a business point: semantic and point in time (‘risk profile attribute of clients are updated every night)
– From a technical point: Risk profile is loaded from Table X with process Y, etc.
35 GEST S492 – The Digital Firm
Hierarchies
• Hierarchies are the key relationship for roll-up and drill-down in OLAP environment – Sales per Day à Week à Month à Quarter à … – Sales for cheese à milk_productà food à …
• Hierarchies can be represented in 2 ways in dimensions – Explicit Attribute sets:
• Dim.Time: date; DayofWeek; Week; Month; Quarter; … • Dim.product: Product (Cheese); Family (Milk_product); Category
(Food)
– Snowflake: • Create a table per level and link tables by 1:N relationships
36 GEST S492 – The Digital Firm
Snowflaking & Attribute Hierarchies
• Efficiency vs Space – Snowflaking normalizes the attributes
within a hierarchy and hence reduces redundancy in a dimension table
• For each combination of attributes involved in the hierarchy, a dimension will have one additional record
• Understandability – The user must join tables to get the
right level of aggregation
Product.level1 Product.level2 Product.level3cheese milk FoodYougourt milk Foodmilk milk Food
Product.level3Food
Product.level2milk
Product.level1cheeseYougourtmilk
38 GEST S492 – The Digital Firm
Slowly Changing Dimensions
• Type 1: Store only the current value – No history is created: all facts are like the value was always
like the last one
Select week.year, Client.zip, Sum(Sales.sale) as Sales From Sales, Client, Week Where Sales.client_key=Client.key and Sales.week_key= week.key and week.year=2010 Group by week.year, Client.zip
Year Zip Sales2010 1950 18 €
Client Sales weekKey Name zip Date_in Date_close client_key week_key Sale Key Week Month Year1 Kasparov 1150 1/01/2010 1 1 10 € 1 1 1 2010
2 2 1 20103 3 1 2010
Type 1Client Sales weekKey Name zip Date_in Date_close client_key week_key Sale Key Week Month Year1 Kasparov 1950 1/01/2010 1 1 10 € 1 1 1 2010
1 2 8 € 2 2 1 20103 3 1 2010
39 GEST S492 – The Digital Firm
Slowly Changing Dimensions
• Type 2: Create a new dimension record for each new value of the attribute – This will create a new surrogate key to be used as from this moment in all
new records of the Fact table, creating the historical context of the measures
Select week.year, Client.zip, Sum(Sales.sale) as Sales From Sales, Client, Week Where Sales.client_key=Client.key and Sales.week_key= week.key and week.year=2010 Group by week.year, Client.zip
Year Zip Sales2010 1150 10 € 2010 1950 8 €
Client Sales weekKey Name zip Date_in Date_close client_key week_key Sale Key Week Month Year1 Kasparov 1150 1/01/2010 1 1 10 € 1 1 1 2010
2 2 1 20103 3 1 2010
Type 2Client Sales weekKey Name zip Date_in Date_close client_key week_key Sale Key Week Month Year1 Kasparov 1150 1/01/2010 7/01/2010 1 1 10 € 1 1 1 20102 Kasparov 1950 8/01/2010 2 2 8 € 2 2 1 2010
3 3 1 2010
40 GEST S492 – The Digital Firm
Slowly Changing Dimensions
• Type 3: Create an attribute in the dimension record for previous value – This will not build a history at the level of the facts, but only in dimension
Select week.year, Client.zip, Sum(Sales.sale) as Sales From Sales, Client, Week Where Sales.client_key=Client.key and Sales.week_key= week.key and week.year=2010 Group by week.year, Client.zip
Year Zip Sales2010 1950 18 €
Type 3Client Sales week
Key Name zip zip_prev Date_in Date_close client_key week_key Sale Key Week Month Year1 Kasparov 1950 1150 1/01/2010 1 1 10 € 1 1 1 2010
1 2 8 € 2 2 1 20103 3 1 2010
Client Sales weekKey Name zip Date_in Date_close client_key week_key Sale Key Week Month Year1 Kasparov 1150 1/01/2010 1 1 10 € 1 1 1 2010
2 2 1 20103 3 1 2010
42 GEST S492 – The Digital Firm
Strengths of Dimensional Modeling
• Predictable and understandable, standard framework – The semantic of table links is not hidden from the user as it is the case for
relational models – Users can understand it easily
• Respond well to changes in user reporting needs – Most of the reports will be created through aggregation of low level facts. Hence
any reports – given that the model uses conformed dimensions – can be build on top
• Relatively easy to add data without reloading tables – If a new measure must be added to a fact (‘credit line level’ on an account), given
that the historical data is available, adding this measure in existing fact will be easy and reveal all historical context as well
– Adding new attributes to an existing dimension might be more difficult if the added attribute history must be synchronized with the dimension history – surrogate key conflict –
• Adding an attribute adds a new granularity and hence some records must be split (product family is the actual level and we want to have product shelf location…)
43 GEST S492 – The Digital Firm
Strengths of Dimensional Modeling
• Standard design approaches have been developed – Allowing to quickly designing facts and dimensions, even in complex
environments. The framework might be understood by all participants – including the final users through the conceptual star description
• There exist a number of products supporting the dimensional model – Because the semantic of the dimensional modeling is fixed, software can
leverage this knowledge and propose tools that provide automatic mechanism for Type1, Type2 and Type3 updates, that accept explicit description of hierarchies, etc. OLAP functionalities creation can be fully automated from a constellation of fact tables. The programming and BDA work is largely reduced.
• Multidimensional models are more efficient for queries & Analysis – Fact tables have many records (millions), and dimensions will normally
be relatively small (product = 10.000, Shops = 500). Joining small tables on large ones can be efficient with the right indexes.
44 GEST S492 – The Digital Firm
Weaknesses of Dimensional Modeling
• Some dimensions might be too large – Client with address à 10% change/year – Large dimensions must be cut
• A relation between two dimensional objects only exist through their link in a fact – A client with no sales has no geo info attached
• normalized and indexed relational models more flexible – Granularity choice can be avoided – Relationships among objects do not depend on some events
stored in facts – The cost is: lose all advantages of DM!
In practice, we often use a combination of both at low level (high granularity) and build departmental and personal data marts using dimensional models only.
ClientKeyNameStatus
Sales streetn°ZIP
RegionCountry
Type_suburb
ClientKeyNameStatus
SalesGeo
Keystreetn°ZIP
RegionCountry
Type_suburb
45 GEST S492 – The Digital Firm
Weaknesses of Dimensional Modeling
In practice, we often use a combination of both at low level (high granularity) and build departmental and personal data marts using dimensional models only.
A link will be created between Client and Geo as soon as client record is added in the dimension. The client record will eventuelly closed even before it is used in the fact.
ClientKeyNameStatus
SalesGeo
Keystreetn°ZIP
RegionCountry
Type_suburb
47 GEST S492 – The Digital Firm
Campaign Datamart
Factless fact table Is a table that creates a relationship between objects. No measure is stored.