tvdm digital 5 - dimentional modeling

48
1 GEST S492 – The Digital Firm The Digital Firm Data-Driven Decision-making Dimensional Modeling

Upload: john

Post on 11-Dec-2015

11 views

Category:

Documents


2 download

DESCRIPTION

Dimensional modeling and SQL

TRANSCRIPT

1 GEST S492 – The Digital Firm

The Digital Firm Data-Driven Decision-making

Dimensional Modeling

2 GEST S492 – The Digital Firm

DIMENSIONAL MODELING

Data Organization in DWH: Star Schema

3 GEST S492 – The Digital Firm

Example of Star Schema

time_key day day_of_the_week month quarter year

time

location_key street city province_or_street country

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales Measures

item_key item_name brand type supplier_type

item

branch_key branch_name branch_type

branch

4 GEST S492 – The Digital Firm

Conceptual Modeling of DWH

Modeling data warehouses: dimensions & measures •  Star schema: A fact table in the middle connected to a set

of dimension tables

•  Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake

•  Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation

5 GEST S492 – The Digital Firm

Example of Snowflake Schema

time_key day day_of_the_week month quarter year

time

location_key street city_key

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_key item_name brand type supplier_key

item

branch_key branch_name branch_type

branch

supplier_key supplier_type

supplier

city_key city province_or_street country

city

6 GEST S492 – The Digital Firm

Example of Fact Constellation

time_key day day_of_the_week month quarter year

time

location_key street city province_or_street country

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales Measures

item_key item_name brand type supplier_type

item

branch_key branch_name branch_type

branch

Shipping Fact Table

time_key

item_key

shipper_key

from_location

to_location

dollars_cost

units_shipped

shipper_key shipper_name location_key shipper_type

shipper

7 GEST S492 – The Digital Firm

Fact tables contain factual or quantitative data

Dimension tables contain descriptions about the subjects of the business

1:N relationship between dimension tables and fact tables

Dimension tables are denormalized to maximize performance

Star Schema

8 GEST S492 – The Digital Firm

Fact table provides statistics for sales broken down by product, period and store dimensions

An example

9 GEST S492 – The Digital Firm

An example with data

10 GEST S492 – The Digital Firm

Fact tables contain time-period data è Date dimensions are important

Modeling dates

11 GEST S492 – The Digital Firm

Issues Regarding Star Schema

•  Dimension table keys must be surrogate (non-intelligent and non-business related), because: –  Keys will change over time when history on dimensions is

built –  Length/format consistency

•  Granularity of Fact Table – what level of detail do you want? –  Transactional grain – finest level –  Aggregated grain – more summarized –  Finer grain è better market basket analysis capability –  Finer grain è more dimension tables, more rows in fact table

12 GEST S492 – The Digital Firm

DIMENSIONAL MODELING

Star Schema: SQL for OLAP implementation

13 GEST S492 – The Digital Firm

STAR for Vehicule sales

14 GEST S492 – The Digital Firm

How does it work? •  Slice 1: Number of Trucks sold in Europe in 1997

SELECT sum(Vehicles_Sold) FROM Location L, Product P, Time T, Fact F WHERE L.Location_Key = F.Location_Key AND P.Product_Key = F.Product_Key AND T.Time_Key = F.Time_Key AND L.Continent = ‘Europe’ AND P.Vehicle_Kind = ‘Truck’ AND T.Year =1997

Joining necessary dimensions & Fact

Using connected dimensions to constraint

F.Location_Key in (1,2) and F.Product_Key in (3, 4) and T.Time_Key in (3,4)

Step 3

T.Time_Key= (3,4) P.Product_Key= (3,4)

L.Location_Key= (1,2) Step 1

L.Location_Key = F.Location_Key = (1,2) P.Product_Key = F.Product_Key= (3, 4) T.Time_Key = F.Time_Key = (3,4)

Step 2

15 GEST S492 – The Digital Firm

SQL on Facts

•  Slice 1 SELECT sum(Vehicles_Sold) FROM Location L, Product P, Time T, Fact F WHERE L.Location_Key = F.Location_Key AND P.Product_Key = F.Product_Key AND T.Time_Key = F.Time_Key AND L.Continent = ‘Europe’ AND P.Vehicle_Kind = ‘Truck’ AND T.Year =1997

•  Slice 2: nbre of vehicles by country, size & year SELECT L.Country, P.Vehicle_Size, T.Year, SUM(Vehicles_Sold) FROM Location L, Product P, Time T, Fact F WHERE L.Location_Key = F.Location_Key AND P.Product_Key = F.Product_Key AND T.Time_Key = F.Time_Key AND L.Country = ‘USA’ GROUP BY L.Country, P.Vehicle_Size, T.Year

•  Roll-up and Drill-down are implemented by Group By and aggregate functions

Joining necessary dimensions & Fact Using connected dimensions to constraint

16 GEST S492 – The Digital Firm

SQL for OLAP operations

•  « Analyzing the sales by Continents » SELECT L.Continent , SUM(F.Vehicles_Sold) FROM Location L, Product P, Time T, Fact F WHERE L.Location_Key = F.Location_Key AND P.Product_Key = F.Product_Key AND T.Time_Key = F.Time_Key

GROUP BY L.Continent •  « Drill down to countries »

SELECT L.Continent, L.Country, SUM(F.Vehicles_Sold) FROM Location L, Product P, Time T, Fact F WHERE L.Location_Key = F.Location_Key AND P.Product_Key = F.Product_Key AND T.Time_Key = F.Time_Key

GROUP BY L.Continent, L.Country

17 GEST S492 – The Digital Firm

An example

•  Star for Grocery stores –  TVDM BI 6 Star Grocery.mdb

•  OLAP in Pivot Table Excel –  TVDM BI 6 Star Grocery Pivot.xlsx

18 GEST S492 – The Digital Firm

DIMENSIONAL MODELING

Building Star Schema

19 GEST S492 – The Digital Firm

Dimensional Models

•  A denormalized relational model –  Made up of tables with attributes –  Relationships defined by keys and foreign keys

•  Organized for understandability and ease of reporting rather than update

•  Queried and maintained by SQL or special purpose management tools –  Because the semantic structure is known, and the surrogate

key mechanism is standard, tools can be optimized for query and indexing

20 GEST S492 – The Digital Firm

Building a DWH from a Normalized Database

The steps •  Analyse the normalized entity-relationship

–  identify business processes to model each separately •  Translate this into a dimensional model

–  Select m-n relationships containing numeric and additive facts: these will be the fact tables

–  Denormalize the remaining tables into flat tables with single-part key that connects directly to fact tables: these are the dimension tables

•  Translate this into the physical model. This reflects the changes necessary to reach the performance objectives

21 GEST S492 – The Digital Firm

Steps in dimensional modeling

•  Select an associative entity for a fact table (table containing additive measures)

•  Determine granularity •  Replace operational keys with surrogate keys •  Promote the keys from all hierarchies to the fact table

–  Can be in attributes: Date/week/month; Product/Category/Brand –  Can be in dimensions: Customer / Segment

•  Add date dimension •  Split all compound attributes

–  “Rue Auguste Rodin 45, 1050 Bruxelles, Belgique” •  Add necessary categorical attributes: Reference Tables !

–  DOB: baby, young, teenager, adult, older •  Fact (varies with time) / Attribute (constant)

22 GEST S492 – The Digital Firm

ORDER order_num (PK) customer_ID (FK) store_ID (FK) clerk_ID (FK) date

STORE store_ID (PK) store_name address district floor_type

CLERK clerk_id (PK) clerk_name clerk_grade

PRODUCT SKU (PK) description brand category

CUSTOMER customer_ID (PK) customer_name purchase_profile credit_profile address

PROMOTION promotion_NUM (PK) promotion_name price_type ad_type

ORDER-LINE order_num (PK) (FK) SKU (PK) (FK) promotion_key (FK) dollars_sold units_sold dollars_cost

ERD

SKU = Stock Keeping Unit

n

n

n

1

1

1

1

1

1

n

n

n

23 GEST S492 – The Digital Firm

ORDER order_num (PK) customer_ID (FK) store_ID (FK) clerk_ID (FK) date

STORE store_ID (PK) store_name address district floor_type

CLERK clerk_id (PK) clerk_name clerk_grade

PRODUCT SKU (PK) description brand category

CUSTOMER customer_ID (PK) customer_name purchase_profile credit_profile address

PROMOTION promotion_NUM (PK) promotion_name price_type ad_type

ORDER-LINE order_num (PK) (FK) SKU (PK) (FK) promotion_key (FK) dollars_sold units_sold dollars_cost

ERD

SKU = Stock Keeping Unit

Measures Dimensions

Where are the measures? Where are the dimensions?

n

n

n

1

1

1

1 n

1 n

1 n

This date will be used to link the time dimension to the fact What if this date is missing in OLT?

24 GEST S492 – The Digital Firm

TIME time_key (PK) SQL_date day_of_week month

STORE store_key (PK) store_ID store_name address district floor_type

CLERK clerk_key (PK) clerk_id clerk_name clerk_grade

PRODUCT product_key (PK) SKU description brand category

CUSTOMER customer_key (PK) customer_name purchase_profile credit_profile address

PROMOTION promotion_key (PK) promotion_name price_type ad_type

FACT time_key (FK) store_key (FK) clerk_key (FK) product_key (FK) customer_key (FK) promotion_key (FK) dollars_sold units_sold dollars_cost

DIMENSONAL MODEL

25 GEST S492 – The Digital Firm

The 5 Step Design Process

•  Choose the Data Mart •  Choose the Grain •  Choose the Dimensions •  Choose the Facts •  Define the Measures

26 GEST S492 – The Digital Firm

Choosing the Mart

•  A specific set of business questions (belongs to a BL) –  Finance & control, marketing, sales, procurement, etc.

•  A set of related fact and dimension tables –  Costs, sales, number of contacts –  Clients, Products, Campaign, Channel, etc.

•  Single source or multiple source –  ERP, legacy, Excel, etc.

•  Conformed dimensions –  Dimensions shared by all facts – allow to join them

•  Typically have a fact table for each process –  Campaign selection, Campaign tracking, etc.

27 GEST S492 – The Digital Firm

Grain (unit of analysis)

The grain determines the level of detail of each fact record •  For example

–  Individual transactions: sales transaction, product shipment –  Snapshots (points in time): sales of the day

•  Generally it is better to focus on the smallest grain –  Moving from low grain to more aggregate is possible, but

inverse is not •  Retailer: sales ticket = transaction, Sales per item per client per week

will, for ever, make any basket analysis impossible

–  Levels can be used for going from high granularity to lowest granularity

–  There is a price to pay: the finest grain will create the largest fact tables, and hence aggregates will have to be computed

Question: how to do this?

28 GEST S492 – The Digital Firm

Dimensions

A table (or hierarchy of tables) connected with the fact table with keys and foreign keys

•  Dimension tables contain text or numeric attributes that gives the context of the measures available in a Fact Table –  Time, client, status, shop, product, employee, etc.

•  Preferably in a 1-M relationship with Facts –  1 context can have many measures: 1 client having many sales –  If not a bridge table must be created

•  Dimension’s attributes will be the source of query constraints –  Select sum(Sales) from … where Client.ZIP = 1150 –  Customer (region, status); Product (Type, Price_range); Time (Mondays, WE, 2010)

•  Connected with surrogate (generated) keys, not operational keys –  Surrogate key will be used to store the history of the dimensions. They

are the central point of the whole dimensional architecture

29 GEST S492 – The Digital Firm

Fact Tables

Represent a process or reporting environment interesting for some business users

•  Contains many foreign keys connecting to Dimensions •  It is important to determine the identity of the fact table and

specify exactly what it represents –  Sales in shops, direct-mail campaigns, shipment, production of item x

(family)

•  Typically correspond to an associative entity in the ER model

•  Tend to have huge numbers of records –  The max number of records is the combination of all records of each

attribute (Cartesian product)

•  Useful facts tend to be numeric and additive –  Quantity ordered, Price paid, Number of responders

30 GEST S492 – The Digital Firm

Measures

Measurements associated with fact records at the fact table granularity

•  Normally numeric and additive –  Sales, number of contacts, etc.

•  Semi-additive (“stock” or “level” measures) –  Measures that are non additive on time but additive along

other dimensions: Inventory, balance •  Non-additive (“Value-per-unit” measures)

–  Measure that are non additive at all: exchange rate Attributes in dimension tables are constants. Facts attributes vary with the granularity of the fact table: when aggregations are done, they should smoothly aggregate as well.

31 GEST S492 – The Digital Firm

Dimensions with m:n Fact relation

•  Use a Bridge Table •  Problem for additive measures – will be counted N times

–  Add a weighting factor to correct fact addition –  Define a business rule to identify main record (winner-takes-all)

Bridge tables introduce complexity in the query process to take care of additive facts à user impact

Credit FACT - Amount - Rate

Client

Day Campaign

Group_ Campaign -  Weight -  Main

Campaign Group_CampaignCpg_key Desc Launch_date GRPCpg_keyCpg_key Main

1 hello  world 1/01/2009 1 1 12 Contest 15/01/2009 1 3 03 Credit 17/01/2009 1 5 04 loan  it 1/02/2009 2 1 05 Mbre2Mbre 19/02/2009 2 3 16 Picnic 30/02/2009 2 7 07 Card 1/03/2009 2 8 08 Safe  it 12/03/2009

n (one  Cpg  can  be  n  times  in  Group_Cpg  table)

1 (one  Cpg  can  befound  once  in  Cpg  table)

The choice of the weight and groups is done at the loading of the fact

32 GEST S492 – The Digital Firm

Good Dimension Attributes

•  Verbose –  Gives a rich context to the measure

•  Descriptive –  User crystal clear naming and semantics

•  Complete –  All modalities must be represented – ‘ooops “Risk A” is not loaded’ –  All attributes within a dimension must describe completely the object

represented by the dimension •  Quality assured

–  No trash value like 99 for age…

33 GEST S492 – The Digital Firm

Good Dimension Attributes

•  Indexed (b-tree vs bitmap) –  Fast for SQL ‘where’ clauses

•  Equally available –  Attribute must be available for most of the objects it represents – ‘age is

available for only 1% of our clients’ •  Documented

–  From a business point: semantic and point in time (‘risk profile attribute of clients are updated every night)

–  From a technical point: Risk profile is loaded from Table X with process Y, etc.

34 GEST S492 – The Digital Firm

DIMENSIONAL MODELING

Hierarchies in Star Schema

35 GEST S492 – The Digital Firm

Hierarchies

•  Hierarchies are the key relationship for roll-up and drill-down in OLAP environment –  Sales per Day à Week à Month à Quarter à … –  Sales for cheese à milk_productà food à …

•  Hierarchies can be represented in 2 ways in dimensions –  Explicit Attribute sets:

•  Dim.Time: date; DayofWeek; Week; Month; Quarter; … •  Dim.product: Product (Cheese); Family (Milk_product); Category

(Food)

–  Snowflake: •  Create a table per level and link tables by 1:N relationships

36 GEST S492 – The Digital Firm

Snowflaking & Attribute Hierarchies

•  Efficiency vs Space –  Snowflaking normalizes the attributes

within a hierarchy and hence reduces redundancy in a dimension table

•  For each combination of attributes involved in the hierarchy, a dimension will have one additional record

•  Understandability –  The user must join tables to get the

right level of aggregation

Product.level1 Product.level2 Product.level3cheese milk FoodYougourt milk Foodmilk milk Food

Product.level3Food

Product.level2milk

Product.level1cheeseYougourtmilk

37 GEST S492 – The Digital Firm

DIMENSIONAL MODELING

Building the contextual history

38 GEST S492 – The Digital Firm

Slowly Changing Dimensions

•  Type 1: Store only the current value –  No history is created: all facts are like the value was always

like the last one

Select week.year, Client.zip, Sum(Sales.sale) as Sales From Sales, Client, Week Where Sales.client_key=Client.key and Sales.week_key= week.key and week.year=2010 Group by week.year, Client.zip

Year Zip Sales2010 1950 18  €                

Client Sales weekKey Name zip Date_in Date_close client_key week_key Sale Key Week Month Year1 Kasparov 1150 1/01/2010 1 1 10  €         1 1 1 2010

2 2 1 20103 3 1 2010

Type  1Client Sales weekKey Name zip Date_in Date_close client_key week_key Sale Key Week Month Year1 Kasparov 1950 1/01/2010 1 1 10  €         1 1 1 2010

1 2 8  €             2 2 1 20103 3 1 2010

39 GEST S492 – The Digital Firm

Slowly Changing Dimensions

•  Type 2: Create a new dimension record for each new value of the attribute –  This will create a new surrogate key to be used as from this moment in all

new records of the Fact table, creating the historical context of the measures

Select week.year, Client.zip, Sum(Sales.sale) as Sales From Sales, Client, Week Where Sales.client_key=Client.key and Sales.week_key= week.key and week.year=2010 Group by week.year, Client.zip

Year Zip Sales2010 1150 10  €                2010 1950 8  €                    

Client Sales weekKey Name zip Date_in Date_close client_key week_key Sale Key Week Month Year1 Kasparov 1150 1/01/2010 1 1 10  €         1 1 1 2010

2 2 1 20103 3 1 2010

Type  2Client Sales weekKey Name zip Date_in Date_close client_key week_key Sale Key Week Month Year1 Kasparov 1150 1/01/2010 7/01/2010 1 1 10  €         1 1 1 20102 Kasparov 1950 8/01/2010 2 2 8  €             2 2 1 2010

3 3 1 2010

40 GEST S492 – The Digital Firm

Slowly Changing Dimensions

•  Type 3: Create an attribute in the dimension record for previous value –  This will not build a history at the level of the facts, but only in dimension

Select week.year, Client.zip, Sum(Sales.sale) as Sales From Sales, Client, Week Where Sales.client_key=Client.key and Sales.week_key= week.key and week.year=2010 Group by week.year, Client.zip

Year Zip Sales2010 1950 18  €                

Type  3Client Sales week

Key Name zip zip_prev Date_in Date_close client_key week_key Sale Key Week Month Year1 Kasparov 1950 1150 1/01/2010 1 1 10  €         1 1 1 2010

1 2 8  €             2 2 1 20103 3 1 2010

Client Sales weekKey Name zip Date_in Date_close client_key week_key Sale Key Week Month Year1 Kasparov 1150 1/01/2010 1 1 10  €         1 1 1 2010

2 2 1 20103 3 1 2010

41 GEST S492 – The Digital Firm

DIMENSIONAL MODELING

Dimensional Model Strengths & Weaknesses

42 GEST S492 – The Digital Firm

Strengths of Dimensional Modeling

•  Predictable and understandable, standard framework –  The semantic of table links is not hidden from the user as it is the case for

relational models –  Users can understand it easily

•  Respond well to changes in user reporting needs –  Most of the reports will be created through aggregation of low level facts. Hence

any reports – given that the model uses conformed dimensions – can be build on top

•  Relatively easy to add data without reloading tables –  If a new measure must be added to a fact (‘credit line level’ on an account), given

that the historical data is available, adding this measure in existing fact will be easy and reveal all historical context as well

–  Adding new attributes to an existing dimension might be more difficult if the added attribute history must be synchronized with the dimension history – surrogate key conflict –

•  Adding an attribute adds a new granularity and hence some records must be split (product family is the actual level and we want to have product shelf location…)

43 GEST S492 – The Digital Firm

Strengths of Dimensional Modeling

•  Standard design approaches have been developed –  Allowing to quickly designing facts and dimensions, even in complex

environments. The framework might be understood by all participants – including the final users through the conceptual star description

•  There exist a number of products supporting the dimensional model –  Because the semantic of the dimensional modeling is fixed, software can

leverage this knowledge and propose tools that provide automatic mechanism for Type1, Type2 and Type3 updates, that accept explicit description of hierarchies, etc. OLAP functionalities creation can be fully automated from a constellation of fact tables. The programming and BDA work is largely reduced.

•  Multidimensional models are more efficient for queries & Analysis –  Fact tables have many records (millions), and dimensions will normally

be relatively small (product = 10.000, Shops = 500). Joining small tables on large ones can be efficient with the right indexes.

44 GEST S492 – The Digital Firm

Weaknesses of Dimensional Modeling

•  Some dimensions might be too large –  Client with address à 10% change/year –  Large dimensions must be cut

•  A relation between two dimensional objects only exist through their link in a fact –  A client with no sales has no geo info attached

•  normalized and indexed relational models more flexible –  Granularity choice can be avoided –  Relationships among objects do not depend on some events

stored in facts –  The cost is: lose all advantages of DM!

In practice, we often use a combination of both at low level (high granularity) and build departmental and personal data marts using dimensional models only.

ClientKeyNameStatus

Sales streetn°ZIP

RegionCountry

Type_suburb

ClientKeyNameStatus

SalesGeo

Keystreetn°ZIP

RegionCountry

Type_suburb

45 GEST S492 – The Digital Firm

Weaknesses of Dimensional Modeling

In practice, we often use a combination of both at low level (high granularity) and build departmental and personal data marts using dimensional models only.

A link will be created between Client and Geo as soon as client record is added in the dimension. The client record will eventuelly closed even before it is used in the fact.

ClientKeyNameStatus

SalesGeo

Keystreetn°ZIP

RegionCountry

Type_suburb

46 GEST S492 – The Digital Firm

DIMENSIONAL MODELING

Dimensional Model example

47 GEST S492 – The Digital Firm

Campaign Datamart

Factless fact table Is a table that creates a relationship between objects. No measure is stored.

48 GEST S492 – The Digital Firm

Tracking campaign effects

Non additive