11-dmdw11-dw curs 2cursuri.cs.pub.ro/~radulescu/dmdw/dmdw-nou/dmdw11.pdf · facts and dimensions...
Post on 31-Jan-2018
238 Views
Preview:
TRANSCRIPT
Dimensional ModelingDimensional Modeling
Prof.dr.ing. Florin Radulescu
Universitatea Politehnica din Bucureşti
�Facts and dimensions
�Steps in dimensional modeling
�Modeling example
Summary
Road Map
2
Florin Radulescu, Note de curs
DMDW-11
�Summary
In the previous lesson we saw that:
�A large enterprise data warehouse will consist of 20 or more very similar-looking data marts, with similar dimensional
Facts and dimensions
3
Florin Radulescu, Note de curs
DMDW-11
data marts, with similar dimensional models.
�Each data mart may contain several fact tables, each with 5 to 15 dimension tables.
�Many of these dimension tables will be shared between several fact tables.
�A fact table represents a business process and contains the values for the main measurements describing that process.
�For a sale business process, this table
Facts
4
Florin Radulescu, Note de curs
DMDW-11
�For a sale business process, this table contains for example quantity sold, total price, total cost, total gross profit.
�The attributes of a fact table may be additive, semi-additive or non-additive
• Additive measures can be aggregated across all dimensions.
• All examples in the previous slide are additive.
Additive measures
5
Florin Radulescu, Note de curs
DMDW-11
additive.
• For example, SUM(total price) is meaningful on all dimensions: time, location, store, customer, etc.
�Semi-Additive measures can be aggregated across some dimensions but not all.
�Here are for example periodic measurements:�account balance for a bank account or
�inventory level for a retail chain.
Semi-Additive
6
Florin Radulescu, Note de curs
DMDW-11
�inventory level for a retail chain.
� In the first case an average may be computed for knowing the average daily balance but the sum of daily balances is not meaningful.
� In the second case, inventory level is additive on product and warehouse but not across time: the sum of yesterday and today inventory level for a given product is not a meaningful value.
• Non-additive measures cannot be aggregated across all/any dimension.
• A classical example is the unit price.
• Considering a retail company, the sum of unit prices along any dimension (product, customer,
Non-additive measures
7
Florin Radulescu, Note de curs
DMDW-11
prices along any dimension (product, customer, location, etc.) is not meaningful.
• For that reason, if these values can be computed based on additive measures, the non-additive measures are not stored in the fact tables.
• For our example, the unit price can always be computed dividing total cost by the quantity sold.
�The level of detail of o record in a fact table is called the “grain” of the table.
�Besides business process measurements, the fact table contains also foreign keys for
Grain
8
Florin Radulescu, Note de curs
DMDW-11
the fact table contains also foreign keys for all the dimension tables and possibly some pseudo-foreign keys for some degenerate dimensions.
�More details on this topic in the next paragraphs of this lesson.
In [CS680, 2004], dimension tables are characterized as follows:
�Represent the who, what, where, when and how of a measurement/artifact
�Represent real-world entities not business
Dimensions
9
Florin Radulescu, Note de curs
DMDW-11
�Represent real-world entities not business processes
�Give the context of a measurement (subject)
Example: in a retail company DW, the Sales fact table can be linked with dimensions like Location (Where), Time (When), Product (What), Customer (Who), Sales Channel (How).
The Dimension Attributes are the columns of the dimension table. [Wikipedia] lists some features for these attributes:
�Verbose - labels consisting of full words,
�Descriptive,
Dimensions
10
Florin Radulescu, Note de curs
DMDW-11
�Descriptive,
�Complete - no missing values,
�Discretely valued - only one value per row in dimensional table,
�Quality assured - no misspelling, no impossible values.
Star scheme
11
Florin Radulescu, Note de curs
DMDW-11
�Each fact table is surrounded by several linked dimension tables, as in Figure 1.
�Because of its appearance, such a construction is called a ‘star scheme’.
�A star scheme has several advantages:
Advantages
12
Florin Radulescu, Note de curs
DMDW-11
�A star scheme has several advantages:�Is easy to understand. Graphic representations have
almost always this advantage
�Provide better performance: data is de-normalized in fact and dimension tables, so for obtaining a query result needs only the joins between the fact table and the implied dimensions
�Is extensible. Attributes and dimensions may be added easily
SQL Query
SELECT P.Name, SUM(S.Sales), . . .
FROM Sales S, Date D, Product P, Location L, Promotion R
WHERE S.Date_Id = D.Date_Id // join condition
AND S.Poduct_Id = P.Product_Id
AND S.Location_Id = L.Location_Id
13
Florin Radulescu, Note de curs
DMDW-11
AND S.Location_Id = L.Location_Id
AND S.Promotion_Id = R.Promotion_Id
// additional condition
AND D.Month='JUN' AND D.Year='2012' AND L.Country_Name='ROU'
GROUP BY P.Product_Id, P.Name // P.Name is needed because
// is in the SELECT clause
�Sometimes the dimension tables are normalized: each dimension is stored as a set of tables.
�In this case, the scheme is called ‘Snow-
Snow-flake schemes
14
Florin Radulescu, Note de curs
DMDW-11
�In this case, the scheme is called ‘Snow-flake scheme’
Example
15
Florin Radulescu, Note de curs
DMDW-11
�Facts and dimensions
�Steps in dimensional modeling
�Modeling example
Summary
Road Map
16
Florin Radulescu, Note de curs
DMDW-11
�Summary
�There are four steps in dimensional modeling design.
�These steps must be performed in particular order and every review of a step
The four step approach
17
Florin Radulescu, Note de curs
DMDW-11
particular order and every review of a step triggers the review of all subsequent steps.
�These four steps are:
1. Select the business processes
2. Declare the grain
3. Choose the dimensions
The four step approach
18
Florin Radulescu, Note de curs
DMDW-11
3. Choose the dimensions
4. Identify the facts
� An organization has several departments and carries out several business processes.
�Selecting the business process does not
Select the business processes
19
Florin Radulescu, Note de curs
DMDW-11
�Selecting the business process does not refer to one structural department of the organization but to a process carried out by one or more departments together.
�In a general company for example some main business processes are:
�Supply chain management
�Orders,
Select the business processes
20
Florin Radulescu, Note de curs
DMDW-11
�Orders,
�Shipments,
�Invoicing,
�Stocking and inventory
�General ledger
� This approach also ensures that data contains no duplicate data.
� If a department approach in structuring the data warehouse is used, same data may be used by several departments and must be presented redundantly in the DW.
� For example, inventory data are used for supply chain management but also for production management in a car
No duplicate data
21
Florin Radulescu, Note de curs
DMDW-11
management but also for production management in a car factory.
� A data warehouse organized based on departmental structure will duplicate inventory data but organizing it on business processes will avoid redundancy and both departments –supply management and production – will use the same data.
�No duplicate data means also that:
�Consistency is better preserved; redundancy
is known for inducing consistency problems
(as the update anomaly)
No duplicate data
22
Florin Radulescu, Note de curs
DMDW-11
(as the update anomaly)
�Data are published once and this single view
is used as published for decision support in all
departments and activities.
� Each line in a fact table is a grain in our data warehouse. In step 2 of the dimensional design process the level of detail for these lines / grains must be defined.
� Thinking at a retail company with registered users (like Metro or Selgros), for the POS sales business
Step 2: Declare the grain
23
Florin Radulescu, Note de curs
DMDW-11
Metro or Selgros), for the POS sales business process, a grain may be:
1. An individual line item on a customer’s retail sales ticket or invoice, as measured by a scanner device (in that case the same item may be on several lines in the same ticket/invoice because the quantity was greater than one and each product was scanned individually).
2. The same significance as above but lines containing the same part number are summarized in a single line.
3. A daily reunion of the sales tickets of a customer containing items and prices.
4. A sales ticket / invoice for a customer. The same
Step 2: Declare the grain
24
Florin Radulescu, Note de curs
DMDW-11
4. A sales ticket / invoice for a customer. The same customer may have several sales tickets each day.
5. A daily summary on the sales tickets of a customer, containing only total sales amount.
6. A weekly summary on the sales tickets of a customer containing only total sales amount.
• In the first three cases data on what products
bought each customer is preserved in the
DW, so there is a low detail lever allowing
much more queries and much more ‘slicing
Discussion
25
Florin Radulescu, Note de curs
DMDW-11
much more queries and much more ‘slicing
and dicing’ actions: reports, analysis, etc.
• For the last three cases these data are
removed, the level of detail is higher and only
a part of the above actions may be
performed.
A key idea in choosing the granularity level is emphasized in [Kimball, Ross, 2002]:
� “Preferably you should develop dimensional models for the most atomic information captured by a business process. Atomic data is the most detailed information collected; such data cannot be subdivided
Discussion
26
Florin Radulescu, Note de curs
DMDW-11
information collected; such data cannot be subdivided further.”
and
� “A data warehouse almost always demands data expressed at the lowest possible grain of each dimension not because queries want to see individual low-level rows, but because queries need to cut through the details in very precise ways.”
�Some features of atomic data listed in Kimball & Ross book are:
�Is highly dimensional,
�Being highly dimensional, detailed and atomic
Atomic data features
27
Florin Radulescu, Note de curs
DMDW-11
�Being highly dimensional, detailed and atomic
fact measurement allow more dimensions and
so data may be drilled in more ways,
�Dimensional approach is favored by atomic
data, each extra dimension being easily
added to the star schemes,
�Features – cont.:�Provides maximum analytic flexibility,�Detailed data allow more ad hoc queries,�Low level grain does not prohibit adding also
summary high level grain in the DW for speeding
Atomic data features
28
Florin Radulescu, Note de curs
DMDW-11
summary high level grain in the DW for speeding up frequent queries and reports.
�Note that declaring the grain is a critical step. If later the granularity choice is proved to be wrong, the process must go back to step 2 for re-declaring the grain correctly, and after that steps 3 and 4 must be run again.
�Knowing the grain, dimensions can be determined easily: each dimension represents a possible fact table line description.
�Example of common dimensions used in a sales data warehouse:�Product,
Step 3: Choose the dimensions
29
Florin Radulescu, Note de curs
DMDW-11
�Product,
�Customer,
�Date,
�Ticket number
�Status,
�Store,
�Salesperson
�Promotion
• Dimensions can be found by asking ourselves haw can we describe a single line in the fact table.
• In the above example, a line in the fact
Step 3: Choose the dimensions
30
Florin Radulescu, Note de curs
DMDW-11
• In the above example, a line in the fact table represents a single line on a sales ticket.
• This line is about selling a product to a customer at a given date and in a given store possibly under a promotion.
• The line is on a ticket having a number, a status (for example paid by credit card) and has been made by a particular salesperson.
Step 3: Choose the dimensions
31
Florin Radulescu, Note de curs
DMDW-11
• The number of attributes in a dimension table is not so small.
• This lesson presents an example showing that each dimension may have several tens of attributes.
�Every line in a fact table must contain some attribute values.
� These attributes represents the measures assigned to that business process that must be determined at this step.
� In the case of a star scheme containing data on POS
Step 4: Identify the facts
32
Florin Radulescu, Note de curs
DMDW-11
� In the case of a star scheme containing data on POS retail sales in a store chain, possible attributes of the fact table are:�Quantity sold – additive value
�Total line value amount – additive value
�Line cost amount – additive value
�Line profit amount – additive value
�Unit price – not an additive value
• The product sold, store, date, time, customer, promotion, sales ticket and other data are also identified by the linked records in the corresponding dimension
Discussion
33
Florin Radulescu, Note de curs
DMDW-11
records in the corresponding dimension tables.
• In the fact table are stored only information specific to an association of an instance for each dimension.
• Additive measures are preferred. So unit price, which is not an additive value, will be removed because it can be computed by division from the total line value amount and quantity sold.
• Redundant data can be stored in a fact table if
Discussion
34
Florin Radulescu, Note de curs
DMDW-11
• Redundant data can be stored in a fact table if they are additive or semi-additive.
• For example, Line profit amount may be computed by subtracting the cost amount from the value amount.
• The presence of these redundant values is allowed for speeding up processing.
�Facts and dimensions
�Steps in dimensional modeling
�Modeling example
Summary
Road Map
35
Florin Radulescu, Note de curs
DMDW-11
�Summary
� A retail sales modeling example is presented in [Kimball, Ross, 2002] for a store chain.
�Each store has several departments and sales several tens of thousands items (called stock keeping units – SKU).
Modeling example
36
Florin Radulescu, Note de curs
DMDW-11
keeping units – SKU).
�Each SKU has either a universal product code imprinted by the manufacturers or a local code for bulk goods (for example agricultural products -vegetables and fruits, meat, bakery, etc.).
�Package variation of a product is another SKU and by consequence has a different code.
�Some products are sold under some promotions. There are four types of promotions in the modeling example cited:
�Temporary price reductions,
Modeling example
37
Florin Radulescu, Note de curs
DMDW-11
�Temporary price reductions,
�Ads in newspapers and newspaper inserts,
�Displays in the store (end-aisle displays
included),
�Coupons.
Step 1: Select the business process
�The most important business process in a retail company is customer purchases as captured by the POS system, so the
Step 1
38
Florin Radulescu, Note de curs
DMDW-11
captured by the POS system, so the business process modeled is “POS retail sales”.
Step 2: Declare the grain
�As seen earlier, in dimensional modeling is preferably to store atomic information, data collected at the POS location and not
Step 2
39
Florin Radulescu, Note de curs
DMDW-11
data collected at the POS location and not summarized or aggregated data based on POS transactions.
�So the grain in this modeling example will be an individual line on the sales ticket generated by the POS.
Step 3: Choose the dimensions
�As presented earlier, a line on a sales ticket is about selling a product to a customer at a given date and in a given
Step 3
40
Florin Radulescu, Note de curs
DMDW-11
customer at a given date and in a given store possibly under a promotion.
�The line is on a ticket having a number and is made by a particular salesperson.
Step 3 – star scheme
Product Product_Key (PK) Product attributes
Store Store_Key (PK) Store attributes
POS_Sales Product_Key (FK) Date_Key (FK) Store_Key (FK) SP_Key (FK)
41
Florin Radulescu, Note de curs
DMDW-11
Date Date_Key (PK) Date attributes
Promotion Promotion_Key (PK) Promotion attributes
Salesperson SP_Key (PK) SP attributes
SP_Key (FK) Promotion_Key (FK) Ticket_number (FK) Fact table attributes
�From this definition of the grain, the dimensions are:
�Product
�Date
Step 3 - details
42
Florin Radulescu, Note de curs
DMDW-11
�Date
�Store
�Promotion
�Salesperson
�Sales ticket
�The sales ticket is a so-called “degenerate dimension”.
�Such degenerate dimensions come from operational control numbers: ticket
Degenerate dimensions
43
Florin Radulescu, Note de curs
DMDW-11
operational control numbers: ticket number, order number, invoice number, and so on.
�These dimensions are empty - without other attributes.
�No associated dimension table is present in the
star scheme, but they are necessary in some
queries, for example for finding products sold
together in the same sales basket.
Degenerate dimensions
44
Florin Radulescu, Note de curs
DMDW-11
�For these dimensions only the pseudo-foreign
key associated with the dimension is present as
attribute in the fact table.
�Every dimension has a surrogate primary key
and this key is also contained as foreign key in
the fact table.
�The candidate fact attributes:
�Quantity_sold – additive value
�Line_amount – additive value
�Cost_amount – additive value
Step 4: Identify the facts
45
Florin Radulescu, Note de curs
DMDW-11
�Cost_amount – additive value
�Profit amount – additive value
�Unit_price – not an additive value
Star scheme again
Product Product_Key (PK) Product attributes
Date
Store Store_Key (PK) Store attributes
POS_Sales Product_Key (FK) Date_Key (FK) Store_Key (FK) SP_Key (FK)
Promotion_Key (FK) Ticket_number (FK) Quantity_sold Line_amount
46
Florin Radulescu, Note de curs
DMDW-11
Date Date_Key (PK) Date attributes
Promotion Promotion_Key (PK) Promotion attributes
Salesperson SP_Key (PK) SP attributes
Line_amount Cost_amount Profit amount
�Because Unit_price is not additive, this attribute will not be stored in the fact table but can always be computed by dividing Line_amount to Quantity_sold.
Discussion
47
Florin Radulescu, Note de curs
DMDW-11
�This applies to percentages and ratios, (almost all are non-additive values) and and in these cases the numerator and denominator must be stored in the fact table.
Dimension attributes example
Date Key (PK) Date Full Date Description Day of Week Day Number in Epoch Week Number in Epoch Month Number in Epoch
Calendar Quarter Calendar Year-Quarter Calendar Half Year Calendar Year Fiscal Week Fiscal Week Number in Year Fiscal Month
48
Florin Radulescu, Note de curs
DMDW-11
Month Number in Epoch Day Number in Calendar Month Day Number in Calendar Year Day Number in Fiscal Month Day Number in Fiscal Year Last Day in Week Indicator Last Day in Month Indicator Calendar Week Ending Date Calendar Week Number in Year Calendar Month Name Calendar Month Number in Year Calendar Year-Month (YYYY-MM)
Fiscal Month Fiscal Month Number in Year Fiscal Year-Month Fiscal Quarter Fiscal Year-Quarter Fiscal Half Year Fiscal Year Holiday Indicator Weekday Indicator Selling Season Major Event SQL Date Stamp
Product attributes example
Product Key (PK) SKU Number (Natural Key) Category Description Package Type Description Fat Content Weight
Product Description Brand Description Department Description Package Size Diet Type Weight Units of Measure
49
Florin Radulescu, Note de curs
DMDW-11
Weight Storage Type Shelf Width Shelf Depth
Weight Units of Measure Shelf Life Type Shelf Height
Store attributes example
Store Name Store Number (Natural Key) Store Street Address Store City Store County Store State
Store Region Floor Plan Type Photo Processing Type Financial Service Type Selling Square Footage Total Square Footage
50
Florin Radulescu, Note de curs
DMDW-11
Store State Store Zip Code Store Manager Store District
Total Square Footage First Open Date Last Remodel Date
Promotion attributes example
Promotion Key (PK) Promotion Name Price Reduction Type Promotion Media Type Ad Type
Coupon Type Ad Media Name Display Provider Promotion Cost Promotion Begin Date
51
Florin Radulescu, Note de curs
DMDW-11
Ad Type Display Type
Promotion Begin Date Promotion End Date
�This course presented the dimensional model of data warehouses:�Definitions for facts and dimensions, definitions
for star scheme and snow-flake scheme.
�The four steps in dimensional modeling: identify
Summary
52
Florin Radulescu, Note de curs
DMDW-11
�The four steps in dimensional modeling: identify the business process, declare the grain, choose dimensions and identify the facts
�A modeling example for a sales chain with illustration of attributes in fact and dimension tables
�Next week: Data warehouse case study
[CS680, 2004] Introduction to Data Warehouses, Drexel Univ. CS
680 Course notes, 2004 (page
https://www.cs.drexel.edu/~dvista/cs680/2.DW.Overview.ppt
visited 2010)
[Kimball, Ross, 2002] Ralph Kimball, Margy Ross - The Data
Warehouse Toolkit, Second Edition, Wiley & Sons, 2002
References
53
Florin Radulescu, Note de curs
DMDW-11
Warehouse Toolkit, Second Edition, Wiley & Sons, 2002
[Wikipedia] Wikipedia, the free encyclopedia, en.wikipedia.org
top related