technologica ltd. 3, sofiisko pole str. tel : ( + 359 2) 91 91 2 (ten lines)

59
1 TechnoLogica Ltd. 3, Sofiisko Pole Str. tel: (+ 3592) 91 91 2 (ten lines) e-mail: offi[email protected], http:// www.technologica.com DW Concepts Dimension Modeling Techniques Milena Gerova Project Manager

Upload: meryl

Post on 05-Jan-2016

45 views

Category:

Documents


0 download

DESCRIPTION

DW Concepts Dimension Modeling Techniques. Milena Gerova Project Manager. TechnoLogica Ltd. 3, Sofiisko Pole Str. tel : ( + 359 2) 91 91 2 (ten lines) e-mail: [email protected], http:// www.technologica.com. TechnoLogica DW Projects. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

1

TechnoLogica Ltd.3, Sofiisko Pole Str. tel: (+ 3592) 91 91 2 (ten lines)e-mail: [email protected], http:// www.technologica.com

DW Concepts

Dimension Modeling Techniques

Milena GerovaProject Manager

Page 2: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

2

www.technologica.com

TechnoLogica DW Projects

Business Management System National Health Insurance Fund (10.2004 – current)

Customer Data Integration Allianz Bulgaria Holding (10.2004 – current)

Regulatory Reporting System BULBANK (2002 - 2003)

Information System Monetary StatisticsBulgarian National Bank (April 2003 – August 2004)

Management Information System BULBANK (January 2001 - June 2002)

Page 3: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

3

www.technologica.com

Agenda

DW Terminology Overview

Dimensional Modeling

Dimension Types

History and Dimensions

Hierarchy in Dimensions

Page 4: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

4

www.technologica.com

The data warehouse must

Make an organization’s information easily accessible.

Present the organization’s information consistently.

Be adaptive and resilient to change

Be a secure bastion that protects our information assets.

Serve as the foundation for improved decision making

The business community must accept the data warehouse if it is to be deemed successful.

Page 5: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

5

www.technologica.com

Components of a Data Warehouse

Page 6: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

6

www.technologica.com

Dimensional Modeling

Dimensional modeling is a new name for an old technique for making databases simple and understandable

Dimensional modeling is quite different from third-normal-form (3NF) modeling

ERM ->The Transaction Processing Model

o One table per entity

o Minimize data redundancy

o Optimize update

DM -> The data warehousing model

o One fact table for a process in the organization

o Maximize understandability

o Optimized for retrieval

o Resilient to change

Page 7: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

7

www.technologica.com

Star Dimensional Modeling

History(Dimension

table)

Customer(Dimension

table)

Product(Dimension

table)

Channel(Dimension

table)

Item_nbrItem_descQuantityDiscnt_priceUnit_priceOrder_amount…

(Fact table)

OrderHistory

(Dimensiontable)

Customer(Dimension

table)

Product(Dimension

table)

Channel(Dimension

table)

Item_nbrItem_descQuantityDiscnt_priceUnit_priceOrder_amount…

(Fact table)

Order

Page 8: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

8

www.technologica.com

Four-Step Dimensional Design Process

1. Select the business process to model.

2. Declare the grain of the business process.

3. Choose the dimensions that apply to each fact table row.

4. Identify the numeric facts that will populate each fact table row.

Page 9: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

9

www.technologica.com

Dimensions

Determine these by the ways you want to slice and dice the data

Small number of rows compared to facts

Usually 5-10 dimensions surrounding a fact table

Time is almost always a dimension used by every fact

Track history

Uses Surrogate Keys

Hierarchies are usually built into them if possible

Page 10: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

10

www.technologica.com

Date Dimension

The date dimension is the one dimension nearly guaranteed to be in every data mart

Date Dimension = Time Dimension before

We can build the date dimension table in advance (5-10 years -> only 3,650 rows)

Page 11: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

11

www.technologica.com

Date Dimension

Page 12: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

12

www.technologica.com

Date Dimension

Page 13: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

13

www.technologica.com

Date Dimension

Data warehouses always need an explicit date dimension table. There are many date attributes not supported by the SQL date function, including fiscal periods, seasons, holidays, and weekends. Rather than attempting to determine these nonstandard calendar calculations in a query, we should look them up in a date dimension table.

select sum(f.amount_sold)from DATE_DIM d, FACT fwhere d.Calendar_Month = ‘January’

and d.id = f.date_dim_id;

Page 14: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

14

www.technologica.com

Dimension Normalization(Denormalized dimension)

Page 15: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

15

www.technologica.com

Dimension Normalization(Denormalized dimension)

Page 16: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

16

www.technologica.com

Dimension Normalization(Snowflaking)

Page 17: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

17

www.technologica.com

Dimension Normalization(Snowflaking)

The dimension tables should remain as flat tables physically.

Normalized, snowflaked dimension tables penalize cross-attribute browsing and prohibit the use of bit-mapped indexes.

Disk space savings gained by normalizing the dimension tables typically are less than 1 percent of the total disk space needed for the overall schema

Page 18: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

18

www.technologica.com

Too Many Dimensions

Page 19: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

19

www.technologica.com

Too Many Dimensions

A very large number of dimensions typically is a sign that several dimensions are not completely independent and should be combined into a single dimension.

If our design has 25 or more dimensions, we should look for ways to combine correlated dimensions into a single dimension

It is a dimensional modeling mistake to represent elements of a hierarchy as separate dimensions in the fact table.

Page 20: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

20

www.technologica.com

Surrogate Keys

Every join between dimension and fact tables in the data warehouse should be based on meaningless integer surrogate keys.

You should avoid using the natural operational production codes. None of the data warehouse keys should be smart, where you can tell something about the row just by looking at the key.

Page 21: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

21

www.technologica.com

Surrogate Keys

Surrogate keys are like an immunization for the data warehouse

Buffer the data warehouse environment from operational changes

Performance advantages The smaller surrogate key translates into smaller fact tables, smaller fact table indices, and more fact table rows per block input-output operation

Surrogate keys are used to record dimension conditions that may not have an operational code“No Promotion in Effect”, “Date Not Applicable.”

Page 22: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

22

www.technologica.com

Surrogate Keys

The date dimension is the one dimension where surrogate keys should be assigned in a meaningful, sequential order

Surrogate keys are needed to support one of the primary techniques for handling changes to dimension table attributes

Don’t use concatenated or compound keys for dimension tables

Page 23: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

23

www.technologica.com

Data Warehouse Bus Architecture

Page 24: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

24

www.technologica.com

Data Warehouse Bus Matrix

Page 25: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

25

www.technologica.com

Conformed Dimensions

Most dimensions are defined naturally at the most granular level possible

Conformed dimensions are either identical or strict mathematical subsets of the most granular, detailed dimension

They have consistent dimension keys, consistent attribute column names, consistent attribute definitions, and consistent attribute values

The conformed dimension may be the same physical table within the database or may be duplicated synchronously in each data mart

Page 26: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

26

www.technologica.com

Conformed Dimensions

Roll-up dimensions conform to the base-level atomic dimension if they are a strict subset of that atomic dimension.

Page 27: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

27

www.technologica.com

Conformed Dimensions

They should be built once in the staging area

They must be published prior to staging of the fact data

The dimension authority has responsibility for defining, maintaining, and publishing a particular dimension or its subsets to all the data mart clients who need it

Page 28: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

28

www.technologica.com

Tracking History in Dimensions

Unchanging Dimensions

Changing, but Original Values are Irrelevant A phone number in a customer record

Slowly Changing Dimensions (SCD) A customer address, manager

Rapidly Changing Dimensions Income range of a customer

Continuously Changing Dimensions Customer age

Page 29: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

29

www.technologica.com

Type 1: Overwrite the Value

The type 1 response is easy to implement, but: it does not maintain any history of prior attribute values

any preexisting aggregations based on the department value will need to be rebuilt

Page 30: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

30

www.technologica.com

The type 2 response is the primary technique for accurately tracking slowly changing dimension attributes. It is extremely powerful because the new dimension row automatically partitions history in the fact table.

It’s not suitable for dimension tables that already exceed a million rows

Type 2: Add a Dimension Row

Page 31: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

31

www.technologica.com

Type 2: Add a Dimension Row

Product Key

Product Description Department

SKU Number

Effective Date

Expiration Date

12345 IntelliKidz 1.0 Education ABC922-Z 01.1.1900 22.4.200525984 IntelliKidz 1.0 Strategy ABC922-Z 23.4.2005 01.1.2500

Product Key

Product Description Department

SKU Number

Effective Date

Most Resent Flag

12345 IntelliKidz 1.0 Education ABC922-Z 01.1.1900 N25984 IntelliKidz 1.0 Strategy ABC922-Z 23.4.2005 Y

Product Key Date Key

Amount Sold

12345 200 100 <--- 20.04.200525984 203 200 <--- 23.04.2005

Page 32: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

32

www.technologica.com

Type 3: Add a Dimension Column

The type 3 slowly changing dimension technique allows us to see new and historical fact data by either the new or prior attribute values.

Page 33: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

33

www.technologica.com

Hybrid SCD TechniquesSeries of Type 3 Attributes

Predictable Changes with Multiple Version Overlays

Report each year’s sales using the district map for that year.

Report each year’s sales using a district map from an arbitrary different year.

Report an arbitrary span of years’ sales using a single district map from any chosen year. The most common version of this requirement would be to report the complete span of fact data using the current district map.

Page 34: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

34

www.technologica.com

Hybrid SCD TechniquesType 2 with "Current" Overwrite

Unpredictable Changes with Single-Version Overlay preserves historical accuracy while supporting the ability to report historical data according to the current values

Page 35: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

35

www.technologica.com

Dimension Table Staging

Page 36: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

36

www.technologica.com

Dimension Table Staging

Page 37: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

38

www.technologica.com

Junk Dimensions

What to do with flags and indicators Leave the flags and indicators unchanged in the fact

table row.

Make each flag and indicator into its own separate dimension

Strip out all the flags and indicators from the design.

A junk dimension is a convenient grouping of typically low-cardinality flags and indicators

Page 38: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

39

www.technologica.com

Junk Dimensions

Whether to use junk dimension5 indicators, each has 3 values -> 243 (35) rows5 indicators, each has 100 values -> 100 million (1005) rows

When to insert rows in the dimension

Page 39: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

40

www.technologica.com

Multiple Currencies

Page 40: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

41

www.technologica.com

Customer Dimension

Critical element for effective CRM

The most challenging dimension for any data warehouse

extremely deep (with millions of rows)

extremely wide (with dozens or even hundreds of attributes)

sometimes subject to rather rapid change

Page 41: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

42

www.technologica.com

Customer Dimension Name and Address Parsing

Page 42: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

43

www.technologica.com

Customer Dimension Other Common Customer Attributes

Gender

Ethnicity

Age or other life-stage classifications

Income or other lifestyle classifications

Status (for example, new, active, inactive, closed)

Referring source

Business-specific market segment

Scores characterizing the customer, such as purchase behavior, payment behavior, product preferences

Page 43: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

44

www.technologica.com

Customer Dimension Aggregated Facts as Attributes

These attributes are to be used for constraining and labeling; they are not to be used in numeric calculations

Focus on those which will be used frequently

Minimize the frequency with which these attributes need to be updated

Replace metrics with more meaningful descriptive values, such as “High Spender”

Page 44: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

45

www.technologica.com

Dimension Outriggers for aLow-Cardinality Attribute Set

Page 45: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

46

www.technologica.com

Rapidly Changing CustomerDimensions

Challenges It generally takes too long to constrain or browse

among the relationships in such a big table

It is difficult to use previously described techniques for tracking changes in these large dimensions

One solution is to break off frequently analyzed or frequently changing attributes into a separate dimension, referred to as a minidimension

Page 46: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

47

www.technologica.com

Rapidly Changing CustomerDimensions

The Mini Dimension with "Current" Overwrite

Page 47: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

48

www.technologica.com

Rapidly Changing CustomerDimensions

The minidimension terminology refers to when the demographics key is part of the fact table composite key

If the demographics key is a foreign key in the customer dimension, we refer to it as an outrigger

Page 48: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

49

www.technologica.com

Rapidly Changing CustomerDimensions

Type 2 with Natural Keys in Fact Table

Customer Dimension - Current Attributes (SCD1) Fact Table

Customer ID (Natural Key) Customer Key (FK) Customer Name Customer Demographics Key (FK) Customer Address More Foreign Keys … Customer Date of Birth Facts … Customer Date of 1st Order … Age

Gender Customer Dimension - "As was" Attributes (SCD2)

Annual Income Customer Key (PK) Number of Children Customer ID (Natural Key) Marital Status Customer Name

Customer Address Customer Date of Birth Customer Date of 1st Order … Age Gender Annual Income Number of Children Marital Status

Page 49: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

50

www.technologica.com

Implications of Type 2 CustomerDimension Changes

Be careful to avoid overcounting because we may have multiple rows in the customer dimension for the same individual

COUNT DISTINCT

A most recent row indicator

The comparison operators depend on the business rules used to set our effective/expiration dates.

Page 50: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

51

www.technologica.com

Capture the keys of the customers or products whose behavior you are tracking

Customer Behavior Study Groups

Page 51: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

52

www.technologica.com

Commercial Customer Hierarchies

Page 52: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

53

www.technologica.com

Commercial Customer Hierarchies

Bridge tables

Page 53: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

54

www.technologica.com

Commercial Customer Hierarchies

Page 54: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

55

www.technologica.com

Commercial Customer Hierarchies

Be aware of risk of double counting

SELECT 'San Francisco', SUM(F.REVENUE)FROM FACT F, DATE DWHERE F.CUSTOMER_KEY IN

(SELECT B.SUBSIDIARY_KEYFROM CUSTOMER C, BRIDGE BWHERE C.CUSTOMER_KEY =

B.PARENT_KEYAND C.CUSTOMER_CITY = 'San

Francisco') //to sum all SF parentsAND F.DATE_KEY = D.DATE_KEYAND D.MONTH = 'January 2002‘GROUP BY 'San Francisco'

Page 55: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

56

www.technologica.com

Heterogeneous Product Schemas

Page 56: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

57

www.technologica.com

Heterogeneous Product Schemas

Page 57: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

58

www.technologica.com

Common Dimensional Modeling Mistakes to Avoid

Mistake 10: Place text attributes used for constraining and grouping in a fact table

Mistake 9: Limit verbose descriptive attributes in dimensions to save space

Mistake 8: Split hierarchies and hierarchy levels into multiple dimensions

Mistake 7: Ignore the need to track dimension attribute changes

Mistake 6: Solve all query performance problems by adding more hardware

Page 58: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

59

www.technologica.com

Common Dimensional Modeling Mistakes to Avoid

Mistake 5: Use operational or smart keys to join dimension tables to a fact table

Mistake 4: Neglect to declare and then comply with the fact table’s grain

Mistake 3: Design the dimensional model based on a specific report

Mistake 2: Expect users to query the lowest-level atomic data in a normalized forma

Mistake 1: Fail to conform facts and dimensions across separate fact tables

Page 59: TechnoLogica Ltd. 3, Sofiisko Pole Str.  tel : ( + 359 2) 91 91 2  (ten lines)

60

www.technologica.com

Answers

Questionsand