building a data warehouse

Aalborg University

DE-1 project

Building a Data Warehouse

Author:Dovydas Sabonis

Femi Adisa

Supervisor:Liu Xiufeng

December 19, 2008

Faculty of Engineering and ScienceAalborg University

Department of Computer Science

PROJECT TITLE:Building a Data Warehouse

PROJECT PERIOD:DE-1

September 2, 2008 - December 19, 2008.

GROUP MEMBERS:Dovydas Sabounis

Femi Adisa

SUPERVISOR:Liu Xiufeng

REPORT PAGES:60

3

Contents

1 Introduction 8What is a Data Warehouse? . . . . . . . . . . . . . . . . . . . . . . . . . 8What is Data Warehousing? . . . . . . . . . . . . . . . . . . . . . . . . . 9Why build a Data Warehouse? . . . . . . . . . . . . . . . . . . . . . . . 9The Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 The Data Warehouse Architecture 13The Data Warehouse Architecture . . . . . . . . . . . . . . . . . . . . . . 13Data Flow Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 14Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3 The Methodology 17The Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17The Four-Step Design Process. . . . . . . . . . . . . . . . . . . . . . . . 17Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4 Functional Requirements 21Functional Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5 Data Modelling 235.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23Data Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.2 PRIMER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23Data Modelling Primer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.2.1 Dimensional Model . . . . . . . . . . . . . . . . . . . . . . . 24Dimensional Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.2.2 Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

1

5.3 Designing the Dimensional Data Store . . . . . . . . . . . . . . . . 25Designing the Dimensional Data Store . . . . . . . . . . . . . . . . . . . 25

5.3.1 STEP 1: Selecting the Business Model . . . . . . . . . . . . 26Selecting the Business Model . . . . . . . . . . . . . . . . . . . . . . 265.3.2 STEP 2: Declaring the Grain . . . . . . . . . . . . . . . . . 26Declaring the Grain . . . . . . . . . . . . . . . . . . . . . . . . . . . 265.3.3 STEP 3: Choosing the dimensions . . . . . . . . . . . . . . . 27Choosing the dimensions . . . . . . . . . . . . . . . . . . . . . . . . 27

5.4 Slowly Changing Dimensions . . . . . . . . . . . . . . . . . . . . . . 28Slowly Changing Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . 285.5 Data Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31Data Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315.6 The Date Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . 34The Date Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.7 The Office Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . 36The Office Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365.8 The Product Dimension . . . . . . . . . . . . . . . . . . . . . . . . 37The Product Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . 375.9 The Customer Dimension . . . . . . . . . . . . . . . . . . . . . . . 39The Customer Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . 395.10 Step 4: Identifying the Facts. . . . . . . . . . . . . . . . . . . . . . 40Step 4: Identifying the Facts. . . . . . . . . . . . . . . . . . . . . . . . . 405.11 Source System Mapping. . . . . . . . . . . . . . . . . . . . . . . . . 42Source System Mapping. . . . . . . . . . . . . . . . . . . . . . . . . . . . 425.12 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

6 The Physical Database Design 446.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44The Physical Database Design. . . . . . . . . . . . . . . . . . . . . . . . 446.2 The source system database. . . . . . . . . . . . . . . . . . . . . . . 44The source system database. . . . . . . . . . . . . . . . . . . . . . . . . . 446.3 The Staging area database. . . . . . . . . . . . . . . . . . . . . . . . 45The Staging area database. . . . . . . . . . . . . . . . . . . . . . . . . . 456.4 The DDS database. . . . . . . . . . . . . . . . . . . . . . . . . . . . 46The DDS database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466.5 The Metadata database. . . . . . . . . . . . . . . . . . . . . . . . . 46The Metadata database. . . . . . . . . . . . . . . . . . . . . . . . . . . . 466.6 Views. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47Views. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

2

Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

7 Populating the Data Warehouse 497.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49Populating the Data Warehouse. . . . . . . . . . . . . . . . . . . . . . . 497.2 Populating the Stage database . . . . . . . . . . . . . . . . . . . . . 51Populating the Stage database. . . . . . . . . . . . . . . . . . . . . . . . 517.3 Data Mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54Data Mappings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547.4 Control Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56Control Flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567.5 Moving Data to the DDS . . . . . . . . . . . . . . . . . . . . . . . . 58Moving Data to the DDS. . . . . . . . . . . . . . . . . . . . . . . . . . . 587.6 Populating the Dimension tables . . . . . . . . . . . . . . . . . . . . 60Populating the Dimension tables. . . . . . . . . . . . . . . . . . . . . . . 607.7 Populating the Fact table . . . . . . . . . . . . . . . . . . . . . . . 64Populating the Fact table. . . . . . . . . . . . . . . . . . . . . . . . . . . 647.8 Preparing for the next upload . . . . . . . . . . . . . . . . . . . . . 69Preparing for the next upload. . . . . . . . . . . . . . . . . . . . . . . . . 697.9 Scheduling the ETL . . . . . . . . . . . . . . . . . . . . . . . . . . . 71Scheduling the ETL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 717.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

8 Building Reports 75Building Reports. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75Selecting Report fields. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

Bibliography 86

3

List of Figures

1.1 A simple Data Warehouse . . . . . . . . . . . . . . . . . . . . . . . 9

2.1 Elements of a Data Warehouse . . . . . . . . . . . . . . . . . . . . . 142.2 A single DDS architecture . . . . . . . . . . . . . . . . . . . . . . . 14

3.1 The Four-Step Dimensional Design Process . . . . . . . . . . . . . . 17

5.1 The Four-Step Dimensional Design Process . . . . . . . . . . . . . . 255.2 Product Sales Data mart . . . . . . . . . . . . . . . . . . . . . . . . 265.3 The Product dimension Hierarchy . . . . . . . . . . . . . . . . . . . 325.4 The Customer dimension Hierarchy . . . . . . . . . . . . . . . . . . 325.5 The Date dimension Hierarchy . . . . . . . . . . . . . . . . . . . . . 335.6 The Office dimension Hierarchy . . . . . . . . . . . . . . . . . . . . 335.7 The Date dimension . . . . . . . . . . . . . . . . . . . . . . . . . . 355.8 The Office dimension . . . . . . . . . . . . . . . . . . . . . . . . . . 365.9 The product dimension . . . . . . . . . . . . . . . . . . . . . . . . . 385.10 The Customer dimension . . . . . . . . . . . . . . . . . . . . . . . . 395.11 The Product Sales Data mart . . . . . . . . . . . . . . . . . . . . . 41

7.1 Data flowing through the warehouse . . . . . . . . . . . . . . . . . . 507.2 Sample customer table . . . . . . . . . . . . . . . . . . . . . . . . . 517.3 The Metadata data flow table . . . . . . . . . . . . . . . . . . . . . 527.4 source-to-stage mappings . . . . . . . . . . . . . . . . . . . . . . . . 557.5 Stage ETL Control Flow . . . . . . . . . . . . . . . . . . . . . . . . 577.6 DDS ETL Control Flow . . . . . . . . . . . . . . . . . . . . . . . . 597.7 populating Customer Dimension . . . . . . . . . . . . . . . . . . . . 617.8 Slowly changing Customer Dimension . . . . . . . . . . . . . . . . . 627.9 Merge Joining Orders and OrderDetails tables . . . . . . . . . . . . 647.10 Retrieving the Office code . . . . . . . . . . . . . . . . . . . . . . . 667.11 Business to Surrogate key . . . . . . . . . . . . . . . . . . . . . . . 677.12 Fact table mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . 687.13 populating the Fact table . . . . . . . . . . . . . . . . . . . . . . . . 70

4

7.14 Creating an SQL Agent Job . . . . . . . . . . . . . . . . . . . . . . 717.15 the ETL scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

8.1 Creating the Profit report . . . . . . . . . . . . . . . . . . . . . . . 778.2 Building the Profit report . . . . . . . . . . . . . . . . . . . . . . . 788.3 Designing the report matrix . . . . . . . . . . . . . . . . . . . . . . 798.4 Sales by country report . . . . . . . . . . . . . . . . . . . . . . . . . 818.5 Model Sales Report . . . . . . . . . . . . . . . . . . . . . . . . . . . 838.6 Model Sales Report . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5

List of Tables

4.1 Functional Requirements. . . . . . . . . . . . . . . . . . . . . . . . 21

5.1 Type 2 response to SCD . . . . . . . . . . . . . . . . . . . . . . . . 295.2 Type 3 response to SCD . . . . . . . . . . . . . . . . . . . . . . . . 30

6

Faculty of Engineering and ScienceAalborg University

Department of Computer Science

TITLE:

Building a Data Warehouse

PROJECT PERIOD:DE,Sept 1st 2008 -Dec 19th 2008

PROJECT GROUP:DE-1

GROUP MEMBERS:Dovydas SabunasFemi Adisa

SUPERVISOR:Liu Xiufeng

NUMBER OF COPIES: 4

REPORT PAGES: ??

TOTAL PAGES: ??

ABSTRACT:

This report documents our experienceswhile trying to learn the fundamentalaspects of data warehousing.fundamental aspect of w building thisreport tries to.. our journey intodata warehousing/ foray into, tries topresent our/ the obstacle encountered

7

Chapter 1

Introduction

What is a Data Warehouse?

Before we get down to work and try to build a data warehouse, we feel it is veryimportant to first define a data warehouse and related terminologies and why orga-nizations decide to implement one. Further down we will talk about what shouldbe the driving force behind the need to build a data warehouse and what the focusshould be on, during implementation.

While various definitions abound for what is and what constitutes a data ware-house, the definition which we believe best describes a data warehouse is definedby [1]: A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision making process.

We take a moment to go through the definition. A data warehouse is subjectoriented ; this means that it is specifically designed to attack a particular businessdomain. A data warehouse is integrated ; it is a repository of data from multiple,possibly heterogeneous data sources, presented with consistent and coherent se-mantics. Data in a data warehouse comes from one or more source systems. Theseare usually OLTP or online analytical processing systems that handle day to daytransactions of a business or organization.

A data warehouse is time-variant ; where each unit of data in a data warehouseis relevant to some moment in time.

A data warehouse is non-volatile; it contains historic snapshots of various oper-ational system data, and is durable. Data in the data warehouse is usually neitherupdated nor deleted but new rows are rather uploaded, usually in batches on aregular basis.

A data warehouse supports management’s decision making process; the main

8

reason for building a data warehouse is to be able to query it for business intel-ligence and other analytical activities. Users use various front-end tools such asspreadsheets, pivot tables, reporting tools, and SQL query tools to probe, retrieveand analyze (slice and dice) the data in a data warehouse to get a deeper under-standing about their businesses. They can analyze the sales by time, customer,and product. Users can also analyze the revenue and cost for a certain month,region, and product type.

ETL: Data from source systems are moved into the data warehouse by a pro-cess known as ETL (Extract, Transform and Load). It is basically a system thatconnects to the source systems, read the data, transform the data, and load it intoa target system. It is the ETL system that integrates, transforms, and loads thedata into a dimensional data store (DDS). A DDS is a database that stores thedata warehouse data in a different format than OLTP. The data is moved from thesource system into the DDS because data in the DDS is arranged in a dimensionalformat that is more suitable for analysis and helps to avoid querying the sourcesystem directly. Another reason is because a DDS is a one-stop shop for data fromseveral source systems.

Figure 1.1: A Data Warehouse in its simplest form.

What is Data Warehousing?

This is the process of designing, building, and maintaining a data warehouse sys-tem.

Why build a Data Warehouse?

The most compelling reason why an organization should want a data warehouse,would be to help it make sense of the vast amount of transactional data that thebusiness is generating, the volume of which is growing tremendously on a day today basis. Typically before the advent of data warehousing, data from OLTP sys-tems were regularly archived onto magnetic disk and kept in storage over a periodof time, in case something goes wrong and they need to restore the data or as in

9

case with the banking and insurance industries as required by regulations and alsofor performance enhancing purposes.It was not until much later that it was realized, the potential that these data holdfor analysis of business activities over time, as well as forecasting and analyzingtrends. Even then, it was not quite feasible to get a consolidated or integratedoverview of the data due to the lack of the available technology and also becausemost of the information, often times come from several disparate systems andavailable reporting tools were not able to deal with them.Technology has come a long way and so also has data warehousing matured. Anyorganization that implements an OLTP system in the day to day running of itsbusiness, knows that the value of information contained within these systems,when analyzed properly can help leverage the business and support managementdecision making.It is important to mention at this very early juncture, that the decision to builda data warehouse should to a large extent be a purely business decision and notone of technology. Early data warehouse projects failed because, project managersfocused more on delivering a technology and at the end of the day they succeeded.But what they delivered was beautiful nonsense; Nice to look at and state of theart but of little benefit to business users. The business users and their needs werenot properly aligned and well incorporated in the data warehousing; instead thefocus was on delivering the technology.These projects failed not because, the data warehouse was not delivered. On thecontrary, they delivered a product that did not meet or satisfy the needs of thebusiness users and as a result, they were abandoned.It is of utmost importance to get business users involved in every stage of thedata warehouse development cycle and to put in place a mechanism for constantinteraction and feedback sessions.From the moment a need is identified until the final delivery of a fully functionaldata warehouse.

The Classic Car Case study

During the course of this project, we will be building a data warehouse for afictitious company called Classic Cars Inc. we try to cover all the core aspects ofdata warehousing; architecture, methodology, requirements, data modeling, ETL,metadata, reports.

Building a complete data warehouse given our time frame and human resourcesis not feasible. It is very important that we define a scope for our project and thiswe do by analyzing the source system to know what kind of data resides in it and

10

what we can derive out of it. The classic car source database contains sales ordertransactions data which makes it ideal for constructing a sales data mart.

Classic car Inc. is a company that is in the business of selling scale modelsof classic/vintage cars, aero planes, ships, trucks, motorbikes, trains and busses.Their customer base spans across the globe. They sell only to retailers in differentregions. There is usually more than one customer in a country. The companyitself is headquartered in the USA and has branch offices in several countries.Each branch office is responsible for different geographical regions.

Customers send in their orders and the company ships it to them via courier.Each customer has a responsible employee that deals with it. The company alsogives credit facilities to the customers and each customer has a credit limit depend-ing on their level of standing with the company. The customers usually mail intheir payment checks after they receive their orders. The company itself does notmanufacture the products it sells but there is no information in the database aboutits suppliers. We can only assume that its operations are not fully computerizedor that it runs several disparate systems.

11

Summary

In this chapter we gave a breakdown of what data warehousing is. We explainedwhat should be the driving force behind every decision to build a data warehouse.We finished by giving an introduction to our case study. In the next chapter wewill look at the various data warehousing architecture.

12

Chapter 2

The Data WarehouseArchitecture

In this chapter we will give a brief overview of data warehouse elements. We willexplain typical data warehouse architectures and explain which one we have cho-sen and why.

A data warehouse system comprises 2 architectures; the data flow architectureand the system architecture. System architecture deals with the physical con-figuration of the servers, network, software, storage, and clients and will not bediscussed in this report.Choosing what architecture to implement when building a data warehouse islargely based on the business environment that the warehouse will be operatingin. For example, how many source systems feed into the data warehouse or howthe data flows within the data stores to the users or what kind of data will berequested by end users applications.The figure 2.1 illustrates the basic elements of a data warehouse.

13

Figure 2.1: Basic elements of a Data Warehouse

Data Flow Architecture.

According to [3], there are four data flow architectures: single Dimensional DataStore (DDS), Normalized Data Store (NDS) + DDS, Operational Data Store(ODS) + DDS, and federated data warehouse. The first three use a dimensionalmodel as their back-end data stores, but they are different in the middle-tier datastore. The federated data warehouse architecture consists of several data ware-houses integrated by a data retrieval layer.

We have chosen to implement the single DDS architecture because our datawarehouse will be fed from only one source system. Not only is the single DDS thesimplest, quickest and most straightforward architecture to implement, but alsobecause our DDS will consist of only the sales data mart. The Architecture is byevery means extensible. It can quite easily be scaled up to be fed by more thanone source system and the DDS can also comprise several data marts.

Figure 2.2: A single DDS Data Warehouse architecture.

14

A data store is one or more databases or files containing data warehouse data,arranged in a particular format and involved in data warehouse processes [3].

The stage is an internal data store used for transforming and preparing thedata obtained from the source systems, before the data is loaded to into the DDS.Extracting data into the stage minimizes the connection time with the source sys-tem and allows processing to be done in the staging area without undue strain onthe OLTP systems. We have incorporated the staging area to make the designextensible as well because if in the future the DDS will be fed from multiple sourcesystems, the staging area is vital for the processing and transformation.

The dimensional data store (DDS) is a user-facing data store, in the form ofa database, made up of one or more data marts, with each data mart comprisingof dimension and fact tables arranged in dimensional format for the purpose ofsupporting analytical queries. We will describe the format of the DDS later.

For applications that require the data to be in the form of a multidimensionaldatabase (MDB) rather than a relational database an MDB is incorporated intoour design. An MDB is a database where the data is stored in cells and the posi-tion of each cell is defined by a number of variables called dimensions [3]. Eachcell represents a business event, and the value of the dimensions shows when andwhere this event happened. The MDB is populated from DDS.

In between the data stores sits the ETL processes that move data from onedata store (source) into another data store (target). Embedded within the ETLare logics to extract, transform and load the data. Information about each ETLprocess is stored in metadata. This includes the source and target info, the trans-formation applied, the parent process, and each ETL process to run schedule.

The technology we have chosen for this data warehousing project is MicrosoftSQL Server Integration Services and Analysis Services (MSSIS, MSSAS). It pro-vides a platform for building data integration and workflow applications. It isan integration of tools that provides database, updating multidimensional cubedata, ETL and Reporting capabilities. It also includes the Business IntelligenceDevelopment Studio (BIDS). which allows us to edit SSIS packages.

15

Summary

In this chapter we explained what consists a data warehouse architecture. Wementioned the 4 types of data flow architectures available and explained why weadopted the Single DDS architecture and went on to describe it in detail. We alsointroduced the technology we will be using .In the next chapter we will explain the methodology we will be following to buildthe data warehouse and why we have adopted the particular approach.

16

Chapter 3

The Methodology

In this chapter we discuss the process which we will be adopting in building ourdata warehouse. We have chosen to go with Ralph Kimball’s Four-Step Dimen-sional Design Process [2].The approach was mentioned and recommended in all the different literatures weread. It is followed by experts in the field and it was quite easy to see why af-ter consulting The Data Warehouse Toolkit, The Complete Guide to DImensionalModelling [2] ourselves.Dimensional Modelling was well outlined and quite straightforward and we felt itprovided us with the right footing to literally hit the ground running when it cameto building our own data warehouse.

Figure 3.1: Key Input to the four-step dimensional design process

The Four-Step Design Process.

STEP 1: Selecting a business process to model. A process is a natural businessactivity performed in an organization that is typically supported by a source sys-tem [3].

17

It should not be confused with a business department. Orders, purchasing, ship-ments, invoicing and inventory all fall under business processes. For example, asingle dimensional model is built to handle orders rather than building separatemodels for the sales and marketing departments. That way both departments canaccess orders data. Data is published once and inconsistencies can be avoided.

After a careful analysis of our source system database, we have selected salesas a business process to model because this is the only model that can supportedby the data available to us in the source system. We will build a sales data martfor the Classis Cars Co., which should allow business users to analyze individualand overall product sales and individual stores performances.The norm would have been to set up a series of meetings with the prospectiveusers of the data warehouse as a means of gathering the requirements and select-ing which model to implement but because we do not have this opportunity, weare confined to selecting a model which we feel can best be implemented based onthe data available from our source system database.

STEP 2: Declaring the grain of the business process. Here we identify whatexactly constitutes a row in a fact table. The grain conveys the level of detailassociated with the fact table measurements [3].Kimball and Ross recommend that a dimensional model be developed for the mostatomic information captured by a business process.Typical examples of suitable candidates:

• An Individual line item on a customer’s retail sales ticket as measured by ascanner device.

• A daily snapshot of the inventory levels of each product in a warehouse.

• A monthly snapshot for each bank account.. . .

When data is at its atomic form, it provides maximum analytic flexibility be-cause it can be rolled up and cut through (sliced and diced) in every possiblemanner. Detailed data in a dimensional model is most suitable for ad hoc userqueries. A must if the data warehouse is to be accepted by the users.

STEP 3: Choosing the Dimensions. By choosing the correct grain for the facttable, the dimensions automatically become evident. These are basically fieldsthat describe the grain items. We try to create very robust dimensions and thismeans juicing it up with descriptive textlike attributes. Fields like order datewhich represents the date the order was made, product Description, which helpsto describe the product and so on.

18

As we understand the problem better, more dimensions will be added as required.Sometimes adding a new dimension causes us a take a closer look at the fact table.Adding additional dimensions should however not cause additional fact rows to begenerated.

STEP 4: Identifying the numeric facts that will populate each fact table row;Numeric facts are basically business performance measures.According to [2], all candidate facts in a design must be true to the grain definedin step 2. In our case, an individual order details line include such facts like,quantity sold, unit cost amount and total sale amount.These facts are numeric additive figures, and will allow for slicing and dicing, theirsums will be correct across dimensions and more additional measures can be de-rived or computed from them. With the proper facts, things like gross profit (costamount - sales amount) can be easily computed and this derived figure is alsoadditive across dimensions.

In building a data warehouse, it is highly important to keep the business users’requirements and the realities of the source data in tandem. One should normallyuse an understanding of the business to determine what dimensions and facts arerequired to build the dimensional model.We will do our best to apply Kimball and Ross’ four-step methodology to whatwe believe would be the normal business requirements for this project.

19

Summary

In this chapter, we outlined Ralph kimball’s four-step methodology and presentedwhy it is very popular amongst the data warehousing community. We talked brieflyabout our constraint of not having business users to interact with as a means ofgathering business requirements for this project and how we hope to work aroundthis.Next chapter, we will discuss the functional requirements for the data warehouse.

20

Chapter 4

Functional Requirements

Before diving into the process of data warehousing, it is important to define whatis expected from the completed data mart. i.e what do the business users expectto be able to do with our system or as in our case, what we believe will help ClassicCars achieve their business objectives.Functional requirements mean defining what the system does.By defining the func-tional requirements, we have a measure of success at the completion of the project,as we can easily look at the data warehouse and determine how well it conformsor provides answers to the various requirements posed in table 4.1.

In trying to define the functional requirements, we explored the source systemand tried to analyze the business operations of Classic Cars. In the end, we agreedthat the data warehouse should be able to help users provide answers to:

No. Requirement Priority1 Customer Purchase history High2 Product order history High3 Product sales per geographic region High4 Store sales performance High5 Customer payment history High6 Buying patterns per geographic region High

Table 4.1: Functional requirements for the Classic Cars Data Warehouse.

21

4.1 Summary

In this short but very important chapter, we tried to outline what the businessusers expect from our finished data warehouse. This will very much be the yard-stick, which will determine whether the data warehouse will be accepted by theusers or not.A data warehouse that does not meet the expectation of the business users wouldnot be used and from that perspective would be deemed to have failed.

In the next chapter, we combine the functional requirements and the method-ology and try to come up with a dimensional model of our data warehouse.

22

Chapter 5

Data Modelling

5.1

We start off this chapter by explaining some dimensional modeling terms. We willdesign the data stores. By looking at the functional requirements, we are able toknow what to include in our data stores. We will be using the dimensional modelingapproach and follow the Four-Step Dimensional Design Process [2] outlined in theprevious chapter.We will first define and then build our fact table surrounded by the dimensionaltables. The contents of our fact and dimension tables will be dictated by thefunctional requirements defined in the previous chapter. We will construct a datahierarchy and also construct a metadata database.

5.2 PRIMER

Fact Table: A fact table is the primary table in a dimensional model where thenumerical performance measurements of the business are stored [2]. Measurementsfrom a single business process are stored in a single data mart. FACT representsa business measure e.g. quantities sold, dollar sales amount per product, per dayin a store.The most useful facts in a FACT table are numeric and additive. This is due tothe fact that the usual operation on warehouse data is retrieving thousands of rowsand adding them up.

Fact tables contain a primary key which is a combination of primary keys fromthe dimension tables (foreign keys). Also known as a composite or concatenatedkey, this helps to form a many-to-many relationship between the fact table and the

23

dimensional tables. Not every foreign key in the fact table is needed to guaranteeuniqueness.

Fact tables may also contain a degenerate dimension (DD) column. This is adimension with only one attribute and as such is added to the fact table as opposedto having a dimension table of its own with only one column.

Dimension Tables: These contain textual descriptors that accompany thedata in the fact table. The aim is to include as much descriptive attributes aspossible because they serve as the primary source of query constraints, groupings,and report labels. E.g. when a user states to see a model sales by country by re-gion, country and region must be available as dimension attributes. They are thekey to making the data warehouse usable and understandable, and should containverbose business terminology as opposed to cryptic abbreviations [2].

Dimension tables are highly denormalized and as a result contain redundantdata but this is a small price to pay for the trade off. What we achieve is ease ofuse and better query performance as less joins are required.

The data warehouse is only as good as its dimension attributes. Dimensiontables also represent hierarchical relationships in the business.

5.2.1 Dimensional Model

When we join the fact table together with the corresponding dimension tables,we get what is known as a data mart. This forms a kind of star like structureand is also referred to as the star join schema [2]. The star schema is based onsimplicity and symmetry. It is very easy to understand and navigate. Data inthe dimension tables are highly denormalized and contain meaningful and verbosebusiness descriptors, users can quickly recognize that the dimensional model prop-erly represents their business.

Another advantage of using a dimensional model is that it is gracefully exten-sible to accommodate changes [2]. It can easily withstand unexpected changesin user behavior. We can easily add completely new dimensions to the schema aslong as a single value of that dimension is defined for each existing fact row.

It has no built-in bias as to query expectations and certainly no preferencesfor likely business questions. All dimensions are equal and present a symmetricalequal entry points into the fact table. The schema should not have to be adjusted

24

every time users come up with new ways to analyze the business.

The key to achieving this lies in the process of choosing the granularity as themost granular or atomic data has the most dimensionality [2].According to [2], atomic data that has not been aggregated is the most expressiveand the fact table incorporates atomic data, and so should be able to withstandad hoc user queries; a must if our warehouse is to useful and durable. Creating areport should be as simple as dragging and dropping dimensional attributes andfacts into a simple report.

5.2.2 Metadata

Metadata is the encyclopedia of a data warehouse. It contains all theinformation about the data in the data warehouse. It supports thevarious activities required to keep the data warehouse functioning, beit technical; (information about source systems, source tables, targettables, load times, last successful load, transformation on data, etc),administrative; (indexes, view definitions, security privileges and ac-cess rights, ETL run schedules, run-log results, usage statistics, etc) orbusiness users support (user documentation, business names and defi-nition, etc).We build a metadata database, which will serve as the catalogue of thedata warehouse.

5.3 Designing the Dimensional Data Store

In order to do a good DDS design, we must ensure that the design of theDDS is driven by the functional requirements defined in the previouschapter. This is because the functional requirements represent the kindof analysis that the business users will want to perform on the data inthe warehouse.

Figure 5.1: Key Input to the four-step dimensional design process

25

5.3.1 STEP 1: Selecting the Business Model

Understanding the business requirements coupled with analysis of theavailable data helps us to choose what business process to model. In anormal real life situation, we would choose an area that would have animmediate and the most impact on business users as a means of gettingthem to adopt the system quite easily.However, we are constrained by the fact that the only data available tous in our source system is sales data. So our business process to modelis product sales. We will build a Product-sales data mart.

A data mart is simply a fact table surrounded by its correspondingdimension tables that model a business process. It will allow users toanswer questions posed in the functional requirements.

The product sales event happens when a customer, through a salesrep places an order for some of the products. The roles (who, what,where) in this case are the customer, product, and the store. Themeasures are the quantity, unit price and value of sales. We will put themeasures into the fact table and the roles (plus dates) in the dimensiontables.The business events become individual rows in the fact table.

Figure 5.2: Preliminary Sales Data Mart

5.3.2 STEP 2: Declaring the Grain

Declaring the grain means deciding what level of data detail should beavailable in the dimensional model. The goal being to create a dimen-

26

sional model for the most atomic information captured by the businessprocess outlined in step 1.

Different arguments abound about how low or the atomicity of thegrain should be. According to Ralph Kimball, tackling data at its low-est, most atomic grain makes sense on multiple fronts.Atomic data is highly dimensional. The more detailed and atomic thefact measurement, the more things we know for sure.Atomic data provides maximum analytic flexibility because it can beconstrained and rolled up in every possible way. Detailed data in adimensional model is poised and ready for the ad hoc attack by thebusiness users.

Selecting a higher-level grain limits the potential to less detailed di-mensions and makes the model vulnerable to unexpected user requeststo drill down into the details. The same would also be true if summaryor aggregated data is used.

In our Classic Car study, we have chosen an individual line item inthe order details transaction table as the most granular data item.In other words, the grain or one row of the Product-Sales fact tablecorresponds to one unit of a model sold (car, truck, motorcycle, etc).

By choosing such low level grain, we are not restricting the potentialsof the data warehouse by anticipating user queries but ensuring maxi-mum dimensionality and flexibility because queries need to cut throughdetails (slicing and dicing) in precise ways, whether they want to com-pare sales between particular days, or compare models sale accordingto scale model size.

While users will probably not want to analyze every single line itemsale in a particular order, providing access to summarized data onlywould not be able to answer such questions.

5.3.3 STEP 3: Choosing the dimensions

After we have identified what constitutes the business measure of theevent we are modeling (Product-Sales), certain fields which describeor qualify the event (roles) become obvious: product, store, customer,date will form the dimensions.We will also have the Order Number as a dimension, but because it

27

does not have any other attributes of its own, it will sit in our fact tableas a degenerate dimension. It will help to identify products belongingto a particular order.

Dimension tables need to be robust and as verbose as possible. Di-mensions implement the user interface to a data warehouse and It isnot uncommon to have a dimension table containing 50 - 100 columns.Unlike fact tables, they are updated infrequently and updates are usu-ally minor additions like adding a new product or customer or updatingprices and etc.

5.4 Slowly Changing Dimensions

This brings us to the problem of slowly changing dimensions or SCDand how it is handled. If we recall from our definition of what a datawarehouse is, we know that it stores historical data, so what then hap-pens for example if the value of a dimensional attribute changes?

Say for example, an office that was overseeing a particular regionor a customer changes address? Surely, merely updating this dimen-sion by simply changing the address will mean all previous transactionscarried out under the old region or address can no longer be isolatedand we might not be able to analyze the information because querieswould have no means to refer to them explicitly since they are now partof the new region or address and a fundamental function of our datawarehouse of storing historical data is no longer true.

According to [2], the problem of SCD can be handled by eitheroverwriting existing values (type 1 SCD) or preserving the old attributevalues as rows (type 2), or storing them as columns (type 3).

Type 1 response is only suitable if the attribute change is a correc-tion or there is no value of retaining the old description. This is notusually desirable and should be up to the business users to determineif they want to be able to keep it or not.

Type 2 response is the most common technique as it is the mostflexible to implement and does not limit the number of times we caneasily reflect a change in a dimension attribute. It involves adding a

28

new dimension row every time an attribute changes, the current valueis preserved in the current row and the new value is reflected in thenew row.Using this method we are able to stay true to our definition of a datawarehouse keeping historical data and also allowing users to be able totrack historical changes and perform analysis constrained on either orboth values.

Let us suppose in our case study, that a particular car model is onlysold in region 1 up until a certain period and then Classic Cars decidedto discontinue its sale there and move it to region 2. Obviously undertype 1 response, from the moment the field attribute is corrected toreflect region 2 as the new region, there will be no way of analyzing carX model sales performance prior to when it was moved to Region 2.

Furthermore, analysis on the sales figures in Region 2 will reflect al-beit, incorrectly car X model’s sales figure from when it was in Region1 as part of Region 2’s.Using type 2 approach, when the region changed, we will add a newdimension row to reflect the change in region attribute. We will thenhave two product dimensions for car X model:

Product Key Product Description Region Product Code1233 Ferrari Blazer Region 1 FERR-122346 Ferrari Blazer Region 2 FERR-12

Table 5.1: Type 2 response to SCD

29

The above table also helps us to see why we have to introduce sur-rogate keys into our dimension tables as opposed to using the naturalkeys. The surrogate keys can help to identify a unique product attributeprofile that was true for a span of time [2].Plus we do not also need to go into the fact table to modify the productkeys and the new dimension row also helps to automatically partitionhistory in the fact table.

Constraining a query by Region 1, on car x prior to the change datewill only reflect product key 1233 when car x was still in Region 1 andconstraining by a date after the change will no longer reflect the sameProduct key because it now rolls up in Region 2.

We also introduce a date stamp column on the dimension row whichwill help track new rows that are added, a valid or invalid attributeis also added to indicate the state of the attributes. Effective and ex-piration dates are necessary in the staging area because they help todetermine which surrogate keys are valid when the ETL is loading his-torical fact records.

Type 3 response uses a technique that requires adding a new columnto the dimension table to reflect the new attribute. The advantage itoffers is that unlike type 2 response, it allows us to associate the newvalue with old fact history or vice versa [2].If we remembered that in type 2 response, the new row had to beassigned a new product key (Surrogate Key) so as to guarantee unique-ness, the only way to connect them was through the product code (Nat-ural Key). Using a type 3 response, the solution would look like,

Product Key Product Description Region Prior Region Product Code1233 Ferrari Blazer Region 2 Region 1 FERR-12

Table 5.2: Type 3 response to SCD

30

Type 3 response is suitable when there’s a need to support both thecurrent and previous view of an attribute value simultaneously. But itis quite obvious that adding a new column will involve some structuralchanges to the physical design of the underlying dimension table and somight be preferable if the business users decide that only the last 2 to3 prior attribute values would need to be tracked.Also, the biggest drawback would be if we needed to the track the im-pact of the intermediate attribute values [2].

There are hybrid methods for solving the problem of SCD which com-bine features of the above techniques but while they can offer greaterflexibility, they usually introduce more complexity and if possible, ac-cording to [2], should be avoided.

We introduce surrogate keys into our dimension tables and use themas the primary keys. This approach is more suitable because for onereason, it helps to tackle the problem of SCD.It is also essential for the Stage process ETL especially because we havechosen type the 2 response to dealing with SCDs.

Surrogate keys help the ETL process keep track of rows that alreadyexist in the data warehouse and avoids reloading same. Surrogate keysare very easy to automate and assign because they are usually integervalues and the last assigned value is stored in metadata and is easilyretrieved and incremented on the next run.

5.5 Data Hierarchy

Dimension tables often also represent hierarchical relationships in thebusiness. Hierarchies help us to roll up and drill down to analyze infor-mation based on related facts.

For example state rolls up into country and country into region. Orin the date dimension, days roll up into week and weeks into month,months into period etc. Products roll up into product line, product lineinto vendor. Having hierarchy translates into better query performanceand more efficient slicing and dicing through grouping along a path.Users are able to for example, view a products performance during aweek and later on roll it up into a month and further into a quarter orperiod. All our four dimensions have hierarchy.

31

Figure 5.3: The Product dimension hierarchy

Figure 5.4: The Customer dimension hierarchy

32

Figure 5.5: The Date dimension hierarchy

Figure 5.6: The Office dimension hierarchy with multiple paths.

33

5.6 The Date Dimension

Every business event that takes place, happens on a particular date andso the date dimension is very important to a data warehouse. It is theprimary basis of every report and virtually every data mart is a timeseries [2].It is also common to every data mart in a data warehouse and as aresult must be designed correctly.When modeling the date dimension, care must be taken to make surethat it is filled with attributes that are necessary for every fact tablethat will be using it.

Assigning the right columns will make it possible to create reportsthat will for example, compares sales on a Monday with sales on aSunday, or comparing a particular one month versus another.According to [3], the columns or attributes in a date dimension can becategorized into four groups:

• Date formats: The date format columns contain dates in variousformats.

• Calendar date attributes: The calendar date attributes containvarious elements of a date, such as day, month name, and year.

• Fiscal attributes: The fiscal attribute columns contain elementsrelated to the fiscal calendar, such as fiscal week, fiscal period, andfiscal year.

• Indicator columns: These contain Boolean values used to deter-mine whether a particular date satisfies a certain condition, e.g. anational holiday.. . .

34

Figure 5.7: The Date dimension table.

35

5.7 The Office Dimension

The office dimension describes every branch office outlet in the busi-ness. It is a geographic dimension. Each outlet is a location and so canbe rolled up into city, state or country. Each office can easily be rolledup into its corresponding geographic region as well.To accommodate the movement of an office’s coverage region, we haveintroduced the store key as a surrogate key and this will be used toimplement a type 2 SCD response.

Figure 5.8: The Office dimension table.

36

5.8 The Product Dimension

The product dimension describes the complete portfolio of productssold by the company. We have introduced the product key as the sur-rogate key. It is mapped to the product code in the source system(s).This helps to integrate product information sourced from different op-erational systems. It also helps to overcome the problem that ariseswhen the company discontinues a product and assigns the same code toa new product and as we have mentioned earlier, the problem of SCD.Apart from very few dimension attributes changing over time, most at-tributes stay the same over time.

Hierarchies are also very apparent in our product dimension. Prod-ucts roll up into product line, product scale and product vendor, busi-ness users will normally constrain on a product hierarchy attribute.Drilling down simply means adding more row headers and drilling upis just the opposite. As with all dimension tables, we try to make ourattributes as rich and textually verbose as possible, since they will alsobe used to construct row headers for reports.

37

Figure 5.9: The product dimension table.

38

5.9 The Customer Dimension

The customer forms an important part of the product sales event. Thecustomer is actually the initiator of this event. All classic Cars’ cus-tomers are commercial entities, since they are all resellers.

The customer name field in this respect makes sense as only onecolumn. But we do have a contact first name and contact last name forcorrespondence. The customer key is a surrogate key that helps withSCD. Attributes are chosen based on the business users requirementsoutlined in the functional requirements.

Figure 5.10: The customer dimension table.

39

5.10 Step 4: Identifying the Facts.

The final step is identifying the facts that will form the columns of thefact table. The facts are actually dictated by the grain declared in step2.According to [2], the facts must be true to the grain; which in our case,is an individual order line item. The facts available to us are the salesquantity, buy price per unit and the sales amount, all purely additiveacross all dimensions.

We will be able to calculate gross profit (sales amount - buy price) onitems sold, also known as revenue. We can calculate the gross profit ofany combination of products sold in any set of stores on any set numberof days. And in the cases where stores sell products at slightly differentprices from the recommended retail price, we should also be able tocalculate the average selling price for a product in a series of stores oracross a period of time.

Kimball et al recommends that these computed facts be stored inthe physical database to eliminate the possibility of user error. Thecost of a user incorrectly representing gross profit overwhelms the mi-nor incremental storage cost. We agree with this, since storage cost isno longer an issue as it once was.

Since the fact table connects to our dimension tables to form a datamart, it is necessary that it contain attributes that link it with thedimension table, in order words, attributes that enforce Referential In-tegrity. All the surrogate keys in the dimension tables are present in thefact table as foreign keys. A combination of these keys will help us de-fine a primary key for our fact table to guarantee uniqueness. Our facttable also contains 2 degenerate dimensions, namely the order numberand the order line number.

40

Figure 5.11: The Product Sales Data mart.

41

5.11 Source System Mapping.

After completing the DDS design, the next step will be to map thesource system columns to the DDS columns. This will aid the ETLprocess during the extraction phase to know which columns to extractfrom and the target columns to populate.

Since the fact table columns comprise attributes from different ta-bles, which in turn could also be from different source systems, we needto have in place a source system code in order to identify the sourcesystem where the record comes from and for the ETL to be able tomap to the column in whatever system it might reside in. The onlyrequirement is that the source system code and its mapping be storedin the metadata database.

At this stage we also consider the necessary transformations andcalculations to be performed by the ETL logic during extraction. Butbecause our source system database is rather simple and straight for-ward, we will not be performing any.

42

5.12 Summary

In this chapter we went in depth into data modeling and what guides themodeling process and then designed our DDS. We started by definingsome data modeling jargons. We used the Kimball four step approachin our DDS construction process.We also looked at how columns from the source system are mapped tothe columns in the DDS. In the next chapter, we will be looking at thephysical elements of our data warehouse.

43

Chapter 6

The Physical Database Design

6.1

In this chapter we look at the physical structure of our data warehouseand its supporting technology. We will show how we implement ourDDS and data warehouse structure using Microsoft SQL server.We will not be discussing the hardware structure or requirements asthese are beyond our defined scope for this project.

In a normal business environment, the source system, the ETL serverand the DDS would ideally be running on separate systems. More so,because the source systems is an OLTP system and we must not inter-fere with its smooth running.For the purpose of implementation, we needed to find a way to simu-late multiple systems on a single computer. Our solution is to representeach element as a separate database running on a single SQL server in-stallation installed on one computer.What we have is an environment where each database behaves like anindividual system and using MSSIS, we could connect and move databetween the different elements of the data warehouse through OLEDB,just as we would if the databases were residing on separate systems.

6.2 The source system database.

This will simulate our source system and is a database of transactionsthat is pulled from the order management system of the Classic CarCompany. It is an OLTP system and records the day to day transac-

44

tion of receiving and dispatching orders, as well as inventories and allthe supporting data. In our case it is a single system but as classiccars have stores in various regions, it would ideally be OLTP data fromthe various stores. We as data warehouse designers and implementersdo not create the source systems but it is the first place we start ourfeasibility study for the functional requirements of the data warehouse.Careful thought must be put into providing the OLTP source systemwith as little interference as possible from the other elements of thedata warehouse. According to Kimball and Ross, a well designed datawarehouse can help to relive OLTP systems of the responsibility of stor-ing historical data.

6.3 The Staging area database.

In trying to conform to the last sentence of the above paragraph, it isvery essential that we have a staging area. A data warehouse differsfrom an OLTP system in that, the data in a data warehouse is accurateup until the last time it was updated. A data warehouse does not con-tain live data and is not updated in real time.

Updating the data in a data warehouse might mean uploading hun-dreds of megabytes to tens of gigabytes of data from OLTP systems ona daily basis. OLTP systems are not designed to be tolerant to this kindof extraction. So in order to avoid slowing the source systems down, wecreate a stage database from where our stage ETL will connect to thesource system at a predefined time of the day (usually at a time of lowtransaction traffic) extract the data, dump it into the stage databaseand immediately disconnect from the source system database.

The internal structure of the staging database is basically the sameas that of the source system, except that the tables have been strippedof all constraints and indexes.We have added the columns: source system code and date of recordcreation as a means of identifying the originating source of the dataand the date of extraction as a bookmark.These are for auditing and ETL purposes. That way the ETL can avoidreloading the same data on the next load. The stage ETL performs allthe necessary transformations on the extracted data in this area andthen loads them into the dimension and fact tables of the DDS.

45

The stage database area is akin to a workshop in that it is not accessi-ble to user queries. It is just an intermediate place that data warehousedata pass through on their way to the DDS.

6.4 The DDS database.

The DDS database houses the Classic Cars DDS that contains our di-mension and fact tables. Our data mart contains four dimensions and 1fact table but ideally in a real world, it could house tens of data martsand we would recommend it having a standalone system of its own.This is our data presentation area and will be accessed by various re-port writers, analytic applications, data mining and other data accesstools.We aim to design a DDS that is unbiased and transparent to the access-ing application or tool. This way, users are not tied to any particulartool for querying or analysis purposes.

Due to referential integrity, it is important to create the dimensionsbefore the fact tables.

6.5 The Metadata database.

The metadata database maintains all the information in a data ware-house that is not actual data itself. It is data about data.Kimball likens it to the encyclopedia of the data warehouse. Undernormal circumstances, it would be filled with tons of information abouteverything that is done in the warehouse and how it is done. It will sup-port all user groups from technical to administrative to business users.

Our metadata database is a stripped down version and its primarypurpose is to support our ETL processes. We store information aboutsource system and columns mapping and ETL scheduling. Informationabout date and time last successful and unsuccessful load is recorded.The last increments of surrogate keys are also recorded. The metadatais the starting point of every ETL process.

46

6.6 Views.

A view is a database object akin to a table with rows and columns butis not physically stored on disk. It is a virtual table that is formed byusing a join to select subsets of table(s) rows and columns.

We created a view in order to be able to link a sale to a particularstore. This was because the store table does not connect to the orderstransaction table and the only way to deduct which store the transac-tion took place was through the employee making the sale. To extractthis information, we had to join the order transaction table to the em-ployees table through the salesRepEmployee number and from that wecould retrieve the store ID.

47

6.7 Summary

This chapter looked at the physical components of our data warehouse.We explained how we are able to achieve the simulation of the variouselements and environment of a data warehouse in a single system. Webuilt our databases and can now look forward to the next phase in ourimplementation; populating the data warehouse. In the next chapter,we will be looking at how to move data from our source system into theDDS.

48

Chapter 7

Populating the Data Warehouse

7.1

In this chapter we will look at how we move data from our source sys-tem into the data warehouse.Populating our data warehouse is done in the following steps: thefirst step is to move the data from our source database to the stagingdatabase, here the necessary transformations are applied and thereafterall the data is transferred to the DDS.

While transferring the data from the staging database to the DDSwe need to denormalize it first. This is a necessary step in preparing itfor the DDS.

To achieve this goal we have implemented two ETL processes:

• The Stage ETL: this connects to the source system, moves the datato the stage database and disconnects from the source system.

• The DDS ETL: this denormalizes the data, and then loads it intoto the DDS.. . .

Both steps are illustrated in the figure below

49

Figure 7.1: Data flow through the data warehouse showing ETL processes

50

7.2 Populating the Stage database

As we mentioned in an earlier chapter, our decision to include the stagedatabase into our data warehouse architecture is to primarily reducethe amount of time during which our ETL is connected to the sourcesystem.In order to minimize this burden time on the source database, we havechosen to implement the incremental extract method in the Stage ETL.Using this approach, only the initial load process of the ETL will re-quire that all the data in the source system be moved into the stagedatabase, thereafter, at regular intervals, usually once a day and nor-mally at a time when the OLTP system is handling less transactions,the ETL will connect and only picks up new or updated records sinceits last connection from the source system and load them into the datawarehouse, hence the name incremental extract method.To enable the ETL recognize and extract the data incrementally wehave added the created and lastUpdated timestamp columns to eachtable in our source database.Below is an extract from the Customer table:

Figure 7.2: Sample customer table showing the created and the lastUpdatedcolumns.

51

We use the metadata database to store the the times for the lastsuccessful extraction time LSET and current extraction time CET foreach table in the source system.This is a mechanism to help the ETL process figure out where to beginthe next run of incremental extraction and also to help in the case oferror recovery, if there is a failure during an extract and the processdoes not complete [3].

Figure 7.3: Snapshot from the Metadata data flow table showing the LSET andthe CET.

52

From figure 7.3, we can clearly see that the last ETL run successfullyloaded all the records from the source database to the stage until the11th of November, 2008 LSET. Therefore, in the next extraction sessionwe need the ETL process to only load those records which were createdor updated after the last successful extraction time LSET.

As an example, lets assume that we are running our ETL process onthe 11th of December 2008,CET, from the three customers in shown inthe above Figure, only one will be transferred to our staging area i.e.Atelier graphicue. The reason being that this record was last updatedon (2008/12/08), which was after our last successful extraction time forthe Customers table(2008/11/11).

In order to pick all the new or updated records from the sourcedatabase, we must first save the current time as the CET in our meta-data data flow table.Next we need to get the LSET for a particular table, for example theCustomers table. This is achieved by using a simple SQL query like this:

SELECT LSET from metadata.data flow where name = Customers

The query returns the LSET for the Customers table. Armed withthese two parameters, we can then proceed to extract new or updatedcustomers from our source database with the following query:

SELECT * FROM CustomersWHERE (created >LSET AND created <= CET)OR (lastUpdated >LSET AND lastUpdated <= CET)

To pick logically correct data here also requires that the recordshould have the created or the lastUpdated timestamp fields not biggerthan current extraction time.

For even better performance of our stage ETL process, our stagingdatabase tables do away with constraints and indexes. While transfer-ring the data to the staging area we do not want our ETL process to bebogged down by unnecessarily checking for any constraint violations.We use constraints to ensure data quality only when inserting new orupdating old data in our source systems. This way we are sure that thedata, that comes to our staging area, is correct and there is no need to

53

double check it again.

7.3 Data Mappings

Another important step in the design of the ETL process is data map-pings. It is a common occurrence to find columns that make up a tablein the DDS to be derived from multiple tables in the source systems.Mapping them helps the ETL process to populate the tables with theirrightful columns. Columns are mapped to tables in the stage area andmappings are also done between the stage area and the DDS tables.Below is the data mappings of our source-to-stage ETL process:

54

Figure 7.4: Column mappings in the ETL process for source system to stagingdatabase.

55

7.4 Control Flow

The next figure shows the control flow diagram of our ETL process.It is based on the incremental extraction algorithm that we describedearlier in this chapter. The whole procedure is divided into four mainsteps:

1. Set the current extraction time (CET = current time).

2. Get the last successful extraction time (LSET) of some particulartable

3. Pick the records that were created or updated during time intervalT,(LSET <T ? CET),and load them into the stage database.

4. Set the last successful extraction time (LSET = CET).. . .

This ETL process extracts and loads the data for each table in par-allel. If some error occurs while executing one of those steps then theoperation is halted and marked as failed.However individual tables are not dependent on each other. If an er-ror occurs while loading the Customers table then only the procedurethat works with this table is marked as failed, while the others continueas usual. Using this approach we are sure that step 4 is executed ifand only if there were no errors during the whole procedure, i.e. theextraction was successful.

56

Figure 7.5: ETL process for populating the staging database.

57

7.5 Moving Data to the DDS

Now that we have the data in our staging database, it is time to applysome transformations, if needed, and move the data to the DDS. Thisis the task that our second ETL process is responsible for.One of the first things to do when moving data to the DDS is to popu-late the dimension tables before . That is because we have a fact tablewhich references every dimension table through a foreign key. Trying topopulate the fact table first would result in a violation of its referentialintegrity.

Having successfully populated the dimension tables, we can safelyload the fact table. If no errors occurred and all the data has been suc-cessfully transferred to the DDS we clear the staging database. Sincewe have that data in the data warehouse we no longer need to keep acopy of it in the stage database.

Our DDS contains four dimensions: Customer, Product, Office andTime. This ETL process will only populate three of them because theTime dimension is populated only once - when our DDS is created.This dimension rarely changes, except for when an organization needsto redefine its fiscal times or update a holiday. Hence, we do not needto update it very often. While designing the control flow architectureof this ETL process in Business Intelligence Development Studio weplaced every dimension populating procedure in a sequence containerand made it as the starting point of our ETL process. This is to loadthe dimension tables first and separate this task from the loading thefact table.

If an error occurs in one of the procedures that are in the sequencecontainer, all further execution is halted and the entire ETL process re-sults in a failure. The Figure below illustrates the control flow diagramfrom Business Intelligence Development Studio.

58

Figure 7.6: ETL process for populating the DDS.

59

7.6 Populating the Dimension tables

While designing this ETL process, one of the most important require-ments was to incorporate slowly changing dimensions into our datawarehouse. This feature makes the loading of this ETL process quitenot as straightforward as was the case while populating the stagingdatabase.

For this project we are implementing Type 1 and Type 2 SCD. Slowlychanging dimension is used only when populating dimension tables. Thereason being that dimension records are updated more frequently thanfact table records.

Implementing SCD using Microsoft Business Intelligence Develop-ment Studio is rather an easy task. For that purpose we use the SlowlyChanging Dimension data flow transformation as seen in the next fig-ure. The same dataflow architecture is used to populate each dimensionin Populate Dimensions sequence container.

60

Figure 7.7: Data flow architecture for populating Customer Dimension.

61

While populating the Customer dimension we can select some columnsto correspond to SCD Type 1 or Type 2 response. The following figureillustrates the SCD response type for those columns in the Customerdimension.

Figure 7.8: Handling SCD in the Customer Dimension.

62

The Changing attribute represents a Type 1 SCD response; Newvalues overwrite existing values and no history is preserved. While theHistorical attribute corresponds to a Type 2 SCD response; Changesin these column values are saved in new record rows. Previous valuesare saved in records marked as outdated.To show a records current status, we use the currentStatus column withpossible values of Active, meaning the record is up to date and Expired,meaning the record is outdated.When data passes through the SCD data flow transformation, it can goto three different outputs:

1. If we receive an updated version of changing attribute then thedata is directed to Changing Attribute Updates Output and theold value in the DDS is updated with the new one. We do notneed to insert anything into our DDS this time

2. If the historical attribute is updated, then the data is directedto Historical Attribute Inserts Output. Since we want to keep ahistory of a previous value of this attribute we do not update thisrecord but, instead, create a new Customer record with the newdata and mark it as Active. The old record is then marked asExpired. Both of those records reside in the DDS.

3. Data is redirected to the New Output when we receive a newrecord, that is not currently in our DDS. Those records by defaultare marked as Active and inserted into the DDS.. . .

63

7.7 Populating the Fact table

The next step after populating the dimension tables is to load data intothe fact table. Remember dimensions are connected to the fact tablethrough surrogate keys. This means that we cannot just load the facttable with the data we get from the source database.Firstly, we need to find each record’s matching surrogate key in the di-mension table and only then can we be able to link those tables togetherthrough foreign keys.

What distinguishes this procedure from the ones mentioned earlieris that here, the dataflow architecture has two data sources. This is theresult of our fact table being composed of columns from two differenttables from the data source,namely the Orders and OrderDetails tables.This is not uncommon in data warehousing. So when populating thefact table we need to join those two tables first. We do that with thehelp of ”Merge Join” dataflow transformation as shown below,

Figure 7.9: Joining the Orders and OrderDetails tables with a Merge Join.

64

Before joining the two data sets, it is required that they be sortedfirst. Here, Sort 1 is the OrderDetails table and Sort is a sorted versionof Orders table. The orderNumber naturally forms the join key.

After joining the two tables, we now have all the data we need toform our fact table. The only thing left to be done is to find surrogatekeys to be able to join every fact table record with all of the dimensions.Since our fact table is connected to four dimensions we need to containfour of those keys in every record of our fact table, very much like acomposite key. At this moment our fact table can join directly with thefollowing dimensions:

• Product dimension on productCode.

• Customer dimension on customerNumber.

• Date dimension on orderDate.. . .

At this point, we do not have a column that directly joins the Officedimension. This is the case because neither the Orders nor the Or-derDetails tables have a direct link with the Office table in the sourcedatabase.To help us work around this situation, we created a view that joinsevery order to the office where the order was made through the SalesRep number. This is because the order record contains the employeenumber for the person who handled the sale, so we use that employeenumber as a handle to get the office code and voila problem solved!

65

Figure 7.10: Joining the Orders, customers, employees and office tables.

66

Using the data that we have now in our fact table we can successfullyjoin with each dimension using a business key. However, we need to beusing surrogate keys for our star-join.Since all the dimensions have a business key (not used as a primarykey) here but we can use it to get the surrogate key using the ”Lookup”dataflow transformation. The lookup transformation just joins the facttable with one of the dimensions using a business key and then retrievesa surrogate key and adds it into the fact table.The figure below shows a lookup data flow transformation for gettingthe Customer dimension surrogate key.

Figure 7.11: Retrieving the Customer dimension surrogate key.

67

The business key in this example is customerNumber. Using thiscolumn we join our fact table with the Customer dimension and re-trieve it’s surrogate key i.e. customerDimKey. We insert this key intoour fact table as a new column. Using this same approach, we get theremaining three surrogate keys.

After getting the surrogate keys for all four dimensions we can finallyinsert our first record into the fact table. The only thing left to do isto get rid of the business keys that we currently store in the data setthat is going to be inserted into our fact table.This is easily solved using data mapping. We simply leave out theBusiness keys from the mappings because we do not want to includethem in the fact table now that we have all the surrogate keys.

Figure 7.12: Data mappings for the Fact table.

68

7.8 Preparing for the next upload

After populating the fact table with data from the staging database, wecomplete the data warehouse population task.The Last step will be to clear the staging database for the next sched-uled data extraction. The complete data flow architecture diagram forpopulating the fact table is depicted in Figure 7.13.

69

Figure 7.13: Data flow diagram for populating the Fact table.

70

7.9 Scheduling the ETL

To keep our data warehouse up to date we need to run our ETL pro-cesses regularly. Using incremental extract, the best option would beto run it on a daily basis, usually at a time when business transaction islow or at the end of a work day. Although the latter argument holds nowater in these days of online shopping, where as one part of the worldis shutting down another is resuming work.The general idea is that we do not want to interfere with the smoothrunning of the OLTP source systems. So, it is purely a business decisionwhen an organization would like the ETL process that connects to thesource systems to run.

On the technical aspect, to schedule an ETL and execute all of ourpreviously built SSIS packages, we need create a new SQL Server AgentJob. It is a multi step process with one step per ETL process.Figure 7.14 illustrates.

Figure 7.14: Creating an SQL Agent Job.

71

The whole job process is atomic, first it populates the staging database.If it fails at any point, it does not continue to step 2 for the obvious rea-son that it makes no sense trying to load the data from stage databaseto the DDS if no data was transferred to the staging area. Instead, wejust quit the job, reporting failure.

If step 1 succeeds, it moves onto step 2. If both steps succeed thenthe whole job is marked as successful execution of our ETL processes.There are also options for notifying the administrator in case of a failure.These notifications can be sent by mail, network message or written tothe metadata or Windows Application event log.

In trying not to interfere with the smooth running of the sourcesystem, we have scheduled our ETL processes to run at 3:00AM.

Figure 7.15: Scheduling the ETL.

72

At this point we have an up to date data warehouse. Our ETLprocesses are built. The data flows into the DDS regularly. The datawarehouse is ready and fully functional.

73

7.10 Summary

In this chapter, we explained our implementation in detail with well il-lustrated diagrams. This was the part that took he longest to complete.Being able to complete is was a milestone for us.In the next chapter we will look at reporting from our completed datawarehouse.

74

Chapter 8

Building Reports

Now that our data warehouse is up and running, it is ready to be tested.We have chosen to build some sample report as a means of seeing ourdata warehouse in action. But our main goal is not to limit our datawarehouse to certain pre-built reports.We believe a data warehouse should not be biased towards any partic-ular reporting or analysis tool and should be flexible enough to handleusers slicing and dicing in whatever manner they choose. This to uswill represent how successful our planning and design process was.

One of the most common ways to study data in the data warehouseis by building reports. Using this approach the data is gathered fromthe DDS and presented to the user in a very convenient way. Insteadof just plain data fields in the DDS the user can use charts, diagramsand other ways of representing data. This makes it much more easierto inspect and analyze the data that resides in the DDS.

For creating our reports we are using SQL Server 2008 ReportingServices. Using these services, there is a possibility to deploy the re-ports so that they can be accessed via a web site.We are not using this feature because we do not have access to an httpserver with SQL server 2008 Enterprise edition installed.

75

Selecting the Report fields

The first thing to do when building a report is to decide what tablesand columns to include in our report. This of course depends on whatkind of information we expect from our report.If we want to build a report that shows the sales of a particular productin different countries than we should build a SQL query that extractsall the sales and then groups the result by product and country.

Before continuing with the reports we need to make another view.A view that relates every order line item of a particular product and avalue indicating the profit from that sale.To create such a view we have to collect data from two tables: fact tableand product dimension. To calculate the profit for one particular orderline in the fact table we use the expression:

Profit = (selling price of one unit - buying price of one unit) * quan-tity of units sold

We then cast this value as Money and insert this column into ourview. So the view now contains three columns: orderNumber, profitand productDimKey.Why not use a nested query, one might ask. We are forced to takethis approach because grouping by nested queries is forbidden in SQLServer 2008 while creating reports.

76

Figure 8.1: Creating a view to relate each order line with the profit it made

77

Now that we have this view created we can build some reports totest our data warehouse. The first report we created shows productsales by country over a period of time.Time period is year and quarter columns from the date dimension.Two other important columns are profit and country.

Figure 8.2 shows how the query groups the data by year, quarter andcountry and then sums up the profit. The result of this operation isexactly what we need to analyze the company’s sales in different coun-tries over time.

Figure 8.2: A query used to provide data to the Sales by country report.

78

The final step is to wrap this data into a nice and readable format,i.e. build a report or paint a picture. Once we have the data we need,building a report is a very straightforward process.

In this report we are using a matrix style layout. This basicallymeans that we have three dimensions that we can use; columns, rowsand details:

• The time year and quarter are represented as columns.

• Countries are displayed as rows

• Profit is shown as details.. . .

The matrix design template is depicted in Figure 8.3.

Figure 8.3: Designing the report matrix.

79

The main group in time dimension is the year field. It also containsfield quarter as a child group. This technique groups the data in a hier-archical approach, where every year column is a parent of four quartercolumns.There is also an additional row: Total per quarter that shows us thetotal profit made during each quarter.

A picture they say, speaks a thousand words and to present the datain a more easily readable format, we also included a chart in this report.Notice that having three dimensions (time, profit and country) in thematrix, we also need a three dimensional chart. A simple column or piechart would be unsuitable because those only present two dimensionaldata. We use a stacked cylinder chart instead. To achieve the thirddimension the cylinder is split in parts, that in our case represents dif-ferent countries.

80

Figure 8.4: The Sales by country report.

81

Including charts with reports is a very helpful practice. Sometimes,when we have a big matrix and large numbers it is very hard to analyzethe data by just looking at the plain digits.For example, by just taking a quick look at the chart in previous reportit is very easy to notice that two of the most profitable countries in ourcase are France and USA.

The following two reports were built while testing our data ware-house to demonstrate the flexibility of the data warehouse. These re-ports were built using the same methods as a previous one.

The report in Figure 8.5: Sales by model type gives us an overview ofthe sales grouped by model type (cars, planes, ships and etc.) and thereport in Figure 8.6: Sales by manufacturer shows us the performanceof the manufacturers over time.

82

Figure 8.5: Sales by model type report.

83

Figure 8.6: Sales by model type report.

84

Summary

In this chapter, we have demonstrated creating reports from our datawarehouse. While we only only demonstrated by creating 3 reports,endless amounts of reports can be generated according to the userswant. It is also possible to use third party reporting and analysis toolsconnect to the data warehouse and slice and dice through the data asrequired. That is based on solid design principles that was adoptedduring the implementation of the data warehouse.

85

Bibliography

[1] W.H Inmon, Building the Data Warehouse. Wiley and Sons, Inc.3rd Edition, 2002.

[2] Ralph Kimball and Margy Ross, The Data Warehouse Toolkit.Thecomplete guide to Dimensional Modelling. Wiley and Sons, Inc. 2ndEdition, 2002.

[3] Vincent Rainardi, Building a Data Warehouse. Apress, CA. 2008.

[4] Paulraj Ponniah, Data Warehousing fundamentals. A comprehen-sive guide for IT professionals. Wiley and Sons, Inc. 2001.

[5] C.Imhoff, N.Galemmo, J.G.Geiger, Mastering Data WarehouseDesign. Wiley and Sons, Indiana. 2003.

86

building a data warehouse

Documents

data hierarchy

moving data

data warehousing

data owing

data warehouse7

data warehouse author

simple data warehouse

dimensional data store