data warehouse ims5024 – presented by eder tsang

25
Data Warehouse IMS5024 – presented by Eder Tsang

Post on 19-Dec-2015

227 views

Category:

Documents


3 download

TRANSCRIPT

Data Warehouse

IMS5024 – presented by Eder Tsang

Data Warehouse

A data warehouse is a system consisting of processes and

databases used to provide the “data infrastructure” for EIS

and DSS

“… a subject-oriented, integrated, timevariant, and non-

volatile collection of data in support of management’s

decisions”

Inmon and Hackathorn (1994)

Data warehouse - subject oriented

The data warehouse is organised by “data subjects” that

are relevant to the organisation.

– Customer, claim, shipment, product

This may be contrasted with the process orientation of

many OLTP systems

Data warehouse - integrated

Data in the warehouse is structured based on a corporate-

wide model, spanning the functional boundaries of legacy

systems

This includes naming standards, units of measurement

and periodicity

Data warehouse - time variant

Data is the data warehouse is characterised by the time-

series nature of historical data

The data consists of a series of “snapshots” which are

time-stamped and record values at a moment in time

This supports trend analysis of the data

Data warehouse - non volatile

The data warehouse is not continuously updated (inserts,

eletes and changes) like data in an OLTP system

Data in a data warehouse is periodically up-loaded at a

scheduled time intervals (say daily)

Motivations for data warehousing

Demands on OLTP data bases for query processing

would be too great

Data warehousing is designed for efficient retrieval

Data in legacy systems is frequently inconsistent, of poor

quality and stored in different formats

Reduce costs in providing data for decision making

Motivations for data warehousing

Support for focus on complete business processes (BPR)

Support for new initiatives

– CRM, Balanced Scorecard

Industry sources quote ROI’s averaging 401% over 3

years

Remain competitive

Typical Data Warehouse Architecture

An Actual Data Warehouse

Data warehouse development

Requirements identification

Logical design, data modelling

Data extract, transform and load (ETL)

Warehouse architecture , technology and tools

Physical database design

Delivery systems

Operational policies

Designing a data warehouse – data design

There are two main approaches to data modelling or data

warehouse design

– entity relationship modelling and normalisation

– dimensional modelling

The design of databases using a traditional E-R approach

Entities and relationships

Normalisation 3NF

Entity relationship schema

Why do we normalise data?

Normalisation is a process for converting complex data

structures into simple, stable data structures

Normalisation protects integrity of database by avoiding

anomalies (update, delete, create)

• Normalised data models are:

• robust and stable

• have minimum redundancy

Dimensional Modeling (star schema)

Dimensional Modeling (star schema)

Components of dimensional model:

– Fact Tables : contain measurements of business

eg. Sales, purchase order, shipment

– Dimension Tables : store the descriptions of the dimensions

of the business

eg. Product, customer, vendor, store

Dimensional Modeling (star schema)

Each dimension table has a single primary key that

corresponds exactly to one of the components of the

multipart key in the fact table.

A fact table always expresses a many to many relationship

(the key is composed of foreign keys

The most useful facts in a fact table are numeric and

additive

( typically values are added up)

Snowflake schema

Snowflake schema –all the tables are normalised

Star schemas are preferable to snowflake – fewer joins for

information retrieval

Dimensional Modelling vs E-R modelling

the purpose of dimensional modelling structure data for easy

and efficient analysis

E-R modelling creates a single required to support

organisation’s Whereas

DM creates individual models for business/decision interest

eg. • model for sales info

• model for Inventory info

Entity relationship schema (3NF)

Corresponding Star schema

Corresponding snowflake schema

Dimensional Modelling vs E-R modelling (Con’t)

OLTP and DW have different purpose: – operational vs

informational

Normalisation protects integrity of database by avoiding

anomalies (update, delete, create)

Data models for data warehouse do not have to be normalised

– In contrast, data in DW does not change often

– periodic additions of new data

DM vs. E-R modeling debate (Kimball’s view)

OLTP systems are volatile – high rates of update transactions

In normalised models the goal is to reduce data redundancy

and prevent update anomalies

Data in a data warehouse does not need to be normalised

because it is periodically refreshed not updated by user

transactions