data warehouse basic

30
Training Material

Upload: madasamy-murugaboobathi

Post on 07-Apr-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Warehouse Basic

8/4/2019 Data Warehouse Basic

http://slidepdf.com/reader/full/data-warehouse-basic 1/30

Training Material

Page 2: Data Warehouse Basic

8/4/2019 Data Warehouse Basic

http://slidepdf.com/reader/full/data-warehouse-basic 2/30

Data Warehouse

Name of the author: Kanchan Yadav

Date Created:23rd Feb 2009

Page 3: Data Warehouse Basic

8/4/2019 Data Warehouse Basic

http://slidepdf.com/reader/full/data-warehouse-basic 3/30

Data Warehouse Architecture 3

CONTENTS

Why DW Systems?

Why Now?

Status of DW Systems

DW Architecture

Operations System vs. DW Systems

Data Quality

ETL

Audit Requirements

Page 4: Data Warehouse Basic

8/4/2019 Data Warehouse Basic

http://slidepdf.com/reader/full/data-warehouse-basic 4/30

4

Data Warehouse Architecture 4

 What is a Data Warehouse?

A single, complete andconsistent store of dataobtained from a variety ofdifferent sources madeavailable to end users in awhat they can understandand use in a businesscontext.

[Barry Devlin]

E.g. Same person promoted and asked to leave

E.g. customer for Savings A/C and same customer for Loan are considered

different

Page 5: Data Warehouse Basic

8/4/2019 Data Warehouse Basic

http://slidepdf.com/reader/full/data-warehouse-basic 5/30

5

Data Warehouse Architecture 5

Data Warehousing --

It is a process

Technique for assembling andmanaging data from various sourcesfor the purpose of answeringbusiness questions. Thus makingdecisions that were not previouspossible

A decision support databasemaintained separately from theorganization’s operational database

Page 6: Data Warehouse Basic

8/4/2019 Data Warehouse Basic

http://slidepdf.com/reader/full/data-warehouse-basic 6/30

6

Data Warehouse Architecture 6

 Why Now?

Data is being produced

ERP provides clean data

The computing power is available

The computing power is affordable

The competitive pressures are strong

Commercial products are available

Page 7: Data Warehouse Basic

8/4/2019 Data Warehouse Basic

http://slidepdf.com/reader/full/data-warehouse-basic 7/30

7

Data Warehouse Architecture 7

Status of Data Warehousing Systems

Just begun in India

One of the hottest & fast growing area, inIndia & abroad

Many IT organizations have establishedseparate group for domestic & export work

Data Warehousing already being carried outin various Indian organizations, includingStock exchanges, Banks, Telcom, Insurance

cos, etc. Data Mining not far behind

Page 8: Data Warehouse Basic

8/4/2019 Data Warehouse Basic

http://slidepdf.com/reader/full/data-warehouse-basic 8/30

Data Warehouse Architecture 8

Data Warehouse Architecture

OLTP 1RDBMS

OLTP 3ERP

OLTP 2 VSAM

DataWarehouse/Data Mart

Staging Area

Cube II

Cube I

OLAPTool –Slicing /Dicing

Query /Reporting

Tool

ETL

Page 9: Data Warehouse Basic

8/4/2019 Data Warehouse Basic

http://slidepdf.com/reader/full/data-warehouse-basic 9/30

9

Data Warehouse Architecture 9

 What are Operational Systems?

They are mainly OLTPsystems

Run mission criticalapplications

Need to work with stringentperformance requirementsfor routine tasks

Used to run a business!

Page 10: Data Warehouse Basic

8/4/2019 Data Warehouse Basic

http://slidepdf.com/reader/full/data-warehouse-basic 10/30

Data Warehouse Architecture 10

Data Environment in a business

Organization

Current scenario

Data are captured through differentapplications

Applications may be on different platforms,developed at different times with specifiedobjectives

Getting data out of applications is a difficulttask

In other words Data are said to be “in jail”

Page 11: Data Warehouse Basic

8/4/2019 Data Warehouse Basic

http://slidepdf.com/reader/full/data-warehouse-basic 11/30

Page 12: Data Warehouse Basic

8/4/2019 Data Warehouse Basic

http://slidepdf.com/reader/full/data-warehouse-basic 12/30

Data Warehouse Architecture 12

Data in Data Warehouse

Used to manage and control business

Data is historical or point-in-time

Optimized for inquiry rather than update

Use of the system is loosely defined and can be ad-

hoc

Used by managers and end-users to understand the

business and make judgements

Page 13: Data Warehouse Basic

8/4/2019 Data Warehouse Basic

http://slidepdf.com/reader/full/data-warehouse-basic 13/30

Data Warehouse Architecture 13

 Application-Orientation vs. Subject-

Orientation Application-Orientation Subject-Orientation

DataWarehouse

Customer

 Vendor

Product

 Activity

OperationalDatabase

LoansCreditCard

TrustSavings

Page 14: Data Warehouse Basic

8/4/2019 Data Warehouse Basic

http://slidepdf.com/reader/full/data-warehouse-basic 14/30

Data Warehouse Architecture 14

Integration

Integration can take place in various dimensions likeconsistent naming conventions, consistentmeasurement of variables, consistent encodingstructures, consistent physical attributes of data etc.Integration is done at data staging level withoutchanging the operational application systems.

Page 15: Data Warehouse Basic

8/4/2019 Data Warehouse Basic

http://slidepdf.com/reader/full/data-warehouse-basic 15/30

Data Warehouse Architecture 15

 Time Orientation

Data warehouse data are snapshot data

It has longer time horizon

It has a key structure containing an element of time.

Page 16: Data Warehouse Basic

8/4/2019 Data Warehouse Basic

http://slidepdf.com/reader/full/data-warehouse-basic 16/30

Data Warehouse Architecture 16

Non Volatility 

Data are loaded into the warehouse and accessedthere, but once the snapshot of data is made, the datain the warehouse do not change. Data can be updatedaccording to pre-announced calendar of programme.

Page 17: Data Warehouse Basic

8/4/2019 Data Warehouse Basic

http://slidepdf.com/reader/full/data-warehouse-basic 17/30

Data Warehouse Architecture 17

Metadata

Metadata explains what data exists, where it is locatedand how to access it. The metadata is a core of a datalogistics system, the infrastructure for DW andultimately the intelligence system.

Page 18: Data Warehouse Basic

8/4/2019 Data Warehouse Basic

http://slidepdf.com/reader/full/data-warehouse-basic 18/30

Data Warehouse Architecture 18

 To summarize ...

OLTP Systems areused to “run” abusiness

The Data Warehousehelps to “optimize” thebusiness

Page 19: Data Warehouse Basic

8/4/2019 Data Warehouse Basic

http://slidepdf.com/reader/full/data-warehouse-basic 19/30

Loading the Warehouse

Cleaning the data beforeit is loaded

Page 20: Data Warehouse Basic

8/4/2019 Data Warehouse Basic

http://slidepdf.com/reader/full/data-warehouse-basic 20/30

Page 21: Data Warehouse Basic

8/4/2019 Data Warehouse Basic

http://slidepdf.com/reader/full/data-warehouse-basic 21/30

Data Warehouse Architecture 21

Data Quality - The Reality 

Tempting to think creating a data warehouse issimply extracting operational data and entering intoa data warehouse

Nothing could be farther from the truth

Warehouse data comes from disparatequestionable sources

Data Profiling before and during ETL

Batch Totals

Handling Missing Values / Outliers / Duplicates / Non-quality Data

Page 22: Data Warehouse Basic

8/4/2019 Data Warehouse Basic

http://slidepdf.com/reader/full/data-warehouse-basic 22/30

Data Warehouse Architecture 22

Data Integration Across Sources

Trust Credit cardSavings Loans

Same datadifferent name

Different dataSame name

Different Type,Length

Different Units

Page 23: Data Warehouse Basic

8/4/2019 Data Warehouse Basic

http://slidepdf.com/reader/full/data-warehouse-basic 23/30

23

Data Warehouse Architecture 23

Data Transformation Terms

Transformation

The conversion of data types from the

source to the target data store

(warehouse) -- always a relational

database

•90% of persons were born on November 11, 1911

• 80% robbery performed in Ghatkopar, Chowky # 1

Page 24: Data Warehouse Basic

8/4/2019 Data Warehouse Basic

http://slidepdf.com/reader/full/data-warehouse-basic 24/30

Data Warehouse Architecture 24

Data Transformation Example

      e       n      c       o 

        d         i      n

      g   

      u       n

        i        t 

        f        i      e 

        l        d 

appl A - balanceappl B - balappl C - currbalappl D - balcurr

appl A - pipeline - cmappl B - pipeline - inappl C - pipeline - feetappl D - pipeline - yds

appl A - m,f appl B - 1,0appl C - x,yappl D - male, female

Data Warehouse

Page 25: Data Warehouse Basic

8/4/2019 Data Warehouse Basic

http://slidepdf.com/reader/full/data-warehouse-basic 25/30

Data Warehouse Architecture 25

Data Integrity Problems

Same person, different spellings

Agarwal, Agrawal, Aggarwal etc...

Multiple ways to denote company name

Persistent Systems, PSPL, Persistent Pvt.LTD.

Use of different names

mumbai, bombay

Different account numbers generated by differentapplications for the same customer

Page 26: Data Warehouse Basic

8/4/2019 Data Warehouse Basic

http://slidepdf.com/reader/full/data-warehouse-basic 26/30

26

Data Warehouse Architecture 26

Data Integrity Problems (Cont)

Required fields left blank

Invalid product codes collected at point of sale

manual entry leads to mistakes

“in case of a problem use 9999999”

Page 27: Data Warehouse Basic

8/4/2019 Data Warehouse Basic

http://slidepdf.com/reader/full/data-warehouse-basic 27/30

27

Data Warehouse Architecture 27

Loads

After extracting, scrubbing, cleaning, validating etc.need to load the data into the warehouse

Issues

huge volumes of data to be loaded

small time window available when warehouse canbe taken off line (usually nights)

when to build index and summary tables

allow system administrators to monitor, cancel,resume, change load rates

Page 28: Data Warehouse Basic

8/4/2019 Data Warehouse Basic

http://slidepdf.com/reader/full/data-warehouse-basic 28/30

Data Warehouse Architecture 28

 When to Refresh?

periodically (e.g., every night, every week) or after

significant events

on every update: not warranted unless warehouse

data require current data (up to the minute stock

quotes)

refresh policy set by administrator based on user

needs and traffic

possibly different policies for different sources

Page 29: Data Warehouse Basic

8/4/2019 Data Warehouse Basic

http://slidepdf.com/reader/full/data-warehouse-basic 29/30

Data Warehouse Architecture 29

Extraction Techniques

Full Extract from base tables

read entire source table: too expensive

maybe the only choice for legacy systems

Page 30: Data Warehouse Basic

8/4/2019 Data Warehouse Basic

http://slidepdf.com/reader/full/data-warehouse-basic 30/30

Data Warehouse Architecture 30

How To Detect Changes

Create a snapshot log table to record ids ofupdated rows of source data and timestamp

Detect changes by:

Defining after row triggers to update snapshot logwhen source table changes

Using regular transaction log to detect changes tosource data