datawarehouse

52
Introduction DATAWAREHOUSE www.notesvillage.com

Upload: sherlock-jairo

Post on 03-Jan-2016

12 views

Category:

Documents


1 download

DESCRIPTION

DATAWAREHOUSE. Introduction. Information. Data. What is Data Warehousing?. A process of transforming data into information and making it available to users in a timely enough manner to make a difference [Forrester Research]. What are the users saying. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: DATAWAREHOUSE

Introduction

DATAWAREHOUSE

www.notesvillage.com

Page 2: DATAWAREHOUSE

What is Data Warehousing?

A process of transforming data into information and making it available to users in a timely enough manner to make a difference

[Forrester Research]

Data

Information

www.notesvillage.com

Page 3: DATAWAREHOUSE

What are the users saying...Data should be integrated

across the enterpriseSummary data has a real

value to the organizationHistorical data holds the

key to understanding data over time

What-if capabilities are required

www.notesvillage.com

Page 4: DATAWAREHOUSE

What is a Data Warehouse? A single, complete and

consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context.

[Barry Devlin]

www.notesvillage.com

Page 5: DATAWAREHOUSE

Data, Data everywhereyet ... I can’t find the data I need

data is scattered over the networkmany versions, subtle differences

• I can’t get the data I need– need an expert to get the data

• I can’t understand the data I found– available data poorly documented

• I can’t use the data I found– results are unexpected– data needs to be transformed from

one form to otherwww.notesvillage.com

Page 6: DATAWAREHOUSE

Which are our lowest/highest margin

customers ?

Which are our lowest/highest margin

customers ?

Who are my customers and what products are they buying?

Who are my customers and what products are they buying?

Which customers are most likely to go to the competition ?

Which customers are most likely to go to the competition ?

What impact will new products/services

have on revenue and margins?

What impact will new products/services

have on revenue and margins?

What product prom--otions have the biggest

impact on revenue?

What product prom--otions have the biggest

impact on revenue?

What is the most effective distribution

channel?

What is the most effective distribution

channel?

A producer wants to know….

www.notesvillage.com

Page 7: DATAWAREHOUSE

NecessityNeed for Strategic Information

Retain consumer baseIncrease customer base by 20%Enhance consumer satisfaction levelLaunch two new products

Information CrisisHow to maintain lots of data in the organizationIT resources and strategy is not effective for data

Opportunities and Risks:Quick decisionsManagerial analysisCrisis management

www.notesvillage.com

Page 8: DATAWAREHOUSE

History of DSSAd hoc reportsSmall applicationsInformation centersDecision support system

www.notesvillage.com

Page 9: DATAWAREHOUSE

DisadvantagesMany adhoc requests .Supplementary report problemsNo interactivity of the userFlexible informations.

www.notesvillage.com

Page 10: DATAWAREHOUSE

OPERATIONAL VS INFORMATIONALCurrent ValuesOptimized for

transactionHigh frequencyRead Update

DeleteLarge userResponse time is

very fast

Archived valueOptimized for

complex queriesLowReadAdhoc,RandomRelatively more

several seconds

www.notesvillage.com

Page 11: DATAWAREHOUSE

Scenerio 1ABC Pvt Ltd is a company with branches at

Mumbai, Delhi, Chennai and Banglore. The Sales Manager wants quarterly sales report. Each branch has a separate operational system.

www.notesvillage.com

Page 12: DATAWAREHOUSE

Scenario 1 : ABC Pvt Ltd.

Mumbai

Delhi

Chennai

Banglore

SalesManager

Sales per item type per branchfor first quarter.

www.notesvillage.com

Page 13: DATAWAREHOUSE

Solution 1:ABC Pvt Ltd

Extract sales information from each database.

Store the information in a common repository at a single site.

www.notesvillage.com

Page 14: DATAWAREHOUSE

Solution 1:ABC Pvt Ltd.

Mumbai

Delhi

Chennai

Banglore

DataWarehouse

SalesManager

Query &Analysis tools

Report

www.notesvillage.com

Page 15: DATAWAREHOUSE

Scenario 2 : One Stop Shopping

OperationalDatabase

Data Entry Operator

Data Entry Operator

ManagementWait

Report

www.notesvillage.com

Page 16: DATAWAREHOUSE

Solution 2Solution 2•Extract data needed for analysis from operational database.•Store it in warehouse.•Refresh warehouse at regular interval so that it contains up to date information for analysis.•Warehouse will contain data with historical perspective.

www.notesvillage.com

Page 17: DATAWAREHOUSE

Solution 2

Operationaldatabase

DataWarehouse

Extractdata

Data EntryOperator

Data EntryOperator

Manager

Report

Transaction

www.notesvillage.com

Page 18: DATAWAREHOUSE

Scenario 3Cakes & Cookies is a small,new

company.President of the company wants his company should grow.He needs information so that he can make correct decisions.

www.notesvillage.com

Page 19: DATAWAREHOUSE

Solution 3Solution 3•Improve the quality of data before loading it into the warehouse.

•Perform data cleaning and transformation before loading the data.

•Use query analysis tools to support adhoc queries.

www.notesvillage.com

Page 20: DATAWAREHOUSE

Solution 3

Query and Analysistool

President

Expansion

Improvement

sales

time

DataWarehouse

www.notesvillage.com

Page 21: DATAWAREHOUSE

Definitions of a Data Warehouse

- W.H. Inmon

“A subject-oriented, integrated, time-variant and

non-volatile collection of data in support of

management's decision making process”

- Ralph Kimball

“A copy of transaction data, specifically structured for query and analysis”

1.

2.

www.notesvillage.com

Page 22: DATAWAREHOUSE

Data WarehouseFor organizational learning to take place,

data from many sources must be gathered together and organized in a consistent and useful way – hence, Data Warehousing (DW)

DW allows an organization (enterprise) to remember what it has noticed about its data

Data Mining techniques make use of the data in a DW

www.notesvillage.com

Page 23: DATAWAREHOUSE

What is Data Warehouse??Inmons’s definition

A data warehouse is-subject-oriented,-integrated,-time-variant,-nonvolatile

collection of data in support of management’sdecision making process

www.notesvillage.com

Page 24: DATAWAREHOUSE

Subject-orientedSubject-oriented

•Data warehouse is organized around subjects such as sales, product, customer.

•It focuses on modeling and analysis of data for decision makers.

•Excludes data not useful in decision support process.

www.notesvillage.com

Page 25: DATAWAREHOUSE

Integrated DataIntegrationData Warehouse is constructed by

integrating multiple heterogeneous sources.

Data Preprocessing are applied to ensure consistency.RDBMS

LegacySystem

DataWarehouse

Flat Filewww.notesvillage.com

Page 26: DATAWAREHOUSE

IntegrationIntegrationIn terms of data.

encoding structures.

Measurement ofattributes.

physical attribute. of data

naming conventions.

Data type format

www.notesvillage.com

Page 27: DATAWAREHOUSE

Time-variantTime-variant

•Provides information from historical perspective e.g. past 5-10 years

•Data is stored as snapshots

•Every key structure contains either implicitly or explicitly an element of time

•Allows analysis for the past

•Relates information to the present

•Enables forecasts for the futurewww.notesvillage.com

Page 28: DATAWAREHOUSE

NonvolatileNonvolatile•Data once recorded cannot be updated.•Data warehouse requires two operations in data accessing

•Initial loading of data•Access of data

www.notesvillage.com

Page 29: DATAWAREHOUSE

Data GranularityOperational system is kept at the lowest

levelData is summarized at different levelsThe lower the level the more data

granularity.

www.notesvillage.com

Page 30: DATAWAREHOUSE

Operational v/s Information SystemOperational v/s Information System

Features Operational Information

Characteristics Operational processing

Informational processing

Orientation Transaction Analysis

User Clerk,DBA,database professional

Knowledge workers

Function Day to day operation Decision support

Data Current Historical

View Detailed,flat relational Summarized, multidimensional

DB design Application oriented Subject oriented

Unit of work Short ,simple transaction

Complex query

Access Read/write Mostly readwww.notesvillage.com

Page 31: DATAWAREHOUSE

Data Mart

A Data Mart is a smaller, more focused

Data Warehouse – a mini-warehouse.

A Data Mart typically reflects the business

rules of a specific business unit within an

enterprise.

www.notesvillage.com

Page 32: DATAWAREHOUSE

Data Warehouse to Data Mart

DataWarehouse

Data Mart

Data Mart

Data Mart

Decision Support

Information

Decision Support

Information

Decision Support

Information

www.notesvillage.com

Page 33: DATAWAREHOUSE

www.notesvillage.com

Page 34: DATAWAREHOUSE

Data Mart Layer

Presentation/ Desktop

Access Layer

Meta-data Repository Layer

Warehouse Management Layer

Core DW Layer

Data Staging and Quality Layer

Data Access Layer

Operational Data Layer

External Data Layer

Data Feed/Data Mining/

Indexing Layer

Virtual DW

Coarse DW

Central DW

Distributed DW

Application Messaging (Transport) Layer

Internet/Intranet Layer

direct queries

virtual queries

ad hoc queries

1

2a

2b

2c

3

4 56 7

8

9

10

11

United Statesby Sales

$10,340 to $10,350 (1)$8,730 to $10,340 (2)$4,320 to $8,730 (2)$1,100 to $4,320 (1)

$730 to $1,100 (3)

United States$11,000

Sales

North America

Non-operational

Data Layer

Data Marts vs Data WarehousesData Marts vs Data Warehouses

www.notesvillage.com

Page 35: DATAWAREHOUSE

Operational vs. InformationalSystems

Operational vs. InformationalSystems

OperationalOperationalSystemsSystems

InformationInformationDelivery SystemDelivery SystemDataDataWarehouseWarehouse

InformationalInformationalSystemsSystems

Data Data MartsMarts

www.notesvillage.com

Page 36: DATAWAREHOUSE

Virtual Date Warehouse

A Virtual Data Warehouse approach is often chosen when there are infrequent demands for data and management wants to determine if/how users will use operational data.

One of the weaknesses of a Virtual Data Warehouse approach is that user queries a made against operational DBs.

One way to minimize this problem is to build a “Query Monitor” to check the performance characteristics of a query before executing it.

www.notesvillage.com

Page 37: DATAWAREHOUSE

• A Coarse Data Warehouse is often chosen when the organization has a relatively clean/new operational system and management wants to make the operational data more easily available for just that system.

• A Central Data Warehouse is often chosen when the organization has a clear understanding about it Information Access needs and wants to provide “quality”, “integrated” , information to its knowledge workers

• A Distributed Data Warehouse is similar in most respects to a Central Data Warehouse, except that the data is distributed to separate mini-Data Warehouses (Data Marts )on local or specialized servers

www.notesvillage.com

Page 38: DATAWAREHOUSE

Data Mart Layer

Presentation/ Desktop

Access Layer

Meta-data Repository Layer

Warehouse Management Layer

Core DW Layer

Data Staging and Quality Layer

Data Access Layer

Operational Data Layer

External Data Layer

Data Feed/Data Mining/

Indexing Layer

Virtual DW

Coarse DW

Central DW

Application Messaging (Transport) Layer

Distributed DW

Internet/Intranet Layer

direct queries

virtual queries

ad hoc queries

1

2a

2b

2c

3

4 56 7

8

9

10

11

United Statesby Sales

$10,340 to $10,350 (1)$8,730 to $10,340 (2)$4,320 to $8,730 (2)$1,100 to $4,320 (1)

$730 to $1,100 (3)

United States$11,000

Sales

North America

Non-operational

Data Layer

Central Data WarehouseCentral Data Warehouse

www.notesvillage.com

Page 39: DATAWAREHOUSE

Top-down approachSingle, Central

storage of data about the content

Centralized Rule and Content

May seek quick result if implemented with Iteration

Needs high level cross functional skills

High outlay with proof of concepts

High exposure to risk/Failures

www.notesvillage.com

Page 40: DATAWAREHOUSE

www.notesvillage.com

Page 41: DATAWAREHOUSE

Data Warehouse ArchitecturesGeneric Two-Level ArchitectureIndependent Data MartDependent Data Mart and Operational Data

StoreLogical Data Mart and Real-Time Data

WarehouseThree-Layer architecture

All involve some form of extraction, transformation and loading (ETLETL)

www.notesvillage.com

Page 42: DATAWAREHOUSE

Process Architecture These interrelated sub-processes are sometimes

referred to as an “ETL” process. 1)Extract- Since data for the data warehouse can come

from different sources and may be of different types, the plan to extract the data along with appropriate compression and encryption techniques is an important requirement for consideration.

2)Transform- Transformation of data with appropriate conversion, aggregation and cleaning besides de-normalization and surrogate key management is also an important process to be planned for building a data warehouse.

3)Load- Steps to be considered to load data with optimization by considering the multiple areas where the data is targeted to be loaded and retrieved is also an important part of the data warehouse architecture plan

www.notesvillage.com

Page 43: DATAWAREHOUSE

Data Model Architecture In Data Model Architecture (also known

as Dimensional Data Model), there are 3 main data modeling styles for enterprise warehouses:

3rd Normal Form - Top Down Architecture, Top Down Implementation

Federated Star Schemas - Bottom Up Architecture, Bottom Up Implementation

Data Vault - Top Down Architecture, Bottom Up Implementation

.

www.notesvillage.com

Page 44: DATAWAREHOUSE

Technology Architecture

Scalability and flexibility is required in all facets. The extent of these features are largely depend upon organizational size, business requirements, nature of business etc.

Technology or Technical architecture primary evolved from derivations from the process architecture, meta data management requirements based on business rules and security levels implementations and technology tool specific evaluation.

Besides these, the Technology architecture also looks into the various technology implementation standards in database management, database connectivity protocols (ODBC, JDBC, OLE DB etc), Middleware (based on ORB, RMI, COM/DOM etc.), Network protocols (DNS, LDAP etc) and other related technologies.

www.notesvillage.com

Page 45: DATAWAREHOUSE

Information Architecture

Information Architecture is the process of translating the information from one form to another in a step by step sequence so as to manage the storage, retrieval, modification and deletion of the data in the data warehouse.

www.notesvillage.com

Page 46: DATAWAREHOUSE

Resource Architecture Resource architecture is related to software

architecture in that many resources come from software resources. Resources are important because they help determine performance.

Workload is the other part of the equation. If you have enough resources to complete the workload in the right amount of time, then performance will be high.

www.notesvillage.com

Page 47: DATAWAREHOUSE

Figure 11-2: Generic two-level data warehousing architecture

E

T

LOne, company-wide warehouse

Periodic extraction data is not completely current in warehouse

www.notesvillage.com

Page 48: DATAWAREHOUSE

Scrub/Cleanse…uses pattern recognition and AI techniques to upgrade data quality

Fixing errors:Fixing errors: misspellings, erroneous dates, incorrect field usage, mismatched addresses, missing data, duplicate data, inconsistencies

Also:Also: decoding, reformatting, time stamping, conversion, key generation, merging, error detection/logging, locating missing data

Figure 11-10: Steps in data reconciliation

(cont.)

www.notesvillage.com

Page 49: DATAWAREHOUSE

Figure 11-3 Independent data mart data warehousing architecture

Data marts:Data marts:Mini-warehouses, limited in scope

E

T

L

Separate ETL for each independent data mart

Data access complexity due to multiple data marts

www.notesvillage.com

Page 50: DATAWAREHOUSE

Figure 11-4 Dependent data mart with operational data store: a three-level architecture

ET

L

Single ETL for enterprise data warehouse(EDW)(EDW)

Simpler data access

ODS ODS provides option for obtaining current data

Dependent data marts loaded from EDW

www.notesvillage.com

Page 51: DATAWAREHOUSE

ET

L

Near real-time ETL for Data WarehouseData Warehouse

ODS ODS and data warehousedata warehouse are one and the same

Data marts are NOT separate databases, but logical views of the data warehouse Easier to create new data marts

Figure 11-5 Logical data mart and real time warehouse architecture

www.notesvillage.com

Page 52: DATAWAREHOUSE

Figure 11-6 Three-layer data architecture for a data warehouse

www.notesvillage.com