basics of data warehousing, mis and etl

Post on 06-Apr-2015

266 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Data Warehousing, MIS & ETL

Overview

SivaSatish.K

Data Warehousing

Data & Information

Data, in real terms is raw and has no facts behind its existence.

On the other hand, information is the processed raw data (as per the business requirement) and supported by facts.

Example : For an example, Management requires “Number of booking (booking creation trend) for Asia in year 2007”. Whenever this data will be retrieved from data sources as per business requirements (any input system) and stored in data base, this would be information to the management.

Data Warehousing

It stores the data related to the subject area of the organizations and controlled/owned by centrally coordinated organizations. – In other words, it’s a relational database which is

designed for performing analysis and execution of queries. It contains historical data also.

– Technique to implement data warehouse is known as “Data Warehousing”

Why do we need data warehousing ?

History

Reasons:

In OLTP approach, historical data was archived to update the recent information in the system and even it is bit difficult to update that information everywhere in OLTP. Unlike OLTP; DW stores historical data to compare trends over period of time (now a days, it’s in Tera bytes)

As it has one DB at the end, number of sources can be used to transfer data and sources can be relational DB, Flat File, Unix file. ETL tool handles these source files to transfer data from source to target.

Continued…..

DW stores the necessary data for specific analysis; not jus all data.

Small companies focus on one area so they can have data in data marts ; on the contrary large companies can have data in data warehouse and data marts to facilitate their users’ need.

DW is commonly used by organizations to analyze trends over time. Trends can be of Revenue/Sales/Purchases/ BL generation etc.

Continued……

Most importantly, Though Organizations may use data warehousing to view day-to-day operations but primarily DW is to facilitate strategic planning resulting from long-term data overviews and these overviews, are the pillars for next business models, forecasts and other reports.

Approaches:

Normalized Approach Dimensional approach or De - Normalized

approach

Normalized Vs Dimensional Approach:

Firslty, any of these approaches can be used to store the data to the data warehouse but objective makes a difference to implement these.

In normalized approach, data element is completely ignored for data warehouse.

With dimensional approach, multidimensional model with dimensional approach came into picture. Here, processed data or transaction data are partitioned into "facts" and “dimensions”.

Dimensional Approach continued…...

Facts in simple meaning are ‘measures’ (figures such as # of booking created, Container quantity, Revenue, Gross Weight etc)

Where as Dimensions are the entity to which these facts belong such as Customer name, continent, Week, Month)

Hence, MIS is an example of Dimensional approach.

Continued…with an example

One Dimensional:-– For SVVD (PAX-ANX-054 W), there are 4572

containers. Here, SVVD is the dimension and 4572 (container qty) is the measure i.e. fact.

– For Shipper name “ABC 123”, revenue is 1.2 million $. Here, shipper is the dimension and revenue is the measure i.e. fact corresponds to dimensions.

Continued….

Multidimensional example:– What is the gross weight for 47GP Containers

traveling from New York to Singapore with shipment direction “E” in year 2006?

– Here in this example, Gross weight is the only measure (Fact) and container type (47GP), Load Port (New York) and Discharge Port (Singapore), shipment direction (E) and Year (2006) are various dimensions.

Characteristics : Data Warehouse

Subject Oriented Integrated Non Volatile Time Variant

Continued……

Subject oriented in a way that it revolves around the subject area of corporate such as “Booking”, if management wants to see booking pattern of last quarter like which country has maximum booking and shipment started ports.

Integrated in a way that database should avoid naming conflicts and inconsistencies among keywords used and somehow related to subject oriented feature also.

Continued…..

Non volatile in a way that once stored in the database, data should not be changed as it depicts what has occurred in the past.

Time Variant in a way that, In order to compare the trends in business over the past years, managements/decision makers need large amounts of data. A data warehouse's focus on change over time is what is meant by the term time variant which is unlike OLTP, where performance requirements demand that historical data be moved to an archive.

Architecture of Data Warehouse

Data warehouse basic Flow:

Data Sources:

-Relation data source-Flat File

End Users Analysis

Meta Data

Raw Data

Summarized data

End Users Analysis

End Users Analysis

Terminologies: Freq. used in DW

Source : Data where it is coming from Target : Destination database or data mart Test director Test management and defect

management tool from Mercury Sunopsis/ Informatica : ETL dev tool used to

extract, load and transformation Process Packages, Scenarios, Interfaces,

procedures : Different objects and their hierarchy in Sunopsis ETL tool

MIS : Management Information system

MIS: Management Information system

Important Module in terms of strategic decisions:– As most of the strategic decisions based on the

information.– As management decisions are based on the

information (information is nothing, just the processed data), processed data has been stored in database in more organized and structured manner.

MIS deals only with Data (back- end) It is treated as DSS ( Decision support system)

Continued…

As said, deals only with Data; MIS is a Data Warehouse used to store the data for various modules such as Shipment, Equipment, DnD, AR and Cost etc.

MIS : Architecture

Pictorial Representation:

CSSSybase

ODS Sybase

Gemstone Sybase

CPF Sybase

CDW OracleWare house

EQPMart

AR Mart

CostMart

ShpmtMart

CDW(ETL)

Data mart(ETL)

OLAP Cubes

Server

Cognos Repository

BI Reports

BI Client

The components and the purpose :

OSCAR – One of the source systems for the master and transactional data

Middleware – The Sunopsis ETL tool is used as the Middleware tool to Extract, Transform and Load the source data into the target systems

CDW (centralized Data Warehouse) – This is one of the target systems, which holds the data for all the Marts (Shipment, Cost, Trade PL, AR and Equipment) which is used for reporting

Types of Testing

Data Functional If Require, Middleware

Data Testing includes:

Initial data Testing, Data validation, Data quality, Data integration and Data volume testing. – Data Validation ensures Null/not null values,

duplicate records, behavior of surrogate keys/primary keys, Data type and format etc.

– Data quality testing ensures the accurate information in dimension and fact tables.

– Initial testing means verifying the data in CDW, Master tables which will be populated through OSCAR and its web application (Front End System)

Continued….

– Data Integration testing ensures the integrity of data in target system if data populated from more than one source system.

– Moreover, if any change to the data happens through UI, then possibly test the same changes to the target data warehouse.

– Data Volume testing ensures that Volume of data is available from the OSCAR Test environment then perform data volume testing

Functional Testing

Number of records from source system (OSCAR) for each individual dimension and facttable should be tested into target data warehouse, sometimes it may differ due to join conditions between tables in source system.

Testing will be done in target data warehouse after data loads from Source system to target system.

If any changes occur in data source, check corresponding changes in target data warehouse.

Continued…..

Application specific features to be tested according to business rules/ transformation rules in target data warehouse. (E.g. booking id should be system generated unique identifier).

Middleware (sunopsis) Testing

Verify the connection between source & target database.

If needed, Verify Source to stage and stage to target mappings and based on that only test what is in source and what should be in target

If an errors occurs, Check the error notification through Open view

For some target tables, there are two different packages/scenarios (for source to stage and stage to target), run both packages manually

Continued….

Performance testing / Fail-over testing: Test performance and fail-over of Middleware with huge volume of data (If possible and Volume of data will be available from the OSCAR Test environment)

Test cycle

Subject areas for master target tables: Customer, Geography, Organization, Commodity, Product, Vendor/Tariff, Service contract and Equipment.

Subject areas for transactional target tables: B/L, Booking, Container Movement, Empty poisoning plan, Job Order and Job order Invoice (Transportation/Non-Transportation)

ETL

Overview

Why ETL tool ?

While storing the data to the target database, data will be processed according to the various requirements of business (like ignore a particular country, region etc) and later on saved through ETL tools and technique to the database (target repository)

Generate a Unique Id in Shipment Mart.

ETL: Extraction, Transform and Load.

Extract the data (Sales in different countries in Asia region) through UI/Any other Relational Source or Flat File.

Transform that data based on various requirements from business.

Load the data to the target (table) and store it in database.

Continued…..

The Simplest way to understand the ETL concepts such as – ETL process, – ETL software, – ETL developments – and ETL testing starts with Data warehousing

concepts which was explained earlier.

Recommended References:

http://en.wikipedia.org/wiki/Data_warehouse

top related