unit – ii introduction to data warehousing by mrs. chetana

144
UNIT – II UNIT – II Introduction to Introduction to Data Warehousing Data Warehousing By Mrs. Chetana

Upload: muriel-long

Post on 30-Dec-2015

221 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

UNIT – IIUNIT – IIIntroduction to Introduction to

Data WarehousingData Warehousing

By Mrs. Chetana

Page 2: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

UNIT - II

• Data Warehouse and OLAP Technology for Data Mining Data Warehouse, Multidimensional Data Model, Data Warehouse Architecture, Data Warehouse Implementation

• Data Cube Computation:

Efficient Methods for simple Data Cube Computation (Full Cube, Iceberg Cube, Closed Cube and Shell Cube), Discovery Driven exploration of data cubes

By Mrs. Chetana

Page 3: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Why Data Warehouse?Why Data Warehouse?

“Necessity is the mother of invention”

An operational data store (Database) stores data for a specific application.

It feeds the data warehouse a stream of desired raw data.Data store is generally subject oriented, volatile, current

commonly focused on customers, products, orders, policies, claims, etc…

By Mrs. Chetana

Page 4: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Scenario 1Scenario 1

ABC Pvt Ltd is a company with branches at Mumbai, Hyderabad, Chennai and Bangalore.

The Sales Manager wants quarterly sales report. Each branch has a separate operational system.

By Mrs. Chetana

Page 5: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Scenario 1 : ABC Pvt Ltd.Scenario 1 : ABC Pvt Ltd.

Mumbai

Hyderabad

Chennai

Banglore

SalesManager

Sales per item type per branchfor first quarter.

By Mrs. Chetana

Page 6: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Solution 1:ABC Pvt LtdSolution 1:ABC Pvt Ltd..

Extract sales information from each database.Store the information in a common repository

at a single site.

By Mrs. Chetana

Page 7: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Solution 1:ABC Pvt Ltd.Solution 1:ABC Pvt Ltd.

Mumbai

Delhi

Chennai

Bangalore

DataWarehouse

SalesManager

Query &Analysis tools

Report

By Mrs. Chetana

Page 8: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Scenario 2Scenario 2

One Stop Shopping Super Market has hugeoperational database.

Whenever Executives wants some report the OLTP(database)system becomes slow and data entry operators have to wait for some time.

By Mrs. Chetana

Page 9: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Scenario 2 : One Stop ShoppingScenario 2 : One Stop Shopping

OperationalDatabase

Data Entry Operator

Data Entry Operator

ManagementWait

Report

By Mrs. Chetana

Page 10: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Solution 2Solution 2

Extract data needed for analysis from operational database.Store it in warehouse.Refresh warehouse at regular interval so that it contains up

to date information for analysis.Warehouse will contain data with historical perspective.

By Mrs. Chetana

Page 11: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Solution 2Solution 2

Operationaldatabase

DataWarehouse

Extractdata

Data EntryOperator

Data EntryOperator

Manager

Report

Transaction

By Mrs. Chetana

Page 12: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Scenario 3Scenario 3

Cakes & Cookies is a small, new company.

President of the company wants his company to grow. He needs information so that he can make correct decisions.

By Mrs. Chetana

Page 13: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Solution 3Solution 3

Improve the quality of data before loading it into the warehouse.

Perform data cleaning and transformation before loading the data.

Use query analysis tools to support adhoc queries.

By Mrs. Chetana

Page 14: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Solution 3Solution 3

Query and

Analysistool

President

Expansion

Improvement

sales

time

DataWarehouse

By Mrs. Chetana

Page 15: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Database Vs Data WarehouseDatabase Vs Data WarehouseDatabases : to record Data warehouse : to respond to analysis questions that

are critical for a business. All data warehouses are databases, not all databases

are data warehouses.

By Mrs. Chetana

Page 16: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

OLTP System - Online Transaction Processing (Operational System)OLAP System - Online Analytical Processing (Data Warehouse)

Source of dataOLTP: Operational data; OLTPs are the original source of the data.OLAP: Consolidation data; OLAP data comes from the various OLTP Databases

Purpose of dataOLTP: To control and run fundamental business tasksOLAP: To help with planning, problem solving, and decision support

What the dataOLTP: Reveals a snapshot of ongoing business processesOLAP: Multi-dimensional views of various kinds of business activities

Inserts and UpdatesOLTP: Short and fast inserts and updates initiated by end usersOLAP: Periodic long-running batch jobs refresh the data

OLTP Vs OLAP

By Mrs. Chetana

Page 17: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

QueriesOLTP: Relatively standardized and simple queries Returning relatively few recordsOLAP: Often complex queries involving aggregations

Processing SpeedOLTP: Typically very fastOLAP: Depends on the amount of data involved; batch data refreshes and complex queries may take many hours; query speed can be improved by creating indexes

Space RequirementsOLTP: Can be relatively small if historical data is archivedOLAP: Larger due to the existence of aggregation structures and history data; requires more indexes than OLTP

Database DesignOLTP: Highly normalized with many tablesOLAP: Typically de-normalized with fewer tables; use of star and/or snowflake schemas

Backup and RecoveryOLTP: Backup religiously; operational data is critical to run the business, data loss is likely to entail significant monetary loss and legal liabilityOLAP: Instead of regular backups, some environments may consider simply reloading the OLTP data as a recovery method 

By Mrs. Chetana

Page 18: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Operational Database ( OLAP )An operational database is the database that is accessed and updated on a continual basis and usually handles the daily transactions for a business. Operational databases use an OLTP approach and are designed to be write-optimized.

OLTPOn-Line Transaction Processing (OLTP) is the process in which systems facilitate and manage data entry and retrieval on a frequent basis. The transaction is almost immediately processed and is the main strategy of operational databases. An example of an OLTP system would be an ATM machine. The benefits of using OLTP is that it is fast and efficient and simplifies the process of accessing data. (Business Intelligence).

By Mrs. Chetana

Page 19: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Operational v/s Information SystemOperational v/s Information System

Features Operational InformationCharacteristics Operational processing Informational processing

Orientation Transaction, customer-oriented

Analysis, market-oriented

User Clerk, DBA, database professional

Knowledge workers

Function Day to day operation Decision support

Data contents Current Historical

View Detailed, flat relational Summarized, multidimensional

DB design Application oriented, ER based

Subject oriented, star/snowflakes

Unit of work Short , simple transaction

Complex query

Access Read/write Mostly readBy Mrs. Chetana

Page 20: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Operational v/s Information SystemOperational v/s Information System

Features Operational Information

Focus Data in Information out

Number of records accessed

tens millions

Number of users thousands hundreds

DB size 100MB to GB 100 GB to TB

Priority High performance, high availability

High flexibility, end-user autonomy

Metric Transaction throughput

Query throughput

By Mrs. Chetana

Page 21: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

What is Data Warehouse??What is Data Warehouse??

By Mrs. Chetana

Page 22: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

A Data Warehouse is not an individual repository product.

It is a decision support systems and a knowledge-based applications architecture.

It is an environment that supports both everyday tactical decision making and long-term business strategizes.

The Data Warehouse environment positions a business to utilize an enterprise-wide data store to link information from diverse sources and make the information accessible for a variety of user purposes, most notably, Strategic Analysis.

By Mrs. Chetana

Page 23: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

What is Data Warehouse?What is Data Warehouse?

• A data warehouse is a collection of integrated databases designed to support a DSS.

• By Inmon: A data warehouse is-subject-oriented,-integrated,-time-variant,-nonvolatile

collection of data in support of management’s decision making process.

By Mrs. Chetana

Page 24: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

RalphRalph KimballKimball

Ralph Kimball provided a more concise definition of a data warehouse:

A data warehouse is a copy of transaction data specifically structured for query and analysis.

By Mrs. Chetana

Page 25: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Cont..Cont..

Where is it used?

It is used for evaluating future strategy.

It needs a successful technician:◦ Flexible.◦ Team player.◦ Good balance of business and technical

understanding.

By Mrs. Chetana

Page 26: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Subject-orientedSubject-orientedData are organized based on how the users refer to them.

Eg. sales, product, customer.It focuses on modeling and analysis of data for decision

makers.Excludes data not useful in decision support process.

By Mrs. Chetana

Page 27: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

IntegrationIntegrationData Warehouse is constructed by integrating multiple

heterogeneous sources.Data Preprocessing are applied to ensure all

inconsistencies regarding naming convention and value representations are removed.

RDBMS

LegacySystem

DataWarehouse

Flat File Data ProcessingData Transformation

By Mrs. Chetana

Page 28: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

ExamplesExamples

In terms of data.◦ encoding structures.

◦ Measurement ofattributes.

◦ physical attribute. of data

◦ naming conventions.

◦ Date type format

remarks

By Mrs. Chetana

Page 29: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Time-variantTime-variantA data warehouse is a time-varient database which

supports business needs of end users in comparing and analyzing the business with different time periods. This is also known as Time series analysis.

Provides information from historical perspective

e.g. past 5-10 years.

Every key structure contains either implicitly or explicitly an element of time.

By Mrs. Chetana

Page 30: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

NonvolatileNonvolatile

Data are stored in read-only format and do not change over time.

Data warehouse requires two operations in data accessing◦ Initial loading of data◦ Access of data

load access

By Mrs. Chetana

Page 31: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

History & EvolutionHistory & Evolution 60’s: Batch Reports

◦ Hard to find and analyze information◦ Inflexible and expensive, reprogram every new request◦ Data collection, database creation

70’s: Database Management System◦ Still inflexible, not integrated with desktop tools,

80’s: Advanced Database systems, Desktop Data Access and Analysis Tools

◦ OLTP , Query tools, Spreadsheets, GUIs◦ Easier to use, but only access operational databases

1990s—2000s: ◦ Data mining and data warehousing, multimedia databases, and Web databases◦ OLAP Engines and Tools

By Mrs. Chetana

Page 32: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Data Warehouse ArchitectureData Warehouse Architecture

By Mrs. Chetana

Page 33: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Operational source systemOperational source systemCharacteristics:Source data often in OLTP (Online Transaction

Processing) systems, also called TPS (Transaction Processing Systems)

High level of performance and availabilityBusy with the normal operations of the

organizationDifferent types of data format and data structures in

different OLTP systems AND DIFFERENT SEMANTICS

By Mrs. Chetana

Page 34: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Operational SystemsOperational Systems Operational Systems

◦ Run the business in real time.

◦ Optimized to handle large numbers of simple read/write transactions.

◦ Optimized for fast response to predefined transactions.

◦ Used by people who deal with customers, products.

Database systems have been used traditionally for OLTP.

◦ Online Transaction Processing

◦ Clerical data processing tasks

◦ Detailed, up to date data

◦ Structured repetitive tasks

Examples of Operational Data◦ Customer Files

◦ Account Balance, Call Record

◦ Point of Sale Data, Production Record

By Mrs. Chetana

Page 35: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Data staging AreaData staging AreaOften the most complex part in the architecture, and

involves...◦ Extraction (E)

◦ Transformation (T)

◦ Load (L)

◦ indexing

ETL-tools can be usedeg. Informatica – PowerCenter

  IBM – Infosphere Information Server

Oracle – Data Integrator

Scripts for extraction, transformation and load are

implemented

By Mrs. Chetana

Page 36: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

ExtractionExtraction

Reading and understanding the source data and copying the data needed for the data warehouse into staging area for further manipulation,

i.e. transformation

By Mrs. Chetana

Page 37: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

TransformationTransformationTransforms the data in accordance with the business rules and standards that have been established

Involves:data conversion/transformation:

rules to convert to a common data format and common terms/semantics

data cleaning/cleansing:

– data scrubbing

use domain-specific knowledge to check the data

– data auditing

discover suspicious pattern, discover violation of stated rulescombining data from multiple sourcesdata aggregation

By Mrs. Chetana

Page 38: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Metadata Repository Metadata Repository

The metadata repository contains the schema of the warehouse

It is similar to data dictionary- the information is kept about the logical data structures, the info about files and addresses, the info about indexes, and so on.

By Mrs. Chetana

Page 39: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Data Presentation AreaData Presentation Area

What is OLAP?Dimensional modelingData MartsROLAP/MOLAP servers

By Mrs. Chetana

Page 40: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Physical storagePhysical storage::

Due to the huge amount of data, the warehouse may be distributed onto multiple servers.

In such a case the metadata is usually replicated on each server.

Most of the data warehouse solutions employ relational database management systems. Multidimensional db mgmt systems are employed in some solutions.

Data extracted from the data warehouse storage aggregated in many ways and the summary is kept in the multidimensional databases.

By Mrs. Chetana

Page 41: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

OLAP serversOLAP servers◦ To support and operate on multi-dimensional data structures

These servers receive OLAP queries or data mining requests from front-end tools and process them by accessing data in the physical storage.

The architect of the data warehouse may decide on the number of processing units/servers depending on the typical load of the system.

By Mrs. Chetana

Page 42: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Front-end toolsFront-end tools

This is the interface between the analyst and the data warehouse.

Clients

◦Query and reporting tools◦Analysis tools◦Data mining tools

The analyst should be able to pose OlAP queries or data mining tasks and visualize the results /reports in various textual and graphical ways.

By Mrs. Chetana

Page 43: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Data warehouse Design Data warehouse Design ApproachesApproaches

Top-Down: Starts with Overall design and planning . Useful when technology is mature & well known. The business problems are clear & well understood. Advantages:

- planned & strategic nature

- Emphasizes high-level business needs

- minimizes integration problem Disadvantages:

- Requires an increase costs

- Lacks of flexibility

- Difficult to define the boundaries

- May not be suitable unless the client needs

cross-functional reporting

By Mrs. Chetana

Page 44: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Bottom-Up Starts with experiments & prototypes. Useful in early stage of business modeling & technology development. Starts with lower cost & helps to evaluate benefits of technology

Advantages:

- rapid implementation &

opportunistic application

- Lo cost Disadvantages:

- Problematic when integrating

Data marts

- Design may be out-of-date before delivery

- Requires reengineering for each increment

- Solutions may be rejected by the next line of

business to be involved

- Overall benefit to the business may be minimized By Mrs. Chetana

Page 45: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Combined ApproachCombined Approach

An organization can exploit the planned and strategic nature of the top-down approach

While retaining the rapid implementation and opportunistic application of the bottom-up approach

By Mrs. Chetana

Page 46: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

3- tier Architecture of Data Warehouse

By Mrs. Chetana

Page 47: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Three Data Warehouse ModelsThree Data Warehouse ModelsEnterprise warehouse

◦ collects all of the information about subjects spanning the entire organization

Data Mart

◦ a subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart

Virtual warehouse

◦ A set of views over operational databases

◦ Only some of the possible summary views may be materialized

By Mrs. Chetana

Page 48: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Data Warehouse Development: A Recommended Approach

Define a high-level corporate data model

Data Mart

Data Mart

Distributed Data Marts

Multi-Tier Data Warehouse

Enterprise Data Warehouse

Model refinementModel refinement

1

2 2

3

4

By Mrs. Chetana

Page 49: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Incremental & Evolutionary ApproachIncremental & Evolutionary Approach

High-level Corporate Data Model: Defined in 1 or 2 months, provides a corporate-wide consistent,

integrated view of data for various usage.

Data Marts & Enterprise data warehouse:

Can be implemented parallel , on the above corporate data model.

Distributed data marts:constructed to integrate different data marts via hub server.

Multitier data warehouse:

sole custodian of all warehouse data.

By Mrs. Chetana

Page 50: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Data MartData Mart

It is lower-cost, scaled down version of the DW.

Data Mart offer a targeted and less costly method of gaining the advantages associated with data warehousing and can be scaled up to a full DW environment over time.

example, the marketing data mart may contain data related to items, customers, and sales. Data marts are confined to subjects.

By Mrs. Chetana

Page 51: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

By Mrs. Chetana

Page 52: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Meta DataMeta DataMetadata is “data about data”

With the emergence of the data warehouse as a decision support structure, the metadata are considered as much a resource as the business data they describe.

Metadata are abstractions -- they are high level data that provide concise descriptions of lower-level data.

By Mrs. Chetana

Page 53: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

For example, a line in a sales database may contain:

4056 KJ596 223.45

This is mostly meaningless until we consult the metadata that tells us it was store number 4056, product KJ596 and sales of $223.45

The metadata are essential ingredients in the transformation of raw data into knowledge.

They are the “keys” that allow us to handle the raw data.

By Mrs. Chetana

Page 54: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Data Warehouse SchemaData Warehouse SchemaModeling data warehouses: dimensions & measures

◦ Star schema: A fact table in the middle connected to a set

of dimension tables

◦ Snowflake schema: A refinement of star schema where

some dimensional hierarchy is normalized into a set of

smaller dimension tables, forming a shape similar to

snowflake

◦ Fact constellations: Multiple fact tables share dimension

tables, viewed as a collection of stars, therefore called

galaxy schema or fact constellation

By Mrs. Chetana

Page 55: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Dimensions & MeasuresDimensions & Measures

Dimensions: perspectives or entities with respect to which an organization wants to keep records .◦ Each dimension will may have a table associated with it called as

“dimension table”

◦ The dimension table contains facts i.e measures and key to each related dimension tables.

◦ Specified by users or experts, or automatically generated and adjusted based on data distributions.

Fact: considered as numerical measures.◦ Quantitative to analyze relationships between dimensions.

By Mrs. Chetana

Page 56: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

time_keydayday_of_the_weekmonthquarteryear

time

location_keystreetcityprovince_or_streetcountry

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

Measures

item_keyitem_namebrandtypesupplier_type

item

branch_keybranch_namebranch_type

branch

Time_dimensionItem _dimension

Branch_dimension

Location_dimension

Example of Star Schema

By Mrs. Chetana

Page 57: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Star SchemaStar Schema

By Mrs. Chetana

Page 58: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Example of Star SchemaExample of Star Schema

Store Key

Product Key

Period Key

Units

Price

Store Dimension Time Dimension

Product Dimension

Sales Fact Table

Store Key

Store Name

City

State

Region

Period Key

Year

Quarter

Month

Product Key

Product Desc

By Mrs. Chetana

Page 59: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

time_keydayday_of_the_weekmonthquarteryear

time

location_keystreetcity_key

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

Measures

item_keyitem_namebrandtypesupplier_key

item

branch_keybranch_namebranch_type

branch

supplier_keysupplier_type

supplier

city_keycityprovince_or_streetcountry

city

normalized

Time_dimensionItem _dimension

Branch_dimensionLocation_dimension

Example of Snowflake

By Mrs. Chetana

Page 60: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

SnowFlake SchemaSnowFlake SchemaVariant of star schema model.A single, large and central fact table and one or more tables for

each dimension.Used when a dimensional table becomes very bigDimension tables are more normalized

i.e. split dimension table data into additional tablesAdvantages:

Redundancy is reduced and therefore, it becomes easy to maintain and save storage space.

Drawbacks:

Increases number of joins hence time consuming query processing

Complex queries results in slow report generation

By Mrs. Chetana

Page 61: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Example of SnowFlake SchemaExample of SnowFlake Schema

Store Key

Product Key

Period Key

Units

Price

Time Dimension

Product Dimension

Sales Fact Table

Store Key

Store Name

City Key

Period Key

Year

Quarter

Month

Product Key

Product Desc

City Key

City

State

Region

City Dimension

Store Dimension

By Mrs. Chetana

Page 62: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

By Mrs. Chetana

Page 63: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

time_keydayday_of_the_weekmonthquarteryear

time

location_keystreetcityprovince_or_streetcountry

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

Measures

item_keyitem_namebrandtypesupplier_type

item

branch_keybranch_namebranch_type

branch

Shipping Fact Table

time_key

item_key

shipper_key

from_location

to_location

dollars_cost

units_shipped

shipper_keyshipper_namelocation_keyshipper_type

shipper

Example of Fact Constellation

By Mrs. Chetana

Page 64: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Fact ConstellationFact Constellation

Multiple fact tables share dimension tables.This schema is viewed as collection of stars hence

called galaxy schema or fact constellation.Sophisticated application requires such schema.

By Mrs. Chetana

Page 65: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Example of Fact ConstellationExample of Fact Constellation

Store Key

Product Key

Period Key

Units

Price

Store Dimension

Product Dimension

SalesFact Table

Store Key

Store Name

City

State

Region

Product Key

Product Desc

Shipper Key

Store Key

Product Key

Period Key

Units

Price

ShippingFact Table

By Mrs. Chetana

Page 66: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Software Engineering point of viewSoftware Engineering point of view

The design and construction of a DWH may consist of the following steps:

◦Planning◦Requirement study◦Problem Analysis◦Warehouse design◦Data integration and testing◦Finally Deployment of data warehouse

By Mrs. Chetana

Page 67: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

General Data warehouse design ProcessGeneral Data warehouse design Process

1. Choose a Business Process to modelData warehouse Model : For Complex object collections

Data Mart Model: For focused Process on one kind of business

1. Choose the grain of the business processFundamental , atomic level of data to be represented in fact table

eg. In bank Individual transactions, individual daily snapshot

1. Choose the dimensions that will apply to each fact table record

2. Choose measures that will populate each fact table record (ex: dollars_sold or units_sold)

By Mrs. Chetana

Page 68: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Multi-dimensional Data ModelMulti-dimensional Data ModelData warehouse and OLAP tools are multi dimensional data model.This model views data in the form of data Cube model.A data cube allows data to be modeled & viewed in multiple

dimensions. It is defined by dimensions & facts.

Dimensions: perspectives or entities with respect to which an organization wants to keep records .

◦ Each dimension will may have a table associated with it called as “dimension table”

◦ Specified by users or experts, or automatically generated and adjusted based on data distributions.

Fact: considered as numerical measures.

◦ Quantitative to analyze relationships between dimensions.

By Mrs. Chetana

Page 69: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

From Tables and Spreadsheets to Data CubesFrom Tables and Spreadsheets to Data Cubes

A data warehouse is based on a multidimensional data model which views

data in the form of a data cube

A data cube, such as sales, allows data to be modeled and viewed in

multiple dimensions. It is defined by dimensions and facts.

◦ Dimension tables, such as item (item_name, brand, type), or time(day,

week, month, quarter, year), branch, and location.

◦ Fact table contains measures (such as dollars_sold) and keys to each of

the related dimension tables

Viewing things in 4-D becomes tricky

Data cubes are n-dimensional & do not confine data to 3-D

By Mrs. Chetana

Page 70: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

4- D data cube representation for sales data4- D data cube representation for sales data

By Mrs. Chetana

Page 71: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Data Models in Data WarehousesData Models in Data Warehouses

In OLTP systems, data are stored in 2D matrixes.

Data warehouses are subject-oriented

◦ Profits, Sales …

◦ Data need to be reorganized to better reflect the subjects.

A data warehouse is based on a multidimensional data model, which views data in the form of a data cube.

A data cube allows data to be modeled and viewed in multiple dimensions.

By Mrs. Chetana

Page 72: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

BASICSBASICS Dimensional Model is conceptually represented using Schemas that are

described by basic elements, such as facts, measures, attributes, dimensions, hierarchies and granularity.

Facts are a focus of interest about a particular subject. Facts are used to store

business information(measures such as dollars sold) on which detailed

analysis is carried out.

Dimensions determine the contextual background for the facts and provide

the context of the measures such as item (item name, brand), product, location

or time(day, week, month, quarter, year).

Measures are continuous values that describe the fact from different points of

view (e.g., each sale is measured by its revenue)

By Mrs. Chetana

Page 73: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Hierarchies are made up of discrete dimension attributes linked by one to one relationships and determine how facts may be aggregated and selected significantly for the decision-making process.

Granularity is related to fact table and which represents the lowest level of information that will be stored in fact table.

It is based on

(i) which dimensions will be included

(ii)the hierarchy of each dimension the information will be kept

By Mrs. Chetana

Page 74: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Schema DefinitionSchema Definition

Multidimensional schema is defined using Data Mining Query Language (DMQL). The two primitives, cube definition and dimension definition, can be used for defining the data warehouses and data marts.

Syntax for Cube Definition

define cube < cube_name > [ < dimension-list > }: < measure_list >

Syntax for Dimension Definition

define dimension < dimension_name > as (attribute_or_dimension_list > )

By Mrs. Chetana

Page 75: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Star Schema DefinitionStar Schema Definition The star schema that we have discussed can be defined using Data Mining

Query Language (DMQL) as follows:

define cube sales_star [time, item, branch, location]:

dollars sold = sum(sales in dollars), units sold = count(*)

define dimension time as (time key, day, day of week, month, quarter, year)

define dimension item as (item key, item name, brand, type, supplier type)

define dimension branch as (branch key, branch name, branch type)

define dimension location as (location key, street, city, province or state, country)

Explanation:

define cube statement defines a data cube called sale_star which corresponds to central sales fact table.

define dimension statement used to define four dimensions.

By Mrs. Chetana

Page 76: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Snowflake Schema Snowflake Schema DefinitionDefinition

define cube sales_snowflake [time, item, branch, location]:

dollars sold = sum(sales in dollars), units sold = count(*)

define dimension time as (time key, day, day of week, month, quarter, year)

define dimension item as (item key, item name, brand, type, supplier (supplier key, supplier type))

define dimension branch as (branch key, branch name, branch type)

define dimension location as (location key, street, city (city key, city, province or state, country))

Explanation:

item & location dimension tables are normalized.

supplier & city are two sales_snoflake tables created.

By Mrs. Chetana

Page 77: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Fact Constellation Schema DefinitionFact Constellation Schema Definitiondefine cube sales [time, item, branch, location]:

dollars sold = sum(sales in dollars), units sold = count(*)

define dimension time as (time key, day, day of week, month, quarter, year)

define dimension item as (item key, item name, brand, type, supplier type)

define dimension branch as (branch key, branch name, branch type)

define dimension location as (location key, street, city, province or state, country)

define cube shipping [time, item, shipper, from location, to location]:

dollars cost = sum(cost in dollars), units shipped = count(*)

define dimension time as time in cube sales

define dimension item as item in cube sales

define dimension shipper as (shipper key, shipper name, location as location in cube sales, shipper type)

define dimension from location as location in cube sales

define dimension to location as location in cube sales

By Mrs. Chetana

Page 78: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

From Tables to Data From Tables to Data CubesCubesID Product Country Date Sales

1 TV US 1Qtr 100

2 PC Canada 4Qtr 500

3 CAR US 2Qtr 30

4 PC UK 3Qtr 200

5 CAR UK 1Qtr 20

6 CAR UK 2Qtr 15

7 TV Canada 4Qtr 80

By Mrs. Chetana

Page 79: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Lattice of Cuboid ( Data Cube)Lattice of Cuboid ( Data Cube)A data cube is referred as cuboidCuboid can be generated for each subset of given

dimensions.In data warehousing literature, an n-D base cube is

called a base cuboid. The top most 0-D cuboid, which holds the highest-level

of summarization, is called the apex cuboid. The lattice of cuboids forms a data cube, each shows

different level of summarization or group by.

By Mrs. Chetana

Page 80: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Cube: A Lattice of CuboidsCube: A Lattice of Cuboids

all

time item location supplier

time,item time,location

time,supplier

item,location

item,supplier

location,supplier

time,item,location

time,item,supplier

time,location,supplier

item,location,supplier

time, item, location, supplier

0-D cuboid

1-D cuboids

2-D cuboids

3-D cuboids

4-D cuboid

APEX ( Highest level of Summarization)

Base ( Lowest level of Summarization)

By Mrs. Chetana

Page 81: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

A Sample Data CubeA Sample Data Cube

Date

Produ

ct

Cou

ntr

y

sum

sum TV

VCRPC

1Qtr 2Qtr 3Qtr 4Qtr

U.S.A

Canada

Mexico

sum

Total annual salesof TV in U.S.A.

By Mrs. Chetana

Page 82: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

MeasuresMeasuresA data cube measure is a numerical function that can be

evaluated at each point in the data cube space.

1. Distributive: If it can be computed in a distributed manner using count(), sum(), min(), max() distributed aggregate functions.

1. Algebraic: if it can be computed by an algebraic functions with M arguments using avg(), min_N(), max_N(), sum()/count() algebraic aggregate functions.

By Mrs. Chetana

Page 83: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Holistic : if there is no constant bound on the storage size needed to describe a sub aggregate.

Uses median(), mode(), rank() holistic aggregate functions.

By Mrs. Chetana

Page 84: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Concept HierarchiesConcept Hierarchies

A Concept Hierarchy defines a sequence of mappings from a set of low-level concepts to high-level

Consider a concept hierarchy for the dimension “Location”

By Mrs. Chetana

Page 85: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

By Mrs. Chetana

Page 86: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Concept HierarchiesConcept HierarchiesMany concept hierarchies are implicit within

the database system

By Mrs. Chetana

Page 87: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Concept HierarchiesConcept HierarchiesConcept hierarchies may also be defined by grouping

values for a given dimension or attribute, resulting in a set-grouping hierarchy

By Mrs. Chetana

Page 88: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Online Analysis Online Analysis Processing(OLAP)Processing(OLAP)

It enables analysts, managers and executives to gain insight into data through fast, consistent, interactive access to a wide variety of possible views of information that has been transformed from raw data to reflect the real dimensionality of the enterprise as understood by the user.

Data Warehouse

Time

Product

Reg

ion

By Mrs. Chetana

Page 89: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

OLAP Server Architectures Relational OLAP (ROLAP)

◦ Use relational DBMS to store and manage warehouse data.

◦ ROLAP tools access the data in a relational database and generate SQL queries

to calculate information at the appropriate level as required.

◦ Handles Large amount of data.

◦ Limited by SQL functionalities

By Mrs. Chetana

Page 90: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

MOLAPMOLAP Multidimensional OLAP (MOLAP)

◦ uses array-based multidimensional storage engines provides natural indexing

◦ Fast query performance due to optimized storage and indexing

◦ Automated computation of higher level aggregates of the data

◦ Can perform complex calculations

◦ the storage utilization may be low if the dataset is sparse

◦ uses two levels of data storage representation to handle dense and sparse

datasets.

◦ Limited in the amount of data it can handle

By Mrs. Chetana

Page 91: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

HOLAPHOLAP Hybrid OLAP (HOLAP)

◦ Hybrid OLAP is a combination of both ROLAP and MOLAP. It offers

higher scalability of ROLAP and faster computation of MOLAP.

◦ Stores large data volumes of detailed information. The aggregations are

stored separately in MOLAP store.

By Mrs. Chetana

Page 92: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

OLAP CubeOLAP Cube

City Product Time Units Dollars

All All All 113 251.26

Mumbai All All 64 146.07

Mumbai White Bread All 38 98.49

Mumbai Wheat Bread All 13 32.24

Mumbai Wheat Bread Qtr1 3 7.44

Mumbai Wheat Bread March 3 7.44

By Mrs. Chetana

Page 93: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

OLAP OperationsOLAP OperationsDrill down (roll down): reverse of roll-up

from higher level summary to lower level summary or detailed data, or introducing new dimensions

Time

Regio

n

Product

Category e.g Electrical Appliance

Sub Category e.g Kitchen

Product e.g Toaster

By Mrs. Chetana

Page 94: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

drill Down on time(from quarters to month)

By Mrs. Chetana

Page 95: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

OLAP OperationsOLAP OperationsRoll up (drill-up): summarize data

by climbing up hierarchy or by dimension reductionOwner may be interested in further aggregation along some dimension

Time

Regio

n

ProductCategory e.g Electrical Appliance

Sub Category e.g Kitchen

Product e.g Toaster

By Mrs. Chetana

Page 96: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

roll_up on location (from cities to countries)

By Mrs. Chetana

Page 97: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

OLAP OperationsOLAP OperationsSlice and dice:

user specify interest in some particular values of specific dimensions.project and select

Time

Regio

n

ProductProduct=Toaster

Time

Regio

n

By Mrs. Chetana

Page 98: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Slice for time = “Q1”Slice for time = “Q1”

By Mrs. Chetana

Page 99: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Dice for (location=“Toronto” or “Vancouver”)Dice for (location=“Toronto” or “Vancouver”)and (time=“Q1” or “Q2”) and (item=“Mobile” or “Modem”)and (time=“Q1” or “Q2”) and (item=“Mobile” or “Modem”)

By Mrs. Chetana

Page 100: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

By Mrs. Chetana

Page 101: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

OLAP OperationsOLAP Operations

Pivot (rotate):reorient the cube, visualization, 3D to series of 2D planes.

Time

Regio

n

Product

Region

Tim

e

Product

By Mrs. Chetana

Page 102: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

By Mrs. Chetana

Page 103: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Other OLAP OperationsOther OLAP OperationsDrill across:

◦ Accesses more than one fact table that is linked by common dimensions.

◦ Combines cubes that share one or more dimensions.

Drill through: ◦ Drill down to the bottom level of a data cube down to its back-

end relational tables.

Cross-tab :◦ Spreadsheet style row/column aggregates.

By Mrs. Chetana

Page 104: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

By Mrs. Chetana

Page 105: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Typical Data warehouse Architecture

By Mrs. Chetana

Page 106: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Data Warehouse Data Warehouse ImplementationImplementationData warehouses contain huge volumes of data.OLAP servers demand that queries should be answered

in seconds. So, data warehouse should need highly efficient cube

computation techniques, access methods and query processing techniques.

The core of multidimensional data analysis is the efficient computation of aggregations across many sets of dimensions.

In SQL aggregations are referred to as group-by’s.Each group-by can be represented as cuboid.Set of group-by’s forms a lattice of a cuboids defining a

data cube. By Mrs. Chetana

Page 107: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

The compute cube Operator and the Curse of Dimensionality

The compute cube operator computes aggregates over all subsets of the dimensions specified in the operation.

Requires an excessive storage space, especially for large number of dimensions.

A data cube is a lattice of cuboids.Suppose that we create a data cube for AllElectronics sales

that contains the following: city, item year, and sales_in_dollars.

Efficient Data Cube Computation

Compute the sum of sales, grouping by city and item.Compute the sum of sales, grouping by city.Compute the sum of sales, grouping by item.

By Mrs. Chetana

Page 108: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Lattice of cuboid, making up 3-D data cube.

By Mrs. Chetana

Page 109: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Efficient Data Cube ComputationWhat is the total number of cuboids, or group-by’s, that

can be computed for this data cube?

Three attributes: city, item, year (dimensions) sales_in_dollars (measure)The total number of cuboids or group-by’s computed for

this cube are 23=8.Group-by’s:

{(city,item,year), (city, item), (city, year), (item, year), (city), (item), (year),()}. () : group-by is empty i.e. the dimensions are not grouped.

Base cuboid contains all three dimensions (most specific) Apex cuboid is empty (least specific)

By Mrs. Chetana

Page 110: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Cube OperationCube definition and computation in DMQL:

define cube sales_cube[ city, item, year]: sum (sales_in_dollars)

compute cube sales_cube

Transform it into a SQL-like language

(with a new operator cube by, introduced by Gray et al.’96)

SELECT item, city, year, SUM (amount)

FROM SALES

CUBE BY item, city, year(item)(city)

()

(year)

(city, item) (city, year) (item, year)

(city, item, year)

By Mrs. Chetana

Page 111: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Efficient Data Cube ComputationOn-line analytical processing may need to access different cuboids

for different queries.So we have to compute all or at least some of the cuboids in the

data cube in advance.Pre-computation leads to fast response time and avoids some

redundant computation.A major challenge related to pre-computation would be storage

space if all the cuboids in the data cube are computed, especially when the cube has many dimensions.

The storage requirements are even more excessive when many of the dimensions have associated concept hierarchies, each with multiple levels.

This problem is referred to as the Curse of Dimensionality.

By Mrs. Chetana

Page 112: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Efficient Data Cube Efficient Data Cube ComputationData cube can be viewed as a lattice of cuboids

◦ The bottom-most cuboid is the base cuboid

◦ The top-most cuboid (apex) contains only one cell

How many cuboids in an n-dimensional cube with L levels?Li is number of levels associated with dimension i.

Eg. For time dimension hierarchy is “day<month<quarter<year”

Li = 4 & All is Virtual top level i.e Total Level = (4+1) = 5

& Dimensions = 10

No of cuboids generated = 510 = 9.8x106

T=∏i=1

n

(Li+1)

By Mrs. Chetana

Page 113: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Partial Materialization: Selected Computation of Cuboids

There are three choices for data cube materialization given a base cuboid.

No Materialization – Do not pre-compute any of the non-base cuboids

Full Materialization – Pre-compute all of the cuboids. Need huge memory space.

Partial Materialization - Identify the subsets of cuboids or subcubes to materialize. Exploit the materialized cuboids or subcubes during query

processing. Efficiently update the materialized cuboids or subcubes during

load and refresh.

Selection of which cuboids to materialize

Based on size, sharing, access frequency, etc.

By Mrs. Chetana

Page 114: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Iceberg CubeIceberg Cube

Computing only the cuboid cells whose count

or other aggregates satisfying the condition like

HAVING COUNT(*) >= minsup

It is called an Iceberg-Cube because it contains only some of the cells of the full cube, like the tip of an iceberg.

Motivation Only a small portion of cube cells may be “above the water’’

in a sparse cube Only calculate “interesting” cells — data above certain

threshold Avoid explosive growth of the cube

By Mrs. Chetana

Page 115: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Iceberge Cube Query with three dimensions Iceberge Cube Query with three dimensions

SELECT    A,B,C, Count(*),SUM(X) 

FROM TableName  CUBE BY A,B,C  HAVING COUNT(*)>=minsup

By Mrs. Chetana

Page 116: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Indexing OLAP Data: Bitmap Index Index on a particular columnEach value in the column has a bit vector : bit-op is fastThe length of the bit vector : # of records in the base tableThe i-th bit is set if the i-th row of the base table has the value for the

indexed columnnot suitable for high cardinality domains It reduces join, aggregation & comparison operations

Cust Region TypeC1 Asia RetailC2 Europe DealerC3 Asia DealerC4 America RetailC5 Europe Dealer

RecID Retail Dealer1 1 02 0 13 0 14 1 05 0 1

RecID Asia Europe America1 1 0 02 0 1 03 1 0 04 0 0 15 0 1 0

Base table Index on Region Index on Type

By Mrs. Chetana

Page 117: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Indexing OLAP Data: Join IndicesJoin index: JI(R-id, S-id)

where R (R-id, …) S (S-id, …)Traditional indices map the values to a list of record

ids◦ It materializes relational join in JI file and

speeds up relational join

In data warehouses, join index relates the values of the dimensions of a star schema to rows in the fact table.◦ E.g. fact table: Sales and two dimensions item and

location.

◦ A join index on location maintains for each distinct location a list of R-IDs of the tuples recording the Sales in the locations.

◦ Join indices can span multiple dimensions

By Mrs. Chetana

Page 118: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Efficient Processing OLAP Queries

The purpose of materializing cuboids and constructing OLAP index

structures is to speed up the query processing in data cubes.

Given materialized views, query processing should proceed as follows:

Determine which operations should be performed on the available

cuboids

◦ Transform drill, roll, etc. into corresponding SQL and/or OLAP

operations, e.g. Slice & dice = selection + projection

Determine which materialized cuboid(s) the relevant operations should

be applied.

By Mrs. Chetana

Page 119: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Efficient Processing OLAP QueriesSuppose that the query to be processed is on

{brand, province_or_state} with the selection constant “year = 2004”,

Dimension hierarchies as follows:

Time: “day<month<quarter<year” Item: “item_name<brand<type”

Location: “street<city<state<country”

and there are 4 materialized cuboids available:

1) { year, item_name, city }

2) {year, brand, country}

3) {year, brand, province_or_state}

4) {item_name, province_or_state} where year = 2004

Which of the cuboids should be selected to process the query?By Mrs. Chetana

Page 120: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Solution:Solution:Cuboid 2 cant be used because country is a more general

concept than state.

Cuboids 1,3,4 can be used to process query:1. they have the same set or superset of dimensions in query

2. the selection clause in the query can imply the selection in the cuboid

3. the abstraction levels for the item and location dimensions in these cuboids are at the finer level than brand and state, respectively.

By Mrs. Chetana

Page 121: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

How would the cost of each cuboid compare if used to process the query??Cuboid 1 would cost most because both item_name and

city are at lower level than brand and state concepts specified in query

Cuboid 3 will be smaller than cuboid 4;◦ If not many year values associated with items in the cube

◦ But there are mant item_names for each brand

Cuboid 4 may be better choice; If efficient indices are available for cuboid 4.

By Mrs. Chetana

Page 122: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Data Warehouse UsageData Warehouse Usage Three kinds of data warehouse applications

◦ Information processing

supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphs

◦ Analytical processing

Operates on historical data in both summarized and detailed forms.

multidimensional data analysis of data warehouse data

supports basic OLAP operations, slice-dice, drill-down, roll- up, and pivoting

◦ Data mining

knowledge discovery from hidden patterns

supports associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization toolsBy Mrs. Chetana

Page 123: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Data Warehouse UsageData Warehouse UsageDo OLAP systems perform data mining? Are OLAP systems actually data mining systems?

OLAP is a data summarization/ aggregation tool that helps simplify data analysis.

Data mining allows the automated discovery of implicit patterns and interesting knowledge hidden in large amounts of data.

Data mining covers much broader spectrum than simple OLAP operations because it performs not only data summary and comparison but also association, classification, prediction, clustering, time-series analysis, and other data analysis tasks.

By Mrs. Chetana

Page 124: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

From On-Line Analytical Processing (OLAP) to On Line Analytical Mining (OLAM)

Why online analytical mining?

◦ High quality of data in data warehouses DW contains integrated, consistent, cleaned data

◦ Available information processing structure surrounding data warehouses ODBC, OLEDB, Web accessing, service facilities, reporting

and OLAP tools

◦ OLAP-based exploratory data analysis Mining with drilling, dicing, pivoting, etc.

◦ On-line selection of data mining functions Integration and swapping of multiple mining functions,

algorithms, and tasks

By Mrs. Chetana

Page 125: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

August 1, 2015 Data Mining: Concepts and Techniques 125

An Integrated OLAM and OLAP Architecture

Data Warehouse

Meta Data

MDDB

OLAMEngine

OLAPEngine

User GUI API

Data Cube API

Database API

Data cleaning

Data integration

Layer3

OLAP/OLAM

Layer2

MDDB

Layer1

Data Repository

Layer4

User Interface

Filtering&Integration Filtering

Databases

Mining query Mining result

By Mrs. Chetana

Page 126: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Summary of UNIT- IISummary of UNIT- IIWhy data warehousing?

A multi-dimensional model of a data warehouse

◦ Star schema, snowflake schema, fact constellations

◦ A data cube consists of dimensions & measures

OLAP operations: drilling, rolling, slicing, dicing and pivoting

Data warehouse architecture

OLAP servers: ROLAP, MOLAP, HOLAP

Efficient computation of data cubes

◦ Partial vs. full vs. no materialization

◦ Indexing OALP data: Bitmap index and join index

◦ OLAP query processing

From OLAP to OLAM (on-line analytical mining)

By Mrs. Chetana

Page 127: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Building Data WarehouseBuilding Data Warehouse

Data SelectionData Preprocessing

◦Fill missing values◦Remove inconsistency

Data Transformation & IntegrationData Loading Data in warehouse is stored in form of fact

tables and dimension tables.

By Mrs. Chetana

Page 128: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Case StudyCase Study

Afco Foods & Beverages is a new company which produces dairy,bread and meat products with production unit located at Baroda.

There products are sold in North,North West and Western region of India.

They have sales units at Mumbai, Pune , Ahemdabad ,Delhi and Baroda.

The President of the company wants sales information.

By Mrs. Chetana

Page 129: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Sales InformationSales Information

January February March April

14 41 33 25

Report: The number of units sold.

113

Report: The number of units sold over time

By Mrs. Chetana

Page 130: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Sales InformationSales Information

Jan Feb Mar Apr

Wheat Bread 6 17

Cheese 6 16 6 8

Swiss Rolls 8 25 21

Report : The number of items sold for each product withtime

Product

Tim

eBy Mrs. Chetana

Page 131: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Sales InformationSales Information

Jan Feb Mar Apr

Mumbai Wheat Bread 3 10

Cheese 3 16 6

Swiss Rolls 4 16 6

Pune Wheat Bread 3 7

Cheese 3 8

Swiss Rolls 4 9 15

Report: The number of items sold in each City for each product with time

Product

Tim

e

City

By Mrs. Chetana

Page 132: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Sales InformationSales Information

Report: The number of items sold and income in each region for each product with time.

Jan Feb Mar Apr

Rs U Rs U Rs U Rs U

Mumbai Wheat Bread 7.44 3 24.80 10

Cheese 7.95 3 42.40 16 15.90 6

Swiss Rolls 7.32 4 29.98 16 10.98 6

Pune Wheat Bread 7.44 3 17.36 7

Cheese 7.95 3 21.20 8

Swiss Rolls 7.32 4 16.47 9 27.45 15

By Mrs. Chetana

Page 133: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Sales Measures & DimensionsSales Measures & DimensionsMeasure – Units sold, Amount.Dimensions – Product,Time,Region.

By Mrs. Chetana

Page 134: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Sales Data Warehouse ModelSales Data Warehouse Model

City Product Month Units Rupees

Mumbai Wheat Bread January 3 7.95

Mumbai Cheese January 4 7.32

Pune Wheat Bread January 3 7.95

Pune Cheese January 4 7.32

Mumbai Swiss Rolls February 16 42.40

Fact Table

By Mrs. Chetana

Page 135: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Sales Data Warehouse ModelSales Data Warehouse Model

City_ID Prod_ID Month Units Rupees

1 589 1/1/1998 3 7.95

1 1218 1/1/1998 4 7.32

2 589 1/1/1998 3 7.95

2 1218 1/1/1998 4 7.32

1 589 2/1/1998 16 42.40

By Mrs. Chetana

Page 136: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Sales Data Warehouse ModelSales Data Warehouse Model

Prod_ID Product_Name Product_Category_ID

589 Wheat Bread 1

590 White Bread 1

288 Coconut Cookies 2

Product Dimension Tables

Product_Category_Id Product_Category

1 Bread

2 Cookies

By Mrs. Chetana

Page 137: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Sales Data Warehouse ModelSales Data Warehouse Model

City_ID City Region Country

1 Mumbai West India

2 Pune NorthWest India

Region Dimension Table

By Mrs. Chetana

Page 138: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Sales Data Warehouse ModelSales Data Warehouse Model

Sales Fact

Region

ProductProduct

Category

Time

By Mrs. Chetana

Page 139: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

PresentationPresentation

Time

Regio

n

Product

Report

ReportingTool

By Mrs. Chetana

Page 140: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Data Warehousing includesData Warehousing includes

Build Data WarehouseOnline analysis processing(OLAP).Presentation.

RDBMS

Flat File

Presentation

Cleaning ,Selection &Integration

Warehouse & OLAP serverClient

By Mrs. Chetana

Page 141: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Need for Data WarehousingNeed for Data Warehousing

Industry has huge amount of operational data

Knowledge worker wants to turn this data into useful information.

This information is used by them to support strategic decision making .

By Mrs. Chetana

Page 142: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Need for Data Warehousing Need for Data Warehousing (contd..)(contd..)

It is a platform for consolidated historical data for analysis.

It stores data of good quality so that knowledge worker can make correct decisions.

By Mrs. Chetana

Page 143: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Need for Data Warehousing Need for Data Warehousing (contd..)(contd..)

From business perspective-it is latest marketing weapon-helps to keep customers by learning more about their needs .-valuable tool in today’s competitive fast evolving world.

By Mrs. Chetana

Page 144: UNIT – II Introduction to Data Warehousing By Mrs. Chetana

Data Warehousing ToolsData Warehousing Tools

Data Warehouse◦SQL Server 2000 DTS◦Oracle 8i Warehouse Builder

OLAP tools◦SQL Server Analysis Services◦Oracle Express Server

Reporting tools◦MS Excel Pivot Chart◦VB Applications

By Mrs. Chetana