datawarehouse
DESCRIPTION
DATAWAREHOUSE. Introduction. Information. Data. What is Data Warehousing?. A process of transforming data into information and making it available to users in a timely enough manner to make a difference [Forrester Research]. What are the users saying. - PowerPoint PPT PresentationTRANSCRIPT
Introduction
DATAWAREHOUSE
www.notesvillage.com
What is Data Warehousing?
A process of transforming data into information and making it available to users in a timely enough manner to make a difference
[Forrester Research]
Data
Information
www.notesvillage.com
What are the users saying...Data should be integrated
across the enterpriseSummary data has a real
value to the organizationHistorical data holds the
key to understanding data over time
What-if capabilities are required
www.notesvillage.com
What is a Data Warehouse? A single, complete and
consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context.
[Barry Devlin]
www.notesvillage.com
Data, Data everywhereyet ... I can’t find the data I need
data is scattered over the networkmany versions, subtle differences
• I can’t get the data I need– need an expert to get the data
• I can’t understand the data I found– available data poorly documented
• I can’t use the data I found– results are unexpected– data needs to be transformed from
one form to otherwww.notesvillage.com
Which are our lowest/highest margin
customers ?
Which are our lowest/highest margin
customers ?
Who are my customers and what products are they buying?
Who are my customers and what products are they buying?
Which customers are most likely to go to the competition ?
Which customers are most likely to go to the competition ?
What impact will new products/services
have on revenue and margins?
What impact will new products/services
have on revenue and margins?
What product prom--otions have the biggest
impact on revenue?
What product prom--otions have the biggest
impact on revenue?
What is the most effective distribution
channel?
What is the most effective distribution
channel?
A producer wants to know….
www.notesvillage.com
NecessityNeed for Strategic Information
Retain consumer baseIncrease customer base by 20%Enhance consumer satisfaction levelLaunch two new products
Information CrisisHow to maintain lots of data in the organizationIT resources and strategy is not effective for data
Opportunities and Risks:Quick decisionsManagerial analysisCrisis management
www.notesvillage.com
History of DSSAd hoc reportsSmall applicationsInformation centersDecision support system
www.notesvillage.com
DisadvantagesMany adhoc requests .Supplementary report problemsNo interactivity of the userFlexible informations.
www.notesvillage.com
OPERATIONAL VS INFORMATIONALCurrent ValuesOptimized for
transactionHigh frequencyRead Update
DeleteLarge userResponse time is
very fast
Archived valueOptimized for
complex queriesLowReadAdhoc,RandomRelatively more
several seconds
www.notesvillage.com
Scenerio 1ABC Pvt Ltd is a company with branches at
Mumbai, Delhi, Chennai and Banglore. The Sales Manager wants quarterly sales report. Each branch has a separate operational system.
www.notesvillage.com
Scenario 1 : ABC Pvt Ltd.
Mumbai
Delhi
Chennai
Banglore
SalesManager
Sales per item type per branchfor first quarter.
www.notesvillage.com
Solution 1:ABC Pvt Ltd
Extract sales information from each database.
Store the information in a common repository at a single site.
www.notesvillage.com
Solution 1:ABC Pvt Ltd.
Mumbai
Delhi
Chennai
Banglore
DataWarehouse
SalesManager
Query &Analysis tools
Report
www.notesvillage.com
Scenario 2 : One Stop Shopping
OperationalDatabase
Data Entry Operator
Data Entry Operator
ManagementWait
Report
www.notesvillage.com
Solution 2Solution 2•Extract data needed for analysis from operational database.•Store it in warehouse.•Refresh warehouse at regular interval so that it contains up to date information for analysis.•Warehouse will contain data with historical perspective.
www.notesvillage.com
Solution 2
Operationaldatabase
DataWarehouse
Extractdata
Data EntryOperator
Data EntryOperator
Manager
Report
Transaction
www.notesvillage.com
Scenario 3Cakes & Cookies is a small,new
company.President of the company wants his company should grow.He needs information so that he can make correct decisions.
www.notesvillage.com
Solution 3Solution 3•Improve the quality of data before loading it into the warehouse.
•Perform data cleaning and transformation before loading the data.
•Use query analysis tools to support adhoc queries.
www.notesvillage.com
Solution 3
Query and Analysistool
President
Expansion
Improvement
sales
time
DataWarehouse
www.notesvillage.com
Definitions of a Data Warehouse
- W.H. Inmon
“A subject-oriented, integrated, time-variant and
non-volatile collection of data in support of
management's decision making process”
- Ralph Kimball
“A copy of transaction data, specifically structured for query and analysis”
1.
2.
www.notesvillage.com
Data WarehouseFor organizational learning to take place,
data from many sources must be gathered together and organized in a consistent and useful way – hence, Data Warehousing (DW)
DW allows an organization (enterprise) to remember what it has noticed about its data
Data Mining techniques make use of the data in a DW
www.notesvillage.com
What is Data Warehouse??Inmons’s definition
A data warehouse is-subject-oriented,-integrated,-time-variant,-nonvolatile
collection of data in support of management’sdecision making process
www.notesvillage.com
Subject-orientedSubject-oriented
•Data warehouse is organized around subjects such as sales, product, customer.
•It focuses on modeling and analysis of data for decision makers.
•Excludes data not useful in decision support process.
www.notesvillage.com
Integrated DataIntegrationData Warehouse is constructed by
integrating multiple heterogeneous sources.
Data Preprocessing are applied to ensure consistency.RDBMS
LegacySystem
DataWarehouse
Flat Filewww.notesvillage.com
IntegrationIntegrationIn terms of data.
encoding structures.
Measurement ofattributes.
physical attribute. of data
naming conventions.
Data type format
www.notesvillage.com
Time-variantTime-variant
•Provides information from historical perspective e.g. past 5-10 years
•Data is stored as snapshots
•Every key structure contains either implicitly or explicitly an element of time
•Allows analysis for the past
•Relates information to the present
•Enables forecasts for the futurewww.notesvillage.com
NonvolatileNonvolatile•Data once recorded cannot be updated.•Data warehouse requires two operations in data accessing
•Initial loading of data•Access of data
www.notesvillage.com
Data GranularityOperational system is kept at the lowest
levelData is summarized at different levelsThe lower the level the more data
granularity.
www.notesvillage.com
Operational v/s Information SystemOperational v/s Information System
Features Operational Information
Characteristics Operational processing
Informational processing
Orientation Transaction Analysis
User Clerk,DBA,database professional
Knowledge workers
Function Day to day operation Decision support
Data Current Historical
View Detailed,flat relational Summarized, multidimensional
DB design Application oriented Subject oriented
Unit of work Short ,simple transaction
Complex query
Access Read/write Mostly readwww.notesvillage.com
Data Mart
A Data Mart is a smaller, more focused
Data Warehouse – a mini-warehouse.
A Data Mart typically reflects the business
rules of a specific business unit within an
enterprise.
www.notesvillage.com
Data Warehouse to Data Mart
DataWarehouse
Data Mart
Data Mart
Data Mart
Decision Support
Information
Decision Support
Information
Decision Support
Information
www.notesvillage.com
www.notesvillage.com
Data Mart Layer
Presentation/ Desktop
Access Layer
Meta-data Repository Layer
Warehouse Management Layer
Core DW Layer
Data Staging and Quality Layer
Data Access Layer
Operational Data Layer
External Data Layer
Data Feed/Data Mining/
Indexing Layer
Virtual DW
Coarse DW
Central DW
Distributed DW
Application Messaging (Transport) Layer
Internet/Intranet Layer
direct queries
virtual queries
ad hoc queries
1
2a
2b
2c
3
4 56 7
8
9
10
11
United Statesby Sales
$10,340 to $10,350 (1)$8,730 to $10,340 (2)$4,320 to $8,730 (2)$1,100 to $4,320 (1)
$730 to $1,100 (3)
United States$11,000
Sales
North America
Non-operational
Data Layer
Data Marts vs Data WarehousesData Marts vs Data Warehouses
www.notesvillage.com
Operational vs. InformationalSystems
Operational vs. InformationalSystems
OperationalOperationalSystemsSystems
InformationInformationDelivery SystemDelivery SystemDataDataWarehouseWarehouse
InformationalInformationalSystemsSystems
Data Data MartsMarts
www.notesvillage.com
Virtual Date Warehouse
A Virtual Data Warehouse approach is often chosen when there are infrequent demands for data and management wants to determine if/how users will use operational data.
One of the weaknesses of a Virtual Data Warehouse approach is that user queries a made against operational DBs.
One way to minimize this problem is to build a “Query Monitor” to check the performance characteristics of a query before executing it.
www.notesvillage.com
• A Coarse Data Warehouse is often chosen when the organization has a relatively clean/new operational system and management wants to make the operational data more easily available for just that system.
• A Central Data Warehouse is often chosen when the organization has a clear understanding about it Information Access needs and wants to provide “quality”, “integrated” , information to its knowledge workers
• A Distributed Data Warehouse is similar in most respects to a Central Data Warehouse, except that the data is distributed to separate mini-Data Warehouses (Data Marts )on local or specialized servers
www.notesvillage.com
Data Mart Layer
Presentation/ Desktop
Access Layer
Meta-data Repository Layer
Warehouse Management Layer
Core DW Layer
Data Staging and Quality Layer
Data Access Layer
Operational Data Layer
External Data Layer
Data Feed/Data Mining/
Indexing Layer
Virtual DW
Coarse DW
Central DW
Application Messaging (Transport) Layer
Distributed DW
Internet/Intranet Layer
direct queries
virtual queries
ad hoc queries
1
2a
2b
2c
3
4 56 7
8
9
10
11
United Statesby Sales
$10,340 to $10,350 (1)$8,730 to $10,340 (2)$4,320 to $8,730 (2)$1,100 to $4,320 (1)
$730 to $1,100 (3)
United States$11,000
Sales
North America
Non-operational
Data Layer
Central Data WarehouseCentral Data Warehouse
www.notesvillage.com
Top-down approachSingle, Central
storage of data about the content
Centralized Rule and Content
May seek quick result if implemented with Iteration
Needs high level cross functional skills
High outlay with proof of concepts
High exposure to risk/Failures
www.notesvillage.com
www.notesvillage.com
Data Warehouse ArchitecturesGeneric Two-Level ArchitectureIndependent Data MartDependent Data Mart and Operational Data
StoreLogical Data Mart and Real-Time Data
WarehouseThree-Layer architecture
All involve some form of extraction, transformation and loading (ETLETL)
www.notesvillage.com
Process Architecture These interrelated sub-processes are sometimes
referred to as an “ETL” process. 1)Extract- Since data for the data warehouse can come
from different sources and may be of different types, the plan to extract the data along with appropriate compression and encryption techniques is an important requirement for consideration.
2)Transform- Transformation of data with appropriate conversion, aggregation and cleaning besides de-normalization and surrogate key management is also an important process to be planned for building a data warehouse.
3)Load- Steps to be considered to load data with optimization by considering the multiple areas where the data is targeted to be loaded and retrieved is also an important part of the data warehouse architecture plan
www.notesvillage.com
Data Model Architecture In Data Model Architecture (also known
as Dimensional Data Model), there are 3 main data modeling styles for enterprise warehouses:
3rd Normal Form - Top Down Architecture, Top Down Implementation
Federated Star Schemas - Bottom Up Architecture, Bottom Up Implementation
Data Vault - Top Down Architecture, Bottom Up Implementation
.
www.notesvillage.com
Technology Architecture
Scalability and flexibility is required in all facets. The extent of these features are largely depend upon organizational size, business requirements, nature of business etc.
Technology or Technical architecture primary evolved from derivations from the process architecture, meta data management requirements based on business rules and security levels implementations and technology tool specific evaluation.
Besides these, the Technology architecture also looks into the various technology implementation standards in database management, database connectivity protocols (ODBC, JDBC, OLE DB etc), Middleware (based on ORB, RMI, COM/DOM etc.), Network protocols (DNS, LDAP etc) and other related technologies.
www.notesvillage.com
Information Architecture
Information Architecture is the process of translating the information from one form to another in a step by step sequence so as to manage the storage, retrieval, modification and deletion of the data in the data warehouse.
www.notesvillage.com
Resource Architecture Resource architecture is related to software
architecture in that many resources come from software resources. Resources are important because they help determine performance.
Workload is the other part of the equation. If you have enough resources to complete the workload in the right amount of time, then performance will be high.
www.notesvillage.com
Figure 11-2: Generic two-level data warehousing architecture
E
T
LOne, company-wide warehouse
Periodic extraction data is not completely current in warehouse
www.notesvillage.com
Scrub/Cleanse…uses pattern recognition and AI techniques to upgrade data quality
Fixing errors:Fixing errors: misspellings, erroneous dates, incorrect field usage, mismatched addresses, missing data, duplicate data, inconsistencies
Also:Also: decoding, reformatting, time stamping, conversion, key generation, merging, error detection/logging, locating missing data
Figure 11-10: Steps in data reconciliation
(cont.)
www.notesvillage.com
Figure 11-3 Independent data mart data warehousing architecture
Data marts:Data marts:Mini-warehouses, limited in scope
E
T
L
Separate ETL for each independent data mart
Data access complexity due to multiple data marts
www.notesvillage.com
Figure 11-4 Dependent data mart with operational data store: a three-level architecture
ET
L
Single ETL for enterprise data warehouse(EDW)(EDW)
Simpler data access
ODS ODS provides option for obtaining current data
Dependent data marts loaded from EDW
www.notesvillage.com
ET
L
Near real-time ETL for Data WarehouseData Warehouse
ODS ODS and data warehousedata warehouse are one and the same
Data marts are NOT separate databases, but logical views of the data warehouse Easier to create new data marts
Figure 11-5 Logical data mart and real time warehouse architecture
www.notesvillage.com
Figure 11-6 Three-layer data architecture for a data warehouse
www.notesvillage.com