chapter 16 data warehouse technology and management

50
Chapter 16 Chapter 16 Data Warehouse Technology and Management

Upload: stella-sims

Post on 02-Jan-2016

236 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Chapter 16 Data Warehouse Technology and Management

Chapter 16Chapter 16Data Warehouse Technology and Management

Page 2: Chapter 16 Data Warehouse Technology and Management

McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.

Outline Outline

Basic concepts and characteristicsBusiness architectures and applicationsData cube concepts and operatorsRelational DBMS featuresPopulating a data warehouse

Page 3: Chapter 16 Data Warehouse Technology and Management

McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.

Comparison of Processing Comparison of Processing EnvironmentsEnvironmentsTransaction processing

– Uses operational databases– Short-term decisions: fulfill orders, resolve

complaints, provide staffingDecision support processing

– Uses integrated and summarized data– Medium and long-term decisions: capacity

planning, store locations, new lines of business

Page 4: Chapter 16 Data Warehouse Technology and Management

McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.

Data Warehouse Definition and Data Warehouse Definition and CharacteristicsCharacteristics

A central repository for summarized and integrated data from operational databases and external data sources

Key Characteristics– Subject-oriented– Integrated– Time-variant– Nonvolatile

Page 5: Chapter 16 Data Warehouse Technology and Management

McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.

Data ComparisonData ComparisonCharacteristic Operational

Database Data Warehouse

Currency Current Historical

Detail level Individual Individual and Summary

Orientation Process orientation

Subject orientation

Number of records processed

Few Thousands

Normalization level Mostly normalized

Mostly denormalized

Update level Volatile Nonvolatile (refreshed)

Data model Relational Multidimensional

Page 6: Chapter 16 Data Warehouse Technology and Management

McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.

Business Architectures and Business Architectures and ApplicationsApplicationsData warehouse projectsTop-down architecturesBottom-up architectureApplications and data mining

Page 7: Chapter 16 Data Warehouse Technology and Management

McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.

Data Warehouse ProjectsData Warehouse Projects

Large efforts with much coordination across departments

Enterprise data model– Important artifact of data warehouse project– Structure of data model– Meta data for data transformation

Top-down vs. bottom-up business architectures

Page 8: Chapter 16 Data Warehouse Technology and Management

McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.

Two Tier ArchitectureTwo Tier Architecture

Data warehouse

Operationaldatabase

Operationaldatabase

Externaldata source

EDM

Summarizeddata

Transformationprocess

Data warehouseserver

User departments

Page 9: Chapter 16 Data Warehouse Technology and Management

McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.

Three Tier ArchitectureThree Tier Architecture

Data warehouse

Operationaldatabase

Operationaldatabase

Externaldata source

EDM

Summarizeddata

Transformationprocess

Data warehouseserver

Userdepartments

Data mart

Data mart

Data mart tier

Extractionprocess

Page 10: Chapter 16 Data Warehouse Technology and Management

McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.

Bottom-up ArchitectureBottom-up Architecture

Operationaldatabase

Operationaldatabase

Externaldata source

Transformationprocess

Userdepartments

Data mart

Data mart

Data mart tier

Page 11: Chapter 16 Data Warehouse Technology and Management

McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.

ApplicationsApplications

Industry Key Applications Airline Yield management,

route assessment Telecommunications Customer retention,

network design Insurance Risk assessment, product

design, fraud detection Retail Target marketing,

supply-chain management

Page 12: Chapter 16 Data Warehouse Technology and Management

McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.

Data MiningData Mining

Discover significant, implicit patterns– Target promotions– Change mix and collocation of items

Requires large volumes of transaction dataImportant application for data warehouses

Page 13: Chapter 16 Data Warehouse Technology and Management

McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.

Data Cube Concepts and Data Cube Concepts and OperatorsOperatorsBasicsDimension and measure detailsOperators

Page 14: Chapter 16 Data Warehouse Technology and Management

McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.

Data Cube BasicsData Cube Basics

Multidimensional arrangement of dataUsers think about decision support data as

data cubesTerminology

– Dimension: subject label for a row or column– Member: value of dimension– Measure: quantitative data stored in cells

Page 15: Chapter 16 Data Warehouse Technology and Management

McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.

Data Cube ExampleData Cube Example

Product

Soda Diet Orange Lime soda soda soda

Lo

cati

on California

Utah

Arizona

Washington

Colorado Time 1/1/2003 1/2/2003

….. 12/31/2003

80 110 60 25

40 90 50 30

70 55 60 35

75 85 45 45

65 45 85 60

Page 16: Chapter 16 Data Warehouse Technology and Management

McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.

Dimension and Measure Dimension and Measure DetailsDetailsDimensions

– Hierarchies: members can have sub members– Sparsity: many cells do not have data

Measures– Derived measures– Multiple measures in cells

Page 17: Chapter 16 Data Warehouse Technology and Management

McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.

Time Series DataTime Series Data

Common data type in trend analysisReduce dimensionality using time seriesTime series properties

– Data type– Start date– Calendar– Periodicity– Conversion

Page 18: Chapter 16 Data Warehouse Technology and Management

McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.

Slice OperatorSlice Operator

Focus on a subset of dimensions Set dimension to specific value: 1/1/2003

Location Product Soda Diet soda Lime soda Orange soda

California 80 110 60 25 Utah 40 90 50 30 Arizona 70 55 60 35 Washington 75 85 45 45 Colorado 65 45 85 60

Page 19: Chapter 16 Data Warehouse Technology and Management

McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.

Dice OperatorDice OperatorFocus on a subset of member valuesReplace dimension with a subset of valuesDice operation often follows a slice

operation

PRODUCT

Soda Diet Orange Lime soda soda soda

LO

CA

TIO

N

Utah 40 90 50 30

Page 20: Chapter 16 Data Warehouse Technology and Management

McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.

Other OperatorsOther Operators

Operators for hierarchical dimensions– Drill-down: add detail to a dimension– Roll-up: remove detail from a dimension– Recalculate measure values

Pivot: rearrange dimensions

Page 21: Chapter 16 Data Warehouse Technology and Management

McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.

Operator SummaryOperator Summary

Operator Purpose Description

Slice Focus attention on a

subset of dimensions Replace a dimension with a single member value or with a summary of

its measure values Dice Focus attention on a

subset of member values Replace a dimension with a subset of members

Drill-down Obtain more detail about a dimension

Navigate from a more general level to a more specific level

Roll-up Summarize details about

a dimension Navigate from a more specific level to a more general level

Pivot Present data in a different order

Rearrange the dimensions in a data

cube

Page 22: Chapter 16 Data Warehouse Technology and Management

McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.

Relational DBMS SupportRelational DBMS Support

Data modelingDimension representationGROUP BY extensionsMaterialized views and query rewritingStorage structures and optimization

Page 23: Chapter 16 Data Warehouse Technology and Management

McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.

Relational Data ModelingRelational Data Modeling

Dimension table: contains member valuesFact table: contains measure values1-M relationships from dimension to fact

tablesGrain: most detailed measure values stored

Page 24: Chapter 16 Data Warehouse Technology and Management

McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.

Star Schema ExampleStar Schema Example

CustomerCustIdCustNameCustPhoneCustStreetCustCityCustStateCustZipCustNation

StoreStoreIdStoreManagerStoreStreetStoreCityStoreStateStoreZipStoreNationDivIdDivNameDivManager

SalesSalesNoSalesUnitsSalesDollarSalesCost

ItemItemIdItemNameItemUnitPriceItemBrandItemCategory

TimeTimeNoTimeDayTimeMonthTimeQuarterTimeYearTimeDayOfWeekTimeFiscalYear

ItemSales

CustSales

TimeSales

StoreSales

Page 25: Chapter 16 Data Warehouse Technology and Management

McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.

Constellation Schema ExampleConstellation Schema Example

CustomerCustIdCustNameCustPhoneCustStreetCustCityCustStateCustZipCustNation

StoreStoreIdStoreManagerStoreStreetStoreCityStoreStateStoreZipStoreNationDivIdDivNameDivManager

SalesSalesNoSalesUnitsSalesDollarSalesCost

ItemItemIdItemNameItemUnitPriceItemBrandItemCategory

TimeTimeNoTimeDayTimeMonthTimeQuarterTimeYearTimeDayOfWeekTimeFiscalYear

ItemSales

CustSales

TimeSales

StoreSales

InventoryInvNoInvQOHInvCostInvReturns

SupplierSuppIdSuppNameSuppCitySuppStateSuppZipSuppNation

SuppInv

ItemInv

StoreInv

TimeInv

Page 26: Chapter 16 Data Warehouse Technology and Management

McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.

Snowflake Schema ExampleSnowflake Schema Example

CustomerCustIdCustNameCustPhoneCustStreetCustCityCustStateCustZipCustNation

StoreStoreIdStoreManagerStoreStreetStoreCityStoreStateStoreZipStoreNation

SalesSalesNoSalesUnitsSalesDollarSalesCost

ItemItemIdItemNameItemUnitPriceItemBrandItemCategory

TimeTimeNoTimeDayTimeMonthTimeQuarterTimeYearTimeDayOfWeekTimeFiscalYear

ItemSales

CustSales

TimeSales

StoreSales

DivisionDivIdDivNameDivManager

DivStore

Page 27: Chapter 16 Data Warehouse Technology and Management

McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.

Handling M-N RelationshipsHandling M-N Relationships

Source data may have M-N relationships, not 1-M relationships

Adjust fact or dimension tables for a fixed number of exceptions

More complex solutions to support M-N relationships with a variable number of connections

Page 28: Chapter 16 Data Warehouse Technology and Management

McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.

Dimension RepresentationDimension Representation

Star schema and variations lack dimension representation

Explicit dimension representation important to data cube operations and optimization

Proprietary extensions for dimension representation

Represent levels, hierarchies, and constraints

Page 29: Chapter 16 Data Warehouse Technology and Management

McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.

Oracle Dimension RepresentationOracle Dimension Representation

Levels: dimension componentsHierarchies: may have multiple hierarchiesConstraints: functional dependency

relationships

Page 30: Chapter 16 Data Warehouse Technology and Management

McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.

CREATE DIMENSION ExampleCREATE DIMENSION ExampleCREATE DIMENSION StoreDim LEVEL StoreId IS Store.StoreId LEVEL City IS Store.StoreCity LEVEL State IS Store.StoreState LEVEL Zip IS Store.StoreZip LEVEL Nation IS Store.StoreNation LEVEL DivId IS Division.DivId HIERARCHY CityRollup ( StoreId CHILD OF City CHILD OF State CHILD OF Nation )HIERARCHY ZipRollup ( StoreId CHILD OF Zip CHILD OF State CHILD OF Nation )HIERARCHY DivisionRollup ( StoreId CHILD OF DivId JOIN KEY Store.DivId REFERENCES DivId )ATTRIBUTE DivId DETERMINES Division.DivNameATTRIBUTE DivId DETERMINES Division.DivManager ;

Page 31: Chapter 16 Data Warehouse Technology and Management

McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.

GROUP BY ExtensionsGROUP BY Extensions

ROLLUP operatorCUBE operatorGROUPING SETS operatorOther extensions

– Ranking– Ratios– Moving summary values

Page 32: Chapter 16 Data Warehouse Technology and Management

McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.

CUBE ExampleCUBE Example

SELECT StoreZip, TimeMonth, SUM(SalesDollar) AS SumSales FROM Sales, Store, Time WHERE Sales.StoreId = Store.StoreId AND Sales.TimeNo = Time.TimeNo AND (StoreNation = 'USA' OR StoreNation = 'Canada') AND TimeYear = 2002 GROUP BY CUBE (StoreZip, TimeMonth)

Page 33: Chapter 16 Data Warehouse Technology and Management

McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.

ROLLUP ExampleROLLUP Example

SELECT TimeMonth, TimeYear, SUM(SalesDollar) AS SumSales FROM Sales, Store, Time WHERE Sales.StoreId = Store.StoreId AND Sales.TimeNo = Time.TimeNo AND (StoreNation = 'USA' OR StoreNation = 'Canada') AND TimeYear BETWEEN 2002 AND 2003 GROUP BY ROLLUP (TimeMonth,TimeYear);

Page 34: Chapter 16 Data Warehouse Technology and Management

McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.

GROUPING SETS ExampleGROUPING SETS Example

SELECT StoreZip, TimeMonth, SUM(SalesDollar) AS SumSales FROM Sales, Store, Time WHERE Sales.StoreId = Store.StoreId AND Sales.TimeNo = Time.TimeNo AND (StoreNation = 'USA' OR StoreNation = 'Canada') AND TimeYear = 2002 GROUP BY GROUPING SETS((StoreZip, TimeMonth), StoreZip, TimeMonth, ());

Page 35: Chapter 16 Data Warehouse Technology and Management

McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.

Variations of the Grouping Variations of the Grouping OperatorsOperatorsPartial cubePartial rollupComposite columnsCUBE and ROLLUP inside a

GROUPIING SETS operation

Page 36: Chapter 16 Data Warehouse Technology and Management

McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.

Materialized ViewsMaterialized Views

Stored viewPeriodically refreshed with source dataUsually contain summary dataFast query response for summary dataAppropriate in query dominant

environments

Page 37: Chapter 16 Data Warehouse Technology and Management

McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.

Materialized View ExampleMaterialized View Example

CREATE MATERIALIZED VIEW MV1BUILD IMMEDIATEREFRESH COMPLETE ON DEMANDENABLE QUERY REWRITE ASSELECT StoreState, TimeYear, SUM(SalesDollar) AS SUMDollar1 FROM Sales, Store, Time WHERE Sales.StoreId = Store.StoreId AND Sales.TimeNo = Time.TimeNo AND TimeYear > 2000 GROUP BY StoreState, TimeYear;

Page 38: Chapter 16 Data Warehouse Technology and Management

McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.

Query RewritingQuery Rewriting

Substitution processMaterialized view replaces references to

fact and dimension tables in a query Query optimizer must evaluate whether the

substitution will improve performance over the original query

More complex than query modification process for traditional views

Page 39: Chapter 16 Data Warehouse Technology and Management

McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.

Query Rewriting ExampleQuery Rewriting Example-- Data warehouse querySELECT StoreState, TimeYear, SUM(SalesDollar) FROM Sales, Store, Time WHERE Sales.StoreId = Store.StoreId AND Sales.TimeNo = Time.TimeNo AND StoreNation IN ('USA','Canada') AND TimeYear = 2002 GROUP BY StoreState, TimeYear;-- Query Rewrite: replace Sales and Time tables with MV1SELECT DISTINCT MV1.StoreState, TimeYear, SumDollar1FROM MV1, StoreWHERE MV1.StoreState = Store.StoreState AND TimeYear = 2002 AND StoreNation IN ('USA','Canada');

Page 40: Chapter 16 Data Warehouse Technology and Management

McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.

Storage and Optimization Storage and Optimization TechnologiesTechnologiesMOLAP: direct storage and manipulation

of data cubesROLAP: relational extensions to support

multidimensional dataHOLAP: combine MOLAP and ROLAP

storage engines

Page 41: Chapter 16 Data Warehouse Technology and Management

McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.

ROLAP TechniquesROLAP Techniques

Bitmap join indexesStar join optimizationQuery rewritingSummary storage advisorsParallel query execution

Page 42: Chapter 16 Data Warehouse Technology and Management

McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.

Populating a Data WarehousePopulating a Data Warehouse

Data sourcesWorkflow representationOptimizing the refresh process

Page 43: Chapter 16 Data Warehouse Technology and Management

McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.

Data SourcesData Sources

CooperativeLoggedQueryableSnapshot

Page 44: Chapter 16 Data Warehouse Technology and Management

McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.

Maintenance WorkflowMaintenance Workflow

PreparationPhase

IntegrationPhase

UpdatePhase

Propagation

Notification

Extraction

Cleaning

Auditing

Transportation

Merging

Auditing

Page 45: Chapter 16 Data Warehouse Technology and Management

McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.

Data Quality ProblemsData Quality Problems

Multiple identifiersMultiple field namesDifferent unitsMissing valuesOrphaned valuesMultipurpose fieldsConflicting dataDifferent update times

Page 46: Chapter 16 Data Warehouse Technology and Management

McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.

ETL ToolsETL Tools

Extraction, Transformation, and LoadingSpecification basedEliminate custom codingThird party and DBMS based tools

Page 47: Chapter 16 Data Warehouse Technology and Management

McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.

Refresh OptimizationRefresh Optimization

Accounting

UnknownProcesses

ExternalData Sources

InternalData Sources

DataWarehouse

ETLTools

Valid Time Lag

Load Time Lag

Fact andDimensionChanges

PrimarilyDimensionChanges

Page 48: Chapter 16 Data Warehouse Technology and Management

McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.

Determining the Refresh Determining the Refresh FrequencyFrequencyMaximize net refresh benefitValue of data timelinessCost of refreshSatisfy data warehouse and source system

constraints

Page 49: Chapter 16 Data Warehouse Technology and Management

McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.

Determining the Level of Historical Determining the Level of Historical IntegrityIntegrityPrimarily an issue for dimension updatesType I: overwrite old valuesType II: version numbers for an unlimited

historyType III: new columns for a limited

history

Page 50: Chapter 16 Data Warehouse Technology and Management

McGraw-Hill/Irwin © 2004 The McGraw-Hill Companies, Inc. All rights reserved.

SummarySummary

Data warehouse requirements differ from transaction processing.

Architecture choice is important.Multidimensional data model is intuitiveRelational representation and storage

techniques are significant.Maintaining a data warehouse is an

important, operational problem.