business systems intelligence: 3. data warehousing dr. brian mac namee (

53
Business Systems Intelligence: 3. Data Warehousing D r . B r i a n M a c N a m e e ( w w w . c o m p . d i t . i e / b m a c n a m e e )

Post on 21-Dec-2015

216 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

Business Systems Intelligence:3. Data Warehousing

Dr. B

rian Mac N

amee (w

ww

.comp.dit.ie/bm

acnamee)

Page 2: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

2of25

2of58 Acknowledgments

These notes are based (heavily) on those provided by the authors to

accompany “Data Mining: Concepts & Techniques” by Jiawei Han and Micheline Kamber

Some slides are also based on trainer’s kits provided by

More information about the book is available at:www-sal.cs.uiuc.edu/~hanj/bk2/

And information on SAS is available at:www.sas.com

Page 3: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

3of25

3of52

Data Warehousing & OLAP Technology For Data Mining

Today we will begin to look at data warehouses, and in particular:

– What is a data warehouse? – A multi-dimensional data model– Data warehouse architecture– Data warehouse implementation– Further development of data cube technology– From data warehousing to data mining

Page 4: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

4of25

4of52 What Is A Data Warehouse?Defined in many different ways, but not rigorously

– A decision support database that is maintained separately from the organization’s operational database

– Support information processing by providing a solid platform of consolidated, historical data for analysis

“A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile

collection of data in support of management’s decision-making process”

—Bill Inmon

Page 5: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

5of25

5of52

Data Warehouse - Subject-Oriented

Organized around major subjects, such as customer, product, sales

Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing

Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process

Page 6: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

6of25

6of52 Data Warehouse - IntegratedConstructed by integrating multiple, heterogeneous data sources

– Relational databases, flat files, on-line transaction records

Data cleaning and data integration techniques are applied

– Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources

• E.g., Hotel price: currency, tax, breakfast covered, etc.

– When data is moved to the warehouse, it is converted

Page 7: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

7of25

7of52 Data Warehouse - Time VariantThe time horizon for the data warehouse is significantly longer than that of operational systems

– Operational database: current value data– Data warehouse data: provide information from

a historical perspective (e.g., past 5-10 years)

Every key structure in the data warehouse– Contains an element of time, explicitly or

implicitly– But the key of operational data may or may not

contain “time element”

Page 8: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

8of25

8of52 Data Warehouse - Non-VolatileA physically separate store of data transformed from the operational environment

Operational update of data does not occur in the data warehouse environment

– Does not require transaction processing, recovery, and concurrency control mechanisms

– Requires only two operations in data accessing: • Initial loading of data and access of data

Page 9: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

9of25

9of52

Data Warehouse Vs. Heterogeneous DBMS

Traditional heterogeneous DB integration: – Build wrappers/mediators on top of heterogeneous

databases – Query driven approach

• When a query is posed to a client site, a meta-dictionary is used to translate the query into queries appropriate for individual heterogeneous sites involved, and the results are integrated into a global answer set

• Complex information filtering, compete for resources

Data warehouse: update-driven, high performance– Information from heterogeneous sources is integrated in

advance and stored in warehouses for direct query and analysis

Page 10: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

10of25

10of52 What Is OLAP?Online Analytical Processing (OLAP) is an industry-accepted reporting technology that provides high-performance analysis and easy reporting on large volumes of data

The goal of OLAP, also known as multidimensional data analysis, is to provide fast and flexible data summarization, analysis, and reporting capabilities with the ability to view trends over time

Page 11: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

11of25

11of52 What Is OLAP? (cont…)OLAP applications, also called decision support systems (DSS), have the following features:

– Enable users to look at different relationships in data by looking beyond traditional two-dimensional row and column data analysis

– Offer high-performance access to large amounts of presummarized data

– Give users the power to retrieve answers to multi-dimensional business questions quickly and easily

– Provide slice-and-dice views of multiple relationships in large quantities of presummarized data

Page 12: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

12of25

12of52

Data Warehouse Vs. Operational DBMS

OLTP (on-line transaction processing)– Major task of traditional relational DBMS– Day-to-day operations: purchasing, inventory, banking,

manufacturing, payroll, registration, accounting, etc.OLAP (on-line analytical processing)

– Major task of data warehouse system– Data analysis and decision making

Distinct features (OLTP vs. OLAP):– User and system orientation: customer vs. market– Data contents: current, detailed vs. historical,

consolidated– Database design: ER + application vs. star + subject– View: current, local vs. evolutionary, integrated– Access patterns: update vs. read-only but complex

queries

Page 13: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

13of25

13of52 OLTP Vs. OLAP

OLTP OLAP

Users Clerk, IT professional Knowledge worker

Function Day to day operations Decision support

DB Design Application-oriented Subject-oriented

DataCurrent, up-to-datedetailed, flat relationalIsolated

Historical, summarized, multidimensional, integrated, consolidated

Usage Repetitive Ad-hoc

AccessRead/writeIndex/hash on prim. Key

Lots of scans

Unit of Work Short, simple transaction Complex query

# Records Accessed Tens Millions

# Users Thousands Hundreds

DB Size 100MB-GB 100GB-TB

Metric Transaction throughput Query throughput, response

Page 14: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

14of25

14of52 Why Separate Data Warehouse?High performance for both systems

– DBMS - tuned for OLTP: access methods, indexing, concurrency control, recovery

– Warehouse - tuned for OLAP: complex OLAP queries, multidimensional view, consolidation.

Different functions and different data:– Missing data: Decision support requires historical data

which operational DBs do not typically maintain– Data consolidation: DS requires consolidation

(aggregation, summarization) of data from heterogeneous sources

– Data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled

Page 15: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

15of25

15of52

Some Key Points Regarding Data Warehousing

Data warehousing has these characteristics:– Not new; existed for some time– An accepted practice– Fairly widespread, but not always well done

Many organizations now run data warehouse projects

Data warehousing is used in multiple industry sectors throughout the world

Within organizations, more and more data warehouses and data marts are used

Data warehousing is most successful when used as the foundation of an integrated information strategy

Page 16: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

16of25

16of52

SAS Rapid Data Warehouse Methodology

The SAS Rapid Data Warehouse Methodology facilitates the development of high quality data warehousing environments

The methodology is based onexperience that was gainedfrom hundreds of warehousingprojects

Page 17: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

17of25

17of52 Why Use A Methodology?For the first few data warehouse projects, a methodology accomplishes the following:

– Gives practitioners a roadmap to follow for success

– Enables practitioners to benefit from the experience (and mistakes!) of others

For experienced practitioners, a methodology provides the following:

– A checklist of tasks, roles, deliverables, and so on

– An explanation of tasks to others (customer management, users, project staff)

Page 18: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

18of25

18of52

SAS Rapid Data Warehouse Methodology

The SAS Rapid Data Warehouse Methodology has seven phases

The phases provide logical work groupings and milestones to verify that the project has a solid foundationOngoing

Maintenanceand Administration

Assessment

Requirements

Design

Deployment

Review

Construction

Final Test

Page 19: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

19of25

19of52 Assessment Phase

Phase 1 – Assessment:Identify the organization’sreadiness for undertakinga data warehousing project.

Ongoing Maintenance

and Administration

Assessment

Requirements

Design

Deployment

Review

Construction

Final Test

Page 20: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

20of25

20of52 Requirements Phase

Ongoing Maintenance

and Administration

Assessment

Requirements

Design

Deployment

Review

Construction

Final Test

Phase 2 – Requirements:Gather the businessrequirements and definethe acceptance criteria.

Page 21: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

21of25

21of52 Design Phase

Ongoing Maintenance

and Administration

Assessment

Requirements

Design

Deployment

Review

Construction

Final Test

Phase 3 – Design:Analyze and designthe warehouse systemarchitecture. Confirm the acceptance test criteria.

Page 22: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

22of25

22of52 Construction Phase

Ongoing Maintenance

and Administration

Assessment

Requirements

Design

Deployment

Review

Construction

Final Test

Phase 4 – Construction:Construct and populate the warehouse. Code and test the exploitation applications and processes. Validate types of test. Perform unit and integration tests.

Page 23: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

23of25

23of52 Final Test Phase

Ongoing Maintenance

and Administration

Assessment

Requirements

Design

Deployment

Review

Construction

Final Test

Phase 5 – Final Test:Test the warehouse andensure that it meets thespecifications of therequirements document.

Page 24: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

24of25

24of52 Deployment Phase

Ongoing Maintenance

and Administration

Assessment

Requirements

Design

Deployment

Review

Construction

Final Test

Phase 6 – Deployment:Roll out to the environmentand perform an acceptance test. Ensure knowledge transfer and user access throughout the organization.

Page 25: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

25of25

25of52 Review Phase

Ongoing Maintenance

and Administration

Assessment

Requirements

Design

Deployment

Review

Construction

Final Test

Phase 7 – Review:Review the project process, the deployment process, and the impact on the organization.

Page 26: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

26of25

26of52

From Tables & Spreadsheets to Data Cubes

A data warehouse is based on a multi-dimensional data model which views data in the form of a data cube

A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions

– Dimension tables, such as item (item_name, brand, type), or time (day, week, month, quarter, year)

– Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables

Page 27: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

27of25

27of52 A Data Cube Of Dollars Sold

Page 28: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

28of25

28of52 Adding A 4th Dimension

Page 29: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

29of25

29of52 N-Dimensional Data CubesWe can continue to add dimensions indefinitley

In data warehousing literature, an n-D base cube is called a base cuboid

The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid

The lattice of cuboids forms a data cube

Page 30: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

30of25

30of52 Cube: A Lattice Of Cuboids

all

time item location supplier

time,item time,location

time,supplier

item,location

item,supplier

location,supplier

time,item,location

time,item,supplier

time,location,supplier

item,location,supplier

time, item, location, supplier

0-D(apex) cuboid

1-D cuboids

2-D cuboids

3-D cuboids

4-D(base) cuboid

Page 31: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

31of25

31of52

Conceptual Modeling of Data Warehouses

Modeling data warehouses: dimensions & measures

– Star schema: A fact table in the middle connected to a set of dimension tables

– Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake

– Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation

Page 32: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

32of25

32of52 Example Star Schema

time_keydayday_of_the_weekmonthquarteryear

time

location_keystreetcitystate_or_provincecountry

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_keyitem_namebrandtypesupplier_type

item

branch_keybranch_namebranch_type

branch

Page 33: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

33of25

33of52 Example Snowflake Schema

time_keydayday_of_the_weekmonthquarteryear

time

location_keystreetcity_key

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_keyitem_namebrandtypesupplier_key

item

branch_keybranch_namebranch_type

branch

supplier_keysupplier_type

supplier

city_keycitystate_or_provincecountry

city

Page 34: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

34of25

34of52 Example Fact Constellation

time_keydayday_of_the_weekmonthquarteryear

time

location_keystreetcityprovince_or_statecountry

location

Measures

item_keyitem_namebrandtypesupplier_type

item

branch_keybranch_namebranch_type

branch

Shipping Fact Table

time_key

item_key

shipper_key

from_location

to_location

dollars_cost

units_shipped

shipper_keyshipper_namelocation_keyshipper_type

shipper

Sales Fact Table time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Page 35: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

35of25

35of52 Measures: Three CategoriesDistributive: If the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning

• E.g., count(), sum(), min(), max().

Algebraic: If the result can be computed by an algebraic function with M arguments, each of which is obtained by applying a distributive aggregate function

• E.g., avg(), min_N(), standard_deviation()

Holistic: If there is no constant bound on the storage size needed to describe a sub-aggregate.

• E.g., median(), mode(), rank()

Measures are usually confined to numerics

Page 36: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

36of25

36of52

A Concept Hierarchy: Dimension (Location)

All

Europe North America

MexicoCanadaSpainGermany

Vancouver

M. WindL. Chan

...

......

... ...

...

All

Region

Office

Country

TorontoFrankfurtCity

Page 37: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

37of25

37of52 Multidimensional DataSales volume as a function of product, month, and region

Pro

duct

Regio

n

Month

Dimensions: Product, Location, TimeHierarchical summarization paths

Industry Region Year

Category Country Quarter

Product City Month Week

Office Day

Page 38: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

38of25

38of52 A Sample Data Cube

Total annual salesof TV in U.S.A.Date

Produ

ct

Cou

ntr

ysum

sum TV

VCRPC

1Qtr 2Qtr 3Qtr 4Qtr

U.S.A

Canada

Mexico

sum

Page 39: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

39of25

39of52

Cuboids Corresponding To The Cube

all

product date country

product,date product,country date, country

product, date, country

0-D (apex) cuboid

1-D cuboids

2-D cuboids

3-D (base) cuboid

Page 40: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

40of25

40of52 Browsing A Data Cube

VisualizationOLAP capabilitiesInteractive manipulation

Page 41: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

41of25

41of52 Typical OLAP OperationsRoll up (drill-up): Summarize data

– By climbing up hierarchy or by dimension reduction

Drill down (roll down): Reverse of roll-up– From higher level summary to lower level

summary or detailed data, or introducing new dimensions

Slice and dice: – Project and select

Pivot (rotate): – Reorient the cube, visualization, 3D to series of

2D planes

Page 42: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

42of25

42of52

Typical OLAP Operations (cont…)

Other operations:– Drill across: Involving (across) more than one

fact table– Drill through: Through the bottom level of the

cube to its back-end relational tables (using SQL)

Page 43: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

43of25

43of52

Page 44: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

44of25

44of52

Design Of A Data Warehouse: A Business Analysis Framework

Four views regarding the design of a data warehouse

– Top-down view• Allows selection of the relevant information necessary

for the data warehouse

– Data source view• Exposes the information being captured, stored, and

managed by operational systems

– Data warehouse view• Consists of fact tables and dimension tables

– Business query view • Sees the perspectives of data in the warehouse from

the view of end-user

Page 45: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

45of25

45of52

Data Warehouse Design Process

Top-down, bottom-up approaches or a combination of both

– Top-down: • Starts with overall design and planning (mature)

– Bottom-up: • Starts with experiments and prototypes (rapid)

From software engineering point of view– Waterfall:

• Structured and systematic analysis at each step before proceeding to the next

– Spiral:• Rapid generation of increasingly functional systems,

short turn around time, quick turn around

Page 46: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

46of25

46of52

Data Warehouse Design Process (cont…)

Typical data warehouse design process– Choose a business process to model

• E.g. orders, invoices, etc.

– Choose the grain (atomic level of data) of the business process

– Choose the dimensions that will apply to each fact table record

– Choose the measure that will populate each fact table record

Page 47: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

47of25

47of52

DataWarehouse

ExtractTransformLoadRefresh

OLAP Engine

AnalysisQueryReportsData mining

Monitor&

IntegratorMetadata

Data Sources Front-End Tools

Serve

Data Marts

Operational DBs

other

sources

Data Storage

OLAP Server

Multi-Tiered Architecture

Page 48: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

48of25

48of52 Three Data Warehouse ModelsEnterprise Warehouse

– Collects all of the information about subjects spanning the entire organization

Data Mart– A subset of corporate-wide data that is of value

to a specific groups of users• Its scope is confined to specific, selected groups• Independent vs. dependent (directly from warehouse)

data mart

Virtual Warehouse– A set of views over operational databases– Only some of the possible summary views may

be materialized

Page 49: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

49of25

49of52

Data Warehouse Development: A Recommended Approach

Define a High-Level Corporate Data Model

Data Mart Data Mart

Distributed Data Marts

Enterprise Data Warehouse

Multi-Tier Data Warehouse

Model RefinementModel

Refinement

Page 50: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

50of25

50of52 OLAP Server ArchitecturesRelational OLAP (ROLAP)

– Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware to support missing pieces

– Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services

– Greater scalability

Multidimensional OLAP (MOLAP) – Array-based multidimensional storage engine

(sparse matrix techniques)– Fast indexing to pre-computed summarized data

Page 51: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

51of25

51of52

OLAP Server Architectures (cont…)

Hybrid OLAP (HOLAP)– User flexibility, e.g., low level: relational, high-

level: array

Specialized SQL Servers– Specialized support for SQL queries over

star/snowflake schemas

Page 52: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

52of25

52of52 SummaryToday we took an overview of data warehousing

We really have barely scratched the surface

Page 53: Business Systems Intelligence: 3. Data Warehousing Dr. Brian Mac Namee (

53of25

53of52 Questions

?