chapter 2: the data warehouse modern data warehousing, mining, and visualization: core concepts by...

28
Chapter 2: The Data Warehouse Modern Data Warehousing, Mining, and Visualization: Core Concepts by George M. Marakas Spring 2012

Upload: mervin-robertson

Post on 01-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Chapter 2: The Data Warehouse Modern Data Warehousing, Mining, and Visualization: Core Concepts by George M. Marakas Spring 2012

Chapter 2: The Data Warehouse

Modern Data Warehousing, Mining, and Visualization: Core Concepts by George M. Marakas

Spring 2012

Page 2: Chapter 2: The Data Warehouse Modern Data Warehousing, Mining, and Visualization: Core Concepts by George M. Marakas Spring 2012

© 2003, Prentice-Hall 2

Objectives

Goals and characteristics of Data Warehouses

Compare Operational Data Store to Data Warehouse

Elements of data warehouse architecture Role & components of Metadata Challenges of implementing DW Examine DW Technologies

Page 3: Chapter 2: The Data Warehouse Modern Data Warehousing, Mining, and Visualization: Core Concepts by George M. Marakas Spring 2012

2-1: Stores, Warehouses and Marts

A data warehouse is a collection of integrated databases designed to support a DSS.

An operational data store (ODS) stores data for a specific application. It feeds the data warehouse a stream of desired raw data.

A data mart is a lower-cost, scaled-down version of a data warehouse, usually designed to support a small group of users (rather than the entire firm).

The metadata is information that is kept about the warehouse.

Page 4: Chapter 2: The Data Warehouse Modern Data Warehousing, Mining, and Visualization: Core Concepts by George M. Marakas Spring 2012

The Data Warehouse Environment

The organization’s legacy systems and data stores provide data to the data warehouse or mart.

During the transfer of data from the various sources, cleansing or transformation may occur, so the data in the DW is more uniform (ETS=Extract, Transform, Store consistently).

Simultaneously, metadata is recorded. Finally, the DW or mart may be used to

create one or more “personal” warehouses.

Page 5: Chapter 2: The Data Warehouse Modern Data Warehousing, Mining, and Visualization: Core Concepts by George M. Marakas Spring 2012

Organizational Data Flow and Data Storage Components

Figure 2-1

Page 6: Chapter 2: The Data Warehouse Modern Data Warehousing, Mining, and Visualization: Core Concepts by George M. Marakas Spring 2012

Characteristics of a Data Warehouse

Subject oriented – organized based on use Integrated – inconsistencies removed Nonvolatile – stored in read-only format Time variant – data are normally time series Summarized – in decision-usable format Large volume – data sets are quite large Non-normalized – often redundant (semi-

normalized in actuality) Metadata – data about data are stored Data sources – comes from nonintegrated

sources

Page 7: Chapter 2: The Data Warehouse Modern Data Warehousing, Mining, and Visualization: Core Concepts by George M. Marakas Spring 2012

A Data Warehouse is Subject Oriented

Figure 2-2

Page 8: Chapter 2: The Data Warehouse Modern Data Warehousing, Mining, and Visualization: Core Concepts by George M. Marakas Spring 2012

Data in a Data Warehouse are Integrated

ETS

Page 9: Chapter 2: The Data Warehouse Modern Data Warehousing, Mining, and Visualization: Core Concepts by George M. Marakas Spring 2012

© 2003, Prentice-Hall 9

2-2: The Data Warehouse Architecture

The architecture consists of various interconnected elements: Operational and external database layer – the

source data for the DW Information access layer – the tools the end user

access to extract and analyze the data Data access layer – the interface(s) between the

operational and information access layers Data staging layer – includes all necessary

processes to select, edit, summarize, and load all operation and information access layer data

Metadata layer – the data directory or repository of metadata information

Page 10: Chapter 2: The Data Warehouse Modern Data Warehousing, Mining, and Visualization: Core Concepts by George M. Marakas Spring 2012

Components of the Data Warehouse Architecture Figure 2-4

DW

Data

Metadata

Page 11: Chapter 2: The Data Warehouse Modern Data Warehousing, Mining, and Visualization: Core Concepts by George M. Marakas Spring 2012

© 2003, Prentice-Hall 11

The Data Warehouse Architecture (cont.)

Additional layers are: Process management layer – the

scheduler or job controller Application messaging layer – the

“middleware” that transports information around the firm

Physical data warehouse layer – where the actual data used in the DSS are located

Page 12: Chapter 2: The Data Warehouse Modern Data Warehousing, Mining, and Visualization: Core Concepts by George M. Marakas Spring 2012

2-3: Data Have Data -- The Metadata The name suggests some high-level

technological concept, but it really is fairly simple. Metadata is “data about data”.

With the emergence of the data warehouse as a decision support structure, the metadata are considered as much a resource as the business data they describe.

Metadata are abstractions -- they are high level data that provide concise descriptions of lower-level data.

Page 13: Chapter 2: The Data Warehouse Modern Data Warehousing, Mining, and Visualization: Core Concepts by George M. Marakas Spring 2012

© 2003, Prentice-Hall 13

Metadata in Action – The Author’s View

The metadata are essential ingredients in the transformation of raw data into knowledge. They are the “keys” that allow us to handle the raw data.

For example, a “line” or row in a sales database may contain:

1023 K596 111.21

This is mostly meaningless until we consult the metadata (in the data directory) that tells us:

store number =1023, product= K596, and sales = $111.21

Page 14: Chapter 2: The Data Warehouse Modern Data Warehousing, Mining, and Visualization: Core Concepts by George M. Marakas Spring 2012

© 2003, Prentice-Hall 14

2-4: Metadata Extraction

Regardless of the nature of a query, certain aspects of the metadata are important to all decision-makers. Some of these are:

What tables, attributes and keys does the DW contain?

Where did each set of data come from? What transformations were applied with cleansing? How have the metadata changed over time? How often do the data get reloaded? Are there so many data elements that you need to

be careful what you ask for?

Page 15: Chapter 2: The Data Warehouse Modern Data Warehousing, Mining, and Visualization: Core Concepts by George M. Marakas Spring 2012

© 2003, Prentice-Hall 15

Components of the Metadata

Transformation maps – records that show what transformations were applied Extraction history – records that show what

data was analyzed; a.k.a, log files Algorithms for summarization – methods

available for aggregating and summarizing Data ownership – records that show origin Access patterns – records that show what

data are accessed and how often

Page 16: Chapter 2: The Data Warehouse Modern Data Warehousing, Mining, and Visualization: Core Concepts by George M. Marakas Spring 2012

© 2003, Prentice-Hall 16

Typical Mapping MetadataTable 2-4

Transformation mapping records include: Identification of original source Attribute conversions Physical characteristic conversions Encoding/reference table conversions Naming changes Key changes Values of default attributes Logic to choose from multiple sources Algorithmic changes

Page 17: Chapter 2: The Data Warehouse Modern Data Warehousing, Mining, and Visualization: Core Concepts by George M. Marakas Spring 2012

© 2003, Prentice-Hall 17

Consider Accounting Systems

The DW needs the meanings of all of the reference codes that are used within your accounting data (account code, department, product, etc)

Codes in various systems, in different operating units of the company, are typically different.

You must establish common vocabulary; consolidated views must be done by mapping operating unit’s codes against a master code scheme.

The DW must maintain records of all codes, and the way they are used in reporting.

The DW must have an accurate history of the changes in metadata: dates and details of how business units have added codes, dropped codes, etc.

Page 18: Chapter 2: The Data Warehouse Modern Data Warehousing, Mining, and Visualization: Core Concepts by George M. Marakas Spring 2012

© 2003, Prentice-Hall 18

Tracking the Dimension tables

Metadata are data about data. It is also a way of documenting information about data tables.

The information contained in metadata will document the creation of a dataset and history (audit trail) of where is came from

A metadata audit trail not only includes this information, it tracks the origins of interrelationships between and transformations to your data as it moves through the enterprise

Page 19: Chapter 2: The Data Warehouse Modern Data Warehousing, Mining, and Visualization: Core Concepts by George M. Marakas Spring 2012

© 2003, Prentice-Hall 19

Consider a Multinational Enterprise: Dimensions

The following groupings and dimensions are typically found in an accounting data hypercube:

1. Account group, and Account code 2. Period, year and month 3. Reporting entity, division, dept 4. Actual, Budget, Forecast 5. By product line, product, etc. 6. By currencies 7. By executive, manager, & supervisor responsible 8. By productivity or billing, by professional or

employee, depending on industry, etc. etc. 

Page 20: Chapter 2: The Data Warehouse Modern Data Warehousing, Mining, and Visualization: Core Concepts by George M. Marakas Spring 2012

© 2003, Prentice-Hall 20

2-5: Implementing the Data Warehouse

Denis Kozar, Chase Manhattan Bank (1997) assembled a list of “seven deadly sins” of data warehouse implementation:

1. “If you build it, they will come” – the DW needs to be designed to meet people’s needs

2. Omission of an architectural framework – you need to consider the number of users, volume of data, update cycle, etc.

3. Underestimating the importance of documenting assumptions – the assumptions and potential conflicts must be included in the framework

Page 21: Chapter 2: The Data Warehouse Modern Data Warehousing, Mining, and Visualization: Core Concepts by George M. Marakas Spring 2012

“Seven Deadly Sins” (continued)

4. Failure to use the right tool – a DW project needs different tools than those used to develop an application

5. Life cycle abuse – in a DW, the life cycle really never ends… continuous new development projects

6. Ignorance about data conflicts – resolving these takes a lot more effort than most people realize. Data ownership and data naming convention issues are pervasive.

7. Failure to learn from mistakes – since one DW project tends to beget another, learning from the early mistakes will yield higher quality later

Page 22: Chapter 2: The Data Warehouse Modern Data Warehousing, Mining, and Visualization: Core Concepts by George M. Marakas Spring 2012

2-6: Data Warehouse Technologies

No one currently offers an end-to-end DW solution. Organizations buy bits and pieces from a number of vendors and hopefully make them work together.

SAS, IBM, Software AG, Information Builders and Platinum offer solutions that are at least fairly comprehensive.

The market is very competitive. Table 2-6 (pp. 64-66) in the text lists 90 firms that produce DW products.

Page 23: Chapter 2: The Data Warehouse Modern Data Warehousing, Mining, and Visualization: Core Concepts by George M. Marakas Spring 2012

2-7: The Future of Data Warehousing

As the DW becomes a standard part of an organization, there will be efforts to find new ways to use the data. This will likely bring with it several new challenges: Regulatory constraints may limit the ability to

combine sources of disparate data. These disparate sources are likely to contain

unstructured data, which is hard to store. The Internet makes it possible to access data from

virtually “anywhere”. This further increases the disparity and volume of data,

Page 24: Chapter 2: The Data Warehouse Modern Data Warehousing, Mining, and Visualization: Core Concepts by George M. Marakas Spring 2012

© 2003, Prentice-Hall 24

Quiz Time – Define each of the following Sales Returns. Inventory Tracking Raw Material Inward Stock Status Direct Labor Minimum, Maximum &

Reorder Level Attendance Recording. Check Endorsement Semi-Finished goods Equity Deferred charge

Net profit Depletion Damaged items Journal entry Multiple Bank accounts

maintenance facility. Bank Book Cash Book. Trial Balance Manufacturing Account Trading Account Profit and Loss account Market value Gross income

Page 25: Chapter 2: The Data Warehouse Modern Data Warehousing, Mining, and Visualization: Core Concepts by George M. Marakas Spring 2012

© 2003, Prentice-Hall 25

To Summarize First, there's the challenge of understanding the warehouse's

data properly from a business perspective. The ability to audit any data element is an essential for instilling

confidence in the data. Second, there's the challenge of consistency in data usage. Two users might think they're debating over similar data

elements when in fact the elements indicate different aspects of the business, either because of naming inconsistencies or nonsynchronized data.

Third, there's the problem of properly building SQL queries. Slight variations in SQL syntax can produce widely differing results from the same set of data. Most of SQL's syntax is poorly understood, requiring an expert to use it properly.

Fourth, there's the challenge of expanding access to the data warehouse to people who might be unfamiliar about the precise business interpretation of a specific data element.

Page 26: Chapter 2: The Data Warehouse Modern Data Warehousing, Mining, and Visualization: Core Concepts by George M. Marakas Spring 2012

© 2003, Prentice-Hall 26

Data Warehousing Typology

The virtual data warehouse – the end users have direct access to the data stores, using tools enabled at the data access layer

The central data warehouse – a single physical database contains all of the data for a specific functional area

The distributed data warehouse – the components are distributed across several physical databases

Page 27: Chapter 2: The Data Warehouse Modern Data Warehousing, Mining, and Visualization: Core Concepts by George M. Marakas Spring 2012

Summary

The data warehouse (DW) is separated from all other operational systems

Data must be cleansed and transformed [ETS]

Metadata stores the data about the data Transformation maps completes the data

warehousing essential documentation The DW has a seven component or layer

architecture

Page 28: Chapter 2: The Data Warehouse Modern Data Warehousing, Mining, and Visualization: Core Concepts by George M. Marakas Spring 2012

© 2003, Prentice-Hall 28

The End