chapter 2: the data warehouse modern data warehousing, mining, and visualization: core concepts by...
TRANSCRIPT
Chapter 2: The Data Warehouse
Modern Data Warehousing, Mining, and Visualization: Core Concepts by George M. Marakas
Spring 2012
© 2003, Prentice-Hall 2
Objectives
Goals and characteristics of Data Warehouses
Compare Operational Data Store to Data Warehouse
Elements of data warehouse architecture Role & components of Metadata Challenges of implementing DW Examine DW Technologies
2-1: Stores, Warehouses and Marts
A data warehouse is a collection of integrated databases designed to support a DSS.
An operational data store (ODS) stores data for a specific application. It feeds the data warehouse a stream of desired raw data.
A data mart is a lower-cost, scaled-down version of a data warehouse, usually designed to support a small group of users (rather than the entire firm).
The metadata is information that is kept about the warehouse.
The Data Warehouse Environment
The organization’s legacy systems and data stores provide data to the data warehouse or mart.
During the transfer of data from the various sources, cleansing or transformation may occur, so the data in the DW is more uniform (ETS=Extract, Transform, Store consistently).
Simultaneously, metadata is recorded. Finally, the DW or mart may be used to
create one or more “personal” warehouses.
Organizational Data Flow and Data Storage Components
Figure 2-1
Characteristics of a Data Warehouse
Subject oriented – organized based on use Integrated – inconsistencies removed Nonvolatile – stored in read-only format Time variant – data are normally time series Summarized – in decision-usable format Large volume – data sets are quite large Non-normalized – often redundant (semi-
normalized in actuality) Metadata – data about data are stored Data sources – comes from nonintegrated
sources
A Data Warehouse is Subject Oriented
Figure 2-2
Data in a Data Warehouse are Integrated
ETS
© 2003, Prentice-Hall 9
2-2: The Data Warehouse Architecture
The architecture consists of various interconnected elements: Operational and external database layer – the
source data for the DW Information access layer – the tools the end user
access to extract and analyze the data Data access layer – the interface(s) between the
operational and information access layers Data staging layer – includes all necessary
processes to select, edit, summarize, and load all operation and information access layer data
Metadata layer – the data directory or repository of metadata information
Components of the Data Warehouse Architecture Figure 2-4
DW
Data
Metadata
© 2003, Prentice-Hall 11
The Data Warehouse Architecture (cont.)
Additional layers are: Process management layer – the
scheduler or job controller Application messaging layer – the
“middleware” that transports information around the firm
Physical data warehouse layer – where the actual data used in the DSS are located
2-3: Data Have Data -- The Metadata The name suggests some high-level
technological concept, but it really is fairly simple. Metadata is “data about data”.
With the emergence of the data warehouse as a decision support structure, the metadata are considered as much a resource as the business data they describe.
Metadata are abstractions -- they are high level data that provide concise descriptions of lower-level data.
© 2003, Prentice-Hall 13
Metadata in Action – The Author’s View
The metadata are essential ingredients in the transformation of raw data into knowledge. They are the “keys” that allow us to handle the raw data.
For example, a “line” or row in a sales database may contain:
1023 K596 111.21
This is mostly meaningless until we consult the metadata (in the data directory) that tells us:
store number =1023, product= K596, and sales = $111.21
© 2003, Prentice-Hall 14
2-4: Metadata Extraction
Regardless of the nature of a query, certain aspects of the metadata are important to all decision-makers. Some of these are:
What tables, attributes and keys does the DW contain?
Where did each set of data come from? What transformations were applied with cleansing? How have the metadata changed over time? How often do the data get reloaded? Are there so many data elements that you need to
be careful what you ask for?
© 2003, Prentice-Hall 15
Components of the Metadata
Transformation maps – records that show what transformations were applied Extraction history – records that show what
data was analyzed; a.k.a, log files Algorithms for summarization – methods
available for aggregating and summarizing Data ownership – records that show origin Access patterns – records that show what
data are accessed and how often
© 2003, Prentice-Hall 16
Typical Mapping MetadataTable 2-4
Transformation mapping records include: Identification of original source Attribute conversions Physical characteristic conversions Encoding/reference table conversions Naming changes Key changes Values of default attributes Logic to choose from multiple sources Algorithmic changes
© 2003, Prentice-Hall 17
Consider Accounting Systems
The DW needs the meanings of all of the reference codes that are used within your accounting data (account code, department, product, etc)
Codes in various systems, in different operating units of the company, are typically different.
You must establish common vocabulary; consolidated views must be done by mapping operating unit’s codes against a master code scheme.
The DW must maintain records of all codes, and the way they are used in reporting.
The DW must have an accurate history of the changes in metadata: dates and details of how business units have added codes, dropped codes, etc.
© 2003, Prentice-Hall 18
Tracking the Dimension tables
Metadata are data about data. It is also a way of documenting information about data tables.
The information contained in metadata will document the creation of a dataset and history (audit trail) of where is came from
A metadata audit trail not only includes this information, it tracks the origins of interrelationships between and transformations to your data as it moves through the enterprise
© 2003, Prentice-Hall 19
Consider a Multinational Enterprise: Dimensions
The following groupings and dimensions are typically found in an accounting data hypercube:
1. Account group, and Account code 2. Period, year and month 3. Reporting entity, division, dept 4. Actual, Budget, Forecast 5. By product line, product, etc. 6. By currencies 7. By executive, manager, & supervisor responsible 8. By productivity or billing, by professional or
employee, depending on industry, etc. etc.
© 2003, Prentice-Hall 20
2-5: Implementing the Data Warehouse
Denis Kozar, Chase Manhattan Bank (1997) assembled a list of “seven deadly sins” of data warehouse implementation:
1. “If you build it, they will come” – the DW needs to be designed to meet people’s needs
2. Omission of an architectural framework – you need to consider the number of users, volume of data, update cycle, etc.
3. Underestimating the importance of documenting assumptions – the assumptions and potential conflicts must be included in the framework
“Seven Deadly Sins” (continued)
4. Failure to use the right tool – a DW project needs different tools than those used to develop an application
5. Life cycle abuse – in a DW, the life cycle really never ends… continuous new development projects
6. Ignorance about data conflicts – resolving these takes a lot more effort than most people realize. Data ownership and data naming convention issues are pervasive.
7. Failure to learn from mistakes – since one DW project tends to beget another, learning from the early mistakes will yield higher quality later
2-6: Data Warehouse Technologies
No one currently offers an end-to-end DW solution. Organizations buy bits and pieces from a number of vendors and hopefully make them work together.
SAS, IBM, Software AG, Information Builders and Platinum offer solutions that are at least fairly comprehensive.
The market is very competitive. Table 2-6 (pp. 64-66) in the text lists 90 firms that produce DW products.
2-7: The Future of Data Warehousing
As the DW becomes a standard part of an organization, there will be efforts to find new ways to use the data. This will likely bring with it several new challenges: Regulatory constraints may limit the ability to
combine sources of disparate data. These disparate sources are likely to contain
unstructured data, which is hard to store. The Internet makes it possible to access data from
virtually “anywhere”. This further increases the disparity and volume of data,
© 2003, Prentice-Hall 24
Quiz Time – Define each of the following Sales Returns. Inventory Tracking Raw Material Inward Stock Status Direct Labor Minimum, Maximum &
Reorder Level Attendance Recording. Check Endorsement Semi-Finished goods Equity Deferred charge
Net profit Depletion Damaged items Journal entry Multiple Bank accounts
maintenance facility. Bank Book Cash Book. Trial Balance Manufacturing Account Trading Account Profit and Loss account Market value Gross income
© 2003, Prentice-Hall 25
To Summarize First, there's the challenge of understanding the warehouse's
data properly from a business perspective. The ability to audit any data element is an essential for instilling
confidence in the data. Second, there's the challenge of consistency in data usage. Two users might think they're debating over similar data
elements when in fact the elements indicate different aspects of the business, either because of naming inconsistencies or nonsynchronized data.
Third, there's the problem of properly building SQL queries. Slight variations in SQL syntax can produce widely differing results from the same set of data. Most of SQL's syntax is poorly understood, requiring an expert to use it properly.
Fourth, there's the challenge of expanding access to the data warehouse to people who might be unfamiliar about the precise business interpretation of a specific data element.
© 2003, Prentice-Hall 26
Data Warehousing Typology
The virtual data warehouse – the end users have direct access to the data stores, using tools enabled at the data access layer
The central data warehouse – a single physical database contains all of the data for a specific functional area
The distributed data warehouse – the components are distributed across several physical databases
Summary
The data warehouse (DW) is separated from all other operational systems
Data must be cleansed and transformed [ETS]
Metadata stores the data about the data Transformation maps completes the data
warehousing essential documentation The DW has a seven component or layer
architecture
© 2003, Prentice-Hall 28
The End