granularity in the data warehouse

Granularity in the Data Warehouse

Chapter 4

Raw Estimates• The single most important design issue facing

the data warehouse developer is determining the proper level of granularity of the data that will reside in the data warehouse.

• Granularity is also important to the warehouse architect because it affects all the environments that depend on the warehouse for data.

• Granularity affects how efficiently data can be shipped to the different environments and determines the types of analysis that can be done.

• The primary issue of granularity is that of getting it at the right level. The level of granularity needs to be neither too high nor too low.

Raw Estimates• The starting point for determining the appropriate

level of granularity is to do a raw estimate of the number of rows of data and the DASD (direct access storage device) that will be in the data warehouse.

• The raw estimate of the number of rows of data that will reside in the data warehouse tells the architect a great deal. – If there are only 10,000 rows, almost any level of granularity

will do. – If there are 10 million rows, a low level of granularity is

possible.– If there are 10 billion rows, not only is a higher level of

granularity needed, but a major portion of the data will probably go into overflow storage.

• Figure shows an algorithmic path to calculate the space occupied by a data warehouse.

Input to the Planning Process

The estimate of rows and DASD then serves as input to the planning process, as shown in Figure

Data in Overflow

• Once the raw estimate as to the size of the data warehouse is made, the next step is to compare the total number of rows in the warehouse environment to the charts shown in Figure.

• Depending on how many total rows will be in the warehouse environment, different approaches to design, development, and storage are necessary.

Data in Overflow

Data in Overflow

On the five-year horizon, the totals shift by about an order of magnitude or perhaps even more. The theory is that after five years, these factors will be in place:

■■There will be more expertise available in managing the data warehouse volumes of data.

■■Hardware costs will have dropped to some extent.

■■More powerful software tools will be available.

■■The end user will be more sophisticated.

Overflow Storage• Data in the data warehouse environment grows

at a rate never before seen by IT professionals. The combination of historical data and detailed data produces a growth rate that is phenomenal.

• As data grows large, a natural subdivision of data occurs between actively used data and inactively used data. Inactive data is sometimes called dormant data or infrequently used data.

• At some point in the life of the data warehouse, the vast majority of the data in the warehouse becomes stale and unused. At this point, it makes sense to start separating the data onto different storage media.

Overflow Storage• Figure shows that a data monitor is needed to

determine the usage of data. The data monitor tells where to place data by determining what data is and is not being used in the data warehouse.

• The movement between disk storage and near-line storage is controlled by means of software called a cross-media storage manager (CMSM).

• The data in alternate storage or near-line storage can be accessed directly by means of software that has the intelligence to know where data is located in near-line storage.

• These three software components are the minimum required for alternate storage or near-line storage to be used effectively.

What the Levels of Granularity Will Be

Some Feedback Loop Techniques

• Build the first parts of the data warehouse in very small, very fast steps, and carefully listen to the end users’ comments at the end of each step of development. Be prepared to make adjustments quickly.

• If available, use prototyping and allow the feedback loop to function using observations gleaned from the prototype.

• Look at how other people have built their levels of granularity and learn from their experience.

• Go through the feedback process with an experienced user who is aware of the process occurring.

• Under no circumstances should you keep your users in the dark as to the dynamics of the feedback loop.

• Look at whatever the organization has now that appears to be working, and use those functional requirements as a guideline.

• Execute joint application design (JAD) sessions and simulate the output to achieve the desired feedback.

Some Feedback Loop Techniques

Granularity of data can be raised in many ways, such as the following:

• Summarize data from the source as it goes into the target.

• Average or otherwise calculate data as it goes into the target.

• Push highest and/or lowest set values into the target.• Push only data that is obviously needed into the target.• Use conditional logic to select only a subset of records to

go into the target.The ways that data may be summarized or aggregated are

limitless. When building a data warehouse, keep one important point in mind: In classical requirements systems development, it is unwise to proceed until the vast majority of the requirements are identified. But in building the data warehouse, it is unwise not to proceed if at least half of the requirements for the data warehouse are identified.

Levels of Granularity — Banking Environment

The Data Warehouseand Technology

Chapter 5

Some basic requirement of technology supporting Data warehousing

Managing Large Amounts of Data

• In the ideal case, the data warehouse developer builds a data warehouse under the assumption that the technology that houses the data warehouse can handle the volumes required.

• When the designer has to go to extraordinary lengths in design and implementation to map the technology to the data warehouse, then there is a problem with the underlying technology.

• When technology is an issue, it is normal to engage more than one technology.

• The ability to participate in moving dormant data to overflow storage is perhaps the most strategic capability that a technology can have.

Managing Multiple Media

• In conjunction with managing large amounts of data efficiently and cost effectively, the technology underlying the data warehouse must handle multiple storage media. It is insufficient to manage a mature data warehouse on DASD alone.

• Following is a hierarchy of storage of data in terms of speed of access and cost of storage:

Indexing and Monitoring Data

• Of course, the designer uses many practices to make data as flexible as possible, such as spreading data across different storage media and partitioning data. But the technology that houses the data must be able to support easy indexing as well.

• Unlike the monitoring of transaction processing, where the transactions themselves are monitored, data warehouse activity monitoring determines what data has and has not been used.

Monitoring data warehouse data determines such factors as the following:– If a reorganization needs to be done– If an index is poorly structured– If too much or not enough data is in overflow– The statistical composition of the access of the data– Available remaining space

Interfaces to Many Technologies

Extremely important component of the data warehouse is the ability both to receive data from and to pass data to a wide variety of technologies.

The interface to different technologies requires several considerations:

■■Does the data pass from one DBMS to another easily?■■Does it pass from one operating system to another easily?■■Does it change its basic format in passage (EBCDIC, ASCII,

and so forth)?■■Can passage into multidimensional processing be done

easily?■■Can selected increments of data, such as changed data

capture (CDC) be passed rather than entire tables?■■Is the context of data lost in translation as data is moved to

otherenvironments?

Programmer or Designer Controlof Data PlacementBecause of efficiency of access and update, the programmer or designer must have specific control over the placement of data at the physical block or page level, as shown in Figure

Metadata ManagementMetadata becomes even more important in the

data warehouse than in the classical operational environment. Metadata is vital because of the fundamental difference in the development life cycle that is associated with the data warehouse.

Typically, the technical metadata that describes the data warehouse contains the following:

• Data warehouse table structures• Data warehouse table attribution• Data warehouse source data (the system of

record)

Metadata Management• Mapping from the system of record to the

data warehouse• Data model specification• Extract logging• Common routines for access of data• Definitions and/or descriptions of data• Relationships of one unit of data to

anotherMetadata comes in different varieties:• Business metadata is that metadata that

is of use and value to the business person. • Technical metadata is that metadata that

is of use and value to the technician.

Language InterfaceThe data warehouse must have a rich language

specification. The languages used by the programmer and the DSS end user to access data inside the data warehouse should be easy to use and robust.

Typically, the language interface to the data warehouse should do the following:

■■Be able to access data a set at a time■■Be able to access data a record at a time■■Specifically ensure that one or more indexes

will be used in the satisfaction of a query■■Have an SQL interface■■Be able to insert, delete, or update data

Efficient Loading of DataAn important technological

capability of the data warehouse is the ability to load the data warehouse efficiently, as shown in Figure

Data is loaded into a data warehouse in two fundamental ways: a record at a time through a language interface or en masse with a utility.

Efficient Loading of DataLoading data by means of a utility is much

faster.In addition, indexes must be efficiently

loaded at the same time the data is loaded.As the burden of the volume of loading

becomes an issue, the load is often parallelized. When this happens, the data being loaded is divided into one of several job streams.

Another related approach to the efficient loading of very large amounts of data is staging the data prior to loading

Efficient Index UtilizationTechnology can support efficient index access

inseveral ways:■■Using bit maps■■Having multileveled indexes■■Storing all or parts of an index in main

memory■■Compacting the index entries when the

order of the data being indexed allows such compaction

■■Creating selective indexes and range indexes

Compaction of Data• The ability of DW to manage large amounts of

data necessitates the ability to compact data. • Of course, when data is compacted, it can be

stored in a minimal amount of space which makes the access of the data very efficient.

• Another advantage is that the programmer gets the most out of a given I/O when data is stored compactly.

• Of course, there is always the corresponding issue of decompaction of data on access.

• As a rule, in the data warehouse environment, I/O resources are much scarcer than CPU resources, so decompaction of data is not a major issue since it consumes CPU cycles not the I/O ones.

Variable-Length DataAnother simple but vital technological

requirement of the data warehouse environment is the ability to manage variable-length data efficiently, as seen in Figure

Because of the variety of data found in the data warehouse,

variable-length structuring of data must be supported.

Lock ManagementA standard part of database technology is the

lock manager, which ensures that two or more people are not updating the same record at the same time. But an update is not done in the data warehouse; instead, data is stored in a series of snapshot records. When a change occurs, a new snapshot record is added, rather than an update being done.

One of the effects of the lock manager is that it consumes a fair amount of resources, even when data is not being updated. Merely turning the lock manager on requires overhead. Therefore, to streamline the data warehouse environment, being able to selectively turn the lock manager off and on is necessary.

Other Technological Features

It is noteworthy that many features of DBMS technology found in the classical transaction processing play only a small role (if they play a role at all) in the support of the data warehouse environment. Some of those features include the following:

■■Transaction integrity■■High-speed buffering■■Row- or page-level locking■■Referential integrity■■VIEWs of data■■Partial block loadingIndeed, whenever a transaction-based DBMS is used

in the data warehouse environment, it is desirable to turn off such features, because they interfere with the efficient processing of data inside the data warehouse.

DBMS Types and the Data Warehouse

With the advent of data warehousing and the recognition of DSS as an integral part of the modern information systems infrastructure, a new class of DBMS has arisen. This class can be called a data warehouse-specific DBMS. The data warehouse-specific DBMS is optimized for data warehousing and DSS processing.This specific type of DBMS differs from the general purpose DBMS in various regards.

• No overhead of update.• No freespace for update is needed. • A much more robust and sophisticated indexing

structure is needed.• A data warehouse-specific DBMS allows data to be

physically optimized for DSS access and analysis.

Multidimensional DBMS and the Data Warehouse

One of the technologies often discussed in the context of the data warehouse is multidimensional DBMS processing (sometimes called OLAP processing). Multidimensional database management systems, or data marts, provide an information system with the structure that allows an organization to have very flexible access to data, to slice and dice data any number of ways, and to dynamically explore the relationship between summary and detail data. Multidimensional DBMSs offer both flexibility and control to the end user, and as such they fit well in a DSS environment.

Though tempting to think that multidimensional DBMS technology should be the database technology for the data warehouse, in all but the most unusual cases, this is a mistake.

Consider the differences between the multidimensional DBMS and the data warehouse:

■■The data warehouse holds massive amounts of data; the multidimensional DBMS holds at least an order of magnitude less data.

■■The data warehouse is geared for a limited amount of flexible access; the multidimensional DBMS is geared for very heavy and unpredictable access and analysis of data.

■■The data warehouse contains data with a very lengthy time horizon (from 5 to 10 years); the multidimensional DBMS holds a much shorter time horizon of data.

■■The data warehouse allows analysts to access its data in a constrained fashion; the multidimensional DBMS allows unfettered access.

■■Instead of the data warehouse being housed in a multidimensional DBMS, the multidimensional DBMS and the data warehouse enjoy a complementary relationship.

Multidimensional DBMSs come in several flavors. Some multidimensional DBMSs operate on a foundation of relational technology, and some operate on a technological foundation optimal for “slicing and dicing” the data, where data can be thought of as existing in multidimensional cubes. The latter technological foundation is sometimes called a cube or OLAP foundation.

Both foundations can support multidimensional DBMS data marts. But there are some differences between the two types of technological foundations.

Following is the relational foundation for multidimensional DBMS data marts:

Strengths:■■Can support a lot of data.■■Can support dynamic joining of data.■■Has proven technology.■■Is capable of supporting general-purpose update

processing.■■If there is no known pattern of usage of data,

then the relational structure is as good as any other.

Weaknesses:■■Has performance that is less than optimal.■■Cannot be purely optimized for access

processing.

Following is the cube foundation for multidimensional DBMS data marts:

Strengths:■■Performance that is optimal for DSS processing.■■Can be optimized for very fast access of data.■■If pattern of access of data is known, then the structure

of data can be optimized.■■Can easily be sliced and diced.■■Can be examined in many ways.Weaknesses:■■Cannot handle nearly as much data as a standard

relational format.■■Does not support general-purpose update processing.■■May take a long time to load.■■If access is desired on a path not supported by the

design of the data, the structure is not flexible.■■Questionable support for dynamic joins of data.

Context and ContentAn Example: Suppose a manager asks for a report from

the data warehouse for 1995. The report is generated, and the manager is pleased. In fact, the manager is so pleased that a similar report for 1990 is requested. Because the data warehouse carries historical information, such a request is not hard to accommodate. The report for 1990 is generated. Now the manager holds the two reports—one for 1995 and one for 1990—in his hands and declares that the reports are a disaster.

The data warehouse architect examines the reports and sees that the financial statement for 1995 shows $50 million in revenue, while the report for 1990 shows a value of $10,000 for the same category.

The manager declares that there is no way that any account or category could have increased in value that much in five years’ time.

Before giving up, the data warehouse architect points out to the manager that there are other relevant factors that do not show up in the report.

• In 1990, there was a different source of data than in 1995.

• In 1990, the definition of a product was not the same as in 1995.

• In 1990, there were different marketing territories than in 1995.

• In 1990, there were different calculations, such as for depreciation, than in 1995.

• In addition, there were many different external considerations, such as a difference in inflation, taxation, economic forecasts, and so forth.

• Once the context of the reports is explained to the manager, the contents now appear to be quite acceptable.

Three Types of Contextual Information

Three levels of contextual information must be managed:

■■Simple contextual information■■Complex contextual information■■External contextual information

Simple contextual information

Simple contextual information relates to the basic structure of data itself, and includes such things as these:

■■The structure of data■■The encoding of data■■The naming conventions used for data■■The metrics describing the data, such

as:■■How much data there is■■How fast the data is growing■■What sectors of the data are growing■■How the data is being used

Complex contextual information

Complex contextual information describes the same data as simple contextual information, but from a different perspective. This type of information addresses such aspects of data as these:

■■Product definitions■■Marketing territories■■Pricing■■Packaging■■Organization structure■■Distribution

External contextual information

External contextual information is information outside the corporation that nevertheless plays an important role in understanding information over time. Some examples of external contextual information include the following:

■■Economic forecasts:■■Inflation■■Financial trends■■Taxation■■Economic growth■■Political information■■Competitive information■■Technological advancements

Capturing and Managing Contextual Information

Complex and external contextual types of information are hard to capture and quantify because they are so unstructured. Compared to simple contextual information, external and complex contextual types of information are amorphous.

Another mitigating factor is that contextual information changes quickly. What is relevant one minute is passé the next. It is this constant flux and the amorphous state of external and complex contextual information that makes these types of information so hard to systematize.

Refreshing the Data Warehouse

• Once the data warehouse is built, attention shifts from the building of the data warehouse to its day-to-day operations.

• The cost of operating and maintaining a data warehouse is high, and the volume of data in the warehouse is growing faster than anyone had predicted.

• The widespread and unpredictable usage of the data warehouse by the end user DSS analyst causes contention on the server managing the warehouse.

• The largest unexpected expense associated with the operation of the data warehouse is the periodic refreshment of legacy data.

• The first step most organizations take in the refreshment of data warehouse data is to read the old legacy databases. However, repeated and direct reads of the legacy data are very costly.

• A much more appealing approach is to trap the data in the legacy environment as it is being updated.

• Two basic techniques are used to trap data as an update is occurring in the legacy operational environment. One technique is called data replication; the other is called change data capture.

Data replication:• Replication requires that the data to be trapped be

identified prior to the update. Then, as update occurs, the data is trapped.

• A trigger is set that causes the update activity to be captured.

• One of the advantages of replication is that the process of trapping can be selectively controlled. Only the data that needs to be captured is, in fact, captured.

• Another advantage of replication is that the format of the data is “clean” and well-defined. The content and structure of the data that has been trapped are well-documented and readily understandable to the programmer.

• The disadvantages of replication are that extra I/O is incurred as a result of trapping the data and, because of the unstable, ever changing nature of the data warehouse, the system requires constant attention to the definition of the parameters and triggers that control trapping.

Changed data capture:• One approach to CDC is to use the log tape to

capture and identify the changes that have occurred throughout the online day. In this approach, the log or journal tape is read. Reading a log tape is no small matter, however. Many obstacles are in the way, including the following:– The log tape contains much extraneous data.– The log tape format is often arcane.– The log tape contains spanned records.– The log tape often contains addresses instead of data

values.– The log tape reflects the idiosyncrasies of the DBMS and

varies widely from one DBMS to another.

• There is a second approach to CDC: lift the changed data out of the DBMS buffers as change occurs.

What is done in fact?

First, the organization reads legacy databases directly to refresh its data warehouse. Then it tries replication. Finally, the economics and the efficiencies of operation lead it to CDC as the primary means to refresh the data warehouse.

granularity in the data warehouse

Documents

data warehouse volumes

used data

data monitor

detailed data

data warehousechapter

dormant data

inactive data

usage of data