(kajal maam)four ways to build a data warehouse

8/7/2019 (Kajal Maam)Four Ways to Build a Data Warehouse

1/7

Four Ways to Build a DataWarehouseIt has been said there are as many ways to build data warehouses asthere are companies to build them.It has been said there are as many ways to build data warehouses as there arecompanies to build them. Each data warehouse is unique because it must adapt to theneeds of business users in different functional areas, whose companies face differentbusiness conditions and competitive pressures.

Nonetheless, four major approaches to building a data warehousing environment exist.These architectures are generally referred to as 1) top-down 2) bottom-up 3) hybrid,and 4) federated . Most organizationswittingly or notfollow one or another of theseapproaches as a blueprint for development.

Although we have been building data warehouses since the early 1990s, there is still agreat deal of confusion about the similarities and differences among these architectures.This is especially true of the top-down and bottom-up approaches, which haveexisted the longest and occupy the polar ends of the development spectrum.

As a result, some companies fail to adopt a clear vision for the way the datawarehousing environment can and should evolve. Others, paralyzed by confusion or fear of deviating from prescribed tenets for success, cling too rigidly to one approach or another, undermining their ability to respond flexibly to new or unexpected situations.Ideally, organizations need to borrow concepts and tactics from each approach tocreate environments that uniquely meets their needs.

Semantic and Substantive Differences The two most influential approaches arechampioned by industry heavyweights Bill Inmon and Ralph Kimball, both prolificauthors and consultants in the data warehousing field. Inmon, who is credited withcoining the term data warehousing in the early 1990s, advocates a top-downapproach, in which companies first build a data warehouse followed by data marts.Kimballs approach, on the other hand, is often called bottom-up because it starts andends with data marts, negating the need for a physical data warehouse altogether.

On the surface, there is considerable friction between top-down and bottom-upapproaches. But in reality, the differences are not as stark as they may appear. Bothapproaches advocate building a robust enterprise architecture that adapts easily tochanging business needs and delivers a single version of the truth. In some cases, thedifferences are more semantic than substantive in nature. For example, bothapproaches collect data from source systems into a single data store, from which datamarts are populated. But while top-down subscribers call this a data warehouse,bottom-up adherents often call this a staging area.

Nonetheless, significant differences exist between the two approaches (see chart.) Datawarehousing professionals need to understand the substantial, subtle, and semanticdifferences among the approaches and which industry gurus or consultants advocate


2/7

each approach. This will provide a clearer understanding of the different routes toachieve data warehousing success and how to translate between the advice andrhetoric of the different approaches.

Top-Down ApproachThe top-down approach views the data warehouse as the linchpin of the entire analyticenvironment. The data warehouse holds atomic or transaction data that is extractedfrom one or more source systems and integrated within a normalized, enterprise datamodel. From there, the data is summarized, dimensionalized, and distributed to one or more dependent data marts. These data marts are dependent because they deriveall their data from a centralized data warehouse.

Sometimes, organizations supplement the data warehouse with a staging area to collectand store source system data before it can be moved and integrated within the datawarehouse. A separate staging area is particularly useful if there are numerous sourcesystems, large volumes of data, or small batch windows with which to extract data fromsource systems.

The major benefit of a top-down approach is that it provides an integrated, flexiblearchitecture to support downstream analytic data structures. First, this means the datawarehouse provides a departure point for all data marts, enforcing consistency andstandardization so that organizations can achieve a single version of the truth. Second,the atomic data in the warehouse lets organizations re-purpose that data in any number of ways to meet new and unexpected business needs. For example, a data warehousecan be used to create rich data sets for statisticians, deliver operational reports, or support operational data stores (ODS) and analytic applications. Moreover, users canquery the data warehouse if they need cross-functional or enterprise views of the data.

On the downside, a top-down approach may take longer and cost more to deploy thanother approaches, especially in the initial increments. This is because organizationsmust create a reasonably detailed enterprise data model as well as the physicalinfrastructure to house the staging area, data warehouse, and the marts beforedeploying their applications or reports. (Of course, depending on the size of animplementation, organizations can deploy all three tiers within a single database.) Thisinitial delay may cause some groups with their own IT budgets to build their ownanalytic applications. Also, it may not be intuitive or seamless for end users to drillthrough from a data mart to a data warehouse to find the details behind the summarydata in their reports.

Bottom-Up ApproachIn a bottom-up approach, the goal is to deliver business value by deploying dimensional

data marts as quickly as possible. Unlike the top-down approach, these data martscontain all the databoth atomic and summarythat users may want or need, now or in the future. Data is modeled in a star schema design to optimize usability and queryperformance. Each data mart builds on the next, reusing dimensions and facts so userscan query across data marts, if desired, to obtain a single version of the truth as well asboth summary and atomic data.

The bottom-up approach consciously tries to minimize back-office operations,preferring to focus an organizations effort on developing dimensional designs that meet


3/7

end-user requirements. The bottom-up staging area is non-persistent, and may simplystream flat files from source systems to data marts using the file transfer protocol. Inmost cases, dimensional data marts are logically stored within a single database. Thisapproach minimizes data redundancy and makes it easier to extend existingdimensional models to accommodate new subject areas.

Pros/Cons. The major benefit of a bottom-up approach is that it focuses on creatinguser-friendly, flexible data structures using dimensional, star schema models. It alsodelivers value rapidly because it doesnt lay down a heavy infrastructure up front.

Without an integration infrastructure, the bottom-up approach relies on a dimensionalbus to ensure that data marts are logically integrated and stovepipe applications areavoided. To integrate data marts logically, organizations use conformed dimensionsand facts when building new data marts. Thus, each new data mart is integrated withothers within a logical enterprise dimensional model.

Another advantage of the bottom-up approach is that since the data marts contain bothsummary and atomic data, users do not have to drill through from a data mart to

another structure to obtain detailed or transaction data. The use of a staging area alsoeliminates redundant extracts and overhead required to move source data into thedimensional data marts.

One problem with a bottom-up approach is that it requires organizations to enforce theuse of standard dimensions and facts to ensure integration and deliver a single versionof the truth. When data marts are logically arrayed within a single physical database,this integration is easily done. But in a distributed, decentralized organization, it may betoo much to ask departments and business units to adhere and reuse references andrules for calculating facts. There can be a tendency for organizations to createindependent or non-integrated data marts.

In addition, dimensional marts are designed to optimize queries, not support batch or transaction processing. Thus, organizations that use a bottom-up approach need tocreate additional data structures outside of the bottom-up architecture to accommodatedata mining, ODSs, and operational reporting requirements. However, this may beachieved simply by pulling a subset of data from a data mart at night when users are notactive on the system.

Hybrid ApproachThe hybrid approach tries to blend the best of both top-down and bottom-upapproaches. It attempts to capitalize on the speed and user-orientation of the bottom-up approach without sacrificing the integration enforced by a data warehouse in a topdown approach. Pieter Mimno, an independent consultant who teaches at TDWIconferences, is currently the most vocal proponent of this approach.The hybrid approach recommends spending about two weeks developing an enterprisemodel in third normal form before developing the first data mart. The first several datamarts are also designed in third normal form but deployed using star schema physicalmodels. This dual modeling approach fleshes out the enterprise model withoutsacrificing the usability and query performance of a star schema.


4/7

The hybrid approach relies on an extraction, transformation, and load (ETL) tool to storeand manage the enterprise and local models in the data marts as well as synchronizethe differences between them. This lets local groups, for example, develop their owndefinitions or rules for data elements that are derived from the enterprise model withoutsacrificing long-term integration. Organizations also use the ETL tool to extract and load

data from source systems into the dimensional data marts at both the atomic andsummary levels. Most ETL tools today can create summary tables on the fly.

After deploying the first few dependent data marts, an organization then backfills adata warehouse behind the data marts, instantiating the fleshed out version of theenterprise data model. The organization then transfers atomic data from the data martsto the data warehouse and consolidates redundant data feeds, saving the organizationtime, money, and processing resources. Organizations typically backfill a datawarehouse once business users request views of atomic data across multiple datamarts.

The major benefit of a hybrid approach is that it combines rapid developmenttechniques within an enterprise architecture framework. It develops an enterprise datamodel iteratively and only develops a heavyweight infrastructure once its really needed(e.g. when executives start asking for reports that cross data mart boundaries.)

However, backfilling a data warehouse can be a highly disruptive process that deliversno ostensible value and therefore may never be funded. In addition, few query tools candynamically and intelligently query atomic data in one database (i.e. the datawarehouse) and summary data in another database (i.e. the data marts.) Users may beconfused when to query which database.

This approach also relies heavily on an ETL tool to synchronize meta data betweenenterprise and local versions, develop aggregates, load detail data, and orchestrate thetransition to a data warehousing infrastructure. Although ETL tools have maturedconsiderably, they can never enforce adherence to architecture. The hybrid approachmay make it too easy for local groups to stray irrevocably from the enterprise datamodel.

Federated ApproachThe federated approach is sometimes confused with the hybrid approach above or hub-and-spoke data warehousing architectures that are a reflection of a top-downapproach.

However, the federated approachas defined by its most vocal proponent, DougHackneyis not a methodology or architecture per se, but a concession to the naturalforces that undermine the best laid plans for deploying a perfect system. A federatedapproach rationalizes the use of whatever means possible to integrate analyticalresources to meet changing needs or business conditions. In short, its a salve for thesoul of the stressed out data warehousing project manager who must sacrificearchitectural purity to meet the immediate (and ever-changing) needs of his businessusers.

Hackney says the federated approach is an architecture of architectures. Itrecommends how to integrate a multiplicity of heterogeneous data warehouses, data


5/7

marts, and packaged applications that companies have already deployed and willcontinue to implement in spite of the IT groups best effort to enforce standards andadhere to a specific architecture.

Hackney concedes that a federated architecture will never win awards for elegance or be drawn up on clean white boards as an optimal solution. He says it provides themaximum amount of architecture possible in a given political and implementationreality. The approach merely encourages organizations to share the highest valuemetrics, dimensions, and measures wherever possible, however possible. This maymean, for example, creating a common staging area to eliminate redundant data feedsor building a data warehouse that sources data from multiple data marts, datawarehouses, or analytic applications.

The major problem with the federated approach is that it is not well documented. Thereare only a few columns written on the subject. But perhaps this is enough, as it doesntprescribe a specific end-state or approach. Another potential problem is that without aspecific architecture in mind, a federated approach can perpetuate the continueddecentralization and fragmentation of analytical resources, making it harder to deliver an enterprise view in the end. Also, integrating meta data is a pernicious problem in aheterogeneous, ever-changing environment.

SummaryThe four approaches described here represent the dominant strains of datawarehousing methodologies. Data warehousing managers need to be aware of thesemethodologies but not wedded to them. These methodologies have shaped the debateabout data warehousing best practices, and comprise the building blocks for methodologies developed by practicing consultants.

Ultimately, organizations need to understand the strengths and limitations of eachmethodology and then pursue their own way through the data warehousing thicket.Since each organization must respond to unique needs and business conditions, havinga foundation of best practice models to start with augurs a successful outcome.


6/7

Top-Down Bottom-Up Hybrid Federated

Major Characteristics

Emphasizes the DW. Starts by designing anenterprise model for aDW. Deploys multi-tier architecture comprisedof a staging area, a DW,and dependent datamarts. The staging area ispersistent. The DW is enterprise-oriented; data marts arefunction-specific. The DW has atomic-level data; data martshave summary data. The DW uses anenterprise-basednormalized model; datamarts use a subject-specific dimensionalmodel. Users can query thedata warehouse anddata marts.

Emphasizes datamarts. Starts by designing adimensional model for adata mart. Uses a flatarchitecture consistingof a staging area anddata marts. The staging area islargely non-persistent. Data marts containboth atomic andsummary data. Data marts can provideboth enterprise andfunction-specific views. A data mart consists of a single star schema,logically or physicallydeployed. Data marts aredeployed incrementallyand integrated usingconformed dimensions.

Emphasizes DW anddata marts; blends top-down and bottom-upmethods. Starts by designingenterprise and localmodels synchronously. Spends 23 weekscreating a high-level,normalized, enterprisemodel; fleshes outmodel with initial marts. Populates marts withatomic and summarydata via a non-persistentstaging area. Models marts as oneor more star schemas. Uses ETL tool topopulate data marts andexchange meta databetween ETL tool anddata marts. Backfills a DW behindthe marts when userswant views at atomic

level across marts;instantiates the fleshedout enterprise model,and moves atomic datato the DW.

Emphasizes the needto integrate new andexisting heterogeneousBI environments. An architecture of architectures. Acknowledges thereality of change inorganizations andsystems that make itdifficult to implement aformalized architecture. Rationalizes the useof whatever meanspossible to implement or integrate analyticalresources to meetchanging needs or business conditions. Encouragesorganizations to sharedimensions, facts, rules,definitions, and datawherever possible,however possible.

Pros

Enforces a flexible,enterprise architecture. Once built, minimizesthe possibility of renegade independentdata marts. Supports other analytical structures inan architectedenvironment, includingdata mining sets, ODSs,and operational reports. Keeps detailed data in

Focuses on creatinguser-friendly, flexibledata structures. Minimizes backoffice operations andredundant datastructures to acceleratedeployment and reducecost. No drill-throughrequired since atomicdata is always stored inthe data marts.

Provides rapiddevelopment within anenterprise architectureframework. Avoids creation of renegade independentdata marts. Instantiates enterprisemodel and architectureonly when needed andonce data marts deliver real value. Synchronizes meta

Provides a rationalefor band aidapproaches that solvereal business problems. Alleviates the guilt andstress data warehousingmanagers mightexperience by notadhering to formalizedarchitectures. Provides pragmaticway to share data andresources.


7/7

normalized form so itcan be flexibly re-purposed to meet newand unexpected needs. Data warehouseeliminates redundant

extracts.

Creates new views byextending existing starsor building new oneswithin the same logicalmodel. Staging area

eliminates redundantextracts.

data and databasemodels betweenenterprise and localdefinitions. Backfilled DWeliminates redundant

extracts.

Cons

Upfront modeling andplatform deploymentmean the firstincrements take longer to deploy and cost more. Requires building andmanaging multiple datastores and platforms. Difficult to drill throughfrom summary data inmarts to detail data inDW. Might need to storedetail data in data martsanyway.

Few query tools caneasily join data acrossmultiple, physicallydistinct marts. Requires groupsthroughout anorganization toconsistently usedimensions and facts toensure a consolidatedview. Not designed tosupport operational datastores or operationalreporting data structuresor processes.

Requires organizationsto enforce standard useof entities and rules. Backfilling a DW isdisruptive, requiringcorporate commitment,funding, and applicationrewrites. Few query tools candynamically queryatomic and summarydata in differentdatabases.

The approach is notfully articulated. With no predefinedend-state or architecturein mind, it may give wayto unfettered chaos. It might encouragerather than reign inindependentdevelopment andperpetuate thedisintegration of standards and controls.

(kajal maam)four ways to build a data warehouse

Documents