the data warehouse etl toolkit - chapter 01

55
The Data warehouse ETL Toolkit By Kimball VSV Training Chapter 01: Surrounding the Requirements Prepared by: Khoi NGUYEN – Hien NGUYEN Date: 09/02/2008

Upload: abacus83

Post on 17-Jul-2016

32 views

Category:

Documents


3 download

DESCRIPTION

The Data Warehouse ETL Toolkit - Chapter 01

TRANSCRIPT

Page 1: The Data Warehouse ETL Toolkit - Chapter 01

The Data warehouse ETL ToolkitBy Kimball

VSV TrainingChapter 01: Surrounding the Requirements

Prepared by: Khoi NGUYEN – Hien NGUYENDate: 09/02/2008

Page 2: The Data Warehouse ETL Toolkit - Chapter 01

ETL Book Overview PART I: Requirements, Realities, and, Architecture

Chapter 01: Surrounding the Requirements Chapter 02: ETL Data Structures

PART II: Data Flow Chapter 03: Extracting Chapter 04: Cleaning and conforming Chapter 05: Delivering Dimension Tables Chapter 06: Delivering Fact Tables

PART III: Implementation and Operations Chapter 07: Development Chapter 08: Operations Chapter 09: Meta Data Chapter 10: Responsibilities

PART IV: Real Time Streaming ETL Systems Chapter 11: Real-Time ETL Systems Chapter 12: Conclusions

Page 3: The Data Warehouse ETL Toolkit - Chapter 01

Chapter1: Surrounding the Requirements

Introduction Requirements Architecture The Mission of the Data Warehouse The Mission of the ETL Team

Page 4: The Data Warehouse ETL Toolkit - Chapter 01

1.0 IntroductionETL system is the foundation of DW

Extract data sourcesEnforce data quality and consistencyConforms data of different sourcesDeliver data to presentation-ready format

ETL add significant values to dataRemove mistakes and correct missing dataProvides documented measures of confidence in dataCaptures the flow of transactional data for safekeepingAdjusts data from multiple sources to be used togetherStructures data to be usable be end-user tools

Page 5: The Data Warehouse ETL Toolkit - Chapter 01

1.0 Introduction(You will Learn – In this book) Plan and design your ETL system Choose the appropriate architecture from the many possible choices Manage the implementation Manage the day-to-day operations Build the development/test/production suite of ETL processes Understand the tradeoffs of various back-room data structures, including flat files, normalized schemas, XML schemas, and star join (dimensional) schemas Analyze and extract source data Build a comprehensive data-cleaning subsystem Structure data into dimensional schemas for the most effective delivery to end users, business-intelligence tools, data-mining tools, OLAP cubes, and analytic applications Deliver data effectively both to highly centralized and profoundly distributed data warehouses using the same techniques Tune the overall ETL process for optimum performance

Page 6: The Data Warehouse ETL Toolkit - Chapter 01

1.0 Introduction(Two Simultaneous Threads)

Page 7: The Data Warehouse ETL Toolkit - Chapter 01

1.0 Introduction(The Planning and Design Thread – 1st step Requirements and realities)

Business needs Data profiling and other data-source realities Compliance requirements Security requirements Data integration Data latency Archiving and lineage End user delivery interfaces Available development skills Available management skills Legacy licenses

Page 8: The Data Warehouse ETL Toolkit - Chapter 01

1.0 Introduction(The Planning and Design Thread – 2nd Steparchitecture step)

Hand-coded versus ETL vendor tool Batch versus streaming data flow Horizontal versus vertical task dependency Scheduler automation Exception handling Quality handling Recovery and restart Metadata Security

Page 9: The Data Warehouse ETL Toolkit - Chapter 01

1.0 Introduction(The Planning and Design Thread – 3rd Stepsystem implementation step)

HardwareSoftwareCoding practicesDocumentation practicesSpecific quality checks

Page 10: The Data Warehouse ETL Toolkit - Chapter 01

1.0 Introduction(The Planning and Design Thread – Final Stepadministration maintenance step)

Development systemsTest systemsProduction systemsHandoff proceduresUpdate propagation approachSystem snapshoting and rollback proceduresPerformance tuning

Page 11: The Data Warehouse ETL Toolkit - Chapter 01

1.0 Introduction(The Data Flow Thread – EXTRACT step)

Reading source-data modelsConnecting to and accessing dataScheduling the source system, intercepting notifications

anddaemonsCapturing changed dataStaging the extracted data to disk

Page 12: The Data Warehouse ETL Toolkit - Chapter 01

1.0 Introduction(The Data Flow Thread – Clean step)

Enforcing column propertiesEnforcing structureEnforcing data and value rulesEnforcing complex business rulesBuilding a metadata foundation to describe data qualityStaging the cleaned data to disk

Page 13: The Data Warehouse ETL Toolkit - Chapter 01

1.0 Introduction(The Data Flow Thread – Conform step)

Conforming business labels (in dimensions) Conforming business metrics and performance indicators (in

fact tables) Deduplicating Householding Internationalizing Staging the conformed data to disk

Page 14: The Data Warehouse ETL Toolkit - Chapter 01

1.0 Introduction(The Data Flow Thread – Deliver/Load step)

Loading flat and snowflaked dimensions Generating time dimensions Loading degenerate dimensions Loading subdimensions Loading types 1, 2, and 3 slowly changing dimensions Conforming dimensions and conforming facts Handling late-arriving dimensions and late-arriving facts Loading multi-valued dimensions Loading ragged hierarchy dimensions Loading text facts in dimensions Running the surrogate key pipeline for fact tables Loading three fundamental fact table grains Loading and updating aggregations Staging the delivered data to disk

Page 15: The Data Warehouse ETL Toolkit - Chapter 01

1.0 Introduction(The Data Flow Thread – Operation step)

SchedulingJob executionException handlingRecovery and restartQuality checkingReleaseSupport

Page 16: The Data Warehouse ETL Toolkit - Chapter 01

1.1 RequirementsBusiness NeedsCompliance RequirementsData ProfilingSecurity RequirementsData IntegrationData LatencyArchiving and LineageEnd User Delivery InterfacesAvailable skillsLegacy Licenses

Page 17: The Data Warehouse ETL Toolkit - Chapter 01

1.1 RequirementsList of major categories of requirementsAlthough every one of the requirements can

be a show- stopper, business needs have to be more fundamental and important.

Page 18: The Data Warehouse ETL Toolkit - Chapter 01

1.1.1 Business NeedsBusiness needs are the information

requirements of the end users of the data warehouse.

Business needs directly drive the choice of data sources

The ‘Data Warehouse Lifecycle toolkit’ describe the process for interviewing end users and gather business requirements (Another must read book for System Analysts)

Page 19: The Data Warehouse ETL Toolkit - Chapter 01

1.1.2 Compliance RequirementsTypical due diligence requirements for the

data warehouse include:Archived copies of data sources and

subsequent staging of data.Proof of the complete transaction low that

changed any data.Fully documented algorithms for allocations

and adjustments.Proof of security of the data copies over time,

both on-line and off-line.

Page 20: The Data Warehouse ETL Toolkit - Chapter 01

1.1.3 Data ProfilingData profiling is a necessary precursor to

designing any kind of system to use that data.Data profiling is a systematic examination of

the quality, scope, and context of a data source to allow an ETL system to be built.

Page 21: The Data Warehouse ETL Toolkit - Chapter 01

1.1.4 Security RequirementsThe ETL team should not be directly

concerned with designing or managing end user security.

ETL team needs to work in a special environment.

Security must be extended to physical backups.

Page 22: The Data Warehouse ETL Toolkit - Chapter 01

1.1.5 Data IntegrationData integration is a huge topic for IT

because ultimately IT aims to make all systems work together seamlessly.Conforming dimensionsConforming facts (KPIs)

Data integration is a separate step identified in our data flow thread as the conform step.

Page 23: The Data Warehouse ETL Toolkit - Chapter 01

1.1.6 Data LatencyThe data latency requirement describes how

quickly the data must be delivered to end users.Traditional batch-oriented data flowsImprove processing algorithmsParallel processingBetter hardwareStreaming ?

Page 24: The Data Warehouse ETL Toolkit - Chapter 01

1.1.7 Archiving and LineageStaging the data at each point where a major

transformation has occurred.Each staged/archived data set should have

accompanying metadata describing the origins and processing steps that produced the data.

Page 25: The Data Warehouse ETL Toolkit - Chapter 01

1.1.8 End User Delivery InterfacesThe ETL team and the data modelers need to

work closely with the end user application developers to determine the exact requirements for the final data handoff.

Page 26: The Data Warehouse ETL Toolkit - Chapter 01

1.1.9 Available SkillsThe big design decisions when building an

ETL system must be made on the basis of who builds and manages the system.

Page 27: The Data Warehouse ETL Toolkit - Chapter 01

1.1.10 Legacy LicensesMajor design decisions will be made for you

implicitly by senior management s’ insistence that you use existing legacy licenses.ETL tool (Data Integrator) LicenseServer LicencesOther Legacy system license

Page 28: The Data Warehouse ETL Toolkit - Chapter 01

1.2 ArchitectureThe choice of architecture is a fundamental

and early decision in the design of the ETL system.

Page 29: The Data Warehouse ETL Toolkit - Chapter 01

1.2.1 ETL Tool vs. Hand Coding(Buy a Tool Suite or Roll Your Own?)

The answer is, “It depends.”

Hand-Coded ETL Advantages Automated unit testing tools

are available in a hand-coded system

Object-oriented programming techniques help you make all your transformations consistent for error reporting, validation, and metadata updates.

File-based vs. database processing

Existing legacy routines unchanged

Usage of Scripting language More Flexible In-house programmers may be

available.

Tool-Based ETL Advantages To make ETL step simple Simple, faster, cheaper

development Non-professional can use ETL

too Integrate metadata repository Automatically generate

metadata Built-in job scheduler Data lineage management Built-in connectors, security

features Good performance for large

datasets Load-balancing Auto impact analysis Monitoring tools

Page 30: The Data Warehouse ETL Toolkit - Chapter 01

1.2.1 ETL Tool vs. Hand Coding(Using Proven Technology – Use ETL tool)Some additional benefits of using proven ETL

technology areas follows:Define once, apply many.Impact analysis.Metadata repository.Incremental aggregation.Managed batch loading

Simpler connectivity. Parallel pipe-lined multithreaded operation. Vendor experience.

Page 31: The Data Warehouse ETL Toolkit - Chapter 01

1.2.1 ETL Tool vs. Hand Coding(Batch vs. Streaming Data Flow)The standard architecture for an ETL system

is based on periodic batch extracts from the source data.

Changing from a batch to a streaming data flow changes everything.

Page 32: The Data Warehouse ETL Toolkit - Chapter 01

1.2.1 ETL Tool vs. Hand Coding(Horizontal vs. Vertical Task Dependency)A horizontally organized

task flow allows each final database load to run to completion independently.

A vertically oriented task flow synchronizes two or more separate job flows so that, above all, the final database loads occur simultaneously.

Page 33: The Data Warehouse ETL Toolkit - Chapter 01

1.2.1 ETL Tool vs. Hand Coding(Scheduler Automation)A related architectural decision is how deeply

to control your overall ETL system with automated scheduler technology.

Page 34: The Data Warehouse ETL Toolkit - Chapter 01

1.2.1 ETL Tool vs. Hand Coding(Exception Handling)Every job needs to be architected to write

these exception-reporting records into the database.

Page 35: The Data Warehouse ETL Toolkit - Chapter 01

1.2.1 ETL Tool vs. Hand Coding(Quality Handling)A common response to quality issues that

arise while processing the data.

Page 36: The Data Warehouse ETL Toolkit - Chapter 01

1.2.1 ETL Tool vs. Hand Coding(Recovery and Restart)You need to build your ETL system around

the ability to recover from abnormal ending of a job and restart.

Page 37: The Data Warehouse ETL Toolkit - Chapter 01

1.2.1 ETL Tool vs. Hand Coding(Metadata)The biggest metadata challenge for the ETL

team is where and how to store process-flow information.

Page 38: The Data Warehouse ETL Toolkit - Chapter 01

1.2.1 ETL Tool vs. Hand Coding(Security)Security in the ETL environment is less

granular than in the end user environment.

Page 39: The Data Warehouse ETL Toolkit - Chapter 01

1.2.2 The Back Room - Preparing the Data

The back room and the front room of the data warehouse are physically, logically, and administratively separate.

Page 40: The Data Warehouse ETL Toolkit - Chapter 01

1.2.2 The Back Room - Preparing the Data (cont’)

Page 41: The Data Warehouse ETL Toolkit - Chapter 01

1.2.2 The Back Room - Preparing the Data (cont’)Prohibiting data access in the back room

kitchen relieves the ETL team from:Providing detailed security at a row, column, or

applications level.Building query performance-enhancing indexes

and aggregations.Providing continuous up-time under service-

level agreements.Guaranteeing that all datasets are consistent

with each other.

Page 42: The Data Warehouse ETL Toolkit - Chapter 01

1.2.2 The Back Room - Preparing the Data (Four Staging Steps of a Data Warehouse)

Page 43: The Data Warehouse ETL Toolkit - Chapter 01

1.2.3 The Front Room – Data AccessAccessing data in the presentation area of

the data warehouse is a client, or follow-on, project that must be closely coordinated with the building and managing of the ETL system.

Data marts are an important component of the front room.

Page 44: The Data Warehouse ETL Toolkit - Chapter 01

1.2.3 The Front Room – Data Access(Data Mart)A data mart is a set of dimensional tables

supporting a business process.Some strong statements about data mart:

Data marts are based on the source of data, not on a department ’s view of data.

Data marts contain all atomic detail needed to support drilling down to the lowest level.

Data marts can be centrally controlled or decentralized.

Page 45: The Data Warehouse ETL Toolkit - Chapter 01

1.2.3 The Front Room – Data Access(Extended Concept of the Front Room)

Indexing dimensional tables in the presentation area for query performance.

Choosing front-end tools, including query tools, report writers, and dashboards.

Writing SQL to solve end user queries. Data-mining techniques.Forecasting, behavior scoring, and calculating allocations.Security on the tables and applications accessible by end

users.Metadata supporting end user tools.End user training and documentation.

Page 46: The Data Warehouse ETL Toolkit - Chapter 01

1.3 The Mission of the Data WarehouseThe mission of the data warehouse is to

publish the organization’s data assets to most effectively support decision making.

Page 47: The Data Warehouse ETL Toolkit - Chapter 01

1.3.1 What the Data Ware house IsA data warehouse is a system that extracts,

cleans, conforms, and delivers source data into a dimensional data store and then supports and implements querying and analysis for the purpose of decision making.

Page 48: The Data Warehouse ETL Toolkit - Chapter 01

1.3.2 What the Data Warehouse Is NotTop five things the data warehouse is

mistaken to be are as follows:A product.A language.A project.A data model.A copy of your transaction system.

Page 49: The Data Warehouse ETL Toolkit - Chapter 01

1.3.3 Industry Terms Not Used ConsistentlyIndustry terms that are given different

meanings by different writers.

Page 50: The Data Warehouse ETL Toolkit - Chapter 01

1.3.3 Industry Terms Not Used Consistently (Data Mart)Other authors frequently define a data mart

as an aggregated set of data pre-built to answer specific business questions for a given department.

Our data marts (all them dimensional data marts) look the same to all observers and would be implemented identically by anyone with access to the underlying measurement events.

Page 51: The Data Warehouse ETL Toolkit - Chapter 01

1.3.3 Industry Terms Not Used Consistently (Enterprise Data Warehouse - EDW)EDW is sometimes used as the name the

name for a specific design approach (as contrasted with the uncapitalized enterprise data warehouse, which refers generically to the data warehouse assets of a large or organization)

Page 52: The Data Warehouse ETL Toolkit - Chapter 01

1.3.4 Resolving Architectural Conflict: The Hybrid Bus Approach

A really dirty data source benefits from the discipline of enforcing the many-to-1 relationships brought to the surface by the process of normalization.

ETL team to convert any such normalized structures into simple dimensional structures for the conforming and final handoff steps

Page 53: The Data Warehouse ETL Toolkit - Chapter 01

1.3.5 How the Data Warehouse Is ChangingThe data warehouse is undergoing a

significant change, perhaps the most significant change since the beginning of data warehousing.

Page 54: The Data Warehouse ETL Toolkit - Chapter 01

1.4 The Mission of the ETL TeamThe mission of the ETL team at the highest

level is to build the back room of the data warehouse.

More specifically, the ETL system must:Deliver data most effectively to end user toolsAdd value to data in the cleaning and

conforming steps.Protect and document the lineage of data.

Page 55: The Data Warehouse ETL Toolkit - Chapter 01

1.4 The Mission of the ETL Team (cont’)Every data warehouse the back room must

support four keys steps:Extracting data from the original sources.Quality assuring and cleaning data.Conforming the labels and measures in the

data to achieve consistency a cross the original sources.

Delivering data in a physical format that can be used by query tools, report writers, and dash boards.