data warehouse data transformation. extract transform insert n extract data from operational system,...

42
Data Warehouse Data Warehouse DATA TRANSFORMATION DATA TRANSFORMATION

Post on 22-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Data WarehouseData Warehouse

DATA TRANSFORMATIONDATA TRANSFORMATION

Extract Transform InsertExtract Transform Insert

Extract data from operational system, Extract data from operational system, transform and insert into data transform and insert into data warehousewarehouse

Why ETI?Why ETI? Will your warehouse produce correct Will your warehouse produce correct

information with the current data?information with the current data? How ho w can I ensure warehouse How ho w can I ensure warehouse

credibility?credibility?

Excuses for NOT Excuses for NOT Transforming Legacy DataTransforming Legacy Data Old data works fine, new will work as well.Old data works fine, new will work as well. Data will be Data will be fixed fixed at point of entry through at point of entry through

GUI.GUI. If needed, data will be cleaned after new If needed, data will be cleaned after new

system populated; After proof-of-concept system populated; After proof-of-concept pilot.pilot.

Keys join the data most of the time.Keys join the data most of the time. Users will not agree to modifying or Users will not agree to modifying or

standardizing their data.standardizing their data.

Levels of Migration ProblemLevels of Migration Problem

Existing metadata is insufficient and unreliableExisting metadata is insufficient and unreliable Metadata must hold for all occurrencesMetadata must hold for all occurrences Metadata must represent business and technical Metadata must represent business and technical

attributesattributes

Data values incorrectly typed and accessibleData values incorrectly typed and accessible Values form extracted from storageValues form extracted from storage Values meaning inferred from its contentValues meaning inferred from its content

Entity keys unreliable or unavailableEntity keys unreliable or unavailable Inferred from related valuesInferred from related values

Metadata ChallengeMetadata Challenge

Metadata gets out of synch with details it Metadata gets out of synch with details it summarizessummarizes Business grows faster than systems designed to Business grows faster than systems designed to

capture business infocapture business info Not at the right level of detailNot at the right level of detail

Multiple values in a single fieldMultiple values in a single field Multiple meanings to a single fieldMultiple meanings to a single field No fixed format for valueNo fixed format for value

Expressed in awkward of limited termsExpressed in awkward of limited terms Program/compiler view rather than business viewProgram/compiler view rather than business view

Character-level ChallengeCharacter-level Challenge

Value Value instanceinstance level level Spelling, aliasesSpelling, aliases Abbreviations, truncations, transpositionsAbbreviations, truncations, transpositions Inconsistent storage formatsInconsistent storage formats

Named Named typetype level level Multiple meanings, contextual meaningsMultiple meanings, contextual meanings Synonyms, homonymsSynonyms, homonyms

EntityEntity level level No common keys or representationNo common keys or representation No integrated view across records, files, systemsNo integrated view across records, files, systems

Some Data Quality examplesSome Data Quality examples

The magic shrinking vendor fileThe magic shrinking vendor file

127 ways to spell...127 ways to spell...

Data surprises in individual fieldsData surprises in individual fields

Cowbirds and Data FieldsCowbirds and Data Fields

Magic numbers and embedded intelligenceMagic numbers and embedded intelligence

The Magic Shrinking Vendor The Magic Shrinking Vendor FileFileA Medical claims processor was A Medical claims processor was

having trouble with their Insurance having trouble with their Insurance Vendor file. They thought they had Vendor file. They thought they had 300,000 Insurance Vendors. 300,000 Insurance Vendors.

When they cleaned up their data, When they cleaned up their data, they discovered they had only they discovered they had only 27,000 unique Insurance Vendors.27,000 unique Insurance Vendors.

127 ways to spell...127 ways to spell...

Have over 127 different ways to spell Have over 127 different ways to spell AT&TAT&T

Have over 1000 ways to spell duPontHave over 1000 ways to spell duPont

NAME SOC. SEC. # TELEPHONE

Data surprises in individual Data surprises in individual fieldsfieldsData surprises in individual Data surprises in individual fieldsfields

Source: ValitySource: Vality

NAME SOC. SEC. # TELEPHONEMeta

Data surprises in individual Data surprises in individual fieldsfieldsData surprises in individual Data surprises in individual fieldsfields

Source: ValitySource: Vality

NAME SOC. SEC. # TELEPHONE

Denise Mario DBA

Marc Di Lorenzo ETAL

Tom & Mary Roberts

First Natl Provident

Digital 15 State St.

Astorial Fedrl Savings

Kevin Cooke, Receiver

John Doe Trustee for K

Meta

Actual Data Values

Data surprises in individualData surprises in individual fieldsfieldsData surprises in individualData surprises in individual fieldsfields

Source: ValitySource: Vality

NAME SOC. SEC. # TELEPHONE

Denise Mario DBA

Marc Di Lorenzo ETAL

Tom & Mary Roberts

First Natl Provident

Digital 15 State St.

Astorial Fedrl Savings

Kevin Cooke, Receiver

John Doe Trustee for K

228-02-1975

999999999

025-37-1888

34-2671434

101010101

LN#12-756

18-7534216

111111111

Meta

Actual Data Values

Data surprises in individualData surprises in individual fieldsfieldsData surprises in individualData surprises in individual fieldsfields

Source: ValitySource: Vality

NAME SOC. SEC. # TELEPHONE

Denise Mario DBA

Marc Di Lorenzo ETAL

Tom & Mary Roberts

First Natl Provident

Digital 15 State St.

Astorial Fedrl Savings

Kevin Cooke, Receiver

John Doe Trustee for K

228-02-1975

999999999

025-37-1888

34-2671434

101010101

LN#12-756

18-7534216

111111111

6173380300

3380321

415-392-2000

508-466-1200

212-235-1000

FAX 528-9825

5436

Meta

Actual Data Values

Data surprises in individualData surprises in individual fieldsfieldsData surprises in individualData surprises in individual fieldsfields

Source: ValitySource: Vality

Cowbirds and Data FieldsCowbirds and Data Fields

Cowbirds lay their eggs in other birds Cowbirds lay their eggs in other birds netsnets

Users use data fields that are not used Users use data fields that are not used for other purposesfor other purposes

Magic Numbers and Magic Numbers and Embedded IntelligenceEmbedded Intelligence

Customer Number = Customer Number = XXXXXXXX--YYYY--ZZZZ

XXXXXXXX = 1st 4 Positions of Zip Code = 1st 4 Positions of Zip Code

If If YYYY = 00-70 Then Cust = Pharmacy = 00-70 Then Cust = Pharmacy

If If YY YY = 80-89 Then Cust = Hospital = 80-89 Then Cust = Hospital

Except if Except if YYYY = 82 and = 82 and ZZZZ = ** Which Means... = ** Which Means...

Orr's Laws of Data QualityOrr's Laws of Data QualityLaw #1 - “Data that is not used cannot be correct!”Law #1 - “Data that is not used cannot be correct!”

Law #2 - “Data quality is a function of its use, not its Law #2 - “Data quality is a function of its use, not its collection!”collection!”

Law #3 - “Data will be no better than its most stringent Law #3 - “Data will be no better than its most stringent use!”use!”

Law #4 - “Data quality problems increase with the age of Law #4 - “Data quality problems increase with the age of the system!”the system!”

Law #5 - “Data quality laws apply equally to meta-data!”Law #5 - “Data quality laws apply equally to meta-data!”

Law #6 - The less likely something is to occur, the more Law #6 - The less likely something is to occur, the more traumatic it will be when it happens!traumatic it will be when it happens!

Legacy Data Contaminants Legacy Data Contaminants Found in MigrationsFound in Migrations Lack of standardsLack of standards Data surprises in individual fieldsData surprises in individual fields Legacy information buried in free form Legacy information buried in free form

fieldsfields Legacy myopia – multiple account Legacy myopia – multiple account

numbers block consolidated viewnumbers block consolidated view Anomaly nightmare – complex matching Anomaly nightmare – complex matching

and consolidationand consolidation

4 Fundamental Types of 4 Fundamental Types of TransformationTransformation Simple TransformationSimple Transformation

Fundamental building blocks of all data Fundamental building blocks of all data transformationstransformations

One field at a timeOne field at a time Cleansing and ScrubbingCleansing and Scrubbing

Ensure consistent formatting and usage of Ensure consistent formatting and usage of field or related group of filedsfield or related group of fileds

Checks valid valuesChecks valid values

4 Fundamental Types of 4 Fundamental Types of Transformation (con’t)Transformation (con’t) IntegrationIntegration

Takes operational data from one or more Takes operational data from one or more sources and maps it, field by field to new sources and maps it, field by field to new data structuredata structure

Aggregation and SummarizationAggregation and Summarization Remove low level of detailRemove low level of detail Data for data martData for data mart

Simple TransformationSimple Transformation

Convert data element from one type to Convert data element from one type to anotheranother semantic value samesemantic value same rename elementsrename elements

Date time conversionDate time conversion standard warehouse formatstandard warehouse format

Decode encoded fieldsDecode encoded fields M F vs C S MMM F vs C S MM

Cleansing and ScrubbingCleansing and Scrubbing

Actual content examinedActual content examined Range checking, Range checking, enumerated lists, enumerated lists, dependency checkingdependency checking

Uniform representation for dwUniform representation for dw address informationaddress information

parse to componentsparse to components

IntegrationIntegration Simple field level mappingsSimple field level mappings -80-90% -80-90% Complex integrationComplex integration

No common identifierNo common identifier probable matchesprobable matches 2-stage process, isolation/reconciliation2-stage process, isolation/reconciliation

Multiple sources for same target elementMultiple sources for same target element contradictorycontradictory

Missing dataMissing data Derived/calculated dataDerived/calculated data

redundant?redundant?

Aggregation and SummarizationAggregation and Summarization Summarization Summarization is the addition of like values is the addition of like values

along one or more business dimensionsalong one or more business dimensions add daily sales by stores for monthly sales by add daily sales by stores for monthly sales by

regionregion Aggregation Aggregation is the addition of different is the addition of different

business elements into common totalbusiness elements into common total daily product sales plus monthly consulting sales daily product sales plus monthly consulting sales

give monthly combined sales amountgive monthly combined sales amount Details of process available in metadataDetails of process available in metadata

Data Re-engineering Data Re-engineering ProblemProblem

Programming for the unknownProgramming for the unknown Unanticipated values, structures and patternsUnanticipated values, structures and patterns

Programming for noise and uncertaintyProgramming for noise and uncertainty Conflicting and missing valuesConflicting and missing values

Programming for productivity and efficiencyProgramming for productivity and efficiency Changing data values, changing user requirementsChanging data values, changing user requirements High volumes, non-linear searchesHigh volumes, non-linear searchesConventional data transformation methods do not solve Conventional data transformation methods do not solve

the metadata and data value challenges – need data the metadata and data value challenges – need data re-engineeringre-engineering Stephen Brown, Vality Corp.Stephen Brown, Vality Corp.

Data Re-engineering ProcessData Re-engineering Process

External Files

Legacy Applications

Historical Extracts

•Customer Information Systems

•Data Warehouses

•Client/server Applications

•Consolidations

• Data investigation and Metadata Mining

• Data Standardization

• Data Integration

• Data Survivorship and Formating

Natural Laws of Natural Laws of Data Re-engineeringData Re-engineering

Data has no standardData has no standard You can’t predict or legislate format or contentYou can’t predict or legislate format or content Data will evolve faster than its capture and storage systemsData will evolve faster than its capture and storage systems You can’t write rules for what you don’t know and can’t seeYou can’t write rules for what you don’t know and can’t see Instructions for handling data are within the dataInstructions for handling data are within the data Don’t trust the metadata, make the data reveal itselfDon’t trust the metadata, make the data reveal itself Revealed metadata is knowledge about the businessRevealed metadata is knowledge about the business Revealed metadata validates warehouse designRevealed metadata validates warehouse design Revealed metadata supports conversion project managementRevealed metadata supports conversion project management Revealed metadata is insurance against misinformationRevealed metadata is insurance against misinformation

Buy tool or manually code Buy tool or manually code programs ?programs ?

3 - DW Tools3 - DW Tools

1st Generation ETL

Job Schedulers

RDBMS Utilities

Replication/Distribution Tools

Universal Repositories

Q&R/MQE/MRE

MOLAP/ROLAP/LowLAP

Data Mining

CASE

DB Design

Repositories

DB & System Monitors

TechnologiesTechnologies

ProcessesProcesses

EIS

Data Visualization

Meta Data Browsers

•Design•Mapping

•Design•Mapping

•Extract•Scrub•Transform

•Extract•Scrub•Transform

•Load•Index•Aggregation

•Load•Index•Aggregation

•Replication•Data Set Distribution

•Replication•Data Set Distribution

•Access & Analysis•Resource Scheduling & Distribution

•Access & Analysis•Resource Scheduling & Distribution

Meta DataMeta Data

System MonitoringSystem Monitoring

2nd Generation ETL Suites / Environments

TransformationTransformation Choosing between tool and manually coded Choosing between tool and manually coded

programsprograms Time frames - tools take longerTime frames - tools take longer

select, configure, learnselect, configure, learn

Budgets - short term or long termBudgets - short term or long term Size of warehouse - initial project small enough for Size of warehouse - initial project small enough for

codingcoding Size and skills of warehouse teamSize and skills of warehouse team

Tool automatically generates and maintains Tool automatically generates and maintains metadatametadata

Hand Generated CodeHand Generated Code

UpsideUpside No learning curveNo learning curve Inherent skillsInherent skills In house capabilitiesIn house capabilities Usually simpleUsually simple No culture change/mandate (CASE)No culture change/mandate (CASE)

DownsideDownside Manual meta dataManual meta data Maintenance challenge when talent level changesMaintenance challenge when talent level changes No automationNo automation

ToolsTools

UpsideUpside Easy to maintain as talent level changesEasy to maintain as talent level changes Automatic meta dataAutomatic meta data May gain efficienciesMay gain efficiencies Integration with repositoriesIntegration with repositories Integration with other toolsIntegration with other tools

SchedulersSchedulers MonitorsMonitors Meta data managementMeta data management

ToolsTools

DownsideDownside Cost (1st generation tools very high $)Cost (1st generation tools very high $) Learning curveLearning curve Enforced culture changeEnforced culture change

Must use tool for all changesMust use tool for all changes

Speed, may be slower to implementSpeed, may be slower to implement May require additional resourcesMay require additional resources

Copyright © 1997, Enterprise Group, Ltd.

Manual Code / 1st Generation Manual Code / 1st Generation ETL Tools ProcessETL Tools Process

ExtractProgram

SourceSourceOLTPOLTP

SystemsSystems

TransformProgram

FileTransferProgram

FileLoad

Program

Source Mainframe or C/S System Data Warehouse Client/Server System

IndexProgram

AggregationProgram

External Job Scheduling and Control - External Meta Data Load/Maintenance

Copyright © 1997, Enterprise Group, Ltd.

2nd Generation ETL Tools 2nd Generation ETL Tools ProcessProcess

Source OLTPSource OLTPSystemsSystems

TransformationEngine

•Monitoring•Scheduling•Extraction•Scrubbing•Transformation•Load•Index•Aggregation•Meta Data Load•Meta Data Maint.

CachingCaching

Source Mainframeor C/S System

Transformation EngineC/S System

Data Warehouse/MartC/S System

Data WarehouseData Warehouseor Data Martor Data Mart

Copyright © 1997, Enterprise Group, Ltd.

2nd Generation ETL Environment 2nd Generation ETL Environment ProcessProcess

Source OLTPSource OLTPSystemsSystems

TransformationEngine

•Monitoring•Scheduling•Extraction•Scrubbing•Transformation•Load•Index•Aggregation•Meta Data Load•Meta Data Maint.•Request Broker

CachingCaching

Source Mainframeor C/S System

Transformation EngineC/S System

Data WarehouseC/S System

Enterprise Meta Data

Data MartC/S System

Data MartData Mart

Data MartC/S System

Data MartData Mart

•Surf Meta Data•Request Resource•Schedule Delivery

User Process

DataWarehouse

1st Generation ETL Tools 1st Generation ETL Tools Hampered by:Hampered by:

High cost (average deal prices in the $250-400k range)High cost (average deal prices in the $250-400k range) Long learning curvesLong learning curves Perceived value (most teams felt they could write better code)Perceived value (most teams felt they could write better code) Cultural challenges (like a CASE tool, the team must use the Cultural challenges (like a CASE tool, the team must use the

code generator for all creation and changes, no matter how code generator for all creation and changes, no matter how minor)minor)

Core capabilities (complex transformations still required manual Core capabilities (complex transformations still required manual code)code)

Management requirements (users still had to manage all the Management requirements (users still had to manage all the programs generated)programs generated)

Performance issues (the resulting programs could not leverage Performance issues (the resulting programs could not leverage parallelism)parallelism)

Important 2nd Generation ETL Important 2nd Generation ETL tool features:tool features:

Transformation engine designTransformation engine design Ability to leverage parallel server technologyAbility to leverage parallel server technology CDC (Change Data Capture, which allows only the CDC (Change Data Capture, which allows only the

new data to be extracted)new data to be extracted) Incremental aggregation (ability to add CDC Incremental aggregation (ability to add CDC

incremental data to existing aggregations)incremental data to existing aggregations) Limited or no use of temporary files or data base tables Limited or no use of temporary files or data base tables

(virtual caching only)(virtual caching only) Common, open and extensible meta data repositoryCommon, open and extensible meta data repository Enterprise scalabilityEnterprise scalability

Important 2nd Generation ETL Important 2nd Generation ETL tool features:tool features:

Common UI (User Interface) across all toolsCommon UI (User Interface) across all tools Extensive selection of transformation algorithmsExtensive selection of transformation algorithms Easily extensible scrub and transform algorithm library Easily extensible scrub and transform algorithm library Extensive heterogeneous source and target supportExtensive heterogeneous source and target support Native OLAP data set target supportNative OLAP data set target support System monitoring & management System monitoring & management Enterprise meta data repository (content, resources, Enterprise meta data repository (content, resources,

structure, etc.)structure, etc.) Transform once, populate many (populate multiple targets Transform once, populate many (populate multiple targets

with a single transformation output)with a single transformation output)

Important 2nd Generation ETL Important 2nd Generation ETL tool features:tool features:

Integrated enterprise scale scrubbing capabilitiesIntegrated enterprise scale scrubbing capabilities Seamless interoperability with external point solution Seamless interoperability with external point solution

toolstools Integrated information access, analysis, scheduling Integrated information access, analysis, scheduling

and deliveryand delivery Aggregate aware information request broker (enables Aggregate aware information request broker (enables

virtual data warehouse)virtual data warehouse) Ad hoc aggregation monitoring and managementAd hoc aggregation monitoring and management Pipeline parallelism / very high throughputPipeline parallelism / very high throughput Native drivers (source and target)Native drivers (source and target)

OLTP <> OLAPOLTP <> OLAP

OLTPOLTP normalizednormalized

OLAPOLAP tools must provide multidimensional tools must provide multidimensional

conceptual view of data ??????conceptual view of data ??????Providing OLAP to User Analysts, Providing OLAP to User Analysts, E.F.CoddE.F.Codd

redundant dataredundant data

Multidimensional ModelMultidimensional Model

Data stored as facts and dimensionsData stored as facts and dimensions

Sales Fact Cube