65518203 etl concepts

Upload: sravya-reddy

Post on 03-Apr-2018

225 views

Category:

Documents


2 download

TRANSCRIPT

  • 7/28/2019 65518203 ETL Concepts

    1/56

    ETL Concepts

  • 7/28/2019 65518203 ETL Concepts

    2/56

    Scope of the TrainingWhat is ETL?

    Need for ETLETL Glossary

    The ETL Process

    Data Extraction and Preparation

    Data Cleansing

    Data Transformation

    Data Load

    Data Refresh Strategies

    ETL Solution Options

    Characteristics of ETL Tools

  • 7/28/2019 65518203 ETL Concepts

    3/56

    What is ETL?ETL stands for Extraction, Transformation

    and Load

    This is the most challenging, costly and

    time consuming step towards building any

    type of Data warehouse.This step usually determines the success

    or failure of a Data warehouse because

    any analysis lays a lot of importance ondata and the quality of data that is being

    analyzed.

  • 7/28/2019 65518203 ETL Concepts

    4/56

    What is ETL?

  • 7/28/2019 65518203 ETL Concepts

    5/56

    What is ETL? - ExtractionExtraction

    The process of culling out data that is

    required for the Data Warehouse from the

    source system

    Can be to a file or to a database

    Could involve some degree of cleansing or

    transformation

    Can be automated since it becomes

    repetitive once established

  • 7/28/2019 65518203 ETL Concepts

    6/56

    What is ETL? - Transformation &

    Cleansing

    Transformation

    Modification or transformation of data being

    imported into the Data Warehouse.

    Usually done with the purpose of ensuring

    clean and consistent data

    Cleansing

    The process of removing errors and

    inconsistencies from data being imported to

    a data warehouse

    Could involve multi le sta es

  • 7/28/2019 65518203 ETL Concepts

    7/56

    What is ETL? - Loading

    After extracting, scrubbing, cleaning,validating etc. need to load the data intothe warehouse

    Issueshuge volumes of data to be loaded

    small time window available when

    warehouse can be taken off line (usuallynights)

    when to build index and summarytables

  • 7/28/2019 65518203 ETL Concepts

    8/56

    What is ETL? - Loading Techniques

    Techniques

    batch load utility: sort input records on

    clustering key and use sequential I/O;

    build indexes and derived tables

    sequential loads still too long

    use parallelism and incrementaltechniques

  • 7/28/2019 65518203 ETL Concepts

    9/56

    The Need for ETLFacilitates Integration of data from various

    data sources for building a Datawarehouse Note: Mergers and acquisitions also create

    disparities in data representation and pose

    more difficult challenges in ETL.Businesses have data in multiple

    databases with different codification andformats

    Transformation is required to convert andto summarize operational data into aconsistent, business oriented format

    Pre-Computation of any derived

  • 7/28/2019 65518203 ETL Concepts

    10/56

    The Need for ETL - Example

    appl A - balanceappl B - bal

    appl C - currbal

    appl D - balcurr

    appl A - pipeline - cm

    appl B - pipeline - in

    appl C - pipeline - feet

    appl D - pipeline - yds

    appl A - m,f

    appl B - 1,0

    appl C - x,y

    appl D - male, female

    Data Warehouse

  • 7/28/2019 65518203 ETL Concepts

    11/56

    Data Integrity Problems - ScenariosSame person, different spellings

    Agarwal, Agrawal, Aggarwal etc...Multiple ways to denote company name

    Persistent Systems, PSPL, Persistent

    Pvt. LTD.Use of different names

    mumbai, bombay

    Different account numbers generated bydifferent applications for the same customer

    Required fields left blank

    Invalid product codes collected at point of

  • 7/28/2019 65518203 ETL Concepts

    12/56

    ETL Glossary

    Extracting

    Conditioning

    House holding

    Enrichment

    Scoring

  • 7/28/2019 65518203 ETL Concepts

    13/56

    ETL Glossary

    Extracting

    Capture of data from operational

    source in as is statusSources for data generally in legacy

    mainframes in VSAM, IMS, IDMS,

    DB2; more data today in relationaldatabases on Unix

    Conditioning

    The conversion of data types from the

  • 7/28/2019 65518203 ETL Concepts

    14/56

    ETL GlossaryHouse holding

    Identifying all members of a household(living at the same address)

    Ensures only one mail is sent to a

    householdCan result in substantial savings: 1 lakh

    catalogues at Rs. 50 each costs Rs. 50lakhs. A 2% savings would save Rs. 1 lakh.

    Enrichment

    Bring data from external sources toaugment/enrich operational data. Data

    sources include Dunn and Bradstreet, A. C.

  • 7/28/2019 65518203 ETL Concepts

    15/56

    The ETL ProcessAccess data dictionaries defining source

    files

    Build logical and physical data models for

    target data

    Identify sources of data from existing

    systems

    Specify business and technical rules fordata extraction, conversion and

    transformation

    Perform data extraction and

  • 7/28/2019 65518203 ETL Concepts

    16/56

    The ETL Process Push vs. PullPull :- A Pull strategy is initiated by the Target

    System. As a part of the Extraction Process, thesource data can be pulled from Transactional

    system into a staging area by establishing a

    connection to the relational/flat/ODBC sources. Advantage :- No additional space required to store

    the data that needs to be loaded into to the staging

    database

    Disadvantage :- Burden on the Transactional

    systems when we want to load data into the staging

    database

    OR

  • 7/28/2019 65518203 ETL Concepts

    17/56

    The ETL Process Push vs. Pull

    With a PUSH strategy, the source system areamaintains the application to read the source and

    create an interface file that is presented to your ETL.

    With a PULL strategy, the DW maintains the

    application to read the source.

  • 7/28/2019 65518203 ETL Concepts

    18/56

    The ETL Process - Data Extraction

    and Preparation

    Extract

    Analyze, Cleanand Transform

    Data Movementand Load

    Stage I

    Stage II

    Stage III

    Periodic

    Refresh/

    Update

  • 7/28/2019 65518203 ETL Concepts

    19/56

    The ETL Process A simplified

    picture

    OLTP

    Systems

    StagingArea

    DataWarehouseExtract

    OLTP

    Systems

    OLTPSystems

    Load

    Transform

    Stage I Stage II Stage III

  • 7/28/2019 65518203 ETL Concepts

    20/56

    Static extract = capturing a

    snapshot of the source data at a

    point in time

    Incremental extract = capturing

    changes that have occurred since

    the last static extract

    Capture = extractobtaining a snapshot of

    a chosen subset of the source data for

    loading into the data warehouse

    The ETL Process Step1

  • 7/28/2019 65518203 ETL Concepts

    21/56

    Scrub = cleanseuses pattern recognition

    and AI techniques to upgrade data quality

    Fixing errors: misspellings, erroneousdates, incorrect field usage, mismatched

    addresses, missing data, duplicate data,

    inconsistencies

    Also: decoding, reformatting, timestamping, conversion, key generation,

    merging, error detection/logging, locating

    missing data

    The ETL Process Step2

  • 7/28/2019 65518203 ETL Concepts

    22/56

    Transform = convert data from format of

    operational system to format of data

    warehouse

    Record-level:Selection data partitioning

    Joining data combining

    Aggregation data summarization

    Field-level:Single-field from one field to one field

    Multi-field from many fields to one, or one

    field to many

    The ETL Process Step3

  • 7/28/2019 65518203 ETL Concepts

    23/56

    Load/Index= place transformed data into the

    warehouse and create indexes

    Refresh mode: bulk rewriting of targetdata at periodic intervals

    Update mode: only changes in sourcedata are written to data warehouse

    The ETL Process Step4

  • 7/28/2019 65518203 ETL Concepts

    24/56

    The ETL Process - Data

    Transformation

    Transforms the data in accordance with

    the business rules and standards that

    have been established

    Example include: format changes, de-

    duplication, splitting up fields,

    replacement of codes, derived values,

    and aggregates

  • 7/28/2019 65518203 ETL Concepts

    25/56

    Scrubbing/Cleansing DataSophisticated transformation tools used for

    improving the quality of dataClean data is vital for the success of the

    warehouse

    Example

    Seshadri, Sheshadri, Sesadri, Seshadri S.,

    Srinivasan Seshadri, etc. are the same

    person

  • 7/28/2019 65518203 ETL Concepts

    26/56

    Dummy Values

    Absence of DataMultipurpose Fields

    Cryptic Data

    Contradicting Data

    Inappropriate Use of Address Lines

    Violation of Business Rules

    Reused Primary Keys

    Non-Unique IdentifiersData Integration Problems

    Reasons for Dirty data

  • 7/28/2019 65518203 ETL Concepts

    27/56

    The ETL Process - Data Cleansing

    Source systems contain dirty data

    that must be cleansed

    ETL software contains rudimentary

    data cleansing capabilities

    Specialized data cleansing software is

    often used. Important for performing

    name and address correction and

    house holding functions

    Leading data cleansing/Quality

    Technology vendors include IBM

  • 7/28/2019 65518203 ETL Concepts

    28/56

    Steps in Data Cleansing

    Parsing

    Correcting

    Standardizing

    Matching

    Consolidating

  • 7/28/2019 65518203 ETL Concepts

    29/56

    Parsing

    Parsing locates and identifies

    individual data elements in the source

    files and then isolates these data

    elements in the target files.Examples include parsing the first,

    middle, and last name; street number

    and street name; and city and state.

  • 7/28/2019 65518203 ETL Concepts

    30/56

    Correcting

    Corrects parsed individual data

    components using sophisticated data

    algorithms and secondary data

    sources.

    Example include replacing a vanity

    address and adding a zip code.

  • 7/28/2019 65518203 ETL Concepts

    31/56

    Standardizing

    Standardizing applies conversion

    routines to transform data into its

    preferred (and consistent) format using

    both standard and custom businessrules.

    Examples include adding a pre name,

    replacing a nickname, and using apreferred street name.

  • 7/28/2019 65518203 ETL Concepts

    32/56

    Matching

    Searching and matching records within

    and across the parsed, corrected and

    standardized data based on predefined

    business rules to eliminateduplications.

    Examples include identifying similar

    names and addresses.

  • 7/28/2019 65518203 ETL Concepts

    33/56

    Consolidating

    Analyzing and identifying relationships

    between matched records and

    consolidating/merging them into ONE

    representation.

  • 7/28/2019 65518203 ETL Concepts

    34/56

    Data Quality Technology Tools

    (Vendors)

    DataFlux Integration Server & dfPower

    Studio (www.DataFlux.com)

    Trillium Software Discovery & Trillium

    Software System

    (www.trilliumsoftware.com)

    ProfileStage & QualityStage

    (www.ascential.com)

    MarketScope Update: Data Quality

  • 7/28/2019 65518203 ETL Concepts

    35/56

    MarketScope Update: Data Quality

    Technology ratings, 2005 (Source:

    Gartner - June 2005)

  • 7/28/2019 65518203 ETL Concepts

    36/56

    The ETL Process - Data Loading

    Data are physically moved to the data

    warehouse

    The loading takes place within a load

    window

    The trend is to near real time updates

    of the data warehouse as the

    warehouse is increasingly used foroperational applications

  • 7/28/2019 65518203 ETL Concepts

    37/56

    Data Loading - First Time LoadFirst load is a complex exercise

    Data extracted from tapes, files, archivesetc.

    First time load might take a lot of time to

    complete

  • 7/28/2019 65518203 ETL Concepts

    38/56

    Data Refresh

    Issues:

    when to refresh?

    on every update: too expensive, onlynecessary if OLAP queries need

    current data (e.g., up-the-minute stock

    quotes)

    periodically (e.g., every 24 hours,

    every week) or after significant

    events

    refresh olic set b administrator

  • 7/28/2019 65518203 ETL Concepts

    39/56

    Data RefreshData refreshing can follow two approaches :

    Complete Data Refresh

    Completely refresh the target table every

    time

    Data Trickle Load

    Replicate only net changes and update

    the target database

  • 7/28/2019 65518203 ETL Concepts

    40/56

    Data Refresh TechniquesSnapshot Approach - Full extract from base

    tablesread entire source table or database:

    expensive

    may be the only choice for legacydatabases or files.

    Incremental techniques (related to work onactive DBs)

    detect & propagate changes on basetables: replication servers (e.g., Sybase,Oracle, IBM Data Propagator)

    snapshots & triggers (Oracle)

  • 7/28/2019 65518203 ETL Concepts

    41/56

    ETL Solution Options

    ETL

    Custom

    Solution

    Generic

    Solution

  • 7/28/2019 65518203 ETL Concepts

    42/56

    Custom SolutionUsing RDBMS staging tables and stored

    proceduresProgramming languages like C, C++, Perl,

    Visual Basic etc

    Building a code generator

  • 7/28/2019 65518203 ETL Concepts

    43/56

    Custom Solution Typical components

    Control Program

    Time window based extraction

    Restart at point of failure

    High level of error handling

    Control metadata captured in Oracle tables

    Facility to launch failure recovery programs Automatically

    Snapshots for dimension

    tables

    PL/SQL extraction procedure

    Complex views for

    transformation

    Control table and highly

    parameterized/generic

    extraction process

    Extract From Source

    Control table driven

    Highly configurable process

    PL/SQL procedure

    Checks performed - referential

    integrity, Y2K, elementary

    statistics, business rules

    Mechanism to flag the records

    as bad / reject

    Data Quality

    Multiple Stars extracted as

    separate groups Pro*C programs using

    embedded SQL

    Surrogate key generation

    mechanism

    ASCII file downloads

    generated for load into

    warehouse

    Generate Download Files

  • 7/28/2019 65518203 ETL Concepts

    44/56

    Generic SolutionAddress limitations (in scalability &

    complexity) of manual codingThe need to deliver quantifiable business

    value

    Functionality, Reliability and Viability areno longer major issues

  • 7/28/2019 65518203 ETL Concepts

    45/56

    Characteristics of ETL ToolsProvides facility to specify a large number of

    transformation rules with a GUIGenerate programs to transform data

    Handle multiple data sources

    Handle data redundancy

    Generate metadata as output

    Most tools exploit parallelism by running on

    multiple low-cost servers in multi-threaded

    environment

    Support data extraction, cleansing, aggregation,

    reorganization, transformation, and load

    operations

  • 7/28/2019 65518203 ETL Concepts

    46/56

    Types of ETL ToolsFirst-generation

    Code-generation products

    Generate the source code

    Second-generation Engine-driven products

    Generate directly executable code

    Note: Due to more efficient architecture,

    second generation tools have significant

  • 7/28/2019 65518203 ETL Concepts

    47/56

    Types of ETL Tools - First-

    GenerationExtraction, transformation,load process

    run on server or hostGUI interface is used to define extraction/

    transformation processes

    Detailed transformations require coding inCOBOL or C

    Extract program is generated

    automatically as source code. Sourcecode is compiled, scheduled, and run in

    batch mode

    Uses intermediate files

    Fi G i ETL T l

  • 7/28/2019 65518203 ETL Concepts

    48/56

    First-Generation ETL Tools

    Strengths and LimitationsStrengths

    Tools are mature

    Programmers are

    familiar with code

    generation in

    COBOL or C

    Limitations

    High cost of products Complex training

    Extract programs have to

    compiled from source

    Many transformations have tocoded manually

    Lack of parallel execution

    support

    Most metadata to be manually

    generated

    Fi t G ti ETL T l

  • 7/28/2019 65518203 ETL Concepts

    49/56

    First-Generation ETL Tools

    ExamplesSAS/Warehouse Administrator

    Prism from Prism Solutions

    Passport from Apertus Carleton Corp

    ETI-EXTRACT Tool Suite from

    Evolutionary Technologies

    Copy Manager from Information Builders

    T f ETL T l S d

  • 7/28/2019 65518203 ETL Concepts

    50/56

    Types of ETL Tools - Second-

    GenerationExtraction/Transformation/Load runs on

    serverData directly extracted from source and

    processed on server

    Data transformation in memory and written

    directly to warehouse database. High

    throughput since intermediate files are not

    usedDirectly executable code

    Support for monitoring, scheduling,

    extraction, scrubbing, transformation,

    S d G ti ETL T l

  • 7/28/2019 65518203 ETL Concepts

    51/56

    Second-Generation ETL Tools

    Strengths and LimitationsStrengths

    Lower cost suites,

    platforms, and

    environment

    Fast, efficient, and

    multi-threaded

    ETL functions highly

    Limitations

    Not mature

    Initial tools oriented only to

    RDBMS sources

    S d G ti ETL T l

  • 7/28/2019 65518203 ETL Concepts

    52/56

    Second-Generation ETL Tools

    ExamplesPowerMart from Informatica

    DataStage from Ardent

    Data Mart Solution from Sagent

    Technology

    Tapestry from D2K

  • 7/28/2019 65518203 ETL Concepts

    53/56

    ETL Tools - Examples

    DataStage from Ascential Software

    SAS System from SAS Institute

    Power Mart/Power Center from

    Informatica

    Sagent Solution from Sagent Software

    Hummingbird Genio Suite fromHummingbird Communications

  • 7/28/2019 65518203 ETL Concepts

    54/56

    ETL Tool - General Selection criteria Business Vision/Considerations

    Overall IT strategy/Architecture Over all cost of Ownership

    Vendor Positioning in the Market

    Performance

    In-house Expertise available

    User friendliness

    Training requirements to existing users

    References from other customers

  • 7/28/2019 65518203 ETL Concepts

    55/56

    ETL Tool Specific Selection criteriaSupport to retrieve, cleanse, transform,

    summarize, aggregate, and load dataEngine-driven products for fast, parallel

    operation

    Generate and manage central metadatarepository

    Open metadata exchange architecture

    Provide end-users with access tometadata in business terms

    Support development of logical and

    physical data models

  • 7/28/2019 65518203 ETL Concepts

    56/56

    ETL Tool - Selection criteria

    ETI Extract

    SAS Warehouse

    Administrator

    Informatica PowerCenter

    Platinum Decision Base

    Ardent DataStage

    Data Mirror Transformation

    Server

    Ardent Warehouse

    Executive

    Carleton Pureview

    Ease of Use / Development

    Capabilities

    Target Database Loading

    Data

    Transformation

    and Repair

    Complexity

    Operations

    Management/

    ProcessAutomation

    Metadata

    Management and

    Administration

    Data Extraction &

    Integration

    complexity

    Source: Gartner Report

    High Rating

    Low Rating