dw training

Upload: santoshsunkara

Post on 09-Apr-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/8/2019 DW Training

    1/32

    ETL- Extraction,Transformation & Loading

  • 8/8/2019 DW Training

    2/32

    1-2

    What is ETL?

    ETL(Extraction, Transformation and Loading) is a process by which

    data is integrated and transformed from the operational systems intothe Data warehouse environment

    Operational systemsOperational systems

    Filters andFilters and

    ExtractorsExtractors

    TransformationTransformation

    RulesRules

    Rule 1Rule 1

    Rule 2Rule 2

    Rule 3Rule 3

    IntegratorIntegrator

    CleaningCleaningRulesRules

    Rule 1Rule 1

    Rule 2Rule 2

    Rule 3Rule 3

    TransformationTransformationEngineEngine

    CleanserCleanser

    LoaderLoaderWarehouseWarehouse

    ErrorError

    ViewCheck

    Correct

    ErrorError

    View

    Check

    Correct

  • 8/8/2019 DW Training

    3/32

    1-3

    Operational Data - Challenges

    Data from heterogeneous sources

    Format differences

    Data Variations

    Context

    Across locations the same code could

    represent different customers

    Across periods of time a product code could

    have been reused

  • 8/8/2019 DW Training

    4/32

    1-4

    Extraction

    Oracle

    Sybase

    Text files

    Target

    80 tables

    50 tables

  • 8/8/2019 DW Training

    5/32

    1-5

    Transformation

    First

    Name

    Last

    Name

    Emp id

    IndianaJones10001

    SherlockHolmes10002

    Name =

    Concat(First Name,

    Last Name)

    Indiana Jones

    Sherlock Homes

    Staging Area

    Source

  • 8/8/2019 DW Training

    6/32

    1-6

    Loading

    StagingArea

    SourceData

    Warehouse

    Direct Load

    Cleaning,

    Transformation

    & Integration of

    Raw data

  • 8/8/2019 DW Training

    7/32

    1-7

    Volume of ETL in a Data warehouse

    Source OLTP

    Systems Data MartsData Marts

    DesignMapping

    Extract

    Scrub

    Transform

    LoadIndex

    Aggregation ReplicationData Set Distribution

    Access & AnalysisResource Scheduling & Distributio

    Meta Data

    System Monitoring

    Enterprise

    DataW

    areh

    ouse

    Metadata

  • 8/8/2019 DW Training

    8/32

    1-8

    Factors Influencing ETL Architecture

    Volume at each warehouse component.

    The time window available for extraction.

    The extraction type (Full,Periodic etc.)

    Complexity of the processes at each stage.

  • 8/8/2019 DW Training

    9/32

    Extraction Types

  • 8/8/2019 DW Training

    10/32

    1-10

    Extraction Types

    Extraction

    Full Extract

    Periodic/

    Incremental

    Extract

  • 8/8/2019 DW Training

    11/32

    1-11

    Source System

    Full Extract

    Existing data

    Data Mart

    Full Extract

  • 8/8/2019 DW Training

    12/32

    1-12

    Full Extract

    Source System

    Full Extract

    Data Mart

    New data

  • 8/8/2019 DW Training

    13/32

    1-13

    Full Extract

    Data Mart

    New data

    Source System

    Full Extract

  • 8/8/2019 DW Training

    14/32

    1-14

    Incremental Extract

    Data Mart

    Source SystemIncremental Extract

    Existing data

    Incremental

    Data

  • 8/8/2019 DW Training

    15/32

    1-15

    Incremental Extract

    Data Mart

    Source SystemIncremental Extract

    New data

    Changed data

    Existing data

    Incremental

    Data

  • 8/8/2019 DW Training

    16/32

    1-16

    Incremental Extract

    Data Mart

    Source System

    Incremental Extract

    New data

    Changed dataExisting data updated

    using changed data

    Incremental

    Data

    Incremental addition

    to data mart

  • 8/8/2019 DW Training

    17/32

    Transformation

  • 8/8/2019 DW Training

    18/32

    1-18

    Data Transformation

    Conversions

    Data type (e.g. Char to Date)

    Bring data to common units (Currency,Measuring Units)

    Classifications

    Changing continuous values to discrete ranges (e.g.

    Temperatures to Temperature Ranges)

    Splitting of fields

    Merging of fields

    Aggregations (e.g. Sum, Avg., Count)

    Derivations (Percentages, Ratios, Indicators)

  • 8/8/2019 DW Training

    19/32

    1-19

    Structural Transformations

    Additive

    Orders arrive

    every

    two minutesAggregate

    Average

    DailyProductivity

    figuresAverage

    OLTP

    OLTP

    Data ware

    house

    Data ware

    house

  • 8/8/2019 DW Training

    20/32

    1-20

    Format transformation

    Splitting

    Data Type

    Conversions

    Source

    Schema

    32

    Transformation

    Target

    Schema

    32

    Age as a String Age as an

    Integer

    15-10-1992

    Source

    Schema

    Date as a

    String

    Transformation15 10 1999

    Target

    Schema

    Day Month Year

    Date as a combination of3

    integer fields

  • 8/8/2019 DW Training

    21/32

    1-21

    Simple Conversions

    Transformations using Simple Conversions

    Source

    Schema

    Rs. 10000Multiply by 1/43

    Target

    Schema

    $232.56

    Revenue in

    Rupees

    Revenue in

    Dollars

    1000 lbs.Multiply by 0.4536

    453.56 kgs.

    Production in

    Pounds

    Production in

    Kilograms

    SourceSchema

    TargetSchema

  • 8/8/2019 DW Training

    22/32

    1-22

    Classification

    Name Age

    John Black 27

    Richard Wayne 53

    Jennifer Goldman 45

    Helmut Koch 37

    Anna Ludwig 32

    Shito Maketha 28

    Tracy Withman 39

    Ada Zhesky 25

    David Rosenberg 33

    Pankaj Sharma 29

    Zhu Ling 44

    George Kurtz 27

    Rita Hartman 34

    Grouping

    ge Grou Frequency

    20-25 1

    26-30 4

    31-35 3

    36-40 2

    41-45 2

    46-50 1

    51-55 1

    56-60 0

  • 8/8/2019 DW Training

    23/32

    1-23

    Data Consistency Transformations

    Source 1

    ender

    Male M

    Female F

    Source 2

    Gender

    Male Male

    Female Female

    Source 1

    Gender

    Male 1

    Female 2

    Target

    Gender

    Male M

    emale

  • 8/8/2019 DW Training

    24/32

    1-24

    Reconciliation of Duplicated data

    Joe Smith

    123 MaineSt.

    MA -

    70127

    Joseph

    Smith123 Maine

    St.

    MA -

    70127

    J.R.Smith

    123 MaineSt.

    MA -

    70127

    Joseph R Smith

    123 Maine St.

    MA - 70127

  • 8/8/2019 DW Training

    25/32

    1-25

    Data Aggregation - Design Requirements

    Aggregates must be stored in their own fact tables and each level

    should have its own fact table

    Dimension tables attached to the aggregate fact tables should where

    ever possible be shrunken versions of the dimension tables attached to

    the base fact table

    The base fact table and all of its related aggregate fact tables must be

    associated together as a family of schemas

  • 8/8/2019 DW Training

    26/32

    Loading

  • 8/8/2019 DW Training

    27/32

    1-27

    Types of Data warehouse Loading

    Target update types

    Insert

    Update

  • 8/8/2019 DW Training

    28/32

    1-28

    Types of Data Warehouse Updates

    Insert

    Full Replace

    Selective Replace

    Update

    Update plus RetainHistory

    Point in Time Snapshots

    New Data

    Changed Data

    Data Warehouse

    Source data Data Staging

  • 8/8/2019 DW Training

    29/32

    1-29

    New Data and Point-In-Time Data Insert

    Source data

    New data

    OR

    Point-in-Time

    Snapshot(e.g.. Monthly)

    New Data Added toExisting Data

  • 8/8/2019 DW Training

    30/32

    1-30

    Changed Data Insert

    Source data

    Changed Data Added to

    Existing Data

    Changed

    data

  • 8/8/2019 DW Training

    31/32

    1-31

    When the value of dimension in a data warehouse changes,

    then

    History of change needs to be maintained.

    Changed data alone needs to be identified

    Changed data should be easier to access.

    Reconstruction of the dimension table any point in time should be

    easier

    Change of Dimension values

  • 8/8/2019 DW Training

    32/32

    1-32

    ETL - Approach in a nutshell

    1) Identify the Operational systems based on data islands in the

    target

    2) Map source-target dependencies.

    3) Define cleaning and transformation rules

    4) Validate source-target mapping

    5) Consolidate Meta data for ETL

    6) Draw the ETL architecture

    7) Build the cleaning, transformation and auditing routines using

    either a tool or customized programs