02- etl design strategy

95
Internal and Confidential Data Integration – Techniques for Extracting, Transforming and Loading Data

Upload: muneer24

Post on 14-Dec-2014

572 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 02- ETL Design Strategy

Internal and Confidential

Data Integration – Techniques for Extracting, Transforming and Loading Data

Page 2: 02- ETL Design Strategy

April 10, 2023

Agenda• Module 1 - Source Data Analysis & Modeling

• Module 2 – Data Capture Analysis & Design

• Module 3 – Data Transformation Analysis & Design

• Module 4 – Data Transport & Load Design

• Module 5 – Implementation Guidelines

Page 3: 02- ETL Design Strategy

Internal and Confidential April 10, 2023

Module 1 – Source Data Analysis and Modeling

Page 4: 02- ETL Design Strategy

April 10, 2023

Module 1 – Source Data Analysis & Modeling

Data Acquisition Concepts

Source Data Analysis

Source Data Modeling

Understanding the Scope of Data sources

Understanding Source Content

Logical to Physical Mapping

Data Acquisition Roles and Skills

Page 5: 02- ETL Design Strategy

April 10, 2023

• What is Data Acquisition?– Data acquisition is the process of moving data to and within the

warehousing environment• Goal oriented

– Not an isolated activity. Like data modeling, it is driven by goals and purpose of the data warehouse

• Source Driven / Target Driven– Source Driven – the necessary activities for getting data into the

warehousing environment from various sources– Target Driven – activities for acquisition within the warehousing

environment• Data Acquisition activities

– Identifying the set of data needed– Choosing the extract approach– Extract data from the source– Apply transformations– Load data

Data Acquisition Concepts

Staging

Datamart

ETL

Datawarehouse

ETL

ETLETL

ETL

Operational dataexternal data

Data sources

Data Intake

Data Distribution

Datamart

Datamart

Information DeliveryAccess

Get data from sources

Get data into warehouse

Get data for delivery

Page 6: 02- ETL Design Strategy

April 10, 2023

Module 1 – Source Data Analysis & Modeling

Data Acquisition Concepts

Source Data Analysis

Source Data Modeling

Understanding the Scope of Data sources

Understanding Source Content

Logical to Physical Mapping

Data Acquisition Roles and Skills

Page 7: 02- ETL Design Strategy

April 10, 2023

Source Data Analysis – Scope of Data Sources• Data acquisition issues

– Identifying and understanding data sources

– Mapping source data to target data

– Deciding which data to capture

– Effectively and efficiently capturing the data, and

– Determining how and when to transform the data

• Acquisition process design must pay attention to:– Ensuring data quality

– Efficiently loading warehouse databases

– Using tools effectively

– Planning for error recovery

product Customer

Finance Process

OrganizationHREntities

Customer

Order

Product…..

Business Events

Receive order

Ship order

Cancel order…. 2005

20042003

Merger with..

Acquisition of..

Termination of….

Enterprise Events

Subjects?

Kinds of

data?

History?

Page 8: 02- ETL Design Strategy

April 10, 2023

Source Data Analysis – Identification of sources• Types of sources

– Operational Systems

– Secondary Systems

– Backups, Logs, and Archives

– Shadow Systems

– DSS/EIS Systems, and

– External Data

• On-Going versus Single Load Sources

Page 9: 02- ETL Design Strategy

April 10, 2023

Source Data Analysis – Evaluation of Sources

Qualifying criteria Assessment Questions

Availability How available and accessible is the data?Are there technical obstacles to access?Or ownership and access authority issues?

Understandability How easily understood is the data?Is it well documented?Does someone in the organization have depth of knowledge?Who works regularly with this data?

Stability How frequently do data structures change?What is the history of change for the data?What is the expected life span of the potential data source?

Accuracy How reliable is the data?Do the business people who work with the data trust it?

Timeliness When and how often is the data updated?How current is the data?How much history is available?How available is it for extraction?

Completeness Does the scope of data correspond to the scope of the data warehouse?Is any data missing?

Granularity Is the source the lowest available grain ( most detailed level) for this data?

Page 10: 02- ETL Design Strategy

April 10, 2023

Evaluation of Sources – Origin of data• Original Point of Entry

– Best practice technique is to evaluate the original point of entry. “Is is this the very first place that the data is recorded anywhere within the business?”

– If “yes”, then you have found the original point of entry. If “no”, then source may not be the original point of entry. Ask the follow up question “ Can the element be updated in this file/table?”

– If not then this is not the original point of entry. If “yes” then the data element may be useful as a data warehousing source data

CUSTOMER-NUMBER N(9)CUSTOMER-NAME A(32)GENDER A(1)DATE-OF-BIRTH N(6)SSN N(9)

CUSTOMER MASTER FILE

Point of Origin?

CLAIM-NUMBER Unique number to identify the claim N(9)POLICY-NUMBER ID of policy against which the claim is filed NNNN-XXX-NNCUSTOMER-NUMBER ID number of the customer who filed the claim N(9)CUSTOMER-NAME Name of the policy holder A(32)DRIVER-NAME Name of the driver(if any) involved in the accident A(32)

DRIVER-ID State and driver's license number of the driver X(22)VIN Vehicle ID number of involved vehicle X(40)STATUS-CODE Status of claim, blank=open, C-closed A(1)INCIDENT-DATE Date of incident that initiated the claim YYYYMMDDFILING-DATE Date that the claim was received YYYYMMDDINCIDENT-TYPE-CODE What kind of incident resulted in a claim?

1=accident2=theft3=vandalism4=fire5=earthquake6=hurricane7=tornado8=flood9=act of god

N(1)

CLA

IM M

AS

TER

FIL

E

Page 11: 02- ETL Design Strategy

April 10, 2023

Evaluation of Sources – Origin of data

• Original Point of Entry – This practice has many benefits

– Data timeliness and accuracy are improved

– Simplifies the set of extracts from the source system

• Business System of Record

– To what system do the business people go when they are validating results?

– If business identifies a system as the “System of Record” then it must be considered as a probable warehousing data source

• Data Stewardship

– In organizations that have data stewardship program, involve the data stewards

Page 12: 02- ETL Design Strategy

April 10, 2023

Availabilit

y

Un

ders

tan

dabilit

y

Sta

bilit

y

Accu

racy

Tim

elin

ess

Com

ple

ten

ess

Gra

nu

lari

ty

Poin

f of

Ori

gin

?

Syste

m o

f R

ecord

?

APMS Policy Master File + - - + + + + policy policyAPMS Driver File X - - + - X + driver driverAPMS Premium File - X + - X - + premiumCPS Claim Master File + X + - X + +CPS Claim Detail File + X - X X + +CPS Claim Action File - + + - + - +

Evaluation of Sources – An example

FIELD ATTRIBUTE ID(KEY)

ENTITY RELATIONSHIP

POE SOR

CLAIM-NUMBER Unique number to identify the claim Y Claim Y Y Y

POLICY-NUMBER ID of policy against which the claim is filed

N Policy Y N Y

CUSTOMER-NUMBER ID number of the customer who filed the claim

N Customer Y N N

CUSTOMER-NAME Name of the policy holder N Customer N N N

DRIVER-NAME Name of the driver(if any) involved in the accident

N Claim N Y N

DRIVER-ID State and driver's license number of the driver

N Claim N Y N

VIN Vehicle ID number of involved vehicle N Claim N Y Y

STATUS-CODE Status of claim, blank=open, C-closed N Claim N Y Y

INCIDENT-DATE Date of incident that initiated the claim N Claim N Y Y

FILING-DATE Date that the claim was received N Claim N Y Y

Source Data Store Matrix

Source Data Element Matrix

Page 13: 02- ETL Design Strategy

April 10, 2023

Module 1 – Source Data Analysis & Modeling

Data Acquisition Concepts

Source Data Analysis

Source Data Modeling

Understanding the Scope of Data sources

Understanding Source Content

Logical to Physical Mapping

Data Acquisition Roles and Skills

Page 14: 02- ETL Design Strategy

April 10, 2023

Source Data Modeling – Overview of Warehouse data modeling• Business Goals and Drivers

• Information Needs

• Source composition• Source subjects

• Source Data Files

• Integrated Source Data Model (ERM)

• Source Data Structure Model

• Source Data File Descriptions

•Warehousing subjects•Business Questions•Facts and Qualifiers•Targets Configuration

• Staging, Warehouse, & mart ER models• Data mart Dimensional models

• Staging Area Structure• Warehouse structure• Relational mart structures• Dimensional mart structures

• Staging Physical Design• Warehouse Physical Design• Data Mart Physical Designs

• Implemented warehousing databases

Contextual

Models

Conceptual Models

Logical

Models

Structural

Models

Physical

Models

Functional

Databases

Triage

Page 15: 02- ETL Design Strategy

April 10, 2023

Source Data Modeling

Source Data Modeling Objectives

• A single logical model representing a design view of all source data within scope

• An entity relationship model in 3rd normal form ( a business model without implementation redundancies)

• Traceability from logical entities to the specific data sources that implement those entities

• Traceability from logical relationships to the specific data sources that implement those relationships

• Verification that each logical attribute is identifiable in implemented data sources

Source Data Modeling Challenges

• Many data sources do not have data models

• Where data models exist, they are probably out-dated and almost certainly not integrated

• Many source structures are only documented in code (e.g. COBOL definitions of VSAM files)

• Sometimes multiple and conflicting file descriptions exist for a single data structure

Page 16: 02- ETL Design Strategy

April 10, 2023

Source Data Modeling – The activities

Business drivers

Business goalsInformation needs

What kinds of

data stores

Totarget modeling

Source composition model Does

source model exist

SourceSubjectmodel

integrate

StructureOf data

Store (matrix)

Existing file desc

Existing data store

SourceLogical

Model (ERM)

Existing data model

validate

locate extract

WhichModeling

Approach?

Source data warehousing data

Each source

no yes

bottom-up

top-down

Con

textu

al

(scop

e)

Con

cep

tua

l (an

aly

ze)

Logical

(design)

Structural

(specify)

Physical

(optimize)

Functional

(Implement)

Page 17: 02- ETL Design Strategy

April 10, 2023

Module 1 – Source Data Analysis & Modeling

Data Acquisition Concepts

Source Data Analysis

Source Data Modeling

Understanding the Scope of Data sources

Understanding Source Content

Logical to Physical Mapping

Data Acquisition Roles and Skills

Page 18: 02- ETL Design Strategy

April 10, 2023

Understand the Scope of Data SourcesBusiness drivers

Business goalsInformation needs

What kinds of

data stores

Totarget modeling

Source composition model Does

source model exist

SourceSubjectmodel

Existing data model

validate

WhichModeling

Approach?

Source data warehousing data

Each source

no yes

bottom-up

top-down

Con

text

ual

(scop

e)

Con

cep

tu

al

(an

aly

ze)

•Source composition model uses set notation to develop a subject area model

•Classifies each source by the business subjects that it supports

•Helps to understand

•which subjects have a robust set of sources

•which sources address a broad range of business subjects

•Helpful to plan, size, sequence and schedule development of the DW increments

Identify & NameSubjects

AssociateSubjects

Page 19: 02- ETL Design Strategy

April 10, 2023

Composition Subject Model - Example

MIS customer table

LIS policy file

RPS policy file

APMS policy master

MIS product table

APMS premium file

CPS claim master

LIS claim file

CPS claim detail file

CPS claim

action file

CPS party file

MIS auto

Marketplace tableMIS residential

Marketplace table

CUSTOMER

REVENUE

POLICY

CLAIM

EXPENSE

INCIDENT

ORGANIZATION

MARKETPLACE

PARTY

Page 20: 02- ETL Design Strategy

April 10, 2023

Composition Subject Matrix Example

SUBJ ECTS

FILES/TABLES

CLAI M MASTER FI LE √ √ √CLAI M DETAI L FI LE √ √ √ √CLAI M ACTI ON FI LE √PARTY FI LE √POLI CY MASTER √ √PREMI UM FI LE √DRI VER FI LE

RIP

RESI DENTI AL POLI CY FI LE √ √ √LI FE CLAI MS FI LE √ √ √ √LI FE POLI CY FI LE √ √ √CUSTOMER TABLE √PRODUCT TABLE √AUTO MARKET TABLE √RESI DENTI AL MARKET TABLE √

LIS

MIS

MA

RK

ETP

LA

CE

CP

SA

PM

S

PA

RTY

OR

GA

NIZA

TIO

N

REV

EN

UE

EX

PEN

SE

CU

STO

MER

PO

LIC

Y

CLA

IM

IN

CID

EN

T

Page 21: 02- ETL Design Strategy

April 10, 2023

Module 1 – Source Data Analysis & Modeling

Data Acquisition Concepts

Source Data Analysis

Source Data Modeling

Understanding the Scope of Data sources

Understanding Source Content

Logical to Physical Mapping

Data Acquisition Roles and Skills

Page 22: 02- ETL Design Strategy

April 10, 2023

Understanding source content - Integrated view of Non-integrated sources

“Not within the charter of the warehousing program to redesign data sources”

• Understand the existing source data designs

• Merging all of the designs into one representative model

• The source logical data modeling process is not one of design but of integrating

• To develop a logical source data model, you will need to integrate design information from multiple inputs including

– Merging and integrating existing data models

– Extending the subject model that represents any source

– Extracting design structures from source data stores with reverse engineering techniques

Source Data Modeling…

Page 23: 02- ETL Design Strategy

April 10, 2023

Understanding Source content – Using existing models

•This modeling activity begins with collection of existing data models

•Models must be validated – to ensure accuracy and currency

•Existing models – “jump start” the process

•Merging models – identifying and resolving redundancy and conflict across source models

Check for Currency& Accuracy

Combine intoSingle model

Source composition model

Does source model

exist

SourceSubjectmodel

integrate

SourceLogical

Model (ERM)

Existing data model

validate

WhichModeling

Approach?

no yes

bottom-up

top-down

Con

cep

tual

(an

aly

ze)

Logical

(design)

Existing data model

Existing data model

Page 24: 02- ETL Design Strategy

April 10, 2023

Understanding Source content – Working Top DownSource composition

modelDoes

source model exist

SourceSubjectmodel

integrate

SourceLogical

Model (ERM)

Existing data model

validate

WhichModeling

Approach?

no yes

bottom-up

top-down

Con

cep

tual

(an

aly

ze)

Logical

(design)

Existing data model

Existing data model

Identify, Name & Describe Entities

Map to Data Stores

Identify, Name & Describe Relationships

Identify, Name & Describe Attributes

Page 25: 02- ETL Design Strategy

April 10, 2023

Understanding Source Content – Working Bottom-Up

• Derive the data model from the File descriptions

• The source data element matrix serves as the tool to perform source data modeling

• Source modeling and source assessment work well together and share the same set of documentation techniques.

File Field Attribute (what fact?)

ID (key?)

Entity (what

subject?)

Relationship (foreign key?)

APMS Premium POLICY-NUMBER Unique Policy ID Yes POLICYAPMS Premium NAME Policy Holder Name CUSTOMERAPMS Premium ADDRESS Policy Holder Address CUSTOMERAPMS Premium PREMIUM-AMOUNT Cost of Policy Premium POLICYAPMS Premium POLICY-TERM Coverage Duration POLICYAPMS Premium BEGIN-DATE Start date of coverage POLICYAPMS Premium END-DATE End date of coverage POLICYAPMS Premium DISCOUNT-CD Identify kind of discount Partial DISCOUNTAPMS Premium SCHEDULE Basis of discount amt DISCOUNTAPMS Policy POLICY-NUMBER Unique Policy ID Yes POLICYAPMS Policy CUSTOMER-NUMBER Unique customer ID Yes CUSTOMER POLICY->CUSTOMERAPMS Policy VIN Vehicle ID number Yes VEHICLE POLICY-> VEHICLEAPMS Policy MAKE Vehicle Manufacturer VEHICLE

Page 26: 02- ETL Design Strategy

April 10, 2023

Integrating Multiple Views – Resolving Redundancy & Conflict

Resolve RedundancyAnd Conflict

Verify Model

Model States

Normalize

Examine sets of entities and relationships:

customer places order and person places order customer and person are redundant

customer places order and customer sends order places redundant with sends

Examine sets of entities and attributes

When differently named entities have a high degree of similarity in their sets of attributes

Page 27: 02- ETL Design Strategy

April 10, 2023

Understanding Source content - Data Profiling – Looking at the Data

Look at the data to:

discover patterns

know how it is used

understand data quality

identify all data values

classify and organize3 types of profiling:

• Column profiling

• Dependency profiling

• Redundancy profiling

Page 28: 02- ETL Design Strategy

April 10, 2023

Module 1 – Source Data Analysis & Modeling

Data Acquisition Concepts

Source Data Analysis

Source Data Modeling

Understanding the Scope of Data sources

Understanding Source Content

Logical to Physical Mapping

Data Acquisition Roles and Skills

Page 29: 02- ETL Design Strategy

April 10, 2023

Logical to Physical Mapping

• Two way connection

– What attribute is implemented by this field/column?

– Which fields/columns implement this attribute?

• Documenting the mapping

– Extend the source data element matrix to include all data sources

– Provides comprehensive documentation of source data, and detailed mapping from the business view to implemented data elements

Tracing Business data to Physical Implementation

Page 30: 02- ETL Design Strategy

April 10, 2023

Module 1 – Source Data Analysis & Modeling

Data Acquisition Concepts

Source Data Analysis

Source Data Modeling

Understanding the Scope of Data sources

Understanding Source Content

Logical to Physical Mapping

Data Acquisition Roles and Skills

Page 31: 02- ETL Design Strategy

April 10, 2023

Understanding the source systems – A Team Effort• Understanding the source systems feeding the warehousing environment is a critical success factor

• All members of the warehousing team have a role in this effort

• The acquisition (ETL) team is generally responsible to– document the source layouts– perform reverse engineering as needed– determine point of entry– identify system of record– and look at actual data values

• The data modelers are likely to– create the single source logical model (using the inputs gathered by the acquisition team)

• Business Analysts/representatives are involved to– look at the data and help understand the values– help to identify point of entry– and help to determine system of record

Page 32: 02- ETL Design Strategy

Internal and Confidential April 10, 2023

Module 2 – Data Capture Analysis and Design

Page 33: 02- ETL Design Strategy

April 10, 2023

Module 2 – Data Capture Analysis & Design

Data Capture Concepts

Source/Target Mapping

Source Data Triage

Data Capture Design Considerations

Time and Data Capture

Page 34: 02- ETL Design Strategy

April 10, 2023

Data Capture Concepts – An overview• Data capture – activities involved in getting data out of sources

• Synonym for Data Extraction

Data capture Analysis – What to extract?

– Performed to understand requirements for data capture

• Which data entities and elements are required by target data stores?

• Which data sources are needed to meet target requirements?

• What data elements need to be extracted from each data source?

Data capture Design – When to extract? How to extract?

– Performed to understand and specify methods of data capture

• Timing of extracts from each data source

• Kinds of data to be extracted

• Occurrences of data (all or changes only) to be captured

• Change detection methods

• Extract technique (snapshot or audit trail)

• Data capture method (push from source or pull from warehouse

source

Extract

Transform

Load

target

Page 35: 02- ETL Design Strategy

April 10, 2023

Module 2 – Data Capture Analysis & Design

Data Capture Concepts

Source/Target Mapping

Source Data Triage

Data Capture Design Considerations

Time and Data Capture

Page 36: 02- ETL Design Strategy

April 10, 2023

Source/Target Mapping – Mapping Objectives

– Primary technique used to perform data capture analysis

– Mapping source data to target data

– Three levels of detail

• Entities

• Data Stores

• Data Elements

– The terms “source” and “target” describe roles that a data store may play, not fixed characteristics of a data store

source

Extract

Transform

Load

target

Source/Target mapping

Data

avai

labi

lity

Data Requirements

Page 37: 02- ETL Design Strategy

April 10, 2023

Source and Target as Roles

Staging

Datamart

ETL

Datawarehouse

ETL

ETLETL

ETL

Operational dataexternal data

Data sources

Data Intake

Data Distribution

Datamart

Datamart

Information Delivery

Source: Operational/external data

Target: staging data

Source: staging data

Target: data warehouse

Source: data warehouse

Target: data marts

The terms “source” and “target”

describe roles that a data store

may play, not fixed characteristics

of a data store

Page 38: 02- ETL Design Strategy

April 10, 2023

Mapping Techniques

Map source entities to target entities

Map source data storesto target data stores

Map source data elements totarget data elements

Source & target data models

design transformations

product

customer

service

Logical data models

Structural &

physical models

Page 39: 02- ETL Design Strategy

April 10, 2023

Source/Target Mapping: EXAMPLES

CU

STO

MER

CU

STO

MER

S

ALES

TR

AN

SA

CTIO

N

PR

OD

UC

T S

ALES

TR

AN

SA

CTIO

N

PR

OD

UC

T

INV

EN

TO

RY

TU

RN

OV

ER

MEMBER √

SALES TRANSACTI ON √ √ √

PRODUCT √

PRODUCT I NVENTORY √ √ √

RESTOCK TRANSACTI ON √ √enti

ties

from

sourc

e log

ical

data

mod

el

Entities from target logical models

CU

STO

MER

TA

BLE

CU

STO

MER

SA

LES

TR

AN

SA

CTIO

N

TA

BLE

PR

OD

UC

T

SA

LES

TR

AN

SA

CTIO

N

TA

BLE

PR

OD

UC

T

TA

BLE

IN

VEN

TO

RY

TU

RN

OV

ER

TA

BLE

MEMBERSHI P MASTER FI LE √MEMBERSHI P ACTI VI TYFI LE √ √POI NT-OF-SALEDETAI L FI LE √PRODUCT I NVENTORYFI LE √ √ √WAREHOUSE TO STOREACTI VI TY TABLE √ √ F

iles/

table

fro

m

sourc

e s

truct

ura

lmodel

Entities from target structural model

cust

om

er

nu

mber

cust

om

er

firs

t n

am

e

cust

om

er

last

nam

e

cust

om

er

biz

nam

e

ren

ew

al date

cust

om

er

nu

mber

cust

om

er

addre

ss

cust

om

er

city

cust

om

er

state

cust

om

er

zip

code

tran

sact

ion

ID

lin

e n

um

ber

term

inal ID

SK

U I

D

tran

sact

ion

qu

an

tity

member-number √ √

membership-type √ √ √

date-joineddate-last-renewed √

term-last-renewed √

date-of-last-activityname √ √ √

address √

city-and-state √ √

zip-code √

transcation-ID √

line-number √

ME

MB

ER

SH

IP M

AS

TE

RPO

INT O

FS

ALE

DE

TA

IL

CUSTOMERCUSTOMER ADDRESS

SALESTRANSACTION

ENTITY MAPPING

DATA STORE MAPPING

DATA ELEMENT MAPPING

Page 40: 02- ETL Design Strategy

April 10, 2023

Source/target Mapping: Full set of Data elements

Source/target mapping

triage

transform design

Logica

l

model

Data element

matrix

File/ta

ble

descrip

tions

Business

questio

ns

Physic

al

desig

n

Source/target map

Logica

l

model

Elements added by business

Elements added by triage

Ele

men

ts a

dd

ed

by

tran

sfo

rm log

ic

Page 41: 02- ETL Design Strategy

April 10, 2023

Module 2 – Data Capture Analysis & Design

Data Capture Concepts

Source/Target Mapping

Source Data Triage

Data Capture Design Considerations

Time and Data Capture

Page 42: 02- ETL Design Strategy

April 10, 2023

Source Data Triage

What to extract - opportunities• Source/target mapping analyzes need, triage analyzes opportunity

• Triage is about extracting all data with potential value

What is Triage?• Source data structures are analyzed to determine the appropriate data elements for inclusion

Why Triage?• Ensure that a complete set of attributes is captured in the warehousing environment.

• Rework is minimized

Triage and Acquisition• Performing triage is a joint effort between the acquisition team and the warehousing data modelers

Page 43: 02- ETL Design Strategy

April 10, 2023

The Triage Technique

Select needed files

Identify elements addressing known business questions

Eliminate Operational and redundant elements

Take all other business elements

First draft of element mapping for staging area or atomic DW

Source systems needed for increment

Page 44: 02- ETL Design Strategy

April 10, 2023

Module 2 – Data Capture Analysis & Design

Data Capture Concepts

Source/Target Mapping

Source Data Triage

Data Capture Design Considerations

Time and Data Capture

Page 45: 02- ETL Design Strategy

April 10, 2023

Customer

Member NumberMembership TypeLast NameFirst NameDate J oinedLast Activity DateLast Update DateLast Upate TimeLast Update User ID

Sales Transaction

Transaction DateTransaction TimePayment MethodTransaction StatusRegister IDStore NumberChecker Employee IDMember Number

Employee

Employee IdEmployee NameEmployee AddressStore Number

Store

Store NumberStore NameStore AddressPhone NumberManager Employee ID

Kinds of Data

Source system metadata

Reference Data

Event Data

Source system keys

Page 46: 02- ETL Design Strategy

April 10, 2023

PULL FROMSOURCE

Extract sourceFiles/tables

Extract source Changes ortransactions

PUSH TOWAREHOUSE

Replicate sourceFiles/tables

ReplicateSource changesOr transactions

Data Capture Methods

ALL DATA

CHANGEDDATA

Page 47: 02- ETL Design Strategy

April 10, 2023

Detecting Data Changes

• Detecting changes at source

• source date/time stamps

• source transaction files/logs

• replicate source data changes

• DBMS logs

• compare back-up files

• Detecting changes after extract

• compare warehouse extract generations

• compare warehouse extract to source system

PULL FROMSOURCE

Extract sourceFiles/tables

Extract source Changes ortransactions

PUSH TOWAREHOUSE

Replicate sourceFiles/tables

ReplicateSource changesOr transactions

ALL DATA

CHANGEDDATA

how to know

which data has

changed???

Page 48: 02- ETL Design Strategy

April 10, 2023

Module 2 – Data Capture Analysis & Design

Data Capture Concepts

Source/Target Mapping

Source Data Triage

Data Capture Design Considerations

Time and Data Capture

Page 49: 02- ETL Design Strategy

April 10, 2023

Timing issues

OLTP

DataExtraction

Work Tables

DataTransformation

WarehouseLoading

Intake layer

DataMart

Sources

Frequency of

Acquisition

Latency of

Load

Periodicity

Of

Data Marts

Page 50: 02- ETL Design Strategy

April 10, 2023

Source System Considerations

OLTP

DataExtraction

Work Tables

Sources

When is the data ready in each source system ?

How will i know when it’s ready?

How will I respond to source system failures?

How will i recover from a failure?

How long will it remain in the steady state?

How will I know when source systems fail?

Page 51: 02- ETL Design Strategy

April 10, 2023

Handling time variance – techniques and methods

• SNAPSHOT– Periodically posts records as of a specific point in time

– Records all data of interest without regard to changes

– Acquisition techniques to create snapshots

• DBMS replication

• Full File Unload or Copy

• AUDIT TRAIL– Records details of each change to data of interest

– Details may include date and time of change, how the change was detected, reason for change, before and after data values, etc.

– Acquisition techniques

• DBMS triggers

• DBMS replication

• Incremental selection

• Full file unload/copy

Important distinction between Snapshot and Audit trail : Audit trail techniques only Changed data is extracted and loaded,Snapshot all data is extracted and loaded, whether changed or not

Page 52: 02- ETL Design Strategy

Internal and Confidential April 10, 2023

Module 3 – Data Transformation Analysis & Design

Page 53: 02- ETL Design Strategy

April 10, 2023

Module 3 – Data Transformation Analysis & Design

Data Transformation Concepts

Transformation Analysis

Transformation Design

Transformation Rules and Logic

Transformation Sequences and Processes

Page 54: 02- ETL Design Strategy

April 10, 2023

Transformation concepts – An overview• Data Transformation

• Changes that occur to the data after it is extracted• Transformation processing removes

– Complexities of operational environments– Conflicts and redundancies of multiple databases– Details of daily operations– Obscurity of highly encoded data

• Transformation Analysis• Integrate disparate data• Change granularity of data• Assure data quality

• Transformation Design• Specifies the processing needed to meet the requirements that are determined by

transformation analysis• Determining kinds of transformations

– Selection– Filtering– Conversion– Translation– Derivation– Summarization– Organized into programs, scripts, modules, jobs, etc. that are compatible with chosen tools

and technology

Page 55: 02- ETL Design Strategy

April 10, 2023

Module 3 – Data Transformation Analysis & Design

Data Transformation Concepts

Transformation Analysis

Transformation Design

Transformation Rules and Logic

Transformation Sequences and Processes

Page 56: 02- ETL Design Strategy

April 10, 2023

Data Integration Requirements• Integration

– Create a single view of the data

• Integration & Staging Data– organize data by business subjects– ensure integrated identity through use of common, shared business keys

• Integration & Warehouse data– implement data standards, including derivation of conformed facts and structuring

of conformed dimensions– ensure integration of internal identifiers – where staging integrates real world keys,

the warehouse needs to do the same for surrogate keys

• Integration & Data marts– Intended to satisfy business specific /department specific requirements

Page 57: 02- ETL Design Strategy

April 10, 2023

Data Granularity Requirements• Granularity

– Each change of data grain, from atomic data to progressively higher levels of summary – achieved through transformation

• Granularity & Staging Data

– Staging data kept at atomic level

• Granularity & warehouse data

– In a 3 tier environment, warehouse should contain all common and standard summaries

• Granularity & Data marts

– Derivation of summaries specific to individual needs

Page 58: 02- ETL Design Strategy

April 10, 2023

Data Quality Requirements

• Data Cleansing

– process by which data quality needs are met

– range from filtering bad date to replacing data values with some alternative default or derived values

• Cleansing & Staging Data

– the earlier the data is cleansed, the better the result

– sometimes important for staging data to reflect what was contained in the source systems

– Delay data cleansing transformation until data is moved from staging to warehouse

– Keep both cleansed and un-cleansed data in staging area

• Cleansing & Warehouse data

– data not cleansed in staging is cleansed before loaded into the warehouse

• Cleansing & Data marts

– cleansing at data marts is not necessarily desirable, however as a practical matter may be necessary

Page 59: 02- ETL Design Strategy

April 10, 2023

Module 3 – Data Transformation Analysis & Design

Data Transformation Concepts

Transformation Analysis

Transformation Design

Transformation Rules and Logic

Transformation Sequences and Processes

Page 60: 02- ETL Design Strategy

April 10, 2023

Transformation Design - Approach

Identify transformation rules& logic

Determine transformationsequences

Specify transformation process

transformation specifications

transformation requirements

Page 61: 02- ETL Design Strategy

April 10, 2023

Module 3 – Data Transformation Analysis & Design

Data Transformation Concepts

Transformation Analysis

Transformation Design

Transformation Rules and Logic

Transformation Sequences and Processes

Page 62: 02- ETL Design Strategy

April 10, 2023

Kinds of transformations

This Transformation type… is Used to…

Selection Choose one source to be used among multiple possibilities

Filtering Choose a subset of rows from a source data table, or a subset of records from a source data file

Conversion and Translation Change the format of data elements

Derivation Create new data values, which can be inferred from the values of existing data elements

Summarization Create new data values, which can inferred from the values of existing data elements

Page 63: 02- ETL Design Strategy

April 10, 2023

Choose among alternative sources

based upon selection rules

Selection

ExtractedSource # 1

ExtractedSource # 2

TransformedTarget data

Select

sometimes from source 1

sometimes from source 2

‘If membership type is individual use member name from the membership master file, otherwise use member name from the business contact table’

Page 64: 02- ETL Design Strategy

April 10, 2023

eliminate somedata from the target set of data based on

filtering rules

Filtering

ExtractedSource data

TransformedTarget data

Filter

Some rows or values discarded

‘If the last 2 digits of policy number are 04,27,46, or 89 extract data for the data mart, otherwise exclude the policy and all associated data’

Page 65: 02- ETL Design Strategy

April 10, 2023

Change data content and/or format based on conversion rules

Conversion

ExtractedSource data

TransformedTarget data

Convert

Value/format in is different than value/format out

‘For policy history prior to 1994, reformat from Julian date to YYYYMMDD format. Default century to 19’

Page 66: 02- ETL Design Strategy

April 10, 2023

decode data whose values are encoded based on rules for

translation

Translation

ExtractedSource data

TransformedTarget data

Translate

both encoded and decoded value out

‘if membership-type-code is ‘C’ translate to ‘Business’; If membership-type-code is ‘P’, blank, or null translate to ‘Individual’; otherwise translate to ‘Unknown’ ’

encode values in

Page 67: 02- ETL Design Strategy

April 10, 2023

use existing data values to create new

data based on derivation rules

Derivation

ExtractedSource data

TransformedTarget data

Derive

new data values created…

More values out than in

‘Total Premium Cost = base-premium-amount + (sum of all additional coverage amounts)-(sum of all discount amounts) ’

Page 68: 02- ETL Design Strategy

April 10, 2023

Change data granularity based on

rules of summarization

Summarization

ExtractedSource data

TransformedTarget data

Summarize

Summary data out

‘for each store (for each product line (for each day (count the number of transactions, accumulate the total dollar value of the transactions))) ’

‘for each week (sum daily transaction count, sum daily dollar total)

atomic or base data in

Page 69: 02- ETL Design Strategy

April 10, 2023

Identifying Transformation Rules

cust

om

er

nu

mber

cust

om

er

firs

t n

am

e

cust

om

er

last

nam

e

cust

om

er

biz

nam

e

ren

ew

al date

cust

om

er

nu

mber

cust

om

er

addre

ss

cust

om

er

city

cust

om

er

state

cust

om

er

zip c

ode

tran

sact

ion

ID

line n

um

ber

term

inal ID

SK

U I

D

tran

sact

ion

qu

an

tity

member-number √ √

membership-type √ √ √

date-joineddate-last-renewed √

term-last-renewed √

date-of-last-activityname √ √ √

address √

city-and-state √ √

zip-code √

transcation-ID √

line-number √

ME

MB

ER

SH

IP M

AS

TE

RPO

INT O

FS

ALE

DE

TA

IL

CUSTOMERCUSTOMER ADDRESS

SALESTRANSACTION

for any source-to-target data element association, what needs exist for:

• selection?• filtering?• conversion?• translation?• derivation?• summarization?

Page 70: 02- ETL Design Strategy

April 10, 2023

Specifying Transformation Rules

cust

om

er

nu

mber

cust

om

er

firs

t n

am

e

cust

om

er

last

nam

e

cust

om

er

biz

nam

e

ren

ew

al date

cust

om

er

nu

mber

cust

om

er

addre

ss

cust

om

er

city

cust

om

er

state

cust

om

er

zip c

ode

tran

sact

ion

ID

line n

um

ber

term

inal ID

SK

U I

D

tran

sact

ion

qu

an

tity

member-number √ √

membership-type √ √ √

date-joineddate-last-renewed √

term-last-renewed √

date-of-last-activityname √ √ √

address √

city-and-state √ √

zip-code √

transcation-ID √

line-number √

ME

MB

ER

SH

IP M

AS

TE

RPO

INT O

FS

ALE

DE

TA

ILCUSTOMER

CUSTOMER ADDRESS

SALESTRANSACTION

cleansing DTR027 (default value)

Derivation DTR008 (Derive name)

DTR027(Default Membership Type) If membership-type is null or invalid assume “family” membership

DTR008(Derive Name) If membership-type is “family” separate name using comma insert characters prior to comma in customer-last-name insert characters after comma in customer-first-name else move name to customer-biz-name

cells expand to identify

transformations by type &

name

logic

of

transfo

rmatio

ns

is se

parate

ly

docum

ented

Page 71: 02- ETL Design Strategy

April 10, 2023

Module 3 – Data Transformation Analysis & Design

Data Transformation Concepts

Transformation Analysis

Transformation Design

Transformation Rules and Logic

Transformation Sequences and Processes

Page 72: 02- ETL Design Strategy

April 10, 2023

Dependencies and Sequences• Time Dependency – when one transformation rule must execute before another

• example: summarization of derived data cannot occur before the derivation

• Rule Dependency – when execution of a transformation rule is based upon the result of another rule

• example: different translations occur depending on source chosen by a selection rule

• Grain Dependency – when developing one level of summary if based on results of a previous summarization

• example: quarters can’t be summarized annually before months are summarized on a quarterly basis

Page 73: 02- ETL Design Strategy

April 10, 2023

Dependencies and Sequences

1. Identify the transformation rules

2. Understand rule dependency – package as modules

3. Understand time dependency – package as processes

4. Validate and define the test plan

Specify selection

Specify filtering

Specify conversion & translation

Specify derivation

Specify summarization

1

2

3

4

Page 74: 02- ETL Design Strategy

April 10, 2023

Modules and ProgramsDTR027(Default Membership Type) If membership-type is null or invalid assume “family” membershipDTR008(Derive Name) If membership-type is “family” separate name using comma insert characters prior to comma in customer-last-name

insert characters after comma in customer-first-name

else move name to customer-biz-name

Dependencies among

rules

Transformation Rules

Structures of Modules, Programs, Scripts, etc.

Page 75: 02- ETL Design Strategy

April 10, 2023

Job Streams & Manual Procedures- completing the ETL design

Extract & Load

Dependencies

scheduling

execution

verification

communication

restart

Transformation Rules and their implementation

Automated & Manual Procedures

Page 76: 02- ETL Design Strategy

Internal and Confidential April 10, 2023

Module 4 – Data Transportation & Loading Design

Page 77: 02- ETL Design Strategy

April 10, 2023

Module 4 – Data Transport & Load Design

Data Transport and Load Concepts

Data Transport Design

Database Load Design

Page 78: 02- ETL Design Strategy

April 10, 2023

Overview

Extract

Transform

Load

Source Data

Target Data

wh

ere

do p

latf

orm

chang

es

occ

ur?

data

tra

nsp

ort

database load

Page 79: 02- ETL Design Strategy

April 10, 2023

Module 4 – Data Transport & Load Design

Data Transport and Load Concepts

Data Transport Design

Database Load Design

Page 80: 02- ETL Design Strategy

April 10, 2023

Data Transport Issues

Extract

Transform

Load

Source Data

Target Data

data

tra

nsp

ort

which platforms?data volumes?

transport frequency?network capacity?ASCII vs EBCDIC

data security?transport methods?

Page 81: 02- ETL Design Strategy

April 10, 2023

Data Transport Techniques

Extract

Transform

Load

Source Data

Target Data

data

tra

nsp

ort

Open FTPSecure FTP

Alternatives to FTPData compressionData Encryption

ETL tools

Page 82: 02- ETL Design Strategy

April 10, 2023

Module 4 – Data Transport & Load Design

Data Transport and Load Concepts

Data Transport Design

Database Load Design

Page 83: 02- ETL Design Strategy

April 10, 2023

Database Load Issues

Extract

Transform

Load

Source Data

Target Data

database load

which DBMS?relational vs dimensional?

tables & indices?load frequency?

load timing?data volumes?

exception handling?restart & recovery?

load methods?referential integrity?

Page 84: 02- ETL Design Strategy

April 10, 2023

Populating Tables• Drop and rebuild the tables

• Insert (only) rows into a table

• Delete old rows and insert changed rows

Page 85: 02- ETL Design Strategy

April 10, 2023

Indexing

Load

Tables

Indices

update at load?drop & rebuild?index segmentation?

Page 86: 02- ETL Design Strategy

April 10, 2023

Updating

• Isn’t the warehouse read only?

• Updating Business Data

• Updating Row level Metadata

Load

Indices

Tables

allow updating of

rows in tables?

Page 87: 02- ETL Design Strategy

April 10, 2023

Referential Integrity• RI is the condition where every reference to another table has a foreign key/primary key

match.

• Three common options for RI

• DBMS checking

• Test load files before load using a tool/custom application

• Test data base(s) after load using a tool/custom application

Page 88: 02- ETL Design Strategy

April 10, 2023

Timing Considerations• User Expectations

• Data Readiness

• Database synchronization

Page 89: 02- ETL Design Strategy

April 10, 2023

Exception Processing

Transform

Load

Targetdata

Log

Reports

Suspendexceptions

ok

Discard

Page 90: 02- ETL Design Strategy

April 10, 2023

Integrating with ETL processes

• Loading as a part of single transform & load job stream

• Loads triggered by completion of transform job stream

• Loads triggered by verification of transforms

• Parallel ETL processing

• Loading Partitions

• Updating summary tables

communication

dependencies

scheduling

execution

verificationrestart

scheduling

execution

verificationrestart

scheduling

execution

verificationrestart

EX

TR

AC

TTR

AN

SFO

RM

LO

AD

restart/recovery

dependencies

processmetadata

toolcapabilities

schedulingdependencies

parallel processing

Page 91: 02- ETL Design Strategy

Internal and Confidential April 10, 2023

Module 5 – Implementation Guidelines

Page 92: 02- ETL Design Strategy

April 10, 2023

Module 5 – Implementation Guidelines

Data Acquisition Technology

ETL Summary

Page 93: 02- ETL Design Strategy

April 10, 2023

Technology in Data Acquisition

Data Mapping

Data Conversion

Data Cleansing

Data Movement

Data

base

Loadin

g

Data

Acc

ess

Data Transformation

ETL Technology

Storage Management

Sourc

e S

yst

em

s

Data

base

Manag

em

ent

Metadata Management

Page 94: 02- ETL Design Strategy

April 10, 2023

ETL - Critical Success Factors

Inta

ke

Dis

trib

uti

on

Info

rmati

on

Deliv

ery

Inte

gra

tion

Gra

nula

rity

Cle

ansi

ng

1. Design for the Future, Not for the Present v v v v v v2. Capture and store only changed data v3. Fully understand source systems and data v v v v v4. Allow enough time to do the job right v v v v v v5. Use the right sources, not the easy ones v v v6. Pay attention to data quality v v v v v v7. Capture comprehensive ETL metadata v v v v v v8. Test thoroughly and according to a test plan v v v v v v9. Distinguish between one-time and ongoing loads v v v10. Use the right technology for the right reasons v v v v v v11. Triage source attributes v v v12. Capture atomic level detail v v13. Strive for subject orientation and integration v v14. Capture history of changes in audit trail form v15. Modularize ETL processing v v v v v v16. Ensure that business data is non-volatile v v v17. Use bulk loads and/or insert-only processing v18. Complete subject orientation and integration v v19. Use the right data structures (relational vs. dimensional) v v v v v20. Use shared transformation rules and logic v v v v v21. Design for distribution first, then for access v22. Fully understand each unique access need v v v23. Use DBMS update capabilities v v v24. Design for access before other purposes v v25. Design for access tool capabilities v v26. Capture quality metadata and report data quality v v v

Data Store Roles

Data Transformation

Roles

Page 95: 02- ETL Design Strategy

April 10, 2023

Exercises

Exercise 1: Source Data Options

Exercise 2: Source Data Modeling

Exercise 3: Data Capture

Exercise 4: Data Transformation

Exercise 5: Data Acquisition Decision