informatica tips and tricks every etl developer should know

63
Tips & Tricks Every ETL Developer Should Know Sean Desmond, Informatica Vijay Viswanathan, Cognicase

Upload: shyamap

Post on 28-Oct-2015

439 views

Category:

Documents


12 download

DESCRIPTION

NA

TRANSCRIPT

Page 1: Informatica Tips and Tricks Every ETL Developer Should Know

Tips & Tricks Every ETL Developer Should Know

Sean Desmond, InformaticaVijay Viswanathan, Cognicase

Page 2: Informatica Tips and Tricks Every ETL Developer Should Know

Agenda

� Objectives

� Who Are The Presenters

� Top Ten ETL Tips & Tricks - Overview

� The Ministry Project – Overview

� Top Ten ETL Tips & Tricks – “The Meat and Potatoes”

� Summary

� Questions & Answers

Page 3: Informatica Tips and Tricks Every ETL Developer Should Know

Introduction

Page 4: Informatica Tips and Tricks Every ETL Developer Should Know

Objectives

� By the end of this session you should…� Understand how a warehouse design is tightly

integrated into the PowerCenter Architecture� Receive several mapping ‘tips & tricks’ garnered from a

successful implementation� Weigh the pros and cons of applying these ‘tips &

tricks’ to one of your own project solutions

Page 5: Informatica Tips and Tricks Every ETL Developer Should Know

The Presenters

� Sean Desmond� Regional Manager, Informatica Professional Services

New England/E. Canada� 6 years in Data Warehousing, Metadata Management,

Project Delivery

� Vijay Viswanathan� BIDW Consultant, Cognicase (Toronto)� Specializes in data warehousing / ETL design� Over 5 years in Data Warehousing

Page 6: Informatica Tips and Tricks Every ETL Developer Should Know

Top Ten ETL Tips & Tricks

Page 7: Informatica Tips and Tricks Every ETL Developer Should Know

Top Ten ETL Tips & Tricks

10. Use Velocity

9. Dedicate Time to Infrastructure and Standards Prior to Development (Baseline Architecture Deployment)

8. Reduce Reliance on Stored Procedures

7. Audit Your Loads

6. Track Data Errors

Page 8: Informatica Tips and Tricks Every ETL Developer Should Know

Top Ten ETL “Tips & Tricks”

5. Bless the ROUTER!

4. Be careful of “Lookup Gotchas”!

3. Determine the Record Type in Staging

2. Use Parameter Files

1. Create a Common Library of Sources, Targets and Transformations

Page 9: Informatica Tips and Tricks Every ETL Developer Should Know

The Ministry Project

Page 10: Informatica Tips and Tricks Every ETL Developer Should Know

Ministry Project - Overview

� Combined data about students, course marks, schools, teachers, funding, socio-economic demographics, standardized testing results

� Tools Used:� ErWin v4.0� Informatica PowerCenter v5.1� Cognos PowerPlay Web v7.0, Cognos IWR v7.0� DB2 v7.2

� 3 target areas� Stage (based on source file layouts)� Data Warehouse (mainly normalized)� Data Marts (dimensional)

Page 11: Informatica Tips and Tricks Every ETL Developer Should Know

Project Architecture

Datamart

Staging Area

ALL INFORMATION DEPERSONALIZED

NO USER ACCESS

Transformation• Computation of

referenceelements

• Data aggregationwhere necessary

ESDW Architecture

Operational Databases

• Removal of personal identifying information

• Generation of unique record identifier

Extract, Cleanse&

LoadCentral Store

Metadata

Index

Data

Extract,Transform

& Load

UserData Mart

(Access via BITools etc)

Business Intelligence&

Reporting Tools

Load

Extract,Transform &

Load

DataAtomisation

Central Store

Source information&

Metadata (Updates)

Extract Cleansing

Depersonalising

Page 12: Informatica Tips and Tricks Every ETL Developer Should Know

Top Ten ETL Tips & Tricks“The Meat & Potatoes”

Page 13: Informatica Tips and Tricks Every ETL Developer Should Know

Tip #10 - Dedicate Time to Infrastructure and Standards

Prior to Development

Page 14: Informatica Tips and Tricks Every ETL Developer Should Know

Dedicate Time to Infrastructure and Standards Prior to Development

� Should take place at least 2-3 weeks BEFORE beginning any development (BLA)� Development Standards� Folder Architecture� Security Measures� Naming Convention Standards� Metadata Documentation Standards� Lifecycle Strategy� Shared Objects Strategy

Page 15: Informatica Tips and Tricks Every ETL Developer Should Know

Tip #9 – Use Velocity

Page 16: Informatica Tips and Tricks Every ETL Developer Should Know

Use Velocity

� A methodology for the development of analytic solutions based on Informatica platform products, InformaticaPowerCenter® and Informatica PowerMart®.

� Applications & PowerAnalyzer being incorporated

� Informatica Velocity covers each of the major phases of analytic solution development efforts, including Manage, Architect, Design, Build, Deploy, and Operate.

� ‘Project Roadmap’

� Best Practices – Mapping Design, P&T, Migration

� Sample Deliverables – Mapping Inventory, Mapping Spec, System Test Plan

Page 17: Informatica Tips and Tricks Every ETL Developer Should Know

Tip #8 – Reduce Reliance on Stored Procedures

Page 18: Informatica Tips and Tricks Every ETL Developer Should Know

Reduce Reliance on Stored Procedures

� Stored Procedures are a big performance hit!

� Try to avoid external calls to stored procedures unless it is a necessity

� For surrogate key generation use native Informatica sequence generator or the IDENTITY datatype.

� Personally saw performance of a mapping increase from 5 rows/sec to over 500 rows/sec once stored proc was replaced!

Page 19: Informatica Tips and Tricks Every ETL Developer Should Know

Tip #7 – Audit Your Loads

Page 20: Informatica Tips and Tricks Every ETL Developer Should Know

Audit Your Loads

� A key area that is quite often ignored

� You must match to the source systems or be able to explain the differences

� Auditing data loads (when did we start a load and what is the status?)

� Audit information is provided to increase the end-user’s confidence in the quality of data contained in the Data Warehouse

� Without proof, you will lose all credibility!

Page 21: Informatica Tips and Tricks Every ETL Developer Should Know

Audit Data Model

ETL_AUDITetl_load_key: INTEGER NOT NULL

academic_yr: CHAR(9)prev_etl_load_key: INTEGERmost_rcnt_fy_ind: CHAR NOT NULLsystem_cd: VARCHAR(5) NOT NULL (FK)load_status_flg: VARCHAR(12)load_type_flg: CHARstage_archvd_date: DATEwh_archvd_date: DATEstage_start_ts: TIMESTAMPwarehouse_start_ts: TIMESTAMPnum_rows_read: INTEGERfct_cleanup_ind: CHARacad_yr_transt_ind: CHAR

ETL_Source_Systemsystem_cd: VARCHAR(5) NOT NULL

system_name: VARCHAR(20)system_desc: VARCHAR(255)sys_req_file_cnt: INTEGER

ETL_AUDIT_TABLE_LOADSetl_load_key: INTEGER NOT NULL (FK)source_name: VARCHAR(80) NOT NULL

num_rows_read: INTEGERnum_records_reqd: INTEGERload_status_flg: VARCHAR(12)extract_num: INTEGERextract_ts: TIMESTAMPstop_source_row_id: INTEGERload_session_name: VARCHAR(80)load_start_ts: TIMESTAMPload_stop_ts: TIMESTAMP

Page 22: Informatica Tips and Tricks Every ETL Developer Should Know

Audit Your Loads - Methodology

At the beginning of this stage the status code is set to ‘Stg-Loading’

At completion, it is set to ‘Stg-Complete’ only if all the source extract files have been processed.

If any one of the source files is not processed, the status code is set to ‘Stg-Fail’.

Staging LoadLoad Status CodesStep

Page 23: Informatica Tips and Tricks Every ETL Developer Should Know

Audit Your Loads - Methodology

At the beginning of this stage, the status is set to ‘WH-Loading’

At the end, if all the tables are properly loaded, it is set to ‘WH-Complete’

If any of the expected tables did not load completely, the load status is set to ‘WH-Fail’

Warehouse LoadLoad Status CodesStep

Page 24: Informatica Tips and Tricks Every ETL Developer Should Know

Audit Your Loads - Methodology

At the beginning of a Data Mart load, the Load_Status_Flag is set to ‘DM-Loading’

When the Data Mart load is completed, the status flag is set as ‘DM-Loaded’

If the Data Mart load does not succeed, the status will be set to ‘DM-Fail’

The load status of ‘DM-Completed’ is used only after the data is reviewed and the business metrics match what is expected from the source system

Data Mart LoadLoad Status CodesStep

Page 25: Informatica Tips and Tricks Every ETL Developer Should Know

Audit Your Loads – Informatica Mapping

Page 26: Informatica Tips and Tricks Every ETL Developer Should Know

Audit Your Loads – Informatica Mapping

Page 27: Informatica Tips and Tricks Every ETL Developer Should Know

Tip #6 – Track Data Errors

Page 28: Informatica Tips and Tricks Every ETL Developer Should Know

Track Data Errors

� Use Informatica to conduct Data Validity, IsNull, IsDate, IsNumber and other Pre Defined error checks

� Errors are logged by calling the INSERT ERROR RECS stored procedure from Informatica

� Invalid values can be either skipped or passed through with default values

� Hold or point to the original source record and be able to recreate it

� Best Practices exist, but design is key

Page 29: Informatica Tips and Tricks Every ETL Developer Should Know

Error Correction Model

Target

Reload

ErrorExists

Source StageLoadProcess

Page 30: Informatica Tips and Tricks Every ETL Developer Should Know

Error Correction Data Model

Error_typeerror_type_cd: VARCHAR(2) NOT NULL

error_type_desc: VARCHAR(255)last_update_ts: TIMESTAMP NOT NULLrecord_expiry_ts: TIMESTAMP

Severity_Levelseverity_cd: VARCHAR(3) NOT NULL

severity_desc: VARCHAR(255)last_update_ts: TIMESTAMP NOT NULLrecord_expiry_ts: TIMESTAMP ETL_ERROR

etl_load_key: INTEGER NOT NULL (FK)sys_load_col_name: VARCHAR(30) NOT NULLsource_name: VARCHAR(80) NOT NULL (FK)error_type_cd: VARCHAR(2) NOT NULL (FK)source_row_id: INTEGER

severity_cd: VARCHAR(3) NOT NULL (FK)

Page 31: Informatica Tips and Tricks Every ETL Developer Should Know

General Rules for Non-Lookup Errors

Missing or NullMN

Data LengthDL

Is DeleteDD

Is Valid DateID

Datatype MismatchDM

Inconsistent RecordIR

Error Type DescriptionError Type Code

Page 32: Informatica Tips and Tricks Every ETL Developer Should Know

Error Severity

Non Fatal NF

Fatal ErrorFTL

Severity DescriptionSeverity Code

Page 33: Informatica Tips and Tricks Every ETL Developer Should Know

Error Correction Checks

Page 34: Informatica Tips and Tricks Every ETL Developer Should Know

Pulling Audit & Error Correction Together

ETL_AUDITetl_load_key: INTEGER NOT NULL

academic_yr: CHAR(9)prev_etl_load_key: INTEGERmost_rcnt_fy_ind: CHAR NOT NULLsystem_cd: VARCHAR(5) NOT NULL (FK)load_status_flg: VARCHAR(12)load_type_flg: CHARstage_archvd_date: DATEwh_archvd_date: DATEstage_start_ts: TIMESTAMPwarehouse_start_ts: TIMESTAMPnum_rows_read: INTEGERfct_cleanup_ind: CHARacad_yr_transt_ind: CHAR

ETL_Source_Systemsystem_cd: VARCHAR(5) NOT NULL

system_name: VARCHAR(20)system_desc: VARCHAR(255)sys_req_file_cnt: INTEGER

ETL_AUDIT_TABLE_LOADSetl_load_key: INTEGER NOT NULL (FK)source_name: VARCHAR(80) NOT NULL

num_rows_read: INTEGERnum_records_reqd: INTEGERload_status_flg: VARCHAR(12)extract_num: INTEGERextract_ts: TIMESTAMPstop_source_row_id: INTEGERload_session_name: VARCHAR(80)load_start_ts: TIMESTAMPload_stop_ts: TIMESTAMP

Error_typeerror_type_cd: VARCHAR(2) NOT NULL

error_type_desc: VARCHAR(255)last_update_ts: TIMESTAMP NOT NULLrecord_expiry_ts: TIMESTAMP

Severity_Levelseverity_cd: VARCHAR(3) NOT NULL

severity_desc: VARCHAR(255)last_update_ts: TIMESTAMP NOT NULLrecord_expiry_ts: TIMESTAMP ETL_ERROR

etl_load_key: INTEGER NOT NULL (FK)sys_load_col_name: VARCHAR(30) NOT NULLsource_name: VARCHAR(80) NOT NULL (FK)error_type_cd: VARCHAR(2) NOT NULL (FK)source_row_id: INTEGER

severity_cd: VARCHAR(3) NOT NULL (FK)

Page 35: Informatica Tips and Tricks Every ETL Developer Should Know

Tip #5 – Bless The Router!

Page 36: Informatica Tips and Tricks Every ETL Developer Should Know

Bless the Router!

� New feature introduced in Powercenter 5

� Similar to a Filter since both allow the developer to use a condition to test data

� Big Difference – Router allows you to test multiple conditions!

� Use the Router instead of multiple Filter transformations

� Big Advantage – Only reads the data once!

� Considerable Performance Gains

� Crucial in dealing with both Type 1 Dimensions and Type 2 Dimensions

Page 37: Informatica Tips and Tricks Every ETL Developer Should Know

Router At Work

Page 38: Informatica Tips and Tricks Every ETL Developer Should Know

Groups in a Router

Page 39: Informatica Tips and Tricks Every ETL Developer Should Know

Tip #4 – Lookup Gotchas

Page 40: Informatica Tips and Tricks Every ETL Developer Should Know

Lookup Gotchas

� Use Dynamic Lookup if conducting a lookup on the target and want the lookup to be synchronized with target. MUST BE CONNECTED!

� Cannot explicitly set the ORDER BY clause in the SQL Overide. Can trick Informatica by ordering ports in desired ORDER BY sequence

� Make sure Datatypes and Precision of ports being compared are the same otherwise you might get undesired results

� If Lookups are large adjust the lookup data cache and lookup index cache size in the session properties to improve performance

� Can also take advantage of the persistent cache feature. This feature is valuable if you know the lookup table does not changebetween sessions runs

Page 41: Informatica Tips and Tricks Every ETL Developer Should Know

Tip #3 – Determination of Record Type in Staging Layer

Page 42: Informatica Tips and Tricks Every ETL Developer Should Know

Determination of Record Type in Staging Layer

� First a Quick Refresher on Dimensions:� Type 1 – No history� Type 2 – All history� Type 3 – Some history

Page 43: Informatica Tips and Tricks Every ETL Developer Should Know

More Dimension Types…Combinations

� Type 3 Prime – Types 1 and 2 (the most common)

� Type 4 – Types 1 and 3

� Type 5 – Types 2 & 3

� Type 6 – Types 1, 2, and 3 (the second most common)

Page 44: Informatica Tips and Tricks Every ETL Developer Should Know

Type 1 – No History

Source Transaction #1

Ms.SalutationBedrockCity23 Boulder RdAddressSandy RubbleName1Id 100Key

Warehouse Transaction #1

01-Jan-2002DateMs.SalutationBedrockCity23 Boulder RdAddressSandy RubbleName1Id

Page 45: Informatica Tips and Tricks Every ETL Developer Should Know

Type 1 – No HistorySource Transaction #1

Ms.SalutationBedrockCity23 Boulder RdAddressSandy RubbleName1Id

100KeyWarehouse Transaction #1

01-Jan-2002DateMrs.SalutationGravelPitCity42 Slate AveAddressSandy RubbleName1Id

Source Transaction #2

Mrs.SalutationGravelPitCity42 Slate AveAddressSandy RubbleName1Id

Page 46: Informatica Tips and Tricks Every ETL Developer Should Know

Type 2 – All HistorySource Transaction #1

Ms.SalutationBedrockCity23 Boulder RdAddressSandy RubbleName1Id

100KeyWarehouse Transaction #1

01-Jan-2002DateMs.SalutationBedrockCity23 Boulder RdAddressSandy RubbleName1Id

Source Transaction #2

Mrs.SalutationGravelPitCity42 Slate AveAddressSandy RubbleName1Id 100Key

Warehouse Transaction #2

15-Nov-2002DateMrs.SalutationGravelPitCity42 Slate AveAddressSandy RubbleName1Id

Page 47: Informatica Tips and Tricks Every ETL Developer Should Know

Type 3 – Some History

Source Transaction #1

Ms.SalutationBedrockCity23 Boulder RdAddressSandy RubbleName1Id

Ms.Original Salutation

100KeyWarehouse Transaction #1

01-Jan-2002DateMs.Salutation

BedrockCity23 Boulder RdAddressSandy RubbleName1Id

Page 48: Informatica Tips and Tricks Every ETL Developer Should Know

Type 3 – Some HistorySource Transaction #1

Ms.SalutationBedrockCity23 Boulder RdAddressSandy RubbleName1Id

Ms.Original Salutation

100KeyWarehouse Transaction #1

15-Nov-2002DateMrs.Salutation

GravelPitCity42 Slate AveAddressSandy RubbleName1Id

Source Transaction #2

Mrs.SalutationGravelPitCity42 Slate AveAddressSandy RubbleName1Id

Page 49: Informatica Tips and Tricks Every ETL Developer Should Know

Methodology

� As each record from the source file is processed into the staging area, a record type indicator is added to identify how the staging record should later be processed (e.g. as an insert, delete or update)

� This indicator is set based on a comparison to the previous successful data load for that table

� The flag then dictates what path the record will take when loaded into the Warehouse Layer

Page 50: Informatica Tips and Tricks Every ETL Developer Should Know

Record Type Identification

Record should be expired in the warehouse by performing a type 1 update to the previous instance of the record and populating the expiry_ts column with the current date/time. (Fatal error for systems in which deletions are prohibited.)

D

Record contains a trigger field update, and should be treated as a type II update

L

Record contains a non-trigger field update, and should be treated as a type I update.

M

Record is New, and will be treated as an insert.N

Record is unchanged. Only the ETL load key and fields should be updated in the target table.

X

DescriptionRecord Type Identifier

Page 51: Informatica Tips and Tricks Every ETL Developer Should Know

Record Type Identification – An Example

M2Terrell Davis HS

B00991997-19983

X2Terrell Davis HS

B00991998-19994

1

1

SemesterCode

Terrell Davis HS

John Elway HS

School Name

NB00991995-19961

LB00991996-19972

Record Type Flag

School Number

Academic Yr

Stage Key

Page 52: Informatica Tips and Tricks Every ETL Developer Should Know

Early Detection - Advantages

� Reduces Complexity in Warehouse Layer Mappings

� Shift focus of Warehouse Layer Mappings to Error Checking and Error Handling

� Improved Performance of Warehouse Load

Page 53: Informatica Tips and Tricks Every ETL Developer Should Know

Tip #2 – Use Parameter Files

Page 54: Informatica Tips and Tricks Every ETL Developer Should Know

Using Parameter Files

� Parameter Files� A mapping parameter represents a constant value that

you can define before running a session� A mapping parameter retains the same value

throughout the entire session� In a parameter file for the session, one defines the

value of the parameter� During the session, the Informatica Server evaluates

all references to the parameter to that value

Page 55: Informatica Tips and Tricks Every ETL Developer Should Know

Parameter Files Syntax

� Use the following format to define parameters and variables in a session. The folder name is optional:

[(folder_name.)session_name]parameter_name=valueparameter2_name=value

� An Example:[s_m_DM_BOARD]$DBConnection_DW=ESDWP_DW$DBConnection_DM=ESDWP_DM

Page 56: Informatica Tips and Tricks Every ETL Developer Should Know

How do you call a Parameter File?

� Can be specified at the session level

Page 57: Informatica Tips and Tricks Every ETL Developer Should Know

How do you call a Parameter File?

� Also can be specified at the the batch level

Page 58: Informatica Tips and Tricks Every ETL Developer Should Know

Why Use a Parameter File?

BENEFITS:

� Portable Across Environments (Dev, SIT, UAT, Prod)

� Simplifies and automates the Code Promotion Process

� Removes the manual step of updating the database connection(s) in a session

MAIN DRAWBACK:

� File needs to be modified if new mappings are created and are part of the load process

Page 59: Informatica Tips and Tricks Every ETL Developer Should Know

Tip #1 – Creation of Common Library of Components

Page 60: Informatica Tips and Tricks Every ETL Developer Should Know

Creation of Common Library of Components

� Components include: Sources, Targets, Reusable Transformations (Mapplets, Lookups), Variables, Parameter Files, Database Connections

Advantages:

� Reduces redundancy

� Increases standardization/common structure in mappings

� Consistency among mappings

� Reduces chance of mapping errors due to “designer license”

Page 61: Informatica Tips and Tricks Every ETL Developer Should Know

Summary

Page 62: Informatica Tips and Tricks Every ETL Developer Should Know

Summary – Always remember…

� PLAN PLAN PLAN! � Short Term Pain – Long Term Gain� Promote Standardization and Structure� Net Net Effect is more consistent ETL Mappings and a

more robust ETL Load Process

� Credibility is Everything� Remember to audit the load process� Have a strong error detection and correction

methodology

� Leverage Resources� Methodology, Devnet, etc.

Page 63: Informatica Tips and Tricks Every ETL Developer Should Know

Thank YouSean Desmond, [email protected]

Vijay Viswanathan, [email protected]

A N S W E R SA N S W E R SQ U E S T I O N SQ U E S T I O N S