etl standards for informatica

21

Click here to load reader

Upload: amitosh123

Post on 28-Oct-2014

567 views

Category:

Documents


16 download

DESCRIPTION

standards

TRANSCRIPT

Page 1: ETL Standards for Informatica

Enterprise Architecture Enterprise Architecture ETL Standards for InformaticaETL Standards for Informatica

Cisco Guidelines for Reporting Architecture

Prepared By: Data and Information Architecture TeamAuthors: Richard PowellCreation Date: September 30, 2003Last Updated: September 30, 2003Last Updated By: Richard PowellVersion: 1.0

Copyright 2003 Cisco Systems Inc.

All Rights Reserved

Page 2: ETL Standards for Informatica

Table of Contents

1.0 DOCUMENT CONTROL................................................................................................................1

1.1 Change History..................................................................................................................1

1.2 Change Management........................................................................................................1

2.0 WHY THIS DOCUMENT?..............................................................................................................2

2.1 Document Purpose............................................................................................................2

2.2 Document Context.............................................................................................................2

2.3 Document Scope...............................................................................................................2

3.0 INTRODUCTION TO ETL STANDARDS.......................................................................................3

3.1 Overview............................................................................................................................ 3

3.2 Data Extraction Method.....................................................................................................3

3.3 General Standards.............................................................................................................3

3.4 Design Template................................................................................................................3

3.5 Architecture Template........................................................................................................3

4.0 PACKAGED ETL SOLUTION - INFORMATICA...........................................................................4

4.1 Informatica Architecture.....................................................................................................4Informatica Components using PowerCenter................................................................................................... 4Repository Folder Structure.............................................................................................................................. 4

4.2 Naming Standards.............................................................................................................5Repository Name.............................................................................................................................................. 5Server Name.................................................................................................................................................... 5User Groups..................................................................................................................................................... 5Folder............................................................................................................................................................... 6ODBC Connections.......................................................................................................................................... 6Database Connections..................................................................................................................................... 6Transformations................................................................................................................................................ 6Mapping Name................................................................................................................................................. 7Maplet Name.................................................................................................................................................... 7Ports................................................................................................................................................................. 7Log files and Bad files....................................................................................................................................... 7Workflow Name................................................................................................................................................ 8Session Name.................................................................................................................................................. 8Session Parameter Name................................................................................................................................. 8

4.3 Using Standard Job Control PL/SQL Procedures within Mappings....................................8

4.4 Templates.......................................................................................................................... 8Incremental Data Extraction and Processing using a control table:..................................................................8ETL Process Recovery..................................................................................................................................... 8Error Handling................................................................................................................................................... 9Incremental Data Extraction and Processing using a ChangeTable...............................................................10Using a reusable transformation in a mapping................................................................................................ 10Using a mapplet in a mapping........................................................................................................................ 10

4.5 Code Promotion Process.................................................................................................10EDW Informatica Code Promotion Process.................................................................................................... 11

4.6 Scheduling Informatica jobs.............................................................................................11Warehouse Operations Environment.............................................................................................................. 11Standard KSH script....................................................................................................................................... 11

5.0 APPENDIX................................................................................................................................... 12

5.1 APPENDIX B: Research Bibliography.............................................................................13

Copyright 2003 Cisco Systems Inc.

All Rights Reserved

Page 3: ETL Standards for Informatica

1.0 DOCUMENT CONTROL

1.1 Change History

DateDocument Version

Version Details

08/27/03 PP1.x Version in-process

1.2 Change Management

DateDocument Version

Author(s) Name Reviewer Names Approver Names

08/27/03 PP1.x Richard Powell

Copyright 2003 Cisco Systems Inc.

All Rights Reserved

Appendix Page 1 of 16document.doc

Page 4: ETL Standards for Informatica

2.0 WHY THIS DOCUMENT?

2.1 Document Purpose

The purpose of this document is to record and communicate standards for underlying architecture for data movement and data transformation in support of the information delivery process.

2.2 Document Context

This document forms part of a suite of guidance and standards documents (to-be) issued by the Enterprise Architecture (EA) Data and Information Architecture (DAIA) team focused on various aspects of business intelligence. This team plays a key role in three corporate initiatives that either enable, support or plan for enterprise-class information delivery and reporting capabilities.

The corporate initiatives are:

Enterprise Reporting & Business Intelligence (ERBI) Enterprise Architecture (EA), Enterprise Data Warehouse (EDW).

The principal documents in this suite are:

Document Name OwnerInitial

Publish DateCisco ISO Document Number

The Enterprise Data Warehouse Staging Guidelines

EA DAIA Team

tbd Tbd

The Enterprise DW and DM GuidelinesEA DAIA

Teamtbd Tbd

ETL Standards for Warehouse Operations Environment

EA DAIA Team

tbd Tbd

ETL Standards for PLSQLEA DAIA

Teamtbd Tbd

2.3 Document Scope

This document establishes and communicates the enterprise ETL Standards within Cisco, as established by the Data and Information Architecture team. It translates the work that is currently being performed in the Enterprise Data Warehouse (EDW) and the Enterprise Architecture (EA) efforts into tangible reporting architecture products that are to be consumed by the IT and Business user.

.

Copyright 2003 Cisco Systems Inc.

All Rights Reserved

Appendix Page 2 of 16document.doc

Page 5: ETL Standards for Informatica

3.0 INTRODUCTION TO ETL STANDARDS

3.1 Overview

What are the ETL Standards?

ETL is an acronym which is generally used to describe the overall process of data Extraction, Transform, and Loading. Sometimes the order of doing transformation and loading depends on the technologies used for the target database environment and the tools used for doing the data movement. ELT is a type of data movement and transformation where data is extracted from a source system, loaded into staging tables in the target system, and transformed as needed before loading into the final target tables. Using Informatica Pushdown optimization with Teradata is a good example of ELT.

ETL Standards identify the technologies chosen and architecture components needed to perform data movement and data transformation across the enterprise along with data auditing processes and tools to ensure that data is delivered as expected. This document will clearly discuss ETL vs ELT where needed.

The Data and Information Architecture team is the governing team which will establish, communicate, and consult enterprise wide ETL standards, guidelines, and best practices to ETL development efforts across Cisco. This document will discuss each ETL Standard, naming conventions, Audit Standards, ETL environment standards, and the process for development and deployment of ETL.

The current Cisco Standards by ETL Technology is as follows:

Custom Solution PL/SQL, Ksh

Packaged Solution Informatica

Messaging Solution Cisco Data Bus

Auditing Solution Custom built environment

For more information regarding recommendations of when to use each technology, please review the ETL Guidelines document.

3.2 Data Extraction Method

All scheduled and reoccuring data extractions should be incremental in nature, meaning no truncate then reload scenarios.

Sometimes truncate and reloads are necessary and should be considered on a case by case basis.

3.3 General Standards

Include comments in all mappings, transformations and any ports that are being modified by the transformation.

The overall mapping logic should appear in the mapping level comment area.

Any changes to a mapping once migrated (copied from the master folder) should have the date of change, name of the person making the change and the purpose of change.

3.4 Design Template

The standard ETL Design template should be used for capturing Informatica designs prior to coding. This template is available on Livelink at:

Copyright 2003 Cisco Systems Inc.

All Rights Reserved

Appendix Page 3 of 16document.doc

Page 6: ETL Standards for Informatica

http://ework.cisco.com/Livelink/livelink.exe?func=ll&objId=2247272&objAction=browse&sort=name

3.5 Architecture Template

The Architecture template which should have been created during the Analysis phase of the project should be updated to point to the completed ETL design template for future reference. The Architecture template is also available on Livelink at:

http://ework.cisco.com/Livelink/livelink.exe?func=ll&objId=2247272&objAction=browse&sort=name

4.0 PACKAGED ETL SOLUTION - INFORMATICA

What is the Packaged ETL Solution?

What are the different components and standards of each ?

What type of environment is required?

The Packaged ETL Solution consists of the following components:

Informatica Architecture

Using Standard Job Control PL/SQL Procedures within Mappings

Mappings and Maplets

Sessions

Templates

Code Promotion Process

Each of these components are discussed in the sections below.

4.1 Informatica Architecture

The powerpoint slide below shows the basic components needed when using Informatica’s PowerCenter engine. Unix servers are not shown in the diagram to highlight that Informatica can be co-located on the source or target database UNIX hosts or separated into it’s own Unix server altogether. However Informatica recommends co-locating the server on the target database UNIX host since that is typically where most transformations occur. Depending on your situation one of the other 2 configurations may be more appropriate.

Informatica Components using PowerCenter

Copyright 2003 Cisco Systems Inc.

All Rights Reserved

Appendix Page 4 of 16document.doc

Page 7: ETL Standards for Informatica

555© 2003, Cisco Systems, Inc. All rights reserved.Presentation_ID

SourceSource

Informatica Components - PowerCenter* Over simplified view of Informatica architecture.

PowerCenter MetadataRepository

PowerCenter

SourceTarget

Target

ETLETL

PowerMarts are being converted to PowerCenters

Repository Folder StructureNO user specific folders should be created within the Informatica repository. Only application or project specific folders should be created.

Most Informatica installations today are organization specific and are typically installed on the target database host. The following folder naming should be followed in this case:

<Application_Name>[_<project_name>]

The target enterprise deployment architecture calls for one PowerCenter per major business process. Therefore future repository structures being shared across multiple organizations should adhere to the following folder naming standard:<organization acronym>_<application name>_<project or release>

Why are user specific folders not allowed?

Attrition is not a possibility, it’s a fact of life and dealing with that churn from a code management should be avoided at all costs. Rather than spending significant amounts of time chasing down who might be using code within someone’s personal folder it’s better to manage ETL within it’s appropriate project or application folder so that code is sunsetted along with the project or application that it was created for.

4.2 Naming Standards

Name of the object should be self-explanatory. It should convey the functionality and the usage of the object. Prefix 3-5 characters of abbreviation to the name of an object. Use Hungarian notation wherever possible and separate the words with ‘_’ otherwise.

Repository NameIf the Repository is being used by a single orgnization then use the following format:

Copyright 2003 Cisco Systems Inc.

All Rights Reserved

Appendix Page 5 of 16document.doc

Page 8: ETL Standards for Informatica

<Organization Name>_[Dev | Test | Prod]

Ex: EDW_ProdEx: EDW_Dev

If the Repository is being used by mulitple orgnizations then use the following format:

<Business Process Area Name>_[Dev | Test | Prod]

Ex: M20_DevEx: H2R_Prod

Server NameFormat of the Server Name should be as follows:

<Repository Name>_P[c | m]Svr[_<sequence number>]The sequence number is used to distinguish between servers when more than one server is running against the same repository to share the workload.

Ex: EDW_Prod_PcSvrEx: H2R_Prod_ PcSvr

User GroupsTypically each project or application folder will have it’s own Developers group assigned.

<Folder Name>_DevelopersEx: ERBI_Revenue_Developers

FolderThree types of folders are there, depending on the usage. They are Global Shareable, Local Shareable and Normal Folders.

Based on the purpose, names of the folders as follows:

Development Folders

Each project and/or application should have it’s own folder. If the repository is used by a single organization then use this format:

<project or application name>Ex: ERBI_Revenue

If the repository is used by multiple organizations then use this format:<Organization name>_<project or application name>Ex: EDW_ERBI_Revenue

Sources Folder

If the repository is used for only one project, then the name should be Sources. Otherwise Src_[project_name]

Target Folder

If the repository is used for only one project, then the name should be Target. Otherwise Tgt_[project_name]

Copyright 2003 Cisco Systems Inc.

All Rights Reserved

Appendix Page 6 of 16document.doc

Page 9: ETL Standards for Informatica

Reusable Transformations Folder

If the repository is used for only one project, then the name should be RTx. Otherwise Rtx_[project_name]

Also there could be a prefix based on the scope of the folders.GSFld_FolderNameLSFld_FolderName

Either Repository Admin or the Folder Admin should create the shareable objects in the respective folders. Normal developer should not be given permission to change these objects.

ODBC ConnectionsODBC name should be consistent across the user group in each folder.

Database ConnectionsDatabase connection name should be consistent across the user group in each folder.

TransformationsName’s prefix should indicate the transformation.

Source Qualifier Sq_

Target TableName_[Operation]

Expression Exp_

Update Strategy Upd_

Filter Fil_

Lookup Lkp_

Aggregator Agg_

Sequence Generator Seq_

Stored Procedure Sp_

External Stored Procedure Ep_

Adv. Ext Stored Procedure Aep_

Joiner Jnr_

Normalizer Nrm_

Rank Rnk_

ERP ErpSrc_

Maplet Input Mapi_

Maplet Output Mapo_

Sorter Srt_

Transaction Control Tctrl_

XML Source Qualifier XML_

Application Source Qualifier <app>_

MQ Series Source Qualifier MQ_

Tibco Source Qualifier Tibco_

Application Multi-Group Source Qualifier

amg_

Shared Objects SO_ObjectName

Reusable Transformations Rtx_[TransformationName]

Copyright 2003 Cisco Systems Inc.

All Rights Reserved

Appendix Page 7 of 16document.doc

Page 10: ETL Standards for Informatica

Mapping NameMeaningful name describing the purpose of the mapping. <load_area> may be something like SA for Staging Area, schema name, or architecture layer being loaded.

m_<load_area>_<table_name>_<[insert | update | delete ]>Ex: m_ECM_BACKLOG_SUMMARY_UPDATE

Maplet NamePrefix “mplt_” should be used along with a meaningful name describing the purpose of the maplet.

PortsPort names should start with:

I_ for input portsO_ for output portsV_ for variable portsR_ for return ports

Delete all unnecessary ports from all the transformations.

Log files and Bad filesNamed after session and targets by default. It’s OK to use the default name.

Workflow NamePrefix “wf_” should be used along with a meaningful name describing the purpose of the workflow.

wf_<work flow name>Ex: wf_ECM_BACKLOG_UPDATEEx: wf_ECM_BACKLOG_UPDATE_hist

Session NamePrefix “s_” should be used along with a meaningful name describing the purpose of the workflow.

s_<session name>

Session Parameter NameThe user can customize the parts of the parameter names that are in italics below:

Parameter Naming Convention

Database Connection $DBConnectionName

Reject File $BadFileName

Source File $InputFileName

Target File $OutputFileName

4.3 Using Standard Job Control PL/SQL Procedures within Mappings

Provide example of how to call a standard Job Control PL/SQL Procedure within a mapping.

TBD

Copyright 2003 Cisco Systems Inc.

All Rights Reserved

Appendix Page 8 of 16document.doc

Page 11: ETL Standards for Informatica

4.4 Templates

Provide a diagram for the Template Mappings created by the Administration team and explain the various components.

The following Informatica templates can be found in the “mapping_templates” folder within the EDW development Informatica repository called EDW_InfoDev. Acutal templates were provided by Informatica while onsite during August – September ’04.

The templates use a combination of mappings, mapplets, sessions, workflows, reusable transformations, Unix shell scripts and other objects. Many of these templates can cover one or more scenarios that are to be addressed.All templates have elaborate descriptions/comments to explain in detail the purpose of each object.

Incremental Data Extraction and Processing using a control table:The mapping template m_Tmplt_Process_Label_1 demonstrates how to incrementally extract data from a source table with the help of a control table and load it to the target table(s). The source qualifier query is overridden such that only the incremental data is extracted. Control table is updated at the end of the session with the new Last GES update date.

ETL Process RecoveryThe ETL process recovery is illustrated by the following set of templates:

The identical mapping templates m_Tmplt_Process_Label_1 & m_Tmplt_Process_Label_2

The workflow template wf_Tmplt_Process_Labels_1_and_2 that includes

the session templates s_m_Tmplt_Process_Label_1 & s_m_Tmplt_Process_Label_2 email task templates Email_Reject_counts and Email_Reject_counts1

The Shell script template /apps/Informatica/ template1.ksh

The purpose behind this template is to implement the ETL process in such a way that if a workflow fails for whatever reason, then that workflow should be restarted with no or minimal manual intervention.

In the workflow wf_Tmplt_Process_Labels_1_and_2, each task has the property ‘Fail parent if this task fails’ turned on so that the shell script calling this workflow will always get the correct return value.

Also the link between any two sessions will have conditions set such that the second session will run only if the first one succeeds. This makes sure that the workflow does not process any data further until the failed session has re-run successfully.

In each of the mappings, there is a pre-read stored procedure that will update the Current Label in the control table besides other fields. There is one row in the control table for each workflow run and the Current Label field for that row will have the name of the session that is running at present. When all the sessions in the workflow have completed successfully, this field will be set to null, so that the next run of the workflow will start form the first session.The shell script that is kicking-off the workflow will read the Current Label for that workflow from the control table. If the Current Label is null then the workflow will be started from the beginning. Otherwise the workflow will be started from the session that corresponds to the Current Label.

Note1: One point to keep in mind in this template is that if a session fails during the control table update operation, then before restarting the workflow you need to manually update the control table to make sure that the correct value is written into the Current Label field.

Copyright 2003 Cisco Systems Inc.

All Rights Reserved

Appendix Page 9 of 16document.doc

Page 12: ETL Standards for Informatica

Note2: Additionally, this template also demonstrates how to notify duty if there are any rejected records in a session. In some cases, you may want to continue the session even if there are some records rejected by the database. In such cases, set the ‘stop on errors’ property in the session to 0 (which is the default value) and invoke an email task when there is any record rejected by the database, as demonstrated by the email tasks on this workflow and the conditions in the link that connect the sessions to the email tasks ($s_m_Tmplt_Process_Label_1.TgtFailedRows > 0 etc..,)

Error HandlingIdentifying errors and creating an error handling strategy is an essential part of a data warehousing project. In the production environment, data must be checked and validated prior to entry into the data warehouse. One strategy for handling errors is to maintain database constraints. Another approach is to use mappings to trap data errors.

The mapping template m_Tmplt_Error_Handling demonstrates how to trap errors from within a mapping. An error table is created fro each target table. The error table will have all the fields that the corresponding target table has plus additional columns such as Error_Severity, Error_Descritpion, date and timestamp and a flag that indicates whether the error has been fixed or not.

An expression transformation is used to check for all the required error conditions and the rows are routed to the error table if they have any errors in them using a router transformation. The error rows are also marked with error severity and contain error description.

Note: How many error conditions should be checked in this type of approach should be dependent on the quality of data, the accuracy requirement and performance requirements. Checking for too many conditions unnecessarily for millions of rows will obviously increase the processing time for the session.

Incremental Data Extraction and Processing using a ChangeTableThe mapping template m_Tmplt_Incr_Load_With_ChngTbl shows how to extract and process incremental data with the help of a change table. The change table has a primary key that is the same as the main table and it stores the operation type and timestamp fields in addition to the primary key. The Source Qualifier has a modified query to select only the ‘changed’ rows from the source table with the help of the Change Table. Once the changed rows are selected, the update strategy in the mapping flags each row as insert/update/delete depending on the operation type obtained from the change table.

The Change Table is truncated after every run of the session. However, you can further modify this template to select only new change rows from the Change Table using the timestamp field in it if you do not want to truncate the table every time.

Using a reusable transformation in a mappingThe template m_Tmplt_Demonstrate_Reusable_Transformation illustrates the use of reusable transformations. Reusable transformations can be used in different mappings and also multiple times within the same mapping. This is very much similar to calling a function or a subroutine. The reusable expression transformation used in this mapping takes two fields as input and outputs the concatenation of these two fields.

Using a mapplet in a mappingThe template m_Tmplt_Demonstrate_Mapplet shows the use of a mapplet within a mapping. A mapplet is ideally used to implement repeatable logic that cannot be implemented in a single reusable transformation. This mapping template shows that you can have multiple output groups from the mapplet output transformation OR you can use a mapplet more than once in the same mapping, pretty much similar to calling a subroutine multiple times from within a single procedural language program.

Copyright 2003 Cisco Systems Inc.

All Rights Reserved

Appendix Page 10 of 16document.doc

Page 13: ETL Standards for Informatica

4.5 Code Promotion Process

The slide below provides a detailed look at the code deployment process within EDW. The key takeaway from this process is the production support organization must be the keepers of production code and should be the only group which has access to the production environment, aside from team Informatica administration team.

EDW Informatica Code Promotion Process

© 2001, Cisco Systems, Inc. All rights reserved. 26© 2001, Cisco Systems, Inc. All rights reserved. 26© 2001, Cisco Systems, Inc. All rights reserved. 26Securing EDW

Securing the EDW EnvironmentInformatica Code Deployment

IT TeamEDW or Non-EDW

CDWDuty

Code Change

Creates and tests code

Verify folder exists in Production repository

Complete CR template Validate Code Change

Notify IT Team

Migrates Code

DAIA

Exists?Yes

No

Create production repository folder

Notify CDWDuty

Approved?

EDW Change Control Team

No

Notify EDW Change Control team

Yes

Update Master CR Log

Notify CDWDuty

Review request

Notify DAIA

Address CR issues and resubmit

4.6 Scheduling Informatica jobs

Provide an diagram showing the UPROC to Job Config to Informatica call sequence..

Warehouse Operations EnvironmentAll ETL workflow or session launches should be setup within the Job Configuration System Call, namely the DW_JOB_STREAMS.SYSTEM_CALL column within a particular job_group. More information about the Job Configuration schema can be found within the ETL Standards for Warehouse Operations Environment.

Standard KSH scriptThe standard UNIX script, which should be called by the job configuration system call, for launching Informatica workflows is called:

dw_jobs_call_infmt_session.ksh

Parameter Position

Description

$1Folder name which holds the Workflow being called

$2 Workflow Name to launch

Copyright 2003 Cisco Systems Inc.

All Rights Reserved

Appendix Page 11 of 16document.doc

Page 14: ETL Standards for Informatica

This script will return a 0 is successful, otherwise return a non-zero value and send email to the “is-cdw-jobs” alias.

Copyright 2003 Cisco Systems Inc.

All Rights Reserved

Appendix Page 12 of 16document.doc

Page 15: ETL Standards for Informatica

5.0 APPENDIX

5.1 APPENDIX B: Research Bibliography..........................................................................................10

Copyright 2003 Cisco Systems Inc.

All Rights Reserved

Appendix Page 13 of 16document.doc

Page 16: ETL Standards for Informatica

5.1 APPENDIX B: Research Bibliography

Adrian Billington,. “Oracle 9i MERGE STATEMENT - DECOMPOSING SQL%ROWCOUNT” Pipeline Newsletter (September 2003) < http://www.revealnet.com/newsletter-v4/0903_D.htm >

Adarsh Viswanathan,. “General Development Guidelines for ERBI Development and Production Rollout.” (Aug 22, 2003) < \\sjc-fs1\WG-I\IT-EDIT\published\ERBI\guidelines_doc\index.htm >

Damir Bersinic,. “Evolution of Oracle SQL: The Transform Factor” (October 03, 2003) < http://certcities.com/certs/oracle/columns/story.asp?EditorialsID=63 >

Vimal Chopra,. “CDW Developer’s Reference” (June 27, 1996) < ?? >

Raghu Halur,. “PowerCenter ETL Object Templates Document” (September 2004) < http://ework.cisco.com/Livelink/livelink.exe?func=ll&objId=4479993&objAction=Open >

Raghu Halur,. “Standards and Naming Conventions Document” (September 2004) < http://ework.cisco.com/Livelink/livelink.exe?func=ll&objId=4479580&objAction=Open >

Copyright 2003 Cisco Systems Inc.

All Rights Reserved

Appendix Page 14 of 16document.doc