analysis of etl tools -...

XYZ Client

Kenilworth, NJ

Analysis of ETL Tools System Name: CDW - Scoping v1.0

Prepared By:

Name: ____________________________________________________ Date: ______________Ken

ANALYSIS OF ETL TOOLS

SL No. Chapter Page no

1.Characteristics of an ETL tool 2

2.Requirement by the Project 4

3.ETL Tool Comparison 6

Characteristics of an ETL Tool

1.Access data from multiple, operational data sources

2.Re-map source data into a common format

3.Standardize data to enable load to conformed, target databases

4.Filter data, convert codes, perform table lookups, calculate derived values

5.Automated slowly changing dimension support (Type I, Type II,Type III)

6.Incremental aggregation & computation of aggregates by the ETL tool in one pass of the source data

7.Support for Unicode & multi-byte character sets localized for Japanese and other languages

8.Support graphical job sequencer, re-usable containers, and nesting of sessions

9.Validate data to check content and range of field values

10.Perform procedural data cleansing functions

11.Support complete development environment, including versioning and run-time debugger

12.Load cleansed data to the target data mart or central DW

13.Produce audit and operational reports for each data load

14.Automatic generation of centralized Metadata

15.Automatic generation of data extract programs

16.Native interfaces to legacy files, relational databases, ERP sources (e.g.,SAP R/3 and PeopleSoft), eBusiness applications, Web log files, IBM MQ-Series, XML sources etc.

17.Support for near real-time clickstream data warehousing

18.Support for an enterprise eBusiness environment, including integration at the metadata level with BI tools, ERP applications, CRM applications, analytic applications, corporate portals, etc.

19.Platform independence and scalability to enterprise data warehousing applications, directly executable in-memory, multi-threaded processing for fast and parallel operation.

20.No requirement to generate and compile source code .

21.No requirement for intermediate disc files

22.Support for concurrent processing of multiple source data streams, without writing procedural code

Characteristics of an ETL Tool (Contd.…)

23.Specification of ETL functions using pre-packaged transformation objects, accessible via an intuitive graphical user interface

24.Extensible transformation objects at a high level of significance

25.Ability to specify complex transformations using only built-in transformation objects. The goal is to specify transformations without writing any procedural code

26.Automatic generation of central metadata, including source data definitions, transformation objects, target data models, and operational statistics

27.Metadata exchange architecture that supports automatic synchronization of central metadata with local metadata for multiple end-user BI tools

28.Central management of distributed ETL engines and metadata using a central console and a global metadata repository

29.End-user access to central metadata repository via a right-mouse click

30.Metadata exchange API compliant with COM, UML, and XML

31.Support of metadata standards, including OLE DB for OLAP

32.Ability to schedule ETL sessions on time or the occurrence of a specified event, including support for command-line scheduling using external scheduling programs

33.Ability to schedule FTP sessions on time or event

34.Integration with data cleansing tools

35.Import of complete data models from external data modeling tools

36.Strong data warehouse administration functions

37.Support for the analysis of transformations that failed to be accepted by the ETL process

38.Extensive reporting of the results of an ETL session, including automatic notification of significant failures of the ETL process

Requirement by the Project

1.Ease of Use and Maintenance

2.Security Features and Administrative Functions

3.Sources - Oracle Database, Files, XML Files, Cobol Files

4.Targets – Oracle

5.Must be able to use Oracle Functions and specifically Date Functions

6.Must be able to take advantage of Oracle parallel Architecture

7.Must be able to Partition Data for Reading and Writing Purpose

8.Must be able to Support Oracle External Load i: e SQL Load

9.Able to call Oracle Functions and Procedures

10.Must be able to work with Lookups

11.Must be able to work with Containers

12.Graphical Interface for Design, Development and Implementations

13.Must have a debugger or way to debug the Mapping/Graph/JobStream,

14.Version Controlling

15.Load of Failure Data

16.Incremental Load

17.It should be able to work with Uni Code multibyte character localized for different countries (Data is in different places)

18.Sort, Aggregate, Join Transformations (Look for Inner, Semi Outer, Outer Join)

19.Must be able to work with Oracle BLOB, CLOB data type

20.Meta Data Management - Global Metadata Repository

21.Meta Data can be shared across the enterprise

22.Able to Import Metadata from Erwin or Designer 2000

Requirement by the Project (Contd…)

23.Able to Export Metadata to Cognos

24.Meta Data change Impact analysis

25.Must be able to Log errors

26.Keeping Track of Slowly Changing Dimensions (CDC - Change Data Capture)

27.Must have a Sequence Generator for Primary Key

28.Able to Perform Audit on Load and Failure of Data

29.Able to validate data and check content and range of data

30.Can be scheduled through Shell scripts or Scheduling tools

31.Ability to merge data from more than one sources in a single Mapping/Graph/JobStream

Development and Maintenance Process SupportVisual Metaphor Multiple Screens for Handling

development, Scheduling and Administrative tasks

Has different screens for Source import, Target Import, Scheduling and Administrative tasks. The mappings become complex when number of transformations are added and are easy to handle as the project evolves.

GDE (Graphical Development Environment) is present. It is not very intuitive. It has many components each represented by a square box. Newer version has GUI for running the graphs and administrative jobs.

Two environment one for Development and one for Job Scheduling

Feature Ascential Software DataStage XE

Informatica PowerCenter

Ab Initio GDE (Graphical Development Enterprise, Ab Initio Co->Op System)

Cognos Decision Stream

Development and Maintenance Process SupportMultiple Sources and Targets DataStage XE includes unlimited

number of heterogeneous data sources and targets. Multiple, unlimited targets can be of mixed origin, can have multiple destinations and receive data through a variety of loading strategies (bulk loader, flat file I/O, direct SQL) in the same job.

Informatica PowerCenter has access to Multiple source and Targets. But in one mapping it can access data from heterogeneous source but can not write to heterogeneous targets a limitation which will be take care in newer version.

It can read from and write to heterogeneous sources because of its Own Operating system, which has a greater flexibility.

Though it says that it can read from heterogeneous sources and write to heterogeneous target, it is not clear.









Native Mainframe Data Extraction

It has different product DataStage XE/390 that generates COBOL programs that run directly on the mainframe, providing native mainframe extraction, transformation and loading capabilities.

It has different product call Power Connect for Mainframe Data. It accesses DB2 on mainframe using DB2 Connect.

On mainframes, Ab Initio can read and write and update and delete rows in DB2 databases and records in VSAM files directly. It can read and write any MVS dataset. Ab Initio reads mainframe IMS data. All data types are supported

Extraction from Mainframe will be by the use of 3rd Party Software.





Built-in functions and routines

A lot of Built In functions and the use of Script language like Basic or SQL to create your own functions.

A lot of Built In functions and the use of Script language like Sql to create your own functions near about 50 built in components for the use to very common job in the datawarehouse.

There is lot of built in functions available to do the different jobs including Compress, Database, Datasets, De partition, FTP, Miscellaneous, Partition, Sort, Transform, and Validate.

Yes

Advanced transformation support

Script language like basic available so have Meta Data with in the tool and can write advanced Transformations

Script Language only use of SQL and can call External Procedures for Advanced Transformations but meta data of the transformations is not integrated

Ab Initio includes a full programming language (called DML) that can express if-then-else, case, cascading (prioritized) rules, looping, and much more.

Call functions like C





Version Control and Configuration Management

DataStage XE includes a component for version control that saves the history of all the data integration development. It preserves application components such as table definitions, transformation rules, and source/target column mappings within a 2-part numbering scheme.

Version control implemented on to the 'folder' level in PowerCenter. It is not great but it is taken care in 6.0 or 7.0 release. Currently Informatica users use third party versioning control applications like PVCS.

The built-in version management system allows users to navigate to old versions of an object, navigate to the version of a graph used in a particular run of a job, view the state of the entire repository at a particular point in time, and find the differences between two versions of a graph, data transformation, or data type definition.

Through Source Control

Graphical Job Sequencer

Very good tool to visually create the sequence of jobs to be run through GUI.

Newer version 6.0 has this in Powercenter. Currently it is through session manager.

No till the last version. New version has it that is what I came to know

Do not Know





External function support

External functions are easily integrated within DataStage. More importantly the flexibility and completeness of the tools allows the developer to stay inside the tool using the easy to use scripting language like Basic and SQL

Yes, but limited to C++, Java, PL/SQL etc. but most of the job and transformation can be achieved with the built in functions and transformations. There is no way to import meta data from the external function. It will be take care in newer

Custom components can be implemented in any language, including Cobol, C, C++, Perl, Java, Forte, PL SQL etc.

Yes for C like functions

version.





Meta Data Analysis and ManagementSource schema change management -Graphical Impact Analysis Across Tools

Developers can come to know all the relationships associated with and object by this feature, developers can access the impact of changes across the environment before the change actually occurs.

Yes through Powerplug. Informatica's PowerPlug product allows easy comparison between industry leading modeling tools and the Informatica repository. Differences can be highlighted and meta-data imported. (A newer version will have the features like, when you change the source, all the connected fields from the source till target will get automatically updated). It has different built in reports for knowing the source to target column relationships, the sources and targets used in a mapping, the mapplets used in the mapping.

The Repository keeps track of dependencies and the relationships among objects stored in it, so you can assess the impact of any change for example it is possible to determine the possible impact of program changes by locating the parts of an application that depend on an object being changed.

No





Global meta data browsing Meta data exploring, analysis and usage functions for impact analysis, data lineage, and the publications and subscription of meta data through Metastage Explorer.

Browsing of meta-data can be done through the client tools. In addition there is a web-based meta-data reporter distributed with the powercenter.

Repository Browser Interface provides users access to the Repository through a standard web browser.

Yes




Containers (For re use of Business logic)

For business logic reuse in the form of discrete components, DataStage uses the concept of Containers. Containers are independent, shareable DataStage objects that can be included in many jobs within the same project, or multiple projects. When business logic changes, updates are made to the shared container. Jobs using the shared Container are updated with the new information when they are recompiled. Shared Containers are useful for building up a library of standardized, granular and repetitive data integration tasks such as test data generation, job recovery or slowly changing dimensions

It has the same concepts for Business logic reuse across the project and mappings with the help of Mapplets.

Ab Initio provides facilities for developing graphs with the connections and use of Sub Graphs. A subgraph can be stored away in a library and reused as many times as desired and used across multiple applications.




Validation DataStage XE permits validation of transforms and user connections. DataStage jobs will fail immediately at run-time, or in validation mode, because connections and SQL statements are checked prior to processing any rows.

Through Scripts and built in function like is_numeric, is_spaces etc. The database connections are validated for mapping to run.The designer validates the map for any syntactical error or data type error before it is ready to be executed by the scheduler.

It has lot many built in components (like transformation) that will be use in Graphs to do the validations like Compare Checksum, Validate Records etc.The GDE will let the developer know of any error in the garph.

Canvas Annotation The users can “write notes” or add text onto the canvas or screen as they create DataStage jobs. Users can now add comments, labels or other explanations to the designs.

Every Mapping and Transformation has 2 type of annotations, a name or object Description and comment box. The name will be displayed when you point the mouse to the transformations.

Component specific or graph specific annotations are available.

Publication of meta data directory

Meta data information can be published in XML and/or HTML format complete with hyperlinks for easy end-user access and navigation.

Through Webzine - Web based Meta Data Reporter. It is a web based meta-data reporting tool to facility publication of meta-data.

Not Very Clear. But the specified user can show all or any of their properties such as business area, subject area, steward group, database, modified by, created and last modified date.

Do not Know





Meta data reuse DataStage XE uses a publish and subscribe mechanism to distribute standard meta data from a variety of sources. Other users can subscribe to meta data publications on a one-time or recurring basis. When meta data changes, subscribers are automatically notified.

PowerCenter collects (load) and distributes (unload) meta data between tools using bridges - i.e. import from Oracle Designer, Erwin or PowerDesigner or export to BI tools such as BO, Microstrategy and Impromptu. It is one directional. There is another tool called Metadata Exchange SDK which is used for bi directional use of Meta data from Designer Tools or Olap tools and vice versa. Moreover it can be Exported and Imported through XML Files

It has way of integrating to Designer tool and OLAP tool. It is not clear that how the reuse of Metadata is achieved ?

Yes

Data Lineage - Integration of design and event meta data

A way to know from what sources the data has been populated to the target when you have more than one source to extract in a graph.

Through Scripts it is possible Through Scripts it is possible

No





Integration with E/R modeling tools

Meta data sharing capabilities with the industries leading E/R modeling tools including; Erwin, Designer 2000, PowerDesigner, and E/R Studio

Meta data sharing capabilities with the industries leading E/R modeling tools including; Erwin, Designer 2000, PowerDesigner etc with the help of Powerplug and Metdata Exchange SDK.

Ab Initio has an ERwin unloader, which will look at an ERwin diagram, create dataset and column objects in the Ab Initio Repository, and annotate them with comments and descriptive information. But whether it can be integrated with Erwin or D2K is a question remains unanswered? Even if it can be integrated how it can be done ?

Yes

Integration with Business Intelligence Tools

With Business Objects, Brio, Cognos Impromptu, MicroStrategy

Through Power Bridges it has integration with Brio, Cognos, MicroStrategy, Business Objects

Not Very Clear. How this is done?

Cognos Powerplay Through Architect





Performance

Parallelism It has advanced parallel processing capabilities which enable it effectively and efficiently use the capabilities of SMP, SMP Clusters and MPP platforms. In addition, DataStage XE efficiently leverages database parallelism (i.e. DB2 IEEE, Oracle Enterprise Edition with parallel tables).

The PowerCenter architecture limits the product ability to perform parallel processing. While PowerCenter can partition a data set and run separate processes, the newer version has greater capabilities for Sort, Join and Aggregate transformations. It is faster than previous versions. Pipeline bulk loading of data, memory caching.

Because of its parallel architecture aka MPP and the concept of partitioning (Component, Pipe and Data), it has been able to handle very high volume of data. It can do parallel and pipeline loading.

Hashing Techniques for Sort/Join and Aggregates

Named Pipe Support Ability to break large job into smaller job and then these smaller jobs has the ability to communicate with each other making the whole process to run faster.

Sort, Aggregate Transformation has the same facility.

When the graphs are running on the same server they communicate through Named Pipe or Share Memory, when they are different servers, they communicate through network. All these is determined by Ab Initio Operating System

Do not Know





Shared-in Memory Hash Tables

Allows multiple "instances" of a job, or even multiple disparate jobs, running in parallel, to share the same in-memory image of a hash table and thus make better use of machine resources.

Yes for Lookups, Joiners, Aggregates and Sort transformation this technique is used to boost the performance. Dynamic lookup uses this technique too.

There are in-memory versions of the Join, Sort, and Lookup components. For dynamic data lookups, Ab Initio include several mechanisms Lookup File represents one or multiple serial files or a parallel flat file of data records small enough to be held in main memory, letting a transform function retrieve records much more quickly than it could retrieve them if they were stored on disk.

Yes




Platform Support

Hardware support Unix, NT and Mainframe platforms

Unix and NT platforms only Unix and NT and Windows Platforms

Extensibility Central to DataStage’s architecture is the DataStage Plug-in API. This allows Ascential Software engineering, vars and end-users alike to code their own interfaces to external databases, processes, and bulk loaders.

Not aware of any such things Simple mechanism of ‘wrapping’ user programs, the Co>Operating System delivers the ultimate in extensibility

Integrated Bulk Loader Have it bundled within the software and can be invoked within the scripts

Have it bundled within the software and can be invoked within the scripts for Oracle, Sybase and SQL Server

Have features for Load and Unload bulk data through Load and Unload Component




Data Quality and CleansingAbility to assess current data quality

The development teams and business users have the ability to audit, monitor and certify the quality of data at key points throughout the data integration lifecycle.

Data cleansing can be achieved by building the rules within the logic. Rows can be flagged as problem or routed to a problem file/table.

This can be achieved through Data Profiler Product which sits on the top of Ab Inito Co->Operating system. It analyzes the graph the data and stores in the metadata repository, which then can be used by the developer to look at the quality of the data. More ever it has in build components like checksum, compare records which can be incorporated to graphs to access the data quality.

Integrate 3rd party data cleansing tools

With Trillium and First Logic (How easy is it, I do not know as I have not tried it out)






PRODUCT

Component wise product, Buy what you use, but Datastage XE comes up with alll the features that we require like communicating with Designer tools and Olap Tools

Component wise product buy what you use (It has now its own OLAP tools). You have to buy the products for the Transferring metadata from Designer tool and transferring metadata to OLAP tools

One Product and there are add on products for Metadata Transfer (but comes with access to mainframe and other source)

COST

Cost can be spread because of its component in nature

Cost can be spread because of its component in nature

One time cost

Per CPU It is based on Per Server Per Repository (we can have many servers pointing to the Repository) and also depends on number of Sources and Targets. The licensing is done for no of CPU's like 4 CPU's or 6 CPU's etc.

Per CPU and Per Developer




Ease of Use and Maintenance

It has improved a lot over the time had a demo of the tool over the web

Very easy to use and Maintain (Own Experience)

User says the new GUI is good but have no experience

New Releases and Patches

How much cost is involved

18 % to 22 % of the Purchase Cost price every year

20% of the Purchase Cost price Every year

Do not Know

Customer Support and Services

Can we have a dedicated person for our questions to be solved?

Do not Know Do not Know Do not Know

Case Studies and Access to Site under Production Environment using the ETL tool

Can we have the client number and visit their site using ETL tool for their warehousing

Yes Yes Yes

Availability of Manpower (Price & Consultancy)

Good Very Good (Low Price and High Quality)

analysis of etl tools -...

Documents