analysis of etl tools -...
TRANSCRIPT
XYZ Client
Kenilworth, NJ
Analysis of ETL Tools System Name: CDW - Scoping v1.0
Prepared By:
Name: ____________________________________________________ Date: ______________Ken
ANALYSIS OF ETL TOOLS
SL No. Chapter Page no
1.Characteristics of an ETL tool 2
2.Requirement by the Project 4
3.ETL Tool Comparison 6
Characteristics of an ETL Tool
1.Access data from multiple, operational data sources
2.Re-map source data into a common format
3.Standardize data to enable load to conformed, target databases
4.Filter data, convert codes, perform table lookups, calculate derived values
5.Automated slowly changing dimension support (Type I, Type II,Type III)
6.Incremental aggregation & computation of aggregates by the ETL tool in one pass of the source data
7.Support for Unicode & multi-byte character sets localized for Japanese and other languages
8.Support graphical job sequencer, re-usable containers, and nesting of sessions
9.Validate data to check content and range of field values
10.Perform procedural data cleansing functions
11.Support complete development environment, including versioning and run-time debugger
12.Load cleansed data to the target data mart or central DW
13.Produce audit and operational reports for each data load
14.Automatic generation of centralized Metadata
15.Automatic generation of data extract programs
16.Native interfaces to legacy files, relational databases, ERP sources (e.g.,SAP R/3 and PeopleSoft), eBusiness applications, Web log files, IBM MQ-Series, XML sources etc.
17.Support for near real-time clickstream data warehousing
18.Support for an enterprise eBusiness environment, including integration at the metadata level with BI tools, ERP applications, CRM applications, analytic applications, corporate portals, etc.
19.Platform independence and scalability to enterprise data warehousing applications, directly executable in-memory, multi-threaded processing for fast and parallel operation.
20.No requirement to generate and compile source code .
21.No requirement for intermediate disc files
22.Support for concurrent processing of multiple source data streams, without writing procedural code
Characteristics of an ETL Tool (Contd.…)
23.Specification of ETL functions using pre-packaged transformation objects, accessible via an intuitive graphical user interface
24.Extensible transformation objects at a high level of significance
25.Ability to specify complex transformations using only built-in transformation objects. The goal is to specify transformations without writing any procedural code
26.Automatic generation of central metadata, including source data definitions, transformation objects, target data models, and operational statistics
27.Metadata exchange architecture that supports automatic synchronization of central metadata with local metadata for multiple end-user BI tools
28.Central management of distributed ETL engines and metadata using a central console and a global metadata repository
29.End-user access to central metadata repository via a right-mouse click
30.Metadata exchange API compliant with COM, UML, and XML
31.Support of metadata standards, including OLE DB for OLAP
32.Ability to schedule ETL sessions on time or the occurrence of a specified event, including support for command-line scheduling using external scheduling programs
33.Ability to schedule FTP sessions on time or event
34.Integration with data cleansing tools
35.Import of complete data models from external data modeling tools
36.Strong data warehouse administration functions
37.Support for the analysis of transformations that failed to be accepted by the ETL process
38.Extensive reporting of the results of an ETL session, including automatic notification of significant failures of the ETL process
Requirement by the Project
1.Ease of Use and Maintenance
2.Security Features and Administrative Functions
3.Sources - Oracle Database, Files, XML Files, Cobol Files
4.Targets – Oracle
5.Must be able to use Oracle Functions and specifically Date Functions
6.Must be able to take advantage of Oracle parallel Architecture
7.Must be able to Partition Data for Reading and Writing Purpose
8.Must be able to Support Oracle External Load i: e SQL Load
9.Able to call Oracle Functions and Procedures
10.Must be able to work with Lookups
11.Must be able to work with Containers
12.Graphical Interface for Design, Development and Implementations
13.Must have a debugger or way to debug the Mapping/Graph/JobStream,
14.Version Controlling
15.Load of Failure Data
16.Incremental Load
17.It should be able to work with Uni Code multibyte character localized for different countries (Data is in different places)
18.Sort, Aggregate, Join Transformations (Look for Inner, Semi Outer, Outer Join)
19.Must be able to work with Oracle BLOB, CLOB data type
20.Meta Data Management - Global Metadata Repository
21.Meta Data can be shared across the enterprise
22.Able to Import Metadata from Erwin or Designer 2000
Requirement by the Project (Contd…)
23.Able to Export Metadata to Cognos
24.Meta Data change Impact analysis
25.Must be able to Log errors
26.Keeping Track of Slowly Changing Dimensions (CDC - Change Data Capture)
27.Must have a Sequence Generator for Primary Key
28.Able to Perform Audit on Load and Failure of Data
29.Able to validate data and check content and range of data
30.Can be scheduled through Shell scripts or Scheduling tools
31.Ability to merge data from more than one sources in a single Mapping/Graph/JobStream
Development and Maintenance Process SupportVisual Metaphor Multiple Screens for Handling
development, Scheduling and Administrative tasks
Has different screens for Source import, Target Import, Scheduling and Administrative tasks. The mappings become complex when number of transformations are added and are easy to handle as the project evolves.
GDE (Graphical Development Environment) is present. It is not very intuitive. It has many components each represented by a square box. Newer version has GUI for running the graphs and administrative jobs.
Two environment one for Development and one for Job Scheduling
Feature Ascential Software DataStage XE
Informatica PowerCenter
Ab Initio GDE (Graphical Development Enterprise, Ab Initio Co->Op System)
Cognos Decision Stream
Development and Maintenance Process SupportMultiple Sources and Targets DataStage XE includes unlimited
number of heterogeneous data sources and targets. Multiple, unlimited targets can be of mixed origin, can have multiple destinations and receive data through a variety of loading strategies (bulk loader, flat file I/O, direct SQL) in the same job.
Informatica PowerCenter has access to Multiple source and Targets. But in one mapping it can access data from heterogeneous source but can not write to heterogeneous targets a limitation which will be take care in newer version.
It can read from and write to heterogeneous sources because of its Own Operating system, which has a greater flexibility.
Though it says that it can read from heterogeneous sources and write to heterogeneous target, it is not clear.
Feature Ascential Software DataStage XE
Informatica PowerCenter
Ab Initio GDE (Graphical Development Enterprise, Ab Initio Co->Op System)
Cognos Decision Stream
Feature Ascential Software DataStage XE
Informatica PowerCenter
Ab Initio GDE (Graphical Development Enterprise, Ab Initio Co->Op System)
Cognos Decision Stream
Native Mainframe Data Extraction
It has different product DataStage XE/390 that generates COBOL programs that run directly on the mainframe, providing native mainframe extraction, transformation and loading capabilities.
It has different product call Power Connect for Mainframe Data. It accesses DB2 on mainframe using DB2 Connect.
On mainframes, Ab Initio can read and write and update and delete rows in DB2 databases and records in VSAM files directly. It can read and write any MVS dataset. Ab Initio reads mainframe IMS data. All data types are supported
Extraction from Mainframe will be by the use of 3rd Party Software.
Feature Ascential Software DataStage XE
Informatica PowerCenter
Ab Initio GDE (Graphical Development Enterprise, Ab Initio Co->Op System)
Cognos Decision Stream
Built-in functions and routines
A lot of Built In functions and the use of Script language like Basic or SQL to create your own functions.
A lot of Built In functions and the use of Script language like Sql to create your own functions near about 50 built in components for the use to very common job in the datawarehouse.
There is lot of built in functions available to do the different jobs including Compress, Database, Datasets, De partition, FTP, Miscellaneous, Partition, Sort, Transform, and Validate.
Yes
Advanced transformation support
Script language like basic available so have Meta Data with in the tool and can write advanced Transformations
Script Language only use of SQL and can call External Procedures for Advanced Transformations but meta data of the transformations is not integrated
Ab Initio includes a full programming language (called DML) that can express if-then-else, case, cascading (prioritized) rules, looping, and much more.
Call functions like C
Feature Ascential Software DataStage XE
Informatica PowerCenter
Ab Initio GDE (Graphical Development Enterprise, Ab Initio Co->Op System)
Cognos Decision Stream
Version Control and Configuration Management
DataStage XE includes a component for version control that saves the history of all the data integration development. It preserves application components such as table definitions, transformation rules, and source/target column mappings within a 2-part numbering scheme.
Version control implemented on to the 'folder' level in PowerCenter. It is not great but it is taken care in 6.0 or 7.0 release. Currently Informatica users use third party versioning control applications like PVCS.
The built-in version management system allows users to navigate to old versions of an object, navigate to the version of a graph used in a particular run of a job, view the state of the entire repository at a particular point in time, and find the differences between two versions of a graph, data transformation, or data type definition.
Through Source Control
Graphical Job Sequencer
Very good tool to visually create the sequence of jobs to be run through GUI.
Newer version 6.0 has this in Powercenter. Currently it is through session manager.
No till the last version. New version has it that is what I came to know
Do not Know
Feature Ascential Software DataStage XE
Informatica PowerCenter
Ab Initio GDE (Graphical Development Enterprise, Ab Initio Co->Op System)
Cognos Decision Stream
External function support
External functions are easily integrated within DataStage. More importantly the flexibility and completeness of the tools allows the developer to stay inside the tool using the easy to use scripting language like Basic and SQL
Yes, but limited to C++, Java, PL/SQL etc. but most of the job and transformation can be achieved with the built in functions and transformations. There is no way to import meta data from the external function. It will be take care in newer
Custom components can be implemented in any language, including Cobol, C, C++, Perl, Java, Forte, PL SQL etc.
Yes for C like functions
version.
Feature Ascential Software DataStage XE
Informatica PowerCenter
Ab Initio GDE (Graphical Development Enterprise, Ab Initio Co->Op System)
Cognos Decision Stream
Meta Data Analysis and ManagementSource schema change management -Graphical Impact Analysis Across Tools
Developers can come to know all the relationships associated with and object by this feature, developers can access the impact of changes across the environment before the change actually occurs.
Yes through Powerplug. Informatica's PowerPlug product allows easy comparison between industry leading modeling tools and the Informatica repository. Differences can be highlighted and meta-data imported. (A newer version will have the features like, when you change the source, all the connected fields from the source till target will get automatically updated). It has different built in reports for knowing the source to target column relationships, the sources and targets used in a mapping, the mapplets used in the mapping.
The Repository keeps track of dependencies and the relationships among objects stored in it, so you can assess the impact of any change for example it is possible to determine the possible impact of program changes by locating the parts of an application that depend on an object being changed.
No
Feature Ascential Software DataStage XE
Informatica PowerCenter
Ab Initio GDE (Graphical Development Enterprise, Ab Initio Co->Op System)
Cognos Decision Stream
Global meta data browsing Meta data exploring, analysis and usage functions for impact analysis, data lineage, and the publications and subscription of meta data through Metastage Explorer.
Browsing of meta-data can be done through the client tools. In addition there is a web-based meta-data reporter distributed with the powercenter.
Repository Browser Interface provides users access to the Repository through a standard web browser.
Yes
Feature Ascential Software DataStage XE
Informatica PowerCenter
Ab Initio GDE (Graphical Development Enterprise, Ab Initio Co->Op System)
Containers (For re use of Business logic)
For business logic reuse in the form of discrete components, DataStage uses the concept of Containers. Containers are independent, shareable DataStage objects that can be included in many jobs within the same project, or multiple projects. When business logic changes, updates are made to the shared container. Jobs using the shared Container are updated with the new information when they are recompiled. Shared Containers are useful for building up a library of standardized, granular and repetitive data integration tasks such as test data generation, job recovery or slowly changing dimensions
It has the same concepts for Business logic reuse across the project and mappings with the help of Mapplets.
Ab Initio provides facilities for developing graphs with the connections and use of Sub Graphs. A subgraph can be stored away in a library and reused as many times as desired and used across multiple applications.
Feature Ascential Software DataStage XE
Informatica PowerCenter
Ab Initio GDE (Graphical Development Enterprise, Ab Initio Co->Op System)
Validation DataStage XE permits validation of transforms and user connections. DataStage jobs will fail immediately at run-time, or in validation mode, because connections and SQL statements are checked prior to processing any rows.
Through Scripts and built in function like is_numeric, is_spaces etc. The database connections are validated for mapping to run.The designer validates the map for any syntactical error or data type error before it is ready to be executed by the scheduler.
It has lot many built in components (like transformation) that will be use in Graphs to do the validations like Compare Checksum, Validate Records etc.The GDE will let the developer know of any error in the garph.
Canvas Annotation The users can “write notes” or add text onto the canvas or screen as they create DataStage jobs. Users can now add comments, labels or other explanations to the designs.
Every Mapping and Transformation has 2 type of annotations, a name or object Description and comment box. The name will be displayed when you point the mouse to the transformations.
Component specific or graph specific annotations are available.
Publication of meta data directory
Meta data information can be published in XML and/or HTML format complete with hyperlinks for easy end-user access and navigation.
Through Webzine - Web based Meta Data Reporter. It is a web based meta-data reporting tool to facility publication of meta-data.
Not Very Clear. But the specified user can show all or any of their properties such as business area, subject area, steward group, database, modified by, created and last modified date.
Do not Know
Feature Ascential Software DataStage XE
Informatica PowerCenter
Ab Initio GDE (Graphical Development Enterprise, Ab Initio Co->Op System)
Cognos Decision Stream
Meta data reuse DataStage XE uses a publish and subscribe mechanism to distribute standard meta data from a variety of sources. Other users can subscribe to meta data publications on a one-time or recurring basis. When meta data changes, subscribers are automatically notified.
PowerCenter collects (load) and distributes (unload) meta data between tools using bridges - i.e. import from Oracle Designer, Erwin or PowerDesigner or export to BI tools such as BO, Microstrategy and Impromptu. It is one directional. There is another tool called Metadata Exchange SDK which is used for bi directional use of Meta data from Designer Tools or Olap tools and vice versa. Moreover it can be Exported and Imported through XML Files
It has way of integrating to Designer tool and OLAP tool. It is not clear that how the reuse of Metadata is achieved ?
Yes
Data Lineage - Integration of design and event meta data
A way to know from what sources the data has been populated to the target when you have more than one source to extract in a graph.
Through Scripts it is possible Through Scripts it is possible
No
Feature Ascential Software DataStage XE
Informatica PowerCenter
Ab Initio GDE (Graphical Development Enterprise, Ab Initio Co->Op System)
Cognos Decision Stream
Integration with E/R modeling tools
Meta data sharing capabilities with the industries leading E/R modeling tools including; Erwin, Designer 2000, PowerDesigner, and E/R Studio
Meta data sharing capabilities with the industries leading E/R modeling tools including; Erwin, Designer 2000, PowerDesigner etc with the help of Powerplug and Metdata Exchange SDK.
Ab Initio has an ERwin unloader, which will look at an ERwin diagram, create dataset and column objects in the Ab Initio Repository, and annotate them with comments and descriptive information. But whether it can be integrated with Erwin or D2K is a question remains unanswered? Even if it can be integrated how it can be done ?
Yes
Integration with Business Intelligence Tools
With Business Objects, Brio, Cognos Impromptu, MicroStrategy
Through Power Bridges it has integration with Brio, Cognos, MicroStrategy, Business Objects
Not Very Clear. How this is done?
Cognos Powerplay Through Architect
Feature Ascential Software DataStage XE
Informatica PowerCenter
Ab Initio GDE (Graphical Development Enterprise, Ab Initio Co->Op System)
Cognos Decision Stream
Performance
Parallelism It has advanced parallel processing capabilities which enable it effectively and efficiently use the capabilities of SMP, SMP Clusters and MPP platforms. In addition, DataStage XE efficiently leverages database parallelism (i.e. DB2 IEEE, Oracle Enterprise Edition with parallel tables).
The PowerCenter architecture limits the product ability to perform parallel processing. While PowerCenter can partition a data set and run separate processes, the newer version has greater capabilities for Sort, Join and Aggregate transformations. It is faster than previous versions. Pipeline bulk loading of data, memory caching.
Because of its parallel architecture aka MPP and the concept of partitioning (Component, Pipe and Data), it has been able to handle very high volume of data. It can do parallel and pipeline loading.
Hashing Techniques for Sort/Join and Aggregates
Named Pipe Support Ability to break large job into smaller job and then these smaller jobs has the ability to communicate with each other making the whole process to run faster.
Sort, Aggregate Transformation has the same facility.
When the graphs are running on the same server they communicate through Named Pipe or Share Memory, when they are different servers, they communicate through network. All these is determined by Ab Initio Operating System
Do not Know
Feature Ascential Software DataStage XE
Informatica PowerCenter
Ab Initio GDE (Graphical Development Enterprise, Ab Initio Co->Op System)
Cognos Decision Stream
Shared-in Memory Hash Tables
Allows multiple "instances" of a job, or even multiple disparate jobs, running in parallel, to share the same in-memory image of a hash table and thus make better use of machine resources.
Yes for Lookups, Joiners, Aggregates and Sort transformation this technique is used to boost the performance. Dynamic lookup uses this technique too.
There are in-memory versions of the Join, Sort, and Lookup components. For dynamic data lookups, Ab Initio include several mechanisms Lookup File represents one or multiple serial files or a parallel flat file of data records small enough to be held in main memory, letting a transform function retrieve records much more quickly than it could retrieve them if they were stored on disk.
Yes
Feature Ascential Software DataStage XE
Informatica PowerCenter
Ab Initio GDE (Graphical Development Enterprise, Ab Initio Co->Op System)
Platform Support
Hardware support Unix, NT and Mainframe platforms
Unix and NT platforms only Unix and NT and Windows Platforms
Extensibility Central to DataStage’s architecture is the DataStage Plug-in API. This allows Ascential Software engineering, vars and end-users alike to code their own interfaces to external databases, processes, and bulk loaders.
Not aware of any such things Simple mechanism of ‘wrapping’ user programs, the Co>Operating System delivers the ultimate in extensibility
Integrated Bulk Loader Have it bundled within the software and can be invoked within the scripts
Have it bundled within the software and can be invoked within the scripts for Oracle, Sybase and SQL Server
Have features for Load and Unload bulk data through Load and Unload Component
Feature Ascential Software DataStage XE
Informatica PowerCenter
Ab Initio GDE (Graphical Development Enterprise, Ab Initio Co->Op System)
Data Quality and CleansingAbility to assess current data quality
The development teams and business users have the ability to audit, monitor and certify the quality of data at key points throughout the data integration lifecycle.
Data cleansing can be achieved by building the rules within the logic. Rows can be flagged as problem or routed to a problem file/table.
This can be achieved through Data Profiler Product which sits on the top of Ab Inito Co->Operating system. It analyzes the graph the data and stores in the metadata repository, which then can be used by the developer to look at the quality of the data. More ever it has in build components like checksum, compare records which can be incorporated to graphs to access the data quality.
Integrate 3rd party data cleansing tools
With Trillium and First Logic (How easy is it, I do not know as I have not tried it out)
With Trillium and First Logic (How easy is it, I do not know as I have not tried it out)
With Trillium and First Logic (How easy is it, I do not know as I have not tried it out)
Feature Ascential Software DataStage XE
Informatica PowerCenter
Ab Initio GDE (Graphical Development Enterprise, Ab Initio Co->Op System)
PRODUCT
Component wise product, Buy what you use, but Datastage XE comes up with alll the features that we require like communicating with Designer tools and Olap Tools
Component wise product buy what you use (It has now its own OLAP tools). You have to buy the products for the Transferring metadata from Designer tool and transferring metadata to OLAP tools
One Product and there are add on products for Metadata Transfer (but comes with access to mainframe and other source)
COST
Cost can be spread because of its component in nature
Cost can be spread because of its component in nature
One time cost
Per CPU It is based on Per Server Per Repository (we can have many servers pointing to the Repository) and also depends on number of Sources and Targets. The licensing is done for no of CPU's like 4 CPU's or 6 CPU's etc.
Per CPU and Per Developer
Feature Ascential Software DataStage XE
Informatica PowerCenter
Ab Initio GDE (Graphical Development Enterprise, Ab Initio Co->Op System)
Ease of Use and Maintenance
It has improved a lot over the time had a demo of the tool over the web
Very easy to use and Maintain (Own Experience)
User says the new GUI is good but have no experience
New Releases and Patches
How much cost is involved
18 % to 22 % of the Purchase Cost price every year
20% of the Purchase Cost price Every year
Do not Know
Customer Support and Services
Can we have a dedicated person for our questions to be solved?
Do not Know Do not Know Do not Know
Case Studies and Access to Site under Production Environment using the ETL tool
Can we have the client number and visit their site using ETL tool for their warehousing
Yes Yes Yes
Availability of Manpower (Price & Consultancy)
Good Very Good (Low Price and High Quality)