datastage how to kick start
DESCRIPTION
this is about datastage etl tool, which is very useful for beginners of ETL learnersTRANSCRIPT
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Course Outline
This course will explain the concepts of DataStage, its architecture, and how to apply it to a
'real life' scenario in a business case-study in which you'll solve business problems. We will
begin by looking at the big picture and discuss why businesses need ETL tools and where DataStage fits in the product set.
Once we've talked about the very basic architecture of DataStage, we'll investigate a
business case-study, learn about a company called Amalgamated Conglomerate Corporation
(ACC) - a fictitious holding company - and its business and technical needs. We'll then go
about solving this company's problems with DataStage in a Guided Tour Product Simulation.
In a practice environment, you'll become ACC’s Senior DataStage Developer. In this
capacity, you will assess an existing DataStage Job, build your own Job, modify a Job, and build a Sequencer Job.
Using the DataStage clients, you'll log onto the DataStage server and look at a Job that was
built previously by the former DataStage Developer at ACC. You’ll then build your own Job
by importing meta data, building a Job design, compiling, running, troubleshooting, and
then fixing this Job. You’ll then modify a Job and finally build a special type of Job called a
Sequence Job.
Let’s get started!
This section begins by talking about the DataStage tool, what it does, and some of the
concepts that underpin its use. We’ll discuss how it fits in the Information Server suite of
products and how it can be purchased as a stand-alone product or in conjunction with other products in the suite. We’ll briefly cover DataStage’s architecture and its clients.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
What Is DataStage? InfoSphere DataStage is an ETL (Extract Transform and Load) tool
that is a part of the InfoSphere suite of products. It functions as a stand-alone product or in
conjunction with other products in the suite. DataStage provides a visual UI, which you can
use, in a point-and-click fashion (A non-linear programming tool), to quickly build
DataStage Jobs that will perform extractions, transformations, and loads of data for use in
data warehousing, system migrations, data integration projects, and data marts.
Sometimes DataStage is purchased with QualityStage, which performs data cleansing (We’ll
touch on it in this training but be sure to take the “Introduction to QualityStage
FlexLearning module as well). When implementing DataStage as a stand-alone product, you
will still need to install InfoSphere Information Server components with it. Other
components or products in the suite can be added at a later time.
The other products in the suite are QualityStage, Information Analyzer, Business Glossary,
and MetaData Workbench. InfoSphere Information Server is a suite of tools that is used to
manage your information needs. Several components come with Information Server. One of
the main components is DataStage, which provides the ETL capability.
IBM InfoSphere Metadata Workbench provides end-to-end metadata management,
depicting the relationships between sources and consumers.
IBM InfoSphere Change Data Capture Real-time change Data Capture (CDC) and replication solution across heterogeneous environments
IBM InfoSphere Change Data Capture for Oracle Replication Real-time data distribution and high availability/disaster recovery solution for Oracle environments
IBM InfoSphere Information Analyzer Profiles and establishes an understanding of source systems and monitors data rules.
IBM InfoSphere Business Glossary Creates, manages, and searches metadata definitions.
IBM InfoSphere QualityStage Standardizes and matches information across heterogeneous sources.
IBM InfoSphere DataStage Extracts, transforms, and loads data between multiple sources
and targets
IBM InfoSphere Datastage MVS Edition provides native data integration capabilities for the
mainframe.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
IBM InfoSphere Federation Server Defines integrated views across diverse and distributed
information sources, including cost-based query optimization and integrated caching.
IBM InfoSphere Information Services Director allows information access and integration processes to be published as reusable services in a service oriented architecture.
IBM InfoSphere Information Server FastTrack simplifies and streamlines communication
between the business analyst and developer by capturing business requirements and automatically translating into DataStage ETL jobs.
Connectivity Software provides efficient and cost-effective cross-platform, high-speed, real-
time, batch and change-only integration for your data sources.
This training will focus on DataStage which gives you the ability to import, export, create,
and manage metadata from a wide variety of sources to use within these DataStage ETL
Jobs. After Jobs are created, they can be scheduled, run, and monitored, all within the DataStage environment.
DataStage provides a point-and-click user interface in which there is a Canvas. You drag-
and-drop icons that represent object’s (Stages and Links) from a Palette onto the Canvas to
build Jobs.
For instance, you might drag and drop icons for a source, a transformation, and a target
onto the Canvas. The data might then flow from a Source Stage via a Link to a Transformation Stage through another Link to a Database Stage, for example.
Stages and Links present a graphical environment that guides you, the developer through
needed steps. You are presented with fill-in-the-blank-style boxes that enable you to quickly
and easily configure data flows.
When connecting to a database for example, you drag a proprietary database Stage icon
onto the Canvas, you don’t have to worry about how that database’s native code works to still be able to leverage its high performance.
After configuring some particulars and doing a compile, the result is a working Job that is
executable within the DataStage environment as well as a Job design which provides a
graphical look as to how the data or information is flowing and being used
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Information Server Backbone In order to understand the environment in which we will
work as Developers, let’s first very briefly look at the architecture and framework behind the
InfoSphere Information Server and its various components (Collectively known as
InfoSphere). InfoSphere has various product modules within it including the Information
Services Director, the Business Glossary, Information Analyzer, DataStage, QualityStage,
and Federation Server.
These product modules are all supported by common metadata access services and
metadata analysis services. They all sit on top of the repository.
They all work with one another and can be used as an integrated product suite or it also comes as a stand-alone DataStage application.
Whichever component or components you have, you will need to install the basics of the Information Server itself including the repository and the application layer.
From there, this 3-tiered architecture fulfills many information integration needs allowing
metadata and terminology collected with one product during one phase of a project to flow
to other products throughout the enterprise enabling common understandings of data and
leveraging common discoveries.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Let’s just review the products in the suite again.
• DataStage, by itself, can pull and push metadata, table definitions, and information
from and to various targets and sources of data, DataStage will provide the
components needed to get data from place to another and manage it on an ongoing
basis.
• The Information Services Director let's you turn your Jobs into real-time Web
Services.
• The Business Glossary maps your business terms to your technical terms so that all
the members of your organization can have common nomenclature during all
business activities and stages of technical development thus reducing ambiguity
among varying audiences.
• The Information Analyzer allows you to analyze data in databases, files, and data
stores and gives you the knowledge to determine what types of Jobs need to be
developed. DataStage and QualityStage allow you to then manipulate the data and
cleanse it in-line while you're working with it.
• DataStage and QualityStage have been integrated into a single canvas so that all
your activities within a Job can flow from a traditional DataStage Stage to one of the
QualityStage Stages; thereby allowing you to, in the process of doing your other ETL
activities, cleanse your data inline. This tutorial will focus primarily on the ETL
activities with a little bit of knowledge on DataStage's QualityStage Stages that go
along with it.
• The Federation Server is a tool that let's you put together disparate and
heterogeneous data stores through a single point of view for the organizations entire
data world: Databases still stored on mainframes as well as a variety of other computers are made to look as one repository.
These product modules feed to DataStage allowing you to create Jobs that will function in a
high efficiency manner to move data and/or cleanse it as necessary. Using DataStage, you
will be able to build data warehouses and other new repositories for integrated forms and
views of data for your entire organization. DataStage is also very useful in data migration
activities, EDI and other means of data communication. As mergers and acquisitions occur,
for example, the underlying IT systems can, using DataStage, be brought together into a
single trusted view or common unified view with which to make better business decisions
and implementations. DataStage can help with ordering and receiving activities as well by
putting data to and/or pulling data from an EDI wire, effectively parse it, and move it into a
data warehouse or to online operating systems so that you can have quicker access and more robust delivery of your business data.
DataStage also integrates with various other tools in addition to databases such as MQ
Series that provide reliable delivery of messaging traffic. DataStage has both real-time and
batch capabilities.
Now that we've talked about the overall Information Server's architecture and components, let's take a more detailed look at DataStage's components.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
DataStage Architecture DataStage has a 3-tiered architecture consisting of clients,
dedicated engines, and a shared repository. These are all components within the InfoSphere
Information Server using its shared and common repository. The clients include an
Administrator client, a Designer client, and the Director client.
The Administrator is used for setting up and managing individual DataStage Projects and
each Project's related common project information. The Designer is used to develop and
manage DataStage applications and all their related components such as metadata. The
Designer is where you will be doing most of your development. The Director allows you to
actively monitor runtime activities, watch Jobs, schedule Jobs, and keep track of events
once your Job has been developed, deployed, and is running
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Inside DataStage, there are two engines; the traditional Server engine (used for DataStage
'Server' Jobs) and the Parallel engine (used for DataStage Parallel Jobs). The 'Server’ jobs
are typically single-threaded and have evolved over the years during various versions of
DataStage since its inception. The Parallel engine allows you to create Jobs that can, at
runtime, dynamically increase or decrease their efficiency and to optimize the use of
resources at hand within your hardware. The Parallel engine can dynamically change the
way in which it performs its parallelism.
The Shared or 'Common' repository holds all of the information such as Job designs, logs of
the Job runs, and metadata that you will need while developing Jobs. Now, let's take a look
at the Administrator client's UI.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Administrator Client We're looking at the Administrator's UI under the tab entitled
'General'. Here, you can enable job administration from within the Director client, enable
runtime column propagation (RCP) for all of your Parallel Jobs, enable editing of the internal
Job references, sharing metadata with importing from Connectors (the DataStage objects
that automatically connect to various proprietary sources and databases with minimal
configuration) and also give you some auto-purge settings for your log. What is RCP and
how does it work?
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
DataStage is a role-based tool. Usually, when it is installed the Administrator assigns roles
such as Developer or Operator. This means that the UI may have different options available
to different people. The Administrator, developers, and operators may not be the same
person. Thus, when you look at the actual product, your UI may be different than others'.
Under the Permissions tab, DataStage can assign functions to individuals who will be using
the tool. The person with the role of ‘administrator’ may have more options than does a
‘developer’. There may be differing functions made available to different people. This can all
be designated for each separate DataStage Project. You can thus determine which user gets
what role for that particular Project.
Traces are available should you wish to put tracing on the engine. The Schedule tab allows
us to dictate which scheduler will be used. In the Windows environment, we would need to
designate who will be the authorized user to run Jobs from the scheduler.
The Mainframe tab allows us to set needed options when working with the mainframe
version of DataStage. This would allow us to create Jobs that could then be ported up to the
mainframe and then executed in their native form with the appropriate JCL.
Other tabs let us set the memory requirements on a per-Project basis. The Parallel tab gives
us options as to how the parallel operations will work with the parallel engine. Under the
Sequence tab, the Sequencer options are set. And under the Remote tab of the
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Administrator client, we make settings for remote parallel Job deployment on the USS
system or remote Job deployment on a Grid.
DataStage Designer The second of the three DataStage clients is the Designer. As an ETL
Developer, this is where you'll spend a good deal of your time. Let's examine the main
areas of the UI. In the upper-left (shown highlighted) is the Repository Browser area. It
shows us the metadata objects and Jobs found within DataStage. Toward the lower-left is
the Palette. The Palette contains all of the Stages (drag and droppable icons). You would
select a Database source icon, for instance and drag it to the right.
On the right is the Canvas. This is where you would drop the Stages that you have chosen
to be part of your Job design. You can see that the Job in the graphic above is a Parallel Job
and that is contains two Stages that have been joined together with a Link. Links can be
added from the Palette directly or you can just Right-click on a Stage, drag, and then drop
on the next Stage to create the Link dynamically. Now, In this particular example, data will
flow from a Row Generator Stage to the Peek Stage.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
The Designer is where you will create the kinds of working code that you need in order to
properly move your data from one source to another. There are over 100 Stages that come
with the product out of the box (At installation time there are approximately 60 base Stages
installed, an additional 40 that are optional, and 50-100 more based on service packs and
add-ons, and then you can build your own custom Stages in addition).
DataStage Director The third DataStage client is the Director. Director gives us a runtime
view of the Job. Above, we can see an individual Job log. There are messages and other
events shown in this Job run that are logged to this area and kept inside the Shared
Repository. We can see the types of things that occur during a Job run such as the Job
starting, setting environment variables, main activities of the program, and then eventually
the successful completion of the Job. Should there be any warnings or runtime errors, they
can be found and analyzed here to determine what went wrong in order to take corrective
action. Color-coding shows that green messages are just informational, a yellow message is
a warning and may not directly affect the running of the Job but should probably be looked
at. If you should ever see a red message in here, this is a fatal error, which must be
resolved before the Job can run successfully.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
DataStage Repository In the Repository Browser, we have various folder icons that
represent different categories within the DataStage Project. Some come automatically with
the Project and then you can add your own and name them. Organize them in a way
suitable to your DataStage Project providing for easy export and import. Standard folders
include those for Jobs, Table Definitions, Rule Sets, and other DataStage objects. Tip:
Create a Folder named after your initiative and then order or move all the appropriate
objects for just that initiative under that one common location.
Table Definitions are also known as metadata or schemas. These terms are often used
interchangeably. Each typically contains a collection of information that defines the
metadata about the individual record or row within a table or file. A piece of metadata can
describe column names, column lengths, or columns' data types or it can describe fields
within a file.
Other folders include one for Routines (sub-routines able to be called from within a Job).
Shared Containers are pieces of DataStage Jobs that can be pulled together, used, and re-
used as modular components. Stage Types (highlighted) is a master list of all the Stages
that are available to you in this Project. Stages can also be added later through various
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
options depending on what product(s) have already been installed and depending on what
your administrator has made available to you. The Standardization Rules folder contains
out-of-the-box or default rulesfor QualityStage.
Transforms provide other abilities from the standard Server engine to create macro types of
use and re-use of various functions to its calling structure. WAVES Rules are also used by
QualityStage for address verification modules. Match Specifications which are used by
QualityStage's Match Stages. Machine Profiles which are used in IBM Mainframes. IMS “view
sets” are for working with legacyIMS types of Database Management Systems
Steps to Create a DataStage Job .
This section talks about the steps to building your first Job. First, you'll need to understand
some basic concepts, then we'll talk about setting up an environment, next you'll connect to
the sources, then we'll talk about how to import table definitions, and then we'll provide you
with an understanding of the various types of Stages and how they are used. Then, we'll
talk about working with RCP, creating Parameter Sets, understanding how to use the CLI,
and, in the next lession, you will put this all of this knowledge to use and begin building Jobs in a case-study scenario. This section covers the following:
1. Understand Some Underpinnings
1. Types of Jobs 2. Design Elements of Parallel Jobs
3. Pipeline Parallelism
4. Partition Parallelism
5. Three-Node Partitioning 6. Job Design Versus Execution 7. Architectural Setup Option
2. Setting up a new DataStage environment
1. Setting up the DataStage Engine for the First Time
2. DataStage Administrator Projects Tab
3. Environment Reporting Variables
4. DataStage Administrator Permissions Tab
5. Export Window
6. Choose Stages; Passive Stages 7. Connect to Databases
3. Import Table Definitions
4. Active Stages; The Basics 5. An Important Active Stage; The Transformer Stage
6. Advanced Concepts When Building Jobs; RCP and More 7. Pulling Jobs Together; Parameter Sets, CLI, etc
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Let’s now shift our discussion to the process of developing Jobs in DataStage. What are the
steps involved in developing Jobs? First, you define global and Project properties using the
Administrator client that we mentioned earlier in the tutorial. This includes defining how you
want the Job to be run. For instance, do you want Runtime Column Propagation or any
other common variables in your environment that you might be using throughout the
Project? We'll discuss this in greater depth later in this tutorial.
Next, you go into the Designer client and import metadata into the repository for use, later,
in building your Jobs. Then, using the Designer tool, you actually build the Job and compile
it. The next step is to use the Director tool to run and monitor the Job (Jobs can also be run
from within the Designer but the Job log messages must be viewed from within the Director
client). So, as you’re testing your Jobs it can be a good idea to have both tools open. This
way, you can get very detailed information about your Job and all of the things that are
happening in it.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Administrator Client
Let’s say that we want to create a Job that extracts data from our primary operational system.
We could go to the Administrator tool set up our Jobs to be able to use RCP and set the
number of purges in the logs for however often we want to run purges.
Still in the Administrator, we could then put in some common environmental variables, for instance, the name of the database that is being accessed, the user ID, and password.
These can be set there and encrypted in a common location so that everyone can use these without necessarily exposing security in your system.
Designer Client
To then create this kind of extract Job, we would next want to go into our Designer tool to first import metadata from the database itself for the table or tables which we’ll be pulling.
Also within the Designer we want to use Stages that will connect us to the database. We do this by dragging and dropping the appropriate Stage onto the Canvas.
Inside the Stage(s), we will use the variables (that we defined in our common environment
using the Administrator tool) to define the Job as we see it needs to be done. We do this by double-clicking the Stage in question and filling out the particulars that it requests.
We’ll pull that data out and store it into either a local repository such as a flat file (By
dragging a Sequential File Stage onto the Canvas and creating a link between the two
Stages) or store it into one of DataStage’s custom Data Sets (A proprietary data store which
keeps the data in a partitioned, ready-to-use, high-speed format so that it doesn’t need to be parsed for re-use as it would otherwise need to be from a flat file)
And then we could, for instance, put it directly into another database (By dragging a Stage onto the Canvas at the end of that data flow).
The database Stages come pre-built to meet the needs of many proprietary databases and
satisfy their native connectivity requirements thus providing high-speed transfer capability
while eliminating the need to do a lot of the configuration.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Types of Jobs Let’s talk about the three types of Jobs that we would most likely be
developing. These fit into the categories of Parallel Jobs, Job Sequences (which control other
Jobs), and Server Jobs (the legacy Jobs from earlier versions of DataStage).
Parallel Jobs are executed by DataStage using a parallel engine. They have built-in
functionality for what is called pipeline and partition parallelism. These are means of
efficiency that can be set. For instance, you can configure your parallelism dynamically.
We’ll discuss these things in greater detail later in the tutorial. When Jobs are compiled,
they are compiled into a DataStage-specific language called Orchestrate Scripting
Language, called OSH. The OSH executes various Operators
Operators are pre-built functions relating to the Stages in our Job Design. Also you can
create custom Operators using a toolkit and the C++ language. If you have special data
need to write to a special device, such as one for scanning and tracking control on delivery
trucks, then you could write a custom module to put the data directly into that format
without having to go through any other intermediate tools. The executable that the OSH
executes is accomplished with C++ class instances. These are then monitored in the
DataStage Director with the runtime Job monitoring process.
Job Sequences, as we mentioned, are Jobs that control other Jobs. Master Sequencer Jobs
can, for example, kick off other Jobs and other activities including Command Line activities,
other control logic, or looping that needs to go on in order to re-run a job over and over
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
again. The Job Sequencer can control all your Jobs at once without having to schedule each
individual Job with the scheduler. Although you still have the ability to call any/each Job
individually, you can also group them together through the Job Sequencer and call them
that way.
In addition, you can perform activities based on their failures and do that either specifically,
when a local Job fails or globally, when anything happens and an exception is thrown. A
common Command Line API is provided. Thus, you can embed this into third-party
schedulers and can then execute Jobs in whatever fashion you choose.
The third type of Job is the Server Job. Server Jobs are legacy DataStage Jobs that continue
to exist and are supported as DataStage applications but you may also find them useful for
their string-parsing capabilities. These are executed by the DataStage Server engine and
are compiled into a form of BASIC and then run against the Server engine. Runtime
monitoring is built into the DataStage engine and can also be viewed from the Director
client.
Server Jobs cannot be parallelized in the same way that the Parallel Jobs can be. However,
you can achieve parallel processing capabilities by either sequencing these Jobs with the Job
Sequencer to run at the same time, or using other options to gain efficiencies with their
performance. Legacy Server Jobs do not follow the same parallelization scheme that the
Parallel Jobs do. Therefore, they cannot easily have their parallelism changed dynamically at
runtime as the Parallel Jobs can.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Design Elements of Parallel Jobs One of the nice things about having a visual medium
like DataStage is that you can easily take the notions of Extract, Transform, and Load and
represent them visually in a manner that is familiar to people. For instance, you can drag an
Extract Stage out onto the visual Canvas and drop it on the left of the screen, flow through
transformations in the middle of the Canvas, going outputs or Loading Stages on the right
side.
DataStage let’s you put elements such as Stages and Links in any fashion that you wish,
using an orderly, flow chart style left-to-right top-to-bottom visual design. This flow makes
future interpretation of your Jobs simple. Those who are new to DataStage are able to look
at, see, and understand the activities and flows that are going on within the DataStage Job.
This also means that other Developer's Jobs will be intuitively understandable to you
When we start designing a Parallel Job, we should think of the basic elements, which are
Stages and Links. Stages, as we mentioned, get implemented ‘under the covers’ as OSH
Operators (The pre-built components that you don’t have to worry about and that will
execute based on your Job design on the graphical Canvas).
Next, we have Passive Stages, which represent the ‘E’ and the ‘L’ of ETL (Extract and Load).
These allow you to read data and write data. For example, Passive Stages include things
such as the Sequential File Stage, the Stage to read DB2, the Oracle Stage, Peek Stages
that allow you to debug and look at some of the Job’s interimactivity without having to land
your data somewhere. Other Passive Stages might include the MQ Series Stage, and various
other third-party component Stages such as SAP and Siebel Stages
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Processing or active Stages are the ‘T’ of ELT (Transformation). They do the transformation
and other ‘heavy lifting’ work. These include the Transformer Stage, which allows you to
transform data, manipulate column order and do a variety of other functions to your data to
enrich it and potentially to validate and pre-cleanse it before applying some of the
QualityStage Stages to it.
You will learn to rely heavily on the Transformer Stage. It provides a variety of functionality
including the ability to filter rows, as well as individually modifying each column of data that
goes out. It allows you to propagate the data that goes out in a number of ways and, in
some cases, merge two streams of data together or pieces from two streams into common
one common stream.
These are also assisted by specific Stages to do filtering, perform aggregation of the data
(summations or counting of data or finding first and last), generating the data – such as
generating rows or columns, splitting or merging the data (using the ability to direct data
down multiple paths simultaneously, or pulling the data – either by joining it into one
common larger column or by funneling it into multiple rows within the same data. Now that
we’ve talked about some Stages, let’s talk about the things that help the data flow from one
Stage to the Next
Links are the ‘pipes’ through which the data moves from Stage to Stage. DataStage is a
link-based system. Although the Stages are visually predominant on the Canvas, in
actuality, all the action is happening on the links. When you open up a Stage by double-
clicking it, you get various dialogue boxes that refer to input and output (input to that Stage
and output from that Stage) in DataStage. These represent “Links” coming into and out of
the Stage (when viewed from the Canvas).
The setting that you define in these two areas within the Stage affect the Link coming in
and the Link going out of the Stage, Your settings, including the metadata that you chose,
are all kept with the Link. And therefore, it’s easy, when you modify Jobs, to move those
pieces of data because they stay with the Links (all the metadata and accompanying
information) from one Stage type to another Stage very easily without losing data or having
to re-enter it.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Pipeline Parallelism The purpose of parallelism is to create more efficiency and get more
work out of your machine(s). This is done is two ways the first of which is Pipeline
Parallelism. Here, each of the Operators (That correspond to the various Stages that we see
on our Canvas) set themselves up within their own processing space and pass data from
one to the other rapidly without having to be re-instantiated continually or without having to
do all the work in a given Stage before starting the next portion of the pipelining process.
For example, this allows us to execute transformer, cleansing, and loading processes
simultaneously. You can think of it as a conveyer-belt moving the rows from process to
process. This reduces disk usage for the staging areas, by not requiring you to land data as
often. It uses your processor more efficiently and thus maximizes the investment in your
hardware. There are some limits on the scalability depending on how much CPU processing
horsepower is available and how much memory is available to store things dynamically.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Partition Parallelism The second factor of parallelism is called Partition Parallelism. This is
more of a ‘divide and conquer’ approach in which we divide an incoming stream of data into
subsets to be separately processed by an Operator. Each one of these subsets is called a
partition. Each partition of data is processed by the same Operator. For instance, let’s say
that you want to do a filtering type of operation. Each partition will be filtered exactly the
same way but multiple filters will be set up in order to handle each of the partitions
themselves.
This facilitates a near-linear scalability of your hardware. That means that we could run 8
times faster on 8 processors, 24 times faster on 24 processors, and so on. This assumes
that data is evenly distributed and no other factors that would limit your data from getting
out there, such as network latency or other external issues.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Three-Node Partitioning Data can be partitioned into multiple streams. 15:34 The way in
which we partition will depend upon how we need to process this data. The simplest of the
methods is called the Round Robin in which we assign each incoming row to the next
subsequent partition in order to create a balanced workload. In other words, like a card
dealer, in Vegas, dealing out cards in the deck to every player around the table, thus
dealing all the cards evenly to each player.
Other times, we need to segregate the data within the partition based on some key values,
whether they are numeric or alpha values. This is known as Hash partitioning (Or with
numeric you can use a Modular partitioning) and in this way, you keep the relevant data
together by these common values.
This is very important when we’re doing things such as sorting the data or possibly filtering
the data on these key values. Should the data become separated (The Key value fields start
going into different partitions), then, when the Operation occurs it will not fully work
correctly as some of the data that should have been removed, may be in another partition
and thereby mistakenly ‘kept alive’ and moving down another stream in a parallel Operation.
Then at the other end of the partitioning process when all the data is collected, we still have
multiple copies when there shouldn’t be. If we had put them all correctly on one partition then only one duplicate would survive which was our desired outcome.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
The kind of partition that you perform matters a lot!
One of the reasons why Round Robin is the default is that if you can evenly distribute the
data, it will be processed much more quickly. If you have to partition by range or some type
of hash partitioning scheme, the quantities of data in each partition may become
imbalanced and one or two of the partitions may finish long before the others and thus reduce performance.
Job Design Versus Execution Let’s look at an example that will give you an idea of what
happens when we design the Job and then what happens when we run it. At the top of the
graphic, you can see that the data is flowing from two Oracle Stages and the data is
brought together inside of a Merge operation. From there, we take the merged data and we
do an aggregation on it and finally we send the data out to a DB2 database.
In the background, let’s say that we want to partition this data four-ways (In DataStage
terminology would be known as 4 Nodes). Here’s what we would see: For each of the Oracle
Stages (Toward the left of the data flow) and other Stages in this design, we would create
four instances of each to handle the four separate streams of data. In this kind of activity
when doing summation, it is extremely important that we partition or segregate our data
and that we partition it in a way that will work with our design.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Here, we want to partition it by the values by which we are going to do the summation. This
way, we don’t accidentally sum on two different things and loose parts of our rows. Let’s say
that we have 5 rows that need to come together for a sum, we want them all to be together
in the same partition, so that all the summation can be done together. Otherwise, we might
get four values out of this phase (Highlight on the 4 instances of Aggregation Stage) with
only one of the set of partitions summing anything at all!
Architectural Setup Options
Now that we've talked about the considerations that we must be aware of as we do our
partitioning both in the design of our Job and then in the choice of partitioning strategy,
let's continue our discussion in a different direction. Note that we will discuss these partitioning strategies in greater detail later in the tutorial.
For now, though, let's talk about some architectural setup options.
This section of the tutorial will talk about some of the underpinnings of DataStage in general
and specifically some of the architectural setups that we can use. In one setup, we have
deployed all of the Information Server DataStage components on a single machine. All of
the components are hosted in one environment.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
In one setup, we have deployed all of the Information Server DataStage components on a
single machine. All of the components are hosted in one environment. This includes the
WebSphere Domain, the Xmeta repository using DB2 (other databases can be used for the
repository such as Oracle or SQL Server), and the DataStage Server itself. The clients can
be stored on the same machine as well if it is a Windows machine. In other words, let’s say
that we are carrying a laptop around: It can contain all the co-located components (and it
could still be connected to by a remote client). What are some reasons for putting all of
these components on a single machine? Perhaps that is all the resources that you have
available or perhaps you have very poor network performance and don’t want the resulting
latency going across the network wire to the repository.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Now let’s talk about putting the components on two machines. A typical two-machine setup
would consist of the Metadata repository and the Domain layer on Machine A and the
DataStage Server onto Machine B and potentially be able to connect to Machine B remotely
with the clients. One of the benefits to this separation (Having two machines) is that the
metadata repository on Machine A can be utilized (By a product such as Information
Analyzer) concurrently without impacting the performance of DataStage as it runs on
Machine B
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
In a 3rd scenario, major components are separated onto different machines. The database
repository is on Machine A (Maybe because you are using a machine that already has your
database farms installed on it). The Metadata Server backbone is installed on Machine B so
that Information Analysis and Business Glossary activities don’t risk negatively impacting
any development on the DataStage Server: Several DataStage development teams can
keep working and utilizing their resources most effectively. Finally, you have the DataStage
Server on Machine C, connected to by remote clients.
Setting up a new DataStage environment.
In this last section, we talked about some different options. And, it should be said that you
must determine the optimal setup in your environment depending on your own conditions
and requirements.
In this next section of the tutorial, we are going to talk about DataStage’s architecture
‘under the cover’ and some of the ways that we can set up the product. We’ll also talk about
setting up the DataStage engine itself for the first time using the Administrator client and
see how to then setup DataStage Projects. Within a Project, we’ll see how to import and
export table definitions into that Project. Next we’ll look at Jobs within Projects and talk
about a number of important development objects such as Passive DataStage Stages. For
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
example, we’ll talk about the Data Set Stage, Sequential Files Stage, and various relational
data Stages. We’ll also talk about some Active Stages, which are important when we are
trying to combine and/or manipulate data. These include the Lookup Stage, the Join Stage, the Merge, Stage, and the Transformer Stage (at the heart of many Jobs).
Then we’ll talk about some key concepts that you’ll want to know about as you are
developing Jobs such as the use of Runtime Column Propagation (RCP), how to create Shared Containers for re-usability, putting together Sequencers to organize all of your Jobs.
Then we’ll talk about ways of implementing and deploying our DataStage application that
can make things easier on you and the team. These might include the creating Parameter
Sets, using various Run options, and using the Command Line Interface (CLI) in which Jobs can be called from a third-party scheduler.
Of course, we’ll discuss parallelism more in depth. Specifically, we’ll talk about where we
perform parallelism. This may be within Sequencers or at the Job level we will talk about
strategies for different ways to apply parallelism. Within InfoSphere, we’ll talk in more detail
about the two forms of parallelism (Pipeline parallelism and Partitioning parallelism). We’ll see how these things work and then what happens when the parallel Job gets compiled.
Finally, we’ll discuss how to apply these concepts and scale our previously constructed DataStage application up or down depending on the resources that we have available to us.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Setting up the DataStage Engine for the First Time When we're setting up a new
DataStage engine, the first thing is to go into the Administrator client and make sure that
some of the basics have been taken care of. When we log on, we are presented with the
Attach to DataStage dialog box shown above. We can see it shows the Domain to which we
want to connect with its corresponding port number of the application server, the user ID
and password for the DataStage Administrator with which we will connect, and the
DataStage Server name or its IP address.
Often, you will find that the Server’s name is the same as the beginning portion of the
Domain name (But this does not have to be the case – it depends on the architecture and
name of the computer and how it is set up in your company’s Domain Name Server
(DNS)).
DataStage Administrator Projects Tab Within the Administrator client, we have a tab
entitled Projects. We use this for each of the individual Projects that we want to set up here
within DataStage. The Command button allows you to issue native DataStage engine
commands when necessary such as the list.readu command which would show you any
locks that are still active in DataStage (as well as any other commands associated with the
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
DataStage engine or with the Universe which was the predecessor to the DataStage
engine).
Below that, you can see the Project pathname box. This shows us where the Project
directory is located. This is extremely important as you must know where the Project is
located on the machine (It is located the machine on which the DataStage Server has been
installed). Let’s take a look at the Properties button. It will show us the settings for the
selected Project (datastage1 shown in the graphic).
The Properties button brings up the dialog box shown above. Various tabs across the top
each have thier own options. One thing that you’ll probably want to do is to check the first
option. This will enable job administration in Director. Enabling job administration in
Director will allow developers to do things such as stopping their jobs, re-setting their jobs,
clearing job information, and so on. One of the times that you may NOT want to do this is in
a production environment. Otherwise, you do, typically, want to give people the ability to
administer Jobs in the Director client as they see fit.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Next, is the option to enable RCP for parallel Jobs. We’ll cover this more in depth a little
later in the tutorial but just be aware that this is where you enable it for the datastage1
(each) Project. The ‘Enable editing of internal references in jobs’ option will help you when
dealing with lineage and impact analysis (If you change Jobs or objects over time), The
‘Share metadata when importing from Connectors’ option. This again, allows us to perform
cross-application impact analysis.
The ‘autopurge of the job log’ option is very important. You always want to set up default
purge options so that the logs don’t fill up and consume large disk space on your machine
(The machine that the DataStage Server is installed on – not necessarily the machine that
houses your clients like the Administrator client).
You have your choice of purging them based on the number of runs or by the number of
days old. Some strategies include setting it to the 5 previous runs within your development
and test environment, which may go for long periods of time between each run of a
particular Job. In your production environment, you may want to set it to ‘every 30 days’
where you plan on running your Jobs on a daily basis.
Next is a button that allows you to protect a Project. It's also useful in production where you
don't want people to make modifications to Jobs. Since most people want to make changes
regularly during development work, it's not very useful for development environments. The
Environment button allows you to set global environment variables for your entire Project.
These will be Project-specific. These will be separate from your overall DataStage
Environment variables, which are global to the entire Server. Project-specific Environment
variables are only for this one particular Project.
The “Generate operational metadata” checkbox will allow the data, as it is being generated;
in particular the row counts to be able to be captured, later, for further analysis and
understanding data in the ‘operational’ form. Metadata is usually understood as a
descriptive (How big is a field, What type of data is it) but operational metadata tells us
things like “How frequently has it been run” and “How much volume has gone through it”.
This can be important later as you determine your scalability and future considerations such
as capacity planning.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
When you click on the Environment button, it opens the Environment variables dialog box.
Here, there are two panes in the window. On the left, we have our categories of
environmental variables such as General, which includes Parallel environmental variables.
The other category is User Defined. Under each category area, the corresponding pane on
the right displays the name of the environmental variable to be used, its prompt, and then
the value, that will be used for the entire DataStage Project.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Environment Reporting Variables A standard category is for environmental variables for
reporting such as the APT_DUMP_SCORE, APT_MSG_FILELINE, APT_NO_JOBMON. By
setting these here, you can have all the developers on the Project use them in a
standardized way. However, each of these variables can also be set individually or in each
Job. That is to say, at runtime, you can have just one Job contain a different value from the
other uses of that same environmental variable throughout the rest of the Project.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
DataStage Administrator Permissions Tab is where we set up users and groups for a
variety of authorizations. We would add a user and then assign a product role. The
“DataStage Operator to view the full log” checkbox is typically used unless you have a
particular reason such as sensitive data or Operators being overwhelmed by too much
information. Normally, we allow Operators to view the full log so that if they encounter
problems during a DataStage Job, they can easily communicate that information to their
second line of support when escalating a problem.
Once we create a new user, we can see in the User Role drop-down menu that there are
several roles available for us to assign to that user or group. These consist of Operator,
Super Operator, Developer, and Production Manager.
Let’s continue our discussion by moving over to the Tracing tab. This allows us to perform
engine traces. Typically, you will not use unless working with IBM support while
experiencing a difficult problem that cannot be solved any other way.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
The Parallel tab contains some options. The checkbox at the top allows you to see the
Orchestrate Shell (OSH) Scripting language that has been generated from within your Job
properties box. It creates a tab entitled OSH. Further down, there are some advanced
options for doing things such as creating format defaults for various data types used in the
parallel framework for this Project.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Under the tab entitled Sequence, we can add checkpoints so that the Job sequences are re-
startable upon any failure. We can automatically handle activities that fail, log warnings
after activities that finish with a status other than ‘OK’, and log report messages after each
Job run.
Work on Projects and Jobs
Once we get our engine configured and set our basic Project properties, the next thing that
we will want to be able to do is to work on a new Project. If we generate all these
DataStage Jobs from scratch, we won’t be re-using anything from previous Projects. However, often times, it makes sense to leverage work that we have already done.
We may want to import things from a previous Project. These include things such as
program objects such as Jobs, Sequences, Shared Containers, Metadata objects (for
instance Table Definitions), Data Connections, things used globally at runtime such as
Parameter Sets, and other objects such as routines, transforms, that we want to bring along and re-use from a previous Project.
Exporting and Importing Objects
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Export Window In order to make this happen, the first thing that we need to do is to
export it from one environment. Then we will be able to import it. Here in the Export
window, we can highlight certain objects and then add them to our export set.
One important thing to consider is the ‘Job components to export’ drop-down menu shown
highlighted. This will allow you to export the job designs with executables, where applicable.
Let’s say that we are promoting all of these objects to a production environment that does
not have its own compiler. You will then need to have these pre-compiled Jobs ready to run
when they reach their target. You can opt to exclude ‘read only’ items. This is a typical
default.
Next, you would choose which file you would like to export it into. This is a text file that will
‘live’ on your client machine. All of the export functionality is done through client
components. Once we have identified all the objects that we want to export, we simply click
the Export button to begin the export.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
We export either into an XML file or the native DSX file. Once this is accomplished, then we
can begin our import. For the import, we would use the DataStage Designer client, click its
import option and select the objects to import from the flat file into our new environment.
Let's take a look at that dialog box.
This file that we are importing may serve as a backup in that it allows us to import object-
by-object. Our export is exactly like a database export in so far as it allows us to pull out
one object from the export at a time. Should someone accidentally delete something (in the
new environment), instead of losing the entire box, we only lose one object that can then be
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
restored without losing any of the other work that you’ve done in the meantime by
importing it from the export file.
On the Import screen, you will need to identify the export file. You then have a choice to
‘Import all’ or ‘Import selected’. At the time, you can choose to ‘Overwrite without a query’
if you know that everything that you want is the most current. You can also choose to
perform an ‘Impact analysis’ to see what objects are related to each other. Once you have
set things up and are ready to begin developing Jobs. Let’s talk about some very important
and key components that we will use during our development.
Choose Stages
Now that we've talked about setting up our Project, let's now focus on building Jobs. In
particular, let’s talk about DataStage Stages that we see on our Canvas. There are two classes of Stages; Passive Stages and Active Stages. First, let's discuss the Passive Stages.
Passive Stages
Passive Stages are typically used for ‘data in’ and ‘data out’ such as working with sequential
files, databases, and Data Sets (Data Sets are the DataStage proprietary high-speed data
format).
Whereas, Active Stages are the things that actually perform on the data: As rows are
coming in, they immediately go out. For example a Transformer Stage, the Filter Stage, and
the Modify Stage pass data as they receive the data unlike the Passive Stage which only
either inputs data or only outputs data. The active Stages manipulate the data. So the
Passive Stages are beginning points in the data flow or ending points. The Active Stages are continuous flow Stages that go in the middle of the data flow.
Most notable of the Active type is the Transformer Stage, which has many functions. We'll get to this Stage a little later. First, let's cover the Passive Stages.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Types of File Data Let’s describe the types of file data that we’ll be dealing with as we
begin to develop Jobs. Primarily this falls into three categories; sequential file data (Should
be of either fixed or variable length), Data Sets which are the DataStage proprietary format
for high-speed access, and then complex flat file data which is most notably used with older
COBOL-type structures – these allow us to look at complicated data structures embedded inside of files and are used in proprietary formats –much the way that COBOL does.
How Sequential Data is Handled When we are using our Sequential File Stage, it is
important to know that DataStage has an import/export operator working on the data (Not
to be confused with importing and exporting objects, components, and metadata). This
Sequential File Stage’s import and export is about taking the data from its native format,
parsing it, and putting it into an internal structure (within DataStage) so that it can be used
in the Job and then, likewise, on the way out (when writing it), taking that internal structure
and writing it back out into a flat file format.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
After the Job has been run, we could look in the log and see messages relating to things
such as ‘How many records were imported successfully’ and ‘How many were rejected’.
When the records get rejected, it is because they cannot be converted correctly during the import or the export. Later, we’ll see how we can push those rejected files out.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Normally, this is going to execute in sequential mode but you can use one of the features of
the Sequential File Stage to be able to read multiple files at the same time. These will
execute in parallel when executing multiple files. If we set up multiple readers, we can read
chunks of a single file in parallel.
The Stage needs to ‘know’ how the file is divided into rows and it needs to ‘know’ what the
record format is (e.g. how the row is divided into columns). Typically, a field of data within
a record is either delimited or in a fixed position. We can use these record delimiters and column delimiters (such as the comma or the new-line delimiter) as a record delimiter.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Job Design Using Sequential Stages The graphic above shows how we can read data
from the Selling_Group_Mapping file, send it into our Copy Stage and load it into the target
flat file. However, if anything is not read correctly by the Selling_Group_Mapping link, it will
be sent down the dashed reject link entitled Source_Rejects and, in this case, read by a
Peek Stage. Over on the right, by the target, if anything isn’t written correctly, it will be
sent out to the Target_File_Target_Rejects link to the TargetRejects Peek Stage. To configure the Sequential File Stage, we would double-click it.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Sequential Source Columns Tab Then, within the Sequential File Stage, on the Output
tab, under the Columns sub-tab, we would configure our Input file column definitions. The
column metadata looks like the metadata that we have elsewhere. In other words all of the
Column Names, Keys, SQL Types, Lengths, Scales, etc. under the Columns sub-tab look like
all of the other metadata that we’ve seen for this file. Next, let's say that we clicked on the
sub-tab entitled 'Properties'.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Under the Properties tab, we are presented with some options. These options help us define
how we will read the file. For instance, there is the file’s name, its read method, and other
options such as whether or not the first line is a column header, whether or not we will reject unreadable rows, etc.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Datasets The other primary file that we use is the Data Set. This kind of file stores data in
a binary format that is non-readable without a DataStage viewer. It preserves the
partitioning that we established during our Jobs. This way, the data contained in the Data
Set is already partitioned and ready to go into the next parallel Job in the proper format.
Data Sets are very important to our parallel operations in that they work within the parallel
framework and use all the same terminology and are therefore accessed more easily.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
We use a Data Set as an intermediate point to land data that does not need to be
interchanged with any application outside of DataStage: It will only be used within
DataStage Jobs going from one job to the next (or just within a Job). It is not a useful form
of data storage for sending to other people as an FTP file, or for EDI, or for any of the other
normal forms of data interchange. It is a proprietary DataStage file format that can be used
within the InfoSphere world for intermediate staging of data where we do not want to use the database or a sequential file.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Data Set management Utility You may need to look at a Data Set to verify that it looks
‘right’ or to find specific values as a part of your testing. You can look at a Data Set using
the Designer client: There is a tool option that allows you to do data set management. It
let’s you first, seek out the file that you want. It then opens a screen, which will show you
how the file is separated. It shows the number of partitions and nodes and it shows how the
data is balanced between them. And, if you would like, it will also include a Displaying option for your data. Let’s look at this display
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Data and Schema Displayed A Data Set cannot be read like a typical sequential file by
native tools such as the VI Editor or your Notepad editor in Windows. When looking at the
Data Viewer we will see data in the normal tabular format just as you view all other DataStage data using the View command.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Connect to Databases Now that we’ve talked about some of the basic Passive Stages for
accessing file data we will take a look at some of the basic Database Stages that allow us to
connect to a variety of relational databases. To do this, we will need to talk a little bit about
working with relational data. However, before we can access our databases, we need to
know how to import relational data. So, let's begin by talking a little bit about working with relational data.
To import relational data, we will use the Table Definition Import utility within the Designer
client – or you can use the orchdbutil. The orchdbutil is the preferred method to get correct
type conversions. However, in any situation, you need to verify that your type conversions
have come through correctly and will suit the downstream needs (They must suit the needs of subsequent Stages and other activities within DataStage).
You will need to work with Data Connection objects. Data Connection objects store all of our
database connection information into one single named object so that, when we go into our
other Stages in Jobs, we won’t need to include every piece of detailed information such as User ID, password of the various databases that we’re connecting to.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Next we need to see what Stages are available to access the relational data. We just talked
briefly about Connector Stages. They provide parallel support and are the most functional
and provide consistent GUI and functionality across all the relational data types.
Then, we have the Enterprise Stages. These are the legacy Stages from previous versions of
DataStage. They provide parallel support as well with the parallel extender set.
There are also things called Plug-in Stages: These are the oldest family of connectivity
Stages for databases and other relational sources. These were ported to the current version of Information Server in order to support DataStage Server Jobs and their functionality.
When you have one of these Stages, it gives you the ability to select the data you want. We
have the ability to use Select statements as we would in any database. With DataStage, you
get a feature called the SQL Builder utility that quickly builds up such statements allowing you to get the data that you desire.
When we’re writing the data, DataStage has a similar functionality that allows you to build INSERT, UPDATE, and DELETE statements also using the SQL Builder.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Import Table Definitions
The first thing that we want to do is to import our table definitions.
To import these table definitions, we can use either ODBC or the Orchestrate schema
definitions. Orchestrate schema imports are better because the data types tend to be more
accurate.
From within the Designer client, you simply click on Import>Table Definitions>Orchestrate
Schema Definitions. Alternatively, you could choose to import table definitions using the
ODBC option. Likewise you can use the Plug-In table definitions or other sources from which
you can import metadata including legacy information such as COBOL files, which can then
later be translated into pulling that same data out from corresponding databases.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Orchestrate Schema Import The Orchestrate Schema Import utility does require certain
bits of information including the database type, the name of the database from which you
are pulling it, the server on which it is hosted, username, and password.
ODBC Import When we’re using our ODBC Import, first we select the ODBC data source
name (DSN). This will need to have been set up for you by your System Administrator
before you use the screen shown above (Otherwise, all the entries will be blank). Many will
need a username and password although some may come pre-configured.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Connector Stage Types There are several Connector Stage types including ODBC, which
conforms to the ODBC 3.5 standard and is level 3 compliant and certified for use with
Oracle, DB2 UDB, SQL Server, and various other databases. This will also include the suite
of DataDirect drivers, which allow you to connect to these various data sources and give
you unlimited connections.
The next connector type is the DB2 UDB Stage. This is useful for DB2 versions 8.1 and 8.2.
There is also a connector for the WebSphere MQ to allow us to connect to MQ series queues.
It can be used with MQ 5.3 and 6.0 for client/server. And there is also WSMB 5.0. The last
Connector Stage type is for Teradata. This gives us fast access to Teradata databases.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Connector Stages Now that we’ve done some of the basic set up for our relational data,
we want to be able to use it on our Canvas. You can see, in the graphic above, that an
ODBC connector has been dragged and droppedon on the left. Next we would want to
configure it so that we can then pull our data from an ODBC sources. Let’s take a look inside
of the Stage.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Stage Editor Inside of the Connector Stage, you can see a Navigator panel, an area in
which we can see the link properties, and then below, we can see the various other
properties of the connection itself. Once we have chosen the connector for our ODBC
Connector Stage, we will then want to configure an SQL statement that will allow us to pull
data. We can use the SQL Builder tool highlighted on the right. Or we can manually enter
our own SQL. Alternatively, we can use SQL that is file-based.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Connector Stage Properties
When configuring the Connector Stage, we should have our connection information and
know what SQL we will be using. This may include any transaction information and session
management information that we entered (This would be our record count and commitment control shown highlighted)
Additionally, we can enter any Before SQL commands or After SQL commands that may
need to occur such as dropping indexes, re-creating indexes, removing constraints, or various other directives to the database.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Building a Query Using SQL Builder When you build a query using the SQL Builder
utility, there are certain things that you’ll need to know: Be sure that you are using the
proper table definition and be sure that the Locator tab information is specified fully and
correctly. You can also drag the table definition to the SQL Builder Canvas. And, you can
drag on the columns that you want to select.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Data Connection Data Connections are objects in DataStage that allow us to store all of
the characteristics and important information that we need to connect to our various data
sources. The Data Connection stores the database parameters and values as a named
object in the DataStage repository. It is associated with a Stage type. The property values
can be specified in a Job Stage of the given type by loading the Data Connection into that
Stage.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Creating a New Data Connection Object You can see the icon that represents the Data
Connection highlighted above.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Select the Stage Type Inside the Data Connection, we can select the Stage type that
we’re interested in using and that we want to associate this to. Then we specify any
parameters that are needed by that particular database. Obviously, different databases
have different required parameters, so the selection of the Stage type is important.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Loading the Data Connection Once we have built our Data Connection, then we will want
to load the Data Connection into the Stage that we have selected.
Active Stages
Lookup, Merge, Join Stages
Aside from Passive Stages, our other main classification of Stages is the Active Stage.
Some of the Active Stages that we’ll be looking at include a number of functionalities such
as data combination, the transformation of data, filtering of data, modification of metadata, and a number of other things for which there are corresponding Active Stages.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Data Combination First, let’s talk about data combination. For this, the three Stages that
we use are the Lookup Stage, the Merge Stage, and the Join Stage. These Stages combine
two or more input links. These Stages differ, mainly, in the way that they use memory.
They also differ in how they treat rows of data when there are unmatched key values. Some
of these Stages also have input requirements such as needing to sort the data or de-
duplicate it prior to its combination.
The Lookup Stage Some of the features of the Lookup Stage include its requirement for
one input for its primary input link. You can have multiple reference links. It can only have
one primary output link (And an optional ‘reject’ link if that option is selected).
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Lookup Failure Actions It is limited to one output link however, with its lookup failure
options, you can include a reject link. Other lookup failure options include continuing with
the row of data through the process, passing the data through with a null value for what
should have been looked up, dropping the row, or we could just fail the Job (Which will
abort the Job).
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
The Lookup Stage can also return multiple matching rows, so you must be careful how you
use it. The Lookup Stage builds a Hash table in memory from the Lookup file(s). This data is
indexed by using a hash key, which gives it a high-speed lookup capability. You should
make sure that the Lookup data is small enough to fit into physical memory.
The Lookup Stage can also return multiple matching rows, so you must be careful how you
use it. The Lookup Stage builds a Hash table in memory from the Lookup file(s). This data is
indexed by using a hash key, which gives it a high-speed lookup capability. You should
make sure that the Lookup data is small enough to fit into physical memory.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Lookup Types There are different types of Lookups depending on whether we want to do
an equality match, a caseless match, or a range lookup on the reference link.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Lookup Example Let’s see how a Lookup Stage might typically fit into a Job. We have a
primary input coming in from the left and the data flows into the Lookup Stage. Coming in
from the top is our Reference data. Notice that the reference link is dashed (a string of
broken lines). As you are dragging objects onto the Canvas and building the Job, always
draw the Primary link before the Reference link. Then coming from our Lookup Stage is the
output link going to the target. What would happen if we clicked on the Lookup Stage to see
how it's configured? Let's take a look.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Lookup Stage With an Equality Match Within the Lookup Stage, there is a multi-pane
window. Everything on the left represents the data coming into the Lookup Stage – both
Primary link and Reference link(s). Everything on the right is coming out of the Lookup
Stage. On the top of the screen is more of a graphical representation of each row. Toward
the bottom of the screen is a more metadata-oriented view in which you can see each of the
links and their respective characteristics. This same metaphor applies to most other Stages
as well.
Notice that the metadata references (shown highlighted lower left) look like the table
definitions that we used earlier to pull metadata into the DataStage Project.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Highlighted toward the top, is a place where you connect column(s) from the input area
(highlight Item input link) and dragging them down and dropping them to the corresponding
Reference link and connect them to the various equivalent fields here (highlight on
Warehouse item in the lower blue area). In this case, we are using a single column to
connect an equal: Where a “Warehouse Item” is equal to an “Item”.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Next, we can move all the columns from the input area on the left to the output link area on
the right but replace this Item (highlight on the right box) description with the Warehouse
Item description (the one that we found in the Reference link). So, we are taking all of our
input columns and we are adding a new column to it based on what we looked up on the
Warehouse Item ID.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
When you click on the icon with golden-chains (shown highlighted) it gives you a new dialog
screen.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Specifying Lookup Failure Actions This dialog box shows us the constraints that we can
use. Here we can determine what actions to take should our lookup fail. You can continue
the row, drop the row, fail the row, or reject it. Also, if you use a condition during your
lookup (highlight on fail under conditions not met), you can use these same options of
continue, drop, fail, or reject for if/when your condition is not met.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Join Stage The Join Stage involves the combination of data. There are 4 types of joins;
inner, left outer, right outer, and the full outer join. The input links must be sorted before
they come into a Join Stage. The Join Stage will use much less memory: It is known as a
‘light-weight’ Stage. The data doesn't need to be indexed and instead has come into the
Stage in a sorted fashion. We must identify a left and a right link as they come in. This
Stage supports additional ‘intermediate’ links. Like a Lookup Stage in an SQL query, in the
Join Stage we will use columns to pull our data together.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Job with Join Stage Above is an example of a Job that uses the Join Stage. Notice the
green Join Stage icon in the middle. Similar to the Lookup Stage, here, we have a primary
input, a secondary input, and an output. But, in this case, notice that the right outer link
input is not a dashed line (Highlight on the upper link). Remember that the reference link
that we saw earlier did have a dashed line.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Join Stage Editor When we get inside the Join Stage, we will use the Properties editor.
Similar editors are found in other Stages. Here, we will use the attributes necessary to
create the join. We will specify the join type that we want and what kind of key we will be
joining on.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
The Merge Stage It is quite similar to the Join Stage. Much like the Join Stage, its input
links must be sorted. Instead of a left and a right link, we have a 'Master' link and one or
more Secondary links. It is a ‘light-weight’ Stage in that it uses little memory (This is
because we expect the data to already be sorted before coming into the Stage and therefore
there are no keys in memory [indexes to be used]). Unmatched Master rows can be kept or
dropped. This gives the effect of a left-outer join. Unmatched Secondary links can be
captured in a Reject link and dispositioned accordingly.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
When the data is not ‘pre-sorted’, the sorting can be done by using an explicit Sort Stage or
you can use an On-link sort. If you have an On-link sort, then you will see its icon on the
Link (The icon is shown in the lower-left of the graphic). The icon toward the right of each
link above is the partitioning icon (Partitioning it out, it looks like a fan, and then collecting
the partitioned data back up).
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Coming out of the Merge Stage, we have a solid line for the main output and a dashed line
for the rejects. The latter will capture the incoming rows that were not matched in the
merge.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Merge Stage Properties Inside the Merge Stage, you must specify the attributes by which
you will pull together the two sets of data; What to do with the unmatched master records
and whether to warn or to reject updates. Also you must specify whether or not to warn on
unmatched masters. This is another way to combine data, which will create more columns
within one row and, in some cases produce multiple rows.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
The Funnel Stage The Funnel Stage provides a way to bring many rows from different
input links into one common output link. The important thing here is that all sources must
have identical metadata! If your links do not have this, then they must first go through
some kind of Transformer Stage or Modify Stage in order to make all of their metadata
match before coming into the Funnel Stage.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
The Funnel Stage works in three modes. The Continuous mode is where records are
combined in no particular order. The first one coming into the Funnel is the first one put
out. The Sort mode is where we combine the input records in order defined based on keys.
This produces a sorted output if all input links are sorted by the same key. The Sequence
mode outputs all of the records from the first input link and then outputs all from the
second input link, and so on (Based on the number of links that you have coming into your
Funnel Stage).
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Funnel Stage Example Above, you can see an example of the Funnel Stage’s use.
Let’s continue our discussion on Active Stages. Here, we’ll briefly talk about the Sort Stage and the Aggregate Stage.
Obviously, the Sort Stage is used to sort data that requires sorting, as was the case with
the Join Stage and the Merge Stage. As mentioned, sorts can be done on the input link to a
given Stage. This on-link sort is configured within a given Stage on its Input link’s Partitioning tab. Just set the partitioning to anything other than Auto.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Alternatively, you can have a separate Sort Stage. One of the advantages to the Sort Stage
is that it gives you more options for controlling memory usage during the sort. The
advantage to on-line sorting (within the Stage) is that it is quick and easy and can often by
done in conjunction with any partitioning that you’ll be doing.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Sorting Alternatives The Sort Stage is highlighted above. In this example, we have
sequential data coming into a Sort Stage before it is moved to a Remove_Duplicates Stage
and then sent out to a Data Set. Below, we have a different example in which file that is
being sorted directly into a Data Set but in this case the sort is happening directly on the
link (on-link sort).
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Our next Active Stage is the Aggregator Stage. The purpose of this Stage is to perform data
aggregations. When using it, you specify one or more key columns, which define the
aggregation units (or groups). Columns to be aggregated must be specified. Next, you
specify the aggregation functions. Some examples of functions are counting values (such as
nulls/non-nulls), summing values, or determining the maximum, the minimum, or ranges of
values that you are looking for. The grouping method (hash table or pre-sort) is often a
performance issue.
If your data is already sorted, then it’s not necessary. If you don’t want to take the time to
do a sort of your data prior to aggregation, then, using the default hash option will allow
you to pull the data in without any sorting ahead of time.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Job with Aggregator Stage In this Job we can see the Aggregator Stage’s icon. Data
flows from a Data Set (at the top of the data flow) and is copied two ways. One way (to the
right) is sent to a Join Stage. Another way (straight down) is sent to an Aggregator Stage to
aggregate the row counts and then (flowing from the Aggregate to the Join) is passed back
into the same Join so that each row that came into this Job can also have, attached to it,
the total number of rows that are being counted. The Aggregator Stage uses a Greek Sigma
character as a symbol for summation.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Remove Duplicates Stage Our next Active Stage is the Remove Duplicate Stage. The
process of removing duplicates can be accomplished using the Sort Stage with the Unique
option. However, this leaves us with no choice as to which duplicate to keep. Use of the
Unique option provides for a Stable sort which always retains the first row in the group or a
Non-Stable sort in which it is indeterminate as to which row will be kept.
The alternative to removing duplicates is to use the Remove Duplicates Stage. This tactic
gives you more sophisticated ways to remove duplicates. In particular, it lets you choose
whether you want to retain the first or the last of the duplicates in the group.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
This sample Job shows us an example of the Remove Duplicate Stage and we can see what
the icon looks like. From the Sort Stage (toward the left of the graphic), the data comes into
a Copy Stage toward the bottom, and then runs the sorted data up into the Remove
Duplicate Stage. Then the data is sent out to a Data Set.
An Important Active Stage: The Transformer Stage One of the most important Active
Stages is the Transformer Stage. It provides for four things: Column Mappings, Derivations,
Constraints, and Expressions to be referenced.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
It is one of the most powerful Stages. With column mappings, we can change the metadata,
its layout, and its content. Also, within the Transformer Stage are derivations. These are
written in a BASIC code and the final compiled code is C++ generated object code. There
are a number of out-of-the-box derivations that you can use allowing you to do things such
as: string manipulation, character detection, concatenation, and other activities. You can
also write your own custom derivations and save these in your Project for use in many
Jobs.
Other features of the Transformer include Constraints. The Constraint allows you to filter
data much like the Filter Stage does. You can direct data down different Output links and
process it differently or process different forms of the data.
The Transformer's use of expressions is for constraints and/or derivations to use as
reference. We can use the input columns that are coming into the Transformer to determine
how we want to either derive or constrain our data. We can use Job parameters and/or
functions (whether they be the native ones provided with DataStage or ones that you have
custom-created). We can use system variables with constants, Stage variables (local in their
scope) to just the Stage – as opposed to being global for the whole Job), and we can also
use external routines within our Transformer.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Job with a Transformer Stage The Transformer Stage has an icon that looks like a T
shape with an arrow. In the example Job above, we have the data coming into the
Transformer. There is a reject link coming out of the Transformer (performed with the
Otherwise option) and two other links coming out of the Transformer as well to populate
two different files.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Inside the Transformer Stage Inside a Transformer Stage we have a look-and-feel
similar to that of the Lookup Stage that we talked about earlier, in that we have an Input
link (highlighted on the left) and all of our Output links on the right. These two output links
(the two boxes on the right) give us a graphical view and specify the column names and
also gives us places to enter any derivations. The top area (highlight on upper-most blue
box on the right) allows us to create Stage variables, which we can use to make local
calculations in the process of doing all of our other derivations.
Again, as with the Lookup Stage, the bottom half of the screen resembles the table
definitions that you will see inside of your repository (highlight on center lower half). This
gives you the very detailed information about the metadata on each column.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
(Yellow highlight on the Golden-Chain button in the toolbar at the top) When you click the
Golden Chain button then this will bring you to the section that allows you to do
Constraints.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Defining a Constraint; you can type in your constraint into the highlighted area that also
uses BASIC terminology. Here you can create an expression that will let you put certain
data into one link or a different link or not put it into a link. When you right-click in any of
the white space, DataStage brings up options for your Expression Builder (highlighted
toward the left).
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Alternatively, you can also use the Ellipse button to bring up the same menu. In the
example above it shows an UpCase function and a Job parameter named
(Channel_Description). At the beginning of the expression, we are using an input column.
We compare our input column to an upper-cased version of one of our Job parameters.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Defining a Derivation When we are defining or building a derivation, we also use the
BASIC-like code to create expressions within a particular cell. These derivations can occur
within the Stage Variable area or down within any one of the Links. As with the constraints,
a right-click or clicking the Ellipse button brings up the context-sensitive code menu to help
you build your expression.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
If Then Else Derivation Using our Transformer Stage, we can also build up “If, then, else”
derivations where we can put conditional logic that will affect the data on our output link
String Functions and Operators Some things included in the Transformer Stage’s
functionality are string functions and operators. The Transformer can use a Substring
operator, it can Upcase/Downcase, and/or it can find the length of strings. It provides a
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
variety of other string functions as well such as string substitution and finding positional
information.
Checking for NULLs Other out-of-the-box functionality of the Transformer includes
checking for NULLs. DataStage can identify whether or not there are NULLs, assist us in how
we want to handle them, and can do various testing, setting, or replacing NULLs as we see
fit.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Transformer Functions Other functions of the Transformer include those for Date and
Time, logic, NULL handling, Numbers, Strings, and those for type-conversion.
Transformer Execution Order It is important to know that, when a Transformer is
executed, there is a very specific order of execution. The first things that are executed in
the Transformer for each row are the derivations in the Stage Variables. The second things
that are executed are the constraints for each link that is going out of the Transformer.
Then within each link, the next thing that is executed are the column derivations and they
are executed first, in the earlier links before the later links.
The derivations in higher columns are executed before lower columns: Everything has a top-
down flow to it. 1. Derivations in Stage variables (from top to bottom) 2. Constraints (from
top to bottom) 3. Within each Link, column derivations (from top to bottom).
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Let’s look at a quick example of how this might work. So, at the beginning we will execute
all my variables (from top to bottom) and then execute each constraint (from top to
bottom). For each constraint, the Transformer will fire off the corresponding output Link,
and within that Link, we do all the columns from top to bottom, Which means that, if we do
something in a column above, we can use its results in the column below. TIP: You must be
very, very careful if you ever re-arrange your metadata (because of how the order of
execution could affect your results within a Transformer.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Transformer Reject Links You can also have Reject links coming out of your Transformer.
Reject links differentiate themselves from other links by how they are designated inside the
constraint. Let’s quickly see how this is done.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Off to the right, we have the “Otherwise Log” option or heading. Checking the respective
checkbox create a Reject link as necessary.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Our “Otherwise” link creates a straight line (such as the Missing Data Link shown above)
whereas; a “Reject” link would have a dashed line (Such as the Link all the way to the left).
Advanced Concepts When Building Jobs
We’ve discussed the key objects in DataStage and we have set up DataStage. We’ve been
able to use Passive Stages to pull data in from various files or relational database sources and we’ve talked about various Active Stages that let us manipulate our data.
Now, we need to cover a few other important concepts that will help us in our development of DataStage Jobs.
• Run Time Column Propagation
• Shared Containers
• Runtime Column Propagation (RCP)
• One of the most important concepts is that of Runtime Column Propagation (RCP).
When RCP is turned on, the columns of data can flow through a Stage without being
explicitly defined in the Stage. We’ll see some examples of this later.
• With RCP enabled, the target columns in a Stage need not have any columns
explicitly mapped to them: No column mapping is enforced at design-time.
• With RCP, the input columns are mapped to unmapped columns by name. This is all
done ‘behind the scenes’ for you.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
• Let’s talk about how the implicit columns get into a Job. The implicit columns are
read from sequential files associated with schemas or from the tables when read
from relational databases using a “select” or they can be explicitly defined as an
output column in a Stage that is earlier in the data flow of our Job.
• The main benefits of RCP are twofold. First, they give us greater Job flexibility. This
way, a Job can process input with different layouts. We don’t have to have a
separate Job for multiple record types that are just slightly different.
• Second, RCP gives us the ability to create more re-usability within components, such
as the Shared Containers (There are many ways to create re-usable bits of
DataStage ETL code). This way you can create a component of logic and apply it to a
single named column while all the other columns will flow through untouched.
• When we want to enable RCP, it must be enabled at the Project-level. If it’s not
enabled there, then you won’t find the option in your Job. At the Job level, we can
enable it for the entire Job or just for individual Stages. Within each Stage, we look
at the Output Column tab in order to decide whether or not we want to use RCP.
• You must enable RCP for the entire Project if you intend to use it. And, if you do,
then can also choose to have each newly created Link use RCP by default or not. This
default setting can be overridden at the Job-level or at the Link-level within each Stage.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Here’s an example of where we set up and enable RCP just within a Stage (Highlight on
checkbox of Enabling RCP at Stage Level slide). When you see this checkmark, you know
that columns coming out of this Stage will be RCP-enabled.
When RCP is Disabled Let’s say that we had four columns coming in to this Stage shown
above as the input link shown above on the left and four columns going out (on the right)
then, if we only drew derivations to two of the columns (the upper two on the right), the
two lower columns would remain in a RED state (ineligible derivations) and this Job will not
compile. However, let’s see what we can do with RCP
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
When RCP is enabled we can see that the output link’s columns (lower right) do not have
anything in the red state. By name matching, DataStage will assign the two columns on the
input link (on the left) to two columns going out on the output link (lower right)
automatically.
Or, we could leave these two columns completely off (the ones in the lower right of the
previous graphic) and they would be carried in an invisible fashion through subsequent
Stages and thereby, allowing you to run many Jobs that have row-types that have the
upper two columns explicitly but also have the lower two (or more) columns implicitly
In other words, we can run this Job one time with a record that has four columns and
another time we can run a Job that has eight columns another time we can run that Job
with six columns.
The only thing that they all have to share is that the very first two columns (highlight on top two columns) must be the same and must be explicit.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Runtime Column Propagation (RCP) In Detail
Because of the importance of concepts including Runtime Column Propagation (RCP), the
use of schemas, the use of Shared Containers (the chunks of DataStage code that can be
re-used), and other things that allow you greater flexibility and help create greater re-
usability within your DataStage Jobs, it is worth talking about them in greater detail. Understanding RCP often takes several years in the field to understand.
Even people who have been in the field for years can have difficulty grasping both the
power and the difficulties of understanding RCP better, so let’s review and continue our
discussion about it.
RCP, when it is turned on, will allow columns of data to flow through a Stage without being
explicitly defined in the Stage. Target columns in the Stage do not need to have any
columns explicitly mapped to them. Therefore, no column mapping is enforced at design-time when RCP is turned on. Input columns are mapped to unmapped columns by name.
RCP is how implicit columns get into a Job. We can define their metadata by using a schema
file in conjunction with a Sequential File Stage. As you’ll recall, a schema file is another way
to define our metadata much like a table definition. Using this schema, we can work with a
Sequential File Stage or the Modify Stage to take unknown data and give it an explicit metadata tag.
We can also pull in our metadata implicitly when we read it from a database using our
“select” statement.
We can also define it by explicitly defining it as an output column in an earlier Stage in our data flow (Using a previous Stage in the Job design).
The main benefits of RCP are Job flexibility so that we can process input with different or
varying layouts. RCP also allows us to create re-usability. It also works very well within DataStage’s Shared Containers (the user-defined chunks of DataStage ETL applications).
Let’s look at an example of RCP so that we can better understand it.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
The graphic above summarizes the main points of RCP but by revisiting the example that we
looked at a few moments ago, we can gain a deeper understanding of RCP. So let’s review
our example in which RCP is NOT used.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
We’re looking within a Transformer Stage at a typical example of columns when RCP has not
been enabled. Let’s review how it functions without RCP. We can see four columns on the
input link to the left. To the lower right, we can see four outgoing columns (on the output
link). But the two lower columns SPEC_HANDLING_CODE and DISTR_CHANNEL_DESC
aren’t connected to the input columns and don’t have any derivations. Therefore, these two
columns are left in a red color indicating that this is an incomplete derivation. Therefore,
this Job won’t compile.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
When RCP is Enabled. But, in the exact same situation but one in which RCP is enabled.
Now with the same Transformer, the bottom two columns are no longer in a red state and
therefore, this Job will compile. When this Job runs, by name reference these two columns
on the bottom left will be assigned to the two columns on the bottom right.
If we had additional explicitly named input link columns beyond the four highlighted to the
left, they would be carried invisibly along to the next Stage. And, they could be either used
or not used as necessary. But this would allow us to not require any kind of individual
column-checking at design-time. This has mixed benefits. It gives us a lot of flexibility but
can create headaches at runtime if you are not aware of data that is passing through your
Job. This can negatively impact downstream activities. Thus you must be very careful when
using the RCP.
Important things to know when using RCP are to understand what data is coming in, what
kind of metadata needs to get processed there, and to understand what data is going out of
the Stage. This applies in terms of the number of columns and the metadata type. If they
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
do not match, then what you are bringing in at some point in the processing, will need to be modified accordingly to work with the final target metadata of the Job.
For instance, let’s say that you bring in one column of a numeric type of data but it needs to
go out as character data, then you’ll need to perform some type of conversion function on
it. For example if it comes in with one column length but needs to go out with a longer
column length then you’ll need to affect it from within either a Transformer Stage or a
Modify Stage. Likewise, if you have a longer column coming in, you’ll need to make sure
that it is truncated properly to fit into the outgoing column space so that it doesn’t overflow and create a runtime failure.
Business Use Case Scenario
Let’s talk about a good use of RCP in business terms. Let’s say that you have a large
conglomerate company and five companies underneath it in a corporate hierarchy. The
conglomerate wants to take data from all five of these companies’ operational systems and
populating a data warehouse with Customer Information. Each of the sub-systems has
different layouts for their Customer but they all contain a commonality of having its own
Customer ID field. However, each of these Customer ID fields may overlap: The value of
‘123’ at Company A may represent a totally different Customer than the value of ‘123’ at
Company B even though they are the same value. The conglomerate has a single Customer
ID that can track each of them distinctly. But each of these sub-companies will not know
what their corporate conglomerate Customer ID is until it comes time to pull all of this data
into the warehouse. They don’t know this corporate conglomerate Customer ID because
their systems weren’t built knowing about any corporate conglomerate Customer ID. But,
the corporation conglomerate knows about it and can pull the customer data from all of the
sub-companies, add another column or piece of information to say ‘from which company it
came’, and then process the data through a DataStage Job which would then relate the sub-
system’s ID to the corporate ID via a cross-reference that is stored in the corporate data warehouse.
Two columns would be needed; the source system ID and the source system Customer ID.
The Job could then use this information to do a Lookup to a reference from the
conglomerate’s warehouse that would in turn find the corresponding corporate conglomerate Customer ID.
It would then bring this data together so that it could be written out with the new number
attached to it. And then that data could then be processed subsequently knowing to which corporate Customer ID this particular sub-system’s row belongs.
So, if we were to process five Customer records from the five companies separately, one
company might have 10 columns in their Customer table while another company has 20 columns in their Customer’s table.
We are only going to specifically identify 2 of those columns coming in and only specify 3 of those columns coming out of our Stage.
If we do this in a file-to-file example; We will read the first file coming in, and process the
data, then write it out… even though we only see, in our Job, the 2 or 3 columns that we
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
have explicitly identified, when we write our output file, all of the columns that came from
the input file will be written in addition to the new conglomerate Customer ID that we did
the lookup on.
This is the case with RCP on. Thereby, we will only need one Job instead of having to write 5
separate Jobs, one for each of specific data layouts coming from each of the sub-companies.
Shared Containers Shared Containers are encapsulated pieces of DataStage Job designs.
They are components of Jobs that are stored in a different container. They can then be re-
used in various other Job designs. We can apply “stored Transformer business logic” to a
variety of different situations. In the previous example we performed some logic on the
grouping Customer Information from sub-companies with a corporate Customer ID. This
logic could be encapsulated, stored, and then re-applied.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Creating a Shared Container In the example above, the Developer has selected the Copy
Stage and the Transformer Stage as a Shared Container. Then clicking (E)dit from the
menubar, the (C)onstruct Container option, and (S)hared completes the task. Then this
same Job would have a slightly different look. Let’s see what it would look like.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Using a Shared Container in a Job Above, you can see the Shared Container icon where
the selected Stages used to be. This same Shared Container can now be used in many other
Jobs. Shared Containers can use RCP although they don’t have to do so. The combination of
Shared Container and RCP is a very powerful tool for re-use. Let’s examine the concept of
combining the two.
If we create a Shared Container in conjunction with RCP then the metadata inside the
Shared Container only need to be what is necessary to conduct the functions that are going to occur in that Shared Container.
Recall for a moment the example that we were talking about with the conglomerate and the
five sub-companies.
If that Shared Container is where we did our Lookup to find our corporate conglomerate
number, and all we really needed to do was to pass into the Shared Container the 2
columns we explicitly knew, any other columns (regardless of how many there are – different sub-companies have different amounts of columns) passed in there too.
But on the way out of the Transformer (that adds the conglomerate Customer ID), all the
columns would be passed out with the newly added corporate identifier (that we had done a Lookup in our Shared Container).
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Mapping Input/Output Links to the Container When mapping the input and output
links to the Shared Container, you need to select the Shared Container link with which the
input link will be mapped to. You will need to select the Container link to map the input link
as well as the appropriate columns that you’ll need. By using RCP we only have to specify
the few columns that really matter to us while we are inside the Shared Container. When we
come back out of the Shared Container we can re-identify any of the columns as necessary
– depending on what we want to do with them
Shared Containers Continued
The main idea around Shared Containers is that they let us create re-usable code. It can be
used in a variety of Jobs. Maybe even in Jobs that weren’t intended to process just the
Customer data. For instance, we might want to process Order data where we still need to
Lookup the conglomerate’s Customer ID just for this Order. By simply passing in our whole
Order record but only exposing the Customer number (the sub-system Customer identifier)
and the source sub-system from which it came so that the Shared Container can do the
Lookup, produce the appropriate output including the concatenated conglomerate Customer
ID, and tag that into the Order. Then, as we process this Order into our data warehouse, we
will know not only the Source CustomerID that appeared on the billing invoice and other
information from that particular sub-company but also be able to tie it to our conglomerate database so that we can distinguish it from other customers from other sub-companies.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Again, Shared Containers do not necessarily have to be RCP-enabled but the combination of the two is a very powerful re-usability tool.
Job Control
Now that we’ve created the DataStage environment, have talked about how to pull in
metadata, and have seen how to build Jobs, the next topic answers the question “How do
we control all of these Jobs and process them in an orderly flow without having to do each one individually?”
The main element that we use for this is called a Job Sequence. A Job Sequence is a special
type of Job. It is a master controlling Job that controls the execution of a set of subordinate
Jobs. They might even be of the Sequencer type in which case they are sub-Sequencers or
they can be Jobs themselves whether Parallel or Server Jobs.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
One of the things that a Job Sequencer does is to pass values to the subordinate Job’s
parameters. It also controls the order of execution by using Links. These Links are different
from our Job Links though in that there is no metadata being passed on them. The
Sequence Link tells us what should be executed next and when it should be executed. The
Sequence specifies the conditions under which the subordinate Jobs get executed using a
term know as “Triggers”.
We can specify a complex flow of control using Loops, the All or Some options, and we can
also do things such as ‘Wait for File’ before our Job starts. For example, let’s say that we
were waiting for a file to be FTP’ed to our system; we can then sit there in a waiting-mode
until that file comes. When it does arrive, the rest of the activities within that Sequencer will
be kicked off.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
The Job Sequencers will allow us to do system activities. For example you can use its e-mail
option that ties into your native system and sends an e-mail. They also allow us to execute
system commands and executables (that we might normally execute from the Command
Line). Any Command Line options can be executed from within the Sequencer. For example,
you could send a message to your Tivoli Work Manager, by writing a Command Line option
that writes a message to the Tivoli Command Center so that operators can see what has
happened.
The Sequencer can include restart checkpoints. Should there be some failure within your
Sequence, you don’t have to go back and re-run all the Jobs. You can just find the one that
failed, skipping over all the ones that worked, and pick up and re-run the one that failed.
When you create a Job Sequencer, you open a new Job Sequence & specify whether or not
it's re-startable. Then, on the Job Sequencer’s Canvas you add Stages that will execute the
Jobs, Stages that will execute system commands in other executables, and/or special
purpose Stages that will perform functions such as looping options or setting up local
variables. Next, you add Links between Stages. This will specify the order in which the Jobs
are executed. The order of sub-sequences, Jobs, looping, system commands, e-mails is
determined by the flow that you create on the Canvas
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
You can specify triggers on the Links. For instance, you might create a condition so that
when the flow is coming out of Activity A, one Link goes to Activity B if the Activity A Job
finished successfully but goes to Activity C on a different Link if the Job ‘errors-out’. The
‘error-out’ Link might then go to an e-mail that sends an error message to someone’s
console and stops the Sequencer at that point.
Job Sequences allow us to specify particular error handling such as the global ability to trap errors. Then you could enable and disable re-start checkpoints within the Sequencer.
Now let's see some of the Stages that can be used within the Sequencer.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Here is a graphic of some of the Stages that can be used within the Sequencer. These
include the EndLoop Activity Stage, the Exception Handler Stage, Executing (the native OS)
Command Stage, the Job Activity Stage (used to execute a DataStage Job), and the Nested Condition Stage.
In addition there are the Notification Activity Stage (that is typically tied into DataStage’s e-
mail function), the Routine Activity Stage (used to call DataStage Server routines), the
Sequencer Stage (which allows us to bring together and to coordinate a rendezvous point
for multiple Links once we have started many Jobs in parallel) and the StartLoop Activity Stage.
There is a (forced) terminator Activity Stage, the UserVariables Activity Stage (used to create local variables), and the Wait For File Activity.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Sequencer Job Example
Here we can see an example of a Sequencer Job. At the top left, this Job Sequence starts
with nothing coming into it. This Stage will thus start executing immediately. Normally the
one at the bottom would as well, however, the bottom Stage is an exception handler Stage
and will only kick off if there is an exception during our runtime.
In a different example, if there were many Stages independent of one another, they normally will all start at the same time within a typical Job Sequence.
In the top flow of the Job Sequence above though, the first Stage waits for a file, then it
kicks off Job 1, if Job 1 is finished successfully the flow follows the green Link to Job 2 and
so on. The green color of the Link indicates that there is a trigger on the Link (this trigger
basically says ‘if the Job was successful, go to Job 2’). If Job 3 completes successfully, then
the flow goes on to execute a command. The red links indicate error activity. Should any of
Jobs 1, 2, or 3 fail, the flow follows the red link to a Sequencer (which can have all or any of
the Jobs complete) but in this case, if any of the Links are followed to it, then it sends a notification warning by email that there has been a Job failure.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Exception Handler Stage
There is one final note about our Sequencer Job: It is not a scheduler unto itself. It just
sequences the activities. There must be some external scheduler whether that is the
DataStage scheduler or a third-party enterprise scheduler to start up this whole Sequencer
Job and only from there, will it call all the other Jobs for you.
Think of it as an entry point into your application from whatever external source you intend
to use. Whether that source is just kicking it off from a Command Line or using something
like a Tivoli Work Manager to execute the Job based on a daily or monthly schedule or a
conditional schedule should some other activity occur first and then, in turn, call this Job.
In order to do that, we would need to learn to execute the Job from the Command Line. We
will talk a little about how the Command Line can be used in this manner as well as how it will help us deploy our applications effectively and efficiently.
Pulling Jobs Together
• Parameter Sets
• Running Jobs from the Command Line • Other Director Functions
Parameter Sets
Let's start by talking about Parameter Sets. One of the areas that will help us to pull our
Jobs together more effectively will be the use of Parameter Sets. Parameters are needed for
various functions within our Jobs and are extremely useful in providing Jobs with flexibility and versatility.
The Parameter Set allows us to store a number of parameters in a particular named object.
Thereby, one or more value files can be named and specified. A value file stores the values
for each parameter within the Parameter Set. These values can then be picked up at
runtime.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Parameter Sets can be added to the Job’s parameter list, which is on the Parameter tab
within the Job Properties. This makes it very convenient for us. As we develop our Jobs, we
don’t have to enter a long list of parameters for every single Job thus risking the
possibilities of mis-keying or having omissions that prevent our parameters from passing
successfully from the Sequencer Job into our lower level Jobs.
An example of a Parameter Set might include all the things that happen within a certain environment.
We might set up a Parameter Set entitled “environment” and have it contain parameters
such as the “Host Name”, the “Host Type”, the “Database Name”, the “Database Type”, the
“Usernames”, and the “Passwords” (Not to be confused with the DataStage Connection
Objects).
We could then have one set of values in a value file used specifically for development, another set for testing, and another set for production.
We could then move our Job from one environment to the next and simply select the appropriate values for that environment from the Parameter Set.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Running Jobs from the Command Line
Now that we’ve learned the basics of how to sequence our Jobs and group our parameters,
we have an application. Let’s say that we’ve built, developed, and tested a Job and are now ready to put it into production. The next question is “How do we execute it?”
Although it can be executed by hand using the DataStage Director client, this is not the
typical way that applications are utilized. Instead, there are a series of Command Line
options that will allow us to start the Job from the native Command Line when we type in the command or to use our scheduler to call Jobs.
The graphic above shows some of the particulars.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Other Director Functions We mentioned that we can schedule Jobs using the DataStage
Director, but more often than not, Jobs are scheduled by an external scheduler. To do this,
we must run Jobs from the Command Line as most schedulers use a Command Line execution.
The InfoSphere suite provides a utility for DataStage called the dsjob. The dsjob is a multi-
functional Application Programming Interface (API) on the Command Line that allows us to
perform a number of functions. One of the most common functions is to run a Job. By
executing the dsjob command on your Command Line with various options after it, you’ll be
able to accomplish a number or automated functions using a whole host of other tools (Since the Command Line is the most common method for launching other activities).
The top one in the graphic above (HIGHLIGHT ON TOP ONE) will run a Job.
In it, we are passing parameters that include the number of rows that we want to run, the name of the Project that we want to run from, and the name of the Job.
dsjob also has other functions such as giving us information about a Job’s run status,
summary of its messages, and even Link information. The example above (in the second
bullet point) displays a summary of all Job messages in the log.
You can use the -logsum or the -logdetail options to the dsjob command to write your
output to a file, which can then either be archived or even used by another DataStage Job
and loaded into a database with all your log information (any database – doesn’t have to be
part of the DataStage repository). With DataStage version 8.1, you are able to store the
logged information directly into the Xmeta repository or keep it locally in the DataStage engine.
You may choose the latter for performance reasons so that you don’t tie up your Xmeta
database with a lot of logging activity while the Job is running –which could affect your
network should your Xmeta be remotely located from your DataStage server. Having the
option to pull the logging information out subsequently and allow you to run all the log
information into the local DataStage engine’s repository and then, only at the end, pull that
data out and process it into your master repository (whether that be Xmeta or some other
relational data source) so that you can keep it for long term usage. This way you can do it
without negatively impacting the performance of the application as it is running since you will be doing it after the application is finished.
As mentioned earlier, Xmeta is the repository for all of DataStage.
dsjob’s function is documented in the “Parallel Job Advanced Developer’s Guide”.
Now, typically, when we use the dsjob command it will be encapsulated inside some kind of
shell script (whether on a Windows or UNIX type of system or whatever system DataStage
is running on) and we would want to make sure that all the options are correct before issuing the command.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Product Simulation
Welcome to the Guided Tour Product Simulation. Earlier in the tutorial, we discussed the
many of the important concepts involved in DataStage. In this interactive section, you will
‘use the product’, without having to perform any installations, to create DataStage Jobs,
navigate around the UI, and ‘learn by doing’.
Specifically, we will first introduce you to a case-study of a business use-case scenario in
which you will learn about a business entitled Amalgamated Conglomeration Corporation.
Amalgamated Conglomeration Corporation (ACC) has hired you as their DataStage
Developer! ACC is a large holding company that owns many subsidiary companies
underneath it.
In our scenario, we’re going to talk about what happens to ACC over a period of a year as they acquire new companies. In this section you will do the following:
1. Understand the Business Problem and Case Scenario
2. Assess the Existing Architecture 3. Look at a Feed Job 4. Create a Feed Job 5. Modify the Consolidation Job 6. Create a Sequence Job
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Let’s look at ACC’s corporate structure as it looks at the beginning of the year on January 1
st. As you can see in the graphic above, ACC has four subsidiary companies. These
subsidiary companies include Acme Universal Manufacturing (Company A), Big Box Stores
(Company B), Cracklin’ Communications (Company C) and Eco Research and Design
(Company E). All of these belong to ACC and information from all of them will need to be
pulled together from time to time.
Now, some time will have gone by, and, on March 3rd, the state of ACC has changed: Your
employer has acquired a new company called Disks of the World (Company D). Disks of the
World will now need to be able to feed its information into ACC’s Data Warehouse. Let’s go
back to January 1 and see what the feeds looked like before Disks of the World (Company
D) came on board.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Each of the individual subsidiaries (Companies A, B, C, and E) feeds its data into a common
customer Data Mart. But this is done in two steps. Let’s quickly talk about how this was
done from more of a technical standpoint. Above, we can see four particular feeds; one for
each company and that particular company's customers. The feeds are all turned into files. Each FEED FILE has been generated by an individual DataStage DS FEED JOB.
These files are then later pulled together by a consolidation job (A DataStage Job) and run into the COMMON CUSTOMER DATA MART.
The reason why we have several different feeds running to files is that each of these
subsidiary companies put out their data at different intervals or periods. In other words,
each puts out its own feed at different frequencies. However, we only want to consolidate it
in “one fell swoop” (ACC’s business requirements are that we put all the feed files into the Data Mart at one time).
Acme or Company A puts out their data once per day. But Big Box or Company B has such a
large feed that they put it out once an hour. Cracklin’ Communications or Company C has a
large feed but not quite as large as Company B, so they do it four times per day. Eco
Research or Company E has such a small customer base that they only produce their feed one time per week.
ACC’s business requirement though, is for us to have a daily consolidation of these feeds
into the Common Customer Data Mart. It will therefore pull in many files from Companies B
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
and C, probably only one from Company A, and then, once per week, there will also be one in the heap from Company E.
Now let’s go ahead in time to March 3 rd. You’ll recall that ACC has now acquired another
company called Disks of the World. The DataStage Developers for ACC (that’s you) will now
need to add a new DataStage feed for the customer data coming from Company D (Disks of
the World) into the Common Customer Data Mart. You will build a DataStage feed that
accommodates their need to produce their data multiple times per day during their sales-
cycle. This feed will be landed to the same set of files (Feed files at the center of the
graphic) that are then picked up by another DataStage Job entitled Consolidation DS Job
(which combines them into the mart). Finally, you will need to build a special type of Job
called a Sequencer Job. This type of Job is used in order to control the running of other
DataStage Jobs. In other words, we’ll see how all of the Jobs (All DS FEED JOBs and the
CONSOLIDATION JOB), can be brought together by using a Sequencer Job. This Sequencer
will call all the Jobs in the correct order. This way, we don’t have to schedule them
individually in our corporate-wide scheduling software. We will just schedule the one Sequencer.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
In sum, as the DataStage Developer for ACC, you will need to do the following:
1. View the existing Acme DS FEED JOB to see how the previous DataStage Developer
at ACC did it
2. create a new DataStage Job shown highlighted above (Company D's DS FEED JOB)
3. Then you will need to modify ACC’s DS CONSOLIDATION JOB to bring in this
new/extra feed (from Company D) into the data mart
4. Finally, later in this Guided Tour Product Simulation, you will also create a Job that
controls all of these Jobs!
5. There is a technical problem that has been identified and reported to ACC in the
past. You will need to understand the issue so that, as you develop a new DS FEED
JOB, you will help solve this problem: ACC will need to recognize any particular
customer as having come from a particular subsidiary:
6. One of the things that all of these DS Feed Jobs have in common is that the
DataStage Jobs will need to find a common customer ID for the corporation. Since
each of the subsidiary companies developed their own systems independently (of
ACC), the numbering systems that they use for their customers are different and
may well inappropriately overlap. For example, Company A has a Customer 123 that
represents Joe Smith but over at Company B, there is a Customer 123 that
represents Emmanuel Jones Enterprises. Even though these two share the same
number, they are not the same customer. If we put them into the Common
Customer Data Mart ‘as is’, then we would have ambiguity and not know which
customer we were dealing with.
7. The Common Customer Data Mart will have its own numbering system for these
customers and it will differentiate by using the Customer ID that comes from the
company (Joe Smith = Customer 123), along with an additional identifier, that
indicates which company the data came from (Company A is known to the Data Mart
as Source System ID = 1, for instance). These two fields when combined will create
a new field that is an alternate key and is now a unique identifier in the Common
Customer Data Mart for Joe. This number will now be able to distinguish any
customer from any subsidiary.
8. Our DataStage “Feed” Jobs will need to be able to find out if their customer is
already known to the Common Customer Data Mart.
9. If the customer is not known, then we will hand it off to ACC's Data Mart team to add
it in a separate process (a ‘Black Box’ process Job that we will not see in this
tutorial) which will then place the new customer or insert it for the first time into the
Data Mart.
10. However, If the customer is already known to the Mart, then the Data Mart must find
the unique ID, and add the unique ID to the data. For example,
Customer_123SourceSysID_1 is added to an ADDRESS-update for Joe Smith). This
way, when it comes time to load the Common Customer Data Mart (With the
Consolidation DS Job) then we won’t have ambiguity in our data. We’ll be able to
load it correctly and distinctly. Joe’s address is updated.
11. Let’s talk about how the Consolidation DataStage Job would do this. 12. ACC's Business Requirements
13. Anytime that ACC runs one of these Jobs the Feed Jobs designate from which source
system (Company) they are coming. In other words, each Feed Job has designated
its subsidiary company uniquely. For example, “Source system 1” or “coming from
Company A” for Acme, “Source system 2” or “coming from Company B” for Big Box
Stores, and so forth. When the data is processed, the source system number will be
sent into the Job (as a parameter). Then, it the Job must perform a lookup. For this
we will use both the source system number or ID and the Customer ID from the
source system to do a lookup into the table, from the Common Customer Data Mart,
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
to see if this compound key already exists. This is called a “reverse lookup”. In other
words, the Job uses those two columns to find the true key for the Common
Customer Data Mart which should then be returned into the Job so that that column
(the true key) can be carried along with all the other native source columns that are
coming in the row as it passes through the Job.
14. Additionally, we need to code the Job so that it will convert all of all of the source system-specific metadata from the source system’s metadata layout into the target
metadata layout. Coming into the Job, the metadata appears as it would in the
source system (Acme’s layout of the customer) but, on the way out, we should see
the same type of data but captured in columns and fields which conform to the
Common Customer Data Mart’s metadata structure.
15. The Common Customer ID is a unique key within the Common Customer Data Mart.
This key has no meaning to each individual source system. So the only way that the
source system can find this unique key is through this feed process. The feed process
enriches the data by looking at the Common Customer Data Mart and finding this
Common Customer ID to help differentiate all of our different customers across all of
our subsidiary companies. If the customer is new, then it will be sent to another
process for insertion into the Customer Data Mart. This Insertion Job is not discussed
in this tutorial and is considered a “Black Box” process used by Amalgamated
Conglomerate Corporation to add new customers.
16. In summary, the unique Customer ID in the Common Customer Data Mart has
compounded the Source System Identifier with a source-specific Customer ID into a
field that, for example, might contain a value like “9788342”. 17. And this field is unique so that no customer is overlapped with another
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Look At Acme DS Feed Job Now that we know the business requirements, the
architecture, and the development strategy, let's look at the DS Feed Job for the Acme
subsidiary (Company A) that was built previously by another DataStage Developer at ACC
(Highlighted above). After we've seen it and how it works, then, after ACC makes another
acquisition in March, we will go to a section of the tutorial, in which you will have to build a similar DS Feed Job for Disks of the World (Company D).
In order to look at the DataStage Feed Job for Acme Universal Manufacturing, we'll need to
bring up the DataStage Designer Client. In this Job, we will see how it pulls the data from
Acme’s operational database and puts that data into a file after it has looked up to find the
appropriate Common Customer ID from the Data Mart. Then, we’ll look at the Consolidation
Job. There, we will see how all of the individual Feed files are brought together prior to
being loaded into the Common Customer Data Mart. We will need the highlighted client to look at, create, and modify DataStage Jobs.
Information Server 8.1 Introduction
Business Intelli Solutions Inc www.businessintelli.com
Our Login screen comes up. Here, we’ll need to enter username/password information. At
the top of the screen, you can see the domain in which we are working. At the bottom of
the screen, we need to make sure that we are working with the correct DataStage Project in
this case it is the AmalCorp Project (Amalgamated Conglomerate Corporation Project).
Then, with a password filled in for you (for purposes of this tutorial) enter a username of
student and click the OK button (The NEXT button below has been disabled for the
remainder of the Guided Tour Product Simulation)
Our Designer client is now connecting to the DataStage Server via authentication and
authorization from the Server’s Security layer. A screen will then pop up within which we
can work on our Job. First, we are asked if we want to create a Job and if so, what type. For right now, since we will look at an existing Job, just click the Cancel button.
From here please develop as per the spec……