datastage what i knew

1

KT Document

Manikandan. K

2

DAY - 1

3

Day-1

• Datastage

– ETL tool used to extract data from any number or type of database transform the data and load the data to data warehouse.

– It has client and server components.

– Client components are (DS designer, DS Director, DS Manager, Ds Administrator)

– Server components are (Repository, DS Server)

– Datastage Files will be having an extension .dsx

– If we run the job in the designer we won’t be able to view the logs.

4

Day-1

• Datastage Designer– It is used to create the jobs, compile them. There

option of running the job in the designer is also possible but its an optional feature.

• Datastage Director– It is used to validate the jobs created using the

designer then run and monitor the jobs.– Once the job is executed we can check the logs to

get the details of the job.

5

Day-1

• Datastage Manager– It is used to view and edit the contents of the

repository.– It is used to export, import and manage the

metadata.• Datastage Administrator

– It is a user interface used to do administrative tasks.

6

Day-1

• Development / Debug stage– Row generator– Column generator– Peek– Tail– Head– Sample

7

Day-1

• Files – Sequential File– Dataset– Fileset– Lookup Fileset

• Processing Stage– Sort– Copy– Remove Duplicates– Join

8

Day-1

– Look Up– Merge– Funnel– Modify– Aggregator– Transformer– Pivot– Filter– Surrogate Key generator

9

Day-1

Row Generator (no input -> one output) • Row generator generates mock data based on

the metadata .• It can be used as an input data for testing

purposes.• By default the number of rows generated will

be 10.Column Generator (1 input -> 1 output)• We must enter the metadata for the column to

be generated.• It depends on the number of input rows.

10

Day-1

• So if the input has 15 rows then the output will also have 15 rows.

• To set a default value for all the records in the column being generated then set the default value to that particular value.

• We can also set the increment by, start value, end value.

Head (1 input -> 1 output)• It is used to pass the first ‘n’ records of the input file.

11

Day-1

• By Default the ‘n’ is set to 10 it can be changed to any value we need.

• In case if the value of n=10 and the input has 50 records then only 1st 10 records will pass through.

Tail (1 input -> 1 output)• It is similar to head the main difference is it selects

the last ‘n’ records of the input data.Sample (1 input -> n output)• It selects ‘n’ records where the value of ‘n’ is given

as ‘%’ from the full set of input records in a

12

Day-1

• Random manner.• If n=50% and input has 150 records the O/P will have

75 records selected in random manner.Peek (1 input -> 1 output (optional))• Allows us to view the data in logs in the director.• By default the output property is set to log so that we

can view it in the director.• If the output is set to Output itself then the output from

the peek is stored in a separate output file. And cannot be seen in the logs.

• It has the records only if the stage is valid and is executed.

13

DAY-2

14

Day - 2

Types of Jobs we use in Datastage• Parallel Job• Parallel Shared Container• Job SequenceServer job is used rarely.Parallel Processing• Mainly there of 2 types of parallel processing

– Partition parallelism – Pipeline parallelism

15

Day - 2

• Partition Parallelism– To process very large quantities of data it is split

into smaller subsets and each subset of data will be handled by a separate processor so that the job will be completed faster.

– At the end of the job the data subsets can be collected back together again and written to a single data source.

• Pipeline Parallelism– In this case there are multiple stages then each

processor will start processing a particular stage and load the data to the pipeline.

16

Day - 2• So multiple stages will be executed subsequently so

that the job will be completed at a much faster pace.

Sequential File• Sequential file may be a text file are CSV or a fixed

width file etc.,• The data from the sequential file will be transferred in

a sequential manner. • So the fan out/in process is done before the data is

transferred from/to parallel processing stages.

17

Day - 2Dataset• Dataset has a extension of .ds• It can be viewed only using datastage and it is

encrypted.• The data from the dataset can be accessed in

parallel so it is much faster when compared to the sequential files.

18

DAY-3

19

Day-3

Join [2 input (more than 2 optional) -> 1 output]• It performs join based on the field that is set as the key.• The 4 types of Join can be performed

– Inner join– Left outer join– Right Outer Join– Full Outer Join

• We can explicitly set the input table to be left, right or intermediate.

20

Day-3

• This is done in the Link ordering Tab.• The data to the join must be sorted.• It is the faster when compared to look up but slow when

compared to Merge.• It does not have a reject link.

Pivot (1 Input -> 1 output) • It is used to convert Columns to rows.• It does the transpose function of the matrix.

21

Day-3

• If there are 4 columns q1,q2,q3,q4 and we are going to convert them into a row with a common column name ‘q’ the description for q = q1,q2,q3,q4.

Remove Duplicates (1 Input -> 1 Output)• Pre-requisite for this stage is that the data that comes

as input must be sorted.• If duplicate values exists in the input then only one

record will be retained others will be dropped.• In the properties we can set which record is to be

retained like either the first one or the last one.

22

Day-3

• It checks the first record with the very next record if both are same then one record will be dropped. So if the input is not sorted duplicate values will remain in the output also.

• We must also give a field or key based on which the duplicate records have to be removed.

Copy ( 1 Input -> ‘n’ Output)• It is use to just copy the input data to multiple output

links.• It is better to have a copy between two transformation

stages.

23

Day - 4

Sort ( 1- Input -> 1 Output)• It is used to sort the data that comes as a input based

on a particular key that we mention explicitly.• We can also mention how to handle the null value

like first or last preference.• We have a option called allow duplicates if this is set

to false if duplicate records exist only the first record will be retained. The subsequent records will be dropped.

• We can also choose the sor order either ascending or descending.

24

Day - 4• Sort can also be done in the link by setting the

properties in the link.

• We cannot have a reject link in for the sort stage.

• It is active if the allow duplicates is set to true and its passive in the later case.

Funnel ( ‘n’ Input -> 1 Output)

• It is used to combine multiple input files into a single output file.

• In funnel there are 3 types:

– Sort

– Sequence

– Continuous

25

Day - 4

• By default it will be in continuous only.• Sort we must give the key so that it sorts the output

from the given input from n files and sort it and combine it into single file and gives as output.

• In Continuous it combines like 1st record from 1st file then 1st from 2nd file then again subsequent records from the input files.

• In case of Sequence funnel initially all the records from the first input will be there then the ones from the 2nd input like that for all the input files.

26

Day - 4

Merge ( 1 master Input, ‘n’ Update Input -> ‘n’ Rejects)

• In merge the number of update link must match with the number of reject link.

• It is mandatory that the input must be sorted.• Master and updates must be explicitly mentioned.• It is fastest way of joining two or more tables.• The reject will have the records that are dropped from

the updates only.

27

Day - 4

Filter ( 1 Input -> n Output ,1 Reject (optional))• It is used to filter the records from the input based on a

condition and pass the records which pass the condition to the output.

• We can set multiple conditions and get multiple output based on those conditions.

• The link ordering can be done in the link ordering tab in the properties dialog box.

• The records that does not satisfy any of the conditions will be either dropped or passed on to the reject link.

28

DAY - 5

29

Day - 5

Transformer ( 1 Input -> ‘n’ Output , 1 Reject)• This stage .• It must be used only if the process cannot be done

using any other stages.• It is an active stage.• It can do filter, copy, sort etc.,• It has stage variable where we can create some

variables to store some values that are to be used frequently among the output.

• In the properties there is a constraints option where we can set any conditions to be checked in the incoming data.

30

Day - 5

• The derivations option will allow us to perform some calculations using functions like arithmetic, string, trigonometric etc.,

• Null handling, conditional statement like if-else, date functions can also be used in the derivations.

• We can derive the outputs from the transformer to the multiple targets based on the derivations.

• The stage variables declared and defined will get calculated in the order they are derived.

31

Day - 5

Look Up (1 source (Primary), ‘n’ Reference -> 1 Reject)• Look up is used to join the tables based on the key

(field) we mention.• There are 2 types of look up:

– Normal lookup– Sparse lookup

• In this case the reject will get the records that are rejected from the primary.

• The record from the primary is looked up in the reference input and if the key match the join is performed.

32

Day - 5• In case if there is no record matching the primary

table in the reference table then the record is dropped and passed to the reject link.

• It is the slowest compared to merge and join stage.• In look up also we can set constraints.• There of two types of constraints:

– Condition failure– Lookup failure

33

Day - 5

• In case of failure there are 4 ways of handling it:

– Continue

– Fail

– Reject

– Drop

• Continue just puts a null value in case of failure and continue with the process.

• Fail passes quits the job and comes out of the execution (i.e) it aborts the process.

• Drop just drops the record that didn’t meet the condition and continues processing the next record.

• Reject passes the rejected records to the reject link.

34

Thank You!

datastage what i knew

Documents