36017098 datastage what i knew

Upload: sai-vishnu-vardhan

Post on 10-Apr-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/8/2019 36017098 Datastage What i Knew

    1/34

  • 8/8/2019 36017098 Datastage What i Knew

    2/34

    2

    DAY - 1

  • 8/8/2019 36017098 Datastage What i Knew

    3/34

    3

    Day-1

    Datastage

    ETL tool used to extract data from any number or

    type of database transform the data and load the

    data to data warehouse.

    It has client and server components.

    Client components are (DS designer, DS Director,

    DS Manager, Ds Administrator)

    Server components are (Repository, DS Server)

    Datastage Files will be having an extension .dsx

    If we run the job in the designer we wont be able

    to view the logs.

  • 8/8/2019 36017098 Datastage What i Knew

    4/34

    4

    Day-1

    Datastage Designer

    It is used to create the jobs, compile them. Thereoption of running the job in the designer is alsopossible but its an optional feature.

    Datastage Director It is used to validate the jobs created using the

    designer then run and monitor the jobs.

    Once the job is executed we can check the logs toget the details of the job.

  • 8/8/2019 36017098 Datastage What i Knew

    5/34

    5

    Day-1

    Datastage Manager

    It is used to view and edit the contents of the

    repository.

    It is used to export, import and manage themetadata.

    Datastage Administrator

    It is a user interface used to do administrative

    tasks.

  • 8/8/2019 36017098 Datastage What i Knew

    6/34

    6

    Day-1

    Development / Debug stage

    Row generator

    Column generator

    Peek Tail

    Head

    Sample

  • 8/8/2019 36017098 Datastage What i Knew

    7/34

    7

    Day-1

    Files

    Sequential File

    Dataset

    Fileset

    Lookup Fileset

    Processing Stage

    Sort

    Copy

    Remove Duplicates

    Join

  • 8/8/2019 36017098 Datastage What i Knew

    8/34

    8

    Day-1

    Look Up

    Merge

    Funnel

    Modify Aggregator

    Transformer

    Pivot

    Filter

    Surrogate Key generator

  • 8/8/2019 36017098 Datastage What i Knew

    9/34

    9

    Day-1

    Row Generator (no input -> one output)

    Row generator generates mock data based onthe metadata .

    It can be used as an input data for testingpurposes.

    By default the number of rows generated will be10.

    Column Generator (1 input -> 1 output)

    We must enter the metadata for the column tobe generated.

    It depends on the number of input rows.

  • 8/8/2019 36017098 Datastage What i Knew

    10/34

    10

    Day-1

    So if the input has 15 rows then the output will also

    have 15 rows.

    To set a default value for all the records in the

    column being generated then set the default value

    to that particular value.

    We can also set the increment by, start value, end

    value.

    Head (1 input -> 1 output)

    It is used to pass the first n records of the input file.

  • 8/8/2019 36017098 Datastage What i Knew

    11/34

    11

    Day-1

    By Default the n is set to 10 it can be changed toany value we need.

    In case if the value of n=10 and the input has 50records then only 1st 10 records will pass through.

    Tail (1 input -> 1 output)

    It is similar to head the main difference is it selectsthe last n records of the input data.

    Sample (1 input -> n output)

    It selects n records where the value of n is givenas % from the full set of input records in a

  • 8/8/2019 36017098 Datastage What i Knew

    12/34

    12

    Day-1

    Random manner.

    If n=50% and input has 150 records the O/P willhave 75 records selected in random manner.

    Peek (1 input -> 1 output (optional))

    Allows us to view the data in logs in the director.

    By default the output property is set to log so thatwe can view it in the director.

    If the output is set to Output itself then the output

    from the peek is stored in a separate output file.And cannot be seen in the logs.

    It has the records only if the stage is valid and isexecuted.

  • 8/8/2019 36017098 Datastage What i Knew

    13/34

  • 8/8/2019 36017098 Datastage What i Knew

    14/34

    14

    Day - 2

    Types ofJobs we use in Datastage Parallel Job

    Parallel Shared Container

    Job Sequence

    Server job is used rarely.Parallel Processing

    Mainly there of2 types of parallel processing

    Partition parallelism

    Pipeline parallelism

  • 8/8/2019 36017098 Datastage What i Knew

    15/34

    15

    Day - 2

    Partition Parallelism To process very large quantities of data it is split

    into smaller subsets and each subset of data willbe handled by a separate processor so that the

    job will be completed faster.

    At the end of the job the data subsets can becollected back together again and written to asingle data source.

    Pipeline Parallelism

    In this case there are multiple stages then eachprocessor will start processing a particular stageand load the data to the pipeline.

  • 8/8/2019 36017098 Datastage What i Knew

    16/34

    16

    Day - 2

    So multiple stages will be executed subsequently sothat the job will be completed at a much faster pace.

    Sequential File

    Sequential file may be a text file are CSV or a fixed

    width file etc., The data from the sequential file will be transferred in

    a sequential manner.

    So the fan out/in process is done before the data is

    transferred from/to parallel processing stages.

  • 8/8/2019 36017098 Datastage What i Knew

    17/34

    17

    Day - 2

    Dataset Dataset has a extension of .ds

    It can be viewed only using datastage and it is

    encrypted.

    The data from the dataset can be accessed inparallel so it is much faster when compared to the

    sequential files.

  • 8/8/2019 36017098 Datastage What i Knew

    18/34

    18

    DAY-3

  • 8/8/2019 36017098 Datastage What i Knew

    19/34

    19

    Day-3

    Join [2 input (more than 2 optional) -> 1 output]

    It performs join based on the field that is set as the key.

    The 4 types ofJoin can be performed

    Inner join

    Left outer join

    Right OuterJoin

    Full OuterJoin

    We can explicitly set the input table to be left, right orintermediate.

  • 8/8/2019 36017098 Datastage What i Knew

    20/34

    20

    Day-3

    This is done in the Link ordering Tab. The data to the join must be sorted.

    It is the faster when compared to look up but slow when

    compared to Merge.

    It does not have a reject link.Pivot (1 Input -> 1 output)

    It is used to convert Columns to rows.

    It does the transpose function of the matrix.

  • 8/8/2019 36017098 Datastage What i Knew

    21/34

    21

    Day-3

    If there are 4 columns q1,q2,q3,q4 and we are going to

    convert them into a row with a common column name

    q the description for q = q1,q2,q3,q4.

    Remove Duplicates (1 Input -> 1 Output)

    Pre-requisite for this stage is that the data that comes

    as input must be sorted.

    If duplicate values exists in the input then only one

    record will be retained others will be dropped. In the properties we can set which record is to be

    retained like either the first one or the last one.

  • 8/8/2019 36017098 Datastage What i Knew

    22/34

    22

    Day-3

    It checks the first record with the very next record if

    both are same then one record will be dropped. So if

    the input is not sorted duplicate values will remain in the

    output also.

    We must also give a field or key based on which theduplicate records have to be removed.

    Copy ( 1 Input -> n Output)

    It is use to just copy the input data to multiple output

    links. It is better to have a copy between two transformation

    stages.

  • 8/8/2019 36017098 Datastage What i Knew

    23/34

    23

    Day - 4

    Sort ( 1- Input -> 1 Output) It is used to sort the data that comes as a input based

    on a particular key that we mention explicitly.

    We can also mention how to handle the null value

    like first or last preference.

    We have a option called allow duplicates if this is set

    to false if duplicate records exist only the first record

    will be retained. The subsequent records will be

    dropped.

    We can also choose the sor order either ascending ordescending.

  • 8/8/2019 36017098 Datastage What i Knew

    24/34

    24

    Day - 4

    Sort can also be done in the link by setting the

    properties in the link.

    We cannot have a reject link in for the sort stage.

    It is active if the allow duplicates is set to true and its

    passive in the later case.

    Funnel ( n Input -> 1 Output)

    It is used to combine multiple input files into a single

    output file.

    In funnel there are 3 types:

    Sort

    Sequence

    Continuous

  • 8/8/2019 36017098 Datastage What i Knew

    25/34

    25

    Day - 4

    By default it will be in continuous only. Sort we must give the key so that it sorts the output

    from the given input from n files and sort it and

    combine it into single file and gives as output.

    In

    Continuous it combines like 1

    st

    record from 1st

    filethen 1st from 2nd file then again subsequent records

    from the input files.

    In case of Sequence funnel initially all the records

    from the first input will be there then the ones from

    the 2nd input like that for all the input files.

  • 8/8/2019 36017098 Datastage What i Knew

    26/34

  • 8/8/2019 36017098 Datastage What i Knew

    27/34

    27

    Day - 4

    Filter ( 1 Input -> n Output ,1 Reject (optional))

    It is used to filter the records from the input based on a

    condition and pass the records which pass the

    condition to the output.

    We can set multiple conditions and get multiple outputbased on those conditions.

    The link ordering can be done in the link ordering tab

    in the properties dialog box.

    The records that does not satisfy any of the conditionswill be either dropped or passed on to the reject link.

  • 8/8/2019 36017098 Datastage What i Knew

    28/34

    28

    DAY - 5

  • 8/8/2019 36017098 Datastage What i Knew

    29/34

    29

    Day - 5

    Transformer ( 1 Input -> n Output , 1 Reject) This stage .

    It must be used only if the process cannot be doneusing any other stages.

    It is an active stage.

    It can do filter, copy, sort etc.,

    It has stage variable where we can create somevariables to store some values that are to be usedfrequently among the output.

    In the properties there is a constraints option where wecan set any conditions to be checked in the incomingdata.

  • 8/8/2019 36017098 Datastage What i Knew

    30/34

    30

    Day - 5

    The derivations option will allow us to perform somecalculations using functions like arithmetic, string,trigonometric etc.,

    Null handling, conditional statement like if-else, datefunctions can also be used in the derivations.

    We can derive the outputs from the transformer tothe multiple targets based on the derivations.

    The stage variables declared and defined will getcalculated in the order they are derived.

  • 8/8/2019 36017098 Datastage What i Knew

    31/34

    31

    Day - 5

    Look Up (1 source (Primary), n Reference -> 1Reject)

    Look up is used to join the tables based on the key

    (field) we mention.

    There are

    2types of look up:

    Normal lookup

    Sparse lookup

    In this case the reject will get the records that are

    rejected from the primary. The record from the primary is looked up in the

    reference input and if the key match the join is

    performed.

  • 8/8/2019 36017098 Datastage What i Knew

    32/34

    32

    Day - 5

    In case if there is no record matching the primarytable in the reference table then the record isdropped and passed to the reject link.

    It is the slowest compared to merge and join stage.

    In look up also we can set constraints.

    There of two types of constraints: Condition failure

    Lookup failure

  • 8/8/2019 36017098 Datastage What i Knew

    33/34

    33

    Day - 5

    In case of failure there are 4 ways of handling it:

    Continue

    Fail

    Reject

    Drop

    Continue just puts a null value in case of failure and

    continue with the process.

    Fail passes quits the job and comes out of the

    execution (i.e) it aborts the process.

    Drop just drops the record that didnt meet the

    condition and continues processing the next record.

    Reject passes the rejected records to the reject link.

  • 8/8/2019 36017098 Datastage What i Knew

    34/34