datastage interview questions

Datastage Interview Questions - Part I

1. What are the Environmental variables in Datastage?

2. Check for Job Errors in datastage

3. What are Stage Variables, Derivations and Constants?

4. What is Pipeline Parallelism?

5. Debug stages in PX

6. How do you remove duplicates in dataset

7. What is the difference between Job Control and Job Sequence

8. What is the max size of Data set stage?

9. performance in sort stage

10. How to develop the SCD using LOOKUP stage?

12. What are the errors you expereiced with data stage

13. what are the main diff between server job and parallel job in datastage

14. Why you need Modify Stage?

15. What is the difference between Squential Stage & Dataset Stage. When do u use them.

16. memory allocation while using lookup stage

17. What is Phantom error in the datastage. How to overcome this error.

18. Parameter file usage in Datastage

19. Explain the best approch to do a SCD type2 mapping in parallel job?

20. how can we improve the performance of the job while handling huge amount of data

21. HI How can we create read only jobs in Datastage.

22. how to implement routines in data stage,have any one has any material for data stage

23. How will you determine the sequence of jobs to load into data warehouse?

24. How can we Test jobs in Datastage??

25. DataStage - delete header and footer on the source sequential

26. How can we implement Slowly Changing Dimensions in DataStage?.

27. Differentiate Database data and Data warehouse data?

28. How to run a Shell Script within the scope of a Data stage job?

29. what is the difference between datastage and informatica

30. Explain about job control language such as (DS_JOBS)

32. What is Invocation ID?

33. How to connect two stages which do not have any common columns between them?

34. In SAP/R3, How do you declare and pass parameters in parallel job .

35. Difference between Hashfile and Sequential File?

36. How do you fix the error "OCI has fetched truncated data" in DataStage

37. A batch is running and it is scheduled to run in 5 minutes. But after 10 days the time changes to 10 minutes. What type of error is this and how to fix it?

38. Which partition we have to use for Aggregate Stage in parallel jobs ?

39. What is the baseline to implement parition or parallel execution method in datastage job.e.g. more than 2 millions records only advised ?

40. how do we create index in data satge?

41. What is the flow of loading data into fact & dimensional tables?

42. What is a sequential file that has single input link??

43. Aggregators – What does the warning “Hash table has grown to ‘xyz’ ….” mean?

44. what is hashing algorithm?

45. How do you load partial data after job failed source has 10000 records, Job failed after 5000 records are loaded. This status of the job is abort , Instead of removing 5000 records from target , How can i resume the load 46. What is Orchestrate options in generic stage, what are the option names. value ? Name of an Orchestrate operator to call. what are the orchestrate operators available in datastage for AIX environment.

47. Type 30D hash file is GENERIC or SPECIFIC?

48. Is Hashed file an Active or Passive Stage? When will be it useful?

49. How do you extract job parameters from a file?

50. 1.What about System variables? 2.How can we create Containers? 3.How can we improve the performance of DataStage? 4.what are the Job parameters? 5.what is the difference between routine and transform and function? 6.What are all the third party tools used in DataStage? 7.How can we implement Lookup in DataStage Server jobs? 8.How can we implement Slowly Changing Dimensions in DataStage?. 9.How can we join one Oracle source and Sequential file?. 10.What is iconv and oconv functions?

Read more: http://www.placementpapers.us/datastage/300-datastage_interview_questions_part_i.html#ixzz1Lvg1Hc2y Under Creative Commons License: Attribution

How do you fix the error "OCI has fetched truncated data" in DataStage

Can we use Change capture stage to get the truncated data's.Members please confirm

Dear Friend,Not only truncated data,it captures duplicates,edit,insert,unwanted data that means every changes in the before and after data

How did u connect with DB2 in your last project?

Most of the times the data was sent to us in the form of flat files. The data is dumped and sent to us. In some cases were we need to connect to DB2 for look-ups as an instance then we used ODBC drivers

Asked by: Interview Candidate

Best Answer:

By making use of "DB2/UDB Stage".There we basically need to specify connection parameters like client instance name,default

server,default database.

What are Sequencers?

Sequencers are job control programs that execute other jobs with preset Job parameters.


Best Answer:

A sequencer allows you to synchronize the control flow of multiple activities in a job sequence. It can have multiple input triggers as

well as multiple output triggers.The sequencer operates in two modes:ALL mode. In this mode all of the inputs to the sequencer

must be TRUE for any of the sequencer outputs to fire.ANY mode. In this mode, output triggers can be fired if any of the sequencer

inputs are TRUE How did you handle an 'Aborted' sequencer?

In almost all cases we have to delete the data inserted by this from DB manually and fix the job and then run the job again.


Best Answer:

Have you set the compilation options for the sequence so that in case job aborts, you need not to run it from from the first job. By

selecting that compilation option you can run that aborted sequence from the point the sequence was aborted.

Like for example, you have 10 jobs(job1, job2, job3 etc.) in a sequence and the job 5 aborts, then by checking "Add checkpoints so

sequence is restartable on failure" and "Automatically handle activities that fail" you can restart this sequence from job 5 only. it will

not run the jobs 1,2,3 and 4.

Please check these options in your sequence.

Hope this helps.

Answered by: ritu singhaiWhat are other Performance tunings you have done in your last project to increase the performance of

What are other Performance tunings you have done in your last project to increase the performance of slowly running jobs?Staged the data coming from ODBC/OCI/DB2UDB stages or any database on the server using Hash/Sequential files for optimum performance also for data recovery in case job aborts.Tuned the OCI stage for 'Array


Best Answer:

1. Minimise the usage of Transformer (Instead of this use Copy, modify, Filter, Row Generator)

2. Use SQL Code while extracting the data

3. Handle the nulls

4. Minimise the warnings

5. Reduce the number of lookups in a job design

6. Use not more than 20stages in a job

7. Use IPC stage between two passive stages Reduces processing time

8. Drop indexes before data loading and recreate after loading data into tables

9. Gen\'ll we cannot avoid no of lookups if our requirements to do lookups compulsory.

10. There is no limit for no of stages like 20 or 30 but we can break the job into small jobs then we use dataset Stages to store

the data.

11. IPC Stage that is provided in Server Jobs not in Parallel Jobs

12. Check the write cache of Hash file. If the same hash file is used for Look up and as well as target, disable this Option.

13. If the hash file is used only for lookup then \"enable Preload to memory\". This will improve the performance. Also, check

the order of execution of the routines.

14. Don\'t use more than 7 lookups in the same transformer; introduce new transformers if it exceeds 7 lookups.

15. Use Preload to memory option in the hash file output.

16. Use Write to cache in the hash file input.

17. Write into the error tables only after all the transformer stages.

18. Reduce the width of the input record - remove the columns that you would not use.

19. Cache the hash files you are reading from and writting into. Make sure your cache is big enough to hold the hash files.

20. Use ANALYZE.FILE or HASH.HELP to determine the optimal settings for your hash files. This would also minimize overflow on the hash file.

21. If possible, break the input into multiple threads and run multiple instances of the job.

22. Staged the data coming from ODBC/OCI/DB2UDB stages or any database on the server using Hash/Sequential files for

optimum performance also for data recovery in case job aborts.

23. Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values for faster inserts, updates and selects.

24. Tuned the 'Project Tunables' in Administrator for better performance.

25. Used sorted data for Aggregator.

26. Sorted the data as much as possible in DB and reduced the use of DS-Sort for better performance of jobs

27. Removed the data not used from the source as early as possible in the job.

28. Worked with DB-admin to create appropriate Indexes on tables for better performance of DS queries

29. Converted some of the complex joins/business in DS to Stored Procedures on DS for faster execution of the jobs.

30. If an input file has an excessive number of rows and can be split-up then use standard logic to run jobs in parallel.

31. Before writing a routine or a transform, make sure that there is not the functionality required in one of the standard

routines supplied in the sdk or ds utilities categories.

Constraints are generally CPU intensive and take a significant amount of time to process. This may be the case if the

constraint calls routines or external macros but if it is inline code then the overhead will be minimal.

32. Try to have the constraints in the 'Selection' criteria of the jobs itself. This will eliminate the unnecessary records even

getting in before joins are made.

33. Tuning should occur on a job-by-job basis.

34. Use the power of DBMS.

35. Try not to use a sort stage when you can use an ORDER BY clause in the database.

36. Using a constraint to filter a record set is much slower than performing a SELECT … WHERE….

37. Make every attempt to use the bulk loader for your particular database. Bulk loaders are generally faster than using ODBC

or OLE.38. Where do you use Link-Partitioner and Link-Collector ?39. Link Partitioner - Used for partitioning the data.Link Collector - Used for collecting the partitioned data. 40.

41. Asked by: Interview Candidate

42.

43. Best Answer:

Link Partitioner and collecter are basically used to introduce data parallellism in server jobs.link partitioner,splits the data

on many links.Once the data is processed,link collector collects the data and passes it to a single link.These are used in

server jobs.In datastage parallel jobs,these things are inbuilt and automatically taken care of.

Answered by: nikhilanshuman44.

45. Read answer (1)46.

47.

48.

49. 50. What are OConv () and Iconv () functions and where are they used? 51. IConv() - Converts a string to an internal storage formatOConv() - Converts an expression to an output format. 52.


54.

55. Best Answer:56. iconv is used to convert the date into into internal format i.e only datastage can understand57. example :- date comming in mm/dd/yyyy format58. datasatge will conver this ur date into some number like :- 74059. u can use this 740 in derive in ur own format by using oconv.60. suppose u want to change mm/dd/yyyy to dd/mm/yyyy61. now u will use iconv and oconv.62. ocnv(iconv(datecommingfromi/pstring,SOMEXYZ(seein help which is iconvformat),defineoconvformat))

63.

Answered by: sekr64. What is Metastage?65.


67.

68. Best Answer:

MetaStage is a persistent metadata Directory that uniquely synchronizes metadata across multiple separate silos,

eliminating re keying and the manual establishment of cross-tool relationships. Based on patented technology, it provides

seamless cross-tool integration throughout the entire Business Intelligence and data integration life cycle and tool sets

Answered by: spartankiya69.

70. Read answers (6)71.

We have to create users in the Administrators and give the necessary privileges to users.

What are the command line functions that import and export the DS jobs?

A. dsimport.exe- imports the DataStage components.B. dsexport.exe- exports the DataStage components.

What is difference between data stage and informatica

Here is a very good articles on these differences... whic hhelps to get an idea.. basically it's depends on what you are tring to accomplish

what are the requirements for your ETL tool?

Do you have large sequential files (1 million rows, for example) that need to be compared every day versus yesterday?

If so, then ask how each vendor would do that. Think about what process they are going to do. Are they requiring you to load yesterday?s file into a table and do lookups?

If so, RUN!! Are they doing a match/merge routine that knows how to process this in sequential files? Then maybe they are the right one. It all depends on what you need the ETL to do.

If you are small enough in your data sets, then either would probably be OK.

I want to process 3 files in sequentially one by one , how can i do that. while processing the files it should fetch files automatically .

If the metadata for all the files r same then create a job having file name as parameter, then use same job in routine and call the job with different file name...or u can create sequencer to use the job... I think in datastage8.0.1 there is an option in the sequence job namely loop via which the purpose can be achieved

What Happens if RCP is disable ?

Runtime column propagation (RCP): If RCP is enabled for any job, and specifically for those stage whose output connects to the shared container input, then meta data will be propagated at run time, so there is no need to map it at design time.

If RCP is disabled for the job, in such case OSH has to perform Import and export every time when the job runs and the processing time job is also increased.

Where does unix script of datastage executes weather in clinet machine or in server.suppose if it eexcutes on server then it will execute ?

Datastage jobs are executed in the server machines only. There is nothing that is stored in the client machine.

Defaults nodes for datastage parallel Edition

Actually the Number of Nodes depend on the number of processors in your system.If your system is supporting two processors we will get two nodes by default.

How can we pass parameters to job by using file.

You can do this, by passing parameters from unix file, and then calling the execution of a datastage job. the ds job has the parameters defined (which are passed by unix)

What is ' insert for update ' in datastage

i think 'insert to update' is updated value is inserted to maintain history

i think 'insert to update' is updated value is inserted to maintain history

What are the Repository Tables in DataStage and What are they?

A datawarehouse is a repository(centralized as well as distributed) of Data, able to answer any adhoc,analytical,historical or complex queries.Metadata is data about data. Examples of metadata include data element descriptions, data type descriptions, attribute/property descriptions, range/domain descriptions, and process/method descriptions. The repository environment encompasses all corporate metadata resources: database catalogs, data dictionaries, and navigation services. Metadata includes things like the name, length, valid values, and description of a data element. Metadata is stored in a data dictionary and repository. It insulates the data warehouse from changes in the schema of operational systems.In data stage I/O and Transfer , under interface tab: input , out put & transfer pages.U will have 4 tabs and the last one is build under that u can find the TABLE NAME .The DataStage client components are:AdministratorAdministers DataStage projects and conducts housekeeping on the serverDesignerCreates DataStage jobs that are compiled into executable programs DirectorUsed to run and monitor the DataStage jobsManagerAllows you to view and edit the contents of the repository.

What is version Control?

Version Control

stores different versions of DS jobs

runs different versions of same job

reverts to previos version of a job

view version histories

What happends out put of hash file is connected to transformer ..

What error it throughs

If Hash file output is connected to transformer stage the hash file will consider as the Lookup file if there is no primary link to the same Transformer stage, if there is no primary link then this will treat as primary link itself. you can do SCD in server job by using Lookup functionality. This will not return any error code.

How can I extract data from DB2 (on IBM iSeries) to the data warehouse via Datastage as the ETL tool. I mean do I first need to use ODBC to create connectivity and use an adapter for the extraction and transformation of data? Thanks so much if anybody could provide an answer.

You would need to install ODBC drivers to connect to DB2 instance (does not come with regular drivers that we try to install, use CD provided for DB2 installation, that would have ODBC drivers to connect to DB2) and then try out

How can I connect my DB2 database on AS400 to DataStage? Do I need to use ODBC 1st to open the database connectivity and then use an adapter for just connecting between the two? Thanks alot of any replies.

You need to configure the ODBC connectivity for database (DB2 or AS400) in the datastage.

What is NLS in datastage? how we use NLS in Datastage ? what advantagesin that ? at the time of installation i am not choosen that NLS option , now i want to use that options what can i do ? to reinstall that datastage or first uninstall and install once again ?

Just reinstall you can see the option to include the NLS

NLS stands for national language support. It is used for including other country languages like French, German, Spanish, etc(whose scripts are similar to English) in the data that is processed using data warehousing.

What is the OCI? and how to use the ETL Tools?

OCI doesn't mean the orabulk data. It actually uses the "Oracle Call Interface" of the oracle to load the data. It is kind of the lowest level of Oracle being used for loading the data.

What is APT_CONFIG in datastage

APT_CONFIG is just an environment variable used to idetify the *.apt file. Dont confuse that with *.apt file that has the node's information and Configuration of SMP/MMP server.

Apt_configfile is used for to store the nodes information, and it contains the disk storageinformation, and scrach information. and datastage understands the architecture of system based on this Configfile. for parallel process normally two

nodes are required its name like 10,20.

How we use NLS function in Datastage? what are advantages of NLSfunction? where we can use that one? explain briefly?

By using NLS function we can do the following

- Process the data in a wide range of languages- Use Local formats for dates, times and money

- Sort the data according to the local rules

If NLS is installed, various extra features appear in the product.

For Server jobs, NLS is implemented in DataStage Server engine

For Parallel jobs, NLS is implemented using the ICU library.

How can ETL excel file to Datamart?

Open the ODBC Data Source Administrator found in the controlpanel/administrative tools.

under the system DSN tab, add the Driver to Microsoft Excel.

Then u'll be able to access the XLS file from Datastage.

what is the mean of Try to have the constraints in the 'Selection' criteria of the jobs itself. This will eliminate the unnecessary records even getting in before joins are made?

This means try to improve the performance by avoiding use of constraints wherever possible and instead using them while selecting the data itself using a where clause. This improves performace.

What is the difference between Datastage and Datastage TX?

Its a critical question to answer, but one thing i can tell u that Datastage Tx is not a ETL tool & this is not a new version of Datastage 7.5.

Tx is used for ODS source ,this much i know

DS TX is not a ETL tool. It comes in a category of EAI.

TX helps in connecting varios frontend to diff backends.it has a inbuilt plugin for many diff sources and targets.In short it can make ur application platform independent.It can also parse data.

If data is partitioned in your job on key 1 and then you aggregate on key 2, what issues could arise?

ata will partitioned on both the keys ! hardly it will take more for execution .

How is datastage 4.0 functionally different from the enterprise edition now?? what are the exact changes?

There are lot of Changes in DS EE. CDC Stage, Procedure Stage, Etc..........

It is possible to run parallel jobs in server jobs?

No, It is not possible to run Parallel jobs in server jobs. But Server jobs can be executed in Parallel jobs

Can any one tell me how to extract data from more than 1 hetrogenious Sources.mean, example 1 sequenal file, Sybase , Oracle in a singale Job.

Yes you can extract the data from from two heterogenious sources in data stages using the thetransformer stage it's so simple you need to just form a link between the two sources in thetransformer stage that's itByeHameed

If a DataStage job aborts after say 1000 records, how to continue the job from 1000th record after fixing the error?

By specifying Checkpointing in job sequence properties, if we restart the job. Then job will start by skipping upto the failed record.this option is available in 7.5 edition.

There are three different types of user-created stages available for PX. What are they? Which would you use? What are the disadvantage for using each type?

These are the three different stages: i) Custom ii) Build iii) Wrapped

What is the max capacity of Hash file in DataStage?

Take a look at the uvconfig file:

# 64BIT_FILES - This sets the default mode used to# create static hashed and dynamic files.# A value of 0 results in the creation of 32-bit# files. 32-bit files have a maximum file size of# 2 gigabytes. A value of 1 results in the creation# of 64-bit files (ONLY valid on 64-bit capable platforms).# The maximum file size for 64-bit# files is system dependent. The default behavior# may be overridden by keywords on certain commands.64BIT_FILES 0

What are orabulk and bcp stages?

ORABULK is used to load bulk data into single table of target oracle database.

BCP is used to load bulk data into a single table for microsoft sql server and sysbase.

What is a project? Specify its various components?

You always enter DataStage through a DataStage project. When you start a DataStage client you are prompted to connect to a project. Each project contains:

DataStage jobs.

Built-in components. These are predefined components used in a job.

User-defined components. These are customized components created using the DataStage Manager or DataStage Designer

You always enter DataStage through a DataStage project. When you start a DataStage client you are prompted to connect to a project. Each project contains:

It is possible to call one job in another job in server jobs?

I think we can call a job into another job. In fact calling doesn't sound good, because you attach/add the other job through job properties. In fact, you can attach zero or more jobs.

Steps will be Edit --> Job Properties --> Job Control

Click on Add Job and select the desired job.

How to find the number of rows in a sequential file?

There is a system variable named @INROWNUM that can be used in a processing stage(liketransformer,etc). The data is processed record-by-record in the processing stage.@INROWNUM can be used to count the row number of the record being processed.

For eg:Consider that a sequential stage is connected to the transformer stage. Include a stage variable sv=@INROWNUM. Use this as the value for one of the output columns,countCol derivation is sv.

Now at the output you can see the count for each record. The countCol value for hte last record gives the total number of records in the sequential file.

What is the difference between drs and odbc stage

To answer your question the DRS stage should be faster then the ODBC stage as it uses native database connectivity. You will need to install and configure the required database clients on your DataStage server for it to work.

Dynamic Relational Stage was leveraged for Peoplesoft to have a job to run on any of the supported databases. It supports ODBC connections too. Read more of that in the plug-indocumentation.

ODBC uses the ODBC driver for a particular database, DRS is a stage that tries to make it seamless for switching from one database to another. It uses the native connectivities for the chosen target...

How does DataStage handle the user security?

We have to create users in the Administrators and give the necessary privileges to users.

datastage interview questions

Documents