data stage interview

69
What is Data warehouse ? What is Operational Databases ? Data Extraction ? Data Aggregation ? Data Transformation ? Advantages of Data warehouse ?

Upload: mohammed-hafiz

Post on 03-Jan-2016

56 views

Category:

Documents


10 download

DESCRIPTION

Data Stage Interview

TRANSCRIPT

Page 1: Data Stage Interview

DataStage Designer

What is Data warehouse ?

What is Operational Databases ?

Data Extraction ?

Data Aggregation ?

Data Transformation ?

Advantages of Data warehouse ?

Page 2: Data Stage Interview

DataStage ?

Client Component ?

Server Component ?

DataStage Jobs ?

Page 3: Data Stage Interview

DataStage NLS ?

Stages

Passive Stage ?

Active Stage ?

Page 4: Data Stage Interview

Server Job Stages

Page 5: Data Stage Interview

Parallel Job Stage

Page 6: Data Stage Interview

Links ?

Parallel Processing

Page 7: Data Stage Interview

Types of Parallelism

Plug in Stage?

Difference Between Lookup and Join:

Page 8: Data Stage Interview

What is Staging Variable?

What are Routines?

what are the Job parameters?

why fact table is in normal form?

What are Stage Variables, Derivations and Constants?

What are an Entity, Attribute and Relationship?

Page 9: Data Stage Interview

DataStage Designer

A data warehouse is a central integrated database containing data from allthe operational sources and archive systems in an organization. It containsa copy of transaction data specifically structured for query analysis.

This database can be accessed by all users, ensuring that each group in an organizationis accessing valuable, stable data.

Operational databases are usually accessed by many concurrent users. Thedata in the database changes quickly and often. It is very difficult to obtainan accurate picture of the contents of the database at any one time.Because operational databases are task oriented, for example, stock inventorysystems, they are likely to contain “dirty” data. The high throughputof data into operational databases makes it difficult to trap mistakes orincomplete entries. However, you can cleanse data before loading it into adata warehouse, ensuring that you store only “good” complete records.

Data extraction is the process used to obtain data from operational sources, archives, andexternal data sources.

The summed (aggregated) total is stored in the data warehouse. Becausethe number of records stored in the data warehouse is greatly reduced, itis easier for the end user to browse and analyze the data.

Transformation is the process that converts data to a required definition and value.Data is transformed using routines based on a transformation rule, forexample, product codes can be mapped to a common format using a transformationrule that applies only to product codes.After data has been transformed it can be loaded into the data warehousein a recognized and required format.

• Capitalizes on the potential value of the organization’s information• Improves the quality and accessibility of data• Combines valuable archive data with the latest data in operational sources

Page 10: Data Stage Interview

• Increases the amount of information available to users• Reduces the requirement of users to access operational data• Reduces the strain on IT departments, as they can produce one database to serve all user groups• Allows new reports and studies to be introduced without disrupting operational systems• Promotes users to be self sufficient

the design and processing required to build a data warehouse. It is ETL• Extracts data from any number or type of database.• Transforms data. DataStage has a set of predefined transforms and functions you can use to convert your data. You can easily extend the functionality by defining your own transforms to use.• Loads the data warehouse.

It Consist of number of Client Component ans Server ComponentDataStage server and parallel jobs are compiled and run on the DataStage server. The job will connect to databases on other machines as necessary,extract data, process it, then write the data to the target data warehouse.

DataStage mainframe jobs are compiled and run on a mainframe. Data extracted by such jobs is then loaded into the data warehouse.

creating and moving projects, and setting up purging criteria.

Basic type of DataStage

DataStage Designer -> A design interface used to create DataStageapplications (known as jobs).DataStage Director-> A user interface used to validate, schedule,run, and monitor DataStage server jobs and parallel jobs.DataStage Manager -> A user interface used to view and edit thecontents of the Repository.DataStage Administrator -> A user interface used to perform administrationtasks such as setting up DataStage users,

Repository -> A central store that contains all the informationrequired to build a data mart or data warehouse.DataStage Server -> Runs executable jobs that extract, transform,and load data into a data warehouse.Datastage Package Installer -> A user interface used to install packagedDataStage jobs and plug-ins.

Server Jobs ->These are compiled and run on the DataStage server.A server job will connect to databases on other machines as necessary,extract data, process it, then write the data to the target datawarehouse.

Parallel Jobs -> These are compiled and run on the DataStage serverin a similar way to server jobs, but support parallel processing onSMP, MPP, and cluster systems.

Page 11: Data Stage Interview

DataStage has built-in National Language Support (NLS). With NLS installed, DataStage can do the following:• Process data in a wide range of languages• Accept data in any character set into most DataStage fields• Use local formats for dates, times, and money (server jobs)• Sort data according to local rules• Convert data between different encodings of the same language(for example, for Japanese it can convert JIS to EUC)

currently open in the Designer.

MainFrame Jobs -> These are available only if you have EnterpriseMVS Edition installed. A mainframe job is compiled and run onthe mainframe. Data extracted by such jobs is then loaded into thedata warehouse.

Shared Containers -> These are reusable job elements. They typicallycomprise a number of stages and links. Copies of shared containerscan be used in any number of server jobs or parallel jobs and editedas required.

Job Sequences -> A job sequence allows you to specify a sequence ofDataStage jobs to be executed, and actions to take depending onresults.

Built in Stages -> Supplied with DataStage and used for extracting,aggregating, transforming, or writing data. All types of job havethese stages.

Plug in Stages-> Additional stages that can be installed in DataStageto perform specialized tasks that the built-in stages do not support.Server jobs and parallel jobs can make use of these.

Job Sequences Stages-> Special built-in stages which allow you todefine sequences of activities to run. Only Job Sequences havethese.

A job consists of stages linked together which describe the flow of datafrom a data source to a data target (for example, a final data warehouse).available in the DataStage Designer depend on the type of job that is

A passive stage handlesaccess to databases for the extraction or writing of data.

model the flow of data and provide mechanisms for combining datastreams, aggregating data,

Page 12: Data Stage Interview

and converting data from one data type to another.

Database ODBC. -> Extracts data from or loads data into databases that support the industry standard Open DatabaseConnectivity API. -> This stage is also used as an intermediate stage for aggregating data. This is a passive stage.

UniVerse. -> Extracts data from or loads data into UniVerse databases. This stage is also used as an intermediatestage for aggregating data. This is a passive stage.

UniData. -> Extracts data from or loads data into UniData databases. This is a passive stage.

Oracle 7 Load. - > Bulk loads an Oracle 7 database. Previously known as ORABULK.

Sybase BCP Load. - >Bulk loads a Sybase 6 database. Previously known as BCPLoad.

File

Processing

Hashed File. -> Extracts data from or loads data into databasesthat contain hashed files. Also acts as anintermediate stage for quick lookups. This is a passive stage.

Sequential File. -> Extracts data from, or loads data into,operating system text files. This is a passive stage.

Aggregator.-> Classifies inc oming data into groups,computes totals and other summary functions for eachgroup, and passes them to another stage in the job. This is an active stage.

BASIC Transformer. -> Receives incoming data, transformsit in a variety of ways, and outputs it to anotherstage in the job. This is an active stage.

Folder. -> Folder stages are used to read or write data asfiles in a directory located on the DataStage server.

Inter-process. ->Provides a communication channelbetween DataStage processes running simultaneously inthe same job. This is a passive stage.

Page 13: Data Stage Interview

RealTime

Containers

DataBases

Link Partitioner. -> Allows you to partition a data set intoup to 64 partitions. Enables server jobs to run in parallelon SMP systems. This is an active stage.

Link Collector. -> Collects partitioned data from up to 64partitions. Enables server jobs to run in parallel on SMPsystems. This is an active stage.

RTI Source. -> Entry point for a Job exposed as an RTIservice. The Table Definition specified on the output linkdictates the input arguments of the generated RTIservice.

RTI Target. -> Exit point for a Job exposed as an RTIservice. The Table Definition on the input link dictatesthe output arguments of the generated RTI service.

Server Shared Container. -> Represents a group of stagesand links. The group is replaced by a single SharedContainer stage in the Diagram window.

Local Container. -> Represents a group of stages and links.The group is replaced by a single Container stage in theDiagram window

Container Input and Output. -> Represent the interfacethat links a container stage to the rest of the job design.

DB2/UDB Enterprise. Allows you to read and write aDB2 database.

Informix Enterprise. Allows you to read and write anInformix XPS database.

Oracle Enterprise. Allows you to read and write anOracle database.

Teradata Enterprise. Allows you to read and write aTeradata database.

Page 14: Data Stage Interview

Development/Debug StagesRow Generator. -> Generates a dummy data set.Column Generator. -> Adds extra columns to a data set.

Sample. -> Samples a data set.

File Stages

Data set.-> Stores a set of data.

File set. -> A set of files used to store data.Lookup file set. ->Provides storage for a lookup table.SAS data set. -> Provides storage for SAS data sets.

Processing Stages

Change apply. -> Applies a set of captured changes to a data set.

Compress. -> Compresses a data set.Copy . -> Copies a data set.

Difference. -> Compares two data sets and works out the difference between them.

Head. -> Copies the specified number of records fromthe beginning of a data partition.

Peek. -> Prints column values to the screen as records arecopied from its input data set to one or more outputdata sets.

Tail. -> Copies the specified number of records from theend of a data partition.

Write range map. -> Enables you to carry out range mappartitioning on a data set.

complex flat files on a mainframe machine. This isintended for use on USS systems

External source. -> Allows a parallel job to read anexternal data source.External target. -> Allows a parallel job to write to anexternal data source.

Sequential file. -> Extracts data from, or writes data to, atext file.

variety of ways, and outputs it to another stage in thejob.

Aggregator. -> Classifies incoming data into groups,computes totals and other summary functions for eachgroup, and passes them to another stage in the job.

Change Capture. -> Compares two data sets and recordsthe differences between them.Compare. -> Performs a column by column compare oftwo pre-sorted data sets.

Decode. -> Uses an operating system command to decodea previously encoded data set.

Page 15: Data Stage Interview

Expand. -> Expands a previously compressed data set.

Funnel. -> Copies multiple data sets to a single data set.

Lookup. -> Performs table lookups.Merge.-> Combines data sets.Modify. -> Alters the record schema of its input data set.

Sort. -> Sorts input columns.

Real Time

Restructure

Other Stages

Encode. -> Encodes a data set using an operating systemcommand.

External Filter. -> Uses an external program to filter a dataset.data set which satisfy requirements that you specify, andfilters out all other records.

Generic. -> Allows Orchestrate experts to specify their owncustom commands.

Remove duplicates.-> Removes duplicate entries from adata set.

SAS(Statistical Analysis System)-> Allows you to run SAS applications from withinthe DataStage job.

input record to an output data set based on the value of aselector field.Surrogate Key.-> Generates one or more surrogate keycolumns and adds them to an existing data set.

RTI Source. -> Entry point for a Job exposed as an RTIservice. The Table Definition specified on the output linkdictates the input arguments of the generated RTIservice.

RTI Target. -> Exit point for a Job exposed as an RTIservice. The Table Definition on the input link dictatesthe output arguments of the generated RTI service.

Column export. -> Exports a column of another type to astring or binary column.Column import. -> Imports a column from a string orbinary column.Combine records. -> Combines several columns associatedby a key field to build a vector.Make subrecord. -> Combines a number of vectors to forma subrecord.Make vector. -> Combines a number of fields to form avector.Promote subrecord. -> Promotes the members of asubrecord to a top level field.Split subrecord. -> Separates a number of subrecords intotop level fields.Split vector. -> Separates a number of vector members intoseparate columns.

Page 16: Data Stage Interview

Linking Server Stages - >

Linkning Parallel Stages ->

Parallel processing is the ability to carry out multiple operations or tasks simultaneously.

Parallel Shared Container. -> Represents a group of stagesand links. The group is replaced by a single ParallelShared Container stage in the Diagram window. ParallelShared Container stages are handled differently to otherstage types, they do not appear on the palette.

Local Container. -> Represents a group of stages and links.The group is replaced by a single Container stage in theDiagram window

Container Input and Output. -> Represent the interfacethat links a container stage to the rest of the job design.

Links join the various stages in a job together and are used to specify howdata flows when the job is run.

Stream. A link representing the flow of data. This is the principaltype of link, and is used by both active and passive stages.

Reference. A link representing a table lookup. Reference links areonly used by active stages. They are used to provide informationthat might affect the way data is changed, but do not supply thedata to be changed.

Stream. -> A link representing the flow of data. This is the principaltype of link, and is used by all stage types.

Reference.-> A link representing a table lookup. Reference links canonly be input to Lookup stages, they can only be output fromcertain types of stage.

Reject. -> Some parallel job stages allow you to output records thathave been rejected for some reason onto an output link.

Page 17: Data Stage Interview

Pipeline Parallelism ->If we run a job on a system with at least three processors the stage reading would start on one processor and start filling a pipeline with the data it had read. ->The transformation stage would start running on second processor as soon as there was a data in a pipeline, process it and start filling another pipeline.->The target stage would start running on 3rd processor as soon as there was data in pipeline all three stages are operating simultaneously.

Partitioning Parallelism -> Using Partitioning Parallelism the same job would effectively be run on simultaneously by several processors.

BULK COPY PROGRAM: Microsoft SQL Server and Sybase have a utility called BCP (Bulk CopyProgram). This command line utility copies SQL Server data to or from anoperating system file in a user-specified format. BCP uses the bulk copyAPI in the SQL Server client libraries.By using BCP, you can load large volumes of data into a table withoutrecording each insert in a log file. You can run BCP manually from acommand line using command line options (switches). A format (.fmt) fileis created which is used to load the data into the database.

The Orabulk stage is a plug-in stage supplied by Ascential. The Orabulkplug-in is installed automatically when you install DataStage.An Orabulk stage generates control and data files for bulk loading into asingle table on an Oracle target database. The files are suitable for loadinginto the target database using the Oracle command sqlldr.One input link provides a sequence of rows to load into an Oracle table.The meta data for each input column determines how it is loaded. Oneoptional output link provides a copy of all input rows to allow easycombination of this stage with other stages.

Lookup and join perform equivalent operations: combining two or moreinput datasets based on one or more specified keys.Lookup requires all but one (the first or primary) input to fit into physicalmemory. Join requires all inputs to be sorted.When one unsorted input is very large or sorting isn’t feasible, lookup isthe preferred solution. When all inputs are of manageable size or are presorted,join is the preferred solution.

Page 18: Data Stage Interview

These are the temporary variables created in transformer for calculation.

A fact table consists of measurements of business requirements and foreign keys of dimensions tables as per business rules.

Routines are the functions which we develop in BASIC Code for required tasks, which we Datastage is not fully supported (Complex).

These Parameters are used to provide Administrative access and change run time values of the job. EDIT > JOBPARAMETERSIn that Parameters Tab we can define the name,prompt,type,value.

Stage Variable - An intermediate processing variable that retains value during read and does not pass the value into target column.Derivation - Expression that specifies value to be passed on to the target column.Constant - Conditions that are either true or false that specifies flow of data with a link.

An entity represents a chunk of information. In relational databases, an entity often maps to a table.An attribute is a component of an entity and helps define the uniqueness of the entity. In relational databases, an attribute maps to a column.The entities are linked together using relationships.

Page 19: Data Stage Interview

DataStage server and parallel jobs are compiled and run on the DataStage server. The job will connect to databases on other machines as necessary,

DataStage mainframe jobs are compiled and run on a mainframe. Data extracted by such jobs is then loaded into the data warehouse.

Page 20: Data Stage Interview

JOB SEQUENCE

Job Sequence?

Activity Stages?

Page 21: Data Stage Interview

Triggers?

Job Sequence Properties?

Page 22: Data Stage Interview

Job Report

How do you generate Sequence number in Datastage?

Page 23: Data Stage Interview

JOB SEQUENCE

DataStage provides a graphical Job Sequencer which allows you to specifya sequence of server jobs or parallel jobs to run. The sequence can alsocontain control information; for example, you can specify different coursesof action to take depending on whether a job in the sequence succeeds orfails. Once you have defined a job sequence, it can be scheduled and runusing the DataStage Director. It appears in the DataStage Repository andin the DataStage Director client as a job.

• Job. Specifies a DataStage server or parallel job.• Routine. Specifies a routine. This can be any routine inthe DataStage Repository (but not transforms).• ExecCommand. Specifies an operating system commandto execute.• Email Notification. Specifies that an email notificationshould be sent at this point of the sequence (uses SMTP).• Wait-for-file. Waits for a specified file to appear or disappear.

• Exception Handler. There can only be one of these in ajob sequence. It is executed if a job in the sequence fails torun (other exceptions are handled by triggers) or if thejob aborts and the Automatically handle job runs thatfail option is set for that job.• Nested Conditions. Allows you to further branch theexecution of a sequence depending on a condition.• Sequencer. Allows you to synchronize the control flowof multiple activities in a job sequence.• Terminator. Allows you to specify that, if certain situationsoccur, the jobs a sequence is running shut downcleanly.

Page 24: Data Stage Interview

General,Parameters,Job Control,Dependencies,NLS

• Start Loop and End Loop. Together these two stagesallow you to implement a For…Next or For…Each loopwithin your sequence.• User Variable. Allows you to define variables within asequence. These variables can then be used later on inthe sequence, for example to set job parameters.

The control flow in the sequence is dictated by how you interconnectactivity icons with triggers.

There are three types of trigger:• Conditional. A conditional trigger fires the target activity if thesource activity fulfills the specified condition. The condition isdefined by an expression, and can be one of the following types:– OK. Activity succeeds.– Failed. Activity fails.– Warnings. Activity produced warnings.– ReturnValue. A routine or command has returned a value.– Custom. Allows you to define a custom expression.– User status. Allows you to define a custom status message towrite to the log.• Unconditional. An unconditional trigger fires the target activityonce the source activity completes, regardless of what other triggersare fired from the same activity.• Otherwise. An otherwise trigger is used as a default where asource activity has multiple output triggers, but none of the conditionalones have fired.

Page 25: Data Stage Interview

The job reporting facility allows you to generate an HTML report of aserver, parallel, or mainframe job or shared containers. You can view thisreport in a standard Internet browser (such as Microsoft Internet Explorer)and print it from the browser.The report contains an image of the job design followed by informationabout the job or container and its stages. Hotlinks facilitate navigationthrough the report. The following illustration shows the first page of anexample report, showing the job image and the contents list from whichyou can link to more detailed job component descriptions: The report is not dynamic, if you change the job design you will need toregenerate the report.

Using the RoutineKeyMgtGetNextValKeyMgtGetNextValConnThey can also be done by Oracle Sequence.

Page 26: Data Stage Interview

Scenarios

What are the Environmental variables in Datastage?

How to extract job parameters from a file?

if suppose we have 3 jobs in sequencer, while running if job1 is failed then we have to run job2 and job 3 ,how we can run?

how do you remove duplicates using transformer stage in datastage.

how you will call shell scripts in sequencers in datastage

Page 27: Data Stage Interview

how do u reduce warnings?

How to lock\unlock the jobs as datastage admin?

How to get the unique records on multiple columns by using sequential file stage only

if a column contains data like abc,aaa,xyz,pwe,xok,abc,xyz,abc,pwe,abc,pwe,xok,xyz,xxx,abc,roy,pwe,aaa,xxx,xyz,roy,xok....how to send the unique data to one source and remaining data to another source????

Is there any possibility to generate alphanumeric surrogate key?

How to enter a log in auditing table whenever a job get finished?

Page 28: Data Stage Interview

what is .dsx files

What is APT_DUMP_SCORE?

what is Audit table? Have u use audit table in ur project?

Can we use Round Robin for aggregator? Is there any benefit underlying?

How many number of reject links merge stage can have?

I have 3 jobs A,B and C , which are dependent each other. I want to run A & C jobs daily and B job run only on sunday. how can we do it?

How to generate surrogate key without using surrogate key stage?

what is push and pull technique??? I want to two seq files using push technique import in my desktop what i will do?

how to capture rejected data by using join stage not for lookup stage. please let me know?

Page 29: Data Stage Interview

what is normalization and denormalization?

what are the different type of errors in datastage?

How do u convert the columns to rows in DataStage?

What is environment variables?

ountry, state 2 tables r there. in table 1 have cid,cnametable2 have sid,sname,cid. i want based on cid which country's having more than 25 states i want to display?

what is the difference between 7.1,7.5.2,8.1 versions in datastage?

What is diff between Junk dimensions and conform dimension?

30 jobs are running in unix.i want to find out my job.how to do this?Give me command?

Page 30: Data Stage Interview

Where the DataStage stored his repository?

How do you register plug-ins?

How one source columns or rows to be loaded in to two different tables?

Source has sequential file stage in 10 records and move to transformer stage it has one output link 2 records and reject link has 5 records ? But I want remaining 3 records how to capture

Page 31: Data Stage Interview

Scenarios

To run a job even if its previous job in the sequence is failed you need to go to the TRIGGER tab of that particular job activity in the sequence itself. There you will find three fields:Name: This is the name of the next link (link goin to the next job, e.g. for job activity 1 link name will be the link goin to job activity 2).Expression Type: This will allow you to trigger your next job activity based on the status you want. For example, if in case job 1 fails and you want to run the job 2 and job 3 then go to trigger properties of the job 1 and select expression type as "Failed - (Conditional)". This way you can run your job 2 even if your job 1 is aborted. There are many other options available. Expression: This is editable for some options. Like for expression type "Failed" you can not change this field.I think this will solve your problem.

In that Time double click on transformer stage---> Go to Stage properties(its having in hedder line first icon) ---->double click on stage properties --->Go to inputs ---->go to partitioning---->select one partition technick(with out auto)--->now enable perform sort--->click on perfom sort----> now enable unique---->click on that and we can take required colum name. now out put will come unique values so here duplicats will be removed.

Shell scripts can be called in the sequences by using "Execute command activity". In this activity type following command :

bash /path of your script/scriptname.sh

bash command is used to run the shell script.

The Environmental variables in datastage are some pathes which can support system can use as shortcuts to fulfill the program running instead of doing nonsense activity. In most time, environmental variables are defined when the software have been installed or being installed.

Could we use dsjob command on linux or unix plantform to achive the activity of extacting parameters from a job?

Page 32: Data Stage Interview

In sequential file there is one option is there i.e filter.in this filter we use unix commands like what ever we want.

By Using Sort Stage. GoTo Properties -> set Sorting Keys key=column name and set option Allow Duplicate= false.

In order to reduce the warnings you need to get clear idea about particular warning, if you get any idea on code or design side you fix it, other wise goto director-->select warning and right click and add rule to message, then click ok. from next run onward you shouldn't find any warnings.

It is not possible to generate alphanumeric surrogate key in datastage.

I think this answer might satisfy you.. 1.just open administrator 2.Go to projects tab 3.click on command button. 4.Give list.readu command and press execute(It gives you all the jobs status and please not the PID(Process ID) of those jobs which you want to unlock) 5.Now close that and again come back to command window. 6.now give the command ds.tools and execute 7.read the options given there.... and type "4" (option) 8.and now give 6/7 depending up on ur requirement... 9.Now give the PID that you have noted before.. 10.Then "yes" 11.Generally at first time it won't work.. but if we press again 7 then after that give PID again.. It ll work.... Please get back to me If any further clarifications req

some companies using shell script to load logs into audit table or some companies load logs into audit table using datastage jobs. These jobs are we developed.

Page 33: Data Stage Interview

Yes we can use Round Robin in Aggregator. It is used for Partitioning and Collecting.

we can have n-1 rejects for merge.

Audit table mean its log file.in every job should has audittable.

First you have to schedule A & C jobs Monday to Saturday in one sequence.Next take three jobs according to dependency in one more sequence and schedule that jobonly Sunday.

by using the transformer we can do it.To generate seqnumthere is a formula by using the system variablesie [@partation num + (@inrow num -1) * @num partation.

push means the source team sends the data and pull means the developer extracts the data from source.

.dsx file is nothing but the datastage project backup file..when we want to load the project at the another system or server we take the file and load at the other system/server.

We can not capture the reject data by using join stage.For that we can use transformer stage after join stage.

APT_DUMP_SCORE is an reporting environment variable , used to show how the data is processing and processes are combining.

Page 34: Data Stage Interview

ps -ef|grep USER_ID|grep JOB_NAME

Using Pivot Stage .

Join these two tables on cid and get all the columns to output. Then in aggregator stage, count rows with key collumn cid..Then use filter or transformer to get records with count> 25

The main difference is in 7.5 we can open job only once at a system but in 8.1 we can open on job in multiple time as a read only mode and another difference is in 8.1 having Slowly Changing Dimention stage and Repository are there in 8.1.

IN Normalization is controlled by elimination redundant data where as in Denormalisation is controlled by redundant data.

JUNK DIMENSIONA Dimension which cannot be used to describe the facts isknown as junk dimension(junk dimension provides additionalinformation to the main dimension)ex:-customer add

Confirmed DimensionA dimension table which can be shared by multiple fact tables is known as Confirmed dimension Ex:- Time dimension

Basically Environment variable is predefined variable those we can use while creating DS job.We can set eithere as Project level or Job level.Once we set specific variable that variable will be availabe into the project/job.

Page 35: Data Stage Interview

DataStage stored his repository in IBM Universe Database.

Using DataStage Manager. Tool-> Register Plugin -> Set Specific path and ->ok

For Columns - We can directly map the single source columns to two different targets. For Rows - We have to put some constraint (condition ).

Page 36: Data Stage Interview

DataStage Important Interview Question And Answer

What is DatawareHouse? Concept of Dataware house?

What type of data available in Datawarehouse?

What is Node? What is Node Configurtion?

What is the use of Nodes

Page 37: Data Stage Interview

What is Apt_Conf_File?

What is Version Control in Datastage.

What are descriptor file and data file in Dataset.

What is Job Commit ( in Datastage).

What is Complex Job in Datastage.

Page 38: Data Stage Interview

What is Iconv and Oconv functions

How to Improve Performance of Datastage Jobs?

Page 39: Data Stage Interview

Difference between Server Jobs and Parallel Jobs

Difference between Datastage and Informatica.

What is complier ? Compliation Process in datastage

Page 40: Data Stage Interview

What is Modelling Of Datastage?

What is DataMart, Importance and Advantages?

Page 41: Data Stage Interview

Data Warehouse vs. Data Mart

Page 42: Data Stage Interview

What are different types of error in datastage?

Page 43: Data Stage Interview

What are the client components in DataStage 7.5x2 version?

Page 44: Data Stage Interview

Difference Between 7.5x2 And 8.0.1?

Page 45: Data Stage Interview

What is IBM Infosphere? And History

What is Datastage Project Contains?

Page 46: Data Stage Interview

What is Difference Between Hash And Modulus Technique?

What are Features of Datastage?

Page 47: Data Stage Interview

ETL Project Phase?

What is RCP?

Page 48: Data Stage Interview

What is Roles And Responsibilties of Software Engineer?

Server Component of DataStage 7.5x2 version?

Page 49: Data Stage Interview

How to create Group ID in Sort Stage?

What is Fastly Changing Dimension?

Force Compilation ?

how many rows sorted in sort stage by default in server jobs

when we have to go for a sequential file stage & for adataset in datastage?

Page 50: Data Stage Interview

what is the diff b/w switch and filter stage in datastage?

specify data stage strength?

symmetric multiprocessing (SMP)

Briefly state different between data ware house & data mart?

What are System variables?

What are Sequencers?

Page 51: Data Stage Interview

What is the difference between Hashfile and Sequential File?

What is OCI?

Which algorithm you used for your hashfile?

Whats difference betweeen operational data stage (ODS) and data warehouse?

Page 52: Data Stage Interview

DataStage Important Interview Question And Answer

Datawarehouse is a database which is used to store the heterogeneous sources of data with characteristics like a) Stucture Oriented b) Historical Information c) Integrated d) Non Volatile e) Time Variant

Source will be Online Transaction Process ( OLTP). It collects the data from Online Transaction Process ( OLTP). It maintains the data for 30 - 90 days. It is time sensitive. If we like to store the data for long period, we need a permanent data base. That is Archyl Database ( AD).

Data in the Datawarehouse comes from the client systems.Data that you are using to manage your business is very important to do the manupulations according to the client requirements.

Node is a Logical Cpu in datastage .

Each node in a configuration file is distinguished by the virtual name and defines a number , speed, cpu's , memory availability etc.

Node configuration is a technique of creating logical C.P.U

In a Grid environment a node is the place where the jobs are executes. Nodes are like processors , if we have more nodes when running the job , the performancewill be good to run parallel to make the job efficient.

Page 53: Data Stage Interview

Apt_Config_file is a file which is used to identify the .apt files and we can store the nodes, disk storage space And Apt_Config_File installed under the top level directory ( i.e apt_orchhome config files ). The Size of the computer system on which you run jobs is defined in the c files. You can find the c files in the Manager---Tool---Configuration And node is the name of the processing node that this entry defines.

Complex jobs in datastage is nothing but having more Joins or lookups or transformer stages in one job. There is no limitaion of using stages in a job. We can use any number of stages in single job. But you need to reduce the stages where ever you can by writing the queries in one stage, rather than using two stages. Than you will get good performance. If you are getting more stages in the job you have another technique to get good performance. That is you can split the job into two jobs.

Version Control is used to store the different versions of datstage jobs. And it runs the different versions of same jobs. It also reverts to previous version of a Job.

Descriptor and Data files are the dataset files.

Descriptor file contains the Schema details and address of the data.

And Data file contains the data in the native format.

In DRS Stage we have a transaction Isolation , set to read committed .And set Array Sze and transaction size to 10,2000 . So that , it will commit for every 2000 records.

Page 54: Data Stage Interview

Iconv and Oconv functions are used to convert the date functions.

Iconv() is used to convert string to Internal storage format.

Oconv() is used to convert expression to an output format.

Performance of the Job is really important to maintain.Some of the precautions are as follows to get good performance of the Jobs.Avoid the use of only one flow of tuning for performance testing or tuning testing.Try to work in Increment. Isolate and solve the Jobs. And Work in increment.

For that

a) Avoid using Transformer stage where ever necessary. For example if you are using Transformer stage to change the column names or to drop the column names. Use Copy stage, rather than using Transformer stage. It will give good performance to the Job.

b)Take care to take correct partitioning technique, according to the Job and requirement.

c) Use User defined queries for extracting the data from databases .

d) If the data is less , use Sql Join statements rather then using a Lookup stage.

e) If you have more number of stages in the Job, divide the job into multiple jobs.

Page 55: Data Stage Interview

Server Jobs works only if the server jobs datastage has been installed in your system. Server Jobs doesnot supports the parallelism and partition techniques. Server Jobs generates basic programs after Job Compilation. Parallel Jobs works, if you have installed Enterprise Edition. This works on the Datastage Servers that are SMP (Symmetric Multi-Processing) , MPP ( Massively Parallel Processing ) etc. Parallel Jobs generates OSH ( Orchestrate Shell ) Programs after job compilation. Different Stages will be like datasets, lookup stages etc. Server Jobs works in sequential way while parallel jobs work in parallel fashion (Parallel Extender work on the principal of pipeline and partition) for Inpur/Output processing.

Difference between Datastage and Informatica is Datastage is having Partition, Parallelism, Lookup , Merge etc

But Informtica Doesn't have this concept of partition and parallelism. File lookup is really horrible

Compilation is the process of converting the GUI into its machine code .That is nothing but machine understandable language.

In this process it will checks all the link requirements, stage mandatory property values, and if there any logical errors.

And Compiler produces OSH Code.

Page 56: Data Stage Interview

Modeling is a Logical and physical representation of Source system. Modeling have two types of Modeling Tools They are ERWIN AND ER-STUDIO

In Source System there will be a ER-Model and in the Target system there will be a ER-Model and Dimensional Model Dimension:- The table which was designed for the client perspective. We can see in many ways in the Dimension tables.

And there are two types of Models. They are Forward Engineering (F.E) Reverse Engineering (R.E)

F.E:- F.E is the process starting the process from the scratch for banking sector. Ex: Any Bank which was required Datawarehouse. R.E:- R.E is the process altering existing model for another bank.

A data mart is a repository of data gathered from operational data and other sources that is designed to serve a particular community of knowledge workers. In scope, the data may derive from an enterprise-wide database or data warehouse or be more specialized. The

emphasis of a data mart is on meeting the specific demands of a particular group of knowledge users in terms of analysis, content, presentation, and ease-of-use. Users of a

data mart can expect to have data presented in terms that are familiar.

Page 57: Data Stage Interview

There are many reasons to create Datamart.There is lot of importance of Datamart and advantages.

It is easy to access frequently needed data from the database when reuired by the client.

We can give access to group of users to view the Datamart when it is required. Ofcourse performance will be good.

It is easy to maintain and to create the datamart. It will be related to specific business.

And It is low cost to create a datamart rather than creating datarehouse with a huge space.

A data warehouse tends to be a strategic but somewhat unfinished concept. The design of a data warehouse tends to start from an analysis of what data already exists and how it can be collected in such a way that the data can later be used. A data warehouse is a central aggregation of data (which can be distributed physically); A data mart tends to be tactical and aimed at meeting an immediate need. The design of a data mart tends to start from an analysis of user needs. A data mart is a data repository that may derive from a data warehouse or not and that emphasizes ease of access and usability for a particular designed purpose.

Page 58: Data Stage Interview

You may get many errors in datastage while compiling the jobs or running the jobs. Some of the errors are as follows

a)Source file not found. If you are trying to read the file, which was not there with that name.

b)Some times you may get Fatal Errors.

c) Data type mismatches. This will occur when data type mismatches occurs in the jobs.

d) Field Size errors.

e) Meta data Mismatch

f) Data type size between source and target different

g) Column Mismatch

i) Pricess time out. If server is busy. This error will come some time.

Page 59: Data Stage Interview

1) Datastage Designer 2) Datastage Director 3) Datastage Manager 4) Datastage Admin

In Datastage Designer, We Create the Jobs Compile the Jobs Run the Jobs

In Director, We can View the Jobs View the Logs Batch Jobs Unlock Jobs Scheduling Jobs Monitor the JOBS Message Handling

In Manager , We can Import & Export the Jobs Node Configuration

And by using Admin , We can Create the Projects Organize the Projects Delete the Projects

Page 60: Data Stage Interview

5) There are 2 Architecture Components here. They are a) Server b) Client 5) There are 5 Architecture Components. They are a) Common user Interface. b) Common Repository. c) Common Engine. d) Common Connectivity. e) Common Shared Services.

6) P-3 and P-4 can be performed here. P-3 is Data Transformation. P-4 is Metadata Management

6) P-1,P-2,P3,P4 can be performed here. P-1 is Data Profiling P-2 is Data Quality P-3 is Data Transformation P-4 is Metadata Management

7) Server is IIS

7) Sever is Websphere

8) No Web based Admin

8) Web based Admin.

Page 61: Data Stage Interview

Datastage is the product owned by I.B.M

Datastage is a ETL Tool an it is independent of platform.

Etl means Extraction , Transformation and loading the jobs.

Datastage is the product introduced by the company called V-mark with the name

DataIntegrator in UK in the year 1997.

And later it was acquired by other companies. Finally it was reached to I.B.M in 2006.

Datastage got parallel capabilities when it was integrated with the Orchestrate file

and got independent platform capabilities when integrated with the MKS Tool Kit

Datastage is a Comprehensive ETL Tool. It is used to Extract , transformation and loading the Jobs. Datastage Project will be worked on the Datastage don't. We can login to the Datastage Designer in order to enter the Datastage too for datastage jobs, designing of the jobs etc.

Datastage jobs are maintained according to the project standards.

In every project we contain the Datastage Jobs , Built in Components , Table Definitions , Repository and components required for the project.

Page 62: Data Stage Interview

Hash and Modulus techniques are Key based partition techniques. Hash and Modulus techniques are used for different purpose.

If Key column data type is textual then we use hash partition technique for the job. If Key column data type is numeric, we use modulus partition technique. If one key column numeric and another text then also we use hash partition technique. if both the key columns are numeric data type then we use modulus partition technique.

That means Datastage can Extrace the data from any source and can loads the data into the any target.

2) Platform Independent The Job developed in the one platform can run on the any other platform. That means if we designed a job in the Uni level processing, it can be run in the SMP machine.

3 )Node Configuration Node Configuration is a technique to create logical C.P.U Node is a Logical C.P.U

4)Partition Parallelism Partition parallelim is a technique distributing the data across the nodes based on the partition techniques. Partition Techniques are a) Key based Techniques are 1 ) Hash 2)Modulus 3) Range 4) DB2

b) Key less Techniques are 1 ) Same 2) Entire 3) Round Robin 4 ) Random

5) Pipeline Parallelism Pipeline Parallelism is the process, the extraction, transformation and loading will be occurred simultaneously. Re- Partitioning: The distribution of distributed data is Re-Partitioning. Reverse Partitioning: Reverse Partitioning is called as Collecting. Collecting methods are Ordered Round Robin Sort Merge Auto

Page 63: Data Stage Interview

And four phases are 1) Data Profiling 2) Data Quality 3) Data Transformation 4) Meta data management Data Profiling:- Data Profiling performs in 5 steps. Data Profiling will analysis weather the source data is good or dirty or not. And these 5 steps are

a) Column Analysis b) Primary Key Analysis c) Foreign Key Analysis d) Cross domain Analysis e) Base Line analysis

After completing the Analysis, if the data is good not a problem. If your data is dirty, it will be sent for cleansing. This will be done in the second phase. Data Quality:- Data Quality, after getting the dirty data it will clean the data by using 5 different ways. They are a) Parsing b) Correcting c) Standardize d) Matching e) Consolidate

Data Transformation:- After competing the second phase, it will gives the Golden Copy. Golden copy is nothing but single version of truth. That means , the data is good one now.

RCP is nothing but Runtime Column Propagation. When we run the Datastage Jobs, the columns may change from one stage to another stage. At that point of time we will be loading the unnecessary columns in to the stage, which is not required. If we want to load the required columns to load into the target, we can do this by enabling a RCP. If we enable RCP, we can sent the required columns into the target.

Page 64: Data Stage Interview

Roles and Responsibilities of Software Engineer are

1) Preparing Questions 2) Logical Designs ( i.e Flow Chart ) 3) Physical Designs ( i.e Coding ) 4) Unit Testing 5) Performance Tuning. 6) Peer Review 7) Design Turnover Document or Detailed Design Document or Technical design Document 8) Doing Backups 9) Job Sequencing ( It is for Senior Developer )

There are three Architecture Components in datastage 7.5x2They are Repository:-- Repository is an environment where we create job, design, compile and run etc. Some Components it contains are JOBS,TABLE DEFINITIONS,SHARED CONTAINERS, ROUTINES ETC Server( engine):-- Here it runs executable jobs that extract , transform, and load data into a datawarehouse. Datastage Package Installer:-- It is a user interface used to install packaged datastage jobs and plugins.

Page 65: Data Stage Interview

10,000

Group ids are created in two different ways. We can create group id's by using

a) Key Change Column b) Cluster Key change Column

Both of the options used to create group id's . When we select any option and keep true. It will create the Group id's group wise.

Data will be divided into the groups based on the key column and it will give (1) for the first row of every group and (0) for rest of the rows in all groups.

Key change column and Cluster Key change column used, based on the data we are getting from the source.

If the data we are getting is not sorted , then we use key change column to create group id's If the data we are getting is sorted data, then we use Cluster Key change Column to create Group Id's .

The Entities in the Dimension which are change rapidly is called Rapidly(fastly) changing dimention. best example is atm machine transactions.

For parallel jobs there is also a force compile option. The compilation ofparallel jobs is by default optimized such that transformer stages only getrecompiled if they have changed since the last compilation. The forcecompile option overrides this and causes all transformer stages in the jobto be compiled. To select this option:• Choose File Force Compile➤

When there is Memory limit is requirement is more, then go for Dataset, And sequential file doesn’t support more than 2gb.

Page 66: Data Stage Interview

filter:1)we can write the multiple conditions on multiple fields2)it supports one inputlink and n number of outputlinks Switch:1)multiple conditions on a single field(column)2)it supports one inputlink and 128 output links

The major strength of the datastage are :Partitioning,pipelining,Node configuration,handles Huge volume of data,Platform independent.

symmetric multiprocessing (SMP) involves a multiprocessor computer hardware architecture where two or more identical processors are connected to a single shared main memory and are controlled by a single OS instance. Most common multiprocessor systems today use an SMP architecture.

Data warehouse is made up of many datamarts. DWH contain many subject areas. However, data mart focuses on one subject area generally. E.g. If there will be DHW of bank then there can be one data mart for accounts, one for Loans etc. This is high-level definitions.

A data mart (DM) is the access layer of the data warehouse (DW) environment that is used to get data out to the users. The DM is a subset of the DW, usually oriented to a specific business line or team.

System variables comprise of a set of variables which are used to get system information and they can be accessed from a transformer or a routine. They are read only and start with an @.

A sequencer allows you to synchronize the control flow of multiple activities in a job sequence. It can have multiple input triggers as well as multiple output triggers.

Page 67: Data Stage Interview

It uses GENERAL or SEQ.NUM. algorithm

A dataware house is a decision support database for organisational needs.It is subject oriented,non volatile,integrated ,time varient collect of data.ODS(Operational Data Source) is a integrated collection of related information . it contains maximum 90 days information.ODS is nothing but operational data store is the part of transactional database. this db keeps integrated data from different tdb and allow common operations across organisation. eg: banking transaction.In simple terms ODS is dynamic data.

Hash file stores the data based on hash algorithm and on a key value. A sequential file is just a file with no key column. Hash file used as a reference for look up. Sequential file cannot.

If you mean by Oracle Call Interface (OCI), it is a set of low-level APIs used to interact with Oracle databases. It allows one to use operations like logon, execute, parss etc. using a C or C++ program.