selector web

8/6/2019 Selector Web

1/8

1

Question Answer ==========================================================

Phases vsCheckpoints

Phases - are used to break the graph into pieces. Temporary files created during a phase will bedeleted after its completion. Phases are used to effectively separately manage resource-consuming (memory, CPU, disk) parts of the application.

Checkpoints - created for recovery purposes. These are points where everything is written todisk. You can recover to the latest saved point - and rerun from it.

You can have phase breaks with or without checkpoints.

xfrA new sandbox will have many directories: mp, dml, xfr, db, ... . xfr is a directory where you putfiles with extension .xfr containing your own custom functions (and then use : include"somepath/xfr/yourfile.xfr"). Usually XFR stores mapping.

three types ofparallelism

1) Data Parallesim - data (partitionning of data into parallel streams for parallel processing).

2) Componnent Paralelism (execute simultaneously on different branches of the graph)

3) Pipeline (sequential).

MFS

Multi-File System

m_mkfs - create a multifile (m_mkfs ctrlfile mpfile1 ... mpfileN)m_ls - list all the multifilesm_rm - remove the multifilem_cp - copy a multifile

m_mkdir - to add more directories to existing directory structure

Memory requirements

of a graph

Each partition of a component uses: ~ 8 MB + max-core (if any) Add size of lookup files used in phase (if multiple components use same lookup only

count it once)

Multiply by degree of parallelism. Add up all components in a phase; that is how muchmemory is used in that phase.

Select the largest-memory phase in the graph

How to calculate aSUM

SCANROLLUPSCANWITHROLLUPScan followed by Dedup sort and select the last

dedup sort with nullkey

If we don't use any key in the sort component while using the dedup sort,then the output depends on the keep parameter.

first - only the first record last - only last record

unique_only - there will be no records in the output file.

join on partitioned flowfile1 (A,B,C) , file2 (A,B,D). We partition both files by "A", and then join by "A,B". IS it OK? Orshould we partition by "A,B" ? Not clear.

checkin, checkout You can do checkin/checkout using the wizard right from the GDE using versions and tags

Selector Web Definitions

1


2/8

how to have differentpasswords for QA andproduction

parameterize the .dbc file - or use environmental variable.

How to get records50-75 out of 100

use scan and filter m_dump -start 50 -end 75

use next_in_sequence() function and filter by expression component(next_in_sequence() >50 && next_in_sequence()


3/8

Continuouscomponents

Continuous components - produce useful output file while running continously. For example,Continuous rollup, Continuous update batch subscribe

2

Question Answer ==========================================================

deadlockDeadlock is when two or more processes are requesting the same resource. To avoid use

phasing and resource pooling.

environment

AB_HOME - where co>operating system is installed AB_AIR_ROOT - default location for EME datastore sandboxes standard environment AI_SORT_MAX_CORE, AI_HOME, AI_SERIAL, AI_MFS, etc.

from unix prompt: env | grep AI

wrapper script unix script to run graphs

multistage component

A multistage component is a component which transforms input records in 5 stages (1.inputselect, 2.temporary initialization, 3.processing, 4. output selection, 5.finalize). So it is a transformcomponent which has packages. Examples: scan Normalize and Denormalize, rollup scannormalize and denormalize sorted.

Dynamic DML

Dynamic DML is used if the input metadata can change. Example: at different time differentinput files are recieved for processing which have different dml. in that case we can use flag inthe dml and the flag is first read in the input file recieved and according to the flag itscorresponding dml is used.

fan in, fan out

fan out - partition component (increase parallelism)

fan in departition component (decrease parallelism)

locka user can lock the graph for editing so that others will see the message and can not edit thesame graph.

join vs lookupLookup is good for spped for small files (will load whole file in memory). For large files use join.You may need to increase the maxcore limit to handle big joins.

multi updatemulti update executes SQL statements - it treats each input record as a completely separatepiece of work.

scheduler

We can use Autosys, Control-M, or any other external scheduler.

We can take care of dependencies in many ways. For example, if scripts should runsequentially, we can arrange for this in Autosys, or we can create a wrapper script andput there several sequential commands (nohup command1.ksh & ; nohupcommand2.ksh &; etc). We can even create a special graph in Ab Initio to execute

individual scripts as needed.

Api and Utility modesin input table

These are database interfaces (api - uses SQL, utility - bulk loads, whatever vendor provides)

lookup file

lookup file component. Functions: lookup, lookup_count, lookup_next, lookup_match,lookup_local.

Lookups are always used with combination of the reformat components.


3


4/8

Calling stored proc inDB

You can call stored proc (for example, from input component). In fact, you can even write SP inAb Initio. Make it "with recompile" to assure good performance.

Frequently usedfunctions

string_ltrim, string_lrtrim, string_substring, reinterpret_as, today(), now()

data validation is_valid, is_null, is_blank, is_defined

driving portWhen joining inputs (in0, in1, ...) one of the ports is used as "driving (by default - in0). Drivinginput is usually the largest one. Whereas the smallest can have "Sorted-Input" parameter be set

to "Input need not be sorted" because it will be loaded completely in memory.

Ab Initio vsInformatica for ETL

Ab Initio benefits: parallelism built in, mulitifile system, handles huge amounts of data, easy tobuild and run. Generates scripts which can be easily modified as needed )if something couldn'tbe done in ETL tool itself). The scripts can be easily scheduled using any external scheduler -and easily integrated with other systems.

Ab Initio doesn't require a dedicated administrator.

Ab Initio doesn't have built-in CDC capabilities (CDC = Change Data Capture).

Ab Initio allows to (attach error / reject files) to each transformation and capture and analyze themessage and data separately (as opposed to Informatica which has just one huge log). Ab Initio

provides immediate metrics for each component.

override key override key option is used when we need to join 2 fields which have different field names.

control file control file should be in the multifile directory (contains the addresses of the serial fi les)

max-core

max-core parameter (for example, sort 100 MBytes) specifies the amount of memory used by acomponent (like Sort or Rollup) - per partition - before spilling to disk. Usually you don't need tochange it - just use default value. Setting it too high may degrade the performance because ofOS swapping and degrading of the performance of other components.

Input Parametersgraph > select parameters tab > click "create" - and create a parameter. Usage: $paramname.Edit > parameters. These parameters will be substituted during run time. You may need todeclare you parameter scope as formal.

Error Trapping

Each component has reject, error, and log ports. Reject captures rejected records, Errorcaptures corresponding error, and log captures the execution statistics of the component. Youcan control reject status of each component by setting reject threshold to either Never Abort,Abort on first reject, or setting ramp/limit. You can also use force_error() function in transformfunction.

3

Question Answer ==========================================================

How to see resource

usage

In GDE goto options View > Tracking Details - will see each component's CPU and memory

usage, etc.

assign keyscomponent

Easy and saves development time. Need to understand how to feed parameters, and you can'tcontrol it easily.

Join in DB vs join in AbInitio

Scenario 1 (preferred): we run query which joins 2 tables in DB and gives us the result injust 1 DB component.

Scenario 2 (much slower): we use 2 database components, extract all data - and jointhem in Ab Initio.


4


5/8

Join with DBnot recommended if number of records is big. It is better to retrieve the data out - and then join inAb Initio.

Data Skew

Parameter showing how data is unevenly distributed between partitions.

skew = (partition size - avg.part.size)* 100 / (size of the largest partition)

dbc vs cfg

.dbc - database configuration file (dbname, nodes, version user/pwd) - resides in the db directory

.cfg - any tyoe of config file. for example, remote connection config (name of remote server,user/pwd to connect to db, location of OS on remote machine, connection method). .cfg fileresides in the config dir.

compilation errors

depth not equal data format error etc...

depth error : we get this error.. when two components connected together but does't match therelayout

types of partitions broadcast pbyexpression pbyroundrobin pbykey pwithloadbalance

unused port when joining, used records go to the output port, unused records - to the unused port

tuning performance Go parallel using partitionning. Roundrobin partitionning gives good balance.

Use Multi-file system (MFS). Use Ad Hoc MFS to read many serial files in parallel, and use concat component. Once data is partitionned - do not switch it to serial and back. Repartition instead. Do not acceess large filess via NFS - use FTP instead use lookup local rather than lookup (especially for big lookups). Use rollup and Filter as soon as possible to reduce number of records. Ideally do it in the

source (database ?) before you get the data. Remove unnecessary components. For example, instead of using filter by exp, you can

implement the same function in reformat/Join/Rollup. Another example - when joiningdata from 2 files, use union function instead of adding an additional component forremoving duplicates.

use gather instead of concatenate. it is faster to do a sort after a partitino, than to do a sort before a partition. try to avoid using a join with the "db" component. when getting data from database - make sure your queries are fast (use indexes, etc.). If

possible, do necessary selection / aggregation / sorting in the database before gettingdata into Ab Initio.

tune Max_core for Optimal performance (for sort depends on the size of the input file). Note - If in-memory join cannot fit its non-driving inputs in the provided MAX-CORE, then

it will drop all the inputs to disk and in-memory does not make sence. Using phase breaks let you allocate more memory in individual components - thus

improving performance. Use checkpoint after sort to land data on disk Use Join and rollup in-memory feature When joining very small dataset to a very large dataset it is more efficient to broadcast

the small dataset to MFS using broadcast component, or use the small file as lookup. Butfor large dataset don't use broadcast as a partitioner. Use Ab Initio layout instead of database default to achieve parallel loads Change AB_REPORT parameter to increased monitoring duration Use catalogs for reusability Components like join/ rollup should have the option "Input must be sorted"

if they are placed after a sort component. minimize number of sort components. Minimize usage of sorted join component, and if

possible replace them by in-memory join/hash join. Use only required fields in the sortreformat join components. Use "Sort within Groups" instead of just Sort when data was


5


6/8

already presorted. Use phasing/flow buffers in case of merge sorted joins Minimize the use of regular expression functions like re_index in the transfer functions Avoid repartitioning of data unnecessarily. When splitting records into more than two

flows, use Reformat rather than Broadcast component. For joining records from 2 flows use Concatenate component ONLY when there is a need

to follow some specific order in joining records. If no order is required then it is preferableto use Gather component.

Instead of putting many Reformat components consecutively, use output indexesparameter in the first Reformat component and mention the condition there.

delta table

Delta table maintain the sequencer of each data table.

Master (or base) table - a table on top of which we create a view

scan vs rollup rollup - performs aggregate calculations on groups, scan - calculates cumulative totals

packages used in multistage components or transform components

Reformat vs "Redefine

Format"

Reformat - deriving new data by adding/dropping fields

Redefine format - rename fields

Conditional DML DML which is separated based on a condition

SORTWITHINGROUP The prerequisit for using sortwithingroup is that the data is already sorted by the major

key. sortwithingroup outputs the data once it has finished reading the major key group. Itis like an implicit phase.

passing a condition asa parameter

Define a Formal Keyword Parameter of type string. For example, you call it FilterCondition, andyou want it to do filtering on COUNT > 0 . Also in your graph in your "Filter by expression"Component enter following condition: $FilterCondition

Now on your command line or in wrapper script give the following commandYourGraphname.ksh -FilterCondition COUNT > 0

Passing file name as aparameter

#!/bin/ksh#Running the set up script on enviornmenttypeset PROJ_DIR $(cd $(dirname $0)/..; pwd). $PROJ_DIR/ab_project_setup.ksh $PROJ_DIR#Exporting the script parameter1 to INPUT_FILE_NAMEif [ $# -ne 2 ];thenINPUT_FILE_PARAMETER_1 $1INPUT_FILE_PARAMETER_2 $2# This grpah is using the input filecd $AI_RUN

./my_graph1.ksh $INPUT_FILE_PARAMETER_1# This graph also is using the input file../my_graph2.ksh $INPUT_FILE_PARAMETER_2exit 0;

elseecho Insufficient parametersexit 1;

fi-------------------------------------#!/bin/ksh


6


7/8

#Running the set up script on enviornmenttypeset PROJ_DIR $(cd $(dirname $0)/..; pwd). $PROJ_DIR/ab_project_setup.ksh $PROJ_DIR

#Exporting the script parameter1 to INPUT_FILE_NAMEexport INPUT_FILE_NAME $1

# This grpah is using the input filecd $AI_RUN

./my_graph1.ksh

# This graph also is using the input file../my_graph2.ksh

exit 0;

How to remove headerand trailer lines?

use conditional dml where you can separate detail from header and trailer. For validations usereformat with count :3 (out0:header out1:detail out2:trailer.)

How to create a multifile system on

Windows

first method: in GDE go to RUN > Execute Command - and run m_mkfs c:control c:dp1c:dp2 c:dp3 c:dp4

second method: double-click on the file component, and in ports tab double-click onpartitions - there you can enter the number of partitions.

VectorA vector is simply an array. It is an ordered set of elements of the same type (type can be anytype, including a vector or a record).

Dependency AnalysisDependency analysis will answer the questions regarding datalinage, that is where does the datacome from what applications prodeuce and depend on this data etc..

4

Question Answer ==========================================================

Surrogate key

There are many ways to create a surrogate key. For example, you canuse next_in_sequence() function in your transform. Or you can use "Assign key values"component. Or you can write a stored procedure - and call it.

Note: if you use partitions, then do something like this:

(next_in_sequence()-1)*no_of_partition()+this_partition()

.abinitiorcThis is a config file for ab initio - in user's home directory and in $AB_HOME/Config. It setsabinitio home pat


7


8/8


8

selector web

Documents