a join vs database join

Upload: pradeep-kothakota

Post on 06-Apr-2018

232 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 a Join vs Database Join

    1/21

    Informatica Join Vs Database Join

    In this yet another "DWBI Concepts' Original article", we test the performance of

    Informatica PowerCentre 8.5 Joiner transformation versus Oracle 10g database join.

    This article gives a crucial insight to application developer in order to take informed

    decision regarding performance tuning.

    Which is the fastest? Informatica or Oracle?

    In our previous article, we tested the performance of ORDER BY operation inInformatica and Oracle and found that, in our test condition, Oracle performssorting 14% speedier than Informatica. This time we will look into the JOINoperation, not only because JOIN is the single most important data setoperation but also because performance of JOIN can give crucial data to a

    developer in order to develop proper push down optimization manually.

    Informatica is one of the leading data integration tools in todays world. Morethan 4,000 enterprises worldwide rely on Informatica to access, integrateand trust their information assets with it. On the other hand, Oracle databaseis arguably the most successful and powerful RDBMS system that is trustedfrom 1980s in all sorts of business domain and across all major platforms.Both of these systems are bests in the technologies that they support. Butwhen it comes to the application development, developers often facechallenge to strike the right balance of operational load sharing betweenthese systems. This article will help them to take the informed decision.

    Which JOINs data faster? Oracle or Informatica?

    As an application developer, you have the choice of either using joiningsyntaxes in database level to join your data or using JOINERTRANSFORMATION in Informatica to achieve the same outcome. Thequestion is which system performs this faster?

    Test Preparation

    We will perform the same test with 4 different data points (data volumes)and log the results. We will start with 1 million data in detail table and 0.1million in master table. Subsequently we will test with 2 million, 4 million and6 million detail table data volumes and 0.2 million, 0.4 million and 0.6 millionmaster table data volumes. Here are the details of the setup we will use,

    1. Oracle 10g database as relational source and target

    2. Informatica PowerCentre 8.5 as ETL tool

    http://www.dwbiconcepts.com/advance/7-general/39-informatica-join-vs-database-join.htmlhttp://www.dwbiconcepts.com/advance/7-general/36-informatica-oracle-sort-performance-test.htmlhttp://www.dwbiconcepts.com/advance/7-general/36-informatica-oracle-sort-performance-test.htmlhttp://www.dwbiconcepts.com/advance/7-general/36-informatica-oracle-sort-performance-test.htmlhttp://www.dwbiconcepts.com/advance/7-general/36-informatica-oracle-sort-performance-test.htmlhttp://www.dwbiconcepts.com/advance/7-general/39-informatica-join-vs-database-join.html
  • 8/3/2019 a Join vs Database Join

    2/21

    3. Database and Informatica setup on different physical servers using HP UNIX

    4. Source database table has no constraint, no index, no database statistics and no

    partition

    5. Source database table is not available in Oracle shared pool before the same is

    read

    6. There is no session level partition in Informatica PowerCentre7. There is no parallel hint provided in extraction SQL query

    8. Informatica JOINER has enough cache size

    We have used two sets of Informatica PowerCentre mappings created inInformatica PowerCentre designer. The first mapping m_db_side_join will usean INNER JOIN clause in the source qualifier to sort data in database level.Second mapping m_Infa_side_join will use an Informatica JOINER to JOIN datain informatica level. We have executed these mappings with different datapoints and logged the result.

    Further to the above test we will execute m_db_side_join mapping onceagain, this time with proper database side indexes and statistics and log theresults.

    Result

    The following graph shows the performance of Informatica and Database interms of time taken by each system to sort data. The average time isplotted along vertical axis and data points are plotted alonghorizontal axis.

    Data

    Points

    Master Table Record

    Count

    Detail Table Record

    Count

    1 0.1 M 1 M

    2 0.2 M 2 M

    3 0.4 M 4 M

    4 0.6 M 6 M

  • 8/3/2019 a Join vs Database Join

    3/21

    Verdict

    In our test environment, Oracle 10g performs JOIN operation 24% faster

    than Informatica Joiner Transformation while without Index and 42%

    faster with Database Index

    Assumption

    1. Average server load remains same during all the experiments2. Average network speed remains same during all the experiments

    Note

    1. This data can only be used for performance comparison but cannot beused for performance benchmarking.2. This data is only indicative and may vary in different testing conditions.

    What is incremental aggregation?

    When using incremental aggregation, you apply captured changes in the source toaggregate calculations in a session. If the source changes only incrementally and

    you can capture changes, you can configure the session to process only those

    changes. This allows the Informatica Server to update your target incrementally,

    rather than forcing it to process the entire source and recalculate the same

    calculations each time you run the session.

  • 8/3/2019 a Join vs Database Join

    4/21

    Comparing Performance of SORT operation(Order By) in Informatica and Oracle

    In this "DWBI Concepts' Original article", we put Oracle database and Informatica

    PowerCentre to lock horns to prove which one of them handles data SORTing

    operation faster. This article gives a crucial insight to application developer in order

    to take informed decision regarding performance tuning.

    Which is the fastest? Informatica or Oracle?

    Informatica is one of the leading data integration tools in todays world. Morethan 4,000 enterprises worldwide rely on Informatica to access, integrateand trust their information assets with it. On the other hand, Oracle database

    is arguably the most successful and powerful RDBMS system that is trustedfrom 1980s in all sorts of business domain and across all major platforms.Both of these systems are bests in the technologies that they support. Butwhen it comes to the application development, developers often facechallenge to strike the right balance of operational load sharing betweenthese systems.

    Think about a typical ETL operation often used in enterprise level dataintegration. A lot of data processing can be either redirected to the databaseor to the ETL tool. In general, both the database and the ETL tool arereasonably capable of doing such operations with almost same efficiency and

    capability. But in order to achieve the optimized performance, a developermust carefully consider and decide which system s/he should be trustingwith for each individual processing task.

    In this article, we will take a basic database operation Sorting, and we willput these two systems to test in order to determine which does it faster thanthe other, if at all.

    Which sorts data faster? Oracle or Informatica?

    As an application developer, you have the choice of either using ORDER BY in

    database level to sort your data or using SORTER TRANSFORMATION inInformatica to achieve the same outcome. The question is which systemperforms this faster?

    Test Preparation

    We will perform the same test with different data points (data volumes) andlog the results. We will start with 1 million records and we will be doubling

    http://www.dwbiconcepts.com/advance/7-general/36-informatica-oracle-sort-performance-test.htmlhttp://www.dwbiconcepts.com/advance/7-general/36-informatica-oracle-sort-performance-test.htmlhttp://www.dwbiconcepts.com/advance/7-general/36-informatica-oracle-sort-performance-test.htmlhttp://www.dwbiconcepts.com/advance/7-general/36-informatica-oracle-sort-performance-test.html
  • 8/3/2019 a Join vs Database Join

    5/21

    the volume for each next data points. Here are the details of the setup wewill use,

    1. Oracle 10g database as relational source and target2. Informatica PowerCentre 8.5 as ETL tool

    3. Database and Informatica setup on different physical servers using HPUNIX4. Source database table has no constraint, no index, no database statisticsand no partition5. Source database table is not available in Oracle shared pool before thesame is read6. There is no session level partition in Informatica PowerCentre7. There is no parallel hint provided in extraction SQL query8. The source table has 10 columns and first 8 columns will be used forsorting9. Informatica sorter has enough cache size

    We have used two sets of Informatica PowerCentre mappings created inInformatica PowerCentre designer. The first mapping m_db_side_sort will usean ORDER BY clause in the source qualifier to sort data in database level.Second mapping m_Infa_side_sort will use an Informatica sorter to sort datain informatica level. We have executed these mappings with different datapoints and logged the result.

    Result

    The following graph shows the performance of Informatica and Database in

    terms of time taken by each system to sort data. The time is plotted alongvertical axis and data volume is plotted along horizontal axis.

  • 8/3/2019 a Join vs Database Join

    6/21

    Verdict

    The above experiment demonstrates that Oracle database

    is faster in SORT operation than Informatica by an

    average factor of 14%.

    Assumption

    1. Average server load remains same during all the experiments2. Average network speed remains same during all the experiments

    Note

    This data can only be used for performance comparison but cannot be used

    for performance benchmarking.

    To know the Informatica and Oracle performance comparison for JOINoperation

    Implementing Informatica IncrementalAggregation

    Using incremental aggregation, we apply captured changes in the source data (CDCpart) to aggregate calculations in a session. If the source changes incrementally andwe can capture the changes, then we can configure the session to process thosechanges. This allows the Integration Service to update the target incrementally,rather than forcing it to delete previous loads data, process the entire source dataand recalculate the same data each time you run the session.

    Using Informatica Normalizer Transformation

    Normalizer, a native transformation in Informatica, can ease many complexdata transformation requirement. Learn how to effectively use normalizerhere.

    Using Noramalizer Transformation

    A Normalizer is an Active transformation that returns multiple rows from asource row, it returns duplicate data for single-occurring source columns. The

    http://www.dwbiconcepts.com/advance/4-etl/26-implementing-informaticas-incremental-aggregation.htmlhttp://www.dwbiconcepts.com/advance/4-etl/26-implementing-informaticas-incremental-aggregation.htmlhttp://www.dwbiconcepts.com/basic-concept/3-etl/23-using-informatica-normalizer-transformation.htmlhttp://www.dwbiconcepts.com/advance/4-etl/26-implementing-informaticas-incremental-aggregation.htmlhttp://www.dwbiconcepts.com/advance/4-etl/26-implementing-informaticas-incremental-aggregation.htmlhttp://www.dwbiconcepts.com/basic-concept/3-etl/23-using-informatica-normalizer-transformation.html
  • 8/3/2019 a Join vs Database Join

    7/21

    Normalizer transformation parses multiple-occurring columns from COBOLsources, relational tables, or other sources. Normalizer can be used totranspose the data in columns to rows.

    Normalizer effectively does the opposite of Aggregator!

    Example of Data Transpose using Normalizer

    Think of a relational table that stores four quarters of sales by store and weneed to create a row for each sales occurrence. We can configure aNormalizer transformation to return a separate row for each quarter likebelow..

    The following source rows contain four quarters of sales by store:

    Source Table

    Store Quarter1 Quarter2 Quarter3 Quarter4

    Store1 100 300 500 700

    Store2 250 450 650 850

    The Normalizer returns a row for each store and sales combination. It also

    returns an index(GCID) that identifies the quarter number:Target Table

    Store Sales Quarter

    Store 1 100 1

    Store 1 300 2

    Store 1 500 3

    Store 1 700 4

  • 8/3/2019 a Join vs Database Join

    8/21

    Store 2 250 1

    Store 2 450 2

    Store 2 650 3

    Store 2 850 4

    How Informatica Normalizer Works

    Suppose we have the following data in source:

    Name Month Transportation House Rent Food

    Sam Jan 200 1500 500

    John Jan 300 1200 300

    Tom Jan 300 1350 350

    Sam Feb 300 1550 450

    John Feb 350 1200 290

    Tom Feb 350 1400 350

    and we need to transform the source data and populate this as below in thetarget table:

    Name Month Expense Type Expense

    Sam Jan Transport 200

  • 8/3/2019 a Join vs Database Join

    9/21

    Sam Jan House rent 1500

    Sam Jan Food 500

    John Jan Transport 300

    John Jan House rent 1200

    John Jan Food 300

    Tom Jan Transport 300

    Tom Jan House rent 1350

    Tom Jan Food 350

    .. like this.

    Now below is the screen-shot of a complete mapping which shows how toachieve this result using Informatica PowerCenter Designer. Image:Normalization Mapping Example 1

    I will explain the mapping further below.

    Setting Up Normalizer Transformation Property

    First we need to set the number of occurences property of the Expense head as 3 in

    the Normalizer tab of the Normalizer transformation, since we have Food,Houserent

    and Transportation.

    Which in turn will create the corresponding 3 input ports in the ports tab along with

    the fields Individual and Month

    http://window.open%28%27http//dwbiconcepts.com/images/stories/normalizer1.jpg','Mapping','resizable=yes,scrollbars=yes',false);%20void%200;http://window.open%28%27http//dwbiconcepts.com/images/stories/normalizer1.jpg','Mapping','resizable=yes,scrollbars=yes',false);%20void%200;http://window.open%28%27http//dwbiconcepts.com/images/stories/normalizer1.jpg','Mapping','resizable=yes,scrollbars=yes',false);%20void%200;http://window.open%28%27http//dwbiconcepts.com/images/stories/normalizer1.jpg','Mapping','resizable=yes,scrollbars=yes',false);%20void%200;http://window.open%28%27http//dwbiconcepts.com/images/stories/normalizer1.jpg','Mapping','resizable=yes,scrollbars=yes',false);%20void%200;
  • 8/3/2019 a Join vs Database Join

    10/21

    In the Ports tab of the Normalizer the ports will be created automatically as

    configured in the Normalizer tab. Interestingly we will observe two newcolumns namely GK_EXPENSEHEAD and GCID_EXPENSEHEAD.GK field generates sequence number starting from the value as defined inSequence field while GCID holds the value of the occurence field i.e. thecolumn no of the input Expense head.Here 1 is for FOOD, 2 is for HOUSERENT and 3 is for TRANSPORTATION.

  • 8/3/2019 a Join vs Database Join

    11/21

    Now the GCID will give which expense corresponds to which field whileconverting columns to rows.

    Below is the screen-shot of the expression to handle this GCID efficiently:

    What is the difference between Normal load and Bulkload?Load types:- 1)Bulk Load 2)Normal Load

    Normal load:-1)in case of less data.2)we can get its log details3)we can rollback and commit.4)Session recovery possible.5)performance may be low .

    Bulk load :-1)In case of large data2)no log details are available.3)can't rollback and commit4)session recovery not possible.

  • 8/3/2019 a Join vs Database Join

    12/21

    5)performance improves.

    Implementing Informatica Partitions

    Why use Informatica Pipeline Partition?

    Identification and elimination of performance bottlenecks will obviouslyoptimize session performance. After tuning all the mapping bottlenecks, wecan further optimize session performance by increasing the number ofpipeline partitions in the session. Adding partitions can improveperformance by utilizing more of the system hardware while processing thesession.

    PowerCenter Informatica Pipeline Partition

    Different Types of Informatica Partitions

    We can define the following partition types: Database partitioning, Hashauto-keys, Hash user keys, Key range, Pass-through, Round-robin.

    Informatica Pipeline Partitioning Explained

    Each mapping contains one or more pipelines. A pipeline consists of a sourcequalifier, all the transformations and the target. When the Integration Serviceruns the session, it can achieve higher performance by partitioning thepipeline and performing the extract, transformation, and load for each

    partition in parallel.

    A partition is a pipeline stage that executes in a single reader,transformation, or writer thread. The number of partitions in any pipelinestage equals the number of threads in the stage. By default, theIntegration Service creates one partition in every pipeline stage. Ifwe have the Informatica Partitioning option, we can configure multiplepartitions for a single pipeline stage.

    Setting partition attributes includes partition points, the number of partitions,and the partition types. In the session properties we can add or edit partition

    points. When we change partition points we can define the partition type andadd or delete partitions(number of partitions).

    We can set the following attributes to partition a pipeline:

    Partition point: Partition points mark thread boundaries and divide thepipeline into stages. A stage is a section of a pipeline between any twopartition points. The Integration Service redistributes rows of data at

    http://www.dwbiconcepts.com/advance/4-etl/12-implementing-informatica-partitions-.htmlhttp://www.dwbiconcepts.com/advance/4-etl/12-implementing-informatica-partitions-.html
  • 8/3/2019 a Join vs Database Join

    13/21

    partition points. When we add a partition point, we increase the number ofpipeline stages by one. Increasing the number of partitions or partition pointsincreases the number of threads. We cannot create partition points at Sourceinstances or at Sequence Generator transformations.

    Number of partitions: A partition is a pipeline stage that executes in asingle thread. If we purchase the Partitioning option, we can set the numberof partitions at any partition point. When we add partitions, we increase thenumber of processing threads, which can improve session performance. Wecan define up to 64 partitions at any partition point in a pipeline. When weincrease or decrease the number of partitions at any partition point, theWorkflow Manager increases or decreases the number of partitions at allpartition points in the pipeline. The number of partitions remains consistentthroughout the pipeline. The Integration Service runs the partition threadsconcurrently.

    Partition types: The Integration Service creates a default partition type ateach partition point. If we have the Partitioning option, we can change thepartition type. The partition type controls how the Integration Servicedistributes data among partitions at partition points. We can define thefollowing partition types: Database partitioning, Hash auto-keys, Hash userkeys, Key range, Pass-through, Round-robin. Database partitioning: TheIntegration Service queries the database system for table partitioninformation. It reads partitioned data from the corresponding nodes in thedatabase.

    Pass-through: The Integration Service processes data without redistributing

    rows among partitions. All rows in a single partition stay in the partition aftercrossing a pass-through partition point. Choose pass-through partitioningwhen we want to create an additional pipeline stage to improveperformance, but do not want to change the distribution of data acrosspartitions.

    Round-robin: The Integration Service distributes data evenly among allpartitions. Use round-robin partitioning where we want each partition toprocess approximately the same numbers of rows i.e. load balancing.

    Hash auto-keys: The Integration Service uses a hash function to group

    rows of data among partitions. The Integration Service groups the databased on a partition key. The Integration Service uses all grouped or sortedports as a compound partition key. We may need to use hash auto-keyspartitioning at Rank, Sorter, and unsorted Aggregator transformations.

    Hash user keys: The Integration Service uses a hash function to group rowsof data among partitions. We define the number of ports to generate thepartition key.

  • 8/3/2019 a Join vs Database Join

    14/21

    Key range: The Integration Service distributes rows of data based on a portor set of ports that we define as the partition key. For each port, we define arange of values. The Integration Service uses the key and ranges to sendrows to the appropriate partition. Use key range partitioning when thesources or targets in the pipeline are partitioned by key range.

    We cannot create a partition key for hash auto-keys, round-robin, orpass-through partitioning.

    Add, delete, or edit partition points on the Partitions view on the Mapping tabof session properties of a session in Workflow Manager.

    The PowerCenter Partitioning Option increases the performance ofPowerCenter through parallel data processing.

    This option provides a thread-based architecture and automatic data

    partitioning that optimizes parallel processing on multiprocessor and grid-based hardware environments.

    Implementing Informatica Persistent Cache

    You must have noticed that the time Informatica takes to build the lookupcache can be too much sometimes depending on the lookup tablesize/volume. Using Persistent Cache, you may save lot of your time.

    What is Persistent Cache?

    Lookups are cached by default in Informatica. This means thatInformatica by default brings in the entire data of the lookup table fromdatabase server to Informatica Server as a part of lookup cache

    building activity during session run. If the lookup table is too huge, this oughtto take quite some time. Now consider this scenario - what if you are lookingup to the same table different times using different lookups in differentmappings? Do you want to spend the time of building the lookup cache againand again for each lookup? Off course not! Just use persistent cache option!Yes, Lookup cache can be either non-persistent or persistent. The IntegrationService saves or deletes lookup cache files after a successful session run

    based on whether the Lookup cache is checked as persistent or not.

    Where and when we shall use persistent cache:

    Suppose we have a lookup table with same lookup condition andreturn/output ports and the lookup table is used many times in multiplemappings. Let us say a Customer Dimension table is used in many mappingsto populate the surrogate key in the fact tables based on their source system

    http://www.dwbiconcepts.com/advance/4-etl/9-implementing-informatica-persistent-cache-.htmlhttp://www.dwbiconcepts.com/advance/4-etl/9-implementing-informatica-persistent-cache-.html
  • 8/3/2019 a Join vs Database Join

    15/21

  • 8/3/2019 a Join vs Database Join

    16/21

    Re-cache from lookup source:To be checked i.e. the Named PersistentCache file will be rebuilt or refreshed with the current data of the lookuptable.

    Next in all the mappings where we want to use the same already built

    Named Persistent Cache we need to set two properties in the Properties tabof Lookup transformation.

    Lookup cache persistent:To be checked i.e. the lookup will be using aNamed Persistent Cache that is already saved in Cache Directory and if thecache file is not there the session will not fail it will just create the cache fileinstead.Cache File Name Prefix: user_defined_cache_file_name i.e. the Named

  • 8/3/2019 a Join vs Database Join

    17/21

    Persistent cache file name that was defined in the mapping where thepersistent cache file was created.

    Note:

    If there is any Lookup SQL Override then the SQL statement in all the lookupsshould match exactly even also an extra blank space will fail the session thatis using the already built persistent cache file.

    So if the incoming source data volume is high, the lookup tables datavolume that need to be cached is also high, and the same lookup table isused in many mappings then the best way to handle the situation is to useone-time build, already created persistent named cache.

    Aggregation with out Informatica Aggregator

    Since Informatica process data row by row, it is generally possible to handledata aggregation operation even without an Aggregator Transformation. Oncertain cases, you may get huge performance gain using this technique!

    General Idea of Aggregation without Aggregator

    Transformation

    Let us take an example: Suppose we want to find the SUM of SALARY forEach Department of the Employee Table. The SQL query for this would be:

    SELECT DEPTNO,SUM(SALARY) FROM EMP_SRC GROUP BY DEPTNO;

    If we need to implement this in Informatica, it would be very easy as wewould obviously go for an Aggregator Transformation. By taking the DEPTNOport as GROUP BY and one output port as SUM(SALARY the problem can besolved easily.

    Now the trick is to use only Expression to achieve the functionality ofAggregator expression. We would use the very funda of the expressiontransformation of holding the value of an attribute of the previous tuple over

    here.

    But wait... why would we do this? Aren't we complicating the thing here?

    Yes, we are. But as it appears, in many cases, it might have an performancebenefit (especially if the input is already sorted or when you know input datawill not violate the order, like you are loading daily data and want to sort itby day). Remember Informatica holds all the rows in Aggregator cache for

    http://www.dwbiconcepts.com/basic-concept/3-etl/10-aggregation-with-out-informatica-aggregator-.htmlhttp://www.dwbiconcepts.com/basic-concept/3-etl/10-aggregation-with-out-informatica-aggregator-.html
  • 8/3/2019 a Join vs Database Join

    18/21

    aggregation operation. This needs time and cache space and this also voidsthe normal row by row processing in Informatica. By removing theAggregator with an Expression, we reduce cache space requirement andease out row by row processing. The mapping below will show how to do this

    Image: Aggregation with Expression and Sorter 1

    Sorter (SRT_SAL) Ports Tab

    Now I am showing a sorter here just illustrate the concept. If you already have

    sorted data from the source, you need not use this thereby increasing the

    performance benefit.

    Expression (EXP_SAL) Ports Tab

    Image: Expression Ports Tab Properties

    Sorter (SRT_SAL1) Ports Tab

    http://window.open%28%27http//www.dwbiconcepts.com/images/stories/aggr_thru_expr1.jpg','Mapping','resizable=yes,scrollbars=yes',false);%20void%200;http://window.open%28%27http//www.dwbiconcepts.com/images/stories/aggr_thru_expr3.jpg','Mapping','resizable=yes,scrollbars=yes',false);%20void%200;http://window.open%28%27http//www.dwbiconcepts.com/images/stories/aggr_thru_expr1.jpg','Mapping','resizable=yes,scrollbars=yes',false);%20void%200;http://window.open%28%27http//www.dwbiconcepts.com/images/stories/aggr_thru_expr3.jpg','Mapping','resizable=yes,scrollbars=yes',false);%20void%200;
  • 8/3/2019 a Join vs Database Join

    19/21

    Expression (EXP_SAL2) Ports Tab

  • 8/3/2019 a Join vs Database Join

    20/21

  • 8/3/2019 a Join vs Database Join

    21/21

    Filter (FIL_SAL) Properties Tab

    This is how we can implement aggregation without using Informatica

    aggregator transformation. Hope you liked it!

    Informatica Dynamic Lookup Cache

    A LookUp cache does not change once built. But what if the underlyinglookup table changes the data after the lookup cache is created? Is there away so that the cache always remain up-to-date even if the underlying tablechanges?

    http://www.dwbiconcepts.com/basic-concept/3-etl/22-dynamic-lookup-cache.htmlhttp://www.dwbiconcepts.com/basic-concept/3-etl/22-dynamic-lookup-cache.html