etl implementation for extreme performance etl implementation for extreme performance presented by:...

36
1

Upload: hoangphuc

Post on 09-Mar-2018

221 views

Category:

Documents


2 download

TRANSCRIPT

1

2 2

ETL Implementation for Extreme Performance

Presented By:

Mrs. Catherine Boeving

Mr. Greg Wade

3

Topics

• About Us

• Tips and tricks for high performance mapping

design

• Pipeline techniques to improve throughput

• Stacked pipelines to achieve extreme throughput

• Ensuring data integrity in an extreme

environment

• Q&A

Copyright 2012 Lockheed Martin

4

About Us

• Who we are

• Catherine Boeving, Software Developer

• Greg Wade, Information Systems Architect

• What we do

• Build large scale active data warehouses with near real time data loads and high availability for Department of Defense (DoD) customers

• Where do we work

• Lockheed Martin Global Systems and Solutions; A leading federal services and information technology contractor

Copyright 2012 Lockheed Martin

5

Our Environment

Input

Data Acquisition

EDW Teradata

Staging Oracle

ETL • SPARC Enterprise M5000

• 144 GB RAM

• Oracle Solaris 10 OS

• Informatica PowerCenter and

DataTransformation V9.1.0 HotFix 3

Excess ETL server capacity needed to achieve extreme throughput.

Copyright 2012 Lockheed Martin

6

Performance vs. Throughput

• Performance

• The execution time for one run of a workflows/mappings

• Throughput

• The volume of data that can be ETL’ed in a specified period of time

• Performance is needed to achieve throughput

• High performance workflows/mappings are not always enough to meet demanding service level agreements

Demanding SLAs require both high performance and throughput

Copyright 2012 Lockheed Martin

7

Carefully implemented lookups can improve mapping performance

Tips and Tricks

• Reference Data Lookups

• Validate the source data

• Expand the source data

• Stage Data Lookups

• Previous source data

• Processing of partial transactions

• Improve integration

Lookups – Requirements

Copyright 2012 Lockheed Martin

8

• Cached

• Only one DB request

• Best used when referenced data does not change

• Better for small tables

• May be able to compensate for slow/overloaded DB

• Un-Cached

• Many DB requests

• Required when referenced data changes

• Better for large tables

• Most performance improvement with fast DB

Selecting the correct lookup type is key to performance

Tips and Tricks

Lookups – Approaches

Copyright 2012 Lockheed Martin

9

• Part Number Example

• 1000 part numbers and descriptions

• 20 character part numbers with 50 character descriptions

Lookups – Calculations

Tips and Tricks

Simple calculations provide some insight but testing is needed in your environment

Copyright 2012 Lockheed Martin

10

Lookups – Calculations

• DB Transfer Entire Table

• = # rows * (characters per row)

• = 1000 * (20 + 50) = 70K Bytes

• One Un-Cached Lookup • = SQL request + SQL response • = 100 + (20 + 50) = 170 Bytes

• Break even point = 70K / 170 = about 411

lookups per mapping execution

Tips and Tricks

A good estimate but ignores DB speed, network,

etc.

Copyright 2012 Lockheed Martin

11

Tips and Tricks

Lookups – Implementation

Peer review or inspection checklists should include validating lookup type selection

• Historical Data Load

• Cached large table and process large amounts of data

one time

• 25 files/10K rows of data against 25M cached lookup

• Average 50 minute workflow execution times and 6 hours total load time

Copyright 2012 Lockheed Martin

12

Tips and Tricks

• Reduces Round Trips to the Database

• Combine Several Lookups with Sequence Logic

• Simplifies Complex Database Insert Logic

• Potential Loss of Data Lineage

Controlling network chatter between ETL and the DB is essential for high performance

Stored Procedures

Copyright 2012 Lockheed Martin

13

Tips and Tricks

Stored Procedures – Implementation

Eliminate redundant

stored procedure calls

Copyright 2012 Lockheed Martin

14

Tips and Tricks

Stored Procedures – Implementation

• Ensure Matching Port and Parameter Sizes

• Mismatched parameter sizes will send extra bytes to database

• Occurs when database and mapping development done in parallel

Verify the Import of Stored Procedures

Copyright 2012 Lockheed Martin

15

• Rapid Development

• Reusable components and patterns

• Understandability

• Onboard new staff with unique ETL approach

• Maintenance

• Source system updates

• Future performance tuning

Tips and Tricks

Mapping Design – Goals

Copyright 2012 Lockheed Martin

16

Tips and Tricks

Mapping Design – Mapplet Execution

• Used to handle similar code from different sources

• Smaller risk in one-time changes

Copyright 2012 Lockheed Martin

17

Tips and Tricks

Mapping Design – Worklet Execution

• Isolates performance tuning and minimizes regression testing

• Implements standards for new developers

Copyright 2012 Lockheed Martin

18

Tips and Tricks

Mapping Design – Stage Table Implementation

• Improves critical

path completion

• Simplifies complex data

Copyright 2012 Lockheed Martin

19

2 Files Processed in 4 Minutes

Standard Approach – No Pipeline

Pipeline Processing

Copyright 2012 Lockheed Martin

20

2 Files Processed in 3 Minutes

Pipeline Processing

Pipeline Approach

Copyright 2012 Lockheed Martin

21

• Pre-Process

• Sort transaction types

• Split large files

• Transform in Multiple Steps

• DataTransformation (DT)

• Break sources in multiple logical parts

• Smart Loading

• Decouple DB loading from transformations

• Use of external loaders

Extr

act

Tra

nsfo

rm

Load

Added Complexity requires standards and review

Pipeline Processing

Copyright 2012 Lockheed Martin

22

• Use Flat Files Between Pipeline Steps

• High demands on ETL server’s file system

• Requires highly tuned cluster file systems

• Pipeline Steps Have Similar Run Times

• Simple Three Step Pipeline

• DT File, Workflow Output File, External DB Loader

Pipeline Processing

Implementing the Pipeline

Flat File Movement Adds Complexity And Must Be Monitored

Copyright 2012 Lockheed Martin

23

Pipeline Processing

Implementing the Pipeline

Batch File Processing

Copyright 2012 Lockheed Martin

24

6 Files Processed in 3 Minutes

Files applied to the DB in non-deterministic order

Pipeline Processing

Stacked Pipelines – Threaded 3x

Copyright 2012 Lockheed Martin

25

Pipeline Threading – Implementation

Pipeline Processing

• Manipulate XML

• Replicate Parameter Files

Copyright 2012 Lockheed Martin

26

Pipeline Processing

• Example Calculations

• F = files to process = 100 files

• T = time to ETL and load a file = 5 minutes

• P = number of pipeline steps = 3 steps

• S = number of stacked pipelines = 4 pipelines

• Assume

• All pipeline steps take the same amount of time

• Ignore any overhead for intermediate files

Estimate with your workload to see the possibilities

Copyright 2012 Lockheed Martin

27

Pipeline Processing

• Example Calculations Summary

• Standard Processing (No Pipeline)

• F * T = 100 files * 5 minutes = 500 minutes

• Pipeline

• (P + (F – 1)) * (T/P) = (3 + (100 -1)) * (5/3) = 170 minutes

• 294% speedup over standard processing

• Stacked Pipelines

• ((P + (F – 1)) * (T/P)) / S = 170 / 4 = 43

• 395% speedup over single pipeline

• 1162% speedup over standard processing

Theoretical speedup shown. Actual speedup depends on your environment Copyright 2012 Lockheed Martin

28

Copyright 2012 Lockheed Martin

Pipeline Processing

• Pipeline Metrics – 2 Steps

• Standard Processing

• 387 sec for one file batch

• Pipeline Processing

• Step #1 – 222 sec for one file batch

• Step #2 – 240 sec for one file batch

• 100 File Batches Calculation

• Standard = 100 * 387 = 38700 sec

• Pipeline = (2 + (100 – 1))* (240/2) = 12120

• Speedup = 313 %

29

Threaded Workflows Metrics -- 4 Threads

Pipeline Processing

TOTAL

RUNTIME (SECONDS)

TOTAL INPUT ROWS

TOTAL WEIGHTED AVERAGE RUNTIME

TOTAL ROWS/ TOTAL

WEIGHTED AVERAGE RUNTIME

PERCENT DIFFERENCE

Non-Threaded Workflows

1,836,519.00 81,148,820.00 303,229.40 267.62 -17.93%

Threaded Workflows

5,503,758.00 88,555,273.00 271,577.78 326.08 21.85%

Full speedup not realized. Consider data volume when threading.

We ran out of

files to process!

Copyright 2012 Lockheed Martin

30

Data Integrity

Customer Satisfaction, Trust, and Growth

• Is the data accurate?

• How complete is the picture?

• Finding the Bottleneck?

Building the System of Record

Copyright 2012 Lockheed Martin

31

• File Monitoring

• System exchange

• Reaching pre-processing phase

• Data Monitoring

• Check for data validity

• Track session execution times

• Output File Monitoring

• Output files load time

• Follow invalid output files

Extr

act

Tra

nsfo

rm

Load

Data Integrity

Copyright 2012 Lockheed Martin

32

Data Integrity

Event Tracking

E Exchange

Event

E Pre-Process

Event E Bulk Process

Event E Output File

Event

E Database File

Event

E T L

Copyright 2012 Lockheed Martin

33

Data Integrity

Monitor the Data - Alert

• Maintain Timelines for Files and Data

• Define expectations for data

• Empower the System

• System operators resolution

• Track to data quality issues

Copyright 2012 Lockheed Martin

34

Data Integrity

Proven Metrics

• Checking the Box on the SLA

• Quantifiable numbers

• Build and Track Future Growth

• Handle errors and invalid data

Review of metrics may require redesign.

Copyright 2012 Lockheed Martin

35

Key Points

• Tips and Tricks

• Stored Procedures, Lookups, Mapping Design

• Pipelines

• Pipeline ETL Processing

• Stacked Pipelines and Threaded Workflows

• Data Integrity

• Events for ETL, Alerting, Proven Metrics

Copyright 2012 Lockheed Martin

36

Copyright 2012 Lockheed Martin