improve your etl performance

SEVEN QUICK TIPS FOR IMPROVING ETL PERFORMANCEYaniv Mor, CEO and Co-founder Xplenty

A company’s BI analysts check figures and look for insights. As BI developers, they’re in charge of ETL:

1. Extract data from the various operational stores to the staging environment.

2. Transform the data using an ETL tool, thus building facts and dimensions.

3. Load the transformed data into the production data warehouse.

When a company first starts, there’s not much data. The ETL process runs overnight and is completed by dawn. However as a company grows, the data thickens, and the process takes progressively longer. So the question becomes…

HOW DO I IMPROVE ETL PERFORMANCE?

1. Tackle Bottlenecks

2. Load Data Incrementally

3. Partition Large Tables

4. Cut Out Extraneous Data

5. Cache the Data

6. Use Hadoop

7. MapReduce

TACKLE BOTTLENECKSBefore anything else, make sure you log metrics such as time, number of records processed, and hardware usage. Check how many resources each part of the process takes and address the heaviest one. Usually it will be second part, building facts and dimensions in the staging environment.

Wherever your bottleneck may be, take a deep breath and dive into the code. May the force be with you.

LOAD DATA INCREMENTALLYLoading only the changes between the previous and the new data saves a lot of time as compared to a full load. It’s more difficult to implement and maintain, but difficult doesn’t mean impossible, so do consider it. Loading incrementally can definitely improve the ETL performance.

PARTITION LARGE TABLESIf you use relational databases and you want to improve the data processing window, you can partition large tables. That is, cut big tables down to physically smaller ones, probably by date.

Each partition has its own indices and the indices tree is more shallow thus allowing for quicker access to the data.

It also allows switching data in and out of a table in a quick meta data operation instead of actual insertion or deletion of data records.

CUT OUT EXTRANEOUS DATAIt’s important to collect as much data as possible, but not all of it is worthy enough to enter the data warehouse. For instance, images of furniture models are useless to BI analysts. If you want to improve the ETL performance, sit down and define exactly which data should be processed and leave irrelevant rows/columns out.

Better to start small and grow as you go as opposed to creating a giant data monster that takes eons to process.

CACHE THE DATACaching data can greatly speed things up since memory access performs faster than do hard drives. Note that caching is limited by the maximum amount of memory your hardware supports. All that plastic furniture big data might not fit in.

PROCESS IN PARALLELInstead of processing serially, optimize resources by processing in parallel. Sadly, this is not always possible. Sort and aggregate functions (count, sum, etc.) block processing because they must end before the next task can begin.

Even if you can process in parallel, it won’t help if the machine is running on 100% CPU the entire time. You could scale up by upgrading the CPU, but it would scale only to a limit. There’s a much better solution.

USE HADOOPApache Hadoop is designed for the distributed processing of large data over a cluster of machines. It uses HDFS, a dedicated file system that cuts data into small chunks and optimally spreads them over the cluster. Duplicate copies are kept and the system maintains integrity automatically.

MAPREDUCEMapReduce is used to process tasks (Hadoop 2 or YARN allows more applications). Each MapReduce job works in 2 stages:

▪ Map - filtering & sorting data - tasks are divided into sub-tasks and processed in parallel by the cluster machines.

▪ Reduce - summary operations - data from the previous stage is combined.

MORE ABOUT HADOOPHadoop is optimized for distributed processing analytics. Sort and aggregate functions execute in parallel on an entire cluster. If you’re CPU’d out, Hadoop lets you scale out rather than up - just add more machines to the cluster and you’re good to go. Running Hadoop-as-a-Service on the cloud makes scaling even easier since you can add more power with a few clicks.

https://www.xplenty.com/hadoop-as-a-service/

LIMITATIONSMoving heavy processes to Hadoop could speed up your ETL and make your bosses happier. There are a few limitations though:

1. Hadoop works best when the data is stored locally on the cluster via HDFS. If you run Hadoop on the cloud, the data can be saved in an object store (e.g. S3 on AWS or Softlayer Object Storage) and accessed when necessary. It may decrease performance for large clusters because of the network communication overhead.

2. Hadoop doesn’t read relational data directly. Use Sqoop to import/export it into Hadoop.

3. Running Hadoop on premise requires a large upfront expenditure for implementation and machines. Running SaaS Hadoop in the cloud solves this issue.

4. You’ll need to train or hire techies skilled in Java, Linux, and distributed systems.

SUMMARYThere are several measures you can take to bring the data processing window up to speed - concentrate on bottlenecks, load data incrementally, partition large tables, process only relevant data, try caching, and use parallel processing. Ultimately, give Hadoop a try since it is designed to solve the problem of processing humongous amounts of data. Soon you’ll be able to lean back in your plastic chair, put your feet up, and enjoy the sound of data crunching.

XPLENTYWWW.XPLENTY.COM

http://grow.xplenty.com/signup/?utm_source=slideshare&utm_medium=presentation&utm_term=upd&utm_content=seven%20tips%20for%20improving%20etl%20performance&utm_campaign=repurpose

https://plus.google.com/+Xplenty

https://www.facebook.com/xplenty

https://twitter.com/xplenty

https://www.youtube.com/watch?v=Ap-eytACHBg

https://www.linkedin.com/company/xplenty

mailto:[email protected]

improve your etl performance

Data & Analytics

data caching data

load data

extraneous data

new data

extract data

transformed data

switching data

hadoop hadoop