improve your etl performance

15
SEVEN QUICK TIPS FOR IMPROVING ETL PERFORMANCE Yaniv Mor, CEO and Co-founder Xplenty

Upload: xplenty

Post on 27-Jan-2015

114 views

Category:

Data & Analytics


1 download

DESCRIPTION

There is not much data when a company is first established. The ETL process runs overnight and is completed by dawn. However as the company grows, data thickens, and the process takes progressively longer. One morning you almost choke on your coffee when you see it’s still not done, and all the while the phone keeps ringing with complaints about the system being stuck and the data is getting stale. Clutching your stress ball, you sit at your desk and think - "How do I improve the ETL performance?" We offer up seven quick tips for improving your ETL performance.

TRANSCRIPT

Page 1: Improve Your ETL Performance

SEVEN QUICK TIPS FOR IMPROVING ETL PERFORMANCEYaniv Mor, CEO and Co-founder Xplenty

Page 2: Improve Your ETL Performance

A company’s BI analysts check figures and look for insights. As BI developers, they’re in charge of ETL:

1. Extract data from the various operational stores to the staging environment.

2. Transform the data using an ETL tool, thus building facts and dimensions.

3. Load the transformed data into the production data warehouse.

When a company first starts, there’s not much data. The ETL process runs overnight and is completed by dawn. However as a company grows, the data thickens, and the process takes progressively longer. So the question becomes…

Page 3: Improve Your ETL Performance

HOW DO I IMPROVE ETL PERFORMANCE?

1. Tackle Bottlenecks

2. Load Data Incrementally

3. Partition Large Tables

4. Cut Out Extraneous Data

5. Cache the Data

6. Use Hadoop

7. MapReduce

Page 4: Improve Your ETL Performance

TACKLE BOTTLENECKSBefore anything else, make sure you log metrics such as time, number of records processed, and hardware usage. Check how many resources each part of the process takes and address the heaviest one. Usually it will be second part, building facts and dimensions in the staging environment.

Wherever your bottleneck may be, take a deep breath and dive into the code. May the force be with you.

Page 5: Improve Your ETL Performance

LOAD DATA INCREMENTALLYLoading only the changes between the previous and the new data saves a lot of time as compared to a full load. It’s more difficult to implement and maintain, but difficult doesn’t mean impossible, so do consider it. Loading incrementally can definitely improve the ETL performance.

Page 6: Improve Your ETL Performance

PARTITION LARGE TABLESIf you use relational databases and you want to improve the data processing window, you can partition large tables. That is, cut big tables down to physically smaller ones, probably by date.

Each partition has its own indices and the indices tree is more shallow thus allowing for quicker access to the data.

It also allows switching data in and out of a table in a quick meta data operation instead of actual insertion or deletion of data records.

Page 7: Improve Your ETL Performance

CUT OUT EXTRANEOUS DATAIt’s important to collect as much data as possible, but not all of it is worthy enough to enter the data warehouse. For instance, images of furniture models are useless to BI analysts. If you want to improve the ETL performance, sit down and define exactly which data should be processed and leave irrelevant rows/columns out.

Better to start small and grow as you go as opposed to creating a giant data monster that takes eons to process.

Page 8: Improve Your ETL Performance

CACHE THE DATACaching data can greatly speed things up since memory access performs faster than do hard drives. Note that caching is limited by the maximum amount of memory your hardware supports. All that plastic furniture big data might not fit in.

Page 9: Improve Your ETL Performance

PROCESS IN PARALLELInstead of processing serially, optimize resources by processing in parallel. Sadly, this is not always possible. Sort and aggregate functions (count, sum, etc.) block processing because they must end before the next task can begin.

Even if you can process in parallel, it won’t help if the machine is running on 100% CPU the entire time. You could scale up by upgrading the CPU, but it would scale only to a limit. There’s a much better solution.

Page 10: Improve Your ETL Performance

USE HADOOPApache Hadoop is designed for the distributed processing of large data over a cluster of machines. It uses HDFS, a dedicated file system that cuts data into small chunks and optimally spreads them over the cluster. Duplicate copies are kept and the system maintains integrity automatically.

Page 11: Improve Your ETL Performance

MAPREDUCEMapReduce is used to process tasks (Hadoop 2 or YARN allows more applications). Each MapReduce job works in 2 stages:

▪ Map - filtering & sorting data - tasks are divided into sub-tasks and processed in parallel by the cluster machines.

▪ Reduce - summary operations - data from the previous stage is combined.

Page 12: Improve Your ETL Performance

MORE ABOUT HADOOPHadoop is optimized for distributed processing analytics. Sort and aggregate functions execute in parallel on an entire cluster. If you’re CPU’d out, Hadoop lets you scale out rather than up - just add more machines to the cluster and you’re good to go. Running Hadoop-as-a-Service on the cloud makes scaling even easier since you can add more power with a few clicks.

Page 13: Improve Your ETL Performance

LIMITATIONSMoving heavy processes to Hadoop could speed up your ETL and make your bosses happier. There are a few limitations though:

1. Hadoop works best when the data is stored locally on the cluster via HDFS. If you run Hadoop on the cloud, the data can be saved in an object store (e.g. S3 on AWS or Softlayer Object Storage) and accessed when necessary. It may decrease performance for large clusters because of the network communication overhead.

2. Hadoop doesn’t read relational data directly. Use Sqoop to import/export it into Hadoop.

3. Running Hadoop on premise requires a large upfront expenditure for implementation and machines. Running SaaS Hadoop in the cloud solves this issue.

4. You’ll need to train or hire techies skilled in Java, Linux, and distributed systems.

Page 14: Improve Your ETL Performance

SUMMARYThere are several measures you can take to bring the data processing window up to speed - concentrate on bottlenecks, load data incrementally, partition large tables, process only relevant data, try caching, and use parallel processing. Ultimately, give Hadoop a try since it is designed to solve the problem of processing humongous amounts of data. Soon you’ll be able to lean back in your plastic chair, put your feet up, and enjoy the sound of data crunching.