making sense of performance and identifying stragglers in data analytics framework

Making sense of performance and identifying stragglers inData Analytics Framework

CSCI 8780 Advanced Distributed Systems

Manish Ranjan and Narita Pandhe

Introduction

- Large-scale data analytics has become widespread

- Research devoted to improving the performance of data analytics frameworks

- BUT comparatively little effort : spent in identifying the performance bottlenecks!!

2

More resource efficient

Faster

3

Experiments

10

What Cluster Configuration did we use?

- #1 Master, #6 Slaves

- Master Config- 64 - Bit,

- 8GB RAM,

- 2 Cores,

- 50GB SSD

- Slaves Config(each):- 64 - Bit

- 2GB RAM,

- 1 Core,

- 30GB SSD

Config related modifications: eg. Replication + SSDs

11

First Benchmarking namenode

To first test Namenode hardware and config: NNBench

What it does:

Generates a lot of HDFS related requests

Why it does:

To put a “HIGH” HDFS management stress on the namenode

How it does:

Simulates request for creating, reading, renaming and deleting files on HDFS

12

What Workload did we use?

- TeraSort benchmark suite

- Goal of TeraSort: sort 1TB of data (or any other amount of data you want) as fast as possible.

- Limited by our cluster configuration, we performed several experiments with data of size 1GB, 5GB and 10GB.

- TeraSort benchmark can be utilized to iron out your Hadoop configuration

13

14

Hadoop

i-6c76c1da (M), i-40684ef0

(s1), i-41684ef1 (s2), i-42684ef2 (s3), i-43684ef3 (s4),i-4e684efe (s5), i-4f684eff (s6)

15

i-6c76c1da (M), i-40684ef0


Red : s6Dark Green: s4

16

i-6c76c1da (M), i-40684ef0


Observations for 10GB


17

i-6c76c1da (M), i-40684ef0


Observations for 10GB


18

i-6c76c1da (M), i-40684ef0


Identified Stragglers

19

Spark

i-6c76c1da (M), i-40684ef0


Orange: s2Red: s6

20

Hadoop SparkRed s6Bright Blue :

s5Orange : s2

Conclusions- Straggler task spends an unusually long amount of time in a particular part

of task execution.

- It usually not too hard to found a straggler for a specific execution- what is hard is to get it consistently enough!

- Though we were lucky enough to spot few even in a mediocre strength cluster. Which emphasizes the necessity of understanding the cluster meta info well.

Eg: DFS disk read time, shuffle write time, shuffle read time, and Java’s garbage collection

- Since, Spark:

- often breaks jobs into many more tasks

- has much lower task launch overhead than Hadoop

21

References- Making Sense of Performance in Data Analytics Frameworks,

Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, Byung-Gon Chun, UC Berkeley, ICSI,

VMware, Seoul National University- No One (Cluster) Size Fits All: Automatic Cluster Sizing for Data-intensive Analytics

https://www.cs.duke.edu/starfish/files/socc11-cluster-sizing.pdf- http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-ha

doop-cluster-with-terasort-testdfsio-nnbench-mrbench/- https://github.com/ehiggs/spark-terasort- aws.amazon.com

22

http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-cluster-with-terasort-testdfsio-nnbench-mrbench/

http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-cluster-with-terasort-testdfsio-nnbench-mrbench/

https://github.com/ehiggs/spark-terasort

making sense of performance and identifying stragglers in data analytics framework

Data & Analytics