the fifth elephant 2016: self-serve performance tuning for hadoop and spark

Self-Serve Performance Tuning for Hadoop & Spark

The Fifth Elephant 2016

Akshay RaiEngineer, Hadoop Development TeamLinkedin Dr. Elephant

© 2016 LinkedIn Corporation. All Rights Reserved.

Hadoop @ Linkedin c. 2008

● 1 cluster

● 20 nodes

● 10 users

● 10 workflows in production

● MapReduce, Pig

2

Hadoop @ Linkedin c. 2016

● > 10 clusters

● > 10000 nodes

● > 1000 users

● Thousands of queries and flows in development

● Hundreds running in Production

● MapReduce, Pig, Hive, Spark, Scalding, Gobblin, Cubert3

Scaling Hadoop Infrastructure

• Add extra machines to the cluster

• Hadoop is scalable but not that optimal!

• We cannot keep adding machines forever

• Tune given resources and minimize addition of new machines

4

Measuring performance

• Highlights hardware failures and poor performing components

• Scope for environment upgrades.

5

Cluster Level Performance Tuning

Job Level Performance Tuning6

How difficult is it to tune a Job?

• Production Gatekeeper - Let jobs go into production only after verifying it

is tuned.

• Restriction! More questions on how to tune! Spend more resources

helping people.

Here’s what we tried to achieve Job tuning!

7

Challenges in tuning a job

• Hadoop is designed to let users tune their jobs BUT!

• One cannot optimize if one doesn’t understand the internals of the framework

• Critical information is scattered

• Hadoop has a huge set of parameters, tuning some may impact other

8

You cannot tune what you do not know & you cannot improve what you cannot measure

9

Training Sessions

10

• More people, more frequent sessions.

• Hadoop experience varies with people

• Framework specific training. Pig, hive, etc

Training - Doesn’t Scale

11

Expert Review

12

Expert Review - Also Doesn’t Work

• Again not scalable

• Cannot ensure job is performing optimally, no easy comparison.

• Different people, different perspective, no consensus

• Error prone, one might overlook certain aspects.

13

Scaling Hadoop Infrastructure is HARD

Scaling User Productivity is much HARDER 14

Birth of Dr. Elephant

15

What does Dr. Elephant do?

• Help every user get the best performance from their jobs

• Analyse and compare historical executions

• Provides a platform for other performance related tools

16

Architecture

17

Rule #1 : Mapper Data Skew

18

Mapper Skew Problem• Varying size of splits can cause skewness in the Mapper Input

19

Solution to Mapper Skewness• Each Mapper should process the same amount of data

• Combine the small chunks and feed it to a single Mapper

20

Rule #2 : Mapper Memory

21

Mapper Memory Problem & Solution

• Requested Container Memory >> Task’s Consumed Memory

• Request 4 GB of container

• Actually job uses only 512 MB

• Wait longer to get 4 GB and then block 4GB of resources!

• Request a lower container memory by setting

• mapreduce.map(or reduce).memory.mb

22

Search

23

MapReduce Report

24

Job History

25

How to define a rule?

26

How does a Rule work?

INPUT Counters & Task Data

LOGIC Some logic to compute a value

OUTPUT Compare value against threshold levels

27

Customising Dr. Elephant28

Adding a Custom Rule

1. Create a new Rule and test it.

2. Create a help page defining the rule, parameters to tune etc.

3. Add the details of the Rule in the HeuristicConf.xml file <heuristic> <applicationtype>Mapreduce</applicationtype> <heuristicname>Rule Name</heuristicname> <classname>path.to.rule.class</classname> <viewname>path.to.rule.help.page</viewname></heuristic>

4. Run Dr. Elephant. It should now include the new rules.29

What else can you customize?

● Rules, set threshold levels

● Easily integrate with new schedulers (Azkaban, Airflow, Oozie, etc)

● Enable/disable and extend to new Fetchers

● Extend to newer application types and job types

30

Production Gatekeeper31

Automated Production Reviews | JIRA Bot

• Cluster for critical workloads

• Audit before deployment

32

Workflow monitoring and reports

• Monitor performance on each execution

• Compare behaviour across revisions

• Cost to Serve analysis

33

Open Source, April 2016

github.com / linkedin / dr-elephant34

Watchers Stars Forks 60 262 109

Let’s collectively contribute!

35

Pull Requests 60 +

Contributors 10 +

User Topics 50 +

Dr. Elephant Community

36

Coming Soon

37

● Real time analysis of Jobs

● Analytics for Failed Jobs

● Visualizing Workflows through DAGs

● Support for Other schedulers and Frameworks

References

Engineering Blog: engineering.linkedin.com/blog/2016/04/dr-elephant-open-source-self-serve-performance-tuning-hadoop-spark

Open Source Github Link:github.com/linkedin/dr-elephant

Mailing List & Gitterdr-elephant-users, linkedin/dr-elephant

Hadoop Summit 2015:https://www.youtube.com/watch?v=aL3OJ4YoxPA (Mark Wagner)

38

http://engineering.linkedin.com/blog/2016/04/dr-elephant-open-source-self-serve-performance-tuning-hadoop-spark

http://engineering.linkedin.com/blog/2016/04/dr-elephant-open-source-self-serve-performance-tuning-hadoop-spark

http://github.com/linkedin/dr-elephant

https://groups.google.com/forum/%23!forum/dr-elephant-users

https://groups.google.com/forum/%23!forum/dr-elephant-users

https://www.youtube.com/watch?v=aL3OJ4YoxPA

github.com / linkedin / dr-elephant

Thank You

39

Akshay Raihttps://in.linkedin.com/in/akshayrai09

the fifth elephant 2016: self-serve performance tuning for hadoop and spark

Data & Analytics