the fifth elephant 2016: self-serve performance tuning for hadoop and spark

40
Self-Serve Performance Tuning for Hadoop & Spark The Fifth Elephant 2016 Akshay Rai Engineer, Hadoop Development Team Linkedin Dr. Elephant © 2016 LinkedIn Corporation. All Rights Reserved.

Upload: akshay-rai

Post on 16-Apr-2017

457 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark

Self-Serve Performance Tuning for Hadoop & Spark

The Fifth Elephant 2016

Akshay RaiEngineer, Hadoop Development TeamLinkedin Dr. Elephant

© 2016 LinkedIn Corporation. All Rights Reserved.

Page 2: The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark

Hadoop @ Linkedin c. 2008

● 1 cluster

● 20 nodes

● 10 users

● 10 workflows in production

● MapReduce, Pig

2

Page 3: The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark

Hadoop @ Linkedin c. 2016

● > 10 clusters

● > 10000 nodes

● > 1000 users

● Thousands of queries and flows in development

● Hundreds running in Production

● MapReduce, Pig, Hive, Spark, Scalding, Gobblin, Cubert3

Page 4: The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark

Scaling Hadoop Infrastructure

• Add extra machines to the cluster

• Hadoop is scalable but not that optimal!

• We cannot keep adding machines forever

• Tune given resources and minimize addition of new machines

4

Page 5: The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark

Measuring performance

• Highlights hardware failures and poor performing components

• Scope for environment upgrades.

5

Page 6: The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark

Cluster Level Performance Tuning

Job Level Performance Tuning6

Page 7: The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark

How difficult is it to tune a Job?

• Production Gatekeeper - Let jobs go into production only after verifying it

is tuned.

• Restriction! More questions on how to tune! Spend more resources

helping people.

Here’s what we tried to achieve Job tuning!

7

Page 8: The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark

Challenges in tuning a job

• Hadoop is designed to let users tune their jobs BUT!

• One cannot optimize if one doesn’t understand the internals of the framework

• Critical information is scattered

• Hadoop has a huge set of parameters, tuning some may impact other

8

Page 9: The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark

You cannot tune what you do not know & you cannot improve what you cannot measure

9

Page 10: The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark

Training Sessions

10

Page 11: The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark

• More people, more frequent sessions.

• Hadoop experience varies with people

• Framework specific training. Pig, hive, etc

Training - Doesn’t Scale

11

Page 12: The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark

Expert Review

12

Page 13: The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark

Expert Review - Also Doesn’t Work

• Again not scalable

• Cannot ensure job is performing optimally, no easy comparison.

• Different people, different perspective, no consensus

• Error prone, one might overlook certain aspects.

13

Page 14: The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark

Scaling Hadoop Infrastructure is HARD

Scaling User Productivity is much HARDER 14

Page 15: The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark

Birth of Dr. Elephant

15

Page 16: The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark

What does Dr. Elephant do?

• Help every user get the best performance from their jobs

• Analyse and compare historical executions

• Provides a platform for other performance related tools

16

Page 17: The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark

Architecture

17

Page 18: The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark

Rule #1 : Mapper Data Skew

18

Page 19: The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark

Mapper Skew Problem• Varying size of splits can cause skewness in the Mapper Input

19

Page 20: The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark

Solution to Mapper Skewness• Each Mapper should process the same amount of data

• Combine the small chunks and feed it to a single Mapper

20

Page 21: The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark

Rule #2 : Mapper Memory

21

Page 22: The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark

Mapper Memory Problem & Solution

• Requested Container Memory >> Task’s Consumed Memory

• Request 4 GB of container

• Actually job uses only 512 MB

• Wait longer to get 4 GB and then block 4GB of resources!

• Request a lower container memory by setting

• mapreduce.map(or reduce).memory.mb

22

Page 23: The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark

Search

23

Page 24: The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark

MapReduce Report

24

Page 25: The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark

Job History

25

Page 26: The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark

How to define a rule?

26

Page 27: The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark

How does a Rule work?

INPUT Counters & Task Data

LOGIC Some logic to compute a value

OUTPUT Compare value against threshold levels

27

Page 28: The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark

Customising Dr. Elephant28

Page 29: The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark

Adding a Custom Rule

1. Create a new Rule and test it.

2. Create a help page defining the rule, parameters to tune etc.

3. Add the details of the Rule in the HeuristicConf.xml file <heuristic> <applicationtype>Mapreduce</applicationtype> <heuristicname>Rule Name</heuristicname> <classname>path.to.rule.class</classname> <viewname>path.to.rule.help.page</viewname></heuristic>

4. Run Dr. Elephant. It should now include the new rules.29

Page 30: The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark

What else can you customize?

● Rules, set threshold levels

● Easily integrate with new schedulers (Azkaban, Airflow, Oozie, etc)

● Enable/disable and extend to new Fetchers

● Extend to newer application types and job types

30

Page 31: The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark

Production Gatekeeper31

Page 32: The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark

Automated Production Reviews | JIRA Bot

• Cluster for critical workloads

• Audit before deployment

32

Page 33: The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark

Workflow monitoring and reports

• Monitor performance on each execution

• Compare behaviour across revisions

• Cost to Serve analysis

33

Page 34: The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark

Open Source, April 2016

github.com / linkedin / dr-elephant34

Watchers Stars Forks 60 262 109

Page 35: The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark

Let’s collectively contribute!

35

Pull Requests 60 +

Contributors 10 +

User Topics 50 +

Page 36: The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark

Dr. Elephant Community

36

Page 37: The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark

Coming Soon

37

● Real time analysis of Jobs

● Analytics for Failed Jobs

● Visualizing Workflows through DAGs

● Support for Other schedulers and Frameworks

Page 38: The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark

References

Engineering Blog: engineering.linkedin.com/blog/2016/04/dr-elephant-open-source-self-serve-performance-tuning-hadoop-spark

Open Source Github Link:github.com/linkedin/dr-elephant

Mailing List & Gitterdr-elephant-users, linkedin/dr-elephant

Hadoop Summit 2015:https://www.youtube.com/watch?v=aL3OJ4YoxPA (Mark Wagner)

38

Page 39: The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark

github.com / linkedin / dr-elephant

Thank You

39

Akshay Raihttps://in.linkedin.com/in/akshayrai09

Page 40: The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark

©2014 LinkedIn Corporation. All Rights Reserved.

©2014 LinkedIn Corporation. All Rights Reserved.

© 2016 40