the fifth elephant 2016: self-serve performance tuning for hadoop and spark
TRANSCRIPT
Self-Serve Performance Tuning for Hadoop & Spark
The Fifth Elephant 2016
Akshay RaiEngineer, Hadoop Development TeamLinkedin Dr. Elephant
© 2016 LinkedIn Corporation. All Rights Reserved.
Hadoop @ Linkedin c. 2008
● 1 cluster
● 20 nodes
● 10 users
● 10 workflows in production
● MapReduce, Pig
2
Hadoop @ Linkedin c. 2016
● > 10 clusters
● > 10000 nodes
● > 1000 users
● Thousands of queries and flows in development
● Hundreds running in Production
● MapReduce, Pig, Hive, Spark, Scalding, Gobblin, Cubert3
Scaling Hadoop Infrastructure
• Add extra machines to the cluster
• Hadoop is scalable but not that optimal!
• We cannot keep adding machines forever
• Tune given resources and minimize addition of new machines
4
Measuring performance
• Highlights hardware failures and poor performing components
• Scope for environment upgrades.
5
Cluster Level Performance Tuning
Job Level Performance Tuning6
How difficult is it to tune a Job?
• Production Gatekeeper - Let jobs go into production only after verifying it
is tuned.
• Restriction! More questions on how to tune! Spend more resources
helping people.
Here’s what we tried to achieve Job tuning!
7
Challenges in tuning a job
• Hadoop is designed to let users tune their jobs BUT!
• One cannot optimize if one doesn’t understand the internals of the framework
• Critical information is scattered
• Hadoop has a huge set of parameters, tuning some may impact other
8
You cannot tune what you do not know & you cannot improve what you cannot measure
9
Training Sessions
10
• More people, more frequent sessions.
• Hadoop experience varies with people
• Framework specific training. Pig, hive, etc
Training - Doesn’t Scale
11
Expert Review
12
Expert Review - Also Doesn’t Work
• Again not scalable
• Cannot ensure job is performing optimally, no easy comparison.
• Different people, different perspective, no consensus
• Error prone, one might overlook certain aspects.
13
Scaling Hadoop Infrastructure is HARD
Scaling User Productivity is much HARDER 14
Birth of Dr. Elephant
15
What does Dr. Elephant do?
• Help every user get the best performance from their jobs
• Analyse and compare historical executions
• Provides a platform for other performance related tools
16
Architecture
17
Rule #1 : Mapper Data Skew
18
Mapper Skew Problem• Varying size of splits can cause skewness in the Mapper Input
19
Solution to Mapper Skewness• Each Mapper should process the same amount of data
• Combine the small chunks and feed it to a single Mapper
20
Rule #2 : Mapper Memory
21
Mapper Memory Problem & Solution
• Requested Container Memory >> Task’s Consumed Memory
• Request 4 GB of container
• Actually job uses only 512 MB
• Wait longer to get 4 GB and then block 4GB of resources!
• Request a lower container memory by setting
• mapreduce.map(or reduce).memory.mb
22
Search
23
MapReduce Report
24
Job History
25
How to define a rule?
26
How does a Rule work?
INPUT Counters & Task Data
LOGIC Some logic to compute a value
OUTPUT Compare value against threshold levels
27
Customising Dr. Elephant28
Adding a Custom Rule
1. Create a new Rule and test it.
2. Create a help page defining the rule, parameters to tune etc.
3. Add the details of the Rule in the HeuristicConf.xml file <heuristic> <applicationtype>Mapreduce</applicationtype> <heuristicname>Rule Name</heuristicname> <classname>path.to.rule.class</classname> <viewname>path.to.rule.help.page</viewname></heuristic>
4. Run Dr. Elephant. It should now include the new rules.29
What else can you customize?
● Rules, set threshold levels
● Easily integrate with new schedulers (Azkaban, Airflow, Oozie, etc)
● Enable/disable and extend to new Fetchers
● Extend to newer application types and job types
30
Production Gatekeeper31
Automated Production Reviews | JIRA Bot
• Cluster for critical workloads
• Audit before deployment
32
Workflow monitoring and reports
• Monitor performance on each execution
• Compare behaviour across revisions
• Cost to Serve analysis
33
Open Source, April 2016
github.com / linkedin / dr-elephant34
Watchers Stars Forks 60 262 109
Let’s collectively contribute!
35
Pull Requests 60 +
Contributors 10 +
User Topics 50 +
Dr. Elephant Community
36
Coming Soon
37
● Real time analysis of Jobs
● Analytics for Failed Jobs
● Visualizing Workflows through DAGs
● Support for Other schedulers and Frameworks
References
Engineering Blog: engineering.linkedin.com/blog/2016/04/dr-elephant-open-source-self-serve-performance-tuning-hadoop-spark
Open Source Github Link:github.com/linkedin/dr-elephant
Mailing List & Gitterdr-elephant-users, linkedin/dr-elephant
Hadoop Summit 2015:https://www.youtube.com/watch?v=aL3OJ4YoxPA (Mark Wagner)
38
github.com / linkedin / dr-elephant
Thank You
39
Akshay Raihttps://in.linkedin.com/in/akshayrai09
©2014 LinkedIn Corporation. All Rights Reserved.
©2014 LinkedIn Corporation. All Rights Reserved.
© 2016 40