hadoop administration
TRANSCRIPT
Hadoop Administration
issues and fixes
InMobi clusterHDP Distribution - 2.2
450 nodes
5PB data storage
60-75K applications runs everyday
9000 Cores
14 TB of Total RAM
Our Hadoop ecosystemfalcon
oozie
scribe/databus
kafka
httpfs services
zookeeper
aerospike
HardWare ConfigurationsNameNode
64GB of RAM / 24 core cpu
DataNode
64GB of RAM /32 core/ 6*2TB disks
128GB of RAM/40 core / 10*2TB disks
OS
Ubuntu 12.04
Configuration managementPuppet
system package pushes
system configuration management
hadoop package pushes
hadoop configuration management using templates
Hadoop configurations1.5GB of RAM and 1 core for a container
Max container limit is 20G - to enable spark jobs
Max vcore allocation is 8 cpu
Using Capacity Scheduler
Node Manager recovery is enabled
CGroups for CPU isolation
jvm configurationNameNode - 30GB
NodeManager - 2GB
DataNode - 3GB
gc - concurrent mark sweep collector ( CMS)
Alers and TrendsUsing Nagios for Alerting.
monitor only master node services eg: namenode,resourcemanager,historyserver
datanodes/nodemanagers are monitored based on availability ( say 90% of the nodes are up)
alerts are configured to send to PagerDuty
Alers and Trends [contd]Using Graphite for Metrics
jmx metrics
use jmxtrans and custom scripts to send metrics to graphite
clusterwide alerts on top on graphite data
Issues And challenges - hadoop Tasks hogging up whole allocated CPU vcores.
Problem
We have couple of jobs which are very CPU intensive. While these jobs are running, other tasks are starving for resources, leads to SLA misses.
Reason
with the DefaultContainerExecutor class, there is no limit in using CPU by a single task. Other tasks running on the node node starve for CPU resource and can lead to long running jobs.
Issues And challenges - hadoop [cntd]solution
We mitigated this problem by implementing strict CPU limit with cgroups and LinuxContainerExecutor class.
install libcgroup1 package
Issues And challenges - hadoop [cntd]Job priority
Problem
yarn does not honor mapred job priority ( mapreduce.job.priority VERY_LOW, LOW,HIGH,VERY_HIGH)
Reason
Priorities across application within the same queue is not implemented yet.
Issues And challenges - hadoop [cntd]Solution
Implemented hierarchical queues within a queue with different thresholds
allocate different proportions to the sub-queues say 30% to report, 20% to report_low and 30% report_high queue.
Issues And challenges - hadoop [cntd]Namenode HA failover with ssh fencing
Problem
Active Namenode(Namenode1) server went down physically. We expected the standby to become active. But that didn’t happen.
Reason
We are using Namenode HA with ssh fencing. zkfc(zookeeper failover controller) does ssh to the active namenode and make sure the namenode process is killed to avoid any possible chance of split-brain. Since the active namenode was physically down, the ssh-fencing always returned non-zero status
Issues And challenges - hadoop [cntd]Solution
1. We added a dummy host in place of the Namenode1 in the hdfs-site.xml. The zkfc running in standby successfully
2. We can use shell fencing method to return exit status ‘0’
3. You can also use custom scripts with shell fencing method to handle this situation
Issues And challenges - hadoop [cntd]tasks running slow
problem
Jobs are running extremely slow.
Reason
There can be system issues as well as application issues that can cause this problem.
1. misconfigured nodemanager
The actual cpu available in a nodemanager is 12, but you set yarn.nodemanager.resource.cpu-vcores as 24.
2. Nodes were configured with 100Mb/s network speed
3. Bad disks
Solution
a. Enabling Speculative execution
cons
Enabling speculative execution can lead to wastage of resources. We have patched speculative execution, It spawn duplicate tasks only in case absolute necessity.
Issues And challenges - hadoop [cntd]
Issues And challenges - hadoop [cntd]b. Excluding bad disks from the cluster
We exclude bad disks from a node using puppet automatically.
c. Exclude bad nodes from the cluster
We are using custom healthcheck to blacklist a node if a certain number of applications fails on this node in a certain amount of time.
i. separate the NodeManager Audit log to a different file using log4j
Issues And challenges - hadoop [cntd]
Then use a custom script to get the number of failures in nodemanager and use that to blacklist a node.
Issues And challenges - hadoop [cntd]d. machines with bad hardware configurations
Solution
Use script to validate each nodes
Issues And challenges - hadoop [cntd]History Server log size.
have seen AM logs with size 1G
controlled AM log size with yarn.app.mapreduce.am.container.log.limit.kb
log aggregation is enabled and logs are kept in HDFS for 3 days
Q&AWe are hiring for rockstars like you. If interested please drop a note at