hadoop administration

Hadoop Administration

issues and fixes

InMobi clusterHDP Distribution - 2.2

450 nodes

5PB data storage

60-75K applications runs everyday

9000 Cores

14 TB of Total RAM

Our Hadoop ecosystemfalcon

oozie

scribe/databus

kafka

httpfs services

zookeeper

aerospike

HardWare ConfigurationsNameNode

64GB of RAM / 24 core cpu

DataNode

64GB of RAM /32 core/ 6*2TB disks

128GB of RAM/40 core / 10*2TB disks

OS

Ubuntu 12.04

Configuration managementPuppet

system package pushes

system configuration management

hadoop package pushes

hadoop configuration management using templates

Hadoop configurations1.5GB of RAM and 1 core for a container

Max container limit is 20G - to enable spark jobs

Max vcore allocation is 8 cpu

Using Capacity Scheduler

Node Manager recovery is enabled

CGroups for CPU isolation

jvm configurationNameNode - 30GB

NodeManager - 2GB

DataNode - 3GB

gc - concurrent mark sweep collector ( CMS)

Alers and TrendsUsing Nagios for Alerting.

monitor only master node services eg: namenode,resourcemanager,historyserver

datanodes/nodemanagers are monitored based on availability ( say 90% of the nodes are up)

alerts are configured to send to PagerDuty

Alers and Trends [contd]Using Graphite for Metrics

jmx metrics

use jmxtrans and custom scripts to send metrics to graphite

clusterwide alerts on top on graphite data

Issues And challenges - hadoop Tasks hogging up whole allocated CPU vcores.

Problem

We have couple of jobs which are very CPU intensive. While these jobs are running, other tasks are starving for resources, leads to SLA misses.

Reason

with the DefaultContainerExecutor class, there is no limit in using CPU by a single task. Other tasks running on the node node starve for CPU resource and can lead to long running jobs.

Issues And challenges - hadoop [cntd]solution

We mitigated this problem by implementing strict CPU limit with cgroups and LinuxContainerExecutor class.

install libcgroup1 package

Issues And challenges - hadoop [cntd]Job priority

Problem

yarn does not honor mapred job priority ( mapreduce.job.priority VERY_LOW, LOW,HIGH,VERY_HIGH)

Reason

Priorities across application within the same queue is not implemented yet.

Issues And challenges - hadoop [cntd]Solution

Implemented hierarchical queues within a queue with different thresholds

allocate different proportions to the sub-queues say 30% to report, 20% to report_low and 30% report_high queue.

Issues And challenges - hadoop [cntd]Namenode HA failover with ssh fencing

Problem

Active Namenode(Namenode1) server went down physically. We expected the standby to become active. But that didn’t happen.

Reason

We are using Namenode HA with ssh fencing. zkfc(zookeeper failover controller) does ssh to the active namenode and make sure the namenode process is killed to avoid any possible chance of split-brain. Since the active namenode was physically down, the ssh-fencing always returned non-zero status

Issues And challenges - hadoop [cntd]Solution

1. We added a dummy host in place of the Namenode1 in the hdfs-site.xml. The zkfc running in standby successfully

2. We can use shell fencing method to return exit status ‘0’

3. You can also use custom scripts with shell fencing method to handle this situation

Issues And challenges - hadoop [cntd]tasks running slow

problem

Jobs are running extremely slow.

Reason

There can be system issues as well as application issues that can cause this problem.

1. misconfigured nodemanager

The actual cpu available in a nodemanager is 12, but you set yarn.nodemanager.resource.cpu-vcores as 24.

2. Nodes were configured with 100Mb/s network speed

3. Bad disks

Solution

a. Enabling Speculative execution

cons

Enabling speculative execution can lead to wastage of resources. We have patched speculative execution, It spawn duplicate tasks only in case absolute necessity.

Issues And challenges - hadoop [cntd]

Issues And challenges - hadoop [cntd]b. Excluding bad disks from the cluster

We exclude bad disks from a node using puppet automatically.

c. Exclude bad nodes from the cluster

We are using custom healthcheck to blacklist a node if a certain number of applications fails on this node in a certain amount of time.

i. separate the NodeManager Audit log to a different file using log4j

Issues And challenges - hadoop [cntd]

Then use a custom script to get the number of failures in nodemanager and use that to blacklist a node.

Issues And challenges - hadoop [cntd]d. machines with bad hardware configurations

Solution

Use script to validate each nodes

Issues And challenges - hadoop [cntd]History Server log size.

have seen AM logs with size 1G

controlled AM log size with yarn.app.mapreduce.am.container.log.limit.kb

log aggregation is enabled and logs are kept in HDFS for 3 days

Q&AWe are hiring for rockstars like you. If interested please drop a note at

[email protected]/[email protected]

hadoop administration

Data & Analytics