improving hadoop cluster performance via linux configuration

Improving Hadoop Cluster Performance via Linux Configura:on DevIgni:on 2014 – Dulles, Virginia

Alex Moundalexis // @technmsg

2 © Cloudera, Inc. All rights reserved.

Tips from a former system administrator


Click to edit Master :tle style

CC BY 2.0 / Richard Bumgardner

Been there, done that.


Tips from a former system administrator field guy



CC BY 2.0 / Alex Moundalexis

Home sweet home.


Tips Easy steps to take…


Tips Easy steps to take… that most people don’t.


What this talk isn’t about

• Deploying • Puppet, Chef, Ansible, homegrown scripts, intern labor

•  Sizing & Tuning • Depends heavily on data and workload

• Coding • Unless you count STDOUT redirec:on

• Algorithms •  I suck at math, but we’ll try some mul:plica:on later


“The answer to most Hadoop ques:ons is…


“The answer to most Hadoop ques:ons is… it depends.”


“The answer to most Hadoop ques:ons is… it depends.” (helpful, right?)


So what ARE we talking about?

•  Seven simple things • Quick • Safe • Viable for most environments and use cases

•  Iden:fy issue, then offer solu:on • Note: Commands run as root or sudo


1. Swapping Bad news, best not to.


Swapping

• A form of memory management • When OS runs low on memory… • write blocks to disk • use now-‐free memory for other things •  read blocks back into memory from disk when needed

• Also known as paging


Swapping

• Problem: Disks are slow, especially to seek • Hadoop is about maximizing IO • spend less :me acquiring data • operate on data in place •  large streaming reads/writes from disk

• Memory usage is somewhat limited within JVM • we should be able to manage our memory • account for JVM overhead


Limit swapping in kernel

• Well, as much as possible. •  Immediate: # echo 1 > /proc/sys/vm/swappiness

• Persist amer reboot: # echo "vm.swappiness = 1" >> /etc/sysctl.conf


Swapping peculiari:es

• Behavior varies based on Linux kernel • CentOS 6.4+ / Ubuntu 10.10+ • For you kernel gurus, that’s Linux 2.6.32-‐303+

• Prior • We don’t swap, except to avoid OOM condi:on.

• Amer • We don’t swap, ever.

• Details: hpp://:ny.cloudera.com/noswap


2. File Access Time Disable this too.


File access :me

•  Linux tracks access :me • writes to disk even if all you did was read

• Problem • more disk seeks • HDFS is write-‐once, read-‐many • NameNode tracks access informa:on for HDFS


Don’t track access :me

• Mount volumes with noatime op:on •  In /etc/fstab: /dev/sdc /data01 ext3 defaults,noatime 0

• Note: noatime assumes nodirtime as well • What about relatime? • Faster than atime but slower than noatime

• No reboot required • # mount -‐o remount /data01


3. Root Reserved Space Reclaim it, impress your bosses!


Root reserved space

• EXT3/4 reserve 5% of disk for root-‐owned files • On an OS disk, sure • System logs, kernel panics, etc



CC BY 2.0 / Alex Moundalexis

Disks used to be much smaller, right?


Do the math

• Conserva:ve • 5% of 1 TB disk = 46 GB • 5 data disks per server = 230 GB • 5 servers per rack = 1.15 TB

• Quasi-‐Aggressive • 5% of 4 TB disk = 186 GB • 12 data disks per server = 2.23 TB • 18 servers per rack = 40.1 TB

• That’s a LOT of unused storage!


Root reserved space

• On a Hadoop data disk, no root-‐owned files • When crea:ng a par::on # mkfs.ext3 –m 0 /dev/sdc

• On exis:ng par::ons # tune2fs -‐m 0 /dev/sdc • 0 is safe, 1 is for the ultra-‐paranoid


4. Name Service Cache Turn it on, already!


Name Service Cache Daemon

• Daemon that caches name service requests • Passwords • Groups • Hosts

• Helps weather network hiccups • Helps more with high latency LDAP, NIS, NIS+ •  Small footprint •  Zero configura:on required



• Hadoop nodes •  largely a network-‐based applica:on • on the network constantly •  issue lots of name lookups, especially HBase & distcp • can thrash name servers

• Reducing latency of service requests? Smart. • Reducing impact on shared infrastructure? Smart.



• Turn it on, let it work, leave it alone: # chkconfig -‐-‐level 345 nscd on # service nscd start

• Check on it later: # nscd -‐g

• Unless using Red Hat SSSD; modify nscd config first! • Don’t use nscd to cache passwd, group, or netgroup • Red Hat, Using NSCD with SSSD. hpp://goo.gl/68HTMQ


5. File Handle Limits Not a problem, un:l they are.


File handle limits

• Kernel refers to files via a handle • Also called descriptors

•  Linux is a mul:-‐user system •  File handles protect the system from • Poor coding • Malicious users • Poor coding of malicious users • Pictures of cats on the Internet

32 © Cloudera, Inc. All rights reserved. 32 Microsom Office EULA. Really.

java.io.FileNotFoundExcep:on: (Too many open files)


File handle limits

•  Linux defaults usually not enough •  Increase maximum open files (default 1024)

# echo hdfs – nofile 32768 >> /etc/security/limits.conf # echo mapred – nofile 32768 >> /etc/security/limits.conf # echo hbase – nofile 32768 >> /etc/security/limits.conf

• Bonus: Increase maximum processes too # echo hdfs – nproc 32768 >> /etc/security/limits.conf # echo mapred – nproc 32768 >> /etc/security/limits.conf # echo hbase – nproc 32768 >> /etc/security/limits.conf

• Note: Cloudera Manager will do this for you.


6. Dedicated Disks Don’t be tempted to share, even with monster disks.


The Situa:on

1.  Your new server has a dozen 1 TB disks 2.  Eleven disks are used to store data 3.  One disk is used for the OS • 20 GB for the OS • 980 GB sits unused

4.  Someone asks “can we store data there too?” 5.  Seems reasonable, lots of space… “OK, why not.”

Sound familiar?

36 © Cloudera, Inc. All rights reserved. Microsom Office EULA. Really.

“I don’t understand it, there’s no consistency to these run >mes!”


No love for shared disk

• Our quest for data gets interrupted a lot: • OS opera:ons • OS logs • Hadoop logging, quite chapy • Hadoop execu:on • userspace execu:on

• Disk seeks are slow, remember?


Dedicated disk for OS and logs

• At install :me • Disk 0, OS & logs • Disk 1-‐n, Hadoop data

• Amer install, more complicated effort, requires manual HDFS block rebalancing: 1.  Take down HDFS •  If you can do it in under 10 minutes, just the DataNode

2.  Move or distribute blocks from disk0/dir to disk[1-‐n]/dir 3.  Remove dir from HDFS config (dfs.data.dir) 4.  Start HDFS


7. Name Resolu:on Sane, both forward and reverse.


Name resolu:on op:ons

1.  Hosts file, if you must 2.  DNS, much preferred


Name resolu:on with hosts file

•  Set canonical names properly

• Right 10.1.1.1 r01m01.cluster.org r01m01 master1 10.1.1.2 r01w01.cluster.org r01w01 worker1

• Wrong 10.1.1.1 r01m01 r01m01.cluster.org master1 10.1.1.2 r01w01 r01w01.cluster.org worker1


Name resolu:on with hosts file

•  Set loopback address properly • Ensure 127.0.0.1 resolves to “localhost,” NOT hostname

• Right 127.0.0.1 localhost

• Wrong 127.0.0.1 r01m01


Name resolu:on with DNS

•  Forward • Reverse

• Hostname should match the FQDN in DNS


This is what you ought to see


Name resolu:on errata

• Mismatches? Expect odd results. • Problems star:ng DataNodes • Non-‐FQDN in Web UI links • Security features are extra sensi:ve to FQDN

• Errors so common that link to FAQ is included in logs! • hpp://wiki.apache.org/hadoop/UnknownHost

• Get name resolu:on working BEFORE enabling nscd!


Summary Now is the appropriate :me to take out your camera phone.


A white background is supposedly beper for prin:ng. (who prints things anymore?)


A white background is supposedly beper for prin:ng. (but makes for very pale slides)


Summary

1.  disable vm.swappiness 2.  data disks: mount with noatime op:on 3.  data disks: disable root reserve space 4.  enable nscd 5.  increase file handle limits 6.  use dedicated OS/logging disk 7.  sane name resolu:on

hpp://:ny.cloudera.com/7steps


Recommended reading

• Hadoop Opera:ons hpp://amzn.to/1ydMrLf


Ques:ons? Preferably related to the talk…


Thanks! Alex Moundalexis| @technmsg


8. Bonus Round Because we have enough :me (or I talked really fast)…


Other things to check

• Disk IO • hdparm • # hdparm -‐Tt /dev/sdc •  Looking for at least 70 MB/s from 7200 RPM disks •  Slower could indicate a failing drive, disk controller, array, etc.

• dd • hpp://romanrm.ru/en/dd-‐benchmark



• Disable Red Hat Transparent Huge Pages (RH6+ un:l 6.5) • Can reduce elevated CPU usage •  In rc.local:

echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled

• Reference: Linux 6 Transparent Huge Pages and Hadoop Workloads, hpp://goo.gl/WSF2qC



• Enable Jumbo Frames • Only if your network infrastructure supports it! • Can easily (and arguably) boost throughput by 10-‐20%



• Enable Jumbo Frames • Only if your network infrastructure supports it! • Can easily (and arguably) boost throughput by 10-‐20%

• Monitor and Chart Everything • How else will you know what’s happening? • Nagios • Ganglia


Ques:ons? Preferably related to the talk…


Thanks! Alex Moundalexis| @technmsg

improving hadoop cluster performance via linux configuration

Software