next generation hadoop operations

Next-Generation Hadoop Operations What’s ahead in the next 12 months for Hadoop cluster administration

Andrew Ryan AppOps Engineer Feb 16, 2011

1 Hadoop operations @Facebook: an overview

2 Existing operational best practices

3 The challenges ahead: new directions in Hadoop

4 Emerging operational best practices

5 Conclusions and next steps

Agenda

Hadoop Operations @Facebook ▪  Lean staffing, fast moving, highly leveraged

▪  Basic oncall structure:

▪  Level 1: 24x7 sysadmin team (“SRO”) for whole site

▪  Level 2: 2 people (“AppOps”) trading 1-week oncall shifts

▪  Level 3: 4 different Hadoop dev subteams with 1-week rotations

▪  Plus oncalls from other adjunct teams: SiteOps for machine repairs, NetEng for network, etc.

▪  Every engineer @FB is issued a cell phone and expected to be available in emergencies and/or if they make a change to a production system or code.

Operational gaps in Hadoop Our best practices address all these gaps ▪  Hardware selection, preparation, and configuration

▪  Installation/packaging

▪  Upgrades

▪  Autostart/start/stop/restart/status as correct UNIX user

▪  Node level application and system monitoring

▪  Cluster-level and job-level monitoring

▪  Integrated log viewing/tailing/grepping

▪  Fast, reliable, centrally logged cluster-level shell ( != slaves.sh)

Existing operational best practices (1) Sysadmin

▪  All the stuff you would do for a large distributed system but especially…

▪  Failed/failing hardware is your biggest enemy. FIND IT AND FIX IT, OR GET IT OUT OF YOUR CLUSTERS! (the ‘excludes’ file is your friend)

▪  Regularly run every possible diagnostic to safely scan for bad hardware

▪  Identify and remove “repeat offender” hardware

▪  Fail fast, recover quickly, small things add up in big clusters:

▪  RHEL/Centos kickstart steals your disk space (1.5%-3%+ per disk)

▪  No swap + vm.panic_on_oom=1 + kernel.kdb=0 for “fast auto reboot on OOM”

▪  Never fsck ext3 data drives unless Hadoop says you have to

Sysadmin example Identifying your “America’s Most Wanted” pays off

Existing operational best practices (2) Tooling ▪  Maintain a central registry of clusters, nodes, and each node’s role in

the cluster, integrated with your service/asset management platform

▪  Build centrally maintained tools to:

▪  Start/stop/restart/autostart daemons on hosts (hadoopctl)

▪  View/grep/tail daemon logs on hosts (hadooplog)

▪  Start/stop, or execute commands on entire clusters (clusterctl)

▪  Manage excludes files based on repair status (excluderator)

▪  Deploy any arbitrary version of software to clusters

▪  Monitor daemon health and collect statistics

Tooling example Deploy & upgrade clusters # Deploy an HDFS/MapReduce cluster pair: 2 to 4000 nodes via

torrent

$ deploy-hadoop-release.py --clusterdeploy=DFS1,SILVER branch@rev

$ clusterctl restart DFS1 SILVER

# “Refresh deploy” on 10 clusters, and then restart just the datanodes

$ deploy-hadoop-release.py –poddeploy=DFSSCRIBE-ALL redeploy

$ clusterctl restart DFSSCRIBE-ALL:datanode

Existing operational best practices (3) Process ▪  Document everything

▪  Segregate different classes of users on different clusters, with appropriate service levels and capacities

▪  Graph user-visible metrics like HDFS and job latency

▪  Build “least destructive” procedures for getting hardware back in service

▪  Developers and Ops should use the same procedures and tools

Process example Graphing our users’ experience on the cluster

A Hadoop cluster admin’s worst enemies

▪  The “X-Files”: machines which fail in strange ways, undetected by your monitoring systems

▪  Get your basics under control, then you’ll have more time for these

▪  “America’s Most Wanted”: machines which keep failing, again and again

▪  Our data: 1% of our machines accounted for 30% of our repair tickets

New directions for Hadoop

▪  Hbase (Facebook Messages, real-time click logs)

▪  Zero-downtime upgrades (AvatarNode, rolling upgrades)

▪  “Megadatanodes” and Hadoop RAID

▪  HDFS as an “appliance”

See also: http://www.facebook.com/notes/facebook-engineering/looking-at-the-code-behind-our-three-uses-of-apache-hadoop/468211193919

Hbase and Hadoop

▪  Very new technology with emerging operational characteristics

▪  Applications using Hbase are also new, with their own usage quirks

▪  Aiming for large number of small clusters (~100 nodes)

▪  Slow/dead nodes are a big problem: these are real-time, user facing

▪  Region failover slow ; no speculative execution

▪  Full-downtime restarts must be avoided

View the Messages tech talk here: http://fb.me/95OQ8YaD2rkb3r

Zero-downtime upgrades

▪  HDFS upgrades are 1-2 hours of downtime

▪  Jobtracker upgrades are quick (5 min), but kill all currently running jobs

▪  Rolling upgrades work today, but are too slow for large clusters

▪  Must be able to be strict and lenient about multiple versions of client and server software installed and running in the cluster

“Megadatanodes” and Hadoop RAID

▪  Storage requirements continue to increase rapidly, as does CPU/RAM

▪  9X increase in datanode density from 2009-2011 (4TB->36TB)

▪  Hadoop RAID with XOR and Reed-Solomon bring tremendous cost savings along with management challenges:

▪  Losing one node is a big deal (200k-600k blocks/node?). A rack? Ouch!

▪  Tools and admin capabilities are not ready yet

▪  Will HDFS administration in 2012 be “like administering a cluster of 4000 Netapps”?

▪  Host/rack level network will be a bottleneck

HDFS as an “appliance”

▪  Use HDFS cluster instead of commercial storage appliance

▪  Requires commercial-grade support & features

▪  Must be price-competitive

vs.

Emerging operational best practices

▪  More careful selection of hardware and network designs to accommodate new uses of Hadoop

▪  Find and deal with slowness at a node/rack/segment level

▪  Auto-healing at granularity better than “reboot” or “restart”

▪  Node-level version detection and installation

▪  Rolling, zero-downtime upgrades (AvatarNode + new JobTracker)

…and do all this without making Hadoop any harder to set up and run

Next steps

▪  Are we trying to do too much?

▪  Facebook needs an enormous data warehouse

▪  Facebook needs a large distributed filesystem

▪  Facebook needs a database alternative to MySQL

▪  Facebook always looking to spend less money

▪  …and all that other stuff too

▪  Failure is not an option

▪  Never a dull moment!

next generation hadoop operations

Technology

different clusters

big clusters

appliancee hdfs cluster

hadoopr best practices

logged clusterlevel

hadoop raidfs

hdfsmapreduce cluster

large number of small