next generation hadoop operations

21

Upload: owen-omalley

Post on 15-Jan-2015

5.096 views

Category:

Technology


3 download

DESCRIPTION

Andrew Ryan describes how Facebook operates Hadoop to provide access as a shared resource between groups. More information and video at: http://developer.yahoo.com/blogs/hadoop/posts/2011/02/hug-feb-2011-recap/

TRANSCRIPT

Page 1: Next Generation Hadoop Operations
Page 2: Next Generation Hadoop Operations

Next-Generation Hadoop Operations What’s ahead in the next 12 months for Hadoop cluster administration

Andrew Ryan AppOps Engineer Feb 16, 2011

Page 3: Next Generation Hadoop Operations

1 Hadoop operations @Facebook: an overview

2 Existing operational best practices

3 The challenges ahead: new directions in Hadoop

4 Emerging operational best practices

5 Conclusions and next steps

Agenda

Page 4: Next Generation Hadoop Operations
Page 5: Next Generation Hadoop Operations

Hadoop Operations @Facebook ▪  Lean staffing, fast moving, highly leveraged

▪  Basic oncall structure:

▪  Level 1: 24x7 sysadmin team (“SRO”) for whole site

▪  Level 2: 2 people (“AppOps”) trading 1-week oncall shifts

▪  Level 3: 4 different Hadoop dev subteams with 1-week rotations

▪  Plus oncalls from other adjunct teams: SiteOps for machine repairs, NetEng for network, etc.

▪  Every engineer @FB is issued a cell phone and expected to be available in emergencies and/or if they make a change to a production system or code.

Page 6: Next Generation Hadoop Operations

Operational gaps in Hadoop Our best practices address all these gaps ▪  Hardware selection, preparation, and configuration

▪  Installation/packaging

▪  Upgrades

▪  Autostart/start/stop/restart/status as correct UNIX user

▪  Node level application and system monitoring

▪  Cluster-level and job-level monitoring

▪  Integrated log viewing/tailing/grepping

▪  Fast, reliable, centrally logged cluster-level shell ( != slaves.sh)

Page 7: Next Generation Hadoop Operations

Existing operational best practices (1) Sysadmin

▪  All the stuff you would do for a large distributed system but especially…

▪  Failed/failing hardware is your biggest enemy. FIND IT AND FIX IT, OR GET IT OUT OF YOUR CLUSTERS! (the ‘excludes’ file is your friend)

▪  Regularly run every possible diagnostic to safely scan for bad hardware

▪  Identify and remove “repeat offender” hardware

▪  Fail fast, recover quickly, small things add up in big clusters:

▪  RHEL/Centos kickstart steals your disk space (1.5%-3%+ per disk)

▪  No swap + vm.panic_on_oom=1 + kernel.kdb=0 for “fast auto reboot on OOM”

▪  Never fsck ext3 data drives unless Hadoop says you have to

Page 8: Next Generation Hadoop Operations

Sysadmin example Identifying your “America’s Most Wanted” pays off

Page 9: Next Generation Hadoop Operations

Existing operational best practices (2) Tooling ▪  Maintain a central registry of clusters, nodes, and each node’s role in

the cluster, integrated with your service/asset management platform

▪  Build centrally maintained tools to:

▪  Start/stop/restart/autostart daemons on hosts (hadoopctl)

▪  View/grep/tail daemon logs on hosts (hadooplog)

▪  Start/stop, or execute commands on entire clusters (clusterctl)

▪  Manage excludes files based on repair status (excluderator)

▪  Deploy any arbitrary version of software to clusters

▪  Monitor daemon health and collect statistics

Page 10: Next Generation Hadoop Operations

Tooling example Deploy & upgrade clusters # Deploy an HDFS/MapReduce cluster pair: 2 to 4000 nodes via

torrent

$ deploy-hadoop-release.py --clusterdeploy=DFS1,SILVER branch@rev

$ clusterctl restart DFS1 SILVER

# “Refresh deploy” on 10 clusters, and then restart just the datanodes

$ deploy-hadoop-release.py –poddeploy=DFSSCRIBE-ALL redeploy

$ clusterctl restart DFSSCRIBE-ALL:datanode

Page 11: Next Generation Hadoop Operations

Existing operational best practices (3) Process ▪  Document everything

▪  Segregate different classes of users on different clusters, with appropriate service levels and capacities

▪  Graph user-visible metrics like HDFS and job latency

▪  Build “least destructive” procedures for getting hardware back in service

▪  Developers and Ops should use the same procedures and tools

Page 12: Next Generation Hadoop Operations

Process example Graphing our users’ experience on the cluster

Page 13: Next Generation Hadoop Operations

A Hadoop cluster admin’s worst enemies

▪  The “X-Files”: machines which fail in strange ways, undetected by your monitoring systems

▪  Get your basics under control, then you’ll have more time for these

▪  “America’s Most Wanted”: machines which keep failing, again and again

▪  Our data: 1% of our machines accounted for 30% of our repair tickets

Page 14: Next Generation Hadoop Operations

New directions for Hadoop

▪  Hbase (Facebook Messages, real-time click logs)

▪  Zero-downtime upgrades (AvatarNode, rolling upgrades)

▪  “Megadatanodes” and Hadoop RAID

▪  HDFS as an “appliance”

See also: http://www.facebook.com/notes/facebook-engineering/looking-at-the-code-behind-our-three-uses-of-apache-hadoop/468211193919

Page 15: Next Generation Hadoop Operations

Hbase and Hadoop

▪  Very new technology with emerging operational characteristics

▪  Applications using Hbase are also new, with their own usage quirks

▪  Aiming for large number of small clusters (~100 nodes)

▪  Slow/dead nodes are a big problem: these are real-time, user facing

▪  Region failover slow ; no speculative execution

▪  Full-downtime restarts must be avoided

View the Messages tech talk here: http://fb.me/95OQ8YaD2rkb3r

Page 16: Next Generation Hadoop Operations

Zero-downtime upgrades

▪  HDFS upgrades are 1-2 hours of downtime

▪  Jobtracker upgrades are quick (5 min), but kill all currently running jobs

▪  Rolling upgrades work today, but are too slow for large clusters

▪  Must be able to be strict and lenient about multiple versions of client and server software installed and running in the cluster

Page 17: Next Generation Hadoop Operations

“Megadatanodes” and Hadoop RAID

▪  Storage requirements continue to increase rapidly, as does CPU/RAM

▪  9X increase in datanode density from 2009-2011 (4TB->36TB)

▪  Hadoop RAID with XOR and Reed-Solomon bring tremendous cost savings along with management challenges:

▪  Losing one node is a big deal (200k-600k blocks/node?). A rack? Ouch!

▪  Tools and admin capabilities are not ready yet

▪  Will HDFS administration in 2012 be “like administering a cluster of 4000 Netapps”?

▪  Host/rack level network will be a bottleneck

Page 18: Next Generation Hadoop Operations

HDFS as an “appliance”

▪  Use HDFS cluster instead of commercial storage appliance

▪  Requires commercial-grade support & features

▪  Must be price-competitive

vs.

Page 19: Next Generation Hadoop Operations

Emerging operational best practices

▪  More careful selection of hardware and network designs to accommodate new uses of Hadoop

▪  Find and deal with slowness at a node/rack/segment level

▪  Auto-healing at granularity better than “reboot” or “restart”

▪  Node-level version detection and installation

▪  Rolling, zero-downtime upgrades (AvatarNode + new JobTracker)

…and do all this without making Hadoop any harder to set up and run

Page 20: Next Generation Hadoop Operations

Next steps

▪  Are we trying to do too much?

▪  Facebook needs an enormous data warehouse

▪  Facebook needs a large distributed filesystem

▪  Facebook needs a database alternative to MySQL

▪  Facebook always looking to spend less money

▪  …and all that other stuff too

▪  Failure is not an option

▪  Never a dull moment!

Page 21: Next Generation Hadoop Operations

(c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0