hw09 monitoring best practices

39
How to monitor the How to monitor the $H!T out of Hadoop $H!T out of Hadoop Developing a comprehensive Developing a comprehensive open approach to monitoring open approach to monitoring hadoop clusters hadoop clusters

Upload: cloudera-inc

Post on 14-Jul-2015

2.321 views

Category:

Technology


5 download

TRANSCRIPT

Page 1: Hw09   Monitoring Best Practices

How to monitor the How to monitor the $H!T out of Hadoop$H!T out of Hadoop

Developing a comprehensive Developing a comprehensive open approach to monitoring open approach to monitoring

hadoop clustershadoop clusters

Page 2: Hw09   Monitoring Best Practices

Relevant Hadoop InformationRelevant Hadoop Information

From 3 – 3000 NodesFrom 3 – 3000 Nodes

Hardware/Software failures “common”Hardware/Software failures “common”

Redundant Components DataNode, Redundant Components DataNode, TaskTrackerTaskTracker

Non-redundant Components NameNode, Non-redundant Components NameNode, JobTracker, SecondaryNameNodeJobTracker, SecondaryNameNode

Fast Evolving Technology (Best Fast Evolving Technology (Best Practices?)Practices?)

Page 3: Hw09   Monitoring Best Practices

Monitoring SoftwareMonitoring Software

Nagios – Nagios – – Red Yellow Green Alerts, EscalationsRed Yellow Green Alerts, Escalations– Defacto Standard – Widely deployedDefacto Standard – Widely deployed

– Text base configurationText base configuration– Web InterfaceWeb Interface– Pluggable with shell scripts/external appsPluggable with shell scripts/external apps

Return 0 - OKReturn 0 - OK

Page 4: Hw09   Monitoring Best Practices

CactiCacti

Performance Graphing SystemPerformance Graphing System

RRD/RRA Front EndRRD/RRA Front End

Slick Web InterfaceSlick Web Interface

Template System for Graph TypesTemplate System for Graph Types

PluggablePluggable– SNMP inputSNMP input– Shell script /external programShell script /external program

Page 5: Hw09   Monitoring Best Practices
Page 6: Hw09   Monitoring Best Practices

hadoop-cacti-jtghadoop-cacti-jtg

JMX Fetching Code w/ (kick off) scriptsJMX Fetching Code w/ (kick off) scripts

Cacti templates For HadoopCacti templates For Hadoop

Premade Nagios Check ScriptsPremade Nagios Check Scripts

Helper/Batch/automation scriptsHelper/Batch/automation scripts

Apache License Apache License

Page 7: Hw09   Monitoring Best Practices

Hadoop JMX Hadoop JMX

Page 8: Hw09   Monitoring Best Practices

Sample Cluster P1Sample Cluster P1

NameNode & SecNameNodeNameNode & SecNameNode– Hardware RAIDHardware RAID– 8 GB RAM8 GB RAM

– 1x QUAD CORE1x QUAD CORE– DerbyDB (hive) on SecNameNodeDerbyDB (hive) on SecNameNode

JobTrackerJobTracker– 8GB RAM8GB RAM– 1x QUAD CORE1x QUAD CORE

Page 9: Hw09   Monitoring Best Practices

A Sample Cluster p2A Sample Cluster p2

Slave (hadoopdata1-XXXX)Slave (hadoopdata1-XXXX)– JBOD 8x 1TB SATA DiskJBOD 8x 1TB SATA Disk– RAM 16GBRAM 16GB

– 2x Quad Core2x Quad Core

Page 10: Hw09   Monitoring Best Practices

PrerequisitesPrerequisites

Nagios (install) DAG RPMsNagios (install) DAG RPMs

Cacti (install) Several RPMSCacti (install) Several RPMS

Liberal network access to the clusterLiberal network access to the cluster

Page 11: Hw09   Monitoring Best Practices

Alerts & EscalationsAlerts & Escalations

X nodes * Y Services = < SleepX nodes * Y Services = < Sleep

Define a policy Define a policy – Wake Me Up’s (SMS)Wake Me Up’s (SMS)

– Don’t Wake Me Up’s (EMAIL)Don’t Wake Me Up’s (EMAIL)– Review (Daily, Weekly, Monthly) Review (Daily, Weekly, Monthly)

Page 12: Hw09   Monitoring Best Practices

Wake Me Up’sWake Me Up’s

NameNodeNameNode– Disk Full (Big Big Headache)Disk Full (Big Big Headache)– RAID Array Issues (failed disk)RAID Array Issues (failed disk)

JobTrackerJobTracker

SecNameNodeSecNameNode– Do not realize it is not working too lateDo not realize it is not working too late

Page 13: Hw09   Monitoring Best Practices

Don’t Wake Me Up’sDon’t Wake Me Up’s

Or ‘Wake someone else up’Or ‘Wake someone else up’

DataNodeDataNode– Warning Currently Failed Disk will down the Warning Currently Failed Disk will down the

Data Node (see Jira)Data Node (see Jira)

TaskTrackerTaskTracker

HardwareHardware– Bad Disk (Start RMA)Bad Disk (Start RMA)

Slaves are expendable (up to a point)Slaves are expendable (up to a point)

Page 14: Hw09   Monitoring Best Practices

Monitoring Battle PlanMonitoring Battle Plan

Start With the BasicsStart With the Basics– Ping, DiskPing, Disk

Add Hadoop Specific Alarms Add Hadoop Specific Alarms – check_data_nodecheck_data_node

Add JMX GraphingAdd JMX Graphing– NameNodeOperationsNameNodeOperations

Add JMX Based alarmsAdd JMX Based alarms– FilesTotal > 1,000,000 or LiveNodes < 50%FilesTotal > 1,000,000 or LiveNodes < 50%

Page 15: Hw09   Monitoring Best Practices

The Basics NagiosThe Basics Nagios

Nagios (All Nodes)Nagios (All Nodes)– Host up (Ping check)Host up (Ping check)– Disk % FullDisk % Full

– SWAP > 85 %SWAP > 85 %

* Load based alarms are somewhat useless * Load based alarms are somewhat useless 389% CPU load is not necessarily a bad 389% CPU load is not necessarily a bad thing in Hadoopvillething in Hadoopville

Page 16: Hw09   Monitoring Best Practices

The Basics CactiThe Basics Cacti

Cacti (All Nodes)Cacti (All Nodes)– CPU (full CPU)CPU (full CPU)– RAM/SWAP RAM/SWAP

– NetworkNetwork– Disk UsageDisk Usage

Page 17: Hw09   Monitoring Best Practices

Disk UtilizationDisk Utilization

Page 18: Hw09   Monitoring Best Practices

RAID ToolsRAID Tools

Hpacucli – not a Street Fighter moveHpacucli – not a Street Fighter move– Alerts on RAID events (NameNode) Alerts on RAID events (NameNode)

Disk failed Disk failed

RebuildingRebuilding

– JBOD (DataNode)JBOD (DataNode)Failed DriveFailed Drive

Drive ErrorsDrive Errors

Dell, SUN, Vendor Specific ToolsDell, SUN, Vendor Specific Tools

Page 19: Hw09   Monitoring Best Practices

Before you jump inBefore you jump in

X Nodes * Y Checks * = Lots of workX Nodes * Y Checks * = Lots of work

About 3 Nodes into the process …About 3 Nodes into the process …– Wait!!! I need some interns!!!Wait!!! I need some interns!!!

Solution S.I.C.C.T. Semi-Intelligent-Solution S.I.C.C.T. Semi-Intelligent-Configuration-cloning-toolsConfiguration-cloning-tools– (I made that up) (I made that up) – (for this presentation)(for this presentation)

Page 20: Hw09   Monitoring Best Practices

NagiosNagios

Answers “IS IT RUNNING?”Answers “IS IT RUNNING?”

Text based ConfigurationText based Configuration

Page 21: Hw09   Monitoring Best Practices

CactiCacti

Answers “HOW WELL IS IT RUNNING?”Answers “HOW WELL IS IT RUNNING?”

Web Based configuration Web Based configuration – php-cli tools php-cli tools

Page 22: Hw09   Monitoring Best Practices

Monitoring Battle PlanMonitoring Battle PlanThus FarThus Far

Start With the BasicsStart With the Basics– Ping, Disk !!!!!!Done!!!!!!Ping, Disk !!!!!!Done!!!!!!

Add Hadoop Specific Alarms Add Hadoop Specific Alarms – check_data_nodecheck_data_node

Add JMX GraphingAdd JMX Graphing– NameNodeOperationsNameNodeOperations

Add JMX Based alarmsAdd JMX Based alarms– FilesTotal > 1,000,000 or LiveNodes < 50%FilesTotal > 1,000,000 or LiveNodes < 50%

Page 23: Hw09   Monitoring Best Practices

Add Hadoop Specific AlarmsAdd Hadoop Specific Alarms

Hadoop Components with a Web InterfaceHadoop Components with a Web Interface– NameNode 50070NameNode 50070– JobTracker 50030JobTracker 50030

– TaskTracker 50060TaskTracker 50060– DataNode 50075DataNode 50075

check_http + regex = simple + effectivecheck_http + regex = simple + effective

Page 24: Hw09   Monitoring Best Practices

nagios_check_commands.cfgnagios_check_commands.cfg

Component FailureComponent Failure

(Future) Newer Hadoop will have XML status (Future) Newer Hadoop will have XML status

define command { command_name check_remote_namenode command_line $USER1$/check_http -H $HOSTADDRESS$ -u http://$HOSTADDRESS$:$ARG1$/dfshealth.jsp -p $ARG1$ -r NameNode}define service { service_description check_remote_namenode use generic-service host_name hadoopname1 check_command check_remote_namenode!50070}

Page 25: Hw09   Monitoring Best Practices

Monitoring Battle PlanMonitoring Battle Plan

Start With the BasicsStart With the Basics– Ping, Disk (Done)Ping, Disk (Done)

Add Hadoop Specific Alarms Add Hadoop Specific Alarms – check_data_node (Done)check_data_node (Done)

Add JMX GraphingAdd JMX Graphing– NameNodeOperationsNameNodeOperations

Add JMX Based alarmsAdd JMX Based alarms– FilesTotal > 1,000,000 or LiveNodes < 50%FilesTotal > 1,000,000 or LiveNodes < 50%

Page 26: Hw09   Monitoring Best Practices

JMX GraphingJMX Graphing

Enable JMXEnable JMX

Import TemplatesImport Templates

Page 27: Hw09   Monitoring Best Practices

JMX GraphingJMX Graphing

Page 28: Hw09   Monitoring Best Practices

JMX GraphingJMX Graphing

Page 29: Hw09   Monitoring Best Practices

JMX GraphingJMX Graphing

Page 30: Hw09   Monitoring Best Practices
Page 31: Hw09   Monitoring Best Practices

Standard Java JMXStandard Java JMX

Page 32: Hw09   Monitoring Best Practices

Monitoring Battle PlanMonitoring Battle PlanThus FarThus Far

Start With the Basics !!!!!!Done!!!!!Start With the Basics !!!!!!Done!!!!!– Ping, DiskPing, Disk

Add Hadoop Specific Alarms !Done!Add Hadoop Specific Alarms !Done!– check_data_nodecheck_data_node

Add JMX Graphing !Done!Add JMX Graphing !Done!– NameNodeOperationsNameNodeOperations

Add JMX Based alarmsAdd JMX Based alarms– FilesTotal > 1,000,000 or LiveNodes < 50%FilesTotal > 1,000,000 or LiveNodes < 50%

Page 33: Hw09   Monitoring Best Practices

Add JMX based AlarmsAdd JMX based Alarms

hadoop-cacti-jtg is flexiblehadoop-cacti-jtg is flexible– extend fetch classesextend fetch classes– Don’t call output()Don’t call output()

– Write your own check logicWrite your own check logic

Page 34: Hw09   Monitoring Best Practices

Quick JMX Base Walkthrough Quick JMX Base Walkthrough

url, user, pass, object specified from CLIurl, user, pass, object specified from CLI

wantedVariables, wantedOperations by wantedVariables, wantedOperations by inheritanceinheritance

fetch() output() providedfetch() output() provided

Page 35: Hw09   Monitoring Best Practices

Extend for NameNodeExtend for NameNode

Page 36: Hw09   Monitoring Best Practices

Extend for NagiosExtend for Nagios

Page 37: Hw09   Monitoring Best Practices

Monitoring Battle PlanMonitoring Battle Plan

Start With the Basics !DONE!Start With the Basics !DONE!– Ping, DiskPing, Disk

Add Hadoop Specific Alarms !DONE!Add Hadoop Specific Alarms !DONE!– check_data_nodecheck_data_node

Add JMX Graphing !DONE!Add JMX Graphing !DONE!– NameNodeOperationsNameNodeOperations

Add JMX Based alarms !DONE!Add JMX Based alarms !DONE!– FilesTotal > 1,000,000 or LiveNodes < 50%FilesTotal > 1,000,000 or LiveNodes < 50%

Page 38: Hw09   Monitoring Best Practices

ReviewReview

File System GrowthFile System Growth– SizeSize– Number of FilesNumber of Files– Number of BlocksNumber of Blocks– Ratio’sRatio’s

UtilizationUtilization– CPU/MemoryCPU/Memory– DiskDisk

Email (nightly)Email (nightly)– FSCK FSCK – DSFADMINDSFADMIN

Page 39: Hw09   Monitoring Best Practices

The FutureThe Future

JMX Coming to JobTracker and JMX Coming to JobTracker and TaskTracker (0.21)TaskTracker (0.21)– Collect and Graph Jobs RunningCollect and Graph Jobs Running– Collect and Graph Map / Reduce per nodeCollect and Graph Map / Reduce per node– Profile Specific Jobs in Cacti?Profile Specific Jobs in Cacti?