nagios conference 2012 - andrew widdersheim - nagios is down boss wants to see you

32
Nagios Is Down and Your Boss Wants to See You Andrew Widdersheim [email protected]

Upload: nagios

Post on 28-Jan-2015

102 views

Category:

Technology


0 download

DESCRIPTION

Andrew Widdersheim's presentation on using Nagios high availability. The presentation was given during the Nagios World Conference North America held Sept 25-28th, 2012 in Saint Paul, MN. For more information on the conference (including photos and videos), visit: http://go.nagios.com/nwcna

TRANSCRIPT

Page 1: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you

Nagios Is Down and Your Boss Wants to See You

Andrew Widdersheim

[email protected]

Page 2: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you

2012 2

Nooooooooooo!!!

Page 3: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you

2012 3

Breaking News!

Page 4: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you

2012 4

Nagios High Availability Options

Merlin by op5

Classic method described in Nagios Core documentation

Some type of virtualized solution like VMWare

or…

Page 5: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you

Nagios High Availability

+

= Win

2012 5

Page 6: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you

DRBD magic

2012 6

Page 7: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you

DRBD magic

2012 7

Linbit

Free

Runs in Kernel either by module or in the mainline code if Kernel is new enough

Each server gets its own independent storage

Able to maintain the data’s consistency between the nodes

Resource level fencing

Page 8: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you

DRBD considerations

2012 8

DRBD is as fast as the slowest node

Network latency

Replication over great distances can be done

DRBD proxy can increase performance over great distances but does cost money

Recommend using dedicated cross-over link for best performance

Protocol Choices

Protocol A: write IO is reported as completed, if it has reached local disk and local TCP send buffer.

Protocol B: write IO is reported as completed, if it has reached local disk and remote buffer cache.

Protocol C: write IO is reported as completed, if it has reached both local and remote disk.

Page 9: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you

Pacemaker

2012 9

Page 10: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you

Pacemaker + DRBD + Nagios

2012 10

PacemakerResource Manager

CoroSync / HeartbeatMessaging

Node1 Node2Hardware

Primary SecondaryDRBD

ext4Filesystem

192.168.1.57VIP

rrdcached

NCSA

NPCD

Apache

Nagios

Nagios Stuff

Page 11: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you

Pacemaker + DRBD + Nagios

11

Primary SecondaryDRBD

ext4Filesystem

192.168.1.57VIP

rrdcached

NCSA

NPCD

Apache

Nagios

Nagios Stuff

PacemakerResource Manager

CoroSync / HeartbeatMessaging

Node1 Node2Hardware

2012

Page 12: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you

Pacemaker + DRBD + Nagios

2012 12

PrimarySecondaryDRBD

ext4

rrdcached

NCSA

NPCD

Apache

Nagios

192.168.1.57

PacemakerResource Manager

CoroSync / HeartbeatMessaging

Node1 Node2Hardware

Page 13: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you

Pacemaker and Nagios

2012 13

Page 14: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you

Pacemaker and Nagios

2012 14

primitive p_nagios lsb:nagios \ op start interval="0" timeout="180s" \ op stop interval="0" timeout="40s" \ op monitor interval="30s" \ meta target-role="Started"

primitive p_fs_nagios ocf:heartbeat:Filesystem \ params device="/dev/drbd/by-res/r1" directory="/drbd/r1" fstype="ext4“ options="noatime" \ op start interval="0" timeout="60s" \ op stop interval="0" timeout="180s" \ op monitor interval="30s" timeout="40s"

group g_nagios p_fs_nagios p_nagios_ip p_nagios_bacula p_nagios_mysql \p_nagios_rrdcached p_nagios_npcd p_nagios_nsca p_nagios_apache \p_nagios_syslog-ng p_nagios \

meta target-role="Started"

Page 15: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you

Pacemaker and Nagios

2012 15

Page 16: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you

Pacemaker considerations

2012 16

Redundant communication links are a must

Recommend use of crossover to help accomplish this

Init scripts for Nagios must be LSB compliant… some are not

Page 17: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you

What to replicate?

2012 17

Configuration

Host

Service

Multi check command files

Webinject command files

PNP4Nagios RRD’s

Nagios log files

retention.dat

Mail Queue (eh…)

Page 18: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you

Everything else?

2012 18

Binaries and main configuration files installed using packages independently on each server

Able to update one node at a time

Easy to roll back should there be an issue

Version/change management

Consistent build process

NDO and MySQL hosted on separate HA cluster

Page 19: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you

RPM’s

2012 19

Build and maintain our own RPM’s

Lets us configure everything to our liking

Lets us update at our own pace

Controlled through SVN with a post-commit to automatically update our own Nagios repository with new packages/updates. Then it is as simple as doing “yum update” on your servers.

A lot of upfront work but was worth it

Page 20: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you

How has this helped?

2012 20

Have been able to repair, upgrade and move hardware with minimal downtime

Updated OS and restart server with minimal downtime

Able to update to 3.4.1 and promptly patch issue affecting Nagios downtime’s that was not caught in QA

CGI pages of death

Page 21: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you

What doesn’t this solve?

2012 21

Having an HA cluster is great but there are still things that can go wrong having a cluster does not solve

Configuration issues are probably the most prevalent thing we run into that might bring down Nagios without there being a major hardware/DC issue

We make use of NagiosQL which does a backup when a configuration is changed. This allows us to rollback unwanted changes but isn’t the best.

Page 22: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you

Two is better than one

2012 22

Setting up another cluster for “development” with similar hardware and software is a great way to test things outside of production

Lets you spot potential problems before they become a problem

Page 23: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you

Monitoring your cluster

2012 23

check_crm

http://exchange.nagios.org/directory/Plugins/Clustering-and-High-2DAvailability/Check-CRM/details

check_drbd

http://exchange.nagios.org/directory/Plugins/Operating-Systems/Linux/check_drbd/details

check_heartbeat_link

http://exchange.nagios.org/directory/Plugins/Operating-Systems/Linux/check_heartbeat_link/details

Page 24: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you

Gotcha’s

2012 24

RPM’s and symlinks in an HA solution are bad

Symlink /usr/local/nagios/etc/ -> /drbd/r1/nagios/etc when node is secondary and you update RPM your symlink will get blown away

Restarting services controlled by Pacemaker should be done within Pacemaker

crm resource restart p_nagios

Page 25: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you

Quick Stats

2012 25

Thousands of host and service checks

Average check latency ~.300 sec

Average checks per second ~70

Mostly active checks polling every 5 minutes

DL360 G5

6 146GB 10k SAS drives in RAID10

2 quad core E5450 @ 3.00GHz

8GB Memory

Page 26: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you

Tuning

2012 26

RAM disk for check results queue, NPCD queue, objects.cache and status.dat

NDOUtils with async patch

Built in since version 1.5

Limit what you send to NDOUtils

Bulk Mode with npcdmod

rrdcached 

Restarting Nagios through external command eventually resulted in higher latencies for some reason

Large installation tweaks

Disable environment macros

A lot of trial and error with scheduling and reaper frequencies

Small amount of check optimization

Measuring Nagios performance using PNP4Nagios is a must

Page 27: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you

RAM disk + ndo-async + rrdcached

2012 27

Page 28: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you

non-external command file restarts

2012 28

Page 29: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you

nsca-2.9

2012 29

Page 30: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you

One Year’s Progress

2012 30

Page 31: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you

How we run today

2012 31

Page 32: Nagios Conference 2012 - Andrew Widdersheim - Nagios is down boss wants to see you

Quick Stats

2012 32

Questions?