deploying nagios in a large enterprise
TRANSCRIPT
![Page 2: Deploying Nagios in a Large Enterprise](https://reader034.vdocuments.us/reader034/viewer/2022042723/586e18ec1a28ab14178b886a/html5/thumbnails/2.jpg)
or...If you strap enough
rockets to a brick, you can make it fly
![Page 3: Deploying Nagios in a Large Enterprise](https://reader034.vdocuments.us/reader034/viewer/2022042723/586e18ec1a28ab14178b886a/html5/thumbnails/3.jpg)
In the beginning...
• New Linux HPC skunkworks project
• Catastrophic success
• Need monitoring added yesterday
![Page 4: Deploying Nagios in a Large Enterprise](https://reader034.vdocuments.us/reader034/viewer/2022042723/586e18ec1a28ab14178b886a/html5/thumbnails/4.jpg)
Looking for Solutions
• Why not use what we already had?
• Stability problems
• Resource utilization problems
• Custom agents were hard
• No support for our new Linux platform
![Page 5: Deploying Nagios in a Large Enterprise](https://reader034.vdocuments.us/reader034/viewer/2022042723/586e18ec1a28ab14178b886a/html5/thumbnails/5.jpg)
Why Nagios?
• Purchasing a new commercial solution was politically difficult
• At the time (2003) nagios was the most mature of the open source solutions
• Good community support
![Page 6: Deploying Nagios in a Large Enterprise](https://reader034.vdocuments.us/reader034/viewer/2022042723/586e18ec1a28ab14178b886a/html5/thumbnails/6.jpg)
The Naive Approach(and why it didn’t work)
• Performance Problems
• Configuration Management Problems
• Availability Problems
• Integration / Automation Requirements
![Page 7: Deploying Nagios in a Large Enterprise](https://reader034.vdocuments.us/reader034/viewer/2022042723/586e18ec1a28ab14178b886a/html5/thumbnails/7.jpg)
Performance Problems
• State check performance
• Active checks: ~3 checks / second maximum
• Statistics performance
• fork()/exec() for every sample
• Web UI performance
• Large configurations take a long time to display (much improved in 2.x)
![Page 8: Deploying Nagios in a Large Enterprise](https://reader034.vdocuments.us/reader034/viewer/2022042723/586e18ec1a28ab14178b886a/html5/thumbnails/8.jpg)
Configuration Management Problems
• Configuration files are very verbose (even with templates)
• Syntax errors are easy
• Keeping up with a high churn rate in monitored servers is expensive
![Page 9: Deploying Nagios in a Large Enterprise](https://reader034.vdocuments.us/reader034/viewer/2022042723/586e18ec1a28ab14178b886a/html5/thumbnails/9.jpg)
• Hardware / software failures
• Building power-downs
• Patches / upgrades
• Who watches the watchers?
Availability Problems
![Page 10: Deploying Nagios in a Large Enterprise](https://reader034.vdocuments.us/reader034/viewer/2022042723/586e18ec1a28ab14178b886a/html5/thumbnails/10.jpg)
Integration / Automation Requirements
• Alarms need to be dispatched to our existing alerting and escalation system
• Alarms need to be suppressed by existing maintenance tools
• Provisioning should flow from our existing provisioning system
![Page 11: Deploying Nagios in a Large Enterprise](https://reader034.vdocuments.us/reader034/viewer/2022042723/586e18ec1a28ab14178b886a/html5/thumbnails/11.jpg)
Solving the Problems
• Move to passive checks
• Run multiple nagios instances
• Deploy HA nagios servers
• Use data-driven configuration file generation
• Create a custom notification back end
![Page 12: Deploying Nagios in a Large Enterprise](https://reader034.vdocuments.us/reader034/viewer/2022042723/586e18ec1a28ab14178b886a/html5/thumbnails/12.jpg)
Passive Checks
• Move most of the work to the clients
• Batch server updates unless a state change occurs
• Randomize server update times to avoid spikes
• Queue the results on the server
• Send statistics to a different back end
![Page 13: Deploying Nagios in a Large Enterprise](https://reader034.vdocuments.us/reader034/viewer/2022042723/586e18ec1a28ab14178b886a/html5/thumbnails/13.jpg)
Passive Data FlowClient 1 Client n
...
... Server n
monqueue
stats-catcher
... nagios n
nagios 1
Server 1
nagios-agent nagios-agent
monqueue
stats-catcher
... nagios n
nagios 1
![Page 14: Deploying Nagios in a Large Enterprise](https://reader034.vdocuments.us/reader034/viewer/2022042723/586e18ec1a28ab14178b886a/html5/thumbnails/14.jpg)
Multiple Nagios Instances
• Run many copies of nagios on one server
• Improve web UI performance
• Show each group only their own servers, so the top level dashboard is more useful
• Allow per-group customizations
![Page 15: Deploying Nagios in a Large Enterprise](https://reader034.vdocuments.us/reader034/viewer/2022042723/586e18ec1a28ab14178b886a/html5/thumbnails/15.jpg)
HA Nagios Servers
• Run multiple nagios servers on different machines in different buildings
• All clients update all servers
• A heartbeat is published through each server to its counterpart
• Notifications are suppressed on slaves if the heartbeat service is OK
• Partitioned masters can cause duplicate alerts
![Page 16: Deploying Nagios in a Large Enterprise](https://reader034.vdocuments.us/reader034/viewer/2022042723/586e18ec1a28ab14178b886a/html5/thumbnails/16.jpg)
HA Data Flowprimary server secondary server
client
nagios-agent
nagios
monqueueheartbeat
notify
monqueue
nagios
ping
notify
heartbeat
ping
![Page 17: Deploying Nagios in a Large Enterprise](https://reader034.vdocuments.us/reader034/viewer/2022042723/586e18ec1a28ab14178b886a/html5/thumbnails/17.jpg)
Data-driven Configuration File Generation
• Leverage our existing host database and provisioning tools
• Generate client and server configurations via cfengine based on templates and database lookups
• Mostly driven by database data, with some per-server threshold overrides managed in cfengine master files
![Page 18: Deploying Nagios in a Large Enterprise](https://reader034.vdocuments.us/reader034/viewer/2022042723/586e18ec1a28ab14178b886a/html5/thumbnails/18.jpg)
Custom NotificationBack End
• Custom code integrates with our Netcool infrastructure
• Alarm suppression based on external criteria
• Also supports email alerts, optionally batched
![Page 19: Deploying Nagios in a Large Enterprise](https://reader034.vdocuments.us/reader034/viewer/2022042723/586e18ec1a28ab14178b886a/html5/thumbnails/19.jpg)
Design Trade-offs
• Batch updates mean slow detection of “zombie” hosts (ping-able, but not running user processes)
• nagios’ notification escalation doesn’t work well without active checks, especially if updates are batched
• Requires configuration management
• More complexity = more bugs
![Page 20: Deploying Nagios in a Large Enterprise](https://reader034.vdocuments.us/reader034/viewer/2022042723/586e18ec1a28ab14178b886a/html5/thumbnails/20.jpg)
nagios-agentDesign Criteria
• Lightweight
• Easy to write and deploy additional agents
• Avoid fork()/exec() where possible
• Support agent callbacks to avoid blocking
• Support “proxy” agents to monitor other devices where we can’t deploy nagios-agent
• Evaluate all thresholds locally and batch server updates
![Page 21: Deploying Nagios in a Large Enterprise](https://reader034.vdocuments.us/reader034/viewer/2022042723/586e18ec1a28ab14178b886a/html5/thumbnails/21.jpg)
nagios-agentnagios-agent
Agent
Class: Ping
Instance: db
Categories: appdev, sa
Args:
hostlist => /my/dblist,
latency => [ 50, 100],
losspct => [ 66, 100 ],
count => 3
Agent
Class: Ping
Instance: LDN
Categories: sa
Args:
hostlist => /my/ldnlist,
latency => [ 100, 200],
losspct => [ 66, 100 ],
count => 3
Server
Class: monqueue
Categories: appdev
Args:
auth => GSSAPI
keytab => mykeytab
server => myserver,
queue => 0
Server
Class: monqueue
Categories: sa
Args:
auth => GSSAPI
keytab => mykeytab
server => myserver,
queue => 1
![Page 22: Deploying Nagios in a Large Enterprise](https://reader034.vdocuments.us/reader034/viewer/2022042723/586e18ec1a28ab14178b886a/html5/thumbnails/22.jpg)
monqueueDesign Criteria
• Fast
• Secure
• Accept data from clients, and dispatch to multiple output queues
• Supply heartbeats to nagios
• Supply queue depth stats to stats-catcher
![Page 23: Deploying Nagios in a Large Enterprise](https://reader034.vdocuments.us/reader034/viewer/2022042723/586e18ec1a28ab14178b886a/html5/thumbnails/23.jpg)
Client Evolution• nagios-agent slowly grew features as they became
required
• multiple agent instances
• agent instance to server mapping
• auto reload of configuration, modules on update
• auto re-exec of nagios-agent on update
• stats collection
• SASL authentication to monqueue
![Page 24: Deploying Nagios in a Large Enterprise](https://reader034.vdocuments.us/reader034/viewer/2022042723/586e18ec1a28ab14178b886a/html5/thumbnails/24.jpg)
Server Evolution
• Started as one monolithic instance
• As deployment spread, split into multiple instances based on administrative domain
• added HA
• added SASL authentication and authorization
• added monitoring of monqueue itself, and service dependencies (so a monqueue failure didn’t trigger alarms for all services)
![Page 25: Deploying Nagios in a Large Enterprise](https://reader034.vdocuments.us/reader034/viewer/2022042723/586e18ec1a28ab14178b886a/html5/thumbnails/25.jpg)
Kudzu
• Originally for one project, fewer than 200 hosts
• Eventually used for large sections of the environment
• Documentation and internal consultancy are critical for user acceptance
![Page 26: Deploying Nagios in a Large Enterprise](https://reader034.vdocuments.us/reader034/viewer/2022042723/586e18ec1a28ab14178b886a/html5/thumbnails/26.jpg)
One of Our Servers
• A single HP DL385 G1 (dual 2.6GHz Opteron, 4GB RAM), running RHEL4 U4, nagios 2.9
• 11 nagios instances
• 27,000+ services (mostly 2 minute intervals)
• 6,600+ hosts
• ~10% CPU
• ~500 MB RAM
![Page 27: Deploying Nagios in a Large Enterprise](https://reader034.vdocuments.us/reader034/viewer/2022042723/586e18ec1a28ab14178b886a/html5/thumbnails/27.jpg)
Still to Come• Source code release
• Encryption of nagios-agent / monqueue communications
• Support for “pulling” status from nagios-agent to better support DMZ environments
• Statistical analysis of multiple data samples to determine service status
• Yet more agent plugins
• nagios-agent support for traditional nagios plugins