mike guthrie - revamping your 10 year old nagios installation
TRANSCRIPT
![Page 2: Mike Guthrie - Revamping Your 10 Year Old Nagios Installation](https://reader031.vdocuments.us/reader031/viewer/2022020119/587565cc1a28abd80a8b507f/html5/thumbnails/2.jpg)
Case Study: Red Ventures• Digital Marketing Company
• Acquire customers for our partners– Optimize SEO for websites
– Take inbound call volume for sales calls
![Page 3: Mike Guthrie - Revamping Your 10 Year Old Nagios Installation](https://reader031.vdocuments.us/reader031/viewer/2022020119/587565cc1a28abd80a8b507f/html5/thumbnails/3.jpg)
RV Technology Notes• LAMP Environment - PHP and JS
• We LOVE data – Many TB of DB storage
• We move fast…think Agile development on steroids.
• 50-60 in-house developers
• Redundancy – CLT and ATL datacenters
• Almost everything is clustered
• Our speed often creates technical debt
![Page 4: Mike Guthrie - Revamping Your 10 Year Old Nagios Installation](https://reader031.vdocuments.us/reader031/viewer/2022020119/587565cc1a28abd80a8b507f/html5/thumbnails/4.jpg)
March 2015 – Nagios Profile
• 2 Nagios Installations – CLT and ATL
• 1100 Hosts/8000 Services (Now 1500/13000)– Linux servers (web, mysql, cron, load balancers)
– Windows servers (phone, terminal)
– Network (Routers, UPS, PDU)
• PNP4Nagios for Performance Data
• Thruk UI
![Page 5: Mike Guthrie - Revamping Your 10 Year Old Nagios Installation](https://reader031.vdocuments.us/reader031/viewer/2022020119/587565cc1a28abd80a8b507f/html5/thumbnails/5.jpg)
Key Problems• No system to configs whatsoever
• No consistency between ATL and CLT in setup
• Adding one check to a server type meant touching hundreds of files
• Terrible alerts storms
• Misdirected or missing alerts
• Lots of hosts not being monitored at all
• ATL latency problems
• Broken escalations
• No effective historical reporting
![Page 6: Mike Guthrie - Revamping Your 10 Year Old Nagios Installation](https://reader031.vdocuments.us/reader031/viewer/2022020119/587565cc1a28abd80a8b507f/html5/thumbnails/6.jpg)
What Every Engineer Wants To Hear
![Page 7: Mike Guthrie - Revamping Your 10 Year Old Nagios Installation](https://reader031.vdocuments.us/reader031/viewer/2022020119/587565cc1a28abd80a8b507f/html5/thumbnails/7.jpg)
Goals
• Manageable configuration
• Minimize time spent on maintenance
• Reporting / Dashboards / Visualization
• Scalability
• Noise reduction
![Page 8: Mike Guthrie - Revamping Your 10 Year Old Nagios Installation](https://reader031.vdocuments.us/reader031/viewer/2022020119/587565cc1a28abd80a8b507f/html5/thumbnails/8.jpg)
Step 1: Fix Config Management• Version controlled and synced:
– Contacts– Templates– Commands – Hostgroups– Escalations– Dependencies
• Decoupled: – Hosts– One-offs escalations and dependencies
![Page 9: Mike Guthrie - Revamping Your 10 Year Old Nagios Installation](https://reader031.vdocuments.us/reader031/viewer/2022020119/587565cc1a28abd80a8b507f/html5/thumbnails/9.jpg)
Step 1:Fix Config Management
• All hosts / services use templates
• Almost all service checks are applied through Service -> Hostgroup relationships
• Hostgroup = roles / attributes– linux-server (Load, Memory, Disk, Procs, etc)
– mysql-server (Mysql, Slaving, Storage partition)
– supervisord-server (supervisord procs running)
• Use host variables for differing ports, SNMP strings, active disk partitions, etc
![Page 10: Mike Guthrie - Revamping Your 10 Year Old Nagios Installation](https://reader031.vdocuments.us/reader031/viewer/2022020119/587565cc1a28abd80a8b507f/html5/thumbnails/10.jpg)
define host {host_name rv-atl-serverl01alias rv-atl-serverl01use rv-routershostgroups MsSQLcontact_groups phoneops
}
define service {host_name rv-atl-serverl01use rv-routers-serviceservice_description PINGcontact_groups phoneopscheck_command check_ping!200.0,30%!300.0,70%
}define service {
host_name rv-atl-serverl01use rv-windows-serviceservice_description DISKcontact_groups phoneopscheck_command check_windows_disk!public!CHIJKM!80!85
}
define service {host_name rv-atl-serverl01use rv-windows-serviceservice_description CPU LOADcheck_command check_snmp_load_windows!public!50!80contact_groups phoneopscheck_period 24x7MinusSQLBackup
}
define service {host_name rv-atl-serverl01use rv-windows-serviceservice_description SQL Servicecheck_command check_windows_service!public!SQL Server
\\(MSSQLSERVER\\)contact_groups phoneops
}
define service {host_name rv-atl-serverl01use rv-windows-serviceservice_description VIRTUAL MEMORY USAGEcheck_command check_snmp_misc!public!Virtual Memory!90!95contact_groups phoneops
}
define host {
host_name rv-atl-serverl01
alias rv-atl-serverl01
use mssql-server
hostgroups windows-server,mssql-server
_SNMP public
}
Host Config Before
Host Config After
![Page 11: Mike Guthrie - Revamping Your 10 Year Old Nagios Installation](https://reader031.vdocuments.us/reader031/viewer/2022020119/587565cc1a28abd80a8b507f/html5/thumbnails/11.jpg)
Step 2: Config Automation
• Most of our servers are puppet managed (transitioning to Salt)
• Linux machines need to be self-aware of what they need to have monitored
• Linux servers use passive checks to propogate themselves up to Nagios
![Page 12: Mike Guthrie - Revamping Your 10 Year Old Nagios Installation](https://reader031.vdocuments.us/reader031/viewer/2022020119/587565cc1a28abd80a8b507f/html5/thumbnails/12.jpg)
GenerateGenerate
Enforces
Process
Result
Remote Host
• NRPE Config
• Passive Crontab
Nagios /
Webhook
• Does this
host exist?
Config Manager
• Puppet
• Salt
• Add Host
• Verify
• Restart
• Notify
![Page 13: Mike Guthrie - Revamping Your 10 Year Old Nagios Installation](https://reader031.vdocuments.us/reader031/viewer/2022020119/587565cc1a28abd80a8b507f/html5/thumbnails/13.jpg)
Result
• CONSISTENCY!
• All Linux configs are now auto-generated
• Everything else is either cloned or generated from a custom webtool
• Maintenance time went from 10-20 hours per week to less than 1 hour most weeks
![Page 14: Mike Guthrie - Revamping Your 10 Year Old Nagios Installation](https://reader031.vdocuments.us/reader031/viewer/2022020119/587565cc1a28abd80a8b507f/html5/thumbnails/14.jpg)
Step #2: Reporting / Visualization
• Need NOC-level visibility
• Need cluster-level views of performance data
• Historical view of state changes and notifications
![Page 15: Mike Guthrie - Revamping Your 10 Year Old Nagios Installation](https://reader031.vdocuments.us/reader031/viewer/2022020119/587565cc1a28abd80a8b507f/html5/thumbnails/15.jpg)
Nagios
Core
CLT
Nagios
Core
ATL
Perfdata
(Carbon)
Ndoutils
(Mysql)
Event
Server?
(TODO)
Grafana Dashboards
Custom NOC Dashboard
Thruk UI
![Page 16: Mike Guthrie - Revamping Your 10 Year Old Nagios Installation](https://reader031.vdocuments.us/reader031/viewer/2022020119/587565cc1a28abd80a8b507f/html5/thumbnails/16.jpg)
NOC Dashboard
![Page 17: Mike Guthrie - Revamping Your 10 Year Old Nagios Installation](https://reader031.vdocuments.us/reader031/viewer/2022020119/587565cc1a28abd80a8b507f/html5/thumbnails/17.jpg)
![Page 18: Mike Guthrie - Revamping Your 10 Year Old Nagios Installation](https://reader031.vdocuments.us/reader031/viewer/2022020119/587565cc1a28abd80a8b507f/html5/thumbnails/18.jpg)
Graphite + Grafana = Awesome• Opted not to use Graphiosservice_perfdata_file_template=\
$TIMET$\t$HOSTNAME$\t$SERVICEDESC$\t$SERVICEPERFDATA
service_perfdata_file_processing_command=<customScript>
• Nagios writes to buffer file
• Custom scripts grabs the buffer and flushes it to carbon
• Multiline socket write over UDP
• Will send 1000 data points in less than .03 seconds
• Carbon can scale far beyond anything we can throw at it
![Page 19: Mike Guthrie - Revamping Your 10 Year Old Nagios Installation](https://reader031.vdocuments.us/reader031/viewer/2022020119/587565cc1a28abd80a8b507f/html5/thumbnails/19.jpg)
Grafana• Makes combining and templating graphs EASY• Can combine all sorts of metrics on a graph and perform a variety of
mathematical functions on them
atl.rv-atl-server*.CPU_Load.load5
*.rv-{atl,clt}-server*.CPU_Load.load
• Can setup new NOC dashboards in minutes• Also using this for application monitoring data• Allows us to easily spot performance anomalies• Helps with event correlation
![Page 20: Mike Guthrie - Revamping Your 10 Year Old Nagios Installation](https://reader031.vdocuments.us/reader031/viewer/2022020119/587565cc1a28abd80a8b507f/html5/thumbnails/20.jpg)
This is OK
This is not OK
![Page 21: Mike Guthrie - Revamping Your 10 Year Old Nagios Installation](https://reader031.vdocuments.us/reader031/viewer/2022020119/587565cc1a28abd80a8b507f/html5/thumbnails/21.jpg)
TODO
• Event Automation – create automatic response
tasks to known issues with common fixes
• Better connectivity to application monitoring
• Adaptive monitoring for situations like this:
![Page 22: Mike Guthrie - Revamping Your 10 Year Old Nagios Installation](https://reader031.vdocuments.us/reader031/viewer/2022020119/587565cc1a28abd80a8b507f/html5/thumbnails/22.jpg)
Implementation• Left the old servers alone and running• Spun up new servers with notifications and event handling disabled• Migrated 600+ configs by hand• 500+ generated automatically• Problem states were perfect for identifying what wasn’t setup yet• Took about 6 weeks to migrate configs to new machines• Launch day was changing Thruk’s backend config to point to new
servers, and switch over notifications• Audit, design, migration, and stable implementation took about 90
days
![Page 23: Mike Guthrie - Revamping Your 10 Year Old Nagios Installation](https://reader031.vdocuments.us/reader031/viewer/2022020119/587565cc1a28abd80a8b507f/html5/thumbnails/23.jpg)
Things I Learned• Take the time to understand what you’re monitoring
• Lack of understanding will produce alert noise, which is ineffective monitoring
• In a complex system, log everything
• Small changes do the most damage
• Automation is cool except for when it automatically sends 60% of your environment into a CPU death spiral
• I wish Nagios Core allowed hostgroup exclusions in service definitions (hint, hint)
• Nagios is still the best tool and monitoring tool out there
![Page 24: Mike Guthrie - Revamping Your 10 Year Old Nagios Installation](https://reader031.vdocuments.us/reader031/viewer/2022020119/587565cc1a28abd80a8b507f/html5/thumbnails/24.jpg)
Thank you!
Any Questions?
![Page 25: Mike Guthrie - Revamping Your 10 Year Old Nagios Installation](https://reader031.vdocuments.us/reader031/viewer/2022020119/587565cc1a28abd80a8b507f/html5/thumbnails/25.jpg)