bryan heden - agile networks - using nagios xi as the platform for monitoring as a service
TRANSCRIPT
Using Nagios XI as the platform for Monitoring as a Service
Bryan Heden
Introduction and Agenda
I’m Bryan Heden, Director of Systems at Agile Networks headquartered in Canton, Ohio
• Who we are and what we do
• Some customers of ours
• Last year’s presentation recap
• The major problems we were faced with
• Solving hardware issues
• Automating user management and multi-tenancy
• Further configuration wizard and component customizations
• MRTG overloads and other issues
• Remote MRTG bandwidth polling and Nagios checks
• Empowering standard users
• Geospatial information system integration
• Conclusion
2
Who we are and what we do
Agile Networks
We engineer and operate The Agile Network, a general purpose
backhaul network with Last-Mile AgilityTM
We provide world class connectivity to:
• The public sector (Public Safety)
• Tier 1 Carriers
• The Oil and Gas industry
• Underserved communities
• Business and Residential customers
• Wireless Internet Service Providers
3
Some customers of ours
4
Last year’s presentation recap
10,000 Services (and growing!) Across the State of Ohio
Choosing Nagios XI and ModGearman
• Easy to use and understandable front-end interface
• ModGearman’s distributed checks
Customizing configuration wizards and components
• Specialized config wizards for our networking equipment
• NOC Overview map to provide geospatially based status info
• ModGearman management, Smokeping component and portal
Offloading MRTG, MySQL, Smokeping and IO improvements
• Upgraded hardware several times to keep up
• Offloaded MRTG, split the processes up
• Offloaded MySQL
• Installed and then immediately offloaded Smokeping
5
The major problems we were faced with
Midnight alerts should power cycle my coffee maker
• IOwait was continuing to grow and memory was limited
• Engineers need to see backhaul, sales needs to see customer equipment
• Our configuration wizards’ defects became glaringly obvious
• MRTG graphs were starting to saw tooth again (checks completing once every 10-15 minutes)
• The need arose to segment some bandwidth polling and Nagios checks entirely away from our backhaul network
• Network or Sales Engineers should not have to be Nagios Administrators to remove devices from monitoring
• The basic NOC Overview map was fast approaching end-of-life. We needed a better way to manage geospatial data that could be
utilized by more than one team of engineers
6
Solving hardware issues
IOwait was continuing to grow and memory was limited
• We had already migrated hardware several times
• Latest migration was to a 24 SSD DAS (12GB SAS) array attached to a 3 node VMWare 6 cluster
• XI VM has 24 cores, 24GB RAM
• MRTG and MySQL VMs are similar
• Several RAMDisks are in use
• Famous Last Words:
“I haven’t seen IOwait over 1% in a long long time now!”
7
Automating user management and multi-tenancy
Engineers need to see backhaul, sales needs to see customer equipment
• We needed to limit the views of company department users (network
engineers, network operations, sales engineers, operations) and
telecommunication customer users (public safety, oil and gas, wireless
resellers)
• Automating this process was on the roadmap for far too long before it was
developed. Manual maintenance was a nightmare!
• We built an intermediary database that manages user groups, i.e.: Agile
Networks Engineering
8
Automating user management and multi-tenancy
Engineers need to see backhaul, sales needs to see customer equipment
• This database links those user groups with contactgroups and default
hostgroups
• We have a component/portal that populates the database upon user group
creation, and uses that data to create users in Nagios and assign them to the
proper contactgroup upon creation
• The default hostgroup is useful for ModGearman and also for keeping track of
who is tracking what
9
Further configuration wizard and component customizations
Our configuration wizards’ defects became glaringly obvious
• The old configuration wizards were specific to a device type. The service
checks were added to that host specifically, which became a problem if we
ever introduced a new OID to monitor, or needed to get rid of one!
• We wrote a script that creates the configuration wizards based on a generic
script. While it is creating the configwizard, it is also creating the device
hostgroup that any device created with this wizard will be added to.
10
Further configuration wizard and component customizations
Our configuration wizards’ defects became glaringly obvious
• Now, we assign service checks to those device hostgroups (Satellites
Tracked, Temperature, SysUpTime). If we ever need to make a change, we
make it at one place, and it is applied to all devices.
• We still track interface information via cfgmaker command, and allow the
user to decide which ports they want checks performed on.
11
MRTG overloads and other issues
MRTG graphs were starting to saw tooth again (checks completing once every 10-15 minutes)
• MRTG was already split manually into 8 separate processes
• ~12k checks every 5 minutes
• If some part of the network became unavailable overnight, and an
error became present on 1 interface that stopped that process from
completing successfully, we all of a sudden didn’t have bandwidth for
~1500 ports. Unacceptable!
12
MRTG overloads and other issues
MRTG graphs were starting to saw tooth again (checks completing once every 10-15 minutes)
• We created a database synchronization tool (MRTGQL?), and converted our configuration wizards to write directly to tables
• Now we can handle duplicate checking in a sane manner!
• We split our MRTG processes based on information in a config array present in the synchronization script, which updates our crontab
file and rewrites all of the individual config files
• We also monitor the log file directory for errors, and send out alerts based on these findings – no more bandwidthless nights
13
Remote MRTG bandwidth polling and Nagios checks
The need arose to segment some bandwidth polling and Nagios checks onto logical networks
• We have all kinds of customers, and support calls are expensive
• Lets give them access to their own monitoring solution!
• It will be fun and easy, they said!
14
Remote MRTG bandwidth polling and Nagios checks
The need arose to segment some bandwidth polling and Nagios checks entirely away from our backhaul network
• Executing remote Nagios checks is as easy as ensuring that each different customer’s device has the
default hostgroup appropriately added. Their remote ModGearman takes care of the rest!
• We changed the configuration wizards to hide all of the default hostgroups from the user’s selectable
listbox, and only assign the one that that user is linked up with in our intermediary management
database
• But what about remote bandwidth polling?
• We changed the configuration wizards to execute cfgmaker on their remote ModGearman box. Once the
user selects which ports to monitor, these are all stored in our mrtg database with the appropriate
remote information so that our database sync occurs on the proper server
15
Remote MRTG bandwidth polling and Nagios checks
16
Empowering standard users
Network or Sales Engineers should not have to be Nagios Administrators to remove devices from monitoring
• We had to train people on Core Config Manager if they ever
planned on creating or removing hostgroups, removing hosts
or services, renaming anything, removing hosts from
hostgroups, etc.
• So we figured out exactly what the most commonly used
features of Core Config Manager were internally (Hint: it is all
the ones I listed in the last bullet point)
17
Empowering standard users
Network or Sales Engineers should not have to be Nagios Administrators to remove devices from monitoring
• Then we built a component that does all of those things via direct calls to the NagiosQL DB
and the filesystem
• Now all of my users can only remove the objects that they have permissions to. Network
Engineers can’t remove customer equipment, and Sales can’t remove backhaul routers!
18
Geospatial information system integration
The basic NOC Overview map was fast approaching end-of-life.
• The original map was extended based on the Google map component
• It had a decent interface that tied existing hostgroups to lat/lng coordinates
(locations) and displayed them on the map based on that hostgroup’s hosts’ statuses
• We tied locations together by linking a specific host and service at a location to a
specific host and service at another location (relationship)
• We displayed relationships as lines between locations
19
Geospatial information system integration
The basic NOC Overview map was fast approaching end-of-life.
• We also built an animated radar layer and overlayed it on top of the map
• This is all fine and good, but that data only existed inside of this portal inside of our
Nagios XI instance
• We needed to export that data to a true geospatial information system (PostGIS,
GeoServer)
20
Geospatial information system integration (continued)
We needed a better way to manage geospatial data that could be utilized by more than one team of engineers
• We built a GeoServer, and built a component that pulls WMS in OpenLayers
• We created multiple datastores for each particular customer we service, with multiple layers in each (locations, wireless relationships, fiber
relationships, etc.)
• We built an awesome interface for that portal that allows any user of our Nagios XI instance to add locations and relationships with ease
• We built an application that parses status data and rebuilds all of the WMS layers with the proper styling (red for down, green for up, etc.)
• Now we can log in to the GeoServer via separate credentials and view the relevant data
• This is useful for our GIS and Project Management departments
21
Geospatial information system integration (continued)
22
Conclusion
• We (seriously) beefed up the hardware to accommodate almost doubling hosts and services
• We automated everything we could possibly automate
• We built a layer on top of MRTG to manage configuration files and remote workers
• We refactored our configuration wizards to be extremely efficient, and tie in directly to our MRTG sync tool
• We built our map functionality on top of a real Geo Server
What’s next?
• Automating the deployment of XI instances based on growth and location
• Tying password change component into LDAP
• Automatic interference detection in the frequency map (with alerting!)
• Receive signal threshold alarming based on propagation prediction
• Alerting based on average values over time and a percentage change in those values
• Deeper geospatial integration (propagation/coverage maps)
Contact and Questions
• Any questions?
23