bryan heden - agile networks - using nagios xi as the platform for monitoring as a service

Using Nagios XI as the platform for Monitoring as a Service

Bryan Heden

Introduction and Agenda

I’m Bryan Heden, Director of Systems at Agile Networks headquartered in Canton, Ohio

• Who we are and what we do

• Some customers of ours

• Last year’s presentation recap

• The major problems we were faced with

• Solving hardware issues

• Automating user management and multi-tenancy

• Further configuration wizard and component customizations

• MRTG overloads and other issues

• Remote MRTG bandwidth polling and Nagios checks

• Empowering standard users

• Geospatial information system integration

• Conclusion

2

Who we are and what we do

Agile Networks

We engineer and operate The Agile Network, a general purpose

backhaul network with Last-Mile AgilityTM

We provide world class connectivity to:

• The public sector (Public Safety)

• Tier 1 Carriers

• The Oil and Gas industry

• Underserved communities

• Business and Residential customers

• Wireless Internet Service Providers

3

Some customers of ours

4

Last year’s presentation recap

10,000 Services (and growing!) Across the State of Ohio

Choosing Nagios XI and ModGearman

• Easy to use and understandable front-end interface

• ModGearman’s distributed checks

Customizing configuration wizards and components

• Specialized config wizards for our networking equipment

• NOC Overview map to provide geospatially based status info

• ModGearman management, Smokeping component and portal

Offloading MRTG, MySQL, Smokeping and IO improvements

• Upgraded hardware several times to keep up

• Offloaded MRTG, split the processes up

• Offloaded MySQL

• Installed and then immediately offloaded Smokeping

5

The major problems we were faced with

Midnight alerts should power cycle my coffee maker

• IOwait was continuing to grow and memory was limited

• Engineers need to see backhaul, sales needs to see customer equipment

• Our configuration wizards’ defects became glaringly obvious

• MRTG graphs were starting to saw tooth again (checks completing once every 10-15 minutes)

• The need arose to segment some bandwidth polling and Nagios checks entirely away from our backhaul network

• Network or Sales Engineers should not have to be Nagios Administrators to remove devices from monitoring

• The basic NOC Overview map was fast approaching end-of-life. We needed a better way to manage geospatial data that could be

utilized by more than one team of engineers

6

Solving hardware issues

IOwait was continuing to grow and memory was limited

• We had already migrated hardware several times

• Latest migration was to a 24 SSD DAS (12GB SAS) array attached to a 3 node VMWare 6 cluster

• XI VM has 24 cores, 24GB RAM

• MRTG and MySQL VMs are similar

• Several RAMDisks are in use

• Famous Last Words:

“I haven’t seen IOwait over 1% in a long long time now!”

7

Automating user management and multi-tenancy

Engineers need to see backhaul, sales needs to see customer equipment

• We needed to limit the views of company department users (network

engineers, network operations, sales engineers, operations) and

telecommunication customer users (public safety, oil and gas, wireless

resellers)

• Automating this process was on the roadmap for far too long before it was

developed. Manual maintenance was a nightmare!

• We built an intermediary database that manages user groups, i.e.: Agile

Networks Engineering

8

Automating user management and multi-tenancy

Engineers need to see backhaul, sales needs to see customer equipment

• This database links those user groups with contactgroups and default

hostgroups

• We have a component/portal that populates the database upon user group

creation, and uses that data to create users in Nagios and assign them to the

proper contactgroup upon creation

• The default hostgroup is useful for ModGearman and also for keeping track of

who is tracking what

9

Further configuration wizard and component customizations

Our configuration wizards’ defects became glaringly obvious

• The old configuration wizards were specific to a device type. The service

checks were added to that host specifically, which became a problem if we

ever introduced a new OID to monitor, or needed to get rid of one!

• We wrote a script that creates the configuration wizards based on a generic

script. While it is creating the configwizard, it is also creating the device

hostgroup that any device created with this wizard will be added to.

10

Further configuration wizard and component customizations

Our configuration wizards’ defects became glaringly obvious

• Now, we assign service checks to those device hostgroups (Satellites

Tracked, Temperature, SysUpTime). If we ever need to make a change, we

make it at one place, and it is applied to all devices.

• We still track interface information via cfgmaker command, and allow the

user to decide which ports they want checks performed on.

11

MRTG overloads and other issues

MRTG graphs were starting to saw tooth again (checks completing once every 10-15 minutes)

• MRTG was already split manually into 8 separate processes

• ~12k checks every 5 minutes

• If some part of the network became unavailable overnight, and an

error became present on 1 interface that stopped that process from

completing successfully, we all of a sudden didn’t have bandwidth for

~1500 ports. Unacceptable!

12

MRTG overloads and other issues

MRTG graphs were starting to saw tooth again (checks completing once every 10-15 minutes)

• We created a database synchronization tool (MRTGQL?), and converted our configuration wizards to write directly to tables

• Now we can handle duplicate checking in a sane manner!

• We split our MRTG processes based on information in a config array present in the synchronization script, which updates our crontab

file and rewrites all of the individual config files

• We also monitor the log file directory for errors, and send out alerts based on these findings – no more bandwidthless nights

13

Remote MRTG bandwidth polling and Nagios checks

The need arose to segment some bandwidth polling and Nagios checks onto logical networks

• We have all kinds of customers, and support calls are expensive

• Lets give them access to their own monitoring solution!

• It will be fun and easy, they said!

14


The need arose to segment some bandwidth polling and Nagios checks entirely away from our backhaul network

• Executing remote Nagios checks is as easy as ensuring that each different customer’s device has the

default hostgroup appropriately added. Their remote ModGearman takes care of the rest!

• We changed the configuration wizards to hide all of the default hostgroups from the user’s selectable

listbox, and only assign the one that that user is linked up with in our intermediary management

database

• But what about remote bandwidth polling?

• We changed the configuration wizards to execute cfgmaker on their remote ModGearman box. Once the

user selects which ports to monitor, these are all stored in our mrtg database with the appropriate

remote information so that our database sync occurs on the proper server

15


16

Empowering standard users

Network or Sales Engineers should not have to be Nagios Administrators to remove devices from monitoring

• We had to train people on Core Config Manager if they ever

planned on creating or removing hostgroups, removing hosts

or services, renaming anything, removing hosts from

hostgroups, etc.

• So we figured out exactly what the most commonly used

features of Core Config Manager were internally (Hint: it is all

the ones I listed in the last bullet point)

17

Empowering standard users

Network or Sales Engineers should not have to be Nagios Administrators to remove devices from monitoring

• Then we built a component that does all of those things via direct calls to the NagiosQL DB

and the filesystem

• Now all of my users can only remove the objects that they have permissions to. Network

Engineers can’t remove customer equipment, and Sales can’t remove backhaul routers!

18

Geospatial information system integration

The basic NOC Overview map was fast approaching end-of-life.

• The original map was extended based on the Google map component

• It had a decent interface that tied existing hostgroups to lat/lng coordinates

(locations) and displayed them on the map based on that hostgroup’s hosts’ statuses

• We tied locations together by linking a specific host and service at a location to a

specific host and service at another location (relationship)

• We displayed relationships as lines between locations

19

Geospatial information system integration

The basic NOC Overview map was fast approaching end-of-life.

• We also built an animated radar layer and overlayed it on top of the map

• This is all fine and good, but that data only existed inside of this portal inside of our

Nagios XI instance

• We needed to export that data to a true geospatial information system (PostGIS,

GeoServer)

20

Geospatial information system integration (continued)

We needed a better way to manage geospatial data that could be utilized by more than one team of engineers

• We built a GeoServer, and built a component that pulls WMS in OpenLayers

• We created multiple datastores for each particular customer we service, with multiple layers in each (locations, wireless relationships, fiber

relationships, etc.)

• We built an awesome interface for that portal that allows any user of our Nagios XI instance to add locations and relationships with ease

• We built an application that parses status data and rebuilds all of the WMS layers with the proper styling (red for down, green for up, etc.)

• Now we can log in to the GeoServer via separate credentials and view the relevant data

• This is useful for our GIS and Project Management departments

21

Geospatial information system integration (continued)

22

Conclusion

• We (seriously) beefed up the hardware to accommodate almost doubling hosts and services

• We automated everything we could possibly automate

• We built a layer on top of MRTG to manage configuration files and remote workers

• We refactored our configuration wizards to be extremely efficient, and tie in directly to our MRTG sync tool

• We built our map functionality on top of a real Geo Server

What’s next?

• Automating the deployment of XI instances based on growth and location

• Tying password change component into LDAP

• Automatic interference detection in the frequency map (with alerting!)

• Receive signal threshold alarming based on propagation prediction

• Alerting based on average values over time and a percentage change in those values

• Deeper geospatial integration (propagation/coverage maps)

Contact and Questions

• [email protected]

• Any questions?

23

bryan heden - agile networks - using nagios xi as the platform for monitoring as a service

Presentations & Public Speaking