nagios on-call rotation james clark [email protected]
TRANSCRIPT
Nagios On-Call RotationJames Clark
Topics Discussed
2
About Me / My Monitoring History
Monitoring History at Current Company
Prerequisites
Current Company Setup
Scripts
Nagios Configuration
About Me
3
Have been in the IT industry since 1988
In 2004 became server group manager
Have been using Nagios since ~2003
Switched to XI ~2010 (And loved every part of it)
Changed jobs in August 2012 and quickly convinced new company to purchase XI
About Me
4
Private web page is http://www.bandits-home-on-the-web.com
On that page you will find some of theNagios modifications I have done
History of Monitoring and Alerting at new job
5
Many monitoring applications spread through-out the IT department
CCSS for iSeriesFoglight for DBSCOM for WindowsThree separate Nagios Core serversIBM NetCoolMany departments had no monitoring
All of the applications forward to NetCool and NetCool then forwards alerts to AlarmPoint (xMatters)
AlarmPoint holds the on-call schedule for the many different groups in the IT department
History of Monitoring and Alerting at new job
6
CCSS for iSeriesPartial conversion to XI started
Foglight for DBComplete conversion to XI hopeful
SCOM for WindowsWill either develop custom script to communicate
back and forth. Currently testing WMI and hope to use that instead
Three separate Nagios Core serversConverted to a single XI server
History of Monitoring and Alerting at new job
7
IBM NetCoolRemoving from company
AlarmPointMore than likely, removing from company
One primary XI server currently with 3 mod_gearman workers
One XI server for monitoring primary XI and a few other devices
One XI server in our DR web data center
History of Monitoring and Alerting at new job
8
Besides AlarmPoint, On-Call schedule is kept in a separate MS SharePoint site that the DC Operations uses.
No fulltime administrator for either NetCool or AlarmPoint.
When done switching everything to NagiosXI, a significant savings will be realized.
One of the main hurdles to the switch, is on-call rotation for alerting.
On Call Data - Prerequisites
9
On-call information stored in some application
On-call information able to be exported from the application in a specific format
A job scheduler to run the jobs
On Call Data – Our Setup
10
SharePoint site to store on-call scheduleSharePoint admin created an application to export the
data needed and send the files to an FTP server.Two files are sent, one for primary and one for
secondary.We use Control-M to schedule the above program
and the two Linux scripts. The job is run daily at 8am. Our on-call changes
Monday’s at 8am.If changes are made to the on-call schedule, that
need to take effect immediately, the job is manually run. Otherwise, it can wait until the next day at 8am.
On Call Data – Our Setup
11
Added ID to contacts table.Added short name to On-Call Groups table.Set the SharePoint site to alert me when any changes done to those two tables so it can be mirrored it in Nagios.
The scripts do handle blanks. This will be shown in a later slide.
On Call Data – Example files
12
Networking,network,smithjSystem p Administration,aix_admins,doejAE Direct,aed_infra,user1Database,dba,clarksSystem i Administration,system_i_admin,walenciejsWintel Administration,wintel_admins,hilderbrandrSystem i Applications,system_i_apps,brownrClient Server Applications,client_server,yatespDataWarehouse/Enterprise Rpts,datawarehouse,connerysStore Applications,store_apps,probstj
The first field is what is displayed on the SharePoint site and is the alias assigned in Nagios. The second field is the name given to the contact groups. The third field is of course the ID of the user.
Scripts:
On Call Data – FTP Script
14
HOST=xxxxxxx #This is the FTP servers host or IP address.USER=xxxxxxx #This is the FTP user that has access to the server.PASS=xxxxxxx #This is the password for the FTP user.
ftp -inv $HOST << EOF
user $USER $PASS
cd /nagiosftp
get primaryOnCall.txtget secondaryOnCall.txt
delete primaryOnCall.txtdelete secondaryOnCall.txt
byeEOFexit 0
On Call Data – Data Manipulation Script
15
#!/usr/bin/perl
#Remove old config filessystem ("find /usr/local/nagios/etc/static -type f -not -name 'xi*' -not -name 'esc*' -not -name 'aed_*' | xargs rm");
#Process primary on-call fileopen (INFILE, 'primaryOnCall.txt') or die $1;while (<INFILE>) { chomp; ($group, $alias, $id) = split(","); if (($alias ne '') && ($group ne '') && ($id ne '')) { open (OUTFILE, '>/usr/local/nagios/etc/static/' . $alias . '_oncall_pri.cfg'); print OUTFILE "define contactgroup{\n"; print OUTFILE "contactgroup_name $alias" . "_oncall_pri\n"; print OUTFILE "alias $group\n"; print OUTFILE "members $id\n"; print OUTFILE "}"; close (OUTFILE); } }close (INFILE);
On Call Data – Data Manipulation Script(cont…)
16
#Process secondary on-call fileopen (INFILE, 'secondaryOnCall.txt') or die $1;while (<INFILE>) { chomp; ($group, $alias, $id) = split(","); if (($alias ne '') && ($group ne '') && ($id ne '')) { open (OUTFILE, '>/usr/local/nagios/etc/static/' . $alias . '_oncall_sec.cfg'); print OUTFILE "define contactgroup{\n"; print OUTFILE "contactgroup_name $alias" . "_oncall_sec\n"; print OUTFILE "alias $group\n"; print OUTFILE "members $id\n"; print OUTFILE "}"; close (OUTFILE); } }close (INFILE);
On Call Data – Data Manipulation Script(cont…)
17
#Change ownership and permissions of config filessystem ("sudo /bin/chown apache:nagios /usr/local/nagios/etc/static/*.cfg");system ("sudo /bin/chmod 777 /usr/local/nagios/etc/static/*.cfg");
#Delete data filessystem ("rm primaryOnCall.txt");system ("rm secondaryOnCall.txt");
#Restart Nagiossystem ("sudo su -l nagios -c 'cd /usr/local/nagiosxi/scripts/ && ./reconfigure_nagios.sh'");
#Exit cleanexit 0;
On Call Data – List of Files Created
18
Due to a blank for secondary on-call in the file, only the primary file for datawarehouse exists.
On Call Data – Files Created – Example Content
19
Nagios Configuration:
NagiosXI Configuration
21
No contacts or contact groups are assigned to the hosts or services. Unless you want to always receive alerts. i.e. Someone who needs alerted that is not a member of the specific on-call group.
Users receive permissions to see hosts and services by having an escalation for them
Escalations must be created for both hosts and services. Services do not inherit escalations like they do notifications
NagiosXI Configuration(cont…)
22
Escalations created as static config files.
Otherwise Nagios would error on the empty contact groups.
All members of groups go into an ALL group. This will be used to give users permissions
The group manager goes into a BOSS group. This is used for alerting the manager after on-call individuals fail to acknowledge an issue
Static Configuration Example - Hosts
23
define hostescalation{ hostgroup_name network_oncall contact_groups network_oncall_pri first_notification 1 last_notification 0 notification_interval 15 }define hostescalation{ hostgroup_name network_oncall contact_groups network_oncall_sec first_notification 2 last_notification 0 notification_interval 15 }define hostescalation{ hostgroup_name network_oncall contact_groups network_boss first_notification 4 last_notification 0 notification_interval 15 }define hostescalation{ hostgroup_name network_oncall contact_groups network_all first_notification 3 last_notification 0 notification_interval 15 }
Created by script
Created by script
Created in XI and manager of group assigned as member
Created in XI and all members of group assigned as members
Static Configuration Example - Services
24
define serviceescalation{ hostgroup_name network_oncall service_description * contact_groups network_oncall_pri first_notification 1 last_notification 0 notification_interval 15
}define serviceescalation{ hostgroup_name network_oncall service_description * contact_groups network_oncall_sec first_notification 2 last_notification 0 notification_interval 15 }define serviceescalation{ hostgroup_name network_oncall service_description * contact_groups network_all first_notification 3 last_notification 0 notification_interval 15 }define serviceescalation{ hostgroup_name network_oncall service_description * contact_groups network_boss first_notification 4 last_notification 0 notification_interval 15 }
The way we set it up, it uses the same hostgroup used for all the hosts and uses a wildcard for service, to include all services.
This could get very complicated if different groups/individuals were needed on different services on the same host.
Static Configuration Example - Services
25
define serviceescalation{ host_name *
servicegroup_name dba_oncall contact_groups dba_oncall_pri,dba_oncall_sec first_notification 1 last_notification 0 notification_interval 15 }define serviceescalation{ host_name * servicegroup_name dba_oncall contact_groups dba first_notification 500 last_notification 0 notification_interval 15 }
Static Configuration Example - Services
26
escalations_aed_serv.cfg
The services can be an simple as the last slide, or as complex as you can imagine. This attached file is a great example of the complexity that is capable.