managing monitoring in distributed environments · managing monitoring in distributed environments...

39
Managing Monitoring in Distributed Environments The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

Upload: others

Post on 11-May-2020

32 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Managing Monitoring in Distributed Environments · Managing Monitoring in Distributed Environments The University of Edinburgh is a charitable body, registered in Scotland, with registration

Managing Monitoring in Distributed Environments

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

Page 2: Managing Monitoring in Distributed Environments · Managing Monitoring in Distributed Environments The University of Edinburgh is a charitable body, registered in Scotland, with registration

Introduction

• Motivation

• Background

• Implementation

• Conclusions

Page 3: Managing Monitoring in Distributed Environments · Managing Monitoring in Distributed Environments The University of Edinburgh is a charitable body, registered in Scotland, with registration

Why Monitor?

“A distributed system is one that stops you from getting any work done when a machine you've

never even heard of crashes.”Leslie Lamport

Page 4: Managing Monitoring in Distributed Environments · Managing Monitoring in Distributed Environments The University of Edinburgh is a charitable body, registered in Scotland, with registration

Reliability

• In a replicated system, you only know when the last server fails

• Sometimes you need to know that a system is down before the first user notices

Page 5: Managing Monitoring in Distributed Environments · Managing Monitoring in Distributed Environments The University of Edinburgh is a charitable body, registered in Scotland, with registration

Why Monitor?

“Statistics are like bikinis.  What they reveal is suggestive, but what they conceal is vital”

Aaron Levenstein

Page 6: Managing Monitoring in Distributed Environments · Managing Monitoring in Distributed Environments The University of Edinburgh is a charitable body, registered in Scotland, with registration

Performance Metrics

• Do we have “five 9s” uptime?

• Are there servers which regularly fall over?

• Are there patterns behind our service outages?

• Does Jim’s team manage their systems better than Susan’s?

Page 7: Managing Monitoring in Distributed Environments · Managing Monitoring in Distributed Environments The University of Edinburgh is a charitable body, registered in Scotland, with registration

Monitoring

• Regularly check if all services are working correctly

• Tell people if they’re not

Page 8: Managing Monitoring in Distributed Environments · Managing Monitoring in Distributed Environments The University of Edinburgh is a charitable body, registered in Scotland, with registration

Complications

Image credit: fliepsiebieps http://www.etsy.com/view_transaction.php?transaction_id=6780862

Page 9: Managing Monitoring in Distributed Environments · Managing Monitoring in Distributed Environments The University of Edinburgh is a charitable body, registered in Scotland, with registration

Completeness

• Monitoring systems are rapidly taken for granted

• “I haven’t been paged - there can’t be anything wrong”

• Maintaining the monitoring system is often forgotten

• Easy to forget to add new services

• Omissions only noticed when things go wrong

• Although ... harder to forget when decommissioning!

Page 10: Managing Monitoring in Distributed Environments · Managing Monitoring in Distributed Environments The University of Edinburgh is a charitable body, registered in Scotland, with registration

Complications

Image credit: rustybrick : http://www.flickr.com/photos/rustybrick/2100356981

Page 11: Managing Monitoring in Distributed Environments · Managing Monitoring in Distributed Environments The University of Edinburgh is a charitable body, registered in Scotland, with registration

Dependencies (and message storms)

• One small failure can have dramatic expenses

• Receiving 1,000 messages when a router fails can be both annoying and expensive

• Large scale monitoring configurations have to understand• Network topology• Service dependencies

• Modelling these is hard. Maintaining the model is harder still

Page 12: Managing Monitoring in Distributed Environments · Managing Monitoring in Distributed Environments The University of Edinburgh is a charitable body, registered in Scotland, with registration

Complications

Image credit: Guacamole Goalie : http://www.fickr.com/photos/jeopei

Page 13: Managing Monitoring in Distributed Environments · Managing Monitoring in Distributed Environments The University of Edinburgh is a charitable body, registered in Scotland, with registration

Communication : Passive vs active

• Not everything you care about is network visible• Some things can’t be• Some things shouldn’t be

• An event occurring is as notable as a service outage

• Have to be able to handle• Services reporting their own state• Reporting of events (SNMP traps &c)

Page 14: Managing Monitoring in Distributed Environments · Managing Monitoring in Distributed Environments The University of Edinburgh is a charitable body, registered in Scotland, with registration

Complications

“Quis custodiet ipsos custodes?”Juvenal

Image credit: Ley_photograhy http://www.flickr.com/photos/lostindevon/

Page 15: Managing Monitoring in Distributed Environments · Managing Monitoring in Distributed Environments The University of Edinburgh is a charitable body, registered in Scotland, with registration

Self monitoring

• Silence isn’t always golden

• Failure of a component of the monitoring system can hide everything

• Important that the monitoring system itself is robustly observed

Page 16: Managing Monitoring in Distributed Environments · Managing Monitoring in Distributed Environments The University of Edinburgh is a charitable body, registered in Scotland, with registration

Complexity and redundancy

• Satisifying all of these requirements leads to configurations that are• complex• fragile• hard to understand• full of redundant duplicated information

• Removing redundancy is highly beneficial. It lets us• ... make changes in one place instead of 20• ... avoid inconsistency between multiple copies• ... speed up configuration tasks• ... reduce errors!

Page 17: Managing Monitoring in Distributed Environments · Managing Monitoring in Distributed Environments The University of Edinburgh is a charitable body, registered in Scotland, with registration

Aside on redundancy

• System configuration is full of redundant information

• Take a web server configuration - information shared with• DNS• DHCP• Firewall• Certificate Server• Console Server• and your Monitoring System

• Removing redundancy dramatically reduces the potential for errors across all of these components

Page 18: Managing Monitoring in Distributed Environments · Managing Monitoring in Distributed Environments The University of Edinburgh is a charitable body, registered in Scotland, with registration

Lost in translation

• Very hard to handle raw configuration files

• Rapidly becomes an O(n2) problem (translate all of these configuration formats to all of these other formats)

• Instead, use a set of abstract configuration resources

• Translate from this to• Service configuration• Monitoring configuration• Everything else ...

Page 19: Managing Monitoring in Distributed Environments · Managing Monitoring in Distributed Environments The University of Edinburgh is a charitable body, registered in Scotland, with registration

Web server example

• This machine is an ssl capable web server

• It runs two virtual hosts• One http service on port 80, answering all requests, and

redirecting to the https service• One https service on port 443, answering to

‘www.inf.ed.ac.uk’, serving documents from /var/www/html

• The https service should be given a signed certificate

• The web-service monitoring group is notified on failure

Page 20: Managing Monitoring in Distributed Environments · Managing Monitoring in Distributed Environments The University of Edinburgh is a charitable body, registered in Scotland, with registration

Implementation - Configuration

(other configuration systems are available)

Page 21: Managing Monitoring in Distributed Environments · Managing Monitoring in Distributed Environments The University of Edinburgh is a charitable body, registered in Scotland, with registration

Implementation - Monitoring

Page 22: Managing Monitoring in Distributed Environments · Managing Monitoring in Distributed Environments The University of Edinburgh is a charitable body, registered in Scotland, with registration

Example - Apache resources

#include <options/apache.h>#include <options/apache-ssl.h>#include <options/x509-client.h>

apache.vhosts redir ssl

apache.vhostname_redir _default_apache.vhostverbatim_redir Redirect / https://www.inf.ed.ac.uk

apache.vhostname_ssl www.inf.ed.ac.ukapache.vhostssl_ssl trueapache.vhostroot_ssl /var/www/html

apache.nagios_groups nagios/web-service

X509_CERTIFICATE(ssl, /etc/pki/tls/certs)

Small

Page 23: Managing Monitoring in Distributed Environments · Managing Monitoring in Distributed Environments The University of Edinburgh is a charitable body, registered in Scotland, with registration

Example - Apache configuration

ServerType standaloneServerRoot /etc/httpdLockFile /var/run/httpd.locakPidFile /var/run/httpd.pidScoreBoardFile logs/apache_runtime_statusTimeout 300KeepAlive OnMaxKeepAliveRequests 100KeepAliveTimeout 15MinSpareServers 5MaxSpareServers 20StartServers 8MaxClients 150MaxRequestsPerChild 1000

LoadModule env_module /usr/lib/apache/mod_env.soLoadModule config_log_module /usr/lib/apache/mod_log_config.soLoadModule mime_module /usr/lib/apache/mod_mime.soLoadModule negotiation_module /usr/lib/apache/mod_negotiation.soLoadModule status_module /usr/lib/apache/mod_status.soLoadModule includes_module /usr/lib/apache/mod_include.soLoadModule autoindex_module /usr/lib/apache/mod_autoindex.soLoadModule dir_module /usr/lib/apache/mod_dir.soLoadModule cgi_module /usr/lib/apache/mod_cgi.soLoadModule asis_module /usr/lib/apache/mod_asis.soLoadModule imap_module /usr/lib/apache/mod_imap.soLoadModule action_module /usr/lib/apache/mod_actions.soLoadModule userdir_module /usr/lib/apache/mod_userdir.soLoadModule alias_module /usr/lib/apache/mod_alias.soLoadModule access_module /usr/lib/apache/mod_access.soLoadModule auth_module /usr/lib/apache/mod_auth.soLoadModule digest_module /usr/lib/apache/mod_digest.soLoadModule expires_module /usr/lib/apache/mod_expires.soLoadModule setenvif_module /usr/lib/apache/mod_setenvif.soLoadModule headers_module /usr/lib/apache/mod_headers.so

Port 80Listen 443Listen 80

User apacheGroup apacheServerAdmin sxwServerName duffus.inf.ed.ac.ukDocumentRoot /var/www/html<Directory /> Options FollowSymLinks AllowOverride None</Directory>

AccessFileName .htaccess

<Files ~ "^\.ht"> Order allow,deny Deny from all Satisfy All</Files>

UseCanonicalName On

TypesConfig /etc/mime.typesDefaultType text/html

ErrorLog logs/error_log

LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined

LogLevel warnCustomLog logs/access_log combined

SSLPassPhraseDialog builtinSSLSessionCache shm:logs/ssl_scache(512000)SSLSessionCacheTimeout 300SSLMutex file:logs/ssl_mutexSSLRandomSeed startup builtinSSLRandomSeed connect builtinSSLLog logs/ssl_engine_logSSLLogLevel warn

AddType application/x-httpd-php .php

NameVirtualHost 129.215.165.30:80

<VirtualHost _default_:80> ServerName _default_ Redirect / https://www.inf.ed.ac.uk/</VirtualHost>

<VirtualHost www.inf.ed.ac.uk:443> ServerName www.inf.ed.ac.uk SSLEngine On SSLCertificateFile /etc/pki/tls/certs/ssl.crt SSLCertificateKeyFile /etc/pki/tls/certs/ssl.key SSLCertificateChainFile /etc/pki/tls/certs/ssl.chain DocumentRoot /var/www/html</VirtualHost>

Medium

Page 24: Managing Monitoring in Distributed Environments · Managing Monitoring in Distributed Environments The University of Edinburgh is a charitable body, registered in Scotland, with registration

Example - Apache monitoring

define command { command_line $USER1$/check_http -I$HOSTADDRESS$ -p $ARG1$ command_name check_apacheconf_main}

define command { command_line $USER1$/check_http -I$ARG1$ -H$ARG2$ -p $ARG3$ -C$ARG4$ command_name check_apacheconf_cert}

define command { command_line $USER1$/check_http -I$ARG1$ -H$ARG2$ -p $ARG3$ -u $ARG4$ -S command_name check_apacheconf_https}

define command { command_line $USER1$/check_http -I$ARG1$ -H$ARG2$ -p $ARG3$ -u $ARG4$ command_name check_apacheconf_http}

define contactgroup { alias nagios/web-service contactgroup_name nagios/web-service members sxw}

define contact { use default-contact alias Simon Wilkinson contact_name sxw email [email protected]}

define hostgroup { alias DICE/Web/Servers hostgroup_name DICEWebServers}

define host { use default-host address 129.215.165.30 contact_groups nagios/web-service host_name duffus hostgroups DICEWebServers}

define service { use default-service active_checks_enabled 0 check_command check_self_active contact_groups nagios/web-service host_name duffus passive_checks_enabled 1 service_description Profile Translation}

define service { use default-service check_command check_apache_main!80 contact_groups nagios/web-service host_name duffus

service_description Apache}

define service { use default-service check_command check_apache_http!129.215.165.30!www.inf.ed.ac.uk!80!/ contact_groups nagios/web-service host_name duffus service_description Apache www.inf.ed.ac.uk:80 HTTP}

define servicedependency { dependent_host_name duffus dependent_service_description Apache www.inf.ed.ac.uk:80 HTTP execution_failure_criteria w,u,c host_name curlew notification_failure_criteria w,u,c service_description Apache}

define service { use default-service check_command check_apache_https!129.215.165.30!www.inf.ed.ac.uk!443!/ contact_groups nagios/web-server host_name curlew service_description Apache www.inf.ed.ac.uk:443 HTTPS}

define servicedependency { dependent_host_name duffus dependent_service_description Apache www.inf.ed.ac.uk:443 HTTPS execution_failure_criteria w,u,c host_name curlew notification_failure_criteria w,u,c service_description Apache}

define service { use default-service check_command check_apache_cert!129.215.202.30!www.inf.ed.ac.uk!443!18 contact_groups nagios/web-service host_name duffus service_description Apache www.inf.ed.ac.uk:443 Certificate}

define servicedependency { dependent_host_name duffus dependent_service_description Apache www.inf.ed.ac.uk:443 Certificate execution_failure_criteria w,u,c host_name curlew notification_failure_criteria w,u,c service_description Apache}

Large

Page 25: Managing Monitoring in Distributed Environments · Managing Monitoring in Distributed Environments The University of Edinburgh is a charitable body, registered in Scotland, with registration

Example - Apache monitoring 2

log_file=/var/log/nagios/nagios.logobject_cache_file=/var/log/nagios/objects.cacheresource_file=/etc/nagios/private/resource.cfgstatus_file=/var/log/nagios/status.datnagios_user=nagiosnagios_group=nagioscheck_external_commands=1command_check_interval=-1command_file=/var/spool/nagios/cmd/nagios.cmdcomment_file=/var/log/nagios/comments.datdowntime_file=/var/log/nagios/downtime.datlock_file=/var/run/nagios/nagios.pidtemp_file=/var/log/nagios/nagios.tmpevent_broker_options=-1log_rotation_method=dlog_archive_path=/var/log/nagios/archivesuse_syslog=1log_notifications=1log_service_retries=1log_host_retries=1log_event_handlers=1log_initial_states=0log_external_commands=1log_passive_checks=1service_inter_check_delay_method=smax_service_check_spread=30service_interleave_factor=shost_inter_check_delay_method=smax_host_check_spread=30max_concurrent_checks=0service_reaper_frequency=10auto_reschedule_checks=0auto_rescheduling_interval=30auto_rescheduling_window=180sleep_time=0.25service_check_timeout=60host_check_timeout=30event_handler_timeout=30notification_timeout=30ocsp_timeout=5perfdata_timeout=5retain_state_information=1state_retention_file=/var/log/nagios/retention.datretention_update_interval=60use_retained_program_state=1use_retained_scheduling_info=1interval_length=60use_aggressive_host_checking=0execute_service_checks=1accept_passive_service_checks=1execute_host_checks=1accept_passive_host_checks=1enable_notifications=1enable_event_handlers=1process_performance_data=0obsess_over_services=0check_for_orphaned_services=1check_service_freshness=1service_freshness_check_interval=60check_host_freshness=0host_freshness_check_interval=60aggregate_status_updates=1status_update_interval=15enable_flap_detection=0low_service_flap_threshold=5.0high_service_flap_threshold=20.0low_host_flap_threshold=5.0high_host_flap_threshold=20.0date_format=europ1_file=/usr/sbin/p1.plillegal_object_name_chars=`~!$%^&*|'"<>?,()=illegal_macro_output_chars=`~$&|'"<>use_regexp_matching=0use_true_regexp_matching=0admin_email=nagios-admin@inf.ed.ac.ukadmin_pager=emptydaemon_dumps_core=0

define timeperiod{ timeperiod_name 24x7 alias 24 Hours A Day, 7 Days A Week sunday 00:00-24:00 monday 00:00-24:00 tuesday 00:00-24:00 wednesday 00:00-24:00 thursday 00:00-24:00 friday 00:00-24:00 saturday 00:00-24:00}

define contact{

name default-contact service_notification_period 24x7 host_notification_period 24x7 service_notification_options w,u,c,r host_notification_options d,r service_notification_commands service_jnotify host_notification_commands host_jnotify register 0}

define host{ name default-host ; The name of this host template active_checks_enabled 1 ; Active host checks are enabled notifications_enabled 1 ; Host notifications are enabled event_handler_enabled 1 ; Host event handler is enabled flap_detection_enabled 1 ; Flap detection is enabled failure_prediction_enabled 1 ; Failure prediction is enabled process_perf_data 1 ; Process performance data retain_status_information 1 ; Retain status information across program restarts retain_nonstatus_information 1 ; Retain non-status information across program restarts notification_period 24x7 ; Send host notifications at any time check_period 24x7 max_check_attempts 10 notification_interval 10 notification_options d,u,r check_command check-host-alive register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE!}

define service{ name default-service ; The 'name' of this service template active_checks_enabled 1 ; Active service checks are enabled passive_checks_enabled 1 ; Passive service checks are enabled/accepted parallelize_check 1 ; Active service checks should be parallelized (disabling this can lead to major performance problems) obsess_over_service 1 ; We should obsess over this service (if necessary) check_freshness 0 ; Default is to NOT check service 'freshness' notifications_enabled 1 ; Service notifications are enabled event_handler_enabled 1 ; Service event handler is enabled flap_detection_enabled 1 ; Flap detection is enabled failure_prediction_enabled 1 ; Failure prediction is enabled process_perf_data 1 ; Process performance data retain_status_information 1 ; Retain status information across program restarts retain_nonstatus_information 1 ; Retain non-status information across program restarts is_volatile 0 ; The service is not volatile check_period 24x7 max_check_attempts 4 normal_check_interval 5 retry_check_interval 1 notification_options w,u,c,r notification_interval 10 notification_period 24x7 register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!}

define command { command_name check_passive command_line $USER1$/check_dummy 2 "Passive check results not recently received"}

define command { command_name host_jnotify command_line /usr/bin/printf "%b" "***** Nagios 2.9 *****\n\nNotification Type: $NOTIFICATIONTYPE$\nHost: $HOSTNAME$\n

State: $HOSTSTATE$\nAddress: $HOSTADDRESS$\nInfo: $HOSTOUTPUT$\n\nDate/Time: $LONGDATETIME$\n" | /usr/bin/notify-by-jnotify --number $NOTIFICATIONNUMBER$ --subject "Host $HOSTSTATE$ alert for $HOSTNAME$!" $CONTACTEMAIL$}

define command { command_name service_jnotify command_line /usr/bin/printf "%b" "***** Nagios 2.9 *****\n\nNotification Type: $NOTIFICATIONTYPE$\n\nService: $SERVICEDESC$\nHost: $HOSTALIAS$\nAddress: $HOSTADDRESS$\nState: $SERVICESTATE$\n\nDate/Time: $LONGDATETIME$\n\nAdditional Info:\n\n$SERVICEOUTPUT$" | /usr/bin/notify-by-jnotify --number $NOTIFICATIONNUMBER$ --subject "** $NOTIFICATIONTYPE$ alert - $HOSTALIAS$/$SERVICEDESC$ is $SERVICESTATE$ **" $CONTACTEMAIL$}

define command { command_line $USER1$/check_dummy 1 "Service checked passively" command_name check_self_active}

define host { use default-host address 127.0.0.1 alias Self Monitoring Pseudo Host contact_groups ignore host_name Self notification_options n}

define service { use default-service active_checks_enabled 0 check_command check_self_active contact_groups sxw,idurkacz host_name Self passive_checks_enabled 1 service_description Nagios Configuration}

define command { command_line $USER1$/check_http -I$HOSTADDRESS$ -p $ARG1$ command_name check_apache_main}

define command { command_line $USER1$/check_http -I$ARG1$ -H$ARG2$ -p $ARG3$ -C$ARG4$ command_name check_apache_cert}

define command { command_line $USER1$/check_http -I$ARG1$ -H$ARG2$ -p $ARG3$ -u $ARG4$ -S command_name check_apache_https}

define command { command_line $USER1$/check_http -I$ARG1$ -H$ARG2$ -p $ARG3$ -u $ARG4$ command_name check_apache_http}

define contactgroup { alias nagios/web-service contactgroup_name nagios/web-service members sxw}

define contact { use default-contact alias Simon Wilkinson contact_name sxw email [email protected]}

define hostgroup { alias DICE/Web/Servers hostgroup_name DICEWebServers}

define host { use default-host address 129.215.165.30 contact_groups nagios/web-service host_name duffus hostgroups DICEWebServers}

define service { use default-service active_checks_enabled 0 check_command check_self_active contact_groups nagios/web-service host_name duffus passive_checks_enabled 1 service_description Profile Translation}

define service { use default-service check_command check_apache_main!80 contact_groups nagios/web-service host_name duffus service_description Apache}

define service { use default-service check_command check_apache_http!129.215.165.30!www.inf.ed.ac.uk!80!/ contact_groups nagios/web-service host_name duffus service_description Apache www.inf.ed.ac.uk:80 HTTP}

define servicedependency { dependent_host_name duffus dependent_service_description Apache www.inf.ed.ac.uk:80 HTTP execution_failure_criteria w,u,c host_name curlew notification_failure_criteria w,u,c service_description Apache}

define service { use default-service check_command check_apache_https!129.215.165.30!www.inf.ed.ac.uk!443!/ contact_groups nagios/web-server host_name curlew service_description Apache www.inf.ed.ac.uk:443 HTTPS}

define servicedependency { dependent_host_name curlew dependent_service_description Apache www.inf.ed.ac.uk:443 HTTPS execution_failure_criteria w,u,c host_name curlew notification_failure_criteria w,u,c service_description Apache}

define service { use default-service check_command check_apache_cert!129.215.202.30!www.inf.ed.ac.uk!443!18 contact_groups nagios/web-service host_name duffus service_description Apache www.inf.ed.ac.uk:443 Certificate}

define servicedependency { dependent_host_name curlew dependent_service_description Apache www.inf.ed.ac.uk:443 Certificate execution_failure_criteria w,u,c host_name curlew notification_failure_criteria w,u,c service_description Apache}

Extra Large

Page 26: Managing Monitoring in Distributed Environments · Managing Monitoring in Distributed Environments The University of Edinburgh is a charitable body, registered in Scotland, with registration

Translation

• Have to translate our resources into application specific configuration

• Traditionally LCFG has ‘components’ which translate a machine’s resources into a local configuration

• Monitoring introduces more complexity• The monitoring configuration is made up of resources from

many different machines• Introduce translators which convert machine resource

fragments into monitoring configuration descriptions

Page 27: Managing Monitoring in Distributed Environments · Managing Monitoring in Distributed Environments The University of Edinburgh is a charitable body, registered in Scotland, with registration

Spanning Maps

• LCFG models configuration on a per host basis

• A publish/subscribe model exists for sharing information between hosts

• We call that model “spanning maps”

• We use it a lot: DHCP, firewall, Kerberos, X509

• Monitoring extends this, so it can see more, quicker

Page 28: Managing Monitoring in Distributed Environments · Managing Monitoring in Distributed Environments The University of Edinburgh is a charitable body, registered in Scotland, with registration

Monitoring Framework

• Monitoring system agnostic translation framework• Spanning maps tell it which components to monitor• Fetch resources for that component from database• Run resources through translator• Extract user/group information from user database• Combine translator results into configuration file

• All implemented in OO perl• Translators are run time loaded perl modules, written to a

defined interface

Page 29: Managing Monitoring in Distributed Environments · Managing Monitoring in Distributed Environments The University of Edinburgh is a charitable body, registered in Scotland, with registration

Caching

• Monitoring reconfiguration is costly

• Have to cache as much as possible

• Framework caches• Results from configuration database• Results from user database• Results from translators

• Only moves performs reconfigurations when required

Page 30: Managing Monitoring in Distributed Environments · Managing Monitoring in Distributed Environments The University of Edinburgh is a charitable body, registered in Scotland, with registration

Why so much?

• Why is that configuration so complex?

• Actually monitoring 4 services• The default httpd instance• The virtual host running on port 80• The SSL service running on port 443• The validity of the certificate

• These all have dependencies

Page 31: Managing Monitoring in Distributed Environments · Managing Monitoring in Distributed Environments The University of Edinburgh is a charitable body, registered in Scotland, with registration

Dependencies

Apache server

SSL server

SSL certificate

Redirect server

Page 32: Managing Monitoring in Distributed Environments · Managing Monitoring in Distributed Environments The University of Edinburgh is a charitable body, registered in Scotland, with registration

Redundant Servers and Clusters

• Modelling clusters is important• I really care if more than 2 of my KDCs are down• My LDAP service is dependent on there being a KDC available

• Nagios makes this hard, the framework makes it simple

kerberos.nagios_cluster INF.ED.AC.UK_KDC

openldap.nagios_dependency INF.ED.AC.UK_KDC

Page 33: Managing Monitoring in Distributed Environments · Managing Monitoring in Distributed Environments The University of Edinburgh is a charitable body, registered in Scotland, with registration

Notifications

• Notification storms are still a danger

• Systems tend to be maintained by more than one person

• Presence enables us to tell who is available

• Escalation lets us shout louder and wider

• Jabber is used for presence, and initial notification

• Escalations are by email

Page 34: Managing Monitoring in Distributed Environments · Managing Monitoring in Distributed Environments The University of Edinburgh is a charitable body, registered in Scotland, with registration

More on the Jabber bot

• Written in the python Twisted async framework

• Bot is permanently connected to our Jabber server, receives notifications by Unix socket

• Maintains state information for all of its buddies• Initially only notifies you if you’re ‘available’• Then notifies you if you’re online• Then emails you

Page 35: Managing Monitoring in Distributed Environments · Managing Monitoring in Distributed Environments The University of Edinburgh is a charitable body, registered in Scotland, with registration

Statistics

Using 2 monitoring machines, we monitor 159 services, across 42 hosts. The monitoring system has 5305 lines of configuration, all automatically

generated and maintained.(on 00:50 on 30th March 2008)

Image credit: williamhartz : http://www.flickr.com/photos/whartz/

Page 36: Managing Monitoring in Distributed Environments · Managing Monitoring in Distributed Environments The University of Edinburgh is a charitable body, registered in Scotland, with registration

Issues

• Only checking real world == configuration definition

• If you remove your web server from your configuration, the monitoring system won’t tell you its down

• Don’t have a good way of expressing• “There must be a www.inf.ed.ac.uk”• “There shall be at least 3 KDCs up at all times”• “Each site must have an AFS database server”

• Need a more descriptive configuration language to solve these problems

Page 37: Managing Monitoring in Distributed Environments · Managing Monitoring in Distributed Environments The University of Edinburgh is a charitable body, registered in Scotland, with registration

Futures

• Use monitoring events to repair problems• Bring up new systems when one goes down• Move volumes off AFS fileservers when one becomes too full• ...

• Lots of risks, but lots of potential

“Physician, heal thyself”

Page 38: Managing Monitoring in Distributed Environments · Managing Monitoring in Distributed Environments The University of Edinburgh is a charitable body, registered in Scotland, with registration

Conclusions

• A quick tour of some of the complexities of monitoring

• Examined how using a central configuration database can simplify these

• Described an implementation for LCFG and Nagios

• Looked at some possible ways of extending that implementation in the future

Page 39: Managing Monitoring in Distributed Environments · Managing Monitoring in Distributed Environments The University of Edinburgh is a charitable body, registered in Scotland, with registration

Questions?

This talk: http://www.dice.inf.ed.ac.uk/publications/

LCFG: http://www.lcfg.org/

Me: [email protected]

Image credit: hustvedt : http://commons.wikimedia.org/wiki/Image:Three_Surveillance_cameras.jpg