gaspar talk

Upload: nathaleenoo

Post on 10-Apr-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/8/2019 Gaspar Talk

    1/28

    Deploying Nagios in a

    Large EnterpriseCarson GasparGoldman Sachs

    [email protected]

  • 8/8/2019 Gaspar Talk

    2/28

    or...

    If you strap enoughrockets to a brick, you

    can make it fly

  • 8/8/2019 Gaspar Talk

    3/28

    In the beginning...

    New Linux HPC skunkworks project Catastrophic success

    Need monitoring added yesterday

  • 8/8/2019 Gaspar Talk

    4/28

    Looking for Solutions

    Why not use what we already had?

    Stability problems Resource utilization problems

    Custom agents were hard

    No support for our new Linux platform

  • 8/8/2019 Gaspar Talk

    5/28

    Why Nagios?

    Purchasing a new commercial solution waspolitically difficult

    At the time (2003) nagios was the most matureof the open source solutions

    Good community support

  • 8/8/2019 Gaspar Talk

    6/28

    The Naive Approach

    (and why it didnt work)

    Performance Problems Configuration Management Problems

    Availability Problems Integration / Automation Requirements

  • 8/8/2019 Gaspar Talk

    7/28

    Performance Problems

    State check performance Active checks: ~3 checks / second maximum

    Statistics performance fork()/exec() for every sample

    Web UI performance Large configurations take a long time to display(much improved in 2.x)

  • 8/8/2019 Gaspar Talk

    8/28

    Configuration

    Management Problems

    Configuration files are very verbose (even withtemplates)

    Syntax errors are easy Keeping up with a high churn rate in monitoredservers is expensive

  • 8/8/2019 Gaspar Talk

    9/28

    Hardware / software failures Building power-downs Patches / upgrades

    Who watches the watchers?

    Availability Problems

  • 8/8/2019 Gaspar Talk

    10/28

    Integration / Automation

    Requirements

    Alarms need to be dispatched to our existingalerting and escalation system Alarms need to be suppressed by existing

    maintenance tools

    Provisioning should flow from our existingprovisioning system

  • 8/8/2019 Gaspar Talk

    11/28

    Solving the Problems

    Move to passive checks

    Run multiple nagios instances Deploy HA nagios servers

    Use data-driven configuration file generation

    Create a custom notification back end

  • 8/8/2019 Gaspar Talk

    12/28

    Passive Checks

    Move most of the work to the clients

    Batch server updates unless a state changeoccurs Randomize server update times to avoid spikes

    Queue the results on the server Send statistics to a different back end

  • 8/8/2019 Gaspar Talk

    13/28

    Passive Data FlowClient 1 Client n

    ...

    ... Server n

    monqueue

    stats-catcher

    ... nagios n

    nagios 1

    Server 1

    nagios-agent nagios-agent

    monqueue

    stats-catcher

    ... nagios n

    nagios 1

  • 8/8/2019 Gaspar Talk

    14/28

    Multiple Nagios Instances

    Run many copies of nagios on one server Improve web UI performance Show each group only their own servers, so the

    top level dashboard is more useful

    Allow per-group customizations

  • 8/8/2019 Gaspar Talk

    15/28

    HA Nagios Servers

    Run multiple nagios servers on differentmachines in different buildings

    All clients update all servers A heartbeat is published through each server to

    its counterpart

    Notifications are suppressed on slaves if theheartbeat service is OK Partitioned masters can cause duplicate alerts

  • 8/8/2019 Gaspar Talk

    16/28

    HA Data Flowprimary server secondary server

    client

    nagios-agent

    nagios

    monqueueheartbeat

    notify

    monqueue

    nagios

    ping

    notify

    heartbeat

    ping

  • 8/8/2019 Gaspar Talk

    17/28

    Data-driven Configuration

    File Generation

    Leverage our existing host database andprovisioning tools

    Generate client and server configurations viacfengine based on templates and databaselookups

    Mostly driven by database data, with some per-server threshold overrides managed in cfenginemaster files

  • 8/8/2019 Gaspar Talk

    18/28

    Custom Notification

    Back End

    Custom code integrates with our Netcoolinfrastructure

    Alarm suppression based on external criteria

    Also supports email alerts, optionally batched

  • 8/8/2019 Gaspar Talk

    19/28

    Design Trade-offs

    Batch updates mean slow detection of zombiehosts (ping-able, but not running user processes)

    nagios notification escalation doesnt work wellwithout active checks, especially if updates arebatched

    Requires configuration management More complexity = more bugs

  • 8/8/2019 Gaspar Talk

    20/28

    nagios-agent

    Design Criteria Lightweight

    Easy to write and deploy additional agents Avoid fork()/exec() where possible Support agent callbacks to avoid blocking

    Support proxy agents to monitor other deviceswhere we cant deploy nagios-agent Evaluate all thresholds locally and batch server

    updates

  • 8/8/2019 Gaspar Talk

    21/28

    nagios-agentnagios-agent

    Agent

    Class: Ping

    Instance: db

    Categories: appdev, sa

    Args:

    hostlist => /my/dblist,

    latency => [ 50, 100],

    losspct => [ 66, 100 ],

    count => 3

    Agent

    Class: Ping

    Instance: LDN

    Categories: sa

    Args:

    hostlist => /my/ldnlist,

    latency => [ 100, 200],

    losspct => [ 66, 100 ],

    count => 3

    Server

    Class: monqueue

    Categories: appdevArgs:

    auth => GSSAPI

    keytab => mykeytab

    server => myserver,

    queue => 0

    Server

    Class: monqueue

    Categories: saArgs:

    auth => GSSAPI

    keytab => mykeytab

    server => myserver,

    queue => 1

  • 8/8/2019 Gaspar Talk

    22/28

    monqueue

    Design Criteria

    Fast

    Secure Accept data from clients, and dispatch to multiple

    output queues

    Supply heartbeats to nagios Supply queue depth stats to stats-catcher

  • 8/8/2019 Gaspar Talk

    23/28

    Client Evolution

    nagios-agent slowly grew features as they becamerequired

    multiple agent instances

    agent instance to server mapping auto reload of configuration, modules on

    update

    auto re-exec of nagios-agent on update stats collection SASL authentication to monqueue

  • 8/8/2019 Gaspar Talk

    24/28

    Server Evolution

    Started as one monolithic instance As deployment spread, split into multiple

    instances based on administrative domain

    added HA added SASL authentication and authorization

    added monitoring of monqueue itself, and servicedependencies (so a monqueue failure didnttrigger alarms for all services)

  • 8/8/2019 Gaspar Talk

    25/28

    Kudzu

    Originally for one project, fewer than 200 hosts

    Eventually used for large sections of theenvironment

    Documentation and internal consultancy are

    critical for user acceptance

  • 8/8/2019 Gaspar Talk

    26/28

    One of Our Servers

    A single HP DL385 G1 (dual 2.6GHz Opteron,4GB RAM), running RHEL4 U4, nagios 2.9

    11 nagios instances 27,000+ services (mostly 2 minute intervals) 6,600+ hosts

    ~10% CPU ~500 MB RAM

  • 8/8/2019 Gaspar Talk

    27/28

    Still to Come

    Source code release Encryption of nagios-agent / monqueue

    communications

    Support for pulling status from nagios-agent tobetter support DMZ environments Statistical analysis of multiple data samples to

    determine service status

    Yet more agent plugins nagios-agent support for traditional nagios

    plugins

  • 8/8/2019 Gaspar Talk

    28/28

    Deploying Nagios in a

    Large EnterpriseCarson GasparGoldman Sachs

    [email protected]