sensu: the monitoring routerfiles.meetup.com/1780846/sensudevopsdcwnotes.pdf · sensu: the...

Sensu: The Monitoring RouterPeter Burkholder@pburkholder

DevOpsDC MeetUp8 May 2012

Sunday, May 13, 12

So, unless I'm mistaken more of our DevOpsDC talks have been dealt withmonitoring than any other area of our work. Not as much time has beendevoted to, say, automated testing, configuration management, culture,career development, operating systems, database or cloud providers to name aa few.

Is this because "Monitoring Sucks"?

I would argue that it is more that monitoring is the technically most"Idiosyncratic" thing that we have to concern ourselves with as thedevelopers of onlines systems, and the operations folks how run saidsoftware. (Say why it's more idiosyncratic: varies along multipledimensions: time, office-hours, after-hours, during deploys/maintenance; space(which environment or role the monitored client occupies); and xxxxx thedistinctiveness of your application needs; and finally it's coupled back intoitself in terms of service dependencies and meta-monitoring.)

Monitoring

• #monitoringsucks

• Or is it idiosyncratic?

Sunday, May 13, 12

Is this because "Monitoring Sucks"?

I would argue that it is more that monitoring is the technically most "Idiosyncratic" thing that we have to concern ourselves with as the developers of onlines systems, and the operations folks how run said software. (Say why it's more idiosyncratic: varies along multiple dimensions: time, office-hours, after-hours, during deploys/maintenance; space (which environment or role the monitored client occupies); and xxxxx the distinctiveness of your application needs; and finally it's coupled back into itself in terms of service dependencies and meta-monitoring.)

Wherever we sit in relation to this mess of running stuff, we'll generally expect two or three things from the monitoring we have

What’s Happening?

Sunday, May 13, 12

What’sHappened?

Sunday, May 13, 12

That is, we need to be informed when something is going south, preferably before the impact it discernible to our customers so we can take remedial actions.

And we need ways of gathering and visualizing numeric metrics so we anticipate future needs and look for correlations that lead us the cause of present (or past problems)

System and application log aggregation is a third tool of what might be considered a tripod, but they are generally distinct from the first two.

If you want to delve into monitoring considerations at a higher level I recommend Limoncelli and Hogan; and Patrick Debois; For our purposes I think how Sensu fits into the monitoring universe can be well illustrated by a case study consideration: namely, the monitoring needs we face at Audax Health in running Careverge:

The Problem

Sunday, May 13, 12

CarevergeSunday, May 13, 12

At Audax Health we run health consumer social network called Careverge, and the core architecture should be not unfamiliar to most of you. Web traffic to www.careverge.com will

a) hit an AWS ELB b) an elastic array of web nodes c) an internal LB d) an elastic array of API nodes e) a MongoDB server replication set f) miscellaneous servers that provide Careverge services (recommendations) or dev/infra services (puppet, jenkins, graphite, logging, monitoring)

Careverge

Prod

Exp Dev

QAO

O

X

Sunday, May 13, 12

And then smaller copies of that in various environments: dev, test, and RC. All of this is managed by Puppet, with some help from MCollective.

The way I've generally tackled monitoring over the last few gigs has been with the Nagios monitoring framework because, despite the age of some of the code, Ethan Galstad has gotten a lot of things right that I've seen utterly absent from recent "enterprise-class" products (don't go into MySQL monitor, Hyperic)

NagiosAPI-1

API NRPE

LAMP-1

httpd NRPE

Nagios

Nagios

check_api 8443check_nrpe -c disk

Sunday, May 13, 12

The way I've generally tackled monitoring over the last few gigs has been with the Nagios monitoring framework because, despite the age of some of the code, Ethan Galstad has gotten a lot of things right that I've seen utterly absent from recent "enterprise-class" products (don't go into MySQL monitor, Hyperic)

Nagios primitives

• Services

• Hosts

• ServiceGroups

• HostGroups

• Dependencies, Commands, Contacts, ...

Sunday, May 13, 12

Puppet + Nagios

• Node comes up as Puppet client w/ ‘role’

• Puppet stashes facts in storeconfig DB

• Nagios puppet run

• ‘exported resources’=>‘hosts.cfg’

• host is member of hostgroups: generic, role

• services are monitored across hostgroups

Sunday, May 13, 12

fail

Sunday, May 13, 12

fail

• storeconfig

• new nodes ... Nagios server lag

• old nodes ... No API to del from DB

• new roles => new hostgroup => fail

Sunday, May 13, 12

Now you can probably guess at some of the pitfalls of this approach.

One caveat: default storeconfig can be a real resource pig in Puppet with somewhere along the line for 10,000 inserts being done for each client 'thin_storeconfigs' is much better and completely sufficient for storing node data sufficient for exported node resources.

The obvious issue is the lag between node instantiation and the convergence of the Nagios server. Requiring a high-frequency of Puppet runs is resource intensive, triggering runs from node instantiation is yet another layer of complexity to build in, and in either case you risk frequent interruptions in your monitoring while the Nagios server restarts to pick up the new configuration, or worse, going down because somehow an error or inconsistency appeared in the configuration.

For example, the hostgroups we assigned to each host was based on the role parameter assigned to the node. E.g.

role_list=opsusers,carecontent

Compare to

role_list=opsusers,carenewfoo

and abruptly the whole thing comes crashing down.

Also disconcerting was the process of removing a node. At least a new node would be monitorable withing a few minutes of coming up. A terminated node is still in the storeconfig DB, and there's no direct API for it, so you'd need to script the SQL to delete the correct lines to effect a node removal.

That was about as far as I got before I realized I needed to take my hands off the keyboard and step back.

Sensu

Sunday, May 13, 12

Fortunately, there'd been some buzz on Twitter about Sensu, and over the course of a weekend I became convinced that I needed to abandon Nagios, or any other monolithic monitoring system, and try Sensu.

Sensu started as an internal project at Sonian, an archive-as-a-service provider which runs on AWS. Sean Porter and others had been using Nagios w/ Chef, but ran into many of the same convergence issues that I had described, but also ran into scaling issues with Nagios's active check architecture. So Sean and a teammate, TKTK, wrote Sensu for internal use, and open-sourced it in November of 2011. Since then, it's seen a lot of uptake and has a very active community. Lots of help is available on IRC if you bother to stop in.

Architecture

• RabbitMQ AMQP message bus

• sensu-server (Ruby) + Redis k/v store

• sensu-client

• sensu-api

• sensu-dashboard

Sunday, May 13, 12

sensu-mq

• RabbitMQ

• Sonian scales to 500-1000 nodes with 1 EC2 instance

Sunday, May 13, 12

There is no ‘sensu-mq’ actually.

sensu-server

• sensu-server (Ruby) and Redis (C)

• JSON configuration

• /etc/sensu/config.json (main config)

• /etc/sensu/conf.d/ (JSON snippets)

Sunday, May 13, 12Then we need at least one sensu-server, and typically on that box we'll run Redis to provide persistence. Sensu-server is written in Ruby, and can be installed as a Gem, as an RPM, and soon as .deb.

All of the configuration is in JSON. Here's a minimal configuration for a sensu-server:

{ "rabbitmq": { "host": "<%= rabbitmq_host %>", "port": <%= rabbitmq_port %> }, "redis": { "host": "<%= redis_host %>", "port": <%= redis_port %> }, "api": { "host": "<%= api_host %>", "port": <%= api_port %> }, }

{ "rabbitmq": { "host": "<%= rabbitmq_host %>", "port": <%= rabbitmq_port %> }, "redis": { "host": "<%= redis_host %>", "port": <%= redis_port %> }, "api": { "host": "<%= api_host %>", "port": <%= api_port %> },}

sensu-server

Sunday, May 13, 12

No checks configured here.

{ "rabbitmq": { "host": "<%= rabbitmq_host %>", "port": <%= rabbitmq_port %> }, "api": { "host": "<%= api_host %>", "port": <%= api_port %> }, "client": { "name": "<%= sensu_hostname %>", "address": "<%= ipaddress %>", "subscriptions": ["generic", "cvapi"] }}

sensu-client

Sunday, May 13, 12


One config.jsonto rule them

all

Sunday, May 13, 12

API-1API

client

LAMP-1httpd

client

sensu

sensu-server

RabbitMQ

Sunday, May 13, 12

{ "checks": { "careverge_api": { "handlers": ["irc", "mailer" ], "notification": "Careverge API is not responding appropriately", "command": "/etc/sensu/plugins/local/check_cvapi.sh -S", "subscribers": [ "cvapi" ], "interval": 30, "refresh": 600 } }}

checks

Sunday, May 13, 12


Plugins!

How it works

• server publishes ‘check-api’ to ‘cvapi’

• some clients subscribe ‘cvapi’

• run check

• publish result

• server processes results, passes to handlers

Sunday, May 13, 12

Works almosttoowell

Sunday, May 13, 12

Sensu works so well that I had to make sure that the check scripts were installed before the sensu service. Not because there's any logical dependency, but simply , otherwise the client would come up and start acting on published requests for, say, 'check_disks' and fail because the 'check_disk.rb' script wasn't there yet.

Notification Handlers

• subclassed from Sensu::Handler

• distributed as .rb scripts with .json config

• community:

• mail, irc, hipchat, campfire, pagerDuty, twitter

Sunday, May 13, 12

API

• thin/sinatra on port 4567

• GET/PUT/POST/DELETE k/v in Redis and

• make check requests

• Very handy, for, say...

Sunday, May 13, 12

Dropping a Node

• sensu-client publishes keep-alive

• On orderly termination:

json = File.read(config_file)client_name = JSON.parse(json)['client']['name'] api_host = JSON.parse(json)['api']['host']uri = URI.parse("http://#{api_host}/client/#{client_name}")http = Net::HTTP.new(uri.host, uri.port) http.request( Net::HTTP::Delete.new(uri.path) )

Sunday, May 13, 12

One also has a fairly rudimentary interface to the API with the Sensu Dashboard:

sensu-dashboard

Sunday, May 13, 12

Not PHB-compliant.

sensu-dashboard

Sunday, May 13, 12

So Far...• Components: RabbitMQ, Redis, Ruby

• sensu-server:

• pubs check requests

• pushes results to handlers

• sensu-client: perform checks, pushes results

• sensu-api, sensu-dashboard

• JSON configuration

• Plugins, Handlers, Keep-Alives

Sunday, May 13, 12

• What’s Happening?

• What’s Happened?

Sunday, May 13, 12

Metric Handlers• E.g. ‘vmstat_metrics’ plugin returns:

• Define a check as a ‘type: metric’

• Add to a subscription

stats.sensu-server.swap.in 0 1336502402stats.sensu-server.swap.out 0 1336502402stats.sensu-server.memory.cache 1408388 1336502402stats.sensu-server.memory.swap_used 0 1336502402stats.sensu-server.memory.free 5492292 1336502402

Sunday, May 13, 12

Metric Handlers

• ‘type: metric’ is always passed to hander

• On server, use a ‘graphite’ handler

• Feeds to Graphite over TCP or AMQP

Sunday, May 13, 12

At least one site has standardized on using Sensu everywhere to consolidate where they define all metrics

But wait, there’s more...

• Metrics integration (Graphite, Librato)

• Application Integration (port 2030)

• Standalone Checks

• Parameter Passing

• Scheduling Downtime

• Sensu and Puppet/Chef

Sunday, May 13, 12

What’s Happening?

Sunday, May 13, 12

What’sHappened?

Sunday, May 13, 12

That is, we need to be informed when something is going south, preferably before the impact it discernible to our customers so we can take remedial actions.

And we need ways of gathering and visualizing numeric metrics so we anticipate future needs and look for correlations that lead us the cause of present (or past problems)

System and application log aggregation is a third tool of what might be considered a tripod, but they are generally distinct from the first two.

If you want to delve into monitoring considerations at a higher level I recommend Limoncelli and Hogan; and Patrick Debois; For our purposes I think how Sensu fits into the monitoring universe can be well illustrated by a case study consideration: namely, the monitoring needs we face at Audax Health in running Careverge:

What’s Happening

• Sensu is great at adapting to changes in your operating environment

• Notifies effectively across various media

• Lacks:

• Tactical dashboard

• Notification Hours, Contact Groups

Sunday, May 13, 12

To Do:Wrap up slide: What’s happening and What’s happened?Draw Sensu OmniGraffle?

What’s Happened

• Metrics integration with Graphite, Librato, Geckoboard

• Applications can fire-and-forget to UDP port 2030

• Lacks:

• Uptime History

• Notification History

Sunday, May 13, 12

To Do:Wrap up slide: What’s happening and What’s happened?Draw Sensu OmniGraffle?

Bear in Mind

• Not even a toddler (Nov 2011 open-source)

• Active Community

• Traction

Sunday, May 13, 12

For more:

• GitHub repo and wiki: http://github.com/sensu

• Joe Miller’s excellent blog series:

• http://joemiller.me/category/sensu/

• IRC Channel: irc://irc.freenode.net/#sensu

• My interview with Sean Porter on Sensu:

• http://bit.ly/zGZhjg

Sunday, May 13, 12

fini

Sunday, May 13, 12

sensu: the monitoring routerfiles.meetup.com/1780846/sensudevopsdcwnotes.pdf · sensu: the...

Documents