sensu: the monitoring routerfiles.meetup.com/1780846/sensudevopsdcwnotes.pdf · sensu: the...
TRANSCRIPT
Sensu: The Monitoring RouterPeter Burkholder@pburkholder
DevOpsDC MeetUp8 May 2012
Sunday, May 13, 12
So, unless I'm mistaken more of our DevOpsDC talks have been dealt withmonitoring than any other area of our work. Not as much time has beendevoted to, say, automated testing, configuration management, culture,career development, operating systems, database or cloud providers to name aa few.
Is this because "Monitoring Sucks"?
I would argue that it is more that monitoring is the technically most"Idiosyncratic" thing that we have to concern ourselves with as thedevelopers of onlines systems, and the operations folks how run saidsoftware. (Say why it's more idiosyncratic: varies along multipledimensions: time, office-hours, after-hours, during deploys/maintenance; space(which environment or role the monitored client occupies); and xxxxx thedistinctiveness of your application needs; and finally it's coupled back intoitself in terms of service dependencies and meta-monitoring.)
Monitoring
• #monitoringsucks
• Or is it idiosyncratic?
Sunday, May 13, 12
Is this because "Monitoring Sucks"?
I would argue that it is more that monitoring is the technically most "Idiosyncratic" thing that we have to concern ourselves with as the developers of onlines systems, and the operations folks how run said software. (Say why it's more idiosyncratic: varies along multiple dimensions: time, office-hours, after-hours, during deploys/maintenance; space (which environment or role the monitored client occupies); and xxxxx the distinctiveness of your application needs; and finally it's coupled back into itself in terms of service dependencies and meta-monitoring.)
Wherever we sit in relation to this mess of running stuff, we'll generally expect two or three things from the monitoring we have
What’s Happening?
Sunday, May 13, 12
What’sHappened?
Sunday, May 13, 12
That is, we need to be informed when something is going south, preferably before the impact it discernible to our customers so we can take remedial actions.
And we need ways of gathering and visualizing numeric metrics so we anticipate future needs and look for correlations that lead us the cause of present (or past problems)
System and application log aggregation is a third tool of what might be considered a tripod, but they are generally distinct from the first two.
If you want to delve into monitoring considerations at a higher level I recommend Limoncelli and Hogan; and Patrick Debois; For our purposes I think how Sensu fits into the monitoring universe can be well illustrated by a case study consideration: namely, the monitoring needs we face at Audax Health in running Careverge:
The Problem
Sunday, May 13, 12
CarevergeSunday, May 13, 12
At Audax Health we run health consumer social network called Careverge, and the core architecture should be not unfamiliar to most of you. Web traffic to www.careverge.com will
a) hit an AWS ELB b) an elastic array of web nodes c) an internal LB d) an elastic array of API nodes e) a MongoDB server replication set f) miscellaneous servers that provide Careverge services (recommendations) or dev/infra services (puppet, jenkins, graphite, logging, monitoring)
Careverge
Prod
Exp Dev
QAO
O
X
Sunday, May 13, 12
And then smaller copies of that in various environments: dev, test, and RC. All of this is managed by Puppet, with some help from MCollective.
The way I've generally tackled monitoring over the last few gigs has been with the Nagios monitoring framework because, despite the age of some of the code, Ethan Galstad has gotten a lot of things right that I've seen utterly absent from recent "enterprise-class" products (don't go into MySQL monitor, Hyperic)
NagiosAPI-1
API NRPE
LAMP-1
httpd NRPE
Nagios
Nagios
check_api 8443check_nrpe -c disk
Sunday, May 13, 12
The way I've generally tackled monitoring over the last few gigs has been with the Nagios monitoring framework because, despite the age of some of the code, Ethan Galstad has gotten a lot of things right that I've seen utterly absent from recent "enterprise-class" products (don't go into MySQL monitor, Hyperic)
Nagios primitives
• Services
• Hosts
• ServiceGroups
• HostGroups
• Dependencies, Commands, Contacts, ...
Sunday, May 13, 12
Puppet + Nagios
• Node comes up as Puppet client w/ ‘role’
• Puppet stashes facts in storeconfig DB
• Nagios puppet run
• ‘exported resources’=>‘hosts.cfg’
• host is member of hostgroups: generic, role
• services are monitored across hostgroups
Sunday, May 13, 12
fail
Sunday, May 13, 12
fail
• storeconfig
• new nodes ... Nagios server lag
• old nodes ... No API to del from DB
• new roles => new hostgroup => fail
Sunday, May 13, 12
Now you can probably guess at some of the pitfalls of this approach.
One caveat: default storeconfig can be a real resource pig in Puppet with somewhere along the line for 10,000 inserts being done for each client 'thin_storeconfigs' is much better and completely sufficient for storing node data sufficient for exported node resources.
The obvious issue is the lag between node instantiation and the convergence of the Nagios server. Requiring a high-frequency of Puppet runs is resource intensive, triggering runs from node instantiation is yet another layer of complexity to build in, and in either case you risk frequent interruptions in your monitoring while the Nagios server restarts to pick up the new configuration, or worse, going down because somehow an error or inconsistency appeared in the configuration.
For example, the hostgroups we assigned to each host was based on the role parameter assigned to the node. E.g.
role_list=opsusers,carecontent
Compare to
role_list=opsusers,carenewfoo
and abruptly the whole thing comes crashing down.
Also disconcerting was the process of removing a node. At least a new node would be monitorable withing a few minutes of coming up. A terminated node is still in the storeconfig DB, and there's no direct API for it, so you'd need to script the SQL to delete the correct lines to effect a node removal.
That was about as far as I got before I realized I needed to take my hands off the keyboard and step back.
Sensu
Sunday, May 13, 12
Fortunately, there'd been some buzz on Twitter about Sensu, and over the course of a weekend I became convinced that I needed to abandon Nagios, or any other monolithic monitoring system, and try Sensu.
Sensu started as an internal project at Sonian, an archive-as-a-service provider which runs on AWS. Sean Porter and others had been using Nagios w/ Chef, but ran into many of the same convergence issues that I had described, but also ran into scaling issues with Nagios's active check architecture. So Sean and a teammate, TKTK, wrote Sensu for internal use, and open-sourced it in November of 2011. Since then, it's seen a lot of uptake and has a very active community. Lots of help is available on IRC if you bother to stop in.
Architecture
• RabbitMQ AMQP message bus
• sensu-server (Ruby) + Redis k/v store
• sensu-client
• sensu-api
• sensu-dashboard
Sunday, May 13, 12
sensu-mq
• RabbitMQ
• Sonian scales to 500-1000 nodes with 1 EC2 instance
Sunday, May 13, 12
There is no ‘sensu-mq’ actually.
sensu-server
• sensu-server (Ruby) and Redis (C)
• JSON configuration
• /etc/sensu/config.json (main config)
• /etc/sensu/conf.d/ (JSON snippets)
Sunday, May 13, 12Then we need at least one sensu-server, and typically on that box we'll run Redis to provide persistence. Sensu-server is written in Ruby, and can be installed as a Gem, as an RPM, and soon as .deb.
All of the configuration is in JSON. Here's a minimal configuration for a sensu-server:
{ "rabbitmq": { "host": "<%= rabbitmq_host %>", "port": <%= rabbitmq_port %> }, "redis": { "host": "<%= redis_host %>", "port": <%= redis_port %> }, "api": { "host": "<%= api_host %>", "port": <%= api_port %> }, }
{ "rabbitmq": { "host": "<%= rabbitmq_host %>", "port": <%= rabbitmq_port %> }, "redis": { "host": "<%= redis_host %>", "port": <%= redis_port %> }, "api": { "host": "<%= api_host %>", "port": <%= api_port %> },}
sensu-server
Sunday, May 13, 12
No checks configured here.
{ "rabbitmq": { "host": "<%= rabbitmq_host %>", "port": <%= rabbitmq_port %> }, "api": { "host": "<%= api_host %>", "port": <%= api_port %> }, "client": { "name": "<%= sensu_hostname %>", "address": "<%= ipaddress %>", "subscriptions": ["generic", "cvapi"] }}
sensu-client
Sunday, May 13, 12
No checks configured here.
One config.jsonto rule them
all
Sunday, May 13, 12
API-1API
client
LAMP-1httpd
client
sensu
sensu-server
RabbitMQ
Sunday, May 13, 12
{ "checks": { "careverge_api": { "handlers": ["irc", "mailer" ], "notification": "Careverge API is not responding appropriately", "command": "/etc/sensu/plugins/local/check_cvapi.sh -S", "subscribers": [ "cvapi" ], "interval": 30, "refresh": 600 } }}
checks
Sunday, May 13, 12
No checks configured here.
Plugins!
How it works
• server publishes ‘check-api’ to ‘cvapi’
• some clients subscribe ‘cvapi’
• run check
• publish result
• server processes results, passes to handlers
Sunday, May 13, 12
Works almosttoowell
Sunday, May 13, 12
Sensu works so well that I had to make sure that the check scripts were installed before the sensu service. Not because there's any logical dependency, but simply , otherwise the client would come up and start acting on published requests for, say, 'check_disks' and fail because the 'check_disk.rb' script wasn't there yet.
Notification Handlers
• subclassed from Sensu::Handler
• distributed as .rb scripts with .json config
• community:
• mail, irc, hipchat, campfire, pagerDuty, twitter
Sunday, May 13, 12
API
• thin/sinatra on port 4567
• GET/PUT/POST/DELETE k/v in Redis and
• make check requests
• Very handy, for, say...
Sunday, May 13, 12
Dropping a Node
• sensu-client publishes keep-alive
• On orderly termination:
json = File.read(config_file)client_name = JSON.parse(json)['client']['name'] api_host = JSON.parse(json)['api']['host']uri = URI.parse("http://#{api_host}/client/#{client_name}")http = Net::HTTP.new(uri.host, uri.port) http.request( Net::HTTP::Delete.new(uri.path) )
Sunday, May 13, 12
One also has a fairly rudimentary interface to the API with the Sensu Dashboard:
sensu-dashboard
Sunday, May 13, 12
Not PHB-compliant.
sensu-dashboard
Sunday, May 13, 12
So Far...• Components: RabbitMQ, Redis, Ruby
• sensu-server:
• pubs check requests
• pushes results to handlers
• sensu-client: perform checks, pushes results
• sensu-api, sensu-dashboard
• JSON configuration
• Plugins, Handlers, Keep-Alives
Sunday, May 13, 12
• What’s Happening?
• What’s Happened?
Sunday, May 13, 12
Metric Handlers• E.g. ‘vmstat_metrics’ plugin returns:
• Define a check as a ‘type: metric’
• Add to a subscription
stats.sensu-server.swap.in 0 1336502402stats.sensu-server.swap.out 0 1336502402stats.sensu-server.memory.cache 1408388 1336502402stats.sensu-server.memory.swap_used 0 1336502402stats.sensu-server.memory.free 5492292 1336502402
Sunday, May 13, 12
Metric Handlers
• ‘type: metric’ is always passed to hander
• On server, use a ‘graphite’ handler
• Feeds to Graphite over TCP or AMQP
Sunday, May 13, 12
At least one site has standardized on using Sensu everywhere to consolidate where they define all metrics
But wait, there’s more...
• Metrics integration (Graphite, Librato)
• Application Integration (port 2030)
• Standalone Checks
• Parameter Passing
• Scheduling Downtime
• Sensu and Puppet/Chef
Sunday, May 13, 12
What’s Happening?
Sunday, May 13, 12
What’sHappened?
Sunday, May 13, 12
That is, we need to be informed when something is going south, preferably before the impact it discernible to our customers so we can take remedial actions.
And we need ways of gathering and visualizing numeric metrics so we anticipate future needs and look for correlations that lead us the cause of present (or past problems)
System and application log aggregation is a third tool of what might be considered a tripod, but they are generally distinct from the first two.
If you want to delve into monitoring considerations at a higher level I recommend Limoncelli and Hogan; and Patrick Debois; For our purposes I think how Sensu fits into the monitoring universe can be well illustrated by a case study consideration: namely, the monitoring needs we face at Audax Health in running Careverge:
What’s Happening
• Sensu is great at adapting to changes in your operating environment
• Notifies effectively across various media
• Lacks:
• Tactical dashboard
• Notification Hours, Contact Groups
Sunday, May 13, 12
To Do:Wrap up slide: What’s happening and What’s happened?Draw Sensu OmniGraffle?
What’s Happened
• Metrics integration with Graphite, Librato, Geckoboard
• Applications can fire-and-forget to UDP port 2030
• Lacks:
• Uptime History
• Notification History
Sunday, May 13, 12
To Do:Wrap up slide: What’s happening and What’s happened?Draw Sensu OmniGraffle?
Bear in Mind
• Not even a toddler (Nov 2011 open-source)
• Active Community
• Traction
Sunday, May 13, 12
For more:
• GitHub repo and wiki: http://github.com/sensu
• Joe Miller’s excellent blog series:
• http://joemiller.me/category/sensu/
• IRC Channel: irc://irc.freenode.net/#sensu
• My interview with Sean Porter on Sensu:
• http://bit.ly/zGZhjg
Sunday, May 13, 12
fini
Sunday, May 13, 12