sensu @ yelp!: a guided tour

41
Sensu @ Yelp - A Guided Tour Kyle Anderson https://github.com/solarkennedy

Upload: kyle-anderson

Post on 08-May-2015

3.451 views

Category:

Technology


0 download

DESCRIPTION

This is a presentation demonstrating how Sensu is used at Yelp to support dynamic infrastructure, and promote self-service monitoring among teams. Video Part 1: https://vimeo.com/92770954 Video Part 2: https://vimeo.com/92838680

TRANSCRIPT

Page 1: Sensu @ Yelp!: A Guided Tour

Sensu @ Yelp - A Guided Tour

Kyle Andersonhttps://github.com/solarkennedy

Page 2: Sensu @ Yelp!: A Guided Tour

DisclaimerI’m just a dude.

I know that when I watch a presentation by a company that I recognize, I think to myself, “Hmm, $company, I’ve heard of them. They probably have their stuff together. Lets see what they do…”

I’m here to describe, not persuade. I may not have everything together. Just because I have things with “Unit Tests”, doesn’t mean I’m “Right”.

Especially with a “framework” like Sensu, there can be more than one way to do things. The trick is figuring out what works for you. I hope by giving a real concrete example, you might be inspired to step up your monitoring game?

Page 3: Sensu @ Yelp!: A Guided Tour

Outline

1. Overall Architecture2. Sensu Server Setup

a. Custom Base Handler3. Client Configuration

a. Sensu Check Puppet Wrapper4. Yelp SOA Checks5. AWS/Cloudwatch Checks6. Dealing with Ephemeral EC Servers7. Cron Job Monitoring8. Future Work

Page 4: Sensu @ Yelp!: A Guided Tour

Overall Architecture● profile::sensu_client

○ Sensu clients connect to RabbitMQ on one of the servers (DNS Round Robin)

● profile::sensu_server○ Base HAProxy install○ RabbitMQ in Mirror Mode, load balanced via

HAProxy○ Redis in Master/slave mode, load balanced via

HAProxy. (only master passes healthcheck)○ Sensu Server installed, subscribes on RabbitMQ○ API Load balanced via HAProxy○ Dashboard Load balanced by HAProxy

Page 5: Sensu @ Yelp!: A Guided Tour

Logical Diagram

Page 6: Sensu @ Yelp!: A Guided Tour

Puppet Modules in Use

puppetlabs/rabbitmq

puppetlabs/haproxy

kyleanderson/redis_sentinel

arioch/redis

sensu/sensu

Page 7: Sensu @ Yelp!: A Guided Tour

Addressing Complexity

“Sensu has so many moving parts that I wouldn’t be able to sleep at night unless I set up a Nagios instance to make sure they were all running.”

Laurie Dennesshttps://laur.ie/blog/2014/02/why-ill-be-letting-nagios-live-on-a-bit-longer-thank-you-very-much/

Page 8: Sensu @ Yelp!: A Guided Tour

Addressing Complexity

“I will be honest; I haven’t used Sensu, because I’m in a happy place right now, but just the architectural diagram of how it works scares the shit out of me.

When you need 7 arrow colours to describe where data is going in a monitoring system, I’m starting to fear it slightly. But hey, if it works, good on you guys. It just looks a lot like this. Nothing wrong with that, if you can make it stable and reliable.”

Laurie Dennesshttps://laur.ie/blog/2014/02/why-ill-be-letting-nagios-live-on-a-bit-longer-thank-you-very-much/

Page 9: Sensu @ Yelp!: A Guided Tour

First Principle: Single Point of Truth

Page 10: Sensu @ Yelp!: A Guided Tour

Pop Quiz: Determine what Servers are Puppetmasters?

• A: Puppet manifests (include puppetmaster)• B: DNS (puppet.local A 10.5.x.x)• C: update-live script (for Server in ….)• D: The servers that have had the puppetmaster bootstrap script run on them• E: What MCollective says (mco find -C puppetmaster)

Answer: All / None of the above!

Page 11: Sensu @ Yelp!: A Guided Tour

Sensu Server Detection

# Use DNS to detect if this server is a sensu server $local_sensu_server_array = gethostbyname2array("sensu.local-${::habitat}.yelpcorp.com")

$ip_address_array = split($::all_ipaddresses, ',')

validate_array($local_sensu_server_array)

validate_array($ip_address_array)

$array_intersection = intersection($ip_address_array, $local_sensu_server_array)

# If our ipaddresses are in the dns entries, we must be a sensu server!

if size($array_intersection) > 0 {

$is_sensu_server = true

} else {

$is_sensu_server = false

}

Page 12: Sensu @ Yelp!: A Guided Tour

HAProxy

• Every server in the sensu cluster runs its own HAProxy• HAProxy listens on the “standard” ports, individual

instances listen on standard + 1• Having an array of sensu servers from DNS allows us to

grow the backends• If HAProxy dies, clients will re-resolve, and reconnect.

Page 13: Sensu @ Yelp!: A Guided Tour

RabbitMQ

• Every server in the sensu cluster runs a rabbitmq server in mirror mode (with autoheal for AP)

• Lots of individual clusters, not doing shoveling.• Client authentication via SSL client certs (controlled by

puppet)• Load balanced by haproxy• Sensu-clients automatically reconnect on failure

Page 14: Sensu @ Yelp!: A Guided Tour

Redis

• Redis is the persistent store used by Sensu to keep track of heartbeats, what alerts are silenced, how many times a check has failed, etc

• Redis is setup in a cluster mode, with redis-sentinel doing automatic master/slave promotion. (Kinda CP)

• We use the redis-role haproxy master pattern suggestion from http://failshell.io/sensu/high-availability-sensu/

Page 15: Sensu @ Yelp!: A Guided Tour

Sensu API + Dashboard

• sensu-api provides a rest api with json output for integration.

• sensu-cli is provided for easy command line interactive use

• Both the API and Dashboard use basic auth internally (shared secret), and then LDAP+SSL auth externally.

• sensu-dashboard uses this api, and is behind our external facing apache for authentication.

Page 16: Sensu @ Yelp!: A Guided Tour

Sensu Servers:

• Automatically does master election, good. Build for 3.• Connects to RabbitMQ, pulls events off and acts on

them• Runs “handlers” on the event data• Thats kinda it• Which leads to handlers….

Page 17: Sensu @ Yelp!: A Guided Tour

Sensu Timing Tunables Before/AfterCustom check definition key-values

Custom key-values can be added to a check definition, which

will be included in event data, enabling handler creativity.

Common custom check definitions:

• interval: How frequently (in seconds) the check will be

executed

• occurrences: Number of event occurrences before the

handler should take action

• refresh: Number of seconds handlers should wait before

taking second action. Relies on sensu-plugin.

Yelp Monitoring Check Definition Key Values

The custom base handler interprets these values:

• check_every = '5m',• alert_after = '0s',• realert_every = '1',

Page 18: Sensu @ Yelp!: A Guided Tour

Custom Base Handler

def filter_repeated

interval = @event['check']['interval'] || 0

alert_after = @event['check']['alert_after'] || 0

realert_every = @event['check']['realert_every'] || 1

failing_for = @event['occurrences'].to_i * @event['check']['interval'].to_i

if failing_for < alert_after

bail "Only failing for #{failing_for}, less than #{alert_after}. Not performing any action yet."

elsif interval > 0 and @event['action'] == 'create'

initial_failing_occurrences = alert_after.fdiv(interval).to_i

number_of_failed_attempts = @event['occurrences'] - initial_failing_occurrences

unless number_of_failed_attempts == 0 || number_of_failed_attempts % realert_every == 0

bail 'only handling every ' + number.to_s + ' occurrences'

end

end

end

Page 19: Sensu @ Yelp!: A Guided Tour

Other Handlers In Use

● IRC (Triaged by who is “on-point”)● Email (not a thing)● Pagerduty (Handled by “on-call”)● OpsGenie (trialing)● aws_prune (only on ec2 nodes)● motd (sensu-report, not really a handler. Used for situation

awareness)Future Handlers● JIRA (auto create/close a ticket after a while?)● Flapjack?

Page 20: Sensu @ Yelp!: A Guided Tour

Sensu Clients

• Almost every server @yelp runs the sensu client (thank you omnibus packages!)

• They connect to the Round-Robin dns entry local to their zone.

• All checks are standalone, configured by puppet

Page 21: Sensu @ Yelp!: A Guided Tour

Monitoring Check Puppet Wrapper

define monitoring_check (

$command,

$runbook,

$check_every = '5m',

$alert_after = '0s',

$realert_every = '1',

$irc_channels = undef,

$tip = false,

$page = false,

$wake = true,

$needs_sudo = false,

$sudo_user = 'root',

$team = 'operations',

$ensure = 'present',

$dependencies = [],

$sensu_custom = {},

) {

……

Lots of validation. Lots of tests.

mandatory runbook!

Human readable time units!

Easy to add sudo rules!

TIP: The one line runbook for lazy humans!

Team defaults to ops for convenience.

Usually set to $::profile::server::team

Page 22: Sensu @ Yelp!: A Guided Tour

Monitoring Check Puppet Wrapper Example

# Make sure apt-mirroring is working by checking the age of the NEW file left over.

monitoring_check { 'apt-mirror':

check_every => '4h',

team => 'operations',

page => false,

runbook => 'y/rb-package-mirroring',

tip => 'Talk to kwa. Check /var/spool/apt-mirror/var/cron.log, then /nail/apt-mirror/var/apt-mirror.lock.',

command => '/usr/lib/nagios/plugins/check_file_age /nail/apt-mirror/var/NEW -w 86400 -c 172800',

}

Page 23: Sensu @ Yelp!: A Guided Tour

Why Not Use The Native Puppet Type?

● The wrapper reduces the boilerplate and gives good defaults

● Enforces site-specific policies and validation (team names, mandatory runbooks)

● Allows us to modify all puppet-controlled sensu checks in the future from a single spot.

● Custom tests● Allows us to be backend agnostic (maybe)

Page 24: Sensu @ Yelp!: A Guided Tour

Yelp SOA Checks

• How do we (Yelp) empower our developers to monitor their services?

• How can we safely and conveniently allow devs to define checks within our SOA framework?

• How can Devs not be blocked by Ops for service deployment?

Page 25: Sensu @ Yelp!: A Guided Tour

Define the Meta Check

# Defined on all hosts that run yelp SOA infrastructure

monitoring_check { 'check-yelp_soa':

check_every => '1m',

alert_after => '10m',

page => true,

runbook => 'http://y/rb-check-yelpsoa',

tip => 'Run /etc/sensu/plugins/check-yelp_soa.rb --debug to see what is wrong?',

command => '/etc/sensu/plugins/check-yelp_soa.rb',

require => Class['::yelp_soa']

}

Page 26: Sensu @ Yelp!: A Guided Tour

check-yelp_soa.rb reduxdef run

# TODO: Parallelize?

configs.each do | service, config |

next unless services_that_run_here.include?(service)

$log.debug "Processing #{service} as apparently it runs here"

srv_configs = read_srv_configs(service)

next unless srv_configs.include?('monitoring_check')

monitoring_check = srv_configs['monitoring_check']

if numeric?(config['port'])

...

if command == 'check_http'

url = monitoring_check['check_url'] || '/status'

$log.debug "Making a http check for #{service}, team: #{team}, warn_timeout: #{warn_timeout}, crit_timeout: #{crit_timeout}"

output, status = check_http(port,url,http_expect,warn_timeout,crit_timeout)

elsif monitoring_check['command'] == 'check_tcp'

$log.debug "Making a tcp check for #{service}, team: #{team}, warn_timeout: #{warn_timeout}, crit_timeout: #{crit_timeout}"

output, status = check_tcp(port,warn_timeout,crit_timeout)

else

$log.debug "Not spawning a check for #{service} because I don't know how to run #{command}"

next

end

send_result_to_sensu(service, status, output, team, runbook, tip, page, alert_after, realert_every, irc_channels)

services_checked << service

end # End port check

end # End for loop

ok "Finished run. Ran checks on #{services_checked}"

end

Page 27: Sensu @ Yelp!: A Guided Tour

What was that?

Iterate through the SOA services that are configured to run on a server.Determine if that service has monitoring metadata defined by the authorsOperate on that metadata to check it (usually check_http)Send the results of the check to the localhost:3030 socket as a *Different* check (“soa_$servicename”)

See https://gist.github.com/joemiller/5806570 for another example

Page 28: Sensu @ Yelp!: A Guided Tour

An example service (request_blocking)

# from request_blocking.yaml

monitoring_check:

team: 'infra'

alert_after: 2m

realert_every: 2

irc_channels: 'infra'

url: '/status'

tip: "no tips yet"

warn_timout: 2.0

crit_timeout: 5.0

Page 29: Sensu @ Yelp!: A Guided Tour

AWS/Cloudwatch Checks

• Pretty much the same thing, except:• Checks are executed on special monitoring hosts in

the AZ (not on the ephemeral node)• Runs graphite/check_data.rb against the provided

metric name• Written in python this time! (https://pypi.python.

org/pypi/sensu)

Page 30: Sensu @ Yelp!: A Guided Tour

Dealing with Ephemeral EC2 Nodes

• Yelps lives in a hybrid world, we have lots of “ephemeral” EC2 nodes that are baked and do NOT run puppet. Can Sensu still work on them?

• How do we prevent ourselves from being spammed when hosts go away “normally”?

• How do we know what a host is without logging into it? (EC2 metadata)

• Baking………..

Page 31: Sensu @ Yelp!: A Guided Tour

EC2 Considerations

• We use puppet to bake AMIs for ELBs, so we can control (via puppet) how Sensu is configured at bake time.

• We can query the AWS API to know if a host has gone away, and prune it from the Queue to squelch alerts.

• Using custom client metadata, we can add things like puppet cert name, AMI_ID, etc at runtime with a special init script.

Page 32: Sensu @ Yelp!: A Guided Tour

For Non-Ephemeral Instances

if str2bool($::is_ec2) == true {

$client_custom = {

'instance_id' => $::ec2_instanceid,

'keepalive' => {

'handlers' => [ 'aws_prune', 'default' ],

'team' => $team,

'page' => true

}

}

} else {

$client_custom = {

'team' => $team,

'page' => true

}

}

Only EC2 Servers need the special aws_prune handler

A Fact! Embed it for easy troubleshooting

Page 33: Sensu @ Yelp!: A Guided Tour

For Ephemeral (baked) Instances

description "Fix Sensu clientinfo on startup for baked ec2 instances"

author "Kyle Anderson <[email protected]>"

start on starting sensu-client

task

script

ADDRESS=$(curl -s http://169.254.169.254/latest/meta-data/local-ipv4)

AMI_ID=$(curl -s http://169.254.169.254/latest/meta-data/ami-id)

INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)

/usr/bin/jq ".client.name = \"$(/usr/local/sbin/puppet-certname)\" | .client.address = \"$ADDRESS\" | .client.instance_id =

\"$INSTANCE_ID\" | .client.ami_id = \"$AMI_ID\" " /etc/sensu/conf.d/client.json > /etc/sensu/conf.d/newclient.json

mv /etc/sensu/conf.d/client.json /etc/sensu/conf.d/client.json.old

mv /etc/sensu/conf.d/newclient.json /etc/sensu/conf.d/client.json

end script

Only run once, right before sensu-client

Real data. Can’t lie.

Overwrite what we were baked with. It is wrong.

jq FTW

Page 34: Sensu @ Yelp!: A Guided Tour

Pruning Terminated EC2 Nodes

● Modification of https://github.com/sensu/sensu-community-plugins/blob/master/handlers/other/ec2_node.rb

● Instead we use a cron job to cache the results of the api call into json so we can be nice to AWS

● Then we can have *every* check use this handler, as it is easy to just to check on disk if the instance_id is active.

● Use the instance_id from the client data to figure out who you are. (which should be correct from the above)

Page 35: Sensu @ Yelp!: A Guided Tour

What Does It Look Like? file { '/etc/sensu/plugins/cache_instance_list.rb':

owner => 'root',

group => 'root',

mode => '0500',

source => 'puppet:///modules/profile/sensu/handlers/cache_instance_list.rb',

} ->

cron::d { 'cache_instance_list':

minute => '*',

user => 'root',

command => "/etc/sensu/plugins/cache_instance_list.rb -a ${access_key} -r ${region} -k ${secret_key}",

} ->

monitoring_check { 'cache_instance_list-staleness':

check_every => '10m',

alert_after => '1h',

team => 'test',

runbook => 'y/rb-aws-prune',

command => "/usr/lib/nagios/plugins/check_file_age /var/cache/instance_list.json -w 1800 -c 3600",

page => false,

}

Page 36: Sensu @ Yelp!: A Guided Tour

The Handler (puppet)

$access_key = hiera('sensu::aws_key')

$secret_key = hiera('sensu::aws_secret')

$aws_config_hash = {

access_key => $access_key,

secret_key => $secret_key,

region => $region,

blacklist_name_array => [ 'bake_soa_ami', 'Packer Builder' ]

}

sensu::handler { 'aws_prune':

type => 'pipe',

source => 'puppet:///modules/profile/sensu/handlers/aws_prune.rb',

config => $aws_config_hash,

require => [ Package['rubygem-fog'], Package['rubygem-sensu-plugin'], Package['rubygem-unf'] ],

}

}

Page 37: Sensu @ Yelp!: A Guided Tour

The Handler (Ruby)

def ec2_node_exists?

running_instances = load_instances_cache

instance_ids = running_instances.collect { |s| Hash[ 'id', s['id'], 'tags', s['tags'] ]}

my_instance_id = @event['client']['instance_id']

instance_ids.each do |instance|

# YELP SPECIFIC CODE

instance_name = instance['tags']['Name'].to_s

# Yelp specific: pretend that the node does not exist if we are in our blacklist

return false if blacklist_name_array.include?(instance_name)

return true if my_instance_id == instance['id']

end

return false # no match found, node doesn't exist

end

Page 38: Sensu @ Yelp!: A Guided Tour

Cron Job Monitoring

• I believe cron sending emails is an anti-pattern and not *web-scale*

• Lets use Sensu to monitor our cron jobs!• Use a combination of a cron puppet type wrapper and

my Sensu-Shell-Helper• Modified sensu-shell-helper includes fields for team

and page for yelp-specific things: https://github.com/solarkennedy/sensu-shell-helper

Page 39: Sensu @ Yelp!: A Guided Tour

What does it look like?

$command = 'chgrp -R admin /nail/packages/'

cron::d { 'fix-packages-permissions':

mailto => '',

minute => '10',

user => 'root',

comment => 'Make permissions group writable for collaboration purposes',

command => “sensu-shell-helper -n fix-packages-permissions -p false -t operations ${command}”,

ensure => 'present'

}

See https://github.com/torrancew/puppet-cron#cronjob for related work.

Page 40: Sensu @ Yelp!: A Guided Tour

Future Work

● battle-test more of the pagerduty stuff (blocked on bogus aws nodes still)● sort out AWS pruning, harder (#61626)● make tools that work on nagios *and* sensu?● really monitor the sensu instances in nagios with alerts (#60164)● enable self-serve sensu alerts for services (#62201)● make a library for sending passive checks (#62440)● set up infrastructure for “aggregate” checks (cluster checks)● better test the alerting tunables we have (#61628)● enable sensu alerts for Asgardy services (#57450)● set up easy to use metric based alerting (like horsefly, blocked on #67000)● write my sensu-downtime tool● write an super-dashboard (hackathon)● write the sensu archive service (sensu-db?)

Page 41: Sensu @ Yelp!: A Guided Tour

Thanks!