From #MonitoringSucks to
#MonitoringLove
(and back)
@KrisBuytaert OSMC 2014 , Nuremberg, Germany
Kris Buytaert ●I used to be a Dev, ●Then Became an Op ●Chief Trolling Officer and Open Source Consultant @inuits.eu ●Everything is an effing DNS Problem ●Building Clouds since before the bookstore ●Organising Conferences ●Evangelizing devops
An opinionated talk about the Open Source Monitoring tooling landscape
In which I hope to learn from YOU
#devops=~C(L)AMS ● Culture
● (Lean)
● Automation
● Monitoring and Measurement
● Sharing
● Damon Edwards and John Willis
Gene Kim
Monitoring is usually an aftertought ENOBUDGET, ENOTIME
An 2008 OLS Paper ● We have bloated Java tools
● Some open Core stuff
● DYI folks want traditional Nagios
● DBA Required
#monitoringsucks ● John Vincent (@lusis), june 2011
● A sub #devops movement
● https://github.com/monitoringsucks/
Why #monitoringsucks ● Manual config (gui)
● Not in sync with reality
● Hosts only
● Services sometimes
● Aplication never
● Chaos or out of sync with reality
● Alert Fatigue
Let's forget about ● Tools with no (stable) API
● Tools with strong focus on GUI
● Unless you are an SME with < 100 nodes
● Zenoss, Hyperic, GroundWork, ....
● P.S. : don't even mention proprietary software to me
What we want
● Small , well suited components
• Collect
• Transport / Mangle
• Store
• Analyse
• Act / Alert
• Visualize
#monitoringlove
•Ulf Mansson #devopsdays Rome 2011
•A new era of tooling
•#monitoringlove hacksessions @inuits
•#monitorama
Icinga •2009 Fork
•I consider Nagios dead
•Vibrant Community (or they stalk me)
•Throw great parties in Nurnberg
•Nobody can pronounce it anyhow
•https://github.com/Inuits/puppet-icinga/
Stored Configs
#monitoringlove But the love was about :
Sensu ● Awesome for non static environments
● Scaling a clustered RabbitMQ ?
● This is Europe, U no do cloud
Automation of #monitoring brought back
the #love
●Autodetection
●Multiplexing
●Trend Forecasting
I love CheckMK
•Autodetection ?
•Service,
•Business Functionalities
•eg. vhosts etc
•Single Source of Truth
I hate CheckMK
Monitoring a service vs
Monitoring a Service
definition of done:
monitored and in production
A software project is not done untill your last end user is dead
Culture,
Automation,
Measurement : measure all the things
Sharing
Deploy Statistics ● Time To Deploy
● Deploy Frequency
● Lifecycle frequency
● Map to other metrics
CollectD all the metrics, at high intervals
Oldschool graphite
Self Service Gdash based pipelines
Puppetized Templates (wip)
Gdash
Grafana
Graphite++ ● Dashboards
• Grafana
● Engines :
• InfluxDB
• Cyanite
Triggers on Graphs ● Export Java Metrics
● JMXTrans
● Export JMXConfigs
● Configure NRPE Check
● Export NagiosCheck
● Collect JMX Exports on JMXTransNode
● Graph Em
● Collect Icinga Configs on Icinga
Aggregation ● Alert on streams
● Alert on aggregated metrics
Riemann ● I still don't get it ?
● Distributed Top
● Do you like Clojure ?
● Riemann Health plugin ?
● s/riemann-health/collectd/g;
● Output to graphite
Graphs to Knowledge
Skyline
•Oculus
•Creating Information out of this data
•Big data
•Machine Learning
But I have log files..
Logs and Metrics ● Graylog2
● ELSA (Enterprise Log Search and Archive)
● ELK Stack
● Collect from anywhere
● Filter
● Send anywhere
● Queing
Black on White ?
APM But what about my apps ?
Half the world cheers about SAAS tools :(
Packetbeat ● Traffic Flow through network
● Transactions causing errros
● SQL per HTTP
● API call usage
PacketBeat
This new “D” hype
Containers are the new black
● 1 process per container
● Metric collection ?
● Service health ?
So you want service registration of your healthy (containerized) applications ?
Enter Consul.io ● Service discovery
● Failure detection
● Using Gossip build on top of Serf
● Random node 2 node communication
● A HashiCorp project
Consul ● Uses monitoring_plugins for health
● Creates unhealthy dns setups
● Sensu alike
● Key-Value store
● Consul_template => fills your templates
Everything is a freaking dns problem
Self Healing ● Pacemaker Corosync (ocf resource that monitors your service)
● Mesos
● Kubernetes
● Scale changes, Consensus Models change
So your DC fails
Whom to alert when ?
'New' kids on the block ● Flapjack
● flapjack.io
● monitoring notification routing + event processing system
● OpenDuty
● github.com/szechuen/OpenDuty
● Duty management
My Alerting Strategy
Is still in beta
And back :(
In 2014 I`m still running the same check for
- service registration (consul)
- high availability (pacemaker/corosync)
- monitoring (icinga)
But I love where Monitoring is heading
We have much less false positives
And we have a Maintainable Monitoring Infra
Kinda
Your next trip to Gent !
CfgMgmtcamp.eu February 2 and 3, 2015
CFP is Open !
Contact [email protected] Further Reading @krisbuytaert http://www.krisbuytaert.be/blog/ http://www.inuits.eu/
Inuits Duboistraat 50 2060 Antwerpen Belgium 891.514.231 +32 475 961221