monitoring at - when things go ping in the night spotify · pdf filekairosdb ‣originally...

Monitoring at Spotify- When things go ping in the night

Martin Parm, Product owner for Monitoring

‣ Martin Parm, Danish, 36 years old‣ Master degree in CS from Copenhagen

University‣ IT operations and infrastructure since 2004‣ Joined Spotify in 2012 (and moved to Sweden)‣ Joined Spotify’s monitoring team in february

2014‣ Currently Product Owner for monitoring

About Martin Parm

This talk neither endorses nor condemn any specific products, open source or otherwise. All opinions expressed are in the context of Spotify’s specific history with monitoring.

Disclaimer

This talk is not a sales pitch for our monitoring solution. It’s a story about how it came to be.

Disclaimer (2)

Spotify - what we do

‣ Music streaming - and discovery‣ Clients for all major operating systems‣ Partner integration into speakers, TVs,

PlayStation, ChromeCast, etc.‣ More than 20 million paid subscribers*‣ More than 75 million daily active users*‣ Paid back more than $3 billion in royalties*

The service - the right music for every moment

* https://news.spotify.com/se/2015/06/10/20-million-reasons-to-say-thanks/

https://news.spotify.com/se/2015/06/10/20-million-reasons-to-say-thanks/

‣ Major tech offices in Stockholm, Gothenburg, New York and San Francisco

‣ Small sales and label relations offices in all markets (read: countries)

‣ ~1500 employees worldwide○ ~50% in Technology○ ~100 people in IO, our infrastructure department○ 6 engineers in our monitoring team

The people

‣ 4 physical data centers‣ ~10K physical servers‣ Microservice infrastructure with ~1000 different

services‣ Mostly Ubuntu Linux

The technology

‣ Ops-In-Squads○ Distributed operational responsibility in the feature teams○ Monitoring as self-service for backend

‣ Two monitoring systems for backend○ Heroic - time series based graphs and alerts○ Riemann - event based alerting

‣ ~100M time series‣ ~750 graph-based alert definitions‣ ~1500 graph dashboards

Operations and monitoring

Our story begins back in 2006...

Spotify starts in 2006 - a different world

‣ No cloud computing○ AWS had only just launched○ Google App engine and Microsoft Azure didn’t exist yet

‣ Few cloud applications○ GMail was still in limited public beta○ Google Docs was still in limited testing○ Facebook opened for public access in September

‣ No smartphones○ Apple had not even unveiled the iPhone yet○ Android didn’t become available until 2008

Minutes from meeting in June 2007

“Munin mail - It's not sustainable to get 100 status mails per day about full disks to dev. X

should turn them off.”

‣ Sitemon; our first graphing system ○ Based on Munin but with a custom frontend○ Metrics were pulled from hosts by aggregators

○ Metrics were written to several different RRD files with different solutions

○ Static graphs were generated from these RRD files○ One single dashboard for all systems and users○ Main metrics: Backend Request Failures

First steps with monitoring

-- Emil Fredriksson, Operations director at Spotify(Slightly paragraphed)

“Our first alerting system was an engineer, who looked at graphs all the time; day and night; weekends.

And he would start calling up people, when something looked

wrong.”

‣ Spotify launched in Sweden and UK in October 2008

‣ Zabbix was introduced in September 2009‣ Alerts were sent as text messages to

Operations, who would then contact feature developers

‣ Most common “alerting source”: Users○ Operations had a permanent twitter search

First steps with alerting

2011/2012: Ops in squads

‣ Opened our 3rd data center and grew to ~1000 hosts

‣ Spotify grew from ~100 to 400-600 people worldwide in a few months

‣ Many new engineers didn’t have operational experience or DevOps mentality

‣ A rift between dev and ops emerged...

2011-2012: The 2nd great hiring spree

‣ Development speed-up and a vast increase in new services

‣ Stability and reliability was an increasing problem

‣ Service ownership was often unclear‣ Too frequent changes for a monolithic SRE

team to keep up


‣ Big incidents almost every week → The business were unhappy

‣ Constant panic and fire fighting → The SRE team were unhappy

‣ Policies and restrictions, and angry SRE engineers → The feature developers were unhappy


“The infrastructure and feature squads that write services should also take responsibility for correct day-to-day operation of individual services.”‣ Capacity Planning‣ Service Configuration and Deployment‣ Monitoring and Alerting‣ Defining and Managing to SLAs‣ Managing Incidents

September 2012: Ops In Squads

Benefits‣ Organizational Scalability‣ Faster incident solving - getting The Right

Person™ on the problem faster‣ Accountability - making the right people hurt‣ Autonomy - feature teams make all their own

planning and decisions


Human challenges‣ Developers need training, but not a new

education‣ Developers need autonomy, but will do stupid

things‣ Developers need to care about monitoring and

alerting, but not the monitoring pipeline


We needed infrastructure as services‣ The classis Operations team was disbanded‣ Operations engineers and tools teams were

reformed in IO, our infrastructure organization‣ Teaching and self-service became a primary

priority


“Creating IO was probably one of the smartest moves in

Spotify”-- Previous Product Owner for Monitoring

(Slightly paragraphed)

Three tales of failure*

* Read: Learning opportunities

‣ Late 2011: Backend Infrastructure Team (BIT) was formed○ BIT was the first infrastructure team at Spotify

‣ Tasked with log delivery, monitoring and alerting

‣ Development of Sitemon2 began○ Meant to replace Sitemon

○ Still based on Munin, but with a Cassandra backend and much more powerful frontend

Sitemon2 - The graphing system, which never launched

...but BIT was set up for failure from the start‣ Sitemon2 was developed mostly in isolation and

with very little collaboration with developers‣ Priority collisions: Log delivery was always

more critical than monitoring‣ Scope creep: BIT tried to integrate Sitemon2

with analytics

Sitemon2 - The graphing system, which never launched

We needed feature teams to take part in monitoring, but Zabbix was too inflexible and hard to learn.‣ Late 2012: Development of OMG began

○ Event streaming processor similar to Riemann○ Initial development was super fast and focused○ Developed in collaboration with Operations

‣ A few teams adopted OMG, but.....

OMG - The alerting system no one understood

OMG rule written in Esper

{Template_Site:grpsum["{$SPOTIFY_SITE} Access Points","spotify.muninplugintcp4950[hermes_requests_discovery,%%any]","last","0"].avg(300)}>0.1&(({TRIGGER.VALUE}=0&{Template_Site:grpsum["{$SPOTIFY_SITE} Access Points","spotify.muninplugintcp4950[hermes_replies_discovery,%%any]","last","0"].max(300)}/{Template_Site:grpsum["{$SPOTIFY_SITE} Access Points","spotify.muninplugintcp4950[hermes_requests_discovery,%%any]","last","0"].min(300)}<0.9)|({TRIGGER.VALUE}=1&{Template_Site:grpsum["{$SPOTIFY_SITE} Access Points","spotify.muninplugintcp4950[hermes_replies_discovery,%%any]","last","0"].max(300)}/{Template_Site:grpsum["{$SPOTIFY_SITE} Access Points","spotify.muninplugintcp4950[hermes_requests_discovery,%%any]","last","0"].min(300)}<0.97))

The alerting rule language was Esper (EPL)‣ Most engineers found the learning curve way

too steep and confusing‣ Too few tools and libraries for the language

Why was this not caught?‣ The Ops engineer assigned for the

collaboration happened to also like Esper

OMG - The alerting system no one understood

‣ February 2013: One of our system architects builds Monster as a proof-of-concept hack project○ In-memory time series database in Java○ Based on Munin collection and data model○ Metric data was pushed rather than pulled○ The prototype was completed in 2 weeks

Monster

‣ Pushing monitoring data was much more reliable than pulling

‣ Querying and graphing is blazing fast‣ The Operations engineers loved it!‣ Sitemon kept running, but development of

Sitemon2 was halted

We’ll get back to the failure part...

Monster

2013: The birth of a dedicated monitoring team

‣ First dedicated monitoring team at Spotify‣ Assigned with the task of “Providing self-

service monitoring solutions for DevOps teams”‣ Inherited Monster, Zabbix and OMG‣ Calculation: Monster could survive a year, so

we focused on alerting first

Hero Squad

‣ Replaced Zabbix and OMG with Riemann○ “Riemann is an event stream processor”○ Written in Clojure○ Rules are also written in Clojure

‣ We built a support library with helper functions, namespace support and unit testing

‣ Build a web frontend for reading the current state of Riemann

Riemann as a self-service alerting system

* Some boilerplate code *

(def-rules (where (tagged-any roles) (tagged "monitoring-hooks" (:alert target)) (where (service "vfs/disk_used/") (spotify/trigger-on (p/above 0.9) (with {:description "Disk is getting full"} (:alert target))))))

Riemann rule written in Clojure

‣ Success: Riemann was widely adopted‣ Success: Riemann is a true self-service

○ Riemann rules lives in a shared git repo, which gets automatically deployed

○ Each team/project have it’s own namespace○ Unit tests ensure that rules work as intended○ Peak: 36 namespaces and ~5000 lines of Clojure code

‣ Failure: Many engineers didn’t understand or like the Clojure language

Riemann as a self-service alerting system

2014: Now for the pretty graphs...

‣ Sharding and rebalancing quickly became a serious operational overhead○ The whisper write pattern involved randomly seeking and

writing across a lot of files – one for each series○ The Cyanite backend have recently addressed this

‣ Hierarchical naming○ Example: “db1.cpu.idle”

○ Difficult to slice and dice metrics across dimension, e.g. select all host in a site or running a particular piece of software

A brief encounter with collectd and Graphite

‣ Replace long hierarchical names with tags

‣ Compatible with what Riemann does with events

‣ Makes slicing and dicing metrics easy

.... but who supports it?

Metric 2.0

‣ We need quick adoption and commitment from our feature teams

‣ Monitoring was still very immature and we need room to experiment and fail

‣ Problem: Engineers get sick of migrations and refactorings

‣ Solution: A flexible infrastructure and an “API”

Adoption vs. flexibility

‣ Small daemon running on each host, which forwards events and metrics to monitoring infrastructure

‣ First written in Ruby, but later ported to Java‣ Provides a stable entry point for our users

ffwd - a monitoring “API”

Abstracting away the infrastructure from monitoring collection

Magical monitoring

pipeline

Metrics and events

ffwd

Alerts

Pretty graphs

‣ Atlas, developed by NetFlix○ Hadn’t been open sourced yet

‣ Prometheus, developed by SoundCloud○ Hadn’t been open sourced yet

‣ OpenTSDB, originally developed by StumbleUpon○ Was rejected because of bad experiences with HBase

‣ InfluxDB○ Was too immature at the time

Tripping towards graphing; it’s all about the timing

‣ Time series database written in Java and backed by Cassandra

‣ Looked promising at first○ We deployed it and killed Sitemon for good○ We quickly ran into problems with the query engine

‣ Timing: The two main developers got hired by DataStax (the Cassandra company)○ KairosDB development went to a halt

KairosDB

‣ Originally written as an alternative query engine for KairosDB

‣ We kept using the KairosDB database schema and KairosDB metric writers

‣ June 2014: We dropped KairosDB and Heroic became a stand-alone product

‣ ElasticSearch used for indexing time series metadata

May 2014: The birth of Heroic

Monster couldn’t scale, but this was not obvious to the users‣ When it worked, it was blazing fast and beat all

other solutions‣ When it broke, it crashed and required the

attention of the monitoring team, but most users never knew

‣ Only visible sign: shorter and shorter history

Back to the Monster failure

‣ Failure: Because Monster was loved, and the users weren’t experiencing the pain when it broke, many teams resisted migrating from Monster

‣ Result: We didn’t manage to shut down Monster until August 2015

‣ In it’s last 6 weeks Monster crashed 51(!) times

Back to the Monster failure

July 2014: Graph-based alerting

Alerting was becoming a problem again

‣ Scaling Riemann with the increasing number of metrics became hard○ We began sharding, but some groups of hosts were still too big

‣ Writing reliable sliding window rules in Riemann was hard○ Learning Riemann and Clojure was the most common

complaint from our users

‣ One team dropped our monitoring solution and moved to an external vendor

‣ Simple thresholds on time series using the same backend, data and query language○ 3 operators: Above, Below or Missing for X time

‣ Integrated directly into our frontend○ No code, no fancy math, just a line on a graph

Graph-based alerting

‣ Our engineers loves it!○ Thousands of lines of Riemann code was ripped out○ Many teams have migrated completely away from Riemann

○ We saw a massive speed-up in adoption of monitoring; both data collection and definitions of dashboard and alerts

‣ Many monitoring problems can indeed be expressed as a simple threshold


Adoption of Heroic

‣ We are currently collecting ~10TB of metrics per month worldwide○ 30TB of storage in Cassandra due to replication factor

‣ ~80% of our data was collected within the last 6 month

Adoption of Heroic

The final current picture

Metrics and events

ffwd

Alerts

Pretty graphs

Riemann

Apache Kafka Heroic

‣ ffwd and ffwd-java has been developed as Open Source software from the start

‣ Heroic was released as Open Source software yesterday○ Blog post: “Monitoring at Spotify: Introducing Heroic”○ Other components are being released later

We finally Open Sourced it

https://labs.spotify.com/2015/11/17/monitoring-at-spotify-introducing-heroic/

What we have learned so far

● Learning a new monitoring system is an investment

● Legacy systems are almost always to hardest

But it gets worse...

● Almost all system ends as legacy

● You probably haven’t installed your last monitoring system

Migrations are hard and expensive

Suggestions:

● Consider having abstraction layers

● Beware of vendor lock-in○ Open Source software is not

safe

● Sometimes it’s cheaper to keep a migration layer for legacy systems than migrating

● The monitoring/operations team are experts; feature developers might not be

● User experience matters for adoption

● The learning curve affects the cost of adoption for teams

User experience and learning curve matters

● A technically superior solution is worthless, if your users don’t understand it

● Providing good defaults and a easy golden path will not only drive adoption, but also prevent users from making common mistakes

When collection is easy and performance is good, engineers will start using the monitoring system as a debugger.

● Storage is cheap but not free

● The operational cost of keeping debugging data highly available is significant

Beware of scope creep in monitoring

When graphing is easy, pretty and powerful, people will start using monitoring for business analytics.

● Monitoring is suppose to be reliable, but not accurate

● Seems very intuitive● Fragile, sensitive to latency

and sporadic failures● Noise for alerting● What are you really

measuring?● Solution: Convert your

problem into a metric by interpreting close to the source

Heartbeats are hard to get right

● We used to sent events on every Puppet run● Teams would make monitoring rules for failed Puppet runs

and absent Puppet runs● Problem: Absent Puppet runs looks exactly the same when

○ Puppet is disabled○ Network is down○ Host is down○ Host has been decommissioned

● Solution: Emit “Time since last successful Puppet run” metric instead○ Now we can do simple thresholds, which are easy to reason about

Heartbeats example: Puppet runs

● Indexing 100M time series is hard

● Browsing 100M time series is hard○ UI design - getting an overview

of 100M time series is hard

○ Understanding a graph with

thousands of lines is difficult for humans

● Your data will keep growing

The next big scaling problem is very human: data discovery

● Anomaly detection and machine learning might help us○ Many new and upcoming

product looks promising○ ...but still largely an unsolved

problem

Thank you for your time and patience!Martin Parm

email: [email protected]

twitter: @parmus_dk

mailto:[email protected]

https://twitter.com/parmus_dk

List of Open Source software mentioned

‣ Munin, http://munin-monitoring.org/‣ Zabbix, http://www.zabbix.com/‣ Riemann, http://riemann.io/‣ Apache Kafka, http://kafka.apache.org/‣ Atlas, https://github.com/Netflix/atlas‣ Prometheus, http://prometheus.io/‣ OpenTSDM, http://opentsdb.net/‣ InfluxDB, https://influxdb.com/‣ KairosDB, http://kairosdb.github.io/‣ ffwd, https://github.com/spotify/ffwd‣ ffwd-java, https://github.com/spotify/ffwd-java‣ Heroic, https://github.com/spotify/heroic‣ Cassandra, http://cassandra.apache.org/‣ ElasticSearch, https://www.elastic.co/

http://munin-monitoring.org/

http://www.zabbix.com/

http://riemann.io/

http://kafka.apache.org/

https://github.com/Netflix/atlas

http://prometheus.io/

http://opentsdb.net/

https://influxdb.com/

http://kairosdb.github.io/

https://github.com/spotify/ffwd

https://github.com/spotify/ffwd-java

https://github.com/spotify/heroic

http://cassandra.apache.org/

https://www.elastic.co/

monitoring at - when things go ping in the night spotify · pdf filekairosdb ‣originally...

Documents