monitoring at - when things go ping in the night spotify · pdf filekairosdb ‣originally...

66
Monitoring at Spotify - When things go ping in the night Martin Parm, Product owner for Monitoring

Upload: dodan

Post on 11-Mar-2018

239 views

Category:

Documents


10 download

TRANSCRIPT

Page 1: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

Monitoring at Spotify- When things go ping in the night

Martin Parm, Product owner for Monitoring

Page 2: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

‣ Martin Parm, Danish, 36 years old‣ Master degree in CS from Copenhagen

University‣ IT operations and infrastructure since 2004‣ Joined Spotify in 2012 (and moved to Sweden)‣ Joined Spotify’s monitoring team in february

2014‣ Currently Product Owner for monitoring

About Martin Parm

Page 3: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

This talk neither endorses nor condemn any specific products, open source or otherwise. All opinions expressed are in the context of Spotify’s specific history with monitoring.

Disclaimer

Page 4: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

This talk is not a sales pitch for our monitoring solution. It’s a story about how it came to be.

Disclaimer (2)

Page 5: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

Spotify - what we do

Page 6: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

‣ Music streaming - and discovery‣ Clients for all major operating systems‣ Partner integration into speakers, TVs,

PlayStation, ChromeCast, etc.‣ More than 20 million paid subscribers*‣ More than 75 million daily active users*‣ Paid back more than $3 billion in royalties*

The service - the right music for every moment

* https://news.spotify.com/se/2015/06/10/20-million-reasons-to-say-thanks/

Page 7: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

‣ Major tech offices in Stockholm, Gothenburg, New York and San Francisco

‣ Small sales and label relations offices in all markets (read: countries)

‣ ~1500 employees worldwide○ ~50% in Technology○ ~100 people in IO, our infrastructure department○ 6 engineers in our monitoring team

The people

Page 8: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

‣ 4 physical data centers‣ ~10K physical servers‣ Microservice infrastructure with ~1000 different

services‣ Mostly Ubuntu Linux

The technology

Page 9: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

‣ Ops-In-Squads○ Distributed operational responsibility in the feature teams○ Monitoring as self-service for backend

‣ Two monitoring systems for backend○ Heroic - time series based graphs and alerts○ Riemann - event based alerting

‣ ~100M time series‣ ~750 graph-based alert definitions‣ ~1500 graph dashboards

Operations and monitoring

Page 10: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

Our story begins back in 2006...

Page 11: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

Spotify starts in 2006 - a different world

‣ No cloud computing○ AWS had only just launched○ Google App engine and Microsoft Azure didn’t exist yet

‣ Few cloud applications○ GMail was still in limited public beta○ Google Docs was still in limited testing○ Facebook opened for public access in September

‣ No smartphones○ Apple had not even unveiled the iPhone yet○ Android didn’t become available until 2008

Page 12: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

Minutes from meeting in June 2007

“Munin mail - It's not sustainable to get 100 status mails per day about full disks to dev. X

should turn them off.”

Page 13: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

‣ Sitemon; our first graphing system ○ Based on Munin but with a custom frontend○ Metrics were pulled from hosts by aggregators

○ Metrics were written to several different RRD files with different solutions

○ Static graphs were generated from these RRD files○ One single dashboard for all systems and users○ Main metrics: Backend Request Failures

First steps with monitoring

Page 14: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

-- Emil Fredriksson, Operations director at Spotify(Slightly paragraphed)

“Our first alerting system was an engineer, who looked at graphs all the time; day and night; weekends.

And he would start calling up people, when something looked

wrong.”

Page 15: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

‣ Spotify launched in Sweden and UK in October 2008

‣ Zabbix was introduced in September 2009‣ Alerts were sent as text messages to

Operations, who would then contact feature developers

‣ Most common “alerting source”: Users○ Operations had a permanent twitter search

First steps with alerting

Page 16: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

2011/2012: Ops in squads

Page 17: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

‣ Opened our 3rd data center and grew to ~1000 hosts

‣ Spotify grew from ~100 to 400-600 people worldwide in a few months

‣ Many new engineers didn’t have operational experience or DevOps mentality

‣ A rift between dev and ops emerged...

2011-2012: The 2nd great hiring spree

Page 18: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

‣ Development speed-up and a vast increase in new services

‣ Stability and reliability was an increasing problem

‣ Service ownership was often unclear‣ Too frequent changes for a monolithic SRE

team to keep up

2011-2012: The 2nd great hiring spree

Page 19: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

‣ Big incidents almost every week → The business were unhappy

‣ Constant panic and fire fighting → The SRE team were unhappy

‣ Policies and restrictions, and angry SRE engineers → The feature developers were unhappy

2011-2012: The 2nd great hiring spree

Page 20: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

“The infrastructure and feature squads that write services should also take responsibility for correct day-to-day operation of individual services.”‣ Capacity Planning‣ Service Configuration and Deployment‣ Monitoring and Alerting‣ Defining and Managing to SLAs‣ Managing Incidents

September 2012: Ops In Squads

Page 21: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

Benefits‣ Organizational Scalability‣ Faster incident solving - getting The Right

Person™ on the problem faster‣ Accountability - making the right people hurt‣ Autonomy - feature teams make all their own

planning and decisions

September 2012: Ops In Squads

Page 22: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

Human challenges‣ Developers need training, but not a new

education‣ Developers need autonomy, but will do stupid

things‣ Developers need to care about monitoring and

alerting, but not the monitoring pipeline

September 2012: Ops In Squads

Page 23: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

We needed infrastructure as services‣ The classis Operations team was disbanded‣ Operations engineers and tools teams were

reformed in IO, our infrastructure organization‣ Teaching and self-service became a primary

priority

September 2012: Ops In Squads

Page 24: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

“Creating IO was probably one of the smartest moves in

Spotify”-- Previous Product Owner for Monitoring

(Slightly paragraphed)

Page 25: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

Three tales of failure*

* Read: Learning opportunities

Page 26: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

‣ Late 2011: Backend Infrastructure Team (BIT) was formed○ BIT was the first infrastructure team at Spotify

‣ Tasked with log delivery, monitoring and alerting

‣ Development of Sitemon2 began○ Meant to replace Sitemon

○ Still based on Munin, but with a Cassandra backend and much more powerful frontend

Sitemon2 - The graphing system, which never launched

Page 27: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

...but BIT was set up for failure from the start‣ Sitemon2 was developed mostly in isolation and

with very little collaboration with developers‣ Priority collisions: Log delivery was always

more critical than monitoring‣ Scope creep: BIT tried to integrate Sitemon2

with analytics

Sitemon2 - The graphing system, which never launched

Page 28: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

We needed feature teams to take part in monitoring, but Zabbix was too inflexible and hard to learn.‣ Late 2012: Development of OMG began

○ Event streaming processor similar to Riemann○ Initial development was super fast and focused○ Developed in collaboration with Operations

‣ A few teams adopted OMG, but.....

OMG - The alerting system no one understood

Page 29: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

OMG rule written in Esper

{Template_Site:grpsum["{$SPOTIFY_SITE} Access Points","spotify.muninplugintcp4950[hermes_requests_discovery,%%any]","last","0"].avg(300)}>0.1&(({TRIGGER.VALUE}=0&{Template_Site:grpsum["{$SPOTIFY_SITE} Access Points","spotify.muninplugintcp4950[hermes_replies_discovery,%%any]","last","0"].max(300)}/{Template_Site:grpsum["{$SPOTIFY_SITE} Access Points","spotify.muninplugintcp4950[hermes_requests_discovery,%%any]","last","0"].min(300)}<0.9)|({TRIGGER.VALUE}=1&{Template_Site:grpsum["{$SPOTIFY_SITE} Access Points","spotify.muninplugintcp4950[hermes_replies_discovery,%%any]","last","0"].max(300)}/{Template_Site:grpsum["{$SPOTIFY_SITE} Access Points","spotify.muninplugintcp4950[hermes_requests_discovery,%%any]","last","0"].min(300)}<0.97))

Page 30: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

The alerting rule language was Esper (EPL)‣ Most engineers found the learning curve way

too steep and confusing‣ Too few tools and libraries for the language

Why was this not caught?‣ The Ops engineer assigned for the

collaboration happened to also like Esper

OMG - The alerting system no one understood

Page 31: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

‣ February 2013: One of our system architects builds Monster as a proof-of-concept hack project○ In-memory time series database in Java○ Based on Munin collection and data model○ Metric data was pushed rather than pulled○ The prototype was completed in 2 weeks

Monster

Page 32: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

‣ Pushing monitoring data was much more reliable than pulling

‣ Querying and graphing is blazing fast‣ The Operations engineers loved it!‣ Sitemon kept running, but development of

Sitemon2 was halted

We’ll get back to the failure part...

Monster

Page 33: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

2013: The birth of a dedicated monitoring team

Page 34: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

‣ First dedicated monitoring team at Spotify‣ Assigned with the task of “Providing self-

service monitoring solutions for DevOps teams”‣ Inherited Monster, Zabbix and OMG‣ Calculation: Monster could survive a year, so

we focused on alerting first

Hero Squad

Page 35: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

‣ Replaced Zabbix and OMG with Riemann○ “Riemann is an event stream processor”○ Written in Clojure○ Rules are also written in Clojure

‣ We built a support library with helper functions, namespace support and unit testing

‣ Build a web frontend for reading the current state of Riemann

Riemann as a self-service alerting system

Page 36: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

* Some boilerplate code *

(def-rules (where (tagged-any roles) (tagged "monitoring-hooks" (:alert target)) (where (service "vfs/disk_used/") (spotify/trigger-on (p/above 0.9) (with {:description "Disk is getting full"} (:alert target))))))

Riemann rule written in Clojure

Page 37: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

‣ Success: Riemann was widely adopted‣ Success: Riemann is a true self-service

○ Riemann rules lives in a shared git repo, which gets automatically deployed

○ Each team/project have it’s own namespace○ Unit tests ensure that rules work as intended○ Peak: 36 namespaces and ~5000 lines of Clojure code

‣ Failure: Many engineers didn’t understand or like the Clojure language

Riemann as a self-service alerting system

Page 38: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

2014: Now for the pretty graphs...

Page 39: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

‣ Sharding and rebalancing quickly became a serious operational overhead○ The whisper write pattern involved randomly seeking and

writing across a lot of files – one for each series○ The Cyanite backend have recently addressed this

‣ Hierarchical naming○ Example: “db1.cpu.idle”

○ Difficult to slice and dice metrics across dimension, e.g. select all host in a site or running a particular piece of software

A brief encounter with collectd and Graphite

Page 40: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

‣ Replace long hierarchical names with tags

‣ Compatible with what Riemann does with events

‣ Makes slicing and dicing metrics easy

.... but who supports it?

Metric 2.0

Page 41: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

‣ We need quick adoption and commitment from our feature teams

‣ Monitoring was still very immature and we need room to experiment and fail

‣ Problem: Engineers get sick of migrations and refactorings

‣ Solution: A flexible infrastructure and an “API”

Adoption vs. flexibility

Page 42: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

‣ Small daemon running on each host, which forwards events and metrics to monitoring infrastructure

‣ First written in Ruby, but later ported to Java‣ Provides a stable entry point for our users

ffwd - a monitoring “API”

Page 43: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

Abstracting away the infrastructure from monitoring collection

Magical monitoring

pipeline

Metrics and events

ffwd

Alerts

Pretty graphs

Page 44: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

‣ Atlas, developed by NetFlix○ Hadn’t been open sourced yet

‣ Prometheus, developed by SoundCloud○ Hadn’t been open sourced yet

‣ OpenTSDB, originally developed by StumbleUpon○ Was rejected because of bad experiences with HBase

‣ InfluxDB○ Was too immature at the time

Tripping towards graphing; it’s all about the timing

Page 45: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

‣ Time series database written in Java and backed by Cassandra

‣ Looked promising at first○ We deployed it and killed Sitemon for good○ We quickly ran into problems with the query engine

‣ Timing: The two main developers got hired by DataStax (the Cassandra company)○ KairosDB development went to a halt

KairosDB

Page 46: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

‣ Originally written as an alternative query engine for KairosDB

‣ We kept using the KairosDB database schema and KairosDB metric writers

‣ June 2014: We dropped KairosDB and Heroic became a stand-alone product

‣ ElasticSearch used for indexing time series metadata

May 2014: The birth of Heroic

Page 47: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

Monster couldn’t scale, but this was not obvious to the users‣ When it worked, it was blazing fast and beat all

other solutions‣ When it broke, it crashed and required the

attention of the monitoring team, but most users never knew

‣ Only visible sign: shorter and shorter history

Back to the Monster failure

Page 48: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

‣ Failure: Because Monster was loved, and the users weren’t experiencing the pain when it broke, many teams resisted migrating from Monster

‣ Result: We didn’t manage to shut down Monster until August 2015

‣ In it’s last 6 weeks Monster crashed 51(!) times

Back to the Monster failure

Page 49: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

July 2014: Graph-based alerting

Page 50: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

Alerting was becoming a problem again

‣ Scaling Riemann with the increasing number of metrics became hard○ We began sharding, but some groups of hosts were still too big

‣ Writing reliable sliding window rules in Riemann was hard○ Learning Riemann and Clojure was the most common

complaint from our users

‣ One team dropped our monitoring solution and moved to an external vendor

Page 51: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

‣ Simple thresholds on time series using the same backend, data and query language○ 3 operators: Above, Below or Missing for X time

‣ Integrated directly into our frontend○ No code, no fancy math, just a line on a graph

Graph-based alerting

Page 52: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

Graph-based alerting

Page 53: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

‣ Our engineers loves it!○ Thousands of lines of Riemann code was ripped out○ Many teams have migrated completely away from Riemann

○ We saw a massive speed-up in adoption of monitoring; both data collection and definitions of dashboard and alerts

‣ Many monitoring problems can indeed be expressed as a simple threshold

Graph-based alerting

Page 54: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

Adoption of Heroic

Page 55: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

‣ We are currently collecting ~10TB of metrics per month worldwide○ 30TB of storage in Cassandra due to replication factor

‣ ~80% of our data was collected within the last 6 month

Adoption of Heroic

Page 56: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

The final current picture

Metrics and events

ffwd

Alerts

Pretty graphs

Riemann

Apache Kafka Heroic

Page 57: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

‣ ffwd and ffwd-java has been developed as Open Source software from the start

‣ Heroic was released as Open Source software yesterday○ Blog post: “Monitoring at Spotify: Introducing Heroic”○ Other components are being released later

We finally Open Sourced it

Page 58: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

What we have learned so far

Page 59: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

● Learning a new monitoring system is an investment

● Legacy systems are almost always to hardest

But it gets worse...

● Almost all system ends as legacy

● You probably haven’t installed your last monitoring system

Migrations are hard and expensive

Suggestions:

● Consider having abstraction layers

● Beware of vendor lock-in○ Open Source software is not

safe

● Sometimes it’s cheaper to keep a migration layer for legacy systems than migrating

Page 60: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

● The monitoring/operations team are experts; feature developers might not be

● User experience matters for adoption

● The learning curve affects the cost of adoption for teams

User experience and learning curve matters

● A technically superior solution is worthless, if your users don’t understand it

● Providing good defaults and a easy golden path will not only drive adoption, but also prevent users from making common mistakes

Page 61: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

When collection is easy and performance is good, engineers will start using the monitoring system as a debugger.

● Storage is cheap but not free

● The operational cost of keeping debugging data highly available is significant

Beware of scope creep in monitoring

When graphing is easy, pretty and powerful, people will start using monitoring for business analytics.

● Monitoring is suppose to be reliable, but not accurate

Page 62: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

● Seems very intuitive● Fragile, sensitive to latency

and sporadic failures● Noise for alerting● What are you really

measuring?● Solution: Convert your

problem into a metric by interpreting close to the source

Heartbeats are hard to get right

Page 63: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

● We used to sent events on every Puppet run● Teams would make monitoring rules for failed Puppet runs

and absent Puppet runs● Problem: Absent Puppet runs looks exactly the same when

○ Puppet is disabled○ Network is down○ Host is down○ Host has been decommissioned

● Solution: Emit “Time since last successful Puppet run” metric instead○ Now we can do simple thresholds, which are easy to reason about

Heartbeats example: Puppet runs

Page 64: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

● Indexing 100M time series is hard

● Browsing 100M time series is hard○ UI design - getting an overview

of 100M time series is hard

○ Understanding a graph with

thousands of lines is difficult for humans

● Your data will keep growing

The next big scaling problem is very human: data discovery

● Anomaly detection and machine learning might help us○ Many new and upcoming

product looks promising○ ...but still largely an unsolved

problem

Page 65: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

Thank you for your time and patience!Martin Parm

email: [email protected]

twitter: @parmus_dk

Page 66: Monitoring at - When things go ping in the night Spotify · PDF fileKairosDB ‣Originally written as an alternative query engine for KairosDB ‣We kept using the KairosDB database

List of Open Source software mentioned

‣ Munin, http://munin-monitoring.org/‣ Zabbix, http://www.zabbix.com/‣ Riemann, http://riemann.io/‣ Apache Kafka, http://kafka.apache.org/‣ Atlas, https://github.com/Netflix/atlas‣ Prometheus, http://prometheus.io/‣ OpenTSDM, http://opentsdb.net/‣ InfluxDB, https://influxdb.com/‣ KairosDB, http://kairosdb.github.io/‣ ffwd, https://github.com/spotify/ffwd‣ ffwd-java, https://github.com/spotify/ffwd-java‣ Heroic, https://github.com/spotify/heroic‣ Cassandra, http://cassandra.apache.org/‣ ElasticSearch, https://www.elastic.co/