analytics driven operations - steve acreman - dataloop

www.dataloop.io | @dataloopio | [email protected]

Monitoring for Online Services

What is Dataloop?

PerformanceUp / Down Alerts

Dev Env Enterprise Stuff

Architecture

First Year

Measure

Putting out the fire

rollup workermetric worker

Problems

• NodeJS metrics workers not scaling

• Memory management was an issue

• Needed big caches to reduce database

load

• GC cycles too long

• 8 x single processes on an 8 core server

Metric worker re-write

• Approximately 6 weeks from no Erlang experience to working

version

• No more crashes

• Reduced servers needed from 16 to 8

• Pushes metrics straight from Rabbit into DalmatinerDB (new

database)

Happy Ending

Just the beginning!

Initial Instrumentation

› StatsD libraries in Node and Erlang code› Push UDP packets to a StatsD server for aggregation

Pitfalls

› Metrics increase as service usage increases

› UDP isn’t great

› Aggregates across a service (hard to spot an outlier)

› Quite lossy

Better Instrumentation

› Prometheus http metrics endpoints

› 10 second scrape interval into Dataloop

› Raw data (no loss)

› Dimensions allow drill down into host

Prometheus Output

curl http://localhost/metrics

What to instrument?› Everything!

› Feature usage

› Throughput

› Error rates

› If it moves instrument it

Analytics

› Simple things like API response times

Analytics› Pretty useful to plot when a problem started

Yesterday vs. Today

SQL Like Query Language

Time Series Functions

› Create a query to answer questions

Future

› Prediction algorithms

› Search ‘similar’ metrics

› Outlier algorithms

› More functions!

Summary

› Code level metrics with Prometheus are extremely light weight

› Have a framework in place to quickly add more when issues arise

› Don’t wait until your first fire to start

› Start small and try to get both operations and developers on board

www.dataloop.io

@dataloopio

[email protected]

analytics driven operations - steve acreman - dataloop

Software