analytics driven operations - steve acreman - dataloop
TRANSCRIPT
www.dataloop.io | @dataloopio | [email protected]
Monitoring for Online Services
What is Dataloop?
PerformanceUp / Down Alerts
Dev Env Enterprise Stuff
Architecture
First Year
First Year
Measure
Putting out the fire
rollup workermetric worker
Problems
• NodeJS metrics workers not scaling
• Memory management was an issue
• Needed big caches to reduce database
load
• GC cycles too long
• 8 x single processes on an 8 core server
Metric worker re-write
• Approximately 6 weeks from no Erlang experience to working
version
• No more crashes
• Reduced servers needed from 16 to 8
• Pushes metrics straight from Rabbit into DalmatinerDB (new
database)
Today
Happy Ending
Just the beginning!
Initial Instrumentation
› StatsD libraries in Node and Erlang code› Push UDP packets to a StatsD server for aggregation
Pitfalls
› Metrics increase as service usage increases
› UDP isn’t great
› Aggregates across a service (hard to spot an outlier)
› Quite lossy
Better Instrumentation
› Prometheus http metrics endpoints
› 10 second scrape interval into Dataloop
› Raw data (no loss)
› Dimensions allow drill down into host
Prometheus Output
curl http://localhost/metrics
What to instrument?› Everything!
› Feature usage
› Throughput
› Error rates
› If it moves instrument it
Analytics
› Simple things like API response times
Analytics› Pretty useful to plot when a problem started
Yesterday vs. Today
SQL Like Query Language
Time Series Functions
› Create a query to answer questions
Future
› Prediction algorithms
› Search ‘similar’ metrics
› Outlier algorithms
› More functions!
Summary
› Code level metrics with Prometheus are extremely light weight
› Have a framework in place to quickly add more when issues arise
› Don’t wait until your first fire to start
› Start small and try to get both operations and developers on board
Q&A