metrics and monitoring infrastructure: lessons learned building metrics at linkedin
TRANSCRIPT
Metrics and Monitoring Infrastructure
Lessons Learned Building Metrics at LinkedIn
“If you can’t measure it, you can’t manage it.”
- W. Edwards Deming
“It is wrong to suppose that if you can’t measure it, you can’t manage it – a costly myth.”
- W. Edwards Deming
Who Am I?
Grier JohnsonPlatform Engineer
Grier JohnsonProduction Engineer
Grier JohnsonSite Reliability Engineer
Grier JohnsonService Engineer
Grier JohnsonProduction Engineer
Grier JohnsonAll these titles mean about the same thing
Built out some metrics collection at LinkedIn
Architected the alerting system
So what makes a “Metrics Infrastructure”
The Metrics Store
Metric stores have high writes, low read
The read requirements for historical data are even
lower
Metrics Transport
You probably just want to use Kafka
Now your metrics are pub/sub
Add something like Hadoop for data analysis
maybe?
Metrics Emission
Standardize…Your metrics
Standardize…Your intervals
Standardize…Your protocols
Standardize…Your namespace
Metrics Types
Allow for custom metrics
Prefer counters
Histograms
Protect your namespace, seriously
Metrics Presentation
Allow for customization
Graph Size
Allow for customization
Start and end times
Allow for customization
Graph type
Allow for customization
Legends
Allow for customization
Color palette
Static Graphs and Links
Speed Matters
Update frequency matters
Give a lot of thought to the colors. No really. People
really care about this.
Ok, now how about monitoring and alerting…
Data Sources
Metrics
Centralized Checks
Decentralized / Distributed Checks
Stream Processing
Defining Your Alerts
What data are we looking at
What does it mean for the data to be good or bad
What do we do if it’s bad?
Processing
Bringing together data and definition
Read your data sources
Read your definitions
Apply the definitions to the data
Do something! Or don’t.
Alerting Actions
Nothing, nothing at all.
Alert!
Run a script
… wait what?
Use your monitoring system to respond.
Automation is better than alerting
UI/UX
At-a-glance visual for site/service health
Point-and-click alert suppression
CLI and API for automation tasks
Bulk actions, please
Stopping Alerts
“Suppression”or acknowledge
“Suppression”or sleep
“Suppression”or quiet
“Suppression”or silence
Allow suppression scheduling
Cancelling suppressions
So what did I learn…
Lesson 1Writing your Metrics Infrastructure from scratch
Don’t Invent It Here
Use Open Source
Metrics Stores
• Netflix’s Atlas (https://github.com/Netflix/atlas)
• Prometheus (https://prometheus.io)
• RackSpace’s Blueflood (http://blueflood.io)
• OpenTSDB (http://opentsdb.net)
Alerting
• Sensu (https://sensuapp.org)
• Riemann (http://riemann.io)
• Zabbix (http://www.zabbix.com)
• Nagios (http://actually-dont.com)
Hybridize
Contribute
Lesson 2When you build your metrics system from scratch
anyhow…
Redundancy doesn’t matter, until your first outage with data loss
Care about thin bandwidth pipes
Distribute your stores close to the metrics
creators
Aggregate distributed metrics at the
presentation layer
Cache what you can
Use Kafka
Lesson 3Graphing UIs
DPI Matters
When your graph is 500px wide, how many of the 750 data points can you show?
Show the highest data points?
The lowest?
The average?
They’re all trade-offs and someone will hate you for
it.
Javascript is slow
It looks good though
Hard to e-mail dynamic javascript graphs
Remember to plan for caching
Outages test your frontend’s performance
Lesson 4Metrics Discovery
Why have 100M metrics if you can’t find them
Even when you find them, you can’t makes sense of
1000 metrics for one service
Standard names help. Namespace helps.
Not having 100M metrics helps
Lesson 5Alerting On Metrics
Alerting on absolute values is bad
Use standard deviation
Use rate of change
But remember DST and Holidays ruin everything
Lesson 6Alerting Overload
Alerts are for humans
Low friction for alert suppression
Low friction for alert changes and customization
Lesson 7Alerting Levels
There is no “Warning” level for alerts
If it is worth alerting you, it’s critical
Rate of change monitoring will help with most useful cases here
Lesson 8Suppression Times
Unlimited suppression time = Regret
Less than unlimited is OK
Low friction alert modifications
Lesson 9Processing your metrics from streams for alerts
Don’t re-read your kafka stream to build up metrics
history.
It’s right there in your metrics store
Lesson 10Alerting on Logs
Don’t
No really, logs are for humans, rethink your monitoring strategy
Lesson 11Alerting on Exceptions
Maybe
When exceptions are rare, alerting is fine
There’s a special place in hell for people that alert on rates of exceptions
I’m sure there’s moreBut if I’m not over time by this point I talked REALLY FAST
Questions?
• Grier Johnson
• @grierj on Twitter for DMs
• https://www.linkedin.com/in/grierjohnson