metrics and monitoring infrastructure: lessons learned building metrics at linkedin
TRANSCRIPT
-
Metrics and Monitoring Infrastructure
Lessons Learned Building Metrics at LinkedIn
-
If you cant measure it, you cant manage it.
- W. Edwards Deming
-
It is wrong to suppose that if you cant measure it, you cant manage it a costly myth.
- W. Edwards Deming
-
Who Am I?
-
Grier JohnsonPlatform Engineer
-
Grier JohnsonProduction Engineer
-
Grier JohnsonSite Reliability Engineer
-
Grier JohnsonService Engineer
-
Grier JohnsonProduction Engineer
-
Grier JohnsonAll these titles mean about the same thing
-
Built out some metrics collection at LinkedIn
-
Architected the alerting system
-
So what makes a Metrics Infrastructure
-
The Metrics Store
-
Metric stores have high writes, low read
-
The read requirements for historical data are even
lower
-
Metrics Transport
-
You probably just want to use Kafka
-
Now your metrics are pub/sub
-
Add something like Hadoop for data analysis
maybe?
-
Metrics Emission
-
StandardizeYour metrics
-
StandardizeYour intervals
-
StandardizeYour protocols
-
StandardizeYour namespace
-
Metrics Types
-
Allow for custom metrics
-
Prefer counters
-
Histograms
-
Protect your namespace, seriously
-
Metrics Presentation
-
Allow for customization
Graph Size
-
Allow for customization
Start and end times
-
Allow for customization
Graph type
-
Allow for customization
Legends
-
Allow for customization
Color palette
-
Static Graphs and Links
-
Speed Matters
-
Update frequency matters
-
Give a lot of thought to the colors. No really. People
really care about this.
-
Ok, now how about monitoring and alerting
-
Data Sources
-
Metrics
-
Centralized Checks
-
Decentralized / Distributed Checks
-
Stream Processing
-
Defining Your Alerts
-
What data are we looking at
-
What does it mean for the data to be good or bad
-
What do we do if its bad?
-
Processing
-
Bringing together data and definition
-
Read your data sources
-
Read your definitions
-
Apply the definitions to the data
-
Do something! Or dont.
-
Alerting Actions
-
Nothing, nothing at all.
-
E-mail
-
Alert!
-
Run a script
-
wait what?
-
Use your monitoring system to respond.
-
Automation is better than alerting
-
UI/UX
-
At-a-glance visual for site/service health
-
Point-and-click alert suppression
-
CLI and API for automation tasks
-
Bulk actions, please
-
Stopping Alerts
-
Suppressionor acknowledge
-
Suppressionor sleep
-
Suppressionor quiet
-
Suppressionor silence
-
Allow suppression scheduling
-
Cancelling suppressions
-
So what did I learn
-
Lesson 1Writing your Metrics Infrastructure from scratch
-
Dont Invent It Here
-
Use Open Source
-
Metrics Stores
Netflixs Atlas (https://github.com/Netflix/atlas)
Prometheus (https://prometheus.io)
RackSpaces Blueflood (http://blueflood.io)
OpenTSDB (http://opentsdb.net)
https://github.com/Netflix/atlashttps://prometheus.iohttp://blueflood.iohttp://opentsdb.net
-
Alerting
Sensu (https://sensuapp.org)
Riemann (http://riemann.io)
Zabbix (http://www.zabbix.com)
Nagios (http://actually-dont.com)
https://sensuapp.orghttp://riemann.iohttp://www.zabbix.comhttp://actually-dont.com
-
Hybridize
-
Contribute
-
Lesson 2When you build your metrics system from scratch
anyhow
-
Redundancy doesnt matter, until your first outage with data loss
-
Care about thin bandwidth pipes
-
Distribute your stores close to the metrics
creators
-
Aggregate distributed metrics at the
presentation layer
-
Cache what you can
-
Use Kafka
-
Lesson 3Graphing UIs
-
DPI Matters
-
When your graph is 500px wide, how many of the 750 data points can you show?
-
Show the highest data points?
-
The lowest?
-
The average?
-
Theyre all trade-offs and someone will hate you for
it.
-
Javascript is slow
-
It looks good though
-
Hard to e-mail dynamic javascript graphs
-
Remember to plan for caching
-
Outages test your frontends performance
-
Lesson 4Metrics Discovery
-
Why have 100M metrics if you cant find them
-
Even when you find them, you cant makes sense of
1000 metrics for one service
-
Standard names help. Namespace helps.
Not having 100M metrics helps
-
Lesson 5Alerting On Metrics
-
Alerting on absolute values is bad
-
Use standard deviation
-
Use rate of change
-
But remember DST and Holidays ruin everything
-
Lesson 6Alerting Overload
-
Alerts are for humans
-
Low friction for alert suppression
-
Low friction for alert changes and customization
-
Lesson 7Alerting Levels
-
There is no Warning level for alerts
-
If it is worth alerting you, its critical
-
Rate of change monitoring will help with most useful cases here
-
Lesson 8Suppression Times
-
Unlimited suppression time = Regret
-
Less than unlimited is OK
-
Low friction alert modifications
-
Lesson 9Processing your metrics from streams for alerts
-
Dont re-read your kafka stream to build up metrics
history.
-
Its right there in your metrics store
-
Lesson 10Alerting on Logs
-
Dont
-
No really, logs are for humans, rethink your monitoring strategy
-
Lesson 11Alerting on Exceptions
-
Maybe
-
When exceptions are rare, alerting is fine
-
Theres a special place in hell for people that alert on rates of exceptions
-
Im sure theres moreBut if Im not over time by this point I talked REALLY FAST
-
Questions?
Grier Johnson
@grierj on Twitter for DMs
https://www.linkedin.com/in/grierjohnson
mailto:[email protected]://www.linkedin.com/in/grierjohnson