metrics and monitoring infrastructure: lessons learned building metrics at linkedin

Metrics and Monitoring Infrastructure

Lessons Learned Building Metrics at LinkedIn

If you cant measure it, you cant manage it.

- W. Edwards Deming

It is wrong to suppose that if you cant measure it, you cant manage it a costly myth.

- W. Edwards Deming

Who Am I?

Grier JohnsonPlatform Engineer

Grier JohnsonProduction Engineer

Grier JohnsonSite Reliability Engineer

Grier JohnsonService Engineer

Grier JohnsonProduction Engineer

Grier JohnsonAll these titles mean about the same thing

Built out some metrics collection at LinkedIn

Architected the alerting system

So what makes a Metrics Infrastructure

The Metrics Store

Metric stores have high writes, low read

The read requirements for historical data are even

lower

Metrics Transport

You probably just want to use Kafka

Now your metrics are pub/sub

Add something like Hadoop for data analysis

maybe?

Metrics Emission

StandardizeYour metrics

StandardizeYour intervals

StandardizeYour protocols

StandardizeYour namespace

Metrics Types

Allow for custom metrics

Prefer counters

Histograms

Protect your namespace, seriously

Metrics Presentation

Allow for customization

Graph Size


Start and end times


Graph type


Legends


Color palette

Static Graphs and Links

Speed Matters

Update frequency matters

Give a lot of thought to the colors. No really. People

really care about this.

Ok, now how about monitoring and alerting

Data Sources

Metrics

Centralized Checks

Decentralized / Distributed Checks

Stream Processing

Defining Your Alerts

What data are we looking at

What does it mean for the data to be good or bad

What do we do if its bad?

Processing

Bringing together data and definition

Read your data sources

Read your definitions

Apply the definitions to the data

Do something! Or dont.

Alerting Actions

Nothing, nothing at all.

E-mail

Alert!

Run a script

wait what?

Use your monitoring system to respond.

Automation is better than alerting

At-a-glance visual for site/service health

Point-and-click alert suppression

CLI and API for automation tasks

Bulk actions, please

Stopping Alerts

Suppressionor acknowledge

Suppressionor sleep

Suppressionor quiet

Suppressionor silence

Allow suppression scheduling

Cancelling suppressions

So what did I learn

Lesson 1Writing your Metrics Infrastructure from scratch

Dont Invent It Here

Use Open Source

Metrics Stores

Netflixs Atlas (https://github.com/Netflix/atlas)

Prometheus (https://prometheus.io)

RackSpaces Blueflood (http://blueflood.io)

OpenTSDB (http://opentsdb.net)

https://github.com/Netflix/atlashttps://prometheus.iohttp://blueflood.iohttp://opentsdb.net

Alerting

Sensu (https://sensuapp.org)

Riemann (http://riemann.io)

Zabbix (http://www.zabbix.com)

Nagios (http://actually-dont.com)

https://sensuapp.orghttp://riemann.iohttp://www.zabbix.comhttp://actually-dont.com

Hybridize

Contribute

Lesson 2When you build your metrics system from scratch

anyhow

Redundancy doesnt matter, until your first outage with data loss

Care about thin bandwidth pipes

Distribute your stores close to the metrics

creators

Aggregate distributed metrics at the

presentation layer

Cache what you can

Use Kafka

Lesson 3Graphing UIs

DPI Matters

When your graph is 500px wide, how many of the 750 data points can you show?

Show the highest data points?

The lowest?

The average?

Theyre all trade-offs and someone will hate you for

it.

Javascript is slow

It looks good though

Hard to e-mail dynamic javascript graphs

Remember to plan for caching

Outages test your frontends performance

Lesson 4Metrics Discovery

Why have 100M metrics if you cant find them

Even when you find them, you cant makes sense of

1000 metrics for one service

Standard names help. Namespace helps.

Not having 100M metrics helps

Lesson 5Alerting On Metrics

Alerting on absolute values is bad

Use standard deviation

Use rate of change

But remember DST and Holidays ruin everything

Lesson 6Alerting Overload

Alerts are for humans

Low friction for alert suppression

Low friction for alert changes and customization

Lesson 7Alerting Levels

There is no Warning level for alerts

If it is worth alerting you, its critical

Rate of change monitoring will help with most useful cases here

Lesson 8Suppression Times

Unlimited suppression time = Regret

Less than unlimited is OK

Low friction alert modifications

Lesson 9Processing your metrics from streams for alerts

Dont re-read your kafka stream to build up metrics

history.

Its right there in your metrics store

Lesson 10Alerting on Logs

No really, logs are for humans, rethink your monitoring strategy

Lesson 11Alerting on Exceptions

When exceptions are rare, alerting is fine

Theres a special place in hell for people that alert on rates of exceptions

Im sure theres moreBut if Im not over time by this point I talked REALLY FAST

Questions?

Grier Johnson

[email protected]

@grierj on Twitter for DMs

https://www.linkedin.com/in/grierjohnson

mailto:[email protected]://www.linkedin.com/in/grierjohnson

metrics and monitoring infrastructure: lessons learned building metrics at linkedin

Internet