metrics and monitoring infrastructure: lessons learned building metrics at linkedin

Metrics and Monitoring Infrastructure

Lessons Learned Building Metrics at LinkedIn

“If you can’t measure it, you can’t manage it.”

- W. Edwards Deming

“It is wrong to suppose that if you can’t measure it, you can’t manage it – a costly myth.”

- W. Edwards Deming

Who Am I?

Grier JohnsonPlatform Engineer

Grier JohnsonProduction Engineer

Grier JohnsonSite Reliability Engineer

Grier JohnsonService Engineer

Grier JohnsonProduction Engineer

Grier JohnsonAll these titles mean about the same thing

Built out some metrics collection at LinkedIn

Architected the alerting system

So what makes a “Metrics Infrastructure”

The Metrics Store

Metric stores have high writes, low read

The read requirements for historical data are even

lower

Metrics Transport

You probably just want to use Kafka

Now your metrics are pub/sub

Add something like Hadoop for data analysis

maybe?

Metrics Emission

Standardize…Your metrics

Standardize…Your intervals

Standardize…Your protocols

Standardize…Your namespace

Metrics Types

Allow for custom metrics

Prefer counters

Histograms

Protect your namespace, seriously

Metrics Presentation

Allow for customization

Graph Size


Start and end times


Graph type


Legends


Color palette

Static Graphs and Links

Speed Matters

Update frequency matters

Give a lot of thought to the colors. No really. People

really care about this.

Ok, now how about monitoring and alerting…

Data Sources

Metrics

Centralized Checks

Decentralized / Distributed Checks

Stream Processing

Defining Your Alerts

What data are we looking at

What does it mean for the data to be good or bad

What do we do if it’s bad?

Processing

Bringing together data and definition

Read your data sources

Read your definitions

Apply the definitions to the data

Do something! Or don’t.

Alerting Actions

Nothing, nothing at all.

E-mail

Alert!

Run a script

… wait what?

Use your monitoring system to respond.

Automation is better than alerting

At-a-glance visual for site/service health

Point-and-click alert suppression

CLI and API for automation tasks

Bulk actions, please

Stopping Alerts

“Suppression”or acknowledge

“Suppression”or sleep

“Suppression”or quiet

“Suppression”or silence

Allow suppression scheduling

Cancelling suppressions

So what did I learn…

Lesson 1Writing your Metrics Infrastructure from scratch

Don’t Invent It Here

Use Open Source

Metrics Stores

• Netflix’s Atlas (https://github.com/Netflix/atlas)

• Prometheus (https://prometheus.io)

• RackSpace’s Blueflood (http://blueflood.io)

• OpenTSDB (http://opentsdb.net)

https://github.com/Netflix/atlas

https://prometheus.io

http://blueflood.io

http://opentsdb.net

Alerting

• Sensu (https://sensuapp.org)

• Riemann (http://riemann.io)

• Zabbix (http://www.zabbix.com)

• Nagios (http://actually-dont.com)

https://sensuapp.org

http://riemann.io

http://www.zabbix.com

http://actually-dont.com

Hybridize

Contribute

Lesson 2When you build your metrics system from scratch

anyhow…

Redundancy doesn’t matter, until your first outage with data loss

Care about thin bandwidth pipes

Distribute your stores close to the metrics

creators

Aggregate distributed metrics at the

presentation layer

Cache what you can

Use Kafka

Lesson 3Graphing UIs

DPI Matters

When your graph is 500px wide, how many of the 750 data points can you show?

Show the highest data points?

The lowest?

The average?

They’re all trade-offs and someone will hate you for

it.

Javascript is slow

It looks good though

Hard to e-mail dynamic javascript graphs

Remember to plan for caching

Outages test your frontend’s performance

Lesson 4Metrics Discovery

Why have 100M metrics if you can’t find them

Even when you find them, you can’t makes sense of

1000 metrics for one service

Standard names help. Namespace helps.

Not having 100M metrics helps

Lesson 5Alerting On Metrics

Alerting on absolute values is bad

Use standard deviation

Use rate of change

But remember DST and Holidays ruin everything

Lesson 6Alerting Overload

Alerts are for humans

Low friction for alert suppression

Low friction for alert changes and customization

Lesson 7Alerting Levels

There is no “Warning” level for alerts

If it is worth alerting you, it’s critical

Rate of change monitoring will help with most useful cases here

Lesson 8Suppression Times

Unlimited suppression time = Regret

Less than unlimited is OK

Low friction alert modifications

Lesson 9Processing your metrics from streams for alerts

Don’t re-read your kafka stream to build up metrics

history.

It’s right there in your metrics store

Lesson 10Alerting on Logs

Don’t

No really, logs are for humans, rethink your monitoring strategy

Lesson 11Alerting on Exceptions

When exceptions are rare, alerting is fine

There’s a special place in hell for people that alert on rates of exceptions

I’m sure there’s moreBut if I’m not over time by this point I talked REALLY FAST

Questions?

• Grier Johnson

• [email protected]

• @grierj on Twitter for DMs

• https://www.linkedin.com/in/grierjohnson

mailto:[email protected]

https://www.linkedin.com/in/grierjohnson

metrics and monitoring infrastructure: lessons learned building metrics at linkedin

Internet