metrics and monitoring infrastructure: lessons learned building metrics at linkedin

142
Metrics and Monitoring Infrastructure Lessons Learned Building Metrics at LinkedIn

Upload: grier-johnson

Post on 21-Apr-2017

664 views

Category:

Internet


1 download

TRANSCRIPT

Page 1: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Metrics and Monitoring Infrastructure

Lessons Learned Building Metrics at LinkedIn

Page 2: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

“If you can’t measure it, you can’t manage it.”

- W. Edwards Deming

Page 3: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

“It is wrong to suppose that if you can’t measure it, you can’t manage it – a costly myth.”

- W. Edwards Deming

Page 4: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Who Am I?

Page 5: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Grier JohnsonPlatform Engineer

Page 6: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Grier JohnsonProduction Engineer

Page 7: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Grier JohnsonSite Reliability Engineer

Page 8: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Grier JohnsonService Engineer

Page 9: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Grier JohnsonProduction Engineer

Page 10: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Grier JohnsonAll these titles mean about the same thing

Page 11: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Built out some metrics collection at LinkedIn

Page 12: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn
Page 13: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Architected the alerting system

Page 14: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn
Page 15: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

So what makes a “Metrics Infrastructure”

Page 16: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

The Metrics Store

Page 17: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Metric stores have high writes, low read

Page 18: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

The read requirements for historical data are even

lower

Page 19: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Metrics Transport

Page 20: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

You probably just want to use Kafka

Page 21: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Now your metrics are pub/sub

Page 22: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Add something like Hadoop for data analysis

maybe?

Page 23: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Metrics Emission

Page 24: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Standardize…Your metrics

Page 25: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Standardize…Your intervals

Page 26: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Standardize…Your protocols

Page 27: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Standardize…Your namespace

Page 28: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Metrics Types

Page 29: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Allow for custom metrics

Page 30: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Prefer counters

Page 31: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Histograms

Page 32: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Protect your namespace, seriously

Page 33: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Metrics Presentation

Page 34: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Allow for customization

Graph Size

Page 35: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Allow for customization

Start and end times

Page 36: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Allow for customization

Graph type

Page 37: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Allow for customization

Legends

Page 38: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Allow for customization

Color palette

Page 39: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Static Graphs and Links

Page 40: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Speed Matters

Page 41: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Update frequency matters

Page 42: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Give a lot of thought to the colors. No really. People

really care about this.

Page 43: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn
Page 44: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn
Page 45: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Ok, now how about monitoring and alerting…

Page 46: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Data Sources

Page 47: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Metrics

Page 48: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Centralized Checks

Page 49: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Decentralized / Distributed Checks

Page 50: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Stream Processing

Page 51: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Defining Your Alerts

Page 52: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

What data are we looking at

Page 53: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

What does it mean for the data to be good or bad

Page 54: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

What do we do if it’s bad?

Page 55: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Processing

Page 56: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Bringing together data and definition

Page 57: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Read your data sources

Page 58: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Read your definitions

Page 59: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Apply the definitions to the data

Page 60: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Do something! Or don’t.

Page 61: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Alerting Actions

Page 62: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Nothing, nothing at all.

Page 63: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

E-mail

Page 64: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Alert!

Page 65: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Run a script

Page 66: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

… wait what?

Page 67: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Use your monitoring system to respond.

Page 68: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Automation is better than alerting

Page 69: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

UI/UX

Page 70: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

At-a-glance visual for site/service health

Page 71: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Point-and-click alert suppression

Page 72: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

CLI and API for automation tasks

Page 73: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Bulk actions, please

Page 74: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Stopping Alerts

Page 75: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

“Suppression”or acknowledge

Page 76: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

“Suppression”or sleep

Page 77: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

“Suppression”or quiet

Page 78: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

“Suppression”or silence

Page 79: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Allow suppression scheduling

Page 80: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Cancelling suppressions

Page 81: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn
Page 82: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn
Page 83: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

So what did I learn…

Page 84: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Lesson 1Writing your Metrics Infrastructure from scratch

Page 85: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Don’t Invent It Here

Page 86: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Use Open Source

Page 87: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Metrics Stores

• Netflix’s Atlas (https://github.com/Netflix/atlas)

• Prometheus (https://prometheus.io)

• RackSpace’s Blueflood (http://blueflood.io)

• OpenTSDB (http://opentsdb.net)

Page 88: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Alerting

• Sensu (https://sensuapp.org)

• Riemann (http://riemann.io)

• Zabbix (http://www.zabbix.com)

• Nagios (http://actually-dont.com)

Page 89: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Hybridize

Page 90: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Contribute

Page 91: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Lesson 2When you build your metrics system from scratch

anyhow…

Page 92: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Redundancy doesn’t matter, until your first outage with data loss

Page 93: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Care about thin bandwidth pipes

Page 94: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Distribute your stores close to the metrics

creators

Page 95: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Aggregate distributed metrics at the

presentation layer

Page 96: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Cache what you can

Page 97: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Use Kafka

Page 98: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Lesson 3Graphing UIs

Page 99: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

DPI Matters

Page 100: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

When your graph is 500px wide, how many of the 750 data points can you show?

Page 101: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Show the highest data points?

Page 102: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

The lowest?

Page 103: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

The average?

Page 104: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

They’re all trade-offs and someone will hate you for

it.

Page 105: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Javascript is slow

Page 106: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

It looks good though

Page 107: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Hard to e-mail dynamic javascript graphs

Page 108: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Remember to plan for caching

Page 109: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Outages test your frontend’s performance

Page 110: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Lesson 4Metrics Discovery

Page 111: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Why have 100M metrics if you can’t find them

Page 112: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Even when you find them, you can’t makes sense of

1000 metrics for one service

Page 113: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Standard names help. Namespace helps.

Not having 100M metrics helps

Page 114: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Lesson 5Alerting On Metrics

Page 115: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Alerting on absolute values is bad

Page 116: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Use standard deviation

Page 117: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Use rate of change

Page 118: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

But remember DST and Holidays ruin everything

Page 119: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Lesson 6Alerting Overload

Page 120: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Alerts are for humans

Page 121: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Low friction for alert suppression

Page 122: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Low friction for alert changes and customization

Page 123: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Lesson 7Alerting Levels

Page 124: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

There is no “Warning” level for alerts

Page 125: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

If it is worth alerting you, it’s critical

Page 126: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Rate of change monitoring will help with most useful cases here

Page 127: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Lesson 8Suppression Times

Page 128: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Unlimited suppression time = Regret

Page 129: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Less than unlimited is OK

Page 130: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Low friction alert modifications

Page 131: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Lesson 9Processing your metrics from streams for alerts

Page 132: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Don’t re-read your kafka stream to build up metrics

history.

Page 133: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

It’s right there in your metrics store

Page 134: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Lesson 10Alerting on Logs

Page 135: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Don’t

Page 136: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

No really, logs are for humans, rethink your monitoring strategy

Page 137: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Lesson 11Alerting on Exceptions

Page 138: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Maybe

Page 139: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

When exceptions are rare, alerting is fine

Page 140: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

There’s a special place in hell for people that alert on rates of exceptions

Page 141: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

I’m sure there’s moreBut if I’m not over time by this point I talked REALLY FAST

Page 142: Metrics and Monitoring Infrastructure: Lessons Learned Building Metrics at LinkedIn

Questions?

• Grier Johnson

[email protected]

• @grierj on Twitter for DMs

• https://www.linkedin.com/in/grierjohnson