less alarming alerts - usenix...less alarming alerts saturday, april 9, 16. hello /@robtreat2 former...

55
/ Robert Treat Less Alarming Alerts Saturday, April 9, 16

Upload: others

Post on 22-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

/ Robert Treat

Less Alarming Alerts

Saturday, April 9, 16

Page 2: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

Hello /@robtreat2

Former

WebDev SysAdmin DBA

I have now been promoted to where I can do the least damage

Saturday, April 9, 16

Page 3: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

Hello /@robtreat2

Now

CEO @OMNITI

Saturday, April 9, 16

Page 4: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

Hello /@robtreat2

Who Cares What Some Suite Thinks?

Saturday, April 9, 16

Page 5: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

Hello /@robtreat2

Phantom Pages

Saturday, April 9, 16

Page 6: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

Memory Lane /@robtreat2

Benny

Saturday, April 9, 16

Page 7: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

Memory Lane /@robtreat2

MyFirstPager

Saturday, April 9, 16

Page 8: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

Memory Lane /@robtreat2

Multiple Rotations

Saturday, April 9, 16

Page 9: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

Memory Lane /@robtreat2

always available, phone onlyno pager for years

Saturday, April 9, 16

Page 10: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

Hello /@robtreat2

Phantom Pages

Saturday, April 9, 16

Page 11: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

Hello /@robtreat2

I manage the SRE team at OmniTI

we manage multiple sites24x7

millions of users

(omniti.com/is/hiring)

Saturday, April 9, 16

Page 12: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

Why God Why?

paging is useful

“broken systems should not be just another day at the office”

-- me

Saturday, April 9, 16

Page 13: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

Why God Why?

paging is useful

Who has ever gotten an alert and ignored it? (/me looks at alert, says “oh, it’ll probably recover, no need to look further”)

Saturday, April 9, 16

Page 14: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

Why God Why?

paging is useful

How many alerts were received in the past week that were not actionable?

(no human action was required)

Saturday, April 9, 16

Page 15: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

Why God Why?

paging CAN BE useful

Saturday, April 9, 16

Page 16: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

Can We Fix It?

how to improve?

Saturday, April 9, 16

Page 17: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

Can We Fix It?

[email protected]

we offer operationally focused services to help build and manage your infrastructure

:-)

Saturday, April 9, 16

Page 18: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

Terms

• Metrics

• (anything which can be measured)

Saturday, April 9, 16

Page 19: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

Terms

• Metrics

• (anything which can be measured)

• Graphs

• (trending systems)

Saturday, April 9, 16

Page 20: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

Terms

• Metrics

• (anything which can be measured)

• Graphs

• (trending systems)

• Notices

• (notification of event; email)

Saturday, April 9, 16

Page 21: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

Terms

• Metrics

• (anything which can be measured)

• Graphs

• (trending systems)

• Notices

• (notification of event; email)

• ALERTS

• (wake’n you up; pages)

Saturday, April 9, 16

Page 22: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

Terms

• Metrics

• (anything which can be measured)

• Graphs

• (trending systems)

• Notices

• (notification of event; email)

• ALERTS

• (wake’n you up; pages)

Saturday, April 9, 16

Page 23: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

Onward and Upward

If you want to improve your alerts

use systems thinking to reason about your“system”

Saturday, April 9, 16

Page 24: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

Onward and Upward

alerts should be seen as evidence that your system is behaving in a way

outside of your existing understanding

Saturday, April 9, 16

Page 25: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

Onward and Upward

If you want to improve your alerts

think in terms your business can get on board with

Saturday, April 9, 16

Page 26: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

Onward and Upward

for every alert you receive

What is the business impact of this alert?

Saturday, April 9, 16

Page 27: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

Onward and Upward

for every alert you receive

What is the remediation for this alert?

Saturday, April 9, 16

Page 28: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

Onward and Upward

remediation:

• Summarize the problem

• What was done to solve the problem?

• Who was notified?

• Can this be prevented?

Saturday, April 9, 16

Page 29: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

Onward and Upward

send the answer to these questions to everyone on the team

every time

Saturday, April 9, 16

Page 30: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

Onward and Upward

link to this documentationfrom your alerting system

Saturday, April 9, 16

Page 31: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

Onward and Upward

• Knowledge Transfer

• Gaps Exposed

• Patterns will emerge

Saturday, April 9, 16

Page 32: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

Onward and Upward

you might be a bad alert

• cannot determine business impact

• no remediation necessary

• no one needs to be told

• work arounds are available

Saturday, April 9, 16

Page 33: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

Onward and Upward

if you can’t fix it, you don’t need to wake up for it

Saturday, April 9, 16

Page 34: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

Onward and Upward

if it can wait until morning, you don’t need to wake up

for it

Saturday, April 9, 16

Page 35: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

Onward and Upward

in case of bad alert

• remove the alert

Saturday, April 9, 16

Page 36: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

Onward and Upward

in case of bad alert

• remove the alert

• convert the alert to a notice

Saturday, April 9, 16

Page 37: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

Onward and Upward

in case of bad alert

• remove the alert

• convert the alert to a notice

• implement fixes

Saturday, April 9, 16

Page 38: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

Onward and Upward

pro tip: never let anyone add an alert unless they can answer these

questions first

Saturday, April 9, 16

Page 39: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

Can We Really Do This?

this is partially an organizational issue

Saturday, April 9, 16

Page 40: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

Can We Really Do This?

thought exercise: if you launched a new web site today,

you really only need one alarm

Saturday, April 9, 16

Page 41: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

Can We Really Do This?

“I don’t care if my servers are on fire, as long as I am still making money”

-- Kevin, actual OmniTI customer

Saturday, April 9, 16

Page 42: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

This sounds good but...

Most SA/SRE types want to be pro-active, not re-active.

ie. they want to alert on leading indicators, not on problems

Saturday, April 9, 16

Page 43: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

This sounds good but...

Carrie: I-I'm just making sure we don't get hit again.

Saul: Well, I'm glad someone's looking out for us, Carrie.

Carrie: I'm serious. I-I missed something once before, I won't... I can't let that happen again.

Saul: It was ten years ago. Everyone missed something that day.

Carrie: Yeah, everyone's not me.

Saturday, April 9, 16

Page 44: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

Based On A True Story

site down: monitor was checking 200 response code.

failed to notice absence of response code.

easily fixed, but reactive

Saturday, April 9, 16

Page 45: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

Based On A True Story

“root cause” ==> OOM

why don’t we alert on OOM?

OOM does not consistently cause outages

Saturday, April 9, 16

Page 46: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

Based On A True Story

too many false positives leads toignoring alarms

Saturday, April 9, 16

Page 47: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

Digression

Friendman, Naparstek, Taussing-Rubbo, Alarmingly Useless, The Case For Banning Car Alarms In NYC

http://transalt.org/files/news/reports/caralarms/report.pdf

Blackstone, Buck, HakimEvaluation of alternative policies to combat false emergency calls

http://isc.temple.edu/economics/wkpapers/Pubs/FalsePolicy.pdf

Wickens, Rice, Keller, Hutchins, Hughes, ClaytonFalse Alerts in Air Traffic Control Conflict Alerting System: Is There A Cry Wolf Effect?

http://www.tc.faa.gov/LOGISTICS/grants/pdf/2007/07-G-002.pdf

Görges M, Markewitz BA, Westenskow DRImproving Alarm Performance In The Medical Intensive Care Unit Using Delays and Clinical Context

http://www.ncbi.nlm.nih.gov/pubmed/19372334

“In an intensive care unit, alarms are used to call attention to a patient, to alert a change in the patient's physiology, or to warn of a failure in a medical

device; however, up to 94% of the alarms are false.”

Saturday, April 9, 16

Page 48: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

Digression

AESOP

The Boy Who Cried Wolf

Saturday, April 9, 16

Page 49: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

Based On A True Story

• send notice of OOM?

• fix the cause of OOM?

• make a useful alert?

Saturday, April 9, 16

Page 50: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

Based On A True Story

useful alerting

• script that checks for OOM

• restart app server when found

• find offending process; kill it

• spin up new node; kill old node

in the event all of these fail, send an alert?

Saturday, April 9, 16

Page 51: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

Based On A True Story

thought exercise: if you launched a new web site today,

you really only need one alert

Saturday, April 9, 16

Page 52: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

In Conclusion

if we need software that runs 24x7, we should design resiliency into our software,

not human intervention

Saturday, April 9, 16

Page 53: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

In Conclusion

thinking doesn’t scale

especially at 2AM

Saturday, April 9, 16

Page 54: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

In Conclusion

thanks!more:

Surge 2016http://surge.omniti.com

@robtreat2@omniti

Saturday, April 9, 16

Page 55: Less Alarming Alerts - USENIX...Less Alarming Alerts Saturday, April 9, 16. Hello /@robtreat2 Former WebDev SysAdmin DBA I have now been promoted to where I can do the least damage

Saturday, April 9, 16