reducing mttr for production alerts at twitter hq (sre meetup)

Reducing MTTR for production alerts via alert enrichment and auto-remediation

SF Reliability Engineering Meetup, Twitter HQ, Sep 21st 2016

Kiran Gollu, Founder

Neptune.io © 2016

Agenda

•  State of incident response today •  Challenges for SRE teams •  Core aspects of incident response system (including cultural)

•  Alert Enrichment and auto-remediation techniques / tradeoffs •  Our learnings and recommendations

Neptune.io © 2016

Brief Intro: Kiran Gollu

Neptune.io © 2016

•  Present: Founder

•  Past: Founding engineer at AWS for 5 years

What is Incident Response?

How to handle alerts/incidents/

outages?

Many more..

Alerts

Neptune.io © 2016

Problem#1: Lots of alerts

5 Neptune.io © 2016

Problem#2: Failures are complicated

Neptune.io © 2016

Problem#3: Debugging is hard!

7

Source : DevOps survey; Victor Ops incident response

#4: 95% of Time To Recovery(TTR) is still manual today

Alert Troubleshooting Triage | Investigate | Identify

Resolution Documentation

73% 10% 5% 12%

Snapshots •  Graphs & metrics •  Logs •  Webpages Service health checks •  Internal •  External Host/App diagnostics •  “Top”, “df –H” etc. •  Heap dumps/Stack traces

Runbooks •  On single/cluster

of hosts •  Any script, any

language Cloud API/CLI actions •  Start/Stop/

Reboot •  Scale up/down

Root-cause analysis & Audit •  Heap dumps •  Logs •  Graphs Post-mortem •  History •  Diagnostics

Neptune.io © 2016

FBAR : Facebook Auto Remediation Platform “…Its doing the work of approximately 200 sys admins…”

“We built an internal tool for AWS”

Nurse: Auto-remediation platform “60% of problems are fixed automatically…”

Winston: Event driven automation tool

9 Neptune.io © 2016

Your Options

Neptune.io © 2016

SaaS product, On-Premise offering on AWS, deep integrations with monitoring tools

Open source event driven automation

Build it in house

Common Issues

•  Alert Noise •  non-actionable alerts, false positives and self-recovering alerts

•  Not measuring cost of dealing with incidents

•  Incorrect monitoring thresholds

•  Engineer burnout: Too much manual work

•  Bandaids instead of root causing problems

•  Alert correlation is hard – downstream/upstream dependencies

•  Not having clear incident response and escalation processeses Neptune.io © 2016

Maturity level of incident response teams

@jpaulreed @kfinnbraun DevOps Enterprise Summit Neptune.io © 2016

Incident fired – root cause?

•  Two reasons:

•  Corrective Action

•  Monitoring problem

Neptune.io © 2016

3 core pieces of incident response platform

Neptune.io © 2016

1. Tracking & Analytics

•  Helps identify those top-20% alerts causing 80% of pain •  Sorted by frequency and MTTR

•  Capture: •  MTTA (mean time to acknowledgement) •  MTTR (mean time to resolution) •  Frequency of occurrence (#times a particular alert has occurred)

•  Reporting + Auditing •  Audit all activity (both manual + automated) •  Leads to data-driven post mortems

Neptune.io © 2016

2. Enrich Alerts

•  When an alert occurs: •  Gather context automatically from 13 different tools

•  Monitoring tools, logging tools, health checks, dependent services

Use cases: •  Show relevant events on same host/app right next to alert •  Latency is high à correlate events, &capture health of dependencies •  High memory à capture top-10 memory hogs, memory usage graphs •  High app error rate à capture error rate, latency trends, app logs for

5xx errors from Splunk/Sumologic

Neptune.io © 2016

3. Auto-Remediate repetitive alerts

•  When an alert occurs:

•  If it’s a known alert à Run a remediation runbook

•  Use cases: •  Process crashed à restart process •  Host is unpingable à restart 3 times and escalate if still fails •  Service is down à capture graphs, run a remediation workflow

Neptune.io © 2016

Cultural aspects

•  Have a clear incident resolution and escalation process

•  Document and version your runbooks

•  Single consolidated report per each incident to make post-mortems easy •  Audit all manual and automated actions for an incident •  Use your own communication tools (Slack, HipChat) but record incident logs •  Use tools to log team collaboration activity

•  Break silos : Dev / Ops can resolve and share how incidents are resolved

Neptune.io © 2016

Our learnings working with 100+ SRE/Ops teams

•  Automate simple things first

•  Have checks in place to avoid cascading failures •  Rate limiting, handling correlated failures

•  Capture state and snapshots for self-recovering alerts

•  Don’t automatically fix when you don’t know root cause

•  Enriching incidents is as important as automating repetitive incidents •  Availability of automation tool should be >>> your apps

Neptune.io © 2016

How do you get better at it?

•  Continuously eliminate manual effort involved

•  Streamline your incident response workflow (cultural aspects) •  Encourage good behavior and punish bad behavior

•  Measure time spent in incident response

•  Make your alerts actionable •  Fix the monitoring thresholds as a continuous process

•  Enrich and automate incidents to reduce MTTR

Neptune.io © 2016

Summary

•  Reducing MTTR gives you: •  More uptime, better customer experience •  More sleep, Happier engineers

•  To reduce MTTR for production incidents: •  create actionable alerts and fix monitoring thresholds •  Embrace automation – enrich and automate alerts •  Instill good incident response processes

Neptune.io © 2016

reducing mttr for production alerts at twitter hq (sre meetup)

Software