reducing mttr for production alerts at twitter hq (sre meetup)
TRANSCRIPT
Reducing MTTR for production alerts via alert enrichment and auto-remediation
SF Reliability Engineering Meetup, Twitter HQ, Sep 21st 2016
Kiran Gollu, Founder
Neptune.io © 2016
Agenda
• State of incident response today • Challenges for SRE teams • Core aspects of incident response system (including cultural)
• Alert Enrichment and auto-remediation techniques / tradeoffs • Our learnings and recommendations
Neptune.io © 2016
Brief Intro: Kiran Gollu
Neptune.io © 2016
• Present: Founder
• Past: Founding engineer at AWS for 5 years
What is Incident Response?
How to handle alerts/incidents/
outages?
Many more..
Alerts
Neptune.io © 2016
Problem#1: Lots of alerts
5 Neptune.io © 2016
Problem#2: Failures are complicated
Neptune.io © 2016
Problem#3: Debugging is hard!
7
Source : DevOps survey; Victor Ops incident response
#4: 95% of Time To Recovery(TTR) is still manual today
Alert Troubleshooting Triage | Investigate | Identify
Resolution Documentation
73% 10% 5% 12%
Snapshots • Graphs & metrics • Logs • Webpages Service health checks • Internal • External Host/App diagnostics • “Top”, “df –H” etc. • Heap dumps/Stack traces
Runbooks • On single/cluster
of hosts • Any script, any
language Cloud API/CLI actions • Start/Stop/
Reboot • Scale up/down
Root-cause analysis & Audit • Heap dumps • Logs • Graphs Post-mortem • History • Diagnostics
Neptune.io © 2016
FBAR : Facebook Auto Remediation Platform “…Its doing the work of approximately 200 sys admins…”
“We built an internal tool for AWS”
Nurse: Auto-remediation platform “60% of problems are fixed automatically…”
Winston: Event driven automation tool
9 Neptune.io © 2016
Your Options
Neptune.io © 2016
SaaS product, On-Premise offering on AWS, deep integrations with monitoring tools
Open source event driven automation
Build it in house
Common Issues
• Alert Noise • non-actionable alerts, false positives and self-recovering alerts
• Not measuring cost of dealing with incidents
• Incorrect monitoring thresholds
• Engineer burnout: Too much manual work
• Bandaids instead of root causing problems
• Alert correlation is hard – downstream/upstream dependencies
• Not having clear incident response and escalation processeses Neptune.io © 2016
Maturity level of incident response teams
@jpaulreed @kfinnbraun DevOps Enterprise Summit Neptune.io © 2016
Incident fired – root cause?
• Two reasons:
• Corrective Action
• Monitoring problem
Neptune.io © 2016
3 core pieces of incident response platform
Neptune.io © 2016
1. Tracking & Analytics
• Helps identify those top-20% alerts causing 80% of pain • Sorted by frequency and MTTR
• Capture: • MTTA (mean time to acknowledgement) • MTTR (mean time to resolution) • Frequency of occurrence (#times a particular alert has occurred)
• Reporting + Auditing • Audit all activity (both manual + automated) • Leads to data-driven post mortems
Neptune.io © 2016
2. Enrich Alerts
• When an alert occurs: • Gather context automatically from 13 different tools
• Monitoring tools, logging tools, health checks, dependent services
Use cases: • Show relevant events on same host/app right next to alert • Latency is high à correlate events, &capture health of dependencies • High memory à capture top-10 memory hogs, memory usage graphs • High app error rate à capture error rate, latency trends, app logs for
5xx errors from Splunk/Sumologic
Neptune.io © 2016
3. Auto-Remediate repetitive alerts
• When an alert occurs:
• If it’s a known alert à Run a remediation runbook
• Use cases: • Process crashed à restart process • Host is unpingable à restart 3 times and escalate if still fails • Service is down à capture graphs, run a remediation workflow
Neptune.io © 2016
Cultural aspects
• Have a clear incident resolution and escalation process
• Document and version your runbooks
• Single consolidated report per each incident to make post-mortems easy • Audit all manual and automated actions for an incident • Use your own communication tools (Slack, HipChat) but record incident logs • Use tools to log team collaboration activity
• Break silos : Dev / Ops can resolve and share how incidents are resolved
Neptune.io © 2016
Our learnings working with 100+ SRE/Ops teams
• Automate simple things first
• Have checks in place to avoid cascading failures • Rate limiting, handling correlated failures
• Capture state and snapshots for self-recovering alerts
• Don’t automatically fix when you don’t know root cause
• Enriching incidents is as important as automating repetitive incidents • Availability of automation tool should be >>> your apps
Neptune.io © 2016
How do you get better at it?
• Continuously eliminate manual effort involved
• Streamline your incident response workflow (cultural aspects) • Encourage good behavior and punish bad behavior
• Measure time spent in incident response
• Make your alerts actionable • Fix the monitoring thresholds as a continuous process
• Enrich and automate incidents to reduce MTTR
Neptune.io © 2016
Summary
• Reducing MTTR gives you: • More uptime, better customer experience • More sleep, Happier engineers
• To reduce MTTR for production incidents: • create actionable alerts and fix monitoring thresholds • Embrace automation – enrich and automate alerts • Instill good incident response processes
Neptune.io © 2016
Neptune.io © 2016