sandstorm or significant? the evolving role of situational context in incident management

Post on 05-Apr-2017

71 Views

Category:

Technology

5 Downloads

Preview:

Click to see full reader

TRANSCRIPT

The evolving role of context in Incident Management

Matthew BoeckmanDeveloper Advocate

Victorops.com/blog

@matthewboeckmanBackground

● 18 years on-call Ops● 15 years w/software

teams● Startup junkie● DevOps enthusiast

3

What is VictorOps?

VictorOps ingests all of your alerts from your current monitoring tools and becomes the logical layer between your alerts and the people who receives them.

victorops.com/IMA

5

5 Phases of Incident Management

Detection

monitoring, metrics, thresholds

Response

alerting,on-call,escalation

Remediation

fixes,tickets,deployments

Analysis

postmortem,how or why,understand

Readiness

improvement,game days,learning

6

Standard Incident Workflow

Detection Response Remediation

AnalysisReadiness

7

Incident Management Assessment Matrix

Detection Response Remediation Analysis Preparedness

Novice

Beginner

Competent

Proficient

Expert

8

Incident Management Maturity Matrix

Detection Response Remediation Analysis Preparedness

Novice

Beginner xCompetent x xProficient x x

Expert

9

Self Assessment

Poll: How would you rate your overall team maturity?

A. NoviceB. BeginnerC. CompetentD. ProficientE. Expert

10

The Focus Question

How can we help teams

mature their incident management practice

(Stated plainly: Make On-Call suck less)

11

Situational Context

12

Incident Management Key Metrics

● MTTR Mean time to Repair(MTTR)● Availability (SLA)● Ticket Volumes● Escalations● Customer Satisfaction

13

Incident Management Key Metrics

14

Time Spent Managing Incidents - Low Maturity

Detection Response Remediation Analysis

Readiness

Time to Repair (MTTR)

15

Time Spent Managing Incidents - Medium Maturity

Detection Response Remediation Analysis

Readiness

Time to Repair (MTTR)

16

Time Spent Managing Incidents - High Maturity

Detection

Response

Remediation Analysis Readiness

Time to Repair (MTTR)

17

A New Core Metric

Detection

Response

Remediation Analysis Readiness

Time to Repair (MTTR)

Time to Learn(TTL)

Identify trendsCapacity planImprove infrastructure

GamedaysCross trainUpdate runbooks

18

Beep Beep Beep

19

Standard Incident Workflow

20

Standard Diagnostic Procedure

1. Fire up the VPN

2. Navigate dashboards, find relevant section

3. Review ticket or incident history for host

4. Review Runbooks for associated host

21

Common Bottlenecks to Establishing Context

● Multiple sources of record● Duplicate Runbooks or documentation● Metric overload

● New responders unfamiliar with systems

22

Where Does it Hurt?

Poll: Which is the most painful problem you experience in establishing context

A. Multiple sources of recordB. Duplicate documentationC. Metric overloadD. Everything is equally on fireE. Everything is fantastic

23

Beep Beep Beep

24

A Tale of Two Graphs

Massive spike above expected norm

Response: Fire up the laptop and put a pot of coffee on

25

A Tale of Two Graphs

Small spike for a consistently loaded box.

Response: ACK alert, go back to sleep

26

This Time, with Context!

27

Enhanced Contextual Workflow

28

Alert Enhancements

Poll: My team is doing some enhancement of alerts today.

A. TrueB. False

Many incidents can be tracked to deploys

Developer Velocity = Constant Change

Silos impair communication

29

CI/CD Exacerbates the Contextual Challenge

30

A Tale of Two Incidents

31

A Tale of Two Incidents

32

Introducing: The Scientific Method

Make Observations (the measurement)

Ask a question (why would a webserver quit working?)

Form a hypothesis (because we just deployed?)

33

The Sandstorm

34

No. Do not.

35

Measure Everything: the Anti-pattern

Measurements cost time and money

Busy dashboards lead to sub-concious filtering

Measurements create a natural impulse to alert

36

Enhance

37

Stop

38

An Embarrassment of Dashboards

39

Rule of Thumb

Measure much

Alert on some

Contextualize all

40

Iteration is Key

Dialing in context takes time

Conduct blameless postmortems

Experiment with more and less context

Be objective in your assessment of what works

41

Leverage Situational Context

Providing incident responders with context

can meaningfully impact MTTR

paying dividends in time

to move your practice forward

42

The Beginning

Detection Response Remediation Analysis

Readiness

Time to Repair (MTTR)

43

The Goal

Detection

Response

Remediation Analysis Readiness

Time to Repair (MTTR)

Time to Learn(TTL)

Identify trendsCapacity planImprove infrastructure

GamedaysCross trainUpdate runbooks

Take the IMA!http://victorops.com/ima

Questions?

44

Thank you!

Matthew Boeckman@matthewboeckman

Slides on devops.com & slideshare.com

top related