devops incident handling - making friends not enemies

Post on 27-Jun-2015

352 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

David Mytton CEO of Server Density presented this talk to the DevOps Meetup in London. It takes you through how to handle DevOps incidents, outages and downtime -- and more specifically how to make friends, not enemies in the process.

TRANSCRIPT

How to win friends when handling outages and downtime

David MyttonLondon DevOps - Oct 2014

blog.serverdensity.com

David Mytton

Server monitoring, cloud management, dashboards and alerting

serverdensity.com

Slides: twitter.com/davidmytton

Let’s talk about downtime

2013 Spend: ~$5bn

2013 Spend: ~$6bn

2013 Spend: ~$4bn

You will have downtime

How much do you spend?

Preparation

Preparation - On Call

● Primary?

Preparation - On Call

● Primary?

● Secondary?

Preparation - On Call

● Primary?

● Secondary?

● Reachability - Tube, 3G/4G (edge?!), Do Not Disturb mode, at the gym, family emergency, system updates

Preparation - On Call

● Off call

Preparation - On Call

● Off call

● Rotations

Preparation - On Call

● Off call

● Rotations

● Illness

Preparation - On Call

● Off call

● Rotations

● Illness

● Work the next day?

Preparation - Documentation

Preparation - Documentation

● Searchable

Preparation - Documentation

● Searchable

● Easy to edit

Preparation - Documentation

● Searchable

● Easy to edit

● Independent of your infrastructure

Preparation - Documentation

● Searchable

● Easy to edit

● Independent of your infrastructure

● Up to date

Preparation - Key Info

Preparation - Key Info

● Team contacts

Preparation - Key Info

● Team contacts

● Key vendor contacts

Preparation - Key Info

● Team contacts

● Key vendor contacts

● Credentials to key systems

Unexpected failures

Unexpected failures

● Communication systems

Unexpected failures

● Communication systems

● Network connectivity

Unexpected failures

● Communication systems

● Network connectivity

● Access to support

ALERT!

ALERT!

1. Load up incident response checklist

ALERT!

1. Load up incident response checklist

2. Log incident in JIRA

ALERT!

1. Load up incident response checklist

2. Log incident in JIRA

3. Log into Ops War Room

ALERT!

1. Load up incident response checklist

2. Log incident in JIRA

4. Public status post

3. Log into Ops War Room

ALERT!

1. Load up incident response checklist

2. Log incident in JIRA

4. Public status post

5. Initial investigation

3. Log into Ops War Room

Key response principles

Key response principles

● Log everything

Key response principles

● Log everything

● Frequent public status updates

Key response principles

● Log everything

● Frequent public status updates

● Gather the team

Key response principles

● Log everything

● Frequent public status updates

● Gather the team

● Escalate!

Postmortem

Postmortem

● Within a few days

Postmortem

● Within a few days

● Tell the story

Postmortem

● Within a few days

● Tell the story

● Provide technical detail

Postmortem

● Within a few days

● Tell the story

● Provide technical detail

● Explain what failed and why

Postmortem

● How it’s going to be fixed

stspg.io/ZDC

Summary

● Preparation

● Communication

● Checklists

● Documentation

● Postmortem

どもありがとうございます

@davidmytton

david@serverdensity.com

blog.serverdensity.com

www.serverdensity.com

top related