scaling humans - ops teams and incident management

39
Scaling Humans Ops teams and incident management dotScale, Paris 2015 David Mytton, CEO, Server Density

Upload: server-density

Post on 27-Jul-2015

219 views

Category:

Internet


0 download

TRANSCRIPT

Scaling HumansOps teams and incident management

dotScale, Paris 2015 David Mytton, CEO, Server Density

Cost of uptime?

Cost of uptime?

Cost of uptime?

$2.9bnQ1: 2015

Cost of uptime?

Cost of uptime?

$2.9bnQ1: 2015

$870mQ1: 2015

Cost of uptime?

Cost of uptime?

$2.9bnQ1: 2015

$870mQ1: 2015

$4.1bnQ1: 2015

Cost of uptime?

How much are you spending?

Expect downtime

• Prepare

• Respond

• Postmortem

Prepare

• On call

• Primary/secondary

Prepare

• On call

• Primary/secondary

• Reachability

Prepare

• On call

• Off call

Prepare

• On call

• Off call

• Docs

Prepare

• On call

• Off call

• Docs

• Searchable

Prepare

• On call

• Off call

• Docs

• Searchable

• Independent

Prepare

Prepare

• Key info

• Team contacts

Prepare

• Key info

• Team contacts

• Vendor contacts

Prepare

• Key info

• Team contacts

• Vendor contacts

• Key credentials

Prepare

• Key info

• Unexpected situations

• Communication

Prepare

• Key info

• Unexpected situations

• Communication

• Internet access

Prepare

• Key info

• Unexpected situations

• Communication

• Internet access

• Support access

Respond

• First responder

1. Load incident response checklist

Respond

• First responder

1. Load incident response checklist

2. Log into Ops War Room

Respond

• First responder

1. Load incident response checklist

2. Log into Ops War Room

3. Log incident in JIRA

Respond

• First responder

1. Load incident response checklist

2. Log into Ops War Room

3. Log incident in JIRA

4. Begin investigation

Respond

• Key response principles

• Log everything

Respond

• Key response principles

• Log everything

• Frequent public updates

Respond

• Key response principles

• Log everything

• Frequent public updates

• Gather the team

Respond

• Key response principles

• Log everything

• Frequent public updates

• Gather the team

• Escalate!

Postmortem

• Within a few days

Postmortem

• Within a few days

• Tell the story

Postmortem

• Within a few days

• Tell the story

• Appropriate technical detail

Postmortem

• Within a few days

• Tell the story

• Appropriate technical detail

• What failed, why?

Postmortem

• How it’s going to be fixed

Postmortem

ありがとうございます

[email protected]

@davidmytton