spof - single "person" of failure

45
Single Point of Failure… Expert Sasha Rosenbaum, @DivineOps

Upload: sasha-rosenbaum

Post on 07-Jan-2017

4.003 views

Category:

Technology


4 download

TRANSCRIPT

Page 1: SPOF - Single "Person" of Failure

Single Point of Failure… ExpertSasha Rosenbaum, @DivineOps

Page 2: SPOF - Single "Person" of Failure

Who am I?

Sasha Rosenbaum

Azure & DevOps consultant

at 10th Magnitude for 4 years

Co-organizer of

- DevOps Days Chicago Conference

- Chicago Azure meetup

@DivineOps

Page 3: SPOF - Single "Person" of Failure

What is a Single Point of Failure?

@DivineOps

Page 4: SPOF - Single "Person" of Failure

A single point of failure (SPOF) is a part of a system that, if it fails, will stop the

entire system from working

@DivineOps

Page 5: SPOF - Single "Person" of Failure

High Availability

Achieving redundancy by removing single points of failure

Having reliable cross-over capabilities to switch between components

Detection of failures as they occur, so that cross-over can be initiated

@DivineOps

Page 6: SPOF - Single "Person" of Failure

This is complicated

@DivineOps

Page 7: SPOF - Single "Person" of Failure

Architecting for HA

@DivineOps

Page 8: SPOF - Single "Person" of Failure

How is the entire system down?

@DivineOps

Page 9: SPOF - Single "Person" of Failure

We forgot a dependency!

@DivineOps

Page 10: SPOF - Single "Person" of Failure

Oh…

@DivineOps

Page 11: SPOF - Single "Person" of Failure

Just imagine buying a server that

Uptime of roughly 16 hours a day

With interruptions

Single one of its kind

Cannot be replicated!

@DivineOps

Page 12: SPOF - Single "Person" of Failure

Humans are NOT highly available

@DivineOps

Page 13: SPOF - Single "Person" of Failure

How did we get here?

Lack of budget

Lack of people

Human nature

@DivineOps

Page 14: SPOF - Single "Person" of Failure

How to recognize that you have a problem?

@DivineOps

Page 15: SPOF - Single "Person" of Failure

1

@DivineOps

Page 16: SPOF - Single "Person" of Failure

Keys to the Kingdom

@DivineOps

Page 17: SPOF - Single "Person" of Failure

TO MY PRODUCTION SERVER @DivineOps

Page 18: SPOF - Single "Person" of Failure

Even when the systems are automated there are still humans who manage them

@DivineOps

Page 19: SPOF - Single "Person" of Failure

Why is there a single admin?

The situation evolved organically from having a small team

Someone took over deliberately

@DivineOps

Page 20: SPOF - Single "Person" of Failure

Role Based Access

Grant access based on a role/group

Admin group size > 1

Service accounts

@DivineOps

Page 21: SPOF - Single "Person" of Failure

Make sure that the person on call has the necessary access to fix the problem

@DivineOps

Page 22: SPOF - Single "Person" of Failure

TRUST YOUR PEOPLE!!!

@DivineOps

Page 23: SPOF - Single "Person" of Failure

2

@DivineOps

Page 24: SPOF - Single "Person" of Failure

Beware of the Expert!

@DivineOps

Page 25: SPOF - Single "Person" of Failure

“This will take 15 minutes to fix

And 8 hours to explain”

@DivineOps

Page 26: SPOF - Single "Person" of Failure

We cannot afford the loss of productivity!

@DivineOps

Page 27: SPOF - Single "Person" of Failure

Can you afford losing this knowledge?

@DivineOps

Page 28: SPOF - Single "Person" of Failure

Delegate to Juniors

@DivineOps

Page 29: SPOF - Single "Person" of Failure

Juniors are wonderful people

They ask tough questions

@DivineOps

Page 30: SPOF - Single "Person" of Failure

Your new hires haven’t yet caught the

“This is how it’s always been” virus

@DivineOps

Page 31: SPOF - Single "Person" of Failure

You are emotionally invested in your code

It is hard not to get protective of it

@DivineOps

Page 32: SPOF - Single "Person" of Failure

Documentation

Documents

Readme

Comments

Tests

Automation

Features

@DivineOps

Page 33: SPOF - Single "Person" of Failure

3

@DivineOps

Page 34: SPOF - Single "Person" of Failure

“I cannot afford to take vacation!”

@DivineOps

Page 35: SPOF - Single "Person" of Failure

Job security?

@DivineOps

Page 36: SPOF - Single "Person" of Failure

Productivity?

@DivineOps

Page 37: SPOF - Single "Person" of Failure

Hours / Productivity

@DivineOps

Page 38: SPOF - Single "Person" of Failure

Research shows that working longer hours

DOES NOT increase productivity

@DivineOps

Page 39: SPOF - Single "Person" of Failure

You need rest to be at your best!

@DivineOps

Page 40: SPOF - Single "Person" of Failure

Cell phones are the single worse thing that happened to people AND businesses in the last century

@DivineOps

Page 41: SPOF - Single "Person" of Failure

If people were actually unreachable we would find a more reliable way to solve problems

@DivineOps

Page 42: SPOF - Single "Person" of Failure

Mandatory Vacation

@DivineOps

Page 43: SPOF - Single "Person" of Failure

Game Days

@DivineOps

Page 44: SPOF - Single "Person" of Failure

Say NO to having a

Single PERSON of Failure ;-)

@DivineOps

Page 45: SPOF - Single "Person" of Failure

Great job, DoD Silicon Valley!

@DivineOps