spof - single "person" of failure
TRANSCRIPT
Single Point of Failure… ExpertSasha Rosenbaum, @DivineOps
Who am I?
Sasha Rosenbaum
Azure & DevOps consultant
at 10th Magnitude for 4 years
Co-organizer of
- DevOps Days Chicago Conference
- Chicago Azure meetup
@DivineOps
What is a Single Point of Failure?
@DivineOps
A single point of failure (SPOF) is a part of a system that, if it fails, will stop the
entire system from working
@DivineOps
High Availability
Achieving redundancy by removing single points of failure
Having reliable cross-over capabilities to switch between components
Detection of failures as they occur, so that cross-over can be initiated
@DivineOps
This is complicated
@DivineOps
Architecting for HA
@DivineOps
How is the entire system down?
@DivineOps
We forgot a dependency!
@DivineOps
Oh…
@DivineOps
Just imagine buying a server that
Uptime of roughly 16 hours a day
With interruptions
Single one of its kind
Cannot be replicated!
@DivineOps
Humans are NOT highly available
@DivineOps
How did we get here?
Lack of budget
Lack of people
Human nature
@DivineOps
How to recognize that you have a problem?
@DivineOps
1
@DivineOps
Keys to the Kingdom
@DivineOps
TO MY PRODUCTION SERVER @DivineOps
Even when the systems are automated there are still humans who manage them
@DivineOps
Why is there a single admin?
The situation evolved organically from having a small team
Someone took over deliberately
@DivineOps
Role Based Access
Grant access based on a role/group
Admin group size > 1
Service accounts
@DivineOps
Make sure that the person on call has the necessary access to fix the problem
@DivineOps
TRUST YOUR PEOPLE!!!
@DivineOps
2
@DivineOps
Beware of the Expert!
@DivineOps
“This will take 15 minutes to fix
And 8 hours to explain”
@DivineOps
We cannot afford the loss of productivity!
@DivineOps
Can you afford losing this knowledge?
@DivineOps
Delegate to Juniors
@DivineOps
Juniors are wonderful people
They ask tough questions
@DivineOps
Your new hires haven’t yet caught the
“This is how it’s always been” virus
@DivineOps
You are emotionally invested in your code
It is hard not to get protective of it
@DivineOps
Documentation
Documents
Readme
Comments
Tests
Automation
Features
@DivineOps
3
@DivineOps
“I cannot afford to take vacation!”
@DivineOps
Job security?
@DivineOps
Productivity?
@DivineOps
Hours / Productivity
@DivineOps
Research shows that working longer hours
DOES NOT increase productivity
@DivineOps
You need rest to be at your best!
@DivineOps
Cell phones are the single worse thing that happened to people AND businesses in the last century
@DivineOps
If people were actually unreachable we would find a more reliable way to solve problems
@DivineOps
Mandatory Vacation
@DivineOps
Game Days
@DivineOps
Say NO to having a
Single PERSON of Failure ;-)
@DivineOps
Great job, DoD Silicon Valley!
@DivineOps