fault-tolerance on the cheap: making systems that (probably) won't fall over

92
Fault-Tolerance On The Cheap Making systems that probably won’t fall over.

Upload: brian-troutwine

Post on 28-Jul-2015

961 views

Category:

Software


1 download

TRANSCRIPT

Page 1: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Fault-Tolerance On The CheapMaking systems that probably won’t fall over.

Page 2: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Hi, folks!

Page 3: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

I do things to/with computers.

Page 4: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

I’m a real-time, networked systems engineer.

Page 5: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Real-Time Systems

Page 6: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Real-Time Systems• Computation on a deadline

Page 7: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Real-Time Systems• Computation on a deadline • Fail-safe / Fail-operational

Page 8: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Real-Time Systems• Computation on a deadline • Fail-safe / Fail-operational • Guaranteed response / Best effort

Page 9: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Real-Time Systems• Computation on a deadline • Fail-safe / Fail-operational • Guaranteed response / Best effort

• Resource adequate / inadequate

Page 10: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Networked Systems

Page 11: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Networked Systems• Out-of-order messages

Page 12: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Networked Systems• Out-of-order messages • No legitimate concept of “now”

Page 13: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Networked Systems• Out-of-order messages • No legitimate concept of “now” • High-latency transmission

Page 14: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Networked Systems• Out-of-order messages • No legitimate concept of “now” • High-latency transmission • Lossy transmission

Page 15: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Punk Rock Version

Page 16: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Here’s a socket.

Page 17: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Here’s an interrupt.

Page 18: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Go program a computer!

Page 19: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

AdRoll

Page 20: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

real-time bidding

Page 21: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Erlang

Page 22: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Fault Tolerance

Page 23: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Sub-components fail. The system does not.

Page 24: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Well, not right away…

Page 25: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

What’s it take?

Page 26: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Option 1: Perfection

Page 27: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

• Total control over the whole mechanism.

Option 1: Perfection

Page 28: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

• Total control over the whole mechanism.

• Total understanding of the problem domain.

Option 1: Perfection

Page 29: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

• Total control over the whole mechanism.

• Total understanding of the problem domain.

• Specific, explicit system goals.

Option 1: Perfection

Page 30: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

• Total control over the whole mechanism.

• Total understanding of the problem domain.

• Specific, explicit system goals. • Well-known service lifetime.

Option 1: Perfection

Page 31: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Option 1: Perfection

“They Write the Right Stuff” Fast Company, 2005

Page 32: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

• Extremely expensive.

Option 1: Perfection

Page 33: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

• Extremely expensive. • Intentionally stifles creativity.

Option 1: Perfection

Page 34: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

• Extremely expensive. • Intentionally stifles creativity. • Design up front.

Option 1: Perfection

Page 35: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

• Extremely expensive. • Intentionally stifles creativity. • Design up front. • Complete control of the system

is not complete.

Option 1: Perfection

Page 36: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Option 2: Hope for the Best

Page 37: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

• Little up-front knowledge of the problem domain.

Option 2: Hope for the Best

Page 38: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

• Little up-front knowledge of the problem domain.

• Implicit or short-term system goals.

Option 2: Hope for the Best

Page 39: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

• Little up-front knowledge of the problem domain.

• Implicit or short-term system goals.

• No money down.

Option 2: Hope for the Best

Page 40: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

• Little up-front knowledge of the problem domain.

• Implicit or short-term system goals.

• No money down. • Ingenuity under pressure.

Option 2: Hope for the Best

Page 41: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Option 2: Hope for the Best

“Move fast and break things!”

Page 42: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

• Ignorance of problem domain leads to long-term system issues.

Option 2: Hope for the Best

Page 43: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

• Ignorance of problem domain leads to long-term system issues.

• Failures do propagate out toward users.

Option 2: Hope for the Best

Page 44: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

• Ignorance of problem domain leads to long-term system issues.

• Failures do propagate out toward users.

• No, money down!

Option 2: Hope for the Best

Page 45: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

• Ignorance of problem domain leads to long-term system issues.

• Failures do propagate out toward users.

• No, money down! • Hard to change cultural values.

Option 2: Hope for the Best

Page 46: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Option 3: Embrace Faults

Page 47: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Option 3: Embrace Faults• Partial control over the whole

mechanism.

Page 48: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Option 3: Embrace Faults• Partial control over the whole

mechanism. • Partial understanding of the

problem domain.

Page 49: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Option 3: Embrace Faults• Partial control over the whole

mechanism. • Partial understanding of the

problem domain. • Sorta explicit system goals.

Page 50: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Option 3: Embrace Faults• Partial control over the whole

mechanism. • Partial understanding of the

problem domain. • Sorta explicit system goals. • Able to spot a failure when you

see one.

Page 51: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Option 3: Embrace Faults

“Fail fast. Either do the right thing

or stop.” “Why Do Computers Stop and What Can Be Done About it?”, Jim Gray, 1985 (paraphrase)

Page 52: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Option 3: Embrace Faults• Faults are isolated but must be

resolved in production.

Page 53: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Option 3: Embrace Faults• Faults are isolated but must be

resolved in production. • Must carefully design for

introspection.

Page 54: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Option 3: Embrace Faults• Faults are isolated but must be

resolved in production. • Must carefully design for

introspection. • Moderate design up-front.

Page 55: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Option 3: Embrace Faults• Faults are isolated but must be

resolved in production. • Must carefully design for

introspection. • Moderate design up-front. • Pay a little now, pay a little later.

Page 56: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Let’s talk embracing faults.

Page 57: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

There are four conceptual stages to consider.

Page 58: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Component

Page 59: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

The most atomic level of the system.

Page 60: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Progress here has an outsized impact.

Page 61: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Immutable Data

Structures

Page 62: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Isolate Side-Effects

Page 63: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Compile-Time Guarantees

Page 64: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Why test, when you can prove?

Page 65: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

This is Functional

Programming

Page 66: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Machine

Page 67: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Faults in components are exercised here.

Page 68: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Faults in interactions are exercised here.

Page 69: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Supervise, and restart.

Page 70: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Use only addressable

names.

Page 71: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Distinguish your critical components.

Page 72: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Cluster

Page 73: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Redundant Components

Page 74: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

No Single Points of Failure

Page 75: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Mean Time to Failure Estimates

Page 76: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Instrument and Monitor

Page 77: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Organization

Page 78: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

A finely built machine without a supporting

organization is a disaster waiting to happen.

Page 79: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

A finely built machine without

organization is a disaster waiting to happen.

Chernobyl STS-51-L Deepwater Horizon Magnitogorsk Damascus Incident

Chevron Refinery BART ATC Asiana #214 Therac-25 New Orleans Levee

Page 80: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Correct the conditions that allowed mistakes, as well as the mistake.

Page 81: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Process is Priceless

Page 82: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Build flexible tools for experts.

Page 83: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Separate Your

Concerns

Page 84: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Build with Failure in mind.

Page 85: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Have resources you’re willing to sacrifice.

Page 86: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Study accidents.

Page 87: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Every system carries the potential for its own destruction.

Page 88: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Some things aren’t worth building.

Page 89: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Understand Networks.

Page 90: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

0. The network is unreliable. 1. Latency is non-zero. 2. Bandwidth is finite. 3. The network is insecure. 4. Topology changes. 5. There are many administrators. 6. Transport cost is non-zero. 7. The network is heterogenous.

Page 91: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Thanks so

much!@bltroutwine

Page 92: Fault-tolerance on the Cheap: Making Systems That (Probably) Won't Fall Over

Recommended Reading“Normal Accidents: Living with High-Risk Technologies”, Charles Perrow

“Digital Apollo: Human and Machine in Spaceflight”, David A. Mindel

“Command and Control: Nuclear Weapons, the Damascus Accident, and the Illusion of Safety”, Eric Schlosser

“Erlang Programming”, Simon Thompson and Francesco Cesarini

“Steeltown, USSR”, Stephen Kotkin

“Crash-Only Software”, George Candea and Armando Fox

“The Truth About Chernobyl”, Grigorii Medvedev

“Real-Time Systems: Design Principles for Distributed Embedded Applications”, Hermann Kopetz

“ T h e A p o l l o G u i d a n c e C o m p u t e r : Architecture and Operation”, Frank O’Brien

“Why Do Computers Stop and What Can Be Done About It?”, Jim Gray

“Thirteen: The Apollo Flight That Failed”, Henry S.F. Cooper Jr.