why do computers stop and what can be done about...

11

Why do computers stop and Why do computers stop and what can be done about it?what can be done about it?

Jim GrayJim Gray

Symposium on Reliability in Distributed Symposium on Reliability in Distributed Software and Database Systems (1986)Software and Database Systems (1986)

Presented by Presented by YehYeh TsungTsung--YuYu

22

OutlineOutline

IntroductionIntroductionAn analysis of failuresAn analysis of failuresImplications of the analysisImplications of the analysisFaultFault--tolerant executiontolerant executionProcess pairProcess pairConclusionConclusion

33

IntroductionIntroduction

Reliability and availability are different:Reliability and availability are different:–– Availability is doing the right thing within the specified Availability is doing the right thing within the specified

response time. Reliability is not doing the wrong thing.response time. Reliability is not doing the wrong thing.

–– Reliability is proportional to the Mean Time Between Reliability is proportional to the Mean Time Between Failures (Failures (MTBFMTBF). ).

–– Availability can be expressed as a probability that the Availability can be expressed as a probability that the system will be available:system will be available:

44


The key to providing high availability is to modularize the The key to providing high availability is to modularize the system so that modules are the unit of failure and system so that modules are the unit of failure and replacement.replacement.

Von Neumann was the first to analytically study the use of Von Neumann was the first to analytically study the use of redundancy.redundancy.–– The key difference is that von NeumannThe key difference is that von Neumann’’s model lacked modularity, s model lacked modularity,

a failure in any bundle of wires anywhere, implied a total systea failure in any bundle of wires anywhere, implied a total system m failure.failure.

In contrast, modern computer systems are constructed in a In contrast, modern computer systems are constructed in a modular fashion modular fashion ---- a failure within a module only affects a failure within a module only affects that module.that module.

55


FFaultault--tolerant hardware can be constructed as follows:tolerant hardware can be constructed as follows:

–– Hierarchically decompose the system into modules.Hierarchically decompose the system into modules.

–– Make each module failMake each module fail--fast.fast.

–– Detect module faults promptly.Detect module faults promptly.make the module signal failure.make the module signal failure.make it to periodically send an make it to periodically send an I AM ALIVE I AM ALIVE message or reset a message or reset a watchdog timer.watchdog timer.

–– Configure extra modules which can pick up the load of Configure extra modules which can pick up the load of failed modules.failed modules.

Takeover time, including the detection of the module failure, Takeover time, including the detection of the module failure, should be seconds.should be seconds.

66

An Analysis of Failures of a FaultAn Analysis of Failures of a Fault--Tolerant SystemTolerant System

The analysis data is from : The analysis data is from : –– the causes of system failures reported to Tandem over a seventhe causes of system failures reported to Tandem over a seven--

month period, the sample set covered more than 2000 systems.month period, the sample set covered more than 2000 systems.

During the measured period, 166 failures were reported During the measured period, 166 failures were reported including one fire and one flood.including one fire and one flood.

If we subtracts out If we subtracts out ““infantinfant”” failures, then the remaining failures, then the remaining failures, 107 in all, make an interesting analysis.failures, 107 in all, make an interesting analysis.–– ““infantinfant”” failures are related to a new software or hardware product failures are related to a new software or hardware product

still having the bugs.still having the bugs.

77


88


The failure to maintenance. E.g., sometimes it was clear The failure to maintenance. E.g., sometimes it was clear that the maintenance person typed the wrong command or that the maintenance person typed the wrong command or unplugged the wrong module.unplugged the wrong module.

99

Implications of the AnalysisImplications of the Analysis

Less than one in a thousand resulted in an Less than one in a thousand resulted in an interruption of service. Hardware faultinterruption of service. Hardware fault--tolerance tolerance works!works!

The top priority for improving system availability is The top priority for improving system availability is to reduce administrative mistakes.to reduce administrative mistakes.–– making selfmaking self--configured systems with minimal maintenance and configured systems with minimal maintenance and

operator interaction.operator interaction.

A contradiction about maintenanceA contradiction about maintenance–– Software and hardware fixes should be installed as soon as Software and hardware fixes should be installed as soon as

possible.possible.–– But new patch may include But new patch may include ”” infant mortalityinfant mortality”” !!

1010

Implications of the AnalysisImplications of the Analysis

Software fixes outnumber hardware fixes, as a result, Software fixes outnumber hardware fixes, as a result, software and hardware maintenance strategy must be software and hardware maintenance strategy must be separated ! separated !

–– Hardware should be update as soon as possible in the Hardware should be update as soon as possible in the long term.long term.

–– A software fix should be installed only if the bug is A software fix should be installed only if the bug is causing outagescausing outages..

–– If this bug is not causing outages, we could depend on If this bug is not causing outages, we could depend on software fault tolerance.software fault tolerance.

1111

FaultFault--tolerant Executiontolerant Execution

The keys to this software faultThe keys to this software fault--tolerance are:tolerance are:–– Software modularity through processes and messages.Software modularity through processes and messages.

–– Fault containment through failFault containment through fail--fast software modules.fast software modules.

–– ProcessProcess--pairs to tolerate hardware and transient pairs to tolerate hardware and transient software faults.software faults.

–– (Option) Transaction mechanism combined with (Option) Transaction mechanism combined with processprocess--pairs to ease exception handling and tolerate pairs to ease exception handling and tolerate software faults.software faults.

1212


Fault containment through failFault containment through fail--fast software fast software modulesmodules–– Process module should be failProcess module should be fail--fast, it should either fast, it should either

function correctly or it should detect the fault, signal function correctly or it should detect the fault, signal failure and stop operating.failure and stop operating.

–– Processes are made failProcesses are made fail--fast by defensive programming. fast by defensive programming. They check all their inputs, intermediate results, outputs and dThey check all their inputs, intermediate results, outputs and data ata structures.structures.If any error is detected, they signal a failure and stop.If any error is detected, they signal a failure and stop.

1313


Software faults Software faults ---- the the BohrbugBohrbug / / HeisenbugHeisenbug hypothesishypothesis–– Most hardware faults are transient, solution : memory error Most hardware faults are transient, solution : memory error

correction, checksum for retransmission.correction, checksum for retransmission.

–– By conjecture, most software faults are also transient.By conjecture, most software faults are also transient.

–– Transient software faults Transient software faults -- HeisenbugHeisenbug , typically related to , typically related to strange hardware conditions (transient device fault).strange hardware conditions (transient device fault).limit conditions (out of storage, counter overflow, etc.).limit conditions (out of storage, counter overflow, etc.).race conditions (forgetting to request a semaphore).race conditions (forgetting to request a semaphore).

–– BohrbugsBohrbugs, like the Bohr atom, are solid, easily detected by standard , like the Bohr atom, are solid, easily detected by standard techniques.techniques.

1414


Experiment for Experiment for BohrbugBohrbug / / HeisenbugHeisenbug–– Method : When process detects a fault, it stops and lets Method : When process detects a fault, it stops and lets

its brother continue the operation. The brother does a its brother continue the operation. The brother does a software retry.software retry.

–– If the brother also fails, then the bug is a If the brother also fails, then the bug is a BohrbugBohrbug rather rather than a than a HeisenbugHeisenbug..

–– In the measured period, one of 132 software faults was In the measured period, one of 132 software faults was a a BohrbugBohrbug, the remainders were , the remainders were HeisenbugsHeisenbugs..

1515

ProcessProcess--pairspairs

ProcessProcess--pairs for faultpairs for fault--tolerant executiontolerant execution–– Purpose : make process module redundant just like Purpose : make process module redundant just like

hardware.hardware.

First kind of First kind of ProcessProcess--pairs : pairs : LockstepLockstep–– Primary and backup processes synchronously execute Primary and backup processes synchronously execute

the same instruction stream on independent processors.the same instruction stream on independent processors.

–– If one of the processors fails, the other simply continues If one of the processors fails, the other simply continues the computation.the computation.

–– Give good tolerance to hardware failures but no Give good tolerance to hardware failures but no tolerance of tolerance of HeisenbugsHeisenbugs..

1616


Second kind : State Second kind : State CheckpointingCheckpointing–– Primary process does computation and sends state Primary process does computation and sends state

changes and reply messages to its backup prior each changes and reply messages to its backup prior each major event.major event.

–– Give excellent faultGive excellent fault--tolerance, but that programming tolerance, but that programming checkpoints is difficult.checkpoints is difficult.

–– The trend is towards The trend is towards ““DeltaDelta”” or or ““PersistentPersistent”” approaches approaches described below.described below.

1717


Third kind : Automatic Third kind : Automatic CheckpointingCheckpointing–– kernel automatically manages the kernel automatically manages the checkpointingcheckpointing, ,

relieving the programmerrelieving the programmer’’s effort.s effort.

–– At takeover, these messages are replayed to the At takeover, these messages are replayed to the backup to roll it forward to the primary processbackup to roll it forward to the primary process’’ state.state.

–– higher execution cost than state checkpoint.higher execution cost than state checkpoint.

1818


FouthFouth kind : Delta checkpointkind : Delta checkpoint–– This is an evolution of state This is an evolution of state checkpointingcheckpointing. Logical . Logical

rather than physical updates are sent to the backup.rather than physical updates are sent to the backup.

–– Have the virtue of performance.Have the virtue of performance.

–– A bug in the primary process is less likely to corrupt the A bug in the primary process is less likely to corrupt the backupbackup’’s state.s state.

1919


Fifth kind : PersistenceFifth kind : Persistence–– if primary fails, the backup wakes up in the null state if primary fails, the backup wakes up in the null state

with amnesia about what was happening at the time of with amnesia about what was happening at the time of the primary failure.the primary failure.

–– If the primary process fails, the database or devices it If the primary process fails, the database or devices it manages are left in a mess.manages are left in a mess.

–– As a result, we need a simple way to resynchronize As a result, we need a simple way to resynchronize these processes to have a common state these processes to have a common state –– transaction! transaction!

2020


The programmerThe programmer’’s interface to transactions s interface to transactions is quite simple: is quite simple: –– starts a transaction by asserting the starts a transaction by asserting the

BeginTransactionBeginTransaction verb.verb.–– ends it by asserting the ends it by asserting the EndTransactjonEndTransactjon or or

AbortTransactionAbortTransaction verb. verb.

2121

ConclusionConclusion

Dealing with system configuration, operations, and Dealing with system configuration, operations, and maintenance remains an unsolved problem.maintenance remains an unsolved problem.

The only hope is to simplify and reduce human The only hope is to simplify and reduce human intervention in these aspects of the system.intervention in these aspects of the system.

why do computers stop and what can be done about...

Documents