chapter 11 fault tolerance. topics introduction process resilience reliable group communication...

Chapter 11

Fault Tolerance

Topics Introduction Process Resilience Reliable Group Communication Recovery

Basic Concepts Failure = state of a system where system

fails to meet its contract Error= part of the system state that leads

to failure (e.g. differing from its intended value)

Faults = cause of an error, e.g. results from Design errors Manufacturing faults Deterioration External disturbance

FaultErrorFailure

Remark:

Presence of a fault does not ensure that an error will occur, e.g. memory stuck-at-0

Characteristics of FaultsDuration

Permanent fault Once a component fails, it never works correctly

again Easiest to diagnose

Transient fault 1 time only 10 times as likely as permanent faults

Intermittent fault Re-occurring May appear to be transient (if long period) Hard and expensive to detect

Fault Tolerance Includes… Fault tolerance: system can avoid a failure

despite occurrence of faults Availability: probability that system works

correctly at a given instance of time Reliability: expected time between failures Safety: absence of catastrophic

consequences of a fault Maintainability: ease of recovering from a

failure (incl. automatic recognizing of faults)

Failure ModelsType of failure Description

Crash failure A server halts, but is working correctly until it halts

Omission failure Receive omission Send omission

A server fails to respond to incoming requestsA server fails to receive incoming messagesA server fails to send messages

Timing failure A server's response lies outside the specified time interval

Response failure Value failure State transition failure

The server's response is incorrectThe value of the response is wrongThe server deviates from the correct flow of control

Arbitrary failure (Byzantine failure)

A server may produce arbitrary responses at arbitrary times

How to Overcome Failures? Design servers being able to announce

that they might fail in the near future? Design a DS that is able to detect that

A server is down and/or a server does no longer work correctly?

Design a DS that is able to mask faults via redundancy?

Failure Masking Hide occurrence of faults using

redundancy Information (e.g., additional bits, i.e. error

correcting codes, e.g. Hamming-code) Time (e.g., retry an operation, an aborted

transaction may be repeated without any side effects)

Physical Hardware (replicated equipment) Software (replicated server processes/threads)

Hardware Redundancy Passive (static)

Uses fault masking to hide occurrence of faults No action from system is required e.g. voting (see next slide)

Active (dynamic) Uses comparison for detection and/or diagnosis Remove faulty hardware from system

reconfiguration Hybrid

Combine both approaches Masking until diagnostic complete Expensive, but better to achieve higher reliability

Failure Masking by Redundancy

Triple modular redundancy.

Stand-by-Sparing Only one module is driving outputs

Other modules are Idle hot spares Shut down cold spares

In case of error detectionswitch to new module

Hot spares No power up delays But may be significant power consumption

Cold spares Vice versa to hot spares

Failure Masking by Software Redundancy

How to improve reliability?

What can we do to mask thread/process faults?

Process Resilience Protection against process failures Group of identical processes provides

redundancy Software: multiple processes on same machine Hardware: processes on different machines

Multicast communication ensures all members receive all messages (often atomic and ordered)

Processes can join and leave groups dynamically e.g., to replace failed processes

Membership protocol ensures agreement on group membership at any given time

Flat Groups versus Hierarchical Groups

a) Communication in a flat group.b) Communication in a simple hierarchical group

Group Management 1. Use a single group-server with a single data

base typical single point of failure 2. Use a single data base but several group-

servers (standby solution) 3. Manage groups in a distributed way, i.e. every

outsider wanting to enter a group sends a corresponding enter_group message per reliable multicast to every current group member, but

When does a new group member gets all the group internal messages?

When leaving the group, what about already sent but not yet received messages?

Agreement in Faulty Systems (1) Ensure all non-faulty processes

reach consensus in a finite number of steps 1. Reliable processes, faulty

communication (omission faults). Two-army problem

2. Reliable communication, faulty processes (Byzantine faults).

Agreement in Faulty Systems (2)

The Byzantine generals problem for 3 loyal generals and1 traitor.

a) The generals announce their troop strengths (in units of 1 kilosoldiers).

b) The vectors that each general assembles based on (a)c) The vectors that each general receives in step 3.

Agreement in Faulty Systems (3)

The same as in previous slide, except now with 2 loyal generals and one traitor.

With m faulty processes, at least 2m+1 correctly functioning processes are required to reach an agreement.

Reliable Group Communication

Basic Reliable-Multicasting Schemes

A simple solution to reliable multicasting when all receivers are known and are assumed not to fail

a) Message transmissionb) Reporting feedback

Scalability Feedback implosion: sender is swamped

with feedback messages Nonhierarchical multicast:

Use NACKS Feedback suppression: NACK’s multicast to

everyone. Prevents other receivers from sending NACK’s if they have already seen one

Reduces (N)ACK load on server Receivers have to be coordinated so they do not

all multicast NACKs at same time Multicasting feedback also interrupts processes

that successfully have received messages

Nonhierarchical Feedback Control

Several receivers have scheduled a request for retransmission, but the first retransmission request leads to the suppression of others.

Hierarchical Feedback Control

The essence of hierarchical reliable multicasting.a) Each local coordinator forwards the message to its children.b) A local coordinator handles retransmission requests.

Atomic Multicast: Virtual Synchrony Deliver a message either to all group

members (in the same order), or to none. Requires agreement about group membership Replica crash?

Process group: Group view: list of processes the sender has when

a message is sent. Each message uniquely associated with a group

View changes need to be ordered with respect to message transmissions: Either the message is delivered to the old or the new view

Special case: sender failure

Virtual Synchrony (2)

The principle of virtual synchronous multicast. If the sender crashes during the multicast, the message may either be delivered

to all or ignored by each of them.

Implementing Virtual Synchrony A message m sent in view Gi is stable if it

was received by all members of Gi Only stable messages are delivered

View changes are announced By the arriving/departing node or failure

detecting node via a view change message, followed by any unstable messages in the old view, followed by a flush message

View is changed after the flush message has arrived from all members of the old view

Implementing Virtual Synchrony (2)

a) Process 4 notices that process 7 has crashed, sends a view changeb) Process 6 sends out all its unstable messages, followed by a flush

messagec) Process 6 installs the new view when it has received a flush message

from everyone else

chapter 11 fault tolerance. topics introduction process resilience reliable group communication...

Documents

system state

server deviates

omissiona server

thata server

haltsomission failure

design servers

hot sparesfailure masking

expected time