chapter 11 fault tolerance. topics introduction process resilience reliable group communication...
TRANSCRIPT
Chapter 11
Fault Tolerance
Topics Introduction Process Resilience Reliable Group Communication Recovery
Basic Concepts Failure = state of a system where system
fails to meet its contract Error= part of the system state that leads
to failure (e.g. differing from its intended value)
Faults = cause of an error, e.g. results from Design errors Manufacturing faults Deterioration External disturbance
FaultErrorFailure
Remark:
Presence of a fault does not ensure that an error will occur, e.g. memory stuck-at-0
Characteristics of FaultsDuration
Permanent fault Once a component fails, it never works correctly
again Easiest to diagnose
Transient fault 1 time only 10 times as likely as permanent faults
Intermittent fault Re-occurring May appear to be transient (if long period) Hard and expensive to detect
Fault Tolerance Includes… Fault tolerance: system can avoid a failure
despite occurrence of faults Availability: probability that system works
correctly at a given instance of time Reliability: expected time between failures Safety: absence of catastrophic
consequences of a fault Maintainability: ease of recovering from a
failure (incl. automatic recognizing of faults)
Failure ModelsType of failure Description
Crash failure A server halts, but is working correctly until it halts
Omission failure Receive omission Send omission
A server fails to respond to incoming requestsA server fails to receive incoming messagesA server fails to send messages
Timing failure A server's response lies outside the specified time interval
Response failure Value failure State transition failure
The server's response is incorrectThe value of the response is wrongThe server deviates from the correct flow of control
Arbitrary failure (Byzantine failure)
A server may produce arbitrary responses at arbitrary times
How to Overcome Failures? Design servers being able to announce
that they might fail in the near future? Design a DS that is able to detect that
A server is down and/or a server does no longer work correctly?
Design a DS that is able to mask faults via redundancy?
Failure Masking Hide occurrence of faults using
redundancy Information (e.g., additional bits, i.e. error
correcting codes, e.g. Hamming-code) Time (e.g., retry an operation, an aborted
transaction may be repeated without any side effects)
Physical Hardware (replicated equipment) Software (replicated server processes/threads)
Hardware Redundancy Passive (static)
Uses fault masking to hide occurrence of faults No action from system is required e.g. voting (see next slide)
Active (dynamic) Uses comparison for detection and/or diagnosis Remove faulty hardware from system
reconfiguration Hybrid
Combine both approaches Masking until diagnostic complete Expensive, but better to achieve higher reliability
Failure Masking by Redundancy
Triple modular redundancy.
Stand-by-Sparing Only one module is driving outputs
Other modules are Idle hot spares Shut down cold spares
In case of error detectionswitch to new module
Hot spares No power up delays But may be significant power consumption
Cold spares Vice versa to hot spares
Failure Masking by Software Redundancy
How to improve reliability?
What can we do to mask thread/process faults?
Process Resilience Protection against process failures Group of identical processes provides
redundancy Software: multiple processes on same machine Hardware: processes on different machines
Multicast communication ensures all members receive all messages (often atomic and ordered)
Processes can join and leave groups dynamically e.g., to replace failed processes
Membership protocol ensures agreement on group membership at any given time
Flat Groups versus Hierarchical Groups
a) Communication in a flat group.b) Communication in a simple hierarchical group
Group Management 1. Use a single group-server with a single data
base typical single point of failure 2. Use a single data base but several group-
servers (standby solution) 3. Manage groups in a distributed way, i.e. every
outsider wanting to enter a group sends a corresponding enter_group message per reliable multicast to every current group member, but
When does a new group member gets all the group internal messages?
When leaving the group, what about already sent but not yet received messages?
Agreement in Faulty Systems (1) Ensure all non-faulty processes
reach consensus in a finite number of steps 1. Reliable processes, faulty
communication (omission faults). Two-army problem
2. Reliable communication, faulty processes (Byzantine faults).
Agreement in Faulty Systems (2)
The Byzantine generals problem for 3 loyal generals and1 traitor.
a) The generals announce their troop strengths (in units of 1 kilosoldiers).
b) The vectors that each general assembles based on (a)c) The vectors that each general receives in step 3.
Agreement in Faulty Systems (3)
The same as in previous slide, except now with 2 loyal generals and one traitor.
With m faulty processes, at least 2m+1 correctly functioning processes are required to reach an agreement.
Reliable Group Communication
Basic Reliable-Multicasting Schemes
A simple solution to reliable multicasting when all receivers are known and are assumed not to fail
a) Message transmissionb) Reporting feedback
Scalability Feedback implosion: sender is swamped
with feedback messages Nonhierarchical multicast:
Use NACKS Feedback suppression: NACK’s multicast to
everyone. Prevents other receivers from sending NACK’s if they have already seen one
Reduces (N)ACK load on server Receivers have to be coordinated so they do not
all multicast NACKs at same time Multicasting feedback also interrupts processes
that successfully have received messages
Nonhierarchical Feedback Control
Several receivers have scheduled a request for retransmission, but the first retransmission request leads to the suppression of others.
Hierarchical Feedback Control
The essence of hierarchical reliable multicasting.a) Each local coordinator forwards the message to its children.b) A local coordinator handles retransmission requests.
Atomic Multicast: Virtual Synchrony Deliver a message either to all group
members (in the same order), or to none. Requires agreement about group membership Replica crash?
Process group: Group view: list of processes the sender has when
a message is sent. Each message uniquely associated with a group
View changes need to be ordered with respect to message transmissions: Either the message is delivered to the old or the new view
Special case: sender failure
Virtual Synchrony (2)
The principle of virtual synchronous multicast. If the sender crashes during the multicast, the message may either be delivered
to all or ignored by each of them.
Implementing Virtual Synchrony A message m sent in view Gi is stable if it
was received by all members of Gi Only stable messages are delivered
View changes are announced By the arriving/departing node or failure
detecting node via a view change message, followed by any unstable messages in the old view, followed by a flush message
View is changed after the flush message has arrived from all members of the old view
Implementing Virtual Synchrony (2)
a) Process 4 notices that process 7 has crashed, sends a view changeb) Process 6 sends out all its unstable messages, followed by a flush
messagec) Process 6 installs the new view when it has received a flush message
from everyone else