fault tolerance. basic concepts availability the system is ready to work immediately reliability the...

Fault Tolerance

Basic Concepts• Availability The system is ready to work immediately

• Reliability The system can run continuously

• Safety When the system fails, nothing catastrophic happens

• Maintainability

A failed system can be easily repaired.

Fault types: transient, intermittent, permanent

Failure Models

Different types of failures.

Type of failure Description

Crash failure A server halts, but is working correctly until it halts

Omission failure Receive omission Send omission

A server fails to respond to requestsA server fails to receive incoming messagesA server fails to send messages

Timing failure A server's response lies outside the specified time interval

Response failure Value failure State transition failure

The server's response is incorrectThe value of the response is wrongThe server deviates from the correct flow of control

Arbitrary failure A server may produce arbitrary responses at arbitrary times

Failure Masking by Redundancy

•Information redundancy (extra bits)

•Time redundancy (extra operations)

• Physical redundancy (extra equipment or processes)

Failure Masking by Redundancy

Triple modular redundancy (TMR).An electronic circuit example

Process failures

To tolerate a faulty process, identical processes organized into a group

When one process of the group fails, some other process in the group takes care of the work

Process groups may be dynamic

Mechanisms are needed for managing groups membership•Group server maintains information on membership (centralized)•Distributed management (less simple and time consuming)

Flat Groups versus Hierarchical Groups

a) Communication in a flat group (voting mechanism, slow decision) Replicated write protocolsb) Communication in a simple hierarchical group (single point of failure) Primary based protocols

Client-server communication failuresUsing a reliable transport protocol (TCP) masks omission failures, but many failures are not masked.

Classes of failure • The client is unable to locate the server – exception is a solution, but we loose in transparency

•The request message from the client to the server is lost – retransmission

•The server crashes after receiving a request

•The reply message from the server to the client is lost – retransmission, but…

•The client crashes after sending a request – orphan is generated. (extermination, reincarnation with epoch #, gentle reincarnation, expiration…)

Server Crashes (1)

A server in client-server communicationa) Normal caseb) Crash after execution c) Crash before execution

At least once semantic: after server reboot, to try until a request is obtained At most once semantic: immediate failure report Exactly once semantic: no way

Server Crashes (2)

Different combinations of client and server strategies in the presence of server crashes.

Client Server

Strategy M -> P Strategy P -> M

Reissue strategy MPC MC(P) C(MP) PMC PC(M) C(PM)

Always DUP OK OK DUP DUP OK

Never OK ZERO ZERO OK OK ZERO

Only when ACKed DUP OK ZERO DUP OK ZERO

Only when not ACKed OK ZERO OK OK DUP OK

Example: a client send a message to a server for printing (P) it, having a completion message back (M). The server can crash (C)

Group CommunicationBasic Reliable-Multicasting Schemes

A simple solution to reliable multicasting when all receivers are known and are assumed not to fail

a) Message transmission b) Reporting feedback

Efficient only for little # of receivers ( only nack, timer etc..)

Important for messaging in process group

Nonhierarchical Feedback Control

Several receivers have scheduled a request for retransmission, but the first retransmission request leads to the suppression of others (Scalable Reliable

Multicasting protocol).It leads to timing problems, useless retransmissions or a complicated

organization of the group membership

To scale, we need to reduce the number of messages,with feedback suppression

Hierarchical Feedback Control

The essence of hierarchical reliable multicasting. A tree of receivers partitions is formed• Each local coordinator forwards the message to its children.• A local coordinator handles retransmission requests.Acknowledge between coordinators

Atomic Multicast In presence of process failures, the guarantee that a message is delivered to all or none

of the receivers is needed. This lead to the atomic multicast problemAtomic multicasting ensures that group members maintain consistency

The logical organization of a distributed system to distinguish between message receipt and message delivery

In atomic multicasting a multicast message is uniquely associated to a list of receiving processes ( Group view )

A view change takes place when a process joins or leaves the group

Virtual Synchrony

The principle of virtual synchronous multicast (view change similar to synchronization variable)

We need an ordered reliable multicasting.Virtual Synchrony guarantees that a message sent to a group view is delivered to each

non-faulty member of the group.If the sender crashes, the message may be either delivered to all the other processes or

ignored by each of them.

Message Ordering

Four different type of ordering of multicasts:

• Reliable, unordered multicast no guarantees is given on the order in which messages are delivered

• FIFO ordered multicast messages from the same process are delivered in the order as they are sent

• Causally ordered multicast causality between messages is preserved

• Totally-ordered multicast messages are delivered in the same order to all members of the group

Message Ordering

Four processes in the same group with two different senders, and a possible delivery order of messages under FIFO-ordered multicasting

Process P1 Process P2 Process P3 Process P4

sends m1 receives m1 receives m3 sends m3

sends m2 receives m3 receives m1 sends m4

receives m2 receives m2

receives m4 receives m4

Process P1 Process P2 Process P3

sends m1 receives m1 receives m2

sends m2 receives m2 receives m1

Unordered multicast:Three communicating processes in the same group. The ordering of events per process is shown

along the vertical axis.

Message Ordering

Six different versions of virtually synchronous reliable multicasting.

Multicast Basic Message Ordering Total-ordered Delivery?

Reliable multicast None No

FIFO multicast FIFO-ordered delivery No

Causal multicast Causal-ordered delivery No

Atomic multicast None Yes

FIFO atomic multicast FIFO-ordered delivery Yes

Causal atomic multicast Causal-ordered delivery Yes

Virtually synchronous reliable multicasting offering totally ordered delivery is called atomic multicasting

Distributed Commit

a) The finite state machine for the coordinator in two phase commit.b) The finite state machine for a participant.The first phase is the vote phase, the second is the decision phaseTimeout mechanisms are necessary, coordinator can crash

Distributed commit means that an operation has to be performed by each member of a group or none at all

One phase distributed commit is performed using a coordinator ( if a participant cannot perform the operation, no means to advise the coordinator)

Two Phase Commit

• The coordinator send a vote_request to all participants• A participant returns a vote-commit (it is ready to commit its part of transaction) or a vote-abort• The coordinator collects the votes and send a global_commit or a global_abort (if one of the participants has sent a vote_abort)• A participant receive a global_commit and locally commits the transaction, or receive a global_abort and locally aborts the transaction

1 – voting phase2 – decision phase

1

2

Three-Phase Commit

It avoids blocking processes in case of coordinator crash

• There is no state from which it is possible to make a transition directly to COMMIT or ABORT

• There is no state in which it is not possible to make a final decision and from which a transition to a COMMIT can be made

Recovery

• Backward recovery brings the system to the previous correct state. It is necessary to record the state (check-pointing)

• Forward recovery attempt to bring the system in a correct new state to continue the execution.

fault tolerance. basic concepts availability the system is ready to work immediately reliability the...

Documents