a survey of fault tolerance in distributed systems by szeying tan fall 2002 cs 633

A Survey of Fault Tolerance in Distributed

SystemsBy

Szeying TanFall 2002 CS 633

IntroductionPaper covers:

Definitions of faults/failuresDiscuss failure models and elements of fault

toleranceIntroduce hardware fault tolerant techniquesIntroduce software fault tolerant techniques

Why Fault Tolerance? Mission critical systems – a requirement to

ensure reliability and availabilityHigh availability and need for reliability especially

important in distributed real time systemsComplex issues raised in providing fault

tolerance in distributed systems compared to single processor systems

What do we do with faults?

Error detection – find the error in the systemDamage control and assessment – contain and

fixError recovery – return the system back to an

error-free stateFault treatment/continued service – attempt

uninterrupted execution regardless of fault

Failure ModelsFailstopCrashCrash+LinkReceive OmissionSend OmissionGeneral OmissionByzantine Failures

Types of FaultsPermanent – remains in the system indefinitely

till corrective action is takenTransient – disappears after a short period of

timeIntermittent – appear and disappear repeatedly

Elements of Fault Tolerance

Redundancy – addition of information, resources, or time beyond what is needed for normal system operation

Failure semantics – knowledgebase of failure behaviors of a system

Group failure masking – Masks failures from others in group.

Hardware Fault Tolerant Techniques

Hardware redundancy – duplicate components to detect or tolerate faults

Passive techniques – fault maskingActive techniques – fault detection and removalHybrid techniques – a combination of both

Techniques listed on the next slide

Triple Modular Redundancy

Execute a task three times

Take a majority voteIn a fault free system, all

three results are identicalDoes not work for

Byzantine(arbitrary) failures

VOTE RESULT

N-Modular Redundancy

VOTE

VOTE

VOTE

Accomplised by masking an error N times

Works similar to TMR.Masks symmetrical and

asymmetrical failures

Standby Sparing

INPUTOUTPUT

COMPONENT 1

FAULTDETECTOR

COMPONENT 3

SWITCH

Replicate spares in the system (duplicate components)Spares activated when fault is detected

Duplex Systems

COMPARE RESULT

Duplicate execution twiceCompare results for discrepanciesExecution can occur on separate hardware or

sequentially on the same hardware

An example of a hardware fault tolerant system

Stratus servers – Fault tolerant hardware servers that use TMR and fully replicated hardware design to provide fault tolerance.

http://www.stratus.com

Software Fault Tolerant Techniques

Two main areas:Provide for static redundancyProvide for dynamic redundancy

N-Version ProgrammingRecovery Blocks or Primary-Backup technique

N-Version programmingDuplicate n versions of a program on n

processes.Forward recovery scheme that mask faultsRelies on voting mechanisms

Agreement problemsAn agreement problem are problems that occur

when a processor is faulty and other non-faulty processors have to agree on a course of action

Some agreement problems covered in my paperByzantine Generals ProtocolConsensus ProblemInteractive Consistency

Application of agreement protocols

Fault tolerant clock syncsNon faulty processes must have clocks that are

approximately equal in valueAtomic commits

Process actions have certain characteristics that must be followed (indivisible, instantaneous, non-revealing state changes etc.)

Recovery BlocksClient

Process

Backup 1

Backup 2

request

update

reply

ack

Backward error recovery schemeAlso known as primary-backup approachRelies on acceptance testsChecks output is within an acceptable range

Error Detection Techniques

Effectiveness of any fault tolerant system depends on the effectiveness of its error detection techniques

Early detection or late detectionConcept of acceptability determines the

thoroughness of error detection on a distributed system

Error Detection Techniques

Replication ChecksTiming ChecksStructural ChecksReasonableness ChecksReversal checks

ConclusionMany different means in which fault tolerance

can be provided on a distributed systemSections not covered includes error recover and

fault treatment

a survey of fault tolerance in distributed systems by szeying tan fall 2002 cs 633

Documents