a survey of fault tolerance in distributed systems by szeying tan fall 2002 cs 633

21
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633

Upload: gloria-tucker

Post on 19-Jan-2018

217 views

Category:

Documents


0 download

DESCRIPTION

Why Fault Tolerance?  Mission critical systems – a requirement to ensure reliability and availability  High availability and need for reliability especially important in distributed real time systems  Complex issues raised in providing fault tolerance in distributed systems compared to single processor systems

TRANSCRIPT

Page 1: A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633

A Survey of Fault Tolerance in Distributed

SystemsBy

Szeying TanFall 2002 CS 633

Page 2: A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633

IntroductionPaper covers:

Definitions of faults/failuresDiscuss failure models and elements of fault

toleranceIntroduce hardware fault tolerant techniquesIntroduce software fault tolerant techniques

Page 3: A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633

Why Fault Tolerance? Mission critical systems – a requirement to

ensure reliability and availabilityHigh availability and need for reliability especially

important in distributed real time systemsComplex issues raised in providing fault

tolerance in distributed systems compared to single processor systems

Page 4: A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633

What do we do with faults?

Error detection – find the error in the systemDamage control and assessment – contain and

fixError recovery – return the system back to an

error-free stateFault treatment/continued service – attempt

uninterrupted execution regardless of fault

Page 5: A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633

Failure ModelsFailstopCrashCrash+LinkReceive OmissionSend OmissionGeneral OmissionByzantine Failures

Page 6: A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633

Types of FaultsPermanent – remains in the system indefinitely

till corrective action is takenTransient – disappears after a short period of

timeIntermittent – appear and disappear repeatedly

Page 7: A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633

Elements of Fault Tolerance

Redundancy – addition of information, resources, or time beyond what is needed for normal system operation

Failure semantics – knowledgebase of failure behaviors of a system

Group failure masking – Masks failures from others in group.

Page 8: A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633

Hardware Fault Tolerant Techniques

Hardware redundancy – duplicate components to detect or tolerate faults

Passive techniques – fault maskingActive techniques – fault detection and removalHybrid techniques – a combination of both

Techniques listed on the next slide

Page 9: A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633

Triple Modular Redundancy

Execute a task three times

Take a majority voteIn a fault free system, all

three results are identicalDoes not work for

Byzantine(arbitrary) failures

VOTE RESULT

Page 10: A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633

N-Modular Redundancy

VOTE

VOTE

VOTE

Accomplised by masking an error N times

Works similar to TMR.Masks symmetrical and

asymmetrical failures

Page 11: A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633

Standby Sparing

INPUTOUTPUT

COMPONENT 1

FAULTDETECTOR

COMPONENT 3

SWITCH

Replicate spares in the system (duplicate components)Spares activated when fault is detected

Page 12: A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633

Duplex Systems

COMPARE RESULT

Duplicate execution twiceCompare results for discrepanciesExecution can occur on separate hardware or

sequentially on the same hardware

Page 13: A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633

An example of a hardware fault tolerant system

Stratus servers – Fault tolerant hardware servers that use TMR and fully replicated hardware design to provide fault tolerance.

http://www.stratus.com

Page 14: A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633

Software Fault Tolerant Techniques

Two main areas:Provide for static redundancyProvide for dynamic redundancy

N-Version ProgrammingRecovery Blocks or Primary-Backup technique

Page 15: A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633

N-Version programmingDuplicate n versions of a program on n

processes.Forward recovery scheme that mask faultsRelies on voting mechanisms

Page 16: A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633

Agreement problemsAn agreement problem are problems that occur

when a processor is faulty and other non-faulty processors have to agree on a course of action

Some agreement problems covered in my paperByzantine Generals ProtocolConsensus ProblemInteractive Consistency

Page 17: A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633

Application of agreement protocols

Fault tolerant clock syncsNon faulty processes must have clocks that are

approximately equal in valueAtomic commits

Process actions have certain characteristics that must be followed (indivisible, instantaneous, non-revealing state changes etc.)

Page 18: A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633

Recovery BlocksClient

Process

Backup 1

Backup 2

request

update

reply

ack

Backward error recovery schemeAlso known as primary-backup approachRelies on acceptance testsChecks output is within an acceptable range

Page 19: A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633

Error Detection Techniques

Effectiveness of any fault tolerant system depends on the effectiveness of its error detection techniques

Early detection or late detectionConcept of acceptability determines the

thoroughness of error detection on a distributed system

Page 20: A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633

Error Detection Techniques

Replication ChecksTiming ChecksStructural ChecksReasonableness ChecksReversal checks

Page 21: A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633

ConclusionMany different means in which fault tolerance

can be provided on a distributed systemSections not covered includes error recover and

fault treatment