a survey of fault tolerance in distributed systems by szeying tan fall 2002 cs 633
Post on 19-Jan-2018
217 Views
Preview:
DESCRIPTION
TRANSCRIPT
A Survey of Fault Tolerance in Distributed
SystemsBy
Szeying TanFall 2002 CS 633
IntroductionPaper covers:
Definitions of faults/failuresDiscuss failure models and elements of fault
toleranceIntroduce hardware fault tolerant techniquesIntroduce software fault tolerant techniques
Why Fault Tolerance? Mission critical systems – a requirement to
ensure reliability and availabilityHigh availability and need for reliability especially
important in distributed real time systemsComplex issues raised in providing fault
tolerance in distributed systems compared to single processor systems
What do we do with faults?
Error detection – find the error in the systemDamage control and assessment – contain and
fixError recovery – return the system back to an
error-free stateFault treatment/continued service – attempt
uninterrupted execution regardless of fault
Failure ModelsFailstopCrashCrash+LinkReceive OmissionSend OmissionGeneral OmissionByzantine Failures
Types of FaultsPermanent – remains in the system indefinitely
till corrective action is takenTransient – disappears after a short period of
timeIntermittent – appear and disappear repeatedly
Elements of Fault Tolerance
Redundancy – addition of information, resources, or time beyond what is needed for normal system operation
Failure semantics – knowledgebase of failure behaviors of a system
Group failure masking – Masks failures from others in group.
Hardware Fault Tolerant Techniques
Hardware redundancy – duplicate components to detect or tolerate faults
Passive techniques – fault maskingActive techniques – fault detection and removalHybrid techniques – a combination of both
Techniques listed on the next slide
Triple Modular Redundancy
Execute a task three times
Take a majority voteIn a fault free system, all
three results are identicalDoes not work for
Byzantine(arbitrary) failures
VOTE RESULT
N-Modular Redundancy
VOTE
VOTE
VOTE
Accomplised by masking an error N times
Works similar to TMR.Masks symmetrical and
asymmetrical failures
Standby Sparing
INPUTOUTPUT
COMPONENT 1
FAULTDETECTOR
COMPONENT 3
SWITCH
Replicate spares in the system (duplicate components)Spares activated when fault is detected
Duplex Systems
COMPARE RESULT
Duplicate execution twiceCompare results for discrepanciesExecution can occur on separate hardware or
sequentially on the same hardware
An example of a hardware fault tolerant system
Stratus servers – Fault tolerant hardware servers that use TMR and fully replicated hardware design to provide fault tolerance.
http://www.stratus.com
Software Fault Tolerant Techniques
Two main areas:Provide for static redundancyProvide for dynamic redundancy
N-Version ProgrammingRecovery Blocks or Primary-Backup technique
N-Version programmingDuplicate n versions of a program on n
processes.Forward recovery scheme that mask faultsRelies on voting mechanisms
Agreement problemsAn agreement problem are problems that occur
when a processor is faulty and other non-faulty processors have to agree on a course of action
Some agreement problems covered in my paperByzantine Generals ProtocolConsensus ProblemInteractive Consistency
Application of agreement protocols
Fault tolerant clock syncsNon faulty processes must have clocks that are
approximately equal in valueAtomic commits
Process actions have certain characteristics that must be followed (indivisible, instantaneous, non-revealing state changes etc.)
Recovery BlocksClient
Process
Backup 1
Backup 2
request
update
reply
ack
Backward error recovery schemeAlso known as primary-backup approachRelies on acceptance testsChecks output is within an acceptable range
Error Detection Techniques
Effectiveness of any fault tolerant system depends on the effectiveness of its error detection techniques
Early detection or late detectionConcept of acceptability determines the
thoroughness of error detection on a distributed system
Error Detection Techniques
Replication ChecksTiming ChecksStructural ChecksReasonableness ChecksReversal checks
ConclusionMany different means in which fault tolerance
can be provided on a distributed systemSections not covered includes error recover and
fault treatment
top related