fault tolerance in parallel systems v6 3meseec.ce.rit.edu/756-projects/spring2006/d2/5/fault...why...

Fault Tolerance inFault Tolerance inParallel SystemsParallel Systems

Jamie BoeheimJamie BoeheimSarah KaySarah Kay

May 18, 2006May 18, 2006

OutlineOutlineWhat is a fault?What is a fault?Reasons for fault toleranceReasons for fault toleranceDesirable characteristicsDesirable characteristicsFault detectionFault detectionMethodologyMethodology

HardwareHardwareRedundancyRedundancyRoutingRoutingExample schemesExample schemes

SoftwareSoftwareOperating systemOperating systemSoftwareSoftwareRouting protocolsRouting protocolsExample systemExample system

CostCostConclusionsConclusionsReferencesReferences

What is a fault?What is a fault?“…“…an abnormal condition or defect at the an abnormal condition or defect at the component, equipment, or subcomponent, equipment, or sub--system level which system level which may lead to a may lead to a failure..””

HardwareHardwareA defect in a circuit or wiring caused by imperfect A defect in a circuit or wiring caused by imperfect connections, poor insulation, grounding, or connections, poor insulation, grounding, or shorting. shorting.

SoftwareSoftwareAn accidental condition, or a manifestation of a An accidental condition, or a manifestation of a programming mistake, that may cause a system or programming mistake, that may cause a system or component not to perform as required.component not to perform as required.

Why Is Fault Tolerance Necessary?Why Is Fault Tolerance Necessary?

In a standard system, a fault In a standard system, a fault can disrupt work on the can disrupt work on the processor and destroy the processor and destroy the data. A single fault can data. A single fault can pass through the entire pass through the entire system.system.

For components in series, For components in series, the probability of a failure is the probability of a failure is the product of the probability the product of the probability of failure in each individual of failure in each individual component.component.

Desirable CharacteristicsDesirable CharacteristicsA large number of PEA large number of PE--disjoint paths between disjoint paths between any two pairs of PEs for increased reliability and any two pairs of PEs for increased reliability and fault tolerancefault tolerance

The message routing should be simple to The message routing should be simple to implement and flexible to route around faulty implement and flexible to route around faulty PEs in the networkPEs in the network

Graceful degradation in performance with Graceful degradation in performance with increasing number of faultsincreasing number of faults

PE PE –– Processing ElementProcessing Element

Fault DetectionFault Detection

Can have faults in a processor, faults in the Can have faults in a processor, faults in the network, or faults in datanetwork, or faults in data

Processor faultsProcessor faults——Errors in processor itself, Errors in processor itself, can be detected by processor status bits or can be detected by processor status bits or external result comparisonexternal result comparisonNetwork faultsNetwork faults——Broken links, can be detected Broken links, can be detected with link status informationwith link status informationData faultsData faults——Errors in data, can be detected Errors in data, can be detected with parity bits, error checking code, etc.with parity bits, error checking code, etc.

Fault Tolerance MethodologiesFault Tolerance MethodologiesHardwareHardware

RedundancyRedundancyLimited RoutingLimited RoutingFTPAFTPA

SoftwareSoftwareCheck PointingCheck Pointing

FDIRFDIRGENESIS Cluster SupportGENESIS Cluster Support

Routing protocolsRouting protocolsCrosshatchCrosshatchMeshes/ToriMeshes/Tori

RedundancyRedundancy——Processing NodesProcessing NodesHave multiple processing elements performing the same Have multiple processing elements performing the same calculations. Compare the results to find the correct calculations. Compare the results to find the correct value.value.

In a simple computation In a simple computation (e.g. systolic multiplication)(e.g. systolic multiplication)

““Majority rulesMajority rules””Simple comparator used Simple comparator used for selection of resultfor selection of result

More complex/critical More complex/critical systemssystems

Confidence votingConfidence votingMore complex logic More complex logic requiredrequired——more more possibilities of failurepossibilities of failure

Hardware Fault ToleranceHardware Fault Tolerance

RedundancyRedundancy——LinksLinks

If a failure is detected If a failure is detected on one link, stop on one link, stop sending/accepting sending/accepting packets on that linkpackets on that link

Move communication Move communication to an unused linkto an unused linkSplit messages Split messages assigned to assigned to nonfunctional link nonfunctional link among other links among other links (some software (some software intervention)intervention)

Multiple links between processing nodesMultiple links between processing nodes


RoutingRoutingOnce a fault is detected, Once a fault is detected, the offending link or the offending link or processing node needs to processing node needs to be fixed, masked, or be fixed, masked, or avoided.avoided.

Elements of masking Elements of masking were shown for were shown for redundant systems, but if redundant systems, but if this is not available this is not available routing around the error routing around the error is important.is important.

Limited amount of routing Limited amount of routing that can be done directly that can be done directly by the hardwareby the hardware

Fault is masked (redundant system)Fault is masked (redundant system)

Fault must be routed around Fault must be routed around (no redundancy)(no redundancy)


FTPA FTPA (Fault Tolerant Processor Array)(Fault Tolerant Processor Array)

Designed to route data around nonfunctional processors. Designed to route data around nonfunctional processors. In design, it was necessary to determine where to route In design, it was necessary to determine where to route data while trying to minimize communication time.data while trying to minimize communication time.


Single redundant cell Single redundant cell assigned to a small assigned to a small cluster can replace cluster can replace one of the cells (local one of the cells (local redundancy)redundancy)

Swap out entire block if an error Swap out entire block if an error occurs in one cell (set switching)occurs in one cell (set switching)

FTPA (continued)FTPA (continued)

Summary of redundancy techniquesSummary of redundancy techniques


FairFairGoodGoodFairFairProcessor SwitchingProcessor SwitchingFairFairFairFairGoodGoodLocal RedundancyLocal RedundancyPoorPoorPoorPoorGoodGoodSet SwitchingSet SwitchingAreaAreaEfficiencyEfficiencySimplicitySimplicitySchemeScheme

Switches remove damaged Switches remove damaged processors from the pipeline and processors from the pipeline and add spare nodes to handle the add spare nodes to handle the operations necessary (processor operations necessary (processor switching)switching)

Check PointingCheck PointingSoftware Fault ToleranceSoftware Fault Tolerance

Copy process resources/state to stable Copy process resources/state to stable storagestorage

NonNon--deterministic events should be prevented deterministic events should be prevented during creation (e.g. blocking its interduring creation (e.g. blocking its inter--process process communication to stop rollback propagation) communication to stop rollback propagation)

If a fault occurs, process can be restarted If a fault occurs, process can be restarted on same or different PE by simply copying on same or different PE by simply copying saved process statesaved process state

FDIRFDIRSoftware Fault ToleranceSoftware Fault Tolerance

Used with NASA XUsed with NASA X--38 experimental 38 experimental vehicle processorsvehicle processorsSoftware used to track where faults occur, Software used to track where faults occur, and if necessary provide recovery with and if necessary provide recovery with some form of backup.some form of backup.

GENESIS Cluster SupportGENESIS Cluster SupportSoftware Fault ToleranceSoftware Fault Tolerance

Transparent check pointing for Transparent check pointing for programmerprogrammerCheck pointing similar to process Check pointing similar to process duplicationduplication

High performanceHigh performanceLow overheadLow overhead

Alternative Software ApproachesAlternative Software Approaches

CALYPSOCALYPSO

CocheckCocheck——checkpoint basedcheckpoint basedManethoManetho——log basedlog basedFault Tolerant MPIFault Tolerant MPI

Software Fault ToleranceSoftware Fault Tolerance

Routing ProtocolsRouting Protocols

After an erroneous After an erroneous module or link has been module or link has been found, a way to avoid it found, a way to avoid it should be determined.should be determined.

Even with masking, only Even with masking, only a limited number of faults a limited number of faults can be tolerated.can be tolerated.


Software allows for more flexible design.Software allows for more flexible design.

Crosshatch RoutingCrosshatch RoutingEach switch knows information about the fault Each switch knows information about the fault status of the switches to which it is connectedstatus of the switches to which it is connectedIn case of a fault, packets are transmitted around In case of a fault, packets are transmitted around the fault without changing the switching technique the fault without changing the switching technique Rerouted messages may deadlock as they take Rerouted messages may deadlock as they take space on routes not intended to handle themspace on routes not intended to handle them

One way to avoid One way to avoid the deadlock is to the deadlock is to specify certain specify certain switches to handle switches to handle fault conditionsfault conditions


Meshes/ToriMeshes/Tori

Adaptive routing around failure area (single PE Adaptive routing around failure area (single PE or block)or block)Reconfigure routing table to adapt to new Reconfigure routing table to adapt to new topology after failuretopology after failure

Tradeoff: flexibility Tradeoff: flexibility vs. performancevs. performanceMinimize use of Minimize use of additional resources additional resources (e.g. virtual channels)(e.g. virtual channels)


Cost of Fault ToleranceCost of Fault Tolerance

HardwareHardwareRedundant hardware requires extra spaceRedundant hardware requires extra spaceMajor issue in massively parallel machinesMajor issue in massively parallel machinesMay lose performance if, instead of May lose performance if, instead of duplicating hardware, dedicate some of duplicating hardware, dedicate some of existing hardware to fault toleranceexisting hardware to fault tolerance

SoftwareSoftwarePerformance degradation with checksPerformance degradation with checksMemory requirementsMemory requirements

ConclusionsConclusionsAdded cost of fault tolerance necessary when Added cost of fault tolerance necessary when

PEs are inherently errorPEs are inherently error--prone prone nanotechnologynanotechnology

Long term projects require extended reliability Long term projects require extended reliability space explorationspace exploration

Accuracy of results is essential Accuracy of results is essential banking transactionsbanking transactions

Hardware fault tolerance has less system Hardware fault tolerance has less system overhead but is not flexibleoverhead but is not flexible

Software fault tolerance has more system Software fault tolerance has more system overhead but better adaptability for individual overhead but better adaptability for individual implementationsimplementations

ReferencesReferencesKleinOsowski, A. et al. KleinOsowski, A. et al. ““The Recursive NanoBox Processor Grid: A Reliable The Recursive NanoBox Processor Grid: A Reliable System Architecture for Unreliable Nanotechnology DevicesSystem Architecture for Unreliable Nanotechnology Devices””. IEEE. 2004. IEEE. 2004GG´́omez, M.E. et al. omez, M.E. et al. ““An Efficient FaultAn Efficient Fault--Tolerant Routing Methodology for Tolerant Routing Methodology for Meshes and ToriMeshes and Tori””. . Baratlooz, A. et al. Baratlooz, A. et al. ““Calypso: A Novel Software System for FaultCalypso: A Novel Software System for Fault--Tolerant Tolerant Parallel Processing on Distributed PlatformsParallel Processing on Distributed Platforms””. . Racine, R. et al. Racine, R. et al. ““Design of a FaultDesign of a Fault--Tolerant Parallel ProcessorTolerant Parallel Processor””. IEEE. 2002. IEEE. 2002Rough, J., Goscinski, A. Rough, J., Goscinski, A. ““Exploiting operating system services to efficiently Exploiting operating system services to efficiently checkpoint parallel applications in GENESIScheckpoint parallel applications in GENESIS””. Algorithms and Architectures . Algorithms and Architectures for Parallel Processing. 2002.for Parallel Processing. 2002.Yasudo et al. Yasudo et al. ““DeadlockDeadlock--free Faultfree Fault--tolerant Routing in the Multitolerant Routing in the Multi--dimensional dimensional Crossbar Network and Its Implementation for the Hitachi SR2201Crossbar Network and Its Implementation for the Hitachi SR2201””Chean, M., Fortes, J. Chean, M., Fortes, J. ““A Taxonomy of Reconfiguration Techniques for FaultA Taxonomy of Reconfiguration Techniques for Fault--Tolerant Processor ArraysTolerant Processor Arrays””. Survey & Tutorial Series. 1990.. Survey & Tutorial Series. 1990.Harper, R. et al. Harper, R. et al. ““Fault Tolerant Parallel Processor Architecture OverviewFault Tolerant Parallel Processor Architecture Overview””. . IEEE. 1988IEEE. 1988

QuestionsQuestions

fault tolerance in parallel systems v6 3meseec.ce.rit.edu/756-projects/spring2006/d2/5/fault...why...

Documents