fault tolerance in parallel systems v6 3meseec.ce.rit.edu/756-projects/spring2006/d2/5/fault...why...
TRANSCRIPT
Fault Tolerance inFault Tolerance inParallel SystemsParallel Systems
Jamie BoeheimJamie BoeheimSarah KaySarah Kay
May 18, 2006May 18, 2006
OutlineOutlineWhat is a fault?What is a fault?Reasons for fault toleranceReasons for fault toleranceDesirable characteristicsDesirable characteristicsFault detectionFault detectionMethodologyMethodology
HardwareHardwareRedundancyRedundancyRoutingRoutingExample schemesExample schemes
SoftwareSoftwareOperating systemOperating systemSoftwareSoftwareRouting protocolsRouting protocolsExample systemExample system
CostCostConclusionsConclusionsReferencesReferences
What is a fault?What is a fault?“…“…an abnormal condition or defect at the an abnormal condition or defect at the component, equipment, or subcomponent, equipment, or sub--system level which system level which may lead to a may lead to a failure..””
HardwareHardwareA defect in a circuit or wiring caused by imperfect A defect in a circuit or wiring caused by imperfect connections, poor insulation, grounding, or connections, poor insulation, grounding, or shorting. shorting.
SoftwareSoftwareAn accidental condition, or a manifestation of a An accidental condition, or a manifestation of a programming mistake, that may cause a system or programming mistake, that may cause a system or component not to perform as required.component not to perform as required.
Why Is Fault Tolerance Necessary?Why Is Fault Tolerance Necessary?
In a standard system, a fault In a standard system, a fault can disrupt work on the can disrupt work on the processor and destroy the processor and destroy the data. A single fault can data. A single fault can pass through the entire pass through the entire system.system.
For components in series, For components in series, the probability of a failure is the probability of a failure is the product of the probability the product of the probability of failure in each individual of failure in each individual component.component.
Desirable CharacteristicsDesirable CharacteristicsA large number of PEA large number of PE--disjoint paths between disjoint paths between any two pairs of PEs for increased reliability and any two pairs of PEs for increased reliability and fault tolerancefault tolerance
The message routing should be simple to The message routing should be simple to implement and flexible to route around faulty implement and flexible to route around faulty PEs in the networkPEs in the network
Graceful degradation in performance with Graceful degradation in performance with increasing number of faultsincreasing number of faults
PE PE –– Processing ElementProcessing Element
Fault DetectionFault Detection
Can have faults in a processor, faults in the Can have faults in a processor, faults in the network, or faults in datanetwork, or faults in data
Processor faultsProcessor faults——Errors in processor itself, Errors in processor itself, can be detected by processor status bits or can be detected by processor status bits or external result comparisonexternal result comparisonNetwork faultsNetwork faults——Broken links, can be detected Broken links, can be detected with link status informationwith link status informationData faultsData faults——Errors in data, can be detected Errors in data, can be detected with parity bits, error checking code, etc.with parity bits, error checking code, etc.
Fault Tolerance MethodologiesFault Tolerance MethodologiesHardwareHardware
RedundancyRedundancyLimited RoutingLimited RoutingFTPAFTPA
SoftwareSoftwareCheck PointingCheck Pointing
FDIRFDIRGENESIS Cluster SupportGENESIS Cluster Support
Routing protocolsRouting protocolsCrosshatchCrosshatchMeshes/ToriMeshes/Tori
RedundancyRedundancy——Processing NodesProcessing NodesHave multiple processing elements performing the same Have multiple processing elements performing the same calculations. Compare the results to find the correct calculations. Compare the results to find the correct value.value.
In a simple computation In a simple computation (e.g. systolic multiplication)(e.g. systolic multiplication)
““Majority rulesMajority rules””Simple comparator used Simple comparator used for selection of resultfor selection of result
More complex/critical More complex/critical systemssystems
Confidence votingConfidence votingMore complex logic More complex logic requiredrequired——more more possibilities of failurepossibilities of failure
Hardware Fault ToleranceHardware Fault Tolerance
RedundancyRedundancy——LinksLinks
If a failure is detected If a failure is detected on one link, stop on one link, stop sending/accepting sending/accepting packets on that linkpackets on that link
Move communication Move communication to an unused linkto an unused linkSplit messages Split messages assigned to assigned to nonfunctional link nonfunctional link among other links among other links (some software (some software intervention)intervention)
Multiple links between processing nodesMultiple links between processing nodes
Hardware Fault ToleranceHardware Fault Tolerance
RoutingRoutingOnce a fault is detected, Once a fault is detected, the offending link or the offending link or processing node needs to processing node needs to be fixed, masked, or be fixed, masked, or avoided.avoided.
Elements of masking Elements of masking were shown for were shown for redundant systems, but if redundant systems, but if this is not available this is not available routing around the error routing around the error is important.is important.
Limited amount of routing Limited amount of routing that can be done directly that can be done directly by the hardwareby the hardware
Fault is masked (redundant system)Fault is masked (redundant system)
Fault must be routed around Fault must be routed around (no redundancy)(no redundancy)
Hardware Fault ToleranceHardware Fault Tolerance
FTPA FTPA (Fault Tolerant Processor Array)(Fault Tolerant Processor Array)
Designed to route data around nonfunctional processors. Designed to route data around nonfunctional processors. In design, it was necessary to determine where to route In design, it was necessary to determine where to route data while trying to minimize communication time.data while trying to minimize communication time.
Hardware Fault ToleranceHardware Fault Tolerance
Single redundant cell Single redundant cell assigned to a small assigned to a small cluster can replace cluster can replace one of the cells (local one of the cells (local redundancy)redundancy)
Swap out entire block if an error Swap out entire block if an error occurs in one cell (set switching)occurs in one cell (set switching)
FTPA (continued)FTPA (continued)
Summary of redundancy techniquesSummary of redundancy techniques
Hardware Fault ToleranceHardware Fault Tolerance
FairFairGoodGoodFairFairProcessor SwitchingProcessor SwitchingFairFairFairFairGoodGoodLocal RedundancyLocal RedundancyPoorPoorPoorPoorGoodGoodSet SwitchingSet SwitchingAreaAreaEfficiencyEfficiencySimplicitySimplicitySchemeScheme
Switches remove damaged Switches remove damaged processors from the pipeline and processors from the pipeline and add spare nodes to handle the add spare nodes to handle the operations necessary (processor operations necessary (processor switching)switching)
Check PointingCheck PointingSoftware Fault ToleranceSoftware Fault Tolerance
Copy process resources/state to stable Copy process resources/state to stable storagestorage
NonNon--deterministic events should be prevented deterministic events should be prevented during creation (e.g. blocking its interduring creation (e.g. blocking its inter--process process communication to stop rollback propagation) communication to stop rollback propagation)
If a fault occurs, process can be restarted If a fault occurs, process can be restarted on same or different PE by simply copying on same or different PE by simply copying saved process statesaved process state
FDIRFDIRSoftware Fault ToleranceSoftware Fault Tolerance
Used with NASA XUsed with NASA X--38 experimental 38 experimental vehicle processorsvehicle processorsSoftware used to track where faults occur, Software used to track where faults occur, and if necessary provide recovery with and if necessary provide recovery with some form of backup.some form of backup.
GENESIS Cluster SupportGENESIS Cluster SupportSoftware Fault ToleranceSoftware Fault Tolerance
Transparent check pointing for Transparent check pointing for programmerprogrammerCheck pointing similar to process Check pointing similar to process duplicationduplication
High performanceHigh performanceLow overheadLow overhead
Alternative Software ApproachesAlternative Software Approaches
CALYPSOCALYPSO
CocheckCocheck——checkpoint basedcheckpoint basedManethoManetho——log basedlog basedFault Tolerant MPIFault Tolerant MPI
Software Fault ToleranceSoftware Fault Tolerance
Routing ProtocolsRouting Protocols
After an erroneous After an erroneous module or link has been module or link has been found, a way to avoid it found, a way to avoid it should be determined.should be determined.
Even with masking, only Even with masking, only a limited number of faults a limited number of faults can be tolerated.can be tolerated.
Software Fault ToleranceSoftware Fault Tolerance
Software allows for more flexible design.Software allows for more flexible design.
Crosshatch RoutingCrosshatch RoutingEach switch knows information about the fault Each switch knows information about the fault status of the switches to which it is connectedstatus of the switches to which it is connectedIn case of a fault, packets are transmitted around In case of a fault, packets are transmitted around the fault without changing the switching technique the fault without changing the switching technique Rerouted messages may deadlock as they take Rerouted messages may deadlock as they take space on routes not intended to handle themspace on routes not intended to handle them
One way to avoid One way to avoid the deadlock is to the deadlock is to specify certain specify certain switches to handle switches to handle fault conditionsfault conditions
Software Fault ToleranceSoftware Fault Tolerance
Meshes/ToriMeshes/Tori
Adaptive routing around failure area (single PE Adaptive routing around failure area (single PE or block)or block)Reconfigure routing table to adapt to new Reconfigure routing table to adapt to new topology after failuretopology after failure
Tradeoff: flexibility Tradeoff: flexibility vs. performancevs. performanceMinimize use of Minimize use of additional resources additional resources (e.g. virtual channels)(e.g. virtual channels)
Software Fault ToleranceSoftware Fault Tolerance
Cost of Fault ToleranceCost of Fault Tolerance
HardwareHardwareRedundant hardware requires extra spaceRedundant hardware requires extra spaceMajor issue in massively parallel machinesMajor issue in massively parallel machinesMay lose performance if, instead of May lose performance if, instead of duplicating hardware, dedicate some of duplicating hardware, dedicate some of existing hardware to fault toleranceexisting hardware to fault tolerance
SoftwareSoftwarePerformance degradation with checksPerformance degradation with checksMemory requirementsMemory requirements
ConclusionsConclusionsAdded cost of fault tolerance necessary when Added cost of fault tolerance necessary when
PEs are inherently errorPEs are inherently error--prone prone nanotechnologynanotechnology
Long term projects require extended reliability Long term projects require extended reliability space explorationspace exploration
Accuracy of results is essential Accuracy of results is essential banking transactionsbanking transactions
Hardware fault tolerance has less system Hardware fault tolerance has less system overhead but is not flexibleoverhead but is not flexible
Software fault tolerance has more system Software fault tolerance has more system overhead but better adaptability for individual overhead but better adaptability for individual implementationsimplementations
ReferencesReferencesKleinOsowski, A. et al. KleinOsowski, A. et al. ““The Recursive NanoBox Processor Grid: A Reliable The Recursive NanoBox Processor Grid: A Reliable System Architecture for Unreliable Nanotechnology DevicesSystem Architecture for Unreliable Nanotechnology Devices””. IEEE. 2004. IEEE. 2004GG´́omez, M.E. et al. omez, M.E. et al. ““An Efficient FaultAn Efficient Fault--Tolerant Routing Methodology for Tolerant Routing Methodology for Meshes and ToriMeshes and Tori””. . Baratlooz, A. et al. Baratlooz, A. et al. ““Calypso: A Novel Software System for FaultCalypso: A Novel Software System for Fault--Tolerant Tolerant Parallel Processing on Distributed PlatformsParallel Processing on Distributed Platforms””. . Racine, R. et al. Racine, R. et al. ““Design of a FaultDesign of a Fault--Tolerant Parallel ProcessorTolerant Parallel Processor””. IEEE. 2002. IEEE. 2002Rough, J., Goscinski, A. Rough, J., Goscinski, A. ““Exploiting operating system services to efficiently Exploiting operating system services to efficiently checkpoint parallel applications in GENESIScheckpoint parallel applications in GENESIS””. Algorithms and Architectures . Algorithms and Architectures for Parallel Processing. 2002.for Parallel Processing. 2002.Yasudo et al. Yasudo et al. ““DeadlockDeadlock--free Faultfree Fault--tolerant Routing in the Multitolerant Routing in the Multi--dimensional dimensional Crossbar Network and Its Implementation for the Hitachi SR2201Crossbar Network and Its Implementation for the Hitachi SR2201””Chean, M., Fortes, J. Chean, M., Fortes, J. ““A Taxonomy of Reconfiguration Techniques for FaultA Taxonomy of Reconfiguration Techniques for Fault--Tolerant Processor ArraysTolerant Processor Arrays””. Survey & Tutorial Series. 1990.. Survey & Tutorial Series. 1990.Harper, R. et al. Harper, R. et al. ““Fault Tolerant Parallel Processor Architecture OverviewFault Tolerant Parallel Processor Architecture Overview””. . IEEE. 1988IEEE. 1988
QuestionsQuestions