daniel j. sorin, et al. university of wisconsin-madison presented by: nick kirchem march 26, 2004
DESCRIPTION
SafetyNet: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery. Daniel J. Sorin, et al. University of Wisconsin-Madison Presented by: Nick Kirchem March 26, 2004. Motivation and Goals. Availability is crucial - PowerPoint PPT PresentationTRANSCRIPT
SafetyNet: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery
Daniel J. Sorin, et al.University of Wisconsin-
Madison
Presented by: Nick KirchemMarch 26, 2004
Motivation and Goals Availability is crucial
Internet services and database management systems highly relied upon
Unless architecture is changed, availability will decrease as # of components increases
Goals for paper Lightweight mechanism providing E2E
recovery for transient and permanent faults Decouple recovery from detection (use
traditional detection techniques: RAID, ECC, duplicate ALUs, etc)
Solution SafetyNet
A global checkpoint/recovery scheme Creates periodic system-wide logical
checkpoints Log all changes to the architected
state
Recovery Scheme Challenges (1) Saving previous values before every
register/ cache update or coherence message would require too much storage space
(2) All processors and components must recover to a consistent point
(3) SafetyNet must determine when it is safe to roll-back to a recovery checkpoint
(1) Checkpointing Via Logging Checkpoints contain a complete
copy of system’s architectural state
Taken at coarse granularity (e.g. 100,000 cycles)
Log is only taken on first altering action per checkpoint interval.
(2) Consistent Checkpoints All components coordinate local
checkpoints through logical time Coherence transaction appears logically
atomic once it completes Point of atomicity
When previous owner processes request Response includes CN of this PoA
Requestor does not advance recovery point until all outstanding transactions are complete
(2) PoA Example
(3) Validating Checkpoints States: current state, checkpoints
waiting to be validated, recovery point Validation: determining which
checkpoint is the recovery point All prior execution must be fault-free
Coordination is pipelined and performed in background (off critical path)
(3) Validation continued Validation latency depends on fault
detection latency Output commit problem
Delay all output events until checkpoint is validated
Depends on validation latency Input Commit problem
Log incoming messages
Recovery Processors restore their register
checkpoints Caches and memories unroll local
logs State from coherence transactions
in progress is discarded Reconfiguration if necessary
Implementation
Implementation (Cont’d) Checkpoint Log Buffers (CLBs)
Associated with each cache and memory component
Store log state Shadow registers 2D torus MOSI Directory protocol
Logical Time Base Loosely synchronous checkpoint
clock distributed redundantly Ensures no single point of failure Edge of clock increments current
checkpoint number (CCN) Works as long as skew < minimum
communication time between nodes Assigning transaction checkpoint
interval is protocol-dependent
Logging Memory block written to CLB whenever
action might have to be undone CLBs are write-only (except during
recovery) and off critical path CN added to each block in cache Steps taken for update-action
CCN compared with block’s CN Block is logged if CCN >= CN Updates block’s CN to CCN+1 Performs the update action
Updated CN sent with coherence response
Checkpoint Creation/Validation
Choose suitable checkpoint clock freq Detection latency tolerance Total CLB storage
Lost messages (timeout) trigger recovery Recovery point checkpoint number (RPCN)
broadcasted when recovery point is advanced After fault, recovery msg sent (includes
RPCN) Interconnection network is drained Processors, cache, memories recover to RPCN
Implementation Summary Processor/Cache Changes Required
Processor must be able to checkpoint its register state
Must be able to copy old versions out of cache before transferring them
CNs added to L1 cache block Directory Protocol Changes
CNs added to data response messages Coherence requests can be nack’ed Final ack required from requestor to directory
Evaluation Parameters 16-processor target system Simics + memory hierarchy simulator 4 commercial workloads, 1 scientific In order processor, 4 billion instr/sec MOSI directory protocol with 2D torus Checkpoint interval = 100,000 cycles
Experiments (1) Fault-free performance
Overhead determined to be negligible (2) Dropped messages
Periodic transient faults injected (10/sec)
Recovery latency << crash + reboot (3) Lost Switch
Hard fault – kill half-switch Crash avoided – performance suffers
due to restricted bandwidth
Sensitivity Analysis Cache bandwidth
Depends on frequency of stores requiring logging (additional b/w consumed reading old copy of the block)
Cache ownership transfer: no additional b/w
Storage cost CLBs sized to avoid performance
degradation due to full buffers Entries per checkpoint corresponds to
logging frequency
Conclusion SafetyNet is efficient in common
case (error free execution) – little to no latency added
Latency is hidden by pipelining validation of checkpoints
Checkpoints coordinated in logical time (no synch exchanges necessary)
Questions What about faults/errors with the
saved state itself? What if you there’s a permanent
fault for which you can’t reconfigure (endless loop of recovering to last checkpoint?)