revive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors...
TRANSCRIPT
![Page 1: ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University](https://reader034.vdocuments.us/reader034/viewer/2022042821/56649cfa5503460f949cb6f2/html5/thumbnails/1.jpg)
ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors
Milos Prvulovic, Zheng Zhang, Josep Torrellas
University of Illinois at Urbana-Champaign
Hewlett-Packard Laboratories
Isaac Liu
![Page 2: ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University](https://reader034.vdocuments.us/reader034/viewer/2022042821/56649cfa5503460f949cb6f2/html5/thumbnails/2.jpg)
IntroductionTargeting large scale applications
that provide services (need high availability)
Improvements in silicon technology make modern integrated circuits prone to transient and permanent faults
FER vs. BER ◦Hardware redundancy vs. recovery
![Page 3: ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University](https://reader034.vdocuments.us/reader034/viewer/2022042821/56649cfa5503460f949cb6f2/html5/thumbnails/3.jpg)
ReVive designGoal: Cost-effective general-
purpose rollback recovery◦Modest amount of hardware (cost-
effective)◦Recovery from a wide class of errors
(General-purpose)◦Short system downtime due to error
(high availability)◦Low overhead when error-free (high
performance)
![Page 4: ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University](https://reader034.vdocuments.us/reader034/viewer/2022042821/56649cfa5503460f949cb6f2/html5/thumbnails/4.jpg)
Hardware Modifications
![Page 5: ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University](https://reader034.vdocuments.us/reader034/viewer/2022042821/56649cfa5503460f949cb6f2/html5/thumbnails/5.jpg)
Design Choices◦Checkpoint Storage:
Safe Internal Storage with Distributed parity
Safe External Specialized fault class
◦Checkpoint Separation: Partial separation with Logging Full separation Partial separation with buffering (renaming)
◦Checkpoint Consistency: Global (Un) Coordinated Local
![Page 6: ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University](https://reader034.vdocuments.us/reader034/viewer/2022042821/56649cfa5503460f949cb6f2/html5/thumbnails/6.jpg)
OverviewPeriodically establish checkpointBetween checkpoints, whenever
main memory written to, log the data to maintain checkpoint state.
If error is detected, then use the logs to roll back state.
![Page 7: ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University](https://reader034.vdocuments.us/reader034/viewer/2022042821/56649cfa5503460f949cb6f2/html5/thumbnails/7.jpg)
Design Choices◦Checkpoint Storage:
Safe Internal Storage with Distributed parity
◦Checkpoint Separation: Partial separation with Logging
◦Checkpoint Consistency: Global
![Page 8: ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University](https://reader034.vdocuments.us/reader034/viewer/2022042821/56649cfa5503460f949cb6f2/html5/thumbnails/8.jpg)
Distributed Parity
![Page 9: ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University](https://reader034.vdocuments.us/reader034/viewer/2022042821/56649cfa5503460f949cb6f2/html5/thumbnails/9.jpg)
Design Choices◦Checkpoint Storage:
Safe Internal Storage with Distributed parity
◦Checkpoint Separation: Partial separation with Logging
◦Checkpoint Consistency: Global
![Page 10: ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University](https://reader034.vdocuments.us/reader034/viewer/2022042821/56649cfa5503460f949cb6f2/html5/thumbnails/10.jpg)
Logging
![Page 11: ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University](https://reader034.vdocuments.us/reader034/viewer/2022042821/56649cfa5503460f949cb6f2/html5/thumbnails/11.jpg)
Design Choices◦Checkpoint Storage:
Safe Internal Storage with Distributed parity
◦Checkpoint Separation: Partial separation with Logging
◦Checkpoint Consistency: Global Checkpoint
![Page 12: ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University](https://reader034.vdocuments.us/reader034/viewer/2022042821/56649cfa5503460f949cb6f2/html5/thumbnails/12.jpg)
Global checkpointCommit all work and states to
main memory.Two phase commit protocol, first
sync is tentative commit, and then sync again to fully commit.
Keeps two most recent checkpoints.
![Page 13: ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University](https://reader034.vdocuments.us/reader034/viewer/2022042821/56649cfa5503460f949cb6f2/html5/thumbnails/13.jpg)
Global Checkpoint
![Page 14: ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University](https://reader034.vdocuments.us/reader034/viewer/2022042821/56649cfa5503460f949cb6f2/html5/thumbnails/14.jpg)
Implementation issuesExtra L bit for each directory
entryNew states in directory protocol,
new messages (parity update/ack)
Race Conditions◦Log-Data Update race◦Atomic Log Update Race◦Log-Parity Update Race◦Data-Parity Update Race◦Checkpoint commit Race
![Page 15: ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University](https://reader034.vdocuments.us/reader034/viewer/2022042821/56649cfa5503460f949cb6f2/html5/thumbnails/15.jpg)
Rollback
![Page 16: ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University](https://reader034.vdocuments.us/reader034/viewer/2022042821/56649cfa5503460f949cb6f2/html5/thumbnails/16.jpg)
OverheadLogging and parity maintenance
◦Depends on applicationGlobal Checkpoint
◦cross-processor interrupt◦Write dirty data to memory
Rollback◦Recovery + Lost work + Rebuild lost
memory pages
![Page 17: ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University](https://reader034.vdocuments.us/reader034/viewer/2022042821/56649cfa5503460f949cb6f2/html5/thumbnails/17.jpg)
Evaluation environmentCC-NUMA multiprocessor with 16
nodesNon-blocking and write-back
cacheFull-map directory and cache
coherent protocol similar to DASH.
Cache size: ◦16KB for L1, 128kB for L2
*Applications run on smaller problems sizes and shorter periods
![Page 18: ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University](https://reader034.vdocuments.us/reader034/viewer/2022042821/56649cfa5503460f949cb6f2/html5/thumbnails/18.jpg)
Evaluation Results
•Cp10ms – Parity and checkpoint every 10ms•CpInf – Parity and checkpoint with infinite interval•Cp10msM – Mirror and checkpoint every 10ms•CpInfM –Mirror and checkpoint with infinite interval
![Page 19: ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University](https://reader034.vdocuments.us/reader034/viewer/2022042821/56649cfa5503460f949cb6f2/html5/thumbnails/19.jpg)
Traffic
•Par – parity updates•Ckp – checkpoint•WB – writeback•RD/RDX- cache miss•LOG – writing to logs
![Page 20: ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University](https://reader034.vdocuments.us/reader034/viewer/2022042821/56649cfa5503460f949cb6f2/html5/thumbnails/20.jpg)
Overhead
![Page 21: ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University](https://reader034.vdocuments.us/reader034/viewer/2022042821/56649cfa5503460f949cb6f2/html5/thumbnails/21.jpg)
ReVive vs. SafetyNetBoth use log-based rollback
mechanismsReVive enables recovery from a
permanent nodeReVive does not need to change
processor’s cacheReVive is more general, so it may
result in larger performance overhead.
![Page 22: ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University](https://reader034.vdocuments.us/reader034/viewer/2022042821/56649cfa5503460f949cb6f2/html5/thumbnails/22.jpg)
ConclusionReVive provides:
◦Modest amount of hardware (cost-effective)
◦Recovery from a wide class of errors (General-purpose)
◦Short system downtime due to error (high availability)
◦Low overhead when error-free (high performance)