milos prvulovic , zheng zhang, josep torrellas university of illinois at urbana-champaign

22
ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared- Memory Multiprocessors Milos Prvulovic, Zheng Zhang, Josep Torrellas University of Illinois at Urbana-Champaign Hewlett-Packard Laboratories Isaac Liu

Upload: myrrh

Post on 24-Feb-2016

24 views

Category:

Documents


0 download

DESCRIPTION

ReVive : Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors. Milos Prvulovic , Zheng Zhang, Josep Torrellas University of Illinois at Urbana-Champaign Hewlett-Packard Laboratories. Isaac Liu. Introduction. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Milos Prvulovic ,  Zheng  Zhang,  Josep Torrellas University of Illinois at Urbana-Champaign

ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors

Milos Prvulovic, Zheng Zhang, Josep TorrellasUniversity of Illinois at Urbana-Champaign

Hewlett-Packard Laboratories

Isaac Liu

Page 2: Milos Prvulovic ,  Zheng  Zhang,  Josep Torrellas University of Illinois at Urbana-Champaign

IntroductionTargeting large scale applications

that provide services (need high availability)

Improvements in silicon technology make modern integrated circuits prone to transient and permanent faults

FER vs. BER ◦Hardware redundancy vs. recovery

Page 3: Milos Prvulovic ,  Zheng  Zhang,  Josep Torrellas University of Illinois at Urbana-Champaign

ReVive designGoal: Cost-effective general-

purpose rollback recovery◦Modest amount of hardware (cost-

effective)◦Recovery from a wide class of errors

(General-purpose)◦Short system downtime due to error

(high availability)◦Low overhead when error-free (high

performance)

Page 4: Milos Prvulovic ,  Zheng  Zhang,  Josep Torrellas University of Illinois at Urbana-Champaign

Hardware Modifications

Page 5: Milos Prvulovic ,  Zheng  Zhang,  Josep Torrellas University of Illinois at Urbana-Champaign

Design Choices◦Checkpoint Storage:

Safe Internal Storage with Distributed parity

Safe External Specialized fault class

◦Checkpoint Separation: Partial separation with Logging Full separation Partial separation with buffering (renaming)

◦Checkpoint Consistency: Global (Un) Coordinated Local

Page 6: Milos Prvulovic ,  Zheng  Zhang,  Josep Torrellas University of Illinois at Urbana-Champaign

OverviewPeriodically establish checkpointBetween checkpoints, whenever

main memory written to, log the data to maintain checkpoint state.

If error is detected, then use the logs to roll back state.

Page 7: Milos Prvulovic ,  Zheng  Zhang,  Josep Torrellas University of Illinois at Urbana-Champaign

Design Choices◦Checkpoint Storage:

Safe Internal Storage with Distributed parity

◦Checkpoint Separation: Partial separation with Logging

◦Checkpoint Consistency: Global

Page 8: Milos Prvulovic ,  Zheng  Zhang,  Josep Torrellas University of Illinois at Urbana-Champaign

Distributed Parity

Page 9: Milos Prvulovic ,  Zheng  Zhang,  Josep Torrellas University of Illinois at Urbana-Champaign

Design Choices◦Checkpoint Storage:

Safe Internal Storage with Distributed parity

◦Checkpoint Separation: Partial separation with Logging

◦Checkpoint Consistency: Global

Page 10: Milos Prvulovic ,  Zheng  Zhang,  Josep Torrellas University of Illinois at Urbana-Champaign

Logging

Page 11: Milos Prvulovic ,  Zheng  Zhang,  Josep Torrellas University of Illinois at Urbana-Champaign

Design Choices◦Checkpoint Storage:

Safe Internal Storage with Distributed parity

◦Checkpoint Separation: Partial separation with Logging

◦Checkpoint Consistency: Global Checkpoint

Page 12: Milos Prvulovic ,  Zheng  Zhang,  Josep Torrellas University of Illinois at Urbana-Champaign

Global checkpointCommit all work and states to

main memory.Two phase commit protocol, first

sync is tentative commit, and then sync again to fully commit.

Keeps two most recent checkpoints.

Page 13: Milos Prvulovic ,  Zheng  Zhang,  Josep Torrellas University of Illinois at Urbana-Champaign

Global Checkpoint

Page 14: Milos Prvulovic ,  Zheng  Zhang,  Josep Torrellas University of Illinois at Urbana-Champaign

Implementation issuesExtra L bit for each directory

entryNew states in directory protocol,

new messages (parity update/ack)

Race Conditions◦Log-Data Update race◦Atomic Log Update Race◦Log-Parity Update Race◦Data-Parity Update Race◦Checkpoint commit Race

Page 15: Milos Prvulovic ,  Zheng  Zhang,  Josep Torrellas University of Illinois at Urbana-Champaign

Rollback

Page 16: Milos Prvulovic ,  Zheng  Zhang,  Josep Torrellas University of Illinois at Urbana-Champaign

OverheadLogging and parity maintenance

◦Depends on applicationGlobal Checkpoint

◦cross-processor interrupt◦Write dirty data to memory

Rollback◦Recovery + Lost work + Rebuild lost

memory pages

Page 17: Milos Prvulovic ,  Zheng  Zhang,  Josep Torrellas University of Illinois at Urbana-Champaign

Evaluation environmentCC-NUMA multiprocessor with 16

nodesNon-blocking and write-back

cacheFull-map directory and cache

coherent protocol similar to DASH.

Cache size: ◦16KB for L1, 128kB for L2

*Applications run on smaller problems sizes and shorter periods

Page 18: Milos Prvulovic ,  Zheng  Zhang,  Josep Torrellas University of Illinois at Urbana-Champaign

Evaluation Results

•Cp10ms – Parity and checkpoint every 10ms•CpInf – Parity and checkpoint with infinite interval•Cp10msM – Mirror and checkpoint every 10ms•CpInfM –Mirror and checkpoint with infinite interval

Page 19: Milos Prvulovic ,  Zheng  Zhang,  Josep Torrellas University of Illinois at Urbana-Champaign

Traffic

•Par – parity updates•Ckp – checkpoint•WB – writeback•RD/RDX- cache miss•LOG – writing to logs

Page 20: Milos Prvulovic ,  Zheng  Zhang,  Josep Torrellas University of Illinois at Urbana-Champaign

Overhead

Page 21: Milos Prvulovic ,  Zheng  Zhang,  Josep Torrellas University of Illinois at Urbana-Champaign

ReVive vs. SafetyNetBoth use log-based rollback

mechanismsReVive enables recovery from a

permanent nodeReVive does not need to change

processor’s cacheReVive is more general, so it may

result in larger performance overhead.

Page 22: Milos Prvulovic ,  Zheng  Zhang,  Josep Torrellas University of Illinois at Urbana-Champaign

ConclusionReVive provides:

◦Modest amount of hardware (cost-effective)

◦Recovery from a wide class of errors (General-purpose)

◦Short system downtime due to error (high availability)

◦Low overhead when error-free (high performance)