transparent dynamic binding with fault-tolerant cache coherence protocol for chip multiprocessors

Shuchang Shan † ‡ , Yu Hu †, Xiaowei Li †

†Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences

‡ Graduate University of Chinese Academy of Sciences (GUCAS)

Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors

Outline

Introduction

TDB execution model

Experimental results

Conclusion

Decode/

Rename

Register File

Writeback/

Commit

Reorder Buffer

Issue QueueFUs

Decode/

Rename

Register File

Writeback/

Commit

Reorder Buffer

Issue Queue

Architectural level Dual Modular Redundancy

Memory system

L1 L1 L1 L1L1

Instruction-level DMR

Core-level DMR

AR-SMT[FTCS’99], SRT[ISCA’00]Thread-level DMR

DIVA[MICRO’99], SHREC[MICRO’04], EDDI[TR’02]

CRTR [ISCA’03], Reunion[MICRO’06], DCC[DSN’07]

Leading thread

Trailing thread

CHKLeading

instructionsTrailing

instructions

A A’ B B’ For CMP systems, to make use of abundant

hardware resources, buildingCore-level DMR!

4Core-level Dual Modular Redundancy (DMR)Using coupled cores to verify each other’s executionStatic binding

– lacks of flexibility– e.g., Reunion [MICRO’06], CRT [ISCA’02], CRTR [ ISCA’03]

Dynamic binding– Lacks of scalability for parallel processing– e.g., DCC [DSN’07, WDDD’08]

On-chip network & Shared Cache

A A’ B B’

Static binding

On-chip network & Shared Cache

X X C’

A B’ B A’

Dynamic binding

Key issue in Core-level DMR

Maintain master-slave memory consistencyMaster-slave memory consistency

– Coupled cores must get the same memory value– External writes causes consistency violation

Reunion [Smolens-MICRO’06]– Rollback and recovery for the inconsistency

Dynamic Core Coupling (DCC) [LaFrieda-DSN’07]– Consistency window to stall the external writes

Scalability problem

LD2Consistency

violation

ST3Stall latency

Scalability problemExternal writes occur earlier and more frequently as the

system scales– Reunion: Unacceptable recovery overhead for consistency violation– DCC: Unacceptable stall latency caused by consistency window

Scalable solution needed– Reduce the consistency maintenance overhead

1684 1684 1684 1684 1684 1684 1684 1684lu fft ocean-con barnes cholesky radix radiosity average

0.00.10.20.30.40.50.60.70.80.91.0

n <100 <200 <300 <500 >500

Probability of external writes occurring within certain slacks

For 4-CMP system: 28% in 100 cycles 37% in 500 cycles

For 16-CMP system: 43% in 100 cycles 55% in 500 cycles

cycles

#External writes within 1K cycles: 0.3 for 4-CMP 3.3 for 16-CMP

Basic ideathe scope of the master-slave memory consistency maintenanceSphere of Consistency (SoC)

– The memory hierarchy– The private caches

Master

L1 cache L1 cache

Global memory

Master

L1 cache L1 cache

Global memory

Transparent Dynamic Binding (TDB):Reduce the SoC to the scale of private caches;

provide scalable and flexible Core-level DMR solution!

Outline

Introduction

TDB execution model

Conclusion

TDB principle

The same program input for the pairSimilar memory access behavior

Program

A-L1$ A’-L1$

Global memory

Transparent binding: Master issues L1 miss requests for the logical pair Slave is prevent from accessing the global memory

Dynamic binding: using the system network fordata communication and result comparison

Transparent dynamic binding

Master

Global memory

Program Logical pair: Consumer-consumer

Sphere of ConsistencyThe private caches

Transparent of slavesPassively waiting

Consumer-consumer data access pattern

Producer

Maintain Consistency under Out-of-Order ExecutionOut-of-Order execution brings in wrong-path effects [1]:

Master

Global memory

Program

Producer

MA2MA3MA4MA5

MA1MA3MA6MA1MA5

LRU MRU

[1] R. Sendaga, et al.“The impact of wrong-path memory references in cache-coherent multiprocessor systems.” In JPDC’07

Maintain Consistency under Out-of-Order ExecutionOut-of-Order execution brings in wrong-path effects:

Master

Global memory

Program

Producer

MA2MA3MA4MA5

MA1MA3MA6MA1MA5

LRU MRU

Master

Global memory

Program

Producer

1 2 3 4 1

MA2MA3MA4MA5

MA1MA3MA6MA1MA5

LRU MRU

Pipeline Refresh

Master

Global memory

Program

Producer

1 2 3 4 1

MA2MA3MA4MA5

MA1MA3MA6MA1MA5

MRULRU

Master

Global memory

Program

Producer

2 3 4 3

MA2MA3MA4MA5

MA1MA3MA6MA1MA5

MRULRU

Master-slave private cache consistency violation

Invariant: in-order memory instruction retirement sequence

Victim Buffer Assisted Conservative Private Cache Ingress Rule

Master Slave

ProgramMA1

MA2MA3MA4MA5

MA1MA3MA6MA1MA5

MRULRU

Global memory

Victim Buffer:Filter the WP data blocks

Master Slave

ProgramMA1

MA2MA3MA4MA5

MA1MA3MA6MA1MA5

MRULRU

Global memory

Master Slave

ProgramMA1

MA2MA3MA4MA5

MA1MA3MA6MA1MA5

MRULRU

2 3 4 3 42

Global memory

Master Slave

ProgramMA1

MA2MA3MA4MA5

MA1MA3MA6MA1MA5

MRULRU

2 3 4 3 42

Global memory

5 5Conservative private cache ingress rule:

accept data blocks from correct path into private caches

Master Slave

ProgramMA1

MA2MA3MA4MA5

MA1MA3MA6MA1MA5

MRULRU

2 3 4 3 42

Global memory

MA1MA5

Invariant: in-order memory instruction retirement sequence

Maintain Consistency under Out-of-Order Execution

Potential master-slave consistency violation

update-after-retirement LRU Replacement policy (uar-LRU)

Master Slave

ProgramMA1

MA2MA3MA4MA5

MA1MA3MA6MA1MA5

MRULRU

Global memory

MA1MA5

update-after-retirement LRU Replacement policy (uar-LRU)

Master Slave

ProgramMA1

MA2MA3MA4MA5

MA1MA3MA6MA1MA5

MRULRU

2 3 4 3 42

Global memory

MA1MA5

5 5uar-LRU: update MRU after the instruction retirement to prevent the WP

memory references from violating the consistency

Master-slave memory consistency violationExternal writes violates the master-slave memory consistencyAtomicity of master-slave data access behaviorLacks of scalability as external writes become more frequent

Master-slave input coherence: (a) external writes violates the consistency; (b) the master-slave consistency window in DCC

24Transparent Input Coherence StrategyTake advantage of Transparent dynamic bindingBreak the atomicity of master-slave data access behavior

optimization

Checker

Outline

Introduction

TDB execution model

Conclusion

Experimental Setup

Full system simulator: simics + GEMSParallel workloads: SPLASH-2The Baseline Dual Modular Redundancy System

– N active cores and another N disabled cores– Simulate the DMR system where the slaves work without

interfering the masters

The Performance of TDB Proposal

4P 8P 16P 32P

97.2%, 99.8%, 101.2% and 105.4% over the baseline for 4, 8, 16 and 32 cores respectively

Conservative private cache ingress rule helps filter the WP effects

Network Traffic of TDB Proposal

Traffi

4P 8P 16P 32P

the total traffic is increased by 5.2%, 3.6%, 1.3% and 2.5% for 4-, 8-, 16- and 32-core CMP systems

Comparison against DCC [DSN’07]

4P 8P 16P 32P1.01.21.41.6

4P 8P 16P 32P1.0

TDB DCCNo

TDB DCC

9.2% 10.4%18%

Transparent Dynamic Binding (TDB):scalable and flexible Core-level DMR solution!

Conclusion

Transparent Dynamic Binding– Reduce SoC to the scale of Private Caches

Techniques to maintain the consistency– Consumer-consumer data access pattern– Victim Buffer assisted conservative ingress rule– uar-LRU replacement policy– Transparent input coherence policy

Scalable and flexible core-level DMR solution

transparent dynamic binding with fault-tolerant cache coherence protocol for chip multiprocessors

consistency violationdcc

buildingcorelevel dmr

memory valueexternal

system scalesreunion

global memorydynamic

master issues l1

logical pair slave

unacceptable recovery

Documents

5 • chip multiprocessors (ii)rdm34/acs-slides/lec5.pdf ·...

1 lecture 21: coherence and interconnection networks papers:...

symmetric multiprocessors

tso-cc: consistency directed cache coherence for tso marco...

formal automatic verification of cache coherence in … ·...

hierarchical checking of multiprocessors using · pdf...

synchronization, coherence, and event ordering in...

physical design of snoop-based cache coherence in...

1 multiprocessors computer organization computer...

cache coherence directories for scalable … ·...

1 interconnect-aware coherence protocols for chip...

csl718 : multiprocessors

cache coherence directories for scalable multiprocessors

cache coherence in bus-based shared memory multiprocessors

multiprocessors— large vs. small scale multiprocessors—...

cache coherence protocols for chip multiprocessors -...

l31 multiprocessors

multiprocessors - university of california, san...

1 lecture 24: fault tolerance papers: token coherence:...

eecc756 - shaaban #1 lec # 10 spring2009 5-5-2009...