transparent dynamic binding with fault-tolerant cache coherence protocol for chip multiprocessors

Post on 23-Feb-2016

65 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors. Shuchang Shan † ‡ , Yu Hu † , Xiaowei Li † † Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences - PowerPoint PPT Presentation

TRANSCRIPT

Shuchang Shan † ‡ , Yu Hu †, Xiaowei Li †

†Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences

‡ Graduate University of Chinese Academy of Sciences (GUCAS)

Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors

2

Outline

Introduction

TDB execution model

Experimental results

Conclusion

3

FUs

Decode/

Rename

Register File

Writeback/

Commit

Fetch

Reorder Buffer

Issue QueueFUs

Decode/

Rename

Register File

Writeback/

Commit

Fetch

Reorder Buffer

Issue Queue

=

Architectural level Dual Modular Redundancy

Memory system

L1 L1 L1 L1L1

Instruction-level DMR

Core-level DMR

AR-SMT[FTCS’99], SRT[ISCA’00]Thread-level DMR

DIVA[MICRO’99], SHREC[MICRO’04], EDDI[TR’02]

CRTR [ISCA’03], Reunion[MICRO’06], DCC[DSN’07]

Leading thread

Trailing thread

EX’

CHKLeading

instructionsTrailing

instructions

A A’ B B’ For CMP systems, to make use of abundant

hardware resources, buildingCore-level DMR!

4Core-level Dual Modular Redundancy (DMR)Using coupled cores to verify each other’s executionStatic binding

– lacks of flexibility– e.g., Reunion [MICRO’06], CRT [ISCA’02], CRTR [ ISCA’03]

Dynamic binding– Lacks of scalability for parallel processing– e.g., DCC [DSN’07, WDDD’08]

-

On-chip network & Shared Cache

X - X

A A’ B B’

Static binding

C

On-chip network & Shared Cache

X X C’

A B’ B A’

Dynamic binding

5

Key issue in Core-level DMR

Maintain master-slave memory consistencyMaster-slave memory consistency

– Coupled cores must get the same memory value– External writes causes consistency violation

Reunion [Smolens-MICRO’06]– Rollback and recovery for the inconsistency

Dynamic Core Coupling (DCC) [LaFrieda-DSN’07]– Consistency window to stall the external writes

Scalability problem

LD1

ST3

LD1'

LD2Consistency

violation

LD1

ST3

LD1'

LD2

ST3Stall latency

6

Scalability problemExternal writes occur earlier and more frequently as the

system scales– Reunion: Unacceptable recovery overhead for consistency violation– DCC: Unacceptable stall latency caused by consistency window

Scalable solution needed– Reduce the consistency maintenance overhead

1684 1684 1684 1684 1684 1684 1684 1684lu fft ocean-con barnes cholesky radix radiosity average

0.00.10.20.30.40.50.60.70.80.91.0

exte

rnal

write

inte

rval

brea

kdow

n <100 <200 <300 <500 >500

Probability of external writes occurring within certain slacks

For 4-CMP system: 28% in 100 cycles 37% in 500 cycles

For 16-CMP system: 43% in 100 cycles 55% in 500 cycles

cycles

#External writes within 1K cycles: 0.3 for 4-CMP 3.3 for 16-CMP

7

Basic ideathe scope of the master-slave memory consistency maintenanceSphere of Consistency (SoC)

– The memory hierarchy– The private caches

Master

L1 cache L1 cache

Slave

Global memory

Master

L1 cache L1 cache

Slave

Global memory

Transparent Dynamic Binding (TDB):Reduce the SoC to the scale of private caches;

provide scalable and flexible Core-level DMR solution!

8

Outline

Introduction

TDB execution model

Experimental results

Conclusion

9

TDB principle

The same program input for the pairSimilar memory access behavior

Program

A-L1$ A’-L1$

Global memory

Transparent binding: Master issues L1 miss requests for the logical pair Slave is prevent from accessing the global memory

Dynamic binding: using the system network fordata communication and result comparison

10

Transparent dynamic binding

Master

Global memory

Slave

Program Logical pair: Consumer-consumer

Sphere of ConsistencyThe private caches

Transparent of slavesPassively waiting

Consumer-consumer data access pattern

Producer

11

Maintain Consistency under Out-of-Order ExecutionOut-of-Order execution brings in wrong-path effects [1]:

Master

Global memory

Slave

Program

Producer

MA1

1 1

MA2MA3MA4MA5

MA1MA3MA6MA1MA5

LRU MRU

[1] R. Sendaga, et al.“The impact of wrong-path memory references in cache-coherent multiprocessor systems.” In JPDC’07

12

Maintain Consistency under Out-of-Order ExecutionOut-of-Order execution brings in wrong-path effects:

Master

Global memory

Slave

Program

Producer

MA1

1 2 1

MA2MA3MA4MA5

MA1MA3MA6MA1MA5

2

LRU MRU

13

Maintain Consistency under Out-of-Order ExecutionOut-of-Order execution brings in wrong-path effects:

Master

Global memory

Slave

Program

Producer

MA1

1 2 3 4 1

MA2MA3MA4MA5

MA1MA3MA6MA1MA5

2 3

4

LRU MRU

Pipeline Refresh

14

Maintain Consistency under Out-of-Order ExecutionOut-of-Order execution brings in wrong-path effects:

Master

Global memory

Slave

Program

Producer

MA1

1 2 3 4 1

MA2MA3MA4MA5

MA1MA3MA6MA1MA5

2 3 4

MRULRU

5

15

Maintain Consistency under Out-of-Order ExecutionOut-of-Order execution brings in wrong-path effects:

Master

Global memory

Slave

Program

Producer

MA1

2 3 4 3

MA2MA3MA4MA5

MA1MA3MA6MA1MA5

1 4

MRULRU

55

Master-slave private cache consistency violation

Invariant: in-order memory instruction retirement sequence

16

Victim Buffer Assisted Conservative Private Cache Ingress Rule

Master Slave

ProgramMA1

1

MA2MA3MA4MA5

MA1MA3MA6MA1MA5

MRULRU

1

Global memory

Victim Buffer:Filter the WP data blocks

17

Victim Buffer Assisted Conservative Private Cache Ingress Rule

Master Slave

ProgramMA1

1

MA2MA3MA4MA5

MA1MA3MA6MA1MA5

MRULRU

1

2 2

Global memory

18

Victim Buffer Assisted Conservative Private Cache Ingress Rule

Master Slave

ProgramMA1

1

MA2MA3MA4MA5

MA1MA3MA6MA1MA5

MRULRU

1

2 3 4 3 42

Global memory

19

Victim Buffer Assisted Conservative Private Cache Ingress Rule

Master Slave

ProgramMA1

1

MA2MA3MA4MA5

MA1MA3MA6MA1MA5

MRULRU

1

2 3 4 3 42

Global memory

5 5Conservative private cache ingress rule:

accept data blocks from correct path into private caches

20

Master Slave

ProgramMA1

1 5 5

MA2MA3MA4MA5

MA1MA3MA6MA1MA5

MRULRU

1

2 3 4 3 42

Global memory

MA1MA5

Invariant: in-order memory instruction retirement sequence

Maintain Consistency under Out-of-Order Execution

Potential master-slave consistency violation

21

update-after-retirement LRU Replacement policy (uar-LRU)

Master Slave

ProgramMA1

1

MA2MA3MA4MA5

MA1MA3MA6MA1MA5

MRULRU

1

Global memory

MA1MA5

22

update-after-retirement LRU Replacement policy (uar-LRU)

Master Slave

ProgramMA1

1

MA2MA3MA4MA5

MA1MA3MA6MA1MA5

MRULRU

1

2 3 4 3 42

Global memory

MA1MA5

5 5uar-LRU: update MRU after the instruction retirement to prevent the WP

memory references from violating the consistency

23

Master-slave memory consistency violationExternal writes violates the master-slave memory consistencyAtomicity of master-slave data access behaviorLacks of scalability as external writes become more frequent

Master-slave input coherence: (a) external writes violates the consistency; (b) the master-slave consistency window in DCC

24Transparent Input Coherence StrategyTake advantage of Transparent dynamic bindingBreak the atomicity of master-slave data access behavior

LD1

ST3

LD1'

ST3

D D

I D

I D

optimization

time

Checker

25

Outline

Introduction

TDB execution model

Experimental results

Conclusion

26

Experimental Setup

Full system simulator: simics + GEMSParallel workloads: SPLASH-2The Baseline Dual Modular Redundancy System

– N active cores and another N disabled cores– Simulate the DMR system where the slaves work without

interfering the masters

27

The Performance of TDB Proposal

0.8

0.9

1.0

1.1

Norm

alize

d run

time

4P 8P 16P 32P

97.2%, 99.8%, 101.2% and 105.4% over the baseline for 4, 8, 16 and 32 cores respectively

Conservative private cache ingress rule helps filter the WP effects

28

Network Traffic of TDB Proposal

0.8

0.9

1.0

1.1

Norm

alize

d Net

work

Traffi

c

4P 8P 16P 32P

the total traffic is increased by 5.2%, 3.6%, 1.3% and 2.5% for 4-, 8-, 16- and 32-core CMP systems

29

Comparison against DCC [DSN’07]

4P 8P 16P 32P1.01.21.41.6

4P 8P 16P 32P1.0

1.1

TDB DCCNo

rmali

zed

Runti

me

Norm

alize

d Ne

twor

k Tra

ffic

TDB DCC

9.2% 10.4%18%

37.1%

Transparent Dynamic Binding (TDB):scalable and flexible Core-level DMR solution!

30

Conclusion

Transparent Dynamic Binding– Reduce SoC to the scale of Private Caches

Techniques to maintain the consistency– Consumer-consumer data access pattern– Victim Buffer assisted conservative ingress rule– uar-LRU replacement policy– Transparent input coherence policy

Scalable and flexible core-level DMR solution

31

Q&A?

top related