the locality-aware adaptive cache coherence protocol

65
The Locality-Aware Adaptive Cache Coherence Protocol George Kurian 1 , Omer Khan 2 , Srini Devadas 1 1 Massachusetts Institute of Technology 2 University of Connecticut, Storrs 1

Upload: jake

Post on 23-Feb-2016

49 views

Category:

Documents


0 download

DESCRIPTION

The Locality-Aware Adaptive Cache Coherence Protocol. George Kurian 1 , Omer Khan 2 , Srini Devadas 1 1 Massachusetts Institute of Technology 2 University of Connecticut, Storrs. Cache Hierarchy Organization Directory-Based Coherence. Private cache Write miss. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The Locality-Aware Adaptive Cache Coherence Protocol

The Locality-Aware Adaptive Cache Coherence Protocol

George Kurian1, Omer Khan2, Srini Devadas1

1 Massachusetts Institute of Technology2 University of Connecticut, Storrs

1

Page 2: The Locality-Aware Adaptive Cache Coherence Protocol

2

Cache Hierarchy OrganizationDirectory-Based Coherence

Private cache Write miss

1

2Shared Cache + Directory

4Sharer

3Sharer

• Private caches: 1 or 2 levels• Shared cache: Last-level

Write word

• Concurrent reads lead to replication in private caches

• Directory maintains coherence for replicated lines

Page 3: The Locality-Aware Adaptive Cache Coherence Protocol

Private CachingAdvantages & Drawbacks

3

☺ Exploits spatio- temporal locality

☺ Efficient low-latency local access to private + shared data (cache line replication)

☹ Inefficiently handles data with LOW spatio-temporal locality

☹Working set > private cache size☹ Inefficient cache utilization

(Cache thrashing)☹ Unnecessary fetch of entire

cache line☹ Shared data replication

increases working set

Page 4: The Locality-Aware Adaptive Cache Coherence Protocol

Private CachingAdvantages & Drawbacks

4

☺ Exploits spatio-temporal locality

☺ Efficient low-latency local access to private + shared data (cache line replication)

☹ Inefficiently handles data with LOW spatio-temporal locality

☹Working set > private cache size

☹Shared data with frequent writes☹Wasteful invalidations,

synchronous writebacks, cache line ping-ponging

Increased on-chip communication and time spent waiting for expensive events

Page 5: The Locality-Aware Adaptive Cache Coherence Protocol

5

On-Chip Communication Problem

Wires relative to gates are getting worse every generation

Shekhar Borkar, Intel

Must Architect Efficient Coherence Protocols

Bit movement is much more expensive than computation

Bill Dally, Stanford

Page 6: The Locality-Aware Adaptive Cache Coherence Protocol

• Utilization: # private L1 cache accesses before cache line is evicted

• 40% of lines evicted have a utilization < 4

Locality of BenchmarksEvaluating Reuse before Evictions

6

80%

20%

Page 7: The Locality-Aware Adaptive Cache Coherence Protocol

• Utilization: # private L1 cache accesses before cache line is invalidated (intervening write)

Locality of BenchmarksEvaluating Reuse before Invalidations

7

80%

10%

Page 8: The Locality-Aware Adaptive Cache Coherence Protocol

1

Remote-Word Access (RA)

8

2

Hom

e co

re

NUCA-based protocol[Fensch et al HPCA’08]

[Hoffmann et al HiPEAC’10]

Write word

• Assign each memory address to unique “home” core– Cache line present only in

shared cache at “home” core (single location)

• For access to non-locally cached word, request “remote” shared cache on “home” core to perform the read/write access

Page 9: The Locality-Aware Adaptive Cache Coherence Protocol

Remote-Word AccessAdvantages & Drawbacks

9

☺ Energy Efficient(low locality data) Word access (~200 bits) cheaper than cache line fetch (~640 bits)

☺ NO data replication Efficient private cache utilization

☺ NO invalidations / synchronous writebacks

☹ Round-trip network request for remote-WORD access

☹ Expensive for high locality data

☹ Data placement dictates distance & frequency of remote accesses

Page 10: The Locality-Aware Adaptive Cache Coherence Protocol

Locality-Aware Cache Coherence• Combine advantages of private caching and

remote access• Privately cache high locality lines

– Optimize hit latency and energy• Remotely cache low locality lines

– Prevent data replication & costly data movement

• Private Caching Threshold (PCT)– Utilization >= PCT Mark as private– Utilization < PCT Mark as remote

10

Page 11: The Locality-Aware Adaptive Cache Coherence Protocol

0%10%20%30%40%50%60%70%80%90%

100%1 2,3 4,5 6,7 >=8

Inva

lidati

ons B

reak

dow

n (%

)

Locality-Aware Cache Coherence

11

Invalidations vs Utilization

• Private Caching Theshold (PCT) = 4

Remote

Private

Page 12: The Locality-Aware Adaptive Cache Coherence Protocol

Outline

• Motivation for Locality-Aware Coherence• Detailed Implementation• Optimizations• Evaluation• Conclusion

12

Page 13: The Locality-Aware Adaptive Cache Coherence Protocol

13

Baseline System

• Compute pipeline• Private L1-I and L1-D caches• Logically shared physically distributed L2 cache with

integrated directory

Router

L1 I-CacheL1 D-Cache

L2 Shared Cache

Core

Compute Pipeline

Directory

M

M

M

• L2 cache managed by Reactive-NUCA [Hardavellas – ISCA09]• ACKwise limited-directory protocol [Kurian – PACT10]

Page 14: The Locality-Aware Adaptive Cache Coherence Protocol

Locality-Aware CoherenceImportant Features

• Intelligent allocation of cache lines– In the private L1 cache– Allocation decision made per-core at cache line level

• Efficient locality tracking hardware– Decoupled from traditional coherence tracking

structures• Protocol complexity low

– NO additional networks for deadlock avoidance

14

Page 15: The Locality-Aware Adaptive Cache Coherence Protocol

Implementation DetailsPrivate Cache Line Tag

• Private Utilization bits to track cache line usage in L1 cache

• Communicated back to directory on eviction or invalidation

• Storage overhead is only 0.4%

15

State LRU Tag PrivateUtilization

Page 16: The Locality-Aware Adaptive Cache Coherence Protocol

Implementation DetailsDirectory Entry

• P/Ri: Private/Remote Mode

• Remote-Utilizationi: Line usage by Corei at shared L2 cache

• Complete Locality Classifier: Track mode/remote-utilization for all cores

• Storage overhead reduced later 16

State TagACKwise Pointers

1 … p

Remote Utilization1

Remote Utilizationn

…P/R1

P/Rn

Page 17: The Locality-Aware Adaptive Cache Coherence Protocol

Mode Transitions Summary

• Classification based on previous behavior

17

RemotePrivate

Private Utilization < PCT

Private Utilization >= PCT

Initial Remote Utilization < PCT

Remote Utilization >= PCT

Page 18: The Locality-Aware Adaptive Cache Coherence Protocol

Walk Through Example

18

Core-A

Private

U

Core-B

Private

U

Core-C

Private

U

Directory

Core A

Core B

Core D

Core C

Private Caching ThresholdPCT = 2

Unc

ache

d

Pipeline + L1 Cache

Pipeline +L1 Cache

Pipeline + L1 Cache

L2 Cache + Directory

All cores start out in private mode

Network

Page 19: The Locality-Aware Adaptive Cache Coherence Protocol

Walk Through Example

19

Core-A

Private

U

Core-B

Private

U

Core-C

Private

U

Directory

Core A

Core B

Core D

Core C

PCT = 2

Unc

ache

d

Read[X]

Page 20: The Locality-Aware Adaptive Cache Coherence Protocol

Walk Through Example

20

Core-B

Private

U

Core-A

Private

C

Core-C

Private

U

Directory

Core A

Core B

Core D

Core C

Sha

red

PCT = 2

Cache Line [X]

Clean -

Page 21: The Locality-Aware Adaptive Cache Coherence Protocol

Walk Through Example

21

Core-A

Private

C

Core-B

Private

U

Core-C

Private

U

Directory

Core A

Core B

Core D

Core C

Shared 1 PCT = 2

Sha

red

Cache Line [X]

Clean -

Page 22: The Locality-Aware Adaptive Cache Coherence Protocol

Walk Through Example

22

Core-A

Private

C

Core-B

Private

U

Core-C

Private

U

Directory

Core A

Core B

Core D

Core C

PCT = 2

Sha

red

Shared 1

Read[X]

Clean -

Page 23: The Locality-Aware Adaptive Cache Coherence Protocol

Walk Through Example

23

Core-A

Private

C

Core-B

Private

U

Core-C

Private

C

Directory

Core A

Core B

Core D

Core C

PCT = 2

Sha

red

Shared 1

Cache Line [X]

Clean -

Page 24: The Locality-Aware Adaptive Cache Coherence Protocol

Walk Through Example

24

Core-A

Private

C

Core-B

Private

U

Core-C

Private

C

Directory

Core A

Core B

Core D

Core C

PCT = 2

Sha

red

Shared 1

Shared 1 Cache Line [X]

Clean -

Page 25: The Locality-Aware Adaptive Cache Coherence Protocol

Walk Through Example

25

Core-A

Private

C

Core-B

Private

U

Core-C

Private

C

Directory

Core A

Core B

Core D

Core C

PCT = 2

Sha

red

Shared 1

Shared 1

Read[X]

Clean -

Page 26: The Locality-Aware Adaptive Cache Coherence Protocol

Walk Through Example

26

Core-A

Private

C

Core-B

Private

U

Core-C

Private

C

Directory

Core A

Core B

Core D

Core C

PCT = 2

Sha

red

Shared 1

Shared 2

Clean -

Page 27: The Locality-Aware Adaptive Cache Coherence Protocol

Walk Through Example

27

Core-A

Private

C

Core-B

Private

U

Core-C

Private

C

Directory

Core A

Core B

Core D

Core C

PCT = 2

Sha

red

Shared 1

Write[X] Shared 2

Clean -

Page 28: The Locality-Aware Adaptive Cache Coherence Protocol

Walk Through Example

28

Core-A

Private

C

Core-B

Private

U

Core-C

Private

C

Directory

Core A

Core B

Core D

Core C

PCT = 2

Sha

red

Shared 1

Shared 2

Inv [X]

Clean -

Page 29: The Locality-Aware Adaptive Cache Coherence Protocol

Walk Through Example

29

Core-A

Private

C

Core-B

Private

U

Core-C

Private

C

Directory

Core A

Core B

Core D

Core C

PCT = 2

Sha

red

Invalid 0

Shared 2

Inv-Reply [X] (1)

Clean -

Page 30: The Locality-Aware Adaptive Cache Coherence Protocol

Walk Through Example

30

Core-B

Private

U

Core-A

Remote

0

Core-C

Private

C

Directory

Core A

Core B

Core D

Core C

PCT = 2

Sha

red

Shared 2

Inv-Reply [X] (1)

Clean -

Page 31: The Locality-Aware Adaptive Cache Coherence Protocol

Walk Through Example

31

Core-A

Remote

0

Core-B

Private

U

Core-C

Private

C

Directory

Core A

Core B

Core D

Core C

PCT = 2

Sha

red

Inv-Reply [X] (2)

Invalid 0

Clean -

Page 32: The Locality-Aware Adaptive Cache Coherence Protocol

Walk Through Example

32

Core-A

Remote

0

Core-B

Private

U

Core-C

Private

U

Directory

Core A

Core B

Core D

Core C

PCT = 2

Unc

ache

d

Inv-Reply [X] (2)

Clean -

Page 33: The Locality-Aware Adaptive Cache Coherence Protocol

Walk Through Example

33

Core-A

Remote

0

Core-B

Private

C

Core-C

Private

U

Directory

Core A

Core B

Core D

Core C

PCT = 2

Mod

ified

Cache Line [X]

Clean -

Page 34: The Locality-Aware Adaptive Cache Coherence Protocol

Walk Through Example

34

Core-A

Remote

0

Core-B

Private

C

Core-C

Private

U

Directory

Core A

Core B

Core D

Core C

PCT = 2

Mod

ified

Modified 1 Cache Line [X]

Clean -

Page 35: The Locality-Aware Adaptive Cache Coherence Protocol

Walk Through Example

35

Core-A

Remote

0

Core-B

Private

C

Core-C

Private

U

Directory

Core A

Core B

Core D

Core C

PCT = 2

Mod

ified

Modified 1

Read[X]

Clean -

Page 36: The Locality-Aware Adaptive Cache Coherence Protocol

Walk Through Example

36

Core-A

Remote

0

Core-B

Private

C

Core-C

Private

U

Directory

Core A

Core B

Core D

Core C

PCT = 2

Mod

ified

Modified 1

WB [X]

Clean -

Page 37: The Locality-Aware Adaptive Cache Coherence Protocol

Walk Through Example

37

Core-A

Remote

0

Core-B

Private

C

Core-C

Private

U

Directory

Core A

Core B

Core D

Core C

PCT = 2

Mod

ified

Shared 1 WB-Reply [X]

Clean -

Page 38: The Locality-Aware Adaptive Cache Coherence Protocol

Walk Through Example

38

Core-A

Remote

0

Core-B

Private

C

Core-C

Private

U

Directory

Core A

Core B

Core D

Core C

PCT = 2

Sha

red

Shared 1

WB-Reply [X]

Dirty -

Page 39: The Locality-Aware Adaptive Cache Coherence Protocol

Walk Through Example

39

Core-B

Private

C

Core-A

Remote

1

Core-C

Private

U

Directory

Core A

Core B

Core D

Core C

PCT = 2

Sha

red

Shared 1

Word [X]

Dirty -

Page 40: The Locality-Aware Adaptive Cache Coherence Protocol

Walk Through Example

40

Core-A

Remote

1

Core-B

Private

C

Core-C

Private

U

Directory

Core A

Core B

Core D

Core C

PCT = 2

Sha

red

Shared 1 Write [X]

Dirty -

Page 41: The Locality-Aware Adaptive Cache Coherence Protocol

Walk Through Example

41

Core-A

Remote

0

Core-B

Private

C

Core-C

Private

U

Directory

Core A

Core B

Core D

Core C

PCT = 2

Mod

ified

Shared 1

Dirty -

Upgrade-Reply [X]

Page 42: The Locality-Aware Adaptive Cache Coherence Protocol

Walk Through Example

42

Core-A

Remote

0

Core-B

Private

C

Core-C

Private

U

Directory

Core A

Core B

Core D

Core C

PCT = 2

Mod

ified

Modified 2

Dirty -

Page 43: The Locality-Aware Adaptive Cache Coherence Protocol

Walk Through Example

43

Core-A

Remote

0

Core-B

Private

C

Core-C

Private

U

Directory

Core A

Core B

Core D

Core C

PCT = 2

Sha

red

Modified 2

Read [X]

Dirty -

Page 44: The Locality-Aware Adaptive Cache Coherence Protocol

Walk Through Example

44

Core-B

Private

C

Core-A

Remote

1

Core-C

Private

U

Directory

Core A

Core B

Core D

Core C

PCT = 2

Sha

red

Shared 2

Dirty -

Read [X]

Page 45: The Locality-Aware Adaptive Cache Coherence Protocol

Walk Through Example

45

Core-B

Private

C

Core-A

Remote

1

Core-C

Private

U

Directory

Core A

Core B

Core D

Core C

PCT = 2

Sha

red

Shared 2

Word [X]

Dirty -

Page 46: The Locality-Aware Adaptive Cache Coherence Protocol

Walk Through Example

46

Core-A

Remote

1

Core-B

Private

C

Core-C

Private

U

Directory

Core A

Core B

Core D

Core C

PCT = 2

Sha

red

Shared 2

Read [X]

Dirty -

Page 47: The Locality-Aware Adaptive Cache Coherence Protocol

Walk Through Example

47

Core-B

Private

C

Core-A

Remote

2

Core-C

Private

U

Directory

Core A

Core B

Core D

Core C

PCT = 2

Sha

red

Shared 2

Dirty -

Read [X]

Page 48: The Locality-Aware Adaptive Cache Coherence Protocol

Walk Through Example

48

Core-B

Private

C

Core-A

Private

C

Core-C

Private

U

Directory

Core A

Core B

Core D

Core C

PCT = 2

Sha

red

Shared 2

Dirty -

Cache Line [X] (2)

Page 49: The Locality-Aware Adaptive Cache Coherence Protocol

Walk Through Example

49

Core-A

Private

C

Core-B

Private

C

Core-C

Private

U

Directory

Core A

Core B

Core D

Core C

PCT = 2

Sha

red

Shared 2

Shared 2

Cache Line [X] (2)

Dirty -

Page 50: The Locality-Aware Adaptive Cache Coherence Protocol

Outline

• Motivation for Locality-Aware Coherence• Detailed Implementation• Optimizations• Evaluation• Conclusion

50

Page 51: The Locality-Aware Adaptive Cache Coherence Protocol

Complete Locality ClassifierHigh Directory Storage

• Complete Locality Classifier: Tracks locality information for all cores

51

State TagACKwise Pointers

1 … p

Remote Utilization1

Remote Utilizationn

…P/R1

P/Rn

Classifier CompleteBit Overhead per core (256 KB L2)

192 KB (60%)

Page 52: The Locality-Aware Adaptive Cache Coherence Protocol

Limited Locality ClassifierReduces Directory Storage

• Utilization and mode tracked for k sharers• Modes of other sharers obtained by taking a

majority vote

52

State TagACKwisePointers

1 … p

Core ID1

Remote Utilization1

Core IDk

Remote Utilizationk

…P/R1

P/Rk

Page 53: The Locality-Aware Adaptive Cache Coherence Protocol

Limited-3 Locality Classifier

53

Classifier Complete Limited-3Bit Overhead per core(256 KB L2)

192 KB (60%) 18 KB (5.7%)

Metric Limited-3 vs CompleteCompletion Time 3 % lowerEnergy 1.5 % lower

• Utilization and mode tracked for 3 sharers

Achieves the performance and energy of the Complete locality classifier• CT and Energy lower because remote mode

classification learned faster with Limited-3

Page 54: The Locality-Aware Adaptive Cache Coherence Protocol

Private <-> Remote TransitionResults In Private Cache Thrashing

54

RemotePrivate

Private Utilization < PCT

Private Utilization >= PCT

Initial Remote Utilization < PCT

Remote Utilization >= PCT

• Core reverts back to private mode after #PCT accesses to cache line at shared L2 cache

• Evicts other lines in the private L1 cache• Results in low spatio-temporal locality for all

• Difficult to measure private cache locality of line in shared L2 cache

Page 55: The Locality-Aware Adaptive Cache Coherence Protocol

Ideal ClassifierNO Private Cache Thrashing

55

• Ideal classifier maintains part of the working set in the private cache

• Other lines placed in remote mode at shared cache

Page 56: The Locality-Aware Adaptive Cache Coherence Protocol

Remote Access ThresholdReduces Private Cache Thrashing

• Remote Access Threshold (RAT) varied based on PCT & application behavior [details in paper] 56

RemotePrivate

Private Utilization < PCT

Private Utilization >= PCT

Initial Remote Utilization < RAT

Remote Utilization >= RAT

• If core classified as remote sharer (capacity), increase cost of promotion to private mode

• If core classified as private sharer, reset the cost back to its starting valueReduces private cache thrashing to a

negligible level

Page 57: The Locality-Aware Adaptive Cache Coherence Protocol

Outline

• Motivation for Locality-Aware Coherence• Implementation Details• Optimizations• Evaluation• Conclusion

57

Page 58: The Locality-Aware Adaptive Cache Coherence Protocol

Reducing Capacity MissesPrivate L1 Cache Miss Rate vs PCT (Blackscholes)

58

• Miss rate reduces as PCT increases (better utilization)• Multiple capacity misses (expensive) replaced with

single word access (cheap)• Cache miss rate increases towards the end

(one capacity miss turns into multiple word misses)

1 2 3 4 5 6 7 80

0.51

1.52

2.53

Cold Capacity Upgrade Sharing WordCa

che

Mis

s Rat

e Br

eakd

own

(%)

PCT

Page 59: The Locality-Aware Adaptive Cache Coherence Protocol

Energy vs PCTBlackscholes

• Reducing L1 cache misses (& Capacity Word) lead to lesser network traffic and L2 accesses

• Accessing a word (200 bits) cheaper than fetching the entire cache line (640 bits) 59

1 2 3 4 5 6 7 80

0.2

0.4

0.6

0.8

1

1.2

Network Link Network Router Directory L2 Cache L1-D Cache L1-I Cache

Ener

gy (n

orm

aliz

ed)

PCT

Page 60: The Locality-Aware Adaptive Cache Coherence Protocol

Completion Time vs PCTBlackscholes

• Lower L1 cache miss rate + miss penalty • Less time spent waiting on L1 cache misses

60

1 2 3 4 5 6 7 80

0.2

0.4

0.6

0.8

1

1.2

Synchronization L2Cache-OffChip L2Cache-Sharers L2Cache-Waiting L1Cache-L2Cache Compute

Com

pleti

on T

ime

(nor

mal

ized

)

Page 61: The Locality-Aware Adaptive Cache Coherence Protocol

Reducing Sharing MissesPrivate L1 Cache Miss Rate vs PCT (Streamcluster)

61

• Sharing misses (expensive) turned into word misses (cheap) as PCT increases

PCT1 2 3 4 5 6 7 8

012345678

Cold Capacity Upgrade Sharing Word

Cach

e M

iss R

ate

Brea

kdow

n (%

)

Page 62: The Locality-Aware Adaptive Cache Coherence Protocol

Energy vs PCTStreamcluster

• Reduce invalidations, asynchronous write-backs and cache-line ping-pong’ing

62

1 2 3 4 5 6 7 80

0.2

0.4

0.6

0.8

1

1.2

Network Link Network Router Directory L2 Cache L1-D Cache L1-I Cache

Ener

gy (n

orm

al-

ized

)

PCT

Page 63: The Locality-Aware Adaptive Cache Coherence Protocol

Completion Time vs PCTStreamcluster

• Less time spent waiting for invalidations and invalidations and by loads waiting for previous stores

• Critical section time reduction -> synchronization time reduction 63

1 2 3 4 5 6 7 80

0.2

0.4

0.6

0.8

1

1.2

Synchronization L2Cache-OffChip L2Cache-Sharers L2Cache-Waiting L1Cache-L2Cache Compute

Com

pleti

on T

ime

(nor

mal

ized

)

PCT

Page 64: The Locality-Aware Adaptive Cache Coherence Protocol

Variation with PCTResults Summary

• Evaluated 18 benchmarks from the SPLASH-2, PARSEC, parallel-MI bench and UHPC suites + 3 hand-written benchmarks

• PCT of 4 obtains 25% reduction in energy and 15% reduction in completion time

• Evaluations done using Graphite simulator for 64 cores, McPAT/CACTI cache energy models and DSENT network energy models at 11 nm

64

Page 65: The Locality-Aware Adaptive Cache Coherence Protocol

Conclusion• Three potential advantages of the locality-aware

adaptive cache coherence protocol– Better private cache utilization– Reduced on-chip communication (invalidations, asynchronous

write-backs and cache-line transfers)– Reduced memory access latency and energy

• Efficient locality tracking hardware• Decoupled from traditional coherence tracking structures• Limited3 locality classifier has low overhead of 18KB per-core

(with 256KB per-core L2 cache)• Simple to implement

– NO additional networks for deadlock avoidance65