locality-aware data replication in the last-level cache

Post on 03-Jan-2016

48 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Locality-Aware Data Replication in the Last-Level Cache. George Kurian 1 , Srinivas Devadas 1 , Omer Khan 2 , 1 Massachusetts Institute of Technology 2 University of Connecticut, Storrs. The Problem. Future multicore processors will have 100s of cores - PowerPoint PPT Presentation

TRANSCRIPT

1

Locality-Aware Data Replication in the Last-Level Cache

George Kurian1, Srinivas Devadas1, Omer Khan2,

1 Massachusetts Institute of Technology2 University of Connecticut, Storrs

The Problem

• Future multicore processors will have 100s of cores

• LLC management key to optimizing performance and energy

• Last-level cache (LLC) data locality and off-chip miss rates often show opposing trends

2

• Goal: Intelligent replication at the LLC

# Network Hops = ⅔ * √N

LLC Replication Strategy

• Black block shows benefit with replication– E.g., Frequently-read shared data– Core-1 and Core-2 allowed to create replicas

• Red block shows NO benefit with replication– E.g., Frequently-written shared data 3

L2 Cache(LLC Slice)

ComputePipeline

Directory

Router

PrivateL1 Caches

Core

L1 D L1 I

LLC Slice

L1 D L1 I

LLC Slice

L1 D L1 I

LLC Slice

L1 D L1 I

LLC Slice

L1 D L1 I

LLC Slice

L1 D L1 I

LLC Slice

L1 D L1 I

LLC Slice

L1 D L1 I

LLC Slice

L1 D L1 I

LLC Slice

Replica

Home

Replica

Home

1

4

32

4

Outline

• Motivation• Comparison to Previous Schemes• Design & Implementation• Evaluation• Conclusion

5

MotivationReuse at the LLC

• Reuse: # Accesses to a cache line by a core before eviction or a conflicting access by another core– Note: Private L1 cache hits are filtered out

L2 Cache(LLC Slice)

ComputePipeline

Directory

Router

PrivateL1 Caches

Core

L1 D L1 I

LLC Slice

L1 D L1 I

LLC Slice

L1 D L1 I

LLC Slice

L1 D L1 I

LLC Slice

L1 D L1 I

LLC Slice

L1 D L1 I

LLC Slice

L1 D L1 I

LLC Slice

L1 D L1 I

LLC Slice

L1 D L1 I

LLC SliceHome

4

3 Core 35 Accesses

Core 4Write

Reuse = 5

6

MotivationReuse Determines Replication Benefit

• Reuse: # Accesses to a cache line by a core before eviction or a conflicting access by another core

• Higher the reuse, higher the efficacy of replication

RADIX FFTLU

-CLU

-NC

CHOLESKY

BARNES

OCEAN-C

OCEAN-NC

WATER-N

SQ

RAYTRACE

VOLREND

BLACKSCH.

SWAPTIO

NS

FLUID

ANIM.

STREAMCLUS.

DEDUP

FERRET

BODYTRACK

FACESIM

PATRICIA

CONCOMP0%

20%

40%

60%

80%

100%[1-2] [3-9] [≥10]

LLC

Acc

ess

Coun

t

Fig: LLC Access Count vs Reuse

7

Motivation (cont’d)Reuse Determines Replication Benefit

• Reuse: # Accesses to a cache line by a core before eviction or a conflicting access by another core

• Higher the reuse, higher the efficacy of replication

RADIX FFTLU

-CLU

-NC

CHOLESKY

BARNES

OCEAN-C

OCEAN-NC

WATER-N

SQ

RAYTRACE

VOLREND

BLACKSCH.

SWAPTIO

NS

FLUID

ANIM.

STREAMCLUS.

DEDUP

FERRET

BODYTRACK

FACESIM

PATRICIA

CONCOMP0%

20%

40%

60%

80%

100%[1-2] [3-9] [≥10]

LLC

Acc

ess

Coun

t

Replicate

Don’tReplicate

Fig: LLC Access Count vs Reuse

8

Motivation (cont’d)Reuse Independent of Cache Line Type

• Private data exhibits varying degrees of reuse

Private 1-2 3-9 ≥10

Fig: LLC Access Count vs Reuse

RADIX FFTLU

-CLU

-NC

CHOLESKY

BARNES

OCEAN-C

OCEAN-NC

WATER-N

SQ

RAYTRACE

VOLREND

BLACKSCH.

SWAPTIO

NS

FLUID

ANIM.

STREAMCLUS.

DEDUP

FERRET

BODYTRACK

FACESIM

PATRICIA

CONCOMP0%

20%

40%

60%

80%

100%

LLC

Acc

ess

Coun

t

9

Motivation (cont’d)Reuse Independent of Cache Line Type

• Instructions mostly exhibit high reuse

Private

Instruction

1-2 3-9 ≥10

Fig: LLC Access Count vs Reuse

RADIX FFTLU

-CLU

-NC

CHOLESKY

BARNES

OCEAN-C

OCEAN-NC

WATER-N

SQ

RAYTRACE

VOLREND

BLACKSCH.

SWAPTIO

NS

FLUID

ANIM.

STREAMCLUS.

DEDUP

FERRET

BODYTRACK

FACESIM

PATRICIA

CONCOMP0%

20%

40%

60%

80%

100%

LLC

Acc

ess

Coun

t

10

Motivation (cont’d)Reuse Independent of Cache Line Type

• Shared read-only data exhibits varying degrees of reuse

Private

InstructionShared Read-Only

1-2 3-9 ≥10

1-2 3-9 ≥10

Fig: LLC Access Count vs Reuse

RADIX FFTLU

-CLU

-NC

CHOLESKY

BARNES

OCEAN-C

OCEAN-NC

WATER-N

SQ

RAYTRACE

VOLREND

BLACKSCH.

SWAPTIO

NS

FLUID

ANIM.

STREAMCLUS.

DEDUP

FERRET

BODYTRACK

FACESIM

PATRICIA

CONCOMP0%

20%

40%

60%

80%

100%

LLC

Acc

ess

Coun

t

11

Motivation (cont’d)Reuse Independent of Cache Line Type

• Shared read-write data exhibits varying degrees of reuse

RADIX FFTLU

-CLU

-NC

CHOLESKY

BARNES

OCEAN-C

OCEAN-NC

WATER-N

SQ

RAYTRACE

VOLREND

BLACKSCH.

SWAPTIO

NS

FLUID

ANIM.

STREAMCLUS.

DEDUP

FERRET

BODYTRACK

FACESIM

PATRICIA

CONCOMP0%

20%

40%

60%

80%

100%

LLC

Acc

ess

Coun

t

Private

InstructionShared Read-Only

Shared Read-Write

1-2 3-9 ≥10

1-2 3-9 ≥10

Fig: LLC Access Count vs Reuse

12

Motivation (cont’d)Reuse Independent of Cache Line Type

• Replication must be based on reuse and not cache line classification

RADIX FFTLU

-CLU

-NC

CHOLESKY

BARNES

OCEAN-C

OCEAN-NC

WATER-N

SQ

RAYTRACE

VOLREND

BLACKSCH.

SWAPTIO

NS

FLUID

ANIM.

STREAMCLUS.

DEDUP

FERRET

BODYTRACK

FACESIM

PATRICIA

CONCOMP0%

20%

40%

60%

80%

100%

LLC

Acc

ess

Coun

t

Private

InstructionShared Read-Only

Shared Read-Write

1-2 3-9 ≥10

1-2 3-9 ≥10

Fig: LLC Access Count vs Reuse

Replicate based on Reuse Instructions Shared read-only data Shared read-write data (even) Private data

13

Locality-Aware ReplicationSalient Features

• Locality-based: Based on reuse and not memory classification information– Replicate data with high reuse – Bypass replication mechanisms for low reuse data

• Cache-line Level: Reuse measured and replication decision made at cache-line level

• Dynamic: Reuse profiled at runtime using highly-accurate hardware counters

• Minimal Coherence Protocol Changes: Replication is done at the local LLC slice

• Fully Hardware: LLC replication techniques require no modification to operating system

14

Comparison to Previous Schemes

LLC Management Schemes

Replication Candidates

When to Replicate?

Static-NUCA (S-NUCA)

None Never

Reactive-NUCA(R-NUCA)

Instructions(per-cluster)

Every L1 Cache Miss(NO Intelligence)

Victim Replication(VR)

All Every L1 Cache Eviction(NO Intelligence)

Adaptive Selective Replication (ASR)

SharedRead-Only

L1 Cache Eviction(Adapts Replication Level)

Locality-Aware Replication

All L1 Cache Miss(Detect High Reuse)

15

Outline

• Motivation • Comparison to Previous Schemes• Design & Implementation• Evaluation• Conclusion

16

Baseline System

• Compute pipeline with private L1-I and L1-D caches• Logically shared physically distributed L2 cache (LLC) with

integrated directory

Router

L1 I-CacheL1 D-Cache

L2 Cache (LLC)

CoreCompute Pipeline

Directory

M

M

M

• LLC managed using Reactive-NUCA [Hardavellas – ISCA09]- Local placement of private pages, shared pages are striped

• ACKwise limited-directory protocol [Kurian – PACT10]

17

Locality Tracking IntelligenceReplica Reuse Counter

• Replica Reuse: Tracks cache line usage by a core at the LLC replica

• Replica reuse counter is communicated back to directory on eviction or invalidation for classification

• NO additional network messages• Storage overhead: 1KB - 0.4%

StateTagMode1 Moden…

Home Reuse1 Home Reusen…ACKWise

Pointers (1 … p)

Complete Locality List (1 .. n)

LRU Replica Reuse

18

Replica Reuse

Locality Tracking IntelligenceMode & Home Reuse Counters

• Modei: Can cache line be replicated at Corei?

• Home Reusei: Tracks cache line usage by Corei at home LLC slice

• Complete Locality Classifier: Tracks locality information for all cores and for all LLC cache lines

• Storage Overhead: 96KB - 30%– We’ll fix this later

StateTagMode1 Moden…

Home Reuse1 Home Reusen…ACKWise

Pointers (1 … p)

Complete Locality List (1 .. n)

LRU

19

Mode TransitionsReplication Intelligence

• Initially, no replica is created• All requests are serviced at the LLC home

No Replica

Initial

• Replication decision made based on previous cache line reuse behavior

20

Mode Transitions

• Home-Reuse counter: Tracks the # accesses by a core at the LLC home location

No Replica

Initial

• Replication decision made based on previous cache line reuse behavior

21

Mode Transitions

• A replica is created if enough reuse is detected at the LLC home

• If (Home-Reuse >= Replication-Threshold) Promote to “Replica” mode

Create Replica• Replication-Threshold : # Replicas• Replication-Threshold : # Replicas

ReplicaNo Replica

Home Reuse >= RTInitialRT:

Replication Threshold

22

Mode Transitions

• Replica-Reuse counter: Tracks the # accesses to the LLC at the replica location

ReplicaNo Replica

Home Reuse >= RTInitialRT:

Replication Threshold

23

Mode Transitions

• Eviction from LLC Replica Location• Triggered by capacity limitations• If (Replica-Reuse >= Replication-Threshold)

Stay in “Replica” modeElse

Demote to “No-Replica” mode

ReplicaNo Replica

Home Reuse >= RTInitialReplica Reuse >= RT

Replica Reuse < RT

RT: Replication Threshold

24

Mode Transitions

• Invalidation at LLC Replica Location• Triggered by a conflicting write• If ( [Replica+Home] Reuse >= Replication-Threshold)

Stay in “Replica” modeElse

Demote to “No-Replica” mode

ReplicaNo Replica

Home Reuse >= RTInitial(Replica + Home) Reuse >= RT

(Replica + Home) Reuse < RT

RT: Replication Threshold

25

Mode Transitions

• Conflicting-Write from another core:Reset Home-Reuse counter to ‘0’

No Replica

Initial

Home Reuse < RT

RT: Replication Threshold

Replica

Home Reuse >= RT XReuse >= RT

XReuse < RT

26

Mode Transitions Summary

ReplicaNo Replica

Home Reuse >= RT

Home Reuse < RT

Initial XReuse >= RT

XReuse < RT

RT: Replication Threshold

• Replication decision made based on previous cache line reuse behavior

27

Replica Reuse

Locality Tracking IntelligenceLimitedk Locality Classifier

• Complete Locality Classifier: Prohibitive storage overhead (30%)

• Limited Locality Classifier (k): Mode and Home Reuse information tracked for only k cores

• Modes of other cores obtained by majority voting• Smaller k -> Lower overhead• Inactive cores replaced in locality list based on access

pattern to accommodate new sharers

StateTag

Core ID1 Core IDk…

Mode1 Modek…

Home Reuse1 Home Reusek…

Limited Locality List (1 .. k)

ACKWise Pointers (1 … p)LRU

Limited3 Locality Classifier

• Limited-3 classifier approximates performance & energy of Complete classifier

28

Classifier Complete Limited-3Bit Overhead per core(256KB L2, 32KB L1-D, 16KB L1-I)

96 KB(30%)

13.5 KB(4.5%)

Metric Limited-3 vs CompleteCompletion Time 0.6 % higherEnergy 1.0 % higher

• Mode and Home Reuse tracked for 3 sharers

29

Outline

• Motivation • Comparison to Previous Schemes• Design & Implementation• Evaluation• Conclusion

30

Evaluation Methodology

• Evaluations done using– Graphite simulator for 64 cores– McPAT/CACTI cache energy models and DSENT network

energy models at 11 nm• Evaluated 21 benchmarks from the SPLASH-2 (11),

PARSEC (8), Parallel MI-bench (1) and UHPC (1) suites• LLC managements schemes compared:– Static-NUCA (S-NUCA)– Reactive-NUCA (R-NUCA)– Victim Replication (VR)– Adaptive Selective Replication (ASR) [modified]– Locality-Aware Replication (RT-1, RT-3, RT-8)

31

Replicate Shared Read-Write DataLLC Accesses: BARNES

• Most LLC accesses are reads to widely-shared high-reuse shared read-write data

• Important to replicate shared read-write data

1 5 10 15 20 25 30 35 40 45 50 55 60 640

10000000

20000000

30000000

40000000

Number of Sharers

LLC

Acc

ess

Coun

t

Private

InstructionShared Read-Only

Shared Read-Write

1-2 3-9 ≥10

1-2 3-9 ≥10

32

Replicate Shared Read-Write DataEnergy Results: BARNES

• Locality-aware protocol reduces network router & link energy by replicating shared read-write data locally

• Victim replication (VR) obtains limited energy benefits– (Almost) blind replica creation scheme– Simplistic LLC replacement policy– Removing and re-inserting replicas on L1 misses & evictions

• Adaptive selective replication (ASR) and Reactive-NUCA do not replicate shared read-write data

S-NUCA

R-NUCA VR

ASR RT-1

RT-3 RT-8

00.20.40.60.8

11.2

DRAM Network Link Network Router Directory L2 Cache (LLC) L1-D Cache L1-I CacheEn

ergy

(nor

mal

ized

)

33

Replicate Shared Read-Write DataCompletion Time Results: BARNES

• Locality-aware protocol reduces communication time with the LLC home(L1-To-LLC-Home)

S-NUCA

R-NUCA VR

ASR RT-1

RT-3 RT-8

00.20.40.60.8

11.2

Synchronization LLC-Home-To-OffChip LLC-Home-To-Sharers LLC-Home--Waiting L1-To-LLC-Home L1-To-LLC-Replica Compute

Com

pleti

on T

ime

(nor

mal

ized

)

34

Replicate Private Cache LinesPage vs Cache Line Classification: BLACKSCHOLES

• Page-level classification incurs false positives– Multiple cores work privately on cache lines in the same page– Page classified shared read-only instead of private

• Page-level data placement not optimal– Reactive-NUCA cannot localize most LLC accesses

• Replicate private data to localize all LLC accesses

Page Cache-Line0%

20%

40%

60%

80%

100%

Shared Read-Write

Shared Read-Only

Instruction

Private

LLC

Acce

ss C

ount

35

Replicate Private Cache LinesEnergy Results: BLACKSCHOLES

• Locality-aware protocol reduces network energy through replication of private cache lines

• ASR replicates just shared read-only cache lines• VR obtains limited improvements in energy

– Still restricted by replication mechanisms

S-NUCA

R-NUCA VR

ASR RT-1

RT-3 RT-8

00.20.40.60.8

11.2

DRAM Network Link Network Router Directory L2 Cache L1-D Cache L1-I CacheEn

ergy

(nor

mal

ized

)

36

Replicate All Classes of Cache LinesLLC Accesses: BODYTRACK

• Most LLC accesses are reads to widely-shared high-reuse instructions, shared read-only and shared read-write data

• Best replication policy should optimize handling of all 3 classes of cache lines

1 5 10 15 20 25 30 35 40 45 50 55 60 640

40000000

80000000

120000000

Number of Sharers

LLC

Acc

ess

Coun

t

Private

InstructionShared Read-Only

Shared Read-Write

1-2 3-9 ≥10

1-2 3-9 ≥10

37

Replicate All Classes of Cache LinesEnergy Results: BODYTRACK

• R-NUCA replicates instructions, hence obtains small network energy improvements

• ASR replicates instructions and shared read-only data and obtains larger energy improvements

• The locality-aware protocol replicates shared read-write data as well

S-NUCA

R-NUCA VR

ASR RT-1

RT-3 RT-8

00.20.40.60.8

11.2

DRAM Network Link Network Router Directory L2 Cache L1-D Cache L1-I CacheEn

ergy

(nor

mal

ized

)

38

Use Optimal Replication ThresholdEnergy Results: STREAMCLUSTER

• Perform intelligent replication• RT-1 performs badly due to LLC pollution• RT-8 identifies less replicas, slow to identify useful ones• RT-3 identifies more replicas and faster while not creating LLC

pollution• Use optimal replication threshold of 3

S-NUCA

R-NUCA VR

ASR RT-1

RT-3 RT-8

00.20.40.60.8

11.2

DRAM Network Link Network Router Directory L2 Cache L1-D Cache L1-I CacheEn

ergy

(nor

mal

ized

)

39

Results Summary

• We choose a static Replication threshold (RT) of 3• Energy improved by 13-21%• Completion Time improved by 4-13%

Energy Completion TimeS-N

UCA

R-NUCA VR

ASR RT-1RT-3

RT-80

0.2

0.4

0.6

0.8

1

S-NUCA

R-NUCA VR

ASR RT-1RT-3

RT-80

0.2

0.4

0.6

0.8

1

40

Conclusion

• Locality-aware instruction and data replication in the last-level cache (LLC)

• Spatio-temporal locality profiled dynamically at the cache line level using low-overhead yet highly accurate hardware counters

• Enables replication only for lines with high reuse• Requires minimal changes to the baseline cache

coherence protocol since replicas are placed locally• Significant energy and performance improvements

against state-of-the-art replication mechanisms

41

See The Paper For …

• Exhaustive benchmark case studies– Apps with migratory shared data– Apps with NO benefit from replication

• Limited locality classifier study– Sensitivity to number of tracked cores (k)

• Cluster-level locality-aware LLC replication study– Sensitivity to cluster size (C)

42

Thank You!Questions?

43

Locality-Aware Data Replication in the Last-Level Cache

George Kurian1, Srinivas Devadas1, Omer Khan2,

1 Massachusetts Institute of Technology2 University of Connecticut, Storrs

top related