locality-aware data replication in the last-level cache

Locality-Aware Data Replication in the Last-Level Cache

George Kurian1, Srinivas Devadas1, Omer Khan2,

1 Massachusetts Institute of Technology2 University of Connecticut, Storrs

The Problem

• Future multicore processors will have 100s of cores

• LLC management key to optimizing performance and energy

• Last-level cache (LLC) data locality and off-chip miss rates often show opposing trends

• Goal: Intelligent replication at the LLC

# Network Hops = ⅔ * √N

LLC Replication Strategy

• Black block shows benefit with replication– E.g., Frequently-read shared data– Core-1 and Core-2 allowed to create replicas

• Red block shows NO benefit with replication– E.g., Frequently-written shared data 3

L2 Cache(LLC Slice)

ComputePipeline

Directory

• LLC managed using Reactive-NUCA [Hardavellas – ISCA09]- Local placement of private pages, shared pages are striped

• ACKwise limited-directory protocol [Kurian – PACT10]

Locality Tracking IntelligenceReplica Reuse Counter

• Replica Reuse: Tracks cache line usage by a core at the LLC replica

• Replica reuse counter is communicated back to directory on eviction or invalidation for classification

• NO additional network messages• Storage overhead: 1KB - 0.4%

StateTagMode1 Moden…

Home Reuse1 Home Reusen…ACKWise

Pointers (1 … p)

Complete Locality List (1 .. n)

LRU Replica Reuse

Replica Reuse

Locality Tracking IntelligenceMode & Home Reuse Counters

• Modei: Can cache line be replicated at Corei?

• Home Reusei: Tracks cache line usage by Corei at home LLC slice

• Complete Locality Classifier: Tracks locality information for all cores and for all LLC cache lines

• Storage Overhead: 96KB - 30%– We’ll fix this later

StateTagMode1 Moden…

Home Reuse1 Home Reusen…ACKWise

Pointers (1 … p)

Complete Locality List (1 .. n)

Mode TransitionsReplication Intelligence

• Initially, no replica is created• All requests are serviced at the LLC home

No Replica

Initial

• Replication decision made based on previous cache line reuse behavior

Mode Transitions

• Home-Reuse counter: Tracks the # accesses by a core at the LLC home location

No Replica

Initial

Mode Transitions

• A replica is created if enough reuse is detected at the LLC home

• If (Home-Reuse >= Replication-Threshold) Promote to “Replica” mode

Create Replica• Replication-Threshold : # Replicas• Replication-Threshold : # Replicas

ReplicaNo Replica

Home Reuse >= RTInitialRT:

Replication Threshold

Mode Transitions

• Replica-Reuse counter: Tracks the # accesses to the LLC at the replica location

ReplicaNo Replica

Home Reuse >= RTInitialRT:

Replication Threshold

Mode Transitions

• Eviction from LLC Replica Location• Triggered by capacity limitations• If (Replica-Reuse >= Replication-Threshold)

Stay in “Replica” modeElse

Demote to “No-Replica” mode

ReplicaNo Replica

Home Reuse >= RTInitialReplica Reuse >= RT

Replica Reuse < RT

RT: Replication Threshold

Mode Transitions

• Invalidation at LLC Replica Location• Triggered by a conflicting write• If ( [Replica+Home] Reuse >= Replication-Threshold)

Stay in “Replica” modeElse

Demote to “No-Replica” mode

ReplicaNo Replica

Home Reuse >= RTInitial(Replica + Home) Reuse >= RT

(Replica + Home) Reuse < RT

Mode Transitions

• Conflicting-Write from another core:Reset Home-Reuse counter to ‘0’

No Replica

Initial

Home Reuse < RT

Replica

Home Reuse >= RT XReuse >= RT

XReuse < RT

Mode Transitions Summary

ReplicaNo Replica

Home Reuse >= RT

Home Reuse < RT

Initial XReuse >= RT

XReuse < RT

Replica Reuse

Locality Tracking IntelligenceLimitedk Locality Classifier

• Complete Locality Classifier: Prohibitive storage overhead (30%)

• Limited Locality Classifier (k): Mode and Home Reuse information tracked for only k cores

• Modes of other cores obtained by majority voting• Smaller k -> Lower overhead• Inactive cores replaced in locality list based on access

pattern to accommodate new sharers

StateTag

Core ID1 Core IDk…

Mode1 Modek…

Home Reuse1 Home Reusek…

Limited Locality List (1 .. k)

ACKWise Pointers (1 … p)LRU

Limited3 Locality Classifier

• Limited-3 classifier approximates performance & energy of Complete classifier

Classifier Complete Limited-3Bit Overhead per core(256KB L2, 32KB L1-D, 16KB L1-I)

96 KB(30%)

13.5 KB(4.5%)

Metric Limited-3 vs CompleteCompletion Time 0.6 % higherEnergy 1.0 % higher

• Mode and Home Reuse tracked for 3 sharers

Outline

• Motivation • Comparison to Previous Schemes• Design & Implementation• Evaluation• Conclusion

Evaluation Methodology

• Evaluations done using– Graphite simulator for 64 cores– McPAT/CACTI cache energy models and DSENT network

energy models at 11 nm• Evaluated 21 benchmarks from the SPLASH-2 (11),

PARSEC (8), Parallel MI-bench (1) and UHPC (1) suites• LLC managements schemes compared:– Static-NUCA (S-NUCA)– Reactive-NUCA (R-NUCA)– Victim Replication (VR)– Adaptive Selective Replication (ASR) [modified]– Locality-Aware Replication (RT-1, RT-3, RT-8)

Replicate Shared Read-Write DataLLC Accesses: BARNES

• Most LLC accesses are reads to widely-shared high-reuse shared read-write data

• Important to replicate shared read-write data

1 5 10 15 20 25 30 35 40 45 50 55 60 640

10000000

20000000

30000000

40000000

Number of Sharers

Private

Shared Read-Write

1-2 3-9 ≥10

Replicate Shared Read-Write DataEnergy Results: BARNES

• Locality-aware protocol reduces network router & link energy by replicating shared read-write data locally

• Victim replication (VR) obtains limited energy benefits– (Almost) blind replica creation scheme– Simplistic LLC replacement policy– Removing and re-inserting replicas on L1 misses & evictions

• Adaptive selective replication (ASR) and Reactive-NUCA do not replicate shared read-write data

S-NUCA

R-NUCA VR

ASR RT-1

RT-3 RT-8

00.20.40.60.8

DRAM Network Link Network Router Directory L2 Cache (LLC) L1-D Cache L1-I CacheEn

Replicate Shared Read-Write DataCompletion Time Results: BARNES

• Locality-aware protocol reduces communication time with the LLC home(L1-To-LLC-Home)

S-NUCA

R-NUCA VR

ASR RT-1

RT-3 RT-8

00.20.40.60.8

Synchronization LLC-Home-To-OffChip LLC-Home-To-Sharers LLC-Home--Waiting L1-To-LLC-Home L1-To-LLC-Replica Compute

Replicate Private Cache LinesPage vs Cache Line Classification: BLACKSCHOLES

• Page-level classification incurs false positives– Multiple cores work privately on cache lines in the same page– Page classified shared read-only instead of private

• Page-level data placement not optimal– Reactive-NUCA cannot localize most LLC accesses

• Replicate private data to localize all LLC accesses

Page Cache-Line0%

Shared Read-Write

Shared Read-Only

Instruction

Private

Replicate Private Cache LinesEnergy Results: BLACKSCHOLES

• Locality-aware protocol reduces network energy through replication of private cache lines

• ASR replicates just shared read-only cache lines• VR obtains limited improvements in energy

– Still restricted by replication mechanisms

S-NUCA

R-NUCA VR

ASR RT-1

RT-3 RT-8

00.20.40.60.8

DRAM Network Link Network Router Directory L2 Cache L1-D Cache L1-I CacheEn

Replicate All Classes of Cache LinesLLC Accesses: BODYTRACK

• Most LLC accesses are reads to widely-shared high-reuse instructions, shared read-only and shared read-write data

• Best replication policy should optimize handling of all 3 classes of cache lines

1 5 10 15 20 25 30 35 40 45 50 55 60 640

40000000

80000000

120000000

Number of Sharers

Private

Shared Read-Write

1-2 3-9 ≥10

Replicate All Classes of Cache LinesEnergy Results: BODYTRACK

• R-NUCA replicates instructions, hence obtains small network energy improvements

• ASR replicates instructions and shared read-only data and obtains larger energy improvements

• The locality-aware protocol replicates shared read-write data as well

S-NUCA

R-NUCA VR

ASR RT-1

RT-3 RT-8

00.20.40.60.8

Use Optimal Replication ThresholdEnergy Results: STREAMCLUSTER

• Perform intelligent replication• RT-1 performs badly due to LLC pollution• RT-8 identifies less replicas, slow to identify useful ones• RT-3 identifies more replicas and faster while not creating LLC

pollution• Use optimal replication threshold of 3

S-NUCA

R-NUCA VR

ASR RT-1

RT-3 RT-8

00.20.40.60.8

Results Summary

• We choose a static Replication threshold (RT) of 3• Energy improved by 13-21%• Completion Time improved by 4-13%

Energy Completion TimeS-N

R-NUCA VR

ASR RT-1RT-3

S-NUCA

R-NUCA VR

ASR RT-1RT-3

Conclusion

• Locality-aware instruction and data replication in the last-level cache (LLC)

• Spatio-temporal locality profiled dynamically at the cache line level using low-overhead yet highly accurate hardware counters

• Enables replication only for lines with high reuse• Requires minimal changes to the baseline cache

coherence protocol since replicas are placed locally• Significant energy and performance improvements

against state-of-the-art replication mechanisms

See The Paper For …

• Exhaustive benchmark case studies– Apps with migratory shared data– Apps with NO benefit from replication

• Limited locality classifier study– Sensitivity to number of tracked cores (k)

• Cluster-level locality-aware LLC replication study– Sensitivity to cluster size (C)

Thank You!Questions?

Locality-Aware Data Replication in the Last-Level Cache

George Kurian1, Srinivas Devadas1, Omer Khan2,

1 Massachusetts Institute of Technology2 University of Connecticut, Storrs

locality-aware data replication in the last-level cache

intelligent replication

level cache aka llc

cache line

cache blocks

written data

llc home location

previous replication

private l1 cache hits

Documents

09 6810 l11 - university of utah college of...

decoupled compressed cache - microarch.org · decoupled...

topics memory hierarchy locality of reference cache design...

improving cache locality for thread-level speculation...

slide 1 platform-independent cache optimization by...

cacheminer : run-time cache locality exploitation on smps...

lecture 11 & 12: caches cache overview 4 hierarchy questions...

1 memory hierarchy ( Ⅱ ). 2 outline storage technologies...

memory hierarchy how to improve memory access. outline...

locality-aware data replication in the last-level...

locality and caching topics locality of reference cache...

© 2011 cisco all rights reserved.cisco confidential 1 app...

lecture objectives: 1)define temporal and spatial locality....

a brief introduction to cache locality yin wei dong 14 ss

exploiting data locality in adaptive architectures ·...

cachingstephen chong, harvard university 3 topics for today...

exploiting locality to ameliorate packet queue contention...

fully-asynchronouscache-eﬃcient ... · the commonly used...

http:// achieving non-inclusive cache performance with...

caching - university of california, san diego ·...