emulating optimal replacement with a shepherd cache
Post on 22-Nov-2021
5 Views
Preview:
TRANSCRIPT
Emulating Optimal Replacement with a ShepherdCache
Kaushik Rajan1 & R. Govindarajan
Supercomputer Education and Research CentreIndian Institute of Science
Bangalore
05-Dec-2007
1Author acknowledges financial supportprovided through the MSR PhD Fellowship
Kaushik Rajan, Indian Institute of Science
Introduction
Outline
1 IntroductionMemory hierarchyDrawbacks of LRU
2 The Optimal Replacement Policy (OPT)DefinitionsUnderstanding OPT
3 Emulating Optimal Replacement with a Shepherd CacheShepherd cacheIllustrative exampleCache organization
4 Performance EvaluationBridging the performance gapComparison with related workImpact on system performance
5 ConclusionsKaushik Rajan, Indian Institute of Science
Introduction Memory hierarchy
Memory Hierarchy
Kaushik Rajan, Indian Institute of Science
Introduction Memory hierarchy
Memory Hierarchy
L2 and lower cachesObjective : Need to reduce expensivememory accesses
Design : Large size, Higher associativity,Complex design
Problem : Do not interact with programdirectly and observe filtered temporal locality
Kaushik Rajan, Indian Institute of Science
Introduction Memory hierarchy
Memory Hierarchy
L2 and lower cachesObjective : Need to reduce expensivememory accesses
Design : Large size, Higher associativity,Complex design
Problem : Do not interact with programdirectly and observe filtered temporal locality
High Associativity =⇒ replacement policy crucial to performance
L1 cache services temporal accesses =⇒ Lack of temporalaccesses at L2 =⇒ LRU replacement inefficient
Replacement decisions are taken off the processor critical path
Kaushik Rajan, Indian Institute of Science
Introduction Drawbacks of LRU
LRU vs OPT
MPKI for SPEC2000 suite, Benchmarks with MPKI < 5 not plotted butcount towards average
Huge performance gap between LRU and OPT
Kaushik Rajan, Indian Institute of Science
Introduction Drawbacks of LRU
LRU vs OPT
MPKI for SPEC2000 suite, Benchmarks with MPKI < 5 not plotted butcount towards average
Huge performance gap between LRU and OPTOPT at half the size preferable to LRU at double the size
Kaushik Rajan, Indian Institute of Science
The Optimal Replacement Policy (OPT)
Outline
1 IntroductionMemory hierarchyDrawbacks of LRU
2 The Optimal Replacement Policy (OPT)DefinitionsUnderstanding OPT
3 Emulating Optimal Replacement with a Shepherd CacheShepherd cacheIllustrative exampleCache organization
4 Performance EvaluationBridging the performance gapComparison with related workImpact on system performance
5 ConclusionsKaushik Rajan, Indian Institute of Science
The Optimal Replacement Policy (OPT) Definitions
The Optimal Replacement Policy
1 Replacement Candidates : On a miss any replacement policycould either choose to replace any of the lines in the cache orchoose not to place the miss causing line in the cache at all.
2 Self Replacement : The latter choice is referred to as aself-replacement or a cache bypass
Kaushik Rajan, Indian Institute of Science
The Optimal Replacement Policy (OPT) Definitions
The Optimal Replacement Policy
1 Replacement Candidates : On a miss any replacement policycould either choose to replace any of the lines in the cache orchoose not to place the miss causing line in the cache at all.
2 Self Replacement : The latter choice is referred to as aself-replacement or a cache bypass
Optimal Replacement Policy
On a miss replace the candidate to which an access is leastimminent [Belady1966,Mattson1970,McFarling-thesis]
Kaushik Rajan, Indian Institute of Science
The Optimal Replacement Policy (OPT) Definitions
The Optimal Replacement Policy
1 Replacement Candidates : On a miss any replacement policycould either choose to replace any of the lines in the cache orchoose not to place the miss causing line in the cache at all.
2 Self Replacement : The latter choice is referred to as aself-replacement or a cache bypass
Optimal Replacement Policy
On a miss replace the candidate to which an access is leastimminent [Belady1966,Mattson1970,McFarling-thesis]
3 Lookahead Window : Window of accesses between miss causingaccess and the access to the least imminent replacementcandidate. Single pass simulation of OPT make use of lookaheadwindows to identify replacement candidates and modify currentcache state [Sugumar-SIGMETRICS1993]
Kaushik Rajan, Indian Institute of Science
The Optimal Replacement Policy (OPT) Understanding OPT
Understanding OPT
0 1 2 3 4
0 1 2 3 4
Access Sequence 5 1 6 3 1 4 5 2 5 7 6 8A A A A A A A A A A A A
5
6
A
AOPT order for
OPT order for
Consider 4 way associative cache with one set initially containing lines(A1,A2,A3,A4), consider the access stream shown in table
Access A5 misses, replacement decision proceeds as follows
Kaushik Rajan, Indian Institute of Science
The Optimal Replacement Policy (OPT) Understanding OPT
Understanding OPT
0 1 2 3 4
0 1 2 3 4
Access Sequence 5 1 6 3 1 4 5 2 5 7 6 8A A A A A A A A A A A A
5
6
A
AOPT order for
OPT order for
Consider 4 way associative cache with one set initially containing lines(A1,A2,A3,A4), consider the access stream shown in table
Access A5 misses, replacement decision proceeds as follows1 Identify replacement candidates : (A1,A2,A3,A4,A5)
Kaushik Rajan, Indian Institute of Science
The Optimal Replacement Policy (OPT) Understanding OPT
Understanding OPT
0 1 2 3 4
0 1 2 3 4
Access Sequence 5 1 6 3 1 4 5 2 5 7 6 8A A A A A A A A A A A A
5
6
A
AOPT order for
OPT order for
Consider 4 way associative cache with one set initially containing lines(A1,A2,A3,A4), consider the access stream shown in table
Access A5 misses, replacement decision proceeds as follows1 Identify replacement candidates : (A1,A2,A3,A4,A5)2 Lookahead and gather imminence order : shown in table,
lookahead window circled
Kaushik Rajan, Indian Institute of Science
The Optimal Replacement Policy (OPT) Understanding OPT
Understanding OPT
0 1 2 3 4
0 1 2 3 4
Access Sequence 5 1 6 3 1 4 5 2 5 7 6 8A A A A A A A A A A A A
5
6
A
AOPT order for
OPT order for
Consider 4 way associative cache with one set initially containing lines(A1,A2,A3,A4), consider the access stream shown in table
Access A5 misses, replacement decision proceeds as follows1 Identify replacement candidates : (A1,A2,A3,A4,A5)2 Lookahead and gather imminence order : shown in table,
lookahead window circled3 Make replacement decision : A5 replaces A2
Kaushik Rajan, Indian Institute of Science
The Optimal Replacement Policy (OPT) Understanding OPT
Understanding OPT
0 1 2 3 4
0 1 2 3 4
Access Sequence 5 1 6 3 1 4 5 2 5 7 6 8A A A A A A A A A A A A
5
6
A
AOPT order for
OPT order for
Consider 4 way associative cache with one set initially containing lines(A1,A2,A3,A4), consider the access stream shown in table
Access A5 misses, replacement decision proceeds as follows1 Identify replacement candidates : (A1,A2,A3,A4,A5)2 Lookahead and gather imminence order : shown in table,
lookahead window circled3 Make replacement decision : A5 replaces A2
A6 self-replaces, lookahead window and imminence order in table
Kaushik Rajan, Indian Institute of Science
Emulating Optimal Replacement with a Shepherd Cache
Outline
1 IntroductionMemory hierarchyDrawbacks of LRU
2 The Optimal Replacement Policy (OPT)DefinitionsUnderstanding OPT
3 Emulating Optimal Replacement with a Shepherd CacheShepherd cacheIllustrative exampleCache organization
4 Performance EvaluationBridging the performance gapComparison with related workImpact on system performance
5 ConclusionsKaushik Rajan, Indian Institute of Science
Emulating Optimal Replacement with a Shepherd Cache Shepherd cache
Motivation
LRU-OPT gap
Smaller OPT cache better than larger LRU cache =⇒
Focus on matching OPT performance even on a smallercache
Lookahead
OPT requires lookahead to to find least imminentline =⇒ Use part of cache to aid in emulating OPT forremaining cache
Kaushik Rajan, Indian Institute of Science
Emulating Optimal Replacement with a Shepherd Cache Shepherd cache
Emulating OPT with a Shepherd Cache
Kaushik Rajan, Indian Institute of Science
Emulating Optimal Replacement with a Shepherd Cache Shepherd cache
Emulating OPT with a Shepherd Cache
Split the cache into two logical partsMain Cache (MC) for which optimalreplacement is emulatedShepherd Cache (SC) used to provide alookahead and guide replacements from MCtowards OPT
Kaushik Rajan, Indian Institute of Science
Emulating Optimal Replacement with a Shepherd Cache Shepherd cache
Emulating OPT with a Shepherd Cache
Split the cache into two logical partsMain Cache (MC) for which optimalreplacement is emulatedShepherd Cache (SC) used to provide alookahead and guide replacements from MCtowards OPT
Operation1 Buffer lines temporarily in SC before moving
them to MC, SC acts as a FIFO buffer2 While in SC, gather imminence information and
emulate lookahead3 When forced out of SC, make an MC
replacement based on the gathered imminenceorder
Kaushik Rajan, Indian Institute of Science
Emulating Optimal Replacement with a Shepherd Cache Illustrative example
Overview of Shepherd Caching
A 5,A 1,A 6,A 3,A 1,A 4,
A 5, 2,A A 5,A 7,A 6,A 8
2SCSC1
NVC2
CM
MC
A 1A 2
3A
NVC1
4A
To emulate MC with 4 ways per set and 2 SCways per set
To gather imminence order add a countermatrix (CM)
CM has one column per SC way to trackimminence order w.r.t to it
CM has one row per SC and MC line as anyof them can be a replacement candidate
Each column has one Next Value Counter(NVC) to track the next value to assign alongcolumn
Kaushik Rajan, Indian Institute of Science
Emulating Optimal Replacement with a Shepherd Cache Illustrative example
1A
A 2A 3
4A
5,A 1,A 6,A 3,A 1,A 4,
A 5, 2,A A 5,A 7,A 6,A 8
CM
SCSC
21
NVCs
MC
A
Kaushik Rajan, Indian Institute of Science
Emulating Optimal Replacement with a Shepherd Cache Illustrative example
1A
A 2A 3
4A
5,A 1,A 6,A 3,A 1,A 4,
A 5, 2,A A 5,A 7,A 6,A 8
CM
SCSC
21
NVCs
MC
A
1A
A 2A 3
4A
5,A 1,A 6,A 3,A 1,A 4,
A 5, 2,A A 5,A 7,A 6,A 8
CM
SCSC
21
NVCs
e
MC
5A eeeee
0
A
Kaushik Rajan, Indian Institute of Science
Emulating Optimal Replacement with a Shepherd Cache Illustrative example
1A
A 2A 3
4A
5,A 1,A 6,A 3,A 1,A 4,
A 5, 2,A A 5,A 7,A 6,A 8
CM
SCSC
21
NVCs
MC
A
1A
A 2A 3
4A
5,A 1,A 6,A 3,A 1,A 4,
A 5, 2,A A 5,A 7,A 6,A 8
CM
SCSC
21
NVCs
e
MC
5A eeeee
0
A
1A
A 2A 3
4A
5, 1, 6,A 3,A 1,A 4,
A 5, 2,A A 5,A 7,A 6,A 8
CM
SCSC
21
NVCs
e
MC
5A e0
A AA
eee
1
Kaushik Rajan, Indian Institute of Science
Emulating Optimal Replacement with a Shepherd Cache Illustrative example
1A
A 2A 3
4A
5, A 3,A 1,A 4,
A 5, 2,A A 5,A 7,A 6,A 8
CM
SCSC
21
NVCs
0
MC
5A e0
A AA
eee
1
1, 6,
A 6 eeeeee
0
Kaushik Rajan, Indian Institute of Science
Emulating Optimal Replacement with a Shepherd Cache Illustrative example
1A
A 2A 3
4A
5, A 3,A 1,A 4,
A 5, 2,A A 5,A 7,A 6,A 8
CM
SCSC
21
NVCs
0
MC
5A e0
A AA
eee
1
1, 6,
A 6 eeeeee
0
1A
A 2A 3
4A
5, 3,A 1,A 4,
A 5, 2,A A 5,A 7,A 6,A 8
CM
SCSC
21
NVCs
0
MC
5A e0
A A
e
e
1, 6,
A 6
0
1
e
eeee
2
1
A A
Kaushik Rajan, Indian Institute of Science
Emulating Optimal Replacement with a Shepherd Cache Illustrative example
1A
A 2A 3
4A
5, A 3,A 1,A 4,
A 5, 2,A A 5,A 7,A 6,A 8
CM
SCSC
21
NVCs
0
MC
5A e0
A AA
eee
1
1, 6,
A 6 eeeeee
0
1A
A 2A 3
4A
5, 3,A 1,A 4,
A 5, 2,A A 5,A 7,A 6,A 8
CM
SCSC
21
NVCs
0
MC
5A e0
A A
e
e
1, 6,
A 6
0
1
e
eeee
2
1
A A
1A
A 2A 3
4A
5, 3, 1,A 4,
A 5, 2,A A 5,A 7,A 6,A 8
CM
SCSC
21
NVCs
0
MC
5A e0
A A
e
1, 6,
A 6
0
2
e
ee
1
A AA
e e
1
2
Kaushik Rajan, Indian Institute of Science
Emulating Optimal Replacement with a Shepherd Cache Illustrative example
1A
A 2A 3
4A
5, 3, 1, 4,
5, 2,A A 5,A 7,A 6,A 8
CM
SCSC
21
NVCs
0
MC
5A
0
A A
e
1, 6,
A 6
0
4
e
e
1
A A A
1
2
4
A
A
2
3 3
Kaushik Rajan, Indian Institute of Science
Emulating Optimal Replacement with a Shepherd Cache Illustrative example
1A
A 2A 3
4A
5, 3, 1, 4,
5, 2,A A 5,A 7,A 6,A 8
CM
SCSC
21
NVCs
0
MC
5A
0
A A
e
1, 6,
A 6
0
4
e
e
1
A A A
1
2
4
A
A
2
3 31A
A 2A 3
4A
5, 3, 1, 4,
5, 2,A 5,A 7,A 6,A 8
CM
SCSC
21
NVCs
0
MC
5A
0
A A 1, 6,
A 6
0
5
e
1
A A A
1
2
5
A
2
3
AA
3
4 4
Kaushik Rajan, Indian Institute of Science
Emulating Optimal Replacement with a Shepherd Cache Illustrative example
1A
A 2A 3
4A
5, 3, 1, 4,
5, 2,A A 5,A 7,A 6,A 8
CM
SCSC
21
NVCs
0
MC
5A
0
A A
e
1, 6,
A 6
0
4
e
e
1
A A A
1
2
4
A
A
2
3 31A
A 2A 3
4A
5, 3, 1, 4,
5, 2,A 5,A 7,A 6,A 8
CM
SCSC
21
NVCs
0
MC
5A
0
A A 1, 6,
A 6
0
5
e
1
A A A
1
2
5
A
2
3
AA
3
4 41A
A 3
4A
5, 3, 1, 4,
5, 2, 7,A 6,A 8
CM
SCSC
21
NVCs
MC
7A
A A 1, 6,
A 6
0
5
e
A A A
1
0
A
2
e 3
A 5,AA
A 5
eee
ee
0
Kaushik Rajan, Indian Institute of Science
Emulating Optimal Replacement with a Shepherd Cache Illustrative example
1A
A 3
4A
5, 3, 1, 4,
5, 2, 7,A 6,A 8
CM
SCSC
21
NVCs
MC
7A
A A 1, 6,
A 6
0
5
e
A A A
1
0
A
2
e 3
A 5,AA
A 5
eee
ee
01A
A 3
4A
5, 3, 1, 4,
5, 2, 7, 6,A 8
CM
SCSC
21
NVCs
MC
7A
A A 1, 6,
A 6
0
A A A
1
1
A
2
3
A 5,A
A 5
eeeee
0 5
6
0
AAA
Kaushik Rajan, Indian Institute of Science
Emulating Optimal Replacement with a Shepherd Cache Illustrative example
1A
A 3
4A
5, 3, 1, 4,
5, 2, 7,A 6,A 8
CM
SCSC
21
NVCs
MC
7A
A A 1, 6,
A 6
0
5
e
A A A
1
0
A
2
e 3
A 5,AA
A 5
eee
ee
01A
A 3
4A
5, 3, 1, 4,
5, 2, 7, 6,A 8
CM
SCSC
21
NVCs
MC
7A
A A 1, 6,
A 6
0
A A A
1
1
A
2
3
A 5,A
A 5
eeeee
0 5
6
0
AAA
1A
A 3
4A
5, 3, 1, 4,
5, 2, 7, 6, 8
CM
SCSC
21
NVCs
MC
7A
A A 1, 6,
A 8
A A A
1
A
A 5,A
A 5
eeeee
0 e
0
AA AA
eeeee
Kaushik Rajan, Indian Institute of Science
Emulating Optimal Replacement with a Shepherd Cache Cache organization
Baseline MC replacement
Lookahead provided may-notalways be sufficient to identifyleast imminent replacement
Partial imminence order used torestrict replacement choice toless imminent candidates
Use baseline MC replacement tochoose victim among these
Baseline MC replacement can beany standard policy, we evaluateLRU, LFU and random
When line moves from SC to MC,place it in LRU/LFU position
1A
A 2A 3
4A
CM
SCSC
21
NVCs
MC
1
2
4
CM
SCSC
21
NVCs
MC
56 0
2
0
e
e
0
1
2
e
e
3 2
1
e
A
A
3A
A
A
A
A1,
A A6, 3, 5, 1,
A75,
A A A
Initial State Before Access A7
A5 replaces one of A2 or A4 based onbaseline MC replacement
Kaushik Rajan, Indian Institute of Science
Emulating Optimal Replacement with a Shepherd Cache Cache organization
Cache Organization
TAG DATA
ASSOCIATIVITYSE
T 0
SE
T 1
ASSOCIATIVITY
ASSOCIATIVITY
....... .
.
. . .
. . .
. . .
. . .
. . .
SC−FLAG/SC−PTR
NEXT VALUE
TAG DATA
s/3s/2s/1s/0
mmmmmmmmmm
mm
s/0 s/1 s/2 s/3
COUNTER MATRIX
COUNTERS
Cache Organization (SC-4, MC-12)
Divide cache into SC and MC with some ways of each set are SC ways
Use SC flag to identify SC ways, use SC-ptr to link SC line to its column
No need to physically move lines from SC to MC, just adjust SC-flag/ptr
Larger SC-assoc =⇒ more lookahead =⇒ OPT in smaller MC
Kaushik Rajan, Indian Institute of Science
Performance Evaluation
Outline
1 IntroductionMemory hierarchyDrawbacks of LRU
2 The Optimal Replacement Policy (OPT)DefinitionsUnderstanding OPT
3 Emulating Optimal Replacement with a Shepherd CacheShepherd cacheIllustrative exampleCache organization
4 Performance EvaluationBridging the performance gapComparison with related workImpact on system performance
5 ConclusionsKaushik Rajan, Indian Institute of Science
Performance Evaluation Bridging the performance gap
Simulation Methodology
1 Memory hierarchyL2 Cache:- 512KB,1MB,2MB and 4MB, 16 way, 128B lines, 8 cycleL1 cache:- Both L1-I and L1-D are 16KB 2 way, 32B lines, 2 cycleMemory:- Infinite size 400 cycleInfinite MSHR and store buffer
2 SC+MC configurationTotal assoc 16, performance evaluated for (SC-2, MC-14) to(SC-14, MC-2)Schemes referred to by SC-associativity, SC-4 =⇒ (SC-4, MC-12)
3 Simulation methodologyBenchmarks Suite:- SPEC 2006 all 26 benchmarks with ref inputsInstructions:- Fast-forward 2 billion run 3 billionMethodology:- Trace driven for MPKI, Simplescalar for systemperformanceProcessor configuration:- Similar to recent studies
Kaushik Rajan, Indian Institute of Science
Performance Evaluation Bridging the performance gap
Trade-off between SC and MC
Avg MPKI over SPEC2000 suite
SC-assoc vs MC-assocMarginal gains whenMC size less than halfthe cache
Peak gains : SC-6 toSC-4
Quarter to half ofcache should be forlookahead
Size overhead48B-92B, less than onecache block (128B)
Kaushik Rajan, Indian Institute of Science
Performance Evaluation Bridging the performance gap
Bridging the performance gap
Avg MPKI over SPEC2000 suite
Bridging the LRU-OPT gap
SC-4 bridges 32-52%of gap
SC moves closer toOPT as cache sizeincreases
Kaushik Rajan, Indian Institute of Science
Performance Evaluation Comparison with related work
Comparison with related work
Scheme ParametersDIP [Qureshi-ISCA07] 16/32/64 dedicated sets, epsilon 1/32V-way cache [Qureshi-ISCA05] 32 tags per set, tdr 2
global reuse distance replacementVictim Buffer [Jouppi-ISCA90] MC+VB, VB same assoc as SCfa cache fully associative with LRUOPT-16 16 way associative OPT replacement
Performance compared with base LRU, DIP, victim buffer with oneadditional way per set to compensate for size overhead
Kaushik Rajan, Indian Institute of Science
Performance Evaluation Comparison with related work
Comparison with related schemes
MPKI for 512KB 16 way cache, select benchmarks, all 26 in average
On average both SC-4 and SC-8 out-performs LRU, DIP, v-way, faand victim by 4-10%
For most benchmarks SC-4 performs better than SC-8
Different SC associativities best for different benchmarks
Kaushik Rajan, Indian Institute of Science
Performance Evaluation Impact on system performance
Speedup in CPI
512KB
2MBAverage speedup of about 7% with SC-4 and 5% with SC-8
Kaushik Rajan, Indian Institute of Science
Conclusions
Outline
1 IntroductionMemory hierarchyDrawbacks of LRU
2 The Optimal Replacement Policy (OPT)DefinitionsUnderstanding OPT
3 Emulating Optimal Replacement with a Shepherd CacheShepherd cacheIllustrative exampleCache organization
4 Performance EvaluationBridging the performance gapComparison with related workImpact on system performance
5 ConclusionsKaushik Rajan, Indian Institute of Science
Conclusions
Conclusions and Future Work
Conclusion
Using part of cache for lookahead and emulating OPT inremaining cache effective at improving L2 performance
Quarter to Half of cache space should be dedicated to SC
Performs better than related proposals and successfully bridges32-52% of the LRU-OPT gap
Future Work
Different SC associativity works best with different benchmarks,can the SC-associativity be adapted at runtime?
Study the benefits of dissociating SC and MC?
Tuning the organization to a shared CMP cache?
Kaushik Rajan, Indian Institute of Science
top related