IMPROVING CACHE MANAGEMENT POLICIES USING DYNAMIC REUSE DISTANCES
Nam Duong1, Dali Zhao1, Taesu Kim1,Rosario Cammarota1, Mateo Valero2,
Alexander V. Veidenbaum1
1University of California, Irvine2Universitat Politecnica de Catalunya and
Barcelona Supercomputing Center
CACHE MANAGEMENT
2
Cache Management
Single-core
Replacement
Shared-cache
Bypass Partitioning
LRUNRU
EELRUDIP
RRIP…
SPD…
UCPPIPP
TA-DIPTA-DRRIPVantage
…PDP
PDP
PDP
Prefetch
Have been a hot research topic
OVERVIEW Proposed new cache replacement and partitioning algorithms
with a better balance between reuse and pollution
Introduced a new concept, Protecting Distance (PD), which is shown to achieve such a balance
Developed single- and multi-core hit rate models as a function of PD, cache configuration and program behavior Models are used to dynamically compute the best PD
Showed that PD-based cache management policies improve performance for both single- and multi-core systems
3
OUTLINE
1. The concept of Protecting Distance2. The single-core PD-based replacement and
bypass policy (PDP)3. The multi-core PD-based management policies4. Evaluation
4
DEFINITIONS The (line) reuse distance: The number of accesses to the
same cache set between two accesses to the same line This metric is directly related to hit rate
The reuse distance distribution (RDD) A distribution of observed reuse distances A program signature for a given cache configuration
RDDs of representative benchmarks X-axis: the RD (<256)
0 40 80 120160200240
403.gcc
0 39 78 117156195234
436.cactusADM
0 47 94 141188235
464.h264ref
5
FUTURE BEHAVIOR PREDICTION Cache management policies use past reference
behavior to predict future accesses Prediction accuracy is critical
Prediction in some of the prior policies LRU: predicts that lines are reused after K unique
accesses, where K < W (W: cache associativity) Early eviction LRU (EELRU): Counts evictions in two non-
LRU regions (early/late) to predict a line to evict RRIP: Predicts if a line will be reused in a near, long, or
distant future
6
BALANCING REUSE AND CACHE POLLUTION Key to good performance (high hit rate)
Cache lines must be reused as much as possible before eviction AND must be evicted soon after the “last” reuse to give space to
new lines
The former can be achieved by using the reuse distance and actively preventing eviction “Protecting” a line from eviction
The latter can be achieved by evicting when not reused within this distance
There is an optimal reuse distance balancing the two It is called a Protecting Distance (PD)
7
EXAMPLE: 436.CACTUSADM A majority of lines are reused at 64 or fewer accesses
There are multiple peaks at different reuse distances
Reuse maximized if lines are kept in the cache for 64 accesses Lines may not be reused if evicted before that Lines kept beyond that are likely to pollute cache
Assume that no lines are kept longer than a given RD
0 33 66 99 132165198231
436.cactusADM
8
RD = 16
RD = 32
RD = 48
RD = 72
RD = 128
RD = 256
EELRU
DIPRRIP
0%
20%
40%
60%
Reduction in miss rate over LRU
THE PROTECTING DISTANCE (PD) A distance at which a majority of lines are covered
A single value for all setsPredicted based on the current RDD
Questions to answer/solveWhy does using the PD achieve the balance?How to dynamically find the PD for an application and a
cache configuration?How to build the PD-based management policies?
9
OUTLINE
1. The concept of Protecting Distance2. Single-core PD-based replacement and bypass
policy (PDP)3. The multi-core PD-based management policies4. Evaluation
10
THE SINGLE-CORE PDP
A cache tag contains a line’s remaining PD (RPD) A line can be evicted when its RPD=0
The RPD of an inserted or promoted line set to the predicted PD RPDs of other lines in a set are decremented
Example: A 4-way cache, the predicted PD is 7 A line is promoted on a hit
A set with RPDs before and after the hit access
110 6 5 21 4 6 3
Reused line Inserted line (unused)
THE SINGLE-CORE PDP (CONT.)
Selecting a victim on a miss A line with an RPD = 0 can be replaced
Two cases when all RPDs > 0 (no unprotected lines) Caches without bypass (inclusive):
Unused lines are less likely to be reused than reused lines Replace unused line with highest RPD first
No unused line: Replace a line with highest RPD
Caches with bypass (non-inclusive): Bypass the new line12
6 3 5 20 4 6 3
0 3 5 21 4 6 3
0 3 6 21 4 6 3
0 3 5 61 4 6 3
Reused line Inserted line (unused)
EVALUATION OF THE STATIC PDP
Static PDP: use the best static PD for each benchmark PD < 256
SPDP-NB: Static PDP with replacement only SPDP-B: Static PDP with replacement and bypass
Performance: in general, DDRIP < SPDP-NB < SPDP-B 436.cactusADM: a 10% additional miss reduction
Two static PDP policies have similar performance 483.xalancbmk: 3 different execution windows have different
behavior for SPDP-B
13
403.g
cc
429.m
cf
433.m
ilc
434.z
eusm
p
436.c
actus
ADM
437.l
eslie
3d
450.s
oplex
456.h
mmer
459.G
emsFDTD
462.l
ibqua
ntum
464.h
264ref
470.l
bm
471.o
mnetpp
473.a
star
482.s
phinx
3
483.x
alanc
bmk.1
483.x
alanc
bmk.2
483.x
alanc
bmk.3
-5%5%
15%25%
Miss reduction over DRRIP
SPDP-NBSPDP-B
436.CACTUSADM:EXPLAINING THE PERFORMANCE DIFFERENCE
How the evicted lines occupy the cache?
DRRIP: Early evicted lines: 75% of accesses, but occupy only 4% Late evicted lines: 2% of accesses, but occupy 8% of the cache → pollution
SPDP-NB: Early and late evicted lines: 42% of accesses but occupy only 4%
SPDP-B: Late evicted lines: 1% of accesses, occupy 3% of the cache → yielding cache space to useful lines
14
Access Occupancy Access Occupancy Access OccupancyDRRIP SPDP-NB SPDP-B
0%20%40%60%80%
100%
Hit BypassEvict before 16 accesses (early) Evict after 16 accesses (late)
PDP has less pollution caused by long RD lines in the cache than RRIP
CASE STUDY: 483.XALANCBMK
15
0 18 36 54 72 90 108126144162180198216234252
RDD483.xalancbmk.1483.xalancbmk.2483.xalancbmk.3
483.xalancbmk.1 483.xalancbmk.2 483.xalancbmk.30%
20%40%60%80%
Hit rate of SPDP-B
The best PD is different in different windowsAnd for different programs
Need a dynamic policy that finds best PD Need a model to drive the search
There is a close relationship between the hit rate, the PD and the RDD
A HIT RATE MODEL FOR NON-INCLUSIVE CACHE
The model estimates the hit rate as a function of dp and the RDD
{Ni}, Nt: The RDD dp: The protecting distance de: Experimentally set to W (W: Cache associativity)
0 40 80 120160200240
403.gcc
0 40 80 120160200240
436.cactusADM
0 40 80 120160200240
464.h264ref
16
RDD
E
Hit rate
Used to find the PD maximizing the hit rate
ep
d
iit
d
ii
d
ii
p
ddNNiN
N
WAccessesHitsdE
pp
p
**
1*)(
11
1
PDP CACHE ORGANIZATION
RD Sampler tracks access to several cache sets In L2 miss/WB stream, can reduce sampling rate Measures reuse distance of a new access
RD Counter Array collects # of accesses at RD=i, Nt
To reduce overhead, each counter covers a range of RDs PD Compute Logic: finds PD that maximizes E
Computed PD used in the next interval (.5M L3 accesses) Reasonable hardware overhead
2 or 3 bits per tag to store the RPD
17
LLC
RD Sampler RD Counter Array
PD Compute Logic
Access address
Higher level
Main memory
RDRDD
PD
PDP VS. EXISTING POLICIESManagement policy
Supported policy(*) Balance Distancemeasurement
ModelReplacement Bypass Reuse Pollution
LRU Yes No No Yes Stack-based NoEELRU [1] Yes No No Yes Stack-based ProbabilisticDIP [2] Yes No Yes No N/A NoRRIP [3] Yes No Yes No N/A NoSDP [4] No Yes Yes No N/A NoPDP Yes Yes Yes Yes Access-based Hit rate
18
[1] Y. Smaragdakis, S. Kaplan, and P. Wilson. EELRU: simple and effective adaptive page replacement. In SIGMETRICS’99
[2] M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer. Adaptive insertion policies for high performance caching. In ISCA’07
[3] A. Jaleel, K. B. Theobald, S. C. Steely, Jr., and J. Emer. High performance cache replacement using re-reference interval prediction (RRIP). In ISCA’10
[4] S. M. Khan, Y. Tian, and D. A. Jimenez. Sampling dead block prediction for last-level caches. In MICRO’10
(*)Originally proposed EELRU has the concept of late eviction point, which shares some
similarities with the protecting distance However, lines are not always guaranteed to be protected
OUTLINE
1. The concept of Protecting Distance2. The single-core PD-based replacement and
bypass policy (PDP)3. The multi-core PD-based management policies4. Evaluation
19
PD-BASED SHARED CACHE PARTITIONING Each thread has its own PD (thread-aware)
Counter array replicated per thread Sampler and compute logic shared
A thread’s PD determines its cache partition Its lines occupy cache longer if its PD is large The cache is implicitly partitioned per needs of each
thread using thread PDs
The problem is to find a set of thread PDs that together maximize the hit rate
20
SHARED-CACHE HIT RATE MODEL Extending the single-core approach
Compute a vector <PD> (T= number of threads)
Exhaustive search for <PD> is not practical A heuristic search algorithm finds a combination of threads’
RDD peaks that maximizes hit rate The single-core model generates top 3 peaks per thread The complexity is O(T2)
See the paper for more detail21
WTAccesses
THitsPDE
T
T 1*
OUTLINE
1. The concept of Protecting Distance2. The single-core PD-based replacement and
bypass policy (PDP)3. The multi-core PD-based management policies4. Evaluation
22
EVALUATION METHODOLOGY CMP$im simulator, LLC replacement Target cache: LLC
23
Cache ParamsDCache 32KB, 8-way, 64B, 2 cyclesICache 32KB, 4-way, 64B, 2 cyclesL2Cache 256KB, 8-way, 64B, 10 cyclesL3Cache (LLC) 2MB, 16-way, 64B, 30 cyclesMemory 200 cycles
EVALUATION METHODOLOGY (CONT.) Benchmarks: SPEC CPU 2006 benchmarks
Excluded those which did not stress the LLC
Single-core: Compared to EELRU, SDP, DIP, DRRIP
Multi-core 4- and 16-core configurations, 80 workloads each The workloads generated by randomly combining benchmarks Compared to UCP, PIPP, TA-DRRIP
Our policy: PDP-x, where x is the number of bits per cache line 24
SINGLE-CORE PDP
PDP-x, where x is the number of bits per cache line Each benchmark is executed for 1B instructions
Best if can use 3 bits per line, but still better than prior work at 2 bits25
403.g
cc
429.m
cf
433.m
ilc
434.z
eusm
p
436.c
actus
ADM
437.l
eslie
3d
450.s
oplex
456.h
mmer
459.G
emsF
DTD
462.l
ibqua
ntum
464.h
264ref
470.l
bm
471.o
mnetpp
473.a
star
482.s
phinx
3
483.x
alanc
bmk.1
483.x
alanc
bmk.2
483.x
alanc
bmk.3
Avera
ge
-30%
-20%
-10%
0%
10%
20%
30%IPC improvement over DIP
SDP DRRIP EELRU PDP-2 PDP-3 PDP-8 SPDP-B
5 benchmarks which demonstrate significant phase changes Each benchmark is run for 5B instructions
Change of PD (X-axis: 1M LLC accesses)
ADAPTATION TO PROGRAM PHASES
26
0 12 24 36 48 60 720
50100150200
403.gcc
0 1002003004005006000
50100150200 429.mcf
0 54 1081622162700
50100150200
450.soplex
0 18 36 54 72 90 108
0
100
200482.sphinx3
0 36 72 1081441802160
50100150200
483.xalancbmk
ADAPTATION TO PROGRAM PHASES (CONT.)
IPC improvement over DIP
27
403.g
cc
429.m
cf
450.s
oplex
482.s
phinx
3
483.x
alanc
bmk
-5%
0%
5%
10%
15%
RRIPPDP-2PDP-3PDP-8
PD-BASED CACHE PARTITIONING FOR 16 CORES Normalized to TA-DRRIP
28
0 9 18 27 36 45 54 63 72-20%
0%
20%
40%W
UCPPIPPPDP-2PDP-3
Workload
0 9 18 27 36 45 54 63 72-20%
0%
20%
40%T
UCPPIPPPDP-2PDP-3
Workload
0 9 18 27 36 45 54 63 72-20%
0%
20%
40%H
UCPPIPPPDP-2PDP-3
Workload
W T H
-10%
-5%
0%
5%
10%Average
UCP PIPP PDP-2 PDP-3
HARDWARE OVERHEADPolicy Per-line
bitsOverhead
(%)DIP 4 0.8%RRIP 2 0.4%SDP 4 1.4%PDP-2 2 0.6%PDP-3 3 0.8%
29
OTHER RESULTS
Exploration of PDP cache parameters Cache bypass fraction Prefetch-aware PDP PD-based cache management policy for 4-core
30
CONCLUSIONS Proposed the concept of Protecting Distance (PD)
Showed that it can be used to better balance reuse and cache pollution
Developed a hit rate model as a function of the PD, program behavior, and cache configuration
Proposed PD-based management policies for both single- and multi-core systems
PD-based policies outperform existing policies31
THANK YOU!
32
BACKUP SLIDES
RDD, E and hit rate of all benchmarks
33
RDDS, MODELED AND REAL HIT RATES OF SPEC CPU 2006 BENCHMARKS
34
0 33 66 99 132165198231
429.mcf
0 33 66 99 132165198231
433.milc
0 33 66 99 132165198231
434.zeusmp
0 33 66 99 132165198231
436.cactusADM
0 33 66 99 132165198231
403.gccRDD
E
Hit rate
0 33 66 99 132165198231
437.leslie3d
0 33 66 99 132165198231
450.soplex
0 33 66 99 132165198231
456.hmmer
RDDS, MODELED AND REAL HIT RATES OF SPEC CPU 2006 BENCHMARKS (CONT.)
35
0 33 66 99 132165198231
459.GemsFDTD
0 33 66 99 132165198231
462.libquantum
0 33 66 99 132165198231
464.h264ref
0 33 66 99 132165198231
470.lbm
0 33 66 99 132165198231
471.omnetpp
0 33 66 99 132165198231
473.astar
0 33 66 99 132165198231
482.sphinx3
RDDS, MODELED AND REAL HIT RATES OF SPEC CPU 2006 BENCHMARKS (CONT.)
36
0 33 66 99 132165198231
483.xalancbmk.1
0 33 66 99 132165198231
483.xalancbmk.2
0 33 66 99 132165198231
483.xalancbmk.3