Download - Improving Cache Management Policies Using Dynamic Reuse Distances

IMPROVING CACHE MANAGEMENT POLICIES USING DYNAMIC REUSE DISTANCES

Nam Duong1, Dali Zhao1, Taesu Kim1,Rosario Cammarota1, Mateo Valero2,

Alexander V. Veidenbaum1

1University of California, Irvine2Universitat Politecnica de Catalunya and

Barcelona Supercomputing Center

CACHE MANAGEMENT

2

Cache Management

Single-core

Replacement

Shared-cache

Bypass Partitioning

LRUNRU

EELRUDIP

RRIP…

SPD…

UCPPIPP

TA-DIPTA-DRRIPVantage

…PDP

PDP

PDP

Prefetch

Have been a hot research topic

OVERVIEW Proposed new cache replacement and partitioning algorithms

with a better balance between reuse and pollution

Introduced a new concept, Protecting Distance (PD), which is shown to achieve such a balance

Developed single- and multi-core hit rate models as a function of PD, cache configuration and program behavior Models are used to dynamically compute the best PD

Showed that PD-based cache management policies improve performance for both single- and multi-core systems

3

OUTLINE

1. The concept of Protecting Distance2. The single-core PD-based replacement and

bypass policy (PDP)3. The multi-core PD-based management policies4. Evaluation

4

DEFINITIONS The (line) reuse distance: The number of accesses to the

same cache set between two accesses to the same line This metric is directly related to hit rate

The reuse distance distribution (RDD) A distribution of observed reuse distances A program signature for a given cache configuration

RDDs of representative benchmarks X-axis: the RD (<256)

0 40 80 120160200240

403.gcc

0 39 78 117156195234

436.cactusADM

0 47 94 141188235

464.h264ref

5

FUTURE BEHAVIOR PREDICTION Cache management policies use past reference

behavior to predict future accesses Prediction accuracy is critical

Prediction in some of the prior policies LRU: predicts that lines are reused after K unique

accesses, where K < W (W: cache associativity) Early eviction LRU (EELRU): Counts evictions in two non-

LRU regions (early/late) to predict a line to evict RRIP: Predicts if a line will be reused in a near, long, or

distant future

6

BALANCING REUSE AND CACHE POLLUTION Key to good performance (high hit rate)

Cache lines must be reused as much as possible before eviction AND must be evicted soon after the “last” reuse to give space to

new lines

The former can be achieved by using the reuse distance and actively preventing eviction “Protecting” a line from eviction

The latter can be achieved by evicting when not reused within this distance

There is an optimal reuse distance balancing the two It is called a Protecting Distance (PD)

7

EXAMPLE: 436.CACTUSADM A majority of lines are reused at 64 or fewer accesses

There are multiple peaks at different reuse distances

Reuse maximized if lines are kept in the cache for 64 accesses Lines may not be reused if evicted before that Lines kept beyond that are likely to pollute cache

Assume that no lines are kept longer than a given RD

0 33 66 99 132165198231

436.cactusADM

8

RD = 16

RD = 32

RD = 48

RD = 72

RD = 128

RD = 256

EELRU

DIPRRIP

0%

20%

40%

60%

Reduction in miss rate over LRU

THE PROTECTING DISTANCE (PD) A distance at which a majority of lines are covered

A single value for all setsPredicted based on the current RDD

Questions to answer/solveWhy does using the PD achieve the balance?How to dynamically find the PD for an application and a

cache configuration?How to build the PD-based management policies?

9

OUTLINE

1. The concept of Protecting Distance2. Single-core PD-based replacement and bypass

policy (PDP)3. The multi-core PD-based management policies4. Evaluation

10

THE SINGLE-CORE PDP

A cache tag contains a line’s remaining PD (RPD) A line can be evicted when its RPD=0

The RPD of an inserted or promoted line set to the predicted PD RPDs of other lines in a set are decremented

Example: A 4-way cache, the predicted PD is 7 A line is promoted on a hit

A set with RPDs before and after the hit access

110 6 5 21 4 6 3

Reused line Inserted line (unused)

THE SINGLE-CORE PDP (CONT.)

Selecting a victim on a miss A line with an RPD = 0 can be replaced

Two cases when all RPDs > 0 (no unprotected lines) Caches without bypass (inclusive):

Unused lines are less likely to be reused than reused lines Replace unused line with highest RPD first

No unused line: Replace a line with highest RPD

Caches with bypass (non-inclusive): Bypass the new line12

6 3 5 20 4 6 3

0 3 5 21 4 6 3

0 3 6 21 4 6 3

0 3 5 61 4 6 3

Reused line Inserted line (unused)

EVALUATION OF THE STATIC PDP

Static PDP: use the best static PD for each benchmark PD < 256

SPDP-NB: Static PDP with replacement only SPDP-B: Static PDP with replacement and bypass

Performance: in general, DDRIP < SPDP-NB < SPDP-B 436.cactusADM: a 10% additional miss reduction

Two static PDP policies have similar performance 483.xalancbmk: 3 different execution windows have different

behavior for SPDP-B

13

403.g

cc

429.m

cf

433.m

ilc

434.z

eusm

p

436.c

actus

ADM

437.l

eslie

3d

450.s

oplex

456.h

mmer

459.G

emsFDTD

462.l

ibqua

ntum

464.h

264ref

470.l

bm

471.o

mnetpp

473.a

star

482.s

phinx

3

483.x

alanc

bmk.1

483.x

alanc

bmk.2

483.x

alanc

bmk.3

-5%5%

15%25%

Miss reduction over DRRIP

SPDP-NBSPDP-B

436.CACTUSADM:EXPLAINING THE PERFORMANCE DIFFERENCE

How the evicted lines occupy the cache?

DRRIP: Early evicted lines: 75% of accesses, but occupy only 4% Late evicted lines: 2% of accesses, but occupy 8% of the cache → pollution

SPDP-NB: Early and late evicted lines: 42% of accesses but occupy only 4%

SPDP-B: Late evicted lines: 1% of accesses, occupy 3% of the cache → yielding cache space to useful lines

14

Access Occupancy Access Occupancy Access OccupancyDRRIP SPDP-NB SPDP-B

0%20%40%60%80%

100%

Hit BypassEvict before 16 accesses (early) Evict after 16 accesses (late)

PDP has less pollution caused by long RD lines in the cache than RRIP

CASE STUDY: 483.XALANCBMK

15

0 18 36 54 72 90 108126144162180198216234252

RDD483.xalancbmk.1483.xalancbmk.2483.xalancbmk.3

483.xalancbmk.1 483.xalancbmk.2 483.xalancbmk.30%

20%40%60%80%

Hit rate of SPDP-B

The best PD is different in different windowsAnd for different programs

Need a dynamic policy that finds best PD Need a model to drive the search

There is a close relationship between the hit rate, the PD and the RDD

A HIT RATE MODEL FOR NON-INCLUSIVE CACHE

The model estimates the hit rate as a function of dp and the RDD

{Ni}, Nt: The RDD dp: The protecting distance de: Experimentally set to W (W: Cache associativity)

0 40 80 120160200240

403.gcc

0 40 80 120160200240

436.cactusADM

0 40 80 120160200240

464.h264ref

16

RDD

E

Hit rate

Used to find the PD maximizing the hit rate

ep

d

iit

d

ii

d

ii

p

ddNNiN

N

WAccessesHitsdE

pp

p

**

1*)(

11

1

PDP CACHE ORGANIZATION

RD Sampler tracks access to several cache sets In L2 miss/WB stream, can reduce sampling rate Measures reuse distance of a new access

RD Counter Array collects # of accesses at RD=i, Nt

To reduce overhead, each counter covers a range of RDs PD Compute Logic: finds PD that maximizes E

Computed PD used in the next interval (.5M L3 accesses) Reasonable hardware overhead

2 or 3 bits per tag to store the RPD

17

LLC

RD Sampler RD Counter Array

PD Compute Logic

Access address

Higher level

Main memory

RDRDD

PD

PDP VS. EXISTING POLICIESManagement policy

Supported policy(*) Balance Distancemeasurement

ModelReplacement Bypass Reuse Pollution

LRU Yes No No Yes Stack-based NoEELRU [1] Yes No No Yes Stack-based ProbabilisticDIP [2] Yes No Yes No N/A NoRRIP [3] Yes No Yes No N/A NoSDP [4] No Yes Yes No N/A NoPDP Yes Yes Yes Yes Access-based Hit rate

18

[1] Y. Smaragdakis, S. Kaplan, and P. Wilson. EELRU: simple and effective adaptive page replacement. In SIGMETRICS’99

[2] M. K. Qureshi, A. Jaleel, Y. N. Patt, S. C. Steely, and J. Emer. Adaptive insertion policies for high performance caching. In ISCA’07

[3] A. Jaleel, K. B. Theobald, S. C. Steely, Jr., and J. Emer. High performance cache replacement using re-reference interval prediction (RRIP). In ISCA’10

[4] S. M. Khan, Y. Tian, and D. A. Jimenez. Sampling dead block prediction for last-level caches. In MICRO’10

(*)Originally proposed EELRU has the concept of late eviction point, which shares some

similarities with the protecting distance However, lines are not always guaranteed to be protected

OUTLINE



19

PD-BASED SHARED CACHE PARTITIONING Each thread has its own PD (thread-aware)

Counter array replicated per thread Sampler and compute logic shared

A thread’s PD determines its cache partition Its lines occupy cache longer if its PD is large The cache is implicitly partitioned per needs of each

thread using thread PDs

The problem is to find a set of thread PDs that together maximize the hit rate

20

SHARED-CACHE HIT RATE MODEL Extending the single-core approach

Compute a vector <PD> (T= number of threads)

Exhaustive search for <PD> is not practical A heuristic search algorithm finds a combination of threads’

RDD peaks that maximizes hit rate The single-core model generates top 3 peaks per thread The complexity is O(T2)

See the paper for more detail21

WTAccesses

THitsPDE

T

T 1*

OUTLINE



22

EVALUATION METHODOLOGY CMP$im simulator, LLC replacement Target cache: LLC

23

Cache ParamsDCache 32KB, 8-way, 64B, 2 cyclesICache 32KB, 4-way, 64B, 2 cyclesL2Cache 256KB, 8-way, 64B, 10 cyclesL3Cache (LLC) 2MB, 16-way, 64B, 30 cyclesMemory 200 cycles

EVALUATION METHODOLOGY (CONT.) Benchmarks: SPEC CPU 2006 benchmarks

Excluded those which did not stress the LLC

Single-core: Compared to EELRU, SDP, DIP, DRRIP

Multi-core 4- and 16-core configurations, 80 workloads each The workloads generated by randomly combining benchmarks Compared to UCP, PIPP, TA-DRRIP

Our policy: PDP-x, where x is the number of bits per cache line 24

SINGLE-CORE PDP

PDP-x, where x is the number of bits per cache line Each benchmark is executed for 1B instructions

Best if can use 3 bits per line, but still better than prior work at 2 bits25

403.g

cc

429.m

cf

433.m

ilc

434.z

eusm

p

436.c

actus

ADM

437.l

eslie

3d

450.s

oplex

456.h

mmer

459.G

emsF

DTD

462.l

ibqua

ntum

464.h

264ref

470.l

bm

471.o

mnetpp

473.a

star

482.s

phinx

3

483.x

alanc

bmk.1

483.x

alanc

bmk.2

483.x

alanc

bmk.3

Avera

ge

-30%

-20%

-10%

0%

10%

20%

30%IPC improvement over DIP

SDP DRRIP EELRU PDP-2 PDP-3 PDP-8 SPDP-B

5 benchmarks which demonstrate significant phase changes Each benchmark is run for 5B instructions

Change of PD (X-axis: 1M LLC accesses)

ADAPTATION TO PROGRAM PHASES

26

0 12 24 36 48 60 720

50100150200

403.gcc

0 1002003004005006000

50100150200 429.mcf

0 54 1081622162700

50100150200

450.soplex

0 18 36 54 72 90 108

0

100

200482.sphinx3

0 36 72 1081441802160

50100150200

483.xalancbmk

ADAPTATION TO PROGRAM PHASES (CONT.)

IPC improvement over DIP

27

403.g

cc

429.m

cf

450.s

oplex

482.s

phinx

3

483.x

alanc

bmk

-5%

0%

5%

10%

15%

RRIPPDP-2PDP-3PDP-8

PD-BASED CACHE PARTITIONING FOR 16 CORES Normalized to TA-DRRIP

28

0 9 18 27 36 45 54 63 72-20%

0%

20%

40%W

UCPPIPPPDP-2PDP-3

Workload

0 9 18 27 36 45 54 63 72-20%

0%

20%

40%T

UCPPIPPPDP-2PDP-3

Workload

0 9 18 27 36 45 54 63 72-20%

0%

20%

40%H

UCPPIPPPDP-2PDP-3

Workload

W T H

-10%

-5%

0%

5%

10%Average

UCP PIPP PDP-2 PDP-3

HARDWARE OVERHEADPolicy Per-line

bitsOverhead

(%)DIP 4 0.8%RRIP 2 0.4%SDP 4 1.4%PDP-2 2 0.6%PDP-3 3 0.8%

29

OTHER RESULTS

Exploration of PDP cache parameters Cache bypass fraction Prefetch-aware PDP PD-based cache management policy for 4-core

30

CONCLUSIONS Proposed the concept of Protecting Distance (PD)

Showed that it can be used to better balance reuse and cache pollution

Developed a hit rate model as a function of the PD, program behavior, and cache configuration

Proposed PD-based management policies for both single- and multi-core systems

PD-based policies outperform existing policies31

THANK YOU!

32

BACKUP SLIDES

RDD, E and hit rate of all benchmarks

33

RDDS, MODELED AND REAL HIT RATES OF SPEC CPU 2006 BENCHMARKS

34

0 33 66 99 132165198231

429.mcf

0 33 66 99 132165198231

433.milc

0 33 66 99 132165198231

434.zeusmp

0 33 66 99 132165198231

436.cactusADM

0 33 66 99 132165198231

403.gccRDD

E

Hit rate

0 33 66 99 132165198231

437.leslie3d

0 33 66 99 132165198231

450.soplex

0 33 66 99 132165198231

456.hmmer

RDDS, MODELED AND REAL HIT RATES OF SPEC CPU 2006 BENCHMARKS (CONT.)

35

0 33 66 99 132165198231

459.GemsFDTD

0 33 66 99 132165198231

462.libquantum

0 33 66 99 132165198231

464.h264ref

0 33 66 99 132165198231

470.lbm

0 33 66 99 132165198231

471.omnetpp

0 33 66 99 132165198231

473.astar

0 33 66 99 132165198231

482.sphinx3

RDDS, MODELED AND REAL HIT RATES OF SPEC CPU 2006 BENCHMARKS (CONT.)

36

0 33 66 99 132165198231

483.xalancbmk.1

0 33 66 99 132165198231

483.xalancbmk.2

0 33 66 99 132165198231

483.xalancbmk.3

Download - Improving Cache Management Policies Using Dynamic Reuse Distances

Top Related