predicting inter-thread cache contention on a chip multi-processor architecture

Predicting Inter-Thread Cache Contention on a Chip

Multi-Processor Architecture

Dhruba Chandra Fei Guo Seongbeom Kim

Yan Solihin

Electrical and Computer EngineeringNorth Carolina State University

HPCA-2005

2Chandra, Guo, Kim, Solihin - Contention Model

L2 $

Cache Sharing in CMP

L1 $

……

Processor Core 1 Processor Core 2

L1 $


Impact of Cache Space Contention

0%50%

100%150%200%250%300%350%400%

Alo

ne

mcf

+art

mcf

+sw

im

mcf

+mst

mcf

+gzi

p

L2

Cac

he M

isse

s

Application-specific (what) Coschedule-specific (when) Significant: Up to 4X cache misses, 65% IPC reduction

Need a model to understand cache sharing impact

0%

20%

40%

60%

80%

100%

Alo

ne

mcf

+art

mcf

+sw

im

mcf

+mst

mcf

+gzi

p

mcf

's N

orm

aliz

ed I

PC


Related Work Uniprocessor miss estimation:

Cascaval et al., LCPC 1999 Chatterjee et al., PLDI 2001

Fraguela et al., PACT 1999 Ghosh et al., TPLS 1999J. Lee at al., HPCA 2001 Vera and Xue, HPCA 2002Wassermann et al., SC 1997

Context switch impact on time-shared processor: Agarwal, ACM Trans. On Computer Systems, 1989Suh et al., ICS 2001

No model for cache sharing impact: Relatively new phenomenon: SMT, CMP Many possible access interleaving scenarios


Contributions Inter-Thread cache contention models

2 Heuristics models (refer to the paper) 1 Analytical model

Input: circular sequence profiling for each thread Output: Predicted num cache misses per thread in a co-schedule

Validation Against a detailed CMP simulator 3.9% average error for the analytical model

Insight Temporal reuse patterns impact of cache sharing


Outline Model Assumptions Definitions Inductive Probability Model Validation Case Study Conclusions


Assumptions One circular sequence profile per thread

Average profile yields high prediction accuracy Phase-specific profile may improve accuracy

LRU Replacement Algorithm Others are usu. LRU approximations

Threads do not share data Mostly true for serial apps Parallel apps: threads likely to be impacted uniformly


Outline Model Assumptions Definitions Inductive Probability (Prob) Model Validation Case Study Conclusions


Definitions seqX(dX,nX) = sequence of nX accesses to dX distinct

addresses by a thread X to the same cache set cseqX(dX,nX) (circular sequence) = a sequence in which

the first and the last accesses are to the same address

A B C D A E E Bcseq(4,5) cseq(1,2)

cseq(5,7)

seq(5,8)


Circular Sequence Properties Thread X runs alone in the system:

Given a circular sequence cseqX(dX,nX), the last access is a cache miss iff dX > Assoc

Thread X shares the cache with thread Y: During cseqX(dX,nX)’s lifetime if there is a sequence of

intervening accesses seqY(dY,nY), the last access of thread X is a miss iff dX+dY > Assoc


Example Assume a 4-way associative cache:

A B A

X’s circular sequence cseqX(2,3)

U V V W

Y’s intervening access sequence

lifetime

No cache sharing: A is a cache hitCache sharing: is A a cache hit or miss?


Example Assume a 4-way associative cache:

A U B V V W A

A B A

X’s circular sequence cseqX(2,3)

U V V W

Y’s intervening access sequence

A U B V V A W

Cache Hit Cache Miss

seqY(2,3) intervening in cseqX’s lifetime

seqY(3,4) intervening in cseqX’s lifetime


Inductive Probability Model For each cseqX(dX,nX) of thread X

Compute Pmiss(cseqX): the probability of the last access is a miss

Steps: Compute E(nY): expected number of intervening

accesses from thread Y during cseqX’s lifetime

For each possible dY, compute P(seq(dY, E(nY)): probability of occurrence of seq(dY, E(nY)),

If dY + dX > Assoc, add to Pmiss(cseqX)

Misses = old_misses + ∑ Pmiss(cseqX) x F(cseqX)


Computing P(seq(dY, E(nY))) Basic Idea:

P(seq(d,n)) = A * P(seq(d-1,n)) + B * P(seq(d-1,n-1)) Where A and B are transition probabilities

Detailed steps in paper

seq(d,n)

seq(d-1,n-1) seq(d,n-1)

+ 1 access to a distinct address

+ 1 access to a non-distinct address


Validation SESC simulator Detailed CMP + memory hierarchy

14 co-schedules of benchmarks (Spec2K and Olden) Co-schedule terminated when an app completes

CMP Cores

2 cores, each 4-issue dynamic. 3.2GHz

Base Memory

L1 I/D (private): each WB, 32KB, 4way, 64B line

L2 Unified (shared): WB, 512 KB, 8way, 64B line

L2 replacement: LRU


ValidationCo-schedule Actual Miss

IncreasePrediction Error

gzip

+ applu

243% -25%

11% 2%

gzip

+ apsi

180% -9%

0% 0%

mcf

+ art

296% 7%

0% 0%

mcf

+ gzip

18% 7%

102% 22%

mcf

+ swim

59% -7%

0% 0%

Error =

(PM-AM)/AM

Larger error happens when miss increase is very large Overall, the model is accurate


Other Observations Based on how vulnerable to cache sharing impact:

Highly vulnerable (mcf, gzip) Not vulnerable (art, apsi, swim) Somewhat / sometimes vulnerable (applu, equake, perlbmk,

mst)

Prediction error: Very small, except for highly vulnerable apps 3.9% (average), 25% (maximum) Also small for different cache associativities and sizes


Case Study Profile approx. by geometric progression

F(cseq(1,*)) F(cseq(2,*)) F(cseq(3,*)) … F(cseq(A,*)) …

Z Zr Zr2 … ZrA … Z = amplitude 0 < r < 1 = common ratio Larger r larger working set

Impact of interfering thread on the base thread? Fix the base thread Interfering thread: vary

Miss frequency = # misses / time Reuse frequency = # hits / time


Base Thread: r = 0.5 (Small WS)

Base thread: Not vulnerable to interfering thread’s miss frequency Vulnerable to interfering thread’s reuse frequency

1 1.5 2 2.5 3 3.5 4

Multiplying Factor

L2

Cac

he

Mis

ses

Miss Freq Reuse Freq


Base Thread: r = 0.9 (Large WS)

Base thread: Vulnerable to interfering thread’s miss and reuse frequency

1 1.5 2 2.5 3 3.5 4

Multiplying Factor

L2 C

ach

e M

isses

Miss Freq Reuse Freq


Conclusions New Inter-Thread cache contention models Simple to use:

Input: circular sequence profiling per thread Output: Number of misses per thread in co-schedules

Accurate 3.9% average error

Useful Temporal reuse patterns cache sharing impact

Future work: Predict and avoid problematic co-schedules Release the tool at http://www.cesr.ncsu.edu/solihin

predicting inter-thread cache contention on a chip multi-processor architecture

Documents

interthread cache contention

processor core 2l1

ipc reductionchandra