1 optimizing compilers managing cache bercovici sivan

1

Optimizing compilers

Managing Cache

Bercovici Sivan

2

Overview

Motivation Cache structure Important observations Techniques covered in the book

– Loop interchange– Blocking– Unaligned Data– Pre-fetching

3

Overview (Cont.)

Issues and techniques not covered in the book– Instruction Cache– Dynamic profiling driven cache optimization

4

Motivation

Shorten fetch time

Processor-DRAMPerformance Gap:(grows 50% / year)

1

10

100

1000

19

80

19

81

19

82

19

83

19

84

19

85

19

86

19

87

19

88

19

89

19

90

19

91

19

92

19

93

19

94

19

95

19

96

19

97

19

98

19

99

20

00

DRAM

CPU

Pe

rfo

rma

nc

e

Time

5

Motivation (cont.)

Solution: Cache– Faster memory

Software problems– Maximize cache performance

6

Memory structure

Hierarchical

Registers

Cache

Memory

Disk

Instructions

Blocks

Pages

Larger

Faster

100s Bytes, <10s ns

K Bytes,10-100 ns

M Bytes, 100ns-1us

Capacity, Access Time

7

Cache structure

Specialized– Instruction– Data– What about stack

cache?

8

Cache Structure (cont.)

Organized into blocks– Multiple machine-words

Maps entire memory Most use LRU

replacementstrategy

Tag LineTag Array

Tag = Block#

Address Fields0431

Data array

031

==

=

hit data

Line Offset

9

Observations

Temporal locality– If a variable is referenced, it tends to be

referenced again soon.

Spatial locality– If a variable is referenced, nearby variables tend

to referenced soon

10

Observations (cont.)

Temporal locality example– Variables used inside a loop

Spatial locality example– Iterating on array items

11

Cache exploiting observations

Temporal locality– Cache attempts to keep recently accessed data

Spatial locality– The cache brings blocks of data from memory

12

Loop interchange

13

Example

DO I = 1, M DO J = 1, N A(I, J) = A(I, J) + B(I, J) ENDDOENDDO

I iterates on rows, J on columns Fortran arrays are column major

– The first column is stored first, then the second column..– C is the other way around (row major)

No spatial reuse

14

Example - visually

B

DO I = 1, M DO J = 1, N A(I, J) = A(I, J) + B(I, J) ENDDOENDDO

Cache mappingCache miss

A

Cache miss

15

Example analysis

DO I = 1, M

DO J = 1, N

A(I, J) = A(I, J) + B(I, J)

ENDDO

ENDDO

2*N*M misses– Due to the fact the innermost loop iterates on the

non-contiguous dimension

16

Example fixed (loop interchange)

DO J = 1, N

DO I = 1, M

A(I, J) = A(I, J) + B(I, J)

ENDDO

ENDDO

We process column by column spatial reuse

17

Example - visually

DO J = 1, N DO I = 1, M A(I, J) = A(I, J) + B(I, J) ENDDOENDDO

Cache mapping

A B

18

Analyzing fixed example

DO J = 1, N

DO I = 1, M

A(I, J) = A(I, J) + B(I, J)

ENDDO

ENDDO

2*N*M/b misses– b is the cache-block size

19

A harder example

DO I = 1, NDO J = 1, M

D(I) = D(I) + B(I,J)ENDDO

ENDDO NM for B, N/b for D After interchange: NM/b for B, NM/b for D When should interchange?

– N/b+NM - 2NM/b > 0

20

Loop interchange

Determine which loop should be innermost– Strive to increase locality

Heuristic approach– Compute cost function for each loop– Order loops:

Cheapest loop innermost Most expensive, outermost

21

Cost assignment

Cost is 1 for references that do not depend on loop induction variables

Cost is N for references based on induction variables over a non-contiguous space

Cost is N/b for induction variables based references over contiguous space

Multiply the cost by the loop trip count if the reference varies with the loop index

22

Loop interchange) cont.)

Special notes– Avoid over counting references

Don’t overcount references that are available due to temporal reuse (available to next iterations)

References can still be in the same cache block as other references (references in the same iteration)

– Not all loops order are possible due to data dependency

Find permutation that is both legal and best suits minimal score

23

Blocking

24

Back to the example

DO J = 1, N

DO I = 1, M

D(I) = D(I) + B(I,J)

ENDDO

ENDDO 2NM/b misses

25

Example - visually

D BDO J = 1, N DO I = 1, M D(I) = D(I) + B(I,J) ENDDOENDDO

Cache block size

Cache miss

26

Back to the example (cont.)

DO I = 1, N, S

DO J = 1, M

DO i2 = I, MIN(I+S-1, N)

D(i2) = D(i2) + B(i2,J)

ENDDO

ENDDO

ENDDO

Work on smaller strips (s)

27

Example - visually

D B

DO I = 1, N, S DO J = 1, M DO i2 = I, MIN(I+S-1, N) D(i2) = D(i2) + B(i2,J) ENDDO ENDDOENDDO

Cache block size

Second strip

28

Analysis

Cost of B does not change: NM/b Cost of D effect due to reuse: N/b

– No misses during iterations on J

Conclude: (1+1/M)NM/b


29

Unaligned data

What if B is not aligned on cache block boundary?

At most an additional penalty for each sub-column iteration– Additional NM/S

Conclude: (1+1/M+b/S)NM/bDO I = 1, N, S DO J = 1, M DO i2 = I, MIN(I+S-1, N) D(i2) = D(i2) + B(i2,J) ENDDO ENDDOENDDO

30

Unaligned data - visually

D B


Strip size

Cache alignment

Additional miss

31

Unaligned data (cont.)

What can be done?– Enforce data alignment– Refine our score for loop interchange

include these misses as well in the score


32

Blocking - legality

Split to strips - Legal Interchange –

not alwayslegal

procedure StripMineAndInterchange (L, m, k, o, S)// L = {L1, L2, ..., Lm}is the loop nest to be transformed// Lk is the loop to be strip mined// Lo is the outer loop which is to be just inside the by-strip loop// after interchange// S is the variable to use as strip size; it’s value must be positivelet the header of Lk be

DO I = L, N, D;split the loop into two loops, a by-strip loop:

DO I = L, N, S*Dand a within-strip loop:

DO i = I, MAX(I+S*D-D,N), Daround the loop body;

interchange the by-strip loop to the position just outside of Lo;

end StripMineAndInterchange

33

Blocking – a harder example

DO I = 1, N

DO J = 1, M

A(J+1) = (A(J) + A(J+1))/2

ENDDO

ENDDO

Due to dependence, loops can not be interchanged

Statements

Dependencies between statements

34

Blocking – a closer look

DO I = 1, N

DO J = 1, M

A(J+1) = (A(J) + A(J+1))/2

ENDDO

ENDDO

Not computable due to dependencies

Bad performance due to low cache reuse

35

Harder example – skew it

DO I = 1, N

DO j = I, M+I-1

A(j-I+2) = (A(j-I+1) + A(j-I+2))/2

ENDDO

ENDDO

j = 1..M 2..M+1 3..M+2

36

Harder example - Strip it

DO I = 1, N

DO j = I, M+I-1, S

DO jj = j, MAX(j+S-1, M+I-1)

A(jj-I+2) = (A(jj-I+1) + A(jj-I+2))/2

ENDDO

ENDDO

ENDDO

37

Hard example – interchange loops

DO j = 1, M+N-1, S

DO I = MAX(1, j-M+1), MIN(j, N)

DO jj = j, MAX(j+S-1, M+I-1)

A(jj-I+2) = (A(jj-I+1) + A(jj-I+2))/2

ENDDO

ENDDO

ENDDO

38

Harder example - Comparison

NM/b misses (M+N)*(1/b+1/S) misses

39

Triangular blocking

DO I = 2, N

DO J = 1, I-1

A(I, J) = A(I, I) + A(J, J)

ENDDO

ENDDO

Explicitly: 1..2

1...3

1....4

1…..5

40

Triangular – Strip it

DO I = 2, N, K

DO ii = I, I+K-1

DO J = 1, ii – 1

A(ii, J) = A(ii, I) + A(J, J)

ENDDO

ENDDO

ENDDO

K-size strips Nothing important changed yet..

41

Triangular – transform!

DO I = 2, N, K

DO J = 1, I+K-1

DO ii = MAX(J+1, I), I+K-1

A(ii, J) = A(ii, I) + A(J, J)

ENDDO

ENDDO

ENDDO

Triangular loop interchange Working on the k-strips Preserving correct triangular loop limits

42

Blocking with parallelization

Dimension of parallelism is that of the sequential access– Solution: If multiple parallelization dimensions are

available, avoid the stride-one dimension False sharing

– Data used by different processors is on the same cache-line, but not the exact same data

– Solution: Language extension - expressing data division to processors. Memory data layout accordingly

43

Prefetch

44

“…And I don’t want to miss a thing…” - AeroSmith, 98’ Optimization Seminar

Problematic misses:– Data used for the first time– Data re-used in ways that can not be predicted at

compile-time

DO I=1,NA(I) = B(LOC(I))

ENDDO

45

Prefetch

Brings a line into the cache Typically does not cause stall

– Loads the line parallel to continues execution

Introduced by programmer/compiler

46

Prefetch

Advantages– Miss latencies can be avoided

Assuming we can introduce a prefetch far enough Assuming the cache is large enough

Disadvantages– Number of instruction to execute increases– May cause useful data inside cache to be

evacuated prematurely– Data brought by prefetch might be evicted

prematurely.

47

Minimizing disadvantages impact

The number of added prefetches must be close to what is needed

Prefetches should not arrive “too early”

48

Do J=1, M Do I=1, 32 A(I+1,J) = A(I,J) + C(J) ENDDOENDDO Group generator is not contained in a dependence

cycle – a miss is expected on each iteration unless references to

the generator on subsequent iterations display temporal locality

Generate miss on every new cache line Use prefetch to before references to generators

RAW

Identify prefetch opportunities

49

Acyclic name partitioning

Two cases:– references to the generator do not iterate

sequentially within the loop– references have spatial locality within

the loop

Do I=1, 32 Do J=1, M A(I+1,J) = A(I,J) + C(J) ENDDOENDDO

Do J=1, M Do I=1, 32 A(I+1,J) = A(I,J) + C(J) ENDDOENDDO

50


Case I: references to the generator do not iterate sequentially within the loop insert prefetch before each reference to the generator

Final positioning of the prefetches will be determined by the instruction scheduler

Do I=1, 32 Do J=1, M A(I+1,J) = A(I,J) + C(J) ENDDOENDDO

Prefetch

51

Case II: references have spatial locality within the loop – Determine i0 of the first iteration after the initial

iteration that causes a miss on the access to the generator

– Determine iteration delta between misses in the cache



pre-

loop

Mai

n lo

op

52

Acyclic with spatial reuse

Partition the loop into two parts– initial subloop running from 1 to io-1

– remainder running from io to the end

In the example, io=4DO I = 1, 3 A(I, J) = A(I, J) + A(I-1, J)ENDDODO I = 4, M A(I, J) = A(I, J) + A(I-1, J)ENDDO

DO I = 1, M A(I, J) = A(I, J) + A(I-1, J)ENDDO

53


Strip mine the second loop to have subloops of length delta

In the example, delta=4DO I = 1, 3 A(I, J) = A(I, J) + A(I-1, J)ENDDODO I = 4, M, 4 IU = MIN(M, I+4) DO ii = I, IU A(ii, J) = A(ii, J) + A(ii-1, J) ENDDOENDDO

DO I = 1, 3 A(I, J) = A(I, J) + A(I-1, J)ENDDODO I = 4, M A(I, J) = A(I, J) + A(I-1, J)ENDDO

54


Insert a prefetch before the initial loop Insert prefetches before the inner-loop

prefetch(A(0,J))DO I = 1, 3 A(I, J) = A(I, J) + A(I-1, J)ENDDODO I = 4, M, 4 IU = MIN(M, I+3) prefetch(A(I, J)) DO ii = I, IU A(ii, J) = A(ii, J) + A(ii-1, J) ENDDOENDDO

DO I = 1, 3 A(I, J) = A(I, J) + A(I-1, J)ENDDODO I = 4, M, 4 IU = MIN(M, I+4) DO ii = I, IU A(ii, J) = A(ii, J) + A(ii-1, J) ENDDOENDDO

55


prefetch(A(0,J))DO I = 1, 3 A(I, J) = A(I, J) + A(I-1, J)ENDDODO I = 4, M, 4 IU = MIN(M, I+3) prefetch(A(I, J)) DO ii = I, IU A(ii, J) = A(ii, J) + A(ii-1, J) ENDDOENDDO

DO I = 1, 3 A(I, J) = A(I, J) + A(I-1, J)ENDDODO I = 4, M, 4 IU = MIN(M, I+4) DO ii = I, IU A(ii, J) = A(ii, J) + A(ii-1, J) ENDDOENDDO

DO I = 1, M A(I, J) = A(I, J) + A(I-1, J)ENDDO

56


Group generator is contained in a dependence cycle– a miss is expected only on the first few iterations of the

carrying loop prefetch to the reference can be placed before the

loop carrying the dependence

Identify prefetch opportunities

Input Dependence

57

Put it all together

Rearrange the loop nest so that the loop iterating sequentially over cache lines is innermost

Split the innermost loop into two – – Pre-loop to the first iteration of the innermost loop

containing a generator reference beginning on a new cache line and

– Main loop that begins with the iteration containing the new cache reference.

Insert the prefetch, as previously explained

58

Example

DO J = 1, M DO I = 2, 33 A(I, J) = A(I, J) * B(I) ENDDO ENDDO

prefetch(B(2))DO I = 5, 33, 4 prefetch(B(I))ENDDODO J = 1, M DO I = 2, 33 A(I, J) = A(I, J) * B(I) ENDDOENDDO

prefetch(B(2))DO I = 5, 33, 4 prefetch(B(I))ENDDODO J = 1, M prefetch(A(2,J)) DO I = 2, 4 A(I, J) = A(I, J) * B(I) ENDDO DO I = 5, 33, 4 prefetch(A(I, J)) A(I, J) = A(I, J) * B(I) A(I+1, J) = A(I+1, J) * B(I+1) A(I+2, J) = A(I+2, J) * B(I+2) A(I+3, J) = A(I+3, J) * B(I+3) ENDDO prefetch(A(33, J)) A(33, J) = A(33, J) * B(33)ENDDO

59

Effectiveness of prefetching

60

What did we miss?

61

..“Sometimes, is never quite enough..” - Alanis Morissette , 95’ optimization seminar

Static analysis often ineffective – missing information:– Run-time cache miss– Miss address information

62

Profiling based optimization

Dynamic optimization systems– Collect information dynamically– Optimize according to profile

Use collected information for re-compilation, optimizing accordingly.

Use collected information to perform run-time optimization (Code modification on run-time to perform pre-fetch)

– Example: ADORE

1 optimizing compilers managing cache bercovici sivan

Documents

j bi

j enddoenddo2

j enddoenddoi

references references

cache block

di bi

cache structure cont

stack cache