1 optimizing compilers managing cache bercovici sivan

63
1 Optimizing compilers Managing Cache Bercovici Sivan

Upload: deon-ivy

Post on 14-Jan-2016

225 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: 1 Optimizing compilers Managing Cache Bercovici Sivan

1

Optimizing compilers

Managing Cache

Bercovici Sivan

Page 2: 1 Optimizing compilers Managing Cache Bercovici Sivan

2

Overview

Motivation Cache structure Important observations Techniques covered in the book

– Loop interchange– Blocking– Unaligned Data– Pre-fetching

Page 3: 1 Optimizing compilers Managing Cache Bercovici Sivan

3

Overview (Cont.)

Issues and techniques not covered in the book– Instruction Cache– Dynamic profiling driven cache optimization

Page 4: 1 Optimizing compilers Managing Cache Bercovici Sivan

4

Motivation

Shorten fetch time

Processor-DRAMPerformance Gap:(grows 50% / year)

1

10

100

1000

19

80

19

81

19

82

19

83

19

84

19

85

19

86

19

87

19

88

19

89

19

90

19

91

19

92

19

93

19

94

19

95

19

96

19

97

19

98

19

99

20

00

DRAM

CPU

Pe

rfo

rma

nc

e

Time

Page 5: 1 Optimizing compilers Managing Cache Bercovici Sivan

5

Motivation (cont.)

Solution: Cache– Faster memory

Software problems– Maximize cache performance

Page 6: 1 Optimizing compilers Managing Cache Bercovici Sivan

6

Memory structure

Hierarchical

Registers

Cache

Memory

Disk

Instructions

Blocks

Pages

Larger

Faster

100s Bytes, <10s ns

K Bytes,10-100 ns

M Bytes, 100ns-1us

Capacity, Access Time

Page 7: 1 Optimizing compilers Managing Cache Bercovici Sivan

7

Cache structure

Specialized– Instruction– Data– What about stack

cache?

Page 8: 1 Optimizing compilers Managing Cache Bercovici Sivan

8

Cache Structure (cont.)

Organized into blocks– Multiple machine-words

Maps entire memory Most use LRU

replacementstrategy

Tag LineTag Array

Tag = Block#

Address Fields0431

Data array

031

==

=

hit data

Line Offset

Page 9: 1 Optimizing compilers Managing Cache Bercovici Sivan

9

Observations

Temporal locality– If a variable is referenced, it tends to be

referenced again soon.

Spatial locality– If a variable is referenced, nearby variables tend

to referenced soon

Page 10: 1 Optimizing compilers Managing Cache Bercovici Sivan

10

Observations (cont.)

Temporal locality example– Variables used inside a loop

Spatial locality example– Iterating on array items

Page 11: 1 Optimizing compilers Managing Cache Bercovici Sivan

11

Cache exploiting observations

Temporal locality– Cache attempts to keep recently accessed data

Spatial locality– The cache brings blocks of data from memory

Page 12: 1 Optimizing compilers Managing Cache Bercovici Sivan

12

Loop interchange

Page 13: 1 Optimizing compilers Managing Cache Bercovici Sivan

13

Example

DO I = 1, M DO J = 1, N A(I, J) = A(I, J) + B(I, J) ENDDOENDDO

I iterates on rows, J on columns Fortran arrays are column major

– The first column is stored first, then the second column..– C is the other way around (row major)

No spatial reuse

Page 14: 1 Optimizing compilers Managing Cache Bercovici Sivan

14

Example - visually

B

DO I = 1, M DO J = 1, N A(I, J) = A(I, J) + B(I, J) ENDDOENDDO

Cache mappingCache miss

A

Cache miss

Page 15: 1 Optimizing compilers Managing Cache Bercovici Sivan

15

Example analysis

DO I = 1, M

DO J = 1, N

A(I, J) = A(I, J) + B(I, J)

ENDDO

ENDDO

2*N*M misses– Due to the fact the innermost loop iterates on the

non-contiguous dimension

Page 16: 1 Optimizing compilers Managing Cache Bercovici Sivan

16

Example fixed (loop interchange)

DO J = 1, N

DO I = 1, M

A(I, J) = A(I, J) + B(I, J)

ENDDO

ENDDO

We process column by column spatial reuse

Page 17: 1 Optimizing compilers Managing Cache Bercovici Sivan

17

Example - visually

DO J = 1, N DO I = 1, M A(I, J) = A(I, J) + B(I, J) ENDDOENDDO

Cache mapping

A B

Page 18: 1 Optimizing compilers Managing Cache Bercovici Sivan

18

Analyzing fixed example

DO J = 1, N

DO I = 1, M

A(I, J) = A(I, J) + B(I, J)

ENDDO

ENDDO

2*N*M/b misses– b is the cache-block size

Page 19: 1 Optimizing compilers Managing Cache Bercovici Sivan

19

A harder example

DO I = 1, NDO J = 1, M

D(I) = D(I) + B(I,J)ENDDO

ENDDO NM for B, N/b for D After interchange: NM/b for B, NM/b for D When should interchange?

– N/b+NM - 2NM/b > 0

Page 20: 1 Optimizing compilers Managing Cache Bercovici Sivan

20

Loop interchange

Determine which loop should be innermost– Strive to increase locality

Heuristic approach– Compute cost function for each loop– Order loops:

Cheapest loop innermost Most expensive, outermost

Page 21: 1 Optimizing compilers Managing Cache Bercovici Sivan

21

Cost assignment

Cost is 1 for references that do not depend on loop induction variables

Cost is N for references based on induction variables over a non-contiguous space

Cost is N/b for induction variables based references over contiguous space

Multiply the cost by the loop trip count if the reference varies with the loop index

Page 22: 1 Optimizing compilers Managing Cache Bercovici Sivan

22

Loop interchange) cont.)

Special notes– Avoid over counting references

Don’t overcount references that are available due to temporal reuse (available to next iterations)

References can still be in the same cache block as other references (references in the same iteration)

– Not all loops order are possible due to data dependency

Find permutation that is both legal and best suits minimal score

Page 23: 1 Optimizing compilers Managing Cache Bercovici Sivan

23

Blocking

Page 24: 1 Optimizing compilers Managing Cache Bercovici Sivan

24

Back to the example

DO J = 1, N

DO I = 1, M

D(I) = D(I) + B(I,J)

ENDDO

ENDDO 2NM/b misses

Page 25: 1 Optimizing compilers Managing Cache Bercovici Sivan

25

Example - visually

D BDO J = 1, N DO I = 1, M D(I) = D(I) + B(I,J) ENDDOENDDO

Cache block size

Cache miss

Page 26: 1 Optimizing compilers Managing Cache Bercovici Sivan

26

Back to the example (cont.)

DO I = 1, N, S

DO J = 1, M

DO i2 = I, MIN(I+S-1, N)

D(i2) = D(i2) + B(i2,J)

ENDDO

ENDDO

ENDDO

Work on smaller strips (s)

Page 27: 1 Optimizing compilers Managing Cache Bercovici Sivan

27

Example - visually

D B

DO I = 1, N, S DO J = 1, M DO i2 = I, MIN(I+S-1, N) D(i2) = D(i2) + B(i2,J) ENDDO ENDDOENDDO

Cache block size

Second strip

Page 28: 1 Optimizing compilers Managing Cache Bercovici Sivan

28

Analysis

Cost of B does not change: NM/b Cost of D effect due to reuse: N/b

– No misses during iterations on J

Conclude: (1+1/M)NM/b

DO I = 1, N, S DO J = 1, M DO i2 = I, MIN(I+S-1, N) D(i2) = D(i2) + B(i2,J) ENDDO ENDDOENDDO

Page 29: 1 Optimizing compilers Managing Cache Bercovici Sivan

29

Unaligned data

What if B is not aligned on cache block boundary?

At most an additional penalty for each sub-column iteration– Additional NM/S

Conclude: (1+1/M+b/S)NM/bDO I = 1, N, S DO J = 1, M DO i2 = I, MIN(I+S-1, N) D(i2) = D(i2) + B(i2,J) ENDDO ENDDOENDDO

Page 30: 1 Optimizing compilers Managing Cache Bercovici Sivan

30

Unaligned data - visually

D B

DO I = 1, N, S DO J = 1, M DO i2 = I, MIN(I+S-1, N) D(i2) = D(i2) + B(i2,J) ENDDO ENDDOENDDO

Strip size

Cache alignment

Additional miss

Page 31: 1 Optimizing compilers Managing Cache Bercovici Sivan

31

Unaligned data (cont.)

What can be done?– Enforce data alignment– Refine our score for loop interchange

include these misses as well in the score

DO I = 1, N, S DO J = 1, M DO i2 = I, MIN(I+S-1, N) D(i2) = D(i2) + B(i2,J) ENDDO ENDDOENDDO

Page 32: 1 Optimizing compilers Managing Cache Bercovici Sivan

32

Blocking - legality

Split to strips - Legal Interchange –

not alwayslegal

procedure StripMineAndInterchange (L, m, k, o, S)// L = {L1, L2, ..., Lm}is the loop nest to be transformed// Lk is the loop to be strip mined// Lo is the outer loop which is to be just inside the by-strip loop// after interchange// S is the variable to use as strip size; it’s value must be positivelet the header of Lk be

DO I = L, N, D;split the loop into two loops, a by-strip loop:

DO I = L, N, S*Dand a within-strip loop:

DO i = I, MAX(I+S*D-D,N), Daround the loop body;

interchange the by-strip loop to the position just outside of Lo;

end StripMineAndInterchange

Page 33: 1 Optimizing compilers Managing Cache Bercovici Sivan

33

Blocking – a harder example

DO I = 1, N

DO J = 1, M

A(J+1) = (A(J) + A(J+1))/2

ENDDO

ENDDO

Due to dependence, loops can not be interchanged

Statements

Dependencies between statements

Page 34: 1 Optimizing compilers Managing Cache Bercovici Sivan

34

Blocking – a closer look

DO I = 1, N

DO J = 1, M

A(J+1) = (A(J) + A(J+1))/2

ENDDO

ENDDO

Not computable due to dependencies

Bad performance due to low cache reuse

Page 35: 1 Optimizing compilers Managing Cache Bercovici Sivan

35

Harder example – skew it

DO I = 1, N

DO j = I, M+I-1

A(j-I+2) = (A(j-I+1) + A(j-I+2))/2

ENDDO

ENDDO

j = 1..M 2..M+1 3..M+2

Page 36: 1 Optimizing compilers Managing Cache Bercovici Sivan

36

Harder example - Strip it

DO I = 1, N

DO j = I, M+I-1, S

DO jj = j, MAX(j+S-1, M+I-1)

A(jj-I+2) = (A(jj-I+1) + A(jj-I+2))/2

ENDDO

ENDDO

ENDDO

Page 37: 1 Optimizing compilers Managing Cache Bercovici Sivan

37

Hard example – interchange loops

DO j = 1, M+N-1, S

DO I = MAX(1, j-M+1), MIN(j, N)

DO jj = j, MAX(j+S-1, M+I-1)

A(jj-I+2) = (A(jj-I+1) + A(jj-I+2))/2

ENDDO

ENDDO

ENDDO

Page 38: 1 Optimizing compilers Managing Cache Bercovici Sivan

38

Harder example - Comparison

NM/b misses (M+N)*(1/b+1/S) misses

Page 39: 1 Optimizing compilers Managing Cache Bercovici Sivan

39

Triangular blocking

DO I = 2, N

DO J = 1, I-1

A(I, J) = A(I, I) + A(J, J)

ENDDO

ENDDO

Explicitly: 1..2

1...3

1....4

1…..5

Page 40: 1 Optimizing compilers Managing Cache Bercovici Sivan

40

Triangular – Strip it

DO I = 2, N, K

DO ii = I, I+K-1

DO J = 1, ii – 1

A(ii, J) = A(ii, I) + A(J, J)

ENDDO

ENDDO

ENDDO

K-size strips Nothing important changed yet..

Page 41: 1 Optimizing compilers Managing Cache Bercovici Sivan

41

Triangular – transform!

DO I = 2, N, K

DO J = 1, I+K-1

DO ii = MAX(J+1, I), I+K-1

A(ii, J) = A(ii, I) + A(J, J)

ENDDO

ENDDO

ENDDO

Triangular loop interchange Working on the k-strips Preserving correct triangular loop limits

Page 42: 1 Optimizing compilers Managing Cache Bercovici Sivan

42

Blocking with parallelization

Dimension of parallelism is that of the sequential access– Solution: If multiple parallelization dimensions are

available, avoid the stride-one dimension False sharing

– Data used by different processors is on the same cache-line, but not the exact same data

– Solution: Language extension - expressing data division to processors. Memory data layout accordingly

Page 43: 1 Optimizing compilers Managing Cache Bercovici Sivan

43

Prefetch

Page 44: 1 Optimizing compilers Managing Cache Bercovici Sivan

44

“…And I don’t want to miss a thing…” - AeroSmith, 98’ Optimization Seminar

Problematic misses:– Data used for the first time– Data re-used in ways that can not be predicted at

compile-time

DO I=1,NA(I) = B(LOC(I))

ENDDO

Page 45: 1 Optimizing compilers Managing Cache Bercovici Sivan

45

Prefetch

Brings a line into the cache Typically does not cause stall

– Loads the line parallel to continues execution

Introduced by programmer/compiler

Page 46: 1 Optimizing compilers Managing Cache Bercovici Sivan

46

Prefetch

Advantages– Miss latencies can be avoided

Assuming we can introduce a prefetch far enough Assuming the cache is large enough

Disadvantages– Number of instruction to execute increases– May cause useful data inside cache to be

evacuated prematurely– Data brought by prefetch might be evicted

prematurely.

Page 47: 1 Optimizing compilers Managing Cache Bercovici Sivan

47

Minimizing disadvantages impact

The number of added prefetches must be close to what is needed

Prefetches should not arrive “too early”

Page 48: 1 Optimizing compilers Managing Cache Bercovici Sivan

48

Do J=1, M Do I=1, 32 A(I+1,J) = A(I,J) + C(J) ENDDOENDDO Group generator is not contained in a dependence

cycle – a miss is expected on each iteration unless references to

the generator on subsequent iterations display temporal locality

Generate miss on every new cache line Use prefetch to before references to generators

RAW

Identify prefetch opportunities

Page 49: 1 Optimizing compilers Managing Cache Bercovici Sivan

49

Acyclic name partitioning

Two cases:– references to the generator do not iterate

sequentially within the loop– references have spatial locality within

the loop

Do I=1, 32 Do J=1, M A(I+1,J) = A(I,J) + C(J) ENDDOENDDO

Do J=1, M Do I=1, 32 A(I+1,J) = A(I,J) + C(J) ENDDOENDDO

Page 50: 1 Optimizing compilers Managing Cache Bercovici Sivan

50

Acyclic name partitioning

Case I: references to the generator do not iterate sequentially within the loop insert prefetch before each reference to the generator

Final positioning of the prefetches will be determined by the instruction scheduler

Do I=1, 32 Do J=1, M A(I+1,J) = A(I,J) + C(J) ENDDOENDDO

Prefetch

Page 51: 1 Optimizing compilers Managing Cache Bercovici Sivan

51

Case II: references have spatial locality within the loop – Determine i0 of the first iteration after the initial

iteration that causes a miss on the access to the generator

– Determine iteration delta between misses in the cache

Acyclic name partitioning

Do J=1, M Do I=1, 32 A(I+1,J) = A(I,J) + C(J) ENDDOENDDO

pre-

loop

Mai

n lo

op

Page 52: 1 Optimizing compilers Managing Cache Bercovici Sivan

52

Acyclic with spatial reuse

Partition the loop into two parts– initial subloop running from 1 to io-1

– remainder running from io to the end

In the example, io=4DO I = 1, 3 A(I, J) = A(I, J) + A(I-1, J)ENDDODO I = 4, M A(I, J) = A(I, J) + A(I-1, J)ENDDO

DO I = 1, M A(I, J) = A(I, J) + A(I-1, J)ENDDO

Page 53: 1 Optimizing compilers Managing Cache Bercovici Sivan

53

Acyclic with spatial reuse

Strip mine the second loop to have subloops of length delta

In the example, delta=4DO I = 1, 3 A(I, J) = A(I, J) + A(I-1, J)ENDDODO I = 4, M, 4 IU = MIN(M, I+4) DO ii = I, IU A(ii, J) = A(ii, J) + A(ii-1, J) ENDDOENDDO

DO I = 1, 3 A(I, J) = A(I, J) + A(I-1, J)ENDDODO I = 4, M A(I, J) = A(I, J) + A(I-1, J)ENDDO

Page 54: 1 Optimizing compilers Managing Cache Bercovici Sivan

54

Acyclic with spatial reuse

Insert a prefetch before the initial loop Insert prefetches before the inner-loop

prefetch(A(0,J))DO I = 1, 3 A(I, J) = A(I, J) + A(I-1, J)ENDDODO I = 4, M, 4 IU = MIN(M, I+3) prefetch(A(I, J)) DO ii = I, IU A(ii, J) = A(ii, J) + A(ii-1, J) ENDDOENDDO

DO I = 1, 3 A(I, J) = A(I, J) + A(I-1, J)ENDDODO I = 4, M, 4 IU = MIN(M, I+4) DO ii = I, IU A(ii, J) = A(ii, J) + A(ii-1, J) ENDDOENDDO

Page 55: 1 Optimizing compilers Managing Cache Bercovici Sivan

55

Acyclic with spatial reuse

prefetch(A(0,J))DO I = 1, 3 A(I, J) = A(I, J) + A(I-1, J)ENDDODO I = 4, M, 4 IU = MIN(M, I+3) prefetch(A(I, J)) DO ii = I, IU A(ii, J) = A(ii, J) + A(ii-1, J) ENDDOENDDO

DO I = 1, 3 A(I, J) = A(I, J) + A(I-1, J)ENDDODO I = 4, M, 4 IU = MIN(M, I+4) DO ii = I, IU A(ii, J) = A(ii, J) + A(ii-1, J) ENDDOENDDO

DO I = 1, M A(I, J) = A(I, J) + A(I-1, J)ENDDO

Page 56: 1 Optimizing compilers Managing Cache Bercovici Sivan

56

Do J=1, M Do I=1, 32 A(I+1,J) = A(I,J) + C(J) ENDDOENDDO

Group generator is contained in a dependence cycle– a miss is expected only on the first few iterations of the

carrying loop prefetch to the reference can be placed before the

loop carrying the dependence

Identify prefetch opportunities

Input Dependence

Page 57: 1 Optimizing compilers Managing Cache Bercovici Sivan

57

Put it all together

Rearrange the loop nest so that the loop iterating sequentially over cache lines is innermost

Split the innermost loop into two – – Pre-loop to the first iteration of the innermost loop

containing a generator reference beginning on a new cache line and

– Main loop that begins with the iteration containing the new cache reference.

Insert the prefetch, as previously explained

Page 58: 1 Optimizing compilers Managing Cache Bercovici Sivan

58

Example

DO J = 1, M DO I = 2, 33 A(I, J) = A(I, J) * B(I) ENDDO ENDDO

prefetch(B(2))DO I = 5, 33, 4 prefetch(B(I))ENDDODO J = 1, M DO I = 2, 33 A(I, J) = A(I, J) * B(I) ENDDOENDDO

prefetch(B(2))DO I = 5, 33, 4 prefetch(B(I))ENDDODO J = 1, M prefetch(A(2,J)) DO I = 2, 4 A(I, J) = A(I, J) * B(I) ENDDO DO I = 5, 33, 4 prefetch(A(I, J)) A(I, J) = A(I, J) * B(I) A(I+1, J) = A(I+1, J) * B(I+1) A(I+2, J) = A(I+2, J) * B(I+2) A(I+3, J) = A(I+3, J) * B(I+3) ENDDO prefetch(A(33, J)) A(33, J) = A(33, J) * B(33)ENDDO

Page 59: 1 Optimizing compilers Managing Cache Bercovici Sivan

59

Effectiveness of prefetching

Page 60: 1 Optimizing compilers Managing Cache Bercovici Sivan

60

What did we miss?

Page 61: 1 Optimizing compilers Managing Cache Bercovici Sivan

61

..“Sometimes, is never quite enough..” - Alanis Morissette , 95’ optimization seminar

Static analysis often ineffective – missing information:– Run-time cache miss– Miss address information

Page 62: 1 Optimizing compilers Managing Cache Bercovici Sivan

62

Profiling based optimization

Dynamic optimization systems– Collect information dynamically– Optimize according to profile

Use collected information for re-compilation, optimizing accordingly.

Use collected information to perform run-time optimization (Code modification on run-time to perform pre-fetch)

– Example: ADORE

Page 63: 1 Optimizing compilers Managing Cache Bercovici Sivan

63