1 optimizing compilers managing cache bercovici sivan
TRANSCRIPT
1
Optimizing compilers
Managing Cache
Bercovici Sivan
2
Overview
Motivation Cache structure Important observations Techniques covered in the book
– Loop interchange– Blocking– Unaligned Data– Pre-fetching
3
Overview (Cont.)
Issues and techniques not covered in the book– Instruction Cache– Dynamic profiling driven cache optimization
4
Motivation
Shorten fetch time
Processor-DRAMPerformance Gap:(grows 50% / year)
1
10
100
1000
19
80
19
81
19
82
19
83
19
84
19
85
19
86
19
87
19
88
19
89
19
90
19
91
19
92
19
93
19
94
19
95
19
96
19
97
19
98
19
99
20
00
DRAM
CPU
Pe
rfo
rma
nc
e
Time
5
Motivation (cont.)
Solution: Cache– Faster memory
Software problems– Maximize cache performance
6
Memory structure
Hierarchical
Registers
Cache
Memory
Disk
Instructions
Blocks
Pages
Larger
Faster
100s Bytes, <10s ns
K Bytes,10-100 ns
M Bytes, 100ns-1us
Capacity, Access Time
7
Cache structure
Specialized– Instruction– Data– What about stack
cache?
8
Cache Structure (cont.)
Organized into blocks– Multiple machine-words
Maps entire memory Most use LRU
replacementstrategy
Tag LineTag Array
Tag = Block#
Address Fields0431
Data array
031
==
=
hit data
Line Offset
9
Observations
Temporal locality– If a variable is referenced, it tends to be
referenced again soon.
Spatial locality– If a variable is referenced, nearby variables tend
to referenced soon
10
Observations (cont.)
Temporal locality example– Variables used inside a loop
Spatial locality example– Iterating on array items
11
Cache exploiting observations
Temporal locality– Cache attempts to keep recently accessed data
Spatial locality– The cache brings blocks of data from memory
12
Loop interchange
13
Example
DO I = 1, M DO J = 1, N A(I, J) = A(I, J) + B(I, J) ENDDOENDDO
I iterates on rows, J on columns Fortran arrays are column major
– The first column is stored first, then the second column..– C is the other way around (row major)
No spatial reuse
14
Example - visually
B
DO I = 1, M DO J = 1, N A(I, J) = A(I, J) + B(I, J) ENDDOENDDO
Cache mappingCache miss
A
Cache miss
15
Example analysis
DO I = 1, M
DO J = 1, N
A(I, J) = A(I, J) + B(I, J)
ENDDO
ENDDO
2*N*M misses– Due to the fact the innermost loop iterates on the
non-contiguous dimension
16
Example fixed (loop interchange)
DO J = 1, N
DO I = 1, M
A(I, J) = A(I, J) + B(I, J)
ENDDO
ENDDO
We process column by column spatial reuse
17
Example - visually
DO J = 1, N DO I = 1, M A(I, J) = A(I, J) + B(I, J) ENDDOENDDO
Cache mapping
A B
18
Analyzing fixed example
DO J = 1, N
DO I = 1, M
A(I, J) = A(I, J) + B(I, J)
ENDDO
ENDDO
2*N*M/b misses– b is the cache-block size
19
A harder example
DO I = 1, NDO J = 1, M
D(I) = D(I) + B(I,J)ENDDO
ENDDO NM for B, N/b for D After interchange: NM/b for B, NM/b for D When should interchange?
– N/b+NM - 2NM/b > 0
20
Loop interchange
Determine which loop should be innermost– Strive to increase locality
Heuristic approach– Compute cost function for each loop– Order loops:
Cheapest loop innermost Most expensive, outermost
21
Cost assignment
Cost is 1 for references that do not depend on loop induction variables
Cost is N for references based on induction variables over a non-contiguous space
Cost is N/b for induction variables based references over contiguous space
Multiply the cost by the loop trip count if the reference varies with the loop index
22
Loop interchange) cont.)
Special notes– Avoid over counting references
Don’t overcount references that are available due to temporal reuse (available to next iterations)
References can still be in the same cache block as other references (references in the same iteration)
– Not all loops order are possible due to data dependency
Find permutation that is both legal and best suits minimal score
23
Blocking
24
Back to the example
DO J = 1, N
DO I = 1, M
D(I) = D(I) + B(I,J)
ENDDO
ENDDO 2NM/b misses
25
Example - visually
D BDO J = 1, N DO I = 1, M D(I) = D(I) + B(I,J) ENDDOENDDO
Cache block size
Cache miss
26
Back to the example (cont.)
DO I = 1, N, S
DO J = 1, M
DO i2 = I, MIN(I+S-1, N)
D(i2) = D(i2) + B(i2,J)
ENDDO
ENDDO
ENDDO
Work on smaller strips (s)
27
Example - visually
D B
DO I = 1, N, S DO J = 1, M DO i2 = I, MIN(I+S-1, N) D(i2) = D(i2) + B(i2,J) ENDDO ENDDOENDDO
Cache block size
Second strip
28
Analysis
Cost of B does not change: NM/b Cost of D effect due to reuse: N/b
– No misses during iterations on J
Conclude: (1+1/M)NM/b
DO I = 1, N, S DO J = 1, M DO i2 = I, MIN(I+S-1, N) D(i2) = D(i2) + B(i2,J) ENDDO ENDDOENDDO
29
Unaligned data
What if B is not aligned on cache block boundary?
At most an additional penalty for each sub-column iteration– Additional NM/S
Conclude: (1+1/M+b/S)NM/bDO I = 1, N, S DO J = 1, M DO i2 = I, MIN(I+S-1, N) D(i2) = D(i2) + B(i2,J) ENDDO ENDDOENDDO
30
Unaligned data - visually
D B
DO I = 1, N, S DO J = 1, M DO i2 = I, MIN(I+S-1, N) D(i2) = D(i2) + B(i2,J) ENDDO ENDDOENDDO
Strip size
Cache alignment
Additional miss
31
Unaligned data (cont.)
What can be done?– Enforce data alignment– Refine our score for loop interchange
include these misses as well in the score
DO I = 1, N, S DO J = 1, M DO i2 = I, MIN(I+S-1, N) D(i2) = D(i2) + B(i2,J) ENDDO ENDDOENDDO
32
Blocking - legality
Split to strips - Legal Interchange –
not alwayslegal
procedure StripMineAndInterchange (L, m, k, o, S)// L = {L1, L2, ..., Lm}is the loop nest to be transformed// Lk is the loop to be strip mined// Lo is the outer loop which is to be just inside the by-strip loop// after interchange// S is the variable to use as strip size; it’s value must be positivelet the header of Lk be
DO I = L, N, D;split the loop into two loops, a by-strip loop:
DO I = L, N, S*Dand a within-strip loop:
DO i = I, MAX(I+S*D-D,N), Daround the loop body;
interchange the by-strip loop to the position just outside of Lo;
end StripMineAndInterchange
33
Blocking – a harder example
DO I = 1, N
DO J = 1, M
A(J+1) = (A(J) + A(J+1))/2
ENDDO
ENDDO
Due to dependence, loops can not be interchanged
Statements
Dependencies between statements
34
Blocking – a closer look
DO I = 1, N
DO J = 1, M
A(J+1) = (A(J) + A(J+1))/2
ENDDO
ENDDO
Not computable due to dependencies
Bad performance due to low cache reuse
35
Harder example – skew it
DO I = 1, N
DO j = I, M+I-1
A(j-I+2) = (A(j-I+1) + A(j-I+2))/2
ENDDO
ENDDO
j = 1..M 2..M+1 3..M+2
36
Harder example - Strip it
DO I = 1, N
DO j = I, M+I-1, S
DO jj = j, MAX(j+S-1, M+I-1)
A(jj-I+2) = (A(jj-I+1) + A(jj-I+2))/2
ENDDO
ENDDO
ENDDO
37
Hard example – interchange loops
DO j = 1, M+N-1, S
DO I = MAX(1, j-M+1), MIN(j, N)
DO jj = j, MAX(j+S-1, M+I-1)
A(jj-I+2) = (A(jj-I+1) + A(jj-I+2))/2
ENDDO
ENDDO
ENDDO
38
Harder example - Comparison
NM/b misses (M+N)*(1/b+1/S) misses
39
Triangular blocking
DO I = 2, N
DO J = 1, I-1
A(I, J) = A(I, I) + A(J, J)
ENDDO
ENDDO
Explicitly: 1..2
1...3
1....4
1…..5
40
Triangular – Strip it
DO I = 2, N, K
DO ii = I, I+K-1
DO J = 1, ii – 1
A(ii, J) = A(ii, I) + A(J, J)
ENDDO
ENDDO
ENDDO
K-size strips Nothing important changed yet..
41
Triangular – transform!
DO I = 2, N, K
DO J = 1, I+K-1
DO ii = MAX(J+1, I), I+K-1
A(ii, J) = A(ii, I) + A(J, J)
ENDDO
ENDDO
ENDDO
Triangular loop interchange Working on the k-strips Preserving correct triangular loop limits
42
Blocking with parallelization
Dimension of parallelism is that of the sequential access– Solution: If multiple parallelization dimensions are
available, avoid the stride-one dimension False sharing
– Data used by different processors is on the same cache-line, but not the exact same data
– Solution: Language extension - expressing data division to processors. Memory data layout accordingly
43
Prefetch
44
“…And I don’t want to miss a thing…” - AeroSmith, 98’ Optimization Seminar
Problematic misses:– Data used for the first time– Data re-used in ways that can not be predicted at
compile-time
DO I=1,NA(I) = B(LOC(I))
ENDDO
45
Prefetch
Brings a line into the cache Typically does not cause stall
– Loads the line parallel to continues execution
Introduced by programmer/compiler
46
Prefetch
Advantages– Miss latencies can be avoided
Assuming we can introduce a prefetch far enough Assuming the cache is large enough
Disadvantages– Number of instruction to execute increases– May cause useful data inside cache to be
evacuated prematurely– Data brought by prefetch might be evicted
prematurely.
47
Minimizing disadvantages impact
The number of added prefetches must be close to what is needed
Prefetches should not arrive “too early”
48
Do J=1, M Do I=1, 32 A(I+1,J) = A(I,J) + C(J) ENDDOENDDO Group generator is not contained in a dependence
cycle – a miss is expected on each iteration unless references to
the generator on subsequent iterations display temporal locality
Generate miss on every new cache line Use prefetch to before references to generators
RAW
Identify prefetch opportunities
49
Acyclic name partitioning
Two cases:– references to the generator do not iterate
sequentially within the loop– references have spatial locality within
the loop
Do I=1, 32 Do J=1, M A(I+1,J) = A(I,J) + C(J) ENDDOENDDO
Do J=1, M Do I=1, 32 A(I+1,J) = A(I,J) + C(J) ENDDOENDDO
50
Acyclic name partitioning
Case I: references to the generator do not iterate sequentially within the loop insert prefetch before each reference to the generator
Final positioning of the prefetches will be determined by the instruction scheduler
Do I=1, 32 Do J=1, M A(I+1,J) = A(I,J) + C(J) ENDDOENDDO
Prefetch
51
Case II: references have spatial locality within the loop – Determine i0 of the first iteration after the initial
iteration that causes a miss on the access to the generator
– Determine iteration delta between misses in the cache
Acyclic name partitioning
Do J=1, M Do I=1, 32 A(I+1,J) = A(I,J) + C(J) ENDDOENDDO
pre-
loop
Mai
n lo
op
52
Acyclic with spatial reuse
Partition the loop into two parts– initial subloop running from 1 to io-1
– remainder running from io to the end
In the example, io=4DO I = 1, 3 A(I, J) = A(I, J) + A(I-1, J)ENDDODO I = 4, M A(I, J) = A(I, J) + A(I-1, J)ENDDO
DO I = 1, M A(I, J) = A(I, J) + A(I-1, J)ENDDO
53
Acyclic with spatial reuse
Strip mine the second loop to have subloops of length delta
In the example, delta=4DO I = 1, 3 A(I, J) = A(I, J) + A(I-1, J)ENDDODO I = 4, M, 4 IU = MIN(M, I+4) DO ii = I, IU A(ii, J) = A(ii, J) + A(ii-1, J) ENDDOENDDO
DO I = 1, 3 A(I, J) = A(I, J) + A(I-1, J)ENDDODO I = 4, M A(I, J) = A(I, J) + A(I-1, J)ENDDO
54
Acyclic with spatial reuse
Insert a prefetch before the initial loop Insert prefetches before the inner-loop
prefetch(A(0,J))DO I = 1, 3 A(I, J) = A(I, J) + A(I-1, J)ENDDODO I = 4, M, 4 IU = MIN(M, I+3) prefetch(A(I, J)) DO ii = I, IU A(ii, J) = A(ii, J) + A(ii-1, J) ENDDOENDDO
DO I = 1, 3 A(I, J) = A(I, J) + A(I-1, J)ENDDODO I = 4, M, 4 IU = MIN(M, I+4) DO ii = I, IU A(ii, J) = A(ii, J) + A(ii-1, J) ENDDOENDDO
55
Acyclic with spatial reuse
prefetch(A(0,J))DO I = 1, 3 A(I, J) = A(I, J) + A(I-1, J)ENDDODO I = 4, M, 4 IU = MIN(M, I+3) prefetch(A(I, J)) DO ii = I, IU A(ii, J) = A(ii, J) + A(ii-1, J) ENDDOENDDO
DO I = 1, 3 A(I, J) = A(I, J) + A(I-1, J)ENDDODO I = 4, M, 4 IU = MIN(M, I+4) DO ii = I, IU A(ii, J) = A(ii, J) + A(ii-1, J) ENDDOENDDO
DO I = 1, M A(I, J) = A(I, J) + A(I-1, J)ENDDO
56
Do J=1, M Do I=1, 32 A(I+1,J) = A(I,J) + C(J) ENDDOENDDO
Group generator is contained in a dependence cycle– a miss is expected only on the first few iterations of the
carrying loop prefetch to the reference can be placed before the
loop carrying the dependence
Identify prefetch opportunities
Input Dependence
57
Put it all together
Rearrange the loop nest so that the loop iterating sequentially over cache lines is innermost
Split the innermost loop into two – – Pre-loop to the first iteration of the innermost loop
containing a generator reference beginning on a new cache line and
– Main loop that begins with the iteration containing the new cache reference.
Insert the prefetch, as previously explained
58
Example
DO J = 1, M DO I = 2, 33 A(I, J) = A(I, J) * B(I) ENDDO ENDDO
prefetch(B(2))DO I = 5, 33, 4 prefetch(B(I))ENDDODO J = 1, M DO I = 2, 33 A(I, J) = A(I, J) * B(I) ENDDOENDDO
prefetch(B(2))DO I = 5, 33, 4 prefetch(B(I))ENDDODO J = 1, M prefetch(A(2,J)) DO I = 2, 4 A(I, J) = A(I, J) * B(I) ENDDO DO I = 5, 33, 4 prefetch(A(I, J)) A(I, J) = A(I, J) * B(I) A(I+1, J) = A(I+1, J) * B(I+1) A(I+2, J) = A(I+2, J) * B(I+2) A(I+3, J) = A(I+3, J) * B(I+3) ENDDO prefetch(A(33, J)) A(33, J) = A(33, J) * B(33)ENDDO
59
Effectiveness of prefetching
60
What did we miss?
61
..“Sometimes, is never quite enough..” - Alanis Morissette , 95’ optimization seminar
Static analysis often ineffective – missing information:– Run-time cache miss– Miss address information
62
Profiling based optimization
Dynamic optimization systems– Collect information dynamically– Optimize according to profile
Use collected information for re-compilation, optimizing accordingly.
Use collected information to perform run-time optimization (Code modification on run-time to perform pre-fetch)
– Example: ADORE
63