embedded computer architecture tu/e 5kk73 henk corporaal bart mesman loop transformations
Post on 21-Dec-2015
217 views
TRANSCRIPT
@HC 5KK73 Embedded Computer Architecture 2
Overview• Motivation• Loop level transformations catalogue
– Loop merging– Loop interchange– Loop unrolling– Unroll-and-Jam – Loop tiling
• Loop Transformation Theory and Dependence Analysis
Thanks for many slides go to the DTSE people from IMEC and
Dr. Peter Knijnenburg († 2007) from Leiden University
@HC 5KK73 Embedded Computer Architecture 3
Loop Transformations
• Change the order in which the iteration space is traversed.
• Can expose parallelism, increase available ILP, or improve memory behavior.
• Dependence testing is required to check validity of transformation.
@HC 5KK73 Embedded Computer Architecture 4
Why loop transformationsExample 1: in-place mapping
for (j=1; j<M; j++) for (i=0; i<N; i++) A[i][j] = f(A[i][j-1]);
for (i=0; i<N; i++) OUT[i] = A[i][M-1];
for (i=0; i<N; i++){ for (j=1; j<M; j++) A[i][j] = f(A[i][j-1]); OUT[i] = A[i][M-1];}
OUT
j
i
A
@HC 5KK73 Embedded Computer Architecture 5
Why loop transformations
for (i=0; i<N; i++) B[i] = f(A[i]);for (i=0; i<N; i++) C[i] = g(B[i]);
for (i=0; i<N; i++){ B[i] = f(A[i]); C[i] = g(B[i]);}
N cyc.
N cyc.2N cyc.
Example 2: memory allocation
2 background ports 1 backgr. + 1 foregr. ports
CPU
Background memory
Foreground
memory
Externalmemoryinterface
@HC 5KK73 Embedded Computer Architecture 6
Loop transformation catalogue
• merge (fusion)– improve locality
• bump• extend/reduce• body split• reverse
– improve regularity
• interchange– improve
locality/regularity
• skew• tiling• index split• unroll• unroll-and-jam
@HC 5KK73 Embedded Computer Architecture 7
Loop Merge
• Improve locality• Reduce loop overhead
for Ia = exp1 to exp2 A(Ia)for Ib = exp1 to exp2
B(Ib)
for I = exp1 to exp2
A(I) B(I)
@HC 5KK73 Embedded Computer Architecture 8
Loop Merge:Example of locality improvement:
Consumptions of second loop closer to productions and consumptions of first loop
• Is this always the case after merging loops?
for (i=0; i<N; i++) B[i] = f(A[i]); C[i] = f(B[i],A[i]);
for (i=0; i<N; i++) B[i] = f(A[i]);for (j=0; j<N; j++) C[j] = f(B[j],A[j]);
Location
Time
Production/Consumption
Consumption(s)
Location
Time
Production/Consumption
Consumption(s)
@HC 5KK73 Embedded Computer Architecture 9
Loop Merge not always allowed !
• Data dependencies from first to second loopcan block Loop Merge
• Merging is allowed if I: cons(I) in loop 2 prod(I) in loop 1
• Enablers: Bump, Reverse, Skew
for (i=0; i<N; i++) B[i] = f(A[i]);
for (i=0; i<N; i++) C[i] = g(B[N-1]);
for (i=0; i<N; i++) B[i] = f(A[i]);
for (i=2; i<N; i++) C[i] = g(B[i-2]);
N-1 >= i i-2 < i
@HC 5KK73 Embedded Computer Architecture 10
for I = exp1 to exp2 A(I)
for I = exp1+N to exp2+N A(I-N)
Loop Bump
@HC 5KK73 Embedded Computer Architecture 11
Loop Bump: Example as enabler
for (i=2; i<N; i++) B[i] = f(A[i]);for (i=0; i<N-2; i++) C[i] = g(B[i+2]);
i+2 > i direct merging not possible• we would consume B[i+2] before it is produced
Loop Bump for (i=2; i<N; i++)
B[i] = f(A[i]);for (i=2; i<N; i++) C[i-2] = g(B[i+2-2]);
i+2–2 = i merging possible
Loop Merge
for (i=2; i<N; i++) B[i] = f(A[i]); C[i-2] = g(B[i]);
@HC 5KK73 Embedded Computer Architecture 12
for I = exp1 to exp2 A(I)
for I = exp3 to exp4 if Iexp1 and Iexp2
A(I)
Loop Extend
exp3 exp1 exp4 exp2
@HC 5KK73 Embedded Computer Architecture 13
Loop Extend: Example as enablerfor (i=0; i<N; i++) B[i] = f(A[i]);for (i=2; i<N+2; i++) C[i-2] = g(B[i]);
Loop Extend
for (i=0; i<N+2; i++) if(i<N) B[i] = f(A[i]);for (i=0; i<N+2; i++) if(i>=2) C[i-2] = g(B[i]);
Loop Merge
for (i=0; i<N+2; i++) if(i<N) B[i] = f(A[i]); if(i>=2) C[i-2] = g(B[i]);
@HC 5KK73 Embedded Computer Architecture 14
for I = max(exp1,exp3) to min(exp2,exp4) A(I)
Loop Reduce
for I = exp1 to exp2 if Iexp3 and Iexp4
A(I)
@HC 5KK73 Embedded Computer Architecture 15
for I = exp1 to exp2
A(I)B(I)
for Ia = exp1 to exp2 A(Ia)
for Ib = exp1 to exp2 B(Ib)
Loop Body Split
A(I) must besingle-assignment:its elements shouldbe written once
@HC 5KK73 Embedded Computer Architecture 16
Loop Body Split: Example as enabler
for (i=0; i<N; i++) A[i] = f(A[i-1]); B[i] = g(in[i]);for (j=0; j<N; j++) C[j] = h(B[j],A[N]);
for (i=0; i<N; i++) A[i] = f(A[i-1]);for (k=0; k<N; k++) B[k] = g(in[k]);for (j=0; j<N; j++) C[j] = h(B[j],A[N]);
Loop Body Split
for (i=0; i<N; i++) A[i] = f(A[i-1]);for (j=0; j<N; j++) B[j] = g(in[j]); C[j] = h(B[j],A[N]);
Loop Merge
@HC 5KK73 Embedded Computer Architecture 17
for I = exp1 to exp2 A(I)
for I = exp2 downto exp1
A(I)
for I = exp1 to exp2
A(exp2-(I-exp1))
Loop Reverse
Or written differently:
Location
Time
Location
Time
ProductionConsumption
@HC 5KK73 Embedded Computer Architecture 18
Loop Reverse: Satisfy dependencies
No loop-carried dependencies allowed ! Can not reverse the loop
for (i=0; i<N; i++) A[i] = f(A[i-1]);
Enabler: data-flow transformations; exploit associativity of + operator
A[0] = …for (i=1; i<=N; i++) A[i]= A[i-1]+ f(…);
… = A[N] … A[N] = …for (i=N-1; i>=0; i--) A[i]= A[i+1]+ f(…);
… = A[0] …
@HC 5KK73 Embedded Computer Architecture 19
Loop Reverse: Example as enabler
for (i=0; i<=N; i++) B[i] = f(A[i]);for (i=0; i<=N; i++) C[i] = g(B[N-i]);
N–i > i merging not possible
Loop Reverse for (i=0; i<=N; i++)
B[i] = f(A[i]);for (i=0; i<=N; i++) C[N-i] = g(B[N-(N-i)]);
N-(N-i) = i merging possible
Loop Merge for (i=0; i<=N; i++)
B[i] = f(A[i]); C[N-i] = g(B[i]);
@HC 5KK73 Embedded Computer Architecture 20
Loop Interchange
for I1 = exp1 to exp2 for I2 = exp3 to exp4
A(I1, I2)
for I2 = exp3 to exp4 for I1 = exp1 to exp2
A(I1, I2)
@HC 5KK73 Embedded Computer Architecture 21
Loop Interchange: index traversal
LoopInterchange
for(i=0; i<W; i++) for(j=0; j<H; j++) A[i][j] = …;
for(j=0; j<H; j++) for(i=0; i<W; i++) A[i][j] = …;
i
j
i
j
@HC 5KK73 Embedded Computer Architecture 22
Loop Interchange• Validity: check dependence direction
vectors.• Often used to improve cache behavior.• The innermost loop (loop index changes
fastest) should (only) index the right-most array index expression in case of row-major storage, like in C.
• Can improve execution time by 1 or 2 orders of magnitude.
@HC 5KK73 Embedded Computer Architecture 23
Loop Interchange (cont’d)• Loop interchange can also expose
parallelism.• If an inner-loop does not carry a
dependence (entry in direction vector equals ‘=‘), this loop can be executed in parallel.
• Moving this loop outwards increases the granularity of the parallel loop iterations.
@HC 5KK73 Embedded Computer Architecture 24
Loop Interchange:Example of locality improvement
• In second loop:– A[j] reused N times (temporal locality)– However: loosing spatial locality in B (assuming row major ordering)
• => Exploration required
for (j=0; j<M; j++) for (i=0; i<N; i++) B[i][j] = f(A[j],B[i][j-1]);
for (i=0; i<N; i++) for (j=0; j<M; j++) B[i][j] = f(A[j],B[i][j-1]);
@HC 5KK73 Embedded Computer Architecture 25
Loop Interchange: Satisfy dependencies
• No loop-carried dependencies allowedunless I2: prod(I2) cons(I2)
for (i=0; i<N; i++) for (j=0; j<M; j++) A[i][j] = f(A[i-1][j+1]);
• Enablers: – Data-flow transformations– Loop Bump– Skewing
for (i=0; i<N; i++) for (j=0; j<M; j++) A[i][j] = f(A[i-1][j]);
Compare:
@HC 5KK73 Embedded Computer Architecture 26
for I1 = exp1 to exp2
for I2 = exp3+.I1+ to exp4+.I1+ A(I1, I2-.I1- )
for I1 = exp1 to exp2
for I2 = exp3+.I1+ to exp4+.I1+ A(I1, I2-.I1- )
for I1 = exp1 to exp2
for I2 = exp3 to exp4
A(I1, I2)
Loop Skew:Basic transformation
@HC 5KK73 Embedded Computer Architecture 27
Loop Skew
for(j=0;j<H;j++) for(i=0;i<W;i++) A[i][j] = ...;
for(j=0; j<H; j++) for(i=0+j; i<W+j;i++) A[i-j][j] = ...;
i
j
i
j
Loop Skewing
@HC 5KK73 Embedded Computer Architecture 28
for (i=0; i<N; i++) for (j=0; j<M; j++) … = f(A[i+j]);
Loop Skew: Example as enabler of regularity improvement
Loop Skew for (i=0; i<N; i++)
for (j=i; j<i+M; j++) … = f(A[j]);
for (j=0; j<N+M; j++) for (i=0; i<N; i++) if (j>=i && j<i+M) … = f(A[j]);
Loop Extend &
Interchange
@HC 5KK73 Embedded Computer Architecture 29
Loop skewing
j
i
j
i for (i=0; i<N; i++) for (j=0; j<M; j++) A[i+1][j] = f(A[i][j+1]);
j
i
Loop interchange
Loop skewing as enabler
for Loop Interchange
@HC 5KK73 Embedded Computer Architecture 30
for I = 0 to exp1 . exp2
A(I)
for I1 = 0 to exp2
for I2 = exp1.I1 to exp1.(I1 +1)
A(I2)
Loop Tiling
Tile size exp1
Tile factor exp2
@HC 5KK73 Embedded Computer Architecture 31
Loop Tiling (1-D example)
1-D LoopTiling
j
i
i
for(i=0;i<9;i++) A[i] = ...;
for(j=0; j<3; j++) for(i=4*j; i<4*j+4; i++) if (i<9) A[i] = ...;
@HC 5KK73 Embedded Computer Architecture 32
Loop Tiling
• Improve cache reuse by dividing the (2 or higher dimensional) iteration space into tiles and iterating over these tiles.
• Only useful when working set does not fit into cache or when there exists much interference.
• Two adjacent loops can legally be tiled if they can legally be interchanged.
@HC 5KK73 Embedded Computer Architecture 33
2-D Tiling Example
for(i=0; i<N; i++)
for(j=0; j<N; j++)
A[i][j] = B[j][i];
for(TI=0; TI<N; TI+=16)
for(TJ=0; TJ<N; TJ+=16)
for(i=TI; i<min(TI+16,N); i++)
for(j=TJ; j<min(TJ+16,N); j++)
A[i][j] = B[j][i];
Inside
a tile
@HC 5KK73 Embedded Computer Architecture 35
What is the best Tile Size?
• Current tile size selection algorithms use a cache model:– Generate collection of tile sizes;– Estimate resulting cache miss rate;– Select best one.
• Only take into account L1 cache.
• Mostly do not take into account n-way associativity.
@HC 5KK73 Embedded Computer Architecture 36
for Ia = exp1 to exp2
A(Ia)
for Ia = exp1 to pA(Ia)
for Ib = p+1 to exp2
A(Ib)
Loop Index Split
@HC 5KK73 Embedded Computer Architecture 37
Loop Unrolling
• Duplicate loop body and adjust loop header.
• Increases available ILP, reduces loop overhead, and increases possibilities for common subexpression elimination.
• Always valid !!
@HC 5KK73 Embedded Computer Architecture 38
for I = exp1 to exp2
A(I)
A(exp1)A(exp1+1)A(exp1+2)…A(exp2)
(Partial) Loop Unrolling
for I = exp1/2 to exp2 /2 A(2l)A(2l+1)
@HC 5KK73 Embedded Computer Architecture 39
Loop Unrolling: Downside
• If unroll factor is not divisor of trip count, then need to add remainder loop.
• If trip count not known at compile time, need to check at runtime.
• Code size increases which may result in higher I-cache miss rate.
• Global determination of optimal unroll factors is difficult.
@HC 5KK73 Embedded Computer Architecture 40
for (L=1; L < 4; L++) for (i=0; i < (1<<L); i++) A(L,i);
for (i=0; i < 2; i++) A(1,i);for (i=0; i < 4; i++) A(2,i);for (i=0; i < 8; i++) A(3,i);
Loop Unroll: Example of removal of non-affine iterator
@HC 5KK73 Embedded Computer Architecture 41
Unroll-and-Jam
• Unroll outerloop and fuse new copies of the innerloop.
• Increases size of loop body and hence available ILP.
• Can also improve locality.
@HC 5KK73 Embedded Computer Architecture 42
Unroll-and-Jam Example
• More ILP exposed• Spatial locality of B
enhanced
for (i=0;i<N;i++)
for (j=0;j<N;j++)
A[i][j] = B[j][i];
for (i=0; i<N; i+=2)
for (j=0; j<N; j++) {
A[i][j] = B[j][i];
A[i+1][j] = B[j][i+1];
}
for (i=0;i<N;i=i+2)
for (j=0;j<N;j++)
A[i][j] = B[j][i];
for (j=0;j<N;j++)
A[i+1][j] = B[j][i+1];
@HC 5KK73 Embedded Computer Architecture 43
Simplified loop transformation script
• Give all loops same nesting depth– Use dummy 1-iteration loops if necessary
•Improve regularity – Usually applying loop interchange or reverse
•Improve locality– Usually applying loop merge– Break data dependencies with loop bump/skew– Sometimes loop index split or unrolling is easier
@HC 5KK73 Embedded Computer Architecture 44
Loop transformation theory
• Iteration spaces
• Polyhedral model
• Dependency analysis
@HC 5KK73 Embedded Computer Architecture 45
Technical Preliminaries (1)do i = 2, Ndo j = i, N x[i,j]=0.5*(x[x[i-1,ji-1,j]]+ x[i,j-1])
enddo
i
j
2 3 4
2
3
4
0
1
10
01
j
i
(1)
• Address expr (1)
• perfect loop nest• iteration space • dependence vector
@HC 5KK73 Embedded Computer Architecture 46
j
i
n
m
01
10
Technical Preliminaries (2)Switch loop indexes:do m = 2, Ndo n = 2, m x[n,m]=0.5*(x[n-1,m]x[n-1,m]+ x[n,m-1])
enddo
affine transformation
0
1
01
10
0
1
01
10
10
01
0
1
10
01
n
m
n
m
j
i
(2)
(2)
@HC 5KK73 Embedded Computer Architecture 47
Polyhedral Model
• Polyhedron is set x : Ax c for some matrix A and bounds vector c
• Polyhedra (or Polytopes) are objects in a many-dimensional space without holes
• Iteration spaces of loops (with unit stride) can be represented as polyhedra
• Array accesses and loop transformations can be represented as matrices
@HC 5KK73 Embedded Computer Architecture 48
Iteration Space• A loop nest is represented as BI b for
iteration vector I• Example: for(i=0; i<10;i++) -1 0 0 for(j=i; j<10;j++) 1 0 i 9 1 -1 j 0 0 1 9
@HC 5KK73 Embedded Computer Architecture 49
Array Accesses
• Any array access A[e1][e2] for linear index expressions e1 and e2 can be represented as an access matrix and offset vector.
A + a
• This can be considered as a mapping from the iteration space into the storage space of the array (which is a trivial polyhedron)
@HC 5KK73 Embedded Computer Architecture 50
Unimodular Matrices
• A unimodular matrix T is a matrix with integer entries and determinant 1.
• This means that such a matrix maps an object onto another object with exactly the same number of integer points in it.
• Its inverse T¹ always exist and is unimodular as well.
@HC 5KK73 Embedded Computer Architecture 51
Types of Unimodular Transformations
• Loop interchange
• Loop reversal
• Loop skewing for arbitrary skew factor
• Since unimodular transformations are closed under multiplication, any combination is a unimodular transformation again.
@HC 5KK73 Embedded Computer Architecture 52
Application
• Transformed loop nest is given by AT¹ I’ a• Any array access matrix is transformed into
AT¹.• Transformed loop nest needs to be
normalized by means of Fourier-Motzkin elimination to ensure that loop bounds are affine expressions in more outer loop indices.
@HC 5KK73 Embedded Computer Architecture 53
Dependence Analysis
Consider following statements:
S1: a = b + c;
S2: d = a + f;
S3: a = g + h;
• S1 S2: true or flow dependence = RaW
• S2 S3: anti-dependence = WaR
• S1 S3: output dependence = WaW
@HC 5KK73 Embedded Computer Architecture 54
Dependences in Loops
• Consider the following loop
for(i=0; i<N; i++){
S1: a[i] = …;
S2: b[i] = a[i-1];}
• Loop carried dependence S1 S2.
• Need to detect if there exists i and i’ such that i = i’-1 in loop space.
@HC 5KK73 Embedded Computer Architecture 55
Definition of Dependence
• There exists a dependence if there two statement instances that refer to the same memory location and (at least) one of them is a write.
• There should not be a write between these two statement instances.
• In general, it is undecidable whether there exist a dependence.
@HC 5KK73 Embedded Computer Architecture 56
Direction of Dependence
• If there is a flow dependence between two statements S1 and S2 in a loop, then S1 writes to a variable in an earlier iteration than S2 reads that variable.
• The iteration vector of the write is lexicographically less than the iteration vector of the read.
• I I’ iff i1 = i’1 i(k-1) = i’(k-1) ik < i’k for some k.
@HC 5KK73 Embedded Computer Architecture 57
Direction Vectors
• A direction vector is a vector
(=,=,,=,<,*,*,,*)– where * can denote = or < or >.
• Such a vector encodes a (collection of) dependence.
• A loop transformation should result in a new direction vector for the dependence that is also lexicographically positive.
@HC 5KK73 Embedded Computer Architecture 58
Loop Interchange
• Interchanging two loops also interchanges the corresponding entries in a direction vector.
• Example: if direction vector of a dependence is (<,>) then we may not interchange the loops because the resulting direction would be (>,<) which is lexicographically negative.
@HC 5KK73 Embedded Computer Architecture 59
Affine Bounds and Indices
• We assume loop bounds and array index expressions are affine expressions:
a0 + a1 * i1 + + ak * ik
• Each loop bound for loop index ik is an affine expressions over the previous loop indices i1 to i(k-1).
• Each loop index expression is a affine expression over all loop indices.
@HC 5KK73 Embedded Computer Architecture 60
Non-Affine Expressions
• Index expressions like i*j cannot be handled by dependence tests. We must assume that there exists a dependence.
• An important class of index expressions are indirections A[B[i]]. These occur frequently in scientific applications (sparse matrix computations).
• In embedded applications???
@HC 5KK73 Embedded Computer Architecture 61
Linear Diophantine Equations
• A linear diophantine equations is of the form
aj * xj = c
• Equation has a solution iff gcd(a1,,an) is divisor of c
@HC 5KK73 Embedded Computer Architecture 62
GCD Test for Dependence
• Assume single loop and two references A[a+bi] and A[c+di].
• If there exist a dependence, then gcd(b,d) divides (c-a).
• Note the direction of the implication!
• If gcd(b,d) does not divide (c-a) then there exists no dependence.
@HC 5KK73 Embedded Computer Architecture 63
GCD Test (cont’d)
• However, if gcd(b,d) does divide (c-a) then there might exist a dependence.
• Test is not exact since it does not take into account loop bounds.
• For example:
• for(i=0; i<10; i++)
• A[i] = A[i+10] + 1;
@HC 5KK73 Embedded Computer Architecture 64
GCD Test (cont’d)
• Using the Theorem on linear diophantine equations, we can test in arbitrary loop nests.
• We need one test for each direction vector.
• Vector (=,=,,=,<,) implies that first k indices are the same.
• See book by Zima for details.
@HC 5KK73 Embedded Computer Architecture 65
Other Dependence Tests
• There exist many dependence test– Separability test– GCD test– Banerjee test– Range test– Fourier-Motzkin test– Omega test
• Exactness increases, but so does the cost.
@HC 5KK73 Embedded Computer Architecture 66
Fourier-Motzkin Elimination
• Consider a collection of linear inequalities over the variables i1,,in
• e1(i1,,in) e1’(i1,,in)• • em(i1,,in) em’(i1,,in)
• Is this system consistent, or does there exist a solution?
• FM-elimination can determine this.
@HC 5KK73 Embedded Computer Architecture 67
FM-Elimination (cont’d)
• First, create all pairs L(i1,,i(n-1)) in and in U(i1,,i(n-1)). This is solution for in.
• Then create new system
• L(i1,,i(n-1)) U(i1,,i(n-1))
• together with all original inequalities not involving in.
• This new system has one variable less and we continue this way.
@HC 5KK73 Embedded Computer Architecture 68
FM-Elimination (cont’d)
• After eliminating i1, we end up with collection of inequalities between constants c1 c1’.
• The original system is consistent iff every such inequality can be satisfied.
• There does not exist an inequality like • 10 3.• There may be exponentially many new
inequalities generated!
@HC 5KK73 Embedded Computer Architecture 69
Fourier-Motzkin Test
• Loop bounds plus array index expressions generate sets of inequalities, using new loop indices i’ for the sink of dependence.
• Each direction vector generates inequalities
• i1 = i1’ i(k-1) = i(k-1)’ ik < ik’
• If all these systems are inconsistent, then there exists no dependence.
• This test is not exact (real solutions but no integer ones) but almost.
@HC 5KK73 Embedded Computer Architecture 70
N-Dimensional Arrays
• Test in each dimension separately.
• This can introduce another level of inaccuracy.
• Some tests (FM and Omega test) can test in many dimensions at the same time.
• Otherwise, you can linearize an array: Transform a logically N-dimensional array to its one-dimensional storage format.
@HC 5KK73 Embedded Computer Architecture 71
Hierarchy of Tests
• Try cheap test, then more expensive ones:• if (cheap test1= NO)• then print ‘NO’• else if (test2 = NO)• then print ‘NO’• else if (test3 = NO)• then print ‘NO’• else
@HC 5KK73 Embedded Computer Architecture 72
Practical Dependence Testing
• Cheap tests, like GCD and Banerjee tests, can disprove many dependences.
• Adding expensive tests only disproves a few more possible dependences.
• Compiler writer needs to trade-off compilation time and accuracy of dependence testing.
• For time critical applications, expensive tests like Omega test (exact!) can be used.