embedded computer architecture tu/e 5kk73 henk corporaal bart mesman loop transformations

Embedded Computer Architecture

TU/e 5kk73Henk Corporaal

Bart Mesman

Loop Transformations

@HC 5KK73 Embedded Computer Architecture 2

Overview• Motivation• Loop level transformations catalogue

– Loop merging– Loop interchange– Loop unrolling– Unroll-and-Jam – Loop tiling

• Loop Transformation Theory and Dependence Analysis

Thanks for many slides go to the DTSE people from IMEC and

Dr. Peter Knijnenburg († 2007) from Leiden University


Loop Transformations

• Change the order in which the iteration space is traversed.

• Can expose parallelism, increase available ILP, or improve memory behavior.

• Dependence testing is required to check validity of transformation.


Why loop transformationsExample 1: in-place mapping

for (j=1; j<M; j++) for (i=0; i<N; i++) A[i][j] = f(A[i][j-1]);

for (i=0; i<N; i++) OUT[i] = A[i][M-1];

for (i=0; i<N; i++){ for (j=1; j<M; j++) A[i][j] = f(A[i][j-1]); OUT[i] = A[i][M-1];}

OUT

j

i

A


Why loop transformations

for (i=0; i<N; i++) B[i] = f(A[i]);for (i=0; i<N; i++) C[i] = g(B[i]);

for (i=0; i<N; i++){ B[i] = f(A[i]); C[i] = g(B[i]);}

N cyc.

N cyc.2N cyc.

Example 2: memory allocation

2 background ports 1 backgr. + 1 foregr. ports

CPU

Background memory

Foreground

memory

Externalmemoryinterface


Loop transformation catalogue

• merge (fusion)– improve locality

• bump• extend/reduce• body split• reverse

– improve regularity

• interchange– improve

locality/regularity

• skew• tiling• index split• unroll• unroll-and-jam


Loop Merge

• Improve locality• Reduce loop overhead

for Ia = exp1 to exp2 A(Ia)for Ib = exp1 to exp2

B(Ib)

for I = exp1 to exp2

A(I) B(I)


Loop Merge:Example of locality improvement:

Consumptions of second loop closer to productions and consumptions of first loop

• Is this always the case after merging loops?

for (i=0; i<N; i++) B[i] = f(A[i]); C[i] = f(B[i],A[i]);

for (i=0; i<N; i++) B[i] = f(A[i]);for (j=0; j<N; j++) C[j] = f(B[j],A[j]);

Location

Time

Production/Consumption

Consumption(s)

Location

Time

Production/Consumption

Consumption(s)


Loop Merge not always allowed !

• Data dependencies from first to second loopcan block Loop Merge

• Merging is allowed if I: cons(I) in loop 2 prod(I) in loop 1

• Enablers: Bump, Reverse, Skew

for (i=0; i<N; i++) B[i] = f(A[i]);

for (i=0; i<N; i++) C[i] = g(B[N-1]);

for (i=0; i<N; i++) B[i] = f(A[i]);

for (i=2; i<N; i++) C[i] = g(B[i-2]);

N-1 >= i i-2 < i


for I = exp1 to exp2 A(I)

for I = exp1+N to exp2+N A(I-N)

Loop Bump


Loop Bump: Example as enabler

for (i=2; i<N; i++) B[i] = f(A[i]);for (i=0; i<N-2; i++) C[i] = g(B[i+2]);

i+2 > i direct merging not possible• we would consume B[i+2] before it is produced

Loop Bump for (i=2; i<N; i++)

B[i] = f(A[i]);for (i=2; i<N; i++) C[i-2] = g(B[i+2-2]);

i+2–2 = i merging possible

Loop Merge

for (i=2; i<N; i++) B[i] = f(A[i]); C[i-2] = g(B[i]);



for I = exp3 to exp4 if Iexp1 and Iexp2

A(I)

Loop Extend

exp3 exp1 exp4 exp2


Loop Extend: Example as enablerfor (i=0; i<N; i++) B[i] = f(A[i]);for (i=2; i<N+2; i++) C[i-2] = g(B[i]);

Loop Extend

for (i=0; i<N+2; i++) if(i<N) B[i] = f(A[i]);for (i=0; i<N+2; i++) if(i>=2) C[i-2] = g(B[i]);

Loop Merge

for (i=0; i<N+2; i++) if(i<N) B[i] = f(A[i]); if(i>=2) C[i-2] = g(B[i]);


for I = max(exp1,exp3) to min(exp2,exp4) A(I)

Loop Reduce

for I = exp1 to exp2 if Iexp3 and Iexp4

A(I)



A(I)B(I)

for Ia = exp1 to exp2 A(Ia)

for Ib = exp1 to exp2 B(Ib)

Loop Body Split

A(I) must besingle-assignment:its elements shouldbe written once


Loop Body Split: Example as enabler

for (i=0; i<N; i++) A[i] = f(A[i-1]); B[i] = g(in[i]);for (j=0; j<N; j++) C[j] = h(B[j],A[N]);

for (i=0; i<N; i++) A[i] = f(A[i-1]);for (k=0; k<N; k++) B[k] = g(in[k]);for (j=0; j<N; j++) C[j] = h(B[j],A[N]);

Loop Body Split

for (i=0; i<N; i++) A[i] = f(A[i-1]);for (j=0; j<N; j++) B[j] = g(in[j]); C[j] = h(B[j],A[N]);

Loop Merge



for I = exp2 downto exp1

A(I)


A(exp2-(I-exp1))

Loop Reverse

Or written differently:

Location

Time

Location

Time

ProductionConsumption


Loop Reverse: Satisfy dependencies

No loop-carried dependencies allowed ! Can not reverse the loop

for (i=0; i<N; i++) A[i] = f(A[i-1]);

Enabler: data-flow transformations; exploit associativity of + operator

A[0] = …for (i=1; i<=N; i++) A[i]= A[i-1]+ f(…);

… = A[N] … A[N] = …for (i=N-1; i>=0; i--) A[i]= A[i+1]+ f(…);

… = A[0] …


Loop Reverse: Example as enabler

for (i=0; i<=N; i++) B[i] = f(A[i]);for (i=0; i<=N; i++) C[i] = g(B[N-i]);

N–i > i merging not possible

Loop Reverse for (i=0; i<=N; i++)

B[i] = f(A[i]);for (i=0; i<=N; i++) C[N-i] = g(B[N-(N-i)]);

N-(N-i) = i merging possible

Loop Merge for (i=0; i<=N; i++)

B[i] = f(A[i]); C[N-i] = g(B[i]);


Loop Interchange

for I1 = exp1 to exp2 for I2 = exp3 to exp4

A(I1, I2)

for I2 = exp3 to exp4 for I1 = exp1 to exp2

A(I1, I2)


Loop Interchange: index traversal

LoopInterchange

for(i=0; i<W; i++) for(j=0; j<H; j++) A[i][j] = …;

for(j=0; j<H; j++) for(i=0; i<W; i++) A[i][j] = …;

i

j

i

j


Loop Interchange• Validity: check dependence direction

vectors.• Often used to improve cache behavior.• The innermost loop (loop index changes

fastest) should (only) index the right-most array index expression in case of row-major storage, like in C.

• Can improve execution time by 1 or 2 orders of magnitude.


Loop Interchange (cont’d)• Loop interchange can also expose

parallelism.• If an inner-loop does not carry a

dependence (entry in direction vector equals ‘=‘), this loop can be executed in parallel.

• Moving this loop outwards increases the granularity of the parallel loop iterations.


Loop Interchange:Example of locality improvement

• In second loop:– A[j] reused N times (temporal locality)– However: loosing spatial locality in B (assuming row major ordering)

• => Exploration required

for (j=0; j<M; j++) for (i=0; i<N; i++) B[i][j] = f(A[j],B[i][j-1]);

for (i=0; i<N; i++) for (j=0; j<M; j++) B[i][j] = f(A[j],B[i][j-1]);


Loop Interchange: Satisfy dependencies

• No loop-carried dependencies allowedunless I2: prod(I2) cons(I2)

for (i=0; i<N; i++) for (j=0; j<M; j++) A[i][j] = f(A[i-1][j+1]);

• Enablers: – Data-flow transformations– Loop Bump– Skewing

for (i=0; i<N; i++) for (j=0; j<M; j++) A[i][j] = f(A[i-1][j]);

Compare:


for I1 = exp1 to exp2

for I2 = exp3+.I1+ to exp4+.I1+ A(I1, I2-.I1- )


for I2 = exp3+.I1+ to exp4+.I1+ A(I1, I2-.I1- )



A(I1, I2)

Loop Skew:Basic transformation


Loop Skew

for(j=0;j<H;j++) for(i=0;i<W;i++) A[i][j] = ...;

for(j=0; j<H; j++) for(i=0+j; i<W+j;i++) A[i-j][j] = ...;

i

j

i

j

Loop Skewing


for (i=0; i<N; i++) for (j=0; j<M; j++) … = f(A[i+j]);

Loop Skew: Example as enabler of regularity improvement

Loop Skew for (i=0; i<N; i++)

for (j=i; j<i+M; j++) … = f(A[j]);

for (j=0; j<N+M; j++) for (i=0; i<N; i++) if (j>=i && j<i+M) … = f(A[j]);

Loop Extend &

Interchange


Loop skewing

j

i

j

i for (i=0; i<N; i++) for (j=0; j<M; j++) A[i+1][j] = f(A[i][j+1]);

j

i

Loop interchange

Loop skewing as enabler

for Loop Interchange


for I = 0 to exp1 . exp2

A(I)

for I1 = 0 to exp2

for I2 = exp1.I1 to exp1.(I1 +1)

A(I2)

Loop Tiling

Tile size exp1

Tile factor exp2


Loop Tiling (1-D example)

1-D LoopTiling

j

i

i

for(i=0;i<9;i++) A[i] = ...;

for(j=0; j<3; j++) for(i=4*j; i<4*j+4; i++) if (i<9) A[i] = ...;


Loop Tiling

• Improve cache reuse by dividing the (2 or higher dimensional) iteration space into tiles and iterating over these tiles.

• Only useful when working set does not fit into cache or when there exists much interference.

• Two adjacent loops can legally be tiled if they can legally be interchanged.


2-D Tiling Example

for(i=0; i<N; i++)

for(j=0; j<N; j++)

A[i][j] = B[j][i];

for(TI=0; TI<N; TI+=16)

for(TJ=0; TJ<N; TJ+=16)

for(i=TI; i<min(TI+16,N); i++)

for(j=TJ; j<min(TJ+16,N); j++)

A[i][j] = B[j][i];

Inside

a tile


2-D Tiling: changes the execution order

j

i

j

i


What is the best Tile Size?

• Current tile size selection algorithms use a cache model:– Generate collection of tile sizes;– Estimate resulting cache miss rate;– Select best one.

• Only take into account L1 cache.

• Mostly do not take into account n-way associativity.


for Ia = exp1 to exp2

A(Ia)

for Ia = exp1 to pA(Ia)

for Ib = p+1 to exp2

A(Ib)

Loop Index Split


Loop Unrolling

• Duplicate loop body and adjust loop header.

• Increases available ILP, reduces loop overhead, and increases possibilities for common subexpression elimination.

• Always valid !!



A(I)

A(exp1)A(exp1+1)A(exp1+2)…A(exp2)

(Partial) Loop Unrolling

for I = exp1/2 to exp2 /2 A(2l)A(2l+1)


Loop Unrolling: Downside

• If unroll factor is not divisor of trip count, then need to add remainder loop.

• If trip count not known at compile time, need to check at runtime.

• Code size increases which may result in higher I-cache miss rate.

• Global determination of optimal unroll factors is difficult.


for (L=1; L < 4; L++) for (i=0; i < (1<<L); i++) A(L,i);

for (i=0; i < 2; i++) A(1,i);for (i=0; i < 4; i++) A(2,i);for (i=0; i < 8; i++) A(3,i);

Loop Unroll: Example of removal of non-affine iterator


Unroll-and-Jam

• Unroll outerloop and fuse new copies of the innerloop.

• Increases size of loop body and hence available ILP.

• Can also improve locality.


Unroll-and-Jam Example

• More ILP exposed• Spatial locality of B

enhanced

for (i=0;i<N;i++)

for (j=0;j<N;j++)

A[i][j] = B[j][i];

for (i=0; i<N; i+=2)

for (j=0; j<N; j++) {

A[i][j] = B[j][i];

A[i+1][j] = B[j][i+1];

}

for (i=0;i<N;i=i+2)

for (j=0;j<N;j++)

A[i][j] = B[j][i];

for (j=0;j<N;j++)

A[i+1][j] = B[j][i+1];


Simplified loop transformation script

• Give all loops same nesting depth– Use dummy 1-iteration loops if necessary

•Improve regularity – Usually applying loop interchange or reverse

•Improve locality– Usually applying loop merge– Break data dependencies with loop bump/skew– Sometimes loop index split or unrolling is easier


Loop transformation theory

• Iteration spaces

• Polyhedral model

• Dependency analysis


Technical Preliminaries (1)do i = 2, Ndo j = i, N x[i,j]=0.5*(x[x[i-1,ji-1,j]]+ x[i,j-1])

enddo

i

j

2 3 4

2

3

4

0

1

10

01

j

i

(1)

• Address expr (1)

• perfect loop nest• iteration space • dependence vector


j

i

n

m

01

10

Technical Preliminaries (2)Switch loop indexes:do m = 2, Ndo n = 2, m x[n,m]=0.5*(x[n-1,m]x[n-1,m]+ x[n,m-1])

enddo

affine transformation

0

1

01

10

0

1

01

10

10

01

0

1

10

01

n

m

n

m

j

i

(2)

(2)


Polyhedral Model

• Polyhedron is set x : Ax c for some matrix A and bounds vector c

• Polyhedra (or Polytopes) are objects in a many-dimensional space without holes

• Iteration spaces of loops (with unit stride) can be represented as polyhedra

• Array accesses and loop transformations can be represented as matrices


Iteration Space• A loop nest is represented as BI b for

iteration vector I• Example: for(i=0; i<10;i++) -1 0 0 for(j=i; j<10;j++) 1 0 i 9 1 -1 j 0 0 1 9


Array Accesses

• Any array access A[e1][e2] for linear index expressions e1 and e2 can be represented as an access matrix and offset vector.

A + a

• This can be considered as a mapping from the iteration space into the storage space of the array (which is a trivial polyhedron)


Unimodular Matrices

• A unimodular matrix T is a matrix with integer entries and determinant 1.

• This means that such a matrix maps an object onto another object with exactly the same number of integer points in it.

• Its inverse T¹ always exist and is unimodular as well.


Types of Unimodular Transformations

• Loop interchange

• Loop reversal

• Loop skewing for arbitrary skew factor

• Since unimodular transformations are closed under multiplication, any combination is a unimodular transformation again.


Application

• Transformed loop nest is given by AT¹ I’ a• Any array access matrix is transformed into

AT¹.• Transformed loop nest needs to be

normalized by means of Fourier-Motzkin elimination to ensure that loop bounds are affine expressions in more outer loop indices.


Dependence Analysis

Consider following statements:

S1: a = b + c;

S2: d = a + f;

S3: a = g + h;

• S1 S2: true or flow dependence = RaW

• S2 S3: anti-dependence = WaR

• S1 S3: output dependence = WaW


Dependences in Loops

• Consider the following loop

for(i=0; i<N; i++){

S1: a[i] = …;

S2: b[i] = a[i-1];}

• Loop carried dependence S1 S2.

• Need to detect if there exists i and i’ such that i = i’-1 in loop space.


Definition of Dependence

• There exists a dependence if there two statement instances that refer to the same memory location and (at least) one of them is a write.

• There should not be a write between these two statement instances.

• In general, it is undecidable whether there exist a dependence.


Direction of Dependence

• If there is a flow dependence between two statements S1 and S2 in a loop, then S1 writes to a variable in an earlier iteration than S2 reads that variable.

• The iteration vector of the write is lexicographically less than the iteration vector of the read.

• I I’ iff i1 = i’1 i(k-1) = i’(k-1) ik < i’k for some k.


Direction Vectors

• A direction vector is a vector

(=,=,,=,<,*,*,,*)– where * can denote = or < or >.

• Such a vector encodes a (collection of) dependence.

• A loop transformation should result in a new direction vector for the dependence that is also lexicographically positive.


Loop Interchange

• Interchanging two loops also interchanges the corresponding entries in a direction vector.

• Example: if direction vector of a dependence is (<,>) then we may not interchange the loops because the resulting direction would be (>,<) which is lexicographically negative.


Affine Bounds and Indices

• We assume loop bounds and array index expressions are affine expressions:

a0 + a1 * i1 + + ak * ik

• Each loop bound for loop index ik is an affine expressions over the previous loop indices i1 to i(k-1).

• Each loop index expression is a affine expression over all loop indices.


Non-Affine Expressions

• Index expressions like i*j cannot be handled by dependence tests. We must assume that there exists a dependence.

• An important class of index expressions are indirections A[B[i]]. These occur frequently in scientific applications (sparse matrix computations).

• In embedded applications???


Linear Diophantine Equations

• A linear diophantine equations is of the form

aj * xj = c

• Equation has a solution iff gcd(a1,,an) is divisor of c


GCD Test for Dependence

• Assume single loop and two references A[a+bi] and A[c+di].

• If there exist a dependence, then gcd(b,d) divides (c-a).

• Note the direction of the implication!

• If gcd(b,d) does not divide (c-a) then there exists no dependence.


GCD Test (cont’d)

• However, if gcd(b,d) does divide (c-a) then there might exist a dependence.

• Test is not exact since it does not take into account loop bounds.

• For example:

• for(i=0; i<10; i++)

• A[i] = A[i+10] + 1;


GCD Test (cont’d)

• Using the Theorem on linear diophantine equations, we can test in arbitrary loop nests.

• We need one test for each direction vector.

• Vector (=,=,,=,<,) implies that first k indices are the same.

• See book by Zima for details.


Other Dependence Tests

• There exist many dependence test– Separability test– GCD test– Banerjee test– Range test– Fourier-Motzkin test– Omega test

• Exactness increases, but so does the cost.


Fourier-Motzkin Elimination

• Consider a collection of linear inequalities over the variables i1,,in

• e1(i1,,in) e1’(i1,,in)• • em(i1,,in) em’(i1,,in)

• Is this system consistent, or does there exist a solution?

• FM-elimination can determine this.


FM-Elimination (cont’d)

• First, create all pairs L(i1,,i(n-1)) in and in U(i1,,i(n-1)). This is solution for in.

• Then create new system

• L(i1,,i(n-1)) U(i1,,i(n-1))

• together with all original inequalities not involving in.

• This new system has one variable less and we continue this way.


FM-Elimination (cont’d)

• After eliminating i1, we end up with collection of inequalities between constants c1 c1’.

• The original system is consistent iff every such inequality can be satisfied.

• There does not exist an inequality like • 10 3.• There may be exponentially many new

inequalities generated!


Fourier-Motzkin Test

• Loop bounds plus array index expressions generate sets of inequalities, using new loop indices i’ for the sink of dependence.

• Each direction vector generates inequalities

• i1 = i1’ i(k-1) = i(k-1)’ ik < ik’

• If all these systems are inconsistent, then there exists no dependence.

• This test is not exact (real solutions but no integer ones) but almost.


N-Dimensional Arrays

• Test in each dimension separately.

• This can introduce another level of inaccuracy.

• Some tests (FM and Omega test) can test in many dimensions at the same time.

• Otherwise, you can linearize an array: Transform a logically N-dimensional array to its one-dimensional storage format.


Hierarchy of Tests

• Try cheap test, then more expensive ones:• if (cheap test1= NO)• then print ‘NO’• else if (test2 = NO)• then print ‘NO’• else if (test3 = NO)• then print ‘NO’• else


Practical Dependence Testing

• Cheap tests, like GCD and Banerjee tests, can disprove many dependences.

• Adding expensive tests only disproves a few more possible dependences.

• Compiler writer needs to trade-off compilation time and accuracy of dependence testing.

• For time critical applications, expensive tests like Omega test (exact!) can be used.

embedded computer architecture tu/e 5kk73 henk corporaal bart mesman loop transformations

Documents

loop transformations

leiden university slide

dependence testing

validity of transformation

dependence analysis

peter knijnenburg

dtse people

place mapping