stanford university cs243 winter 2006 wei li 1 loop transformations and locality

40
Wei Li 1 Stanford University CS243 Winter 2006 Loop Loop Transformations and Transformations and Locality Locality

Post on 19-Dec-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality

Wei Li 1Stanford University

CS243 Winter 2006

Loop Transformations Loop Transformations and Locality and Locality

Page 2: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality

2CS243 Winter 2006Stanford University

AgendaAgenda

IntroductionIntroduction Loop TransformationsLoop Transformations Affine Transform TheoryAffine Transform Theory

Page 3: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality

3CS243 Winter 2006Stanford University

Memory HierarchyMemory Hierarchy

CPU

C

C

Memory

Cache

Page 4: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality

4CS243 Winter 2006Stanford University

Cache LocalityCache Locality

for i = 1, 100 for j = 1, 200 A[i, j] = A[i, j] + 3 end_forend_for

• Suppose array A has column-major layout

A[1,1]A[1,1] A[2,1]A[2,1] …… A[1,2]A[1,2] A[2,2]A[2,2] …… A[1,3]A[1,3] ……

• Loop nest has poor spatial cache locality.

Page 5: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality

5CS243 Winter 2006Stanford University

Loop InterchangeLoop Interchange

for i = 1, 100 for j = 1, 200 A[i, j] = A[i, j] + 3 end_forend_for

• Suppose array A has column-major layout

A[1,1]A[1,1] A[2,1]A[2,1] …… A[1,2]A[1,2] A[2,2]A[2,2] …… A[1,3]A[1,3] ……

• New loop nest has better spatial cache locality.

for j = 1, 200

for i = 1, 100 A[i, j] = A[i, j] + 3 end_forend_for

Page 6: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality

6CS243 Winter 2006Stanford University

Interchange Loops?Interchange Loops?

for i = 2, 100 for j = 1, 200 A[i, j] = A[i-1, j+1]+3 end_forend_for

• e.g. dependence from (3,3) to (4,2)

i

j

Page 7: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality

7CS243 Winter 2006Stanford University

Dependence VectorsDependence Vectors

i

j

Distance vector (1,-1) Distance vector (1,-1) = (4,2)-(3,3)= (4,2)-(3,3)

Direction vector (+, -) Direction vector (+, -) from the signs of from the signs of distance vectordistance vector

Loop interchange is Loop interchange is not legal if there exists not legal if there exists dependence (+, -)dependence (+, -)

Page 8: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality

8CS243 Winter 2006Stanford University

AgendaAgenda

IntroductionIntroduction Loop TransformationsLoop Transformations Affine Transform TheoryAffine Transform Theory

Page 9: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality

9CS243 Winter 2006Stanford University

Loop FusionLoop Fusion

for i = 1, 1000 A[i] = B[i] + 3end_for

for j = 1, 1000 C[j] = A[j] + 5end_for

for i = 1, 1000 A[i] = B[i] + 3 C[i] = A[i] + 5end_for

Better reuse between A[i] and A[i]Better reuse between A[i] and A[i]

Page 10: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality

10CS243 Winter 2006Stanford University

Loop DistributionLoop Distribution

for i = 1, 1000 A[i] = A[i-1] + 3end_for

for i = 1, 1000 C[i] = B[i] + 5end_for

for i = 1, 1000 A[i] = A[i-1] + 3 C[i] = B[i] + 5end_for

22ndnd loop is parallel loop is parallel

Page 11: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality

11CS243 Winter 2006Stanford University

Register BlockingRegister Blocking

for j = 1, 2*m for i = 1, 2*n A[i, j] = A[i-1, j] + A[i-1, j-1] end_forend_for

for j = 1, 2*m, 2 for i = 1, 2*n, 2 A[i, j] = A[i-1,j] + A[i-1,j-1] A[i, j+1] = A[i-1,j+1] + A[i-1,j] A[i+1, j] = A[i, j] + A[i, j-1] A[i+1, j+1] = A[i, j+1] + A[i, j] end_forend_for

Better reuse between A[i,j] and A[i,j]Better reuse between A[i,j] and A[i,j]

Page 12: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality

12CS243 Winter 2006Stanford University

Virtual Register AllocationVirtual Register Allocationfor j = 1, 2*M, 2 for i = 1, 2*N, 2 r1 = A[i-1,j] r2 = r1 + A[i-1,j-1] A[i, j] = r2 r3 = A[i-1,j+1] + r1 A[i, j+1] = r3 A[i+1, j] = r2 + A[i, j-1] A[i+1, j+1] = r3 + r2 end_forend_for

Memory operations reduced to register load/store

8MN loads to 4MN loads

Page 13: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality

13CS243 Winter 2006Stanford University

Scalar ReplacementScalar Replacement

for i = 2, N+1 = A[i-1]+1 A[i] =end_for

t1 = A[1]for i = 2, N+1 = t1 + 1 t1 = A[i] = t1end_for

Eliminate loads and stores for array references

Page 14: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality

14CS243 Winter 2006Stanford University

Unroll-and-JamUnroll-and-Jam

for j = 1, 2*M for i = 1, N A[i, j] = A[i-1, j] + A[i-1, j-1] end_forend_for

for j = 1, 2*M, 2 for i = 1, N A[i, j]=A[i-1,j]+A[i-1,j-1] A[i, j+1]=A[i-1,j+1]+A[i-1,j] end_forend_for

Expose more opportunity for scalar replacement

Page 15: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality

15CS243 Winter 2006Stanford University

Large ArraysLarge Arrays

for i = 1, 1000 for j = 1, 1000 A[i, j] = A[i, j] + B[j, i] end_forend_for

• Suppose arrays A and B have row-major layout

B has poor cache locality. Loop interchange will not help.

Page 16: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality

16CS243 Winter 2006Stanford University

Loop BlockingLoop Blocking

for v = 1, 1000, 20 for u = 1, 1000, 20 for j = v, v+19 for i = u, u+19 A[i, j] = A[i, j] + B[j, i] end_for end_for end_forend_for

Access to small blocks of the arrays has good cache locality.

Page 17: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality

17CS243 Winter 2006Stanford University

Loop Unrolling for ILPLoop Unrolling for ILP

for i = 1, 10 a[i] = b[i]; *p = ... end_for

for I = 1, 10, 2 a[i] = b[i]; *p = … a[i+1] = b[i+1]; *p = …end_for

Large scheduling regions. Fewer dynamic branches Increased code size

Page 18: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality

18CS243 Winter 2006Stanford University

AgendaAgenda

IntroductionIntroduction Loop TransformationsLoop Transformations Affine Transform TheoryAffine Transform Theory

Page 19: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality

19CS243 Winter 2006Stanford University

ObjectiveObjective

Unify a large class of program Unify a large class of program transformations.transformations.

Example:Example:

float Z[100];for i = 0, 9 Z[i+10] = Z[i];end_for

Page 20: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality

20CS243 Winter 2006Stanford University

Iteration SpaceIteration Space

A d-deep loop nest has d index variables, A d-deep loop nest has d index variables, and is modeled by a d-dimensional space. and is modeled by a d-dimensional space. The space of iterations is bounded by the The space of iterations is bounded by the lower and upper bounds of the loop lower and upper bounds of the loop indices.indices.

Iteration space i = 0,1, …9 Iteration space i = 0,1, …9

for i = 0, 9 Z[i+10] = Z[i];end_for

Page 21: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality

21CS243 Winter 2006Stanford University

Matrix FormulationMatrix Formulation

The iterations in a d-deep loop nest can be The iterations in a d-deep loop nest can be represented mathematically asrepresented mathematically as

Z is the set of integersZ is the set of integers B is a d B is a d xx d integer matrix d integer matrix b is an integer vector of length d, andb is an integer vector of length d, and 0 is a vector of d 0’s.0 is a vector of d 0’s.

}0|{ biBdZi

Page 22: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality

22CS243 Winter 2006Stanford University

ExampleExample

for i = 0, 5 for j = i, 7 Z[j,i] = 0;

E.g. the 3rd row –i+j ≥ 0 is from the lower bound j ≥ i for loop j.

0

0

0

0

7

0

5

0

10

11

01

01

j

i

Page 23: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality

23CS243 Winter 2006Stanford University

Symbolic ConstantsSymbolic Constants

for i = 0, n Z[i] = 0;

E.g. the 1st row –i+n ≥ 0 is from the upper bound i ≤ n.

0

0

01

11|{n

iZi

Page 24: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality

24CS243 Winter 2006Stanford University

Data SpaceData Space

An n-dimensional array is modeled by an An n-dimensional array is modeled by an n-dimensional space. The space is n-dimensional space. The space is bounded by the array bounds.bounded by the array bounds.

Data space a = 0,1, …99 Data space a = 0,1, …99

float Z[100]for i = 0, 9 Z[i+10] = Z[i];end_for

Page 25: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality

25CS243 Winter 2006Stanford University

Processor SpaceProcessor Space

Initially assume unbounded number of Initially assume unbounded number of virtual processors (vp1, vp2, …) organized virtual processors (vp1, vp2, …) organized in a multi-dimensional space. in a multi-dimensional space. (iteration 1, vp1), (iteration 2, vp2),…(iteration 1, vp1), (iteration 2, vp2),…

After parallelization, map to physical After parallelization, map to physical processors (p1, p2).processors (p1, p2). (vp1, p1), (vp2, p2), (vp3, p1), (vp4, p2),… (vp1, p1), (vp2, p2), (vp3, p1), (vp4, p2),…

Page 26: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality

26CS243 Winter 2006Stanford University

Affine Array Index FunctionAffine Array Index Function

Each array access in the code specifies a Each array access in the code specifies a mapping from an iteration in the iteration mapping from an iteration in the iteration space to an array element in the data space to an array element in the data spacespace

Both i+10 and i are affine.Both i+10 and i are affine.

float Z[100]for i = 0, 9 Z[i+10] = Z[i];end_for

Page 27: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality

27CS243 Winter 2006Stanford University

Array Affine AccessArray Affine Access

The bounds of the loop are expressed as The bounds of the loop are expressed as affine expressions of the surrounding loop affine expressions of the surrounding loop variables and symbolic constants, andvariables and symbolic constants, and

The index for each dimension of the array The index for each dimension of the array is also an affine expression of surrounding is also an affine expression of surrounding loop variables and symbolic constantsloop variables and symbolic constants

Page 28: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality

28CS243 Winter 2006Stanford University

Matrix FormulationMatrix Formulation

Array access maps a vector i within the Array access maps a vector i within the bounds to array element location Fi+f.bounds to array element location Fi+f.

E.g. access X[i-1] in loop nest i,j E.g. access X[i-1] in loop nest i,j

}0|{ biBdZi

101

j

i

Page 29: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality

29CS243 Winter 2006Stanford University

Affine PartitioningAffine Partitioning

An affine function to assign iterations in an An affine function to assign iterations in an iteration space to processors in the iteration space to processors in the processor space.processor space.

E.g. iteration i to processor 10-i.E.g. iteration i to processor 10-i.

float Z[100]for i = 0, 9 Z[i+10] = Z[i];end_for

Page 30: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality

30CS243 Winter 2006Stanford University

Data Access RegionData Access Region

An affine function to assign iterations in an An affine function to assign iterations in an iteration space to processors in the iteration space to processors in the processor space.processor space.

Region for Z[i+10] is {a | 10 ≤ a ≤ 20}.Region for Z[i+10] is {a | 10 ≤ a ≤ 20}.

float Z[100]for i = 0, 9 Z[i+10] = Z[i];end_for

Page 31: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality

31CS243 Winter 2006Stanford University

Data DependencesData Dependences

Solution to linear constraints as shown in Solution to linear constraints as shown in the last lecture.the last lecture. There exist iThere exist irr, i, iww, such that, such that

0 0 ≤ ≤ iirr,, i iw w ≤≤ 9, 9,

iiw w + 10 = i+ 10 = irrfloat Z[100]for i = 0, 9 Z[i+10] = Z[i];end_for

Page 32: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality

32CS243 Winter 2006Stanford University

Affine TransformAffine Transform

i

j

u

v

bj

iB

v

u

Page 33: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality

33CS243 Winter 2006Stanford University

Locality OptimizationLocality Optimization

for i = 1, 100 for j = 1, 200 A[i, j] = A[i, j] + 3 end_forend_for

for u = 1, 200

for v = 1, 100 A[v,u] = A[v,u]+ 3 end_forend_for

j

i

v

u

01

10

Page 34: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality

34CS243 Winter 2006Stanford University

Old Iteration SpaceOld Iteration Space

j

i

v

u

01

10

0

0

0

0

200

1

100

1

10

10

01

01

j

ifor i = 1, 100 for j = 1, 200 A[i, j] = A[i, j] + 3 end_forend_for

0

0

0

0

200

1

100

11

01

10

10

10

01

01

v

u

Page 35: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality

35CS243 Winter 2006Stanford University

New Iteration SpaceNew Iteration Space

0

0

0

0

200

1

100

1

01

01

10

10

v

u

0

0

0

0

200

1

100

11

01

10

10

10

01

01

v

ufor u = 1, 200

for v = 1, 100 A[v,u] = A[v,u]+ 3 end_forend_for

Page 36: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality

36CS243 Winter 2006Stanford University

Old Array AccessesOld Array Accesses

j

i

v

u

01

10

]10,01[

j

i

j

iA

for i = 1, 100 for j = 1, 200 A[i, j] = A[i, j] + 3 end_forend_for

]1

01

1010,1

01

1001[

v

u

v

uA

Page 37: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality

37CS243 Winter 2006Stanford University

New Array AccessesNew Array Accesses

]1

01

1010,1

01

1001[

v

u

v

uA

for u = 1, 200

for v = 1, 100 A[v,u] = A[v,u]+ 3 end_forend_for

]01,10[

v

u

v

uA

Page 38: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality

38CS243 Winter 2006Stanford University

Interchange Loops?Interchange Loops?

for i = 2, 1000 for j = 1, 1000 A[i, j] = A[i-1, j+1]+3 end_forend_for

• e.g. dependence vector dold = (1,-1)

i

j

Page 39: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality

39CS243 Winter 2006Stanford University

Interchange Loops?Interchange Loops?

j

i

v

u

01

10

1

1

1

1

01

10

01

10oldnew dd

A transformation is legal, if the new dependence A transformation is legal, if the new dependence is lexicographically positive, i.e. the leading non-is lexicographically positive, i.e. the leading non-zero in the dependence vector is positive.zero in the dependence vector is positive.

Page 40: Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality

40CS243 Winter 2006Stanford University

SummarySummary

Locality OptimizationsLocality Optimizations Loop TransformationsLoop Transformations Affine Transform TheoryAffine Transform Theory