program transformations and optimizations in the...

Program transformations and optimizationsin the polyhedral framework

Louis-Noël Pouchet

Dept. of Computer ScienceUCLA

May 14, 2013

1st Polyhedral Spring SchoolSt-Germain au Mont d’or, France

: Polyhedral School

Overview

UCLA 2

Overview & Motivation: Polyhedral School

Agenda for the Lecture

Disclaimer: this lecture is (mainly) for new Ph.D students

1 Program Representation in the Polyhedral Model

2 Applying loop transformations in the model

3 Exact data dependence representation

4 Scheduling using Farkas-based methods

5 Scheduling using approximate data dependencesI If time permits...

UCLA 3


Running Example: DGEMM

Example (dgemm)/* C := alpha*A*B + beta*C */for (i = 0; i < ni; i++)for (j = 0; j < nj; j++)

S1: C[i][j] *= beta;for (i = 0; i < ni; i++)for (j = 0; j < nj; j++)

for (k = 0; k < nk; ++k)S2: C[i][j] += alpha * A[i][k] * B[k][j];

I Loop transformation: permute(i,k,S2)Execution time (in s) on this laptop, GCC 4.2, ni=nj=nk=512

version -O0 -O1 -O2 -O3 -vecoriginal 1.81 0.78 0.78 0.78permute 1.52 0.35 0.35 0.20

http://gcc.gnu.org/onlinedocs/gcc-4.2.1/gcc/Optimize-Options.html

UCLA 4

http://gcc.gnu.org/onlinedocs/gcc-4.2.1/gcc/Optimize-Options.html


Running Example: fdtd-2d

Example (fdtd-2d)for(t = 0; t < tmax; t++) {for (j = 0; j < ny; j++)

ey[0][j] = _edge_[t];for (i = 1; i < nx; i++)

for (j = 0; j < ny; j++)ey[i][j] = ey[i][j] - 0.5*(hz[i][j]-hz[i-1][j]);

for (i = 0; i < nx; i++)for (j = 1; j < ny; j++)ex[i][j] = ex[i][j] - 0.5*(hz[i][j]-hz[i][j-1]);

for (i = 0; i < nx - 1; i++)for (j = 0; j < ny - 1; j++)hz[i][j] = hz[i][j] - 0.7* (ex[i][j+1] - ex[i][j] +

ey[i+1][j]-ey[i][j]);}

I Loop transformation: polyhedralOpt(fdtd-2d)Execution time (in s) on this laptop, GCC 4.2, 64x1024x1024

version -O0 -O1 -O2 -O3 -vecoriginal 2.59 1.62 1.54 1.54

polyhedralOpt 2.05 0.41 0.41 0.41UCLA 5


Manual transformations?

Example (fdtd-2d)for(t = 0; t < tmax; t++) {for (j = 0; j < ny; j++)

ey[0][j] = _edge_[t];for (i = 1; i < nx; i++)

for (j = 0; j < ny; j++)ey[i][j] = ey[i][j] - 0.5*(hz[i][j]-hz[i-1][j]);

for (i = 0; i < nx; i++)for (j = 1; j < ny; j++)ex[i][j] = ex[i][j] - 0.5*(hz[i][j]-hz[i][j-1]);

for (i = 0; i < nx - 1; i++)for (j = 0; j < ny - 1; j++)hz[i][j] = hz[i][j] - 0.7* (ex[i][j+1] - ex[i][j] +

ey[i+1][j]-ey[i][j]);}

UCLA 6


NO NO NO!!!

Example (fdtd-2d tiled)for (c0 = 0; c0 <= (((ny + 2 * tmax + -3) * 32 < 0?((32 < 0?-((-(ny + 2 * tmax + -3) + 32 + 1) / 32) : -((-(ny + 2 *tmax + -3) + 32 - 1) / 32))) : (ny + 2 * tmax + -3) / 32)); ++c0) {

#pragma omp parallel for private(c3, c4, c2, c5)for (c1 = (((c0 * 2 < 0?-(-c0 / 2) : ((2 < 0?(-c0 + -2 - 1) / -2 : (c0 + 2 - 1) / 2)))) > (((32 * c0 + -tmax + 1) *

32 < 0?-(-(32 * c0 + -tmax + 1) / 32) : ((32 < 0?(-(32 * c0 + -tmax + 1) + -32 - 1) / -32 : (32 * c0 + -tmax + 1 + 32- 1) / 32))))?((c0 * 2 < 0?-(-c0 / 2) : ((2 < 0?(-c0 + -2 - 1) / -2 : (c0 + 2 - 1) / 2)))) : (((32 * c0 + -tmax + 1) *32 < 0?-(-(32 * c0 + -tmax + 1) / 32) : ((32 < 0?(-(32 * c0 + -tmax + 1) + -32 - 1) / -32 : (32 * c0 + -tmax + 1 + 32- 1) / 32))))); c1 <= (((((((ny + tmax + -2) * 32 < 0?((32 < 0?-((-(ny + tmax + -2) + 32 + 1) / 32) : -((-(ny + tmax +-2) + 32 - 1) / 32))) : (ny + tmax + -2) / 32)) < (((32 * c0 + ny + 30) * 64 < 0?((64 < 0?-((-(32 * c0 + ny + 30) + 64+ 1) / 64) : -((-(32 * c0 + ny + 30) + 64 - 1) / 64))) : (32 * c0 + ny + 30) / 64))?(((ny + tmax + -2) * 32 < 0?((32 <0?-((-(ny + tmax + -2) + 32 + 1) / 32) : -((-(ny + tmax + -2) + 32 - 1) / 32))) : (ny + tmax + -2) / 32)) : (((32 * c0 +ny + 30) * 64 < 0?((64 < 0?-((-(32 * c0 + ny + 30) + 64 + 1) / 64) : -((-(32 * c0 + ny + 30) + 64 - 1) / 64))) : (32 *c0 + ny + 30) / 64)))) < c0?(((((ny + tmax + -2) * 32 < 0?((32 < 0?-((-(ny + tmax + -2) + 32 + 1) / 32) : -((-(ny + tmax+ -2) + 32 - 1) / 32))) : (ny + tmax + -2) / 32)) < (((32 * c0 + ny + 30) * 64 < 0?((64 < 0?-((-(32 * c0 + ny + 30) + 64+ 1) / 64) : -((-(32 * c0 + ny + 30) + 64 - 1) / 64))) : (32 * c0 + ny + 30) / 64))?(((ny + tmax + -2) * 32 < 0?((32 <0?-((-(ny + tmax + -2) + 32 + 1) / 32) : -((-(ny + tmax + -2) + 32 - 1) / 32))) : (ny + tmax + -2) / 32)) : (((32 * c0 +ny + 30) * 64 < 0?((64 < 0?-((-(32 * c0 + ny + 30) + 64 + 1) / 64) : -((-(32 * c0 + ny + 30) + 64 - 1) / 64))) : (32 *c0 + ny + 30) / 64)))) : c0)); ++c1) {

for (c2 = c0 + -c1; c2 <= (((((tmax + nx + -2) * 32 < 0?((32 < 0?-((-(tmax + nx + -2) + 32 + 1) / 32) : -((-(tmax +nx + -2) + 32 - 1) / 32))) : (tmax + nx + -2) / 32)) < (((32 * c0 + -32 * c1 + nx + 30) * 32 < 0?((32 < 0?-((-(32 * c0 +-32 * c1 + nx + 30) + 32 + 1) / 32) : -((-(32 * c0 + -32 * c1 + nx + 30) + 32 - 1) / 32))) : (32 * c0 + -32 * c1 + nx +30) / 32))?(((tmax + nx + -2) * 32 < 0?((32 < 0?-((-(tmax + nx + -2) + 32 + 1) / 32) : -((-(tmax + nx + -2) + 32 - 1) /32))) : (tmax + nx + -2) / 32)) : (((32 * c0 + -32 * c1 + nx + 30) * 32 < 0?((32 < 0?-((-(32 * c0 + -32 * c1 + nx + 30)+ 32 + 1) / 32) : -((-(32 * c0 + -32 * c1 + nx + 30) + 32 - 1) / 32))) : (32 * c0 + -32 * c1 + nx + 30) / 32)))); ++c2){

if (c0 == 2 * c1 && c0 == 2 * c2) {for (c3 = 16 * c0; c3 <= ((tmax + -1 < 16 * c0 + 30?tmax + -1 : 16 * c0 + 30)); ++c3)

if (c0 % 2 == 0)(ey[0])[0] = (_edge_[c3]);

........... (200 more lines!)

Performance gain: 2-6× on modern multicore platformsUCLA 6


Loop Transformations in Production Compilers

Limitations of standard syntactic frameworks:I Composition of transformations may be tedious

I composability rules / applicability

I Parametric loop bounds, imperfectly nested loops are challengingI Look at the examples!

I Approximate dependence analysisI Miss parallelization opportunities (among many others)

I (Very) conservative performance models

UCLA 7


Achievements of Polyhedral Compilation

The polyhedral model:

I Model/apply seamlessly arbitrary compositions of transformationsI Automatic handling of imperfectly nested, parametric loop structuresI Many loop transformation can be modeled

I Exact dependence analysis on a class of programsI Unleash the power of automatic parallelizationI Aggressive multi-objective program restructuring (parallelism, SIMD,

cache, etc.)

I Requires computationally expensive algorithmsI Usually NP-complete / exponential complexityI Requires careful problem statement/representation

UCLA 8


Research around Polyhedral Compilation

"first generation"I Provided the theoretical foundationsI Focused on high-level/abstract objectivesI Was questioned about how practical is this model

"second generation"I Provided practical / "more scalable" algorithmsI Focused on concrete performance improvementI Still often questioned about how practical is this model

I Thanks for the legacy! ;-)

UCLA 9


Perspective on 25 Years of Computing

"first generation"I Parallel machines were... rareI Era of increased program performance for freeI Intel 486 DX2: 54 MIPS, 66MHz

"second generation"I Parallel machines are ubiquitous (multi-core, SIMD, GPUs)I Very difficult to get good performanceI Intel Core i7: 177730 MIPS, 3.33 GHz

UCLA 10


Perspective on 25 Years of Computing

"first generation"I Parallel machines were... rareI Era of increased program performance for freeI Intel 486 DX2: 54 MIPS, 66MHz

"second generation"I Parallel machines are ubiquitous (multi-core, SIMD, GPUs)I Very difficult to get good performanceI Intel Core i7: 177730 MIPS, 3.33 GHz

More complex problem can be solved, but more complex objectives areneeded anyway

UCLA 10


Disclaimer

This lecture is:I Partial 5/14/13 12:57 AMpolyhedral model affine scheduling - Google Scholar

Page 1 of 2http://scholar.google.com/scholar?q=polyhedral+model+affine+scheduling&btnG=&hl=en&as_sdt=0%2C5

Articles

Legal documents

Any timeSince 2013Since 2012Since 2009Custom range...

Sort by relevanceSort by date

include patentsinclude citations

Create alert

F Quilleré, S Rajopadhye, D Wilde - International Journal of Parallel …, 2000 - Springer... of static loop programs,(3) scheduling, using com- pact representations, of indexed computations(essentially uniform or affine recurrence equations) in the context of parallelization(2, 4) or systolic

Generation of efficient nested loops from polyhedra

MM Baskaran, J Ramanujam, P Sadayappan - Compiler Construction, 2010 - Springer... that transforms a polyhedral representation of a program and affine scheduling constraints into ...end-to-end automatic parallelization and locality optimization of affine programs for ... Section 2provides an overview of the polyhedral model for representing programs, dependences ...Cited by 89 Related articles All 13 versions Cite

[PDF] from researchgate.netAutomatic C-to-CUDA code generation for affine programs

C Bastoul, A Cohen, S Girbal, S Sharma… - … and Compilers for …, 2004 - Springer... motivation for our work, although MARS lacks the expressivity of the affine schedules we usein our uni- fied model. ... where DS is the dS-dimensional iteration domain of S, LS and RS are setsof polyhedral representations of array references, and θS is the affine schedule of S ...Cited by 89 Related articles All 29 versions Cite

[PDF] from archives-ouvertes.frPutting polyhedral loop transformations to work

U Bondhugula, A Hartono, J Ramanujam… - ACM SIGPLAN …, 2008 - dl.acm.org... of test cases has not been done using a powerful and general model like the polyhedral model. ...ing good ways to tile for parallelism and locality directly through an affine transformation framework ...Our approach is thus a departure from scheduling-based approaches in this field ...Cited by 167 Related articles All 11 versions Cite

[PDF] from lsu.eduA practical automatic polyhedral parallelizer and locality optimizer

MM Baskaran, U Bondhugula… - Proceedings of the …, 2008 - dl.acm.org... functions [1, 22, 16, 14, 9, 23, 21, 3]. For such regular programs, compile-time optimizationapproaches have been developed using affine scheduling functions with a polyhedral abstractionof programs and data dependencies. Although the polyhedral model of dependence ...Cited by 118 Related articles All 17 versions Cite

[PDF] from aminer.orgA compiler framework for optimization of affine loop nests for GPGPUs

F Quilleré, S Rajopadhye - ACM Transactions on Programming …, 2000 - dl.acm.org... Words and Phrases: Affine recurrence equations, applicative (functional) languages, automaticparallelization, dataflow analysis, data-parallel languages, dependence analysis, lifetime analysis,memory management, parallel code generation, polyhedral model, scheduling ...Cited by 94 Related articles All 12 versions Cite

[PS] from 131.254.254.45Optimizing memory usage in the polyhedral model

More

D Wilde, S Rajopadhye - Euro-Par'96 Parallel Processing, 1996 - Springer... Currently, the user chooses the transformations based on the analyses enabled by the polyhedralmodel, specifies it ... Scheduling: There has been much research on the SARE scheduling problem[9, 17, 8, 18 ... i, j) - i + j and tx (i) = 2i are valid (one dimensional) affine schedules for ...Cited by 38 Related articles All 10 versions Cite

[PDF] from psu.eduMemory reuse analysis in the polyhedral model

U Bondhugula, M Baskaran, S Krishnamoorthy… - Compiler …, 2008 - Springer... The polyhedral model provides powerful abstractions to optimize loop nests with regular accesses.Affine transformations in this model capture a complex sequence of execution ... Although asignificant body of research has addressed affine scheduling and partitioning, the ...Cited by 60 Related articles All 19 versions Cite

[PDF] from ohio-state.eduAutomatic transformations for communication-minimized parallelization and locality optimization inthe polyhedral model

LN Pouchet, C Bastoul, A Cohen… - Code Generation and …, 2007 - ieeexplore.ieee.org... that can be expressed as one-dimensional affine schedules (using a polyhedral abstraction); •this ... ing approaches on loop transformation sequences, or state-of-the-art affine scheduling; •such smaller ... of the optimal code, a confirmation that building a predictive model for loop ...Cited by 85 Related articles All 29 versions Cite

[PDF] from inria.frIterative optimization in the polyhedral model: Part I, one-dimensional time

LN Pouchet, C Bastoul, A Cohen, J Cavazos - ACM SIGPLAN Notices, 2008 - dl.acm.org... 20 loops) where iterative affine scheduling has never been attempted before. 4. We demonstratesignificant performance improvements on 3 different architectures. The paper is structured asfollows. Section 2 introduces multidimensional scheduling in the polyhedral model. ...Cited by 80 Related articles All 20 versions Cite

[PDF] from inria.frIterative optimization in the polyhedral model: Part II, multidimensional time

Scholar About 3,520 results (0.06 sec) My Citations 33

polyhedral model affine scheduling

Web Images More… [email protected]

I Biased (strongly!)

I Provocative at times (hopefully ;-) )

UCLA 11


Polyhedral Program Representation

UCLA 12

Polyhedral Program Representation: Basics Polyhedral School

Example: DGEMM

Example (dgemm)/* C := alpha*A*B + beta*C */for (i = 0; i < ni; i++)for (j = 0; j < nj; j++) {

S1: C[i][j] *= beta;for (k = 0; k < nk; ++k)

S2: C[i][j] += alpha * A[i][k] * B[k][j];}

This code has:I imperfectly nested loopsI multiple statementsI parametric loop bounds

UCLA 13


Granularity of Program Representation

DGEMM has:I 3 loops

I For loops in the code, while loopsI Control-flow graph analysis

I 2 (syntactic) statementsI Input source code?I Basic block?I ASM instructions?

I S1 is executed ni × nj timesI dynamic instances of the statementI Does not (necessarily) correspond to reality!

UCLA 14


Some Observations

Reasoning at the loop/statement level:I Some loop transformation may be very difficult to implement

I How to fuse loops with different loop bounds?I How to permute triangular loops?I How to unroll-and-jam triangular loops?I How to apply time-tiling?I ...

I Statements may operate on the same array while being independent

UCLA 15


Some Motivations for Polyhedral Transformations

I Known problem: scheduling of task graphI Obvious limitations: task graph is not finite / size depends on problem /

effective code generation almost impossibleI Alternative approach: use loop transformations

I solve all above limitationI BUT the problem is to find a sequence that implements the order we wantI AND also how to apply/compose them

I Desired features:I ability to reason at the instance level (as for task graph scheduling)I ability to easily apply/compose loop transformations

UCLA 16

The Polyhedral Model: Polyhedral School

The Polyhedral Model

UCLA 17

The Polyhedral Model: Overview Polyhedral School

Polyhedral Program Optimization: a Three-StageProcess

1 Analysis: from code to model

→ Existing prototype toolsI URUK, Omega, ChiLL . . .I ISL, LoopoI PolyOpt/PoCC

(Clan-Candl-LetSee-Pluto-Ponos-Cloog-Polylib-PIPLib-ISL-FM-...)

→ GCC GRAPHITE, LLVM Polly (now in mainstream)

→ Reservoir Labs R-Stream, IBM XL/Poly

2 Transformation in the model

→ Build and select a program transformation

3 Code generation: from model to code

→ "Apply" the transformation in the model

→ Regenerate syntactic (AST-based) code

UCLA 18


Motivating Example [1/2]

Example

for (i = 0; i <= 1; ++i)for (j = 0; j <= 2; ++j)A[i][j] = i * j;

Program execution:

1: A[0][0] = 0 * 0;2: A[0][1] = 0 * 1;3: A[0][2] = 0 * 2;4: A[1][0] = 1 * 0;5: A[1][1] = 1 * 1;6: A[1][2] = 1 * 2;

UCLA 19


Motivating Example [2/2]

A few observations:I Statement is executed 6 timesI There is a different values for i, j associated to these 6 instancesI There is an order on them (the execution order)

A rough analogy: polyhedral compilation is about (statically)scheduling tasks, where tasks are statement instances, or operations

UCLA 20


Polyhedral Representation of Programs

Static Control PartsI Loops have affine control only (over-approximation otherwise)

UCLA 21



Static Control PartsI Loops have affine control only (over-approximation otherwise)I Iteration domain: represented as integer polyhedra

for (i=1; i<=n; ++i). for (j=1; j<=n; ++j). . if (i<=n-j+2). . . s[i] = ...

DS1 =

1 0 0 −1−1 0 1 0

0 1 0 −1−1 0 1 0−1 −1 1 2

.

ijn1

≥~0

UCLA 21



Static Control PartsI Loops have affine control only (over-approximation otherwise)I Iteration domain: represented as integer polyhedraI Memory accesses: static references, represented as affine functions of

~xS and~p

for (i=0; i<n; ++i) {. s[i] = 0;. for (j=0; j<n; ++j). . s[i] = s[i]+a[i][j]*x[j];

}

fs( ~xS2) =[

1 0 0 0].

~xS2n1

fa( ~xS2) =

[1 0 0 00 1 0 0

].

~xS2n1

fx( ~xS2) =[

0 1 0 0].

~xS2n1

UCLA 21



Static Control PartsI Loops have affine control only (over-approximation otherwise)I Iteration domain: represented as integer polyhedraI Memory accesses: static references, represented as affine functions of

~xS and~pI Data dependence between S1 and S2: a subset of the Cartesian

product of DS1 and DS2 (exact analysis)

for (i=1; i<=3; ++i) {. s[i] = 0;. for (j=1; j<=3; ++j). . s[i] = s[i] + 1;

}

DS1δS2 :

1 −1 0 01 0 0 −1−1 0 0 3

0 1 0 −10 −1 0 30 0 1 −10 0 −1 3

.

iS1iS2jS21

= 0

≥~0

i

S1 iterations

S2 iterations

UCLA 21


Math Corner

UCLA 22

The Polyhedral Model: Math Corner Polyhedral School

The Affine Qualifier

Definition (Affine function)

A function f : Km→Kn is affine if there exists a vector~b ∈Kn and a matrixA ∈Km×n such that:

∀~x ∈Km, f (~x) = A~x+~b

Definition (Affine half-space)

An affine half-space of Km (affine constraint) is defined as the set of points:

{~x ∈Km |~a.~x≤~b}

UCLA 23


Polyhedron (Implicit Representation)

Definition (Polyhedron)

A set S ∈Km is a polyhedron if there exists a system of a finite number ofinequalities A~x≤~b such that:

P = {~x ∈Km | A~x≤~b}

Equivalently, it is the intersection of finitely many half-spaces.

Definition (Polytope)

A polytope is a bounded polyhedron.

UCLA 24


Integer Polyhedron

Definition (Z-polyhedron)

It is a polyhedron where all its extreme points are integer valued

Definition (Integer hull)

The integer hull of a rational polyhedron P is the largest set of integer pointssuch that each of these points is in P .

For the moment, we will "say" an integer polyhedron is a polyhedron ofinteger points (language abuse)

UCLA 25


Rational and Integer Polytopes

UCLA 26


Another View of PolyhedraThe dual representation models a polyhedron as a combination of lines L andrays R (forming the polyhedral cone) and vertices V (forming the polytope)

Definition (Dual representation)

P : {~x ∈Qn |~x = L~λ+R~µ+V~ν, ~µ≥ 0, ~ν≥ 0, ∑i

νi = 1}

Definition (Face)

A face F of P is the intersection of P with a supporting hyperplane of P . Wehave:

dim(F )≤ dim(P )

Definition (Facet)

A facet F of P is a face of P such that:

dim(F ) = dim(P )−1

UCLA 27


Some Equivalence Properties

Theorem (Fundamental Theorem on Polyhedral Decomposition)

If P is a polyhedron, then it can be decomposed as a polytope V plus apolyhedral cone L .

Theorem (Equivalence of Representations)

Every polyhedron has both an implicit and dual representation

I Chernikova’s algorithm can compute the dual representation from theimplicit one

I The Dual representation is heavily used in polyhedral compilationI Some works operate on the constraint-based representation (Pluto)

UCLA 28


Parametric Polyhedra

Definition (Paramteric polyhedron)

Given~n the vector of symbolic parameters, P is a parametric polyhedron if itis defined by:

P = {~x ∈Km | A~x≤ B~n+~b}

I Requires to adapt theory and tools to parametersI Can become nasty: case distinctions (QUAST)I Reflects nicely the program context

UCLA 29


Some Useful Algorithms

All extended to parametric polyhedra:I Compute the facets of a polytope: PolyLib [Wilde et al]

I Compute the volume of a polytope (number of points): Barvinok[Claus/Verdoolaege]

I Scan a polytope (code generation): CLooG [Quillere/Bastoul]

I Find the lexicographic minimum: PIP [Feautrier]

UCLA 30


Operations on Polyhedra

Definition (Intersection)

The intersection of two convex sets P1 and P2 is a convex set P :

P = {~x ∈Km |~x ∈ P1∧~x ∈ P2}

Definition (Union)

The union of two convex sets P1 and P2 is a set P :

P = {~x ∈Km |~x ∈ P1∨~x ∈ P2}

The union of two convex sets may not be a convex set

UCLA 31


Lattices

Definition (Lattice)

A subset L in Qn is a lattice if is generated by integral combination of finitelymany vectors: a1,a2, . . . ,an (ai ∈Qn). If the ai vectors have integralcoordinates, L is an integer lattice.

Definition (Z-polyhedron)

A Z-polyhedron is the intersection of a polyhedron and an affine integral fulldimensional lattice.

UCLA 32


Pictured Example

Example of a Z-polyhedron:I Q1 = {i, j | 0≤ i≤ 5, 0≤ 3j≤ 20}I L1 = {2i+1,3j+5 | i, j ∈ Z}I Z1 = Q1∩L1

UCLA 33


Quick Facts on Z-polyhedra

I Iteration domains are in fact Z-polyhedra with unit latticeI Intersection of Z-polyhedra is not convex in generalI Union is complex to computeI Can count points, can optimize, can scan

I Implementation available for most operations in PolyLib

UCLA 34


Image and Preimage

Definition (Image)

The image of a polyhedron P ∈ Zn by an affine function f : Zn→ Zm is aZ-polyhedron P ′:

P ′ = {f (~x) ∈ Zm |~x ∈ P}

Definition (Preimage)

The preimage of a polyhedron P ∈ Zn by an affine function f : Zn→ Zm is aZ-polyhedron P ′:

P ′ = {~x ∈ Zn | f (~x) ∈ P}

We have Image(f−1,P ) = Preimage(f ,P ) if f is invertible.

UCLA 35


Relation Between Image, Preimage andZ-polyhedra

I The image of a Z-polyhedron by an unimodular function is aZ-polyhedron

I The preimage of a Z-polyhedron by an affine function is a Z-polyhedron

I The image of a polyhedron by an affine invertible function is aZ-polyhedron

I The preimage of a Z-polyhedron by an affine function is a Z-polyhedron

I The image by a non-invertible function is not a Z-polyhedron

UCLA 36


Program Transformations

UCLA 37

The Polyhedral Model: Modeling Programs Polyhedral School

What Can Be Modeled?

Exact vs. approximate representation:I Exact representation of iteration domains

I Static control flowI Affine loop bounds (includes min/max/integer division)I Affine conditionals (conjunction/disjunction)

I Approximate representation of iteration domainsI Use affine over-approximations, predicate statement executionsI Full-function support

UCLA 38

Program Transformation: Polyhedral School

Key Intuition

I Programs are represented with geometric shapes

I Transforming a program is about modifying those shapesI rotation, skewing, stretching, ...

I But we need here to assume some order to scan points

UCLA 39


Affine Transformations

Interchange TransformationThe transformation matrix is the identity with a permutation of two rows.

1

2

3

5

6

4

1 2 3 4 5 6

123

i

j

1 2 3

4 5 6

0 1 2 3 4 5 6 i’0123

j’

=⇒

1 0−1 0

0 10 −1

( ij

)+

−12−1

3

≥~0(

i′j′

)=[ 0 1

1 0

]( ij

) 0 10 −11 0−1 0

( i′j′

)+

−12−1

3

≥~0

(a) original polyhedron (b) transformation function (c) target polyhedron

A~x+~a≥~0 ~y = T~x (AT−1)~y+~a≥~0

do i = 1, 2do j = 1, 3

S(i,j)

do i’ = 1, 3do j’ = 1, 2

S(i=j’,j=i’)

UCLA 40



Reversal TransformationThe transformation matrix is the identity with one diagonal element replaced by −1.

1

2

3

5

6

4

1 2 3 4 5 6

123

i

j

5

4

6 1

2

3

123

0 1 2−3 −2 −1 i’

j’

=⇒

1 0−1 0

0 10 −1

( ij

)+

−12−1

3

≥~0(

i′j′

)=[−1 0

0 1

]( ij

) −1 01 00 10 −1

( i′j′

)+

−12−1

3

≥~0


A~x+~a≥~0 ~y = T~x (AT−1)~y+~a≥~0

do i = 1, 2do j = 1, 3

S(i,j)

do i’ = -1, -2, -1do j’ = 1, 3

S(i=3-i’,j=j’)

UCLA 40



Coumpound TransformationThe transformation matrix is the composition of an interchange and reversal

1

2

3

5

6

4

1 2 3 4 5 6

123

i

j

3

6

2

5

1

4

123

0 1 2−3 −2 −1 i’

j’

=⇒

1 0−1 0

0 10 −1

( ij

)+

−12−1

3

≥~0(

i′j′

)=[ 0 −1

1 0

]( ij

) 0 −10 11 0−1 0

( i′j′

)+

−12−1

3

≥~0


A~x+~a≥~0 ~y = T~x (AT−1)~y+~a≥~0

do i = 1, 2do j = 1, 3

S(i,j)

do j’ = -1, -3, -1do i’ = 1, 2

S(i=4-j’,j=i’)

UCLA 40


So, What is This Matrix?

I We know how to generate code for arbitrary matrices with integercoefficients

I Arbitrary number of rows (but fixed number of columns)I Arbitrary value for the coefficients

I Through code generation, the number of dynamic instances is preservedI But this is not true for the transformed polyhedra!

Some classification:I The matrix is unimodularI The matrix is full rank and invertibleI The matrix is full rankI The matrix has only integral coefficientsI The matrix has rational coefficients

UCLA 41


A Reverse History Perspective

1 CLooG: arbitrary matrix2 Affine Mappings3 Unimodular framework4 SARE5 SURE

UCLA 42



Original Schedule

for (i = 0; i < n; ++i)for (j = 0; j < n; ++j){

S1: C[i][j] = 0;for (k = 0; k < n; ++k)

S2: C[i][j] += A[i][k]*B[k][j];

}

ΘS1.~xS1 =

(1 0 0 00 1 0 0

).

ijn1

ΘS2.~xS2 =

(1 0 0 0 00 1 0 0 00 0 1 0 0

).

ijkn1

for (i = 0; i < n; ++i)for (j = 0; j < n; ++j){C[i][j] = 0;for (k = 0; k < n; ++k)C[i][j] += A[i][k]*

B[k][j];

}

I Represent Static Control Parts (control flow and dependences must bestatically computable)

I Use code generator (e.g. CLooG) to generate C code from polyhedralrepresentation (provided iteration domains + schedules)

UCLA 43



Distribute loops

for (i = 0; i < n; ++i)for (j = 0; j < n; ++j){

S1: C[i][j] = 0;for (k = 0; k < n; ++k)

S2: C[i][j] += A[i][k]*B[k][j];

}

ΘS1.~xS1 =

(1 0 0 00 1 0 0

).

ijn1

ΘS2.~xS2 =

(1 0 0 1 00 1 0 0 00 0 1 0 0

).

ijkn1

for (i = 0; i < n; ++i)for (j = 0; j < n; ++j)C[i][j] = 0;

for (i = n; i < 2*n; ++i)for (j = 0; j < n; ++j)for (k = 0; k < n; ++k)C[i-n][j] += A[i-n][k]*

B[k][j];

I All instances of S1 are executed before the first S2 instance

UCLA 43



Distribute loops + Interchange loops for S2

for (i = 0; i < n; ++i)for (j = 0; j < n; ++j){

S1: C[i][j] = 0;for (k = 0; k < n; ++k)

S2: C[i][j] += A[i][k]*B[k][j];

}

ΘS1.~xS1 =

(1 0 0 00 1 0 0

).

ijn1

ΘS2.~xS2 =

(0 0 1 1 00 1 0 0 01 0 0 0 0

).

ijkn1

for (i = 0; i < n; ++i)for (j = 0; j < n; ++j)C[i][j] = 0;

for (k = n; k < 2*n; ++k)for (j = 0; j < n; ++j)for (i = 0; i < n; ++i)C[i][j] += A[i][k-n]*

B[k-n][j];

I The outer-most loop for S2 becomes k

UCLA 43



Illegal schedule

for (i = 0; i < n; ++i)for (j = 0; j < n; ++j){

S1: C[i][j] = 0;for (k = 0; k < n; ++k)

S2: C[i][j] += A[i][k]*B[k][j];

}

ΘS1.~xS1 =

(1 0 1 00 1 0 0

).

ijn1

ΘS2.~xS2 =

(0 0 1 0 00 1 0 0 01 0 0 0 0

).

ijkn1

for (k = 0; k < n; ++k)for (j = 0; j < n; ++j)for (i = 0; i < n; ++i)C[i][j] += A[i][k]*

B[k][j];for (i = n; i < 2*n; ++i)for (j = 0; j < n; ++j)C[i-n][j] = 0;

I All instances of S1 are executed after the last S2 instance

UCLA 43



A legal schedule

for (i = 0; i < n; ++i)for (j = 0; j < n; ++j){

S1: C[i][j] = 0;for (k = 0; k < n; ++k)

S2: C[i][j] += A[i][k]*B[k][j];

}

ΘS1.~xS1 =

(1 0 1 00 1 0 0

).

ijn1

ΘS2.~xS2 =

(0 0 1 1 10 1 0 0 01 0 0 0 0

).

ijkn1

for (i = n; i < 2*n; ++i)for (j = 0; j < n; ++j)C[i][j] = 0;

for (k= n+1; k<= 2*n; ++k)for (j = 0; j < n; ++j)for (i = 0; i < n; ++i)C[i][j] += A[i][k-n-1]*

B[k-n-1][j];

I Delay the S2 instancesI Constraints must be expressed between ΘS1 and ΘS2

UCLA 43



Implicit fine-grain parallelism

for (i = 0; i < n; ++i)for (j = 0; j < n; ++j){

S1: C[i][j] = 0;for (k = 0; k < n; ++k)

S2: C[i][j] += A[i][k]*B[k][j];

}

ΘS1.~xS1 = ( 1 0 0 0 ) .

ijn1

ΘS2.~xS2 = ( 0 0 1 1 0 ) .

ijkn1

for (i = 0; i < n; ++i)pfor (j = 0; j < n; ++j)C[i][j] = 0;

for (k = n; k < 2*n; ++k)pfor (j = 0; j < n; ++j)

pfor (i = 0; i < n; ++i)C[i][j] += A[i][k-n]*

B[k-n][j];

I Less (linear) rows than loop depth→ remaining dimensions areparallel

UCLA 43



Representing a schedule

for (i = 0; i < n; ++i)for (j = 0; j < n; ++j){

S1: C[i][j] = 0;for (k = 0; k < n; ++k)

S2: C[i][j] += A[i][k]*B[k][j];

}

ΘS1.~xS1 =

( 1 0 1 00 1 0 0

).

ijn1

ΘS2.~xS2 =

(0 0 1 1 10 1 0 0 01 0 0 0 0

).

ijkn1



B[k-n-1][j];

Θ.~x =

( 1 0 0 0 1 1 1 0 10 1 0 1 0 0 0 0 00 0 1 0 0 0 0 0 0

).

~p

( i j i j k n n 1 1 )T

~p

UCLA 43




for (i = 0; i < n; ++i)for (j = 0; j < n; ++j){

S1: C[i][j] = 0;for (k = 0; k < n; ++k)

S2: C[i][j] += A[i][k]*B[k][j];

}

ΘS1.~xS1 =

( 1 0 1 00 1 0 0

).

ijn1

ΘS2.~xS2 =

(0 0 1 1 10 1 0 0 01 0 0 0 0

).

ijkn1



B[k-n-1][j];

Θ.~x =

( 1 0 0 0 1 1 1 0 10 1 0 1 0 0 0 0 00 0 1 0 0 0 0 0 0

).

~p

( i j i j k n n 1 1 )T

0 0

~ı

0 0 0

~p

0

c

0

UCLA 43




for (i = 0; i < n; ++i)for (j = 0; j < n; ++j){

S1: C[i][j] = 0;for (k = 0; k < n; ++k)

S2: C[i][j] += A[i][k]*B[k][j];

}

ΘS1.~xS1 =

( 1 0 1 00 1 0 0

).

ijn1

ΘS2.~xS2 =

(0 0 1 1 10 1 0 0 01 0 0 0 0

).

ijkn1



B[k-n-1][j];

Transformation Description

~ıreversal Changes the direction in which a loop traverses its iteration rangeskewing Makes the bounds of a given loop depend on an outer loop counter

interchange Exchanges two loops in a perfectly nested loop, a.k.a. permutation

~p fusion Fuses two loops, a.k.a. jammingdistribution Splits a single loop nest into many, a.k.a. fission or splitting

c peeling Extracts one iteration of a given loopshifting Allows to reorder loops

UCLA 43


Fusion in the Polyhedral Model

0 N

for (NBlue

for (i = 0; i <= N; ++i) {Blue(i);Red(i);

}

for (NRed

Perfectly aligned fusion

UCLA 44



0 N N+11

for

Blue(0);for (i = 1; i <= N; ++i) {Blue(i);Red(i-1);

}Red(N);

for (N

Fusion with shift of 1Not all instances are fused

UCLA 44



P0 N N+P

for (i = 0; i < P; ++i)Blue(i);

for (i = P; i <= N; ++i) {Blue(i);Red(i-P);

}for (i = N+1; i <= N+P; ++i)

Red(i-P);

Fusion with parametric shift of PAutomatic generation of prolog/epilog code

UCLA 44



P0 N N+P

for (i = 0; i < P; ++i)Blue(i);

for (i = P; i <= N; ++i) {Blue(i);Red(i-P);

}for (i = N+1; i <= N+P; ++i)

Red(i-P);

Many other transformations may be required to enablefusion: interchange, skewing, etc.

UCLA 44


Scheduling Matrix

Definition (Affine multidimensional schedule)

Given a statement S, an affine schedule ΘS of dimension m is an affine formon the d outer loop iterators~xS and the p global parameters~n.ΘS ∈ Zm×(d+p+1) can be written as:

ΘS(~xS) =

θ1,1 . . . θ1,d+p+1...

...θm,1 . . . θm,d+p+1

.

~xS~n1

ΘS

k denotes the kth row of ΘS.

Definition (Bounded affine multidimensional schedule)

ΘS is a bounded schedule if θSi,j ∈ [x,y] with x,y ∈ Z

UCLA 45


Another Representation

One can separate coefficients of Θ into:1 The iterator coefficients2 The parameter coefficients3 The constant coefficients

One can also enforce the schedule dimension to be 2d+1.I A d-dimensional square matrix for the linear part

I represents composition of interchange/skewing/slowing

I A d×n matrix for the parametric partI represents (parametric) shift

I A d+1 vector for the scalar offsetI represents statement interleaving

I See URUK for instance

UCLA 46


Computing the 2d+1 Identity Schedule

S1

S2

S3

S4

S5

S6

!S("xS) = TS"xS + "tS ,

"xS TS "tS

!S1("xS1) = (0, i, 0) !S2("xS2) = (0, i, 1, j, 0) !S3("xS3) = (0, i, 2)

0

00

0

0 1 2

1 2

3

i

j j

k

S1

S2

S3

S4

S5

S6

S1

S2

S3

S4

S5

S6

!S("xS) = TS"xS + "tS ,

"xS TS "tS

!S1("xS1) = (0, i, 0) !S2("xS2) = (0, i, 1, j, 0) !S3("xS3) = (0, i, 2)

0

00

0

0 1 2

1 2

3

i

j j

k

S1

S2

S3

S4

S5

S6

UCLA 47


Transformation Catalogue [1/2]

28 Girbal, Vasilache et al.

CUTDOM constrains a domain with an additional inequality, in theform of a vector c with the same dimension as a row of matrix Λ.

EXTEND inserts a new intermediate loop level at depth !, initially re-stricted to a single iteration. This new iterator will be used in followingcode transformations.

ADDLOCALVAR insert a fresh local variable to the domain and to theaccess functions. This local variable is typically used by CUTDOM.

PRIVATIZE and CONTRACT implement basic forms of array privati-zation and contraction, respectively, considering dimension ! of the array.Privatization needs an additional parameter s, the size of the additional di-mension; s is required to update the array declaration (it cannot be inferredin general, some references may not be affine). These primitives are simpleexamples updating the data layout and array access functions.

This table is not complete (e.g., it lacks index-set splitting and data-layout transformations), but it demonstrates the expressiveness of the uni-fied representation.

Syntax EffectUNIMODULAR(P,U) !S " Scop | P# βS,AS $ U.AS; ΓS$ U.ΓS

SHIFT(P,M) !S " Scop | P# βS,ΓS $ ΓS +MCUTDOM(P,c) !S " Scop | P# βS,ΛS $ AddRow

!ΛS,0,c/gcd(c1, . . . ,cdS+dSlv+dgp+1

)"

EXTEND(P,!,c) !S " Scop | P# βS,

#$$$$%

$$$$&

dS$ dS +1; ΛS$ AddCol(ΛS,c,0);βS$ AddRow(βS,!,0);AS$ AddRow(AddCol(AS,c,0),!,1!);ΓS$ AddRow(ΓS,!,0);!(A,F) " LS

hs %R Shs,F$ AddRow(F,!,0)

ADDLOCALVAR(P) !S " Scop | P# βS,dSlv $ dSlv+1; ΛS$ AddCol(ΛS,dS +1,0);!(A,F) " LS

hs%R Shs,F$ AddCol(F,dS +1,0)

PRIVATIZE(A,!) !S " Scop,!(A,F) " LShs %R S

hs,F$ AddRow(F,!,1!)CONTRACT(A,!) !S " Scop,!(A,F) " LS

hs %R Shs,F$ RemRow(F,!)

FUSION(P,o) b=max{βSdim(P)+1 | (P,o)# βS}+1Move((P,o+1),(P,o+1),b); Move(P,(P,o+1),&1)

FISSION(P,o,b) Move(P,(P,o,b),1); Move((P,o+1),(P,o+1),&b)MOTION(P,T ) if dim(P)+1= dim(T ) then b=max{βSdim(P) | P# βS}+1 else b= 1

Move(pfx(T,dim(T )&1),T,b)!S " Scop | P# βS,βS $ βS +T &pfx(P,dim(T ))Move(P,P,&1)

Fig. 18. Some classical transformation primitives

Primitives operate on program representation while maintaining thestructure of the polyhedral components (the invariants). Despite their fa-miliar names, the primitives’ practical outcome on the program represen-tation is widely extended compared to their syntactic counterparts. Indeed,transformation primitives like fusion or interchange apply to sets of state-

UCLA 48


Transformation Catalogue [2/2]30 Girbal, Vasilache et al.

Syntax Effect Comments

INTERCHANGE(P,o) !S " Scop | P# βS,swap rows o and o+1

!U= IdS$1o,o$1o+1,o+1+1o,o+1+1o+1,o;UNIMODULAR(βS,U)

SKEW(P,!,c,s) !S " Scop | P# βS,add the skew factor

!U= IdS + s ·1!,c;UNIMODULAR(βS,U)

REVERSE(P,o) !S " Scop | P# βS,put a -1 in (o,o)

!U= IdS $2 ·1o,o;UNIMODULAR(βS,U)

STRIPMINE(P,k) !S " Scop | P# βS,"####$

####%

c= dim(P);EXTEND(βS,c,c);u= dS +dSlv+dgp+1;CUTDOM(βS,$k ·1c +(ASc+1,Γ

Sc+1));

CUTDOM(βS,k ·1c $ (ASc+1,ΓSc+1)+(k$1)1u)

insert intermediate loopconstant columnk · ic % ic+1ic+1 % k · ic + k$1

TILE(P,o,k1 ,k2) !S " Scop | (P,o)# βS,"$

%

STRIPMINE((P,o),k2);STRIPMINE(P,k1);INTERCHANGE((P,0),dim(P))

strip outer loopstrip inner loopinterchange

Fig. 19. Composition of transformation primitives

3.2. Implementing Loop Unrolling

In the context of code optimization, one of the most important transforma-tions is loop unrolling. A naive implementation of unrolling with statementduplications may result in severe complexity overhead for further transfor-mations and for the code generation algorithm (its separation algorithmis exponential in the number of statements, in the worst case). Insteadof implementing loop unrolling in the intermediate representation of ourframework, we delay it to the code generation phase and perform full loopunrolling in a lazy way. This strategy is fully implemented in the code gen-eration phase and is triggered by annotations (carrying depth information)of the statements whose surrounding loops need to be unrolled; unrollingoccurs in the separation algorithm of the code generator (22) when all thestatements being printed out are marked for unrolling at the current depth.

Practically, in most cases, loop unrolling by a factor b an be imple-mented as a combination of strip-mining (by a factor b) and full unrolling(6). Strip-mining itself may be implemented in several ways in a polyhedralsetting. Following our earlier work in (7) and calling b the strip-mining fac-tor, we choose to model a strip-mined loop by dividing the iteration span ofthe outer loop by b instead of leaving the bounds unchanged and insertinga non-unit stride b, see Figure 20.

UCLA 49


Some Final Observations

Some classical pitfallsI The number of rows of Θ does not correspond to actual parallel levelsI Scalar rows vs. linear rowsI Linear independenceI Parametric shift for domains without parametric boundI ...

UCLA 50


Data Dependences

UCLA 51

Data Dependences: Polyhedral School

Purpose of Dependence Analysis

I Not all program transformations preserve the semanticsI Semantics is preserved if the dependence are preserved

I In standard frameworks, it usually means reordering statements/loopsI Statements containing dependent references should not be executed in a

different orderI Granularity: usually a reference to an array

I In the polyhedral framework, it means reordering statement instancesI Statement instances containing dependent references should not be

executed in a different orderI Granularity: a reference to an array cell

UCLA 52


Example

Exercise: Compute the set of cells of A accessed

Example

for (i = 0; i < N; ++i)for (j = i; j < N; ++j)A[2i + 3][4j] = i * j;

I DS: {i, j | 0≤ i < N, i≤ j < N}I Function: fA : {2i+3,4j | i, j ∈ Z}I Image(fA,DS) is the set of cells of A accessed (a Z-polyhedron):

I Polyhedron: Q : {i, j | 3≤ i < 2N +2, 0≤ j < 4N}I Lattice: L : {2i+3,4j | i, j ∈ Z}

UCLA 53


Data Dependence

Definition (Bernstein conditions)

Given two references, there exists a dependence between them if the threefollowing conditions hold:

I they reference the same array (cell)I one of this access is a writeI the two associated statements are executed

Three categories of dependences:I RAW (Read-After-Write, aka flow): first reference writes, second readsI WAR (Write-After-Read, aka anti): first reference reads, second writesI WAW (Write-After-Write, aka output): both references writes

Another kind: RAR (Read-After-Read dependences), used for locality/reusecomputations

UCLA 54


Some Terminology on Dependence Relations

We categorize the dependence relation in three kinds:I Uniform dependences: the distance between two dependent iterations

is a constantI ex: i→ i+1I ex: i, j→ i+1, j+1

I Non-uniform dependences: the distance between two dependentiterations varies during the execution

I ex: i→ i+ jI ex: i→ 2i

I Parametric dependences: at least a parameter is involved in thedependence relation

I ex: i→ i+NI ex: i+N→ j+M

UCLA 55

Data Dependences: Dependence Polyhedra Polyhedral School

Dependence Polyhedron [1/3]

Principle: model all pairs of instances in dependence

Definition (Dependence of statement instances)

A statement S depends on a statement R (written R→ S) if there exists anoperation S(~xS) and R(~xR) and a memory location m such that:

1 S(~xS) and R(~xR) refer to the same memory location m, and at least oneof them writes to that location,

2 xS and xR belongs to the iteration domain of R and S,3 in the original sequential order, S(~xS) is executed before R(~xR).

UCLA 56



1 Same memory location: equality of the subscript functions of a pair ofreferences to the same array: FS~xS +aS = FR~xR +aR.

2 Iteration domains: both S and R iteration domains can be describedusing affine inequalities, respectively AS~xS + cS ≥ 0 and AR~xR + cR ≥ 0.

3 Precedence order : each case corresponds to a common loop depth,and is called a dependence level.

For each dependence level l, the precedence constraints are the equalityof the loop index variables at depth lesser to l: xR,i = xS,i for i < l andxR,l > xS,l if l is less than the common nesting loop level. Otherwise, thereis no additional constraint and the dependence only exists if S is textuallybefore R.

Such constraints can be written using linear inequalities:Pl,S~xS−Pl,R~xR +b≥ 0.

UCLA 57



The dependence polyhedron for R→ S at a given level l and for a given pairof references fR, fS is described as [Feautrier/Bastoul]:

DR,S,fR,fS,l : D(~xS

~xR

)+d =

FS −FRAS 0

0 ARPS −PR

(~xS~xR

)+

aS−aRcScRb

= 0

≥~0

A few properties:I We can always build the dep polyhedron for a given pair of affine array

accesses (it is convex)I It is exact, if the iteration domain and the access functions are also exactI it is over-approximated if the iteration domain or the access function is

an approximation

UCLA 58


Polyhedral Dependence Graph

Definition (Polyhedral Dependence Graph)

The Polyhedral Dependence Graph is a multi-graph with one vertex persyntactic program statement Si, with edges Si→ Sj for each dependencepolyhedron DSi,Sj .

UCLA 59


Scheduling

UCLA 60

Affine Scheduling: Polyhedral School

Affine Scheduling

Definition (Affine schedule)

Given a statement S, a p-dimensional affine schedule ΘR is an affine form onthe outer loop iterators~xS and the global parameters~n. It is written:

ΘS(~xS) = TS

~xS~n1

, TS ∈Kp×dim(~xS)+dim(~n)+1

I A schedule assigns a timestamp to each executed instance of astatement

I If T is a vector, then Θ is a one-dimensional scheduleI If T is a matrix, then Θ is a multidimensional schedule

I Question: does it translate to sequential loops?

UCLA 61


Legal Program Transformation

Definition (Precedence condition)

Given ΘR a schedule for the instances of R, ΘS a schedule for the instancesof S. ΘR and ΘS are legal schedules if ∀〈~xR,~xS〉 ∈DR,S:

ΘR(~xR)≺ΘS(~xS)

≺ denotes the lexicographic ordering.

(a1, . . . ,an)≺ (b1, . . . ,bm) iff ∃i, 1≤ i≤ min(n,m) s.t. (a1, . . . ,ai−1) = (b1, . . . ,bi−1)

and ai < bi

UCLA 62


Scheduling in the Polyhedral Model

Constraints:I The schedule must respect the precedence condition, for all dependent

instancesI Dependence constraints can be turned into constraints on the solution

set

Scheduling:I Among all possibilities, one has to be pickedI Optimal solution requires to consider all legal possible schedules

I Question: is this always true?

UCLA 63


One-Dimensional Affine Schedules

For now, we focus on 1-d schedules

Example

for (i = 1; i < N; ++i)A[i] = A[i - 1] + A[i] + A[i + 1];

I Simple program: 1 loop, 1 polyhedral statementI 2 dependences:

I RAW: A[i]→ A[i - 1]I WAR: A[i + 1]→ A[i]

UCLA 64


Checking the Legality of a ScheduleExercise: given the dependence polyhedra, check if a schedule is legal

D1 :

1 1 0 0 −11 −1 0 1 −11 0 1 0 11 0 −1 1 −10 1 −1 0 11 −1 1 0 −1

.eq

iSi′Sn1

D2 :

1 1 0 0 −11 −1 0 1 −11 0 1 0 11 0 −1 1 −10 1 −1 0 11 −1 1 0 −1

.eq

iSi′Sn1

1 Θ = i2 Θ =−i

Solution: check for the emptiness of the polyhedron

P :[

DiS � i′S

].

iSi′Sn1

where:

I iS � i′S gets the consumer instances scheduled after the producer onesI For Θ =−i, it is −iS �−i′S, which is non-empty

UCLA 65


A (Naive) Scheduling Approach

I Pick a schedule for the program statementsI Check if it respects all dependences

This is called filtering

Limitations:I How to use this in combination of an objective function?I The density of legal 1-d affine schedules is low:

matmult locality fir h264 crout

~i-Bounds −1,1 −1,1 0,1 −1,1 −3,3c-Bounds −1,1 −1,1 0,3 0,4 −3,3#Sched. 1.9×104 5.9×104 1.2×107 1.8×108 2.6×1015

⇓#Legal 6561 912 792 360 798

UCLA 66


Objectives for a Good Scheduling Algorithm

I Build a legal schedule!I Embed some properties in this legal schedule

I latency: minimize the time of the last iterationI delay: minimize the time between the first and last iterationI parallelism / placementI permutability (for tiling)I ...

A possible "simple" two-step approach:I Find the solution set of all legal affine schedulesI Find an ILP/PIP formulation for the objective function(s)

UCLA 67


The Precedence Constraint (Again!)

Precedence constraint adapted to 1-d schedules:

Definition (Causality condition for schedules)

Given DR,S, ΘR and ΘS are legal iff for each pair of instances in dependence:

ΘR(~xR)< Θ

S(~xS)

Equivalently: ∆R,S = ΘS(~xS)−Θ

R(~xR)−1≥ 0

I All functions ∆R,S which are non-negative over the dependencepolyhedron represent legal schedules

I For the instances which are not in dependence, we don’t careI First step: how to get all non-negative functions over a polyhedron?

UCLA 68


Affine Form of the Farkas Lemma

Lemma (Affine form of Farkas lemma)

Let D be a nonempty polyhedron defined by A~x+~b≥~0. Then any affinefunction f (~x) is non-negative everywhere in D iff it is a positive affinecombination:

f (~x) = λ0 +~λT(A~x+~b), with λ0 ≥ 0 and~λ≥~0

λ0 and ~λT are called the Farkas multipliers.

UCLA 69


The Farkas Lemma: Example

I Function: f (x) = ax+bI Domain of x: {1≤ x≤ 3}→ x−1≥ 0, −x+3≥ 0I Farkas lemma: f (x)≥ 0⇔ f (x) = λ0 +λ1(x−1)+λ2(−x+3)

The system to solve:λ1 − λ2 = a

λ0 − λ1 + 3λ2 = bλ0 ≥ 0

λ1 ≥ 0λ2 ≥ 0

UCLA 70

Affine Scheduling: One-dimensional Schedules Polyhedral School

Example: Semantics Preservation (1-D)

Legal Distinct Schedules

Affine Schedules

UCLA 71




Affine Schedules

- Causality condition

Property (Causality condition for schedules)

Given RδS, ΘR and ΘS are legal iff for each pair of instances in dependence:

ΘR(~xR)< Θ

S(~xS)

Equivalently: ∆R,S = ΘS(~xS)−Θ

R(~xR)−1≥ 0

UCLA 71




Affine Schedules


- Farkas Lemma

Lemma (Affine form of Farkas lemma)

Let D be a nonempty polyhedron defined by A~x+~b≥~0. Then any affine function f (~x)is non-negative everywhere in D iff it is a positive affine combination:

f (~x) = λ0 +~λT(A~x+~b), with λ0 ≥ 0 and~λ≥~0.

λ0 and ~λT are called the Farkas multipliers.

UCLA 71




Affine Schedules


- Farkas Lemma

Valid

Farkas

Multipliers

UCLA 71




Affine Schedules


- Farkas Lemma

Valid

Farkas

Multipliers

Many to one

UCLA 71




Affine Schedules


- Farkas Lemma

Valid

Farkas

Multipliers

- Identification

ΘS(~xS)−Θ

R(~xR)−1 = λ0 +~λT(

DR,S

(~xR

~xS

)+~dR,S

)≥ 0

DRδS iR : λD1,1 −λD1,2 +λD1,3 −λD1,4

iS : −λD1,1 +λD1,2 +λD1,5 −λD1,6

jS : λD1,7 −λD1,8

n : λD1,4 +λD1,6 +λD1,8

1 : λD1,0

UCLA 71




Affine Schedules


- Farkas Lemma

Valid

Farkas

Multipliers

- Identification

ΘS(~xS)−Θ

R(~xR)−1 = λ0 +~λT(

DR,S

(~xR~xS

)+~dR,S

)≥ 0

DRδS iR : −t1R = λD1,1 −λD1,2 +λD1,3 −λD1,4

iS : t1S = −λD1,1 +λD1,2 +λD1,5 −λD1,6

jS : t2S = λD1,7 −λD1,8

n : t3S − t2R = λD1,4 +λD1,6 +λD1,8

1 : t4S − t3R −1 = λD1,0

UCLA 71




Affine Schedules


- Farkas Lemma

Valid

Farkas

Multipliers

- Identification

- Projection

I Solve the constraint systemI Use (purpose-optimized) Fourier-Motzkin projection algorithm

I Reduce redundancyI Detect implicit equalities

UCLA 71



Valid

Transformation

Coefficients


Affine Schedules


- Farkas Lemma

Valid

Farkas

Multipliers

- Identification

- Projection

UCLA 71



Valid

Transformation

Coefficients


Affine Schedules


- Farkas Lemma

Valid

Farkas

Multipliers

Bijection

- Identification

- Projection

I One point in the space⇔ one set of legal schedulesw.r.t. the dependences

UCLA 71


Scheduling Algorithm for Multiple Dependences

AlgorithmI Compute the schedule constraints for each dependenceI Intersect all sets of constraintsI Output is a convex solution set of all legal one-dimensional schedules

I Computation is fast, but requires eliminating variables in a system ofinequalities: projection

I Can be computed as soon as the dependence polyhedra are known

UCLA 72


Objective Function

Idea: bound the latency of the schedule and minimize this bound

Theorem (Schedule latency bound)

If all domains are bounded, and if there exists at least one 1-d schedule Θ,then there exists at least one affine form in the structure parameters:

L =~u.~n+w

such that:∀~xR, L≥ΘR(~xR)

I Objective function: min{~u,w |~u.~n+w−Θ≥ 0}I Subject to Θ is a legal schedule, and θi ≥ 0I In many cases, it is equivalent to take the lexicosmallest point in the

polytope of non-negative legal schedules

UCLA 73


Example

min{~u,w |~u.~n+w−Θ≥ 0} : ΘR = 0, ΘS = k+1

Example

parfor (i = 0; i < N; ++i)parfor (j = 0; j < N; ++j)C[i][j] = 0;

for (k = 1; k < N + 1; ++k)parfor (i = 0; i < N; ++i)parfor (j = 0; j < N; ++j)C[i][j] += A[i][k-1] + B[k-1][j];

UCLA 74


Limitations of One-dimensional Schedules

I Not all programs have a legal one-dimensional scheduleI Question: does this program have a 1-d schedule?

Example

for (i = 1; i < N - 1; ++i)for (j = 1; j < N - 1; ++j)A[i][j] = A[i-1][j-1] + A[i+1][j] + A[i][j+1];

I Not all compositions of transformation are possibleI Interchange in inner-loopsI Fusion / distribution of inner-loops

UCLA 75


Multidimensional Scheduling

UCLA 76

Affine Scheduling: Multidimensional Scheduling Polyhedral School

Multidimensional Scheduling

I Some program does not have a legal 1-d scheduleI It means, it’s not possible to enforce the precedence condition for all

dependences

Example

for (i = 0; i < N; ++i)for (j = 0; j < N; ++j)s += s;

I Intuition: multidimensional time means nested time loopsI The precedence constraint needs to be adapted to multidimensional

time

UCLA 77


Dependence Satisfaction

Definition (Strong dependence satisfaction)

Given DR,S, the dependence is strongly satisfied at schedule level k if

∀〈~xR,~xS〉 ∈DR,S, ΘSk(~xS)−ΘR

k (~xR)≥ 1

Definition (Weak dependence satisfaction)

Given DR,S, the dependence is weakly satisfied at dimension k if

∀〈~xR,~xS〉 ∈DR,S, ΘSk(~xS)−Θ

Rk (~xR)≥ 0

∃〈~xR,~xS〉 ∈DR,S, ΘSk(~xS) = Θ

Rk (~xR)

UCLA 78


Program Legality and Existence Results

I All dependence must be strongly satisfied for the program to be correctI Once a dependence is strongly satisfied at level k, it does not

contribute to the constraints of level k+ i

I Unlike with 1-d schedules, it is always possible to build a legalmultidimensional schedule for a SCoP [Feautrier]

Theorem (Existence of an affine schedule)

Every static control program has a multdidimensional affine schedule

UCLA 79


Reformulation of the Precedence ConditionI We introduce variable δ

DR,S1 to model the dependence satisfaction

I Considering the first row of the scheduling matrices, to preserve theprecedence relation we have:

∀DR,S, ∀〈~xR,~xS〉 ∈DR,S, ΘS1(~xS)−ΘR

1 (~xR)≥ δDR,S1

δDR,S1 ∈ {0,1}

Lemma (Semantics-preserving affine schedules)

Given a set of affine schedules ΘR,ΘS . . . of dimension m, the programsemantics is preserved if:

∀DR,S, ∃p ∈ {1, . . . ,m}, δDR,Sp = 1

∧ ∀j < p, δDR,Sj = 0

∧ ∀j≤ p,∀〈~xR,~xS〉 ∈DR,S, ΘSp(~xS)−Θ

Rp (~xR)≥ δ

DR,Sj

UCLA 80


Space of All Affine Schedules

Objective:I Design an ILP which operates on all scheduling coefficientsI Easier optimality reasoning: the space contains all schedules (hence

necesarily the optimal one)I Examples: maximal fusion, maximal coarse-grain parallelism, best

locality, etc.

idea:I Combine all coefficients of all rows of the scheduling function into a

single solution setI Find a convex encoding for the lexicopositivity of dependence

satisfactionI A dependence must be weakly satisfied until it is strongly satisfiedI Once it is strongly satisfied, it must not constrain subsequent levels

UCLA 81


Schedule Lower Bound

Idea:I Bound the schedule latency with a lower bound which does not prevent

to find all solutionsI Intuitively:

I ΘS(~xS)−ΘR(~xR)≥ δ if the dependence has not been strongly satisfiedI ΘS(~xS)−ΘR(~xR)≥−∞ if it has

Lemma (Schedule lower bound)

Given ΘRk , ΘS

k such that each coefficient value is bounded in [x,y]. Thenthere exists K ∈ Z such that:

max(Θ

Sk(~xS)−Θ

Rk (~xR)

)>−K.~n−K

UCLA 82

Affine Scheduling: Space of Semantics-Preserving Affine Schedules Polyhedral School

Space of Semantics-Preserving Affine Schedules

All unique bounded

affine multidimensional schedules

All unique semantics-preserving

affine multidimensional schedules

1 point ↔ 1 unique semantically equivalent program(up to affine iteration reordering)

UCLA 83


Semantics Preservation

Definition (Causality condition)

Given ΘR a schedule for the instances of R, ΘS a schedule for the instancesof S. ΘR and ΘS preserve the dependence DR,S if ∀〈~xR,~xS〉 ∈DR,S:

ΘR(~xR)≺Θ

S(~xS)

≺ denotes the lexicographic ordering.

(a1, . . . ,an)≺ (b1, . . . ,bm) iff ∃i, 1≤ i≤ min(n,m) s.t. (a1, . . . ,ai−1) = (b1, . . . ,bi−1)

and ai < bi

UCLA 84


Lexico-positivity of Dependence Satisfaction

I ΘR(~xR)≺ΘS(~xS) is equivalently written ΘS(~xS)−ΘR(~xR)�~0

I Considering the row p of the scheduling matrices:

ΘSp(~xS)−Θ

Rp (~xR)≥ δp

I δp ≥ 1 implies no constraints on δk, k > pI δp ≥ 0 is required if 6 ∃k < p, δk ≥ 1

I Schedule lower bound:


Given ΘRk , ΘS


∀~xR,~xS, ΘSk(~xS)−Θ

Rk (~xR)>−K.~n−K

UCLA 85


Lexico-positivity of Dependence Satisfaction

I ΘR(~xR)≺ΘS(~xS) is equivalently written ΘS(~xS)−ΘR(~xR)�~0I Considering the row p of the scheduling matrices:

ΘSp(~xS)−Θ

Rp (~xR)≥ δp

I δp ≥ 1 implies no constraints on δk, k > pI δp ≥ 0 is required if 6 ∃k < p, δk ≥ 1

I Schedule lower bound:


Given ΘRk , ΘS


∀~xR,~xS, ΘSk(~xS)−Θ

Rk (~xR)>−K.~n−K

UCLA 85


Convex Form of All Bounded Affine Schedules

Lemma (Convex form of semantics-preserving affine schedules)

Given a set of affine schedules ΘR,ΘS . . . of dimension m, the programsemantics is preserved if the three following conditions hold:

(i) ∀DR,S, δDR,Sp ∈ {0,1}

(ii) ∀DR,S,m

∑p=1

δDR,Sp = 1

(iii) ∀DR,S, ∀p ∈ {1, . . . ,m}, ∀〈~xR,~xS〉 ∈DR,S,

ΘSp(~xS)−Θ

Rp (~xR)≥ δ

DR,Sp

−p−1

∑k=1

δDR,Sk .(K.~n+K)

→ Use Farkas lemma to build all non-negative functions over apolyhedron (here, the dependence polyhedra) [Feautrier,92]

→ Bounded coefficients required [Vasilache,07]

UCLA 86





(i) ∀DR,S, δDR,Sp ∈ {0,1}

(ii) ∀DR,S,m

∑p=1

δDR,Sp = 1


ΘSp(~xS)−Θ

Rp (~xR)≥ δ

DR,Sp −

p−1

∑k=1

δDR,Sk .(K.~n+K)



UCLA 86





(i) ∀DR,S, δDR,Sp ∈ {0,1}

(ii) ∀DR,S,m

∑p=1

δDR,Sp = 1


ΘSp(~xS)−Θ

Rp (~xR)−δ

DR,Sp +

p−1

∑k=1

δDR,Sk .(K.~n+K)≥ 0



UCLA 86





(i) ∀DR,S, δDR,Sp ∈ {0,1}

(ii) ∀DR,S,m

∑p=1

δDR,Sp = 1


ΘSp(~xS)−Θ

Rp (~xR)−δ

DR,Sp +

p−1

∑k=1

δDR,Sk .(K.~n+K)≥ 0


→ Bounded coefficients required [Vasilache,07]UCLA 86


Maximal Fine-Grain Parallelism

Objectives:I Have as few dimensions as possible carrying a dependence

I For dimension k ∈ p..1:

min ∑DR,S

δDR,Sk

I We use lexicographic optimization

UCLA 87


Key Observations

Is all of this really necessary?

I We have encoded one objective per row of Θ

I Question: do we need to solve this large ILP/LP?

UCLA 88


Feautrier’s Greedy Algorithm

Main idea:1 Start at row 1 of Θ

2 Build the set of legal one-dimensional schedules3 Maximize the number of dependences strongly solved (maxδi)4 Remove strongly solved dependences from P5 Goto 1

This is a row-by-row decomposition of the scheduling problem

UCLA 89


Key Properties of Feautrier’s Algorithm

I It terminatesI It finds "optimal" fine-grain parallelism

I Granularity of dependence satisfaction: all-or-nothing

Example

for (i = 0; i < 2 * N; ++i) A[i] = A[2 * N - i];

UCLA 90


Key Observations

Is all of this really necessary?

I Question: do we need to consider all statements at once?

I Insight: the PDG gives structural information about dependencesI Decomposition of the PDG into strongly-connected components

UCLA 91


Feautrier’s Scheduler

UCLA 92


More Observations

I Some problems may be decomposed without loss of optimalityI The PDG gives extra information about further problem decomposition

Still, is all of this really necessary?

I Question: can we use additional knowledge about dependences?

I Uniform, non-uniform and parametric dependences

UCLA 93


Cost Functions

UCLA 94

Scheduling for Performance: Polyhedral School

Objectives for Good Scheduling

Fine-grain parallelism is nice, but...I It has little connection with modern SIMD parallelismI No information about the quality of the generated codeI Ignores all the important performance objectives:

I Data locality / TLB / Cache considerationI Multi-core parallelism (sync-free, with barrier)I SIMD vectorization

Question: how to find a FAST schedule for a modern processor?

UCLA 95


Performance Distribution for 1-D Schedules [1/2]

6e+08

8e+08

1e+09

1.2e+09

1.4e+09

1.6e+09

1.8e+09

2e+09

0 100 200 300 400 500 600 700 800 900 1000

Cyc

les

Transformation identifier

matmult

original

5e+08

1e+09

1.5e+09

2e+09

2.5e+09

3e+09

3.5e+09

4e+09

0 1000 2000 3000 4000 5000 6000 7000C

ycle

sTransformation identifier

locality

original

Figure : Performance distribution for matmult and locality

UCLA 96


Performance Distribution for 1-D Schedules [2/2]

1.26e+09

1.28e+09

1.3e+09

1.32e+09

1.34e+09

1.36e+09

1.38e+09

1.4e+09

1.42e+09

0 100 200 300 400 500 600 700 800

Cyc

les


crout

original

(a) GCC -O3

1.26e+09

1.27e+09

1.28e+09

1.29e+09

1.3e+09

1.31e+09

1.32e+09

1.33e+09

1.34e+09

0 100 200 300 400 500 600 700 800

Cyc

les


crout

original

original

(b) ICC -fast

Figure : The effect of the compiler

UCLA 97


Quantitative Analysis: The Hypothesis

Extremely large generated spaces: > 1050 points

→ we must leverage static and dynamic characteristics to build traversalmechanisms

Hypothesis:I It is possible to statically order the impact on performance of

transformation coefficients, that is, decompose the search space insubspaces where the performance variation is maximal or reduced

I First rows of Θ are more performance impacting than the last ones

UCLA 98


Observations on the Performance Distribution

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

10 20 30 40 50 60

Perfo

rman

ce im

prov

emen

t

Point index for the first schedule row

Performance distribution - 8x8 DCT

BestAverage

Worst for (i = 0; i < M; i++)for (j = 0; j < M; j++) {tmp[i][j] = 0.0;for (k = 0; k < M; k++)tmp[i][j] += block[i][k] *

cos1[j][k];}

for (i = 0; i < M; i++)for (j = 0; j < M; j++) {sum2 = 0.0;for (k = 0; k < M; k++)sum2 += cos1[i][k] * tmp[k][j];block[i][j] = ROUND(sum2);

}

I Extensive study of 8x8 Discrete Cosine Transform (UTDSP)I Search space analyzed: 66×19683 = 1.29×106 different legal

program versions

UCLA 99



0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

10 20 30 40 50 60

Perfo

rman

ce im

prov

emen

t



BestAverage

Worst

Θ :

I Extensive study of 8x8 Discrete Cosine Transform (UTDSP)I Search space analyzed: 66×19683 = 1.29×106 different legal

program versions

UCLA 99



0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

10 20 30 40 50 60

Perfo

rman

ce im

prov

emen

t



BestAverage

Worst

I bestI averageI worst

I Take one specific value for the first rowI Try the 19863 possible values for the second row

I Very low proportion of best points: < 0.02%

UCLA 99



0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

10 20 30 40 50 60

Perfo

rman

ce im

prov

emen

t



BestAverage

Worst

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

0 2000 4000 6000 8000 10000 12000 14000 16000 18000Point index of the second schedule dimension, first one fixed

Performance distribution (sorted) - 8x8 DCT

I Take one specific value for the first rowI Try the 19863 possible values for the second rowI Very low proportion of best points: < 0.02%

UCLA 99



0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

10 20 30 40 50 60

Perfo

rman

ce im

prov

emen

t



BestAverage

Worst Large performance variation

I Performance variation is large for good values of the first row

I It is usually reduced for bad values of the first row

UCLA 99



0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

10 20 30 40 50 60

Perfo

rman

ce im

prov

emen

t



BestAverage

Worst Small performance variation

I Performance variation is large for good values of the first rowI It is usually reduced for bad values of the first row

UCLA 99


Scanning The Space of Program Versions

The search space:I Performance variation indicates to partition the space:~ı >~p > c

I Non-uniform distribution of performance

I No clear analytical property of the optimization function

→ Build dedicated heuristic and genetic operators aware of these staticand dynamic characteristics

UCLA 100


The Quest for Good Objective Functions

I For data locality, loop tiling is keyI But what is the cost of tiling?I Is tiling the only criterion?

I For coarse-grain parallelism, doall parallelization is keyI But what is the cost of parallelization?

UCLA 101


Dependence Distance Minimization

I Idea: minimize the delay between instances accessing the same dataI Formulation in the polyhedral model:

I Expression of the delay through parametric formI Use all dependences (including RAR)

Definition (Dependence distance minimization)

uk.~n+wk ≥ΘS (~xS)−Θ

R (~xR) 〈~xR,~xS〉 ∈DR,S (1)

uk ∈ Np,wk ∈ N

UCLA 102


Key Observations

I Minimizing d = uk.~n+wk minimize the dependence distanceI When d = 0 then ΘR

k (~xR) = ΘSk(~xS)

I 0≥ΘRk (~xR)−ΘS

k(~xS)≥ 0

I d gives an indication of the communication volume between hyperplanes

UCLA 103

Scheduling for Performance: Tiling Polyhedral School

An Overview of Tiling

Tiling: partition the computation into atomic blocs

I Early work in the late 80’sI Motivation: data locality improvement + parallelization

UCLA 104

Scheduling for Performance: Tiling Polyhedral School

An Overview of Tiling

I Tiling the iteration spaceI It must be valid (dependence analysis required)I It may require pre-transformationI Unimodular transformation framework limitations

I Supported in current compilers, but limited applicability

I Challenges: imperfectly nested loops, parametric loops,pre-transformations, tile shape, ...

I Tile size selectionI Critical for locality concerns: determines the footprintI Empirical search of the best size (problem + machine specific)I Parametric tiling makes the generated code valid for any tile size

UCLA 105

Scheduling for Performance: The Tiling Hyperplane Method Polyhedral School

Tiling in the Polyhedral Model

I Tiling partition the computation into blocksI Note we consider only rectangular tiling hereI For tiling to be legal, such a partitioning must be legal

UCLA 106


Key Ideas of the Tiling Hyperplane Algorithm

Affine transformations for communication minimal parallelization and localityoptimization of arbitrarily nested loop sequences[Bondhugula et al, CC’08 & PLDI’08]

I Compute a set of transformations to make loops tilableI Try to minimize synchronizationsI Try to maximize locality (maximal fusion)

I Result is a set of permutable loops, if possibleI Strip-mining / tiling can be appliedI Tiles may be sync-free parallel or pipeline parallel

I Algorithm always terminates (possibly by splitting loops/statements)

UCLA 107


Legality of Tiling

Theorem (Legality of Tiling)

Given ΘRk ,Θ

Sk two one-dimensional schedules. They are valid tiling

hyperplanes if

∀DR,S, ∀〈~xR,~xS〉 ∈DR,S, ΘSk(~xS)−Θ

Rk (~xR)≥ 0

I For a schedule to be a legal tiling hyperplane, all communications mustgo forward: Forward Communication Only [Griebl]

I All dependences must be considered at each level, including thepreviously strongly satisfied

I Equivalence between loop permutability and loop tilability

UCLA 108


Greedy Algorithm for Tiling HyperplaneComputation

1 Start from the outer-most level, find the set of FCO schedules2 Select one which minimize the distance between dependent iterations3 Mark dependences strongly satisfied by this schedule, but do not

remove them4 Formulate the problem for the next level (FCO), adding orthogonality

constraints (linear independence)5 solve again, etc.

Special treatment when no permutable band can be found: splitting

A few properties:I Result is a set of permutable/tilable outer loops, when possibleI It exhibits coarse-grain parallelismI Maximal fusion achieved to improve locality

UCLA 109


Example: 1D-Jacobi

The Ohio State University Louisiana State University 1

1-D Jacobi (imperfectly nested) for (t=1; t<M; t++) { for (i=2; i<N!1; i++) {

S: b[i] = 0.333*(a[i!1]+a[i]+a[i+1]); } for (j=2; j<N!1; j++) {

T: a[j] = b[j]; } }

UCLA 110


Example: 1D-Jacobi


1-D Jacobi (imperfectly nested)

•  The resulting transformation is equivalent to a constant shift of one for T relative to S, fusion (j and i are named the same as a result), and skewing the fused i loop with respect to the t loop by a factor of two. •  The (1,0) hyperplane has the least communication: no dependence crosses more than one hyperplane instance along it.

UCLA 111


Example: 1D-Jacobi


Transforming S

i

t! t

i!

UCLA 112


Example: 1D-Jacobi


Transforming T

j

t! t

j!

UCLA 113


Example: 1D-Jacobi


Interleaving S and T

t! t!

j! i!

UCLA 114


Example: 1D-Jacobi


Interleaving S and T

t

UCLA 115


Example: 1D-Jacobi


1-D Jacobi (imperfectly nested) – transformed code for (t0=0;t0<=M-1;t0++) { S’: b[2]=0.333*(a[2-1]+a[2]+a[2+1]); for (t1=2*t0+3;t1<=2*t0+N-2;t1++) { S: b[-2*t0+t1]=0.333*(a[-2*t0+t1-1]+a[-2*t0+t1] +a[-2*t0+t1+1]);

T: a[-2*t0+t1-1]=b[-2*t0+t1-1]; } T’: a[N-2]=b[N-2]; }

UCLA 116


Example: 1D-Jacobi


1-D Jacobi (imperfectly nested) – transformed code for (t0=0;t0<=M-1;t0++) { S’: b[2]=0.333*(a[2-1]+a[2]+a[2+1]); for (t1=2*t0+3;t1<=2*t0+N-2;t1++) { S: b[-2*t0+t1]=0.333*(a[-2*t0+t1-1]+a[-2*t0+t1] +a[-2*t0+t1+1]);

T: a[-2*t0+t1-1]=b[-2*t0+t1-1]; } T’: a[N-2]=b[N-2]; }

!

!

! ! ! ! !

UCLA 117

Scheduling for Performance: Fusion-driven Optimization Polyhedral School

Fusion-driven Optimization

UCLA 118


Overview

Problem: How to improve program execution time?

I Focus on shared-memory computationI OpenMP parallelizationI SIMD VectorizationI Efficient usage of the intra-node memory hierarchy

I Challenges to address:I Different machines require different compilation strategiesI One-size-fits-all scheme hinders optimization opportunities

Question: how to restructure the code for performance?

UCLA 119


Objectives for a Successful Optimization

During the program execution, interplay between the hardware ressources:I Thread-centric parallelismI SIMD-centric parallelismI Memory layout, inc. caches, prefetch units, buses, interconnects...

→ Tuning the trade-off between these is required

A loop optimizer must be able to transform the program for:I Thread-level parallelism extractionI Loop tiling, for data localityI Vectorization

Our approach: form a tractable search space of possible looptransformations

UCLA 120


Running Example

Original code

Example (tmp = A.B, D = tmp.C)for (i1 = 0; i1 < N; ++i1)for (j1 = 0; j1 < N; ++j1) {

R: tmp[i1][j1] = 0;for (k1 = 0; k1 < N; ++k1)

S: tmp[i1][j1] += A[i1][k1] * B[k1][j1];} {R,S} fused, {T,U} fused

for (i2 = 0; i2 < N; ++i2)for (j2 = 0; j2 < N; ++j2) {

T: D[i2][j2] = 0;for (k2 = 0; k2 < N; ++k2)

U: D[i2][j2] += tmp[i2][k2] * C[k2][j2];}

Original Max. fusion Max. dist Balanced4× Xeon 7450 / ICC 11 1×4× Opteron 8380 / ICC 11 1×

UCLA 121


Running Example

Cost model: maximal fusion, minimal synchronization[Bondhugula et al., PLDI’08]

Example (tmp = A.B, D = tmp.C)parfor (c0 = 0; c0 < N; c0++) {

for (c1 = 0; c1 < N; c1++) {R: tmp[c0][c1]=0;T: D[c0][c1]=0;

for (c6 = 0; c6 < N; c6++)S: tmp[c0][c1] += A[c0][c6] * B[c6][c1];

parfor (c6 = 0;c6 <= c1; c6++)U: D[c0][c6] += tmp[c0][c1-c6] * C[c1-c6][c6];

} {R,S,T,U} fusedfor (c1 = N; c1 < 2*N - 1; c1++)parfor (c6 = c1-N+1; c6 < N; c6++)

U: D[c0][c6] += tmp[c0][1-c6] * C[c1-c6][c6];}

Original Max. fusion Max. dist Balanced4× Xeon 7450 / ICC 11 1× 2.4×4× Opteron 8380 / ICC 11 1× 2.2×

UCLA 121


Running Example

Maximal distribution: best for Intel Xeon 7450Poor data reuse, best vectorization

Example (tmp = A.B, D = tmp.C)parfor (i1 = 0; i1 < N; ++i1)parfor (j1 = 0; j1 < N; ++j1)

R: tmp[i1][j1] = 0;parfor (i1 = 0; i1 < N; ++i1)

for (k1 = 0; k1 < N; ++k1)parfor (j1 = 0; j1 < N; ++j1)

S: tmp[i1][j1] += A[i1][k1] * B[k1][j1];{R} and {S} and {T} and {U} distributed

parfor (i2 = 0; i2 < N; ++i2)parfor (j2 = 0; j2 < N; ++j2)

T: D[i2][j2] = 0;parfor (i2 = 0; i2 < N; ++i2)

for (k2 = 0; k2 < N; ++k2)parfor (j2 = 0; j2 < N; ++j2)

U: D[i2][j2] += tmp[i2][k2] * C[k2][j2];

Original Max. fusion Max. dist Balanced4× Xeon 7450 / ICC 11 1× 2.4× 3.9×4× Opteron 8380 / ICC 11 1× 2.2× 6.1×

UCLA 121


Running Example

Balanced distribution/fusion: best for AMD Opteron 8380Poor data reuse, best vectorization

Example (tmp = A.B, D = tmp.C)parfor (c1 = 0; c1 < N; c1++)parfor (c2 = 0; c2 < N; c2++)

R: C[c1][c2] = 0;parfor (c1 = 0; c1 < N; c1++)

for (c3 = 0; c3 < N;c3++) {T: E[c1][c3] = 0;

parfor (c2 = 0; c2 < N;c2++)S: C[c1][c2] += A[c1][c3] * B[c3][c2];

} {S,T} fused, {R} and {U} distributedparfor (c1 = 0; c1 < N; c1++)

for (c3 = 0; c3 < N; c3++)parfor (c2 = 0; c2 < N; c2++)

U: E[c1][c2] += C[c1][c3] * D[c3][c2];

Original Max. fusion Max. dist Balanced4× Xeon 7450 / ICC 11 1× 2.4× 3.9× 3.1×4× Opteron 8380 / ICC 11 1× 2.2× 6.1× 8.3×

UCLA 121


Running Example

Example (tmp = A.B, D = tmp.C)parfor (c1 = 0; c1 < N; c1++)parfor (c2 = 0; c2 < N; c2++)

R: C[c1][c2] = 0;parfor (c1 = 0; c1 < N; c1++)

for (c3 = 0; c3 < N;c3++) {T: E[c1][c3] = 0;

parfor (c2 = 0; c2 < N;c2++)S: C[c1][c2] += A[c1][c3] * B[c3][c2];

} {S,T} fused, {R} and {U} distributedparfor (c1 = 0; c1 < N; c1++)

for (c3 = 0; c3 < N; c3++)parfor (c2 = 0; c2 < N; c2++)

U: E[c1][c2] += C[c1][c3] * D[c3][c2];

Original Max. fusion Max. dist Balanced4× Xeon 7450 / ICC 11 1× 2.4× 3.9× 3.1×4× Opteron 8380 / ICC 11 1× 2.2× 6.1× 8.3×

The best fusion/distribution choice drives the quality of the optimizationUCLA 121


Loop Structures

Possible grouping + ordering of statements

I {{R}, {S}, {T}, {U}}; {{R}, {S}, {U}, {T}}; ...I {{R,S}, {T}, {U}}; {{R}, {S}, {T,U}}; {{R}, {T,U}, {S}}; {{T,U}, {R}, {S}};...I {{R,S,T}, {U}}; {{R}, {S,T,U}}; {{S}, {R,T,U}};...I {{R,S,T,U}};

Number of possibilities: >> n! (number of total preorders)

UCLA 122


Loop Structures

Removing non-semantics preserving ones

I {{R}, {S}, {T}, {U}}; {{R}, {S}, {U}, {T}}; ...I {{R,S}, {T}, {U}}; {{R}, {S}, {T,U}}; {{R}, {T,U}, {S}}; {{T,U}, {R}, {S}};...I {{R,S,T}, {U}}; {{R}, {S,T,U}}; {{S}, {R,T,U}};...I {{R,S,T,U}}

Number of possibilities: 1 to 200 for our test suite

UCLA 122


Loop Structures

For each partitioning, many possible loop structures

I {{R}, {S}, {T}, {U}}I For S: {i, j,k}; {i,k, j}; {k, i, j}; {k, j, i}; ...I However, only {i,k, j} has:

I outer-parallel loopI inner-parallel loopI lowest striding access (efficient vectorization)

UCLA 122


Possible Loop Structures for 2mm

I 4 statements, 75 possible partitioningsI 10 loops, up to 10! possible loop structures for a given partitioning

I Two steps:I Remove all partitionings which breaks the semantics: from 75 to 12I Use static cost models to select the loop structure for a partitioning: from

d! to 1

I Final search space: 12 possibilites

UCLA 123


Contributions and Overview of the Approach

I Empirical search on possible fusion/distribution schemesI Each structure drives the success of other optimizations

I ParallelizationI TilingI Vectorization

I Use static cost models to compute a complex loop transformation for aspecific fusion/distribution scheme

I Iteratively test the different versions, retain the bestI Best performing loop structure is found

UCLA 124


Search Space of Loop Structures

I Partition the set of statements into classes:I This is deciding loop fusion / distributionI Statements in the same class will share at least one common loop in the

target codeI Classes are ordered, to reflect code motion

I Locally on each partition, apply model-driven optimizations

I Leverage the polyhedral framework:I Build the smallest yet most expressive space of possible partitionings

[Pouchet et al., POPL’11]I Consider semantics-preserving partitionings only: orders of magnitude

smaller space

UCLA 125


Summary of the Optimization Process

description #loops #stmts #refs #deps #part. #valid Variability Pb. Size2mm Linear algebra (BLAS3) 6 4 8 12 75 12 X 1024x10243mm Linear algebra (BLAS3) 9 6 12 19 4683 128 X 1024x1024adi Stencil (2D) 11 8 36 188 545835 1 1024x1024atax Linear algebra (BLAS2) 4 4 10 12 75 16 X 8000x8000bicg Linear algebra (BLAS2) 3 4 10 10 75 26 X 8000x8000correl Correlation (PCA: StatLib) 5 6 12 14 4683 176 X 500x500covar Covariance (PCA: StatLib) 7 7 13 26 47293 96 X 500x500doitgen Linear algebra 5 3 7 8 13 4 128x128x128gemm Linear algebra (BLAS3) 3 2 6 6 3 2 1024x1024gemver Linear algebra (BLAS2) 7 4 19 13 75 8 X 8000x8000gesummv Linear algebra (BLAS2) 2 5 15 17 541 44 X 8000x8000gramschmidt Matrix normalization 6 7 17 34 47293 1 512x512jacobi-2d Stencil (2D) 5 2 8 14 3 1 20x1024x1024lu Matrix decomposition 4 2 7 10 3 1 1024x1024

ludcmp Solver 9 15 40 188 1012 20 X 1024x1024seidel Stencil (2D) 3 1 10 27 1 1 20x1024x1024

Table : Summary of the optimization process

UCLA 126


Experimental Setup

We compare three schemes:I maxfuse: static cost model for fusion (maximal fusion)

I smartfuse: static cost model for fusion (fuse only if data reuse)

I Iterative: iterative compilation, output the best result

UCLA 127


Performance Results - Intel Xeon 7450 - ICC 11

0

1

2

3

4

5

2mm

3mm

adiatax

bicgcorrel

covar

doitgen

gemmgem

ver

gesumm

v

gramschm

idt

jacobi-2d

lu ludcmp

seidel

Per

f. Im

p / I

CC

-fa

st -

para

llel

Performance Improvement - Intel Xeon 7450 (24 threads)

pocc-maxfusepocc-smartfuse

iterative

UCLA 128


Performance Results - AMD Opteron 8380 - ICC 11

0

1

2

3

4

5

6

7

8

9

2mm

3mm

adiatax

bicgcorrel

covar

doitgen

gemmgem

ver

gesumm

v

gramschm

idt

jacobi-2d

lu ludcmp

seidel

Per

f. Im

p / I

CC

-fa

st -

para

llel

Performance Improvement - AMD Opteron 8380 (16 threads)


iterative

UCLA 129


Performance Results - Intel Atom 330 - GCC 4.3

0

5

10

15

20

25

30

2mm

3mm

adiatax

bicgcorrel

covar

doitgen

gemmgem

ver

gesumm

v

gramschm

idt

jacobi-2d

lu ludcmp

seidel

Per

f. Im

p / G

CC

4.3

-O

3 -f

open

mp

Performance Improvement - Intel Atom 230 (2 threads)


iterative

UCLA 130


Assessment from Experimental Results

1 Empirical tuning required for 9 out of 16 benchmarks

2 Strong performance improvements: 2.5× - 3× on average

3 Portability achieved:I Automatically adapt to the program and target architectureI No assumption made about the targetI Exhaustive search finds the optimal structure (1-176 variants)

4 Substantial improvements over state-of-the-art (up to 2×)

UCLA 131

Approximate Scheduling: Polyhedral School

Approximate Scheduling

UCLA 132


Exact vs. Approximate Dependences

I So far, we used exact dependence representationI But numerous other works use approximate representation:

I dependence distance vector [Allen/Kenney,KMW]I dependence directionI DDV approximated as polyhedra [Darte-Vivien]

I Optimality may not always be lost when using a relaxed dependencerepresentation!

UCLA 133


Key Aspects of Dependence Representations

I Can use a graph to represent the dependencesI Karp, Miller, Winograd: graph for SURE, using DDVI Darte-Vivien: "PDG"-like graph

I For certain kind dependences (i.e., uniform) "optimal" algorithmI Darte-VivienI SURE

UCLA 134


Some Observations

I Maximal parallelism (under some conditions) can be extracted from lessprecise representations

I Very fast scheduling algorithms can be derived

I Remaining question: how useful are those algorithms?

UCLA 135


Conclusion

UCLA 136


The End

Let’s stop here for today!

UCLA 137

program transformations and optimizations in the...

Documents