program transformations and optimizations in the...

189
Program transformations and optimizations in the polyhedral framework Louis-Noël Pouchet Dept. of Computer Science UCLA May 14, 2013 1st Polyhedral Spring School St-Germain au Mont d’or, France

Upload: others

Post on 03-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Program transformations and optimizationsin the polyhedral framework

Louis-Noël Pouchet

Dept. of Computer ScienceUCLA

May 14, 2013

1st Polyhedral Spring SchoolSt-Germain au Mont d’or, France

Page 2: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

: Polyhedral School

Overview

UCLA 2

Page 3: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Overview & Motivation: Polyhedral School

Agenda for the Lecture

Disclaimer: this lecture is (mainly) for new Ph.D students

1 Program Representation in the Polyhedral Model

2 Applying loop transformations in the model

3 Exact data dependence representation

4 Scheduling using Farkas-based methods

5 Scheduling using approximate data dependencesI If time permits...

UCLA 3

Page 4: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Overview & Motivation: Polyhedral School

Running Example: DGEMM

Example (dgemm)/* C := alpha*A*B + beta*C */for (i = 0; i < ni; i++)for (j = 0; j < nj; j++)

S1: C[i][j] *= beta;for (i = 0; i < ni; i++)for (j = 0; j < nj; j++)

for (k = 0; k < nk; ++k)S2: C[i][j] += alpha * A[i][k] * B[k][j];

I Loop transformation: permute(i,k,S2)Execution time (in s) on this laptop, GCC 4.2, ni=nj=nk=512

version -O0 -O1 -O2 -O3 -vecoriginal 1.81 0.78 0.78 0.78permute 1.52 0.35 0.35 0.20

http://gcc.gnu.org/onlinedocs/gcc-4.2.1/gcc/Optimize-Options.html

UCLA 4

Page 5: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Overview & Motivation: Polyhedral School

Running Example: fdtd-2d

Example (fdtd-2d)for(t = 0; t < tmax; t++) {for (j = 0; j < ny; j++)

ey[0][j] = _edge_[t];for (i = 1; i < nx; i++)

for (j = 0; j < ny; j++)ey[i][j] = ey[i][j] - 0.5*(hz[i][j]-hz[i-1][j]);

for (i = 0; i < nx; i++)for (j = 1; j < ny; j++)ex[i][j] = ex[i][j] - 0.5*(hz[i][j]-hz[i][j-1]);

for (i = 0; i < nx - 1; i++)for (j = 0; j < ny - 1; j++)hz[i][j] = hz[i][j] - 0.7* (ex[i][j+1] - ex[i][j] +

ey[i+1][j]-ey[i][j]);}

I Loop transformation: polyhedralOpt(fdtd-2d)Execution time (in s) on this laptop, GCC 4.2, 64x1024x1024

version -O0 -O1 -O2 -O3 -vecoriginal 2.59 1.62 1.54 1.54

polyhedralOpt 2.05 0.41 0.41 0.41UCLA 5

Page 6: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Overview & Motivation: Polyhedral School

Manual transformations?

Example (fdtd-2d)for(t = 0; t < tmax; t++) {for (j = 0; j < ny; j++)

ey[0][j] = _edge_[t];for (i = 1; i < nx; i++)

for (j = 0; j < ny; j++)ey[i][j] = ey[i][j] - 0.5*(hz[i][j]-hz[i-1][j]);

for (i = 0; i < nx; i++)for (j = 1; j < ny; j++)ex[i][j] = ex[i][j] - 0.5*(hz[i][j]-hz[i][j-1]);

for (i = 0; i < nx - 1; i++)for (j = 0; j < ny - 1; j++)hz[i][j] = hz[i][j] - 0.7* (ex[i][j+1] - ex[i][j] +

ey[i+1][j]-ey[i][j]);}

UCLA 6

Page 7: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Overview & Motivation: Polyhedral School

NO NO NO!!!

Example (fdtd-2d tiled)for (c0 = 0; c0 <= (((ny + 2 * tmax + -3) * 32 < 0?((32 < 0?-((-(ny + 2 * tmax + -3) + 32 + 1) / 32) : -((-(ny + 2 *tmax + -3) + 32 - 1) / 32))) : (ny + 2 * tmax + -3) / 32)); ++c0) {

#pragma omp parallel for private(c3, c4, c2, c5)for (c1 = (((c0 * 2 < 0?-(-c0 / 2) : ((2 < 0?(-c0 + -2 - 1) / -2 : (c0 + 2 - 1) / 2)))) > (((32 * c0 + -tmax + 1) *

32 < 0?-(-(32 * c0 + -tmax + 1) / 32) : ((32 < 0?(-(32 * c0 + -tmax + 1) + -32 - 1) / -32 : (32 * c0 + -tmax + 1 + 32- 1) / 32))))?((c0 * 2 < 0?-(-c0 / 2) : ((2 < 0?(-c0 + -2 - 1) / -2 : (c0 + 2 - 1) / 2)))) : (((32 * c0 + -tmax + 1) *32 < 0?-(-(32 * c0 + -tmax + 1) / 32) : ((32 < 0?(-(32 * c0 + -tmax + 1) + -32 - 1) / -32 : (32 * c0 + -tmax + 1 + 32- 1) / 32))))); c1 <= (((((((ny + tmax + -2) * 32 < 0?((32 < 0?-((-(ny + tmax + -2) + 32 + 1) / 32) : -((-(ny + tmax +-2) + 32 - 1) / 32))) : (ny + tmax + -2) / 32)) < (((32 * c0 + ny + 30) * 64 < 0?((64 < 0?-((-(32 * c0 + ny + 30) + 64+ 1) / 64) : -((-(32 * c0 + ny + 30) + 64 - 1) / 64))) : (32 * c0 + ny + 30) / 64))?(((ny + tmax + -2) * 32 < 0?((32 <0?-((-(ny + tmax + -2) + 32 + 1) / 32) : -((-(ny + tmax + -2) + 32 - 1) / 32))) : (ny + tmax + -2) / 32)) : (((32 * c0 +ny + 30) * 64 < 0?((64 < 0?-((-(32 * c0 + ny + 30) + 64 + 1) / 64) : -((-(32 * c0 + ny + 30) + 64 - 1) / 64))) : (32 *c0 + ny + 30) / 64)))) < c0?(((((ny + tmax + -2) * 32 < 0?((32 < 0?-((-(ny + tmax + -2) + 32 + 1) / 32) : -((-(ny + tmax+ -2) + 32 - 1) / 32))) : (ny + tmax + -2) / 32)) < (((32 * c0 + ny + 30) * 64 < 0?((64 < 0?-((-(32 * c0 + ny + 30) + 64+ 1) / 64) : -((-(32 * c0 + ny + 30) + 64 - 1) / 64))) : (32 * c0 + ny + 30) / 64))?(((ny + tmax + -2) * 32 < 0?((32 <0?-((-(ny + tmax + -2) + 32 + 1) / 32) : -((-(ny + tmax + -2) + 32 - 1) / 32))) : (ny + tmax + -2) / 32)) : (((32 * c0 +ny + 30) * 64 < 0?((64 < 0?-((-(32 * c0 + ny + 30) + 64 + 1) / 64) : -((-(32 * c0 + ny + 30) + 64 - 1) / 64))) : (32 *c0 + ny + 30) / 64)))) : c0)); ++c1) {

for (c2 = c0 + -c1; c2 <= (((((tmax + nx + -2) * 32 < 0?((32 < 0?-((-(tmax + nx + -2) + 32 + 1) / 32) : -((-(tmax +nx + -2) + 32 - 1) / 32))) : (tmax + nx + -2) / 32)) < (((32 * c0 + -32 * c1 + nx + 30) * 32 < 0?((32 < 0?-((-(32 * c0 +-32 * c1 + nx + 30) + 32 + 1) / 32) : -((-(32 * c0 + -32 * c1 + nx + 30) + 32 - 1) / 32))) : (32 * c0 + -32 * c1 + nx +30) / 32))?(((tmax + nx + -2) * 32 < 0?((32 < 0?-((-(tmax + nx + -2) + 32 + 1) / 32) : -((-(tmax + nx + -2) + 32 - 1) /32))) : (tmax + nx + -2) / 32)) : (((32 * c0 + -32 * c1 + nx + 30) * 32 < 0?((32 < 0?-((-(32 * c0 + -32 * c1 + nx + 30)+ 32 + 1) / 32) : -((-(32 * c0 + -32 * c1 + nx + 30) + 32 - 1) / 32))) : (32 * c0 + -32 * c1 + nx + 30) / 32)))); ++c2){

if (c0 == 2 * c1 && c0 == 2 * c2) {for (c3 = 16 * c0; c3 <= ((tmax + -1 < 16 * c0 + 30?tmax + -1 : 16 * c0 + 30)); ++c3)

if (c0 % 2 == 0)(ey[0])[0] = (_edge_[c3]);

........... (200 more lines!)

Performance gain: 2-6× on modern multicore platformsUCLA 6

Page 8: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Overview & Motivation: Polyhedral School

Loop Transformations in Production Compilers

Limitations of standard syntactic frameworks:I Composition of transformations may be tedious

I composability rules / applicability

I Parametric loop bounds, imperfectly nested loops are challengingI Look at the examples!

I Approximate dependence analysisI Miss parallelization opportunities (among many others)

I (Very) conservative performance models

UCLA 7

Page 9: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Overview & Motivation: Polyhedral School

Achievements of Polyhedral Compilation

The polyhedral model:

I Model/apply seamlessly arbitrary compositions of transformationsI Automatic handling of imperfectly nested, parametric loop structuresI Many loop transformation can be modeled

I Exact dependence analysis on a class of programsI Unleash the power of automatic parallelizationI Aggressive multi-objective program restructuring (parallelism, SIMD,

cache, etc.)

I Requires computationally expensive algorithmsI Usually NP-complete / exponential complexityI Requires careful problem statement/representation

UCLA 8

Page 10: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Overview & Motivation: Polyhedral School

Research around Polyhedral Compilation

"first generation"I Provided the theoretical foundationsI Focused on high-level/abstract objectivesI Was questioned about how practical is this model

"second generation"I Provided practical / "more scalable" algorithmsI Focused on concrete performance improvementI Still often questioned about how practical is this model

I Thanks for the legacy! ;-)

UCLA 9

Page 11: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Overview & Motivation: Polyhedral School

Perspective on 25 Years of Computing

"first generation"I Parallel machines were... rareI Era of increased program performance for freeI Intel 486 DX2: 54 MIPS, 66MHz

"second generation"I Parallel machines are ubiquitous (multi-core, SIMD, GPUs)I Very difficult to get good performanceI Intel Core i7: 177730 MIPS, 3.33 GHz

UCLA 10

Page 12: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Overview & Motivation: Polyhedral School

Perspective on 25 Years of Computing

"first generation"I Parallel machines were... rareI Era of increased program performance for freeI Intel 486 DX2: 54 MIPS, 66MHz

"second generation"I Parallel machines are ubiquitous (multi-core, SIMD, GPUs)I Very difficult to get good performanceI Intel Core i7: 177730 MIPS, 3.33 GHz

More complex problem can be solved, but more complex objectives areneeded anyway

UCLA 10

Page 13: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Overview & Motivation: Polyhedral School

Disclaimer

This lecture is:I Partial 5/14/13 12:57 AMpolyhedral model affine scheduling - Google Scholar

Page 1 of 2http://scholar.google.com/scholar?q=polyhedral+model+affine+scheduling&btnG=&hl=en&as_sdt=0%2C5

Articles

Legal documents

Any timeSince 2013Since 2012Since 2009Custom range...

Sort by relevanceSort by date

include patentsinclude citations

Create alert

F Quilleré, S Rajopadhye, D Wilde - International Journal of Parallel …, 2000 - Springer... of static loop programs,(3) scheduling, using com- pact representations, of indexed computations(essentially uniform or affine recurrence equations) in the context of parallelization(2, 4) or systolic

Generation of efficient nested loops from polyhedra

MM Baskaran, J Ramanujam, P Sadayappan - Compiler Construction, 2010 - Springer... that transforms a polyhedral representation of a program and affine scheduling constraints into ...end-to-end automatic parallelization and locality optimization of affine programs for ... Section 2provides an overview of the polyhedral model for representing programs, dependences ...Cited by 89 Related articles All 13 versions Cite

[PDF] from researchgate.netAutomatic C-to-CUDA code generation for affine programs

C Bastoul, A Cohen, S Girbal, S Sharma… - … and Compilers for …, 2004 - Springer... motivation for our work, although MARS lacks the expressivity of the affine schedules we usein our uni- fied model. ... where DS is the dS-dimensional iteration domain of S, LS and RS are setsof polyhedral representations of array references, and θS is the affine schedule of S ...Cited by 89 Related articles All 29 versions Cite

[PDF] from archives-ouvertes.frPutting polyhedral loop transformations to work

U Bondhugula, A Hartono, J Ramanujam… - ACM SIGPLAN …, 2008 - dl.acm.org... of test cases has not been done using a powerful and general model like the polyhedral model. ...ing good ways to tile for parallelism and locality directly through an affine transformation framework ...Our ap- proach is thus a departure from scheduling-based approaches in this field ...Cited by 167 Related articles All 11 versions Cite

[PDF] from lsu.eduA practical automatic polyhedral parallelizer and locality optimizer

MM Baskaran, U Bondhugula… - Proceedings of the …, 2008 - dl.acm.org... functions [1, 22, 16, 14, 9, 23, 21, 3]. For such regular programs, compile-time optimizationapproaches have been developed using affine scheduling functions with a polyhe- dral abstractionof programs and data dependencies. Although the polyhedral model of dependence ...Cited by 118 Related articles All 17 versions Cite

[PDF] from aminer.orgA compiler framework for optimization of affine loop nests for GPGPUs

F Quilleré, S Rajopadhye - ACM Transactions on Programming …, 2000 - dl.acm.org... Words and Phrases: Affine recurrence equations, applicative (functional) lan- guages, automaticparallelization, dataflow analysis, data-parallel languages, dependence analysis, lifetime analysis,memory management, parallel code generation, polyhedral model, scheduling ...Cited by 94 Related articles All 12 versions Cite

[PS] from 131.254.254.45Optimizing memory usage in the polyhedral model

More

D Wilde, S Rajopadhye - Euro-Par'96 Parallel Processing, 1996 - Springer... Currently, the user chooses the transformations based on the analyses enabled by the polyhedralmodel, specifies it ... Scheduling: There has been much research on the SARE scheduling problem[9, 17, 8, 18 ... i, j) - i + j and tx (i) = 2i are valid (one dimensional) affine schedules for ...Cited by 38 Related articles All 10 versions Cite

[PDF] from psu.eduMemory reuse analysis in the polyhedral model

U Bondhugula, M Baskaran, S Krishnamoorthy… - Compiler …, 2008 - Springer... The polyhedral model provides powerful abstractions to optimize loop nests with regular accesses.Affine transformations in this model capture a com- plex sequence of execution ... Although asignificant body of research has addressed affine scheduling and partitioning, the ...Cited by 60 Related articles All 19 versions Cite

[PDF] from ohio-state.eduAutomatic transformations for communication-minimized parallelization and locality optimization inthe polyhedral model

LN Pouchet, C Bastoul, A Cohen… - Code Generation and …, 2007 - ieeexplore.ieee.org... that can be ex- pressed as one-dimensional affine schedules (using a polyhedral abstraction); •this ... ing approaches on loop transformation sequences, or state-of-the-art affine scheduling; •such smaller ... of the optimal code, a confirmation that building a predictive model for loop ...Cited by 85 Related articles All 29 versions Cite

[PDF] from inria.frIterative optimization in the polyhedral model: Part I, one-dimensional time

LN Pouchet, C Bastoul, A Cohen, J Cavazos - ACM SIGPLAN Notices, 2008 - dl.acm.org... 20 loops) where iterative affine schedul- ing has never been attempted before. 4. We demonstratesignificant performance improvements on 3 different architectures. The paper is structured asfollows. Section 2 introduces multi- dimensional scheduling in the polyhedral model. ...Cited by 80 Related articles All 20 versions Cite

[PDF] from inria.frIterative optimization in the polyhedral model: Part II, multidimensional time

Scholar About 3,520 results (0.06 sec) My Citations 33

polyhedral model affine scheduling

Web Images More… [email protected]

I Biased (strongly!)

I Provocative at times (hopefully ;-) )

UCLA 11

Page 14: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Overview & Motivation: Polyhedral School

Polyhedral Program Representation

UCLA 12

Page 15: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Polyhedral Program Representation: Basics Polyhedral School

Example: DGEMM

Example (dgemm)/* C := alpha*A*B + beta*C */for (i = 0; i < ni; i++)for (j = 0; j < nj; j++) {

S1: C[i][j] *= beta;for (k = 0; k < nk; ++k)

S2: C[i][j] += alpha * A[i][k] * B[k][j];}

This code has:I imperfectly nested loopsI multiple statementsI parametric loop bounds

UCLA 13

Page 16: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Polyhedral Program Representation: Basics Polyhedral School

Granularity of Program Representation

DGEMM has:I 3 loops

I For loops in the code, while loopsI Control-flow graph analysis

I 2 (syntactic) statementsI Input source code?I Basic block?I ASM instructions?

I S1 is executed ni × nj timesI dynamic instances of the statementI Does not (necessarily) correspond to reality!

UCLA 14

Page 17: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Polyhedral Program Representation: Basics Polyhedral School

Some Observations

Reasoning at the loop/statement level:I Some loop transformation may be very difficult to implement

I How to fuse loops with different loop bounds?I How to permute triangular loops?I How to unroll-and-jam triangular loops?I How to apply time-tiling?I ...

I Statements may operate on the same array while being independent

UCLA 15

Page 18: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Polyhedral Program Representation: Basics Polyhedral School

Some Motivations for Polyhedral Transformations

I Known problem: scheduling of task graphI Obvious limitations: task graph is not finite / size depends on problem /

effective code generation almost impossibleI Alternative approach: use loop transformations

I solve all above limitationI BUT the problem is to find a sequence that implements the order we wantI AND also how to apply/compose them

I Desired features:I ability to reason at the instance level (as for task graph scheduling)I ability to easily apply/compose loop transformations

UCLA 16

Page 19: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

The Polyhedral Model: Polyhedral School

The Polyhedral Model

UCLA 17

Page 20: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

The Polyhedral Model: Overview Polyhedral School

Polyhedral Program Optimization: a Three-StageProcess

1 Analysis: from code to model

→ Existing prototype toolsI URUK, Omega, ChiLL . . .I ISL, LoopoI PolyOpt/PoCC

(Clan-Candl-LetSee-Pluto-Ponos-Cloog-Polylib-PIPLib-ISL-FM-...)

→ GCC GRAPHITE, LLVM Polly (now in mainstream)

→ Reservoir Labs R-Stream, IBM XL/Poly

2 Transformation in the model

→ Build and select a program transformation

3 Code generation: from model to code

→ "Apply" the transformation in the model

→ Regenerate syntactic (AST-based) code

UCLA 18

Page 21: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

The Polyhedral Model: Overview Polyhedral School

Polyhedral Program Optimization: a Three-StageProcess

1 Analysis: from code to model

→ Existing prototype toolsI URUK, Omega, ChiLL . . .I ISL, LoopoI PolyOpt/PoCC

(Clan-Candl-LetSee-Pluto-Ponos-Cloog-Polylib-PIPLib-ISL-FM-...)

→ GCC GRAPHITE, LLVM Polly (now in mainstream)

→ Reservoir Labs R-Stream, IBM XL/Poly

2 Transformation in the model

→ Build and select a program transformation

3 Code generation: from model to code

→ "Apply" the transformation in the model

→ Regenerate syntactic (AST-based) code

UCLA 18

Page 22: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

The Polyhedral Model: Overview Polyhedral School

Polyhedral Program Optimization: a Three-StageProcess

1 Analysis: from code to model

→ Existing prototype toolsI URUK, Omega, ChiLL . . .I ISL, LoopoI PolyOpt/PoCC

(Clan-Candl-LetSee-Pluto-Ponos-Cloog-Polylib-PIPLib-ISL-FM-...)

→ GCC GRAPHITE, LLVM Polly (now in mainstream)

→ Reservoir Labs R-Stream, IBM XL/Poly

2 Transformation in the model

→ Build and select a program transformation

3 Code generation: from model to code

→ "Apply" the transformation in the model

→ Regenerate syntactic (AST-based) code

UCLA 18

Page 23: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

The Polyhedral Model: Overview Polyhedral School

Motivating Example [1/2]

Example

for (i = 0; i <= 1; ++i)for (j = 0; j <= 2; ++j)A[i][j] = i * j;

Program execution:

1: A[0][0] = 0 * 0;2: A[0][1] = 0 * 1;3: A[0][2] = 0 * 2;4: A[1][0] = 1 * 0;5: A[1][1] = 1 * 1;6: A[1][2] = 1 * 2;

UCLA 19

Page 24: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

The Polyhedral Model: Overview Polyhedral School

Motivating Example [2/2]

A few observations:I Statement is executed 6 timesI There is a different values for i, j associated to these 6 instancesI There is an order on them (the execution order)

A rough analogy: polyhedral compilation is about (statically)scheduling tasks, where tasks are statement instances, or operations

UCLA 20

Page 25: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

The Polyhedral Model: Overview Polyhedral School

Polyhedral Representation of Programs

Static Control PartsI Loops have affine control only (over-approximation otherwise)

UCLA 21

Page 26: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

The Polyhedral Model: Overview Polyhedral School

Polyhedral Representation of Programs

Static Control PartsI Loops have affine control only (over-approximation otherwise)I Iteration domain: represented as integer polyhedra

for (i=1; i<=n; ++i). for (j=1; j<=n; ++j). . if (i<=n-j+2). . . s[i] = ...

DS1 =

1 0 0 −1−1 0 1 0

0 1 0 −1−1 0 1 0−1 −1 1 2

.

ijn1

≥~0

UCLA 21

Page 27: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

The Polyhedral Model: Overview Polyhedral School

Polyhedral Representation of Programs

Static Control PartsI Loops have affine control only (over-approximation otherwise)I Iteration domain: represented as integer polyhedraI Memory accesses: static references, represented as affine functions of

~xS and~p

for (i=0; i<n; ++i) {. s[i] = 0;. for (j=0; j<n; ++j). . s[i] = s[i]+a[i][j]*x[j];

}

fs( ~xS2) =[

1 0 0 0].

~xS2n1

fa( ~xS2) =

[1 0 0 00 1 0 0

].

~xS2n1

fx( ~xS2) =[

0 1 0 0].

~xS2n1

UCLA 21

Page 28: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

The Polyhedral Model: Overview Polyhedral School

Polyhedral Representation of Programs

Static Control PartsI Loops have affine control only (over-approximation otherwise)I Iteration domain: represented as integer polyhedraI Memory accesses: static references, represented as affine functions of

~xS and~pI Data dependence between S1 and S2: a subset of the Cartesian

product of DS1 and DS2 (exact analysis)

for (i=1; i<=3; ++i) {. s[i] = 0;. for (j=1; j<=3; ++j). . s[i] = s[i] + 1;

}

DS1δS2 :

1 −1 0 01 0 0 −1−1 0 0 3

0 1 0 −10 −1 0 30 0 1 −10 0 −1 3

.

iS1iS2jS21

= 0

≥~0

i

S1 iterations

S2 iterations

UCLA 21

Page 29: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

The Polyhedral Model: Overview Polyhedral School

Math Corner

UCLA 22

Page 30: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

The Polyhedral Model: Math Corner Polyhedral School

The Affine Qualifier

Definition (Affine function)

A function f : Km→Kn is affine if there exists a vector~b ∈Kn and a matrixA ∈Km×n such that:

∀~x ∈Km, f (~x) = A~x+~b

Definition (Affine half-space)

An affine half-space of Km (affine constraint) is defined as the set of points:

{~x ∈Km |~a.~x≤~b}

UCLA 23

Page 31: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

The Polyhedral Model: Math Corner Polyhedral School

Polyhedron (Implicit Representation)

Definition (Polyhedron)

A set S ∈Km is a polyhedron if there exists a system of a finite number ofinequalities A~x≤~b such that:

P = {~x ∈Km | A~x≤~b}

Equivalently, it is the intersection of finitely many half-spaces.

Definition (Polytope)

A polytope is a bounded polyhedron.

UCLA 24

Page 32: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

The Polyhedral Model: Math Corner Polyhedral School

Integer Polyhedron

Definition (Z-polyhedron)

It is a polyhedron where all its extreme points are integer valued

Definition (Integer hull)

The integer hull of a rational polyhedron P is the largest set of integer pointssuch that each of these points is in P .

For the moment, we will "say" an integer polyhedron is a polyhedron ofinteger points (language abuse)

UCLA 25

Page 33: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

The Polyhedral Model: Math Corner Polyhedral School

Rational and Integer Polytopes

UCLA 26

Page 34: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

The Polyhedral Model: Math Corner Polyhedral School

Another View of PolyhedraThe dual representation models a polyhedron as a combination of lines L andrays R (forming the polyhedral cone) and vertices V (forming the polytope)

Definition (Dual representation)

P : {~x ∈Qn |~x = L~λ+R~µ+V~ν, ~µ≥ 0, ~ν≥ 0, ∑i

νi = 1}

Definition (Face)

A face F of P is the intersection of P with a supporting hyperplane of P . Wehave:

dim(F )≤ dim(P )

Definition (Facet)

A facet F of P is a face of P such that:

dim(F ) = dim(P )−1

UCLA 27

Page 35: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

The Polyhedral Model: Math Corner Polyhedral School

Some Equivalence Properties

Theorem (Fundamental Theorem on Polyhedral Decomposition)

If P is a polyhedron, then it can be decomposed as a polytope V plus apolyhedral cone L .

Theorem (Equivalence of Representations)

Every polyhedron has both an implicit and dual representation

I Chernikova’s algorithm can compute the dual representation from theimplicit one

I The Dual representation is heavily used in polyhedral compilationI Some works operate on the constraint-based representation (Pluto)

UCLA 28

Page 36: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

The Polyhedral Model: Math Corner Polyhedral School

Parametric Polyhedra

Definition (Paramteric polyhedron)

Given~n the vector of symbolic parameters, P is a parametric polyhedron if itis defined by:

P = {~x ∈Km | A~x≤ B~n+~b}

I Requires to adapt theory and tools to parametersI Can become nasty: case distinctions (QUAST)I Reflects nicely the program context

UCLA 29

Page 37: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

The Polyhedral Model: Math Corner Polyhedral School

Some Useful Algorithms

All extended to parametric polyhedra:I Compute the facets of a polytope: PolyLib [Wilde et al]

I Compute the volume of a polytope (number of points): Barvinok[Claus/Verdoolaege]

I Scan a polytope (code generation): CLooG [Quillere/Bastoul]

I Find the lexicographic minimum: PIP [Feautrier]

UCLA 30

Page 38: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

The Polyhedral Model: Math Corner Polyhedral School

Operations on Polyhedra

Definition (Intersection)

The intersection of two convex sets P1 and P2 is a convex set P :

P = {~x ∈Km |~x ∈ P1∧~x ∈ P2}

Definition (Union)

The union of two convex sets P1 and P2 is a set P :

P = {~x ∈Km |~x ∈ P1∨~x ∈ P2}

The union of two convex sets may not be a convex set

UCLA 31

Page 39: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

The Polyhedral Model: Math Corner Polyhedral School

Lattices

Definition (Lattice)

A subset L in Qn is a lattice if is generated by integral combination of finitelymany vectors: a1,a2, . . . ,an (ai ∈Qn). If the ai vectors have integralcoordinates, L is an integer lattice.

Definition (Z-polyhedron)

A Z-polyhedron is the intersection of a polyhedron and an affine integral fulldimensional lattice.

UCLA 32

Page 40: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

The Polyhedral Model: Math Corner Polyhedral School

Pictured Example

Example of a Z-polyhedron:I Q1 = {i, j | 0≤ i≤ 5, 0≤ 3j≤ 20}I L1 = {2i+1,3j+5 | i, j ∈ Z}I Z1 = Q1∩L1

UCLA 33

Page 41: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

The Polyhedral Model: Math Corner Polyhedral School

Quick Facts on Z-polyhedra

I Iteration domains are in fact Z-polyhedra with unit latticeI Intersection of Z-polyhedra is not convex in generalI Union is complex to computeI Can count points, can optimize, can scan

I Implementation available for most operations in PolyLib

UCLA 34

Page 42: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

The Polyhedral Model: Math Corner Polyhedral School

Image and Preimage

Definition (Image)

The image of a polyhedron P ∈ Zn by an affine function f : Zn→ Zm is aZ-polyhedron P ′:

P ′ = {f (~x) ∈ Zm |~x ∈ P}

Definition (Preimage)

The preimage of a polyhedron P ∈ Zn by an affine function f : Zn→ Zm is aZ-polyhedron P ′:

P ′ = {~x ∈ Zn | f (~x) ∈ P}

We have Image(f−1,P ) = Preimage(f ,P ) if f is invertible.

UCLA 35

Page 43: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

The Polyhedral Model: Math Corner Polyhedral School

Relation Between Image, Preimage andZ-polyhedra

I The image of a Z-polyhedron by an unimodular function is aZ-polyhedron

I The preimage of a Z-polyhedron by an affine function is a Z-polyhedron

I The image of a polyhedron by an affine invertible function is aZ-polyhedron

I The preimage of a Z-polyhedron by an affine function is a Z-polyhedron

I The image by a non-invertible function is not a Z-polyhedron

UCLA 36

Page 44: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

The Polyhedral Model: Math Corner Polyhedral School

Program Transformations

UCLA 37

Page 45: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

The Polyhedral Model: Modeling Programs Polyhedral School

What Can Be Modeled?

Exact vs. approximate representation:I Exact representation of iteration domains

I Static control flowI Affine loop bounds (includes min/max/integer division)I Affine conditionals (conjunction/disjunction)

I Approximate representation of iteration domainsI Use affine over-approximations, predicate statement executionsI Full-function support

UCLA 38

Page 46: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Program Transformation: Polyhedral School

Key Intuition

I Programs are represented with geometric shapes

I Transforming a program is about modifying those shapesI rotation, skewing, stretching, ...

I But we need here to assume some order to scan points

UCLA 39

Page 47: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Program Transformation: Polyhedral School

Affine Transformations

Interchange TransformationThe transformation matrix is the identity with a permutation of two rows.

1

2

3

5

6

4

1 2 3 4 5 6

123

i

j

1 2 3

4 5 6

0 1 2 3 4 5 6 i’0123

j’

=⇒

1 0−1 0

0 10 −1

( ij

)+

−12−1

3

≥~0(

i′j′

)=[ 0 1

1 0

]( ij

) 0 10 −11 0−1 0

( i′j′

)+

−12−1

3

≥~0

(a) original polyhedron (b) transformation function (c) target polyhedron

A~x+~a≥~0 ~y = T~x (AT−1)~y+~a≥~0

do i = 1, 2do j = 1, 3

S(i,j)

do i’ = 1, 3do j’ = 1, 2

S(i=j’,j=i’)

UCLA 40

Page 48: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Program Transformation: Polyhedral School

Affine Transformations

Reversal TransformationThe transformation matrix is the identity with one diagonal element replaced by −1.

1

2

3

5

6

4

1 2 3 4 5 6

123

i

j

5

4

6 1

2

3

123

0 1 2−3 −2 −1 i’

j’

=⇒

1 0−1 0

0 10 −1

( ij

)+

−12−1

3

≥~0(

i′j′

)=[−1 0

0 1

]( ij

) −1 01 00 10 −1

( i′j′

)+

−12−1

3

≥~0

(a) original polyhedron (b) transformation function (c) target polyhedron

A~x+~a≥~0 ~y = T~x (AT−1)~y+~a≥~0

do i = 1, 2do j = 1, 3

S(i,j)

do i’ = -1, -2, -1do j’ = 1, 3

S(i=3-i’,j=j’)

UCLA 40

Page 49: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Program Transformation: Polyhedral School

Affine Transformations

Coumpound TransformationThe transformation matrix is the composition of an interchange and reversal

1

2

3

5

6

4

1 2 3 4 5 6

123

i

j

3

6

2

5

1

4

123

0 1 2−3 −2 −1 i’

j’

=⇒

1 0−1 0

0 10 −1

( ij

)+

−12−1

3

≥~0(

i′j′

)=[ 0 −1

1 0

]( ij

) 0 −10 11 0−1 0

( i′j′

)+

−12−1

3

≥~0

(a) original polyhedron (b) transformation function (c) target polyhedron

A~x+~a≥~0 ~y = T~x (AT−1)~y+~a≥~0

do i = 1, 2do j = 1, 3

S(i,j)

do j’ = -1, -3, -1do i’ = 1, 2

S(i=4-j’,j=i’)

UCLA 40

Page 50: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Program Transformation: Polyhedral School

Affine Transformations

Coumpound TransformationThe transformation matrix is the composition of an interchange and reversal

1

2

3

5

6

4

1 2 3 4 5 6

123

i

j

3

6

2

5

1

4

123

0 1 2−3 −2 −1 i’

j’

=⇒

1 0−1 0

0 10 −1

( ij

)+

−12−1

3

≥~0(

i′j′

)=[ 0 −1

1 0

]( ij

) 0 −10 11 0−1 0

( i′j′

)+

−12−1

3

≥~0

(a) original polyhedron (b) transformation function (c) target polyhedron

A~x+~a≥~0 ~y = T~x (AT−1)~y+~a≥~0

do i = 1, 2do j = 1, 3

S(i,j)

do j’ = -1, -3, -1do i’ = 1, 2

S(i=4-j’,j=i’)

UCLA 40

Page 51: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Program Transformation: Polyhedral School

So, What is This Matrix?

I We know how to generate code for arbitrary matrices with integercoefficients

I Arbitrary number of rows (but fixed number of columns)I Arbitrary value for the coefficients

I Through code generation, the number of dynamic instances is preservedI But this is not true for the transformed polyhedra!

Some classification:I The matrix is unimodularI The matrix is full rank and invertibleI The matrix is full rankI The matrix has only integral coefficientsI The matrix has rational coefficients

UCLA 41

Page 52: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Program Transformation: Polyhedral School

A Reverse History Perspective

1 CLooG: arbitrary matrix2 Affine Mappings3 Unimodular framework4 SARE5 SURE

UCLA 42

Page 53: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Program Transformation: Polyhedral School

Program Transformations

Original Schedule

for (i = 0; i < n; ++i)for (j = 0; j < n; ++j){

S1: C[i][j] = 0;for (k = 0; k < n; ++k)

S2: C[i][j] += A[i][k]*B[k][j];

}

ΘS1.~xS1 =

(1 0 0 00 1 0 0

).

ijn1

ΘS2.~xS2 =

(1 0 0 0 00 1 0 0 00 0 1 0 0

).

ijkn1

for (i = 0; i < n; ++i)for (j = 0; j < n; ++j){C[i][j] = 0;for (k = 0; k < n; ++k)C[i][j] += A[i][k]*

B[k][j];

}

I Represent Static Control Parts (control flow and dependences must bestatically computable)

I Use code generator (e.g. CLooG) to generate C code from polyhedralrepresentation (provided iteration domains + schedules)

UCLA 43

Page 54: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Program Transformation: Polyhedral School

Program Transformations

Original Schedule

for (i = 0; i < n; ++i)for (j = 0; j < n; ++j){

S1: C[i][j] = 0;for (k = 0; k < n; ++k)

S2: C[i][j] += A[i][k]*B[k][j];

}

ΘS1.~xS1 =

(1 0 0 00 1 0 0

).

ijn1

ΘS2.~xS2 =

(1 0 0 0 00 1 0 0 00 0 1 0 0

).

ijkn1

for (i = 0; i < n; ++i)for (j = 0; j < n; ++j){C[i][j] = 0;for (k = 0; k < n; ++k)C[i][j] += A[i][k]*

B[k][j];

}

I Represent Static Control Parts (control flow and dependences must bestatically computable)

I Use code generator (e.g. CLooG) to generate C code from polyhedralrepresentation (provided iteration domains + schedules)

UCLA 43

Page 55: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Program Transformation: Polyhedral School

Program Transformations

Original Schedule

for (i = 0; i < n; ++i)for (j = 0; j < n; ++j){

S1: C[i][j] = 0;for (k = 0; k < n; ++k)

S2: C[i][j] += A[i][k]*B[k][j];

}

ΘS1.~xS1 =

(1 0 0 00 1 0 0

).

ijn1

ΘS2.~xS2 =

(1 0 0 0 00 1 0 0 00 0 1 0 0

).

ijkn1

for (i = 0; i < n; ++i)for (j = 0; j < n; ++j){C[i][j] = 0;for (k = 0; k < n; ++k)C[i][j] += A[i][k]*

B[k][j];

}

I Represent Static Control Parts (control flow and dependences must bestatically computable)

I Use code generator (e.g. CLooG) to generate C code from polyhedralrepresentation (provided iteration domains + schedules)

UCLA 43

Page 56: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Program Transformation: Polyhedral School

Program Transformations

Distribute loops

for (i = 0; i < n; ++i)for (j = 0; j < n; ++j){

S1: C[i][j] = 0;for (k = 0; k < n; ++k)

S2: C[i][j] += A[i][k]*B[k][j];

}

ΘS1.~xS1 =

(1 0 0 00 1 0 0

).

ijn1

ΘS2.~xS2 =

(1 0 0 1 00 1 0 0 00 0 1 0 0

).

ijkn1

for (i = 0; i < n; ++i)for (j = 0; j < n; ++j)C[i][j] = 0;

for (i = n; i < 2*n; ++i)for (j = 0; j < n; ++j)for (k = 0; k < n; ++k)C[i-n][j] += A[i-n][k]*

B[k][j];

I All instances of S1 are executed before the first S2 instance

UCLA 43

Page 57: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Program Transformation: Polyhedral School

Program Transformations

Distribute loops + Interchange loops for S2

for (i = 0; i < n; ++i)for (j = 0; j < n; ++j){

S1: C[i][j] = 0;for (k = 0; k < n; ++k)

S2: C[i][j] += A[i][k]*B[k][j];

}

ΘS1.~xS1 =

(1 0 0 00 1 0 0

).

ijn1

ΘS2.~xS2 =

(0 0 1 1 00 1 0 0 01 0 0 0 0

).

ijkn1

for (i = 0; i < n; ++i)for (j = 0; j < n; ++j)C[i][j] = 0;

for (k = n; k < 2*n; ++k)for (j = 0; j < n; ++j)for (i = 0; i < n; ++i)C[i][j] += A[i][k-n]*

B[k-n][j];

I The outer-most loop for S2 becomes k

UCLA 43

Page 58: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Program Transformation: Polyhedral School

Program Transformations

Illegal schedule

for (i = 0; i < n; ++i)for (j = 0; j < n; ++j){

S1: C[i][j] = 0;for (k = 0; k < n; ++k)

S2: C[i][j] += A[i][k]*B[k][j];

}

ΘS1.~xS1 =

(1 0 1 00 1 0 0

).

ijn1

ΘS2.~xS2 =

(0 0 1 0 00 1 0 0 01 0 0 0 0

).

ijkn1

for (k = 0; k < n; ++k)for (j = 0; j < n; ++j)for (i = 0; i < n; ++i)C[i][j] += A[i][k]*

B[k][j];for (i = n; i < 2*n; ++i)for (j = 0; j < n; ++j)C[i-n][j] = 0;

I All instances of S1 are executed after the last S2 instance

UCLA 43

Page 59: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Program Transformation: Polyhedral School

Program Transformations

A legal schedule

for (i = 0; i < n; ++i)for (j = 0; j < n; ++j){

S1: C[i][j] = 0;for (k = 0; k < n; ++k)

S2: C[i][j] += A[i][k]*B[k][j];

}

ΘS1.~xS1 =

(1 0 1 00 1 0 0

).

ijn1

ΘS2.~xS2 =

(0 0 1 1 10 1 0 0 01 0 0 0 0

).

ijkn1

for (i = n; i < 2*n; ++i)for (j = 0; j < n; ++j)C[i][j] = 0;

for (k= n+1; k<= 2*n; ++k)for (j = 0; j < n; ++j)for (i = 0; i < n; ++i)C[i][j] += A[i][k-n-1]*

B[k-n-1][j];

I Delay the S2 instancesI Constraints must be expressed between ΘS1 and ΘS2

UCLA 43

Page 60: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Program Transformation: Polyhedral School

Program Transformations

Implicit fine-grain parallelism

for (i = 0; i < n; ++i)for (j = 0; j < n; ++j){

S1: C[i][j] = 0;for (k = 0; k < n; ++k)

S2: C[i][j] += A[i][k]*B[k][j];

}

ΘS1.~xS1 = ( 1 0 0 0 ) .

ijn1

ΘS2.~xS2 = ( 0 0 1 1 0 ) .

ijkn1

for (i = 0; i < n; ++i)pfor (j = 0; j < n; ++j)C[i][j] = 0;

for (k = n; k < 2*n; ++k)pfor (j = 0; j < n; ++j)

pfor (i = 0; i < n; ++i)C[i][j] += A[i][k-n]*

B[k-n][j];

I Less (linear) rows than loop depth→ remaining dimensions areparallel

UCLA 43

Page 61: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Program Transformation: Polyhedral School

Program Transformations

Representing a schedule

for (i = 0; i < n; ++i)for (j = 0; j < n; ++j){

S1: C[i][j] = 0;for (k = 0; k < n; ++k)

S2: C[i][j] += A[i][k]*B[k][j];

}

ΘS1.~xS1 =

( 1 0 1 00 1 0 0

).

ijn1

ΘS2.~xS2 =

(0 0 1 1 10 1 0 0 01 0 0 0 0

).

ijkn1

for (i = n; i < 2*n; ++i)for (j = 0; j < n; ++j)C[i][j] = 0;

for (k= n+1; k<= 2*n; ++k)for (j = 0; j < n; ++j)for (i = 0; i < n; ++i)C[i][j] += A[i][k-n-1]*

B[k-n-1][j];

Θ.~x =

( 1 0 0 0 1 1 1 0 10 1 0 1 0 0 0 0 00 0 1 0 0 0 0 0 0

).

~p

( i j i j k n n 1 1 )T

~p

UCLA 43

Page 62: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Program Transformation: Polyhedral School

Program Transformations

Representing a schedule

for (i = 0; i < n; ++i)for (j = 0; j < n; ++j){

S1: C[i][j] = 0;for (k = 0; k < n; ++k)

S2: C[i][j] += A[i][k]*B[k][j];

}

ΘS1.~xS1 =

( 1 0 1 00 1 0 0

).

ijn1

ΘS2.~xS2 =

(0 0 1 1 10 1 0 0 01 0 0 0 0

).

ijkn1

for (i = n; i < 2*n; ++i)for (j = 0; j < n; ++j)C[i][j] = 0;

for (k= n+1; k<= 2*n; ++k)for (j = 0; j < n; ++j)for (i = 0; i < n; ++i)C[i][j] += A[i][k-n-1]*

B[k-n-1][j];

Θ.~x =

( 1 0 0 0 1 1 1 0 10 1 0 1 0 0 0 0 00 0 1 0 0 0 0 0 0

).

~p

( i j i j k n n 1 1 )T

0 0

0 0 0

~p

0

c

0

UCLA 43

Page 63: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Program Transformation: Polyhedral School

Program Transformations

Representing a schedule

for (i = 0; i < n; ++i)for (j = 0; j < n; ++j){

S1: C[i][j] = 0;for (k = 0; k < n; ++k)

S2: C[i][j] += A[i][k]*B[k][j];

}

ΘS1.~xS1 =

( 1 0 1 00 1 0 0

).

ijn1

ΘS2.~xS2 =

(0 0 1 1 10 1 0 0 01 0 0 0 0

).

ijkn1

for (i = n; i < 2*n; ++i)for (j = 0; j < n; ++j)C[i][j] = 0;

for (k= n+1; k<= 2*n; ++k)for (j = 0; j < n; ++j)for (i = 0; i < n; ++i)C[i][j] += A[i][k-n-1]*

B[k-n-1][j];

Transformation Description

~ıreversal Changes the direction in which a loop traverses its iteration rangeskewing Makes the bounds of a given loop depend on an outer loop counter

interchange Exchanges two loops in a perfectly nested loop, a.k.a. permutation

~p fusion Fuses two loops, a.k.a. jammingdistribution Splits a single loop nest into many, a.k.a. fission or splitting

c peeling Extracts one iteration of a given loopshifting Allows to reorder loops

UCLA 43

Page 64: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Program Transformation: Polyhedral School

Fusion in the Polyhedral Model

0 N

for (NBlue

for (i = 0; i <= N; ++i) {Blue(i);Red(i);

}

for (NRed

Perfectly aligned fusion

UCLA 44

Page 65: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Program Transformation: Polyhedral School

Fusion in the Polyhedral Model

0 N N+11

for

Blue(0);for (i = 1; i <= N; ++i) {Blue(i);Red(i-1);

}Red(N);

for (N

Fusion with shift of 1Not all instances are fused

UCLA 44

Page 66: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Program Transformation: Polyhedral School

Fusion in the Polyhedral Model

P0 N N+P

for (i = 0; i < P; ++i)Blue(i);

for (i = P; i <= N; ++i) {Blue(i);Red(i-P);

}for (i = N+1; i <= N+P; ++i)

Red(i-P);

Fusion with parametric shift of PAutomatic generation of prolog/epilog code

UCLA 44

Page 67: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Program Transformation: Polyhedral School

Fusion in the Polyhedral Model

P0 N N+P

for (i = 0; i < P; ++i)Blue(i);

for (i = P; i <= N; ++i) {Blue(i);Red(i-P);

}for (i = N+1; i <= N+P; ++i)

Red(i-P);

Many other transformations may be required to enablefusion: interchange, skewing, etc.

UCLA 44

Page 68: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Program Transformation: Polyhedral School

Scheduling Matrix

Definition (Affine multidimensional schedule)

Given a statement S, an affine schedule ΘS of dimension m is an affine formon the d outer loop iterators~xS and the p global parameters~n.ΘS ∈ Zm×(d+p+1) can be written as:

ΘS(~xS) =

θ1,1 . . . θ1,d+p+1...

...θm,1 . . . θm,d+p+1

.

~xS~n1

ΘS

k denotes the kth row of ΘS.

Definition (Bounded affine multidimensional schedule)

ΘS is a bounded schedule if θSi,j ∈ [x,y] with x,y ∈ Z

UCLA 45

Page 69: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Program Transformation: Polyhedral School

Another Representation

One can separate coefficients of Θ into:1 The iterator coefficients2 The parameter coefficients3 The constant coefficients

One can also enforce the schedule dimension to be 2d+1.I A d-dimensional square matrix for the linear part

I represents composition of interchange/skewing/slowing

I A d×n matrix for the parametric partI represents (parametric) shift

I A d+1 vector for the scalar offsetI represents statement interleaving

I See URUK for instance

UCLA 46

Page 70: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Program Transformation: Polyhedral School

Computing the 2d+1 Identity Schedule

S1

S2

S3

S4

S5

S6

!S("xS) = TS"xS + "tS ,

"xS TS "tS

!S1("xS1) = (0, i, 0) !S2("xS2) = (0, i, 1, j, 0) !S3("xS3) = (0, i, 2)

0

00

0

0 1 2

1 2

3

i

j j

k

S1

S2

S3

S4

S5

S6

S1

S2

S3

S4

S5

S6

!S("xS) = TS"xS + "tS ,

"xS TS "tS

!S1("xS1) = (0, i, 0) !S2("xS2) = (0, i, 1, j, 0) !S3("xS3) = (0, i, 2)

0

00

0

0 1 2

1 2

3

i

j j

k

S1

S2

S3

S4

S5

S6

UCLA 47

Page 71: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Program Transformation: Polyhedral School

Transformation Catalogue [1/2]

28 Girbal, Vasilache et al.

CUTDOM constrains a domain with an additional inequality, in theform of a vector c with the same dimension as a row of matrix Λ.

EXTEND inserts a new intermediate loop level at depth !, initially re-stricted to a single iteration. This new iterator will be used in followingcode transformations.

ADDLOCALVAR insert a fresh local variable to the domain and to theaccess functions. This local variable is typically used by CUTDOM.

PRIVATIZE and CONTRACT implement basic forms of array privati-zation and contraction, respectively, considering dimension ! of the array.Privatization needs an additional parameter s, the size of the additional di-mension; s is required to update the array declaration (it cannot be inferredin general, some references may not be affine). These primitives are simpleexamples updating the data layout and array access functions.

This table is not complete (e.g., it lacks index-set splitting and data-layout transformations), but it demonstrates the expressiveness of the uni-fied representation.

Syntax EffectUNIMODULAR(P,U) !S " Scop | P# βS,AS $ U.AS; ΓS$ U.ΓS

SHIFT(P,M) !S " Scop | P# βS,ΓS $ ΓS +MCUTDOM(P,c) !S " Scop | P# βS,ΛS $ AddRow

!ΛS,0,c/gcd(c1, . . . ,cdS+dSlv+dgp+1

)"

EXTEND(P,!,c) !S " Scop | P# βS,

#$$$$%

$$$$&

dS$ dS +1; ΛS$ AddCol(ΛS,c,0);βS$ AddRow(βS,!,0);AS$ AddRow(AddCol(AS,c,0),!,1!);ΓS$ AddRow(ΓS,!,0);!(A,F) " LS

hs %R Shs,F$ AddRow(F,!,0)

ADDLOCALVAR(P) !S " Scop | P# βS,dSlv $ dSlv+1; ΛS$ AddCol(ΛS,dS +1,0);!(A,F) " LS

hs%R Shs,F$ AddCol(F,dS +1,0)

PRIVATIZE(A,!) !S " Scop,!(A,F) " LShs %R S

hs,F$ AddRow(F,!,1!)CONTRACT(A,!) !S " Scop,!(A,F) " LS

hs %R Shs,F$ RemRow(F,!)

FUSION(P,o) b=max{βSdim(P)+1 | (P,o)# βS}+1Move((P,o+1),(P,o+1),b); Move(P,(P,o+1),&1)

FISSION(P,o,b) Move(P,(P,o,b),1); Move((P,o+1),(P,o+1),&b)MOTION(P,T ) if dim(P)+1= dim(T ) then b=max{βSdim(P) | P# βS}+1 else b= 1

Move(pfx(T,dim(T )&1),T,b)!S " Scop | P# βS,βS $ βS +T &pfx(P,dim(T ))Move(P,P,&1)

Fig. 18. Some classical transformation primitives

Primitives operate on program representation while maintaining thestructure of the polyhedral components (the invariants). Despite their fa-miliar names, the primitives’ practical outcome on the program represen-tation is widely extended compared to their syntactic counterparts. Indeed,transformation primitives like fusion or interchange apply to sets of state-

UCLA 48

Page 72: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Program Transformation: Polyhedral School

Transformation Catalogue [2/2]30 Girbal, Vasilache et al.

Syntax Effect Comments

INTERCHANGE(P,o) !S " Scop | P# βS,swap rows o and o+1

!U= IdS$1o,o$1o+1,o+1+1o,o+1+1o+1,o;UNIMODULAR(βS,U)

SKEW(P,!,c,s) !S " Scop | P# βS,add the skew factor

!U= IdS + s ·1!,c;UNIMODULAR(βS,U)

REVERSE(P,o) !S " Scop | P# βS,put a -1 in (o,o)

!U= IdS $2 ·1o,o;UNIMODULAR(βS,U)

STRIPMINE(P,k) !S " Scop | P# βS,"####$

####%

c= dim(P);EXTEND(βS,c,c);u= dS +dSlv+dgp+1;CUTDOM(βS,$k ·1c +(ASc+1,Γ

Sc+1));

CUTDOM(βS,k ·1c $ (ASc+1,ΓSc+1)+(k$1)1u)

insert intermediate loopconstant columnk · ic % ic+1ic+1 % k · ic + k$1

TILE(P,o,k1 ,k2) !S " Scop | (P,o)# βS,"$

%

STRIPMINE((P,o),k2);STRIPMINE(P,k1);INTERCHANGE((P,0),dim(P))

strip outer loopstrip inner loopinterchange

Fig. 19. Composition of transformation primitives

3.2. Implementing Loop Unrolling

In the context of code optimization, one of the most important transforma-tions is loop unrolling. A naive implementation of unrolling with statementduplications may result in severe complexity overhead for further transfor-mations and for the code generation algorithm (its separation algorithmis exponential in the number of statements, in the worst case). Insteadof implementing loop unrolling in the intermediate representation of ourframework, we delay it to the code generation phase and perform full loopunrolling in a lazy way. This strategy is fully implemented in the code gen-eration phase and is triggered by annotations (carrying depth information)of the statements whose surrounding loops need to be unrolled; unrollingoccurs in the separation algorithm of the code generator (22) when all thestatements being printed out are marked for unrolling at the current depth.

Practically, in most cases, loop unrolling by a factor b an be imple-mented as a combination of strip-mining (by a factor b) and full unrolling(6). Strip-mining itself may be implemented in several ways in a polyhedralsetting. Following our earlier work in (7) and calling b the strip-mining fac-tor, we choose to model a strip-mined loop by dividing the iteration span ofthe outer loop by b instead of leaving the bounds unchanged and insertinga non-unit stride b, see Figure 20.

UCLA 49

Page 73: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Program Transformation: Polyhedral School

Some Final Observations

Some classical pitfallsI The number of rows of Θ does not correspond to actual parallel levelsI Scalar rows vs. linear rowsI Linear independenceI Parametric shift for domains without parametric boundI ...

UCLA 50

Page 74: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Program Transformation: Polyhedral School

Data Dependences

UCLA 51

Page 75: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Data Dependences: Polyhedral School

Purpose of Dependence Analysis

I Not all program transformations preserve the semanticsI Semantics is preserved if the dependence are preserved

I In standard frameworks, it usually means reordering statements/loopsI Statements containing dependent references should not be executed in a

different orderI Granularity: usually a reference to an array

I In the polyhedral framework, it means reordering statement instancesI Statement instances containing dependent references should not be

executed in a different orderI Granularity: a reference to an array cell

UCLA 52

Page 76: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Data Dependences: Polyhedral School

Example

Exercise: Compute the set of cells of A accessed

Example

for (i = 0; i < N; ++i)for (j = i; j < N; ++j)A[2i + 3][4j] = i * j;

I DS: {i, j | 0≤ i < N, i≤ j < N}I Function: fA : {2i+3,4j | i, j ∈ Z}I Image(fA,DS) is the set of cells of A accessed (a Z-polyhedron):

I Polyhedron: Q : {i, j | 3≤ i < 2N +2, 0≤ j < 4N}I Lattice: L : {2i+3,4j | i, j ∈ Z}

UCLA 53

Page 77: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Data Dependences: Polyhedral School

Data Dependence

Definition (Bernstein conditions)

Given two references, there exists a dependence between them if the threefollowing conditions hold:

I they reference the same array (cell)I one of this access is a writeI the two associated statements are executed

Three categories of dependences:I RAW (Read-After-Write, aka flow): first reference writes, second readsI WAR (Write-After-Read, aka anti): first reference reads, second writesI WAW (Write-After-Write, aka output): both references writes

Another kind: RAR (Read-After-Read dependences), used for locality/reusecomputations

UCLA 54

Page 78: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Data Dependences: Polyhedral School

Some Terminology on Dependence Relations

We categorize the dependence relation in three kinds:I Uniform dependences: the distance between two dependent iterations

is a constantI ex: i→ i+1I ex: i, j→ i+1, j+1

I Non-uniform dependences: the distance between two dependentiterations varies during the execution

I ex: i→ i+ jI ex: i→ 2i

I Parametric dependences: at least a parameter is involved in thedependence relation

I ex: i→ i+NI ex: i+N→ j+M

UCLA 55

Page 79: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Data Dependences: Dependence Polyhedra Polyhedral School

Dependence Polyhedron [1/3]

Principle: model all pairs of instances in dependence

Definition (Dependence of statement instances)

A statement S depends on a statement R (written R→ S) if there exists anoperation S(~xS) and R(~xR) and a memory location m such that:

1 S(~xS) and R(~xR) refer to the same memory location m, and at least oneof them writes to that location,

2 xS and xR belongs to the iteration domain of R and S,3 in the original sequential order, S(~xS) is executed before R(~xR).

UCLA 56

Page 80: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Data Dependences: Dependence Polyhedra Polyhedral School

Dependence Polyhedron [2/3]

1 Same memory location: equality of the subscript functions of a pair ofreferences to the same array: FS~xS +aS = FR~xR +aR.

2 Iteration domains: both S and R iteration domains can be describedusing affine inequalities, respectively AS~xS + cS ≥ 0 and AR~xR + cR ≥ 0.

3 Precedence order : each case corresponds to a common loop depth,and is called a dependence level.

For each dependence level l, the precedence constraints are the equalityof the loop index variables at depth lesser to l: xR,i = xS,i for i < l andxR,l > xS,l if l is less than the common nesting loop level. Otherwise, thereis no additional constraint and the dependence only exists if S is textuallybefore R.

Such constraints can be written using linear inequalities:Pl,S~xS−Pl,R~xR +b≥ 0.

UCLA 57

Page 81: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Data Dependences: Dependence Polyhedra Polyhedral School

Dependence Polyhedron [3/3]

The dependence polyhedron for R→ S at a given level l and for a given pairof references fR, fS is described as [Feautrier/Bastoul]:

DR,S,fR,fS,l : D(~xS

~xR

)+d =

FS −FRAS 0

0 ARPS −PR

(~xS~xR

)+

aS−aRcScRb

= 0

≥~0

A few properties:I We can always build the dep polyhedron for a given pair of affine array

accesses (it is convex)I It is exact, if the iteration domain and the access functions are also exactI it is over-approximated if the iteration domain or the access function is

an approximation

UCLA 58

Page 82: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Data Dependences: Dependence Polyhedra Polyhedral School

Polyhedral Dependence Graph

Definition (Polyhedral Dependence Graph)

The Polyhedral Dependence Graph is a multi-graph with one vertex persyntactic program statement Si, with edges Si→ Sj for each dependencepolyhedron DSi,Sj .

UCLA 59

Page 83: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Data Dependences: Dependence Polyhedra Polyhedral School

Scheduling

UCLA 60

Page 84: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Affine Scheduling: Polyhedral School

Affine Scheduling

Definition (Affine schedule)

Given a statement S, a p-dimensional affine schedule ΘR is an affine form onthe outer loop iterators~xS and the global parameters~n. It is written:

ΘS(~xS) = TS

~xS~n1

, TS ∈Kp×dim(~xS)+dim(~n)+1

I A schedule assigns a timestamp to each executed instance of astatement

I If T is a vector, then Θ is a one-dimensional scheduleI If T is a matrix, then Θ is a multidimensional schedule

I Question: does it translate to sequential loops?

UCLA 61

Page 85: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Affine Scheduling: Polyhedral School

Legal Program Transformation

Definition (Precedence condition)

Given ΘR a schedule for the instances of R, ΘS a schedule for the instancesof S. ΘR and ΘS are legal schedules if ∀〈~xR,~xS〉 ∈DR,S:

ΘR(~xR)≺ΘS(~xS)

≺ denotes the lexicographic ordering.

(a1, . . . ,an)≺ (b1, . . . ,bm) iff ∃i, 1≤ i≤ min(n,m) s.t. (a1, . . . ,ai−1) = (b1, . . . ,bi−1)

and ai < bi

UCLA 62

Page 86: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Affine Scheduling: Polyhedral School

Scheduling in the Polyhedral Model

Constraints:I The schedule must respect the precedence condition, for all dependent

instancesI Dependence constraints can be turned into constraints on the solution

set

Scheduling:I Among all possibilities, one has to be pickedI Optimal solution requires to consider all legal possible schedules

I Question: is this always true?

UCLA 63

Page 87: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Affine Scheduling: Polyhedral School

One-Dimensional Affine Schedules

For now, we focus on 1-d schedules

Example

for (i = 1; i < N; ++i)A[i] = A[i - 1] + A[i] + A[i + 1];

I Simple program: 1 loop, 1 polyhedral statementI 2 dependences:

I RAW: A[i]→ A[i - 1]I WAR: A[i + 1]→ A[i]

UCLA 64

Page 88: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Affine Scheduling: Polyhedral School

Checking the Legality of a ScheduleExercise: given the dependence polyhedra, check if a schedule is legal

D1 :

1 1 0 0 −11 −1 0 1 −11 0 1 0 11 0 −1 1 −10 1 −1 0 11 −1 1 0 −1

.eq

iSi′Sn1

D2 :

1 1 0 0 −11 −1 0 1 −11 0 1 0 11 0 −1 1 −10 1 −1 0 11 −1 1 0 −1

.eq

iSi′Sn1

1 Θ = i2 Θ =−i

Solution: check for the emptiness of the polyhedron

P :[

DiS � i′S

].

iSi′Sn1

where:

I iS � i′S gets the consumer instances scheduled after the producer onesI For Θ =−i, it is −iS �−i′S, which is non-empty

UCLA 65

Page 89: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Affine Scheduling: Polyhedral School

Checking the Legality of a ScheduleExercise: given the dependence polyhedra, check if a schedule is legal

D1 :

1 1 0 0 −11 −1 0 1 −11 0 1 0 11 0 −1 1 −10 1 −1 0 11 −1 1 0 −1

.eq

iSi′Sn1

D2 :

1 1 0 0 −11 −1 0 1 −11 0 1 0 11 0 −1 1 −10 1 −1 0 11 −1 1 0 −1

.eq

iSi′Sn1

1 Θ = i2 Θ =−i

Solution: check for the emptiness of the polyhedron

P :[

DiS � i′S

].

iSi′Sn1

where:

I iS � i′S gets the consumer instances scheduled after the producer onesI For Θ =−i, it is −iS �−i′S, which is non-empty

UCLA 65

Page 90: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Affine Scheduling: Polyhedral School

A (Naive) Scheduling Approach

I Pick a schedule for the program statementsI Check if it respects all dependences

This is called filtering

Limitations:I How to use this in combination of an objective function?I The density of legal 1-d affine schedules is low:

matmult locality fir h264 crout

~i-Bounds −1,1 −1,1 0,1 −1,1 −3,3c-Bounds −1,1 −1,1 0,3 0,4 −3,3#Sched. 1.9×104 5.9×104 1.2×107 1.8×108 2.6×1015

⇓#Legal 6561 912 792 360 798

UCLA 66

Page 91: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Affine Scheduling: Polyhedral School

Objectives for a Good Scheduling Algorithm

I Build a legal schedule!I Embed some properties in this legal schedule

I latency: minimize the time of the last iterationI delay: minimize the time between the first and last iterationI parallelism / placementI permutability (for tiling)I ...

A possible "simple" two-step approach:I Find the solution set of all legal affine schedulesI Find an ILP/PIP formulation for the objective function(s)

UCLA 67

Page 92: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Affine Scheduling: Polyhedral School

The Precedence Constraint (Again!)

Precedence constraint adapted to 1-d schedules:

Definition (Causality condition for schedules)

Given DR,S, ΘR and ΘS are legal iff for each pair of instances in dependence:

ΘR(~xR)< Θ

S(~xS)

Equivalently: ∆R,S = ΘS(~xS)−Θ

R(~xR)−1≥ 0

I All functions ∆R,S which are non-negative over the dependencepolyhedron represent legal schedules

I For the instances which are not in dependence, we don’t careI First step: how to get all non-negative functions over a polyhedron?

UCLA 68

Page 93: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Affine Scheduling: Polyhedral School

Affine Form of the Farkas Lemma

Lemma (Affine form of Farkas lemma)

Let D be a nonempty polyhedron defined by A~x+~b≥~0. Then any affinefunction f (~x) is non-negative everywhere in D iff it is a positive affinecombination:

f (~x) = λ0 +~λT(A~x+~b), with λ0 ≥ 0 and~λ≥~0

λ0 and ~λT are called the Farkas multipliers.

UCLA 69

Page 94: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Affine Scheduling: Polyhedral School

The Farkas Lemma: Example

I Function: f (x) = ax+bI Domain of x: {1≤ x≤ 3}→ x−1≥ 0, −x+3≥ 0I Farkas lemma: f (x)≥ 0⇔ f (x) = λ0 +λ1(x−1)+λ2(−x+3)

The system to solve:λ1 − λ2 = a

λ0 − λ1 + 3λ2 = bλ0 ≥ 0

λ1 ≥ 0λ2 ≥ 0

UCLA 70

Page 95: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Affine Scheduling: One-dimensional Schedules Polyhedral School

Example: Semantics Preservation (1-D)

Legal Distinct Schedules

Affine Schedules

UCLA 71

Page 96: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Affine Scheduling: One-dimensional Schedules Polyhedral School

Example: Semantics Preservation (1-D)

Legal Distinct Schedules

Affine Schedules

- Causality condition

Property (Causality condition for schedules)

Given RδS, ΘR and ΘS are legal iff for each pair of instances in dependence:

ΘR(~xR)< Θ

S(~xS)

Equivalently: ∆R,S = ΘS(~xS)−Θ

R(~xR)−1≥ 0

UCLA 71

Page 97: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Affine Scheduling: One-dimensional Schedules Polyhedral School

Example: Semantics Preservation (1-D)

Legal Distinct Schedules

Affine Schedules

- Causality condition

- Farkas Lemma

Lemma (Affine form of Farkas lemma)

Let D be a nonempty polyhedron defined by A~x+~b≥~0. Then any affine function f (~x)is non-negative everywhere in D iff it is a positive affine combination:

f (~x) = λ0 +~λT(A~x+~b), with λ0 ≥ 0 and~λ≥~0.

λ0 and ~λT are called the Farkas multipliers.

UCLA 71

Page 98: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Affine Scheduling: One-dimensional Schedules Polyhedral School

Example: Semantics Preservation (1-D)

Legal Distinct Schedules

Affine Schedules

- Causality condition

- Farkas Lemma

Valid

Farkas

Multipliers

UCLA 71

Page 99: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Affine Scheduling: One-dimensional Schedules Polyhedral School

Example: Semantics Preservation (1-D)

Legal Distinct Schedules

Affine Schedules

- Causality condition

- Farkas Lemma

Valid

Farkas

Multipliers

Many to one

UCLA 71

Page 100: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Affine Scheduling: One-dimensional Schedules Polyhedral School

Example: Semantics Preservation (1-D)

Legal Distinct Schedules

Affine Schedules

- Causality condition

- Farkas Lemma

Valid

Farkas

Multipliers

- Identification

ΘS(~xS)−Θ

R(~xR)−1 = λ0 +~λT(

DR,S

(~xR

~xS

)+~dR,S

)≥ 0

DRδS iR : λD1,1 −λD1,2 +λD1,3 −λD1,4

iS : −λD1,1 +λD1,2 +λD1,5 −λD1,6

jS : λD1,7 −λD1,8

n : λD1,4 +λD1,6 +λD1,8

1 : λD1,0

UCLA 71

Page 101: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Affine Scheduling: One-dimensional Schedules Polyhedral School

Example: Semantics Preservation (1-D)

Legal Distinct Schedules

Affine Schedules

- Causality condition

- Farkas Lemma

Valid

Farkas

Multipliers

- Identification

ΘS(~xS)−Θ

R(~xR)−1 = λ0 +~λT(

DR,S

(~xR~xS

)+~dR,S

)≥ 0

DRδS iR : −t1R = λD1,1 −λD1,2 +λD1,3 −λD1,4

iS : t1S = −λD1,1 +λD1,2 +λD1,5 −λD1,6

jS : t2S = λD1,7 −λD1,8

n : t3S − t2R = λD1,4 +λD1,6 +λD1,8

1 : t4S − t3R −1 = λD1,0

UCLA 71

Page 102: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Affine Scheduling: One-dimensional Schedules Polyhedral School

Example: Semantics Preservation (1-D)

Legal Distinct Schedules

Affine Schedules

- Causality condition

- Farkas Lemma

Valid

Farkas

Multipliers

- Identification

- Projection

I Solve the constraint systemI Use (purpose-optimized) Fourier-Motzkin projection algorithm

I Reduce redundancyI Detect implicit equalities

UCLA 71

Page 103: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Affine Scheduling: One-dimensional Schedules Polyhedral School

Example: Semantics Preservation (1-D)

Valid

Transformation

Coefficients

Legal Distinct Schedules

Affine Schedules

- Causality condition

- Farkas Lemma

Valid

Farkas

Multipliers

- Identification

- Projection

UCLA 71

Page 104: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Affine Scheduling: One-dimensional Schedules Polyhedral School

Example: Semantics Preservation (1-D)

Valid

Transformation

Coefficients

Legal Distinct Schedules

Affine Schedules

- Causality condition

- Farkas Lemma

Valid

Farkas

Multipliers

Bijection

- Identification

- Projection

I One point in the space⇔ one set of legal schedulesw.r.t. the dependences

UCLA 71

Page 105: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Affine Scheduling: One-dimensional Schedules Polyhedral School

Scheduling Algorithm for Multiple Dependences

AlgorithmI Compute the schedule constraints for each dependenceI Intersect all sets of constraintsI Output is a convex solution set of all legal one-dimensional schedules

I Computation is fast, but requires eliminating variables in a system ofinequalities: projection

I Can be computed as soon as the dependence polyhedra are known

UCLA 72

Page 106: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Affine Scheduling: One-dimensional Schedules Polyhedral School

Objective Function

Idea: bound the latency of the schedule and minimize this bound

Theorem (Schedule latency bound)

If all domains are bounded, and if there exists at least one 1-d schedule Θ,then there exists at least one affine form in the structure parameters:

L =~u.~n+w

such that:∀~xR, L≥ΘR(~xR)

I Objective function: min{~u,w |~u.~n+w−Θ≥ 0}I Subject to Θ is a legal schedule, and θi ≥ 0I In many cases, it is equivalent to take the lexicosmallest point in the

polytope of non-negative legal schedules

UCLA 73

Page 107: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Affine Scheduling: One-dimensional Schedules Polyhedral School

Example

min{~u,w |~u.~n+w−Θ≥ 0} : ΘR = 0, ΘS = k+1

Example

parfor (i = 0; i < N; ++i)parfor (j = 0; j < N; ++j)C[i][j] = 0;

for (k = 1; k < N + 1; ++k)parfor (i = 0; i < N; ++i)parfor (j = 0; j < N; ++j)C[i][j] += A[i][k-1] + B[k-1][j];

UCLA 74

Page 108: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Affine Scheduling: One-dimensional Schedules Polyhedral School

Limitations of One-dimensional Schedules

I Not all programs have a legal one-dimensional scheduleI Question: does this program have a 1-d schedule?

Example

for (i = 1; i < N - 1; ++i)for (j = 1; j < N - 1; ++j)A[i][j] = A[i-1][j-1] + A[i+1][j] + A[i][j+1];

I Not all compositions of transformation are possibleI Interchange in inner-loopsI Fusion / distribution of inner-loops

UCLA 75

Page 109: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Affine Scheduling: One-dimensional Schedules Polyhedral School

Multidimensional Scheduling

UCLA 76

Page 110: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Affine Scheduling: Multidimensional Scheduling Polyhedral School

Multidimensional Scheduling

I Some program does not have a legal 1-d scheduleI It means, it’s not possible to enforce the precedence condition for all

dependences

Example

for (i = 0; i < N; ++i)for (j = 0; j < N; ++j)s += s;

I Intuition: multidimensional time means nested time loopsI The precedence constraint needs to be adapted to multidimensional

time

UCLA 77

Page 111: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Affine Scheduling: Multidimensional Scheduling Polyhedral School

Dependence Satisfaction

Definition (Strong dependence satisfaction)

Given DR,S, the dependence is strongly satisfied at schedule level k if

∀〈~xR,~xS〉 ∈DR,S, ΘSk(~xS)−ΘR

k (~xR)≥ 1

Definition (Weak dependence satisfaction)

Given DR,S, the dependence is weakly satisfied at dimension k if

∀〈~xR,~xS〉 ∈DR,S, ΘSk(~xS)−Θ

Rk (~xR)≥ 0

∃〈~xR,~xS〉 ∈DR,S, ΘSk(~xS) = Θ

Rk (~xR)

UCLA 78

Page 112: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Affine Scheduling: Multidimensional Scheduling Polyhedral School

Program Legality and Existence Results

I All dependence must be strongly satisfied for the program to be correctI Once a dependence is strongly satisfied at level k, it does not

contribute to the constraints of level k+ i

I Unlike with 1-d schedules, it is always possible to build a legalmultidimensional schedule for a SCoP [Feautrier]

Theorem (Existence of an affine schedule)

Every static control program has a multdidimensional affine schedule

UCLA 79

Page 113: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Affine Scheduling: Multidimensional Scheduling Polyhedral School

Reformulation of the Precedence ConditionI We introduce variable δ

DR,S1 to model the dependence satisfaction

I Considering the first row of the scheduling matrices, to preserve theprecedence relation we have:

∀DR,S, ∀〈~xR,~xS〉 ∈DR,S, ΘS1(~xS)−ΘR

1 (~xR)≥ δDR,S1

δDR,S1 ∈ {0,1}

Lemma (Semantics-preserving affine schedules)

Given a set of affine schedules ΘR,ΘS . . . of dimension m, the programsemantics is preserved if:

∀DR,S, ∃p ∈ {1, . . . ,m}, δDR,Sp = 1

∧ ∀j < p, δDR,Sj = 0

∧ ∀j≤ p,∀〈~xR,~xS〉 ∈DR,S, ΘSp(~xS)−Θ

Rp (~xR)≥ δ

DR,Sj

UCLA 80

Page 114: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Affine Scheduling: Multidimensional Scheduling Polyhedral School

Space of All Affine Schedules

Objective:I Design an ILP which operates on all scheduling coefficientsI Easier optimality reasoning: the space contains all schedules (hence

necesarily the optimal one)I Examples: maximal fusion, maximal coarse-grain parallelism, best

locality, etc.

idea:I Combine all coefficients of all rows of the scheduling function into a

single solution setI Find a convex encoding for the lexicopositivity of dependence

satisfactionI A dependence must be weakly satisfied until it is strongly satisfiedI Once it is strongly satisfied, it must not constrain subsequent levels

UCLA 81

Page 115: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Affine Scheduling: Multidimensional Scheduling Polyhedral School

Schedule Lower Bound

Idea:I Bound the schedule latency with a lower bound which does not prevent

to find all solutionsI Intuitively:

I ΘS(~xS)−ΘR(~xR)≥ δ if the dependence has not been strongly satisfiedI ΘS(~xS)−ΘR(~xR)≥−∞ if it has

Lemma (Schedule lower bound)

Given ΘRk , ΘS

k such that each coefficient value is bounded in [x,y]. Thenthere exists K ∈ Z such that:

max(Θ

Sk(~xS)−Θ

Rk (~xR)

)>−K.~n−K

UCLA 82

Page 116: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Affine Scheduling: Space of Semantics-Preserving Affine Schedules Polyhedral School

Space of Semantics-Preserving Affine Schedules

All unique bounded

affine multidimensional schedules

All unique semantics-preserving

affine multidimensional schedules

1 point ↔ 1 unique semantically equivalent program(up to affine iteration reordering)

UCLA 83

Page 117: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Affine Scheduling: Space of Semantics-Preserving Affine Schedules Polyhedral School

Semantics Preservation

Definition (Causality condition)

Given ΘR a schedule for the instances of R, ΘS a schedule for the instancesof S. ΘR and ΘS preserve the dependence DR,S if ∀〈~xR,~xS〉 ∈DR,S:

ΘR(~xR)≺Θ

S(~xS)

≺ denotes the lexicographic ordering.

(a1, . . . ,an)≺ (b1, . . . ,bm) iff ∃i, 1≤ i≤ min(n,m) s.t. (a1, . . . ,ai−1) = (b1, . . . ,bi−1)

and ai < bi

UCLA 84

Page 118: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Affine Scheduling: Space of Semantics-Preserving Affine Schedules Polyhedral School

Lexico-positivity of Dependence Satisfaction

I ΘR(~xR)≺ΘS(~xS) is equivalently written ΘS(~xS)−ΘR(~xR)�~0

I Considering the row p of the scheduling matrices:

ΘSp(~xS)−Θ

Rp (~xR)≥ δp

I δp ≥ 1 implies no constraints on δk, k > pI δp ≥ 0 is required if 6 ∃k < p, δk ≥ 1

I Schedule lower bound:

Lemma (Schedule lower bound)

Given ΘRk , ΘS

k such that each coefficient value is bounded in [x,y]. Thenthere exists K ∈ Z such that:

∀~xR,~xS, ΘSk(~xS)−Θ

Rk (~xR)>−K.~n−K

UCLA 85

Page 119: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Affine Scheduling: Space of Semantics-Preserving Affine Schedules Polyhedral School

Lexico-positivity of Dependence Satisfaction

I ΘR(~xR)≺ΘS(~xS) is equivalently written ΘS(~xS)−ΘR(~xR)�~0I Considering the row p of the scheduling matrices:

ΘSp(~xS)−Θ

Rp (~xR)≥ δp

I δp ≥ 1 implies no constraints on δk, k > pI δp ≥ 0 is required if 6 ∃k < p, δk ≥ 1

I Schedule lower bound:

Lemma (Schedule lower bound)

Given ΘRk , ΘS

k such that each coefficient value is bounded in [x,y]. Thenthere exists K ∈ Z such that:

∀~xR,~xS, ΘSk(~xS)−Θ

Rk (~xR)>−K.~n−K

UCLA 85

Page 120: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Affine Scheduling: Space of Semantics-Preserving Affine Schedules Polyhedral School

Lexico-positivity of Dependence Satisfaction

I ΘR(~xR)≺ΘS(~xS) is equivalently written ΘS(~xS)−ΘR(~xR)�~0I Considering the row p of the scheduling matrices:

ΘSp(~xS)−Θ

Rp (~xR)≥ δp

I δp ≥ 1 implies no constraints on δk, k > pI δp ≥ 0 is required if 6 ∃k < p, δk ≥ 1

I Schedule lower bound:

Lemma (Schedule lower bound)

Given ΘRk , ΘS

k such that each coefficient value is bounded in [x,y]. Thenthere exists K ∈ Z such that:

∀~xR,~xS, ΘSk(~xS)−Θ

Rk (~xR)>−K.~n−K

UCLA 85

Page 121: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Affine Scheduling: Space of Semantics-Preserving Affine Schedules Polyhedral School

Lexico-positivity of Dependence Satisfaction

I ΘR(~xR)≺ΘS(~xS) is equivalently written ΘS(~xS)−ΘR(~xR)�~0I Considering the row p of the scheduling matrices:

ΘSp(~xS)−Θ

Rp (~xR)≥ δp

I δp ≥ 1 implies no constraints on δk, k > pI δp ≥ 0 is required if 6 ∃k < p, δk ≥ 1

I Schedule lower bound:

Lemma (Schedule lower bound)

Given ΘRk , ΘS

k such that each coefficient value is bounded in [x,y]. Thenthere exists K ∈ Z such that:

∀~xR,~xS, ΘSk(~xS)−Θ

Rk (~xR)>−K.~n−K

UCLA 85

Page 122: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Affine Scheduling: Space of Semantics-Preserving Affine Schedules Polyhedral School

Convex Form of All Bounded Affine Schedules

Lemma (Convex form of semantics-preserving affine schedules)

Given a set of affine schedules ΘR,ΘS . . . of dimension m, the programsemantics is preserved if the three following conditions hold:

(i) ∀DR,S, δDR,Sp ∈ {0,1}

(ii) ∀DR,S,m

∑p=1

δDR,Sp = 1

(iii) ∀DR,S, ∀p ∈ {1, . . . ,m}, ∀〈~xR,~xS〉 ∈DR,S,

ΘSp(~xS)−Θ

Rp (~xR)≥ δ

DR,Sp

−p−1

∑k=1

δDR,Sk .(K.~n+K)

→ Use Farkas lemma to build all non-negative functions over apolyhedron (here, the dependence polyhedra) [Feautrier,92]

→ Bounded coefficients required [Vasilache,07]

UCLA 86

Page 123: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Affine Scheduling: Space of Semantics-Preserving Affine Schedules Polyhedral School

Convex Form of All Bounded Affine Schedules

Lemma (Convex form of semantics-preserving affine schedules)

Given a set of affine schedules ΘR,ΘS . . . of dimension m, the programsemantics is preserved if the three following conditions hold:

(i) ∀DR,S, δDR,Sp ∈ {0,1}

(ii) ∀DR,S,m

∑p=1

δDR,Sp = 1

(iii) ∀DR,S, ∀p ∈ {1, . . . ,m}, ∀〈~xR,~xS〉 ∈DR,S,

ΘSp(~xS)−Θ

Rp (~xR)≥ δ

DR,Sp

−p−1

∑k=1

δDR,Sk .(K.~n+K)

→ Use Farkas lemma to build all non-negative functions over apolyhedron (here, the dependence polyhedra) [Feautrier,92]

→ Bounded coefficients required [Vasilache,07]

UCLA 86

Page 124: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Affine Scheduling: Space of Semantics-Preserving Affine Schedules Polyhedral School

Convex Form of All Bounded Affine Schedules

Lemma (Convex form of semantics-preserving affine schedules)

Given a set of affine schedules ΘR,ΘS . . . of dimension m, the programsemantics is preserved if the three following conditions hold:

(i) ∀DR,S, δDR,Sp ∈ {0,1}

(ii) ∀DR,S,m

∑p=1

δDR,Sp = 1

(iii) ∀DR,S, ∀p ∈ {1, . . . ,m}, ∀〈~xR,~xS〉 ∈DR,S,

ΘSp(~xS)−Θ

Rp (~xR)≥ δ

DR,Sp

−p−1

∑k=1

δDR,Sk .(K.~n+K)

→ Use Farkas lemma to build all non-negative functions over apolyhedron (here, the dependence polyhedra) [Feautrier,92]

→ Bounded coefficients required [Vasilache,07]

UCLA 86

Page 125: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Affine Scheduling: Space of Semantics-Preserving Affine Schedules Polyhedral School

Convex Form of All Bounded Affine Schedules

Lemma (Convex form of semantics-preserving affine schedules)

Given a set of affine schedules ΘR,ΘS . . . of dimension m, the programsemantics is preserved if the three following conditions hold:

(i) ∀DR,S, δDR,Sp ∈ {0,1}

(ii) ∀DR,S,m

∑p=1

δDR,Sp = 1

(iii) ∀DR,S, ∀p ∈ {1, . . . ,m}, ∀〈~xR,~xS〉 ∈DR,S,

ΘSp(~xS)−Θ

Rp (~xR)≥ δ

DR,Sp −

p−1

∑k=1

δDR,Sk .(K.~n+K)

→ Use Farkas lemma to build all non-negative functions over apolyhedron (here, the dependence polyhedra) [Feautrier,92]

→ Bounded coefficients required [Vasilache,07]

UCLA 86

Page 126: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Affine Scheduling: Space of Semantics-Preserving Affine Schedules Polyhedral School

Convex Form of All Bounded Affine Schedules

Lemma (Convex form of semantics-preserving affine schedules)

Given a set of affine schedules ΘR,ΘS . . . of dimension m, the programsemantics is preserved if the three following conditions hold:

(i) ∀DR,S, δDR,Sp ∈ {0,1}

(ii) ∀DR,S,m

∑p=1

δDR,Sp = 1

(iii) ∀DR,S, ∀p ∈ {1, . . . ,m}, ∀〈~xR,~xS〉 ∈DR,S,

ΘSp(~xS)−Θ

Rp (~xR)−δ

DR,Sp +

p−1

∑k=1

δDR,Sk .(K.~n+K)≥ 0

→ Use Farkas lemma to build all non-negative functions over apolyhedron (here, the dependence polyhedra) [Feautrier,92]

→ Bounded coefficients required [Vasilache,07]

UCLA 86

Page 127: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Affine Scheduling: Space of Semantics-Preserving Affine Schedules Polyhedral School

Convex Form of All Bounded Affine Schedules

Lemma (Convex form of semantics-preserving affine schedules)

Given a set of affine schedules ΘR,ΘS . . . of dimension m, the programsemantics is preserved if the three following conditions hold:

(i) ∀DR,S, δDR,Sp ∈ {0,1}

(ii) ∀DR,S,m

∑p=1

δDR,Sp = 1

(iii) ∀DR,S, ∀p ∈ {1, . . . ,m}, ∀〈~xR,~xS〉 ∈DR,S,

ΘSp(~xS)−Θ

Rp (~xR)−δ

DR,Sp +

p−1

∑k=1

δDR,Sk .(K.~n+K)≥ 0

→ Use Farkas lemma to build all non-negative functions over apolyhedron (here, the dependence polyhedra) [Feautrier,92]

→ Bounded coefficients required [Vasilache,07]UCLA 86

Page 128: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Affine Scheduling: Space of Semantics-Preserving Affine Schedules Polyhedral School

Maximal Fine-Grain Parallelism

Objectives:I Have as few dimensions as possible carrying a dependence

I For dimension k ∈ p..1:

min ∑DR,S

δDR,Sk

I We use lexicographic optimization

UCLA 87

Page 129: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Affine Scheduling: Space of Semantics-Preserving Affine Schedules Polyhedral School

Key Observations

Is all of this really necessary?

I We have encoded one objective per row of Θ

I Question: do we need to solve this large ILP/LP?

UCLA 88

Page 130: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Affine Scheduling: Space of Semantics-Preserving Affine Schedules Polyhedral School

Feautrier’s Greedy Algorithm

Main idea:1 Start at row 1 of Θ

2 Build the set of legal one-dimensional schedules3 Maximize the number of dependences strongly solved (maxδi)4 Remove strongly solved dependences from P5 Goto 1

This is a row-by-row decomposition of the scheduling problem

UCLA 89

Page 131: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Affine Scheduling: Space of Semantics-Preserving Affine Schedules Polyhedral School

Key Properties of Feautrier’s Algorithm

I It terminatesI It finds "optimal" fine-grain parallelism

I Granularity of dependence satisfaction: all-or-nothing

Example

for (i = 0; i < 2 * N; ++i) A[i] = A[2 * N - i];

UCLA 90

Page 132: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Affine Scheduling: Space of Semantics-Preserving Affine Schedules Polyhedral School

Key Observations

Is all of this really necessary?

I Question: do we need to consider all statements at once?

I Insight: the PDG gives structural information about dependencesI Decomposition of the PDG into strongly-connected components

UCLA 91

Page 133: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Affine Scheduling: Space of Semantics-Preserving Affine Schedules Polyhedral School

Feautrier’s Scheduler

UCLA 92

Page 134: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Affine Scheduling: Space of Semantics-Preserving Affine Schedules Polyhedral School

More Observations

I Some problems may be decomposed without loss of optimalityI The PDG gives extra information about further problem decomposition

Still, is all of this really necessary?

I Question: can we use additional knowledge about dependences?

I Uniform, non-uniform and parametric dependences

UCLA 93

Page 135: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Affine Scheduling: Space of Semantics-Preserving Affine Schedules Polyhedral School

Cost Functions

UCLA 94

Page 136: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Scheduling for Performance: Polyhedral School

Objectives for Good Scheduling

Fine-grain parallelism is nice, but...I It has little connection with modern SIMD parallelismI No information about the quality of the generated codeI Ignores all the important performance objectives:

I Data locality / TLB / Cache considerationI Multi-core parallelism (sync-free, with barrier)I SIMD vectorization

Question: how to find a FAST schedule for a modern processor?

UCLA 95

Page 137: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Scheduling for Performance: Polyhedral School

Performance Distribution for 1-D Schedules [1/2]

6e+08

8e+08

1e+09

1.2e+09

1.4e+09

1.6e+09

1.8e+09

2e+09

0 100 200 300 400 500 600 700 800 900 1000

Cyc

les

Transformation identifier

matmult

original

5e+08

1e+09

1.5e+09

2e+09

2.5e+09

3e+09

3.5e+09

4e+09

0 1000 2000 3000 4000 5000 6000 7000C

ycle

sTransformation identifier

locality

original

Figure : Performance distribution for matmult and locality

UCLA 96

Page 138: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Scheduling for Performance: Polyhedral School

Performance Distribution for 1-D Schedules [2/2]

1.26e+09

1.28e+09

1.3e+09

1.32e+09

1.34e+09

1.36e+09

1.38e+09

1.4e+09

1.42e+09

0 100 200 300 400 500 600 700 800

Cyc

les

Transformation identifier

crout

original

(a) GCC -O3

1.26e+09

1.27e+09

1.28e+09

1.29e+09

1.3e+09

1.31e+09

1.32e+09

1.33e+09

1.34e+09

0 100 200 300 400 500 600 700 800

Cyc

les

Transformation identifier

crout

original

original

(b) ICC -fast

Figure : The effect of the compiler

UCLA 97

Page 139: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Scheduling for Performance: Polyhedral School

Quantitative Analysis: The Hypothesis

Extremely large generated spaces: > 1050 points

→ we must leverage static and dynamic characteristics to build traversalmechanisms

Hypothesis:I It is possible to statically order the impact on performance of

transformation coefficients, that is, decompose the search space insubspaces where the performance variation is maximal or reduced

I First rows of Θ are more performance impacting than the last ones

UCLA 98

Page 140: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Scheduling for Performance: Polyhedral School

Observations on the Performance Distribution

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

10 20 30 40 50 60

Perfo

rman

ce im

prov

emen

t

Point index for the first schedule row

Performance distribution - 8x8 DCT

BestAverage

Worst for (i = 0; i < M; i++)for (j = 0; j < M; j++) {tmp[i][j] = 0.0;for (k = 0; k < M; k++)tmp[i][j] += block[i][k] *

cos1[j][k];}

for (i = 0; i < M; i++)for (j = 0; j < M; j++) {sum2 = 0.0;for (k = 0; k < M; k++)sum2 += cos1[i][k] * tmp[k][j];block[i][j] = ROUND(sum2);

}

I Extensive study of 8x8 Discrete Cosine Transform (UTDSP)I Search space analyzed: 66×19683 = 1.29×106 different legal

program versions

UCLA 99

Page 141: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Scheduling for Performance: Polyhedral School

Observations on the Performance Distribution

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

10 20 30 40 50 60

Perfo

rman

ce im

prov

emen

t

Point index for the first schedule row

Performance distribution - 8x8 DCT

BestAverage

Worst

Θ :

I Extensive study of 8x8 Discrete Cosine Transform (UTDSP)I Search space analyzed: 66×19683 = 1.29×106 different legal

program versions

UCLA 99

Page 142: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Scheduling for Performance: Polyhedral School

Observations on the Performance Distribution

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

10 20 30 40 50 60

Perfo

rman

ce im

prov

emen

t

Point index for the first schedule row

Performance distribution - 8x8 DCT

BestAverage

Worst

I bestI averageI worst

I Take one specific value for the first rowI Try the 19863 possible values for the second row

I Very low proportion of best points: < 0.02%

UCLA 99

Page 143: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Scheduling for Performance: Polyhedral School

Observations on the Performance Distribution

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

10 20 30 40 50 60

Perfo

rman

ce im

prov

emen

t

Point index for the first schedule row

Performance distribution - 8x8 DCT

BestAverage

Worst

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

0 2000 4000 6000 8000 10000 12000 14000 16000 18000Point index of the second schedule dimension, first one fixed

Performance distribution (sorted) - 8x8 DCT

I Take one specific value for the first rowI Try the 19863 possible values for the second rowI Very low proportion of best points: < 0.02%

UCLA 99

Page 144: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Scheduling for Performance: Polyhedral School

Observations on the Performance Distribution

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

10 20 30 40 50 60

Perfo

rman

ce im

prov

emen

t

Point index for the first schedule row

Performance distribution - 8x8 DCT

BestAverage

Worst Large performance variation

I Performance variation is large for good values of the first row

I It is usually reduced for bad values of the first row

UCLA 99

Page 145: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Scheduling for Performance: Polyhedral School

Observations on the Performance Distribution

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

10 20 30 40 50 60

Perfo

rman

ce im

prov

emen

t

Point index for the first schedule row

Performance distribution - 8x8 DCT

BestAverage

Worst Small performance variation

I Performance variation is large for good values of the first rowI It is usually reduced for bad values of the first row

UCLA 99

Page 146: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Scheduling for Performance: Polyhedral School

Scanning The Space of Program Versions

The search space:I Performance variation indicates to partition the space:~ı >~p > c

I Non-uniform distribution of performance

I No clear analytical property of the optimization function

→ Build dedicated heuristic and genetic operators aware of these staticand dynamic characteristics

UCLA 100

Page 147: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Scheduling for Performance: Polyhedral School

The Quest for Good Objective Functions

I For data locality, loop tiling is keyI But what is the cost of tiling?I Is tiling the only criterion?

I For coarse-grain parallelism, doall parallelization is keyI But what is the cost of parallelization?

UCLA 101

Page 148: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Scheduling for Performance: Polyhedral School

Dependence Distance Minimization

I Idea: minimize the delay between instances accessing the same dataI Formulation in the polyhedral model:

I Expression of the delay through parametric formI Use all dependences (including RAR)

Definition (Dependence distance minimization)

uk.~n+wk ≥ΘS (~xS)−Θ

R (~xR) 〈~xR,~xS〉 ∈DR,S (1)

uk ∈ Np,wk ∈ N

UCLA 102

Page 149: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Scheduling for Performance: Polyhedral School

Key Observations

I Minimizing d = uk.~n+wk minimize the dependence distanceI When d = 0 then ΘR

k (~xR) = ΘSk(~xS)

I 0≥ΘRk (~xR)−ΘS

k(~xS)≥ 0

I d gives an indication of the communication volume between hyperplanes

UCLA 103

Page 150: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Scheduling for Performance: Tiling Polyhedral School

An Overview of Tiling

Tiling: partition the computation into atomic blocs

I Early work in the late 80’sI Motivation: data locality improvement + parallelization

UCLA 104

Page 151: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Scheduling for Performance: Tiling Polyhedral School

An Overview of Tiling

I Tiling the iteration spaceI It must be valid (dependence analysis required)I It may require pre-transformationI Unimodular transformation framework limitations

I Supported in current compilers, but limited applicability

I Challenges: imperfectly nested loops, parametric loops,pre-transformations, tile shape, ...

I Tile size selectionI Critical for locality concerns: determines the footprintI Empirical search of the best size (problem + machine specific)I Parametric tiling makes the generated code valid for any tile size

UCLA 105

Page 152: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Scheduling for Performance: The Tiling Hyperplane Method Polyhedral School

Tiling in the Polyhedral Model

I Tiling partition the computation into blocksI Note we consider only rectangular tiling hereI For tiling to be legal, such a partitioning must be legal

UCLA 106

Page 153: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Scheduling for Performance: The Tiling Hyperplane Method Polyhedral School

Key Ideas of the Tiling Hyperplane Algorithm

Affine transformations for communication minimal parallelization and localityoptimization of arbitrarily nested loop sequences[Bondhugula et al, CC’08 & PLDI’08]

I Compute a set of transformations to make loops tilableI Try to minimize synchronizationsI Try to maximize locality (maximal fusion)

I Result is a set of permutable loops, if possibleI Strip-mining / tiling can be appliedI Tiles may be sync-free parallel or pipeline parallel

I Algorithm always terminates (possibly by splitting loops/statements)

UCLA 107

Page 154: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Scheduling for Performance: The Tiling Hyperplane Method Polyhedral School

Legality of Tiling

Theorem (Legality of Tiling)

Given ΘRk ,Θ

Sk two one-dimensional schedules. They are valid tiling

hyperplanes if

∀DR,S, ∀〈~xR,~xS〉 ∈DR,S, ΘSk(~xS)−Θ

Rk (~xR)≥ 0

I For a schedule to be a legal tiling hyperplane, all communications mustgo forward: Forward Communication Only [Griebl]

I All dependences must be considered at each level, including thepreviously strongly satisfied

I Equivalence between loop permutability and loop tilability

UCLA 108

Page 155: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Scheduling for Performance: The Tiling Hyperplane Method Polyhedral School

Greedy Algorithm for Tiling HyperplaneComputation

1 Start from the outer-most level, find the set of FCO schedules2 Select one which minimize the distance between dependent iterations3 Mark dependences strongly satisfied by this schedule, but do not

remove them4 Formulate the problem for the next level (FCO), adding orthogonality

constraints (linear independence)5 solve again, etc.

Special treatment when no permutable band can be found: splitting

A few properties:I Result is a set of permutable/tilable outer loops, when possibleI It exhibits coarse-grain parallelismI Maximal fusion achieved to improve locality

UCLA 109

Page 156: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Scheduling for Performance: The Tiling Hyperplane Method Polyhedral School

Example: 1D-Jacobi

The Ohio State University Louisiana State University 1

1-D Jacobi (imperfectly nested) for (t=1; t<M; t++) { for (i=2; i<N!1; i++) {

S: b[i] = 0.333*(a[i!1]+a[i]+a[i+1]); } for (j=2; j<N!1; j++) {

T: a[j] = b[j]; } }

UCLA 110

Page 157: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Scheduling for Performance: The Tiling Hyperplane Method Polyhedral School

Example: 1D-Jacobi

The Ohio State University Louisiana State University 2

1-D Jacobi (imperfectly nested)

•  The resulting transformation is equivalent to a constant shift of one for T relative to S, fusion (j and i are named the same as a result), and skewing the fused i loop with respect to the t loop by a factor of two. •  The (1,0) hyperplane has the least communication: no dependence crosses more than one hyperplane instance along it.

UCLA 111

Page 158: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Scheduling for Performance: The Tiling Hyperplane Method Polyhedral School

Example: 1D-Jacobi

The Ohio State University Louisiana State University 3

Transforming S

i

t! t

i!

UCLA 112

Page 159: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Scheduling for Performance: The Tiling Hyperplane Method Polyhedral School

Example: 1D-Jacobi

The Ohio State University Louisiana State University 4

Transforming T

j

t! t

j!

UCLA 113

Page 160: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Scheduling for Performance: The Tiling Hyperplane Method Polyhedral School

Example: 1D-Jacobi

The Ohio State University Louisiana State University 5

Interleaving S and T

t! t!

j! i!

UCLA 114

Page 161: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Scheduling for Performance: The Tiling Hyperplane Method Polyhedral School

Example: 1D-Jacobi

The Ohio State University Louisiana State University 6

Interleaving S and T

t

UCLA 115

Page 162: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Scheduling for Performance: The Tiling Hyperplane Method Polyhedral School

Example: 1D-Jacobi

The Ohio State University Louisiana State University 7

1-D Jacobi (imperfectly nested) – transformed code for (t0=0;t0<=M-1;t0++) { S’: b[2]=0.333*(a[2-1]+a[2]+a[2+1]); for (t1=2*t0+3;t1<=2*t0+N-2;t1++) { S: b[-2*t0+t1]=0.333*(a[-2*t0+t1-1]+a[-2*t0+t1] +a[-2*t0+t1+1]);

T: a[-2*t0+t1-1]=b[-2*t0+t1-1]; } T’: a[N-2]=b[N-2]; }

UCLA 116

Page 163: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Scheduling for Performance: The Tiling Hyperplane Method Polyhedral School

Example: 1D-Jacobi

The Ohio State University Louisiana State University 8

1-D Jacobi (imperfectly nested) – transformed code for (t0=0;t0<=M-1;t0++) { S’: b[2]=0.333*(a[2-1]+a[2]+a[2+1]); for (t1=2*t0+3;t1<=2*t0+N-2;t1++) { S: b[-2*t0+t1]=0.333*(a[-2*t0+t1-1]+a[-2*t0+t1] +a[-2*t0+t1+1]);

T: a[-2*t0+t1-1]=b[-2*t0+t1-1]; } T’: a[N-2]=b[N-2]; }

!

!

! ! ! ! !

UCLA 117

Page 164: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Scheduling for Performance: Fusion-driven Optimization Polyhedral School

Fusion-driven Optimization

UCLA 118

Page 165: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Scheduling for Performance: Fusion-driven Optimization Polyhedral School

Overview

Problem: How to improve program execution time?

I Focus on shared-memory computationI OpenMP parallelizationI SIMD VectorizationI Efficient usage of the intra-node memory hierarchy

I Challenges to address:I Different machines require different compilation strategiesI One-size-fits-all scheme hinders optimization opportunities

Question: how to restructure the code for performance?

UCLA 119

Page 166: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Scheduling for Performance: Fusion-driven Optimization Polyhedral School

Objectives for a Successful Optimization

During the program execution, interplay between the hardware ressources:I Thread-centric parallelismI SIMD-centric parallelismI Memory layout, inc. caches, prefetch units, buses, interconnects...

→ Tuning the trade-off between these is required

A loop optimizer must be able to transform the program for:I Thread-level parallelism extractionI Loop tiling, for data localityI Vectorization

Our approach: form a tractable search space of possible looptransformations

UCLA 120

Page 167: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Scheduling for Performance: Fusion-driven Optimization Polyhedral School

Running Example

Original code

Example (tmp = A.B, D = tmp.C)for (i1 = 0; i1 < N; ++i1)for (j1 = 0; j1 < N; ++j1) {

R: tmp[i1][j1] = 0;for (k1 = 0; k1 < N; ++k1)

S: tmp[i1][j1] += A[i1][k1] * B[k1][j1];} {R,S} fused, {T,U} fused

for (i2 = 0; i2 < N; ++i2)for (j2 = 0; j2 < N; ++j2) {

T: D[i2][j2] = 0;for (k2 = 0; k2 < N; ++k2)

U: D[i2][j2] += tmp[i2][k2] * C[k2][j2];}

Original Max. fusion Max. dist Balanced4× Xeon 7450 / ICC 11 1×4× Opteron 8380 / ICC 11 1×

UCLA 121

Page 168: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Scheduling for Performance: Fusion-driven Optimization Polyhedral School

Running Example

Cost model: maximal fusion, minimal synchronization[Bondhugula et al., PLDI’08]

Example (tmp = A.B, D = tmp.C)parfor (c0 = 0; c0 < N; c0++) {

for (c1 = 0; c1 < N; c1++) {R: tmp[c0][c1]=0;T: D[c0][c1]=0;

for (c6 = 0; c6 < N; c6++)S: tmp[c0][c1] += A[c0][c6] * B[c6][c1];

parfor (c6 = 0;c6 <= c1; c6++)U: D[c0][c6] += tmp[c0][c1-c6] * C[c1-c6][c6];

} {R,S,T,U} fusedfor (c1 = N; c1 < 2*N - 1; c1++)parfor (c6 = c1-N+1; c6 < N; c6++)

U: D[c0][c6] += tmp[c0][1-c6] * C[c1-c6][c6];}

Original Max. fusion Max. dist Balanced4× Xeon 7450 / ICC 11 1× 2.4×4× Opteron 8380 / ICC 11 1× 2.2×

UCLA 121

Page 169: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Scheduling for Performance: Fusion-driven Optimization Polyhedral School

Running Example

Maximal distribution: best for Intel Xeon 7450Poor data reuse, best vectorization

Example (tmp = A.B, D = tmp.C)parfor (i1 = 0; i1 < N; ++i1)parfor (j1 = 0; j1 < N; ++j1)

R: tmp[i1][j1] = 0;parfor (i1 = 0; i1 < N; ++i1)

for (k1 = 0; k1 < N; ++k1)parfor (j1 = 0; j1 < N; ++j1)

S: tmp[i1][j1] += A[i1][k1] * B[k1][j1];{R} and {S} and {T} and {U} distributed

parfor (i2 = 0; i2 < N; ++i2)parfor (j2 = 0; j2 < N; ++j2)

T: D[i2][j2] = 0;parfor (i2 = 0; i2 < N; ++i2)

for (k2 = 0; k2 < N; ++k2)parfor (j2 = 0; j2 < N; ++j2)

U: D[i2][j2] += tmp[i2][k2] * C[k2][j2];

Original Max. fusion Max. dist Balanced4× Xeon 7450 / ICC 11 1× 2.4× 3.9×4× Opteron 8380 / ICC 11 1× 2.2× 6.1×

UCLA 121

Page 170: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Scheduling for Performance: Fusion-driven Optimization Polyhedral School

Running Example

Balanced distribution/fusion: best for AMD Opteron 8380Poor data reuse, best vectorization

Example (tmp = A.B, D = tmp.C)parfor (c1 = 0; c1 < N; c1++)parfor (c2 = 0; c2 < N; c2++)

R: C[c1][c2] = 0;parfor (c1 = 0; c1 < N; c1++)

for (c3 = 0; c3 < N;c3++) {T: E[c1][c3] = 0;

parfor (c2 = 0; c2 < N;c2++)S: C[c1][c2] += A[c1][c3] * B[c3][c2];

} {S,T} fused, {R} and {U} distributedparfor (c1 = 0; c1 < N; c1++)

for (c3 = 0; c3 < N; c3++)parfor (c2 = 0; c2 < N; c2++)

U: E[c1][c2] += C[c1][c3] * D[c3][c2];

Original Max. fusion Max. dist Balanced4× Xeon 7450 / ICC 11 1× 2.4× 3.9× 3.1×4× Opteron 8380 / ICC 11 1× 2.2× 6.1× 8.3×

UCLA 121

Page 171: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Scheduling for Performance: Fusion-driven Optimization Polyhedral School

Running Example

Example (tmp = A.B, D = tmp.C)parfor (c1 = 0; c1 < N; c1++)parfor (c2 = 0; c2 < N; c2++)

R: C[c1][c2] = 0;parfor (c1 = 0; c1 < N; c1++)

for (c3 = 0; c3 < N;c3++) {T: E[c1][c3] = 0;

parfor (c2 = 0; c2 < N;c2++)S: C[c1][c2] += A[c1][c3] * B[c3][c2];

} {S,T} fused, {R} and {U} distributedparfor (c1 = 0; c1 < N; c1++)

for (c3 = 0; c3 < N; c3++)parfor (c2 = 0; c2 < N; c2++)

U: E[c1][c2] += C[c1][c3] * D[c3][c2];

Original Max. fusion Max. dist Balanced4× Xeon 7450 / ICC 11 1× 2.4× 3.9× 3.1×4× Opteron 8380 / ICC 11 1× 2.2× 6.1× 8.3×

The best fusion/distribution choice drives the quality of the optimizationUCLA 121

Page 172: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Scheduling for Performance: Fusion-driven Optimization Polyhedral School

Loop Structures

Possible grouping + ordering of statements

I {{R}, {S}, {T}, {U}}; {{R}, {S}, {U}, {T}}; ...I {{R,S}, {T}, {U}}; {{R}, {S}, {T,U}}; {{R}, {T,U}, {S}}; {{T,U}, {R}, {S}};...I {{R,S,T}, {U}}; {{R}, {S,T,U}}; {{S}, {R,T,U}};...I {{R,S,T,U}};

Number of possibilities: >> n! (number of total preorders)

UCLA 122

Page 173: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Scheduling for Performance: Fusion-driven Optimization Polyhedral School

Loop Structures

Removing non-semantics preserving ones

I {{R}, {S}, {T}, {U}}; {{R}, {S}, {U}, {T}}; ...I {{R,S}, {T}, {U}}; {{R}, {S}, {T,U}}; {{R}, {T,U}, {S}}; {{T,U}, {R}, {S}};...I {{R,S,T}, {U}}; {{R}, {S,T,U}}; {{S}, {R,T,U}};...I {{R,S,T,U}}

Number of possibilities: 1 to 200 for our test suite

UCLA 122

Page 174: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Scheduling for Performance: Fusion-driven Optimization Polyhedral School

Loop Structures

For each partitioning, many possible loop structures

I {{R}, {S}, {T}, {U}}I For S: {i, j,k}; {i,k, j}; {k, i, j}; {k, j, i}; ...I However, only {i,k, j} has:

I outer-parallel loopI inner-parallel loopI lowest striding access (efficient vectorization)

UCLA 122

Page 175: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Scheduling for Performance: Fusion-driven Optimization Polyhedral School

Possible Loop Structures for 2mm

I 4 statements, 75 possible partitioningsI 10 loops, up to 10! possible loop structures for a given partitioning

I Two steps:I Remove all partitionings which breaks the semantics: from 75 to 12I Use static cost models to select the loop structure for a partitioning: from

d! to 1

I Final search space: 12 possibilites

UCLA 123

Page 176: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Scheduling for Performance: Fusion-driven Optimization Polyhedral School

Contributions and Overview of the Approach

I Empirical search on possible fusion/distribution schemesI Each structure drives the success of other optimizations

I ParallelizationI TilingI Vectorization

I Use static cost models to compute a complex loop transformation for aspecific fusion/distribution scheme

I Iteratively test the different versions, retain the bestI Best performing loop structure is found

UCLA 124

Page 177: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Scheduling for Performance: Fusion-driven Optimization Polyhedral School

Search Space of Loop Structures

I Partition the set of statements into classes:I This is deciding loop fusion / distributionI Statements in the same class will share at least one common loop in the

target codeI Classes are ordered, to reflect code motion

I Locally on each partition, apply model-driven optimizations

I Leverage the polyhedral framework:I Build the smallest yet most expressive space of possible partitionings

[Pouchet et al., POPL’11]I Consider semantics-preserving partitionings only: orders of magnitude

smaller space

UCLA 125

Page 178: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Scheduling for Performance: Fusion-driven Optimization Polyhedral School

Summary of the Optimization Process

description #loops #stmts #refs #deps #part. #valid Variability Pb. Size2mm Linear algebra (BLAS3) 6 4 8 12 75 12 X 1024x10243mm Linear algebra (BLAS3) 9 6 12 19 4683 128 X 1024x1024adi Stencil (2D) 11 8 36 188 545835 1 1024x1024atax Linear algebra (BLAS2) 4 4 10 12 75 16 X 8000x8000bicg Linear algebra (BLAS2) 3 4 10 10 75 26 X 8000x8000correl Correlation (PCA: StatLib) 5 6 12 14 4683 176 X 500x500covar Covariance (PCA: StatLib) 7 7 13 26 47293 96 X 500x500doitgen Linear algebra 5 3 7 8 13 4 128x128x128gemm Linear algebra (BLAS3) 3 2 6 6 3 2 1024x1024gemver Linear algebra (BLAS2) 7 4 19 13 75 8 X 8000x8000gesummv Linear algebra (BLAS2) 2 5 15 17 541 44 X 8000x8000gramschmidt Matrix normalization 6 7 17 34 47293 1 512x512jacobi-2d Stencil (2D) 5 2 8 14 3 1 20x1024x1024lu Matrix decomposition 4 2 7 10 3 1 1024x1024

ludcmp Solver 9 15 40 188 1012 20 X 1024x1024seidel Stencil (2D) 3 1 10 27 1 1 20x1024x1024

Table : Summary of the optimization process

UCLA 126

Page 179: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Scheduling for Performance: Fusion-driven Optimization Polyhedral School

Experimental Setup

We compare three schemes:I maxfuse: static cost model for fusion (maximal fusion)

I smartfuse: static cost model for fusion (fuse only if data reuse)

I Iterative: iterative compilation, output the best result

UCLA 127

Page 180: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Scheduling for Performance: Fusion-driven Optimization Polyhedral School

Performance Results - Intel Xeon 7450 - ICC 11

0

1

2

3

4

5

2mm

3mm

adiatax

bicgcorrel

covar

doitgen

gemmgem

ver

gesumm

v

gramschm

idt

jacobi-2d

lu ludcmp

seidel

Per

f. Im

p / I

CC

-fa

st -

para

llel

Performance Improvement - Intel Xeon 7450 (24 threads)

pocc-maxfusepocc-smartfuse

iterative

UCLA 128

Page 181: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Scheduling for Performance: Fusion-driven Optimization Polyhedral School

Performance Results - AMD Opteron 8380 - ICC 11

0

1

2

3

4

5

6

7

8

9

2mm

3mm

adiatax

bicgcorrel

covar

doitgen

gemmgem

ver

gesumm

v

gramschm

idt

jacobi-2d

lu ludcmp

seidel

Per

f. Im

p / I

CC

-fa

st -

para

llel

Performance Improvement - AMD Opteron 8380 (16 threads)

pocc-maxfusepocc-smartfuse

iterative

UCLA 129

Page 182: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Scheduling for Performance: Fusion-driven Optimization Polyhedral School

Performance Results - Intel Atom 330 - GCC 4.3

0

5

10

15

20

25

30

2mm

3mm

adiatax

bicgcorrel

covar

doitgen

gemmgem

ver

gesumm

v

gramschm

idt

jacobi-2d

lu ludcmp

seidel

Per

f. Im

p / G

CC

4.3

-O

3 -f

open

mp

Performance Improvement - Intel Atom 230 (2 threads)

pocc-maxfusepocc-smartfuse

iterative

UCLA 130

Page 183: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Scheduling for Performance: Fusion-driven Optimization Polyhedral School

Assessment from Experimental Results

1 Empirical tuning required for 9 out of 16 benchmarks

2 Strong performance improvements: 2.5× - 3× on average

3 Portability achieved:I Automatically adapt to the program and target architectureI No assumption made about the targetI Exhaustive search finds the optimal structure (1-176 variants)

4 Substantial improvements over state-of-the-art (up to 2×)

UCLA 131

Page 184: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Approximate Scheduling: Polyhedral School

Approximate Scheduling

UCLA 132

Page 185: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Approximate Scheduling: Polyhedral School

Exact vs. Approximate Dependences

I So far, we used exact dependence representationI But numerous other works use approximate representation:

I dependence distance vector [Allen/Kenney,KMW]I dependence directionI DDV approximated as polyhedra [Darte-Vivien]

I Optimality may not always be lost when using a relaxed dependencerepresentation!

UCLA 133

Page 186: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Approximate Scheduling: Polyhedral School

Key Aspects of Dependence Representations

I Can use a graph to represent the dependencesI Karp, Miller, Winograd: graph for SURE, using DDVI Darte-Vivien: "PDG"-like graph

I For certain kind dependences (i.e., uniform) "optimal" algorithmI Darte-VivienI SURE

UCLA 134

Page 187: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Approximate Scheduling: Polyhedral School

Some Observations

I Maximal parallelism (under some conditions) can be extracted from lessprecise representations

I Very fast scheduling algorithms can be derived

I Remaining question: how useful are those algorithms?

UCLA 135

Page 188: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Approximate Scheduling: Polyhedral School

Conclusion

UCLA 136

Page 189: Program transformations and optimizations in the ...labexcompilation.ens-lyon.fr/wp-content/uploads/2013/02/Louis-Noel… · I Intel 486 DX2: 54 MIPS, 66MHz "second generation" I

Approximate Scheduling: Polyhedral School

The End

Let’s stop here for today!

UCLA 137