program transformations and optimizations in the...
TRANSCRIPT
Program transformations and optimizationsin the polyhedral framework
Louis-Noël Pouchet
Dept. of Computer ScienceUCLA
May 14, 2013
1st Polyhedral Spring SchoolSt-Germain au Mont d’or, France
: Polyhedral School
Overview
UCLA 2
Overview & Motivation: Polyhedral School
Agenda for the Lecture
Disclaimer: this lecture is (mainly) for new Ph.D students
1 Program Representation in the Polyhedral Model
2 Applying loop transformations in the model
3 Exact data dependence representation
4 Scheduling using Farkas-based methods
5 Scheduling using approximate data dependencesI If time permits...
UCLA 3
Overview & Motivation: Polyhedral School
Running Example: DGEMM
Example (dgemm)/* C := alpha*A*B + beta*C */for (i = 0; i < ni; i++)for (j = 0; j < nj; j++)
S1: C[i][j] *= beta;for (i = 0; i < ni; i++)for (j = 0; j < nj; j++)
for (k = 0; k < nk; ++k)S2: C[i][j] += alpha * A[i][k] * B[k][j];
I Loop transformation: permute(i,k,S2)Execution time (in s) on this laptop, GCC 4.2, ni=nj=nk=512
version -O0 -O1 -O2 -O3 -vecoriginal 1.81 0.78 0.78 0.78permute 1.52 0.35 0.35 0.20
http://gcc.gnu.org/onlinedocs/gcc-4.2.1/gcc/Optimize-Options.html
UCLA 4
Overview & Motivation: Polyhedral School
Running Example: fdtd-2d
Example (fdtd-2d)for(t = 0; t < tmax; t++) {for (j = 0; j < ny; j++)
ey[0][j] = _edge_[t];for (i = 1; i < nx; i++)
for (j = 0; j < ny; j++)ey[i][j] = ey[i][j] - 0.5*(hz[i][j]-hz[i-1][j]);
for (i = 0; i < nx; i++)for (j = 1; j < ny; j++)ex[i][j] = ex[i][j] - 0.5*(hz[i][j]-hz[i][j-1]);
for (i = 0; i < nx - 1; i++)for (j = 0; j < ny - 1; j++)hz[i][j] = hz[i][j] - 0.7* (ex[i][j+1] - ex[i][j] +
ey[i+1][j]-ey[i][j]);}
I Loop transformation: polyhedralOpt(fdtd-2d)Execution time (in s) on this laptop, GCC 4.2, 64x1024x1024
version -O0 -O1 -O2 -O3 -vecoriginal 2.59 1.62 1.54 1.54
polyhedralOpt 2.05 0.41 0.41 0.41UCLA 5
Overview & Motivation: Polyhedral School
Manual transformations?
Example (fdtd-2d)for(t = 0; t < tmax; t++) {for (j = 0; j < ny; j++)
ey[0][j] = _edge_[t];for (i = 1; i < nx; i++)
for (j = 0; j < ny; j++)ey[i][j] = ey[i][j] - 0.5*(hz[i][j]-hz[i-1][j]);
for (i = 0; i < nx; i++)for (j = 1; j < ny; j++)ex[i][j] = ex[i][j] - 0.5*(hz[i][j]-hz[i][j-1]);
for (i = 0; i < nx - 1; i++)for (j = 0; j < ny - 1; j++)hz[i][j] = hz[i][j] - 0.7* (ex[i][j+1] - ex[i][j] +
ey[i+1][j]-ey[i][j]);}
UCLA 6
Overview & Motivation: Polyhedral School
NO NO NO!!!
Example (fdtd-2d tiled)for (c0 = 0; c0 <= (((ny + 2 * tmax + -3) * 32 < 0?((32 < 0?-((-(ny + 2 * tmax + -3) + 32 + 1) / 32) : -((-(ny + 2 *tmax + -3) + 32 - 1) / 32))) : (ny + 2 * tmax + -3) / 32)); ++c0) {
#pragma omp parallel for private(c3, c4, c2, c5)for (c1 = (((c0 * 2 < 0?-(-c0 / 2) : ((2 < 0?(-c0 + -2 - 1) / -2 : (c0 + 2 - 1) / 2)))) > (((32 * c0 + -tmax + 1) *
32 < 0?-(-(32 * c0 + -tmax + 1) / 32) : ((32 < 0?(-(32 * c0 + -tmax + 1) + -32 - 1) / -32 : (32 * c0 + -tmax + 1 + 32- 1) / 32))))?((c0 * 2 < 0?-(-c0 / 2) : ((2 < 0?(-c0 + -2 - 1) / -2 : (c0 + 2 - 1) / 2)))) : (((32 * c0 + -tmax + 1) *32 < 0?-(-(32 * c0 + -tmax + 1) / 32) : ((32 < 0?(-(32 * c0 + -tmax + 1) + -32 - 1) / -32 : (32 * c0 + -tmax + 1 + 32- 1) / 32))))); c1 <= (((((((ny + tmax + -2) * 32 < 0?((32 < 0?-((-(ny + tmax + -2) + 32 + 1) / 32) : -((-(ny + tmax +-2) + 32 - 1) / 32))) : (ny + tmax + -2) / 32)) < (((32 * c0 + ny + 30) * 64 < 0?((64 < 0?-((-(32 * c0 + ny + 30) + 64+ 1) / 64) : -((-(32 * c0 + ny + 30) + 64 - 1) / 64))) : (32 * c0 + ny + 30) / 64))?(((ny + tmax + -2) * 32 < 0?((32 <0?-((-(ny + tmax + -2) + 32 + 1) / 32) : -((-(ny + tmax + -2) + 32 - 1) / 32))) : (ny + tmax + -2) / 32)) : (((32 * c0 +ny + 30) * 64 < 0?((64 < 0?-((-(32 * c0 + ny + 30) + 64 + 1) / 64) : -((-(32 * c0 + ny + 30) + 64 - 1) / 64))) : (32 *c0 + ny + 30) / 64)))) < c0?(((((ny + tmax + -2) * 32 < 0?((32 < 0?-((-(ny + tmax + -2) + 32 + 1) / 32) : -((-(ny + tmax+ -2) + 32 - 1) / 32))) : (ny + tmax + -2) / 32)) < (((32 * c0 + ny + 30) * 64 < 0?((64 < 0?-((-(32 * c0 + ny + 30) + 64+ 1) / 64) : -((-(32 * c0 + ny + 30) + 64 - 1) / 64))) : (32 * c0 + ny + 30) / 64))?(((ny + tmax + -2) * 32 < 0?((32 <0?-((-(ny + tmax + -2) + 32 + 1) / 32) : -((-(ny + tmax + -2) + 32 - 1) / 32))) : (ny + tmax + -2) / 32)) : (((32 * c0 +ny + 30) * 64 < 0?((64 < 0?-((-(32 * c0 + ny + 30) + 64 + 1) / 64) : -((-(32 * c0 + ny + 30) + 64 - 1) / 64))) : (32 *c0 + ny + 30) / 64)))) : c0)); ++c1) {
for (c2 = c0 + -c1; c2 <= (((((tmax + nx + -2) * 32 < 0?((32 < 0?-((-(tmax + nx + -2) + 32 + 1) / 32) : -((-(tmax +nx + -2) + 32 - 1) / 32))) : (tmax + nx + -2) / 32)) < (((32 * c0 + -32 * c1 + nx + 30) * 32 < 0?((32 < 0?-((-(32 * c0 +-32 * c1 + nx + 30) + 32 + 1) / 32) : -((-(32 * c0 + -32 * c1 + nx + 30) + 32 - 1) / 32))) : (32 * c0 + -32 * c1 + nx +30) / 32))?(((tmax + nx + -2) * 32 < 0?((32 < 0?-((-(tmax + nx + -2) + 32 + 1) / 32) : -((-(tmax + nx + -2) + 32 - 1) /32))) : (tmax + nx + -2) / 32)) : (((32 * c0 + -32 * c1 + nx + 30) * 32 < 0?((32 < 0?-((-(32 * c0 + -32 * c1 + nx + 30)+ 32 + 1) / 32) : -((-(32 * c0 + -32 * c1 + nx + 30) + 32 - 1) / 32))) : (32 * c0 + -32 * c1 + nx + 30) / 32)))); ++c2){
if (c0 == 2 * c1 && c0 == 2 * c2) {for (c3 = 16 * c0; c3 <= ((tmax + -1 < 16 * c0 + 30?tmax + -1 : 16 * c0 + 30)); ++c3)
if (c0 % 2 == 0)(ey[0])[0] = (_edge_[c3]);
........... (200 more lines!)
Performance gain: 2-6× on modern multicore platformsUCLA 6
Overview & Motivation: Polyhedral School
Loop Transformations in Production Compilers
Limitations of standard syntactic frameworks:I Composition of transformations may be tedious
I composability rules / applicability
I Parametric loop bounds, imperfectly nested loops are challengingI Look at the examples!
I Approximate dependence analysisI Miss parallelization opportunities (among many others)
I (Very) conservative performance models
UCLA 7
Overview & Motivation: Polyhedral School
Achievements of Polyhedral Compilation
The polyhedral model:
I Model/apply seamlessly arbitrary compositions of transformationsI Automatic handling of imperfectly nested, parametric loop structuresI Many loop transformation can be modeled
I Exact dependence analysis on a class of programsI Unleash the power of automatic parallelizationI Aggressive multi-objective program restructuring (parallelism, SIMD,
cache, etc.)
I Requires computationally expensive algorithmsI Usually NP-complete / exponential complexityI Requires careful problem statement/representation
UCLA 8
Overview & Motivation: Polyhedral School
Research around Polyhedral Compilation
"first generation"I Provided the theoretical foundationsI Focused on high-level/abstract objectivesI Was questioned about how practical is this model
"second generation"I Provided practical / "more scalable" algorithmsI Focused on concrete performance improvementI Still often questioned about how practical is this model
I Thanks for the legacy! ;-)
UCLA 9
Overview & Motivation: Polyhedral School
Perspective on 25 Years of Computing
"first generation"I Parallel machines were... rareI Era of increased program performance for freeI Intel 486 DX2: 54 MIPS, 66MHz
"second generation"I Parallel machines are ubiquitous (multi-core, SIMD, GPUs)I Very difficult to get good performanceI Intel Core i7: 177730 MIPS, 3.33 GHz
UCLA 10
Overview & Motivation: Polyhedral School
Perspective on 25 Years of Computing
"first generation"I Parallel machines were... rareI Era of increased program performance for freeI Intel 486 DX2: 54 MIPS, 66MHz
"second generation"I Parallel machines are ubiquitous (multi-core, SIMD, GPUs)I Very difficult to get good performanceI Intel Core i7: 177730 MIPS, 3.33 GHz
More complex problem can be solved, but more complex objectives areneeded anyway
UCLA 10
Overview & Motivation: Polyhedral School
Disclaimer
This lecture is:I Partial 5/14/13 12:57 AMpolyhedral model affine scheduling - Google Scholar
Page 1 of 2http://scholar.google.com/scholar?q=polyhedral+model+affine+scheduling&btnG=&hl=en&as_sdt=0%2C5
Articles
Legal documents
Any timeSince 2013Since 2012Since 2009Custom range...
Sort by relevanceSort by date
include patentsinclude citations
Create alert
F Quilleré, S Rajopadhye, D Wilde - International Journal of Parallel …, 2000 - Springer... of static loop programs,(3) scheduling, using com- pact representations, of indexed computations(essentially uniform or affine recurrence equations) in the context of parallelization(2, 4) or systolic
Generation of efficient nested loops from polyhedra
MM Baskaran, J Ramanujam, P Sadayappan - Compiler Construction, 2010 - Springer... that transforms a polyhedral representation of a program and affine scheduling constraints into ...end-to-end automatic parallelization and locality optimization of affine programs for ... Section 2provides an overview of the polyhedral model for representing programs, dependences ...Cited by 89 Related articles All 13 versions Cite
[PDF] from researchgate.netAutomatic C-to-CUDA code generation for affine programs
C Bastoul, A Cohen, S Girbal, S Sharma… - … and Compilers for …, 2004 - Springer... motivation for our work, although MARS lacks the expressivity of the affine schedules we usein our uni- fied model. ... where DS is the dS-dimensional iteration domain of S, LS and RS are setsof polyhedral representations of array references, and θS is the affine schedule of S ...Cited by 89 Related articles All 29 versions Cite
[PDF] from archives-ouvertes.frPutting polyhedral loop transformations to work
U Bondhugula, A Hartono, J Ramanujam… - ACM SIGPLAN …, 2008 - dl.acm.org... of test cases has not been done using a powerful and general model like the polyhedral model. ...ing good ways to tile for parallelism and locality directly through an affine transformation framework ...Our ap- proach is thus a departure from scheduling-based approaches in this field ...Cited by 167 Related articles All 11 versions Cite
[PDF] from lsu.eduA practical automatic polyhedral parallelizer and locality optimizer
MM Baskaran, U Bondhugula… - Proceedings of the …, 2008 - dl.acm.org... functions [1, 22, 16, 14, 9, 23, 21, 3]. For such regular programs, compile-time optimizationapproaches have been developed using affine scheduling functions with a polyhe- dral abstractionof programs and data dependencies. Although the polyhedral model of dependence ...Cited by 118 Related articles All 17 versions Cite
[PDF] from aminer.orgA compiler framework for optimization of affine loop nests for GPGPUs
F Quilleré, S Rajopadhye - ACM Transactions on Programming …, 2000 - dl.acm.org... Words and Phrases: Affine recurrence equations, applicative (functional) lan- guages, automaticparallelization, dataflow analysis, data-parallel languages, dependence analysis, lifetime analysis,memory management, parallel code generation, polyhedral model, scheduling ...Cited by 94 Related articles All 12 versions Cite
[PS] from 131.254.254.45Optimizing memory usage in the polyhedral model
More
D Wilde, S Rajopadhye - Euro-Par'96 Parallel Processing, 1996 - Springer... Currently, the user chooses the transformations based on the analyses enabled by the polyhedralmodel, specifies it ... Scheduling: There has been much research on the SARE scheduling problem[9, 17, 8, 18 ... i, j) - i + j and tx (i) = 2i are valid (one dimensional) affine schedules for ...Cited by 38 Related articles All 10 versions Cite
[PDF] from psu.eduMemory reuse analysis in the polyhedral model
U Bondhugula, M Baskaran, S Krishnamoorthy… - Compiler …, 2008 - Springer... The polyhedral model provides powerful abstractions to optimize loop nests with regular accesses.Affine transformations in this model capture a com- plex sequence of execution ... Although asignificant body of research has addressed affine scheduling and partitioning, the ...Cited by 60 Related articles All 19 versions Cite
[PDF] from ohio-state.eduAutomatic transformations for communication-minimized parallelization and locality optimization inthe polyhedral model
LN Pouchet, C Bastoul, A Cohen… - Code Generation and …, 2007 - ieeexplore.ieee.org... that can be ex- pressed as one-dimensional affine schedules (using a polyhedral abstraction); •this ... ing approaches on loop transformation sequences, or state-of-the-art affine scheduling; •such smaller ... of the optimal code, a confirmation that building a predictive model for loop ...Cited by 85 Related articles All 29 versions Cite
[PDF] from inria.frIterative optimization in the polyhedral model: Part I, one-dimensional time
LN Pouchet, C Bastoul, A Cohen, J Cavazos - ACM SIGPLAN Notices, 2008 - dl.acm.org... 20 loops) where iterative affine schedul- ing has never been attempted before. 4. We demonstratesignificant performance improvements on 3 different architectures. The paper is structured asfollows. Section 2 introduces multi- dimensional scheduling in the polyhedral model. ...Cited by 80 Related articles All 20 versions Cite
[PDF] from inria.frIterative optimization in the polyhedral model: Part II, multidimensional time
Scholar About 3,520 results (0.06 sec) My Citations 33
polyhedral model affine scheduling
Web Images More… [email protected]
I Biased (strongly!)
I Provocative at times (hopefully ;-) )
UCLA 11
Overview & Motivation: Polyhedral School
Polyhedral Program Representation
UCLA 12
Polyhedral Program Representation: Basics Polyhedral School
Example: DGEMM
Example (dgemm)/* C := alpha*A*B + beta*C */for (i = 0; i < ni; i++)for (j = 0; j < nj; j++) {
S1: C[i][j] *= beta;for (k = 0; k < nk; ++k)
S2: C[i][j] += alpha * A[i][k] * B[k][j];}
This code has:I imperfectly nested loopsI multiple statementsI parametric loop bounds
UCLA 13
Polyhedral Program Representation: Basics Polyhedral School
Granularity of Program Representation
DGEMM has:I 3 loops
I For loops in the code, while loopsI Control-flow graph analysis
I 2 (syntactic) statementsI Input source code?I Basic block?I ASM instructions?
I S1 is executed ni × nj timesI dynamic instances of the statementI Does not (necessarily) correspond to reality!
UCLA 14
Polyhedral Program Representation: Basics Polyhedral School
Some Observations
Reasoning at the loop/statement level:I Some loop transformation may be very difficult to implement
I How to fuse loops with different loop bounds?I How to permute triangular loops?I How to unroll-and-jam triangular loops?I How to apply time-tiling?I ...
I Statements may operate on the same array while being independent
UCLA 15
Polyhedral Program Representation: Basics Polyhedral School
Some Motivations for Polyhedral Transformations
I Known problem: scheduling of task graphI Obvious limitations: task graph is not finite / size depends on problem /
effective code generation almost impossibleI Alternative approach: use loop transformations
I solve all above limitationI BUT the problem is to find a sequence that implements the order we wantI AND also how to apply/compose them
I Desired features:I ability to reason at the instance level (as for task graph scheduling)I ability to easily apply/compose loop transformations
UCLA 16
The Polyhedral Model: Polyhedral School
The Polyhedral Model
UCLA 17
The Polyhedral Model: Overview Polyhedral School
Polyhedral Program Optimization: a Three-StageProcess
1 Analysis: from code to model
→ Existing prototype toolsI URUK, Omega, ChiLL . . .I ISL, LoopoI PolyOpt/PoCC
(Clan-Candl-LetSee-Pluto-Ponos-Cloog-Polylib-PIPLib-ISL-FM-...)
→ GCC GRAPHITE, LLVM Polly (now in mainstream)
→ Reservoir Labs R-Stream, IBM XL/Poly
2 Transformation in the model
→ Build and select a program transformation
3 Code generation: from model to code
→ "Apply" the transformation in the model
→ Regenerate syntactic (AST-based) code
UCLA 18
The Polyhedral Model: Overview Polyhedral School
Polyhedral Program Optimization: a Three-StageProcess
1 Analysis: from code to model
→ Existing prototype toolsI URUK, Omega, ChiLL . . .I ISL, LoopoI PolyOpt/PoCC
(Clan-Candl-LetSee-Pluto-Ponos-Cloog-Polylib-PIPLib-ISL-FM-...)
→ GCC GRAPHITE, LLVM Polly (now in mainstream)
→ Reservoir Labs R-Stream, IBM XL/Poly
2 Transformation in the model
→ Build and select a program transformation
3 Code generation: from model to code
→ "Apply" the transformation in the model
→ Regenerate syntactic (AST-based) code
UCLA 18
The Polyhedral Model: Overview Polyhedral School
Polyhedral Program Optimization: a Three-StageProcess
1 Analysis: from code to model
→ Existing prototype toolsI URUK, Omega, ChiLL . . .I ISL, LoopoI PolyOpt/PoCC
(Clan-Candl-LetSee-Pluto-Ponos-Cloog-Polylib-PIPLib-ISL-FM-...)
→ GCC GRAPHITE, LLVM Polly (now in mainstream)
→ Reservoir Labs R-Stream, IBM XL/Poly
2 Transformation in the model
→ Build and select a program transformation
3 Code generation: from model to code
→ "Apply" the transformation in the model
→ Regenerate syntactic (AST-based) code
UCLA 18
The Polyhedral Model: Overview Polyhedral School
Motivating Example [1/2]
Example
for (i = 0; i <= 1; ++i)for (j = 0; j <= 2; ++j)A[i][j] = i * j;
Program execution:
1: A[0][0] = 0 * 0;2: A[0][1] = 0 * 1;3: A[0][2] = 0 * 2;4: A[1][0] = 1 * 0;5: A[1][1] = 1 * 1;6: A[1][2] = 1 * 2;
UCLA 19
The Polyhedral Model: Overview Polyhedral School
Motivating Example [2/2]
A few observations:I Statement is executed 6 timesI There is a different values for i, j associated to these 6 instancesI There is an order on them (the execution order)
A rough analogy: polyhedral compilation is about (statically)scheduling tasks, where tasks are statement instances, or operations
UCLA 20
The Polyhedral Model: Overview Polyhedral School
Polyhedral Representation of Programs
Static Control PartsI Loops have affine control only (over-approximation otherwise)
UCLA 21
The Polyhedral Model: Overview Polyhedral School
Polyhedral Representation of Programs
Static Control PartsI Loops have affine control only (over-approximation otherwise)I Iteration domain: represented as integer polyhedra
for (i=1; i<=n; ++i). for (j=1; j<=n; ++j). . if (i<=n-j+2). . . s[i] = ...
DS1 =
1 0 0 −1−1 0 1 0
0 1 0 −1−1 0 1 0−1 −1 1 2
.
ijn1
≥~0
UCLA 21
The Polyhedral Model: Overview Polyhedral School
Polyhedral Representation of Programs
Static Control PartsI Loops have affine control only (over-approximation otherwise)I Iteration domain: represented as integer polyhedraI Memory accesses: static references, represented as affine functions of
~xS and~p
for (i=0; i<n; ++i) {. s[i] = 0;. for (j=0; j<n; ++j). . s[i] = s[i]+a[i][j]*x[j];
}
fs( ~xS2) =[
1 0 0 0].
~xS2n1
fa( ~xS2) =
[1 0 0 00 1 0 0
].
~xS2n1
fx( ~xS2) =[
0 1 0 0].
~xS2n1
UCLA 21
The Polyhedral Model: Overview Polyhedral School
Polyhedral Representation of Programs
Static Control PartsI Loops have affine control only (over-approximation otherwise)I Iteration domain: represented as integer polyhedraI Memory accesses: static references, represented as affine functions of
~xS and~pI Data dependence between S1 and S2: a subset of the Cartesian
product of DS1 and DS2 (exact analysis)
for (i=1; i<=3; ++i) {. s[i] = 0;. for (j=1; j<=3; ++j). . s[i] = s[i] + 1;
}
DS1δS2 :
1 −1 0 01 0 0 −1−1 0 0 3
0 1 0 −10 −1 0 30 0 1 −10 0 −1 3
.
iS1iS2jS21
= 0
≥~0
i
S1 iterations
S2 iterations
UCLA 21
The Polyhedral Model: Overview Polyhedral School
Math Corner
UCLA 22
The Polyhedral Model: Math Corner Polyhedral School
The Affine Qualifier
Definition (Affine function)
A function f : Km→Kn is affine if there exists a vector~b ∈Kn and a matrixA ∈Km×n such that:
∀~x ∈Km, f (~x) = A~x+~b
Definition (Affine half-space)
An affine half-space of Km (affine constraint) is defined as the set of points:
{~x ∈Km |~a.~x≤~b}
UCLA 23
The Polyhedral Model: Math Corner Polyhedral School
Polyhedron (Implicit Representation)
Definition (Polyhedron)
A set S ∈Km is a polyhedron if there exists a system of a finite number ofinequalities A~x≤~b such that:
P = {~x ∈Km | A~x≤~b}
Equivalently, it is the intersection of finitely many half-spaces.
Definition (Polytope)
A polytope is a bounded polyhedron.
UCLA 24
The Polyhedral Model: Math Corner Polyhedral School
Integer Polyhedron
Definition (Z-polyhedron)
It is a polyhedron where all its extreme points are integer valued
Definition (Integer hull)
The integer hull of a rational polyhedron P is the largest set of integer pointssuch that each of these points is in P .
For the moment, we will "say" an integer polyhedron is a polyhedron ofinteger points (language abuse)
UCLA 25
The Polyhedral Model: Math Corner Polyhedral School
Rational and Integer Polytopes
UCLA 26
The Polyhedral Model: Math Corner Polyhedral School
Another View of PolyhedraThe dual representation models a polyhedron as a combination of lines L andrays R (forming the polyhedral cone) and vertices V (forming the polytope)
Definition (Dual representation)
P : {~x ∈Qn |~x = L~λ+R~µ+V~ν, ~µ≥ 0, ~ν≥ 0, ∑i
νi = 1}
Definition (Face)
A face F of P is the intersection of P with a supporting hyperplane of P . Wehave:
dim(F )≤ dim(P )
Definition (Facet)
A facet F of P is a face of P such that:
dim(F ) = dim(P )−1
UCLA 27
The Polyhedral Model: Math Corner Polyhedral School
Some Equivalence Properties
Theorem (Fundamental Theorem on Polyhedral Decomposition)
If P is a polyhedron, then it can be decomposed as a polytope V plus apolyhedral cone L .
Theorem (Equivalence of Representations)
Every polyhedron has both an implicit and dual representation
I Chernikova’s algorithm can compute the dual representation from theimplicit one
I The Dual representation is heavily used in polyhedral compilationI Some works operate on the constraint-based representation (Pluto)
UCLA 28
The Polyhedral Model: Math Corner Polyhedral School
Parametric Polyhedra
Definition (Paramteric polyhedron)
Given~n the vector of symbolic parameters, P is a parametric polyhedron if itis defined by:
P = {~x ∈Km | A~x≤ B~n+~b}
I Requires to adapt theory and tools to parametersI Can become nasty: case distinctions (QUAST)I Reflects nicely the program context
UCLA 29
The Polyhedral Model: Math Corner Polyhedral School
Some Useful Algorithms
All extended to parametric polyhedra:I Compute the facets of a polytope: PolyLib [Wilde et al]
I Compute the volume of a polytope (number of points): Barvinok[Claus/Verdoolaege]
I Scan a polytope (code generation): CLooG [Quillere/Bastoul]
I Find the lexicographic minimum: PIP [Feautrier]
UCLA 30
The Polyhedral Model: Math Corner Polyhedral School
Operations on Polyhedra
Definition (Intersection)
The intersection of two convex sets P1 and P2 is a convex set P :
P = {~x ∈Km |~x ∈ P1∧~x ∈ P2}
Definition (Union)
The union of two convex sets P1 and P2 is a set P :
P = {~x ∈Km |~x ∈ P1∨~x ∈ P2}
The union of two convex sets may not be a convex set
UCLA 31
The Polyhedral Model: Math Corner Polyhedral School
Lattices
Definition (Lattice)
A subset L in Qn is a lattice if is generated by integral combination of finitelymany vectors: a1,a2, . . . ,an (ai ∈Qn). If the ai vectors have integralcoordinates, L is an integer lattice.
Definition (Z-polyhedron)
A Z-polyhedron is the intersection of a polyhedron and an affine integral fulldimensional lattice.
UCLA 32
The Polyhedral Model: Math Corner Polyhedral School
Pictured Example
Example of a Z-polyhedron:I Q1 = {i, j | 0≤ i≤ 5, 0≤ 3j≤ 20}I L1 = {2i+1,3j+5 | i, j ∈ Z}I Z1 = Q1∩L1
UCLA 33
The Polyhedral Model: Math Corner Polyhedral School
Quick Facts on Z-polyhedra
I Iteration domains are in fact Z-polyhedra with unit latticeI Intersection of Z-polyhedra is not convex in generalI Union is complex to computeI Can count points, can optimize, can scan
I Implementation available for most operations in PolyLib
UCLA 34
The Polyhedral Model: Math Corner Polyhedral School
Image and Preimage
Definition (Image)
The image of a polyhedron P ∈ Zn by an affine function f : Zn→ Zm is aZ-polyhedron P ′:
P ′ = {f (~x) ∈ Zm |~x ∈ P}
Definition (Preimage)
The preimage of a polyhedron P ∈ Zn by an affine function f : Zn→ Zm is aZ-polyhedron P ′:
P ′ = {~x ∈ Zn | f (~x) ∈ P}
We have Image(f−1,P ) = Preimage(f ,P ) if f is invertible.
UCLA 35
The Polyhedral Model: Math Corner Polyhedral School
Relation Between Image, Preimage andZ-polyhedra
I The image of a Z-polyhedron by an unimodular function is aZ-polyhedron
I The preimage of a Z-polyhedron by an affine function is a Z-polyhedron
I The image of a polyhedron by an affine invertible function is aZ-polyhedron
I The preimage of a Z-polyhedron by an affine function is a Z-polyhedron
I The image by a non-invertible function is not a Z-polyhedron
UCLA 36
The Polyhedral Model: Math Corner Polyhedral School
Program Transformations
UCLA 37
The Polyhedral Model: Modeling Programs Polyhedral School
What Can Be Modeled?
Exact vs. approximate representation:I Exact representation of iteration domains
I Static control flowI Affine loop bounds (includes min/max/integer division)I Affine conditionals (conjunction/disjunction)
I Approximate representation of iteration domainsI Use affine over-approximations, predicate statement executionsI Full-function support
UCLA 38
Program Transformation: Polyhedral School
Key Intuition
I Programs are represented with geometric shapes
I Transforming a program is about modifying those shapesI rotation, skewing, stretching, ...
I But we need here to assume some order to scan points
UCLA 39
Program Transformation: Polyhedral School
Affine Transformations
Interchange TransformationThe transformation matrix is the identity with a permutation of two rows.
1
2
3
5
6
4
1 2 3 4 5 6
123
i
j
1 2 3
4 5 6
0 1 2 3 4 5 6 i’0123
j’
=⇒
1 0−1 0
0 10 −1
( ij
)+
−12−1
3
≥~0(
i′j′
)=[ 0 1
1 0
]( ij
) 0 10 −11 0−1 0
( i′j′
)+
−12−1
3
≥~0
(a) original polyhedron (b) transformation function (c) target polyhedron
A~x+~a≥~0 ~y = T~x (AT−1)~y+~a≥~0
do i = 1, 2do j = 1, 3
S(i,j)
do i’ = 1, 3do j’ = 1, 2
S(i=j’,j=i’)
UCLA 40
Program Transformation: Polyhedral School
Affine Transformations
Reversal TransformationThe transformation matrix is the identity with one diagonal element replaced by −1.
1
2
3
5
6
4
1 2 3 4 5 6
123
i
j
5
4
6 1
2
3
123
0 1 2−3 −2 −1 i’
j’
=⇒
1 0−1 0
0 10 −1
( ij
)+
−12−1
3
≥~0(
i′j′
)=[−1 0
0 1
]( ij
) −1 01 00 10 −1
( i′j′
)+
−12−1
3
≥~0
(a) original polyhedron (b) transformation function (c) target polyhedron
A~x+~a≥~0 ~y = T~x (AT−1)~y+~a≥~0
do i = 1, 2do j = 1, 3
S(i,j)
do i’ = -1, -2, -1do j’ = 1, 3
S(i=3-i’,j=j’)
UCLA 40
Program Transformation: Polyhedral School
Affine Transformations
Coumpound TransformationThe transformation matrix is the composition of an interchange and reversal
1
2
3
5
6
4
1 2 3 4 5 6
123
i
j
3
6
2
5
1
4
123
0 1 2−3 −2 −1 i’
j’
=⇒
1 0−1 0
0 10 −1
( ij
)+
−12−1
3
≥~0(
i′j′
)=[ 0 −1
1 0
]( ij
) 0 −10 11 0−1 0
( i′j′
)+
−12−1
3
≥~0
(a) original polyhedron (b) transformation function (c) target polyhedron
A~x+~a≥~0 ~y = T~x (AT−1)~y+~a≥~0
do i = 1, 2do j = 1, 3
S(i,j)
do j’ = -1, -3, -1do i’ = 1, 2
S(i=4-j’,j=i’)
UCLA 40
Program Transformation: Polyhedral School
Affine Transformations
Coumpound TransformationThe transformation matrix is the composition of an interchange and reversal
1
2
3
5
6
4
1 2 3 4 5 6
123
i
j
3
6
2
5
1
4
123
0 1 2−3 −2 −1 i’
j’
=⇒
1 0−1 0
0 10 −1
( ij
)+
−12−1
3
≥~0(
i′j′
)=[ 0 −1
1 0
]( ij
) 0 −10 11 0−1 0
( i′j′
)+
−12−1
3
≥~0
(a) original polyhedron (b) transformation function (c) target polyhedron
A~x+~a≥~0 ~y = T~x (AT−1)~y+~a≥~0
do i = 1, 2do j = 1, 3
S(i,j)
do j’ = -1, -3, -1do i’ = 1, 2
S(i=4-j’,j=i’)
UCLA 40
Program Transformation: Polyhedral School
So, What is This Matrix?
I We know how to generate code for arbitrary matrices with integercoefficients
I Arbitrary number of rows (but fixed number of columns)I Arbitrary value for the coefficients
I Through code generation, the number of dynamic instances is preservedI But this is not true for the transformed polyhedra!
Some classification:I The matrix is unimodularI The matrix is full rank and invertibleI The matrix is full rankI The matrix has only integral coefficientsI The matrix has rational coefficients
UCLA 41
Program Transformation: Polyhedral School
A Reverse History Perspective
1 CLooG: arbitrary matrix2 Affine Mappings3 Unimodular framework4 SARE5 SURE
UCLA 42
Program Transformation: Polyhedral School
Program Transformations
Original Schedule
for (i = 0; i < n; ++i)for (j = 0; j < n; ++j){
S1: C[i][j] = 0;for (k = 0; k < n; ++k)
S2: C[i][j] += A[i][k]*B[k][j];
}
ΘS1.~xS1 =
(1 0 0 00 1 0 0
).
ijn1
ΘS2.~xS2 =
(1 0 0 0 00 1 0 0 00 0 1 0 0
).
ijkn1
for (i = 0; i < n; ++i)for (j = 0; j < n; ++j){C[i][j] = 0;for (k = 0; k < n; ++k)C[i][j] += A[i][k]*
B[k][j];
}
I Represent Static Control Parts (control flow and dependences must bestatically computable)
I Use code generator (e.g. CLooG) to generate C code from polyhedralrepresentation (provided iteration domains + schedules)
UCLA 43
Program Transformation: Polyhedral School
Program Transformations
Original Schedule
for (i = 0; i < n; ++i)for (j = 0; j < n; ++j){
S1: C[i][j] = 0;for (k = 0; k < n; ++k)
S2: C[i][j] += A[i][k]*B[k][j];
}
ΘS1.~xS1 =
(1 0 0 00 1 0 0
).
ijn1
ΘS2.~xS2 =
(1 0 0 0 00 1 0 0 00 0 1 0 0
).
ijkn1
for (i = 0; i < n; ++i)for (j = 0; j < n; ++j){C[i][j] = 0;for (k = 0; k < n; ++k)C[i][j] += A[i][k]*
B[k][j];
}
I Represent Static Control Parts (control flow and dependences must bestatically computable)
I Use code generator (e.g. CLooG) to generate C code from polyhedralrepresentation (provided iteration domains + schedules)
UCLA 43
Program Transformation: Polyhedral School
Program Transformations
Original Schedule
for (i = 0; i < n; ++i)for (j = 0; j < n; ++j){
S1: C[i][j] = 0;for (k = 0; k < n; ++k)
S2: C[i][j] += A[i][k]*B[k][j];
}
ΘS1.~xS1 =
(1 0 0 00 1 0 0
).
ijn1
ΘS2.~xS2 =
(1 0 0 0 00 1 0 0 00 0 1 0 0
).
ijkn1
for (i = 0; i < n; ++i)for (j = 0; j < n; ++j){C[i][j] = 0;for (k = 0; k < n; ++k)C[i][j] += A[i][k]*
B[k][j];
}
I Represent Static Control Parts (control flow and dependences must bestatically computable)
I Use code generator (e.g. CLooG) to generate C code from polyhedralrepresentation (provided iteration domains + schedules)
UCLA 43
Program Transformation: Polyhedral School
Program Transformations
Distribute loops
for (i = 0; i < n; ++i)for (j = 0; j < n; ++j){
S1: C[i][j] = 0;for (k = 0; k < n; ++k)
S2: C[i][j] += A[i][k]*B[k][j];
}
ΘS1.~xS1 =
(1 0 0 00 1 0 0
).
ijn1
ΘS2.~xS2 =
(1 0 0 1 00 1 0 0 00 0 1 0 0
).
ijkn1
for (i = 0; i < n; ++i)for (j = 0; j < n; ++j)C[i][j] = 0;
for (i = n; i < 2*n; ++i)for (j = 0; j < n; ++j)for (k = 0; k < n; ++k)C[i-n][j] += A[i-n][k]*
B[k][j];
I All instances of S1 are executed before the first S2 instance
UCLA 43
Program Transformation: Polyhedral School
Program Transformations
Distribute loops + Interchange loops for S2
for (i = 0; i < n; ++i)for (j = 0; j < n; ++j){
S1: C[i][j] = 0;for (k = 0; k < n; ++k)
S2: C[i][j] += A[i][k]*B[k][j];
}
ΘS1.~xS1 =
(1 0 0 00 1 0 0
).
ijn1
ΘS2.~xS2 =
(0 0 1 1 00 1 0 0 01 0 0 0 0
).
ijkn1
for (i = 0; i < n; ++i)for (j = 0; j < n; ++j)C[i][j] = 0;
for (k = n; k < 2*n; ++k)for (j = 0; j < n; ++j)for (i = 0; i < n; ++i)C[i][j] += A[i][k-n]*
B[k-n][j];
I The outer-most loop for S2 becomes k
UCLA 43
Program Transformation: Polyhedral School
Program Transformations
Illegal schedule
for (i = 0; i < n; ++i)for (j = 0; j < n; ++j){
S1: C[i][j] = 0;for (k = 0; k < n; ++k)
S2: C[i][j] += A[i][k]*B[k][j];
}
ΘS1.~xS1 =
(1 0 1 00 1 0 0
).
ijn1
ΘS2.~xS2 =
(0 0 1 0 00 1 0 0 01 0 0 0 0
).
ijkn1
for (k = 0; k < n; ++k)for (j = 0; j < n; ++j)for (i = 0; i < n; ++i)C[i][j] += A[i][k]*
B[k][j];for (i = n; i < 2*n; ++i)for (j = 0; j < n; ++j)C[i-n][j] = 0;
I All instances of S1 are executed after the last S2 instance
UCLA 43
Program Transformation: Polyhedral School
Program Transformations
A legal schedule
for (i = 0; i < n; ++i)for (j = 0; j < n; ++j){
S1: C[i][j] = 0;for (k = 0; k < n; ++k)
S2: C[i][j] += A[i][k]*B[k][j];
}
ΘS1.~xS1 =
(1 0 1 00 1 0 0
).
ijn1
ΘS2.~xS2 =
(0 0 1 1 10 1 0 0 01 0 0 0 0
).
ijkn1
for (i = n; i < 2*n; ++i)for (j = 0; j < n; ++j)C[i][j] = 0;
for (k= n+1; k<= 2*n; ++k)for (j = 0; j < n; ++j)for (i = 0; i < n; ++i)C[i][j] += A[i][k-n-1]*
B[k-n-1][j];
I Delay the S2 instancesI Constraints must be expressed between ΘS1 and ΘS2
UCLA 43
Program Transformation: Polyhedral School
Program Transformations
Implicit fine-grain parallelism
for (i = 0; i < n; ++i)for (j = 0; j < n; ++j){
S1: C[i][j] = 0;for (k = 0; k < n; ++k)
S2: C[i][j] += A[i][k]*B[k][j];
}
ΘS1.~xS1 = ( 1 0 0 0 ) .
ijn1
ΘS2.~xS2 = ( 0 0 1 1 0 ) .
ijkn1
for (i = 0; i < n; ++i)pfor (j = 0; j < n; ++j)C[i][j] = 0;
for (k = n; k < 2*n; ++k)pfor (j = 0; j < n; ++j)
pfor (i = 0; i < n; ++i)C[i][j] += A[i][k-n]*
B[k-n][j];
I Less (linear) rows than loop depth→ remaining dimensions areparallel
UCLA 43
Program Transformation: Polyhedral School
Program Transformations
Representing a schedule
for (i = 0; i < n; ++i)for (j = 0; j < n; ++j){
S1: C[i][j] = 0;for (k = 0; k < n; ++k)
S2: C[i][j] += A[i][k]*B[k][j];
}
ΘS1.~xS1 =
( 1 0 1 00 1 0 0
).
ijn1
ΘS2.~xS2 =
(0 0 1 1 10 1 0 0 01 0 0 0 0
).
ijkn1
for (i = n; i < 2*n; ++i)for (j = 0; j < n; ++j)C[i][j] = 0;
for (k= n+1; k<= 2*n; ++k)for (j = 0; j < n; ++j)for (i = 0; i < n; ++i)C[i][j] += A[i][k-n-1]*
B[k-n-1][j];
Θ.~x =
( 1 0 0 0 1 1 1 0 10 1 0 1 0 0 0 0 00 0 1 0 0 0 0 0 0
).
~p
( i j i j k n n 1 1 )T
~p
UCLA 43
Program Transformation: Polyhedral School
Program Transformations
Representing a schedule
for (i = 0; i < n; ++i)for (j = 0; j < n; ++j){
S1: C[i][j] = 0;for (k = 0; k < n; ++k)
S2: C[i][j] += A[i][k]*B[k][j];
}
ΘS1.~xS1 =
( 1 0 1 00 1 0 0
).
ijn1
ΘS2.~xS2 =
(0 0 1 1 10 1 0 0 01 0 0 0 0
).
ijkn1
for (i = n; i < 2*n; ++i)for (j = 0; j < n; ++j)C[i][j] = 0;
for (k= n+1; k<= 2*n; ++k)for (j = 0; j < n; ++j)for (i = 0; i < n; ++i)C[i][j] += A[i][k-n-1]*
B[k-n-1][j];
Θ.~x =
( 1 0 0 0 1 1 1 0 10 1 0 1 0 0 0 0 00 0 1 0 0 0 0 0 0
).
~p
( i j i j k n n 1 1 )T
0 0
~ı
0 0 0
~p
0
c
0
UCLA 43
Program Transformation: Polyhedral School
Program Transformations
Representing a schedule
for (i = 0; i < n; ++i)for (j = 0; j < n; ++j){
S1: C[i][j] = 0;for (k = 0; k < n; ++k)
S2: C[i][j] += A[i][k]*B[k][j];
}
ΘS1.~xS1 =
( 1 0 1 00 1 0 0
).
ijn1
ΘS2.~xS2 =
(0 0 1 1 10 1 0 0 01 0 0 0 0
).
ijkn1
for (i = n; i < 2*n; ++i)for (j = 0; j < n; ++j)C[i][j] = 0;
for (k= n+1; k<= 2*n; ++k)for (j = 0; j < n; ++j)for (i = 0; i < n; ++i)C[i][j] += A[i][k-n-1]*
B[k-n-1][j];
Transformation Description
~ıreversal Changes the direction in which a loop traverses its iteration rangeskewing Makes the bounds of a given loop depend on an outer loop counter
interchange Exchanges two loops in a perfectly nested loop, a.k.a. permutation
~p fusion Fuses two loops, a.k.a. jammingdistribution Splits a single loop nest into many, a.k.a. fission or splitting
c peeling Extracts one iteration of a given loopshifting Allows to reorder loops
UCLA 43
Program Transformation: Polyhedral School
Fusion in the Polyhedral Model
0 N
for (NBlue
for (i = 0; i <= N; ++i) {Blue(i);Red(i);
}
for (NRed
Perfectly aligned fusion
UCLA 44
Program Transformation: Polyhedral School
Fusion in the Polyhedral Model
0 N N+11
for
Blue(0);for (i = 1; i <= N; ++i) {Blue(i);Red(i-1);
}Red(N);
for (N
Fusion with shift of 1Not all instances are fused
UCLA 44
Program Transformation: Polyhedral School
Fusion in the Polyhedral Model
P0 N N+P
for (i = 0; i < P; ++i)Blue(i);
for (i = P; i <= N; ++i) {Blue(i);Red(i-P);
}for (i = N+1; i <= N+P; ++i)
Red(i-P);
Fusion with parametric shift of PAutomatic generation of prolog/epilog code
UCLA 44
Program Transformation: Polyhedral School
Fusion in the Polyhedral Model
P0 N N+P
for (i = 0; i < P; ++i)Blue(i);
for (i = P; i <= N; ++i) {Blue(i);Red(i-P);
}for (i = N+1; i <= N+P; ++i)
Red(i-P);
Many other transformations may be required to enablefusion: interchange, skewing, etc.
UCLA 44
Program Transformation: Polyhedral School
Scheduling Matrix
Definition (Affine multidimensional schedule)
Given a statement S, an affine schedule ΘS of dimension m is an affine formon the d outer loop iterators~xS and the p global parameters~n.ΘS ∈ Zm×(d+p+1) can be written as:
ΘS(~xS) =
θ1,1 . . . θ1,d+p+1...
...θm,1 . . . θm,d+p+1
.
~xS~n1
ΘS
k denotes the kth row of ΘS.
Definition (Bounded affine multidimensional schedule)
ΘS is a bounded schedule if θSi,j ∈ [x,y] with x,y ∈ Z
UCLA 45
Program Transformation: Polyhedral School
Another Representation
One can separate coefficients of Θ into:1 The iterator coefficients2 The parameter coefficients3 The constant coefficients
One can also enforce the schedule dimension to be 2d+1.I A d-dimensional square matrix for the linear part
I represents composition of interchange/skewing/slowing
I A d×n matrix for the parametric partI represents (parametric) shift
I A d+1 vector for the scalar offsetI represents statement interleaving
I See URUK for instance
UCLA 46
Program Transformation: Polyhedral School
Computing the 2d+1 Identity Schedule
S1
S2
S3
S4
S5
S6
!S("xS) = TS"xS + "tS ,
"xS TS "tS
!S1("xS1) = (0, i, 0) !S2("xS2) = (0, i, 1, j, 0) !S3("xS3) = (0, i, 2)
0
00
0
0 1 2
1 2
3
i
j j
k
S1
S2
S3
S4
S5
S6
S1
S2
S3
S4
S5
S6
!S("xS) = TS"xS + "tS ,
"xS TS "tS
!S1("xS1) = (0, i, 0) !S2("xS2) = (0, i, 1, j, 0) !S3("xS3) = (0, i, 2)
0
00
0
0 1 2
1 2
3
i
j j
k
S1
S2
S3
S4
S5
S6
UCLA 47
Program Transformation: Polyhedral School
Transformation Catalogue [1/2]
28 Girbal, Vasilache et al.
CUTDOM constrains a domain with an additional inequality, in theform of a vector c with the same dimension as a row of matrix Λ.
EXTEND inserts a new intermediate loop level at depth !, initially re-stricted to a single iteration. This new iterator will be used in followingcode transformations.
ADDLOCALVAR insert a fresh local variable to the domain and to theaccess functions. This local variable is typically used by CUTDOM.
PRIVATIZE and CONTRACT implement basic forms of array privati-zation and contraction, respectively, considering dimension ! of the array.Privatization needs an additional parameter s, the size of the additional di-mension; s is required to update the array declaration (it cannot be inferredin general, some references may not be affine). These primitives are simpleexamples updating the data layout and array access functions.
This table is not complete (e.g., it lacks index-set splitting and data-layout transformations), but it demonstrates the expressiveness of the uni-fied representation.
Syntax EffectUNIMODULAR(P,U) !S " Scop | P# βS,AS $ U.AS; ΓS$ U.ΓS
SHIFT(P,M) !S " Scop | P# βS,ΓS $ ΓS +MCUTDOM(P,c) !S " Scop | P# βS,ΛS $ AddRow
!ΛS,0,c/gcd(c1, . . . ,cdS+dSlv+dgp+1
)"
EXTEND(P,!,c) !S " Scop | P# βS,
#$$$$%
$$$$&
dS$ dS +1; ΛS$ AddCol(ΛS,c,0);βS$ AddRow(βS,!,0);AS$ AddRow(AddCol(AS,c,0),!,1!);ΓS$ AddRow(ΓS,!,0);!(A,F) " LS
hs %R Shs,F$ AddRow(F,!,0)
ADDLOCALVAR(P) !S " Scop | P# βS,dSlv $ dSlv+1; ΛS$ AddCol(ΛS,dS +1,0);!(A,F) " LS
hs%R Shs,F$ AddCol(F,dS +1,0)
PRIVATIZE(A,!) !S " Scop,!(A,F) " LShs %R S
hs,F$ AddRow(F,!,1!)CONTRACT(A,!) !S " Scop,!(A,F) " LS
hs %R Shs,F$ RemRow(F,!)
FUSION(P,o) b=max{βSdim(P)+1 | (P,o)# βS}+1Move((P,o+1),(P,o+1),b); Move(P,(P,o+1),&1)
FISSION(P,o,b) Move(P,(P,o,b),1); Move((P,o+1),(P,o+1),&b)MOTION(P,T ) if dim(P)+1= dim(T ) then b=max{βSdim(P) | P# βS}+1 else b= 1
Move(pfx(T,dim(T )&1),T,b)!S " Scop | P# βS,βS $ βS +T &pfx(P,dim(T ))Move(P,P,&1)
Fig. 18. Some classical transformation primitives
Primitives operate on program representation while maintaining thestructure of the polyhedral components (the invariants). Despite their fa-miliar names, the primitives’ practical outcome on the program represen-tation is widely extended compared to their syntactic counterparts. Indeed,transformation primitives like fusion or interchange apply to sets of state-
UCLA 48
Program Transformation: Polyhedral School
Transformation Catalogue [2/2]30 Girbal, Vasilache et al.
Syntax Effect Comments
INTERCHANGE(P,o) !S " Scop | P# βS,swap rows o and o+1
!U= IdS$1o,o$1o+1,o+1+1o,o+1+1o+1,o;UNIMODULAR(βS,U)
SKEW(P,!,c,s) !S " Scop | P# βS,add the skew factor
!U= IdS + s ·1!,c;UNIMODULAR(βS,U)
REVERSE(P,o) !S " Scop | P# βS,put a -1 in (o,o)
!U= IdS $2 ·1o,o;UNIMODULAR(βS,U)
STRIPMINE(P,k) !S " Scop | P# βS,"####$
####%
c= dim(P);EXTEND(βS,c,c);u= dS +dSlv+dgp+1;CUTDOM(βS,$k ·1c +(ASc+1,Γ
Sc+1));
CUTDOM(βS,k ·1c $ (ASc+1,ΓSc+1)+(k$1)1u)
insert intermediate loopconstant columnk · ic % ic+1ic+1 % k · ic + k$1
TILE(P,o,k1 ,k2) !S " Scop | (P,o)# βS,"$
%
STRIPMINE((P,o),k2);STRIPMINE(P,k1);INTERCHANGE((P,0),dim(P))
strip outer loopstrip inner loopinterchange
Fig. 19. Composition of transformation primitives
3.2. Implementing Loop Unrolling
In the context of code optimization, one of the most important transforma-tions is loop unrolling. A naive implementation of unrolling with statementduplications may result in severe complexity overhead for further transfor-mations and for the code generation algorithm (its separation algorithmis exponential in the number of statements, in the worst case). Insteadof implementing loop unrolling in the intermediate representation of ourframework, we delay it to the code generation phase and perform full loopunrolling in a lazy way. This strategy is fully implemented in the code gen-eration phase and is triggered by annotations (carrying depth information)of the statements whose surrounding loops need to be unrolled; unrollingoccurs in the separation algorithm of the code generator (22) when all thestatements being printed out are marked for unrolling at the current depth.
Practically, in most cases, loop unrolling by a factor b an be imple-mented as a combination of strip-mining (by a factor b) and full unrolling(6). Strip-mining itself may be implemented in several ways in a polyhedralsetting. Following our earlier work in (7) and calling b the strip-mining fac-tor, we choose to model a strip-mined loop by dividing the iteration span ofthe outer loop by b instead of leaving the bounds unchanged and insertinga non-unit stride b, see Figure 20.
UCLA 49
Program Transformation: Polyhedral School
Some Final Observations
Some classical pitfallsI The number of rows of Θ does not correspond to actual parallel levelsI Scalar rows vs. linear rowsI Linear independenceI Parametric shift for domains without parametric boundI ...
UCLA 50
Program Transformation: Polyhedral School
Data Dependences
UCLA 51
Data Dependences: Polyhedral School
Purpose of Dependence Analysis
I Not all program transformations preserve the semanticsI Semantics is preserved if the dependence are preserved
I In standard frameworks, it usually means reordering statements/loopsI Statements containing dependent references should not be executed in a
different orderI Granularity: usually a reference to an array
I In the polyhedral framework, it means reordering statement instancesI Statement instances containing dependent references should not be
executed in a different orderI Granularity: a reference to an array cell
UCLA 52
Data Dependences: Polyhedral School
Example
Exercise: Compute the set of cells of A accessed
Example
for (i = 0; i < N; ++i)for (j = i; j < N; ++j)A[2i + 3][4j] = i * j;
I DS: {i, j | 0≤ i < N, i≤ j < N}I Function: fA : {2i+3,4j | i, j ∈ Z}I Image(fA,DS) is the set of cells of A accessed (a Z-polyhedron):
I Polyhedron: Q : {i, j | 3≤ i < 2N +2, 0≤ j < 4N}I Lattice: L : {2i+3,4j | i, j ∈ Z}
UCLA 53
Data Dependences: Polyhedral School
Data Dependence
Definition (Bernstein conditions)
Given two references, there exists a dependence between them if the threefollowing conditions hold:
I they reference the same array (cell)I one of this access is a writeI the two associated statements are executed
Three categories of dependences:I RAW (Read-After-Write, aka flow): first reference writes, second readsI WAR (Write-After-Read, aka anti): first reference reads, second writesI WAW (Write-After-Write, aka output): both references writes
Another kind: RAR (Read-After-Read dependences), used for locality/reusecomputations
UCLA 54
Data Dependences: Polyhedral School
Some Terminology on Dependence Relations
We categorize the dependence relation in three kinds:I Uniform dependences: the distance between two dependent iterations
is a constantI ex: i→ i+1I ex: i, j→ i+1, j+1
I Non-uniform dependences: the distance between two dependentiterations varies during the execution
I ex: i→ i+ jI ex: i→ 2i
I Parametric dependences: at least a parameter is involved in thedependence relation
I ex: i→ i+NI ex: i+N→ j+M
UCLA 55
Data Dependences: Dependence Polyhedra Polyhedral School
Dependence Polyhedron [1/3]
Principle: model all pairs of instances in dependence
Definition (Dependence of statement instances)
A statement S depends on a statement R (written R→ S) if there exists anoperation S(~xS) and R(~xR) and a memory location m such that:
1 S(~xS) and R(~xR) refer to the same memory location m, and at least oneof them writes to that location,
2 xS and xR belongs to the iteration domain of R and S,3 in the original sequential order, S(~xS) is executed before R(~xR).
UCLA 56
Data Dependences: Dependence Polyhedra Polyhedral School
Dependence Polyhedron [2/3]
1 Same memory location: equality of the subscript functions of a pair ofreferences to the same array: FS~xS +aS = FR~xR +aR.
2 Iteration domains: both S and R iteration domains can be describedusing affine inequalities, respectively AS~xS + cS ≥ 0 and AR~xR + cR ≥ 0.
3 Precedence order : each case corresponds to a common loop depth,and is called a dependence level.
For each dependence level l, the precedence constraints are the equalityof the loop index variables at depth lesser to l: xR,i = xS,i for i < l andxR,l > xS,l if l is less than the common nesting loop level. Otherwise, thereis no additional constraint and the dependence only exists if S is textuallybefore R.
Such constraints can be written using linear inequalities:Pl,S~xS−Pl,R~xR +b≥ 0.
UCLA 57
Data Dependences: Dependence Polyhedra Polyhedral School
Dependence Polyhedron [3/3]
The dependence polyhedron for R→ S at a given level l and for a given pairof references fR, fS is described as [Feautrier/Bastoul]:
DR,S,fR,fS,l : D(~xS
~xR
)+d =
FS −FRAS 0
0 ARPS −PR
(~xS~xR
)+
aS−aRcScRb
= 0
≥~0
A few properties:I We can always build the dep polyhedron for a given pair of affine array
accesses (it is convex)I It is exact, if the iteration domain and the access functions are also exactI it is over-approximated if the iteration domain or the access function is
an approximation
UCLA 58
Data Dependences: Dependence Polyhedra Polyhedral School
Polyhedral Dependence Graph
Definition (Polyhedral Dependence Graph)
The Polyhedral Dependence Graph is a multi-graph with one vertex persyntactic program statement Si, with edges Si→ Sj for each dependencepolyhedron DSi,Sj .
UCLA 59
Data Dependences: Dependence Polyhedra Polyhedral School
Scheduling
UCLA 60
Affine Scheduling: Polyhedral School
Affine Scheduling
Definition (Affine schedule)
Given a statement S, a p-dimensional affine schedule ΘR is an affine form onthe outer loop iterators~xS and the global parameters~n. It is written:
ΘS(~xS) = TS
~xS~n1
, TS ∈Kp×dim(~xS)+dim(~n)+1
I A schedule assigns a timestamp to each executed instance of astatement
I If T is a vector, then Θ is a one-dimensional scheduleI If T is a matrix, then Θ is a multidimensional schedule
I Question: does it translate to sequential loops?
UCLA 61
Affine Scheduling: Polyhedral School
Legal Program Transformation
Definition (Precedence condition)
Given ΘR a schedule for the instances of R, ΘS a schedule for the instancesof S. ΘR and ΘS are legal schedules if ∀〈~xR,~xS〉 ∈DR,S:
ΘR(~xR)≺ΘS(~xS)
≺ denotes the lexicographic ordering.
(a1, . . . ,an)≺ (b1, . . . ,bm) iff ∃i, 1≤ i≤ min(n,m) s.t. (a1, . . . ,ai−1) = (b1, . . . ,bi−1)
and ai < bi
UCLA 62
Affine Scheduling: Polyhedral School
Scheduling in the Polyhedral Model
Constraints:I The schedule must respect the precedence condition, for all dependent
instancesI Dependence constraints can be turned into constraints on the solution
set
Scheduling:I Among all possibilities, one has to be pickedI Optimal solution requires to consider all legal possible schedules
I Question: is this always true?
UCLA 63
Affine Scheduling: Polyhedral School
One-Dimensional Affine Schedules
For now, we focus on 1-d schedules
Example
for (i = 1; i < N; ++i)A[i] = A[i - 1] + A[i] + A[i + 1];
I Simple program: 1 loop, 1 polyhedral statementI 2 dependences:
I RAW: A[i]→ A[i - 1]I WAR: A[i + 1]→ A[i]
UCLA 64
Affine Scheduling: Polyhedral School
Checking the Legality of a ScheduleExercise: given the dependence polyhedra, check if a schedule is legal
D1 :
1 1 0 0 −11 −1 0 1 −11 0 1 0 11 0 −1 1 −10 1 −1 0 11 −1 1 0 −1
.eq
iSi′Sn1
D2 :
1 1 0 0 −11 −1 0 1 −11 0 1 0 11 0 −1 1 −10 1 −1 0 11 −1 1 0 −1
.eq
iSi′Sn1
1 Θ = i2 Θ =−i
Solution: check for the emptiness of the polyhedron
P :[
DiS � i′S
].
iSi′Sn1
where:
I iS � i′S gets the consumer instances scheduled after the producer onesI For Θ =−i, it is −iS �−i′S, which is non-empty
UCLA 65
Affine Scheduling: Polyhedral School
Checking the Legality of a ScheduleExercise: given the dependence polyhedra, check if a schedule is legal
D1 :
1 1 0 0 −11 −1 0 1 −11 0 1 0 11 0 −1 1 −10 1 −1 0 11 −1 1 0 −1
.eq
iSi′Sn1
D2 :
1 1 0 0 −11 −1 0 1 −11 0 1 0 11 0 −1 1 −10 1 −1 0 11 −1 1 0 −1
.eq
iSi′Sn1
1 Θ = i2 Θ =−i
Solution: check for the emptiness of the polyhedron
P :[
DiS � i′S
].
iSi′Sn1
where:
I iS � i′S gets the consumer instances scheduled after the producer onesI For Θ =−i, it is −iS �−i′S, which is non-empty
UCLA 65
Affine Scheduling: Polyhedral School
A (Naive) Scheduling Approach
I Pick a schedule for the program statementsI Check if it respects all dependences
This is called filtering
Limitations:I How to use this in combination of an objective function?I The density of legal 1-d affine schedules is low:
matmult locality fir h264 crout
~i-Bounds −1,1 −1,1 0,1 −1,1 −3,3c-Bounds −1,1 −1,1 0,3 0,4 −3,3#Sched. 1.9×104 5.9×104 1.2×107 1.8×108 2.6×1015
⇓#Legal 6561 912 792 360 798
UCLA 66
Affine Scheduling: Polyhedral School
Objectives for a Good Scheduling Algorithm
I Build a legal schedule!I Embed some properties in this legal schedule
I latency: minimize the time of the last iterationI delay: minimize the time between the first and last iterationI parallelism / placementI permutability (for tiling)I ...
A possible "simple" two-step approach:I Find the solution set of all legal affine schedulesI Find an ILP/PIP formulation for the objective function(s)
UCLA 67
Affine Scheduling: Polyhedral School
The Precedence Constraint (Again!)
Precedence constraint adapted to 1-d schedules:
Definition (Causality condition for schedules)
Given DR,S, ΘR and ΘS are legal iff for each pair of instances in dependence:
ΘR(~xR)< Θ
S(~xS)
Equivalently: ∆R,S = ΘS(~xS)−Θ
R(~xR)−1≥ 0
I All functions ∆R,S which are non-negative over the dependencepolyhedron represent legal schedules
I For the instances which are not in dependence, we don’t careI First step: how to get all non-negative functions over a polyhedron?
UCLA 68
Affine Scheduling: Polyhedral School
Affine Form of the Farkas Lemma
Lemma (Affine form of Farkas lemma)
Let D be a nonempty polyhedron defined by A~x+~b≥~0. Then any affinefunction f (~x) is non-negative everywhere in D iff it is a positive affinecombination:
f (~x) = λ0 +~λT(A~x+~b), with λ0 ≥ 0 and~λ≥~0
λ0 and ~λT are called the Farkas multipliers.
UCLA 69
Affine Scheduling: Polyhedral School
The Farkas Lemma: Example
I Function: f (x) = ax+bI Domain of x: {1≤ x≤ 3}→ x−1≥ 0, −x+3≥ 0I Farkas lemma: f (x)≥ 0⇔ f (x) = λ0 +λ1(x−1)+λ2(−x+3)
The system to solve:λ1 − λ2 = a
λ0 − λ1 + 3λ2 = bλ0 ≥ 0
λ1 ≥ 0λ2 ≥ 0
UCLA 70
Affine Scheduling: One-dimensional Schedules Polyhedral School
Example: Semantics Preservation (1-D)
Legal Distinct Schedules
Affine Schedules
UCLA 71
Affine Scheduling: One-dimensional Schedules Polyhedral School
Example: Semantics Preservation (1-D)
Legal Distinct Schedules
Affine Schedules
- Causality condition
Property (Causality condition for schedules)
Given RδS, ΘR and ΘS are legal iff for each pair of instances in dependence:
ΘR(~xR)< Θ
S(~xS)
Equivalently: ∆R,S = ΘS(~xS)−Θ
R(~xR)−1≥ 0
UCLA 71
Affine Scheduling: One-dimensional Schedules Polyhedral School
Example: Semantics Preservation (1-D)
Legal Distinct Schedules
Affine Schedules
- Causality condition
- Farkas Lemma
Lemma (Affine form of Farkas lemma)
Let D be a nonempty polyhedron defined by A~x+~b≥~0. Then any affine function f (~x)is non-negative everywhere in D iff it is a positive affine combination:
f (~x) = λ0 +~λT(A~x+~b), with λ0 ≥ 0 and~λ≥~0.
λ0 and ~λT are called the Farkas multipliers.
UCLA 71
Affine Scheduling: One-dimensional Schedules Polyhedral School
Example: Semantics Preservation (1-D)
Legal Distinct Schedules
Affine Schedules
- Causality condition
- Farkas Lemma
Valid
Farkas
Multipliers
UCLA 71
Affine Scheduling: One-dimensional Schedules Polyhedral School
Example: Semantics Preservation (1-D)
Legal Distinct Schedules
Affine Schedules
- Causality condition
- Farkas Lemma
Valid
Farkas
Multipliers
Many to one
UCLA 71
Affine Scheduling: One-dimensional Schedules Polyhedral School
Example: Semantics Preservation (1-D)
Legal Distinct Schedules
Affine Schedules
- Causality condition
- Farkas Lemma
Valid
Farkas
Multipliers
- Identification
ΘS(~xS)−Θ
R(~xR)−1 = λ0 +~λT(
DR,S
(~xR
~xS
)+~dR,S
)≥ 0
DRδS iR : λD1,1 −λD1,2 +λD1,3 −λD1,4
iS : −λD1,1 +λD1,2 +λD1,5 −λD1,6
jS : λD1,7 −λD1,8
n : λD1,4 +λD1,6 +λD1,8
1 : λD1,0
UCLA 71
Affine Scheduling: One-dimensional Schedules Polyhedral School
Example: Semantics Preservation (1-D)
Legal Distinct Schedules
Affine Schedules
- Causality condition
- Farkas Lemma
Valid
Farkas
Multipliers
- Identification
ΘS(~xS)−Θ
R(~xR)−1 = λ0 +~λT(
DR,S
(~xR~xS
)+~dR,S
)≥ 0
DRδS iR : −t1R = λD1,1 −λD1,2 +λD1,3 −λD1,4
iS : t1S = −λD1,1 +λD1,2 +λD1,5 −λD1,6
jS : t2S = λD1,7 −λD1,8
n : t3S − t2R = λD1,4 +λD1,6 +λD1,8
1 : t4S − t3R −1 = λD1,0
UCLA 71
Affine Scheduling: One-dimensional Schedules Polyhedral School
Example: Semantics Preservation (1-D)
Legal Distinct Schedules
Affine Schedules
- Causality condition
- Farkas Lemma
Valid
Farkas
Multipliers
- Identification
- Projection
I Solve the constraint systemI Use (purpose-optimized) Fourier-Motzkin projection algorithm
I Reduce redundancyI Detect implicit equalities
UCLA 71
Affine Scheduling: One-dimensional Schedules Polyhedral School
Example: Semantics Preservation (1-D)
Valid
Transformation
Coefficients
Legal Distinct Schedules
Affine Schedules
- Causality condition
- Farkas Lemma
Valid
Farkas
Multipliers
- Identification
- Projection
UCLA 71
Affine Scheduling: One-dimensional Schedules Polyhedral School
Example: Semantics Preservation (1-D)
Valid
Transformation
Coefficients
Legal Distinct Schedules
Affine Schedules
- Causality condition
- Farkas Lemma
Valid
Farkas
Multipliers
Bijection
- Identification
- Projection
I One point in the space⇔ one set of legal schedulesw.r.t. the dependences
UCLA 71
Affine Scheduling: One-dimensional Schedules Polyhedral School
Scheduling Algorithm for Multiple Dependences
AlgorithmI Compute the schedule constraints for each dependenceI Intersect all sets of constraintsI Output is a convex solution set of all legal one-dimensional schedules
I Computation is fast, but requires eliminating variables in a system ofinequalities: projection
I Can be computed as soon as the dependence polyhedra are known
UCLA 72
Affine Scheduling: One-dimensional Schedules Polyhedral School
Objective Function
Idea: bound the latency of the schedule and minimize this bound
Theorem (Schedule latency bound)
If all domains are bounded, and if there exists at least one 1-d schedule Θ,then there exists at least one affine form in the structure parameters:
L =~u.~n+w
such that:∀~xR, L≥ΘR(~xR)
I Objective function: min{~u,w |~u.~n+w−Θ≥ 0}I Subject to Θ is a legal schedule, and θi ≥ 0I In many cases, it is equivalent to take the lexicosmallest point in the
polytope of non-negative legal schedules
UCLA 73
Affine Scheduling: One-dimensional Schedules Polyhedral School
Example
min{~u,w |~u.~n+w−Θ≥ 0} : ΘR = 0, ΘS = k+1
Example
parfor (i = 0; i < N; ++i)parfor (j = 0; j < N; ++j)C[i][j] = 0;
for (k = 1; k < N + 1; ++k)parfor (i = 0; i < N; ++i)parfor (j = 0; j < N; ++j)C[i][j] += A[i][k-1] + B[k-1][j];
UCLA 74
Affine Scheduling: One-dimensional Schedules Polyhedral School
Limitations of One-dimensional Schedules
I Not all programs have a legal one-dimensional scheduleI Question: does this program have a 1-d schedule?
Example
for (i = 1; i < N - 1; ++i)for (j = 1; j < N - 1; ++j)A[i][j] = A[i-1][j-1] + A[i+1][j] + A[i][j+1];
I Not all compositions of transformation are possibleI Interchange in inner-loopsI Fusion / distribution of inner-loops
UCLA 75
Affine Scheduling: One-dimensional Schedules Polyhedral School
Multidimensional Scheduling
UCLA 76
Affine Scheduling: Multidimensional Scheduling Polyhedral School
Multidimensional Scheduling
I Some program does not have a legal 1-d scheduleI It means, it’s not possible to enforce the precedence condition for all
dependences
Example
for (i = 0; i < N; ++i)for (j = 0; j < N; ++j)s += s;
I Intuition: multidimensional time means nested time loopsI The precedence constraint needs to be adapted to multidimensional
time
UCLA 77
Affine Scheduling: Multidimensional Scheduling Polyhedral School
Dependence Satisfaction
Definition (Strong dependence satisfaction)
Given DR,S, the dependence is strongly satisfied at schedule level k if
∀〈~xR,~xS〉 ∈DR,S, ΘSk(~xS)−ΘR
k (~xR)≥ 1
Definition (Weak dependence satisfaction)
Given DR,S, the dependence is weakly satisfied at dimension k if
∀〈~xR,~xS〉 ∈DR,S, ΘSk(~xS)−Θ
Rk (~xR)≥ 0
∃〈~xR,~xS〉 ∈DR,S, ΘSk(~xS) = Θ
Rk (~xR)
UCLA 78
Affine Scheduling: Multidimensional Scheduling Polyhedral School
Program Legality and Existence Results
I All dependence must be strongly satisfied for the program to be correctI Once a dependence is strongly satisfied at level k, it does not
contribute to the constraints of level k+ i
I Unlike with 1-d schedules, it is always possible to build a legalmultidimensional schedule for a SCoP [Feautrier]
Theorem (Existence of an affine schedule)
Every static control program has a multdidimensional affine schedule
UCLA 79
Affine Scheduling: Multidimensional Scheduling Polyhedral School
Reformulation of the Precedence ConditionI We introduce variable δ
DR,S1 to model the dependence satisfaction
I Considering the first row of the scheduling matrices, to preserve theprecedence relation we have:
∀DR,S, ∀〈~xR,~xS〉 ∈DR,S, ΘS1(~xS)−ΘR
1 (~xR)≥ δDR,S1
δDR,S1 ∈ {0,1}
Lemma (Semantics-preserving affine schedules)
Given a set of affine schedules ΘR,ΘS . . . of dimension m, the programsemantics is preserved if:
∀DR,S, ∃p ∈ {1, . . . ,m}, δDR,Sp = 1
∧ ∀j < p, δDR,Sj = 0
∧ ∀j≤ p,∀〈~xR,~xS〉 ∈DR,S, ΘSp(~xS)−Θ
Rp (~xR)≥ δ
DR,Sj
UCLA 80
Affine Scheduling: Multidimensional Scheduling Polyhedral School
Space of All Affine Schedules
Objective:I Design an ILP which operates on all scheduling coefficientsI Easier optimality reasoning: the space contains all schedules (hence
necesarily the optimal one)I Examples: maximal fusion, maximal coarse-grain parallelism, best
locality, etc.
idea:I Combine all coefficients of all rows of the scheduling function into a
single solution setI Find a convex encoding for the lexicopositivity of dependence
satisfactionI A dependence must be weakly satisfied until it is strongly satisfiedI Once it is strongly satisfied, it must not constrain subsequent levels
UCLA 81
Affine Scheduling: Multidimensional Scheduling Polyhedral School
Schedule Lower Bound
Idea:I Bound the schedule latency with a lower bound which does not prevent
to find all solutionsI Intuitively:
I ΘS(~xS)−ΘR(~xR)≥ δ if the dependence has not been strongly satisfiedI ΘS(~xS)−ΘR(~xR)≥−∞ if it has
Lemma (Schedule lower bound)
Given ΘRk , ΘS
k such that each coefficient value is bounded in [x,y]. Thenthere exists K ∈ Z such that:
max(Θ
Sk(~xS)−Θ
Rk (~xR)
)>−K.~n−K
UCLA 82
Affine Scheduling: Space of Semantics-Preserving Affine Schedules Polyhedral School
Space of Semantics-Preserving Affine Schedules
All unique bounded
affine multidimensional schedules
All unique semantics-preserving
affine multidimensional schedules
1 point ↔ 1 unique semantically equivalent program(up to affine iteration reordering)
UCLA 83
Affine Scheduling: Space of Semantics-Preserving Affine Schedules Polyhedral School
Semantics Preservation
Definition (Causality condition)
Given ΘR a schedule for the instances of R, ΘS a schedule for the instancesof S. ΘR and ΘS preserve the dependence DR,S if ∀〈~xR,~xS〉 ∈DR,S:
ΘR(~xR)≺Θ
S(~xS)
≺ denotes the lexicographic ordering.
(a1, . . . ,an)≺ (b1, . . . ,bm) iff ∃i, 1≤ i≤ min(n,m) s.t. (a1, . . . ,ai−1) = (b1, . . . ,bi−1)
and ai < bi
UCLA 84
Affine Scheduling: Space of Semantics-Preserving Affine Schedules Polyhedral School
Lexico-positivity of Dependence Satisfaction
I ΘR(~xR)≺ΘS(~xS) is equivalently written ΘS(~xS)−ΘR(~xR)�~0
I Considering the row p of the scheduling matrices:
ΘSp(~xS)−Θ
Rp (~xR)≥ δp
I δp ≥ 1 implies no constraints on δk, k > pI δp ≥ 0 is required if 6 ∃k < p, δk ≥ 1
I Schedule lower bound:
Lemma (Schedule lower bound)
Given ΘRk , ΘS
k such that each coefficient value is bounded in [x,y]. Thenthere exists K ∈ Z such that:
∀~xR,~xS, ΘSk(~xS)−Θ
Rk (~xR)>−K.~n−K
UCLA 85
Affine Scheduling: Space of Semantics-Preserving Affine Schedules Polyhedral School
Lexico-positivity of Dependence Satisfaction
I ΘR(~xR)≺ΘS(~xS) is equivalently written ΘS(~xS)−ΘR(~xR)�~0I Considering the row p of the scheduling matrices:
ΘSp(~xS)−Θ
Rp (~xR)≥ δp
I δp ≥ 1 implies no constraints on δk, k > pI δp ≥ 0 is required if 6 ∃k < p, δk ≥ 1
I Schedule lower bound:
Lemma (Schedule lower bound)
Given ΘRk , ΘS
k such that each coefficient value is bounded in [x,y]. Thenthere exists K ∈ Z such that:
∀~xR,~xS, ΘSk(~xS)−Θ
Rk (~xR)>−K.~n−K
UCLA 85
Affine Scheduling: Space of Semantics-Preserving Affine Schedules Polyhedral School
Lexico-positivity of Dependence Satisfaction
I ΘR(~xR)≺ΘS(~xS) is equivalently written ΘS(~xS)−ΘR(~xR)�~0I Considering the row p of the scheduling matrices:
ΘSp(~xS)−Θ
Rp (~xR)≥ δp
I δp ≥ 1 implies no constraints on δk, k > pI δp ≥ 0 is required if 6 ∃k < p, δk ≥ 1
I Schedule lower bound:
Lemma (Schedule lower bound)
Given ΘRk , ΘS
k such that each coefficient value is bounded in [x,y]. Thenthere exists K ∈ Z such that:
∀~xR,~xS, ΘSk(~xS)−Θ
Rk (~xR)>−K.~n−K
UCLA 85
Affine Scheduling: Space of Semantics-Preserving Affine Schedules Polyhedral School
Lexico-positivity of Dependence Satisfaction
I ΘR(~xR)≺ΘS(~xS) is equivalently written ΘS(~xS)−ΘR(~xR)�~0I Considering the row p of the scheduling matrices:
ΘSp(~xS)−Θ
Rp (~xR)≥ δp
I δp ≥ 1 implies no constraints on δk, k > pI δp ≥ 0 is required if 6 ∃k < p, δk ≥ 1
I Schedule lower bound:
Lemma (Schedule lower bound)
Given ΘRk , ΘS
k such that each coefficient value is bounded in [x,y]. Thenthere exists K ∈ Z such that:
∀~xR,~xS, ΘSk(~xS)−Θ
Rk (~xR)>−K.~n−K
UCLA 85
Affine Scheduling: Space of Semantics-Preserving Affine Schedules Polyhedral School
Convex Form of All Bounded Affine Schedules
Lemma (Convex form of semantics-preserving affine schedules)
Given a set of affine schedules ΘR,ΘS . . . of dimension m, the programsemantics is preserved if the three following conditions hold:
(i) ∀DR,S, δDR,Sp ∈ {0,1}
(ii) ∀DR,S,m
∑p=1
δDR,Sp = 1
(iii) ∀DR,S, ∀p ∈ {1, . . . ,m}, ∀〈~xR,~xS〉 ∈DR,S,
ΘSp(~xS)−Θ
Rp (~xR)≥ δ
DR,Sp
−p−1
∑k=1
δDR,Sk .(K.~n+K)
→ Use Farkas lemma to build all non-negative functions over apolyhedron (here, the dependence polyhedra) [Feautrier,92]
→ Bounded coefficients required [Vasilache,07]
UCLA 86
Affine Scheduling: Space of Semantics-Preserving Affine Schedules Polyhedral School
Convex Form of All Bounded Affine Schedules
Lemma (Convex form of semantics-preserving affine schedules)
Given a set of affine schedules ΘR,ΘS . . . of dimension m, the programsemantics is preserved if the three following conditions hold:
(i) ∀DR,S, δDR,Sp ∈ {0,1}
(ii) ∀DR,S,m
∑p=1
δDR,Sp = 1
(iii) ∀DR,S, ∀p ∈ {1, . . . ,m}, ∀〈~xR,~xS〉 ∈DR,S,
ΘSp(~xS)−Θ
Rp (~xR)≥ δ
DR,Sp
−p−1
∑k=1
δDR,Sk .(K.~n+K)
→ Use Farkas lemma to build all non-negative functions over apolyhedron (here, the dependence polyhedra) [Feautrier,92]
→ Bounded coefficients required [Vasilache,07]
UCLA 86
Affine Scheduling: Space of Semantics-Preserving Affine Schedules Polyhedral School
Convex Form of All Bounded Affine Schedules
Lemma (Convex form of semantics-preserving affine schedules)
Given a set of affine schedules ΘR,ΘS . . . of dimension m, the programsemantics is preserved if the three following conditions hold:
(i) ∀DR,S, δDR,Sp ∈ {0,1}
(ii) ∀DR,S,m
∑p=1
δDR,Sp = 1
(iii) ∀DR,S, ∀p ∈ {1, . . . ,m}, ∀〈~xR,~xS〉 ∈DR,S,
ΘSp(~xS)−Θ
Rp (~xR)≥ δ
DR,Sp
−p−1
∑k=1
δDR,Sk .(K.~n+K)
→ Use Farkas lemma to build all non-negative functions over apolyhedron (here, the dependence polyhedra) [Feautrier,92]
→ Bounded coefficients required [Vasilache,07]
UCLA 86
Affine Scheduling: Space of Semantics-Preserving Affine Schedules Polyhedral School
Convex Form of All Bounded Affine Schedules
Lemma (Convex form of semantics-preserving affine schedules)
Given a set of affine schedules ΘR,ΘS . . . of dimension m, the programsemantics is preserved if the three following conditions hold:
(i) ∀DR,S, δDR,Sp ∈ {0,1}
(ii) ∀DR,S,m
∑p=1
δDR,Sp = 1
(iii) ∀DR,S, ∀p ∈ {1, . . . ,m}, ∀〈~xR,~xS〉 ∈DR,S,
ΘSp(~xS)−Θ
Rp (~xR)≥ δ
DR,Sp −
p−1
∑k=1
δDR,Sk .(K.~n+K)
→ Use Farkas lemma to build all non-negative functions over apolyhedron (here, the dependence polyhedra) [Feautrier,92]
→ Bounded coefficients required [Vasilache,07]
UCLA 86
Affine Scheduling: Space of Semantics-Preserving Affine Schedules Polyhedral School
Convex Form of All Bounded Affine Schedules
Lemma (Convex form of semantics-preserving affine schedules)
Given a set of affine schedules ΘR,ΘS . . . of dimension m, the programsemantics is preserved if the three following conditions hold:
(i) ∀DR,S, δDR,Sp ∈ {0,1}
(ii) ∀DR,S,m
∑p=1
δDR,Sp = 1
(iii) ∀DR,S, ∀p ∈ {1, . . . ,m}, ∀〈~xR,~xS〉 ∈DR,S,
ΘSp(~xS)−Θ
Rp (~xR)−δ
DR,Sp +
p−1
∑k=1
δDR,Sk .(K.~n+K)≥ 0
→ Use Farkas lemma to build all non-negative functions over apolyhedron (here, the dependence polyhedra) [Feautrier,92]
→ Bounded coefficients required [Vasilache,07]
UCLA 86
Affine Scheduling: Space of Semantics-Preserving Affine Schedules Polyhedral School
Convex Form of All Bounded Affine Schedules
Lemma (Convex form of semantics-preserving affine schedules)
Given a set of affine schedules ΘR,ΘS . . . of dimension m, the programsemantics is preserved if the three following conditions hold:
(i) ∀DR,S, δDR,Sp ∈ {0,1}
(ii) ∀DR,S,m
∑p=1
δDR,Sp = 1
(iii) ∀DR,S, ∀p ∈ {1, . . . ,m}, ∀〈~xR,~xS〉 ∈DR,S,
ΘSp(~xS)−Θ
Rp (~xR)−δ
DR,Sp +
p−1
∑k=1
δDR,Sk .(K.~n+K)≥ 0
→ Use Farkas lemma to build all non-negative functions over apolyhedron (here, the dependence polyhedra) [Feautrier,92]
→ Bounded coefficients required [Vasilache,07]UCLA 86
Affine Scheduling: Space of Semantics-Preserving Affine Schedules Polyhedral School
Maximal Fine-Grain Parallelism
Objectives:I Have as few dimensions as possible carrying a dependence
I For dimension k ∈ p..1:
min ∑DR,S
δDR,Sk
I We use lexicographic optimization
UCLA 87
Affine Scheduling: Space of Semantics-Preserving Affine Schedules Polyhedral School
Key Observations
Is all of this really necessary?
I We have encoded one objective per row of Θ
I Question: do we need to solve this large ILP/LP?
UCLA 88
Affine Scheduling: Space of Semantics-Preserving Affine Schedules Polyhedral School
Feautrier’s Greedy Algorithm
Main idea:1 Start at row 1 of Θ
2 Build the set of legal one-dimensional schedules3 Maximize the number of dependences strongly solved (maxδi)4 Remove strongly solved dependences from P5 Goto 1
This is a row-by-row decomposition of the scheduling problem
UCLA 89
Affine Scheduling: Space of Semantics-Preserving Affine Schedules Polyhedral School
Key Properties of Feautrier’s Algorithm
I It terminatesI It finds "optimal" fine-grain parallelism
I Granularity of dependence satisfaction: all-or-nothing
Example
for (i = 0; i < 2 * N; ++i) A[i] = A[2 * N - i];
UCLA 90
Affine Scheduling: Space of Semantics-Preserving Affine Schedules Polyhedral School
Key Observations
Is all of this really necessary?
I Question: do we need to consider all statements at once?
I Insight: the PDG gives structural information about dependencesI Decomposition of the PDG into strongly-connected components
UCLA 91
Affine Scheduling: Space of Semantics-Preserving Affine Schedules Polyhedral School
Feautrier’s Scheduler
UCLA 92
Affine Scheduling: Space of Semantics-Preserving Affine Schedules Polyhedral School
More Observations
I Some problems may be decomposed without loss of optimalityI The PDG gives extra information about further problem decomposition
Still, is all of this really necessary?
I Question: can we use additional knowledge about dependences?
I Uniform, non-uniform and parametric dependences
UCLA 93
Affine Scheduling: Space of Semantics-Preserving Affine Schedules Polyhedral School
Cost Functions
UCLA 94
Scheduling for Performance: Polyhedral School
Objectives for Good Scheduling
Fine-grain parallelism is nice, but...I It has little connection with modern SIMD parallelismI No information about the quality of the generated codeI Ignores all the important performance objectives:
I Data locality / TLB / Cache considerationI Multi-core parallelism (sync-free, with barrier)I SIMD vectorization
Question: how to find a FAST schedule for a modern processor?
UCLA 95
Scheduling for Performance: Polyhedral School
Performance Distribution for 1-D Schedules [1/2]
6e+08
8e+08
1e+09
1.2e+09
1.4e+09
1.6e+09
1.8e+09
2e+09
0 100 200 300 400 500 600 700 800 900 1000
Cyc
les
Transformation identifier
matmult
original
5e+08
1e+09
1.5e+09
2e+09
2.5e+09
3e+09
3.5e+09
4e+09
0 1000 2000 3000 4000 5000 6000 7000C
ycle
sTransformation identifier
locality
original
Figure : Performance distribution for matmult and locality
UCLA 96
Scheduling for Performance: Polyhedral School
Performance Distribution for 1-D Schedules [2/2]
1.26e+09
1.28e+09
1.3e+09
1.32e+09
1.34e+09
1.36e+09
1.38e+09
1.4e+09
1.42e+09
0 100 200 300 400 500 600 700 800
Cyc
les
Transformation identifier
crout
original
(a) GCC -O3
1.26e+09
1.27e+09
1.28e+09
1.29e+09
1.3e+09
1.31e+09
1.32e+09
1.33e+09
1.34e+09
0 100 200 300 400 500 600 700 800
Cyc
les
Transformation identifier
crout
original
original
(b) ICC -fast
Figure : The effect of the compiler
UCLA 97
Scheduling for Performance: Polyhedral School
Quantitative Analysis: The Hypothesis
Extremely large generated spaces: > 1050 points
→ we must leverage static and dynamic characteristics to build traversalmechanisms
Hypothesis:I It is possible to statically order the impact on performance of
transformation coefficients, that is, decompose the search space insubspaces where the performance variation is maximal or reduced
I First rows of Θ are more performance impacting than the last ones
UCLA 98
Scheduling for Performance: Polyhedral School
Observations on the Performance Distribution
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
10 20 30 40 50 60
Perfo
rman
ce im
prov
emen
t
Point index for the first schedule row
Performance distribution - 8x8 DCT
BestAverage
Worst for (i = 0; i < M; i++)for (j = 0; j < M; j++) {tmp[i][j] = 0.0;for (k = 0; k < M; k++)tmp[i][j] += block[i][k] *
cos1[j][k];}
for (i = 0; i < M; i++)for (j = 0; j < M; j++) {sum2 = 0.0;for (k = 0; k < M; k++)sum2 += cos1[i][k] * tmp[k][j];block[i][j] = ROUND(sum2);
}
I Extensive study of 8x8 Discrete Cosine Transform (UTDSP)I Search space analyzed: 66×19683 = 1.29×106 different legal
program versions
UCLA 99
Scheduling for Performance: Polyhedral School
Observations on the Performance Distribution
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
10 20 30 40 50 60
Perfo
rman
ce im
prov
emen
t
Point index for the first schedule row
Performance distribution - 8x8 DCT
BestAverage
Worst
Θ :
I Extensive study of 8x8 Discrete Cosine Transform (UTDSP)I Search space analyzed: 66×19683 = 1.29×106 different legal
program versions
UCLA 99
Scheduling for Performance: Polyhedral School
Observations on the Performance Distribution
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
10 20 30 40 50 60
Perfo
rman
ce im
prov
emen
t
Point index for the first schedule row
Performance distribution - 8x8 DCT
BestAverage
Worst
I bestI averageI worst
I Take one specific value for the first rowI Try the 19863 possible values for the second row
I Very low proportion of best points: < 0.02%
UCLA 99
Scheduling for Performance: Polyhedral School
Observations on the Performance Distribution
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
10 20 30 40 50 60
Perfo
rman
ce im
prov
emen
t
Point index for the first schedule row
Performance distribution - 8x8 DCT
BestAverage
Worst
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
0 2000 4000 6000 8000 10000 12000 14000 16000 18000Point index of the second schedule dimension, first one fixed
Performance distribution (sorted) - 8x8 DCT
I Take one specific value for the first rowI Try the 19863 possible values for the second rowI Very low proportion of best points: < 0.02%
UCLA 99
Scheduling for Performance: Polyhedral School
Observations on the Performance Distribution
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
10 20 30 40 50 60
Perfo
rman
ce im
prov
emen
t
Point index for the first schedule row
Performance distribution - 8x8 DCT
BestAverage
Worst Large performance variation
I Performance variation is large for good values of the first row
I It is usually reduced for bad values of the first row
UCLA 99
Scheduling for Performance: Polyhedral School
Observations on the Performance Distribution
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
10 20 30 40 50 60
Perfo
rman
ce im
prov
emen
t
Point index for the first schedule row
Performance distribution - 8x8 DCT
BestAverage
Worst Small performance variation
I Performance variation is large for good values of the first rowI It is usually reduced for bad values of the first row
UCLA 99
Scheduling for Performance: Polyhedral School
Scanning The Space of Program Versions
The search space:I Performance variation indicates to partition the space:~ı >~p > c
I Non-uniform distribution of performance
I No clear analytical property of the optimization function
→ Build dedicated heuristic and genetic operators aware of these staticand dynamic characteristics
UCLA 100
Scheduling for Performance: Polyhedral School
The Quest for Good Objective Functions
I For data locality, loop tiling is keyI But what is the cost of tiling?I Is tiling the only criterion?
I For coarse-grain parallelism, doall parallelization is keyI But what is the cost of parallelization?
UCLA 101
Scheduling for Performance: Polyhedral School
Dependence Distance Minimization
I Idea: minimize the delay between instances accessing the same dataI Formulation in the polyhedral model:
I Expression of the delay through parametric formI Use all dependences (including RAR)
Definition (Dependence distance minimization)
uk.~n+wk ≥ΘS (~xS)−Θ
R (~xR) 〈~xR,~xS〉 ∈DR,S (1)
uk ∈ Np,wk ∈ N
UCLA 102
Scheduling for Performance: Polyhedral School
Key Observations
I Minimizing d = uk.~n+wk minimize the dependence distanceI When d = 0 then ΘR
k (~xR) = ΘSk(~xS)
I 0≥ΘRk (~xR)−ΘS
k(~xS)≥ 0
I d gives an indication of the communication volume between hyperplanes
UCLA 103
Scheduling for Performance: Tiling Polyhedral School
An Overview of Tiling
Tiling: partition the computation into atomic blocs
I Early work in the late 80’sI Motivation: data locality improvement + parallelization
UCLA 104
Scheduling for Performance: Tiling Polyhedral School
An Overview of Tiling
I Tiling the iteration spaceI It must be valid (dependence analysis required)I It may require pre-transformationI Unimodular transformation framework limitations
I Supported in current compilers, but limited applicability
I Challenges: imperfectly nested loops, parametric loops,pre-transformations, tile shape, ...
I Tile size selectionI Critical for locality concerns: determines the footprintI Empirical search of the best size (problem + machine specific)I Parametric tiling makes the generated code valid for any tile size
UCLA 105
Scheduling for Performance: The Tiling Hyperplane Method Polyhedral School
Tiling in the Polyhedral Model
I Tiling partition the computation into blocksI Note we consider only rectangular tiling hereI For tiling to be legal, such a partitioning must be legal
UCLA 106
Scheduling for Performance: The Tiling Hyperplane Method Polyhedral School
Key Ideas of the Tiling Hyperplane Algorithm
Affine transformations for communication minimal parallelization and localityoptimization of arbitrarily nested loop sequences[Bondhugula et al, CC’08 & PLDI’08]
I Compute a set of transformations to make loops tilableI Try to minimize synchronizationsI Try to maximize locality (maximal fusion)
I Result is a set of permutable loops, if possibleI Strip-mining / tiling can be appliedI Tiles may be sync-free parallel or pipeline parallel
I Algorithm always terminates (possibly by splitting loops/statements)
UCLA 107
Scheduling for Performance: The Tiling Hyperplane Method Polyhedral School
Legality of Tiling
Theorem (Legality of Tiling)
Given ΘRk ,Θ
Sk two one-dimensional schedules. They are valid tiling
hyperplanes if
∀DR,S, ∀〈~xR,~xS〉 ∈DR,S, ΘSk(~xS)−Θ
Rk (~xR)≥ 0
I For a schedule to be a legal tiling hyperplane, all communications mustgo forward: Forward Communication Only [Griebl]
I All dependences must be considered at each level, including thepreviously strongly satisfied
I Equivalence between loop permutability and loop tilability
UCLA 108
Scheduling for Performance: The Tiling Hyperplane Method Polyhedral School
Greedy Algorithm for Tiling HyperplaneComputation
1 Start from the outer-most level, find the set of FCO schedules2 Select one which minimize the distance between dependent iterations3 Mark dependences strongly satisfied by this schedule, but do not
remove them4 Formulate the problem for the next level (FCO), adding orthogonality
constraints (linear independence)5 solve again, etc.
Special treatment when no permutable band can be found: splitting
A few properties:I Result is a set of permutable/tilable outer loops, when possibleI It exhibits coarse-grain parallelismI Maximal fusion achieved to improve locality
UCLA 109
Scheduling for Performance: The Tiling Hyperplane Method Polyhedral School
Example: 1D-Jacobi
The Ohio State University Louisiana State University 1
1-D Jacobi (imperfectly nested) for (t=1; t<M; t++) { for (i=2; i<N!1; i++) {
S: b[i] = 0.333*(a[i!1]+a[i]+a[i+1]); } for (j=2; j<N!1; j++) {
T: a[j] = b[j]; } }
UCLA 110
Scheduling for Performance: The Tiling Hyperplane Method Polyhedral School
Example: 1D-Jacobi
The Ohio State University Louisiana State University 2
1-D Jacobi (imperfectly nested)
• The resulting transformation is equivalent to a constant shift of one for T relative to S, fusion (j and i are named the same as a result), and skewing the fused i loop with respect to the t loop by a factor of two. • The (1,0) hyperplane has the least communication: no dependence crosses more than one hyperplane instance along it.
UCLA 111
Scheduling for Performance: The Tiling Hyperplane Method Polyhedral School
Example: 1D-Jacobi
The Ohio State University Louisiana State University 3
Transforming S
i
t! t
i!
UCLA 112
Scheduling for Performance: The Tiling Hyperplane Method Polyhedral School
Example: 1D-Jacobi
The Ohio State University Louisiana State University 4
Transforming T
j
t! t
j!
UCLA 113
Scheduling for Performance: The Tiling Hyperplane Method Polyhedral School
Example: 1D-Jacobi
The Ohio State University Louisiana State University 5
Interleaving S and T
t! t!
j! i!
UCLA 114
Scheduling for Performance: The Tiling Hyperplane Method Polyhedral School
Example: 1D-Jacobi
The Ohio State University Louisiana State University 6
Interleaving S and T
t
UCLA 115
Scheduling for Performance: The Tiling Hyperplane Method Polyhedral School
Example: 1D-Jacobi
The Ohio State University Louisiana State University 7
1-D Jacobi (imperfectly nested) – transformed code for (t0=0;t0<=M-1;t0++) { S’: b[2]=0.333*(a[2-1]+a[2]+a[2+1]); for (t1=2*t0+3;t1<=2*t0+N-2;t1++) { S: b[-2*t0+t1]=0.333*(a[-2*t0+t1-1]+a[-2*t0+t1] +a[-2*t0+t1+1]);
T: a[-2*t0+t1-1]=b[-2*t0+t1-1]; } T’: a[N-2]=b[N-2]; }
UCLA 116
Scheduling for Performance: The Tiling Hyperplane Method Polyhedral School
Example: 1D-Jacobi
The Ohio State University Louisiana State University 8
1-D Jacobi (imperfectly nested) – transformed code for (t0=0;t0<=M-1;t0++) { S’: b[2]=0.333*(a[2-1]+a[2]+a[2+1]); for (t1=2*t0+3;t1<=2*t0+N-2;t1++) { S: b[-2*t0+t1]=0.333*(a[-2*t0+t1-1]+a[-2*t0+t1] +a[-2*t0+t1+1]);
T: a[-2*t0+t1-1]=b[-2*t0+t1-1]; } T’: a[N-2]=b[N-2]; }
!
!
! ! ! ! !
UCLA 117
Scheduling for Performance: Fusion-driven Optimization Polyhedral School
Fusion-driven Optimization
UCLA 118
Scheduling for Performance: Fusion-driven Optimization Polyhedral School
Overview
Problem: How to improve program execution time?
I Focus on shared-memory computationI OpenMP parallelizationI SIMD VectorizationI Efficient usage of the intra-node memory hierarchy
I Challenges to address:I Different machines require different compilation strategiesI One-size-fits-all scheme hinders optimization opportunities
Question: how to restructure the code for performance?
UCLA 119
Scheduling for Performance: Fusion-driven Optimization Polyhedral School
Objectives for a Successful Optimization
During the program execution, interplay between the hardware ressources:I Thread-centric parallelismI SIMD-centric parallelismI Memory layout, inc. caches, prefetch units, buses, interconnects...
→ Tuning the trade-off between these is required
A loop optimizer must be able to transform the program for:I Thread-level parallelism extractionI Loop tiling, for data localityI Vectorization
Our approach: form a tractable search space of possible looptransformations
UCLA 120
Scheduling for Performance: Fusion-driven Optimization Polyhedral School
Running Example
Original code
Example (tmp = A.B, D = tmp.C)for (i1 = 0; i1 < N; ++i1)for (j1 = 0; j1 < N; ++j1) {
R: tmp[i1][j1] = 0;for (k1 = 0; k1 < N; ++k1)
S: tmp[i1][j1] += A[i1][k1] * B[k1][j1];} {R,S} fused, {T,U} fused
for (i2 = 0; i2 < N; ++i2)for (j2 = 0; j2 < N; ++j2) {
T: D[i2][j2] = 0;for (k2 = 0; k2 < N; ++k2)
U: D[i2][j2] += tmp[i2][k2] * C[k2][j2];}
Original Max. fusion Max. dist Balanced4× Xeon 7450 / ICC 11 1×4× Opteron 8380 / ICC 11 1×
UCLA 121
Scheduling for Performance: Fusion-driven Optimization Polyhedral School
Running Example
Cost model: maximal fusion, minimal synchronization[Bondhugula et al., PLDI’08]
Example (tmp = A.B, D = tmp.C)parfor (c0 = 0; c0 < N; c0++) {
for (c1 = 0; c1 < N; c1++) {R: tmp[c0][c1]=0;T: D[c0][c1]=0;
for (c6 = 0; c6 < N; c6++)S: tmp[c0][c1] += A[c0][c6] * B[c6][c1];
parfor (c6 = 0;c6 <= c1; c6++)U: D[c0][c6] += tmp[c0][c1-c6] * C[c1-c6][c6];
} {R,S,T,U} fusedfor (c1 = N; c1 < 2*N - 1; c1++)parfor (c6 = c1-N+1; c6 < N; c6++)
U: D[c0][c6] += tmp[c0][1-c6] * C[c1-c6][c6];}
Original Max. fusion Max. dist Balanced4× Xeon 7450 / ICC 11 1× 2.4×4× Opteron 8380 / ICC 11 1× 2.2×
UCLA 121
Scheduling for Performance: Fusion-driven Optimization Polyhedral School
Running Example
Maximal distribution: best for Intel Xeon 7450Poor data reuse, best vectorization
Example (tmp = A.B, D = tmp.C)parfor (i1 = 0; i1 < N; ++i1)parfor (j1 = 0; j1 < N; ++j1)
R: tmp[i1][j1] = 0;parfor (i1 = 0; i1 < N; ++i1)
for (k1 = 0; k1 < N; ++k1)parfor (j1 = 0; j1 < N; ++j1)
S: tmp[i1][j1] += A[i1][k1] * B[k1][j1];{R} and {S} and {T} and {U} distributed
parfor (i2 = 0; i2 < N; ++i2)parfor (j2 = 0; j2 < N; ++j2)
T: D[i2][j2] = 0;parfor (i2 = 0; i2 < N; ++i2)
for (k2 = 0; k2 < N; ++k2)parfor (j2 = 0; j2 < N; ++j2)
U: D[i2][j2] += tmp[i2][k2] * C[k2][j2];
Original Max. fusion Max. dist Balanced4× Xeon 7450 / ICC 11 1× 2.4× 3.9×4× Opteron 8380 / ICC 11 1× 2.2× 6.1×
UCLA 121
Scheduling for Performance: Fusion-driven Optimization Polyhedral School
Running Example
Balanced distribution/fusion: best for AMD Opteron 8380Poor data reuse, best vectorization
Example (tmp = A.B, D = tmp.C)parfor (c1 = 0; c1 < N; c1++)parfor (c2 = 0; c2 < N; c2++)
R: C[c1][c2] = 0;parfor (c1 = 0; c1 < N; c1++)
for (c3 = 0; c3 < N;c3++) {T: E[c1][c3] = 0;
parfor (c2 = 0; c2 < N;c2++)S: C[c1][c2] += A[c1][c3] * B[c3][c2];
} {S,T} fused, {R} and {U} distributedparfor (c1 = 0; c1 < N; c1++)
for (c3 = 0; c3 < N; c3++)parfor (c2 = 0; c2 < N; c2++)
U: E[c1][c2] += C[c1][c3] * D[c3][c2];
Original Max. fusion Max. dist Balanced4× Xeon 7450 / ICC 11 1× 2.4× 3.9× 3.1×4× Opteron 8380 / ICC 11 1× 2.2× 6.1× 8.3×
UCLA 121
Scheduling for Performance: Fusion-driven Optimization Polyhedral School
Running Example
Example (tmp = A.B, D = tmp.C)parfor (c1 = 0; c1 < N; c1++)parfor (c2 = 0; c2 < N; c2++)
R: C[c1][c2] = 0;parfor (c1 = 0; c1 < N; c1++)
for (c3 = 0; c3 < N;c3++) {T: E[c1][c3] = 0;
parfor (c2 = 0; c2 < N;c2++)S: C[c1][c2] += A[c1][c3] * B[c3][c2];
} {S,T} fused, {R} and {U} distributedparfor (c1 = 0; c1 < N; c1++)
for (c3 = 0; c3 < N; c3++)parfor (c2 = 0; c2 < N; c2++)
U: E[c1][c2] += C[c1][c3] * D[c3][c2];
Original Max. fusion Max. dist Balanced4× Xeon 7450 / ICC 11 1× 2.4× 3.9× 3.1×4× Opteron 8380 / ICC 11 1× 2.2× 6.1× 8.3×
The best fusion/distribution choice drives the quality of the optimizationUCLA 121
Scheduling for Performance: Fusion-driven Optimization Polyhedral School
Loop Structures
Possible grouping + ordering of statements
I {{R}, {S}, {T}, {U}}; {{R}, {S}, {U}, {T}}; ...I {{R,S}, {T}, {U}}; {{R}, {S}, {T,U}}; {{R}, {T,U}, {S}}; {{T,U}, {R}, {S}};...I {{R,S,T}, {U}}; {{R}, {S,T,U}}; {{S}, {R,T,U}};...I {{R,S,T,U}};
Number of possibilities: >> n! (number of total preorders)
UCLA 122
Scheduling for Performance: Fusion-driven Optimization Polyhedral School
Loop Structures
Removing non-semantics preserving ones
I {{R}, {S}, {T}, {U}}; {{R}, {S}, {U}, {T}}; ...I {{R,S}, {T}, {U}}; {{R}, {S}, {T,U}}; {{R}, {T,U}, {S}}; {{T,U}, {R}, {S}};...I {{R,S,T}, {U}}; {{R}, {S,T,U}}; {{S}, {R,T,U}};...I {{R,S,T,U}}
Number of possibilities: 1 to 200 for our test suite
UCLA 122
Scheduling for Performance: Fusion-driven Optimization Polyhedral School
Loop Structures
For each partitioning, many possible loop structures
I {{R}, {S}, {T}, {U}}I For S: {i, j,k}; {i,k, j}; {k, i, j}; {k, j, i}; ...I However, only {i,k, j} has:
I outer-parallel loopI inner-parallel loopI lowest striding access (efficient vectorization)
UCLA 122
Scheduling for Performance: Fusion-driven Optimization Polyhedral School
Possible Loop Structures for 2mm
I 4 statements, 75 possible partitioningsI 10 loops, up to 10! possible loop structures for a given partitioning
I Two steps:I Remove all partitionings which breaks the semantics: from 75 to 12I Use static cost models to select the loop structure for a partitioning: from
d! to 1
I Final search space: 12 possibilites
UCLA 123
Scheduling for Performance: Fusion-driven Optimization Polyhedral School
Contributions and Overview of the Approach
I Empirical search on possible fusion/distribution schemesI Each structure drives the success of other optimizations
I ParallelizationI TilingI Vectorization
I Use static cost models to compute a complex loop transformation for aspecific fusion/distribution scheme
I Iteratively test the different versions, retain the bestI Best performing loop structure is found
UCLA 124
Scheduling for Performance: Fusion-driven Optimization Polyhedral School
Search Space of Loop Structures
I Partition the set of statements into classes:I This is deciding loop fusion / distributionI Statements in the same class will share at least one common loop in the
target codeI Classes are ordered, to reflect code motion
I Locally on each partition, apply model-driven optimizations
I Leverage the polyhedral framework:I Build the smallest yet most expressive space of possible partitionings
[Pouchet et al., POPL’11]I Consider semantics-preserving partitionings only: orders of magnitude
smaller space
UCLA 125
Scheduling for Performance: Fusion-driven Optimization Polyhedral School
Summary of the Optimization Process
description #loops #stmts #refs #deps #part. #valid Variability Pb. Size2mm Linear algebra (BLAS3) 6 4 8 12 75 12 X 1024x10243mm Linear algebra (BLAS3) 9 6 12 19 4683 128 X 1024x1024adi Stencil (2D) 11 8 36 188 545835 1 1024x1024atax Linear algebra (BLAS2) 4 4 10 12 75 16 X 8000x8000bicg Linear algebra (BLAS2) 3 4 10 10 75 26 X 8000x8000correl Correlation (PCA: StatLib) 5 6 12 14 4683 176 X 500x500covar Covariance (PCA: StatLib) 7 7 13 26 47293 96 X 500x500doitgen Linear algebra 5 3 7 8 13 4 128x128x128gemm Linear algebra (BLAS3) 3 2 6 6 3 2 1024x1024gemver Linear algebra (BLAS2) 7 4 19 13 75 8 X 8000x8000gesummv Linear algebra (BLAS2) 2 5 15 17 541 44 X 8000x8000gramschmidt Matrix normalization 6 7 17 34 47293 1 512x512jacobi-2d Stencil (2D) 5 2 8 14 3 1 20x1024x1024lu Matrix decomposition 4 2 7 10 3 1 1024x1024
ludcmp Solver 9 15 40 188 1012 20 X 1024x1024seidel Stencil (2D) 3 1 10 27 1 1 20x1024x1024
Table : Summary of the optimization process
UCLA 126
Scheduling for Performance: Fusion-driven Optimization Polyhedral School
Experimental Setup
We compare three schemes:I maxfuse: static cost model for fusion (maximal fusion)
I smartfuse: static cost model for fusion (fuse only if data reuse)
I Iterative: iterative compilation, output the best result
UCLA 127
Scheduling for Performance: Fusion-driven Optimization Polyhedral School
Performance Results - Intel Xeon 7450 - ICC 11
0
1
2
3
4
5
2mm
3mm
adiatax
bicgcorrel
covar
doitgen
gemmgem
ver
gesumm
v
gramschm
idt
jacobi-2d
lu ludcmp
seidel
Per
f. Im
p / I
CC
-fa
st -
para
llel
Performance Improvement - Intel Xeon 7450 (24 threads)
pocc-maxfusepocc-smartfuse
iterative
UCLA 128
Scheduling for Performance: Fusion-driven Optimization Polyhedral School
Performance Results - AMD Opteron 8380 - ICC 11
0
1
2
3
4
5
6
7
8
9
2mm
3mm
adiatax
bicgcorrel
covar
doitgen
gemmgem
ver
gesumm
v
gramschm
idt
jacobi-2d
lu ludcmp
seidel
Per
f. Im
p / I
CC
-fa
st -
para
llel
Performance Improvement - AMD Opteron 8380 (16 threads)
pocc-maxfusepocc-smartfuse
iterative
UCLA 129
Scheduling for Performance: Fusion-driven Optimization Polyhedral School
Performance Results - Intel Atom 330 - GCC 4.3
0
5
10
15
20
25
30
2mm
3mm
adiatax
bicgcorrel
covar
doitgen
gemmgem
ver
gesumm
v
gramschm
idt
jacobi-2d
lu ludcmp
seidel
Per
f. Im
p / G
CC
4.3
-O
3 -f
open
mp
Performance Improvement - Intel Atom 230 (2 threads)
pocc-maxfusepocc-smartfuse
iterative
UCLA 130
Scheduling for Performance: Fusion-driven Optimization Polyhedral School
Assessment from Experimental Results
1 Empirical tuning required for 9 out of 16 benchmarks
2 Strong performance improvements: 2.5× - 3× on average
3 Portability achieved:I Automatically adapt to the program and target architectureI No assumption made about the targetI Exhaustive search finds the optimal structure (1-176 variants)
4 Substantial improvements over state-of-the-art (up to 2×)
UCLA 131
Approximate Scheduling: Polyhedral School
Approximate Scheduling
UCLA 132
Approximate Scheduling: Polyhedral School
Exact vs. Approximate Dependences
I So far, we used exact dependence representationI But numerous other works use approximate representation:
I dependence distance vector [Allen/Kenney,KMW]I dependence directionI DDV approximated as polyhedra [Darte-Vivien]
I Optimality may not always be lost when using a relaxed dependencerepresentation!
UCLA 133
Approximate Scheduling: Polyhedral School
Key Aspects of Dependence Representations
I Can use a graph to represent the dependencesI Karp, Miller, Winograd: graph for SURE, using DDVI Darte-Vivien: "PDG"-like graph
I For certain kind dependences (i.e., uniform) "optimal" algorithmI Darte-VivienI SURE
UCLA 134
Approximate Scheduling: Polyhedral School
Some Observations
I Maximal parallelism (under some conditions) can be extracted from lessprecise representations
I Very fast scheduling algorithms can be derived
I Remaining question: how useful are those algorithms?
UCLA 135
Approximate Scheduling: Polyhedral School
Conclusion
UCLA 136
Approximate Scheduling: Polyhedral School
The End
Let’s stop here for today!
UCLA 137