adaptive mapping and parameter selection scheme to ......
TRANSCRIPT
Motivation PPCG Adaptive Mapping Results Conclusions
Adaptive Mapping and ParameterSelection Scheme to Improve Automatic
Code Generation for GPUs
Carlos Juega 1 José Ignacio Gómez 1 ChristianTenllado 1 Francky Catthoor 2
1ArTeCS, Universidad Complutense de Madrid
2IMEC
February 19, 2014
Carlos Juega et al. Adaptive Mapping and Parameter Selection 1 / 29
Motivation PPCG Adaptive Mapping Results Conclusions
1 Motivation
2 PPCG
3 Adaptive Mapping
4 Results
5 Conclusions
Carlos Juega et al. Adaptive Mapping and Parameter Selection 2 / 29
Motivation PPCG Adaptive Mapping Results Conclusions
Motivation
Some facts...I GPGPU programming is still a hard task.
I Parallelism vs locality trade-off rules the mapping.
I Optimal implementation usually changes acrossdifferent GPU generations.
Carlos Juega et al. Adaptive Mapping and Parameter Selection 3 / 29
Motivation PPCG Adaptive Mapping Results Conclusions
Motivation
Some facts...I GPGPU programming is still a hard task.
I Parallelism vs locality trade-off rules the mapping.
I Optimal implementation usually changes acrossdifferent GPU generations.
gemmnaive 40 ms⇓ parallelism, shared memory 12.15 ms⇓⇓ parallelism, shared memory + registers 9.36 ms
Carlos Juega et al. Adaptive Mapping and Parameter Selection 3 / 29
Motivation PPCG Adaptive Mapping Results Conclusions
Motivation
Some facts...I GPGPU programming is still a hard task.
I Parallelism vs locality trade-off rules the mapping.
I Optimal implementation usually changes acrossdifferent GPU generations.
There is always hope!I Many source-to-source compilers for GPUs.I Other approaches (OpenACC, CUDA-CHiLL, ...).
Carlos Juega et al. Adaptive Mapping and Parameter Selection 3 / 29
Motivation PPCG Adaptive Mapping Results Conclusions
Polyhedral Parallel Code Generator
Source-to-source compiler based on Polyhedral Model.
Contributions
1 Exposes parallelism.2 Maximizes temporal locality.3 Exploits shared memory and registers.
C codewith pragmas
polyhedralextraction
dependenceanalysis scheduling
tiling andGPU mapping
memoryallocation
codegenerator CUDA code
Carlos Juega et al. Adaptive Mapping and Parameter Selection 4 / 29
Motivation PPCG Adaptive Mapping Results Conclusions
Polyhedral Parallel Code Generator
Source-to-source compiler based on Polyhedral Model.
Contributions
1 Exposes parallelism.2 Maximizes temporal locality.3 Exploits shared memory and registers.
C codewith pragmas
polyhedralextraction
dependenceanalysis scheduling
tiling andGPU mapping
memoryallocation
codegenerator CUDA code
Carlos Juega et al. Adaptive Mapping and Parameter Selection 4 / 29
Motivation PPCG Adaptive Mapping Results Conclusions
PPCG limitations
1 Static mapping process (it’s always the same nomatter the input).
2 It still needs user assistance (parametrization).3 It’s not aware of platform details.
C codewith pragmas
polyhedralextraction
dependenceanalysis scheduling
tiling andGPU mapping
memoryallocation
codegenerator CUDA code
Carlos Juega et al. Adaptive Mapping and Parameter Selection 4 / 29
Motivation PPCG Adaptive Mapping Results Conclusions Proposal Overview
Proposal
BasicsI Explore minor changes in PPCG’s
schedule.I Loop interchange.I Serialize parallel loops.
I Use parametric tiling or ranged values(just for pruning).
I Takes platform specs as problemvariables.
HeuristicsI Memory
footprint.I Accesses
cost.
ContributionsI Adapts PPCG schedules to the input’s data reuse
requirements.I Computes common GPU parameters based on data
reuse.Carlos Juega et al. Adaptive Mapping and Parameter Selection 5 / 29
Motivation PPCG Adaptive Mapping Results Conclusions Proposal Overview
Proposal
BasicsI Explore minor changes in PPCG’s
schedule.I Loop interchange.I Serialize parallel loops.
I Use parametric tiling or ranged values(just for pruning).
I Takes platform specs as problemvariables.
HeuristicsI Memory
footprint.I Accesses
cost.
ContributionsI Adapts PPCG schedules to the input’s data reuse
requirements.I Computes common GPU parameters based on data
reuse.Carlos Juega et al. Adaptive Mapping and Parameter Selection 5 / 29
Motivation PPCG Adaptive Mapping Results Conclusions Proposal Overview
Proposal
BasicsI Explore minor changes in PPCG’s
schedule.I Loop interchange.I Serialize parallel loops.
I Use parametric tiling or ranged values(just for pruning).
I Takes platform specs as problemvariables.
HeuristicsI Memory
footprint.I Accesses
cost.
ContributionsI Adapts PPCG schedules to the input’s data reuse
requirements.I Computes common GPU parameters based on data
reuse.Carlos Juega et al. Adaptive Mapping and Parameter Selection 5 / 29
Motivation PPCG Adaptive Mapping Results Conclusions Proposal Overview
C codewith pragmas
polyhedralextraction
dependenceanalysis scheduling
tiling andGPU mapping
memoryallocation
codegenerator CUDA code
Carlos Juega et al. Adaptive Mapping and Parameter Selection 6 / 29
Motivation PPCG Adaptive Mapping Results Conclusions Proposal Overview
C codewith pragmas
polyhedralextraction
dependenceanalysis scheduling
codegenerator CUDA code
Sequential datareuse analysis
Parallel datareuse analysis
Restrictionsolver
Search spaceconstruction
reuseanalysis
pruning
Search spaceexpansion
reuseanalysis
pruning
Sorting
Optimizationsolver
Carlos Juega et al. Adaptive Mapping and Parameter Selection 7 / 29
Motivation PPCG Adaptive Mapping Results Conclusions Proposal Overview
C codewith pragmas
polyhedralextraction
dependenceanalysis scheduling
codegenerator CUDA code
Sequential datareuse analysis
Parallel datareuse analysis
Restrictionsolver
Search spaceconstruction
reuseanalysis
pruning
Search spaceexpansion
reuseanalysis
pruning
Sorting
Optimizationsolver[a, b]
[Ta, Tb, Pa, Pb]
[Ta, Tb, Pb, Pa]
[Ta, Pa, Tb, Pb]
Carlos Juega et al. Adaptive Mapping and Parameter Selection 7 / 29
Motivation PPCG Adaptive Mapping Results Conclusions Proposal Overview
C codewith pragmas
polyhedralextraction
dependenceanalysis scheduling
codegenerator CUDA code
Sequential datareuse analysis
Parallel datareuse analysis
Restrictionsolver
Search spaceconstruction
reuseanalysis
pruning
Search spaceexpansion
reuseanalysis
pruning
Sorting
Optimizationsolver
Array R1M
1TB
Array S1N
11
1TA
Carlos Juega et al. Adaptive Mapping and Parameter Selection 7 / 29
Motivation PPCG Adaptive Mapping Results Conclusions Proposal Overview
C codewith pragmas
polyhedralextraction
dependenceanalysis scheduling
codegenerator CUDA code
Sequential datareuse analysis
Parallel datareuse analysis
Restrictionsolver
Search spaceconstruction
reuseanalysis
pruning
Search spaceexpansion
reuseanalysis
pruning
Sorting
Optimizationsolver
0 50 100 150 200 250 3000
10000000
20000000
30000000
40000000
50000000
60000000
70000000
Ta Pa Tb Pb Ta Tb Pa Pb Ta Tb Pb Pa
MemoryFootprint
Tota
lAcc
ess
Co
st
Carlos Juega et al. Adaptive Mapping and Parameter Selection 7 / 29
Motivation PPCG Adaptive Mapping Results Conclusions Proposal Overview
C codewith pragmas
polyhedralextraction
dependenceanalysis scheduling
codegenerator CUDA code
Sequential datareuse analysis
Parallel datareuse analysis
Restrictionsolver
Search spaceconstruction
reuseanalysis
pruning
Search spaceexpansion
reuseanalysis
pruning
Sorting
Optimizationsolver
[Ta, Tb, Pb, Pa] [Ta, Tb, Pb, PTa, PPa]
[Ta, Tb, Pb, PPa, PTa]
[Ta, Tb, PPa, Pb, PTa]
…
Carlos Juega et al. Adaptive Mapping and Parameter Selection 7 / 29
Motivation PPCG Adaptive Mapping Results Conclusions Proposal Overview
C codewith pragmas
polyhedralextraction
dependenceanalysis scheduling
codegenerator CUDA code
Sequential datareuse analysis
Parallel datareuse analysis
Restrictionsolver
Search spaceconstruction
reuseanalysis
pruning
Search spaceexpansion
reuseanalysis
pruning
Sorting
Optimizationsolver
Carlos Juega et al. Adaptive Mapping and Parameter Selection 7 / 29
Motivation PPCG Adaptive Mapping Results Conclusions Proposal Overview
C codewith pragmas
polyhedralextraction
dependenceanalysis scheduling
codegenerator CUDA code
Sequential datareuse analysis
Parallel datareuse analysis
Restrictionsolver
Search spaceconstruction
reuseanalysis
pruning
Search spaceexpansion
reuseanalysis
pruning
Sorting
Optimizationsolver
Carlos Juega et al. Adaptive Mapping and Parameter Selection 7 / 29
Motivation PPCG Adaptive Mapping Results Conclusions Proposal Overview
Restriction solver
Goal functionMaximize parallelism.
Restriction typesI Architecture constraints.I Problem size constraints
(when known at compiletime).
I Platform reuse constraint.I Previously built restrictions
I Tiling constraints.I Optimization constraints.
max: active_blocks*TTA
active_blocks <= 8
active_blocks >= 1
TTA <= 1024
TTA >= 1
active_blocks*TTA <= 1536
4000 / TA >= 14
4000 % TA = 0
4000 % TB = 0
TA >= 36
TA % TTA = 0
TB % TTA = 0
TTA % 32 = 0
active_blocks*(TA) <= 32768
active_blocks*(TA+TB) <= 12288
Carlos Juega et al. Adaptive Mapping and Parameter Selection 8 / 29
Motivation PPCG Adaptive Mapping Results Conclusions Proposal Overview
Restriction solver
Goal functionMaximize parallelism.
Restriction typesI Architecture constraints.I Problem size constraints
(when known at compiletime).
I Platform reuse constraint.I Previously built restrictions
I Tiling constraints.I Optimization constraints.
max: active_blocks*TTA
active_blocks <= 8
active_blocks >= 1
TTA <= 1024
TTA >= 1
active_blocks*TTA <= 1536
4000 / TA >= 14
4000 % TA = 0
4000 % TB = 0
TA >= 36
TA % TTA = 0
TB % TTA = 0
TTA % 32 = 0
active_blocks*(TA) <= 32768
active_blocks*(TA+TB) <= 12288
Carlos Juega et al. Adaptive Mapping and Parameter Selection 8 / 29
Motivation PPCG Adaptive Mapping Results Conclusions Proposal Overview
Restriction solver
Goal functionMaximize parallelism.
Restriction typesI Architecture constraints.I Problem size constraints
(when known at compiletime).
I Platform reuse constraint.I Previously built restrictions
I Tiling constraints.I Optimization constraints.
max: active_blocks*TTA
active_blocks <= 8
active_blocks >= 1
TTA <= 1024
TTA >= 1
active_blocks*TTA <= 1536
4000 / TA >= 14
4000 % TA = 0
4000 % TB = 0
TA >= 36
TA % TTA = 0
TB % TTA = 0
TTA % 32 = 0
active_blocks*(TA) <= 32768
active_blocks*(TA+TB) <= 12288
Carlos Juega et al. Adaptive Mapping and Parameter Selection 8 / 29
Motivation PPCG Adaptive Mapping Results Conclusions Proposal Overview
Restriction solver
Goal functionMaximize parallelism.
Restriction typesI Architecture constraints.I Problem size constraints
(when known at compiletime).
I Platform reuse constraint.I Previously built restrictions
I Tiling constraints.I Optimization constraints.
max: active_blocks*TTA
active_blocks <= 8
active_blocks >= 1
TTA <= 1024
TTA >= 1
active_blocks*TTA <= 1536
4000 / TA >= 14
4000 % TA = 0
4000 % TB = 0
TA >= 36
TA % TTA = 0
TB % TTA = 0
TTA % 32 = 0
active_blocks*(TA) <= 32768
active_blocks*(TA+TB) <= 12288
Carlos Juega et al. Adaptive Mapping and Parameter Selection 8 / 29
Motivation PPCG Adaptive Mapping Results Conclusions Proposal Overview
Restriction solver
Goal functionMaximize parallelism.
Restriction typesI Architecture constraints.I Problem size constraints
(when known at compiletime).
I Platform reuse constraint.I Previously built restrictions
I Tiling constraints.I Optimization constraints.
max: active_blocks*TTA
active_blocks <= 8
active_blocks >= 1
TTA <= 1024
TTA >= 1
active_blocks*TTA <= 1536
4000 / TA >= 14
4000 % TA = 0
4000 % TB = 0
TA >= 36
TA % TTA = 0
TB % TTA = 0
TTA % 32 = 0
active_blocks*(TA) <= 32768
active_blocks*(TA+TB) <= 12288
Carlos Juega et al. Adaptive Mapping and Parameter Selection 8 / 29
Motivation PPCG Adaptive Mapping Results Conclusions
Environment
Tested GPUsI Tesla K20 (Kepler architecture).I Tesla M2070 (Fermi architecture).I Tesla C1060 (Tesla architecture).
Benchmarks (compiled with CUDA 5.0)
PolyBench-gpu-1.0(http://www.cse.ohio-state.edu/~pouchet/software/polybench/GPU/index.html)
Carlos Juega et al. Adaptive Mapping and Parameter Selection 9 / 29
Motivation PPCG Adaptive Mapping Results Conclusions
Results on the K20
cova
rianc
e
atax
bicg
gem
m
gesu
mm
v
mvt
2dco
nv
3dco
nv
syrk
syr2
k
fdtd
-2d
1
10
100
1000
manual CUDA ppcg default ppcg hand-tuned ppcg+methodology
GF
lops
/s
Carlos Juega et al. Adaptive Mapping and Parameter Selection 10 / 29
Motivation PPCG Adaptive Mapping Results Conclusions
Results on the M2070
cova
rianc
e
atax
bicg
gem
m
gesu
mm
v
mvt
2dco
nv
3dco
nv
syrk
syr2
k
fdtd
-2d
1
10
100
1000
manual CUDA ppcg default ppcg hand-tuned ppcg+methodology
GF
lops
/s
Carlos Juega et al. Adaptive Mapping and Parameter Selection 11 / 29
Motivation PPCG Adaptive Mapping Results Conclusions
Results on the C1060
cova
rianc
e
atax
bicg
gem
m
gesu
mm
v
mvt
2dco
nv
3dco
nv
syrk
syr2
k
fdtd
-2d
1
10
100
1000
manual CUDA ppcg default ppcg hand-tuned ppcg+methodology
GF
lops
/s
Carlos Juega et al. Adaptive Mapping and Parameter Selection 12 / 29
Motivation PPCG Adaptive Mapping Results Conclusions
State-of-art comparison
cova
rianc
e
atax
bicg
gem
m
gesu
mm
v
mvt
2dco
nv
3dco
nv
syrk
syr2
k
fdtd
-2d
0,1
1
10
100
1000
manual CUDA OpenACC Par4All ppcg+methodology
GF
lops
/s
Carlos Juega et al. Adaptive Mapping and Parameter Selection 13 / 29
Motivation PPCG Adaptive Mapping Results Conclusions
Conclusions and future work
Conclusions
I Data reuse driven system to improve performance.
I Turns PPCG into a platform-aware compiler.
I Performs very well compared against previous ppcg.
I Speedups: 2.96x, 3.23x, 2.95x (compared toout-of-the-box ppcg).
Future work
I Explore the effect of other loop transformations.
I Add even more restrictions to the solver.
I Validate our model with bigger benchs.
Carlos Juega et al. Adaptive Mapping and Parameter Selection 14 / 29
Motivation PPCG Adaptive Mapping Results Conclusions
Conclusions and future work
Conclusions
I Data reuse driven system to improve performance.
I Turns PPCG into a platform-aware compiler.
I Performs very well compared against previous ppcg.
I Speedups: 2.96x, 3.23x, 2.95x (compared toout-of-the-box ppcg).
Future work
I Explore the effect of other loop transformations.
I Add even more restrictions to the solver.
I Validate our model with bigger benchs.
Carlos Juega et al. Adaptive Mapping and Parameter Selection 14 / 29
Motivation PPCG Adaptive Mapping Results Conclusions
Thanks for your attention!
Any question?
Carlos Juega et al. Adaptive Mapping and Parameter Selection 15 / 29
Motivation PPCG Adaptive Mapping Results Conclusions
Polyhedral Model
#pragma scop
for (i=0; i<4; i++){
S0 A[i] = 0;
for(j=i; j<4; j++)
S1 A[i] += B[i][j];
}
#pragma endscop
Iteration domainContains the dynamic instances ofthe statements.
{ S0(i) | 0 ≤ i < 5 } ∪{ S1(i, j) | 0 ≤ i < 5 ∧ i ≤ j < 5 }
i0 1 2 3 4
i
j
0 1 2 3 40
1
2
3
4
i ≥ 0 i ≤ 4j ≥ i
j ≤ 4
Carlos Juega et al. Adaptive Mapping and Parameter Selection 16 / 29
Motivation PPCG Adaptive Mapping Results Conclusions
Polyhedral Model
#pragma scop
for (i=0; i<4; i++){
S0 A[i] = 0;
for(j=i; j<4; j++)
S1 A[i] += B[i][j];
}
#pragma endscop
Access relationsmap statement instances to the arrayelements.
writes
{ S0(i)→ A(i) } ∪ { S1(i, j)→ A(i) }
reads
{ S1(i, j)→ B(i, j) }
Carlos Juega et al. Adaptive Mapping and Parameter Selection 16 / 29
Motivation PPCG Adaptive Mapping Results Conclusions
Polyhedral Model
#pragma scop
for (i=0; i<4; i++){
S0 A[i] = 0;
for(j=i; j<4; j++)
S1 A[i] += B[i][j];
}
#pragma endscop
Schedulespecifies the order in which thestatement instances are executed.
{ S0(i)→ (i, 0, 0) } ∪ { S1(i, j)→ (i, j, 1) }i0 1 2 3 4
i
j
0
1
2
3
4
Carlos Juega et al. Adaptive Mapping and Parameter Selection 16 / 29
Motivation PPCG Adaptive Mapping Results Conclusions
#pragma scop
for (i=0; i<4; i++){
S0 A[i] = 0;
for(j=i; j<4; j++)
S1 A[i] += B[i][j];
}
#pragma endscop
Dependences analysisdependence relation: determines which statementinstances depends on which other statementinstances.
{ S0(i)→ S1(i, 0) } ∪ { S1(i, j)→ (i, j + 1) }
i0 1 2 3 4
i
j
0
1
2
3
4
Carlos Juega et al. Adaptive Mapping and Parameter Selection 17 / 29
Motivation PPCG Adaptive Mapping Results Conclusions
#pragma scop
for (i=0; i<4; i++){
S0 A[i] = 0;
for(j=i; j<4; j++)
S1 A[i] += B[i][j];
}
#pragma endscop
Dependences analysisdependence relation: determines which statementinstances depends on which other statementinstances.
{ S0(i)→ S1(i, 0) } ∪ { S1(i, j)→ (i, j + 1) }
dependence distances: used to know if scheduledimensions are parallel and/or tileables.
{ (0, 0, 1), (0, 1, 0) }
i0 1 2 3 4
i
j
0
1
2
3
4
Carlos Juega et al. Adaptive Mapping and Parameter Selection 17 / 29
Motivation PPCG Adaptive Mapping Results Conclusions
Scheduling
1 Based on Pluto algorithm.I exposes parallelism.I exposes temporal locality.
2 Uses isl to build a new affine schedule S:
S = (Si)1≤i≤d
3 The affine functions Si are constructed one by one insuch a way that the corresponding dependencedistances are non-negative.
I The sequence of Sis that are constructed in this wayform what is known as a tilable band, or simply band.
4 Ensures at least one parallel dimension within theband. Parallel dimensions are placed outermost.
5 three scheduling strategies:I minimal fusion.I maximal fusion.I maximize band depth.
Carlos Juega et al. Adaptive Mapping and Parameter Selection 18 / 29
Motivation PPCG Adaptive Mapping Results Conclusions
Tiling and Mapping to GPU
1 tiling splits each dimension loop into a pair ofdimensions (loops):
I tile loop that iterates over different tiles.I point loop that iterates inside tiles.
2 dimensions before the outermost band are executedon CPU.
3 tile the outermost band.I up to two parallel tile dimensions are mapped to
threadblocks.I up to three parallel point dimensions are mapped to
threads.
Carlos Juega et al. Adaptive Mapping and Parameter Selection 19 / 29
Motivation PPCG Adaptive Mapping Results Conclusions
Tiling and Mapping to GPU
1 tiling splits each dimension loop into a pair ofdimensions (loops):
I tile loop that iterates over different tiles.I point loop that iterates inside tiles.
2 dimensions before the outermost band are executedon CPU.
3 tile the outermost band.I up to two parallel tile dimensions are mapped to
threadblocks.I up to three parallel point dimensions are mapped to
threads.
Carlos Juega et al. Adaptive Mapping and Parameter Selection 19 / 29
Motivation PPCG Adaptive Mapping Results Conclusions
Memory allocation
1 Transfers to/from GPU. Done at the beginning/end ofthe SCoP.
I arrays accessed inside the SCoP are copied to theGPU.
I arrays updated inside the SCoP are copied back fromthe GPU.
I read-only scalars are passed to kernels as parameters.I scalars updated inside the SCoP are treated as
zero-dimensional arrays.
Carlos Juega et al. Adaptive Mapping and Parameter Selection 20 / 29
Motivation PPCG Adaptive Mapping Results Conclusions
Memory allocation
2 group array references to avoid inconsistencies.
for(i=0; i<N; i++)
C[i] = foo(i);
for(i=0; i<N; i++)
a += C[i];
Carlos Juega et al. Adaptive Mapping and Parameter Selection 20 / 29
Motivation PPCG Adaptive Mapping Results Conclusions
Memory allocation
3 Allocation to registers and shared memory.I If we are able to compute register tiles and there is
any reuse, then the data is placed in registers.I Otherwise, if we are able to compute shared memory
tiles and there is any reuse or the original accesseswere not coalesced, then we place the data inshared memory.
I Otherwise, the data is kept in global memory.
Carlos Juega et al. Adaptive Mapping and Parameter Selection 20 / 29
Motivation PPCG Adaptive Mapping Results Conclusions
Main stepsI Sequential data reuse analysis.I Parallel data reuse analysis.I Restriction solver.
Data reuse analysis stepsI Prepare the search space (tiling plus loop
interchange).I Analyze data reuse (and build restrictions).I Prune the search space.
Restriction solver stepsI Sort out the search space (penalty for non-coalesced
schedules).I Solve an optimization problem.
Carlos Juega et al. Adaptive Mapping and Parameter Selection 21 / 29
Motivation PPCG Adaptive Mapping Results Conclusions
Main stepsI Sequential data reuse analysis.I Parallel data reuse analysis.I Restriction solver.
Data reuse analysis stepsI Prepare the search space (tiling plus loop
interchange).I Analyze data reuse (and build restrictions).I Prune the search space.
Restriction solver stepsI Sort out the search space (penalty for non-coalesced
schedules).I Solve an optimization problem.
Carlos Juega et al. Adaptive Mapping and Parameter Selection 21 / 29
Motivation PPCG Adaptive Mapping Results Conclusions
Main stepsI Sequential data reuse analysis.I Parallel data reuse analysis.I Restriction solver.
Data reuse analysis stepsI Prepare the search space (tiling plus loop
interchange).I Analyze data reuse (and build restrictions).I Prune the search space.
Restriction solver stepsI Sort out the search space (penalty for non-coalesced
schedules).I Solve an optimization problem.
Carlos Juega et al. Adaptive Mapping and Parameter Selection 21 / 29
Motivation PPCG Adaptive Mapping Results Conclusions
Copy candidates construction
Uses the polyhedral representation to compute pieces of datain which exists data reuse
gemm schedule[Ti, Tj, Tk, Pi, Pj, Pk] array S array R matrix A
Carlos Juega et al. Adaptive Mapping and Parameter Selection 22 / 29
Motivation PPCG Adaptive Mapping Results Conclusions
Copy candidates construction
Uses the polyhedral representation to compute pieces of datain which exists data reuse
gemm schedule[Ti, Tj, Tk, Pi, Pj, Pk] array S array R matrix A
Carlos Juega et al. Adaptive Mapping and Parameter Selection 22 / 29
Motivation PPCG Adaptive Mapping Results Conclusions
Copy candidates construction
Uses the polyhedral representation to compute pieces of datain which exists data reuse
gemm schedule[Ti, Tj, Tk, Pi, Pj, Pk] array S array R matrix A
Carlos Juega et al. Adaptive Mapping and Parameter Selection 22 / 29
Motivation PPCG Adaptive Mapping Results Conclusions
Platform
HostI 2xIntel Xeon E5530 @ 2.4GHz.I 4 cores per chip + HyperThreading⇒ 16 virtual cores.I GCC 4.6.
GPU: Tesla M2070I 14 multiprocessors @ 1.15GHz, 32 cores per
multiprocessor.I 32768 register per multiprocessor.I 64KB L1 per multiprocessor.I CUDA 4.0.
BenchmarksPolyBench-3.1 (http://artecs.dacya.ucm.es/?q=node/861).
Carlos Juega et al. Adaptive Mapping and Parameter Selection 23 / 29
Motivation PPCG Adaptive Mapping Results Conclusions
PPCG scheduling strategies
corr
ela
tion
cova
rian
ce2
mm
3m
ma
tax
bic
gch
ole
sky
do
itge
nd
urb
ind
ynp
rog
ge
mm
ge
mve
rg
esu
mm
vg
ram
sch
mid
t lulu
dcm
pm
vtsy
mm
syr2
ksy
rktr
an
spo
setr
iso
lvtr
mm
floyd
-wa
rsh
all
reg
_d
ete
cta
di
fdtd
-2d
fdtd
-ap
ml
jaco
bi-1
d-im
pe
rja
cob
i-2d
-imp
er
seid
el-2
dg
eo
me
tric
me
an
0,1
1
10
100
1000min fusionmax fusionmax band
Sp
ee
du
p
baseline: Sequential execution on CPU.
Carlos Juega et al. Adaptive Mapping and Parameter Selection 24 / 29
Motivation PPCG Adaptive Mapping Results Conclusions
State-of-art comparison 1/2
corr
ela
tion
cova
rian
ce
2m
m
3m
m
bic
g
do
itge
n
ge
mm
ge
mve
r
ge
sum
mv
gra
msc
hm
idt lu
mvt
sym
m
syr2
k
syrk
tra
nsp
ose ad
i
fdtd
-2d
jaco
bi-1
d-im
pe
r
jaco
bi-2
d-im
pe
r
ge
om
etr
ic m
ea
n
0,1
1
10
100
1000 PPCG customPPCG defaultPluto OpenMPPar4All
Sp
ee
du
p
baseline: Sequential execution on CPU.
Carlos Juega et al. Adaptive Mapping and Parameter Selection 25 / 29
Motivation PPCG Adaptive Mapping Results Conclusions
State-of-art comparison 2/2
gemm bicg mvt adi jacobi-2d-imper0
0.01
0.1
1
10
100
1000 ppcgpluto+openmpcmanual+openmpcPGI OpenACC man.PGI OpenACC auto
Sp
ee
dup
baseline: Sequential execution on CPU.
Carlos Juega et al. Adaptive Mapping and Parameter Selection 26 / 29
Motivation PPCG Adaptive Mapping Results Conclusions
State-of-art comparison: gemm
mini small standard large extra-large0.01
0.1
1
10
100
1000
CUBLASPPCG fixed sizesPPCG parametric sizesPar4allC to CUDAPluto (openmp+icc)ATLASseq (icc)
Problem size
GF
lop
s/s
Carlos Juega et al. Adaptive Mapping and Parameter Selection 27 / 29
Motivation PPCG Adaptive Mapping Results Conclusions
I Compute Memory Footprint.I MemoryFootprint = TA + 1 + TB
I Compute Total Access Cost.I
TotalAccessCost = TA∗(M/TB)∗(TA/N) +...
Carlos Juega et al. Adaptive Mapping and Parameter Selection 28 / 29
Motivation PPCG Adaptive Mapping Results Conclusions
I Two data reuse types.I inter-thread data reuse.I intra-thread data reuse.
I Compute MemoryFootprint.
I coalescingI footprint expansion.
I Compute Total AccessCost.
I registers are faster thanshared mem.
Carlos Juega et al. Adaptive Mapping and Parameter Selection 29 / 29