adaptive mapping and parameter selection scheme to ......

56
Motivation PPCG Adaptive Mapping Results Conclusions Adaptive Mapping and Parameter Selection Scheme to Improve Automatic Code Generation for GPUs Carlos Juega 1 José Ignacio Gómez 1 Christian Tenllado 1 Francky Catthoor 2 1 ArTeCS, Universidad Complutense de Madrid 2 IMEC February 19, 2014 Carlos Juega et al. Adaptive Mapping and Parameter Selection 1 / 29

Upload: doankiet

Post on 15-Feb-2018

231 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions

Adaptive Mapping and ParameterSelection Scheme to Improve Automatic

Code Generation for GPUs

Carlos Juega 1 José Ignacio Gómez 1 ChristianTenllado 1 Francky Catthoor 2

1ArTeCS, Universidad Complutense de Madrid

2IMEC

February 19, 2014

Carlos Juega et al. Adaptive Mapping and Parameter Selection 1 / 29

Page 2: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions

1 Motivation

2 PPCG

3 Adaptive Mapping

4 Results

5 Conclusions

Carlos Juega et al. Adaptive Mapping and Parameter Selection 2 / 29

Page 3: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions

Motivation

Some facts...I GPGPU programming is still a hard task.

I Parallelism vs locality trade-off rules the mapping.

I Optimal implementation usually changes acrossdifferent GPU generations.

Carlos Juega et al. Adaptive Mapping and Parameter Selection 3 / 29

Page 4: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions

Motivation

Some facts...I GPGPU programming is still a hard task.

I Parallelism vs locality trade-off rules the mapping.

I Optimal implementation usually changes acrossdifferent GPU generations.

gemmnaive 40 ms⇓ parallelism, shared memory 12.15 ms⇓⇓ parallelism, shared memory + registers 9.36 ms

Carlos Juega et al. Adaptive Mapping and Parameter Selection 3 / 29

Page 5: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions

Motivation

Some facts...I GPGPU programming is still a hard task.

I Parallelism vs locality trade-off rules the mapping.

I Optimal implementation usually changes acrossdifferent GPU generations.

There is always hope!I Many source-to-source compilers for GPUs.I Other approaches (OpenACC, CUDA-CHiLL, ...).

Carlos Juega et al. Adaptive Mapping and Parameter Selection 3 / 29

Page 6: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions

Polyhedral Parallel Code Generator

Source-to-source compiler based on Polyhedral Model.

Contributions

1 Exposes parallelism.2 Maximizes temporal locality.3 Exploits shared memory and registers.

C codewith pragmas

polyhedralextraction

dependenceanalysis scheduling

tiling andGPU mapping

memoryallocation

codegenerator CUDA code

Carlos Juega et al. Adaptive Mapping and Parameter Selection 4 / 29

Page 7: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions

Polyhedral Parallel Code Generator

Source-to-source compiler based on Polyhedral Model.

Contributions

1 Exposes parallelism.2 Maximizes temporal locality.3 Exploits shared memory and registers.

C codewith pragmas

polyhedralextraction

dependenceanalysis scheduling

tiling andGPU mapping

memoryallocation

codegenerator CUDA code

Carlos Juega et al. Adaptive Mapping and Parameter Selection 4 / 29

Page 8: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions

PPCG limitations

1 Static mapping process (it’s always the same nomatter the input).

2 It still needs user assistance (parametrization).3 It’s not aware of platform details.

C codewith pragmas

polyhedralextraction

dependenceanalysis scheduling

tiling andGPU mapping

memoryallocation

codegenerator CUDA code

Carlos Juega et al. Adaptive Mapping and Parameter Selection 4 / 29

Page 9: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions Proposal Overview

Proposal

BasicsI Explore minor changes in PPCG’s

schedule.I Loop interchange.I Serialize parallel loops.

I Use parametric tiling or ranged values(just for pruning).

I Takes platform specs as problemvariables.

HeuristicsI Memory

footprint.I Accesses

cost.

ContributionsI Adapts PPCG schedules to the input’s data reuse

requirements.I Computes common GPU parameters based on data

reuse.Carlos Juega et al. Adaptive Mapping and Parameter Selection 5 / 29

Page 10: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions Proposal Overview

Proposal

BasicsI Explore minor changes in PPCG’s

schedule.I Loop interchange.I Serialize parallel loops.

I Use parametric tiling or ranged values(just for pruning).

I Takes platform specs as problemvariables.

HeuristicsI Memory

footprint.I Accesses

cost.

ContributionsI Adapts PPCG schedules to the input’s data reuse

requirements.I Computes common GPU parameters based on data

reuse.Carlos Juega et al. Adaptive Mapping and Parameter Selection 5 / 29

Page 11: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions Proposal Overview

Proposal

BasicsI Explore minor changes in PPCG’s

schedule.I Loop interchange.I Serialize parallel loops.

I Use parametric tiling or ranged values(just for pruning).

I Takes platform specs as problemvariables.

HeuristicsI Memory

footprint.I Accesses

cost.

ContributionsI Adapts PPCG schedules to the input’s data reuse

requirements.I Computes common GPU parameters based on data

reuse.Carlos Juega et al. Adaptive Mapping and Parameter Selection 5 / 29

Page 12: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions Proposal Overview

C codewith pragmas

polyhedralextraction

dependenceanalysis scheduling

tiling andGPU mapping

memoryallocation

codegenerator CUDA code

Carlos Juega et al. Adaptive Mapping and Parameter Selection 6 / 29

Page 13: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions Proposal Overview

C codewith pragmas

polyhedralextraction

dependenceanalysis scheduling

codegenerator CUDA code

Sequential datareuse analysis

Parallel datareuse analysis

Restrictionsolver

Search spaceconstruction

reuseanalysis

pruning

Search spaceexpansion

reuseanalysis

pruning

Sorting

Optimizationsolver

Carlos Juega et al. Adaptive Mapping and Parameter Selection 7 / 29

Page 14: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions Proposal Overview

C codewith pragmas

polyhedralextraction

dependenceanalysis scheduling

codegenerator CUDA code

Sequential datareuse analysis

Parallel datareuse analysis

Restrictionsolver

Search spaceconstruction

reuseanalysis

pruning

Search spaceexpansion

reuseanalysis

pruning

Sorting

Optimizationsolver[a, b]

[Ta, Tb, Pa, Pb]

[Ta, Tb, Pb, Pa]

[Ta, Pa, Tb, Pb]

Carlos Juega et al. Adaptive Mapping and Parameter Selection 7 / 29

Page 15: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions Proposal Overview

C codewith pragmas

polyhedralextraction

dependenceanalysis scheduling

codegenerator CUDA code

Sequential datareuse analysis

Parallel datareuse analysis

Restrictionsolver

Search spaceconstruction

reuseanalysis

pruning

Search spaceexpansion

reuseanalysis

pruning

Sorting

Optimizationsolver

Array R1M

1TB

Array S1N

11

1TA

Carlos Juega et al. Adaptive Mapping and Parameter Selection 7 / 29

Page 16: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions Proposal Overview

C codewith pragmas

polyhedralextraction

dependenceanalysis scheduling

codegenerator CUDA code

Sequential datareuse analysis

Parallel datareuse analysis

Restrictionsolver

Search spaceconstruction

reuseanalysis

pruning

Search spaceexpansion

reuseanalysis

pruning

Sorting

Optimizationsolver

0 50 100 150 200 250 3000

10000000

20000000

30000000

40000000

50000000

60000000

70000000

Ta Pa Tb Pb Ta Tb Pa Pb Ta Tb Pb Pa

MemoryFootprint

Tota

lAcc

ess

Co

st

Carlos Juega et al. Adaptive Mapping and Parameter Selection 7 / 29

Page 17: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions Proposal Overview

C codewith pragmas

polyhedralextraction

dependenceanalysis scheduling

codegenerator CUDA code

Sequential datareuse analysis

Parallel datareuse analysis

Restrictionsolver

Search spaceconstruction

reuseanalysis

pruning

Search spaceexpansion

reuseanalysis

pruning

Sorting

Optimizationsolver

[Ta, Tb, Pb, Pa] [Ta, Tb, Pb, PTa, PPa]

[Ta, Tb, Pb, PPa, PTa]

[Ta, Tb, PPa, Pb, PTa]

Carlos Juega et al. Adaptive Mapping and Parameter Selection 7 / 29

Page 18: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions Proposal Overview

C codewith pragmas

polyhedralextraction

dependenceanalysis scheduling

codegenerator CUDA code

Sequential datareuse analysis

Parallel datareuse analysis

Restrictionsolver

Search spaceconstruction

reuseanalysis

pruning

Search spaceexpansion

reuseanalysis

pruning

Sorting

Optimizationsolver

Carlos Juega et al. Adaptive Mapping and Parameter Selection 7 / 29

Page 19: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions Proposal Overview

C codewith pragmas

polyhedralextraction

dependenceanalysis scheduling

codegenerator CUDA code

Sequential datareuse analysis

Parallel datareuse analysis

Restrictionsolver

Search spaceconstruction

reuseanalysis

pruning

Search spaceexpansion

reuseanalysis

pruning

Sorting

Optimizationsolver

Carlos Juega et al. Adaptive Mapping and Parameter Selection 7 / 29

Page 20: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions Proposal Overview

Restriction solver

Goal functionMaximize parallelism.

Restriction typesI Architecture constraints.I Problem size constraints

(when known at compiletime).

I Platform reuse constraint.I Previously built restrictions

I Tiling constraints.I Optimization constraints.

max: active_blocks*TTA

active_blocks <= 8

active_blocks >= 1

TTA <= 1024

TTA >= 1

active_blocks*TTA <= 1536

4000 / TA >= 14

4000 % TA = 0

4000 % TB = 0

TA >= 36

TA % TTA = 0

TB % TTA = 0

TTA % 32 = 0

active_blocks*(TA) <= 32768

active_blocks*(TA+TB) <= 12288

Carlos Juega et al. Adaptive Mapping and Parameter Selection 8 / 29

Page 21: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions Proposal Overview

Restriction solver

Goal functionMaximize parallelism.

Restriction typesI Architecture constraints.I Problem size constraints

(when known at compiletime).

I Platform reuse constraint.I Previously built restrictions

I Tiling constraints.I Optimization constraints.

max: active_blocks*TTA

active_blocks <= 8

active_blocks >= 1

TTA <= 1024

TTA >= 1

active_blocks*TTA <= 1536

4000 / TA >= 14

4000 % TA = 0

4000 % TB = 0

TA >= 36

TA % TTA = 0

TB % TTA = 0

TTA % 32 = 0

active_blocks*(TA) <= 32768

active_blocks*(TA+TB) <= 12288

Carlos Juega et al. Adaptive Mapping and Parameter Selection 8 / 29

Page 22: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions Proposal Overview

Restriction solver

Goal functionMaximize parallelism.

Restriction typesI Architecture constraints.I Problem size constraints

(when known at compiletime).

I Platform reuse constraint.I Previously built restrictions

I Tiling constraints.I Optimization constraints.

max: active_blocks*TTA

active_blocks <= 8

active_blocks >= 1

TTA <= 1024

TTA >= 1

active_blocks*TTA <= 1536

4000 / TA >= 14

4000 % TA = 0

4000 % TB = 0

TA >= 36

TA % TTA = 0

TB % TTA = 0

TTA % 32 = 0

active_blocks*(TA) <= 32768

active_blocks*(TA+TB) <= 12288

Carlos Juega et al. Adaptive Mapping and Parameter Selection 8 / 29

Page 23: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions Proposal Overview

Restriction solver

Goal functionMaximize parallelism.

Restriction typesI Architecture constraints.I Problem size constraints

(when known at compiletime).

I Platform reuse constraint.I Previously built restrictions

I Tiling constraints.I Optimization constraints.

max: active_blocks*TTA

active_blocks <= 8

active_blocks >= 1

TTA <= 1024

TTA >= 1

active_blocks*TTA <= 1536

4000 / TA >= 14

4000 % TA = 0

4000 % TB = 0

TA >= 36

TA % TTA = 0

TB % TTA = 0

TTA % 32 = 0

active_blocks*(TA) <= 32768

active_blocks*(TA+TB) <= 12288

Carlos Juega et al. Adaptive Mapping and Parameter Selection 8 / 29

Page 24: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions Proposal Overview

Restriction solver

Goal functionMaximize parallelism.

Restriction typesI Architecture constraints.I Problem size constraints

(when known at compiletime).

I Platform reuse constraint.I Previously built restrictions

I Tiling constraints.I Optimization constraints.

max: active_blocks*TTA

active_blocks <= 8

active_blocks >= 1

TTA <= 1024

TTA >= 1

active_blocks*TTA <= 1536

4000 / TA >= 14

4000 % TA = 0

4000 % TB = 0

TA >= 36

TA % TTA = 0

TB % TTA = 0

TTA % 32 = 0

active_blocks*(TA) <= 32768

active_blocks*(TA+TB) <= 12288

Carlos Juega et al. Adaptive Mapping and Parameter Selection 8 / 29

Page 25: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions

Environment

Tested GPUsI Tesla K20 (Kepler architecture).I Tesla M2070 (Fermi architecture).I Tesla C1060 (Tesla architecture).

Benchmarks (compiled with CUDA 5.0)

PolyBench-gpu-1.0(http://www.cse.ohio-state.edu/~pouchet/software/polybench/GPU/index.html)

Carlos Juega et al. Adaptive Mapping and Parameter Selection 9 / 29

Page 26: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions

Results on the K20

cova

rianc

e

atax

bicg

gem

m

gesu

mm

v

mvt

2dco

nv

3dco

nv

syrk

syr2

k

fdtd

-2d

1

10

100

1000

manual CUDA ppcg default ppcg hand-tuned ppcg+methodology

GF

lops

/s

Carlos Juega et al. Adaptive Mapping and Parameter Selection 10 / 29

Page 27: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions

Results on the M2070

cova

rianc

e

atax

bicg

gem

m

gesu

mm

v

mvt

2dco

nv

3dco

nv

syrk

syr2

k

fdtd

-2d

1

10

100

1000

manual CUDA ppcg default ppcg hand-tuned ppcg+methodology

GF

lops

/s

Carlos Juega et al. Adaptive Mapping and Parameter Selection 11 / 29

Page 28: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions

Results on the C1060

cova

rianc

e

atax

bicg

gem

m

gesu

mm

v

mvt

2dco

nv

3dco

nv

syrk

syr2

k

fdtd

-2d

1

10

100

1000

manual CUDA ppcg default ppcg hand-tuned ppcg+methodology

GF

lops

/s

Carlos Juega et al. Adaptive Mapping and Parameter Selection 12 / 29

Page 29: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions

State-of-art comparison

cova

rianc

e

atax

bicg

gem

m

gesu

mm

v

mvt

2dco

nv

3dco

nv

syrk

syr2

k

fdtd

-2d

0,1

1

10

100

1000

manual CUDA OpenACC Par4All ppcg+methodology

GF

lops

/s

Carlos Juega et al. Adaptive Mapping and Parameter Selection 13 / 29

Page 30: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions

Conclusions and future work

Conclusions

I Data reuse driven system to improve performance.

I Turns PPCG into a platform-aware compiler.

I Performs very well compared against previous ppcg.

I Speedups: 2.96x, 3.23x, 2.95x (compared toout-of-the-box ppcg).

Future work

I Explore the effect of other loop transformations.

I Add even more restrictions to the solver.

I Validate our model with bigger benchs.

Carlos Juega et al. Adaptive Mapping and Parameter Selection 14 / 29

Page 31: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions

Conclusions and future work

Conclusions

I Data reuse driven system to improve performance.

I Turns PPCG into a platform-aware compiler.

I Performs very well compared against previous ppcg.

I Speedups: 2.96x, 3.23x, 2.95x (compared toout-of-the-box ppcg).

Future work

I Explore the effect of other loop transformations.

I Add even more restrictions to the solver.

I Validate our model with bigger benchs.

Carlos Juega et al. Adaptive Mapping and Parameter Selection 14 / 29

Page 32: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions

Thanks for your attention!

Any question?

Carlos Juega et al. Adaptive Mapping and Parameter Selection 15 / 29

Page 33: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions

Polyhedral Model

#pragma scop

for (i=0; i<4; i++){

S0 A[i] = 0;

for(j=i; j<4; j++)

S1 A[i] += B[i][j];

}

#pragma endscop

Iteration domainContains the dynamic instances ofthe statements.

{ S0(i) | 0 ≤ i < 5 } ∪{ S1(i, j) | 0 ≤ i < 5 ∧ i ≤ j < 5 }

i0 1 2 3 4

i

j

0 1 2 3 40

1

2

3

4

i ≥ 0 i ≤ 4j ≥ i

j ≤ 4

Carlos Juega et al. Adaptive Mapping and Parameter Selection 16 / 29

Page 34: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions

Polyhedral Model

#pragma scop

for (i=0; i<4; i++){

S0 A[i] = 0;

for(j=i; j<4; j++)

S1 A[i] += B[i][j];

}

#pragma endscop

Access relationsmap statement instances to the arrayelements.

writes

{ S0(i)→ A(i) } ∪ { S1(i, j)→ A(i) }

reads

{ S1(i, j)→ B(i, j) }

Carlos Juega et al. Adaptive Mapping and Parameter Selection 16 / 29

Page 35: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions

Polyhedral Model

#pragma scop

for (i=0; i<4; i++){

S0 A[i] = 0;

for(j=i; j<4; j++)

S1 A[i] += B[i][j];

}

#pragma endscop

Schedulespecifies the order in which thestatement instances are executed.

{ S0(i)→ (i, 0, 0) } ∪ { S1(i, j)→ (i, j, 1) }i0 1 2 3 4

i

j

0

1

2

3

4

Carlos Juega et al. Adaptive Mapping and Parameter Selection 16 / 29

Page 36: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions

#pragma scop

for (i=0; i<4; i++){

S0 A[i] = 0;

for(j=i; j<4; j++)

S1 A[i] += B[i][j];

}

#pragma endscop

Dependences analysisdependence relation: determines which statementinstances depends on which other statementinstances.

{ S0(i)→ S1(i, 0) } ∪ { S1(i, j)→ (i, j + 1) }

i0 1 2 3 4

i

j

0

1

2

3

4

Carlos Juega et al. Adaptive Mapping and Parameter Selection 17 / 29

Page 37: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions

#pragma scop

for (i=0; i<4; i++){

S0 A[i] = 0;

for(j=i; j<4; j++)

S1 A[i] += B[i][j];

}

#pragma endscop

Dependences analysisdependence relation: determines which statementinstances depends on which other statementinstances.

{ S0(i)→ S1(i, 0) } ∪ { S1(i, j)→ (i, j + 1) }

dependence distances: used to know if scheduledimensions are parallel and/or tileables.

{ (0, 0, 1), (0, 1, 0) }

i0 1 2 3 4

i

j

0

1

2

3

4

Carlos Juega et al. Adaptive Mapping and Parameter Selection 17 / 29

Page 38: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions

Scheduling

1 Based on Pluto algorithm.I exposes parallelism.I exposes temporal locality.

2 Uses isl to build a new affine schedule S:

S = (Si)1≤i≤d

3 The affine functions Si are constructed one by one insuch a way that the corresponding dependencedistances are non-negative.

I The sequence of Sis that are constructed in this wayform what is known as a tilable band, or simply band.

4 Ensures at least one parallel dimension within theband. Parallel dimensions are placed outermost.

5 three scheduling strategies:I minimal fusion.I maximal fusion.I maximize band depth.

Carlos Juega et al. Adaptive Mapping and Parameter Selection 18 / 29

Page 39: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions

Tiling and Mapping to GPU

1 tiling splits each dimension loop into a pair ofdimensions (loops):

I tile loop that iterates over different tiles.I point loop that iterates inside tiles.

2 dimensions before the outermost band are executedon CPU.

3 tile the outermost band.I up to two parallel tile dimensions are mapped to

threadblocks.I up to three parallel point dimensions are mapped to

threads.

Carlos Juega et al. Adaptive Mapping and Parameter Selection 19 / 29

Page 40: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions

Tiling and Mapping to GPU

1 tiling splits each dimension loop into a pair ofdimensions (loops):

I tile loop that iterates over different tiles.I point loop that iterates inside tiles.

2 dimensions before the outermost band are executedon CPU.

3 tile the outermost band.I up to two parallel tile dimensions are mapped to

threadblocks.I up to three parallel point dimensions are mapped to

threads.

Carlos Juega et al. Adaptive Mapping and Parameter Selection 19 / 29

Page 41: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions

Memory allocation

1 Transfers to/from GPU. Done at the beginning/end ofthe SCoP.

I arrays accessed inside the SCoP are copied to theGPU.

I arrays updated inside the SCoP are copied back fromthe GPU.

I read-only scalars are passed to kernels as parameters.I scalars updated inside the SCoP are treated as

zero-dimensional arrays.

Carlos Juega et al. Adaptive Mapping and Parameter Selection 20 / 29

Page 42: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions

Memory allocation

2 group array references to avoid inconsistencies.

for(i=0; i<N; i++)

C[i] = foo(i);

for(i=0; i<N; i++)

a += C[i];

Carlos Juega et al. Adaptive Mapping and Parameter Selection 20 / 29

Page 43: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions

Memory allocation

3 Allocation to registers and shared memory.I If we are able to compute register tiles and there is

any reuse, then the data is placed in registers.I Otherwise, if we are able to compute shared memory

tiles and there is any reuse or the original accesseswere not coalesced, then we place the data inshared memory.

I Otherwise, the data is kept in global memory.

Carlos Juega et al. Adaptive Mapping and Parameter Selection 20 / 29

Page 44: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions

Main stepsI Sequential data reuse analysis.I Parallel data reuse analysis.I Restriction solver.

Data reuse analysis stepsI Prepare the search space (tiling plus loop

interchange).I Analyze data reuse (and build restrictions).I Prune the search space.

Restriction solver stepsI Sort out the search space (penalty for non-coalesced

schedules).I Solve an optimization problem.

Carlos Juega et al. Adaptive Mapping and Parameter Selection 21 / 29

Page 45: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions

Main stepsI Sequential data reuse analysis.I Parallel data reuse analysis.I Restriction solver.

Data reuse analysis stepsI Prepare the search space (tiling plus loop

interchange).I Analyze data reuse (and build restrictions).I Prune the search space.

Restriction solver stepsI Sort out the search space (penalty for non-coalesced

schedules).I Solve an optimization problem.

Carlos Juega et al. Adaptive Mapping and Parameter Selection 21 / 29

Page 46: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions

Main stepsI Sequential data reuse analysis.I Parallel data reuse analysis.I Restriction solver.

Data reuse analysis stepsI Prepare the search space (tiling plus loop

interchange).I Analyze data reuse (and build restrictions).I Prune the search space.

Restriction solver stepsI Sort out the search space (penalty for non-coalesced

schedules).I Solve an optimization problem.

Carlos Juega et al. Adaptive Mapping and Parameter Selection 21 / 29

Page 47: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions

Copy candidates construction

Uses the polyhedral representation to compute pieces of datain which exists data reuse

gemm schedule[Ti, Tj, Tk, Pi, Pj, Pk] array S array R matrix A

Carlos Juega et al. Adaptive Mapping and Parameter Selection 22 / 29

Page 48: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions

Copy candidates construction

Uses the polyhedral representation to compute pieces of datain which exists data reuse

gemm schedule[Ti, Tj, Tk, Pi, Pj, Pk] array S array R matrix A

Carlos Juega et al. Adaptive Mapping and Parameter Selection 22 / 29

Page 49: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions

Copy candidates construction

Uses the polyhedral representation to compute pieces of datain which exists data reuse

gemm schedule[Ti, Tj, Tk, Pi, Pj, Pk] array S array R matrix A

Carlos Juega et al. Adaptive Mapping and Parameter Selection 22 / 29

Page 50: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions

Platform

HostI 2xIntel Xeon E5530 @ 2.4GHz.I 4 cores per chip + HyperThreading⇒ 16 virtual cores.I GCC 4.6.

GPU: Tesla M2070I 14 multiprocessors @ 1.15GHz, 32 cores per

multiprocessor.I 32768 register per multiprocessor.I 64KB L1 per multiprocessor.I CUDA 4.0.

BenchmarksPolyBench-3.1 (http://artecs.dacya.ucm.es/?q=node/861).

Carlos Juega et al. Adaptive Mapping and Parameter Selection 23 / 29

Page 51: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions

PPCG scheduling strategies

corr

ela

tion

cova

rian

ce2

mm

3m

ma

tax

bic

gch

ole

sky

do

itge

nd

urb

ind

ynp

rog

ge

mm

ge

mve

rg

esu

mm

vg

ram

sch

mid

t lulu

dcm

pm

vtsy

mm

syr2

ksy

rktr

an

spo

setr

iso

lvtr

mm

floyd

-wa

rsh

all

reg

_d

ete

cta

di

fdtd

-2d

fdtd

-ap

ml

jaco

bi-1

d-im

pe

rja

cob

i-2d

-imp

er

seid

el-2

dg

eo

me

tric

me

an

0,1

1

10

100

1000min fusionmax fusionmax band

Sp

ee

du

p

baseline: Sequential execution on CPU.

Carlos Juega et al. Adaptive Mapping and Parameter Selection 24 / 29

Page 52: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions

State-of-art comparison 1/2

corr

ela

tion

cova

rian

ce

2m

m

3m

m

bic

g

do

itge

n

ge

mm

ge

mve

r

ge

sum

mv

gra

msc

hm

idt lu

mvt

sym

m

syr2

k

syrk

tra

nsp

ose ad

i

fdtd

-2d

jaco

bi-1

d-im

pe

r

jaco

bi-2

d-im

pe

r

ge

om

etr

ic m

ea

n

0,1

1

10

100

1000 PPCG customPPCG defaultPluto OpenMPPar4All

Sp

ee

du

p

baseline: Sequential execution on CPU.

Carlos Juega et al. Adaptive Mapping and Parameter Selection 25 / 29

Page 53: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions

State-of-art comparison 2/2

gemm bicg mvt adi jacobi-2d-imper0

0.01

0.1

1

10

100

1000 ppcgpluto+openmpcmanual+openmpcPGI OpenACC man.PGI OpenACC auto

Sp

ee

dup

baseline: Sequential execution on CPU.

Carlos Juega et al. Adaptive Mapping and Parameter Selection 26 / 29

Page 54: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions

State-of-art comparison: gemm

mini small standard large extra-large0.01

0.1

1

10

100

1000

CUBLASPPCG fixed sizesPPCG parametric sizesPar4allC to CUDAPluto (openmp+icc)ATLASseq (icc)

Problem size

GF

lop

s/s

Carlos Juega et al. Adaptive Mapping and Parameter Selection 27 / 29

Page 55: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions

I Compute Memory Footprint.I MemoryFootprint = TA + 1 + TB

I Compute Total Access Cost.I

TotalAccessCost = TA∗(M/TB)∗(TA/N) +...

Carlos Juega et al. Adaptive Mapping and Parameter Selection 28 / 29

Page 56: Adaptive Mapping and Parameter Selection Scheme to ... …cgo.org/cgo2014/wp-content/uploads/2013/05/adaptive_mapping_and... · MotivationPPCGAdaptive MappingResultsConclusions Adaptive

Motivation PPCG Adaptive Mapping Results Conclusions

I Two data reuse types.I inter-thread data reuse.I intra-thread data reuse.

I Compute MemoryFootprint.

I coalescingI footprint expansion.

I Compute Total AccessCost.

I registers are faster thanshared mem.

Carlos Juega et al. Adaptive Mapping and Parameter Selection 29 / 29