decoupled access/execute metaprogramming for gpu-accelerated...

Decoupled Access/Execute Metaprogrammingfor GPU-Accelerated Systems

Anton Lokhmotov1 Lee Howes1

Alastair F. Donaldson2,3 Paul H.J. Kelly1

1

2

3

Urbana-Champaign, 28 July 2009

Software engineering challengeMotivating examples

Æcute model

Recent meeting on accelerated computing at Imperial(35–40 attendees summarised by their affiliation)

Computing (software optimisation, cognitive robotics,visual information processing, reconfigurable computing)Electrical Engineering (reconfigurable computing, designautomation)Mechanical Engineering (multiscale flow dynamics,vibration technology)Earth Science and Engineering (applied modelling &computation)Physics (plasma, experimental solid state)Chemistry (computational, biological & biophysical)Biomedical EngineeringChemical EngineeringCivil EngineeringAeronautics

A. Lokhmotov, L. Howes, A.F. Donaldson, P.H.J. Kelly Decoupled Access/Execute Metaprogramming


Æcute model

Why accelerator programming is challenging?

Accelerator hardware

hundreds of functional unitssoftware-managed memory hierarchies, e.g.

host memory (main memory)device global memory (on-board)device local memory (on-chip)

Accelerator software

low-level, hence unproductive

architecture-specific, hence nonportable



Æcute model

The fundamental software engineering challenge

How to use accelerator technology but keep

maintainability,

composability,

reusability,

portability?



Æcute model

Vertical mean image filterHorizontal mean image filterTowards metaprogramming

Motivating example A: vertical mean image filter

Ox ,y =1D

D−1∑

k=0

Ix ,y+k

I is a W × H grey-scale input image;

O is a W × (H − D) grey-scale output image;

D is the diameter of the filter (D ≪ H);

0 ≤ x < W , 0 ≤ y < H − D.



Æcute model


Memory access of iteration (x , y) [D = 2]

Ox ,y =1D

D−1∑

k=0

Ix ,y+k =12(Ix ,y + Ix ,y+1)



Æcute model


Scalable algorithm for vertical mean image filter

Ox ,y =

{

1D

∑D−1k=0 Ix ,y+k , for y = y0;

Ox ,y−1 + 1D

(

Ix ,y+D−1 − Ix ,y−1)

, for 1 ≤ y − y0 < T .

1 for(int x = 0; x < W; ++x) { // for each column2 for(int y0 = 0; y0 < H-D; y0 += T) { // for each strip of rows3 // first phase: convolution4 float sum = 0.0f;5 for(int k = 0; k < D; ++k)6 sum += I[(y0+k)*W + x];7 O[y0*W + x] = sum / (float)D;89 // second phase: rolling sum

10 for(int dy = 1; dy < min(T,H-D-y0); ++dy) {11 int y = y0 + dy;12 sum -= I[(y-1)*W + x];13 sum += I[(y-1+D)*W + x];14 O[y*W + x] = sum / (float)D;15 } } }



Æcute model


Complexity and parallelism

Ox ,y =

{

1D

∑D−1k=0 Ix ,y+k , for y = y0;

Ox ,y−1 + 1D

(

Ix ,y+D−1 − Ix ,y−1)

, for 1 ≤ y − y0 < T .

Writes to O: N = W × (H − D)

Reads from I and arithmetic operations: Θ(N + ND/T )

Θ(ND) when T ≪ D;Θ(N) when T ≫ D.

Thread parallelism: ⌈N/T ⌉



Æcute model


Two-dimensional grid mapping

WPBX

WPBY

H−D

W

1 2 3

4 5 6

7 8 9

10 11 12

each WPBX×WPBY box is assigned to a thread blockwork/block = work/thread × threads/block, e.g.

WPBX = 1 pixel × 128 threads/block = 128 pixels/blockWPBY = T pixels × 1 thread/block = T pixels/block

red regions represent inefficiency (early terminated threads)



Æcute model


One-dimensional grid mapping

WPBY

H−D

WPBX

W

1 2 3

4 5

6 7

8 9 10

85

3

Round to SIMD size

alleviating inefficiency by wrapping around the right edge

wasting a small number of threads to keep SIMD alignment



Æcute model


VMIF on NVIDIA GTX 280, 5120 × 3200 image2D grid; 1D grid.

4000

4500

5000

5500

6000

6500

7000

7500

8000

8500

9000

9500

10000

0 100 200 300 400 500 600 700 800

Thr

ough

put (

Mpi

xel/s

)

Output pixels per thread (T)

vmean: W=5120, H=3200, D=40, TPBX=128, TPBY=1

2D1D



Æcute model


VMIF on NVIDIA GTX 280, 5121 × 3200 image2D grid. Data padded to multiples of 16, 32, 64, and 128 ps.

4000

4500

5000

5500

6000

6500

7000

7500

8000

8500

9000

9500

10000

0 100 200 300 400 500 600 700 800

Thr

ough

put (

Mpi

xel/s

)


vmean 2D: W=5121, H=3200, D=40, TPBX=128, TPBY=1

2D, padded to 51362D, padded to 51522D, padded to 51842D, padded to 5248



Æcute model


VMIF on NVIDIA GTX 280, 5121 × 3200 imageData padded to a multiple of 64 ps. 1D grid wrapped on multiples of 16, 32 and 64 ps.

4000

4500

5000

5500

6000

6500

7000

7500

8000

8500

9000

9500

10000

0 100 200 300 400 500 600 700 800

Thr

ough

put (

Mpi

xel/s

)


vmean 1D: W=5121 (padded to 5184), H=3200, D=40, TPBX=128, TPBY=1

1D wrapped on 51211D wrapped on 51361D wrapped on 51521D wrapped on 5184



Æcute model


VMIF on NVIDIA GTX 280, 5121 × 3200 imageData padded to a multiple of 64 ps. 2D grid; 1D grid wrapped on a multiple of 32 ps.

4000

4500

5000

5500

6000

6500

7000

7500

8000

8500

9000

9500

10000

0 100 200 300 400 500 600 700 800

Thr

ough

put (

Mpi

xel/s

)


vmean: W=5121 (padded to 5184), H=3200, D=40, TPBX=128, TPBY=1

2D1D wrapped on 5152



Æcute model


Motivating example B: horizontal mean image filter

Ox ,y =1D

D−1∑

k=0

Ix+k ,y

I is a W × H grey-scale input image;

O is a (W − D) × H grey-scale output image;

D is the diameter of the filter (D ≪ W );

0 ≤ x < W − D, 0 ≤ y < H.



Æcute model


Bad news and good ideas

Bad news

Threads in a warp access consecutive rows:non-coalesced memory access results in reducedbandwidth

Good ideas

use shared memory to repair inefficient memory accessuse optimised vertical mean filter implementation:

hmean := transpose | vmean | transpose



Æcute model


hmean := transpose | vmean | transpose

H

W H

W

H

W−D

W−D

H

Forward

transpose

Vertical

mean filter

Backward

transpose

Forward transpose: 5120 × 3200 → 3200 × 5120Output pitch 3200: 60.2 GiB/s

Output pitch 3264 = 3200+64: 81.1 GiB/s

Backward transpose: 3200 × 5120 → 5120 × 3200Output pitch 5120: 19.1 GiB/s





Æcute model


HMIF on NVIDIA GTX 280, 5120 × 3200 image

0 200 400 600 800

1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000 3200 3400 3600 3800 4000

0 100 200 300 400 500 600 700 800

Thr

ough

put (

Mpi

xel/s

)


hmean: W=5120, H=3200, D=40

VanillaTranspose

Shared memory, 32 threadsShared memory, 64 threads



Æcute model


Towards metaprogramming

Separate algorithm representation from mapping andtuningGenerate efficient device-specific code, in particular:

synthesise data movement for software-managed memoryhierarchyhandle device-specific alignment, padding, etc.



Æcute model

Access/execute metadataCurrent & future work

Decoupled Access/Execute (Æcute) model

Decoupled Access/Execute metaprogramming

kernel code written for uniform memory

execute metadata describe execution constraints

access metadata describe memory access pattern

part of the kernel’s interface specification

Goals

robust translation into efficient low-level code

ample opportunities for optimisation

convenience and flexibility



Æcute model


Execute metadata

Execute metadata for a kernel is a tuple E = (I, R, P), where:

I ⊂ Zn is a finite, n-dimensional iteration space, for some

n > 0;

R ⊆ I × I, is a precedence relation such that (i1, i2) ∈ R iffiteration i1 must be executed before iteration i2.

P is a partition of I into a set of non-empty, disjoint iterationsubspaces: P =

{

Ik : I =⋃

Ik ; Ii⋂

Ij = ∅, i 6= j}



Æcute model


Access metadata

Access metadata for a kernel is a tuple A = (Mr , Mw ), where:

Mr : I → P(M) specifies the set of memory locations Mr (i)that may be read on iteration i ∈ I;

Mw : I → P(M) specifies the set of memory locationsMw (i) that may be written on iteration i ∈ I.



Æcute model


Example: 2D convolution

Oy ,x =K

∑

u=−K

K∑

v=−K

Cu,v · Iy+u,x+v

I: input image

O: output image

C: coefficients

W : image width

H: image height

K : neighbourhood radius

K ≤ y < H − K ; K ≤ x < W − K



Æcute model


Memory access of a single (y , x) iteration (K = 1)

Oy ,x =K

∑

u=−K

K∑

v=−K

Cu,v · Iy+u,x+v

x

y

3

Region of I3

3

3

All of C

Iteration (y,x) Region of O

1

1



Æcute model


Æcute specification (h × w rectangular tiling)

Oy ,x =K

∑

u=−K

K∑

v=−K

Cu,v · Iy+u,x+v

Execute metadata (I, R, P):

I ={

(y , x) : K ≤ y < H − K , K ≤ x < W − K}

R = ∅

P ={

{(y , x) ∈ I : h(j − 1) ≤ y − K < hj , w(i − 1) ≤x − K < wi} : 1 ≤ j < (H − 2K )/h, 1 ≤ i < (W − 2K )/w

}

Access metadata (Mr , Mw ):

Mr ={

Iy+u,x+v , Cu,v : (y , x) ∈ I,−K ≤ u, v ≤ K}

Mw ={

Oy ,x : (y , x) ∈ I}



Æcute model


Ii ,j , Oi ,j : 0 ≤ i < H, 0 ≤ j < W ; Ci ,j : −K ≤ i , j ≤ K

C++rgb I[W][H];rgb O[W][H];rgb C[2*K+1][2*K+1];

Æcute (data wrappers)Array2D<rgb> arrayI(&I[0][0], W, H);Array2D<rgb> arrayO(&O[0][0], W, H);Array2D<rgb> arrayC(&C[0][0], 2*K+1, 2*K+1);



Æcute model


Iteration space: K ≤ y < H − K , K ≤ x < W − K

C++for(y = K; y < H-K; ++y)for(x = K; x < W-K; ++x)

// Kernel code for each (y,x)

Æcute (execute metadata)IterationSpace1D y(K,H-K);IterationSpace1D x(K,W-K);IterationSpace2D iterYX(y,x);



Æcute model


Access regions: implicit in C++, explicit in Æcute

C++// Kernel code for each (y,x)rgb sum(0.0f, 0.0f, 0.0f);for(u = -K; u <= K; ++u)for(v = -K; v <= K; ++v)

sum += C[K+u][K+v] * I[y+u][x+v]; // read from C and IO[y][x] = sum; // write to O

Æcute (access metadata)// Access descriptorsNeighbourhood2D_R accessI(iterYX, arrayI, K);Point2D_W accessO(iterYX, arrayO);All_R accessC(iterYX, arrayC);



Æcute model


Kernel code

C++// Kernel code for each (y,x)int u, v;rgb sum(0.0f, 0.0f, 0.0f);for(u = -K; u <= K; ++u)for(v = -K; v <= K; ++v)

sum += C[K+u][K+v] * I[y+u][x+v];O[y][x] = sum;

Æcute (kernel method)void kernel(const IterationSpace2D::iterator &it){int u, v;rgb sum(0.0f, 0.0f, 0.0f);for(u = -K; u <= K; ++u)

for(v = -K; v <= K; ++v)sum += accessC(u, v) * accessI(it, u, v);

accessO(it) = sum;}



Æcute model


Bringing all together

// Data wrappersArray2D<rgb> arrayI(&I, W, H);Array2D<rgb> arrayO(&O, W, H);Array2D<rgb> arrayC(&C, 2*K+1, 2*K+1);

// Execute metadataIterationSpace1D y(K,H-K);IterationSpace1D x(K,W-K);IterationSpace2D iterYX(y,x);

// Access metadataNeighbourhood2D_R accessI(iterYX, arrayI, K);Point2D_W accessO(iterYX, arrayO);All_R accessC(iterYX, arrayC);

// Filter initialisation and executionConvolutionFilter2D conv(iterYX, accessI, accessO, accessC);iterYX.tile(h, w);conv.run();



Æcute model


Æcute metadata benefits

data movement synthesis and optimisation (e.g. softwarepipelining and exploiting data reuse)

machine-independent abstraction, machine-dependenttuning (via partitioning)

potential for inter-kernels optimisations (e.g. loop fusionand array contraction)



Æcute model


Memory access of an iteration subspace

Let Ik ⊂ I be an iteration subspace

Mr (Ik ) =⋃

i∈IkMr (i)

Mw (Ik ) =⋃

i∈IkMw (i)

x

y

4

Region of I4

3

3

All of C

One block Region of O

2

2



Æcute model


Memory access of two iteration subspaces

Let Ik and In be two iteration subspaces

Mr (Ik )T

Mr (In) determines reuse

Mw (Ik )T

Mr (In) determines true dependence

Mr (Ik )T

Mw (In) determines anti dependence

Mw (Ik )T

Mw (In) determines output dependence

x

y

Two (overlapping) regions of I

All of C

Two blocks Two (disjoint) regions of O



Æcute model


Inter-kernel optimisation (fusion and contraction)

x

y

Region of I

Kernel A

Region of T

1

1

x

y

Region of O

Kernel B

Region of T

1

1



Æcute model


Inter-kernel optimisation (fusion and contraction)

x

y

Region of I

Kernel A/B

Region of O

Improved locality, reduced communication

Tricky but within the reach of polyhedral code generation

Using metadata bypasses fragile inter-kernel analysis



Æcute model


Current & future work

Generating and optimising code for multiple targets(OpenCL)Integrating Æcute metadata into existing languages

Sieve C++ (EPSRC CASE Award Imperial/Codeplay)OpenCL (AMD)OpenMP?

Handling both regular and irregular computations

Extracting metadata both statically and dynamically

Using metadata for cross optimisation of parallelcomponents

Scaling to large applications



Æcute model


References

1 L. Howes, A. Lokhmotov, A.F. Donaldson, P.H.J. Kelly: Deriving datamovement from decoupled access/execute metadata. In Proceedings ofthe 4th International Conference on High-Performance and EmbeddedArchitectures and Compilers (HiPEAC). (2009)

2 J.L.T. Cornwall, L. Howes, P.H.J. Kelly, P. Parsonage, B. Nicoletti:High-performance SIMT code generation in an active visual effectslibrary. In Proceedings of the ACM International Conference onComputing Frontiers. (2009)

3 L. Howes, A. Lokhmotov, P.H.J. Kelly, A.J. Field: Optimising componentcomposition using indexed dependence metadata. In Proceedings ofthe 1st International Workshop on New Frontiers in High-performanceand Hardware-aware Computing. (2008)

4 J.L.T. Cornwall, P.H.J. Kelly, P. Parsonage, B. Nicoletti: Explicitdependence metadata in an active visual effects library. In Proceedingsof the 20th International Workshop on Languages and Compilers forParallel Computing (LCPC). (2007)



Æcute model


Questions?


Æcute Cell BE frameworkÆcute specifications

Extra material

OverviewRuntimeBenchmarks

Prototype framework for the Cell BE architecture

Sony/Toshiba/IBM Cell Broadband Engine

1 PowerPC Processing Element (PPE)

8 Synergistic Processing Elements (SPEs)

Main memory, 256 KiB local memory per SPE

DMA to copy data between main and local memories

Benchmark variants (kernel code is basically the same)

Æcute

Hand-written C

Software cache (IBM Cell SDK)

Experimental setup

3.2 GHz Cell (Sony PlayStation 3, 6 SPEs available)

IBM Cell SDK 2.1, Fedora Linux 7



Extra material


Prototype framework for the Cell BE architecturePPE and SPE runtimes



Extra material


Prototype framework for the Cell BE architecturePPE takes a block of the iteration space (blocking is configurable)



Extra material


Prototype framework for the Cell BE architecturePPE assigns the block of iterations to a ready SPE by sending a message



Extra material


Prototype framework for the Cell BE architectureSPE copies the block’s working set into buffers associated with access descriptors



Extra material


Prototype framework for the Cell BE architectureWhilst SPE processes the block, it receives another block for execution



Extra material


Prototype framework for the Cell BE architectureSPE copies the new block’s working set into another set of buffers



Extra material


Prototype framework for the Cell BE architectureSPE copies out the first block’s results and clears its buffers



Extra material


Matrix-vector multiply: ~y = A · ~x

2048x2048 2048x4096 4096x4096 4096x81920

1

2

3

4

5

6

7

8

9

10

Matrix size

Exe

cutio

n tim

e no

rmal

ised

to h

and−

writ

ten

code

Hand−writtenSoftware cacheAEcute 4x1024AEcute 4x512AEcute 4x256

Æcute spec

Bandwidth limited

Tiling increaseslocality: ~x isreused

Longer strips(e.g. 1024) aremore efficient

Æcute is2.3–3.6× slower,software cache is5–10× slower



Extra material


Closest-to-mean filter on square D × D images

D=256, N=15 D=256, N=63 D=1024, N=15 D=1024, N=630

0.5

1

1.5

2

2.5

3

3.5

Image dimension D, pixel neighbourhood diameter N

Exe

cutio

n tim

e no

rmal

ised

to h

and−

writ

ten

code

Hand−writtenSoftware cacheAEcute 20x20AEcute 5x40AEcute 40x5

5.7 5.1 9.6

Æcute spec

No tile size wasuniversally best

40 × 5 is best forD=256

20 × 20 is bestfor D=1024

5 × 40 is nearbest for bothD=256 and 1024

Æcute is 12–40%slower, softwarecache is2.2–9.6× slower



Extra material


Bit-reversed copy: y [σn(i)] = x [i ], for 0 ≤ i < N = 2n

0

0.5

1

1.5

2

2.5

3

3.5

14 16 18 20 22 24 0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

2.25

2.5

2.75

3

3.25

3.5

Ban

dwid

th (

GiB

/s)

Data size (index bits)

Software cacheHand-written

AEcute

Æcute spec

(σn(i) reverses bits of n-bit index,

e.g. σ4(01112) = 11102)

Tiled algorithm (Carter& Gatlin)

Vector kernel(Lokhmotov & Mycroft)

Non-affine accessspecification

For large problem sizes(n = 22–24), Æcute is1.6× slower, softwarecache is 20× slower



Extra material

Matrix-vector multiplyClosest-to-mean filterBit-reversed copy

Matrix-vector multiply


I = {(i , j) : 0 ≤ i < H, 0 ≤ j < W}

R = {((i , j), (i , k)) : 0 ≤ i < H, 0 ≤ j < k < W}

P ={

{(i , j) ∈ I : h(k − 1) ≤ i < hk , w(l − 1) ≤ j < wl} :1 ≤ k < H/h, 1 ≤ l < W/w

}


Mr (i , j) = {A[i][j], x [j]}

Mw (i , j) = {y [i]}

Benchmark results



Extra material


Closest-to-mean filter


I ={

(x , y) : K ≤ x < W − K , K ≤ y < H − K}

R = ∅

P ={

{(x , y) ∈ I : w(i − 1) ≤ x − K < wi , h(j − 1) ≤y − K < hj} : 1 ≤ i < (W − 2K )/w , 1 ≤ j < (H − 2K )/h

}


Mr ={

I[x + u][y + v ] : (x , y) ∈ I,−K ≤ u, v ≤ K}

Mw ={

O[x ][y ] : (x , y) ∈ I}

Benchmark results



Extra material


Bit-reversed copy


I = {t : 0 ≤ t < N/B2}

R = ∅

P ={

{t} : t ∈ I}


Mr (t) = {x [u.t .v ] : t ∈ I, 0 ≤ u, v < B}

Mw (t) = {y [u.σn(t).v ] : t ∈ I, 0 ≤ u, v < B}

Benchmark results



Extra material

Parallel scopesBit-reversal on GTX 280

Parallel scopesJoint work with A. Mycroft (Cambridge), A. Donaldson, A. Richards (Codeplay)

Aliasing complicates dependence analysis

The compiler must produce reliable, if inefficient, code

The programmer needs to tell more to the compiler



Extra material


Delayed side-effects in Codeplay’s Sieve C++

In Codeplay’s C++ extension, the programmer can place a codefragment in a parallel scope, thereby instructing the compiler to

delay writes to memory locations defined outside of thescope, and

apply them in order on exit from the scope.

Sieve C++float *pa, *pb; ...sieve { // sieve blockfor(int i = 1; i <= n; ++i){

pb[i] = pa[i] + 42;}

}// writes to pb[1:n] happen here

Fortran 90pb[1:n] = pa[1:n] + 42;



Extra material


Example: N-body simulationForce acting on particle i according to the law of gravity

~Fi = −∑

j 6=i

GMiMj

|rij |3~rij



Extra material


Example: N-body simulationC++ code

int Size;float3 rNormalised(float3 * Pos, int i, int j);...void computeForces(float3 * Force, float3 * Pos, float * Mass){for(int i = 0; i < Size; ++i) {

float3 Potential = { 0.0f, 0.0f, 0.0f };for(int j = 0; j < Size; ++j) {

float3 r = rNormalised(Pos, i, j);Potential -= r * Mass[j];

}Force[i] = Potential * Mass[i];

}}



Extra material


Example: N-body simulationSieve C++ code

int Size;sieve float3 rNormalised(outer float3 * Pos, int i, int j);...void computeForces(float3 * Force, float3 * Pos, float * Mass){sieve{

for(int i = 0; i < Size; ++i) {float3 Potential = { 0.0f, 0.0f, 0.0f };for(int j = 0; j < Size; ++j) {float3 r = rNormalised(Pos, i, j);Potential -= r * Mass[j];

}// Delayed writeForce[i] = Potential * Mass[i];

}}// Side-effects to Force[] committed here

}



Extra material


Sieve/OpenMP C++ (Codeplay/Microsoft) vs. C++Dual quad-core AMD Opteron (2GHz Barcelona), 4GiB RAM, Windows Server 2003

Benchmark details



Extra material


Sieve/OpenMP C++ (Codeplay/IBM XL Alpha) vs. C++Sony PlayStation 3 (3.2GHz Cell BE, 6 SPEs), IBM SDK 3, Fedora Core 7

Benchmark details



Extra material


Lessons learnt

The concept of delayed side-effects is useful

Eliminates the need for fragile whole-program analysis

Enables safe speculative parallelisation

Provides deterministic behaviour

Efficient implementation is hard

Data movement is the culprit for heterogeneousarchitectures

Mainstream languages mix memory access and computeoperations

Efficient programs decouple memory access andcomputation (and then overlap them asynchonously)

Conclusion: we need a programming model that promotes thisdecoupling (hence Æcute)!



Extra material


Sieve/OpenMP C++ benchmarks

GRAVITY: N-body molecular dynamics simulation of 8192particles

NOISE RGB: Noise reduction filter applied to 512 × 512colour image

NOISE GREY: Noise reduction filter applied to 512 × 512greyscale image

CRC: Cyclic redundancy check on random 8M (1M=220)word message

MAND: Calculates 1024 × 1024 fragment of theMandelbrot set

FFT3D: Fast Fourier transform of complex 1283 data set

x86 results Cell results



Extra material


Bit-reversal on NVIDIA GTX 280y [σn(i)] = x [i], for 0 ≤ i < N = 2n

10

20

30

40

50

60

70

80

90

100

110

14 16 18 20 22 24 26

Ban

dwid

th (

GiB

/s)

Problem size (number of index bits)

copybr-gp-smbr-gp+smbr+gp-smbr+gp+sm


decoupled access/execute metaprogramming for gpu-accelerated...

Documents