comp528: multi-core and multi-processor computingmkbane/comp528/copy_of_pdfs/... · 2018-11-28 ·...

Dr Michael K Bane, G14, Computer Science, University of [email protected] https://cgi.csc.liv.ac.uk/~mkbane/COMP528

COMP528: Multi-core and Multi-Processor Computing

42

Programming GPUs

CUDA

• proprietary

• NVIDIA only GPUs

• non-portable

• performant

Directives

• portable

– in theory?

• less coding

• maybe not so performant

• (some extensions to parallelism on

CPUs, Xeon Phis, FPGAs)

• OpenACC

• OpenMP 4.x (and 5.0…)

• OpenCL

OpenMP .v. OpenACC

OpenMP

• 1998 onwards

• offloading @v4 ~20xx

• CPU & accelerator

• FORTRAN, C, C++, …

• prescriptive

– user explicitly specifics actions to be

undertaken by compiler

• slower uptake of new [accelerator]

ideas but generally

• maturity for CPU

OpenACC

• 2012 onwards

• offloading always

• CPU & accelerator

• FORTRAN, C, C++, …

• descriptive

– user describes (guides) compiler but

compiler makes decision how/if to do

parallelism

• generally more reactive to new ideas

• maturity for GPU

https://openmpcon.org/wp-content/uploads/openmpcon2015-james-beyer-comparison.pdf

Who supports What?

• Intel make CPUs (and Xeon Phi) but not discrete [compute]

GPUs

– Intel compilers support OpenMP but not OpenACC

• NVIDIA (owners of PGI) make GPUs but not CPUs

– PGI compilers support OpenACC & (only recently) OpenMP (but

only CPU ‘target’)

• Cray no longer make chips, more of an “integrator”

– Cray compilers support OpenMP & OpenACC

#pragma acc parallel loop reduction(max:error)

for( int j = 1; j < n-1; j++) {

for( int i = 1; i < m-1; i++ ) {

Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]);

error = fmax( error, fabs(A[j][i]-Anew[j][i]));

}

}#pragma acc kernels

{

for( int j = 1; j < n-1; j++) {

for( int i = 1; i < m-1; i++ ) {

Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]);

error = fmax( error, fabs(A[j][i]-Anew[j][i]));

}

}

for( int j = 1; j < n-1; j++) {

for( int i = 1; i < m-1; i++ ) {

A[j][i] = Anew[j][i];

}

}

}

OpenACC

OpenMP…

• syntax via examples

• what is in which version

• which compilers support which version

comp528#24 (c) univ of liverpool

• v4.0 (2013): support offloading

– Intel v15 & v16

– GCC v4.9.0

• v4.5 (2015): improved support for offloading targets

– Intel v17 onwards

– GCC v6.1 onwards

– Cray CCE 8.7 onwards

• Intel

– OpenMP for CPU and XeonPhi

– Intel don’t (yet) make discrete GPUs for computation

– No support in their OpenMP for GPUs

• Cray & PGI compilers

– provide OpenMP support for NVIDIA GPUs

• PGI

– only supports host (CPU) for target device(n)

• There are some options to DIY extend LLVM/clang

OpenMP for Accelerators

• #pragma omp target

– region of code is for the GPU

– then need to say what happens within that region of code eg

• #pragma omp parallel for

– on CPU: creates threads & spreads iterations over threads

– within ‘target’: runs using threads of GPU

Target Clauses

• device (N)

– run on device #N

• map(A,B)

– ensure A, B vars available on GPU

• map(tofrom: C)

– copy C to device, run region on device, copy C back

GPU threads != CPU threads

• OpenMP designed around CPU threads

– high cost of set-up and of synchronisation

• GPUs

– light weight threads, very low cost of switching

– “thread blocks”

• SO… “teams” directive of OpenMP re GPU

whiletarget region

OpenMP accelerator

• teams:

– for GPU targets: one master per thread-block

• distribute:

– work sharing between thread blocks

Conclusion?

• OpenMP for accelerators

– … limited support (Intel for Intel Xeon Phi)

– … clang/LLVM (handbuild or via IBM) for GPUs…

not the most straight forward

OpenACC

• parallel | kernels

• copy

• copyin

• copyout

• create

• delete

OpenMP

• target / teams / parallel

• map(inout:…)

• map(in:…)

• map(out:…)

• map(alloc:…)

• map(release:…) / map(delete:…)


Handling with/without GPU…

OpenMP for Accelerators

Further Reading

• Resources will be added to COMP528 & to VITAL

– OpenMP example of Jacobi:

https://www.slideshare.net/jefflarkin/gtc16-s6510-targeting-gpus-

with-openmp-45

OpenCL

• much more low level than Directives

• perhaps more low level than CUDA

• but portable

– get the OpenCL compiler

– compiles for various targets

• and devotees of OpenCL claim good & portable

performance

• targets: CPUs, GPUs, Xeon Phis, FPGAs

OpenCL Principles• OpenCL compiler + OpenCL run-time library

• “Open Computing Language”: Khronos et al. Standard. 2008-

• context

– defining the actual run-time hardware

• a “compute device” has 1 or more “compute units”

a “compute unit” comprises several “processing elements”

– eg CPU has several cores; eg NVIDIA GPU has several Streaming Multiprocessors

• kernels to run on processing elements

– kernels are compiled at runtime (to give portability) cf Java JIT

• kernels queued up in “command queue”

– NDRange (1D or 2D) assigns tasks to processing elements

OpenCL kernel: mat-vect mult (c/o Wikipedia// Multiplies A*x, leaving the result in y.

// A is a row-major matrix, meaning the (i,j) element is at A[i*ncols+j].

__kernel void matvec(__global const float *A, __global const float *x,

uint ncols, __global float *y)

{

size_t i = get_global_id(0); // Global id, used as the row index

__global float const *a = &A[i*ncols]; // Pointer to the i'th row

float sum = 0.f; // Accumulator for dot product

for (size_t j = 0; j < ncols; j++) {

sum += a[j] * x[j];

}

y[i] = sum;

}

cf threadIdx.x in CUDA cf CUDA __global__ and __device__ and __shared__ attributes

OpenCL “boiler plate”

• https://en.wikipedia.org/wiki/OpenCL

A personal view…• OpenCL

– complex

– but re-usable (amend as appropriate!) boiler plate

• set up context

– more portable than CUDA

– open source not proprietary

– now also supported by NVIDIA

– targets include Intel FPGA (Altera)

• your OpenCL for GPU will just run on FPGA

• but you will need to tune it!

Other Options…

• …to THREADING,

generally (so for CPU rather

than GPU)

• POSIX

• Java Threads & Java Tasks

• TBB

• http://www.archer.ac.uk/training/past_courses.php

>> “Advanced OpenMP” (eg Cambridge, 17-19 July 2018) ->> Course Materials

SIMD & VECTORS


SIMD

• Single Instruction Multiple Data (SIMD) at instruction level

– a = b + c (SIMD fl-pt addition)

– y = mx + z (SIMD fl-pt mult-add)

– where relevant hardware exists to do in 1 instruction

• Modern architectures support vectorisation of SIMD

– SSE, AVX (Intel x86)

– AltiVec (IBM Power)

• i.e. vector units that can do a[i] = b[i] + c[i] for several [i] in

a single clock cycle

How to Make Use…?

1. let the compiler’s “vectorisation” phase (of optimisation)

identify opportunities

2. use of non-portable vendor-specific compiler directives?

3. use of a portable OpenMP directive

#pragma omp simd [clause/s]

SIMD Example (with timings)

ompTimer[0] = omp_get_wtime();

#pragma omp simd

for (i=0; i<N; i++) {

res[i] = A*x[i] + y[i];

}


This is on a single core

Parallelism at the logical vector unit

cf GPUs and threads per CUDA core



#pragma omp parallel for simd

for (i=0; i<N; i++) {

res[i] = A*x[i] + y[i];

}


This is 2 levels of parallelism

a) across many cores

b) across the logical vector unit on each core



#pragma omp simd

for (i=0; i<N; i++) {

res[i] = A*x[i] + y[i];

}


Instead of using a directive to enforce compiler to make use of

the vector unit in each core of modern processors, we can also

look at helping the compiler do such “vectorisation”

Can be key to Nx speed-up per core… but what is N?

Vector Widths

• SSE

• AVX

– 256 bits

– ie 32 bytes so 8 floats or 4

doubles

– so 8x or 4x faster

• AVX-512

– 512 bits

– 16x (single precision) or 8x

(double precision) faster

• BUT it is also about the types

of supported vector operations:

• z = a + b

• z = Ax + y

• z = exp(x)

• z = 1/x

• z = 1 / sqrt(x)

• NN operations

• etc

• growing set with each upgrade

comp528: multi-core and multi-processor computingmkbane/comp528/copy_of_pdfs/... · 2018-11-28 ·...

Documents