comp528: multi-core and multi-processor computingmkbane/comp528/copy_of_pdfs/... · 2018-11-28 ·...
TRANSCRIPT
Dr Michael K Bane, G14, Computer Science, University of [email protected] https://cgi.csc.liv.ac.uk/~mkbane/COMP528
COMP528: Multi-core and Multi-Processor Computing
42
Programming GPUs
CUDA
• proprietary
• NVIDIA only GPUs
• non-portable
• performant
Directives
• portable
– in theory?
• less coding
• maybe not so performant
• (some extensions to parallelism on
CPUs, Xeon Phis, FPGAs)
• OpenACC
• OpenMP 4.x (and 5.0…)
• OpenCL
OpenMP .v. OpenACC
OpenMP
• 1998 onwards
• offloading @v4 ~20xx
• CPU & accelerator
• FORTRAN, C, C++, …
• prescriptive
– user explicitly specifics actions to be
undertaken by compiler
• slower uptake of new [accelerator]
ideas but generally
• maturity for CPU
OpenACC
• 2012 onwards
• offloading always
• CPU & accelerator
• FORTRAN, C, C++, …
• descriptive
– user describes (guides) compiler but
compiler makes decision how/if to do
parallelism
• generally more reactive to new ideas
• maturity for GPU
https://openmpcon.org/wp-content/uploads/openmpcon2015-james-beyer-comparison.pdf
Who supports What?
• Intel make CPUs (and Xeon Phi) but not discrete [compute]
GPUs
– Intel compilers support OpenMP but not OpenACC
• NVIDIA (owners of PGI) make GPUs but not CPUs
– PGI compilers support OpenACC & (only recently) OpenMP (but
only CPU ‘target’)
• Cray no longer make chips, more of an “integrator”
– Cray compilers support OpenMP & OpenACC
#pragma acc parallel loop reduction(max:error)
for( int j = 1; j < n-1; j++) {
for( int i = 1; i < m-1; i++ ) {
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(A[j][i]-Anew[j][i]));
}
}#pragma acc kernels
{
for( int j = 1; j < n-1; j++) {
for( int i = 1; i < m-1; i++ ) {
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(A[j][i]-Anew[j][i]));
}
}
for( int j = 1; j < n-1; j++) {
for( int i = 1; i < m-1; i++ ) {
A[j][i] = Anew[j][i];
}
}
}
OpenACC
OpenMP…
• syntax via examples
• what is in which version
• which compilers support which version
comp528#24 (c) univ of liverpool
• v4.0 (2013): support offloading
– Intel v15 & v16
– GCC v4.9.0
• v4.5 (2015): improved support for offloading targets
– Intel v17 onwards
– GCC v6.1 onwards
– Cray CCE 8.7 onwards
• Intel
– OpenMP for CPU and XeonPhi
– Intel don’t (yet) make discrete GPUs for computation
– No support in their OpenMP for GPUs
• Cray & PGI compilers
– provide OpenMP support for NVIDIA GPUs
• PGI
– only supports host (CPU) for target device(n)
• There are some options to DIY extend LLVM/clang
OpenMP for Accelerators
• #pragma omp target
– region of code is for the GPU
– then need to say what happens within that region of code eg
• #pragma omp parallel for
– on CPU: creates threads & spreads iterations over threads
– within ‘target’: runs using threads of GPU
Target Clauses
• device (N)
– run on device #N
• map(A,B)
– ensure A, B vars available on GPU
• map(tofrom: C)
– copy C to device, run region on device, copy C back
GPU threads != CPU threads
• OpenMP designed around CPU threads
– high cost of set-up and of synchronisation
• GPUs
– light weight threads, very low cost of switching
– “thread blocks”
• SO… “teams” directive of OpenMP re GPU
whiletarget region
OpenMP accelerator
• teams:
– for GPU targets: one master per thread-block
• distribute:
– work sharing between thread blocks
Conclusion?
• OpenMP for accelerators
– … limited support (Intel for Intel Xeon Phi)
– … clang/LLVM (handbuild or via IBM) for GPUs…
not the most straight forward
OpenACC
• parallel | kernels
• copy
• copyin
• copyout
• create
• delete
OpenMP
• target / teams / parallel
• map(inout:…)
• map(in:…)
• map(out:…)
• map(alloc:…)
• map(release:…) / map(delete:…)
comp528#24 (c) univ of liverpool
Handling with/without GPU…
OpenMP for Accelerators
Further Reading
• Resources will be added to COMP528 & to VITAL
– OpenMP example of Jacobi:
https://www.slideshare.net/jefflarkin/gtc16-s6510-targeting-gpus-
with-openmp-45
OpenCL
• much more low level than Directives
• perhaps more low level than CUDA
• but portable
– get the OpenCL compiler
– compiles for various targets
• and devotees of OpenCL claim good & portable
performance
• targets: CPUs, GPUs, Xeon Phis, FPGAs
OpenCL Principles• OpenCL compiler + OpenCL run-time library
• “Open Computing Language”: Khronos et al. Standard. 2008-
• context
– defining the actual run-time hardware
• a “compute device” has 1 or more “compute units”
a “compute unit” comprises several “processing elements”
– eg CPU has several cores; eg NVIDIA GPU has several Streaming Multiprocessors
• kernels to run on processing elements
– kernels are compiled at runtime (to give portability) cf Java JIT
• kernels queued up in “command queue”
– NDRange (1D or 2D) assigns tasks to processing elements
OpenCL kernel: mat-vect mult (c/o Wikipedia// Multiplies A*x, leaving the result in y.
// A is a row-major matrix, meaning the (i,j) element is at A[i*ncols+j].
__kernel void matvec(__global const float *A, __global const float *x,
uint ncols, __global float *y)
{
size_t i = get_global_id(0); // Global id, used as the row index
__global float const *a = &A[i*ncols]; // Pointer to the i'th row
float sum = 0.f; // Accumulator for dot product
for (size_t j = 0; j < ncols; j++) {
sum += a[j] * x[j];
}
y[i] = sum;
}
cf threadIdx.x in CUDA cf CUDA __global__ and __device__ and __shared__ attributes
OpenCL “boiler plate”
• https://en.wikipedia.org/wiki/OpenCL
A personal view…• OpenCL
– complex
– but re-usable (amend as appropriate!) boiler plate
• set up context
– more portable than CUDA
– open source not proprietary
– now also supported by NVIDIA
– targets include Intel FPGA (Altera)
• your OpenCL for GPU will just run on FPGA
• but you will need to tune it!
Other Options…
• …to THREADING,
generally (so for CPU rather
than GPU)
• POSIX
• Java Threads & Java Tasks
• TBB
• http://www.archer.ac.uk/training/past_courses.php
>> “Advanced OpenMP” (eg Cambridge, 17-19 July 2018) ->> Course Materials
SIMD & VECTORS
comp528#24 (c) univ of liverpool
SIMD
• Single Instruction Multiple Data (SIMD) at instruction level
– a = b + c (SIMD fl-pt addition)
– y = mx + z (SIMD fl-pt mult-add)
– where relevant hardware exists to do in 1 instruction
• Modern architectures support vectorisation of SIMD
– SSE, AVX (Intel x86)
– AltiVec (IBM Power)
• i.e. vector units that can do a[i] = b[i] + c[i] for several [i] in
a single clock cycle
How to Make Use…?
1. let the compiler’s “vectorisation” phase (of optimisation)
identify opportunities
2. use of non-portable vendor-specific compiler directives?
3. use of a portable OpenMP directive
#pragma omp simd [clause/s]
SIMD Example (with timings)
ompTimer[0] = omp_get_wtime();
#pragma omp simd
for (i=0; i<N; i++) {
res[i] = A*x[i] + y[i];
}
ompTimer[1] = omp_get_wtime();
This is on a single core
Parallelism at the logical vector unit
cf GPUs and threads per CUDA core
SIMD Example (with timings)
ompTimer[0] = omp_get_wtime();
#pragma omp parallel for simd
for (i=0; i<N; i++) {
res[i] = A*x[i] + y[i];
}
ompTimer[1] = omp_get_wtime();
This is 2 levels of parallelism
a) across many cores
b) across the logical vector unit on each core
SIMD Example (with timings)
ompTimer[0] = omp_get_wtime();
#pragma omp simd
for (i=0; i<N; i++) {
res[i] = A*x[i] + y[i];
}
ompTimer[1] = omp_get_wtime();
Instead of using a directive to enforce compiler to make use of
the vector unit in each core of modern processors, we can also
look at helping the compiler do such “vectorisation”
Can be key to Nx speed-up per core… but what is N?
Vector Widths
• SSE
• AVX
– 256 bits
– ie 32 bytes so 8 floats or 4
doubles
– so 8x or 4x faster
• AVX-512
– 512 bits
– 16x (single precision) or 8x
(double precision) faster
• BUT it is also about the types
of supported vector operations:
• z = a + b
• z = Ax + y
• z = exp(x)
• z = 1/x
• z = 1 / sqrt(x)
• NN operations
• etc
• growing set with each upgrade