comp528: multi-core and multi-processor computingmkbane/comp528/copy_of_pdfs/... · 2018-11-28 ·...

32
Dr Michael K Bane, G14, Computer Science, University of Liverpool [email protected] https://cgi.csc.liv.ac.uk/~mkbane/COMP528 COMP528: Multi-core and Multi-Processor Computing 4 2

Upload: others

Post on 03-May-2020

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: COMP528: Multi-core and Multi-Processor Computingmkbane/COMP528/copy_of_PDFs/... · 2018-11-28 · Programming GPUs CUDA • proprietary • NVIDIA only GPUs • non-portable •

Dr Michael K Bane, G14, Computer Science, University of [email protected] https://cgi.csc.liv.ac.uk/~mkbane/COMP528

COMP528: Multi-core and Multi-Processor Computing

42

Page 2: COMP528: Multi-core and Multi-Processor Computingmkbane/COMP528/copy_of_PDFs/... · 2018-11-28 · Programming GPUs CUDA • proprietary • NVIDIA only GPUs • non-portable •

Programming GPUs

CUDA

• proprietary

• NVIDIA only GPUs

• non-portable

• performant

Directives

• portable

– in theory?

• less coding

• maybe not so performant

• (some extensions to parallelism on

CPUs, Xeon Phis, FPGAs)

Page 3: COMP528: Multi-core and Multi-Processor Computingmkbane/COMP528/copy_of_PDFs/... · 2018-11-28 · Programming GPUs CUDA • proprietary • NVIDIA only GPUs • non-portable •

• OpenACC

• OpenMP 4.x (and 5.0…)

• OpenCL

Page 4: COMP528: Multi-core and Multi-Processor Computingmkbane/COMP528/copy_of_PDFs/... · 2018-11-28 · Programming GPUs CUDA • proprietary • NVIDIA only GPUs • non-portable •

OpenMP .v. OpenACC

OpenMP

• 1998 onwards

• offloading @v4 ~20xx

• CPU & accelerator

• FORTRAN, C, C++, …

• prescriptive

– user explicitly specifics actions to be

undertaken by compiler

• slower uptake of new [accelerator]

ideas but generally

• maturity for CPU

OpenACC

• 2012 onwards

• offloading always

• CPU & accelerator

• FORTRAN, C, C++, …

• descriptive

– user describes (guides) compiler but

compiler makes decision how/if to do

parallelism

• generally more reactive to new ideas

• maturity for GPU

https://openmpcon.org/wp-content/uploads/openmpcon2015-james-beyer-comparison.pdf

Page 5: COMP528: Multi-core and Multi-Processor Computingmkbane/COMP528/copy_of_PDFs/... · 2018-11-28 · Programming GPUs CUDA • proprietary • NVIDIA only GPUs • non-portable •

Who supports What?

• Intel make CPUs (and Xeon Phi) but not discrete [compute]

GPUs

– Intel compilers support OpenMP but not OpenACC

• NVIDIA (owners of PGI) make GPUs but not CPUs

– PGI compilers support OpenACC & (only recently) OpenMP (but

only CPU ‘target’)

• Cray no longer make chips, more of an “integrator”

– Cray compilers support OpenMP & OpenACC

Page 6: COMP528: Multi-core and Multi-Processor Computingmkbane/COMP528/copy_of_PDFs/... · 2018-11-28 · Programming GPUs CUDA • proprietary • NVIDIA only GPUs • non-portable •

#pragma acc parallel loop reduction(max:error)

for( int j = 1; j < n-1; j++) {

for( int i = 1; i < m-1; i++ ) {

Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]);

error = fmax( error, fabs(A[j][i]-Anew[j][i]));

}

}#pragma acc kernels

{

for( int j = 1; j < n-1; j++) {

for( int i = 1; i < m-1; i++ ) {

Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]);

error = fmax( error, fabs(A[j][i]-Anew[j][i]));

}

}

for( int j = 1; j < n-1; j++) {

for( int i = 1; i < m-1; i++ ) {

A[j][i] = Anew[j][i];

}

}

}

OpenACC

Page 7: COMP528: Multi-core and Multi-Processor Computingmkbane/COMP528/copy_of_PDFs/... · 2018-11-28 · Programming GPUs CUDA • proprietary • NVIDIA only GPUs • non-portable •

OpenMP…

• syntax via examples

• what is in which version

• which compilers support which version

comp528#24 (c) univ of liverpool

Page 8: COMP528: Multi-core and Multi-Processor Computingmkbane/COMP528/copy_of_PDFs/... · 2018-11-28 · Programming GPUs CUDA • proprietary • NVIDIA only GPUs • non-portable •

• v4.0 (2013): support offloading

– Intel v15 & v16

– GCC v4.9.0

• v4.5 (2015): improved support for offloading targets

– Intel v17 onwards

– GCC v6.1 onwards

– Cray CCE 8.7 onwards

Page 9: COMP528: Multi-core and Multi-Processor Computingmkbane/COMP528/copy_of_PDFs/... · 2018-11-28 · Programming GPUs CUDA • proprietary • NVIDIA only GPUs • non-portable •

• Intel

– OpenMP for CPU and XeonPhi

– Intel don’t (yet) make discrete GPUs for computation

– No support in their OpenMP for GPUs

• Cray & PGI compilers

– provide OpenMP support for NVIDIA GPUs

• PGI

– only supports host (CPU) for target device(n)

• There are some options to DIY extend LLVM/clang

Page 10: COMP528: Multi-core and Multi-Processor Computingmkbane/COMP528/copy_of_PDFs/... · 2018-11-28 · Programming GPUs CUDA • proprietary • NVIDIA only GPUs • non-portable •

OpenMP for Accelerators

• #pragma omp target

– region of code is for the GPU

– then need to say what happens within that region of code eg

• #pragma omp parallel for

– on CPU: creates threads & spreads iterations over threads

– within ‘target’: runs using threads of GPU

Page 11: COMP528: Multi-core and Multi-Processor Computingmkbane/COMP528/copy_of_PDFs/... · 2018-11-28 · Programming GPUs CUDA • proprietary • NVIDIA only GPUs • non-portable •

Target Clauses

• device (N)

– run on device #N

• map(A,B)

– ensure A, B vars available on GPU

• map(tofrom: C)

– copy C to device, run region on device, copy C back

Page 12: COMP528: Multi-core and Multi-Processor Computingmkbane/COMP528/copy_of_PDFs/... · 2018-11-28 · Programming GPUs CUDA • proprietary • NVIDIA only GPUs • non-portable •

GPU threads != CPU threads

• OpenMP designed around CPU threads

– high cost of set-up and of synchronisation

• GPUs

– light weight threads, very low cost of switching

– “thread blocks”

• SO… “teams” directive of OpenMP re GPU

Page 13: COMP528: Multi-core and Multi-Processor Computingmkbane/COMP528/copy_of_PDFs/... · 2018-11-28 · Programming GPUs CUDA • proprietary • NVIDIA only GPUs • non-portable •

whiletarget region

Page 14: COMP528: Multi-core and Multi-Processor Computingmkbane/COMP528/copy_of_PDFs/... · 2018-11-28 · Programming GPUs CUDA • proprietary • NVIDIA only GPUs • non-portable •

OpenMP accelerator

• teams:

– for GPU targets: one master per thread-block

• distribute:

– work sharing between thread blocks

Page 15: COMP528: Multi-core and Multi-Processor Computingmkbane/COMP528/copy_of_PDFs/... · 2018-11-28 · Programming GPUs CUDA • proprietary • NVIDIA only GPUs • non-portable •

Conclusion?

• OpenMP for accelerators

– … limited support (Intel for Intel Xeon Phi)

– … clang/LLVM (handbuild or via IBM) for GPUs…

not the most straight forward

Page 16: COMP528: Multi-core and Multi-Processor Computingmkbane/COMP528/copy_of_PDFs/... · 2018-11-28 · Programming GPUs CUDA • proprietary • NVIDIA only GPUs • non-portable •

OpenACC

• parallel | kernels

• copy

• copyin

• copyout

• create

• delete

OpenMP

• target / teams / parallel

• map(inout:…)

• map(in:…)

• map(out:…)

• map(alloc:…)

• map(release:…) / map(delete:…)

comp528#24 (c) univ of liverpool

Page 17: COMP528: Multi-core and Multi-Processor Computingmkbane/COMP528/copy_of_PDFs/... · 2018-11-28 · Programming GPUs CUDA • proprietary • NVIDIA only GPUs • non-portable •

Handling with/without GPU…

Page 18: COMP528: Multi-core and Multi-Processor Computingmkbane/COMP528/copy_of_PDFs/... · 2018-11-28 · Programming GPUs CUDA • proprietary • NVIDIA only GPUs • non-portable •

OpenMP for Accelerators

Further Reading

• Resources will be added to COMP528 & to VITAL

– OpenMP example of Jacobi:

https://www.slideshare.net/jefflarkin/gtc16-s6510-targeting-gpus-

with-openmp-45

Page 19: COMP528: Multi-core and Multi-Processor Computingmkbane/COMP528/copy_of_PDFs/... · 2018-11-28 · Programming GPUs CUDA • proprietary • NVIDIA only GPUs • non-portable •

OpenCL

• much more low level than Directives

• perhaps more low level than CUDA

• but portable

– get the OpenCL compiler

– compiles for various targets

• and devotees of OpenCL claim good & portable

performance

• targets: CPUs, GPUs, Xeon Phis, FPGAs

Page 20: COMP528: Multi-core and Multi-Processor Computingmkbane/COMP528/copy_of_PDFs/... · 2018-11-28 · Programming GPUs CUDA • proprietary • NVIDIA only GPUs • non-portable •

OpenCL Principles• OpenCL compiler + OpenCL run-time library

• “Open Computing Language”: Khronos et al. Standard. 2008-

• context

– defining the actual run-time hardware

• a “compute device” has 1 or more “compute units”

a “compute unit” comprises several “processing elements”

– eg CPU has several cores; eg NVIDIA GPU has several Streaming Multiprocessors

• kernels to run on processing elements

– kernels are compiled at runtime (to give portability) cf Java JIT

• kernels queued up in “command queue”

– NDRange (1D or 2D) assigns tasks to processing elements

Page 21: COMP528: Multi-core and Multi-Processor Computingmkbane/COMP528/copy_of_PDFs/... · 2018-11-28 · Programming GPUs CUDA • proprietary • NVIDIA only GPUs • non-portable •
Page 22: COMP528: Multi-core and Multi-Processor Computingmkbane/COMP528/copy_of_PDFs/... · 2018-11-28 · Programming GPUs CUDA • proprietary • NVIDIA only GPUs • non-portable •

OpenCL kernel: mat-vect mult (c/o Wikipedia// Multiplies A*x, leaving the result in y.

// A is a row-major matrix, meaning the (i,j) element is at A[i*ncols+j].

__kernel void matvec(__global const float *A, __global const float *x,

uint ncols, __global float *y)

{

size_t i = get_global_id(0); // Global id, used as the row index

__global float const *a = &A[i*ncols]; // Pointer to the i'th row

float sum = 0.f; // Accumulator for dot product

for (size_t j = 0; j < ncols; j++) {

sum += a[j] * x[j];

}

y[i] = sum;

}

cf threadIdx.x in CUDA cf CUDA __global__ and __device__ and __shared__ attributes

Page 23: COMP528: Multi-core and Multi-Processor Computingmkbane/COMP528/copy_of_PDFs/... · 2018-11-28 · Programming GPUs CUDA • proprietary • NVIDIA only GPUs • non-portable •

OpenCL “boiler plate”

• https://en.wikipedia.org/wiki/OpenCL

Page 24: COMP528: Multi-core and Multi-Processor Computingmkbane/COMP528/copy_of_PDFs/... · 2018-11-28 · Programming GPUs CUDA • proprietary • NVIDIA only GPUs • non-portable •

A personal view…• OpenCL

– complex

– but re-usable (amend as appropriate!) boiler plate

• set up context

– more portable than CUDA

– open source not proprietary

– now also supported by NVIDIA

– targets include Intel FPGA (Altera)

• your OpenCL for GPU will just run on FPGA

• but you will need to tune it!

Page 25: COMP528: Multi-core and Multi-Processor Computingmkbane/COMP528/copy_of_PDFs/... · 2018-11-28 · Programming GPUs CUDA • proprietary • NVIDIA only GPUs • non-portable •

Other Options…

• …to THREADING,

generally (so for CPU rather

than GPU)

• POSIX

• Java Threads & Java Tasks

• TBB

• http://www.archer.ac.uk/training/past_courses.php

>> “Advanced OpenMP” (eg Cambridge, 17-19 July 2018) ->> Course Materials

Page 26: COMP528: Multi-core and Multi-Processor Computingmkbane/COMP528/copy_of_PDFs/... · 2018-11-28 · Programming GPUs CUDA • proprietary • NVIDIA only GPUs • non-portable •

SIMD & VECTORS

comp528#24 (c) univ of liverpool

Page 27: COMP528: Multi-core and Multi-Processor Computingmkbane/COMP528/copy_of_PDFs/... · 2018-11-28 · Programming GPUs CUDA • proprietary • NVIDIA only GPUs • non-portable •

SIMD

• Single Instruction Multiple Data (SIMD) at instruction level

– a = b + c (SIMD fl-pt addition)

– y = mx + z (SIMD fl-pt mult-add)

– where relevant hardware exists to do in 1 instruction

• Modern architectures support vectorisation of SIMD

– SSE, AVX (Intel x86)

– AltiVec (IBM Power)

• i.e. vector units that can do a[i] = b[i] + c[i] for several [i] in

a single clock cycle

Page 28: COMP528: Multi-core and Multi-Processor Computingmkbane/COMP528/copy_of_PDFs/... · 2018-11-28 · Programming GPUs CUDA • proprietary • NVIDIA only GPUs • non-portable •

How to Make Use…?

1. let the compiler’s “vectorisation” phase (of optimisation)

identify opportunities

2. use of non-portable vendor-specific compiler directives?

3. use of a portable OpenMP directive

#pragma omp simd [clause/s]

Page 29: COMP528: Multi-core and Multi-Processor Computingmkbane/COMP528/copy_of_PDFs/... · 2018-11-28 · Programming GPUs CUDA • proprietary • NVIDIA only GPUs • non-portable •

SIMD Example (with timings)

ompTimer[0] = omp_get_wtime();

#pragma omp simd

for (i=0; i<N; i++) {

res[i] = A*x[i] + y[i];

}

ompTimer[1] = omp_get_wtime();

This is on a single core

Parallelism at the logical vector unit

cf GPUs and threads per CUDA core

Page 30: COMP528: Multi-core and Multi-Processor Computingmkbane/COMP528/copy_of_PDFs/... · 2018-11-28 · Programming GPUs CUDA • proprietary • NVIDIA only GPUs • non-portable •

SIMD Example (with timings)

ompTimer[0] = omp_get_wtime();

#pragma omp parallel for simd

for (i=0; i<N; i++) {

res[i] = A*x[i] + y[i];

}

ompTimer[1] = omp_get_wtime();

This is 2 levels of parallelism

a) across many cores

b) across the logical vector unit on each core

Page 31: COMP528: Multi-core and Multi-Processor Computingmkbane/COMP528/copy_of_PDFs/... · 2018-11-28 · Programming GPUs CUDA • proprietary • NVIDIA only GPUs • non-portable •

SIMD Example (with timings)

ompTimer[0] = omp_get_wtime();

#pragma omp simd

for (i=0; i<N; i++) {

res[i] = A*x[i] + y[i];

}

ompTimer[1] = omp_get_wtime();

Instead of using a directive to enforce compiler to make use of

the vector unit in each core of modern processors, we can also

look at helping the compiler do such “vectorisation”

Can be key to Nx speed-up per core… but what is N?

Page 32: COMP528: Multi-core and Multi-Processor Computingmkbane/COMP528/copy_of_PDFs/... · 2018-11-28 · Programming GPUs CUDA • proprietary • NVIDIA only GPUs • non-portable •

Vector Widths

• SSE

• AVX

– 256 bits

– ie 32 bytes so 8 floats or 4

doubles

– so 8x or 4x faster

• AVX-512

– 512 bits

– 16x (single precision) or 8x

(double precision) faster

• BUT it is also about the types

of supported vector operations:

• z = a + b

• z = Ax + y

• z = exp(x)

• z = 1/x

• z = 1 / sqrt(x)

• NN operations

• etc

• growing set with each upgrade