comp528: multi-core and multi-processor computingmkbane/comp528/copy_of... · • mpi code =>...
TRANSCRIPT
Dr Michael K Bane, G14, Computer Science, University of [email protected] https://cgi.csc.liv.ac.uk/~mkbane/COMP528
COMP528: Multi-core and Multi-Processor Computing
X2
So far…• Why and what HPC / multi-core / multi-processing [#1 - #3]
• Use of HPC facility
– batch, timing variations
• Theory
– Amdahl’s Law, Gustafson
– deadlock, livelock
• Message Passing Interface (MPI)
– distributed memory, many nodes, scaling up nodes & memory
– wrappers for compiler and launching
• OpenMP
– shared (globally addressable) memory, single node
So far…
• But not only CPU
• GPU [#18 - #24]
– CUDA: <<<blocks, threads>>> & writing kernels
– Directives: OpenACC [#23], OpenMP 4.0+ [#24]
– OpenCL [#24]
comp528 (c) univ of liverpool
Still to come
• Vectorisation
– including some ‘optimisation’
• Hybrid programming
– how to use OpenMP from MPI
• Libraries
• “Black box” codes
– Using what we have learned to understand how these can help
• Profiling & Optimisation
Still to come
• a “summary” lecture
– what’s key to remember
– opportunity to ask Qs
– what might be interesting but would need another course…
and did somebody say cloud?
HYBRID
Today’s High End
Architectures• processors:
– many cores
• each with vector unit
– maybe specialised units
• TPU, Tensor cores (etc) for
Machine Learning
• nodes
– one or more processors
– zero or more GPUs
• potentially likes of Xeon Phi,
FPGA, custom ASIC,…
Today’s High End Architectures
• i.e. electic mix
• Needing appropriate programming for max performance
– MPI for inter-node
– MPI or OpenMP for intra-node
– CUDA | OpenCL | OpenACC | OpenMP for accelerators
• BUT heterogeneous arch ==> heterogeneous use of
languages
• MPI across nodes, OpenMP on a node
– or MPI per processor & OpenMP across cores?
• Already done (assignment #3)
– OpenMP for CPU + CUDA for GPU
– a single thread calls the CUDA kernel for GPU to run
– (calling a CUDA kernel in a parallel region would launch many
instances of the kernel, each requesting <<<blocks, threads>>>)
MPI + OMP: Simple Case
• MPI code => runs a copy on each process
• Put one process per node
• When need to “accelerate” (eg a for loop),
then use OpenMP
– the ‘master’ OpenMP thread is the MPI process
– and we have the other cores as the ‘slave’ OpenMP threads
• (Inter-process) Comms is only via MPI
– consider each OpenMP team independent of (and without any
knowledge of) other OpenMP teams
why else may we
wish to use OMP
rather than MPI?
dynamic load
balancing eg
schedule(dynamic)
OpenMP program
(no MPI)
MPI with 1 process
launching
OpenMP parallel regions
MPI with 4 processes
each launching
OpenMP parallel regions
comp528 (c) univ of liverpool
MPI with 4 processes
each launching
OpenMP parallel regions
• Data exchange between
MPI threads
• Via MPI Comms
– pt-to-pt
– collectives
• Easiest if OUTSIDE of the
OpenMP regions
REDUCTION TO ROOT
There is no reason we have to
have the same size OpenMP
team on each MPI process
Example / DEMO
• ~/MPI/hybrid/ex1.c
– how compile hybrid?
– run: illustrate
• ~/MPI/hybrid/ex2.c
– v. simple example of summation over MPI*OMP
– MPI_Scatter �#pragma omp par for reduction � MPI_Reduce
Other Options HANDLE WITH CARE
• A single OMP thread (eg in a
master or single region) sends info
via MPI
– generally okay
– will be to another master thread
– is pretty much like sending outside
OMP region
Other Options
• One or more OMP threads in an OMP
parallel region doing MPI Comms (or at
same time)
– “threaded MPI”
– requires MPI_Init_Thread rather than MPI_Init
MPI_Init_thread(argc, argv, required, provided)
– requires provided support (implementation
dependent) of one of:
MPI_THREAD_FUNNELLED
MPI_THREAD_SERIALIZED
MPI_THREAD_MULTIPLE
HANDLE WITH CARE
Performance, Batch etc
• How many cores to request in batch job?
– different batch systems would require request
for:
4 processors * 7 cores (MPI:proc, OMP:core)
24 cores (& then worry re placement)
– Chadwick
24 cores, place MPI per node via mpirun
(SHOW)
• Is it efficient use of resources?
– depends if runs faster
– but there is ‘dead time’ (cf Amdahl)
Can think of some tricks…
#pragma omp parallel …
if (omp_get_thread_num()==0) {
MPI_Send(…) // or other MPI eg MPI_Recv on different MPI process
}
else {
// do some OMP work on remaining threads
}
what can we
NOT do here?
Further Reading
• https://www.intertwine-
project.eu/sites/default/files/images/INTERTWinE_Best_Pr
actice_Guide_MPI%2BOpenMP_1.2.pdf
• (Archer?)