comp528: multi-core and multi-processor computingmkbane/comp528/copy_of... · • mpi code =>...

Post on 27-Jun-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Dr Michael K Bane, G14, Computer Science, University of Liverpoolm.k.bane@liverpool.ac.uk https://cgi.csc.liv.ac.uk/~mkbane/COMP528

COMP528: Multi-core and Multi-Processor Computing

X2

So far…• Why and what HPC / multi-core / multi-processing [#1 - #3]

• Use of HPC facility

– batch, timing variations

• Theory

– Amdahl’s Law, Gustafson

– deadlock, livelock

• Message Passing Interface (MPI)

– distributed memory, many nodes, scaling up nodes & memory

– wrappers for compiler and launching

• OpenMP

– shared (globally addressable) memory, single node

So far…

• But not only CPU

• GPU [#18 - #24]

– CUDA: <<<blocks, threads>>> & writing kernels

– Directives: OpenACC [#23], OpenMP 4.0+ [#24]

– OpenCL [#24]

comp528 (c) univ of liverpool

Still to come

• Vectorisation

– including some ‘optimisation’

• Hybrid programming

– how to use OpenMP from MPI

• Libraries

• “Black box” codes

– Using what we have learned to understand how these can help

• Profiling & Optimisation

Still to come

• a “summary” lecture

– what’s key to remember

– opportunity to ask Qs

– what might be interesting but would need another course…

and did somebody say cloud?

HYBRID

Today’s High End

Architectures• processors:

– many cores

• each with vector unit

– maybe specialised units

• TPU, Tensor cores (etc) for

Machine Learning

• nodes

– one or more processors

– zero or more GPUs

• potentially likes of Xeon Phi,

FPGA, custom ASIC,…

Today’s High End Architectures

• i.e. electic mix

• Needing appropriate programming for max performance

– MPI for inter-node

– MPI or OpenMP for intra-node

– CUDA | OpenCL | OpenACC | OpenMP for accelerators

• BUT heterogeneous arch ==> heterogeneous use of

languages

• MPI across nodes, OpenMP on a node

– or MPI per processor & OpenMP across cores?

• Already done (assignment #3)

– OpenMP for CPU + CUDA for GPU

– a single thread calls the CUDA kernel for GPU to run

– (calling a CUDA kernel in a parallel region would launch many

instances of the kernel, each requesting <<<blocks, threads>>>)

MPI + OMP: Simple Case

• MPI code => runs a copy on each process

• Put one process per node

• When need to “accelerate” (eg a for loop),

then use OpenMP

– the ‘master’ OpenMP thread is the MPI process

– and we have the other cores as the ‘slave’ OpenMP threads

• (Inter-process) Comms is only via MPI

– consider each OpenMP team independent of (and without any

knowledge of) other OpenMP teams

why else may we

wish to use OMP

rather than MPI?

dynamic load

balancing eg

schedule(dynamic)

OpenMP program

(no MPI)

MPI with 1 process

launching

OpenMP parallel regions

MPI with 4 processes

each launching

OpenMP parallel regions

comp528 (c) univ of liverpool

MPI with 4 processes

each launching

OpenMP parallel regions

• Data exchange between

MPI threads

• Via MPI Comms

– pt-to-pt

– collectives

• Easiest if OUTSIDE of the

OpenMP regions

REDUCTION TO ROOT

There is no reason we have to

have the same size OpenMP

team on each MPI process

Example / DEMO

• ~/MPI/hybrid/ex1.c

– how compile hybrid?

– run: illustrate

• ~/MPI/hybrid/ex2.c

– v. simple example of summation over MPI*OMP

– MPI_Scatter �#pragma omp par for reduction � MPI_Reduce

Other Options HANDLE WITH CARE

• A single OMP thread (eg in a

master or single region) sends info

via MPI

– generally okay

– will be to another master thread

– is pretty much like sending outside

OMP region

Other Options

• One or more OMP threads in an OMP

parallel region doing MPI Comms (or at

same time)

– “threaded MPI”

– requires MPI_Init_Thread rather than MPI_Init

MPI_Init_thread(argc, argv, required, provided)

– requires provided support (implementation

dependent) of one of:

MPI_THREAD_FUNNELLED

MPI_THREAD_SERIALIZED

MPI_THREAD_MULTIPLE

HANDLE WITH CARE

Performance, Batch etc

• How many cores to request in batch job?

– different batch systems would require request

for:

4 processors * 7 cores (MPI:proc, OMP:core)

24 cores (& then worry re placement)

– Chadwick

24 cores, place MPI per node via mpirun

(SHOW)

• Is it efficient use of resources?

– depends if runs faster

– but there is ‘dead time’ (cf Amdahl)

Can think of some tricks…

#pragma omp parallel …

if (omp_get_thread_num()==0) {

MPI_Send(…) // or other MPI eg MPI_Recv on different MPI process

}

else {

// do some OMP work on remaining threads

}

what can we

NOT do here?

Further Reading

• https://www.intertwine-

project.eu/sites/default/files/images/INTERTWinE_Best_Pr

actice_Guide_MPI%2BOpenMP_1.2.pdf

• (Archer?)

top related