comp528: multi-core and multi-processor computingmkbane/comp528/copy_of... · • mpi code =>...

18
Dr Michael K Bane, G14, Computer Science, University of Liverpool [email protected] https://cgi.csc.liv.ac.uk/~mkbane/COMP528 COMP528: Multi-core and Multi-Processor Computing X 2

Upload: others

Post on 27-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: COMP528: Multi-core and Multi-Processor Computingmkbane/COMP528/copy_of... · • MPI code => runs a copy on each process • Put one process per node • When need to “accelerate”

Dr Michael K Bane, G14, Computer Science, University of [email protected] https://cgi.csc.liv.ac.uk/~mkbane/COMP528

COMP528: Multi-core and Multi-Processor Computing

X2

Page 2: COMP528: Multi-core and Multi-Processor Computingmkbane/COMP528/copy_of... · • MPI code => runs a copy on each process • Put one process per node • When need to “accelerate”

So far…• Why and what HPC / multi-core / multi-processing [#1 - #3]

• Use of HPC facility

– batch, timing variations

• Theory

– Amdahl’s Law, Gustafson

– deadlock, livelock

• Message Passing Interface (MPI)

– distributed memory, many nodes, scaling up nodes & memory

– wrappers for compiler and launching

• OpenMP

– shared (globally addressable) memory, single node

Page 3: COMP528: Multi-core and Multi-Processor Computingmkbane/COMP528/copy_of... · • MPI code => runs a copy on each process • Put one process per node • When need to “accelerate”

So far…

• But not only CPU

• GPU [#18 - #24]

– CUDA: <<<blocks, threads>>> & writing kernels

– Directives: OpenACC [#23], OpenMP 4.0+ [#24]

– OpenCL [#24]

comp528 (c) univ of liverpool

Page 4: COMP528: Multi-core and Multi-Processor Computingmkbane/COMP528/copy_of... · • MPI code => runs a copy on each process • Put one process per node • When need to “accelerate”

Still to come

• Vectorisation

– including some ‘optimisation’

• Hybrid programming

– how to use OpenMP from MPI

• Libraries

• “Black box” codes

– Using what we have learned to understand how these can help

• Profiling & Optimisation

Page 5: COMP528: Multi-core and Multi-Processor Computingmkbane/COMP528/copy_of... · • MPI code => runs a copy on each process • Put one process per node • When need to “accelerate”

Still to come

• a “summary” lecture

– what’s key to remember

– opportunity to ask Qs

– what might be interesting but would need another course…

and did somebody say cloud?

Page 6: COMP528: Multi-core and Multi-Processor Computingmkbane/COMP528/copy_of... · • MPI code => runs a copy on each process • Put one process per node • When need to “accelerate”

HYBRID

Page 7: COMP528: Multi-core and Multi-Processor Computingmkbane/COMP528/copy_of... · • MPI code => runs a copy on each process • Put one process per node • When need to “accelerate”

Today’s High End

Architectures• processors:

– many cores

• each with vector unit

– maybe specialised units

• TPU, Tensor cores (etc) for

Machine Learning

• nodes

– one or more processors

– zero or more GPUs

• potentially likes of Xeon Phi,

FPGA, custom ASIC,…

Page 8: COMP528: Multi-core and Multi-Processor Computingmkbane/COMP528/copy_of... · • MPI code => runs a copy on each process • Put one process per node • When need to “accelerate”

Today’s High End Architectures

• i.e. electic mix

• Needing appropriate programming for max performance

– MPI for inter-node

– MPI or OpenMP for intra-node

– CUDA | OpenCL | OpenACC | OpenMP for accelerators

• BUT heterogeneous arch ==> heterogeneous use of

languages

Page 9: COMP528: Multi-core and Multi-Processor Computingmkbane/COMP528/copy_of... · • MPI code => runs a copy on each process • Put one process per node • When need to “accelerate”

• MPI across nodes, OpenMP on a node

– or MPI per processor & OpenMP across cores?

• Already done (assignment #3)

– OpenMP for CPU + CUDA for GPU

– a single thread calls the CUDA kernel for GPU to run

– (calling a CUDA kernel in a parallel region would launch many

instances of the kernel, each requesting <<<blocks, threads>>>)

Page 10: COMP528: Multi-core and Multi-Processor Computingmkbane/COMP528/copy_of... · • MPI code => runs a copy on each process • Put one process per node • When need to “accelerate”

MPI + OMP: Simple Case

• MPI code => runs a copy on each process

• Put one process per node

• When need to “accelerate” (eg a for loop),

then use OpenMP

– the ‘master’ OpenMP thread is the MPI process

– and we have the other cores as the ‘slave’ OpenMP threads

• (Inter-process) Comms is only via MPI

– consider each OpenMP team independent of (and without any

knowledge of) other OpenMP teams

why else may we

wish to use OMP

rather than MPI?

dynamic load

balancing eg

schedule(dynamic)

Page 11: COMP528: Multi-core and Multi-Processor Computingmkbane/COMP528/copy_of... · • MPI code => runs a copy on each process • Put one process per node • When need to “accelerate”

OpenMP program

(no MPI)

MPI with 1 process

launching

OpenMP parallel regions

MPI with 4 processes

each launching

OpenMP parallel regions

comp528 (c) univ of liverpool

Page 12: COMP528: Multi-core and Multi-Processor Computingmkbane/COMP528/copy_of... · • MPI code => runs a copy on each process • Put one process per node • When need to “accelerate”

MPI with 4 processes

each launching

OpenMP parallel regions

• Data exchange between

MPI threads

• Via MPI Comms

– pt-to-pt

– collectives

• Easiest if OUTSIDE of the

OpenMP regions

REDUCTION TO ROOT

There is no reason we have to

have the same size OpenMP

team on each MPI process

Page 13: COMP528: Multi-core and Multi-Processor Computingmkbane/COMP528/copy_of... · • MPI code => runs a copy on each process • Put one process per node • When need to “accelerate”

Example / DEMO

• ~/MPI/hybrid/ex1.c

– how compile hybrid?

– run: illustrate

• ~/MPI/hybrid/ex2.c

– v. simple example of summation over MPI*OMP

– MPI_Scatter �#pragma omp par for reduction � MPI_Reduce

Page 14: COMP528: Multi-core and Multi-Processor Computingmkbane/COMP528/copy_of... · • MPI code => runs a copy on each process • Put one process per node • When need to “accelerate”

Other Options HANDLE WITH CARE

• A single OMP thread (eg in a

master or single region) sends info

via MPI

– generally okay

– will be to another master thread

– is pretty much like sending outside

OMP region

Page 15: COMP528: Multi-core and Multi-Processor Computingmkbane/COMP528/copy_of... · • MPI code => runs a copy on each process • Put one process per node • When need to “accelerate”

Other Options

• One or more OMP threads in an OMP

parallel region doing MPI Comms (or at

same time)

– “threaded MPI”

– requires MPI_Init_Thread rather than MPI_Init

MPI_Init_thread(argc, argv, required, provided)

– requires provided support (implementation

dependent) of one of:

MPI_THREAD_FUNNELLED

MPI_THREAD_SERIALIZED

MPI_THREAD_MULTIPLE

HANDLE WITH CARE

Page 16: COMP528: Multi-core and Multi-Processor Computingmkbane/COMP528/copy_of... · • MPI code => runs a copy on each process • Put one process per node • When need to “accelerate”

Performance, Batch etc

• How many cores to request in batch job?

– different batch systems would require request

for:

4 processors * 7 cores (MPI:proc, OMP:core)

24 cores (& then worry re placement)

– Chadwick

24 cores, place MPI per node via mpirun

(SHOW)

• Is it efficient use of resources?

– depends if runs faster

– but there is ‘dead time’ (cf Amdahl)

Page 17: COMP528: Multi-core and Multi-Processor Computingmkbane/COMP528/copy_of... · • MPI code => runs a copy on each process • Put one process per node • When need to “accelerate”

Can think of some tricks…

#pragma omp parallel …

if (omp_get_thread_num()==0) {

MPI_Send(…) // or other MPI eg MPI_Recv on different MPI process

}

else {

// do some OMP work on remaining threads

}

what can we

NOT do here?

Page 18: COMP528: Multi-core and Multi-Processor Computingmkbane/COMP528/copy_of... · • MPI code => runs a copy on each process • Put one process per node • When need to “accelerate”

Further Reading

• https://www.intertwine-

project.eu/sites/default/files/images/INTERTWinE_Best_Pr

actice_Guide_MPI%2BOpenMP_1.2.pdf

• (Archer?)