information technology services, hku mpi parallel programming information technology services the...
TRANSCRIPT
Information Technology Services, HKU
MPI Parallel MPI Parallel ProgrammingProgrammingInformation Technology ServicesThe University of Hong Kong
By HPC/Grid Team, Email: [email protected]
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
Must Know Before Must Know Before Programming…Programming…• HKU HPC Facilities
– hpcpower2: 64-bit Linux cluster consists of 24 nodes
• each node has TWO 64-bit quad-core Intel Xeon CPUs running at 3GHz
– hpcpower: 32-bit Linux cluster consists of 178 nodes
• 128 nodes of dual 2.8 GHz Xeon processors, and
• 50 nodes of dual 3.06 GHz Xeon processors
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
How to get an accountHow to get an account
• Download the form at http://www.hku.hk/cc/home/services/forms.htm
CF-137e : High Performance Computing Cluster (hpcpower) Account Application (for Non-TOSI Staff/Students)
• Return the application form to CC office.
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
How to login HPCPOWERHow to login HPCPOWER
• From any PC within HKU campus network: ssh hpcpower (bundled in Linux) Download PuTTY to use SSH at Windows
http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html
• PC outside HKU campus network: Must use HKUVPN to get a HKU campus network IP
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
File TransferFile Transfer
• From/To any PC within HKU campus network: scp/sftp hpcpower (bundled in Linux) Download WINSCP to use SSH at
Windowshttp://winscp.net/eng/download.php
• PC outside HKU campus network: Must use HKUVPN to get a HKU campus network IP
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
Program EditingProgram Editing
• You can use the command vi, emacs or pico to edit programs. Please refer to the UNIX user's guide for detail. http://www.hku.hk/cc/handbook/unix/unix_toc.html
• Microsoft Windows users: do not use a standard Microsoft Windows editor such as Notepad/Wordpad to edit files that will be used on the Linux system.
• To convert Windows file to UNIX file, enter:
dos2unix file.txt
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
Resource Management SystemResource Management System
• After logging into to the cluster, the user is on the master node. When a program is run, it is also immediately run on the master. This is the "interactive mode", which is convenient for running simple commands like ls, vi, etc. or for editing/compiling a program.
• Any interactive program runs on the master node will be killed without further notice.
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
PBS (Portable Batch System)PBS (Portable Batch System)
• Long computing jobs should be submitted through the batch system.
• The submitted job will be in a queue waiting for its turn, then will be sent to one or more compute node(s), which the job will have dedicated access to until it finishes.
• With PBS, the job will run faster and the cluster will be more efficiently utilized.
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
Basic PBS CommandsBasic PBS Commands
Command
Description
qsub To submit a job to the queuing system
qdel To delete a job that has been submitted to the queuing system
qstat / showq
List all information about queues and jobs
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
Sample PBS job scriptSample PBS job script
• Sample PBS job script (pbs.cmd)#!/bin/sh#PBS -N test#PBS -m ea#PBS -q qdev#PBS -l walltime=02:00:00#PBS -l nodes=1:ppn=2### Define number of processors usedNP=`wc -l < $PBS_NODEFILE` cd $PBS_O_WORKDIR mpirun –np $NP ./a.out
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
Sample PBS job scriptSample PBS job script
• A line beginning with # is a comment; • A line beginning with #PBS is a PBS
directive; • This submit script pbs.cmd specifies the
name of the job (test), which queue to use (qdev), that it needs both processors on a single node (nodes=1:ppn=2), that it will run for at most 2 hours (walltime), and that TORQUE should email to user when the job exits or aborts (ea). Additionally, the user specifies where and what to execute (./a.out).
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
Submit a PBS jobSubmit a PBS job
$ qsub pbs.cmd
2234.hpcpower
• PBS returns a job identifier of the form jobid.hpcpower where jobid (2234) is an integer number assigned by PBS.
• The job identifier is needed for any actions involving the job, such as checking job status or deleting the job.
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
qsub optionsqsub options
• The resource requested on command line has a high preference than the directive line in the script file.
• e.g. submit job by command qsub -l nodes=2:ppn=8 pbs.cmd
, the job will run on 2 compute nodes with 8 processors each instead of what stated in the script file pbs.cmd.
Option Action
qsub -l list Set job resource list
qsub -N jobname
Set job name to jobname
qsub -q dest Submit to queue dest
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
List all Submitted jobs statusList all Submitted jobs status
$ qstat –a or $ qa
hpcpower:
Job ID UsernameQueue Jobname SessID NDS TSKReq'd Memory
Req'd Time S
Elap Time
-------------
------- ----- --------
------ --- --- ------ ----- - -----
2230.hpcpower
ycleung qdev Test 5597 8 -- -- 02:00 R 01:02
2231.hpcpower
ycleung qprod Job_1 11048 16 -- -- 10:00 R 09:17
2233.hpcpower
chliu oneday
huge -- 16 -- -- 15:00 Q --Job information provided• Username: Job owner• NDS: Number of nodes requested• Req’d Time: Requested amount of wallclock time• Elap Time: Elapsed time in the current job state• S: Job state (E-Exit; R-Running; Q-Queuing)
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
qstat optionsqstat options
Option Action
qstat –a List all jobs status
qstat –q List all queues on the system
qstat –n List all jobs’ allocated node(s)
qstat –u username
List all jobs owned by user username
qstat –r List all running jobs
qstat –f jobid List all information known about specified job (jobid)
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
Delete a JobDelete a Job
$ qdel 2234
where 2234 is the job id.
Information Technology Services, HKU
MPI MPI (Message Passing Interface)(Message Passing Interface)
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
MPI (Message Passing Interface)MPI (Message Passing Interface)
• Is a communication protocol for parallel programming
• Is a library of functions but not a language
• MPI software is available for free• MPICH and Open MPI is most common
free versions• Many vendors have their own
optimized version of MPI - IBM, Sun, SGI, Myrinet …..
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
What is MPIWhat is MPI
P0 P1 P2 P3
Communication Network
Message Passing Interface
ProcessA process is a set of executable instructions (program) which runs on a processor. For maximum performance, each CPU (or core in a multicore machine) will be assigned just a single process.
Processes
Message PassingThe method by which data from one processor's memory is copied to the memory of another processor.
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
What is MPIWhat is MPI
• Same program running on many processors at the same time.
• Single-Program-Multiple-Data (SPMD) style.if (my_rank == 0) then
Call Master(....)
else
call Worker(....)
endif
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
What is MPIWhat is MPI
• MPI codes can run on parallel machines with distributed memory and shared memory architecture.
• High portability.• All variable are private to each
process.• Process communicate via specific
communication call.
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
General MPI Program StructureGeneral MPI Program Structure
MPI include file
#include <mpi.h>void main (int argc, char *argv[]){int np, rank, ierr;ierr = MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD,&rank);MPI_Comm_size(MPI_COMM_WORLD,&np);/* Do Some Works */ierr = MPI_Finalize();}
variable declarations #include <mpi.h>void main (int argc, char *argv[]){int np, rank, ierr;ierr = MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD,&rank);MPI_Comm_size(MPI_COMM_WORLD,&np);/* Do Some Works */ierr = MPI_Finalize();}
Initialize MPI environment
#include <mpi.h>void main (int argc, char *argv[]){int np, rank, ierr;ierr = MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD,&rank);MPI_Comm_size(MPI_COMM_WORLD,&np);/* Do Some Works */ierr = MPI_Finalize();}
Do work and make message passing calls
#include <mpi.h>void main (int argc, char *argv[]){int np, rank, ierr;ierr = MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD,&rank);MPI_Comm_size(MPI_COMM_WORLD,&np);/* Do Some Works */ierr = MPI_Finalize();}
Terminate MPI Environment
#include <mpi.h>void main (int argc, char *argv[]){int np, rank, ierr;ierr = MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD,&rank);MPI_Comm_size(MPI_COMM_WORLD,&np);/* Do Some Works */ierr = MPI_Finalize();}
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
MPI Header FileMPI Header File
• C :#include “mpi.h”
• FORTRAN :include ‘mpif.h’
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
MPI Function FormatMPI Function Format
• All MPI functions have names that begin with the prefix MPI_ to advoid name collisions.
• C function names are case sensitive but Fortran function names are not.C :ierror = MPI_Xxxx(parameter...) ;
FORTRAN :call MPI_Xxxx(argument,..., ierror)
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
Initialising MPI environmentInitialising MPI environment• C :
int MPI_Init (int *argc, char*** argv);
• FORTRAN :MPI_Init(ierror)
INTEGER ierror
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
Exit MPI environmentExit MPI environment
• C :int MPI_Finialize()
• FORTRAN :MPI_Finalize(ierror)
INTEGER ierror
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
CommunicatorCommunicator
• Communicator is a collection of processes that can send messages to each other.
• MPI_COMM_WORLD is predefined and consist of all the processes.
1 3
02
4
5Communicator
MPI_COMM_WORLD
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
SizeSize
• How many processes are contained within a communicator ?
• C :MPI_Comm_size(MPI_Comm comm, *size)
• FORTRAN :MPI_Comm_size(comm,size,ierror)
INTEGER comm,size,ierror
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
RankRank• How do we identify different processes ?• C:
MPI_Comm_rank(MPI_Comm comm, *rank);
• FORTRANMPI_Comm_rank(comm,rank,ierror) INTEGER comm,rank,ierror
Rank values are range from 0 to N-1, where N is the number of processes in the communicator.
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
A First MPI Program : HelloworldA First MPI Program : Helloworld
Program HelloworldINCLUDE ‘mpif.h’INTEGER nproc,myrank,ierrCALL MPI_Init(ierr)CALL MPI_Comm_size(MPI_COMM_WORLD,nproc,ierr)CALL MPI_Comm_rank(MPI_COMM_WORLD,myrank,ierr)Print *, “Hello World! I’m process “,myrank,” of “,nprocCALL MPI_Finalize(ierr)
• C
• FORTRAN
#include <stdio.h>#include ‘mpi.h’void main(int argc, char **argv) {int nproc,myrank,ierr;ierr=MPI_Init(&argc,&argv);MPI_Comm_size(MPI_COMM_WORLD,&nproc);MPI_Comm_rank(MPI_COMM_WORLD,&myrank);printf (“Hello World! I’m process %d of %d\
n”,myrank,nproc);ierr=MPI_Finalize(); }
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
Compile MPI in HPCPOWERCompile MPI in HPCPOWER
• Cpgcc -Mmpi program.c
• FORTRAN (FIXED FORM)pgf90 -Mmpi program.f
• FORTRAN (FREE FORM)pgf90 -Mmpi –Mfree program.f90
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
Execute MPI in HPCPOWERExecute MPI in HPCPOWER• Running in 4 processes in interactive node :
mpirun –np 4 ./a.out
Hello World! I’m process 0 of 4Hello World! I’m process 1 of 4Hello World! I’m process 2 of 4Hello World! I’m process 3 of 4
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
What is in a messageWhat is in a message
• An MPI message is an array of elements of a particular MPI datatype.
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
MPI DatatypeMPI DatatypeMPI Fortran
DatatypeFortran
DatatypeMPI C Datatype C Datatype
MPI_CHARACTER character(1) MPI_CHAR signed char
MPI_UNSIGNED_CHAR
unsigned char
MPI_INTEGER integer MPI_INT signed int
MPI_SHORT signed short int
MPI_LONG signed long int
MPI_UNSIGNED_SHORT
unsigned short int
MPI_UNSIGNED unsigned int
MPI_UNSIGNED_LONG
unsigned long int
MPI_REAL real MPI_FLOAT float
MPI_DOUBLE_PRECISION
double precision
MPI_DOUBLE double
MPI_LONG_DOUBLE long double
MPI_COMPLEX complex
MPI_LOGICAL logical
Information Technology Services, HKU
Point-to-point communicationPoint-to-point communication
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
Point-to-point communicationPoint-to-point communication
A point-to-point communication always involves exactly two processes. One process acts as the sender and the other acts as the receiver.
1 3
02
4
5Communicator
source
dest
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
Point-to-point communicationPoint-to-point communication
• Communicate between 2 processes.
• Source send message to destination process.
• Communication takes place within a communicator.
• Destination process is identified by its rank in the communicator.
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
Sending MessagesSending Messages
• C :int MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)
• FORTRAN :MPI_Send( buf, count, datatype, dest, tag,
comm, ierror)
<type> BUF(*)
INTEGER count,datatype,dest,tag,comm
INTEGER ierror
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
Receiving MessagesReceiving Messages
• C :int MPI_Recv(void *buf, int count,
MPI_Datatype datatype, int source,
int tag, MPI_Comm comm,
MPI_Status *status)
• FORTRAN :MPI_Recv( buf, count, datatype, source, tag,
comm, status, ierror)
<type> BUF(*)
INTEGER count,datatype,dest,tag,comm
INTEGER status(MPI_STATUS_SIZE),ierror
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
MessagesMessages
• Messages = envelope + message body • Envelope
1. Source – The sending process2. Destination – The receiving process3. Communicator – specifies group of processes
which both sourse and destination belong4. Tag – used to classify messages
• Message Body1. Buffer – the message data2. Datatype – type of message data3. Count – number of items of type datatype in
buffer
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
TagTag
• The tag is a number specified by programmer for each message. It is attached with the message being sent.
• When the destination process receives the message, the process can check the tag and acts accordingly.
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
Message Matching RulesMessage Matching Rules• Sender must specify a valid
destination RANK.• Receiver must specify a valid source
RANK.• Sender and receiver must use the
same COMMUNICATOR.• TAG must match.• DATATYPE must match.• Receiver's buffer must be large
enough.
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
Example : Greetings(Fortran)Example : Greetings(Fortran)PROGRAM greetingsinclude 'mpif.h'integer my_rankinteger pinteger sourceinteger destinteger tagcharacter*100 messagecharacter*10 digit_stringinteger sizeinteger status(MPI_STATUS_SIZE)integer ierr
call MPI_Init(ierr)call MPI_Comm_rank(MPI_COMM_WORLD, my_rank, ierr)call MPI_Comm_size(MPI_COMM_WORLD, p, ierr)
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
Example : Greetings(Fortran)Example : Greetings(Fortran)if (my_rank /= 0) then write(digit_string,FMT="(I3)") my_rank message ='Greetings from process ' // trim(digit_string) // '!' dest = 0 tag = 0 call MPI_Send(message, len_trim(message), & & MPI_CHARACTER, dest, tag, MPI_COMM_WORLD, ierr)else ! my_rank == 0 do source = 1, p-1 tag = 0 call MPI_Recv(message, len(message), & & MPI_CHARACTER, source, tag, & & MPI_COMM_WORLD, status, ierr) write(6,FMT="(A)") message enddo
endifcall MPI_Finalize(ierr)END PROGRAM greetings
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
Example : Greetings (C)Example : Greetings (C)#include <stdio.h>#include <string.h>#include "mpi.h"main(int argc, char* argv[]) { int my_rank; /* rank of process */ int p; /* number of processes */ int source; /* rank of sender */ int dest; /* rank of receiver */ int tag = 0; /* tag for messages */ char message[100]; /* storage for message */ MPI_Status status; /* return status for */ /* receive */
/* Start up MPI */ MPI_Init(&argc, &argv); /* Find out process rank */ MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); /* Find out number of processes */ MPI_Comm_size(MPI_COMM_WORLD, &p);
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
Example : Greetings (C)Example : Greetings (C) if (my_rank != 0) { /* Create message */ sprintf(message, "Greetings from process %d!", my_rank); dest = 0; /* Use strlen+1 so that '\0' gets transmitted */ MPI_Send(message, strlen(message)+1, MPI_CHAR, dest, tag, MPI_COMM_WORLD);
} else { /* my_rank == 0 */ for (source = 1; source < p; source++) { MPI_Recv(message, 100, MPI_CHAR, source, tag, MPI_COMM_WORLD, &status); printf("%s\n", message); } } /* Shut down MPI */ MPI_Finalize();} /* main */
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
Example : GreetingsExample : Greetings
• % mpirun –np 4 ./a.out
Greetings from process 1 !Greetings from process 2 !Greetings from process 3 !
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
MPI_SendrecvMPI_Sendrecv
MPI_Sendrecv (*sendbuf,sendcount,sendtype,dest,sendtag, *recvbuf,recvcount,recvtype,source,recvtag, comm,*status)
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
WildcardWildcard
• MPI_Recv can use MPI_ANY_SOURCE and MPI_ANY_TAG as argument.
• There is no wildcard for communicator.
call MPI_Recv(message, 1, MPI_INTEGER, & MPI_ANY_SOURCE, MPI_ANY_TAG, &
MPI_COMM_WORLD, MPI_Status , ierror)
• MPI_Send cannot use wildcard.
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
Communication EnvelopeCommunication Envelope
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
statusstatus
• Envelope information is returned from the argument of MPI_Recv - status.
• status has 3 elements :– status(MPI_SOURCE)– status(MPI_TAG)– status(MPI_ERROR)
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
Received Message CountReceived Message Count
• status also returns information on the size of message (i.e. number of elements) received. However, this is not directly accessible.
• To determine the size of message received, we can call :
call MPI_Get_Count(status, &
& MPI_Datatype, count, ierror)
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
Message Order PreservationMessage Order Preservation
• Message do not take over each other.
1 3
0
2
4
5Communicator
source
dest
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
System BufferSystem Buffer
Network
Send Buffer Receive Buffer
System Buffer
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
Blocking CommunicationBlocking Communication
• MPI_Recv would hang until the message has been received into the buffer specified by the buffer argument.
• If message is not available, the process will remain hang until a message become available.
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
Blocking CommunicationBlocking Communication
• MPI_Send would hang until– The whole message has arrived the receiv
er. OR– The whole message has been copied into
a system buffer.• After MPI_Send, the send buffer is save
for reuse.
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
Blocking CommunicationBlocking Communication
• Correct version - Succeed even if no system buffer is available.
IF (rank == 0) THEN
call MPI_SEND(sendbuf,....to P(1))
call MPI_RECV(recvbuf,....from P(1))
ELSE ! rank == 1
call MPI_RECV(recvbuf,....from P(0))
call MPI_SEND(sendbuf,....to P(0))
ENDIF
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
Blocking CommunicationBlocking Communication
• Common bug-Always deadlock.
IF (rank == 0) THEN
call MPI_RECV(recvbuf,....from P(1))
call MPI_SEND(sendbuf,....to P(1))
ELSE ! rank == 1
call MPI_RECV(recvbuf,....from P(0))
call MPI_SEND(sendbuf,....to P(0))
ENDIF
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
Blocking CommunicationBlocking Communication
• Unsafe - As least one of the two messages sent must be buffered.
• Deadlock if system buffer not big enough.IF (rank == 0) THEN
call MPI_SEND(sendbuf,....to P(1))
call MPI_RECV(recvbuf,....from P(1))
ELSE ! rank == 1
call MPI_SEND(sendbuf,....to P(0))
call MPI_RECV(recvbuf,....from P(0))
ENDIF
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
Non-Block CommunicationNon-Block Communication
• One can improve performance by overlapping communication and computation.
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
Non-Blocking SendNon-Blocking Send• A non-blocking send start call initiates the
send operation, but does not complete it.• The send start call will returns before the
message was copied out of the send buffer.• Communication may proceed concurrently
with computation.• A separate send complete operation is
need to complete the communication. i.e. to verify that the data has been copied out of the send buffer.
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
Non-Blocking Send ExampleNon-Blocking Send Examplecall MPI_Isend(x, 1, MPI_INTEGER, &
&1,0,MPI_COMM_WORLD,request1,ierror)
.....
! Do sth here while message being
! sent.
call MPI_Wait(request1,status,ierror)
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
MPI_Isend SyntaxMPI_Isend SyntaxMPI_Isend(buf, count, datatype, dest, &
&tag, comm, request, ierror)
<type> buff(*)
INTEGER, intent(in) :: &
count, datatype, dest, tag, comm
INTEGER, intent(out) :: request, ierror
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
MPI_IsendMPI_Isend• The sender should not access any part
of the send buffer after MPI_Isend is called, until the send completes.
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
RequestRequest
• request is a handle. i.e. the object referenced by request is system defined, and that it cannot be directly accessed by the user.
• In FORTRAN, every request should be declared as an integer.
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
RequestRequest
• Its purpose is to identify the operation started by the non-blocking call.
• It match the operations that initiates the communication with the operation that terminate it.
• It stores information about status of the pending communication operation.
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
Communication CompletionCommunication Completion
• Completion of send indicates that the whole message has been copied out and the sender is now free to update the location in the send buffer.
• It doesn’t indicate message has been received, rather, it indicates that the whole message may have been put in system buffer.
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
MPI_WaitMPI_Wait• MPI_Wait blocks until operation identi
fied by request completes.• When MPI_Wait returns, request is se
t to MPI_REQUEST_NULL. This means there is no pending operation associated to request. The opaque object associated with it would also be deallocated.
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
MPI_Wait SyntaxMPI_Wait SyntaxMPI_Wait(request, status, ierror)
INTEGER, intent(inout) :: request, &
& ierror
INTEGER, intent(out) :: &
& status(MPI_STATUS_SIZE)
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
MPI_TestMPI_Test• MPI_Test is non-blocking version of M
PI_Wait.MPI_Test(request, flag, status,ierror)
LOGICAL, intent(out) :: flag
INTEGER, intent(inout) :: request, &
& ierror
INTEGER, intent(out) :: &
& status(MPI_STATUS_SIZE)
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
MPI_TestMPI_Test• MPI_Test returns immediately. It don’t
wait the completion of communication operation.
• It sets flag == .TRUE. if the operation identified by request is complete. It also sets request to MPI_REQUEST_NULL.
• It sets flag == .FALSE. if the operation identified by request is not complete.
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
Nonblocking Receive ExampleNonblocking Receive Example• Method 1: MPI_Wait
• Method 2: MPI_Test
MPI_Irecv(buf,…,req);…do work not using bufMPI_Wait(req,status);…do work using buf
MPI_Irecv(buf,…,req);MPI_Test(req,flag,status);while (flag != 0) { …do work not using buf MPI_Test(req,flag,status);}…do work using buf
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
MPI_IRecvMPI_IRecv• The receiver should not access any par
t of the receive buffer after MPI_Irecv is called, until the receive completes.
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
Multiple CompletionMultiple Completion
• MPI provides routines which test mulitple communications at once.– MPI_WAITANY, MPI_TESTANY– MPI_WAITALL, MPI_TESTALL– MPI_WAITSOME, MPI_TESTSOME
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
ProbeProbe
• MPI_Probe and MPI_IProbe operations allow incoming messages to be checked for, without actually receiving them.
• The user can then decide how to receive them based on the information returned by probe.
• e.g. allocate memory for receive buffer according to the length of the probed message.
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
CancelCancel
• MPI_Cancel operation allows pending communcations to be canceled.
Information Technology Services, HKU
Collective communicationCollective communication
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
Collective communicationCollective communicationA collective communication involves communication among all processes in a process group (communicator).
1 3
0
2
4
5
Communicator
source
e.g. MPI_Bcast
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
Collective communicationCollective communication• All collective communication events are
blocking. • The call to the collective routine,
typically replaces several point-to-point calls.
• The source code is much more readable.• Optimized forms of the collective
routines are often faster than the equivalent operation expressed in terms of point-to-point routines.
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
Collective communicationCollective communication
11 33 55 77
1616
reduction
scatter gather
broadcast
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
MPI_BarrierMPI_Barrier
• MPI_Barrier (comm)• Creates a barrier synchronization in
a group (communicator).
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
MPI_BcastMPI_Bcast
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
MPI_BcastMPI_Bcast
• Correct - Call MPI_Bcast in all nodesMPI_Bcast (message, … ,0,MPI_COMM_WORLD)
• Common mistakeif (I am master) then
MPI_Bcast (message,…,0,MPI_COMM_WORLD)
elseMPI_Recv(message, … ,0, MPI_COMM_WORLD)
endif
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
MPI_ScatterMPI_Scatter
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
MPI_GatherMPI_Gather
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
MPI_AllgatherMPI_Allgather
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
MPI_ReduceMPI_Reduce
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
MPI_AllreduceMPI_Allreduce
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
MPI_ScanMPI_Scan
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
MPI Reduce and Scan predefined MPI Reduce and Scan predefined operationsoperations
MPI OP Operation CMPI_MAX Maximum integer, float
MPI_MIN Minimum integer, float
MPI_SUM Sum integer, float
MPI_PROD Product integer, float
MPI_LAND Logical AND integer
MPI_BAND Bitwise AND integer, MPI_BYTE
MPI_LOR Logical OR integer
MPI_BOR Bitwise OR integer, MPI_BYTE
MPI_LXOR Logical exclusive OR
Integer
MPI_BXOR Bitwise exclusive OR
integer, MPI_BYTE
MPI_MAXLOC
Max val and location
float, double, long double
MPI_MINLOC Min val and location
float, double, long double
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
Example : Matrix-vector MultiplicatiExample : Matrix-vector MultiplicationonSchematic of parallel decomposition for matrix-vector multiplication, A=B*C. The vector A is depicted in yellow. The matrix B and vector C are depicted in multiple colors representing the portions, columns, and elements assigned to each processor, respectively.
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
Example : Matrix-vector MultiplicatiExample : Matrix-vector Multiplicationon
A=B*C
33,322,311,300,3
33,222,211,200,2
33,122,111,100,1
33,022,011,000,0
3
2
1
0
cbcbcbcb
cbcbcbcb
cbcbcbcb
cbcbcbcb
a
a
a
a
P0 P1 P2 P3
+
+
+
+
+
+
+
+
+
+
+
+
Reduction (SUM)
P0 P1 P2 P3
Information Technology Services, HKU
Further ReadingFurther Reading
Information Technology Services, HKU
Group and TopologyGroup and Topology
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
Group and Communicator Group and Communicator Management RoutinesManagement Routines
• Purpose – Allows you to organize tasks, based
upon function, into task groups. – Enables Collective Communications
operations across a subset of related tasks.
– Provides basis for implementing virtual topologies
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
Group and Communicator Group and Communicator Management RoutinesManagement Routines• MPI_Group_difference• MPI_Group_excl• MPI_Group_incl• MPI_Group_intersection• MPI_Group_union• MPI_Group_compare• MPI_Group_rank• MPI_Group_size• MPI_Group_free• MPI_Comm_group• MPI_Comm_create• MPI_Comm_dup• MPI_Comm_compare• MPI_Comm_free
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
Virtual TopologiesVirtual Topologies
• Cartesian, tree or graph• Cartesian example :
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
ExampleExample
Sendrecv(
sendmsg, 1, MPI_INTEGER, northprocess, sendtag,
Recvmsg, 1, MPI_INTEGER, southprocess, recvtag,
comm,status,ierror )
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
Virtual TopologiesVirtual Topologies• MPI_Cart_coords • MPI_Cart_create • MPI_Cart_get • MPI_Cart_map • MPI_Cart_rank • MPI_Cart_shift • MPI_Cart_sub • MPI_Cartdim_get • MPI_Dims_create • MPI_Graph_create • MPI_Graph_get • MPI_Graph_map • MPI_Graph_neighbors • MPI_Graph_neighbors_count • MPI_Graphdims_get • MPI_Topo_test
Information Technology Services, HKU
Derived DatatypesDerived Datatypes
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
The count argumentThe count argument
• MPI_Send and MPI_Recv has a count argument. E.g.,
real, dimension(100) :: vector
if ( my_rank == 0 ) then
MPI_Send(vector,50,MPI_REAL,1,0, &
&MPI_COMM_WORLD)
else
MPI_Recv(vector,50,MPI_REAL,1,0, &
&MPI_COMM_WORLD,status)
endif
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
The count argumentThe count argument
• Many data item can be sent together. However, they must be :– Continous, and– Be same FORTRAN datatype
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
Why Derived DatatypesWhy Derived Datatypes• Consider double precision results(IMAX,JMAX)
where we want to send results(5,1), results(5,2)...results(5,JMAX). The data to be sent does not lie in one contiguous area of memory.
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
Why Derived DatatypesWhy Derived Datatypes• One solution is using FORTRAN 90
array section. However,• This method is not efficient
because it involves large amount of memory copying.
• Since FORTRAN 90 array section is implemented by memory copying. As a result it cannot be used in non-block communication.
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
Why Derived DatatypesWhy Derived Datatypes• Consider
INTEGER nResult, n, m
DOUBLE PRECISION result (RMAX)
• where it is required to send nResullt followed by results. The data is of mixed type.
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
Discussion of examplesDiscussion of examples
• Programmer can make consecutive MPI calls to send and receive each data element in turn, which is slow and clumsy.
• Programmer may copy data to a buffer before sending it, but it is waste of memory and long-winded.
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
MPI_TYPE_VECTORMPI_TYPE_VECTOR
• count = 2• stride = 5• blocklength = 3
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
MPI_TYPE_VECTOR SyntaxMPI_TYPE_VECTOR Syntax
MPI_TYPE_VECTOR( count, &
blocklength, stride, oldtype, &
newtype )
INTEGER, intent(in) :: count, &
blocklength, stride, oldtype
INTEGER, intent(out):: newtype, &
ierror
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
Example : Sending Sub-matrixExample : Sending Sub-matrix
1 2 3 4 5 6 7 8
2
3
4
6
7
8
9
10
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
Example : Sending Sub-matrixExample : Sending Sub-matrix
double precision :: results (10,8)
INTEGER :: newtype
....
call MPI_TYPE_VECTOR(8, 1, 10, MPI_DOUBLE_PRECISION, newtype, ierror)
call MPI_TYPE_COMMIT(newtype, ierror)
call MPI_SEND(results(5,1),1,newtype, dest, tag, comm, ierror)
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
MPI_TYPE_COMMITMPI_TYPE_COMMIT
• The new datatype is “commited” with a call to MPI_TYPE_COMMIT. It can then be used in any number of communications. The form of MPI_TYPE_COMMIT is :
Call MPI_TYPE_COMMIT(newtype,ierror)
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
MPI_TYPE_FREEMPI_TYPE_FREE
• MPI_TYPE_FREE deallocates the datatype.
call MPI_TYPE_FREE(datatype, ierror)
• datatype is returned as MPI_DATATYPE_NULL.
• After calling MPI_TYPE_FREE, the datatype can be redefined.
Info
rmat
ion T
ech
nolo
gy S
erv
ices,
HK
U
CommunicatorCommunicator• Parallel libraries would also send
message. Library message may accidentally use same tag as user’s own messages. We need to distinguish message send by library routines and message send by our own code.
• The solution adopted by MPI is to add a further information : a communicator.
• Two processes using distinct communicators cannot receive message from each other.