Programming Clusters using Message-Passing Interface
(MPI)
Dr. Rajkumar BuyyaCloud Computing and Distributed Systems (CLOUDS)
Laboratory The University of MelbourneMelbourne, Australiawww.cloudbus.org
Outline
Introduction to Message Passing Environments
HelloWorld MPI Program Compiling and Running MPI programs
On interactive clusters And Batch clusters
Elements of Hello World Program MPI Routines Listing Communication in MPI programs Summary
Message-Passing Programming Paradigm
Each processor in a message-passing program runs a sub-program
written in a conventional sequential language all variables are private communicate via special subroutine calls
M
P
M
P
M
P
Memory
Processors/Node
Interconnection Network
SPMD: A dominant paradigm for writing data parallel applications
main(int argc, char **argv){
if(process is assigned Master role){
/* Assign work and coordinate workers and collect results */ MasterRoutine(/*arguments*/);
} else /* it is worker process */
{ /* interact with master and other workers. Do the work
and send results to the master*/WorkerRoutine(/*arguments*/);
}}
Messages
Messages are packets of data moving between sub-programs.
The message passing system has to be told the following information
Sending processor Source location Data type Data length Receiving processor(s) Destination location Destination size
Messages
Access: Each sub-program needs to be connected to a message
passing system Addressing:
Messages need to have addresses to be sent to Reception:
It is important that the receiving process is capable of dealing with the messages it is sent
A message passing system is similar to: Post-office, Phone line, Fax, E-mail, etc
Message Types: Point-to-Point, Collective, Synchronous
(telephone)/Asynchronous (Postal)
Message Passing Systems and MPI
- www.mpi-forum.org Initially, each manufacturer developed their own message
passing interface Wide range of features, often incompatible MPI Forum brought together several Vendors and users of
HPC systems from US and Europe – overcome above limitations
Produced a document defining a standard, called Message Passing Interface (MPI), which is derived from experience or common features/issues addressed by many message-passing libraries. It aimed:
to provide source-code portability to allow efficient implementation to provide a high level of functionality to support heterogeneous parallel architectures to support parallel I/O (in MPI 2.0)
MPI 1.0 contains over 115 routines/functions
General MPI Program Structure
MPI Include File
Initialise MPI Environment
Do work and perform message communication
Terminate MPI Environment
MPI programs
MPI is a library - there are NO language changes
Header Files C: #include <mpi.h>
MPI Function Format C: error = MPI_Xxxx(parameter,...);
MPI_Xxxx(parameter,...);
Example - C
#include <mpi.h>
/* include other usual header files*/
main(int argc, char **argv)
{
/* initialize MPI */
MPI_Init(&argc, &argv);
/* main part of program */
/* terminate MPI */
MPI_Finalize();
exit(0);
}
MPI helloworld.c
#include <mpi.h>#include <stdio.h>
intmain(int argc, char **argv){ int numtasks, rank;
MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, & numtasks); MPI_Comm_rank(MPI_COMM_WORLD, &rank);
printf("Hello World from process %d of %d\n", rank, numtasks);
MPI_Finalize();return (0);
}
MPI Programs Compilation and Execution
Compile and Run Commands(LAM MPI)
Compile: > mpicc helloworld.c -o helloworld
Run: > lamboot machines.list [hosts picked from
configuration] > mpirun -np 3 helloworld
The file machines.list contains nodes list: manjra.cs.mu.oz.au node1 node2 .. node6 node13
No of processes
Sample Run and Output
A Run with 3 Processes: > lamboot > mpirun -np 3 helloworld
Hello World from process 0 of 3 Hello World from process 1 of 3 Hello World from process 2 of 3
Sample Run and Output
A Run with 6 Processes: > lamboot > mpirun -np 6 helloworld
Hello World from process 0 of 6 Hello World from process 3 of 6 Hello World from process 1 of 6 Hello World from process 5 of 6 Hello World from process 4 of 6 Hello World from process 2 of 6
Note: Process execution need not be in process number order.
Sample Run and Output
A Run with 6 Processes: > lamboot > mpirun -np 6 helloworld
Hello World from process 0 of 6 Hello World from process 3 of 6 Hello World from process 1 of 6 Hello World from process 2 of 6 Hello World from process 5 of 6 Hello World from process 4 of 6
Note: Change in process output order. For each run, process mapping can be different. They may run on machines with different load. Hence such difference.
More on MPI Program Elements and Error Checking
Initializing MPI
The first MPI routine called in any MPI program must be MPI_Init.
The C version accepts the arguments to main
int MPI_Init(int *argc, char ***argv);
MPI_Init must be called by every MPI program
Making multiple MPI_Init calls is erroneous
MPI_COMM_WORLD
MPI_INIT defines a communicator called MPI_COMM_WORLD for every process that calls it
All MPI communication calls require a communicator argument
MPI processes can only communicate if they share a communicator.
A communicator contains a group which is a list of processes
Each process has it’s rank within the communicator
A process can have several communicators
MPI_COMM_WORLD
MPI_COMM_WORLD
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
User-createdCommunicator
21
3 4 5
876
0
1
0
User-createdCommunicator
Communicators
MPI uses objects called Communicators that defines which collection of processes communicate with each other.
Every process has a unique integer identifier, or rank, assigned by the system when the process initialises
A rank is sometimes called process ID Processes can request information from a
communicator MPI_Comm_rank(MPI_comm comm, int *rank)
Returns the rank of the process in comm MPI_Comm_size(MPI_Comm comm, int *size)
Returns the size of the group in comm
Finishing up
An MPI program should call MPI_Finalize when all communications have completed
Once called, no other MPI calls can be made
Aborting:MPI_Abort(comm)
Attempts to abort all processes listed in comm
If comm = MPI_COMM_WORLD the whole program terminates
Hello World with Error Check
Display Hostname of MPI Process
#include <mpi.h>main(int argc, char **argv){ int numtasks, rank; int resultlen; static char mpi_hostname[MPI_MAX_PROCESSOR_NAME];
MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &numtasks); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Get_processor_name( mpi_hostname, &resultlen );
printf("Hello World from process %d of %d running on %s\n", rank, numtasks, mpi_hostname);
MPI_Finalize();}
MPI Routines
MPI Routines – C and Fortran
Environment Management Point-to-Point Communication Collective Communication Process Group Management Communicators Derived Type Virtual Topologies Miscellaneous Routines
Environment Management Routines
Point-to-Point Communication
A simplest form of message passing One process sends a message to another Several variations on how sending a
message can interact with execution of the sub-program
Point-to-Point variations
Synchronous Sends provide information about the completion of the message e.g. fax machines
Asynchronous Sends Only know when the message has left e.g. post cards
Blocking operations only return from the call when operation has completed
Non-blocking operations return straight away - can test/wait later for completion
Point-to-Point Communication
Collective Communications
Collective communication routines are higher level routines involving several processes at a time
Can be built out of point-to-point communications
Barriers synchronise processes
Broadcast one-to-many communication
Reduction operations combine data from several processes to produce a single
(usually) result
Collective Communication Routines
Process Group Management Routines
Communicators Routines
Derived Type Routines
Virtual Topologies Routines
Miscellaneous Routines
MPI Communication Routines and Examples
MPI Messages
A message contains a number of elements of some particular data type
MPI data types Basic Types Derived types
Derived types can be built up from basic types
“C” types are different from Fortran types
MPI Basic Data types - C
MPI datatype C datatypeMPI_CHAR signed charMPI_SHORT signed short intMPI_INT signed intMPI_LONG signed long intMPI_UNSIGNED_CHAR unsigned charMPI_UNSIGNED_SHORT unsigned short intMPI_UNSIGNED unsigned intMPI_UNSIGNED_LONG unsigned long intMPI_FLOAT floatMPI_DOUBLE doubleMPI_LONG_DOUBLE long doubleMPI_BYTEMPI_PACKED
Point-to-Point Communication
Communication between two processes Source process sends message to destination
process Communication takes place within a
communicator Destination process is identified by its rank in
the communicator MPI provides four communication modes for
sending messages standard, synchronous, buffered, and ready
Only one mode for receiving
Standard Send
Completes once the message has been sent Note: it may or may not have been received
Programs should obey the following rules: It should not assume the send will complete before the
receive begins - can lead to deadlock It should not assume the send will complete after the
receive begins - can lead to non-determinism Processes should be eager readers - they should
guarantee to receive all messages sent to them - else network overload
Can be implemented as either a buffered send or synchronous send
Standard Send (cont.)
MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)
buf the address of the data to be sentcount the number of elements of datatype buf containsdatatype the MPI datatypedest rank of destination in communicator commtag a marker used to distinguish different message typescomm the communicator shared by sender and receiverierror the fortran return value of the send
Standard Blocking Receive
Note: all sends so far have been blocking (but this only makes a difference for synchronous sends)
Completes when message receivedMPI_Recv(buf, count, datatype, source, tag, comm,
status)
source - rank of source process in communicator commstatus - returns information about message
Synchronous Blocking Message-Passing processes synchronise sender process specifies the synchronous mode blocking - both processes wait until transaction completed
For a communication to succeed
Sender must specify a valid destination rank Receiver must specify a valid source rank The communicator must be the same Tags must match Message types must match Receivers buffer must be large enough Receiver can use wildcards
MPI_ANY_SOURCE MPI_ANY_TAG actual source and tag are returned in status parameter
Standard/Blocked Send/Receive
Standard/Blocked Send/Receive
MPI Send/Receive a Character (cont...)
// mpi_com.c #include <mpi.h> #include <stdio.h> int main(int argc, char *argv[]) { int numtasks, rank, dest, source, rc, tag=1; char inmsg, outmsg='X'; MPI_Status Stat;
MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD, &numtasks); MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0) { dest = 1; rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD); printf("Rank0 sent: %c\n", outmsg); source = 1; rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat); }
MPI Send/Receive a Character (cont...)
else if (rank == 1) { source = 0; rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat); printf("Rank1 received: %c\n", inmsg); dest = 0; rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD); }
MPI_Finalize(); }
Execution Demo
mpicc mpi_com.c > lamboot > mpirun -np 2 ./a.out
Rank0 sent: XRank0 recv: YRank1 received: X
Non Blocking Message Passing
Non Blocking Message Passing
Functions to check on completion: MPI_Wait, MPI_Test, MPI_Waitany, MPI_Testany, MPI_Waitall, MPI_Testall, MPI_Waitsome, MPI_Testsome. MPI_Status status;
MPI_Wait(&request, &status) /* block */
MPI_Test(&request, &flag, &status) /* doesn’t block */
Timers
C: double MPI_Wtime(void); Returns an elapsed wall clock time in seconds (double
precision) on the calling processor.
Time is measured in seconds Time to perform a task is measured by consulting the
time before and after