distributed memory programming with mpi · mpi programming 5 message passing and mpi message...
TRANSCRIPT
![Page 1: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/1.jpg)
Distributed Memory Distributed Memory Programming with MPIProgramming with MPI
Moreno MarzollaDip. di Informatica—Scienza e Ingegneria (DISI)Università di Bologna
Pacheco chapter 3
![Page 2: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/2.jpg)
MPI Programming 2
![Page 3: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/3.jpg)
MPI Programming 3
Credits
● Peter Pacheco, Dept. of Computer Science, University of San Franciscohttp://www.cs.usfca.edu/~peter/
● Mary Hall, School of Computing, University of Utahhttps://www.cs.utah.edu/~mhall/
● Salvatore Orlando, Univ. Ca' Foscari di Veneziahttp://www.dsi.unive.it/~orlando/
● Blaise Barney, https://computing.llnl.gov/tutorials/mpi/ (highly recommended!!)
![Page 4: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/4.jpg)
MPI Programming 4
Introduction
![Page 5: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/5.jpg)
MPI Programming 5
Message Passing and MPI
● Message passing is the principal alternative to shared memory parallel programming– Message passing represents the predominant programming
model for supercomputers and clusters● What MPI is
– A library used within conventional sequential languagess (Fortran, C, C++)
– Based on Single Program, Multiple Data (SPMD) – Isolation of separate address spaces
● no data races, but communication errors possible● exposes execution model and forces programmer to think about
locality, both good for performance● complexity and code growth!
![Page 6: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/6.jpg)
MPI Programming 6
SPMD(Single Program Multiple Data)
● The same program is executed by P processes● Each process may choose a different execution path
depending on its ID (rank)
...MPI_Init(...);...foo(); /* executed by all processes */if ( my_id == 0 ) {
do_something(); /* executed by process 0 only */} else {
do_something_else(); /* executed by all other processes */}...MPI_Finalize();
...
![Page 7: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/7.jpg)
MPI Programming 7
Message Passing and MPI
● All communication and synchronization operations require subroutine calls– No shared variables
● Subroutines for– Communication
● Pairwise or point-to-point● Collectives involving multiple processes
– Synchronization ● Barrier● No locks because there are no shared variables to protect
– Queries● How many processes? Which one am I? Any messages waiting?
![Page 8: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/8.jpg)
MPI Programming 8
Using MPI under Debian/Ubuntu
● Install the mpi-default-bin and mpi-default-dev packages– Installs OpenMPI– You might also want openmpi-doc for the man pages
● Use mpicc to compile, mpirun to execute● To execute your program on remote hosts, make sure you can
ssh into them without entering a password– Generate a public/private key pair on your local machine, if
you have not already done so, with ssh-keygen -t dsa; do not enter a passphrase
– Append the content of .ssh/id_dsa.pub to the remote file .ssh/authorized_keys
![Page 9: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/9.jpg)
MPI Programming 9
● Two important questions that arise early in a parallel program are:– How many processes are participating in this computation?– Which one am I?
● MPI provides functions to answer these questions:– MPI_Comm_size reports the number of processes– MPI_Comm_rank reports the rank, a number between 0 and
(size - 1), identifying the calling process
Finding Out About the Environment
![Page 10: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/10.jpg)
MPI Programming 10
/* mpi-hello.c */
#include <mpi.h>#include <stdio.h>
int main( int argc, char *argv[] ){
int rank, size, len;char hostname[MPI_MAX_PROCESSOR_NAME];MPI_Init( &argc, &argv ); MPI_Comm_rank( MPI_COMM_WORLD, &rank );MPI_Comm_size( MPI_COMM_WORLD, &size );MPI_Get_processor_name( hostname, &len );printf(”Greetings from process %d of %d running on %s\n",
rank, size, hostname);MPI_Finalize(); return 0;
}
Hello, world!
No MPI call after this line
No MPI call before this line
![Page 11: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/11.jpg)
MPI Programming 11
Hello, world!
● Compilation:
mpicc -Wall mpi_hello.c -o mpi_hello
● Execution (8 processes on localhost):
mpirun -n 8 ./mpi_hello
● Execution (two processes on host “foo” and one on host “bar”)
mpirun -H foo,foo,bar ./mpi_hello
![Page 12: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/12.jpg)
MPI Programming 12
Hello, world!
$ mpirun -n 8 ./mpi_helloGreetings from process 7 of 8 running on wopr Greetings from process 5 of 8 running on woprGreetings from process 0 of 8 running on woprGreetings from process 3 of 8 running on woprGreetings from process 6 of 8 running on woprGreetings from process 4 of 8 running on woprGreetings from process 1 of 8 running on woprGreetings from process 2 of 8 running on wopr
![Page 13: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/13.jpg)
MPI Programming 13
Hostfile$ cat myhostfileaa slots=4bb slots=4cc slots=4
$ mpirun -hostfile myhostfile -n 6 ./mpi_hello
● To run 4 instances on node “aa” and 2 instances on node “bb”:
● To run 2 instances on “aa”, 2 on “bb” and the remaining 2 on “cc”:
● man mpirun for additional information
$ mpirun -loadbalance -hostfile myhostfile -n 6 ./mpi_hello
![Page 14: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/14.jpg)
MPI Programming 14
MPI
● The following six functions suffice for most programs:– MPI_Init– MPI_Finalize– MPI_Comm_size– MPI_Comm_rank– MPI_Send (blocking send)– MPI_Recv (blocking receive)– MPI_Abort (aborting the computation)
![Page 15: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/15.jpg)
MPI Programming 15
A Simple MPI Program/* mpi-point-to-point.c */#include <mpi.h>#include <stdio.h>int main( int argc, char *argv[]){
int rank, buf;MPI_Status status;MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank);
/* process 0 sends and process 1 receives */if (rank == 0) {
buf = 123456;MPI_Send(&buf, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
} else if (rank == 1) {MPI_Recv(&buf, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, &status);printf("Received %d\n", buf);
}
MPI_Finalize();return 0;
}
![Page 16: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/16.jpg)
MPI Programming 16
● How to organize processes– Processes can be collected into groups– A group and context together form a communicator– A process is identified by its rank in the group associated
with a communicator● There is a default communicator MPI_COMM_WORLD
whose group contains all processes
Some Basic Concepts
![Page 17: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/17.jpg)
MPI Programming 17
MPI datatypes
● The data to be sent or received is described by a triple (address, count, datatype), where an MPI datatype is recursively defined as:– predefined, corresponding to a data type from the language
(e.g., MPI_INT, MPI_DOUBLE)– a contiguous array of MPI datatypes– a strided block of datatypes– an indexed array of blocks of datatypes– an arbitrary structure of datatypes
● There are MPI functions to construct custom datatypes
![Page 18: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/18.jpg)
MPI Programming 18
Some MPI datatypesMPI Datatype C datatype
MPI_CHAR signed char
MPI_SHORT signed short int
MPI_INT signed int
MPI_LONG signed lont int
MPI_LONG_LONG signed long long int
MPI_UNSIGNED_CHAR unsigned char
MPI_UNSIGNED_SHORT unsigned short int
MPI_UNSIGNED unsigned int
MPI_UNSIGNED_LONG unsigned long int
MPI_FLOAT float
MPI_DOUBLE double
MPI_LONG_DOUBLE long double
MPI_BYTE
![Page 19: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/19.jpg)
MPI Programming 19
MPI Tags
● Messages are sent with an accompanying user-defined integer tag, to assist the receiving process in identifying the message
● Messages can be screened at the receiving end by specifying a specific tag, or not screened by specifying MPI_ANY_TAG as the tag in a receive
![Page 20: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/20.jpg)
MPI Programming 20
MPI Basic (Blocking) Send
int MPI_Send(const void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)
– The message buffer is described by (buf, count, datatype).– count is the number of items to send (NOT the number of bytes)– The target process is specified by dest, which is the rank of the
target process in the communicator specified by comm● If dest is MPI_PROC_NULL, che MPI_Send operation has no effect
– When this function returns, the data has been delivered to the system and the buffer can be reused. The message may have not been received yet by the target process.
int buf = 123456;MPI_Send(&buf, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
![Page 21: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/21.jpg)
MPI Programming 21
MPI Basic (Blocking) Receive
int MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status)
– Waits until a matching (both source and tag) message is received from the system, and the buffer can be used
– source is the process rank in the communicator specified by comm, or MPI_ANY_SOURCE
– tag is a tag to be matched, or MPI_ANY_TAG– receiving fewer than count occurrences is OK, but receiving
more is an error– status contains further information (e.g. size of message), or MPI_STATUS_IGNORE if no information is needed
MPI_Recv( &buf, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, &status );
![Page 22: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/22.jpg)
MPI Programming 22
MPI_Status
● MPI_Status is a C structure with (among others) the following fields:– int MPI_SOURCE;– int MPI_TAG;– int MPI_ERROR;
● Therefore, a process can check the actual source and tag of a message received with MPI_ANY_TAG or MPI_ANY_SOURCE
![Page 23: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/23.jpg)
MPI Programming 23
MPI_Get_count()
int MPI_Get_count( const MPI_Status *status, MPI_Datatype datatype, int *count )
– MPI_Recv may complete even if less than count elements have been received
● Provided that the matching MPI_Send actually sent fewer elements
– MPI_Get_count can be used to know how many elements of type datatype have actually been received
– See mpi-get-count.c
![Page 24: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/24.jpg)
MPI Programming 24
Blocking communication and deadlocks
● Blocking send/receive may lead to deadlock if not paired carefully
MPI_Send to 1
Process 0 Process 1
MPI_Recv from 1
MPI_Send to 0
MPI_Recv from 0
Possible deadlock!
![Page 25: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/25.jpg)
MPI Programming 25
User program
MPI subsystem
Blocking communication and deadlocks
Process 0 Process 1
MPI_Send(send_buf, …);send_buf
out buffer
Operating System
User program
MPI subsystem
MPI_Send(send_buf, …);send_buf
out buffer
Operating System
MPI_Recv(…);
in buffer in buffer
out buffer
in buffer
out buffer
in buffer
send_buf
MPI_Recv(…);
send_buf
Deadlock
….
![Page 26: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/26.jpg)
MPI Programming 26
Blocking communication and deadlocks
● To avoid the deadlock it is necessary to reorder the operations so that send/receive pairs match...
● ...or use non-blocking communication primitives
MPI_Send to 1
MPI_Recv from 1 MPI_Send to 0
MPI_Recv from 0
Process 0 Process 1
![Page 27: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/27.jpg)
MPI Programming 27
Non-blocking Send
int MPI_Isend(const void *start, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *req)
– The message buffer is described by (start, count, datatype).– count is the number of items to send– The target process is specified by dest, which is the rank of the
target process in the communicator specified by comm– A unique identifier of this request is stored to req– This function returns immediately
int buf = 123456;MPI_Request req;MPI_Isend(%buf, 1, MPI_INT, 1, 0, MPI_COMM_WORLD, &req);
![Page 28: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/28.jpg)
MPI Programming 28
Non-blocking Receive
int MPI_Irecv(void *start, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *req)
– Processing continues immediately without waiting for the message to be received
– A communication request handle (req) is returned for handling the pending message status
– The program must cal MPI_Wait() or MPI_Test() to determine when the non-blocking receive operation completes
– Note: it is OK to use MPI_Isend with the (blocking) MPI_Recv, and vice-versa
MPI_Request req;MPI_Irecv(&buf, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, &req);
![Page 29: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/29.jpg)
MPI Programming 29
MPI_Test()
int MPI_Test(MPI_Request *request, int *flag, MPI_Status *status)
– Checks the status of a specified non-blocking send or receive operation
– The integer flag parameter is set to 1 if the operation has completed, 0 if not
– For multiple non-blocking operations, there exist functions to specify any (MPI_Testany), all (MPI_Testall) or some (MPI_Testsome) completions
– See man pages for details
![Page 30: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/30.jpg)
MPI Programming 30
MPI_Wait()
int MPI_Wait(MPI_Request *request, MPI_Status *status)
– Blocks until a specified non-blocking send or receive operation has completed
– For multiple non-blocking operations, there exists variants to specify any (MPI_Waitany), all (MPI_Waitall) or some (MPI_Waitsome) completions
– See man pages for details
![Page 31: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/31.jpg)
MPI Programming 31
Async send demo/* mpi-async.c */#include <stdio.h>#include <mpi.h>int main( int argc, char *argv[]){ int rank, size, buf; MPI_Status status; MPI_Request req; MPI_Init(&argc, &argv); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); MPI_Comm_size( MPI_COMM_WORLD, &size ); if (rank == 0) { buf = 123456; MPI_Isend( &buf, 1, MPI_INT, 1, 0, MPI_COMM_WORLD, &req); big_computation(); MPI_Wait(&req, &status); } else if (rank == 1) { MPI_Recv( &buf, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, &status ); printf("Received %d\n", buf); } MPI_Finalize(); return 0;}
![Page 32: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/32.jpg)
MPI Programming 32
Aborting the computation
● To abort a computation, do not use exit() or abort(): call MPI_Abort() instead
● MPI_Abort(comm, err) "gracefully" terminates all running MPI processes on communicator “comm” (e.g., MPI_COMM_WORLD) returning the error code “err”
![Page 33: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/33.jpg)
MPI Programming 33
Example: Trapezoid rule with MPI
![Page 34: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/34.jpg)
MPI Programming 34
See trap.c
The trapezoid strikes back
double result = 0.0;double h = (b-a)/n;double x = a;int i;for ( i = 0; i<n; i++ ) {
result += h*(f(x) + f(x+h))/2.0;x += h;
}
a b a b
![Page 35: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/35.jpg)
MPI Programming 35
Parallel pseudo-code (naïve)
partial_result = trap(my_rank, comm_sz, a, b, n);
if (my_rank != 0) {Send partial_result to process 0;
} else { /* my_rank == 0 */result = partial_result;for ( p = 1; p < comm_sz; p++ ) {
Receive partial_result from process p;result += partial_result;
}print result;
}
See mpi-trap0.c
![Page 36: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/36.jpg)
MPI Programming 36
Collective Communication
![Page 37: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/37.jpg)
MPI Programming 37
Collective communications
● Send/receive operations are rarely used in practice
● Many applications use the bulk synchronous pattern:– Repeat:
● Local computation● Communicate to update global view on all
processes● Collective communications are
executed by all processes in the group to compute and share some global result
P0 P1 P2 P3
Credits: Tim Mattson
Collective comm.
Collective comm.
![Page 38: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/38.jpg)
MPI Programming 38
Collective communications
● Collective communications are assumed to be more efficient than point-to-point operations achieving the same result
● Understanding when collective communications are to be used is an essential skill of a MPI programmer
![Page 39: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/39.jpg)
MPI Programming 39
MPI_Barrier()
● Executes a barrier synchronization in a group– When reaching the MPI_Barrier() call, a process blocks
until all processes in the group reach the same MPI_Barrier() call
– Then all processes are free to continue
![Page 40: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/40.jpg)
MPI Programming 40
MPI_Bcast()Broadcasts a message to all other processes of a group
count = 3;src = 1; /* broadcast originates from process 1 */MPI_Bcast(buf, count, MPI_INT, src, MPI_COMM_WORLD);
Proc 0 Proc 1 Proc 2 Proc 3
buf[] (before)
buf[] (after)
1 2 3
1 2 3 1 2 3 1 2 3 1 2 3
![Page 41: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/41.jpg)
MPI Programming 41
MPI_Scatter()Distribute data to other processes in a group
sendcnt = 3; /* how many items are sent to each process */recvcnt = 3; /* how many items are received by each process */src = 1; /* process 1 contains the message to be scattered */MPI_Scatter(sendbuf, sendcnt, MPI_INT, recvbuf, recvcnt, MPI_INT, src, MPI_COMM_WORLD);
Proc 0
Proc 1
Proc 2
Proc 3
sendbuf[] (before) recvbuf[] (after)
1 2 3 4 5 6 7 8 9 10 11 12
1 2 3
4 5 6
7 8 9
10 11 12
![Page 42: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/42.jpg)
MPI Programming 42
MPI_Scatter()
● The MPI_Scatter() operation produces the same result as if the root executes a series of
and all other processes execute
MPI_Send(sendbuf + i * sendcount * extent(sendtype), sendcount, sendtype, i, ...)
MPI_Recv(recvbuf, recvcount, recvtype, i, ...)
![Page 43: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/43.jpg)
MPI Programming 43
MPI_Gather()Gathers together data from other processes
sendcnt = 3; /* how many items are sent by each process */recvcnt = 3; /* how many items are received from each process */dst = 1; /* message will be gathered at process 1 */MPI_Gather(sendbuf, sendcnt, MPI_INT, recvbuf, recvcnt, MPI_INT, dst, MPI_COMM_WORLD);
Proc 0
Proc 1
Proc 2
Proc 3
sendbuf[] (before) recvbuf[] (after)
1 2 3 4 5 6 7 8 9 10 11 12
1 2 3
4 5 6
7 8 9
10 11 12
![Page 44: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/44.jpg)
MPI Programming 44
MPI_Allgather()Gathers data from other processes and distribute to all
sendcnt = 3;recvcnt = 3;MPI_Allgather(sendbuf, sendcnt, MPI_INT, recvbuf, recvcnt, MPI_INT, MPI_COMM_WORLD);
Proc 0
Proc 1
Proc 2
Proc 3
sendbuf[] (before) recvbuf[] (after)
1 2 3
4 5 6
7 8 9
10 11 12 1 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 7 8 9 10 11 12
![Page 45: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/45.jpg)
MPI Programming 45
Example: Parallel Vector Sum
x+ y = (x0, x1,… , xn−1) + ( y0, y1,… , yn−1)
= (x0+ y0, x1+ y1,… , xn−1+ yn−1)
= (z0, z1,… , zn−1)
= z
void sum( double* x, double* y, double* z, int n ){
int i;for (i=0; i<n; i++) {
z[i] = x[i] + y[i];}
}
![Page 46: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/46.jpg)
MPI Programming 46
Parallel Vector Sum
Proc 0 Proc 1 Proc 2 Proc 3
x[]
y[]
x[]
z[]
+ + + +
= = = =
![Page 47: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/47.jpg)
MPI Programming 47
Proc 0
Parallel Vector Sum
Proc 0
x[]
y[]
local_x[]
local_y[]
local_z[]
+
=
Proc 1
local_x[]
local_y[]
local_z[]
+
=
Proc 2
local_x[]
local_y[]
local_z[]
+
=
Proc 3
local_x[]
local_y[]
local_z[]
+
=
Proc 0
z[]
MPI_Scatter
MPI_Gather
See mpi-vecsum.c
![Page 48: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/48.jpg)
MPI Programming 48
MPI_Scatter()
● Contiguous data● Uniform message size
Proc 0
Proc 0
Proc 1
Proc 2
Proc 3
![Page 49: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/49.jpg)
MPI Programming 49
MPI_Scatterv() / MPI_Gatherv()
● Gaps are allowed between messages in source data ● Irregular message sizes are allowed● Data can be distributed to processes in any order
Proc 0
Proc 0
Proc 1
Proc 2
Proc 3
![Page 50: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/50.jpg)
MPI Programming 50
MPI_Scatterv()
Proc 0
Proc 0
Proc 1
Proc 2
Proc 3
int MPI_Scatterv ( void *sendbuf, int *sendcnts, int *displs, MPI_Datatype sendtype, void *recvbuf, int recvcnt, MPI_Datatype recvtype, int root, MPI_Comm comm )
sendbuf
recvbufrecvcnt
sendcnts[0] sendcnts[1] sendcnts[3] sendcnts[2]
displs[0]
displs[1]
displs[2]
displs[3]
Number of array elements, NOT bytes!
![Page 51: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/51.jpg)
MPI Programming 51
Example
int sendbuf[] = {10, 11, 12, 13, 14, 15, 16}; /* at master */int displs[] = {3, 0, 1}; /* assume P=3 MPI processes */int sendcnts[] = {3, 1, 4}; int recvbuf[5];...MPI_Scatterv(sendbuf, sendcnts, displs, MPI_INT, recvbuf, 5, MPI_INT, 0, MPI_COMM_WORLD);
10 11 12 13 14 15 16
13 14 15 10 11 12 13 14
sendbuf[]
recvbuf[]
Proc 0 Proc 1 Proc 2
See mpi-vecsum3.c
![Page 52: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/52.jpg)
MPI Programming 52
MPI_Reduce()Performs a reduction and place result in one process
0 3 2 4
count = 1;dst = 1; /* result will be placed in process 1 */MPI_Reduce(sendbuf, recvbuf, count, MPI_INT, MPI_SUM, dst, MPI_COMM_WORLD);
sendbuf[] (before)
recvbuf[] (after)
Proc 0 Proc 1 Proc 2 Proc 3
= 0 + 3 + 2 + 49
MPI_SUM
![Page 53: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/53.jpg)
MPI Programming 53
Predefined reduction operators
Operation Value Meaning
MPI_MAX Maximum
MPI_MIN Minimum
MPI_SUM Sum
MPI_PROD Product
MPI_LAND Logical AND
MPI_BAND Bitwise AND
MPI_LOR Logical OR
MPI_BOR Bitwise OR
MPI_LXOR Logical exclusive OR
MPI_BXOR Bitwise exclusive OR
MPI_MAXLOC Maximum and location of maximum
MPI_MINLOC Minimum and location of minimum
![Page 54: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/54.jpg)
MPI Programming 54
MINLOC and MAXLOCCompute global min/max and return an associated index
4 2 4 3
Proc 0 Proc 1 Proc 2 Proc 3
-5.0
MPI_MINLOC
-3.0 13.0 -5.0 2.0
4
struct {double val; int idx} in, out;dst = 1; /* result will be placed in process 1 */MPI_Reduce(&in, &out, 1, MPI_DOUBLE_INT, MPI_MINLOC, dst, MPI_COMM_WORLD);
val
idx
The min of val
The index associated with the min of val
See mpi-minloc.c
![Page 55: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/55.jpg)
MPI Programming 55
MPI_SUMMPI_SUM
MPI_Reduce()
● If count > 1, recvbuf[i] is the reduction of all elements sendbuf[i] at the various processes
0 1 2 3 4 5 6 7 8 9 10 11
18 22 26
Proc 0 Proc 1 Proc 2 Proc 3
count = 3;dst = 1;MPI_Reduce(sendbuf, recvbuf, count, MPI_INT, MPI_SUM, dst, MPI_COMM_WORLD);
sendbuf[] (before)
recvbuf[] (after)
MPI_SUM
![Page 56: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/56.jpg)
MPI Programming 56
Parallel trapezoid(with reduction)
● See mpi-trap1.c
![Page 57: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/57.jpg)
MPI Programming 57
MPI_Allreduce()Performs a reduction and place result in all processes
0 3 2 4
count = 1;MPI_Allreduce(sendbuf, recvbuf, count, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
sendbuf[] (before)
recvbuf[] (after)
Proc 0 Proc 1 Proc 2 Proc 3
9
MPI_SUMMPI_SUM MPI_SUM MPI_SUM
9 9 9
![Page 58: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/58.jpg)
MPI Programming 60
MPI_Alltoall()Each process performs a scatter operation
sendcnt = 1;recvcnt = 1;MPI_Alltoall(sendbuf, sendcnt, MPI_INT, recvbuf, recvcnt, MPI_INT, MPI_COMM_WORLD);
Proc 0
Proc 1
Proc 2
Proc 3
sendbuf[] (before) recvbuf[] (after)
4
8
12
16
1 2 3
5 6 7
9 10 11
13 14 15 4 8 12 16
1
2
3
5
6
7
9
10
11
13
14
15
![Page 59: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/59.jpg)
MPI Programming 61
MPI_Scan()Compute the inclusive scan
-2 3
10 161
count = 1;MPI_Scan(sendbuf, recvbuf, count, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
sendbuf[] (before)
recvbuf[] (after)
See mpi-scan.c
3 9 6
-2
Proc 0 Proc 1 Proc 2 Proc 3
MPI_SUM MPI_SUM MPI_SUM MPI_SUM
![Page 60: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/60.jpg)
MPI Programming 63
MPI_Scan()
● If count > 1, recvbuf[i] at proc. j is the scan of all elements sendbuf[i] at the first j processes (incl.)
0 1 2 3 4 5 6 7 8 9 10 11
3 5 7
Proc 0 Proc 1 Proc 2 Proc 3
count = 3;MPI_Scan(sendbuf, recvbuf, count, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
sendbuf[] (before)
recvbuf[] (after)0 1 2 9 12 15 18 22 26
See mpi-scan.c
MPI_SUM MPI_SUM MPI_SUM MPI_SUM
![Page 61: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/61.jpg)
MPI Programming 64
Collective Communication Routines / 1MPI_Barrier(comm)
– Synchronization operation. Creates a barrier synchronization in a group. Each rocess, when reaching the MPI_Barrier call, blocks until all processes in the group reach the same MPI_Barrier call. Then all processes can continue.
MPI_Bcast(buffer, count, datatype, root, comm) – Broadcasts (sends) a message from the process with rank "root" to all other processes in the group.
MPI_Scatter(sendbuf, sendcnt, sendtype, recvbuf, recvcnt, recvtype, root, comm) MPI_Scatterv(sendbuf, sendcnts[], displs[], sendtype, recvbuf, int recvcnt,
recvtype, root, comm)– Distributes distinct messages from a single source process to each process in the group
MPI_Gather(sendbuf, sendcnt, sendtype, recvbuf, recvcount, recvtype, root, comm)MPI_Gatherv(sendbuf, sendcnt, sendtype, recvbuf, recvcnts[], displs[], recvtype, root, comm)
– Gathers distinct messages from each process in the group to a single destination process. This routine is the reverse operation of MPI_Scatter
MPI_Allgather(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, comm) – Concatenation of data to all processes in a group. Each process in the group, in effect, performs a
one-to-all broadcasting operation within the group
MPI_Reduce(sendbuf, recvbuf, count, datatype, op, root, comm) – Applies a reduction operation on all processes in the group and places the result in one process
![Page 62: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/62.jpg)
MPI Programming 65
Collective Communication Routines / 2
MPI_Allreduce(sendbuf, recvbuf, count, datatype, op, comm) – Collective computation operation + data movement. Does an element-wise reduction on a vector
across all processes in the group; all processes receive the result. This is logicaly equivalent to a MPI_Reduce() followed by MPI_Bcast()
MPI_Scan(sendbuf, recvbuf, count, datatype, op, comm) – Performs a scan with respect to a reduction operation across a process group.
MPI_Alltoall(sendbuf, sendcount, sendtype, recvbuf, recvcnt, recvtype, comm) – Data movement operation. Each process in a group performs a scatter operation, sending a distinct
message to all the processes in the group in order by index.
![Page 63: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/63.jpg)
MPI Programming 66
Odd-Even sort with MPI
![Page 64: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/64.jpg)
MPI Programming 67
Odd-Even Transposition Sort
● Variant of bubble sort● First, compare all (even, odd) pairs of adjacent
elements; exchange them if in the wrong order● Then compare all (odd, even) pairs, exchanging if
necessary; repeat the step abovev[0]
v[1]
v[2]
v[3]
v[4]
v[5]
v[6]
Compare andexchange
Time
![Page 65: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/65.jpg)
MPI Programming 68
Odd-Even Transposition Sort
● Variant of bubble sort● First, compare all (even, odd) pairs of adjacent
elements; exchange them if in the wrong order● Then compare all (odd, even) pairs, exchanging if
necessary; repeat the step above
for (phase = 0; phase < n; phase++) {if (phase % 2 == 0) {
for (i=0; i<n-1; i += 2) {if (a[i] > a[i+1]) Swap(&a[i], &a[i+1]);
}} else {
for (i=1; i<n-1; i += 2) {if (a[i] > a[i+1]) Swap(&a[i], &a[i+1]);
}}
}
![Page 66: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/66.jpg)
MPI Programming 69
Choosing the Granularity
● Communication in distributed memory systems is very expensive, therefore working on individual array elements does not make sense
● We achieve a coarser granularity by splitting the array in blocks and applying odd-even sort at the block level
G. Baudet and D. Stevenson, "Optimal Sorting Algorithms for Parallel Computers," in IEEE Transactions on Computers, vol. C-27, no. 1, pp. 84-87, Jan. 1978. doi:10.1109/TC.1978.1674957
![Page 67: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/67.jpg)
MPI Programming 70
Odd-Even Transposition Sort
● The vector is split in blocks that are assigned to MPI processes
● Each process sorts its block (e.g., using qsort)● At each exchange-swap step:
– Process i sends a copy of its chunk to process i + 1– Process i + 1 sends a copy of its chunk to process i– Process i merges the chunks, discards upper half– Process i + 1 merges the chunks, discards lower half
![Page 68: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/68.jpg)
MPI Programming 71
Exchange-Merge step
10 7 12 33 9 13 52 14 97 31
7 9 10 12 33
sort
13 14 31 52 97
sort
7 9 10 12 13 14 31 33 52 97
merge
7 9 10 12 13 14 31 33 52 97
merge
7 9 10 12 13
discard upper half
14 31 33 52 97
discard lower half
Process i Process i + 1
7 9 10 12 33 13 14 31 52 97 13 14 31 52 977 9 10 12 33
Send to partner
…..
![Page 69: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/69.jpg)
MPI Programming 72
Communication pattern forodd-even sort
Proc 0 Proc 1 Proc 2
Proc 0 Proc 1 Proc 2
Proc 3
Proc 3
Proc 0 Proc 1 Proc 2 Proc 3
phase
0
1
2
Proc 0 Proc 1 Proc 2 Proc 33
![Page 70: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/70.jpg)
MPI Programming 73
Sort local valuesfor (phase=0; phase < comm_sz; phase++) {
partner = compute_partner(phase, my_rank);if (I am not idle) {
Send my keys to partner;Receive keys from partner;if (my_rank < partner) {
Keep smaller keys;} else {
Keep larger keys;}
}}
Beware of the (possible) deadlock!!
Depending on the MPI implementation, the send
operation may block if there is no matching receive at the other end; unfortunately, all
receive are executed only after the send completes!
MPI_Send MPI_Send
MPI_Recv MPI_Recv
![Page 71: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/71.jpg)
MPI Programming 74
Solution 1 (ugly):restructure communications
MPI_Send(msg1, size, MPI_INT, partner, 0, comm);MPI_Recv(msg2, size, MPI_INT, partner, 0, comm, MPI_STATUS_IGNORE);
if ( my_rank % 2 == 0 ) {MPI_Send(msg1, size, MPI_INT, partner, 0, comm);MPI_Recv(msg2, size, MPI_INT, partner,
0, comm, MPI_STATUS_IGNORE);} else {
MPI_Recv(msg2, size, MPI_INT, partner, 0, comm, MPI_STATUS_IGNORE);
MPI_Send(msg1, size, MPI_INT, partner, 0, comm);}
![Page 72: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/72.jpg)
MPI Programming 75
Better: use MPI_Sendrecv()
● Executes a blocking send and a receive in a single call– dest and the source can be the
same or different– MPI schedules the
communications so that the program won’t hang or crash
● MPI_Sendrecv() can be matched by MPI_Send() / MPI_Recv()– However, it is very unlikely that
you will ever need to do that
int MPI_Sendrecv(void* sendbuf, int sendcount, MPI_Datatype sendtype,int dest, int sendtag, void* recvbuf, int recvcount,MPI_Datatype recvtype, int source, int recvtag,MPI_Comm comm, MPI_Status* status
)
See mpi-odd-even.c
![Page 73: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/73.jpg)
MPI Programming 76
MPI Datatypes
![Page 74: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/74.jpg)
MPI Programming 77
Example
● Let us consider a two-dimensional domain● (*, Block) decomposition
– with ghost cells along the vertical edges only
P0 P1 P2
Ghost cells
![Page 75: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/75.jpg)
MPI Programming 78
Example
● At each step, nodes must exchange their outer columns with neighbors
P0 P1 P2
![Page 76: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/76.jpg)
MPI Programming 79
Example● In the C language, matrices are stored row-wise
– Elements of the same column are not contiguous in memory
Pi
Pi+1
.....
![Page 77: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/77.jpg)
MPI Programming 80
Example● The BAD solution: send each element with MPI_Send
(or MPI_Isend)
Pi
MPI_Send
MPI_Send
MPI_Send
MPI_Send
MPI_Send
MPI_Send
MPI_Send
MPI_Send
MPI_Send
MPI_Send
MPI_Send
MPI_Send
Pi+1
![Page 78: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/78.jpg)
MPI Programming 81
Example● The UGLY solution: copy the column into a temporary
buffer; MPI_Send() the buffer; fill the destination column
Pi
Pi+1
MPI_Send
..........
![Page 79: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/79.jpg)
MPI Programming 82
Example● The GOOD solution: define a new datatype for the
column, and MPI_Send the column directly
Pi
Pi+1
MPI_Send
![Page 80: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/80.jpg)
MPI Programming 83
MPI Derived Datatypes
● MPI provides several methods for constructing derived data types:– Contiguous: a contiguous block of elements– Vector: a strided vector of elements– Indexed: an irregularly spaced set of blocks of the same type– Struct: an irregularly spaced set of blocks possibly of
different types● Other functions
– MPI_Type_commit(...) commits a new datatype– MPI_Type_free(...) deallocates a datatype object
![Page 81: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/81.jpg)
MPI Programming 84
MPI_Type_contiguous()
int MPI_Type_contiguous(int count, MPI_Datatype oldtype, MPI_Datatype *newtype) ● A contiguous block of count elements of an existing
MPI type oldtype– oldtype can be another previously defined custom datatype
MPI_Datatype rowtype; MPI_Type_contiguous( 4, MPI_FLOAT, &rowtype );MPI_Type_commit(&rowtype);
rowtype
4 ´ MPI_FLOAT
![Page 82: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/82.jpg)
MPI Programming 85
MPI_Type_contiguous()int count = 4MPI_Datatype rowtype;MPI_Type_contiguous(count, MPI_FLOAT, &rowtype);MPI_Type_commit(&rowtype);
1.0 2.0 3.0 4.0
5.0 6.0 7.0 8.0
9.0 10.011.012.0
13.014.015.016.0
a[4][4]
MPI_Send(&a[2][0], 1, rowtype, dest, tag, MPI_COMM_WORLD);
9.0 10.011.012.0 1 element of type rowtype
See mpi-type-contiguous.c
a[2][0]
![Page 83: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/83.jpg)
MPI Programming 86
MPI_Type_vector()
int MPI_Type_vector(int count, int blocklen, int stride, MPI_Datatype oldtype, MPI_Datatype *newtype) ● A regularly spaced array of elements of the existing MPI
type oldtype– count number of blocks– blocklen number of elements of each block– stride number of elements between start of contiguous blocks– oldtype can be another previously defined datatype
MPI_Datatype columntype; MPI_Type_vector( 4, 1, 4, MPI_FLOAT, &columntype );MPI_Type_commit(&columntype);
![Page 84: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/84.jpg)
MPI Programming 87
MPI_Type_vector()int count = 4, blocklen = 1, stride = 4;MPI_Datatype columntype;MPI_Type_vector(count, blocklen, stride, MPI_FLOAT, &columntype);MPI_Type_commit(&columntype);
1.0 2.0 3.0 4.0
5.0 6.0 7.0 8.0
9.0 10.011.012.0
13.014.015.016.0
a[4][4]
MPI_Send(&a[0][1], 1, columntype, dest, tag, MPI_COMM_WORLD);
2.0 6.0 10.014.0 1 element of type columntype
blocklen = 1
stride = 4
See mpi-type-vector.ca[0][1]
![Page 85: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/85.jpg)
MPI Programming 88
Quizint count = 4, blocklen = 2, stride = 4;MPI_Datatype newtype;MPI_Type_vector(count, blocklen, stride, MPI_FLOAT, &newtype);MPI_Type_commit(&newtype);
1.0 2.0 3.0 4.0
5.0 6.0 7.0 8.0
9.0 10.011.012.0
13.014.015.016.0
a[4][4]
MPI_Send(&a[0][1], 1, newtype, dest, tag, MPI_COMM_WORLD);
Which data are being transmitted?
![Page 86: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/86.jpg)
MPI Programming 89
Quiz
1.0 2.0 3.0 4.0
5.0 6.0 7.0 8.0
9.0 10.011.012.0
13.014.015.016.0
a[4][4]
MPI_Send(&a[0][1], 1, newtype, dest, tag, MPI_COMM_WORLD);
10.011.014.015.0
blocklen = 2
stride = 4
2.0 3.0 6.0 7.0
a[0][1]
int count = 4, blocklen = 2, stride = 4;MPI_Datatype newtype;MPI_Type_vector(count, blocklen, stride, MPI_FLOAT, &newtype);MPI_Type_commit(&newtype);
![Page 87: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/87.jpg)
MPI Programming 90
Quizint count = 3, blocklen = 1, stride = 5;MPI_Datatype newtype;MPI_Type_vector(count, blocklen, stride, MPI_FLOAT, &newtype);MPI_Type_commit(&newtype);
1.0 2.0 3.0 4.0
5.0 6.0 7.0 8.0
9.0 10.011.012.0
13.014.015.016.0
a[4][4]
MPI_Send(&a[0][1], 1, newtype, dest, tag, MPI_COMM_WORLD);
Which data are being transmitted?
![Page 88: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/88.jpg)
MPI Programming 91
Quiz
1.0 2.0 3.0 4.0
5.0 6.0 7.0 8.0
9.0 10.011.012.0
13.014.015.016.0
a[4][4]
MPI_Send(&a[0][1], 1, newtype, dest, tag, MPI_COMM_WORLD);
12.0
blocklen = 1
stride = 5
2.0 7.0
a[0][1]
int count = 3, blocklen = 1, stride = 5;MPI_Datatype newtype;MPI_Type_vector(count, blocklen, stride, MPI_FLOAT, &newtype);MPI_Type_commit(&newtype);
![Page 89: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/89.jpg)
MPI Programming 92
Quizint count = ???, blocklen = ???, stride = ???;MPI_Datatype newtype;MPI_Type_vector(count, blocklen, stride, MPI_FLOAT, &newtype);MPI_Type_commit(&newtype);
1.0 2.0 3.0 4.0
5.0 6.0 7.0 8.0
9.0 10.011.012.0
13.014.015.016.0
a[4][4]
MPI_Send(???????, 1, newtype, dest, tag, MPI_COMM_WORLD);
10.04.0 7.0 13.0
Fill the ??? with the parameters required to get the behavior below
![Page 90: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/90.jpg)
MPI Programming 93
Quizint count = 4; blocklen = 1, stride = 3;MPI_Datatype newtype;MPI_Type_vector(count, blocklen, stride, MPI_FLOAT, &newtype);MPI_Type_commit(&newtype);
1.0 2.0 3.0 4.0
5.0 6.0 7.0 8.0
9.0 10.011.012.0
13.014.015.016.0
a[4][4]
MPI_Send(&a[0][3], 1, newtype, dest, tag, MPI_COMM_WORLD);
10.04.0 7.0 13.0
blocklen = 1
stride = 3
a[0][3]
![Page 91: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/91.jpg)
MPI Programming 94
MPI_Type_indexed()
int MPI_Type_indexed(int count, const int array_of_blklen[], const int array_of_displ[], MPI_Datatype oldtype, MPI_Datatype *newtype) ● An irregularly spaced set of blocks of elements of an
existing MPI type oldtype– count number of blocks– array_of_blklen number of elements in each block– array_of_displ displacement of each block with respect to
the beginning of the data structure– oldtype can be another previously defined datatype
MPI_Datatype newtype; MPI_Type_indexed( ... );MPI_Type_commit(&newtype);
![Page 92: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/92.jpg)
MPI Programming 95
MPI_Type_indexed()int count = 3; int blklens[] = {1, 3, 4}; int displs[] = {2, 5, 12};MPI_Datatype newtype;MPI_Type_indexed(count, blklens, displs, MPI_FLOAT, &newtype);MPI_Type_commit(&newtype);
1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.011.012.013.014.015.016.0 a[16]
MPI_Send(&a[0], 1, newtype, dest, tag, MPI_COMM_WORLD);
13.014.015.016.0 1 element of type newtype
See mpi-type-indexed.c
blklens[0] = 1 blklens[1] = 3 blklens[2] = 4
displs[0] = 2
displs[1] = 5
displs[2] = 12
3.0 6.0 7.0 8.0
![Page 93: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/93.jpg)
MPI Programming 96
MPI_Type_indexed()
1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.011.012.013.014.015.016.0 a[16]
MPI_Send(&a[0], 1, newtype, dest, tag, MPI_COMM_WORLD);
MPI_Recv(&b[0], 1, newtype, src, tag, MPI_COMM_WORLD);
3.0 6.0 7.0 8.0 13.014.015.016.0 b[16]
int count = 3; int blklens[] = {1, 3, 4}; int displs[] = {2, 5, 12};MPI_Datatype newtype;MPI_Type_indexed(count, blklens, displs, MPI_FLOAT, &newtype);MPI_Type_commit(&newtype);
![Page 94: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/94.jpg)
MPI Programming 97
MPI_Type_indexed()
1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.011.012.013.014.015.016.0 a[16]
MPI_Send(&a[0], 1, newtype, dest, tag, MPI_COMM_WORLD);
MPI_Recv(&b[0], 8, MPI_FLOAT, src, tag, MPI_COMM_WORLD);
13.014.015.016.03.0 6.0 7.0 8.0 b[16]
int count = 3; int blklens[] = {1, 3, 4}; int displs[] = {2, 5, 12};MPI_Datatype newtype;MPI_Type_indexed(count, blklens, displs, MPI_FLOAT, &newtype);MPI_Type_commit(&newtype);
![Page 95: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/95.jpg)
MPI Programming 98
Combining custom datatypes
● The oldtype parameter of functions MPI_Type_contiguous(), MPI_Type_vector() and MPI_Type_indexed() can be another user-defined datatype
int count, blocklen, stride;MPI_Datatype vec, vecvec;
count = 2; blocklen = 2; stride = 3;MPI_Type_vector( count, blocklen, stride, MPI_FLOAT, &vec);MPI_Type_commit(&vec);count = 2; blocklen = 1; stride = 3MPI_Type_vector( count, blocklen, stride, vec, &vecvec);MPI_Type_commit(&vecvec);
![Page 96: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/96.jpg)
MPI Programming 99
int count, blocklen, stride;MPI_Datatype vec, vecvec;
count = 2; blocklen = 2; stride = 3;MPI_Type_vector( count, blocklen, stride, MPI_FLOAT, &vec);MPI_Type_commit(&vec);count = 2; blocklen = 1; stride = 3MPI_Type_vector( count, blocklen, stride, vec, &vecvec);MPI_Type_commit(&vecvec);
blocklen = 2
stride = 3
vec
blocklen = 1
stride = 3
vecvec
stride = 3 elements of type vec
![Page 97: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/97.jpg)
MPI Programming 100
MPI_Type_struct()
int MPI_Type_struct(int count, int *array_of_blklen, MPI_Aint *array_of_displ, MPI_Datatype *array_of_types, MPI_Datatype *newtype) ● An irregularly spaced set of blocks of elements of existing
MPI types array_of_types– count number of blocks (and number of elements of the arrays array_of_*)
– array_of_blklen number of elements in each block– array_of_displ displacement in bytes of each block with respect
to the beginning of the data structure (of type MPI_Aint)– array_of_types array of MPI_Datatype
![Page 98: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/98.jpg)
MPI Programming 101
MPI_Type_struct()typedef struct {
float x, y, z, v;int n, t;
} particle_t;
int count = 2; int blklens[] = {4, 2}; MPI_Aint displs[2], lb, extent; MPI_Datatype oldtypes[2] = {MPI_FLOAT, MPI_INT}, newtype;MPI_Type_get_extent(MPI_FLOAT, &lb, &extent);displs[0] = 0; displs[1] = 4*extent;MPI_Type_struct(count, blklens, displs, oldtypes, &newtype);MPI_Type_commit(&newtype);
float
blklens[0] = 4
newtype float float float int int
blklens[1] = 2
displs[1] = 4*extent
See mpi-type-struct.c
![Page 99: Distributed Memory Programming with MPI · MPI Programming 5 Message Passing and MPI Message passing is the principal alternative to shared memory parallel programming – Message](https://reader036.vdocuments.us/reader036/viewer/2022062223/5f0611df7e708231d416238b/html5/thumbnails/99.jpg)
MPI Programming 102
Concluding Remarks
● Message passing is a very simple model – Extremely low level; heavy weight – Expense comes from communication and lots of local code – Communication code is often more than half – Tough to make adaptable and flexible – Tough to get right
● Programming model of choice for scalability – Widespread adoption due to portability