cse 160 – lecture 16
DESCRIPTION
CSE 160 – Lecture 16. MPI Concepts, Topology and Synchronization. So Far …. We have concentrated on generic message passing algorithms We have used PVM to implement codes (simpler on workstations) MPI is a defacto standard for message passing and introduces some important concepts. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: CSE 160 – Lecture 16](https://reader036.vdocuments.us/reader036/viewer/2022062518/56813ff3550346895dab0e3f/html5/thumbnails/1.jpg)
CSE 160 – Lecture 16
MPI Concepts, Topology and Synchronization
![Page 2: CSE 160 – Lecture 16](https://reader036.vdocuments.us/reader036/viewer/2022062518/56813ff3550346895dab0e3f/html5/thumbnails/2.jpg)
So Far …
• We have concentrated on generic message passing algorithms
• We have used PVM to implement codes (simpler on workstations)
• MPI is a defacto standard for message passing and introduces some important concepts
![Page 3: CSE 160 – Lecture 16](https://reader036.vdocuments.us/reader036/viewer/2022062518/56813ff3550346895dab0e3f/html5/thumbnails/3.jpg)
Message Addressing
• Identify an endpoint • Use a tag to distinguish a particular message
– pvm_send(dest, tag)
– MPI_SEND(COMM, dest, tag, buf, len, type)
• Receiving– recv(src, tag); recv(*,tag); recv (src, *); recv(*,*);
• What if you want to build a library that uses message passing? Is (src, tag) safe in all instances?
![Page 4: CSE 160 – Lecture 16](https://reader036.vdocuments.us/reader036/viewer/2022062518/56813ff3550346895dab0e3f/html5/thumbnails/4.jpg)
Level O Issues
• Basic Pt-2-Pt Message Passing is straightforward, but how does one …– Make it go fast
• Eliminate extra memory copies• Take advantage of specialized hardware
– Move complex data structures (packing)– Receive from one-of-many (wildcarding)– Synchronize a group of tasks– Recover from errors– Start tasks– Build safe libraries– Monitor tasks– …
![Page 5: CSE 160 – Lecture 16](https://reader036.vdocuments.us/reader036/viewer/2022062518/56813ff3550346895dab0e3f/html5/thumbnails/5.jpg)
MPI-1 addresses many of the level 0 issues
(but not all)
![Page 6: CSE 160 – Lecture 16](https://reader036.vdocuments.us/reader036/viewer/2022062518/56813ff3550346895dab0e3f/html5/thumbnails/6.jpg)
A long history of research efforts in message passing
• P4• Chameleon• Parmacs• TCGMSG• CHIMP• NX (Intel i860, Paragon)• PVM• …
And these begot MPI
![Page 7: CSE 160 – Lecture 16](https://reader036.vdocuments.us/reader036/viewer/2022062518/56813ff3550346895dab0e3f/html5/thumbnails/7.jpg)
So What is MPI
• It is a standard message passing API – Specifies many variants of send/recv
• 9 send interface calls– Eg., synchronous send, asynchronous send, ready send,
asynchronous ready send
– Plus other defined APIs• Process topologies• Group operations• Derived Data types• Profiling API (standard way to instrument MPI code)
– Implemented and optimized by machine vendors– Should you use it? Absolutely!
![Page 8: CSE 160 – Lecture 16](https://reader036.vdocuments.us/reader036/viewer/2022062518/56813ff3550346895dab0e3f/html5/thumbnails/8.jpg)
So What’s Missing in MPI-1?
• Process control– How do you start 100 tasks?– How do you kill/signal/monitor remote tasks
• I/O– Addressed in MPI-2
• Fault-tolerance– One MPI process dies, the rest eventually hang
• Interoperability– No standard set for running a single parallel job across
architectures (eg. Cannot split computation between x86 Linux and Alpha)
![Page 9: CSE 160 – Lecture 16](https://reader036.vdocuments.us/reader036/viewer/2022062518/56813ff3550346895dab0e3f/html5/thumbnails/9.jpg)
Communication Envelope
• (src, tag) is enough to distinguish messages however, there is a problem:– How does one write safe libraries that also use
message passing?• Tags used in libraries can be the same as tags used
in calling code messages could get confused
• Message envelope – add another, system-assigned tag, to a message– This label defines a message envelope
![Page 10: CSE 160 – Lecture 16](https://reader036.vdocuments.us/reader036/viewer/2022062518/56813ff3550346895dab0e3f/html5/thumbnails/10.jpg)
Envelopes
Tag 100, Src A
Tag 100, Src B
Tag 200, Src A
Envelope
Tag 100, Src A
Tag 100, Src B
Tag 200, Src A
Envelope
Context tag – messages received relative to context
![Page 11: CSE 160 – Lecture 16](https://reader036.vdocuments.us/reader036/viewer/2022062518/56813ff3550346895dab0e3f/html5/thumbnails/11.jpg)
Communicators
• Messages ride inside of message envelopes and receives can only match candidate messages relative to an envelope.
• MPI implements these envelopes as communicators.
• Communicators are more than just and envelope – it also encompasses a group of processes
• Communicator = context + group of tasks
![Page 12: CSE 160 – Lecture 16](https://reader036.vdocuments.us/reader036/viewer/2022062518/56813ff3550346895dab0e3f/html5/thumbnails/12.jpg)
MPI_COMM_WORLD
• MPI programs are static – they can never grow or shrink
• There is one default communicator called MPI_COMM_WORLD
• All processes are members of COMM_WORLD• Each process is labeled 0 – N, for a communicator
of size N+1– Similar to a PVM group.
![Page 13: CSE 160 – Lecture 16](https://reader036.vdocuments.us/reader036/viewer/2022062518/56813ff3550346895dab0e3f/html5/thumbnails/13.jpg)
How do you build new groups?
• All new communicators are derived from existing communicators– All communicators have
MPI_COMM_WORLD as an “ancestor”– New communicators are formed synchronously
• All members of a new communicator call the construction routine at the same time.
• MPI_COMM_DUP(), MPI_COMM_SPLIT()
![Page 14: CSE 160 – Lecture 16](https://reader036.vdocuments.us/reader036/viewer/2022062518/56813ff3550346895dab0e3f/html5/thumbnails/14.jpg)
Communicators as groups
0
1
3
4
2
5
1
0
2
0 1
2
![Page 15: CSE 160 – Lecture 16](https://reader036.vdocuments.us/reader036/viewer/2022062518/56813ff3550346895dab0e3f/html5/thumbnails/15.jpg)
Running MPI Programs
• The MPI-1 Standard does not specify how to run an MPI
program, just as the Fortran standard does not specify how to
run a Fortran program.
• In general, starting an MPI program is dependent on the
implementation of MPI you are using, and might require various
scripts, program arguments, and/or environment variables.
• mpiexec <args> is part of MPI-2, as a recommendation,
but not a requirement
– You can use mpiexec for MPICH and mpirun for other systems
![Page 16: CSE 160 – Lecture 16](https://reader036.vdocuments.us/reader036/viewer/2022062518/56813ff3550346895dab0e3f/html5/thumbnails/16.jpg)
Finding Out About the Environment
• Two important questions that arise early in a parallel program are:– How many processes are participating in this
computation?– Which one am I?
• MPI provides functions to answer these questions:– MPI_Comm_size reports the number of processes.– MPI_Comm_rank reports the rank, a number between 0
and size-1, identifying the calling process
![Page 17: CSE 160 – Lecture 16](https://reader036.vdocuments.us/reader036/viewer/2022062518/56813ff3550346895dab0e3f/html5/thumbnails/17.jpg)
MPI Datatypes
• The data in a message to sent or received is described by a triple (address, count, datatype), where
• An MPI datatype is recursively defined as:– predefined, corresponding to a data type from the language (e.g.,
MPI_INT, MPI_DOUBLE_PRECISION)– a contiguous array of MPI datatypes– a strided block of datatypes– an indexed array of blocks of datatypes– an arbitrary structure of datatypes
• There are MPI functions to construct custom datatypes, such an array of (int, float) pairs, or a row of a matrix stored columnwise.
![Page 18: CSE 160 – Lecture 16](https://reader036.vdocuments.us/reader036/viewer/2022062518/56813ff3550346895dab0e3f/html5/thumbnails/18.jpg)
MPI Tags
• Messages are sent with an accompanying user-defined integer tag, to assist the receiving process in identifying the message.
• Messages can be screened at the receiving end by specifying a specific tag, or not screened by specifying MPI_ANY_TAG as the tag in a receive.
• Some non-MPI message-passing systems have called tags “message types”. MPI calls them tags to avoid confusion with datatypes.
![Page 19: CSE 160 – Lecture 16](https://reader036.vdocuments.us/reader036/viewer/2022062518/56813ff3550346895dab0e3f/html5/thumbnails/19.jpg)
Why Datatypes?
• Since all data is labeled by type, an MPI implementation can support communication between processes on machines with very different memory representations and lengths of elementary datatypes (heterogeneous communication).
• Specifying application-oriented layout of data in memory– reduces memory-to-memory copies in the implementation
– allows the use of special hardware (scatter/gather) when available
• Theoretically simplifies the programmer’s life
• How would you build “Datatypes” in PVM – sequences of pvm_pack()
![Page 20: CSE 160 – Lecture 16](https://reader036.vdocuments.us/reader036/viewer/2022062518/56813ff3550346895dab0e3f/html5/thumbnails/20.jpg)
Introduction to Collective Operations in MPI
• Collective operations are called by all processes in a communicator.
• MPI_BCAST distributes data from one process (the root) to all others in a communicator.– How does this differ from pvm_mcast(), pvm_broadcast()?
• MPI_REDUCE combines data from all processes in communicator and returns it to one process.
• In many numerical algorithms, SEND/RECEIVE can be replaced by BCAST/REDUCE, improving both simplicity and efficiency.
![Page 21: CSE 160 – Lecture 16](https://reader036.vdocuments.us/reader036/viewer/2022062518/56813ff3550346895dab0e3f/html5/thumbnails/21.jpg)
Example: PI in C -1#include "mpi.h"#include <math.h>int main(int argc, char *argv[]){
int done = 0, n, myid, numprocs, i, rc;double PI25DT = 3.141592653589793238462643;double mypi, pi, h, sum, x, a;MPI_Init(&argc,&argv);MPI_Comm_size(MPI_COMM_WORLD,&numprocs);MPI_Comm_rank(MPI_COMM_WORLD,&myid);while (!done) { if (myid == 0) { printf("Enter the number of intervals: (0 quits) "); scanf("%d",&n); } MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); if (n == 0) break;
![Page 22: CSE 160 – Lecture 16](https://reader036.vdocuments.us/reader036/viewer/2022062518/56813ff3550346895dab0e3f/html5/thumbnails/22.jpg)
Example: PI in C - 2 h = 1.0 / (double) n;
sum = 0.0; for (i = myid + 1; i <= n; i += numprocs) { x = h * ((double)i - 0.5); sum += 4.0 / (1.0 + x*x); } mypi = h * sum; MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); if (myid == 0) printf("pi is approximately %.16f, Error is %.16f\n", pi, fabs(pi - PI25DT));}MPI_Finalize();
return 0;}
![Page 23: CSE 160 – Lecture 16](https://reader036.vdocuments.us/reader036/viewer/2022062518/56813ff3550346895dab0e3f/html5/thumbnails/23.jpg)
• Send a large message from process 0 to process 1– If there is insufficient storage at the destination, the send must wait for the user to
provide the memory space (through a receive)
• What happens with
Sources of Deadlocks
Process 0
Send(1)Recv(1)
Process 1
Send(0)Recv(0)
• This is called “unsafe” because it depends on the availability of system buffers (pvm does buffering)
![Page 24: CSE 160 – Lecture 16](https://reader036.vdocuments.us/reader036/viewer/2022062518/56813ff3550346895dab0e3f/html5/thumbnails/24.jpg)
Some Solutions to the “unsafe” Problem
• Order the operations more carefully:Process 0
Send(1)Recv(1)
Process 1
Recv(0)Send(0)
• Use non-blocking operations:
Process 0
Isend(1)Irecv(1)Waitall
Process 1
Isend(0)Irecv(0)Waitall
![Page 25: CSE 160 – Lecture 16](https://reader036.vdocuments.us/reader036/viewer/2022062518/56813ff3550346895dab0e3f/html5/thumbnails/25.jpg)
Extending the Message-Passing Interface
• Dynamic Process Management– Dynamic process startup– Dynamic establishment of connections
• One-sided communication– Put/get– Other operations
• Parallel I/O• Other MPI-2 features
– Generalized requests– Bindings for C++/ Fortran-90; interlanguage issues
![Page 26: CSE 160 – Lecture 16](https://reader036.vdocuments.us/reader036/viewer/2022062518/56813ff3550346895dab0e3f/html5/thumbnails/26.jpg)
When to use MPI
• Portability and Performance
• Irregular Data Structures
• Building Tools for Others– Libraries
• Need to Manage memory on a per processor basis
![Page 27: CSE 160 – Lecture 16](https://reader036.vdocuments.us/reader036/viewer/2022062518/56813ff3550346895dab0e3f/html5/thumbnails/27.jpg)
When not to use MPI
• Regular computation matches HPF– But see PETSc/HPF comparison (ICASE 97-72)
• Solution (e.g., library) already exists– http://www.mcs.anl.gov/mpi/libraries.html
• Require Fault Tolerance– Sockets
• Distributed Computing– CORBA, DCOM, etc.
![Page 28: CSE 160 – Lecture 16](https://reader036.vdocuments.us/reader036/viewer/2022062518/56813ff3550346895dab0e3f/html5/thumbnails/28.jpg)
Synchronization
• Messages serve two purposes:– Moving data– Synchronizing processes
• When does synchronization occur for– Blocking send?
– Non-blocking send?
» Dependencies on implementation
![Page 29: CSE 160 – Lecture 16](https://reader036.vdocuments.us/reader036/viewer/2022062518/56813ff3550346895dab0e3f/html5/thumbnails/29.jpg)
Common Synchronization Constructs
• Barrier– All processes stop, No information passed
• Broadcast– All processes get information from the root
• How does this differ from a barrier
• Ideas on how to implement an MPI_BCAST?
• Reduction– Data is reduced (summed, multiplied, counted, …)
• Ideas on how to implement MPI_REDUCE?
![Page 30: CSE 160 – Lecture 16](https://reader036.vdocuments.us/reader036/viewer/2022062518/56813ff3550346895dab0e3f/html5/thumbnails/30.jpg)
Synchronization Trees
• The topology of the messages that implement a reduction is important.
• LogP lecture illustrated a tree that wasn’t linear and wasn’t binary– Major idea was to maintain concurrency
• MPI has many operations as synchronizing so that implementers have a choice in optimization.