managing heterogeneous mpi application interoperation and execution. from pvmpi to snipe based...
TRANSCRIPT
![Page 1: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649de65503460f94ade45d/html5/thumbnails/1.jpg)
Managing Heterogeneous MPI Application Interoperation and
Execution.From PVMPI to SNIPE based
MPI_Connect()
Graham E. Fagg*, Kevin S. London, Jack J. Dongarra and Shirley V. Browne.
University of Tennessee and Oak Ridge National Laboratory
* contact at [email protected]
![Page 2: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649de65503460f94ade45d/html5/thumbnails/2.jpg)
Project Targets
• Allow intercommunication between different MPI implementations or instances of the same implementation on different machines
• Provide heterogeneous intercommunicating MPI applications with access to some of the MPI-2 process control and parallel I/O features
• Allow use of optimized vendor MPI implementations while still permitting distributed heterogeneous parallel computing in a transparent manner.
![Page 3: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649de65503460f94ade45d/html5/thumbnails/3.jpg)
MPI-1 Communicators
• All processes in an MPI-1 application belong to a global communicator called MPI_COMM_WORLD
• All other communicators are derived from this global communicator.
• Communication can only occur within a communicator.– Safe communication
![Page 4: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649de65503460f94ade45d/html5/thumbnails/4.jpg)
MPI InternalsProcesses
• All process groups are derived from the membership of the MPI_COMM_WORLD communicator.
– i.e.. no external processes.
• MPI-1 process membership is static not dynamic like PVM or LAM 6.X
– simplified consistency reasoning
– fast communication (fixed addressing) even across complex topologies.
– interfaces well to simple run-time systems as found on many MPPs.
![Page 5: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649de65503460f94ade45d/html5/thumbnails/5.jpg)
MPI-1 Application
MPI_COMM_WORLD
Derived_Communicator
![Page 6: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649de65503460f94ade45d/html5/thumbnails/6.jpg)
Disadvantages of MPI-1
• Static process model– If a process fails, all communicators it belongs to
become invalid. I.e. No fault tolerance.
– Dynamic resources either cause applications to fail due to loss of nodes or make applications inefficient as they cannot take advantage of new nodes by starting/spawning additional processes.
– When using a dedicated MPP MPI implementation you cannot usually use off-machine or even off- partion nodes.
![Page 7: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649de65503460f94ade45d/html5/thumbnails/7.jpg)
MPI-2• Problem areas and needed additional features identified in
MPI-1 are being addressed by the MPI-2 forum.• These include:
– inter-language operation– dynamic process control / management– parallel IO– extended collective operations
• Support for inter-implementation communication/control was considered ?? – See other projects such as NIST IMPI, PACX and PLUS
![Page 8: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649de65503460f94ade45d/html5/thumbnails/8.jpg)
User requirements for Inter-Operation
• Dynamic connections between tasks on different platforms/systems/sites
• Transparent inter-operation once it has started
• Single style of API, i.e. MPI only!• Access to virtual machine / resource management
when required not mandatory use• Support for complex features across
implementations such as user derived data types etc.
![Page 9: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649de65503460f94ade45d/html5/thumbnails/9.jpg)
MPI_Connect/PVMPI 2 • API in the MPI-1 style
• Inter-application communication using standard point to point MPI-1 functions calls
– Supporting all variations of send and receive
– All MPI data types including user derived
• Naming Service functions similar to the semantics and functionality of current MPI inter-communicator functions
– ease of use for programmers only experienced in MPI and not other message passing systems such as PVM / Nexus or TCP socket programming!
![Page 10: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649de65503460f94ade45d/html5/thumbnails/10.jpg)
Process Identification• Process groups are identified by a character name.
Individual processes are address by a set of tuples.• In MPI:
– {communicator, rank}– {process group, rank}
• In MPI_Connect/PVMPI2:– {name, rank}– {name, instance}
• Instance and rank are identical and range from 0..N-1 where N is the number of processes.
![Page 11: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649de65503460f94ade45d/html5/thumbnails/11.jpg)
Registration I.e. naming MPI applications
• Process groups register their name with a global naming service which returns them a system handle used for future operations on this name / communicator pair.– int MPI_Conn_register (char *name, MPI_Comm local_comm, int * handle);
– call MPI_CONN_REGISTER (name, local_comm, handle, ierr)
• Processes can remove their name from the naming service with:– int MPI_Conn_remove (int handle);
• A process may have multiple names associated with it.• Names can be registered, removed and reregistered multiple
times without restriction.
![Page 12: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649de65503460f94ade45d/html5/thumbnails/12.jpg)
MPI-1 Inter-communicators
• An inter-communicator is used for point to point communication between disjoint groups of processes.
• Inter-communicators are formed using MPI_Intercomm_create() which operates upon two existing non-overlapping intra-communicator and a bridge communicator.
• MPI_Connect/PVMPI could not use this mechanism as there is not a MPI bridge communicator between groups formed from separate MPI applications as their MPI_COMM_WORLDs do not overlap.
![Page 13: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649de65503460f94ade45d/html5/thumbnails/13.jpg)
Forming Inter-communicators
• MPI_Connect/PVMPI2 forms its inter-communicators with a modified MPI_Intercomm_create call.
• The bridging communication is performed automatically and the user only has to specify the remote groups
registered name.– int MPI_Conn_intercomm_create (int local_handle, char *remote_group_name, MPI_Comm *new_inter_comm);
– Call MPI_CONN_INTERCOMM_CREATE (localhandle, remotename, newcomm, ierr)
![Page 14: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649de65503460f94ade45d/html5/thumbnails/14.jpg)
Inter-communicators
• Once an inter-communicator has been formed it can be used almost exactly as any other MPI inter-communicator– All point to point operations– Communicator comparisons and duplication– Remote group information
• Resources released by MPI_Comm_free()
![Page 15: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649de65503460f94ade45d/html5/thumbnails/15.jpg)
Simple exampleAir Model
/* air model */MPI_Init (&argc, &argv);MPI_Conn_register (“AirModel”, MPI_COMM_WORLD,
&air_handle);MPI_Conn_intercomm_create ( handle, “OceanModel”,
&ocean_comm);MPI_Comm_rank( MPI_COMM_WORLD, &myrank);while (!done) { /* do work using intra-comms */ /* swap values with other model */ MPI_Send( databuf, cnt, MPI_DOUBLE, myrank, tag,
ocean_comm); MPI_Recv( databuf, cnt, MPI_DOUBLE, myrank, tag,
ocean_comm, status);} /* end while done work */MPI_Conn_remove ( air_handle );MPI_Comm_free ( &ocean_comm );MPI_Finalize();
![Page 16: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649de65503460f94ade45d/html5/thumbnails/16.jpg)
Ocean model
/* ocean model */MPI_Init (&argc, &argv);MPI_Conn_register (“OceanModel”, MPI_COMM_WORLD, &ocean_handle);MPI_Conn_intercomm_create ( handle, “AirModel”, &air_comm);MPI_Comm_rank( MPI_COMM_WORLD, &myrank);while (!done) { /* do work using intra-comms */ /* swap values with other model */ MPI_Recv( databuf, cnt, MPI_DOUBLE, myrank, tag, air_comm, &status); MPI_Send( databuf, cnt, MPI_DOUBLE, myrank, tag, air_comm);}MPI_Conn_remove ( ocean_handle );MPI_Comm_free ( &air_comm );MPI_Finalize();
![Page 17: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649de65503460f94ade45d/html5/thumbnails/17.jpg)
Coupled model
MPI Application Ocean Model MPI Application Air Model
MPI_COMM_WORLDMPI_COMM_WORLD
Global inter-communicator
air_comm ->
<- ocean_comm
![Page 18: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649de65503460f94ade45d/html5/thumbnails/18.jpg)
MPI_Connect InternalsI.e. SNIPE
• SNIPE is a meta-computing system from UTK that was designed to support long-term distributed applications.
• Uses SNIPE as a communications layer.
• Naming services is provided by the RCDS RC_Server system (which is also used by HARNESS for repository information).
• MPI application startup is via SNIPE daemons that interface to standard batch/queuing systems
– These understand LAM, MPICH, MPIF (POE), SGI MPI, squb variations
• soon condor and lsf
![Page 19: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649de65503460f94ade45d/html5/thumbnails/19.jpg)
PVMPI2 InternalsI.e. PVM
• Uses PVM as a communications layer• Naming services is provided by the PVM Group
Server in PVM3.3.x and by the Mailbox system in PVM 3.4.
• PVM 3.4 is simpler as it has user controlled message contexts and message handlers.
• MPI application startup is provided by specialist PVM tasker processes.– These understand LAM, MPICH, IBM MPI (POE), SGI MPI
![Page 20: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649de65503460f94ade45d/html5/thumbnails/20.jpg)
Internalslinking with MPI
• MPI_Connect and PVMPI are built as a MPI profiling interface. Thus they are transparent to user applications.
• During building it can be configured to call other profiling interfaces and hence allow inter-operation with other MPI monitoring/tracing/debugging tool sets.
![Page 21: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649de65503460f94ade45d/html5/thumbnails/21.jpg)
MPI_Connect / PVMPI Layering
MPI_function
Users CodeIntercomm Library
Return code
Look up communicators etc
If true MPI intracommthen use profiled MPI call
PMPI_Function
Else translate into SNIPE/PVMaddressing and use SNIPE/PVMfunctions other libraryWork out correct returncode
![Page 22: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649de65503460f94ade45d/html5/thumbnails/22.jpg)
Process Management• PVM and/or SNIPE can handle the startup of MPI
jobs– General Resource Manager / Specialist PVM Tasker
control / SNIPE daemons
• Jobs can also be started by MPIRUN– useful when testing on interactive nodes
• Once enrolled, MPI processes are under SNIPE or PVM control– signals (such as TERM/HUP)– notification messages (fault tolerance)
![Page 23: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649de65503460f94ade45d/html5/thumbnails/23.jpg)
Process ManagementMPICH Cluster IBM SP2 SGI O2K
MPICH Tasker POETasker
Pbs/qsub Taskerand PVMDGRM
UserRequest
![Page 24: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649de65503460f94ade45d/html5/thumbnails/24.jpg)
Conclusions• MPI_Connect and PVMPI allow different MPI
implementations to inter-operate.
• Only 3 additional calls required.
• Layering requires a full profiled MPI library (complex linking).
• Intra-communication performance may be slightly effected.
• Inter-communication as fast as intercomm library used (either PVM or SNIPE).
![Page 25: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649de65503460f94ade45d/html5/thumbnails/25.jpg)
MPI_Connect interface much like Names, addresses and ports
in MPI-2
• Well know address host:port– MPI_PORT_OPEN makes a port– MPI_ACCEPT lets client connect– MPI_CONNECT client side connection
• Service naming– MPI_NAME_PUBLISH (port, info, service)– MPI_NAME_GET client side to get port
![Page 26: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649de65503460f94ade45d/html5/thumbnails/26.jpg)
Server-Client model
Server
Client
MPI_Accept()
{host:port}MPI_Connect
Inter-comm
![Page 27: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649de65503460f94ade45d/html5/thumbnails/27.jpg)
Server-Client model
Server
Client
MPI_Port_open()
{host:port}
{host:port}
NAME
MPI_Name_publish
MPI_Name_get
![Page 28: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649de65503460f94ade45d/html5/thumbnails/28.jpg)
SNIPE
![Page 29: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649de65503460f94ade45d/html5/thumbnails/29.jpg)
SNIPE
• Single GlobalName space– Built using RCDS supports URLs,URNs and LIFNs
• testing against LDAP• Scalable / Secure
• Multiple Resource Managers
• Safe execution environment– Java etc..
• Parallelism is the basic unit of execution
![Page 30: Managing Heterogeneous MPI Application Interoperation and Execution. From PVMPI to SNIPE based MPI_Connect() Graham E. Fagg*, Kevin S. London, Jack J](https://reader035.vdocuments.us/reader035/viewer/2022062321/56649de65503460f94ade45d/html5/thumbnails/30.jpg)
Additional Information
• PVM : http://www.epm.ornl.gov/pvm• PVMPI : http://icl.cs.utk.edu/projects/pvmpi/• MPI_Connect : http://icl.cs.utk.edu/projects/mpi_connect/
• SNIPE : htp://www.nhse.org/snipe/ and http://icl.cs.utk.edu/projects/snipe/
• RCDS : htp://www.netlib.org/utk/projects/rcds/• ICL : http://www.netlib.org/icl/• CRPC : http://www.crpc.rice.edu/CRPC• DOD HPC MSRCs
– CEWES : http://www.wes.hpc.mil– ARL http://www.arl.hpc.mil– ASC http://www.asc.hpc.mil