mpich2—a new start for mpi implementations
DESCRIPTION
MPICH2—A New Start for MPI Implementations. William D. Gropp Mathematics and Computer Science Argonne National Laboratory www.mcs.anl.gov/~gropp. The newest edition of Using MPI-2 , translated by Takao Hatazaki, www.pearsoned.co.jp/washo/prog/wa_pro61-j.html. Bill Gropp Rusty Lusk - PowerPoint PPT PresentationTRANSCRIPT
MPICH2—A New Start for MPI Implementations
William D. GroppMathematics and Computer Science
Argonne National Laboratorywww.mcs.anl.gov/~gropp
University of Chicago Department of Energy
The newest edition of Using MPI-2, translated by Takao Hatazaki, www.pearsoned.co.jp/washo/prog/wa_pro61-j.html
University of Chicago Department of Energy
MPICH2 Team
• Bill Gropp• Rusty Lusk• Rajeev Thakur• Rob Ross
• David Ashton• Brian Toonen• Rob Latham
University of Chicago Department of Energy
What’s New
• Pre-pre-pre-pre release version available for groups that expect to perform research on MPI implementations with MPICH2
• Contains Most of MPI-1, services functions from MPI-2 C, Fortran bindings Example devices for TCP Documentation
University of Chicago Department of Energy
MPICH2 Research
• All new implementation is our vehicle for research in Thread safety and efficiency (e.g., avoid thread locks) Optimized MPI datatypes Optimized Remote Memory Access (RMA) High Scalability (64K MPI processes and more) Exploiting Remote Direct Memory Access (RDMA) capable
networks All of MPI-2, including dynamic process management,
parallel I/O, RMA Usability and Robustness
• Software engineering techniques that automate and simplify creating and maintaining a solid, user-friendly implementation
• Allow extensive runtime error checking but do not require it
University of Chicago Department of Energy
Some Target Platforms
• Clusters (TCP, UDP, Infiniband, Myrinet, Proprietary Interconnects, …)
• Clusters of SMPs• Grids (UDP, TCP, Globus I/O, … )• BlueGene/x
64K processors; 64K address spaces ANL/IBM developing MPI and process management
for BG/L
• Other systems ANL/Cray developing MPI and PVFS for Red Storm
University of Chicago Department of Energy
Structure of MPICH-2
ADI-3 ADIO
MPICH-2
Existing parallel file systems
PMI
MPD Vendors
ChannelInterface
Myrinet,Other NIC
Multi-Method
BG/LPortals MM
Existing
In Progress
For others
Fork
PVFS
TCP
Unix(python)
Windows
University of Chicago Department of Energy
The Major Components
• PMI Process Manager Interface Provides scalable interface to both process creation and
communication setup Designed to permit many implementations, including
with/without demons and with 3rd party process managers
• ADIO I/O interface. No change (except for error reporting and
request management) from current ROMIO, at least this year
• ADI3 New device aimed at higher performance networks and
new network capabilities
University of Chicago Department of Energy
Needs of an MPI Implementation
• Point-to-point communication Can work with polling
• Cancel of send Requires some agent (interrupt-driven receive,
separate thread, guaranteed timer)
• Active target RMA Can work with polling Performance may require “bypass”
• Passive target RMA Requires some agent
• For some operations, the agent may be special hardware capabilities
University of Chicago Department of Energy
The Layers
• ADI3 Full-featured interface, closely matched to MPI point-
to-point and RMA operations Most MPI communication routines perform (optional)
error checking and then “call” ADI3 routine Modular design allows replacement of parts, e.g.,
• Datatypes
• New Channel Interface Much smaller than ADI-3, easily implemented on
most platforms Nonblocking design is more robust and efficient than
MPICH-1 version
University of Chicago Department of Energy
Expose Structures To All Levels of the Implementation
• All MPI opaque objects are defined structs for all levels (ADI, channel, and lower)
• All objects have a handle that includes the type of the object within the handle value Permits runtime type checking of handles Null handles are now distinct Easier detection of misused values Fortran Integer-valued handles simplify the
implementation for 64-bit systems
• Consistent mechanism to extend definitions to support needs of particular devices
• Defined fields simplifies much code E.g., direct access to rank, size of communicators
University of Chicago Department of Energy
Special Case:Predefined Objects
• Many predefined objects contain all information within the handle Predefined MPI datatype handles contain
• Size in bytes• Fact that the handle is a predefined datatype
No other data needed by most MPI routines• Eliminates extra loads, pointer chasing, and setup at MPI_Init time
Predefined attributes handled separately from general attributes
• Special case anyway, since C and Fortran versions are different for the predefined attributes
• Other predefined objects initialized only on demand Handle always valid Data areas may not be initialized until needed Example: names (MPI_Type_set_name) on datatypes
University of Chicago Department of Energy
Builtin MPI Datatypes
• MPI_TYPE_NULL has datatype code in upper bits• Index is used to implement MPI_Type_set/get_name
0 1 0 0 1 1
Size in Bytes Index to structCode forDatatype
Builtin
University of Chicago Department of Energy
Channel “CH3”
• One possible implementation design for ADI3 Others possible and underway
• Thread safe by design Requires* atomic operations
• Nonblocking design Requires some completion handle, so
• Delayed allocation Only allocate/initialize if a communication
operation did not complete
University of Chicago Department of Energy
An Example: CH3 Implementation over TCP
• Pollable and active-message data paths• RMA Path
University of Chicago Department of Energy
Typical CH3 Routines
• Begin messages CH3_iStartMsg, CH3_iStartMsgv CH3_iStartRead CH3_Request_create
• Send or receive data using an existing request (e.g., incremental datatype handling, rendezvous) CH3_iSend, CH3_iSendv CH3_iWrite CH3_iRead
• Ensure progress CH3_Progress, CH3_Progress_start, CH3_Progress_end,
CH3_Progress_poke, CH3_Progress_signal_completion• CH3_Init, CH3_Finalize, CH3_InitParent
University of Chicago Department of Energy
Implementation Sketch
MPID_Send( ){ decide if eager based on message size and flow control if (eager) { create packet on stack fill in as eager send packet if (data contiguous) { request = CH3_iStartmsgv( iov ) // note that the request will be null if the message was sent } else { create pack buffer pack into buffer request = CH3_iStartmsgv( iov ) if (!request) free pack buffer else save location of pack buffer in request } else (rendezvous) { create packet on stack request = CH3_request_create(); fill in request fill in packet as rndv_req to send (include request id) CH3_iSend( request, packet ) } return request}
University of Chicago Department of Energy
Example Channel/TCP Internal Profiling
University of Chicago Department of Energy
Example: Cost of Preallocating Requests
University of Chicago Department of Energy
Extra Wrinkles
• Some fast networks do not guarantee ordered delivery (probably all of the very fastest)
• MPI, fortunately, does not require ordered delivery Except for message envelopes (message
headers are ordered)
• What should a channel interface look like? Should it require ordering (streams)?
University of Chicago Department of Energy
BlueGene/L
University of Chicago Department of Energy
Preserving Message Ordering
• Diversion Consider message matching for trace tools such as jumpshot. How are
arrows drawn? If communication is single threaded, the trace tool can “replay” the
messages to perform matching For multithreaded (MPI_THREAD_MULTIPLE), the easiest solution is a
message sequence number on envelopes
• But that’s (almost) all we need to handle unordered message delivery! Need to count data segments Need to handle data that arrives before its envelope
• Can be handled as an low (but not zero) probability case
• Solution: Provide hooks in the channel device for sequence numbers on envelopes;
use these for MPI_THREAD_MULTIPLE and unordered networks. Other wrinkles handled below the channel device
• Ok, because it does not force an ordered, stream-like messaging on the low levels.
University of Chicago Department of Energy
BG/L and the MPICH2 Architecture
Message Passing Interface MPI
Abstract Device Interface MPID
CH3
TCP/IP
Channel Interface
Types, key valuesnotion of requests
Transform topt2pt ops
Request progress engine
Interface Implementation
BG/L TorusBG/L Torus
BG/L TreeBG/L Tree
BG/L GIBG/L GI
Special opportunities:• collective bypass• scalable buffer mgmnt• out-of-order network
MPI
UserUserMPI_SendMPI_Bcast
Abstract Device Interface/MPID
MPID_Send
Channel Interface/CH3Channel Interface/CH3
CH3_iStartmsgv
Transport: Torus Message LayerTransport: Torus Message Layer
Channel_Write
Torus Packet LayerTorus Packet Layer
Torus_Send
Torus hardwareTorus hardware
lfpdux()sfpdx()
MP
ICH
2B
G/L
sof
twar
e
ADI-3Impl.
ChannelImpl.
University of Chicago Department of Energy
CH3 Summary
• Nonblocking interface for correctness and “0 copy” transfers
• struct iovec routines to provide “0 copy” for headers and data
• Lazy request creation to avoid unnecessary operations when data can be sent immediately (low latency case); routines to reuse requests during incremental transfers
• Thread-safe message queue manipulation routines
• Supports both polling and preemptive progress
University of Chicago Department of Energy
Small Message Latency
50.00
70.00
90.00
110.00
130.00
150.00
170.00
190.00
210.00
230.00
250.00
0 200 400 600 800 1000
Message size (bytes)
Tim
e (u
s)
Raw TCP MPICH-chp4 MPICH2-ch3:tcp
100Mb Ethernet TCP
University of Chicago Department of Energy
Bandwidth
0.00
2.00
4.00
6.00
8.00
10.00
12.00
1.0E+00 1.0E+01 1.0E+02 1.0E+03 1.0E+04 1.0E+05 1.0E+06 1.0E+07
Message Size (bytes)
MB
/s
Raw TCP MPICH-chp4 MPICH2-ch3:tcp
100Mb Ethernet TCP
University of Chicago Department of Energy
Small Message Latency
90.00
95.00
100.00
105.00
110.00
115.00
0 200 400 600 800 1000
Message size (bytes)
Tim
e (u
s)
Raw TCP MPICH-chp4 MPICH2-ch3:tcp
GigE TCP Results
University of Chicago Department of Energy
Bandwidth
0.00
20.00
40.00
60.00
80.00
100.00
120.00
1.0E+00 1.0E+01 1.0E+02 1.0E+03 1.0E+04 1.0E+05 1.0E+06 1.0E+07
Message Size (bytes)
MB
/s
Raw TCP MPICH-chp4 MPICH2-ch3:tcp
GigE TCP Results
University of Chicago Department of Energy
Enhancing Collective Performance
• MPICH-1 Algorithms are a combination of purely functional and minimum spanning tree (MST)
• Better algorithms, based on scatter/gather operations, exist for large messages E.g., see van de Geijn for 1-D mesh
• And better algorithms, based on MST, exist for small messages Correct implementations must be careful of MPI Datatypes
• Rajeev Thakur and I have developed and implemented algorithms for switched networks that provide much better performance
• The following results are for MPICH-1 and will be in the next MPICH-1 release
University of Chicago Department of Energy
Bcast with Scatter/Gather
• Implement MPI_Bcast(buf,n,…) as MPI_Scatter(buf, n/p,…, buf+rank*n/p,…) MPI_Allgather(buf+rank*n/p, n/p,…,buf,…)
P0 P1 P3P2 P4 P5 P6 P7
University of Chicago Department of Energy
Collective Performance
Allgather, 128 nodes
0
500
1000
1500
2000
2500
3000
3500
4000
4500
0 5000 10000 15000
Total size (bytes)
Tim
e (
mic
ros
ec
.)
Old, ring
New, rec. dbl.Scatter, 128 nodes
100
300
500
700
900
1100
1300
1500
1700
0 5000 10000 15000
Total size (bytes)
Tim
e (m
icro
sec)
Old, linear
New , MST
Broadcast, 128 nodes
0
100,000
200,000
300,000
400,000
500,000
600,000
700,000
0 2,000,000 4,000,000 6,000,000 8,000,000
Data size (bytes)
Tim
e (m
icro
sec)
Old, MST
New, Scatter+AllgatherReduce Scatter, 128 procs
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
0 50000 100000 150000 200000 250000
Total size (bytes)
Tim
e (
mic
ros
ec
.)
Old, reduce+scatter
New, pairwise exch.
University of Chicago Department of Energy
And We’ll Go Faster Yet
• MPICH-2 enables additional collective optimizations: Pipelining of long messages Store and forward of communication buffers
eliminates extra copies, particularly for non-contiguous messages
• And custom algorithms Each collective operation may be replaced, on a per-
communicator basis, with a separate routine Unlike MPICH-1, each algorithm is contained within
the file implementing that particular collective routine, so only the collective routines used are loaded…
University of Chicago Department of Energy
Memory Footprint
• Processor-rich systems have less memory per node, making lean implementations of libraries important
• MPICH-2 addresses the memory “footprint” of the implementation through Library design and organization reduces the number
of extraneous routines needed only to satisfy (normally unused) references
• Don’t even think of a (classic, independent on-demand) shared library approach on 64K nodes
Lazy initialization of objects that are not commonly used
Use of callbacks to bring in code only when needed
University of Chicago Department of Energy
Reducing the Memory Footprint
• MPI Groups Groups are rarely used by applications Groups are not used internally in the implementation of MPI
routines, not even within the communicators• Datatype names
MPI-2 allows users to set and get names on all MPI datatypes, including predefined ones.
Rather than preinitialize all names during MPI_Init, names are set during first use of MPI_Type_set_name or MPI_Type_get_name.
• Buffered Send Proper completion requires checking during MPI_Finalize for
pending buffered sends First use of MPI_Buffer_attach adds a bsend callback to a list of
routines in MPI_Finalize that are called during MPI_Finalize
University of Chicago Department of Energy
Thread Safety
• Internal APIs are defined to be thread-safe, usually by providing an atomic operation. Locks are not part of the internal API They may be used to implement the API, but
they are not required. Locks are bad but sometimes inevitable.
• Level of thread-safety settable at configure and/or runtime using the same source tree
University of Chicago Department of Energy
Reference Counting MPI Objects
• Most MPI objects have reference count semantics Free only happens when reference count is zero Important for nonblocking (split phase) operations
• Updating the reference count must be atomic Using non-atomic update one of the most common and
most difficult to find errors in multithreaded programs• MPICH2 uses
MPIU_Object_add_ref(pointer_to_object) MPIU_Object_release_ref(pointer_to_object,&inuse)
• If inuse false, can free the object
• Why not use fetch and increment? Not available on IA32 Above API use atomic instructions on all common platforms
University of Chicago Department of Energy
Managing the Source
• Error messages Short text, long text Instance-specific text Tools extract and write error message routines Source markings for error tests
• Leads to…
• Testing Coverage tests Problem: ignoring error-handling code, particularly
for low-likelihood errors• Exploit error messaging macros and comment markers
University of Chicago Department of Energy
Build Environment
• Goal: no replicated data• Approach: Solve everything with indirection• Suite of scripts that derive and write the
necessary files Autoconf 2.13/2.52 “simplemake” creates Makefile.in, understands
libraries whose source files are in different directories, handles make bugs; will create Microsoft visual studio project files
“codingcheck” looks for source code problems “extracterrmsgs” creates message catalogs Fortran and C++ interfaces derived from mpi.h
University of Chicago Department of Energy
MPI-2 In MPICH-2
• Everything is in progress:• I/O will have a first port (error messaging and
generialized requests); later versions will exploit more of MPICH-2 internals
• MPI_Comm_spawn and friends will appear soon, using the PMI interface
• RMA will appear over the next year Challenge is low latency and high performance Key is to exploit semantics of the operations and the
work in BSP, generalized for MPI
University of Chicago Department of Energy
Conclusion
• MPICH2 is ready for researchers • New design is cleaner, more
flexible, faster• Raises the bar on performance for
all MPI implementations