mpich2—a new start for mpi implementations

MPICH2—A New Start for MPI Implementations

William D. GroppMathematics and Computer Science

Argonne National Laboratorywww.mcs.anl.gov/~gropp

University of Chicago Department of Energy

The newest edition of Using MPI-2, translated by Takao Hatazaki, www.pearsoned.co.jp/washo/prog/wa_pro61-j.html


MPICH2 Team

• Bill Gropp• Rusty Lusk• Rajeev Thakur• Rob Ross

• David Ashton• Brian Toonen• Rob Latham


What’s New

• Pre-pre-pre-pre release version available for groups that expect to perform research on MPI implementations with MPICH2

• Contains Most of MPI-1, services functions from MPI-2 C, Fortran bindings Example devices for TCP Documentation


MPICH2 Research

• All new implementation is our vehicle for research in Thread safety and efficiency (e.g., avoid thread locks) Optimized MPI datatypes Optimized Remote Memory Access (RMA) High Scalability (64K MPI processes and more) Exploiting Remote Direct Memory Access (RDMA) capable

networks All of MPI-2, including dynamic process management,

parallel I/O, RMA Usability and Robustness

• Software engineering techniques that automate and simplify creating and maintaining a solid, user-friendly implementation

• Allow extensive runtime error checking but do not require it


Some Target Platforms

• Clusters (TCP, UDP, Infiniband, Myrinet, Proprietary Interconnects, …)

• Clusters of SMPs• Grids (UDP, TCP, Globus I/O, … )• BlueGene/x

64K processors; 64K address spaces ANL/IBM developing MPI and process management

for BG/L

• Other systems ANL/Cray developing MPI and PVFS for Red Storm


Structure of MPICH-2

ADI-3 ADIO

MPICH-2

Existing parallel file systems

PMI

MPD Vendors

ChannelInterface

Myrinet,Other NIC

Multi-Method

BG/LPortals MM

Existing

In Progress

For others

Fork

PVFS

TCP

Unix(python)

Windows


The Major Components

• PMI Process Manager Interface Provides scalable interface to both process creation and

communication setup Designed to permit many implementations, including

with/without demons and with 3rd party process managers

• ADIO I/O interface. No change (except for error reporting and

request management) from current ROMIO, at least this year

• ADI3 New device aimed at higher performance networks and

new network capabilities


Needs of an MPI Implementation

• Point-to-point communication Can work with polling

• Cancel of send Requires some agent (interrupt-driven receive,

separate thread, guaranteed timer)

• Active target RMA Can work with polling Performance may require “bypass”

• Passive target RMA Requires some agent

• For some operations, the agent may be special hardware capabilities


The Layers

• ADI3 Full-featured interface, closely matched to MPI point-

to-point and RMA operations Most MPI communication routines perform (optional)

error checking and then “call” ADI3 routine Modular design allows replacement of parts, e.g.,

• Datatypes

• New Channel Interface Much smaller than ADI-3, easily implemented on

most platforms Nonblocking design is more robust and efficient than

MPICH-1 version


Expose Structures To All Levels of the Implementation

• All MPI opaque objects are defined structs for all levels (ADI, channel, and lower)

• All objects have a handle that includes the type of the object within the handle value Permits runtime type checking of handles Null handles are now distinct Easier detection of misused values Fortran Integer-valued handles simplify the

implementation for 64-bit systems

• Consistent mechanism to extend definitions to support needs of particular devices

• Defined fields simplifies much code E.g., direct access to rank, size of communicators


Special Case:Predefined Objects

• Many predefined objects contain all information within the handle Predefined MPI datatype handles contain

• Size in bytes• Fact that the handle is a predefined datatype

No other data needed by most MPI routines• Eliminates extra loads, pointer chasing, and setup at MPI_Init time

Predefined attributes handled separately from general attributes

• Special case anyway, since C and Fortran versions are different for the predefined attributes

• Other predefined objects initialized only on demand Handle always valid Data areas may not be initialized until needed Example: names (MPI_Type_set_name) on datatypes


Builtin MPI Datatypes

• MPI_TYPE_NULL has datatype code in upper bits• Index is used to implement MPI_Type_set/get_name

0 1 0 0 1 1

Size in Bytes Index to structCode forDatatype

Builtin


Channel “CH3”

• One possible implementation design for ADI3 Others possible and underway

• Thread safe by design Requires* atomic operations

• Nonblocking design Requires some completion handle, so

• Delayed allocation Only allocate/initialize if a communication

operation did not complete


An Example: CH3 Implementation over TCP

• Pollable and active-message data paths• RMA Path


Typical CH3 Routines

• Begin messages CH3_iStartMsg, CH3_iStartMsgv CH3_iStartRead CH3_Request_create

• Send or receive data using an existing request (e.g., incremental datatype handling, rendezvous) CH3_iSend, CH3_iSendv CH3_iWrite CH3_iRead

• Ensure progress CH3_Progress, CH3_Progress_start, CH3_Progress_end,

CH3_Progress_poke, CH3_Progress_signal_completion• CH3_Init, CH3_Finalize, CH3_InitParent


Implementation Sketch

MPID_Send( ){ decide if eager based on message size and flow control if (eager) { create packet on stack fill in as eager send packet if (data contiguous) { request = CH3_iStartmsgv( iov ) // note that the request will be null if the message was sent } else { create pack buffer pack into buffer request = CH3_iStartmsgv( iov ) if (!request) free pack buffer else save location of pack buffer in request } else (rendezvous) { create packet on stack request = CH3_request_create(); fill in request fill in packet as rndv_req to send (include request id) CH3_iSend( request, packet ) } return request}


Example Channel/TCP Internal Profiling


Example: Cost of Preallocating Requests


Extra Wrinkles

• Some fast networks do not guarantee ordered delivery (probably all of the very fastest)

• MPI, fortunately, does not require ordered delivery Except for message envelopes (message

headers are ordered)

• What should a channel interface look like? Should it require ordering (streams)?


BlueGene/L


Preserving Message Ordering

• Diversion Consider message matching for trace tools such as jumpshot. How are

arrows drawn? If communication is single threaded, the trace tool can “replay” the

messages to perform matching For multithreaded (MPI_THREAD_MULTIPLE), the easiest solution is a

message sequence number on envelopes

• But that’s (almost) all we need to handle unordered message delivery! Need to count data segments Need to handle data that arrives before its envelope

• Can be handled as an low (but not zero) probability case

• Solution: Provide hooks in the channel device for sequence numbers on envelopes;

use these for MPI_THREAD_MULTIPLE and unordered networks. Other wrinkles handled below the channel device

• Ok, because it does not force an ordered, stream-like messaging on the low levels.


BG/L and the MPICH2 Architecture

Message Passing Interface MPI

Abstract Device Interface MPID

CH3

TCP/IP

Channel Interface

Types, key valuesnotion of requests

Transform topt2pt ops

Request progress engine

Interface Implementation

BG/L TorusBG/L Torus

BG/L TreeBG/L Tree

BG/L GIBG/L GI

Special opportunities:• collective bypass• scalable buffer mgmnt• out-of-order network

MPI

UserUserMPI_SendMPI_Bcast

Abstract Device Interface/MPID

MPID_Send

Channel Interface/CH3Channel Interface/CH3

CH3_iStartmsgv

Transport: Torus Message LayerTransport: Torus Message Layer

Channel_Write

Torus Packet LayerTorus Packet Layer

Torus_Send

Torus hardwareTorus hardware

lfpdux()sfpdx()

MP

ICH

2B

G/L

sof

twar

e

ADI-3Impl.

ChannelImpl.


CH3 Summary

• Nonblocking interface for correctness and “0 copy” transfers

• struct iovec routines to provide “0 copy” for headers and data

• Lazy request creation to avoid unnecessary operations when data can be sent immediately (low latency case); routines to reuse requests during incremental transfers

• Thread-safe message queue manipulation routines

• Supports both polling and preemptive progress


Small Message Latency

50.00

70.00

90.00

110.00

130.00

150.00

170.00

190.00

210.00

230.00

250.00

0 200 400 600 800 1000

Message size (bytes)

Tim

e (u

s)

Raw TCP MPICH-chp4 MPICH2-ch3:tcp

100Mb Ethernet TCP


Bandwidth

0.00

2.00

4.00

6.00

8.00

10.00

12.00

1.0E+00 1.0E+01 1.0E+02 1.0E+03 1.0E+04 1.0E+05 1.0E+06 1.0E+07

Message Size (bytes)

MB

/s


100Mb Ethernet TCP


Small Message Latency

90.00

95.00

100.00

105.00

110.00

115.00

0 200 400 600 800 1000

Message size (bytes)

Tim

e (u

s)


GigE TCP Results


Bandwidth

0.00

20.00

40.00

60.00

80.00

100.00

120.00

1.0E+00 1.0E+01 1.0E+02 1.0E+03 1.0E+04 1.0E+05 1.0E+06 1.0E+07

Message Size (bytes)

MB

/s


GigE TCP Results


Enhancing Collective Performance

• MPICH-1 Algorithms are a combination of purely functional and minimum spanning tree (MST)

• Better algorithms, based on scatter/gather operations, exist for large messages E.g., see van de Geijn for 1-D mesh

• And better algorithms, based on MST, exist for small messages Correct implementations must be careful of MPI Datatypes

• Rajeev Thakur and I have developed and implemented algorithms for switched networks that provide much better performance

• The following results are for MPICH-1 and will be in the next MPICH-1 release


Bcast with Scatter/Gather

• Implement MPI_Bcast(buf,n,…) as MPI_Scatter(buf, n/p,…, buf+rank*n/p,…) MPI_Allgather(buf+rank*n/p, n/p,…,buf,…)

P0 P1 P3P2 P4 P5 P6 P7


Collective Performance

Allgather, 128 nodes

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 5000 10000 15000

Total size (bytes)

Tim

e (

mic

ros

ec

.)

Old, ring

New, rec. dbl.Scatter, 128 nodes

100

300

500

700

900

1100

1300

1500

1700

0 5000 10000 15000

Total size (bytes)

Tim

e (m

icro

sec)

Old, linear

New , MST

Broadcast, 128 nodes

0

100,000

200,000

300,000

400,000

500,000

600,000

700,000

0 2,000,000 4,000,000 6,000,000 8,000,000

Data size (bytes)

Tim

e (m

icro

sec)

Old, MST

New, Scatter+AllgatherReduce Scatter, 128 procs

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

0 50000 100000 150000 200000 250000

Total size (bytes)

Tim

e (

mic

ros

ec

.)

Old, reduce+scatter

New, pairwise exch.


And We’ll Go Faster Yet

• MPICH-2 enables additional collective optimizations: Pipelining of long messages Store and forward of communication buffers

eliminates extra copies, particularly for non-contiguous messages

• And custom algorithms Each collective operation may be replaced, on a per-

communicator basis, with a separate routine Unlike MPICH-1, each algorithm is contained within

the file implementing that particular collective routine, so only the collective routines used are loaded…


Memory Footprint

• Processor-rich systems have less memory per node, making lean implementations of libraries important

• MPICH-2 addresses the memory “footprint” of the implementation through Library design and organization reduces the number

of extraneous routines needed only to satisfy (normally unused) references

• Don’t even think of a (classic, independent on-demand) shared library approach on 64K nodes

Lazy initialization of objects that are not commonly used

Use of callbacks to bring in code only when needed


Reducing the Memory Footprint

• MPI Groups Groups are rarely used by applications Groups are not used internally in the implementation of MPI

routines, not even within the communicators• Datatype names

MPI-2 allows users to set and get names on all MPI datatypes, including predefined ones.

Rather than preinitialize all names during MPI_Init, names are set during first use of MPI_Type_set_name or MPI_Type_get_name.

• Buffered Send Proper completion requires checking during MPI_Finalize for

pending buffered sends First use of MPI_Buffer_attach adds a bsend callback to a list of

routines in MPI_Finalize that are called during MPI_Finalize


Thread Safety

• Internal APIs are defined to be thread-safe, usually by providing an atomic operation. Locks are not part of the internal API They may be used to implement the API, but

they are not required. Locks are bad but sometimes inevitable.

• Level of thread-safety settable at configure and/or runtime using the same source tree


Reference Counting MPI Objects

• Most MPI objects have reference count semantics Free only happens when reference count is zero Important for nonblocking (split phase) operations

• Updating the reference count must be atomic Using non-atomic update one of the most common and

most difficult to find errors in multithreaded programs• MPICH2 uses

MPIU_Object_add_ref(pointer_to_object) MPIU_Object_release_ref(pointer_to_object,&inuse)

• If inuse false, can free the object

• Why not use fetch and increment? Not available on IA32 Above API use atomic instructions on all common platforms


Managing the Source

• Error messages Short text, long text Instance-specific text Tools extract and write error message routines Source markings for error tests

• Leads to…

• Testing Coverage tests Problem: ignoring error-handling code, particularly

for low-likelihood errors• Exploit error messaging macros and comment markers


Build Environment

• Goal: no replicated data• Approach: Solve everything with indirection• Suite of scripts that derive and write the

necessary files Autoconf 2.13/2.52 “simplemake” creates Makefile.in, understands

libraries whose source files are in different directories, handles make bugs; will create Microsoft visual studio project files

“codingcheck” looks for source code problems “extracterrmsgs” creates message catalogs Fortran and C++ interfaces derived from mpi.h


MPI-2 In MPICH-2

• Everything is in progress:• I/O will have a first port (error messaging and

generialized requests); later versions will exploit more of MPICH-2 internals

• MPI_Comm_spawn and friends will appear soon, using the PMI interface

• RMA will appear over the next year Challenge is low latency and high performance Key is to exploit semantics of the operations and the

work in BSP, generalized for MPI


Conclusion

• MPICH2 is ready for researchers • New design is cleaner, more

flexible, faster• Raises the bar on performance for

all MPI implementations

mpich2—a new start for mpi implementations

Documents