mpich2—a new start for mpi implementations

41
MPICH2—A New Start for MPI Implementations William D. Gropp Mathematics and Computer Science Argonne National Laboratory www.mcs.anl.gov/~gropp

Upload: craig

Post on 04-Feb-2016

37 views

Category:

Documents


0 download

DESCRIPTION

MPICH2—A New Start for MPI Implementations. William D. Gropp Mathematics and Computer Science Argonne National Laboratory www.mcs.anl.gov/~gropp. The newest edition of Using MPI-2 , translated by Takao Hatazaki, www.pearsoned.co.jp/washo/prog/wa_pro61-j.html. Bill Gropp Rusty Lusk - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: MPICH2—A New Start for MPI Implementations

MPICH2—A New Start for MPI Implementations

William D. GroppMathematics and Computer Science

Argonne National Laboratorywww.mcs.anl.gov/~gropp

Page 2: MPICH2—A New Start for MPI Implementations

University of Chicago Department of Energy

The newest edition of Using MPI-2, translated by Takao Hatazaki, www.pearsoned.co.jp/washo/prog/wa_pro61-j.html

Page 3: MPICH2—A New Start for MPI Implementations

University of Chicago Department of Energy

MPICH2 Team

• Bill Gropp• Rusty Lusk• Rajeev Thakur• Rob Ross

• David Ashton• Brian Toonen• Rob Latham

Page 4: MPICH2—A New Start for MPI Implementations

University of Chicago Department of Energy

What’s New

• Pre-pre-pre-pre release version available for groups that expect to perform research on MPI implementations with MPICH2

• Contains Most of MPI-1, services functions from MPI-2 C, Fortran bindings Example devices for TCP Documentation

Page 5: MPICH2—A New Start for MPI Implementations

University of Chicago Department of Energy

MPICH2 Research

• All new implementation is our vehicle for research in Thread safety and efficiency (e.g., avoid thread locks) Optimized MPI datatypes Optimized Remote Memory Access (RMA) High Scalability (64K MPI processes and more) Exploiting Remote Direct Memory Access (RDMA) capable

networks All of MPI-2, including dynamic process management,

parallel I/O, RMA Usability and Robustness

• Software engineering techniques that automate and simplify creating and maintaining a solid, user-friendly implementation

• Allow extensive runtime error checking but do not require it

Page 6: MPICH2—A New Start for MPI Implementations

University of Chicago Department of Energy

Some Target Platforms

• Clusters (TCP, UDP, Infiniband, Myrinet, Proprietary Interconnects, …)

• Clusters of SMPs• Grids (UDP, TCP, Globus I/O, … )• BlueGene/x

64K processors; 64K address spaces ANL/IBM developing MPI and process management

for BG/L

• Other systems ANL/Cray developing MPI and PVFS for Red Storm

Page 7: MPICH2—A New Start for MPI Implementations

University of Chicago Department of Energy

Structure of MPICH-2

ADI-3 ADIO

MPICH-2

Existing parallel file systems

PMI

MPD Vendors

ChannelInterface

Myrinet,Other NIC

Multi-Method

BG/LPortals MM

Existing

In Progress

For others

Fork

PVFS

TCP

Unix(python)

Windows

Page 8: MPICH2—A New Start for MPI Implementations

University of Chicago Department of Energy

The Major Components

• PMI Process Manager Interface Provides scalable interface to both process creation and

communication setup Designed to permit many implementations, including

with/without demons and with 3rd party process managers

• ADIO I/O interface. No change (except for error reporting and

request management) from current ROMIO, at least this year

• ADI3 New device aimed at higher performance networks and

new network capabilities

Page 9: MPICH2—A New Start for MPI Implementations

University of Chicago Department of Energy

Needs of an MPI Implementation

• Point-to-point communication Can work with polling

• Cancel of send Requires some agent (interrupt-driven receive,

separate thread, guaranteed timer)

• Active target RMA Can work with polling Performance may require “bypass”

• Passive target RMA Requires some agent

• For some operations, the agent may be special hardware capabilities

Page 10: MPICH2—A New Start for MPI Implementations

University of Chicago Department of Energy

The Layers

• ADI3 Full-featured interface, closely matched to MPI point-

to-point and RMA operations Most MPI communication routines perform (optional)

error checking and then “call” ADI3 routine Modular design allows replacement of parts, e.g.,

• Datatypes

• New Channel Interface Much smaller than ADI-3, easily implemented on

most platforms Nonblocking design is more robust and efficient than

MPICH-1 version

Page 11: MPICH2—A New Start for MPI Implementations

University of Chicago Department of Energy

Expose Structures To All Levels of the Implementation

• All MPI opaque objects are defined structs for all levels (ADI, channel, and lower)

• All objects have a handle that includes the type of the object within the handle value Permits runtime type checking of handles Null handles are now distinct Easier detection of misused values Fortran Integer-valued handles simplify the

implementation for 64-bit systems

• Consistent mechanism to extend definitions to support needs of particular devices

• Defined fields simplifies much code E.g., direct access to rank, size of communicators

Page 12: MPICH2—A New Start for MPI Implementations

University of Chicago Department of Energy

Special Case:Predefined Objects

• Many predefined objects contain all information within the handle Predefined MPI datatype handles contain

• Size in bytes• Fact that the handle is a predefined datatype

No other data needed by most MPI routines• Eliminates extra loads, pointer chasing, and setup at MPI_Init time

Predefined attributes handled separately from general attributes

• Special case anyway, since C and Fortran versions are different for the predefined attributes

• Other predefined objects initialized only on demand Handle always valid Data areas may not be initialized until needed Example: names (MPI_Type_set_name) on datatypes

Page 13: MPICH2—A New Start for MPI Implementations

University of Chicago Department of Energy

Builtin MPI Datatypes

• MPI_TYPE_NULL has datatype code in upper bits• Index is used to implement MPI_Type_set/get_name

0 1 0 0 1 1

Size in Bytes Index to structCode forDatatype

Builtin

Page 14: MPICH2—A New Start for MPI Implementations

University of Chicago Department of Energy

Channel “CH3”

• One possible implementation design for ADI3 Others possible and underway

• Thread safe by design Requires* atomic operations

• Nonblocking design Requires some completion handle, so

• Delayed allocation Only allocate/initialize if a communication

operation did not complete

Page 15: MPICH2—A New Start for MPI Implementations

University of Chicago Department of Energy

An Example: CH3 Implementation over TCP

• Pollable and active-message data paths• RMA Path

Page 16: MPICH2—A New Start for MPI Implementations

University of Chicago Department of Energy

Typical CH3 Routines

• Begin messages CH3_iStartMsg, CH3_iStartMsgv CH3_iStartRead CH3_Request_create

• Send or receive data using an existing request (e.g., incremental datatype handling, rendezvous) CH3_iSend, CH3_iSendv CH3_iWrite CH3_iRead

• Ensure progress CH3_Progress, CH3_Progress_start, CH3_Progress_end,

CH3_Progress_poke, CH3_Progress_signal_completion• CH3_Init, CH3_Finalize, CH3_InitParent

Page 17: MPICH2—A New Start for MPI Implementations

University of Chicago Department of Energy

Implementation Sketch

MPID_Send( ){ decide if eager based on message size and flow control if (eager) { create packet on stack fill in as eager send packet if (data contiguous) { request = CH3_iStartmsgv( iov ) // note that the request will be null if the message was sent } else { create pack buffer pack into buffer request = CH3_iStartmsgv( iov ) if (!request) free pack buffer else save location of pack buffer in request } else (rendezvous) { create packet on stack request = CH3_request_create(); fill in request fill in packet as rndv_req to send (include request id) CH3_iSend( request, packet ) } return request}

Page 18: MPICH2—A New Start for MPI Implementations

University of Chicago Department of Energy

Example Channel/TCP Internal Profiling

Page 19: MPICH2—A New Start for MPI Implementations

University of Chicago Department of Energy

Example: Cost of Preallocating Requests

Page 20: MPICH2—A New Start for MPI Implementations

University of Chicago Department of Energy

Extra Wrinkles

• Some fast networks do not guarantee ordered delivery (probably all of the very fastest)

• MPI, fortunately, does not require ordered delivery Except for message envelopes (message

headers are ordered)

• What should a channel interface look like? Should it require ordering (streams)?

Page 21: MPICH2—A New Start for MPI Implementations

University of Chicago Department of Energy

BlueGene/L

Page 22: MPICH2—A New Start for MPI Implementations

University of Chicago Department of Energy

Preserving Message Ordering

• Diversion Consider message matching for trace tools such as jumpshot. How are

arrows drawn? If communication is single threaded, the trace tool can “replay” the

messages to perform matching For multithreaded (MPI_THREAD_MULTIPLE), the easiest solution is a

message sequence number on envelopes

• But that’s (almost) all we need to handle unordered message delivery! Need to count data segments Need to handle data that arrives before its envelope

• Can be handled as an low (but not zero) probability case

• Solution: Provide hooks in the channel device for sequence numbers on envelopes;

use these for MPI_THREAD_MULTIPLE and unordered networks. Other wrinkles handled below the channel device

• Ok, because it does not force an ordered, stream-like messaging on the low levels.

Page 23: MPICH2—A New Start for MPI Implementations

University of Chicago Department of Energy

BG/L and the MPICH2 Architecture

Message Passing Interface MPI

Abstract Device Interface MPID

CH3

TCP/IP

Channel Interface

Types, key valuesnotion of requests

Transform topt2pt ops

Request progress engine

Interface Implementation

BG/L TorusBG/L Torus

BG/L TreeBG/L Tree

BG/L GIBG/L GI

Special opportunities:• collective bypass• scalable buffer mgmnt• out-of-order network

Page 24: MPICH2—A New Start for MPI Implementations

MPI

UserUserMPI_SendMPI_Bcast

Abstract Device Interface/MPID

MPID_Send

Channel Interface/CH3Channel Interface/CH3

CH3_iStartmsgv

Transport: Torus Message LayerTransport: Torus Message Layer

Channel_Write

Torus Packet LayerTorus Packet Layer

Torus_Send

Torus hardwareTorus hardware

lfpdux()sfpdx()

MP

ICH

2B

G/L

sof

twar

e

ADI-3Impl.

ChannelImpl.

Page 25: MPICH2—A New Start for MPI Implementations

University of Chicago Department of Energy

CH3 Summary

• Nonblocking interface for correctness and “0 copy” transfers

• struct iovec routines to provide “0 copy” for headers and data

• Lazy request creation to avoid unnecessary operations when data can be sent immediately (low latency case); routines to reuse requests during incremental transfers

• Thread-safe message queue manipulation routines

• Supports both polling and preemptive progress

Page 26: MPICH2—A New Start for MPI Implementations

University of Chicago Department of Energy

Small Message Latency

50.00

70.00

90.00

110.00

130.00

150.00

170.00

190.00

210.00

230.00

250.00

0 200 400 600 800 1000

Message size (bytes)

Tim

e (u

s)

Raw TCP MPICH-chp4 MPICH2-ch3:tcp

100Mb Ethernet TCP

Page 27: MPICH2—A New Start for MPI Implementations

University of Chicago Department of Energy

Bandwidth

0.00

2.00

4.00

6.00

8.00

10.00

12.00

1.0E+00 1.0E+01 1.0E+02 1.0E+03 1.0E+04 1.0E+05 1.0E+06 1.0E+07

Message Size (bytes)

MB

/s

Raw TCP MPICH-chp4 MPICH2-ch3:tcp

100Mb Ethernet TCP

Page 28: MPICH2—A New Start for MPI Implementations

University of Chicago Department of Energy

Small Message Latency

90.00

95.00

100.00

105.00

110.00

115.00

0 200 400 600 800 1000

Message size (bytes)

Tim

e (u

s)

Raw TCP MPICH-chp4 MPICH2-ch3:tcp

GigE TCP Results

Page 29: MPICH2—A New Start for MPI Implementations

University of Chicago Department of Energy

Bandwidth

0.00

20.00

40.00

60.00

80.00

100.00

120.00

1.0E+00 1.0E+01 1.0E+02 1.0E+03 1.0E+04 1.0E+05 1.0E+06 1.0E+07

Message Size (bytes)

MB

/s

Raw TCP MPICH-chp4 MPICH2-ch3:tcp

GigE TCP Results

Page 30: MPICH2—A New Start for MPI Implementations

University of Chicago Department of Energy

Enhancing Collective Performance

• MPICH-1 Algorithms are a combination of purely functional and minimum spanning tree (MST)

• Better algorithms, based on scatter/gather operations, exist for large messages E.g., see van de Geijn for 1-D mesh

• And better algorithms, based on MST, exist for small messages Correct implementations must be careful of MPI Datatypes

• Rajeev Thakur and I have developed and implemented algorithms for switched networks that provide much better performance

• The following results are for MPICH-1 and will be in the next MPICH-1 release

Page 31: MPICH2—A New Start for MPI Implementations

University of Chicago Department of Energy

Bcast with Scatter/Gather

• Implement MPI_Bcast(buf,n,…) as MPI_Scatter(buf, n/p,…, buf+rank*n/p,…) MPI_Allgather(buf+rank*n/p, n/p,…,buf,…)

P0 P1 P3P2 P4 P5 P6 P7

Page 32: MPICH2—A New Start for MPI Implementations

University of Chicago Department of Energy

Collective Performance

Allgather, 128 nodes

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 5000 10000 15000

Total size (bytes)

Tim

e (

mic

ros

ec

.)

Old, ring

New, rec. dbl.Scatter, 128 nodes

100

300

500

700

900

1100

1300

1500

1700

0 5000 10000 15000

Total size (bytes)

Tim

e (m

icro

sec)

Old, linear

New , MST

Broadcast, 128 nodes

0

100,000

200,000

300,000

400,000

500,000

600,000

700,000

0 2,000,000 4,000,000 6,000,000 8,000,000

Data size (bytes)

Tim

e (m

icro

sec)

Old, MST

New, Scatter+AllgatherReduce Scatter, 128 procs

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

0 50000 100000 150000 200000 250000

Total size (bytes)

Tim

e (

mic

ros

ec

.)

Old, reduce+scatter

New, pairwise exch.

Page 33: MPICH2—A New Start for MPI Implementations

University of Chicago Department of Energy

And We’ll Go Faster Yet

• MPICH-2 enables additional collective optimizations: Pipelining of long messages Store and forward of communication buffers

eliminates extra copies, particularly for non-contiguous messages

• And custom algorithms Each collective operation may be replaced, on a per-

communicator basis, with a separate routine Unlike MPICH-1, each algorithm is contained within

the file implementing that particular collective routine, so only the collective routines used are loaded…

Page 34: MPICH2—A New Start for MPI Implementations

University of Chicago Department of Energy

Memory Footprint

• Processor-rich systems have less memory per node, making lean implementations of libraries important

• MPICH-2 addresses the memory “footprint” of the implementation through Library design and organization reduces the number

of extraneous routines needed only to satisfy (normally unused) references

• Don’t even think of a (classic, independent on-demand) shared library approach on 64K nodes

Lazy initialization of objects that are not commonly used

Use of callbacks to bring in code only when needed

Page 35: MPICH2—A New Start for MPI Implementations

University of Chicago Department of Energy

Reducing the Memory Footprint

• MPI Groups Groups are rarely used by applications Groups are not used internally in the implementation of MPI

routines, not even within the communicators• Datatype names

MPI-2 allows users to set and get names on all MPI datatypes, including predefined ones.

Rather than preinitialize all names during MPI_Init, names are set during first use of MPI_Type_set_name or MPI_Type_get_name.

• Buffered Send Proper completion requires checking during MPI_Finalize for

pending buffered sends First use of MPI_Buffer_attach adds a bsend callback to a list of

routines in MPI_Finalize that are called during MPI_Finalize

Page 36: MPICH2—A New Start for MPI Implementations

University of Chicago Department of Energy

Thread Safety

• Internal APIs are defined to be thread-safe, usually by providing an atomic operation. Locks are not part of the internal API They may be used to implement the API, but

they are not required. Locks are bad but sometimes inevitable.

• Level of thread-safety settable at configure and/or runtime using the same source tree

Page 37: MPICH2—A New Start for MPI Implementations

University of Chicago Department of Energy

Reference Counting MPI Objects

• Most MPI objects have reference count semantics Free only happens when reference count is zero Important for nonblocking (split phase) operations

• Updating the reference count must be atomic Using non-atomic update one of the most common and

most difficult to find errors in multithreaded programs• MPICH2 uses

MPIU_Object_add_ref(pointer_to_object) MPIU_Object_release_ref(pointer_to_object,&inuse)

• If inuse false, can free the object

• Why not use fetch and increment? Not available on IA32 Above API use atomic instructions on all common platforms

Page 38: MPICH2—A New Start for MPI Implementations

University of Chicago Department of Energy

Managing the Source

• Error messages Short text, long text Instance-specific text Tools extract and write error message routines Source markings for error tests

• Leads to…

• Testing Coverage tests Problem: ignoring error-handling code, particularly

for low-likelihood errors• Exploit error messaging macros and comment markers

Page 39: MPICH2—A New Start for MPI Implementations

University of Chicago Department of Energy

Build Environment

• Goal: no replicated data• Approach: Solve everything with indirection• Suite of scripts that derive and write the

necessary files Autoconf 2.13/2.52 “simplemake” creates Makefile.in, understands

libraries whose source files are in different directories, handles make bugs; will create Microsoft visual studio project files

“codingcheck” looks for source code problems “extracterrmsgs” creates message catalogs Fortran and C++ interfaces derived from mpi.h

Page 40: MPICH2—A New Start for MPI Implementations

University of Chicago Department of Energy

MPI-2 In MPICH-2

• Everything is in progress:• I/O will have a first port (error messaging and

generialized requests); later versions will exploit more of MPICH-2 internals

• MPI_Comm_spawn and friends will appear soon, using the PMI interface

• RMA will appear over the next year Challenge is low latency and high performance Key is to exploit semantics of the operations and the

work in BSP, generalized for MPI

Page 41: MPICH2—A New Start for MPI Implementations

University of Chicago Department of Energy

Conclusion

• MPICH2 is ready for researchers • New design is cleaner, more

flexible, faster• Raises the bar on performance for

all MPI implementations