a preview of mpi 3.0: the shape of things to...

31
A Preview of MPI 3.0: The Shape of Things to Come Manjunath Gorentla Venkata [email protected] Joshua Hursey [email protected]

Upload: others

Post on 27-Mar-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Preview of MPI 3.0: The Shape of Things to Comeolcf.ornl.gov/wp-content/uploads/2012/08/mpi-part-2... · 2017. 7. 20. · Parallel 3D FFT Performance ISC '11 22 0 0.5 1 1.5 2 2.5

A Preview of MPI 3.0: The Shape of Things to Come

Manjunath Gorentla Venkata [email protected]

Joshua Hursey

[email protected]

Page 2: A Preview of MPI 3.0: The Shape of Things to Comeolcf.ornl.gov/wp-content/uploads/2012/08/mpi-part-2... · 2017. 7. 20. · Parallel 3D FFT Performance ISC '11 22 0 0.5 1 1.5 2 2.5

2

Overview of Seminar Series

•  Monday, June 25 - 3-4 pm – MPI Process (brief) –  Timeline to 3.0 – MPI 3.0 Fortran Bindings – MPI 2.2

•  Tuesday, June 26 - 3-4 pm –  Collectives in MPI 3.0:

•  Neighborhood •  Nonblocking

–  Communicator Creation: •  Noncollective •  Nonblocking duplication

•  Thursday, June 28 - 3-4 pm – MPI Matched Probe/Recv –  RMA / One-sided

enhancements –  Tool Interfaces – MPI <next>

•  Fault Tolerance •  Hybrid, collectives, …

Page 3: A Preview of MPI 3.0: The Shape of Things to Comeolcf.ornl.gov/wp-content/uploads/2012/08/mpi-part-2... · 2017. 7. 20. · Parallel 3D FFT Performance ISC '11 22 0 0.5 1 1.5 2 2.5

3

MPI Topology and Collectives Support

•  Topology Functions (MPI 2.1) •  Create a Graph or Cartesian topology and query it,

nothing else •  Each rank specifies full graph

•  Scalable Graph topology (MPI‐2.2) •  Each rank specifies a subset of the Graph

Page 4: A Preview of MPI 3.0: The Shape of Things to Comeolcf.ornl.gov/wp-content/uploads/2012/08/mpi-part-2... · 2017. 7. 20. · Parallel 3D FFT Performance ISC '11 22 0 0.5 1 1.5 2 2.5

4

MPI Topology and Collectives Support

•  Neighborhood Collectives (MPI-3.0) •  Communication functions on the neighbors of the topology

(Cartesian, Graph, Distributed Graph) •  All processes in the communicator call the collective, but

communication only along the edges of process topology (neighbors)

•  Topology and Neighborhood Collectives Users can define a communication topology and perform communication between neighbors in this topology

Page 5: A Preview of MPI 3.0: The Shape of Things to Comeolcf.ornl.gov/wp-content/uploads/2012/08/mpi-part-2... · 2017. 7. 20. · Parallel 3D FFT Performance ISC '11 22 0 0.5 1 1.5 2 2.5

5

Need for Neighborhood Collectives

•  Many applications and libraries exhibit sparse communication patterns •  Example: Weather prediction applications, PETSc

•  Many architectures support sparse communication efficiently •  Cray XE/XK node has six neighbors

•  Implementation complexity can be reduced if sparse communication is abstracted by libraries

Hoefler et al. : Implemention and Performance Analysis of Non-Blocking Collective Operations for MPI

Page 6: A Preview of MPI 3.0: The Shape of Things to Comeolcf.ornl.gov/wp-content/uploads/2012/08/mpi-part-2... · 2017. 7. 20. · Parallel 3D FFT Performance ISC '11 22 0 0.5 1 1.5 2 2.5

6

MPI_NEIGHBOR_ALLGATHER

MPI_Neighbor_allgather(void* sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, MPI_Comm comm)

•  Send same data element to all neighbor processes •  Receive a distinct data element from each of the

neighbor •  Signature of sendtype and recvtype must be same at the

corresponding processes •  Order determined by MPI_(Dist)Graph_Neighors •  V version of the call is valid

Thanks to Torsten Hoefler (UIUC) and Martin Schulz (LLNL)

Page 7: A Preview of MPI 3.0: The Shape of Things to Comeolcf.ornl.gov/wp-content/uploads/2012/08/mpi-part-2... · 2017. 7. 20. · Parallel 3D FFT Performance ISC '11 22 0 0.5 1 1.5 2 2.5

7

Neighborhood Collectives (Cartesian Communicator)

•  Communication between nearest neighbors •  All processes in the communicator are required to call the

collective •  Number of sources and destinations are equal to 2 * num

dimensions •  The order of neighbors in buffers is in dimension order,

and in each dimension first negative neighbor, and then positive neighbor

Page 8: A Preview of MPI 3.0: The Shape of Things to Comeolcf.ornl.gov/wp-content/uploads/2012/08/mpi-part-2... · 2017. 7. 20. · Parallel 3D FFT Performance ISC '11 22 0 0.5 1 1.5 2 2.5

8

MPI_NEIGHBOR_ALLGATHER (Cartesian Communicator)

•  Buffer order: In dimension order, first negative, and then positive

0 1 2

3 4 5

6 7 8

Process 4

SendbufProcess 5

Process 7

Process 3

Process 1

Proc 3 Proc 5 Proc 1 Proc 7Recvbuf

Thanks to Torsten Hoefler (UIUC) and Martin Schulz (LLNL)

Page 9: A Preview of MPI 3.0: The Shape of Things to Comeolcf.ornl.gov/wp-content/uploads/2012/08/mpi-part-2... · 2017. 7. 20. · Parallel 3D FFT Performance ISC '11 22 0 0.5 1 1.5 2 2.5

9

MPI_NEIGHBOR_ALLGATHER (Cartesian Communicator)

0 1 2

3 4 5

6 7 8

Process 7Proc 6

Sendbuf

Recvbuf

Process 8Process 6

Process 4

Proc 8 Proc 4

Not updated or communicated

Page 10: A Preview of MPI 3.0: The Shape of Things to Comeolcf.ornl.gov/wp-content/uploads/2012/08/mpi-part-2... · 2017. 7. 20. · Parallel 3D FFT Performance ISC '11 22 0 0.5 1 1.5 2 2.5

10

Neighborhood Collectives

(Dist Graph or Graph Communicator)

•  Communication between arbitrary neighbors •  All processes should call the collective •  Order is determined by MPI_{Dist}Graph_Neighbors call

Equivalent to regular collectives, when each process creates graph treating all processes in the communicator as neighbors

Page 11: A Preview of MPI 3.0: The Shape of Things to Comeolcf.ornl.gov/wp-content/uploads/2012/08/mpi-part-2... · 2017. 7. 20. · Parallel 3D FFT Performance ISC '11 22 0 0.5 1 1.5 2 2.5

11

MPI_NEIGHBOR_ALLGATHER (Dist Graph Communicator) 7.6. NEIGHBORHOODCOLLECTIVE COMMUNICATION ON PROCESS TOPOLOGIES27

❞✛❅❅❘✛ ❇❇❇❇❇❇▼

✰✑✑

✑✑✑✑✸

✁✁✁✁✕▼❇❇

❇❇❇❇◆

❍❍❍❍❍❍❍❍

recvbuf

sendbuf

d0

s0

s1

s2 s3

d2, s4

d3, s5

d1

s0 s1 s2 s3 s4 s5

The type signature associated with sendcount, sendtype, at a process must be equal tothe type signature associated with recvcount, recvtype at all other processes. This impliesthat the amount of data sent must be equal to the amount of data received, pairwise betweenevery pair of communicating processes. Distinct type maps between sender and receiver arestill allowed.

Rationale. For optimization reasons, the same type signature is required indepen-dently of whether the topology graph is connected or not. (End of rationale.)

The “in place” option is not meaningful for this operation.The vector variant of MPI_NEIGHBOR_ALLGATHER allows one to gather different

numbers of elements from each neighbor.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

•  Between two processes, it sends and receives the same amount of data

•  MPI_IN_PLACE is not meaningful

Thanks to Torsten Hoefler (UIUC)

Page 12: A Preview of MPI 3.0: The Shape of Things to Comeolcf.ornl.gov/wp-content/uploads/2012/08/mpi-part-2... · 2017. 7. 20. · Parallel 3D FFT Performance ISC '11 22 0 0.5 1 1.5 2 2.5

12

MPI_NEIGHBOR_ALLTOALL

MPI_Neighbor_alltoall(void* sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, MPI_Comm comm)

•  Send a distinct data element to all neighbor process •  Receive a distinct data element from each of the

neighbor •  Type signature of sendtype and recvtype must be same

at the corresponding processes •  Order determined by MPI_(Dist)Graph_Neighors •  V and W versions of the call is valid

Thanks to Torsten Hoefler (UIUC) and Martin Schulz (LLNL)

Page 13: A Preview of MPI 3.0: The Shape of Things to Comeolcf.ornl.gov/wp-content/uploads/2012/08/mpi-part-2... · 2017. 7. 20. · Parallel 3D FFT Performance ISC '11 22 0 0.5 1 1.5 2 2.5

13

Neighborhood Collectives Summary

•  Scalable Graph Topology Creation

•  Neighborhood Collectives •  MPI_Neighbor_Allgather{v} •  MPI_Neighbor_Alltoall{v,w}

•  Neighborhood Collectives (Cartesian Communicator)

•  Neighborhood Collectives (Graph Communicator)

Page 14: A Preview of MPI 3.0: The Shape of Things to Comeolcf.ornl.gov/wp-content/uploads/2012/08/mpi-part-2... · 2017. 7. 20. · Parallel 3D FFT Performance ISC '11 22 0 0.5 1 1.5 2 2.5

14

Nonblocking Collectives

•  Collectives: A global synchronization, data communication, or a reduction operation

•  Blocking Collectives: Returns when completed

•  Nonblocking Collectives: Splits the invocation and completion of an operation •  Properties

•  Synchronization decoupled from invocation

•  Enables asynchronous progress (not guaranteed) •  Multiple outstanding operations

•  Out of order completion

Thanks to Torsten Hoefler (UIUC) and Martin Schulz (LLNL)

Page 15: A Preview of MPI 3.0: The Shape of Things to Comeolcf.ornl.gov/wp-content/uploads/2012/08/mpi-part-2... · 2017. 7. 20. · Parallel 3D FFT Performance ISC '11 22 0 0.5 1 1.5 2 2.5

15

Nonblocking Collective Routines in MPI 3.0

MPI_IBARRIER   MPI_IALLTOALLW  

MPI_IBCAST   MPI_IREDUCE  

MPI_IGATHER   MPI_IALLREDUCE  

MPI_IGATHERV   MPI_IREDUCE  SCATTER    

MPI_ISCATTER   MPI_ISCAN  

MPI_ISCATTERV   MPI_IEXSCAN  

MPI_IALLGATHER   MPI_INEIGHBOR_ALLGATHER  

MPI_IALLGATHERV   MPI_INEIGHBOR_ALLGATHERV  

MPI_IALLTOALL   MPI_INEIGHBOR_ALLTOALL  

MPI_IALLTOALLV   MPI_INEIGHBOR_ALLTOALLV  

MPI_IREDUCE_LOCAL   MPI_IREDUCE_SCATTER_BLOCK    

Page 16: A Preview of MPI 3.0: The Shape of Things to Comeolcf.ornl.gov/wp-content/uploads/2012/08/mpi-part-2... · 2017. 7. 20. · Parallel 3D FFT Performance ISC '11 22 0 0.5 1 1.5 2 2.5

16

Nonblocking Collectives Semantics

•  Multiple nonblocking collectives can be outstanding and their progress is independent

MPI_Request req1, req2;

MPI_Ialltoall(sbuf, scnt, stype, rbuf, rcnt, rtype, comm, &req1);

MPI_Ialltoall(sbuf, scnt, stype, rbuf, rcnt, rtype, comm, &req2);

MPI_Wait(&req2, MPI_STATUS_IGNORE);

MPI_Wait(&req1, MPI_STATUS_IGNORE);

https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/109

Page 17: A Preview of MPI 3.0: The Shape of Things to Comeolcf.ornl.gov/wp-content/uploads/2012/08/mpi-part-2... · 2017. 7. 20. · Parallel 3D FFT Performance ISC '11 22 0 0.5 1 1.5 2 2.5

17

Nonblocking Collectives Semantics

•  Blocking and nonblocking collectives can be interleaved

MPI_Request req;

MPI_Ialltoall(sbuf, scnt, stype, rbuf, rcnt, rtype, comm, &req);

MPI_Bcast(rbuf, rcnt, type, 0, comm);

MPI_Wait(&req1, MPI_STATUS_IGNORE);

https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/109

Page 18: A Preview of MPI 3.0: The Shape of Things to Comeolcf.ornl.gov/wp-content/uploads/2012/08/mpi-part-2... · 2017. 7. 20. · Parallel 3D FFT Performance ISC '11 22 0 0.5 1 1.5 2 2.5

18

Nonblocking Collectives Semantics

•  Order of nonblocking collectives on a communicator should be the same

switch(rank) {

case 0:

MPI_Ibcast(buf, count, type, 0, comm, &req); MPI_Barrier(comm);

MPI_Wait(&req, MPI_STATUS_IGNORE);

break;

case 1:

MPI_Barrier(comm);

MPI_Ibcast(buf, count, type, 0, comm, &req);

MPI_Wait(&req, MPI_STATUS_IGNORE); break;

}

https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/109

Page 19: A Preview of MPI 3.0: The Shape of Things to Comeolcf.ornl.gov/wp-content/uploads/2012/08/mpi-part-2... · 2017. 7. 20. · Parallel 3D FFT Performance ISC '11 22 0 0.5 1 1.5 2 2.5

19

Nonblocking Collectives Semantics

•  Matching of blocking and nonblocking collectives are invalid

switch (rank) {

case 0: MPI_Ibcast(buf, count, type, 0, comm, &req);

MPI_Wait(&req, MPI_STATUS_IGNORE);

break;

case 1:

MPI_Bcast(buf, count, type, 0, comm);

break;

}

https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/109

Page 20: A Preview of MPI 3.0: The Shape of Things to Comeolcf.ornl.gov/wp-content/uploads/2012/08/mpi-part-2... · 2017. 7. 20. · Parallel 3D FFT Performance ISC '11 22 0 0.5 1 1.5 2 2.5

20

Nonblocking Collectives Advantages

•  Communication – Computation Overlap

•  Noise Resiliency

•  Asynchronous Progress

•  Multiple Outstanding Operations

Page 21: A Preview of MPI 3.0: The Shape of Things to Comeolcf.ornl.gov/wp-content/uploads/2012/08/mpi-part-2... · 2017. 7. 20. · Parallel 3D FFT Performance ISC '11 22 0 0.5 1 1.5 2 2.5

21

Nonblocking Collectives Provides Better

Computation-Communication Overlap

•  64-process MPI_Ialltoall and progress examined with MPI_Test •  With network interface offload support one can achieve close to 100%

overlap

0

10

20

30

40

50

60

70

80

90

100

0 200000 400000 600000 800000 1e+06 1.2e+06

Ove

rlap in

Perc

enta

ge

Message Size (bytes)

Cheetah Offloaded Alltoall Cheetah Host Alltoall

Gorentla et al. : Exploring the All-To-All Collective Optimization Space with ConnectX CORE-Direct

Page 22: A Preview of MPI 3.0: The Shape of Things to Comeolcf.ornl.gov/wp-content/uploads/2012/08/mpi-part-2... · 2017. 7. 20. · Parallel 3D FFT Performance ISC '11 22 0 0.5 1 1.5 2 2.5

22

System Noise

•  Noise: OS related activity that steals CPU from the application •  Timer tick •  Hardware Interrupts •  Kernel Daemons

Page 23: A Preview of MPI 3.0: The Shape of Things to Comeolcf.ornl.gov/wp-content/uploads/2012/08/mpi-part-2... · 2017. 7. 20. · Parallel 3D FFT Performance ISC '11 22 0 0.5 1 1.5 2 2.5

23

Collective (Global) Performance Cost of System

Noise

No Noise System Noise

Noise

Page 24: A Preview of MPI 3.0: The Shape of Things to Comeolcf.ornl.gov/wp-content/uploads/2012/08/mpi-part-2... · 2017. 7. 20. · Parallel 3D FFT Performance ISC '11 22 0 0.5 1 1.5 2 2.5

24

Impact of System Noise on MPI_Allreduce

10

15

20

25

30

35

1 10 100 1000 10000 100000

Fact

or S

lowd

own

Size (bytes)

(a) MPI Allreduce

1.8

2

2.2

2.4

2.6

2.8

3

3.2

3.4

3.6

3.8

4

1 10 100 1000 10000 100000

Fact

or S

lowd

own

Size (bytes)

(b) MPI Bcast

Figure 8: Performance impact of MPI Allreduce and MPI Bcast for a 2.5% net processor noise signaturewith a 10Hz frequency and 2500us duration

Impact of MPI Usage Differences on Noise Sensitivity. To understand how these different uses of

MPI effect each application’s sensitivity to noise, we examined the performance of these specific operations

under the 2.5% low-frequency noise signature to which SAGE demonstrated sensitivity in section 5. Figure 8

shows the performance impact of this signature on MPI Allreduce and MPI Bcast operations for 128 nodes.

From this figure, we can see that MPI Bcast is much less sensitive to this noise signature than MPI Allreduce,

with MPI Bcast showing a factor of 2–4 slowdown and MPI Allreduce showing a factor of 12–35 slowdown.

In addition, small MPI Allreduce calls appear to be much more sensitive to this noise signature than larger

operations.

Based on this, we believe that SAGE’s sensitivity to OS noise comes from a combination of sources:

SAGE spends more time in MPI Allreduce than CTH, MPI Allreduce operations are more sensitive to

noise over MPI Bcast operations, and SAGE’s smaller MPI Allreduce operations are impacted to greater

degree than CTH’s larger ones. In addition, CTH is likely able to absorb some of the injected noise due to

the fact that it spends 60% of its time in operations that can potentially absorb noise (MPI Send, MPI Wait,

and MPI Recv).

7 Related Work

As mentioned previously, Petrini et al. [13] most recently raised the visibility of the impact of OS noise

on application performance. Their thorough study investigated performance issues from OS noise on a

large-scale cluster built from commodity hardware components, running a commodity operating system, and

14

Ferreira et al. : The Impact of System Design Parameters on Application Noise Sensitivity

Page 25: A Preview of MPI 3.0: The Shape of Things to Comeolcf.ornl.gov/wp-content/uploads/2012/08/mpi-part-2... · 2017. 7. 20. · Parallel 3D FFT Performance ISC '11 22 0 0.5 1 1.5 2 2.5

25

Nonblocking Collectives Resilient to System Noise

Effects

Blocking Collective Nonblocking Collective

Page 26: A Preview of MPI 3.0: The Shape of Things to Comeolcf.ornl.gov/wp-content/uploads/2012/08/mpi-part-2... · 2017. 7. 20. · Parallel 3D FFT Performance ISC '11 22 0 0.5 1 1.5 2 2.5

26

Nonblocking Collectives: Impact on Parallel 3D FFT Kernel Performance

Parallel 3D FFT Performance

22 ISC '11

00.5

11.5

22.5

33.5

44.5

5

512 600 720 800

Run

-Tim

e (s

)

Problem Size

BlockingH-TestOffload

P3DFFT Application kernel run-time Comparison with 128 processes P3DFFT with Offload-Ialltoall performs about 13.5% better than default P3DFFT and about 12% better than P3DFFT with Host-based-Test

Lower is Better

` 13%

`

23%

K. Kandalla et al. : High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A Study with Parallel 3D FFT

Page 27: A Preview of MPI 3.0: The Shape of Things to Comeolcf.ornl.gov/wp-content/uploads/2012/08/mpi-part-2... · 2017. 7. 20. · Parallel 3D FFT Performance ISC '11 22 0 0.5 1 1.5 2 2.5

27

Nonblocking Collectives Summary

•  Nonblocking Collectives Semantics

•  Nonblocking Collectives Advantages •  Communication-Computation Overlap •  Noise resiliency

•  Nonblocking Performance Results

Page 28: A Preview of MPI 3.0: The Shape of Things to Comeolcf.ornl.gov/wp-content/uploads/2012/08/mpi-part-2... · 2017. 7. 20. · Parallel 3D FFT Performance ISC '11 22 0 0.5 1 1.5 2 2.5

28

Noncollective Communicator Creation

MPI_Group_comm_create(MPI_Comm in, MPI_Group grp, int tag, MPI_Comm *out)

•  grp is a sub-group of communicator (in) •  No cached information passes from old communicator to

the new one

•  Create a communicator with less processes – good for fault tolerance, scalability

Thanks to Torsten Hoefler (UIUC), Martin Schulz (LLNL), and James Dinan (ANL)

Page 29: A Preview of MPI 3.0: The Shape of Things to Comeolcf.ornl.gov/wp-content/uploads/2012/08/mpi-part-2... · 2017. 7. 20. · Parallel 3D FFT Performance ISC '11 22 0 0.5 1 1.5 2 2.5

29

Nonblocking Communicator Duplication Function

•  Duplicates communicator without blocking •  Provides a way to overlap communicator creation

with other computation •  Semantics

•  Restrictions and assumptions of nonblocking collectives apply here

•  Error to use newcomm before completion of MPI_Comm_idup creation

•  Attributes changed after MPI_Comm_idup called is not copied to new communicator

MPI_Comm_idup(MPI_Comm comm, MPI_Comm *newcomm, MPI_Request *request)

Thanks to Torsten Hoefler (UIUC) and Martin Schulz (LLNL)

Page 30: A Preview of MPI 3.0: The Shape of Things to Comeolcf.ornl.gov/wp-content/uploads/2012/08/mpi-part-2... · 2017. 7. 20. · Parallel 3D FFT Performance ISC '11 22 0 0.5 1 1.5 2 2.5

30

Implementation Status

Open  MPI   MPICH2  

Nonblocking  CollecAves   Supports  ParAally    (limited  release)  

Supports    

Neighborhood  CollecAves   No  Support   No  Support  

Nonblocking  Communicator  Duplicate  

 No  Support  

 Supports  

NoncollecAve  Communicator  Create  

 No  Support    

 Supports  

Page 31: A Preview of MPI 3.0: The Shape of Things to Comeolcf.ornl.gov/wp-content/uploads/2012/08/mpi-part-2... · 2017. 7. 20. · Parallel 3D FFT Performance ISC '11 22 0 0.5 1 1.5 2 2.5

31

Acknowledgements

•  Center for Computational Sciences •  MPI Forum

•  Torsten Hoefler (UIUC) and Martin Schulz (LLNL)