mpi: advanced topics and future trends

MPI: Advanced Topics and Future Trends

Dhabaleswar K. (DK) Panda

The Ohio State University

E-mail: [email protected]

http://www.cse.ohio-state.edu/~panda

A Presentation at HPC Advisory Council Workshop, Lugano 2012

by




• MPI is a dominant programming model for HPC Systems

• Introduced some of the MPI Features and their Usage

• Introduced MVAPICH2 stack

• Illustrated many performance optimizations and tuning techniques for

MVAPICH2

• Most default settings are geared towards generic applications

– Default values are chosen after careful study across a wide variety of applications

• Default settings may be further tuned based on

– Application needs

– Specific system characteristics

– Other needs (like memory limits, etc.)

2

Recap of Yesterday’s Presentation (03/13/12)

HPC Advisory Council, Lugano Switzerland '12

• Challenges in designing Exascale Systems

• Overview of new MPI-3 features

• Overview of Latest and Upcoming MVAPICH2 features

– Optimized support for GPU to GPU communication

– Collective Communication: Network Offload, Non-blocking,

Topology-Aware, Power-Aware

– Support for PGAS (UPC and OpenShmem) and Hybrid Programming

– Support for MPI-3 RMA Model

– Future Plans

• Conclusions

3

Presentation Overview


• Exaflop = 1018 floating point operations per second

• Represents a factor of 100-1000x from current state of the

art

• Goal – Reach Exaflop levels by 2019-2020

• Exaflop computing is expected to spur research into high

performance technologies

• Discover new technologies to enable next generation of

science

HPC Advisory Council, Lugano Switzerland '12 4

Exaflop Computing

Exascale System Targets

Systems 2010 2018 Difference Today & 2018

System peak 2 PFlop/s 1 EFlop/s O(1,000)

Power 6 MW ~20 MW (goal)

System memory 0.3 PB 32 – 64 PB O(100)

Node performance 125 GF 1.2 or 15 TF O(10) – O(100)

Node memory BW 25 GB/s 2 – 4 TB/s O(100)

Node concurrency 12 O(1k) or O(10k) O(100) – O(1,000)

Total node interconnect BW 3.5 GB/s 200 – 400 GB/s (1:4 or 1:8 from memory BW)

O(100)

System size (nodes) 18,700 O(100,000) or O(1M) O(10) – O(100)

Total concurrency 225,000 O(billion) + [O(10) to O(100) for latency hiding]

O(10,000)

Storage capacity 15 PB 500 – 1000 PB (>10x system memory is min)

O(10) – O(100)

IO Rates 0.2 TB 60 TB/s O(100)

MTTI Days O(1 day) -O(10)

Courtesy: DOE Exascale Study and Prof. Jack Dongarra

5 HPC Advisory Council, Lugano Switzerland '12

• Supercomputers require a lot of energy

– Power consumption by processors for computation

– Power required by memory and other devices to move data

– Power required to cool the system

• Power requirement by current generation is already high

• Design constraint on Exaflop systems: must reach exaflops

using a total of around 20MW of power


Power Constraints for Exascale Systems

• DARPA Exascale Report – Peter Kogge, Editor and Lead

• Energy and Power Challenge

– Hard to solve power requirements for data movement

• Memory and Storage Challenge

– Hard to achieve high capacity and high data rate

• Concurrency and Locality Challenge

– Management of very large amount of concurrency (billion threads)

• Resiliency Challenge

– Low voltage devices (for low power) introduce more faults


What are the basic design challenges for Exascale Systems?

• Power required for data movement operations is one of

the main challenges

• Non-blocking collectives

– Overlap computation and communication

• Much improved One-sided interface

– Reduce synchronization of sender/receiver

• Manage concurrency

– Improved interoperability with PGAS (e.g. UPC, Global Arrays,

OpenShmem)

• Resiliency

– New interface for detecting failures


How does MPI plan to meet these challenges?









– Future Plans

• Conclusions

9



• Non-blocking Collectives

• Improved One-Sided (RMA) Model

• MPI Tools Interface


Major New Features

• Enables overlap of computation with communication

• Removes synchronization effects of collective operations

(exception of barrier)

• Completion when local part is complete

• Completion of one non-blocking collective does not imply

completion of other non-blocking collectives

• No “tag” for the collective operation

• Issuing many non-blocking collectives may exhaust

resources

– Quality implementations will ensure that this happens for only

pathological cases


Non-blocking Collective operations

• Non-blocking calls do not match blocking collective calls

– MPI implementation may use different algorithms for blocking and

non-blocking collectives

– Therefore, non-blocking collectives cannot match blocking

collectives

– Blocking collectives: optimized for latency

– Non-blocking collectives: optimized for overlap

• User must call collectives in same order on all ranks

• Progress rules are same as those for point-to-point

• Example new calls: MPI_Ibarrier, MPI_Iallreduce, …


Non-blocking Collective Operations (cont’d)

• Easy to express irregular pattern of communication

– Easier than request-response pattern using two-sided

• Decouple data transfer with synchronization

• Better overlap of communication with computation


Recap of One Sided Communication

Rank 0

Rank 2

Rank 1

Rank 3

mem

mem

mem

mem

window

• Remote Memory Access (RMA)

• New proposal has major

improvements

• MPI-2: public and private windows

– Synchronization of windows explicit

• MPI-2: works for non-cache

coherent systems

• MPI-3: two types of windows

– Unified and Separate

– Unified window leverages hardware

cache coherence


Improved One-sided Model

Processor

Private

Window

Public

Window

Incoming RMA Ops

Synchronization

Local Memory Ops

• MPI_Win_create_dynamic

– Window without memory attached

– MPI_Win_attach to attach memory to a window

• MPI_Win_allocate_shared

– Windows with shared memory

– Allows direct loads/store accesses by remote processes

• MPI_Rput, MPI_Rget, MPI_Raccumulate

– Local completion by using MPI_Wait on request objects

• MPI_Get_accumulate, MPI_Fetch_and_op

– Accumulate into target memory, return old data to origin

• MPI_Compare_and_swap

– Atomic compare and swap


Highlights of New MPI-3 RMA Calls

• MPI_Win_lock_all

– Faster way to lock all members of win

• MPI_Win_flush / MPI_Win_flush_all

– Flush all RMA ops to target / window

• MPI_Win_flush_local / MPI_Win_flush_local_all

– Locally complete RMA ops to target / window

• MPI_Win_sync

– Synchronize public and private copies of window

• Overlapping accesses were “erroneous” in MPI-2

– They are “undefined” in MPI-3


Highlights of New MPI-3 RMA Calls (Cont’d)

HPC Advisory Council, Lugano, Switzerland '12 17

MPI Tools Interface

• Extended tools support in MPI-3, beyond the PMPI interface

• Provide standardized interface (MPIT) to access MPI internal

information • Configuration and control information

• Eager limit, buffer sizes, . . .

• Performance information

• Time spent in blocking, memory usage, . . .

• Debugging information

• Packet counters, thresholds, . . .

• External tools can build on top of this standard interface


Miscellaneous features

• New FORTRAN bindings

• Non-blocking File I/O

• Remove C++ bindings

• Support for large data counts

• Scalable sparse collectives on process topologies

• Topology aware communicator creation

• Support for multiple “MPI processes” within single “operating

system process” (in progress)

• Support for helper threads (in progress)

• Numerous fixes in the existing proposal









– Future Plans

• Conclusions

19



Data movement in GPU+IB clusters

• Many systems today want to use systems that have both GPUs and high-speed networks such as InfiniBand

• Steps in Data movement in InfiniBand clusters with GPUs

– From GPU device memory to main memory at source process, using CUDA

– From source to destination process, using MPI

– From main memory to GPU device memory at destination process, using CUDA

• Earlier, GPU device and InfiniBand device required separate memory registration

• GPU-Direct (collaboration between NVIDIA and Mellanox) supported common registration between these devices

• However, GPU-GPU communication is still costly and programming is harder


PCIe

GPU

CPU

NIC

Switch

PCIe

GPU

CPU

NIC

Switch

At Sender:

cudaMemcpy(sbuf, sdev, . . .);

MPI_Send(sbuf, size, . . .);

At Receiver:

MPI_Recv(rbuf, size, . . .);

cudaMemcpy(rdev, rbuf, . . .);

Sample Code - Without MPI integration

• Naïve implementation with standard MPI and CUDA

• High Productivity and Poor Performance


PCIe

GPU

CPU

NIC

Switch

At Sender: for (j = 0; j < pipeline_len; j++)

cudaMemcpyAsync(sbuf + j * blk, sdev + j * blksz,. . .);

for (j = 0; j < pipeline_len; j++) {

while (result != cudaSucess) {

result = cudaStreamQuery(…);

if(j > 0) MPI_Test(…);

}

MPI_Isend(sbuf + j * block_sz, blksz . . .);

}

MPI_Waitall();

Sample Code – User Optimized Code

• Pipelining at user level with non-blocking MPI and CUDA interfaces

• Code at Sender side (and repeated at Receiver side)

• User-level copying may not match with internal MPI design

• High Performance and Poor Productivity


Can this be done within MPI Library?

• Support GPU to GPU communication through standard MPI

interfaces

– e.g. enable MPI_Send, MPI_Recv from/to GPU memory

• Provide high performance without exposing low level details

to the programmer

– Pipelined data transfer which automatically provides optimizations

inside MPI library without user tuning

• A new Design incorporated in MVAPICH2 to support this

functionality


At Sender:

MPI_Send(s_device, size, …);

At Receiver:

MPI_Recv(r_device, size, …);

inside MVAPICH2

Sample Code – MVAPICH2-GPU

• MVAPICH2-GPU: standard MPI interfaces used

• High Performance and High Productivity


Design considerations

• Memory detection

– CUDA 4.0 introduces Unified Virtual Addressing (UVA)

– MPI library can differentiate between device memory and

host memory without any hints from the user

• Overlap CUDA copy and RDMA transfer

– Data movement from GPU and RDMA transfer are DMA

operations

– Allow for asynchronous progress


MPI-Level Two-sided Communication

• 45% improvement compared with a naïve user-level implementation

(Memcpy+Send), for 4MB messages

• 24% improvement compared with an advanced user-level implementation

(MemcpyAsync+Isend), for 4MB messages

0

500

1000

1500

2000

2500

3000

32K 64K 128K 256K 512K 1M 2M 4M

Tim

e (

us)

Message Size (bytes)

Memcpy+Send

MemcpyAsync+Isend

MVAPICH2-GPU

H. Wang, S. Potluri, M. Luo, A. Singh, S. Sur and D. K. Panda, MVAPICH2-GPU: Optimized GPU to GPU Communication for InfiniBand Clusters, ISC ‘11


Other MPI Operations and Optimizations for GPU Buffers

• Similar approaches can be used for

– One-sided

– Collectives

– Communication with Datatypes

• Designs can also be extended for multi-GPUs per node

– Use CUDA IPC (in CUDA 4.1), to avoid copy through host memory

27

• H. Wang, S. Potluri, M. Luo, A. Singh, X. Ouyang, S. Sur and D. K. Panda, Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2, IEEE Cluster '11, Sept. 2011


• A. Singh, S. Potluri, H. Wang, K. Kandalla, S. Sur and D. K. Panda, MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits, Workshop on Parallel Programming on Accelerator Clusters (PPAC '11), held in conjunction with Cluster '11, Sept. 2011

• S. Potluri et al. Optimizing MPI Communication on Multi-GPU Systems using CUDA Inter-Process Communication, Workshop on Accelerators and Hybrid Exascale Systems(ASHES), to be held in conjunction with IPDPS 2012, May 2012

MVAPICH2 1.8a2 Release

• Supports point-to-point and collective communication

• Supports communication between GPU devices and

between GPU device and host

• Supports communication using contiguous and non-

contiguous MPI Datatypes

• Supports GPU-Direct through CUDA support (from 4.0)

• Takes advantage of CUDA IPC for intra-node (intra-I/OH)

communication (from CUDA 4.1)

• Provides flexibility in tuning performance of both RDMA

and shared-memory based designs based on predominant

message sizes in applications


OSU MPI Micro-Benchmarks (OMB) 3.5.1 Release

• OSU MPI Micro-Benchmarks provides a comprehensive suite of

benchmarks to compare performance of different MPI stacks

and networks

• Enhancements done for three benchmarks – Latency

– Bandwidth

– Bi-directional Bandwidth

• Flexibility for using buffers in NVIDIA GPU device (D) and host

memory (H)

• Flexibility for selecting data movement between D->D, D->H and

H->D

• Available from http://mvapich.cse.ohio-state.edu/benchmarks

• Available in an integrated manner with MVAPICH2 stack


http://mvapich.cse.ohio-state.edu/benchmarks




MVAPICH2 vs. OpenMPI (Device-Device, Inter-node)

0

500

1000

1500

2000

1 16 256 4K 64K 1M

Latency

MVAPICH2

OpenMPI

Late

ncy

(u

s)


0

1000

2000

3000

4000

1 16 256 4K 64K 1M

Bandwidth

MVAPICH2

OpenMPI

Ban

dw

idth

(MB

/s)


MVAPICH2 1.8 a2 and OpenMPI (Trunk nightly snapshot on Feb 3, 2012) Westmere with ConnectX-2 QDR HCA, NVIDIA Tesla C2075 GPU and CUDA Toolkit 4.1

0

1000

2000

3000

4000

1 16 256 4K 64K 1M

Bi-directional Bandwidth

MVAPICH2

OpenMPI

Ban

dw

idth

(MB

/s)


BE

TT

ER

BE

TT

ER

BE

TT

ER


MVAPICH2 vs. OpenMPI (Device-Host, Inter-node)

0

500

1000

1500

2000

1 16 256 4K 64K 1M

Latency

MVAPICH2

OpenMPI

Late

ncy

(u

s)


0

1000

2000

3000

4000

1 16 256 4K 64K 1M

Bandwidth

MVAPICH2

OpenMPI

Ban

dw

idth

(MB

/s)


0

1000

2000

3000

4000

1 16 256 4K 64K 1M


MVAPICH2

OpenMPI

Ban

dw

idth

(MB

/s)



Host-Device

Performance is

Similar B

ET

TE

R

BE

TT

ER

BE

TT

ER


MVAPICH2 vs. OpenMPI (Device-Device, Intra-node, Multi-GPU)

0

1000

2000

3000

4000

1 16 256 4K 64K 1M

Latency

MVAPICH2

OpenMPI

Late

ncy

(u

s)


0

1000

2000

3000

4000

5000

1 16 256 4K 64K 1M

Bandwidth

MVAPICH2

OpenMPI

Ban

dw

idth

(MB

/s)


0

2000

4000

6000

8000

1 16 256 4K 64K 1M


MVAPICH2

OpenMPI

Ban

dw

idth

(MB

/s)


BE

TT

ER

BE

TT

ER

BE

TT

ER



Applications-Level Evaluation (Lattice Boltzmann Method (LBM))

0

20

40

60

80

100

128x512x64 256x512x64 512x512x64 1024x512x64

Tim

e f

or

LB S

tep

(u

s)

Matrix Size XxYxZ

MVAPICH2 1.7

MVAPICH2 1.8a2

24.2% 24.2%

23.5%

23.5%

• LBM-CUDA (Courtesy: Carlos Rosale, TACC) is a parallel distributed CUDA implementation of a Lattice Boltzmann Method for multiphase flows with large density ratios • NVIDIA Tesla C2050, Mellanox QDR InfiniBand HCA MT26428, Intel Westmere Processor with 12 GB main memory; CUDA 4.1, MVAPICH2 1.7 and MVAPICH2 1.8a2 • Run one process on each node for one GPU (8-node cluster)


Application-Level Evaluation (AWP-ODC)

0

10

20

30

40

50

60

70

4 8

Tota

l Ex

ecu

tio

n T

ime

(se

c)

Number of Processes

MVAPICH2 1.7

MVAPICH2 1.8a2

• AWP-ODC simulates the dynamic rapture and wave propagation that occur during an earthquake • A Gordon Bell Prize Finalist at SC 2010 • Originally a Fortran code, a new version is being written in C and CUDA • NVIDIA Tesla C2050, Mellanox QDR IB, Intel Westmere Processor with 12 GB main memory • CUDA 4.1, MVAPICH2 1.7 and MVAPICH2 1.8a2 • One process on each node with one GPU. 128x128x1024 data grid per process/GPU.

12.5% 13.0%









– Future Plans

• Conclusions

35



• Mellanox’s ConnectX-2 and ConnectX-3 adapters feature

“task-list” offload interface

– Extension to existing InfiniBand APIs

• Collective communication with `blocking’ feature is usually a

scaling bottleneck

– Matches with the need for non-blocking collective in MPI

• Accordingly MPI software stacks need to be re-designed to

leverage offload in a comprehensive manner

• Can applications be modified to take advantage of non-

blocking collectives and what will be the benefits?

Collective Offload in ConnectX-2 and ConnectX-3


Application

Collective Offload Support in ConnectX-2 (Recv followed by Multi-Send)

• Sender creates a task-list consisting of only

send and wait WQEs

– One send WQE is created for each registered

receiver and is appended to the rear of a

singly linked task-list

– A wait WQE is added to make the ConnectX-2

HCA wait for ACK packet from the receiver


InfiniBand HCA

Physical Link

Send Q

Recv Q

Send CQ

Recv CQ

Data Data

MC

Q MQ

37

Task List Send Wait Send Send Send Wait

P3DFFT Application Performance with Non-Blocking Alltoall based on CX-2 Collective Offload

38

00.5

11.5

22.5

33.5

44.5

5

512 600 720 800

Ap

plic

atio

n R

un

-Tim

e (

s)

Data Size

P3DFFT Application Run-time Comparison. Overlap version with Offload-Alltoall does up to 17% better than default blocking version

K. Kandalla, H. Subramoni, K. Tomko, D. Pekurovsky, S. Sur and D. K. Panda, High-Performance and Scalable

Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A Study with Parallel 3D FFT, Int'l

Supercomputing Conference (ISC), June 2011.

128 Processes


17%

Experimental Setup: • 8 core Intel Xeon(2.53 GHz) 12MB L3 Cache, 12 GB Memory per node, 64 nodes • MT26428 QDR ConnectX-2, PCI-Ex interfaces, 171-port Mellanox QDR switch

Non-Blocking Broadcast with Collective

Offload and Impact on HPL Performance

39

0

0.2

0.4

0.6

0.8

1

1.2

10 20 30 40 50 60 70

No

rmal

ize

d H

PL

Pe

rfo

rman

ce

HPL Problem Size (N) as % of Total Memory

HPL-Offload HPL-1ring HPL-Host

HPL Performance Comparison with 512 Processes

4.5%

K. Kandalla, H. Subramoni, J. Vienne, K. Tomko, S. Sur and D. K. Panda, Designing Non-blocking Broadcast with

Collective Offload on InfiniBand Clusters: A Case Study with HPL, Hot Interconnect '11, Aug. 2011.


Pre-conditioned Conjugate Gradient (PCG) Solver Performance with Non-Blocking Allreduce based on CX-2 Collective Offload

40

0

2

4

6

8

10

12

14

16

64 128 256 512

Ru

n-T

ime

(s)

Number of Processes

PCG-Default Modified-PCG-Offload

64,000 unknowns per process. Modified PCG with Offload-Allreduce performs

21% better than default PCG

21.8%

K. Kandalla, U. Yang, J. Keasler, T. Kolev, A. Moody, H. Subramoni, K. Tomko, J. Vienne and D. K. Panda, Designing Non-blocking Allreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers, Accepted for publication at IPDPS ’12, May 2012.


Non-contiguous Allocation of Jobs

• Supercomputing systems organized

as racks of nodes interconnected

using complex network

architectures

• Job schedulers used to allocate

compute nodes to various jobs

Line Card Switches

Line Card Switches

Spine Switches

- Busy Core

- Idle Core

New Job

- New Job


Non-contiguous Allocation of Jobs

• Supercomputing systems organized

as racks of nodes interconnected

using complex network

architectures

• Job schedulers used to allocate

compute nodes to various jobs

• Individual processes belonging to

one job can get scattered

• Primary responsibility of scheduler is

to keep system throughput high

Line Card Switches

Line Card Switches

Spine Switches

- Busy Core

- Idle Core

- New Job


.

. . .

. .

Compute Node

Compute Node

Compute Node

Compute Node

Switch

. .

Compute Node

Compute Node

Compute Node

Compute Node

. .

Compute Node

Compute Node

Compute Node

Compute Node

. .

Compute Node

Compute Node

Compute Node

Compute Node

. .

Compute Node

Compute Node

Compute Node

Compute Node

Switch Switch

Switch

Switch Switch

Job 1 Job 2 Job 3

Network Contention

43

Topology Aware Collectives



Topology-Aware Collectives

Default (Binomial) Vs Topology-Aware Algorithms with 296 Processes

44

K. Kandalla, H. Subramoni, A. Vishnu and D. K. Panda, “Designing Topology-Aware Collective Communication

Algorithms for Large Scale Infiniband Clusters: Case Studies with Scatter and Gather,” CAC ‘10

0

100

200

300

400

500

2K

4K

8K

16

K

32

K

64

K

12

K

25

6K

51

2K

12

8K

Late

ncy

(m

sec)

Message Size (Bytes)

Scatter-Default

Scatter-Topo-Aware

0100200300400500600700

2K

4K

8K

16

K

32

K

64

K

12

8K

25

6K

51

2K

12

8K

Late

ncy

(m

sec)


Gather-Default

Gather-Topo-Aware

0

20

40

60

80

100

2R 4R 8R 16R 32R

Late

ncy

(u

sec)

System Size – Number of Racks

Default

Topo-Aware

Estimated Latency Of Default and Topology Aware Algorithms for small messages and varying system sizes

22% 54%

Experimental Setup: • 8 core Intel Xeon “Clovertown” (2.53 GHz) 4MB L2 Cache, 5 GB Memory per node, 64 nodes • MT25418 DDR ConnectX-2, PCI-Ex interfaces, 171-port Mellanox DDR switch

Generic Fat-Tree System

Impact of Network-Topology Aware Algorithms on Broadcast Performance

0

5

10

15

20

25

30

35

40

128K 256K 512K 1M

Late

ncy

(m

s)


No-Topo-Aware

Topo-Aware-DFT

Topo-Aware-BFT


0

0.2

0.4

0.6

0.8

1

1.2

128 256 512 1K

No

rmal

ize

d L

ate

ncy

Job Size (Processes)

• Impact of topology aware schemes more pronounced as

• Message size increases

• System size scales

• Upto 14% improvement in performance at scale

14%

H. Subramoni, K. Kandalla, J. Vienne, S. Sur, B. Barth, K.Tomko, R. McLay, K. Schulz, and D. K. Panda, Design and Evaluation of Network Topology-/Speed- Aware Broadcast Algorithms for InfiniBand Clusters, Cluster ‘11

Power-Aware Collective Algorithms

Default (No Power Savings): Run each core at peak frequency/throttling state

Frequency Scaling Only: Dynamically detect communication phases. Treat them as a black-box and scale the CPU frequency

MVAPICH2 Approach: Consider the communication

characteristics of different collectives, intelligently use both

DVFS and CPU Throttling to deliver fine-grained power-

savings


0

100

200

300

400

500

600

Late

ncy

(m

sec)


Default

DVFS

Proposed

47

Power and Energy Savings with Power-Aware Collectives

0

5

10

15

20

25

0.7 6.5 12.2 18.0 23.8 29.5 35.3 41.0 46.8 52.6

Po

we

r (

KW

)

Time (s) Performance and Power Comparison : MPI_Alltoall with 64 processes on 8 nodes

1.4E+13

5.014E+15

1.001E+16

1.501E+16

2.001E+16

2.501E+16

8K 16K 32K 64K 128K 256K

Ene

rgy

(KJ)

System Size (Number of Processes)

Estimated Energy Consumption during an MPI_Alltoall operation with 128K Message size and varying System Size

30%

32%

K. Kandalla, E. Mancini, Sayantan Sur and D. K. Panda, “Designing Power Aware Collective Communication

Algorithms for Infiniband Clusters”, ICPP ‘10

5%


Experimental Setup: • 8 core Intel Xeon “Nehalem” (2.40 GHz) 8MB L3 Cache, 12 GB Memory per node, 8nodes • MT26428 QDR ConnectX-2, PCI-Ex interfaces, 171-port Mellanox QDR switch

Power Savings at application level (CPMD with 32 and 64 Processes)

0

50

100

150

200

250

300

Applic

ation R

un-T

ime (

s)

Default Freq-Scaling Proposed Approach

0

10

20

30

40

50

60

70

32 Processes 64 ProcessesP

ow

er

Consum

ption (

KJ)

Estimated power savings - 7.7% Performance degradation - 2.6%










– Future Plans

• Conclusions

49



• Partitioned Global Address Space (PGAS) models provide a

different complementary interface as compared to

message passing

– Idea is to decouple data movement with process synchronization

– Processes should have asynchronous access to globally distributed

data

– Well suited for irregular applications and kernels that require

dynamic access to different data

• Different libraries and compilers exist that provide this

model

– Global Arrays (library), UPC (compiler), CAF (compiler)

– HPCS languages: X10, Chapel, Fortress


PGAS Models

• Currently UPC and MPI do not share runtimes

– Duplication of lower level communication mechanisms

– GASNet unable to leverage advanced buffering mechanisms developed for MVAPICH2

• Our novel approach is to enable a truly unified communication library

Unifying UPC and MPI Runtimes: Experience with MVAPICH2

Network Interface

MPI Runtime, Buffers, Queue Pairs, and other

resources

GASNet Runtime, Buffers, Queue Pairs, and other

resources

MPI Interface GASNet Interface

UPC Compiler

MPI Interface

Network Interface

Unified MVAPICH + GASNet Runtime,

Buffers, Queue Pairs, and other resources

GASNet Interface

UPC Compiler


• BUPC micro-benchmarks from latest release 2.10.2

• UPC performance is identical with both native IBV layer and new UCR

layer

• Performance of GASNet-MPI conduit is not very good

– Mismatch of MPI specification and Active messages

• GASNet-UCR is more scalable compared native IBV conduit

0

2

4

6

8

10

1 2 4 8

16

32

64

128

256

512

1024

2048

Late

ncy

(u

s)


UPC Memput Latency


UPC Micro-benchmark Performance

GASNet-UCR GASNet-IBV GASNet-MPI

0

500

1000

1500

2000

2500

3000

3500

1 4

16 64

256

1K

4K

16K

64K

256K 1M

Ban

dw

idth

(M

Bp

s)


UPC Memput Bandwidth

200

210

220

230

240

250

260

270

16 32 64 128 256

Mem

ory

Fo

otp

rin

t (M

B)

Number of Processes

UPC Memory Scalability

J. Jose, M. Luo, S. Sur and D. K. Panda, “Unifying UPC and MPI Runtimes: Experience with MVAPICH”, International

Conference on Partitioned Global Address Space (PGAS), 2010

Evaluation using UPC NAS Benchmarks

• GASNet-UCR performs equal or better than GASNet-IBV

• 10% improvement for CG (B, 128)

• 23% improvement for MG (B, 128)

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

B-64 C-64 B-128 C-128

Tim

e (s

ec)

Class-Processes

Performance of MG, Class B and C

0

5

10

15

20

25

30

35

B-64 C-64 B-128 C-128Class-Processes

Performance of FT, Class B and C

0

5

10

15

20

25

30

B-64 C-64 B-128 C-128Class-Processes

Performance of CG, Class B and C

GASNet-UCR GASNet-IBV GASNet-MPI


Evaluation of Hybrid MPI+UPC NAS-FT

0

5

10

15

20

25

30

35

B-64 C-64 B-128 C-128

Tim

e (s

ec)

Class-Processes

GASNet-UCR

GASNet-IBV

GASNet-MPI

Hybrid

• Modified NAS FT UPC all-to-all pattern using MPI_Alltoall

• Truly hybrid program

• 34% improvement for FT (C, 128)


Graph500 Results with new UPC Queue Design

• Workload – Scale:24, Edge Factor:16 (16 million vertices, 256 million edges)

• 44% Improvement over base version for 512 UPC-Threads

• 30% Improvement over base version for 1024 UPC-Threads

44%

30%

J. Jose, S. Potluri, M. Luo, S. Sur and D. K. Panda, UPC Queues for Scalable Graph Traversals: Design and Evaluation on

InfiniBand Clusters, Fifth Conference on Partitioned Global Address Space Programming Model (PGAS '11), Oct. 2011.



Hybrid (MPI and OpenSHMEM) design

Hybrid MPI+OpenSHMEM

• Based on OpenSHMEM Reference Implementation http://openshmem.org/

• Hybrid MPI +OpenSHMEM Model

• Current Model – Separate Runtimes for OpenSHMEM and MPI

• Possible deadlock if both runtimes are not progressed

• Consumes more network resource

• Our Approach – Single Runtime for MPI and OpenSHMEM

Hybrid (OpenSHMEM+MPI) Application

InfiniBand Network

OSU Design

OpenSHMEM

InterfaceMPI

Interface

OpenSHMEM

callsMPI calls

http://openshmem.org/





Performance of Hybrid (MPI+OpenSHMEM) Applications

0

400

800

1200

1600

32 64 128 256 512

Tim

e (s

)

No. of Processes

OpenSHMEM

Hybrid (OpenSHMEM+MPI)

34%

2D Heat Transfer Modeling

0

20

40

60

80

32 64 128 256 512

Res

ou

rce

Uti

lizat

ion

(M

B)

No. of Processes

Separate RuntimeSingle Runtime

27%

Hybrid Application: Network Resource Usage

• Improved Performance for Hybrid Applications • 34% improvement for 2DHeat Transfer Modeling with 512 processes over pure

OpenSHMEM version.

• Our approach with Single Runtime consumes 27% lesser network resource









– Future Plans

• Conclusions

58




Support for MPI-3 RMA Model

MPI-3 One Sided Communication

Accumulate Ordering Undefined Conflicting

Accesses

Separate and Unified

Windows

Window Creation

• Win_allocate

• Win_allocate_shared

• Win_create_dynamic, Win_attach, Win_detach

Synchronization

• Lock_all, Unlock_all

• Win_flush, Win_flush_local, Win_flush_all, Win_flush_local_all

• Win_sync

Communication

• Get_accumulate

• Rput, Rget, Raccumulate, Rget_accumulate

• Fetch_and_op, Compare_and_swap



• Flush Operations

– Local and remote completions bundled in MPI-2

– Considerable overhead on networks like IB - semantics and cost

of local and remote completions are different

– Flush operations allow more efficient check for completions

• Request-based Operations

– Current semantics provide bulk synchronization

– Request based operations return an MPI Request, can be polled

individually

– Allow for better overlap with direct one-sided design in MVAPICH2

• Dynamic Windows

– Allows users to attach and detach memory dynamically, key is to hide

the overheads (exchange of RDMA key exchange)


0

1

2

3

4

5

6

7

8

9

1 4 16 64 256 1K 4K

Tim

e (

use

c)


Put+Unlock Put+Flush_local Put+Flush

0

20

40

60

80

100

2K 8K 32K 128K 512K 2M

Pe

rce

nta

ge O

verl

ap


Lock-Unlock Request Ops

• Flush_local allows for a faster check on local completions

• Request-based operations provide much better overlap than

Lock-Unlock, in a Get-Compute-Put communication pattern


S. Potluri, S. Sur, D. Bureddy and D. K. Panda – Design and Implementation of Key Proposed MPI-3 One-Sided Communication Semantics on InfiniBand, EuroMPI 2011, September 2011









– Future Plans

• Conclusions

62



• Performance and Memory scalability toward 500K-1M cores

• Hybrid programming (MPI + OpenSHMEM, MPI + UPC, MPI + CAF …) • Enhanced Optimization for GPU Support and Accelerators

– Extending the GPGPU support

– Support for Intel MIC (A paper will be presented at Intel-TACC Symposium in April ‘12)

• Taking advantage of Collective Offload framework – Including support for non-blocking collectives (MPI 3.0)

• Extended topology-aware collectives

• Power-aware collectives

• Enhanced Multi-rail Designs

• Automatic optimization of collectives – LiMIC2, XRC, Hybrid (UD-RC/XRC) and Multi-rail

• Support for MPI Tools Interface

• Checkpoint-Restart and migration support with incremental checkpointing

• Fault-tolerance with run-through stabilization (being discussed in MPI 3.0)

• QoS-aware I/O and checkpointing

• Automatic tuning and self-adaptation for different systems and applications

MVAPICH2 – Plans for Exascale


• Presented challenges for designing Exascale systems

• Overview of MPI-3 Features

• How MVAPICH2 is planning to address some of these challenges

• MVAPICH2 plans for Exascale systems

Conclusions


• Programming models like MPI have taken advantage of

RDMA to design high-performance and scalable libraries

• Can similar things be done for other middleware being

used for processing Big Data?

• Present challenges and provide initial designs for

– Memcached

– HBase and HDFS in Hadoop framework

65

Tomorrow’s (March 15th) Presentation

HPC Advisory Council, Lugano, Switzerland '12


Funding Acknowledgments

Funding Support by

Equipment Support by

66


Personnel Acknowledgments

Current Students – J. Chen (Ph.D.)

– V. Dhanraj (M.S.)

– N. Islam (Ph.D.)

– J. Jose (Ph.D.)

– K. Kandalla (Ph.D.)

– M. Luo (Ph.D.)

– X. Ouyang (Ph.D.)

– S. Potluri (Ph.D.)

– R. Rajachandrasekhar (Ph.D.)

– M. Rahman (Ph.D.)

– A. Singh (Ph.D.)

– H. Subramoni (Ph.D.)

Past Students – P. Balaji (Ph.D.)

– D. Buntinas (Ph.D.)

– S. Bhagvat (M.S.)

– L. Chai (Ph.D.)

– B. Chandrasekharan (M.S.)

– N. Dandapanthula (M.S.)

– T. Gangadharappa (M.S.)

– K. Gopalakrishnan (M.S.)

– W. Huang (Ph.D.)

– W. Jiang (M.S.)

– S. Kini (M.S.)

– M. Koop (Ph.D.)

– R. Kumar (M.S.)

– S. Krishnamoorthy (M.S.)

– P. Lai (Ph. D.)

– J. Liu (Ph.D.)

– A. Mamidala (Ph.D.)

– G. Marsh (M.S.)

– V. Meshram (M.S.)

– S. Naravula (Ph.D.)

– R. Noronha (Ph.D.)

– S. Pai (M.S.)

– G. Santhanaraman (Ph.D.)

– J. Sridhar (M.S.)

– S. Sur (Ph.D.)

– K. Vaidyanathan (Ph.D.)

– A. Vishnu (Ph.D.)

– J. Wu (Ph.D.)

– W. Yu (Ph.D.)

67

Past Research Scientist – S. Sur

Current Post-Docs – J. Vienne

– H. Wang

Current Programmers – M. Arnold

– D. Bureddy

– J. Perkins

Past Post-Docs – X. Besseron

– H.-W. Jin

– E. Mancini

– S. Marcarelli


Web Pointers


http://nowlab.cse.ohio-state.edu

MVAPICH Web Page

http://mvapich.cse.ohio-state.edu

[email protected]

68





http://nowlab.cse.ohio-state.edu/




http://mvapich.cse.ohio-state.edu/



mailto:[email protected]



mpi: advanced topics and future trends

Documents