openshmem on connectx-6 and mellanox sharp · §advanced end-to-end adaptive routing, congestion...

© 2019 Mellanox Technologies 11

Manjunath Gorentla Venkata On behalf of OpenSHMEM members @ Mellanox

OpenSHMEM on ConnectX-6 and Mellanox SHARP

OpenSHMEM BoF @ SC19, November 19th, 2019


Higher Data SpeedsFaster Data ProcessingBetter Data SecurityAdapters Switches

Cables & Transceivers

SmartNIC System on a Chip

© 2019 Mellanox Technologies | Confidential 3

InfiniBand Technology and Solutions

§ 200G HDR InfiniBand end-to-end, extremely low latency, high message rate, RDMA and GPUDirect§ Advanced end-to-end adaptive routing, congestion control and quality of service§ In-Network Computing acceleration engines (Mellanox SHARP, MPI offloads)§ Self Healing Network with SHIELD for highest network resiliency§ High speed gateways to Ethernet, and long reach up to 40Km§ Standard, backward and forward compatibility

InfiniBandNVMe / Storage

InfiniBand High Speed Network

Advanced In-Network Computing

Extremely Low Latency

EthernetNVMe / Storage

Mellanox SkywayHigh Speed Gateway to Ethernet

Compute ServersQuantum LongReach

Quantum LongReach

InfiniBandThe picture can't be displayed.

The picture can't be displayed.


Accelerating OpenSHMEM with Mellanox Hardware and Software

Hardware§ High throughput and message rate

§ Dual ports of 200Gb/s VPI Adapter (HDR)§ Message rate: 200 million messages per sec§ Latency: 0.6usec

§ In-Network Computing§ SHARP v2 (Switch Hierarchical Aggregation Protocol)

support§ Enables offloading high bandwidth collective / reduce

operations§ Atomic Operations enhancements

§ XOR, AND, OR, ADD, MIN, MAX§ PCIe atomics§ MEMIC memory for atomics and counters

Software§ OSHMEM: OpenSHMEM implementation in Open MPI

§ Supports OpenSHMEM 1.4 and some OpenSHMEM 1.5 features

§ Leverages UCX and HCOLL§ Supports Intel, Arm and POWER architectures§ Supports ConnectX-4,ConnectX-5, and ConnectX-6 § The production quality implementation used by Mellanox,

Arm, HPE, Cray and other vendors

§ Mellanox vendor tests§ Opensource and available on the community on GitHUB§ Used for verifying the OpenSHMEM implementations

§ Nightly functionality and performance testingSpecification community engagement § Committee and working group meetings§ Engage with Users


OSHMEM Software Stack

Other BTL

PML

OMPI

Coll

Open MPI

EthernetEthernetEthernetEthernetBasic

Other

OPAL ORTE

OPAL and ORTE

Components

MCA - Modular

Component

Architecture uG

NIVader

UCX

SM

EthernetEthernetEthernetEthernetEthernet

OSHMEM

Atomics Collectives

EthernetEthernetEthernetEthernetBasic

HCOLL

Heap SPML

BASIC

UCX

MXM

BASIC

MPI

BUDDY

PTMALLO

C

UCX

IKRIT


HCOLL : A high-performing Collectives Library

Runtime

BCOL SBGP

UMA

IBOFFLO

AD

UCX_PTP

UMA

SOCKET

IBNET

P2P

ML

COMMONSHARP

COM

M Patterns

OFACM

NCCL

MULTICAST

Fig. 2: HCOLL components to realize hierarchical collectiveimplementation

and customizing the implementation to various communicationmechanisms in the system. To understand the design, considerthe implementation of shmem reductions using HCOLL. Theshmem reductions are implemented by combining reduction,allreduce and a broadcast primitive. The primitives are notnecessarily OpenSHMEM compliant, however, the composi-tion of these primitives is OpenSHMEM compliant.

The various components in the HCOLL are shown inFigure 2. It includes components Messaging Layer (ML),Subgroup, and Basic Collective (BCOL). The ML layer isresponsible for orchestrating the collective primitives. Thesubgrouping component is responsible for grouping the pro-cesses PE in the job into multiple subgroups based on thecommunication mechanism shared among them. The BCOLcomponents provide the basic collective primitives, whichare then combined to provide the implementation for Open-SHMEM collective operations.

Though the HCOLL library accelerates all collective op-erations typically used in the parallel programming modelsand defined by the MPI and OpenSHMEM specification, wefocus here on the shmem barrier, shmem float sum to all,and shmem broadcast32 operations. For these operations, avariety of collective algorithms are implemented includinghierarchical and non-hierarchical n-ary, recursive doubling,and recursive k-ing algorithms. The collective algorithm andnumber of hierarchies could be selected either at compile timeor runtime based on the system architecture or communicationcharacteristics. For this paper, we use hierarchical recursivedoubling and configure with two hierarchies (intra-node andinter-node). In this configuration, we designate a PE on eachnode as a leader process, which is responsible for communi-cation between the nodes.

C. OpenSHMEM Collectives using SHARP and HCOLL:

The OpenSHMEM collective operations in OSHMEM canuse either no-acceleration, software accelerated, or hardware-and software- accelerated (In-network Computing) paths.

Non-root OpenSHMEM PE

Intra-node path (lower latency)

Inter-node path (higher latency)

Root OpenSHMEM PE

Fig. 3: No acceleration approach: High-latency paths taken forshmem broadcast32 depends on the topology and algorithmchosen.

Leader OpenSHMEM PE

Non-leader OpenSHMEM PEIntra-node path (lower latency)Inter-node path (higher latency)

Root OpenSHMEM PE

Fig. 4: Software acceleration approach: The number of high-latency paths taken for shmem broadcast32 is proportional tolog( number of nodes).

Switch

Non-root OpenSHMEM PEIntra-node path (lower latency)Inter-node path (higher latency)

Root OpenSHMEM PE

Fig. 5: In-network Computing: The number of high-latencypaths taken for shmem broadcast32 is proportional to thenumber of layers of switches in the path of the operation.


Experiment Testbed

§ Software§ OSHMEM – Master§ UCX - Master§ HCOLL from HPC-X v2.4

§ Benchmarks§ OSU OpenSHMEM benchmarks

§ Hardware§ Dual Socket Intel® Xeon® 16-core CPUs E5-2697A V4 @ 2.60 GHz§ Mellanox ConnectX-6 HDR100 100Gb/s InfiniBand adapters§ Mellanox Switch-IB 2 SB7800 36-Port 100Gb/s EDR InfiniBand switches§ Memory: 256GB DDR4 2400MHz RDIMMs per node


OpenSHMEM : Put and Get Latency and Overlap

§ Put and Get latency below 1.5 usecswhen SHMEM_THREAD_MULTIPLE enabled

§ Overlap is close to 100%§ Wait time is close to 0%

-50

0

50

100

150

200

250

1 10 100 1000 10000 100000 1x106

Overlap

Late

ncy

(mic

rose

c.)

Message size (bytes)

Compute Time Communication Time

Wait Time Overlap

1

1.5

2

2.5

3

3.5

0.1 1 10 100 1000

Late

ncy

(mic

rose

c.)


Put Latency Get Latency


OpenSHMEM on ConnectX-6: Bandwidth

0

5000

10000

15000

20000

25000

8192 16384 32768 65536 131072 262144 524288 1048576 4194304 8388608 16777216

Band

wid

th (M

B/s)

Message Size (Bytes)

OpenSHMEM Put Mem Bandwdith (MB/s)AMD Rome / PCIe 4 and ConnectX-6


OpenSHMEM on ConnectX-6: Bi-directional Message Rate


Experimental Setup

§ The Summit system at ORNL§ 4608 compute nodes§ Each Node§ Two IBM POWER9 processors and Six NVIDIA Volta GPUs§ Mellanox’s ConnectX-5 EDR 100 Gb/s HCAs § Switch-IB 2 InfiniBand switches, which supports SHARP

§ Software§ Open MPI 4.0.0§ HCOLL § SHARP software

§ Benchmarks§ OSU Microbenchmarks: Invoke collective operation in a tight loop§ 2D-Heat application kernel


Barrier using the In-Network Computing Approach has better Performance and Scaling Characteristics

§ Barrier implemented using the In-Network Computing approach is 710 % when compared with a no acceleration approach § Acceleration is a result of using software and hardware

mechanisms

§ The software-accelerated approach based barrier is 200% faster when compared with a no acceleration approach § Acceleration is a result of hierarchical approach to

implementing collective operations

§ The performance changes only 25% when increasing the problem size from 40 to 5120 PEs§ With no-acceleration it increases by 425%

0

5

10

15

20

25

30

35

40

45

50

0 1000 2000 3000 4000 5000 6000

Late

ncy

(m

icro

sec.

)

OpenSHMEM PEs

OSHMEM No Acceleration OSHMEM with Software Accel

OSHMEM with In-Network Computing

Fig. 6: The latency of shmem barrier as we increase thenumber of PEs from 40 to 5120.

10

20

30

40

50

60

70

80

90

0 1000 2000 3000 4000 5000 6000

Late

ncy

(m

icro

sec.

)

OpenSHMEM PEs

OSHMEM with No Acceleration OSHMEM with Software Acceleration OSHMEM with InNetwork Computing

Fig. 7: The latency of large message (32 KB)shmem broadcast32 as we increase the number of PEsfrom 40 to 5120.

B. Performance Characteristics of shmem barrier,shmem broadcast32, and shmem float sum to allwhile using the In-network Computing approach

Performance characteristics of shmem barrierFigure 6 shows the performance of the shmem barrier as

we increase the number of PEs from 40 to 5120. From thefigure, we can observe that the shmem barrier implementationusing the In-network Computing approach performs the best.The latency of shmem barrier with the In-network Computingapproach is 5.32 usecs, which is 3.2 times better than theshmem barrier with software only acceleration and 7.1 timesbetter than shmem barrier with no acceleration. Also, wecan observe that as we increase the number of PEs the In-network Computing approach has the best scaling character-istics; the latency increases by 25% while the latency withno-acceleration increases by 425%.

0

100

200

300

400

500

600

10 100 1000 10000 100000

Late

ncy

(m

icro

sec.

)


Bcast latency - Increasing Message Size 5120 PEs


Fig. 8: The latency of shmem broadcast32 as we increase themessage size and fix the number of PEs at 5120.

0

5

10

15

20

25

30

35

0 1000 2000 3000 4000 5000 6000

Late

ncy

(m

icro

sec.

)

OpenSHMEM PEs


Fig. 9: The latency of small message (8 bytes)shmem float sum to all as we increase the number ofPEs from 40 to 5120.

Performance characteristics of shmem broadcast32Figure 7 shows the latency of 32 KB shmem broadcast32

as we increase the number of PEs from 40 to 5120. For 32 KBshmem broadcast32, the latency is 25.80 usecs, which is 3.7times lower than the no-acceleration approach and 2.9 timeslower than the software accelerated approach.

Figure 8 shows the latency of all messages from 4 bytesto 1 MB for a 5120 PEs problem size. We can observe thatthe latency for a wide range of messages is much lower withthe broadcast implementation using the In-network Computingapproach. This shows that the shmem broadcast32 implemen-tation using the In-network Computing approach has the bestscaling characteristics.

Performance characteristics of shmem float sum to allFigure 9 and 10 shows the latency of

shmem float sum to all as we increase the number of


Reductions using the In-Network Computing Approach has better Performance and Scaling Characteristics

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

0 1000 2000 3000 4000 5000 6000

La

ten

cy (

mic

rose

c.)

OpenSHMEM PEs


Fig. 10: The latency of large message (32 KB)shmem float sum to all as we increase the number ofPEs from 40 to 5120. The In-network computing andsoftware accelerated plots are overlapped and appear asthough it is missing.

0

20

40

60

80

100

120

10 100 1000

La

ten

cy (

mic

rose

c.)


Reduce latency - Increasing Message Size 5120 PEs


Fig. 11: The latency of shmem float sum to all as we in-crease the message size and fix the number of PEs at 5120.

PEs from 40 to 5120. The latency increases from 3.58usecs to 6.07usecs for shmem float sum to all usingthe In-network Computing approach, while with onlysoftware acceleration the latency increases from 2.64 usecsto 18.57 usecs, and with no acceleration approach, itincreases from 6.32 usecs to 59.99 usecs. At 5120 PEs,the shmem float sum to all implementation using the In-network Computing approach has 10 times less latency thanthe no-acceleration shmem float sum to all implementationand 3 times less latency than the software only accelerationimplementation.

Figure 11 shows the latency of all messages from 4 bytesto 1 MB for a 5120 PEs problem size. We can observe thatthe latency for a wide range of messages is much lower

with shmem float sum to all using the In-network Comput-ing approach. As message size gets larger, the performanceof shmem float sum to all using the In-network Computingapproach and software-accelerated shmem float sum to allconverge. This is expected as the current generation of SHARPand the HCAs do not provide acceleration for large messagereduction and acceleration comes only from the software.

C. Performance of 2D-Heat application kernel while usingSoftware and Hardware Acceleration

1.6

1.65

1.7

1.75

1.8

1.85

1.9

1.95

2

2.05

2.1

No Accel SW Accel In-Network Computing

Completion Time of 2D-Heat application kernel

Fig. 12: Completion Time of 2D-Heat Application Kernel

The 2D-Heat application kernel available here [2] models2D heat conduction using the Grass-Seidel method. The kernelcompletes when the standard deviation between the adjacentmatrices is less than the predefined convergence value. Thecommunication is dominated by shmem put, which is used forsending data between the adjacent PEs. The shmem barrieris used for synchronization, and for exchanging data andthe intermediate results it uses shmem float sum to all andshmem broadcast32.

In Figure 12, we can observe that the completion timereduces by 13% for a 5120 PEs problem size while usingcollective implemented using the In-network Computing ap-proach. For these experiments, we used the matrix with 8krows and 8k columns. As we increase the problem size, thenumber of PEs, and matrix size, we expect the performance ofcollective operations becomes more important for reducing thecompletion time. Unfortunately, we could not experiment withhigher scale because of limited availability of the computingresources.

VI. RESULTS DISCUSSION AND CONCLUSION

The results clearly demonstrate that the In-network Com-puting approach improves the performance and scalabilityof collective operations. From the graphs 6, 8 and 11, wecan observe that the shmem barrier, shmem broadcast32,and shmem float sum to all implementations using the In-network Computing approach outperforms the no accelerationimplementations by 710%, 370% and 1000%, respectively.

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

0 1000 2000 3000 4000 5000 6000

La

ten

cy (

mic

rose

c.)

OpenSHMEM PEs



0

20

40

60

80

100

120

10 100 1000

La

ten

cy (

mic

rose

c.)









1.6

1.65

1.7

1.75

1.8

1.85

1.9

1.95

2

2.05

2.1









In-Network Computing Accelerates 2D Heat Kernel

§ Kernel models 2D heat conduction using Grass-Siedel method

§ Barriers are used for synchronization and reductions are used for exchanging intermediate results

§ The workload is dominated by shmem_put

§ For 5120 PE problem configuration, the In-Network computing approach implementation improves performance by 13%

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

0 1000 2000 3000 4000 5000 6000

Late

ncy

(m

icro

sec.

)

OpenSHMEM PEs



0

20

40

60

80

100

120

10 100 1000

Late

ncy

(m

icro

sec.

)









1.6

1.65

1.7

1.75

1.8

1.85

1.9

1.95

2

2.05

2.1









Thank You

openshmem on connectx-6 and mellanox sharp · §advanced end-to-end adaptive routing, congestion...

Documents