openshmem on connectx-6 and mellanox sharp · §advanced end-to-end adaptive routing, congestion...
TRANSCRIPT
![Page 1: OpenSHMEM on ConnectX-6 and Mellanox SHARP · §Advanced end-to-end adaptive routing, congestion control and quality of service §In-Network Computing acceleration engines ... Arm](https://reader030.vdocuments.us/reader030/viewer/2022041120/5f34389f9135e52bc06d1fe7/html5/thumbnails/1.jpg)
© 2019 Mellanox Technologies 11
Manjunath Gorentla Venkata On behalf of OpenSHMEM members @ Mellanox
OpenSHMEM on ConnectX-6 and Mellanox SHARP
OpenSHMEM BoF @ SC19, November 19th, 2019
![Page 2: OpenSHMEM on ConnectX-6 and Mellanox SHARP · §Advanced end-to-end adaptive routing, congestion control and quality of service §In-Network Computing acceleration engines ... Arm](https://reader030.vdocuments.us/reader030/viewer/2022041120/5f34389f9135e52bc06d1fe7/html5/thumbnails/2.jpg)
© 2019 Mellanox Technologies 22
Higher Data SpeedsFaster Data ProcessingBetter Data SecurityAdapters Switches
Cables & Transceivers
SmartNIC System on a Chip
![Page 3: OpenSHMEM on ConnectX-6 and Mellanox SHARP · §Advanced end-to-end adaptive routing, congestion control and quality of service §In-Network Computing acceleration engines ... Arm](https://reader030.vdocuments.us/reader030/viewer/2022041120/5f34389f9135e52bc06d1fe7/html5/thumbnails/3.jpg)
© 2019 Mellanox Technologies | Confidential 3
InfiniBand Technology and Solutions
§ 200G HDR InfiniBand end-to-end, extremely low latency, high message rate, RDMA and GPUDirect§ Advanced end-to-end adaptive routing, congestion control and quality of service§ In-Network Computing acceleration engines (Mellanox SHARP, MPI offloads)§ Self Healing Network with SHIELD for highest network resiliency§ High speed gateways to Ethernet, and long reach up to 40Km§ Standard, backward and forward compatibility
InfiniBandNVMe / Storage
InfiniBand High Speed Network
Advanced In-Network Computing
Extremely Low Latency
EthernetNVMe / Storage
Mellanox SkywayHigh Speed Gateway to Ethernet
Compute ServersQuantum LongReach
Quantum LongReach
InfiniBandThe picture can't be displayed.
The picture can't be displayed.
![Page 4: OpenSHMEM on ConnectX-6 and Mellanox SHARP · §Advanced end-to-end adaptive routing, congestion control and quality of service §In-Network Computing acceleration engines ... Arm](https://reader030.vdocuments.us/reader030/viewer/2022041120/5f34389f9135e52bc06d1fe7/html5/thumbnails/4.jpg)
© 2019 Mellanox Technologies 44
Accelerating OpenSHMEM with Mellanox Hardware and Software
Hardware§ High throughput and message rate
§ Dual ports of 200Gb/s VPI Adapter (HDR)§ Message rate: 200 million messages per sec§ Latency: 0.6usec
§ In-Network Computing§ SHARP v2 (Switch Hierarchical Aggregation Protocol)
support§ Enables offloading high bandwidth collective / reduce
operations§ Atomic Operations enhancements
§ XOR, AND, OR, ADD, MIN, MAX§ PCIe atomics§ MEMIC memory for atomics and counters
Software§ OSHMEM: OpenSHMEM implementation in Open MPI
§ Supports OpenSHMEM 1.4 and some OpenSHMEM 1.5 features
§ Leverages UCX and HCOLL§ Supports Intel, Arm and POWER architectures§ Supports ConnectX-4,ConnectX-5, and ConnectX-6 § The production quality implementation used by Mellanox,
Arm, HPE, Cray and other vendors
§ Mellanox vendor tests§ Opensource and available on the community on GitHUB§ Used for verifying the OpenSHMEM implementations
§ Nightly functionality and performance testingSpecification community engagement § Committee and working group meetings§ Engage with Users
![Page 5: OpenSHMEM on ConnectX-6 and Mellanox SHARP · §Advanced end-to-end adaptive routing, congestion control and quality of service §In-Network Computing acceleration engines ... Arm](https://reader030.vdocuments.us/reader030/viewer/2022041120/5f34389f9135e52bc06d1fe7/html5/thumbnails/5.jpg)
© 2019 Mellanox Technologies 55
OSHMEM Software Stack
Other BTL
PML
OMPI
Coll
Open MPI
EthernetEthernetEthernetEthernetBasic
Other
OPAL ORTE
OPAL and ORTE
Components
MCA - Modular
Component
Architecture uG
NIVader
UCX
SM
EthernetEthernetEthernetEthernetEthernet
OSHMEM
Atomics Collectives
EthernetEthernetEthernetEthernetBasic
HCOLL
Heap SPML
BASIC
UCX
MXM
BASIC
MPI
BUDDY
PTMALLO
C
UCX
IKRIT
![Page 6: OpenSHMEM on ConnectX-6 and Mellanox SHARP · §Advanced end-to-end adaptive routing, congestion control and quality of service §In-Network Computing acceleration engines ... Arm](https://reader030.vdocuments.us/reader030/viewer/2022041120/5f34389f9135e52bc06d1fe7/html5/thumbnails/6.jpg)
© 2019 Mellanox Technologies 66
HCOLL : A high-performing Collectives Library
Runtime
BCOL SBGP
UMA
IBOFFLO
AD
UCX_PTP
UMA
SOCKET
IBNET
P2P
ML
COMMONSHARP
COM
M Patterns
OFACM
NCCL
MULTICAST
Fig. 2: HCOLL components to realize hierarchical collectiveimplementation
and customizing the implementation to various communicationmechanisms in the system. To understand the design, considerthe implementation of shmem reductions using HCOLL. Theshmem reductions are implemented by combining reduction,allreduce and a broadcast primitive. The primitives are notnecessarily OpenSHMEM compliant, however, the composi-tion of these primitives is OpenSHMEM compliant.
The various components in the HCOLL are shown inFigure 2. It includes components Messaging Layer (ML),Subgroup, and Basic Collective (BCOL). The ML layer isresponsible for orchestrating the collective primitives. Thesubgrouping component is responsible for grouping the pro-cesses PE in the job into multiple subgroups based on thecommunication mechanism shared among them. The BCOLcomponents provide the basic collective primitives, whichare then combined to provide the implementation for Open-SHMEM collective operations.
Though the HCOLL library accelerates all collective op-erations typically used in the parallel programming modelsand defined by the MPI and OpenSHMEM specification, wefocus here on the shmem barrier, shmem float sum to all,and shmem broadcast32 operations. For these operations, avariety of collective algorithms are implemented includinghierarchical and non-hierarchical n-ary, recursive doubling,and recursive k-ing algorithms. The collective algorithm andnumber of hierarchies could be selected either at compile timeor runtime based on the system architecture or communicationcharacteristics. For this paper, we use hierarchical recursivedoubling and configure with two hierarchies (intra-node andinter-node). In this configuration, we designate a PE on eachnode as a leader process, which is responsible for communi-cation between the nodes.
C. OpenSHMEM Collectives using SHARP and HCOLL:
The OpenSHMEM collective operations in OSHMEM canuse either no-acceleration, software accelerated, or hardware-and software- accelerated (In-network Computing) paths.
Non-root OpenSHMEM PE
Intra-node path (lower latency)
Inter-node path (higher latency)
Root OpenSHMEM PE
Fig. 3: No acceleration approach: High-latency paths taken forshmem broadcast32 depends on the topology and algorithmchosen.
Leader OpenSHMEM PE
Non-leader OpenSHMEM PEIntra-node path (lower latency)Inter-node path (higher latency)
Root OpenSHMEM PE
Fig. 4: Software acceleration approach: The number of high-latency paths taken for shmem broadcast32 is proportional tolog( number of nodes).
Switch
Non-root OpenSHMEM PEIntra-node path (lower latency)Inter-node path (higher latency)
Root OpenSHMEM PE
Fig. 5: In-network Computing: The number of high-latencypaths taken for shmem broadcast32 is proportional to thenumber of layers of switches in the path of the operation.
![Page 7: OpenSHMEM on ConnectX-6 and Mellanox SHARP · §Advanced end-to-end adaptive routing, congestion control and quality of service §In-Network Computing acceleration engines ... Arm](https://reader030.vdocuments.us/reader030/viewer/2022041120/5f34389f9135e52bc06d1fe7/html5/thumbnails/7.jpg)
© 2019 Mellanox Technologies | Confidential 7
Experiment Testbed
§ Software§ OSHMEM – Master§ UCX - Master§ HCOLL from HPC-X v2.4
§ Benchmarks§ OSU OpenSHMEM benchmarks
§ Hardware§ Dual Socket Intel® Xeon® 16-core CPUs E5-2697A V4 @ 2.60 GHz§ Mellanox ConnectX-6 HDR100 100Gb/s InfiniBand adapters§ Mellanox Switch-IB 2 SB7800 36-Port 100Gb/s EDR InfiniBand switches§ Memory: 256GB DDR4 2400MHz RDIMMs per node
![Page 8: OpenSHMEM on ConnectX-6 and Mellanox SHARP · §Advanced end-to-end adaptive routing, congestion control and quality of service §In-Network Computing acceleration engines ... Arm](https://reader030.vdocuments.us/reader030/viewer/2022041120/5f34389f9135e52bc06d1fe7/html5/thumbnails/8.jpg)
© 2019 Mellanox Technologies 88
OpenSHMEM : Put and Get Latency and Overlap
§ Put and Get latency below 1.5 usecswhen SHMEM_THREAD_MULTIPLE enabled
§ Overlap is close to 100%§ Wait time is close to 0%
-50
0
50
100
150
200
250
1 10 100 1000 10000 100000 1x106
Overlap
Late
ncy
(mic
rose
c.)
Message size (bytes)
Compute Time Communication Time
Wait Time Overlap
1
1.5
2
2.5
3
3.5
0.1 1 10 100 1000
Late
ncy
(mic
rose
c.)
Message size (bytes)
Put Latency Get Latency
![Page 9: OpenSHMEM on ConnectX-6 and Mellanox SHARP · §Advanced end-to-end adaptive routing, congestion control and quality of service §In-Network Computing acceleration engines ... Arm](https://reader030.vdocuments.us/reader030/viewer/2022041120/5f34389f9135e52bc06d1fe7/html5/thumbnails/9.jpg)
© 2019 Mellanox Technologies | Confidential 9
OpenSHMEM on ConnectX-6: Bandwidth
0
5000
10000
15000
20000
25000
8192 16384 32768 65536 131072 262144 524288 1048576 4194304 8388608 16777216
Band
wid
th (M
B/s)
Message Size (Bytes)
OpenSHMEM Put Mem Bandwdith (MB/s)AMD Rome / PCIe 4 and ConnectX-6
![Page 10: OpenSHMEM on ConnectX-6 and Mellanox SHARP · §Advanced end-to-end adaptive routing, congestion control and quality of service §In-Network Computing acceleration engines ... Arm](https://reader030.vdocuments.us/reader030/viewer/2022041120/5f34389f9135e52bc06d1fe7/html5/thumbnails/10.jpg)
© 2019 Mellanox Technologies 1010
OpenSHMEM on ConnectX-6: Bi-directional Message Rate
![Page 11: OpenSHMEM on ConnectX-6 and Mellanox SHARP · §Advanced end-to-end adaptive routing, congestion control and quality of service §In-Network Computing acceleration engines ... Arm](https://reader030.vdocuments.us/reader030/viewer/2022041120/5f34389f9135e52bc06d1fe7/html5/thumbnails/11.jpg)
© 2019 Mellanox Technologies | Confidential 11
Experimental Setup
§ The Summit system at ORNL§ 4608 compute nodes§ Each Node§ Two IBM POWER9 processors and Six NVIDIA Volta GPUs§ Mellanox’s ConnectX-5 EDR 100 Gb/s HCAs § Switch-IB 2 InfiniBand switches, which supports SHARP
§ Software§ Open MPI 4.0.0§ HCOLL § SHARP software
§ Benchmarks§ OSU Microbenchmarks: Invoke collective operation in a tight loop§ 2D-Heat application kernel
![Page 12: OpenSHMEM on ConnectX-6 and Mellanox SHARP · §Advanced end-to-end adaptive routing, congestion control and quality of service §In-Network Computing acceleration engines ... Arm](https://reader030.vdocuments.us/reader030/viewer/2022041120/5f34389f9135e52bc06d1fe7/html5/thumbnails/12.jpg)
© 2019 Mellanox Technologies 1212
Barrier using the In-Network Computing Approach has better Performance and Scaling Characteristics
§ Barrier implemented using the In-Network Computing approach is 710 % when compared with a no acceleration approach § Acceleration is a result of using software and hardware
mechanisms
§ The software-accelerated approach based barrier is 200% faster when compared with a no acceleration approach § Acceleration is a result of hierarchical approach to
implementing collective operations
§ The performance changes only 25% when increasing the problem size from 40 to 5120 PEs§ With no-acceleration it increases by 425%
0
5
10
15
20
25
30
35
40
45
50
0 1000 2000 3000 4000 5000 6000
Late
ncy
(m
icro
sec.
)
OpenSHMEM PEs
OSHMEM No Acceleration OSHMEM with Software Accel
OSHMEM with In-Network Computing
Fig. 6: The latency of shmem barrier as we increase thenumber of PEs from 40 to 5120.
10
20
30
40
50
60
70
80
90
0 1000 2000 3000 4000 5000 6000
Late
ncy
(m
icro
sec.
)
OpenSHMEM PEs
OSHMEM with No Acceleration OSHMEM with Software Acceleration OSHMEM with InNetwork Computing
Fig. 7: The latency of large message (32 KB)shmem broadcast32 as we increase the number of PEsfrom 40 to 5120.
B. Performance Characteristics of shmem barrier,shmem broadcast32, and shmem float sum to allwhile using the In-network Computing approach
Performance characteristics of shmem barrierFigure 6 shows the performance of the shmem barrier as
we increase the number of PEs from 40 to 5120. From thefigure, we can observe that the shmem barrier implementationusing the In-network Computing approach performs the best.The latency of shmem barrier with the In-network Computingapproach is 5.32 usecs, which is 3.2 times better than theshmem barrier with software only acceleration and 7.1 timesbetter than shmem barrier with no acceleration. Also, wecan observe that as we increase the number of PEs the In-network Computing approach has the best scaling character-istics; the latency increases by 25% while the latency withno-acceleration increases by 425%.
0
100
200
300
400
500
600
10 100 1000 10000 100000
Late
ncy
(m
icro
sec.
)
Message size (bytes)
Bcast latency - Increasing Message Size 5120 PEs
OSHMEM with No Acceleration OSHMEM with Software Acceleration OSHMEM with InNetwork Computing
Fig. 8: The latency of shmem broadcast32 as we increase themessage size and fix the number of PEs at 5120.
0
5
10
15
20
25
30
35
0 1000 2000 3000 4000 5000 6000
Late
ncy
(m
icro
sec.
)
OpenSHMEM PEs
OSHMEM with No Acceleration OSHMEM with Software Acceleration OSHMEM with InNetwork Computing
Fig. 9: The latency of small message (8 bytes)shmem float sum to all as we increase the number ofPEs from 40 to 5120.
Performance characteristics of shmem broadcast32Figure 7 shows the latency of 32 KB shmem broadcast32
as we increase the number of PEs from 40 to 5120. For 32 KBshmem broadcast32, the latency is 25.80 usecs, which is 3.7times lower than the no-acceleration approach and 2.9 timeslower than the software accelerated approach.
Figure 8 shows the latency of all messages from 4 bytesto 1 MB for a 5120 PEs problem size. We can observe thatthe latency for a wide range of messages is much lower withthe broadcast implementation using the In-network Computingapproach. This shows that the shmem broadcast32 implemen-tation using the In-network Computing approach has the bestscaling characteristics.
Performance characteristics of shmem float sum to allFigure 9 and 10 shows the latency of
shmem float sum to all as we increase the number of
![Page 13: OpenSHMEM on ConnectX-6 and Mellanox SHARP · §Advanced end-to-end adaptive routing, congestion control and quality of service §In-Network Computing acceleration engines ... Arm](https://reader030.vdocuments.us/reader030/viewer/2022041120/5f34389f9135e52bc06d1fe7/html5/thumbnails/13.jpg)
© 2019 Mellanox Technologies 1313
Reductions using the In-Network Computing Approach has better Performance and Scaling Characteristics
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
0 1000 2000 3000 4000 5000 6000
La
ten
cy (
mic
rose
c.)
OpenSHMEM PEs
OSHMEM with No Acceleration OSHMEM with Software Acceleration OSHMEM with InNetwork Computing
Fig. 10: The latency of large message (32 KB)shmem float sum to all as we increase the number ofPEs from 40 to 5120. The In-network computing andsoftware accelerated plots are overlapped and appear asthough it is missing.
0
20
40
60
80
100
120
10 100 1000
La
ten
cy (
mic
rose
c.)
Message size (bytes)
Reduce latency - Increasing Message Size 5120 PEs
OSHMEM with No Acceleration OSHMEM with Software Acceleration OSHMEM with InNetwork Computing
Fig. 11: The latency of shmem float sum to all as we in-crease the message size and fix the number of PEs at 5120.
PEs from 40 to 5120. The latency increases from 3.58usecs to 6.07usecs for shmem float sum to all usingthe In-network Computing approach, while with onlysoftware acceleration the latency increases from 2.64 usecsto 18.57 usecs, and with no acceleration approach, itincreases from 6.32 usecs to 59.99 usecs. At 5120 PEs,the shmem float sum to all implementation using the In-network Computing approach has 10 times less latency thanthe no-acceleration shmem float sum to all implementationand 3 times less latency than the software only accelerationimplementation.
Figure 11 shows the latency of all messages from 4 bytesto 1 MB for a 5120 PEs problem size. We can observe thatthe latency for a wide range of messages is much lower
with shmem float sum to all using the In-network Comput-ing approach. As message size gets larger, the performanceof shmem float sum to all using the In-network Computingapproach and software-accelerated shmem float sum to allconverge. This is expected as the current generation of SHARPand the HCAs do not provide acceleration for large messagereduction and acceleration comes only from the software.
C. Performance of 2D-Heat application kernel while usingSoftware and Hardware Acceleration
1.6
1.65
1.7
1.75
1.8
1.85
1.9
1.95
2
2.05
2.1
No Accel SW Accel In-Network Computing
Completion Time of 2D-Heat application kernel
Fig. 12: Completion Time of 2D-Heat Application Kernel
The 2D-Heat application kernel available here [2] models2D heat conduction using the Grass-Seidel method. The kernelcompletes when the standard deviation between the adjacentmatrices is less than the predefined convergence value. Thecommunication is dominated by shmem put, which is used forsending data between the adjacent PEs. The shmem barrieris used for synchronization, and for exchanging data andthe intermediate results it uses shmem float sum to all andshmem broadcast32.
In Figure 12, we can observe that the completion timereduces by 13% for a 5120 PEs problem size while usingcollective implemented using the In-network Computing ap-proach. For these experiments, we used the matrix with 8krows and 8k columns. As we increase the problem size, thenumber of PEs, and matrix size, we expect the performance ofcollective operations becomes more important for reducing thecompletion time. Unfortunately, we could not experiment withhigher scale because of limited availability of the computingresources.
VI. RESULTS DISCUSSION AND CONCLUSION
The results clearly demonstrate that the In-network Com-puting approach improves the performance and scalabilityof collective operations. From the graphs 6, 8 and 11, wecan observe that the shmem barrier, shmem broadcast32,and shmem float sum to all implementations using the In-network Computing approach outperforms the no accelerationimplementations by 710%, 370% and 1000%, respectively.
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
0 1000 2000 3000 4000 5000 6000
La
ten
cy (
mic
rose
c.)
OpenSHMEM PEs
OSHMEM with No Acceleration OSHMEM with Software Acceleration OSHMEM with InNetwork Computing
Fig. 10: The latency of large message (32 KB)shmem float sum to all as we increase the number ofPEs from 40 to 5120. The In-network computing andsoftware accelerated plots are overlapped and appear asthough it is missing.
0
20
40
60
80
100
120
10 100 1000
La
ten
cy (
mic
rose
c.)
Message size (bytes)
Reduce latency - Increasing Message Size 5120 PEs
OSHMEM with No Acceleration OSHMEM with Software Acceleration OSHMEM with InNetwork Computing
Fig. 11: The latency of shmem float sum to all as we in-crease the message size and fix the number of PEs at 5120.
PEs from 40 to 5120. The latency increases from 3.58usecs to 6.07usecs for shmem float sum to all usingthe In-network Computing approach, while with onlysoftware acceleration the latency increases from 2.64 usecsto 18.57 usecs, and with no acceleration approach, itincreases from 6.32 usecs to 59.99 usecs. At 5120 PEs,the shmem float sum to all implementation using the In-network Computing approach has 10 times less latency thanthe no-acceleration shmem float sum to all implementationand 3 times less latency than the software only accelerationimplementation.
Figure 11 shows the latency of all messages from 4 bytesto 1 MB for a 5120 PEs problem size. We can observe thatthe latency for a wide range of messages is much lower
with shmem float sum to all using the In-network Comput-ing approach. As message size gets larger, the performanceof shmem float sum to all using the In-network Computingapproach and software-accelerated shmem float sum to allconverge. This is expected as the current generation of SHARPand the HCAs do not provide acceleration for large messagereduction and acceleration comes only from the software.
C. Performance of 2D-Heat application kernel while usingSoftware and Hardware Acceleration
1.6
1.65
1.7
1.75
1.8
1.85
1.9
1.95
2
2.05
2.1
No Accel SW Accel In-Network Computing
Completion Time of 2D-Heat application kernel
Fig. 12: Completion Time of 2D-Heat Application Kernel
The 2D-Heat application kernel available here [2] models2D heat conduction using the Grass-Seidel method. The kernelcompletes when the standard deviation between the adjacentmatrices is less than the predefined convergence value. Thecommunication is dominated by shmem put, which is used forsending data between the adjacent PEs. The shmem barrieris used for synchronization, and for exchanging data andthe intermediate results it uses shmem float sum to all andshmem broadcast32.
In Figure 12, we can observe that the completion timereduces by 13% for a 5120 PEs problem size while usingcollective implemented using the In-network Computing ap-proach. For these experiments, we used the matrix with 8krows and 8k columns. As we increase the problem size, thenumber of PEs, and matrix size, we expect the performance ofcollective operations becomes more important for reducing thecompletion time. Unfortunately, we could not experiment withhigher scale because of limited availability of the computingresources.
VI. RESULTS DISCUSSION AND CONCLUSION
The results clearly demonstrate that the In-network Com-puting approach improves the performance and scalabilityof collective operations. From the graphs 6, 8 and 11, wecan observe that the shmem barrier, shmem broadcast32,and shmem float sum to all implementations using the In-network Computing approach outperforms the no accelerationimplementations by 710%, 370% and 1000%, respectively.
![Page 14: OpenSHMEM on ConnectX-6 and Mellanox SHARP · §Advanced end-to-end adaptive routing, congestion control and quality of service §In-Network Computing acceleration engines ... Arm](https://reader030.vdocuments.us/reader030/viewer/2022041120/5f34389f9135e52bc06d1fe7/html5/thumbnails/14.jpg)
© 2019 Mellanox Technologies 1414
In-Network Computing Accelerates 2D Heat Kernel
§ Kernel models 2D heat conduction using Grass-Siedel method
§ Barriers are used for synchronization and reductions are used for exchanging intermediate results
§ The workload is dominated by shmem_put
§ For 5120 PE problem configuration, the In-Network computing approach implementation improves performance by 13%
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
0 1000 2000 3000 4000 5000 6000
Late
ncy
(m
icro
sec.
)
OpenSHMEM PEs
OSHMEM with No Acceleration OSHMEM with Software Acceleration OSHMEM with InNetwork Computing
Fig. 10: The latency of large message (32 KB)shmem float sum to all as we increase the number ofPEs from 40 to 5120. The In-network computing andsoftware accelerated plots are overlapped and appear asthough it is missing.
0
20
40
60
80
100
120
10 100 1000
Late
ncy
(m
icro
sec.
)
Message size (bytes)
Reduce latency - Increasing Message Size 5120 PEs
OSHMEM with No Acceleration OSHMEM with Software Acceleration OSHMEM with InNetwork Computing
Fig. 11: The latency of shmem float sum to all as we in-crease the message size and fix the number of PEs at 5120.
PEs from 40 to 5120. The latency increases from 3.58usecs to 6.07usecs for shmem float sum to all usingthe In-network Computing approach, while with onlysoftware acceleration the latency increases from 2.64 usecsto 18.57 usecs, and with no acceleration approach, itincreases from 6.32 usecs to 59.99 usecs. At 5120 PEs,the shmem float sum to all implementation using the In-network Computing approach has 10 times less latency thanthe no-acceleration shmem float sum to all implementationand 3 times less latency than the software only accelerationimplementation.
Figure 11 shows the latency of all messages from 4 bytesto 1 MB for a 5120 PEs problem size. We can observe thatthe latency for a wide range of messages is much lower
with shmem float sum to all using the In-network Comput-ing approach. As message size gets larger, the performanceof shmem float sum to all using the In-network Computingapproach and software-accelerated shmem float sum to allconverge. This is expected as the current generation of SHARPand the HCAs do not provide acceleration for large messagereduction and acceleration comes only from the software.
C. Performance of 2D-Heat application kernel while usingSoftware and Hardware Acceleration
1.6
1.65
1.7
1.75
1.8
1.85
1.9
1.95
2
2.05
2.1
No Accel SW Accel In-Network Computing
Completion Time of 2D-Heat application kernel
Fig. 12: Completion Time of 2D-Heat Application Kernel
The 2D-Heat application kernel available here [2] models2D heat conduction using the Grass-Seidel method. The kernelcompletes when the standard deviation between the adjacentmatrices is less than the predefined convergence value. Thecommunication is dominated by shmem put, which is used forsending data between the adjacent PEs. The shmem barrieris used for synchronization, and for exchanging data andthe intermediate results it uses shmem float sum to all andshmem broadcast32.
In Figure 12, we can observe that the completion timereduces by 13% for a 5120 PEs problem size while usingcollective implemented using the In-network Computing ap-proach. For these experiments, we used the matrix with 8krows and 8k columns. As we increase the problem size, thenumber of PEs, and matrix size, we expect the performance ofcollective operations becomes more important for reducing thecompletion time. Unfortunately, we could not experiment withhigher scale because of limited availability of the computingresources.
VI. RESULTS DISCUSSION AND CONCLUSION
The results clearly demonstrate that the In-network Com-puting approach improves the performance and scalabilityof collective operations. From the graphs 6, 8 and 11, wecan observe that the shmem barrier, shmem broadcast32,and shmem float sum to all implementations using the In-network Computing approach outperforms the no accelerationimplementations by 710%, 370% and 1000%, respectively.
![Page 15: OpenSHMEM on ConnectX-6 and Mellanox SHARP · §Advanced end-to-end adaptive routing, congestion control and quality of service §In-Network Computing acceleration engines ... Arm](https://reader030.vdocuments.us/reader030/viewer/2022041120/5f34389f9135e52bc06d1fe7/html5/thumbnails/15.jpg)
© 2019 Mellanox Technologies 15
Thank You