implementation & comparison of rdma over ethernet · provide infiniband-like performance and...
TRANSCRIPT
IMPLEMENTATION & COMPARISON OF RDMA OVER ETHERNET
LA-UR 10-05188
IMPLEMENTATION & COMPARISON OF RDMA OVER ETHERNET
Students: Lee Gaiser, Brian Kraus, and James Wernicke
Mentors: Andree Jacobson, Susan Coulter, Jharrod LaFon, and Ben McClelland
Students: Lee Gaiser, Brian Kraus, and James Wernicke
Mentors: Andree Jacobson, Susan Coulter, Jharrod LaFon, and Ben McClelland
Summaryy
Background Objective Testing Environment Methodology Methodology Results Conclusion Further Work Challenges Lessons Learned Lessons Learned Acknowledgments References & Links Questions
Summaryy
Background Objective Testing Environment MethodologyMethodology Results Conclusion Further Work Challenges Lessons LearnedLessons Learned Acknowledgments References & Links Questions
Background : Remote Direct Memory A (RDMA)Access (RDMA)
RDMA provides high-throughput low- RDMA provides high throughput, lowlatency networking: Reduce consumption of CPU cycles Reduce consumption of CPU cycles
Background : Remote Direct Memory A (RDMA)Access (RDMA)
RDMA provides high-throughput low- RDMA provides high throughput, low latency networking: Reduce consumption of CPU cycles Reduce consumption of CPU cycles Reduce communication latency Reduce communication latency
Images courtesy of http://www.hpcwire.com/features/17888274.html
Background : InfiniBandg
Infiniband is a switched fabric communication link designed for HPC: High throughput Low latency Quality of service
F il Failover Scalability Reliable transport Reliable transport
How do we interface this high performance link with existing Ethernet infrastructure?g
Backgground : InfiniBand
Infiniband is a switched fabric communication link designed for HPC: High throughput Low latency Quality of service F ilFailover Scalability Reliable transportReliable transport
How do we interface this high performance link with existingg Ethernet infrastructure?
Background : RDMA over Converged Eth t (R CE)Ethernet (RoCE)
Provide Infiniband-like performance and efficiency Provide Infiniband like performance and efficiencyto ubiquitous Ethernet infrastructure. Utilize the same transport and network layers from IB p y
stack and swap the link layer for Ethernet. Implement IB verbs over Ethernet.
Not quite IB strength, but it’s getting close. As of OFED 1.5.1, code written for OFED RDMA ,
auto-magically works with RoCE.
Background : RDMA over Converged Eth t (R CE)Ethernet (RoCE)
Provide Infiniband-like performance and efficiency Provide Infiniband like performance and efficiency to ubiquitous Ethernet infrastructure. Utilize the same transpport and network layyers from IB
stack and swap the link layer for Ethernet. Implement IB verbs over Ethernet.
Not quite IB strength, but it’s getting close. As of OFED 1.5.1,, code written for OFED RDMA
auto-magically works with RoCE.
Objective
We would like to answer the following questions:We would like to answer the following questions: What kind of performance can we get out of RoCE
on our cluster?on our cluster? Can we implement RoCE in software (Soft RoCE)
and how does it compare with hardware RoCE?and how does it compare with hardware RoCE?
Objjjective
WeWe would like to answer the following questions:w ould like to answer the following questions: What kind of performance can we get out of RoCE
on our cluster?on our cluster? Can we implement RoCE in software (Soft RoCE)
and how does it compare with hardware RoCE?and how does it compare with hardware RoCE?
Testing Environmentg
Hardware: HP ProLiant DL160 G6 servers Mellanox MNPH29B-XTC 10GbE adapters 50/125 OFNR cabling 50/125 OFNR cablingOperating System: CentOS 5.3
2 6 32 16 k l 2.6.32.16 kernelSoftware/Drivers: Open Fabrics Enterprise Distribution p p
(OFED) 1.5.2-rc2 (RoCE) & 1.5.1-rxe (Soft RoCE)
OSU Micro Benchmarks (OMB) 3.1.1 OpenMPI 1.4.2
Testingg Environment
Hardware: HP ProLiant DL160 G6 servers Mellanox MNPH29B-XTC 10GbE adapters 50/125 OFNR cabling50/125 OFNR cabling Operating System: CentOS 5.3 2 6 32 162.6.32.16 kk ernell Software/Drivers: Oppen Fabrics Enterpprise Distribution
(OFED) 1.5.2-rc2 (RoCE) & 1.5.1-rxe (SoftRoCE)
OSU Micro Benchmarks (OMB) 3.1.1 OpenMPI 1.4.2
Methodologygy
gy Set up a pair of nodes for each technology:p p IB, RoCE, Soft RoCE, and no RDMA
Install, configure & run minimal services on test nodes to maximize machine performancemaximize machine performance.
Directly connect nodes to maximize network performance.hmarks Acquire latency benchmarks Acquire latency benc
OSU MPI Latency Test
Acquire bandwidth benchmarks OSU MPI Uni-Directional Bandwidth Test OSU MPI Bi-Directional Bandwidth Test
Script it all to perform many repetitions Script it all to perform many repetitions
Methodologygy
Set up a pair of nodes for each technologygy:p p IB, RoCE, Soft RoCE, and no RDMA
Install, configure & run minimal services on test nodes to maximize machinemaximize machine performanceperformance.
Directly connect nodes to maximize network performance. Acquire latency benchmarkshmarks Acquire latency benc
OSU MPI Latency Test
Acquire bandwidth benchmarks OSU MPI Uni-Directional Bandwidth Test OSU MPI Bi-Directional Bandwidth Test
Script it all to perform many repetitionsScript it all to perform many repetitions
Results : LatencyyResults : Latencyy LLa
tenc
y (µ
ss)
10000
1000
100
10
1
Message Size (Bytes)
SoftRoCE IB QDR RoCE No RDMA
Results : Uni-directional BandwidthResults : Uni-directional Bandwidth Ba
ndw
idth
(M
B/s)
10000
1000
100
10
1
0.1
Message Size (Bytes)
IB QDR RoCE SoftRoCE No RDMA
Results : Bi-directional BandwidthResults : Bi-directional Bandwidth Ba
ndw
idth
(M
B/s)
10000
1000
100
10
1
0.1
Message Size (Bytes)
IB QDR RoCE No RDMA
Results : Analyysis
Peak Values IB QDR RoCE Soft RoCE No RDMA
Latency (µs) 1.96 3.7 11.6 21.09
One-way BW (MB/s) 3024.8 1142.7 1204.1 301.31
Two-way BW (MB/s) 5481 9 2284 7 - 1136 1Two way BW (MB/s) 5481.9 2284.7 1136.1
Results : Analysisy
RoCE performance gains over 10GbE: Up to 5 7x speedup in latency Up to 5.7x speedup in latency Up to 3.7x increase in bandwidth
IB QDR vs RoCE: IB QDR vs. RoCE: IB less than 1µs faster than RoCE at 128-byte message. IB peak bandwidth is 2-2 5x greater than RoCE IB peak bandwidth is 2 2.5x greater than RoCE.
RoCE performance gains over 10GbE: Up to 5 7x speedup in latencyUp to 5.7x speedup in latency Up to 3.7x increase in bandwidth
IB QDR vs RoCE:IB QDR vs. RoCE: IB less than 1µs faster than RoCE at 128-byte message. IB peak bandwidth is 2-2 5x greater than RoCE IB peak bandwidth is 2 2.5x greater than RoCE.
Conclusion
RoCE is capable of providing near-Infiniband QDR RoCE is capable of providing near Infiniband QDR performance for: Latency-critical applications at message sizes from y pp g
128B to 8KB Bandwidth-intensive applications for messages <1KB.
Soft RoCE is comparable to hardware RoCE at message sizes above 65KB.
Soft RoCE can improve performance where RoCE-enabled hardware is unavailable.
Conclusion
RoCE is capable of providing near-Infiniband QDR RoCE is capable of providing near Infiniband QDR performance for: Latencyy-critical apppplications at messagge sizes from
128B to 8KB Bandwidth-intensive applications for messages <1KB.
Soft RoCE is comparable to hardware RoCE at message sizes above 65KB.
Soft RoCE can improve performance where RoCE-enabled hardware is unavailable.
Further Work & Questions
How does RoCE perform over collectives? How does RoCE perform over collectives? Can we further optimize RoCE configuration to yield
better performance?better performance? Can we stabilize the Soft RoCE configuration? How much does Soft RoCE affect the compute nodes How much does Soft RoCE affect the compute nodes
ability to perform? How does RoCE compare with iWARP? How does RoCE compare with iWARP?
Further Work & Questions
How does RoCE perform over collectives? How does RoCE perform over collectives? Can we further optimize RoCE configuration to yield
better performance?better performance? Can we stabilize the Soft RoCE configuration? How much does Soft RoCE affect the compute nodes How much does Soft RoCE affect the compute nodes
ability to perform? How does RoCE compare with iWARP? How does RoCE compare with iWARP?
Challengesg
Finding an OS that works with OFED & RDMA:g Fedora 13 was too new. Ubuntu 10 wasn’t supported. CentOS 5.5 was missing some drivers.
Had to compile a new kernel with IB/RoCE support. Built OpenMPI 1.4.2 from source, but wasn’t
configured for RDMA; used OpenMPI 1.4.1 supplied with OFED insteadsupplied with OFED instead.
The machines communicating via Soft RoCEfrequently lock up during OSU bandwidth tests.frequently lock up during OSU bandwidth tests.
Challengges
Findingg an OS that works with OFED & RDMA: Fedora 13 was too new. Ubuntu 10 wasn’t supported. CentOS 5.5 was missing some drivers.
Had to compile a new kernel with IB/RoCE support. Built OpenMPI 1.4.2 from source, but wasn’t
configured for RDMA; used OpenMPI 1.4.1 supplied with OFED insteadsupplied with OFED instead.
The machines communicating via Soft RoCE frequently lock up during OSU bandwidth tests.frequently lock up during OSU bandwidth tests.
Lessons Learned
Installing and configuring HPC clusters Installing and configuring HPC clusters Building, installing, and fixing Linux kernel, modules,
and driversand drivers Working with IB, 10GbE, and RDMA technologies Using tools such as OMB 3 1 1 and netperf for Using tools such as OMB-3.1.1 and netperf for
benchmarking performance
Lessons Learned
Installing and configuring HPC clusters Installing and configuring HPC clusters Building, installing, and fixing Linux kernel, modules,
and driversand drivers Working with IB, 10GbE, and RDMA technologies Using tools such as OMB 3 1 1 and netperf for Using tools such as OMB-3.1.1 and netperf for
benchmarking performance
Acknowledgmentsg
Andree JacobsonAndree JacobsonSusan CoulterJharrod LaFonJharrod LaFonBen McClellandS G iSam GutierrezBob Pearson
Acknowledggments
Andree JacobsonAndree Jacobson Susan Coulter JharrodJharrod LaFonLaFon Ben McClelland SSam GG utiierrez Bob Pearson
Questions?Questions?
References & Links
Submaroni, H. et al. RDMA over Ethernet – A Preliminary Study. OSU.http://nowlab cse ohio state edu/publications/conf presentations/2009/subramoni hpidc09 pdfhttp://nowlab.cse.ohio-state.edu/publications/conf-presentations/2009/subramoni-hpidc09.pdf
Feldman, M. RoCE: An Ethernet-InfiniBand Love Story. HPCWire.com. April 22, 2010.http://www.hpcwire.com/blogs/RoCE-An-Ethernet-InfiniBand-Love-Story-91866499.html
Woodruff, R. Access to InfiniBand from Linux. Intel. October 29, 2009. http://software.intel.com/en-us/articles/access-to-infiniband-from-linux/
OFED 1.5.2-rc2/ / gp g/http://www.openfabrics.org/downloads/OFED/ofed-1.5.2/OFED-1.5.2-rc2.tgz/p //
OFED 1.5.1-rxehttp://www.systemfabricworks.com/pub/OFED-1.5.1-rxe.tgz
OMB 3.1.1h // h h d /b h k /OMB 3 1 1http://mvapich.cse.ohio-state.edu/benchmarks/OMB-3.1.1.tgz
References & Links
Submaroni, H. et al. RDMA over Ethernet – A Preliminary Study. OSU. http://nowlab cse ohio state edu/publications/conf presentations/2009/subramoni hpidc09 pdfhttp://nowlab.cse.ohio-state.edu/publications/conf-presentations/2009/subramoni-hpidc09.pdf
Feldman, M. RoCE: An Ethernet-InfiniBand Love Story. HPCWire.com. April 22, 2010.http://www.hpcwire.com/blogs/RoCE-An-Ethernet-InfiniBand-Love-Story-91866499.html
Woodruff, R. Access to InfiniBand from Linux. Intel. October 29, 2009. http://software.intel.com/en-us/articles/access-to-infiniband-from-linux/
OFED 1.5.2-rc2 http://www.oppenfabrics.org/g/downloads/OFED//ofed-1.5.2//OFED-1.5.2-rc2.tggzp // /
OFED 1.5.1-rxe http://www.systemfabricworks.com/pub/OFED-1.5.1-rxe.tgz
OMB 3.1.1 hhttp:////mvapichh.cse.ohhio-state.edd /bu/benchhmarkks/OMB/OMB-3 1 13.1.1.tgz