implementation & comparison of rdma over ethernet · provide infiniband-like performance and...

19
IMPLEMENTATION & COMPARISON OF RDMA OVER ETHERNET LA-UR 10-05188 IMPLEMENTATION & COMPARISON OF RDMA OVER ETHERNET Students: Lee Gaiser, Brian Kraus, and James Wernicke Mentors: Andree Jacobson, Susan Coulter, Jharrod LaFon, and Ben McClelland Students: Lee Gaiser, Brian Kraus, and James Wernicke Mentors: Andree Jacobson, Susan Coulter, Jharrod LaFon, and Ben McClelland

Upload: others

Post on 21-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Implementation & Comparison of RDMA Over Ethernet · Provide Infiniband-like performance and efficiency to ubiquitous Ethernet infrastructure. Utilize the same transport and network

IMPLEMENTATION & COMPARISON OF RDMA OVER ETHERNET

LA-UR 10-05188

IMPLEMENTATION & COMPARISON OF RDMA OVER ETHERNET

Students: Lee Gaiser, Brian Kraus, and James Wernicke

Mentors: Andree Jacobson, Susan Coulter, Jharrod LaFon, and Ben McClelland

Students: Lee Gaiser, Brian Kraus, and James Wernicke

Mentors: Andree Jacobson, Susan Coulter, Jharrod LaFon, and Ben McClelland

Page 2: Implementation & Comparison of RDMA Over Ethernet · Provide Infiniband-like performance and efficiency to ubiquitous Ethernet infrastructure. Utilize the same transport and network

Summaryy

Background Objective Testing Environment Methodology Methodology Results Conclusion Further Work Challenges Lessons Learned Lessons Learned Acknowledgments References & Links Questions

Summaryy

Background Objective Testing Environment MethodologyMethodology Results Conclusion Further Work Challenges Lessons LearnedLessons Learned Acknowledgments References & Links Questions

Page 3: Implementation & Comparison of RDMA Over Ethernet · Provide Infiniband-like performance and efficiency to ubiquitous Ethernet infrastructure. Utilize the same transport and network

Background : Remote Direct Memory A (RDMA)Access (RDMA)

RDMA provides high-throughput low- RDMA provides high throughput, lowlatency networking: Reduce consumption of CPU cycles Reduce consumption of CPU cycles

Background : Remote Direct Memory A (RDMA)Access (RDMA)

RDMA provides high-throughput low- RDMA provides high throughput, low latency networking: Reduce consumption of CPU cycles Reduce consumption of CPU cycles Reduce communication latency Reduce communication latency

Images courtesy of http://www.hpcwire.com/features/17888274.html

Page 4: Implementation & Comparison of RDMA Over Ethernet · Provide Infiniband-like performance and efficiency to ubiquitous Ethernet infrastructure. Utilize the same transport and network

Background : InfiniBandg

Infiniband is a switched fabric communication link designed for HPC: High throughput Low latency Quality of service

F il Failover Scalability Reliable transport Reliable transport

How do we interface this high performance link with existing Ethernet infrastructure?g

Backgground : InfiniBand

Infiniband is a switched fabric communication link designed for HPC: High throughput Low latency Quality of service F ilFailover Scalability Reliable transportReliable transport

How do we interface this high performance link with existingg Ethernet infrastructure?

Page 5: Implementation & Comparison of RDMA Over Ethernet · Provide Infiniband-like performance and efficiency to ubiquitous Ethernet infrastructure. Utilize the same transport and network

Background : RDMA over Converged Eth t (R CE)Ethernet (RoCE)

Provide Infiniband-like performance and efficiency Provide Infiniband like performance and efficiencyto ubiquitous Ethernet infrastructure. Utilize the same transport and network layers from IB p y

stack and swap the link layer for Ethernet. Implement IB verbs over Ethernet.

Not quite IB strength, but it’s getting close. As of OFED 1.5.1, code written for OFED RDMA ,

auto-magically works with RoCE.

Background : RDMA over Converged Eth t (R CE)Ethernet (RoCE)

Provide Infiniband-like performance and efficiency Provide Infiniband like performance and efficiency to ubiquitous Ethernet infrastructure. Utilize the same transpport and network layyers from IB

stack and swap the link layer for Ethernet. Implement IB verbs over Ethernet.

Not quite IB strength, but it’s getting close. As of OFED 1.5.1,, code written for OFED RDMA

auto-magically works with RoCE.

Page 6: Implementation & Comparison of RDMA Over Ethernet · Provide Infiniband-like performance and efficiency to ubiquitous Ethernet infrastructure. Utilize the same transport and network

Objective

We would like to answer the following questions:We would like to answer the following questions: What kind of performance can we get out of RoCE

on our cluster?on our cluster? Can we implement RoCE in software (Soft RoCE)

and how does it compare with hardware RoCE?and how does it compare with hardware RoCE?

Objjjective

WeWe would like to answer the following questions:w ould like to answer the following questions: What kind of performance can we get out of RoCE

on our cluster?on our cluster? Can we implement RoCE in software (Soft RoCE)

and how does it compare with hardware RoCE?and how does it compare with hardware RoCE?

Page 7: Implementation & Comparison of RDMA Over Ethernet · Provide Infiniband-like performance and efficiency to ubiquitous Ethernet infrastructure. Utilize the same transport and network

Testing Environmentg

Hardware: HP ProLiant DL160 G6 servers Mellanox MNPH29B-XTC 10GbE adapters 50/125 OFNR cabling 50/125 OFNR cablingOperating System: CentOS 5.3

2 6 32 16 k l 2.6.32.16 kernelSoftware/Drivers: Open Fabrics Enterprise Distribution p p

(OFED) 1.5.2-rc2 (RoCE) & 1.5.1-rxe (Soft RoCE)

OSU Micro Benchmarks (OMB) 3.1.1 OpenMPI 1.4.2

Testingg Environment

Hardware: HP ProLiant DL160 G6 servers Mellanox MNPH29B-XTC 10GbE adapters 50/125 OFNR cabling50/125 OFNR cabling Operating System: CentOS 5.3 2 6 32 162.6.32.16 kk ernell Software/Drivers: Oppen Fabrics Enterpprise Distribution

(OFED) 1.5.2-rc2 (RoCE) & 1.5.1-rxe (SoftRoCE)

OSU Micro Benchmarks (OMB) 3.1.1 OpenMPI 1.4.2

Page 8: Implementation & Comparison of RDMA Over Ethernet · Provide Infiniband-like performance and efficiency to ubiquitous Ethernet infrastructure. Utilize the same transport and network

Methodologygy

gy Set up a pair of nodes for each technology:p p IB, RoCE, Soft RoCE, and no RDMA

Install, configure & run minimal services on test nodes to maximize machine performancemaximize machine performance.

Directly connect nodes to maximize network performance.hmarks Acquire latency benchmarks Acquire latency benc

OSU MPI Latency Test

Acquire bandwidth benchmarks OSU MPI Uni-Directional Bandwidth Test OSU MPI Bi-Directional Bandwidth Test

Script it all to perform many repetitions Script it all to perform many repetitions

Methodologygy

Set up a pair of nodes for each technologygy:p p IB, RoCE, Soft RoCE, and no RDMA

Install, configure & run minimal services on test nodes to maximize machinemaximize machine performanceperformance.

Directly connect nodes to maximize network performance. Acquire latency benchmarkshmarks Acquire latency benc

OSU MPI Latency Test

Acquire bandwidth benchmarks OSU MPI Uni-Directional Bandwidth Test OSU MPI Bi-Directional Bandwidth Test

Script it all to perform many repetitionsScript it all to perform many repetitions

Page 9: Implementation & Comparison of RDMA Over Ethernet · Provide Infiniband-like performance and efficiency to ubiquitous Ethernet infrastructure. Utilize the same transport and network

Results : LatencyyResults : Latencyy LLa

tenc

y (µ

ss)

10000

1000

100

10

1

Message Size (Bytes)

SoftRoCE IB QDR RoCE No RDMA

Page 10: Implementation & Comparison of RDMA Over Ethernet · Provide Infiniband-like performance and efficiency to ubiquitous Ethernet infrastructure. Utilize the same transport and network

Results : Uni-directional BandwidthResults : Uni-directional Bandwidth Ba

ndw

idth

(M

B/s)

10000

1000

100

10

1

0.1

Message Size (Bytes)

IB QDR RoCE SoftRoCE No RDMA

Page 11: Implementation & Comparison of RDMA Over Ethernet · Provide Infiniband-like performance and efficiency to ubiquitous Ethernet infrastructure. Utilize the same transport and network

Results : Bi-directional BandwidthResults : Bi-directional Bandwidth Ba

ndw

idth

(M

B/s)

10000

1000

100

10

1

0.1

Message Size (Bytes)

IB QDR RoCE No RDMA

Page 12: Implementation & Comparison of RDMA Over Ethernet · Provide Infiniband-like performance and efficiency to ubiquitous Ethernet infrastructure. Utilize the same transport and network

Results : Analyysis

Peak Values IB QDR RoCE Soft RoCE No RDMA

Latency (µs) 1.96 3.7 11.6 21.09

One-way BW (MB/s) 3024.8 1142.7 1204.1 301.31

Two-way BW (MB/s) 5481 9 2284 7 - 1136 1Two way BW (MB/s) 5481.9 2284.7 1136.1

Results : Analysisy

RoCE performance gains over 10GbE: Up to 5 7x speedup in latency Up to 5.7x speedup in latency Up to 3.7x increase in bandwidth

IB QDR vs RoCE: IB QDR vs. RoCE: IB less than 1µs faster than RoCE at 128-byte message. IB peak bandwidth is 2-2 5x greater than RoCE IB peak bandwidth is 2 2.5x greater than RoCE.

RoCE performance gains over 10GbE: Up to 5 7x speedup in latencyUp to 5.7x speedup in latency Up to 3.7x increase in bandwidth

IB QDR vs RoCE:IB QDR vs. RoCE: IB less than 1µs faster than RoCE at 128-byte message. IB peak bandwidth is 2-2 5x greater than RoCE IB peak bandwidth is 2 2.5x greater than RoCE.

Page 13: Implementation & Comparison of RDMA Over Ethernet · Provide Infiniband-like performance and efficiency to ubiquitous Ethernet infrastructure. Utilize the same transport and network

Conclusion

RoCE is capable of providing near-Infiniband QDR RoCE is capable of providing near Infiniband QDR performance for: Latency-critical applications at message sizes from y pp g

128B to 8KB Bandwidth-intensive applications for messages <1KB.

Soft RoCE is comparable to hardware RoCE at message sizes above 65KB.

Soft RoCE can improve performance where RoCE-enabled hardware is unavailable.

Conclusion

RoCE is capable of providing near-Infiniband QDR RoCE is capable of providing near Infiniband QDR performance for: Latencyy-critical apppplications at messagge sizes from

128B to 8KB Bandwidth-intensive applications for messages <1KB.

Soft RoCE is comparable to hardware RoCE at message sizes above 65KB.

Soft RoCE can improve performance where RoCE-enabled hardware is unavailable.

Page 14: Implementation & Comparison of RDMA Over Ethernet · Provide Infiniband-like performance and efficiency to ubiquitous Ethernet infrastructure. Utilize the same transport and network

Further Work & Questions

How does RoCE perform over collectives? How does RoCE perform over collectives? Can we further optimize RoCE configuration to yield

better performance?better performance? Can we stabilize the Soft RoCE configuration? How much does Soft RoCE affect the compute nodes How much does Soft RoCE affect the compute nodes

ability to perform? How does RoCE compare with iWARP? How does RoCE compare with iWARP?

Further Work & Questions

How does RoCE perform over collectives? How does RoCE perform over collectives? Can we further optimize RoCE configuration to yield

better performance?better performance? Can we stabilize the Soft RoCE configuration? How much does Soft RoCE affect the compute nodes How much does Soft RoCE affect the compute nodes

ability to perform? How does RoCE compare with iWARP? How does RoCE compare with iWARP?

Page 15: Implementation & Comparison of RDMA Over Ethernet · Provide Infiniband-like performance and efficiency to ubiquitous Ethernet infrastructure. Utilize the same transport and network

Challengesg

Finding an OS that works with OFED & RDMA:g Fedora 13 was too new. Ubuntu 10 wasn’t supported. CentOS 5.5 was missing some drivers.

Had to compile a new kernel with IB/RoCE support. Built OpenMPI 1.4.2 from source, but wasn’t

configured for RDMA; used OpenMPI 1.4.1 supplied with OFED insteadsupplied with OFED instead.

The machines communicating via Soft RoCEfrequently lock up during OSU bandwidth tests.frequently lock up during OSU bandwidth tests.

Challengges

Findingg an OS that works with OFED & RDMA: Fedora 13 was too new. Ubuntu 10 wasn’t supported. CentOS 5.5 was missing some drivers.

Had to compile a new kernel with IB/RoCE support. Built OpenMPI 1.4.2 from source, but wasn’t

configured for RDMA; used OpenMPI 1.4.1 supplied with OFED insteadsupplied with OFED instead.

The machines communicating via Soft RoCE frequently lock up during OSU bandwidth tests.frequently lock up during OSU bandwidth tests.

Page 16: Implementation & Comparison of RDMA Over Ethernet · Provide Infiniband-like performance and efficiency to ubiquitous Ethernet infrastructure. Utilize the same transport and network

Lessons Learned

Installing and configuring HPC clusters Installing and configuring HPC clusters Building, installing, and fixing Linux kernel, modules,

and driversand drivers Working with IB, 10GbE, and RDMA technologies Using tools such as OMB 3 1 1 and netperf for Using tools such as OMB-3.1.1 and netperf for

benchmarking performance

Lessons Learned

Installing and configuring HPC clusters Installing and configuring HPC clusters Building, installing, and fixing Linux kernel, modules,

and driversand drivers Working with IB, 10GbE, and RDMA technologies Using tools such as OMB 3 1 1 and netperf for Using tools such as OMB-3.1.1 and netperf for

benchmarking performance

Page 17: Implementation & Comparison of RDMA Over Ethernet · Provide Infiniband-like performance and efficiency to ubiquitous Ethernet infrastructure. Utilize the same transport and network

Acknowledgmentsg

Andree JacobsonAndree JacobsonSusan CoulterJharrod LaFonJharrod LaFonBen McClellandS G iSam GutierrezBob Pearson

Acknowledggments

Andree JacobsonAndree Jacobson Susan Coulter JharrodJharrod LaFonLaFon Ben McClelland SSam GG utiierrez Bob Pearson

Page 18: Implementation & Comparison of RDMA Over Ethernet · Provide Infiniband-like performance and efficiency to ubiquitous Ethernet infrastructure. Utilize the same transport and network

Questions?Questions?

Page 19: Implementation & Comparison of RDMA Over Ethernet · Provide Infiniband-like performance and efficiency to ubiquitous Ethernet infrastructure. Utilize the same transport and network

References & Links

Submaroni, H. et al. RDMA over Ethernet – A Preliminary Study. OSU.http://nowlab cse ohio state edu/publications/conf presentations/2009/subramoni hpidc09 pdfhttp://nowlab.cse.ohio-state.edu/publications/conf-presentations/2009/subramoni-hpidc09.pdf

Feldman, M. RoCE: An Ethernet-InfiniBand Love Story. HPCWire.com. April 22, 2010.http://www.hpcwire.com/blogs/RoCE-An-Ethernet-InfiniBand-Love-Story-91866499.html

Woodruff, R. Access to InfiniBand from Linux. Intel. October 29, 2009. http://software.intel.com/en-us/articles/access-to-infiniband-from-linux/

OFED 1.5.2-rc2/ / gp g/http://www.openfabrics.org/downloads/OFED/ofed-1.5.2/OFED-1.5.2-rc2.tgz/p //

OFED 1.5.1-rxehttp://www.systemfabricworks.com/pub/OFED-1.5.1-rxe.tgz

OMB 3.1.1h // h h d /b h k /OMB 3 1 1http://mvapich.cse.ohio-state.edu/benchmarks/OMB-3.1.1.tgz

References & Links

Submaroni, H. et al. RDMA over Ethernet – A Preliminary Study. OSU. http://nowlab cse ohio state edu/publications/conf presentations/2009/subramoni hpidc09 pdfhttp://nowlab.cse.ohio-state.edu/publications/conf-presentations/2009/subramoni-hpidc09.pdf

Feldman, M. RoCE: An Ethernet-InfiniBand Love Story. HPCWire.com. April 22, 2010.http://www.hpcwire.com/blogs/RoCE-An-Ethernet-InfiniBand-Love-Story-91866499.html

Woodruff, R. Access to InfiniBand from Linux. Intel. October 29, 2009. http://software.intel.com/en-us/articles/access-to-infiniband-from-linux/

OFED 1.5.2-rc2 http://www.oppenfabrics.org/g/downloads/OFED//ofed-1.5.2//OFED-1.5.2-rc2.tggzp // /

OFED 1.5.1-rxe http://www.systemfabricworks.com/pub/OFED-1.5.1-rxe.tgz

OMB 3.1.1 hhttp:////mvapichh.cse.ohhio-state.edd /bu/benchhmarkks/OMB/OMB-3 1 13.1.1.tgz