Network File System (NFS) in High Performance Networks
PDLSVD08-02
OverviewWittawat Tantisiriroj, Garth Gibson
• NFS over RDMA was recently released in February 2008• What is the value of RDMA to storage users?• Competing networks
• General purpose network (e.g. Ethernet)• High-performance network with RDMA (e.g. InfiniBand)
NFS over IPoIB / NFS over RDMA
• IPoIB: Implemented as a standard network driver• RDMA: Implemented as a new RPC type
Experiment Setup
• For point-to-point throughput, IP over InfiniBand (Connected Mode) is comparable to a native InfiniBand
Experimental Results
• When a disk is a bottleneck, NFS can benefit from neither IPoIB nor RMDA
• As the number of concurrent read operations increases, aggregate throughputs achieved for both IPoIB and RDMA significantly improve with no disadvantage for IPoIB
• When a disk is not a bottleneck, NFS benefits significantly from both IPoIB and RDMA
• RDMA is better than IPoIB by ~20%
NFSv2
IP
NFSv3
Open Network Computing Remote Procedure Call (ONC RPC)
TCP
NFSv4
RDMA CM
InfiniBand/iWarp Card
TCP RPC
Ethernet Card Driver
Ethernet Card
RDMA Verbs
IB CM
RDMA RPC
iWarp CM
NFSv4.1
TCP
TCP RPC
IP
IPoIB Driver
InfiniBand Card
NFS Server 1) Gigabit Ethernet Switch2) Infiniband SDR Switch
NFS Client 1-41) ext2 on a single 160GB SATA2) tmpfs on 4GB RAM
0
100
200
300
400
500
600
700
800
900
1000
2 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M
Thro
ughp
ut (M
B/s
)
byte per send() call
Point to Point Throughput with iPerf and ib_bw
InfiniBand (Reliable Connection) – ib_bw Socket Direct Protocol (SDP) – iPerf
IP over InfiniBand (Connected Mode) – iPerf IP over InfiniBand (Connected Mode) w/ 2 streams – iPerf
IP over InfiniBand (Datagram Mode) – iPerf 1 Gigabit Ethernet – iPerf
61.54
53.0254.40
26.92
45.96
25.20
36.76
25.33
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
write read
Thro
ughp
ut (M
B/s
)
Sequential Write to Server’s Disk – Read from Server’s Disk5 GB File with 64 KB per Operation (ext2 on Single 160GB SATA)
local disk eth ipoib rdma
60.26 59.52107.93
57.39
397.76
30.67
479.25
0.00
100.00
200.00
300.00
400.00
500.00
600.00
700.00
800.00
900.00
1000.00
write read
Thro
ughp
ut (M
B/s
)
Sequential Write to Server’s Disk – Read from Server’s Cache 512 MB File with 64 KB per Operation (ext2 on Single 160GB SATA)
local disk eth ipoib rdma
2636.51ib_send_bw~940
iperf(ipoib)~650
iperf(eth)~110
108
405456
112
429
600
112
453
583
111
717
635
112
719 699
111
759 758
0
100
200
300
400
500
600
700
800
900
1000
eth ipoib rdma
Agg
rega
te T
hrou
ghpu
t (M
B/s
)
Parallel Sequencial Read from Server's Cache512 MB File with 64 KB per Operation (ext2 on Single 160GB SATA)
1 client x 1 thread 1 client x 2 threads 1 client x 4 threads2 clients x 1 thread 2 clients x 2 threads 4 clients x 1 thread
iperf(ipoib-2s) ~980
iperf(ipoib) ~650
ib_send_bw ~940
iperf(eth) ~110
Gigabit Ethernet10 Gigabit EthernetInfiniband 4X SDRInfiniband 4X SDRInfiniband 4X SDR
110
81632
4040
444
401,350
600720
1,200
~Price per NIC+Port ($)
~Latency (µs)Bandwidth (Gbps)Type
Source: High-Performance Systems Integration group, Los Alamos National Laboratory (HPC-5, LANL)