presenters: seth howell, ziye yang company: intel · sdpk nvme-of host timeline 19.04: continuing...
Post on 13-Aug-2020
6 Views
Preview:
TRANSCRIPT
Presenters: Seth Howell, Ziye Yang
Company: Intel
SPDK, PMDK & Vtune™ Amplifier Summit 2
Notices and DisclaimersIntel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration.
No computer system can be absolutely secure.
Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. For more complete information about performance and benchmark results, visit http://www.intel.com/benchmarks .
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/benchmarks .
Benchmark results were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and "Meltdown." Implementation of these updates may make these results inapplicable to your device or system.
Intel® Advanced Vector Extensions (Intel® AVX)* provides higher throughput to certain processor operations. Due to varying processor power characteristics, utilizing AVX instructions may cause a) some parts to operate at less than the rated frequency and b) some parts with Intel® Turbo Boost Technology 2.0 to not achieve any or maximum turbo frequencies. Performance varies depending on hardware, software, and system configuration and you can learn more at http://www.intel.com/go/turbo.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.
Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate.
© 2018 Intel Corporation. Intel, the Intel logo, and Intel Xeon are trademarks of Intel Corporation in the U.S. and/or other countries.
*Other names and brands may be claimed as property of others.
SPDK, PMDK & Vtune™ Amplifier Summit
Agenda• SPDK NVMe-oF development
history & status
• SPDK RDMA transport enhancement
• SPDK TCP transport introduction
• Conclusion
3
SPDK, PMDK & Vtune™ Amplifier Summit
Agenda• SPDK NVMe-oF development
history & status
• SPDK RDMA transport enhancement
• SPDK TCP transport introduction
• Conclusion
4
SPDK, PMDK & Vtune™ Amplifier Summit 5
19.01: TCP transport released
17.03 – 17.07 : Functional hardening
17.11 – 18.11: Rdma TRANSPORT IMPROVEMENTS
July 2016: Released with RDMA transport support
SDPK NVMe-oF Target Timeline
19.04: Continuing Improvement
SPDK, PMDK & Vtune™ Amplifier Summit 6
19.01: TCP Transport released
17.03 – 17.07 : Functional hardening (e.g., interoperability test with Kerneltarget)
17.11 – 18.11: Rdma TRANSPORT IMPROVEMENTs
Dec 2016: released with rdma transport support
SDPK NVMe-oF host Timeline
19.04: Continuing Improvement
SPDK, PMDK & Vtune™ Amplifier Summit 7
SPDK NVMe-oF target design highlightsNVMe* over Fabrics Target Features Performance Benefit
Utilizes user space NVM Express*
(NVMe) Polled Mode DriverReduced overhead per NVMe I/O
Group polling on each SPDK thread (binding on CPU core) for multiple transports
No interrupt overhead
Connections pinned to dedicated SPDK thread
No synchronization overhead
Asynchronous NVMe CMD handling in whole life cycle
No locks in NVMe CMD data handling path
SPDK, PMDK & Vtune™ Amplifier Summit
Agenda• SPDK NVMe-oF Development
history & status
• SPDK RDMA transport enhancement
• SPDK TCP transport introduction
• Conclusion
8
SPDK, PMDK & Vtune™ Amplifier Summit 9
Shared Receive Queue Support• Disaggregate requests and buffers from queue pairs.
• Fix the majority of allocations to startup.
• Scaling connections is much cheaper.
• Can increase the cost of core scaling.
SPDK, PMDK & Vtune™ Amplifier Summit
RDMA TransportShared resources: buffers for IOVs
Network Device 1 (SRQ Support)
Poller 1 (Poll Group 1)
Shared resources: recvs, reqs, in capsule data buffers
QP1
Resources: qpair struct
QP2
Resources: qpair struct
Poller 2 (Poll Group 2)
Shared resources: recvs, reqs, in capsule data buffers
QP3
Resources: qpair struct
Network Device 2 (No SRQ Support)
Poller 3 (Poll Group 1)
Shared resources: None
QP1
Resources: qpair struct, recvs, reqs, in capsule data buffers
10
NVMe-oF RDMA Transport Target Architecture
SPDK, PMDK & Vtune™ Amplifier Summit 11
0
10
20
30
40
50
8 16 24 32 40 48 56 64
ME
MO
RY
OV
ER
HE
AD
(M
IB)
NUMBER OF QPAIRS
Connection Scaling
Traditional vs. SRQ
Traditional Shared
0
50
100
150
200
1 2 3 4 5 6 7 8ME
MO
RY
OV
ER
HE
AD
(M
IB)
NUMBER OF CORES
Core Scaling Traditional vs.
SRQ
Traditional SRQ
Scaling Connections and Cores
SPDK, PMDK & Vtune™ Amplifier Summit 12
Further Enhancements
Completed
– Send With Invalidate Support
– Multi-SGE in both Target and Initiator
Ongoing and Future
– Support for an increasing number of NICs from multiple vendors
– SEND and RECV operation batching
– Automated performance testing
– Extended verbs interface
– NVMe-oF offload support
SPDK, PMDK & Vtune™ Amplifier Summit
Agenda• SPDK NVMe-oF Development
history & status
• SPDK RDMA transport enhancement
• SPDK TCP transport introduction
• Conclusion
13
SPDK, PMDK & Vtune™ Amplifier Summit 14
General design and implementation
• Follow the SPDK transport abstraction:
• Host side code: lib/nvme/nvme_tcp.c
• Target side code: lib/nvmf/tcp.c
Transport Abstraction
FC
Posix
RDMA TCP
VPP
SPDK, PMDK & Vtune™ Amplifier Summit 15
Performance design consideration for TCP transport in target side
Ingredients Methodology
Design framework Follow the general SPDK NVMe-oFframework (e.g., polling group)
TCP connection optimization Use the SPDK encapsulated Socket API (preparing for integrating other stack, e.g., VPP )
NVMe/TCP PDU handling Use state machine to track
NVMe/TCP request life time cycle Use state machine to track(Purpose: Easy to debug and good for further performance improvement)
SPDK, PMDK & Vtune™ Amplifier Summit 16
TCP PDU Receiving handling for each connection
Ready
Handle CH
Handle PSH
Handlepayload
ErrorState
Error Path
Has
Payload?
No
Payload?
SPDK, PMDK & Vtune™ Amplifier Summit
SPDK NVMe-oF TCP request life cycle of each connection in target side
SPDK, PMDK & Vtune™ Amplifier Summit 18
Free
NewNeed Buffer
PendingR2t
Ready toExecute in NVMe
Need buffer?
Executing in NVMe
driver
Transfer data from Host (get data for write
cmd)
Readyto
complete
Completed
Error Path
Read cmd?
Write withoutr2t sending
Write Needr2t sending
Send R2T
Get dataerror
Not a validcmd
Recycling
resource
Executed in NVMe
driver
Transfer Response
PDU to host
SPDK, PMDK & Vtune™ Amplifier Summit 19
SPDK Target side (TCP transport): I/O scaling
System configuration: (1) Target: server platform: SuperMicro SYS2029U-TN24R4T; 2x Intel® Xeon® Platinum 8180 CPU @ 2.50 GHz, Intel® Speed Step enabled, Intel® Turbo Boost Technology enabled, 4x 2GB DDR4 2666 MT/s, 1 DIMM per channel; 2x 100GbE Mellanox ConnectX-5 NICs; Fedora 28, Linux kernel 5.05, SPDK 19.01.1; 6x Intel® P4600TM P4600x 2.0TB; (2) initiator: Server platform: SuperMicro SYS-2028U TN24R4T+; 44x Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz (HT off); 1x 100GbE Mellanox ConnectX-4 NIC; Fedora 28, Linux kernel 5.05, SPDK 19.0.1. (3) : Fio ver: fio-3.3; Fio workload: blocksize=4k, iodepth=1, iodepth_batch=128, iodepth_low=256, ioengine=libaio or SPDK bdevengine, size=10G, ramp_time=0, run_time=300, group_reporting, thread, direct=1, rw=read/write/rw/randread/randwrite/randrw
120.5
151.2 153.5 158.2132.4
105.4 103.8 100.8
0.0
20.0
40.0
60.0
80.0
100.0
120.0
140.0
0.0
20.0
40.0
60.0
80.0
100.0
120.0
140.0
160.0
180.0
1 2 3 4
Lin
es
-a
ve
rag
e l
ate
ncy
(use
c)
Th
rou
gh
pu
t
(IO
PS
in
k)
# of CPU Cores
4K 100% Random 70% Read 30%
Write QD=1
229.4
475.2
672.0
812.62224.0
1067.5
754.1624.5
0.0
500.0
1000.0
1500.0
2000.0
2500.0
0.0
100.0
200.0
300.0
400.0
500.0
600.0
700.0
800.0
900.0
1 2 3 4
Lin
es
-a
ve
rag
e l
ate
ncy
(use
c)
Th
rou
gh
pu
t
(IO
PS
in
k)
# of CPU Cores
4K 100% Random 70% Read 30%
Write QD=32
SPDK, PMDK & Vtune™ Amplifier Summit 20
SPDK host side (TCP transport): I/O scaling
System configuration: (1) Target: server platform: SuperMicro SYS2029U-TN24R4T; 2x Intel® Xeon® Platinum 8180 CPU @ 2.50 GHz, Intel® Speed Step enabled, Intel® Turbo Boost Technology enabled, 4x 2GB DDR4 2666 MT/s, 1 DIMM per channel; 2x 100GbE Mellanox ConnectX-5 NICs; Fedora 28, Linux kernel 5.05, SPDK 19.01.1; 6x Intel® P4600TM P4600x 2.0TB; (2) initiator: Server platform: SuperMicro SYS-2028U TN24R4T+; 44x Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz (HT off); 1x 100GbE Mellanox ConnectX-4 NIC; Fedora 28, Linux kernel 5.05, SPDK 19.0.1. (3) : Fio ver: fio-3.3; Fio workload: blocksize=4k, iodepth=1, iodepth_batch=128, iodepth_low=256, ioengine=libaio or SPDK bdevengine, size=10G, ramp_time=0, run_time=300, group_reporting, thread, direct=1, rw=read/write/rw/randread/randwrite/randrw
192.1
401.4
603.8
845.4
1324.5
1233.91217.4
1140.8
1000.0
1050.0
1100.0
1150.0
1200.0
1250.0
1300.0
1350.0
0.0
100.0
200.0
300.0
400.0
500.0
600.0
700.0
800.0
900.0
1 2 3 4
Lin
es
-av
era
ge la
ten
cy
(use
c)
Thro
ugh
pu
t(I
OP
S in
k)
# of CPU Cores
4K Random 70% Read 30% Write QD=256
SPDK, PMDK & Vtune™ Amplifier Summit 21
Latency comparison between SPDK and kernel ()
System configuration: (1) Target: server platform: SuperMicro SYS2029U-TN24R4T; 2x Intel® Xeon® Platinum 8180 CPU @ 2.50 GHz, Intel® Speed Step enabled, Intel® Turbo Boost Technology enabled, 4x 2GB DDR4 2666 MT/s, 1 DIMM per channel; 2x 100GbE Mellanox ConnectX-5 NICs; Fedora 28, Linux kernel 5.05, SPDK 19.01.1; 6x Intel® P4600TM P4600x 2.0TB; (2) initiator: Server platform: SuperMicro SYS-2028U TN24R4T+; 44x Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz (HT off); 1x 100GbE Mellanox ConnectX-4 NIC; Fedora 28, Linux kernel 5.05, SPDK 19.0.1. (3) : Fio ver: fio-3.3; Fio workload: blocksize=4k, iodepth=1, iodepth_batch=128, iodepth_low=256, ioengine=libaio or SPDK bdevengine, size=10G, ramp_time=0, run_time=300, group_reporting, thread, direct=1, rw=read/write/rw/randread/randwrite/randrw
52.86
41.09
50.42
66.79
51.43
59.82
0.00 20.00 40.00 60.00 80.00
4k 70% Reads 30%Writes
4k 100% RandomWrites
4k 100% RandomReads
Latency (usecs)
Average Latency ComparisonsSPDK Target vs Kernel Target
Kernel Target
SPDK Target40.99
31.14
35.58
52.86
41.09
50.42
0.00 20.00 40.00 60.00
4k 70% Reads30% Writes
4k 100% RandomWrites
4k 100% RandomReads
Latency (usecs)
Average Latency ComparisonsSPDK Initiator vs Kernel Initiator
Kernel Initiator
SPDK Initiator
SPDK, PMDK & Vtune™ Amplifier Summit 22
Iops/coRE comparison between SPDK and kernel in target side
System configuration: (1) Target: server platform: SuperMicro SYS2029U-TN24R4T; 2x Intel® Xeon® Platinum 8180 CPU @ 2.50 GHz, Intel® Speed Step enabled, Intel® Turbo Boost Technology enabled, 4x 2GB DDR4 2666 MT/s, 1 DIMM per channel; 2x 100GbE Mellanox ConnectX-5 NICs; Fedora 28, Linux kernel 5.05, SPDK 19.01.1; 6x Intel® P4600TM P4600x 2.0TB; (2) initiator: Server platform: SuperMicro SYS-2028U TN24R4T+; 44x Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz (HT off); 1x 100GbE Mellanox ConnectX-4 NIC; Fedora 28, Linux kernel 5.05, SPDK 19.0.1. (3) : Fio ver: fio-3.3; Fio workload: blocksize=4k, iodepth=1, iodepth_batch=128, iodepth_low=256, ioengine=libaio or SPDK bdevengine, size=10G, ramp_time=0, run_time=300, group_reporting, thread, direct=1, rw=read/write/rw/randread/randwrite/randrw
100.00% 100.00% 100.00% 100.00%
237.75%255.12%
241.00% 240.96%
0.00%
50.00%
100.00%
150.00%
200.00%
250.00%
300.00%
1 4 16 32
Re
lati
ve p
erf
orm
ance
# of connections
4k Random 70% Reads 30% Writes, QD=32
Kernel
SPDK
SPDK, PMDK & Vtune™ Amplifier Summit 23
Further development plan
• Continue enhancing the function feature
• Including the compatible test with Linux kernel solution.
• Performance tuning
• Deep integration with user space stack: VPP + DPDK
• Use the hardware features of NICs for performance improvement (e.g., VMA from Mellanox’s NIC, load balance features from Intel’s 100Gbit NIC)
• Figuring out offloading methods with hardware, e.g., FPGA, Smart NIC, and etc.
SPDK, PMDK & Vtune™ Amplifier Summit
Agenda• SPDK NVMe-oF Development
history & status
• SPDK RDMA transport enhancement
• SPDK TCP transport introduction
• Conclusion
24
SPDK, PMDK & Vtune™ Amplifier Summit 25
Conclusion
• SPDK NVMe-oF solution is well adopted by the industry. In this presentation, followings are introduced, i.e.,
• RDMA transport enhancement
• Newly TCP transport development.
• Further development
• Continue following the NVMe-oF spec and adding more features.
• Continue performance enhancements and integration with other solutions.
• Call for activity in community
• Welcome to bug submission, idea discussion and patch submission for NVMe-oF
SPDK, PMDK & Vtune™ Amplifier Summit 27
Backup
SPDK, PMDK & Vtune™ Amplifier Summit 28
• NVMe over Fabrics Target
• July 2016: Released with RDMA transport support
• July 2016 – Oct 2018:
• Hardening: e.g., Intel test infrastructure construction, Discovery simplificationCorrectness & kernel interop)
• Performance improvements: e.g., Read latency improvement, Scalability validation, event framework enhancements, multiple connection performance improvement with group polling
• Jan 2019: Release with TCP transport support
• Compatibility: Support both kernel & SPDK host on TCP transport
• Optimization: Support integrating different TCP stack (e.g., Kernel, VPP)
• NVMe over Fabrics Host (Initiator)
• Dec 2016: Released with RDMA transport support
• Dec 2016 – Oct 2018:
• Hardening: Interoperability test with Kernel/SPDK NVMe-oF target
• Performance improvements: e.g., zero copy, SGL support
• Jan 2019: Release with TCP transport support
• Compatibility: Support both kernel & SPDK target on TCP transport
SDPK NVMe-oF Timeline
SPDK, PMDK & Vtune™ Amplifier Summit 29
General design and implementation• It is very easy to add a new transport in SPDK NVMe-oF solution, we need to
follow the SPDK transport abstraction:
• Host side: Implement transport functions defined in lib/nvme/nvme_transport.cand code is in lib/nvme/nvme_tcp.c
• Target side: Implement the transport related functions defined in lib/nvmf/transport.h and code is in lib/nvmf/tcp.c
Transport Abstraction
FC
Posix
RDMA TCP
VPP
SPDK, PMDK & Vtune™ Amplifier Summit 30
General design and implementation• Performance design consideration for TCP transport in target side
• Follow the general SPDK NVMe-oF framework, e.g., via independent polling group on each thread.
• TCP connection optimization (e.g., read/write) for each qpair.
• Use the SPDK encapsulated Socket API.
• Purpose: To reserve the interface that users can use different TCP stack (e.g., VPP) for further optimization.
• Use state machine to track NVMe/TCP Receiving PDU status.
• NVMe command handling on each qpair.
• Use state machine to track the Life cycle of NVM request
• Benefit: Easy for debugging and further performance improvement.
SPDK, PMDK & Vtune™ Amplifier Summit 31
SPDK NVMe-oF TCP request life cycle of each connection in target side Free
New Need
Buffer
Pending
R2t
Ready toExecute in
NVMe
Need buffer ?
Executing in NVMe
driver
Executed in NVMe
Transfer data from Host (get data for
write cmd)
Readyto complete
Transfer
PDU to
host
Completed
Error Path
Read cmd?Write withoutr2t sending
Write Need r2t sending
SendR2T
Get dataerror
Not a validcmd
Recycling
resource
SPDK, PMDK & Vtune™ Amplifier Summit 32
Benchmarks – latency (need be updated)
System configuration: 44x Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz (HT off); Cores per socket: 22; 16x Samsung 8GB DDR4 @2400 1x Intel SSD DC P3700 Series 375GB @ FW E2010320 SPDK:18.10; Host Dist/Kernel: Fedora 28/Kernel 4.18.16-200; Guest Dist/Kernel: Fio ver: fio-3.6-34; Fio workload: blocksize=4k, iodepth=1, iodepth_batch=128, iodepth_low=256, ioengine=libaio, size=10G, ramp_time=0, run_time=300, group_reporting, thread, numjobs=1, direct=1, rw=read/write/rw/randread/randwrite/randrw
Read Write RW Randread Randwrite Randrw
RDMA 77.25 10.59 48.9 21.59 10.34 88.37
TCP 46.76 34.92 89.62 94.09 44.24 156.52
TCP(VMA) 13.45 13.89 29.34 63.12 13.69 87.44
0
20
40
60
80
100
120
140
160
180
La
ten
cy (
us)
QD=1, MLX5, IO SIZE =4KB RDMA/TCP/TCP(VMA)
SPDK, PMDK & Vtune™ Amplifier Summit 33
Benchmark – iops (Need to be updated)
System configuration: 44x Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz (HT off); Cores per socket: 22; 16x Samsung 8GB DDR4 @2400 1x Intel SSD DC P3700 Series 375GB @ FW E2010320 SPDK:18.10; Host Dist/Kernel: Fedora 28/Kernel 4.18.16-200; Guest Dist/Kernel: Fio ver: fio-3.6-34; Fio workload: blocksize=4k, iodepth=1, iodepth_batch=128, iodepth_low=256, ioengine=libaio, size=10G, ramp_time=0, run_time=300, group_reporting, thread, numjobs=1, direct=1, rw=randread/randwrite/randrw(50% read, 50% write)
Read Write RW Randread Randwrite Randrw
RDMA 705 421 381 479 348 300.2
TCP 97 137 104.9 87 89.7 101.6
TCP(VMA) 338 375 290 316 340 276
0
100
200
300
400
500
600
700
800
IOP
S (
K)
QD=128, MLX5, IO SIZE = 4KB RDMA/TCP/TCP(VMA)
top related