nvme over fabrics—high performance flash moves to · pdf filenvme queuing operational...
TRANSCRIPT
NVMe Over Fabrics—High Performance Flash Moves to Ethernet
PCIe Storage Track John F. Kim, Director of Storage
Marketing at Mellanox Technologies
1 Flash Memory Summit 2016 Santa Clara, CA
Traditional Ways to Connect Flash § Inside the Server
• SCSI (SATA, SAS) • PCIe (proprietary standards) • USB
§ Outside the Server • DAS: SAS, SATA, USB, InfiniBand • Fibre Channel SAN • FCoE • Ethernet TCP (iSCSI, NAS, Object)
New Ways to Connect Flash § Inside the Server
• PCIe with standardized NVMe • NVDIMM
§ Outside the Server • DAS: Thunderbolt, USB 3.0 • Ethernet RDMA (iSER, SMB Direct) • Clustered file system (InfiniBand or Ethernet) • Hyper-converged (Ethernet) • NVMe over Fabrics
Options for NVMf Network § InfiniBand
• Best performance
§ Ethernet (RoCE or iWARP) • Most widely used
§ Fibre Channel (FC-NVMe standard in progress) • Most popular network for flash arrays in the enterprise
Why NVMe Over Fabrics
End-to-End NVMe semantics across a range of topologies [diagram from SNIA ESF webcast]
NVMe Queuing Operational Model
1. Host Driver enqueues the Submission Queue Entries into the SQ 2. NVMe Controller dequeues Submission Queue Entries 3. NVMe Controller enqueues Completion Queue Entries into the CQ 4. Host Driver dequeues Completion Queue Entries
NVMe Controller
FabricPort
NVMe NVM Subsystem
NVMe Host Driver
Host
Submission QueueCommand Id
OpCodeNSID
BufferAddress
(PRP/SGL)
CommandParameters
EnQueue
SQE
DeQueue
CQE
Command IdOpCode
NSID
BufferAddress
(PRP/SGL)
CommandParameters
Command Parm
SQ Head Ptr
Command Status
Command Id
P
1 2
Completion Queue EnQueue
CQE
DeQueue
SQE
Command IdOpCode
NSID
BufferAddress
(PRP/SGL)
CommandParameters
Command Parm
SQ Head Ptr
Command Status
Command Id
P
Command Parm
SQ Head Ptr
Command Status
Command Id
P 34
Subsystem-side
Host-side
Why Ethernet for NVMf? § Ubiquity § 25/50/100GbE today, 200GbE soon § RDMA (RoCE and iWARP) § Scalability to 100’s of thousands nodes § Standards-based, multi (many)-vendor § Lowest Price/Port § Converged Fabric capability § Standard cables extend to long distance
RDMA Options on Ethernet § RoCE uses UDP/IP on Ethernet, no TCP
• InfiniBand transport on top of Ethernet
§ iWARP layers on top of TCP/IP • Offloaded TCP/IP flow control and management • Might include TCP Offload Engine (TOE)
§ Both RoCE and iWARP support RDMA verbs • NVMf using verbs can be run on top of either transport
§ Other options without RDMA
Flash Memory Summit 2016 Santa Clara, CA
Who Supports RDMA on Ethernet? Vendor RoCE iWARP 25/50/100 Mellanox Yes Yes (GA) Broadcom (Emulex)
Yes (in beta) ?
Broadcom (NetXtreme)
?
Chelsio Yes (GA) Announced QLogic Yes (announced) Yes (announced) Yes (in beta)
Flash Memory Summit 2016 Santa Clara, CA
Mellanox NVMf Demo on RoCE § Topology –
• Two compute nodes – ConnectX4-Lx 25Gbps port
• One storage node – ConnectX4-Lx 50Gbps port – 4 X Intel NVMe devices (P3700/750
series)
• Nodes connected through switch
Added fabric latency ~12us
Bandwidth (Target side)
IOPS (Target side)
Num. Online cores Each core utilization
BS = 4KB, 16 jobs, IO depth = 64
5.2GB/sec 1.3M 4 50%
ConnectX-4 Lx 25GbE NIC
Mellanox SN2700 Spectrum Switch
4 Intel P3700 NVMe SSDS
ConnectX-4 Lx 25GbE NIC
ConnectX-4 Lx 50GbE NIC
Intel SPDK NVMf Demo
§ Intel SPDK software with user-space target and polling mode § Mellanox ConnnectX-4 Lx NICs, running RoCE § 4 Intel NVMe P3700 SSDs connected to shared PCIe Gen3x16 bus § Intel Xeon-D 1567 CPU on target side; Intel Xeon E5 2600 V3 on
initiator side
PCIe 3.0 x8 PCIe 3.0 x8
NVMf Initiator NVMf Target
2x 25GbE with RDMA (RoCE)
Mellanox ConnectX-4 Lx dual-port
Mellanox ConnectX-4 Lx dual-port
Intel NVMe SSD P3700 2TB x 4
PCIe 3.0 x16
E8Storage:x10Performance,HalftheTCO
12
02.55
7.510
AFA NVMeSSD
E8D24
MIOPS
VM2
Vol
VM4
Vol
VM7 VM8
Vol Vol
NVMe over 100GbE RoCE Fabric Demo NVMf in ACTION
§ 12-, 16-, 32-TB Flash Arrays
§ 2.25M, 1.75M Rd/Wr IOPS
§ 10.0GB/s, 6.0GB/s Rr/Wr BW
§ < 200us Latency
13 8/9/16
FIO Test Tools Suite illustrates Mangstor’s NX6320 High Performance, Low Latency NVMe over Fabric Devices.
@FMS16 Hall D
ConnectX-4 100Gb/s Adapter with SR-IOV RoCE
ESXi6.0 Hypervisor
VM1
Vol
VM3
Vol
VM5 VM6
Vol Vol
Copyright©PavilionDataSystems,ConfidenOal
LowCostOfOwnership+EnterpriseReliability
14
Thin-Provisioned,virtualNVMeLUNspresentedonhost
Oer
ü NoCustomSoTwareorHardwarerequiredinhosts/applicaOonOer
ü StandardEthernetConnecOvity
ü ErasureCodedVolumes
ü ThinProvisionedVolumes
ü QOSPolicyperVolume
Copyright©PavilionDataSystems,ConfidenOal 15
PowerANDDensity,inanEnterprise-ClassArray
120 GB/S Bandwidth
20 Million
4K IOPS
Half PB Capacity (4U)
*1 PB in Q1-2017
Up To 40 x 40 GB/S Ports
High Speed Data Copies/
Clones
Thin Provisioning
Pay-As-You Grow Scalability in 4U Chassis
Zero Host
Footprint
Erasure Coding
Hot Swappable Components
NVMf: High Performance Flash Moves to Ethernet
Thank You!
Flash Memory Summit 2016 Santa Clara, CA