creating efficient storage networks with ethernet and nvme-of™ · 05/12/2018 · send capsule...
TRANSCRIPT
![Page 1: Creating Efficient Storage Networks with Ethernet and NVMe-oF™ · 05/12/2018 · Send Capsule with Data Command Data . Storage Networking Optimization • You do not need to and](https://reader030.vdocuments.us/reader030/viewer/2022040814/5e5b93da9e4f7a040669785a/html5/thumbnails/1.jpg)
Creating Efficient Storage Networks with Ethernet and NVMe-oF™
Not all Ethernet networks are designed equally
J Metz, Ph.D Cisco Systems, Inc.
@drjmetz NVMe Developer Days 2018 San Diego, CA
1
HRDW-102: Designing NVMe Storage Systems Wednesday, December 5, 2018 3:30p-4:45p PST
![Page 2: Creating Efficient Storage Networks with Ethernet and NVMe-oF™ · 05/12/2018 · Send Capsule with Data Command Data . Storage Networking Optimization • You do not need to and](https://reader030.vdocuments.us/reader030/viewer/2022040814/5e5b93da9e4f7a040669785a/html5/thumbnails/2.jpg)
NVMe Transport Modes • NVMe is a Memory-Mapped, PCIe Model • Fabrics is message-based, shared memory is optional
2
NVMe Transports
Memory Message Message & Memory Commands/Responses & Data
use Shared Memory Commands/Responses use Capsules
Data may use Capsules or Messages
Commands/Responses use Capsules
Data may use Capsules or Shared Memory
Example: Examples: Examples: PCI Express Fibre Channel,
TCP RDMA
(InfiniBand, RoCE, iWARP)
Fabric Message-Based Transports Capsule = Encapsulated NVMe Command/Completion within a transport Message Data = Transport data exchange mechanism (if any)
![Page 3: Creating Efficient Storage Networks with Ethernet and NVMe-oF™ · 05/12/2018 · Send Capsule with Data Command Data . Storage Networking Optimization • You do not need to and](https://reader030.vdocuments.us/reader030/viewer/2022040814/5e5b93da9e4f7a040669785a/html5/thumbnails/3.jpg)
What’s Special About NVMe-oF: Bindings
• What is a Binding? • “A specification of reliable delivery of data,
commands, and responses between a host and an NVM subsystem for an NVMe Transport. The binding may exclude or restrict functionality based on the NVMe Transport’s capabilities”
• I.e., it’s the “glue” that links all the pieces above and below (examples): • SGL Descriptions • Data placement restrictions • Data transport capabilities • Authentication capabilities
3
![Page 4: Creating Efficient Storage Networks with Ethernet and NVMe-oF™ · 05/12/2018 · Send Capsule with Data Command Data . Storage Networking Optimization • You do not need to and](https://reader030.vdocuments.us/reader030/viewer/2022040814/5e5b93da9e4f7a040669785a/html5/thumbnails/4.jpg)
NVMe Command Data Transfers (Controller Initiated)
• Controller initiates the Read or Write of the NVMe Command Data to/from Host Memory Buffer
• Data transfer operations are transport specific; examples • PCIe Transport: PCIe Read/ PCIe
Write Operations • RDMA Transport: RDMA_READ/
RDMA_WRITE Operations
4
Host NVMe Host Driver Host Memory
Buffer
Transport
NVMe controller
Fabric Port
Read Command Data
Transport-Dependent Data Transfer
Transport-Dependent Data Transfer
Write Command Data
NVM Subsystem
![Page 5: Creating Efficient Storage Networks with Ethernet and NVMe-oF™ · 05/12/2018 · Send Capsule with Data Command Data . Storage Networking Optimization • You do not need to and](https://reader030.vdocuments.us/reader030/viewer/2022040814/5e5b93da9e4f7a040669785a/html5/thumbnails/5.jpg)
NVMe Command Data Transfers (In-Capsule Data)
• NVMe Command and Command Data sent together in Command Capsule
• Reduces latency by avoiding the Controller having to fetch the data from Host
• SQE SGL Entry will indicate Capsule Offset type address
5
Host NVMe Host Driver
Transport
NVMe Controller
Fabric Port
Capsule with In-Capsule Data
NVM Subsystem
Send Capsule with Data
Command Data
![Page 6: Creating Efficient Storage Networks with Ethernet and NVMe-oF™ · 05/12/2018 · Send Capsule with Data Command Data . Storage Networking Optimization • You do not need to and](https://reader030.vdocuments.us/reader030/viewer/2022040814/5e5b93da9e4f7a040669785a/html5/thumbnails/6.jpg)
Storage Networking Optimization
• You do not need to and should not be designing a network that requires a lot of buffering
• Capacity and over-subscription is ‘not’ a function of the protocol (NAS, FC, iSCSI, CEPH, NVMe, etc.) but of the application I/O requirements
6
5 millisecond view Congestion Threshold exceeded
Data Center Design Goal: Optimizing the balance of end to end
fabric latency with the ability to absorb traffic peaks and prevent any
associated traffic loss
![Page 7: Creating Efficient Storage Networks with Ethernet and NVMe-oF™ · 05/12/2018 · Send Capsule with Data Command Data . Storage Networking Optimization • You do not need to and](https://reader030.vdocuments.us/reader030/viewer/2022040814/5e5b93da9e4f7a040669785a/html5/thumbnails/7.jpg)
The Network Buffer Paradox
7
Not enough buffer – poor utilization of link BW
Too much buffer – increased latency
Just enough buffer – best possible link utilization and latency
![Page 8: Creating Efficient Storage Networks with Ethernet and NVMe-oF™ · 05/12/2018 · Send Capsule with Data Command Data . Storage Networking Optimization • You do not need to and](https://reader030.vdocuments.us/reader030/viewer/2022040814/5e5b93da9e4f7a040669785a/html5/thumbnails/8.jpg)
Important Concepts:
DCTCP, ECN, PFC, Incast
8
![Page 9: Creating Efficient Storage Networks with Ethernet and NVMe-oF™ · 05/12/2018 · Send Capsule with Data Command Data . Storage Networking Optimization • You do not need to and](https://reader030.vdocuments.us/reader030/viewer/2022040814/5e5b93da9e4f7a040669785a/html5/thumbnails/9.jpg)
Priority Flow Control (PFC) • A.k.a "Lossless Ethernet" • PFC enables Flow Control on a Per-Priority
basis • PFC is also called Per-Priority-Pause
• Ability to have lossless and lossy priorities at the same time on the same wire • Allows traffic to operate over a lossless
priority independent of other priorities • Other traffic assigned to other CoS will
continue to transmit and rely on upper layer protocols for retransmission
9
Ethernet Wire “no-drop”
Lossy Traffic Lossless Traffic
![Page 10: Creating Efficient Storage Networks with Ethernet and NVMe-oF™ · 05/12/2018 · Send Capsule with Data Command Data . Storage Networking Optimization • You do not need to and](https://reader030.vdocuments.us/reader030/viewer/2022040814/5e5b93da9e4f7a040669785a/html5/thumbnails/10.jpg)
Explicit Congestion Notification (ECN)
• ECN enables end-to-end congestion notification between two endpoints on a IP network
• In case of congestion, ECN gets transmitting device to reduce transmission rate until congestion clears, without pausing traffic
10
Data flow
Congestion experienced
ECN-enabled switch
ECN-enabled switch
Notification
Initialization
DiffServ Field Values
0x00 – Non ECN Capable
0x10 – ECN Capable Transport (0)
0x01 – ECN Capable Transport (1)
0x11- Congestion Encountered
![Page 11: Creating Efficient Storage Networks with Ethernet and NVMe-oF™ · 05/12/2018 · Send Capsule with Data Command Data . Storage Networking Optimization • You do not need to and](https://reader030.vdocuments.us/reader030/viewer/2022040814/5e5b93da9e4f7a040669785a/html5/thumbnails/11.jpg)
RoCEv2 Networking Considerations • Underlying networks should be configured as lossless
• That means 802.1Qbb (PFC; old friend from FCoE configs) and ECN
• Transport Protocol in RoCEv2 does include end-to-end reliable delivery mechanism with built-in packet retransmission logic • Typically implemented in Hardware, triggered to
recover from lost packets without intervention from the software stack
• Some vendors are telling customers that lossless networks are not required (they’re not, technically)
• The RoCE v2 specification, however…
11
Ethernet Link Layer
Ethernet Wire
IP
UDP
IB Transport Protocol
Hardw
are Softw
are
![Page 12: Creating Efficient Storage Networks with Ethernet and NVMe-oF™ · 05/12/2018 · Send Capsule with Data Command Data . Storage Networking Optimization • You do not need to and](https://reader030.vdocuments.us/reader030/viewer/2022040814/5e5b93da9e4f7a040669785a/html5/thumbnails/12.jpg)
Incast
• “Incast” is a rare, but important problem in environments where many streams are heading to the same source at the same time
• Large, “elephant flows” can overrun available buffers
• 2 methods of solving this problem: • Increase buffer sizes in the switches • Notify the sender to slow down before
TCP packets get dropped
12
Buffer Available for Incast Burst
![Page 13: Creating Efficient Storage Networks with Ethernet and NVMe-oF™ · 05/12/2018 · Send Capsule with Data Command Data . Storage Networking Optimization • You do not need to and](https://reader030.vdocuments.us/reader030/viewer/2022040814/5e5b93da9e4f7a040669785a/html5/thumbnails/13.jpg)
Understanding TCP Incast • Synchronized TCP sessions arriving at common congestion point (all sessions starting at the same
time) • Each TCP session will grow window until it detects indication of congestion (packet loss in normal
TCP configuration) • All TCP sessions back off at the same time
• This is called Incast Collapse
13
Buffer
Buffer Overflow
TCP Traffic
Throughput
![Page 14: Creating Efficient Storage Networks with Ethernet and NVMe-oF™ · 05/12/2018 · Send Capsule with Data Command Data . Storage Networking Optimization • You do not need to and](https://reader030.vdocuments.us/reader030/viewer/2022040814/5e5b93da9e4f7a040669785a/html5/thumbnails/14.jpg)
Incast Collapse
14
• Incast collapse is a very specialized case • It would need every flow to arrive at exactly the same time • The problem is more the buffer fills up because of elephant flows
• Historically, buffers handle every flow the same • It could potentially be solved with bigger buffers, particularly with
short frames, and one solution is to have larger buffers in the switches than the TCP Incast (avoid overflow altogether), but this adds latency
![Page 15: Creating Efficient Storage Networks with Ethernet and NVMe-oF™ · 05/12/2018 · Send Capsule with Data Command Data . Storage Networking Optimization • You do not need to and](https://reader030.vdocuments.us/reader030/viewer/2022040814/5e5b93da9e4f7a040669785a/html5/thumbnails/15.jpg)
Solution 2: Telling the Sender to Slow Down
• Instead of wait for TCP to drop packets and then adjust flow rate, why not simply tell the sender to slow down before the packets get dropped?
• Technologies such as Data Center TCP (DCTCP) uses Explicit Congestion Notification, “ECN”) instruct the sender to do just this • Dropped packets are the signal to TCP to
modify the flow of packets being sent in a congested network
15
![Page 16: Creating Efficient Storage Networks with Ethernet and NVMe-oF™ · 05/12/2018 · Send Capsule with Data Command Data . Storage Networking Optimization • You do not need to and](https://reader030.vdocuments.us/reader030/viewer/2022040814/5e5b93da9e4f7a040669785a/html5/thumbnails/16.jpg)
DCTCP and Incast Collapse • DCTCP will prevent Incast Collapse for long lived flows • Notification of congestion via ECN prior to packet loss
• Sender gets informed that congestion is happening and can slow down traffic • Without ECN, the packet would have been dropped due to congestions and sender will notice this via
TCP timeout.
16
ECN Enabled
DCTCP Enabled IP Stack
DCTCP Enabled IP
Stack
TCP Traffic
Throughput
![Page 17: Creating Efficient Storage Networks with Ethernet and NVMe-oF™ · 05/12/2018 · Send Capsule with Data Command Data . Storage Networking Optimization • You do not need to and](https://reader030.vdocuments.us/reader030/viewer/2022040814/5e5b93da9e4f7a040669785a/html5/thumbnails/17.jpg)
Now.. back to NVMe-oF
• Remember… • NVMe treats storage in memory-
like semantics • Think of NVMe devices as remote
memory • This arrow, innocuous as it may
seem, is where all the network goodness and badness happens
• We want this arrow to be as short as possible and as reliable as possible
• This is where DCTCP, PFC, and ECN can make life easier
17
Host NVMe Host Driver
Transport
NVMe Controller
Fabric Port
Capsule with In-Capsule Data
NVM Subsystem
Send Capsule with Data
Command Data
![Page 18: Creating Efficient Storage Networks with Ethernet and NVMe-oF™ · 05/12/2018 · Send Capsule with Data Command Data . Storage Networking Optimization • You do not need to and](https://reader030.vdocuments.us/reader030/viewer/2022040814/5e5b93da9e4f7a040669785a/html5/thumbnails/18.jpg)
The Curse of Large Buffers
• NVMe queuing is dependent upon the ongoing communication between the host and NVMe subsystem controller
• Inserting large buffers in-between the host and the storage subsystem:
• Increases latency • Adds potential points of failure/
delay • Significantly reduces
performance
18
Host NVMe Host Driver
Transport
NVMe Controller
Fabric Port
Capsule with In-Capsule Data
NVM Subsystem
Send Capsule with Data
Command Data
![Page 19: Creating Efficient Storage Networks with Ethernet and NVMe-oF™ · 05/12/2018 · Send Capsule with Data Command Data . Storage Networking Optimization • You do not need to and](https://reader030.vdocuments.us/reader030/viewer/2022040814/5e5b93da9e4f7a040669785a/html5/thumbnails/19.jpg)
Summary
• Not all transports treat congestion the same way • Increased throughput of NVMe devices (including NVMe-oF) can
overwhelm architectures • RoCE v2
• Best to use PFC to handle congestion issues before it gets to the upper layer protocol
• NVMe/TCP • Typical TCP-based networks are reactive • NVMe is a low-latency, high performance protocol; throwing
memory or buffers at the problem will make the storage problem worse, not better
• More pro-active approaches are advisable 19