creating efficient storage networks with ethernet and nvme-of™ · 05/12/2018  · send capsule...

19
Creating Efficient Storage Networks with Ethernet and NVMe-oF™ Not all Ethernet networks are designed equally J Metz, Ph.D Cisco Systems, Inc. @drjmetz NVMe Developer Days 2018 San Diego, CA 1 HRDW-102: Designing NVMe Storage Systems Wednesday, December 5, 2018 3:30p-4:45p PST

Upload: others

Post on 25-Feb-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Creating Efficient Storage Networks with Ethernet and NVMe-oF™ · 05/12/2018  · Send Capsule with Data Command Data . Storage Networking Optimization • You do not need to and

Creating Efficient Storage Networks with Ethernet and NVMe-oF™

Not all Ethernet networks are designed equally

J Metz, Ph.D Cisco Systems, Inc.

@drjmetz NVMe Developer Days 2018 San Diego, CA

1

HRDW-102: Designing NVMe Storage Systems Wednesday, December 5, 2018 3:30p-4:45p PST

Page 2: Creating Efficient Storage Networks with Ethernet and NVMe-oF™ · 05/12/2018  · Send Capsule with Data Command Data . Storage Networking Optimization • You do not need to and

NVMe Transport Modes •  NVMe is a Memory-Mapped, PCIe Model •  Fabrics is message-based, shared memory is optional

2

NVMe Transports

Memory Message Message & Memory Commands/Responses & Data

use Shared Memory Commands/Responses use Capsules

Data may use Capsules or Messages

Commands/Responses use Capsules

Data may use Capsules or Shared Memory

Example: Examples: Examples: PCI Express Fibre Channel,

TCP RDMA

(InfiniBand, RoCE, iWARP)

Fabric Message-Based Transports Capsule = Encapsulated NVMe Command/Completion within a transport Message Data = Transport data exchange mechanism (if any)

Page 3: Creating Efficient Storage Networks with Ethernet and NVMe-oF™ · 05/12/2018  · Send Capsule with Data Command Data . Storage Networking Optimization • You do not need to and

What’s Special About NVMe-oF: Bindings

•  What is a Binding? •  “A specification of reliable delivery of data,

commands, and responses between a host and an NVM subsystem for an NVMe Transport. The binding may exclude or restrict functionality based on the NVMe Transport’s capabilities”

•  I.e., it’s the “glue” that links all the pieces above and below (examples): •  SGL Descriptions •  Data placement restrictions •  Data transport capabilities •  Authentication capabilities

3

Page 4: Creating Efficient Storage Networks with Ethernet and NVMe-oF™ · 05/12/2018  · Send Capsule with Data Command Data . Storage Networking Optimization • You do not need to and

NVMe Command Data Transfers (Controller Initiated)

•  Controller initiates the Read or Write of the NVMe Command Data to/from Host Memory Buffer

•  Data transfer operations are transport specific; examples •  PCIe Transport: PCIe Read/ PCIe

Write Operations •  RDMA Transport: RDMA_READ/

RDMA_WRITE Operations

4

Host NVMe Host Driver Host Memory

Buffer

Transport

NVMe controller

Fabric Port

Read Command Data

Transport-Dependent Data Transfer

Transport-Dependent Data Transfer

Write Command Data

NVM Subsystem

Page 5: Creating Efficient Storage Networks with Ethernet and NVMe-oF™ · 05/12/2018  · Send Capsule with Data Command Data . Storage Networking Optimization • You do not need to and

NVMe Command Data Transfers (In-Capsule Data)

•  NVMe Command and Command Data sent together in Command Capsule

•  Reduces latency by avoiding the Controller having to fetch the data from Host

•  SQE SGL Entry will indicate Capsule Offset type address

5

Host NVMe Host Driver

Transport

NVMe Controller

Fabric Port

Capsule with In-Capsule Data

NVM Subsystem

Send Capsule with Data

Command Data

Page 6: Creating Efficient Storage Networks with Ethernet and NVMe-oF™ · 05/12/2018  · Send Capsule with Data Command Data . Storage Networking Optimization • You do not need to and

Storage Networking Optimization

•  You do not need to and should not be designing a network that requires a lot of buffering

•  Capacity and over-subscription is ‘not’ a function of the protocol (NAS, FC, iSCSI, CEPH, NVMe, etc.) but of the application I/O requirements

6

5 millisecond view Congestion Threshold exceeded

Data Center Design Goal: Optimizing the balance of end to end

fabric latency with the ability to absorb traffic peaks and prevent any

associated traffic loss

Page 7: Creating Efficient Storage Networks with Ethernet and NVMe-oF™ · 05/12/2018  · Send Capsule with Data Command Data . Storage Networking Optimization • You do not need to and

The Network Buffer Paradox

7

Not enough buffer – poor utilization of link BW

Too much buffer – increased latency

Just enough buffer – best possible link utilization and latency

Page 8: Creating Efficient Storage Networks with Ethernet and NVMe-oF™ · 05/12/2018  · Send Capsule with Data Command Data . Storage Networking Optimization • You do not need to and

Important Concepts:

DCTCP, ECN, PFC, Incast

8

Page 9: Creating Efficient Storage Networks with Ethernet and NVMe-oF™ · 05/12/2018  · Send Capsule with Data Command Data . Storage Networking Optimization • You do not need to and

Priority Flow Control (PFC) •  A.k.a "Lossless Ethernet" •  PFC enables Flow Control on a Per-Priority

basis •  PFC is also called Per-Priority-Pause

•  Ability to have lossless and lossy priorities at the same time on the same wire •  Allows traffic to operate over a lossless

priority independent of other priorities •  Other traffic assigned to other CoS will

continue to transmit and rely on upper layer protocols for retransmission

9

Ethernet Wire “no-drop”

Lossy Traffic Lossless Traffic

Page 10: Creating Efficient Storage Networks with Ethernet and NVMe-oF™ · 05/12/2018  · Send Capsule with Data Command Data . Storage Networking Optimization • You do not need to and

Explicit Congestion Notification (ECN)

•  ECN enables end-to-end congestion notification between two endpoints on a IP network

•  In case of congestion, ECN gets transmitting device to reduce transmission rate until congestion clears, without pausing traffic

10

Data flow

Congestion experienced

ECN-enabled switch

ECN-enabled switch

Notification

Initialization

DiffServ Field Values

0x00 – Non ECN Capable

0x10 – ECN Capable Transport (0)

0x01 – ECN Capable Transport (1)

0x11- Congestion Encountered

Page 11: Creating Efficient Storage Networks with Ethernet and NVMe-oF™ · 05/12/2018  · Send Capsule with Data Command Data . Storage Networking Optimization • You do not need to and

RoCEv2 Networking Considerations •  Underlying networks should be configured as lossless

•  That means 802.1Qbb (PFC; old friend from FCoE configs) and ECN

•  Transport Protocol in RoCEv2 does include end-to-end reliable delivery mechanism with built-in packet retransmission logic •  Typically implemented in Hardware, triggered to

recover from lost packets without intervention from the software stack

•  Some vendors are telling customers that lossless networks are not required (they’re not, technically)

•  The RoCE v2 specification, however…

11

Ethernet Link Layer

Ethernet Wire

IP

UDP

IB Transport Protocol

Hardw

are Softw

are

Page 12: Creating Efficient Storage Networks with Ethernet and NVMe-oF™ · 05/12/2018  · Send Capsule with Data Command Data . Storage Networking Optimization • You do not need to and

Incast

•  “Incast” is a rare, but important problem in environments where many streams are heading to the same source at the same time

•  Large, “elephant flows” can overrun available buffers

•  2 methods of solving this problem: •  Increase buffer sizes in the switches •  Notify the sender to slow down before

TCP packets get dropped

12

Buffer Available for Incast Burst

Page 13: Creating Efficient Storage Networks with Ethernet and NVMe-oF™ · 05/12/2018  · Send Capsule with Data Command Data . Storage Networking Optimization • You do not need to and

Understanding TCP Incast •  Synchronized TCP sessions arriving at common congestion point (all sessions starting at the same

time) •  Each TCP session will grow window until it detects indication of congestion (packet loss in normal

TCP configuration) •  All TCP sessions back off at the same time

•  This is called Incast Collapse

13

Buffer

Buffer Overflow

TCP Traffic

Throughput

Page 14: Creating Efficient Storage Networks with Ethernet and NVMe-oF™ · 05/12/2018  · Send Capsule with Data Command Data . Storage Networking Optimization • You do not need to and

Incast Collapse

14

•  Incast collapse is a very specialized case •  It would need every flow to arrive at exactly the same time •  The problem is more the buffer fills up because of elephant flows

•  Historically, buffers handle every flow the same •  It could potentially be solved with bigger buffers, particularly with

short frames, and one solution is to have larger buffers in the switches than the TCP Incast (avoid overflow altogether), but this adds latency

Page 15: Creating Efficient Storage Networks with Ethernet and NVMe-oF™ · 05/12/2018  · Send Capsule with Data Command Data . Storage Networking Optimization • You do not need to and

Solution 2: Telling the Sender to Slow Down

•  Instead of wait for TCP to drop packets and then adjust flow rate, why not simply tell the sender to slow down before the packets get dropped?

•  Technologies such as Data Center TCP (DCTCP) uses Explicit Congestion Notification, “ECN”) instruct the sender to do just this •  Dropped packets are the signal to TCP to

modify the flow of packets being sent in a congested network

15

Page 16: Creating Efficient Storage Networks with Ethernet and NVMe-oF™ · 05/12/2018  · Send Capsule with Data Command Data . Storage Networking Optimization • You do not need to and

DCTCP and Incast Collapse •  DCTCP will prevent Incast Collapse for long lived flows •  Notification of congestion via ECN prior to packet loss

•  Sender gets informed that congestion is happening and can slow down traffic •  Without ECN, the packet would have been dropped due to congestions and sender will notice this via

TCP timeout.

16

ECN Enabled

DCTCP Enabled IP Stack

DCTCP Enabled IP

Stack

TCP Traffic

Throughput

Page 17: Creating Efficient Storage Networks with Ethernet and NVMe-oF™ · 05/12/2018  · Send Capsule with Data Command Data . Storage Networking Optimization • You do not need to and

Now.. back to NVMe-oF

•  Remember… •  NVMe treats storage in memory-

like semantics •  Think of NVMe devices as remote

memory •  This arrow, innocuous as it may

seem, is where all the network goodness and badness happens

•  We want this arrow to be as short as possible and as reliable as possible

•  This is where DCTCP, PFC, and ECN can make life easier

17

Host NVMe Host Driver

Transport

NVMe Controller

Fabric Port

Capsule with In-Capsule Data

NVM Subsystem

Send Capsule with Data

Command Data

Page 18: Creating Efficient Storage Networks with Ethernet and NVMe-oF™ · 05/12/2018  · Send Capsule with Data Command Data . Storage Networking Optimization • You do not need to and

The Curse of Large Buffers

•  NVMe queuing is dependent upon the ongoing communication between the host and NVMe subsystem controller

•  Inserting large buffers in-between the host and the storage subsystem:

•  Increases latency •  Adds potential points of failure/

delay •  Significantly reduces

performance

18

Host NVMe Host Driver

Transport

NVMe Controller

Fabric Port

Capsule with In-Capsule Data

NVM Subsystem

Send Capsule with Data

Command Data

Page 19: Creating Efficient Storage Networks with Ethernet and NVMe-oF™ · 05/12/2018  · Send Capsule with Data Command Data . Storage Networking Optimization • You do not need to and

Summary

•  Not all transports treat congestion the same way •  Increased throughput of NVMe devices (including NVMe-oF) can

overwhelm architectures •  RoCE v2

•  Best to use PFC to handle congestion issues before it gets to the upper layer protocol

•  NVMe/TCP •  Typical TCP-based networks are reactive •  NVMe is a low-latency, high performance protocol; throwing

memory or buffers at the problem will make the storage problem worse, not better

•  More pro-active approaches are advisable 19