introduction to nvme over fabrics-v3r

31
Introduction to NVMe over Fabrics 10/2016 v3 Simon Huang Email:[email protected]

Upload: simon-huang

Post on 21-Jan-2017

360 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Introduction to NVMe Over Fabrics-V3R

Introduction to NVMe over Fabrics

10/2016v3

Simon Huang

Email:[email protected]

Page 2: Introduction to NVMe Over Fabrics-V3R

• What is NVM Express™?• What’s NVMe over Fabrics?• Why NVMe over Fabrics?• Expanding NVMe to Fabrics• NVMe over Fabrics in the Data Center • End-to-End NVMe over Fabrics • NVMe Multi-Fabric Transport Mapping• NVMe over Fabrics at Storage Tiers• End-to-End NVMe Model• Shared Server Flash• NVMe Over Fabrics Products(Examples)• Recap • Backup 1 and 2

Agenda

Page 3: Introduction to NVMe Over Fabrics-V3R

What is NVM Express™?

• Industry standard for PCIe SSDs • High-performance, low-latency, PCIe SSD interface

• Command set + PCIe register interface• In-box NVMe host drivers for Linux, Windows, VmWare, …• Standard h/w drive form factors, mobile to enterprise

• NVMe community is 100+ companies strong and growing• Learn more at nvmexpress.org

Page 4: Introduction to NVMe Over Fabrics-V3R

What’s NVMe over Fabrics?

• Nonvolatile Memory Express (NVMe) over Fabrics is a technology specification designed to enable NVMe message-based commands to transfer data between a host computer and a target solid-state storage device or system over a network such as Ethernet, Fibre Channel, and InfiniBand.

Page 5: Introduction to NVMe Over Fabrics-V3R

Why NVMe over Fabrics?

• End-to-End NVMe semantics across a range of topologies– Retains NVMe efficiency and performance over network fabrics– Eliminates unnecessary protocol translations– Enables low-latency and high IOPS remote NVMe storage solutions

Page 6: Introduction to NVMe Over Fabrics-V3R

Expanding NVMe to Fabrics

• Built on common NVMe architecture with additional definitions to support message-based NVMe operations

• Standardization of NVMe over a range Fabric types• Initial fabrics: RDMA(RoCE, iWARP, InfiniBand™) and Fibre Channel• First specification has been released in June, 2016• NVMe.org Fabrics Linux Driver WG developing host and target drivers

Page 7: Introduction to NVMe Over Fabrics-V3R

NVMe Over FabricsEvolution of Non-Volatile Storage in the Data Center

Page 8: Introduction to NVMe Over Fabrics-V3R

End-to-End NVMe over Fabrics

Extend efficiency of NVMe over Front and Back-end Fabrics Enables efficient NVMe end-to-end model (Host<->NVMe PCIe SSD)

Page 9: Introduction to NVMe Over Fabrics-V3R

NVM Over Fabrics Advantages

• Industry standard interface (Multiple sources)• Unlimited storage per server • Scale storage independent of servers • High Efficient shared storage • HA is straightforward • Greater IO performance

Page 10: Introduction to NVMe Over Fabrics-V3R

NVMe Multi-Fabric Transport Mapping

Fabric Message Based Transports

Page 11: Introduction to NVMe Over Fabrics-V3R

NVMe over Fabrics at Storage Tiers

Page 12: Introduction to NVMe Over Fabrics-V3R

End-to-End NVMe Model

• NVMe efficiencies scaled across entire fabric

Page 13: Introduction to NVMe Over Fabrics-V3R

Shared Server Flash - NVMe Storage

• RDMA support required for lowest latency • Ethernet or IB or OmniPath fabrics possible – IB and OmniPath support RDMA – Ethernet has

RoCEv1-v2, iWARP and iSCSI RDMA options – iSCSI offload has built-in RDMA WRITE • Disaster Recovery (DR) requires MAN or WAN – iWARP, iSCSI only options that support MAN

and WAN

Page 14: Introduction to NVMe Over Fabrics-V3R

NVMe Over Fabrics Products(Examples)

• Gangster (NX6320/NX6325/NX6340)All-Flash Arrays

• Chelsio’s Terminator 5

• QLogic FastLinQ QL45611HLCU100Gb Intelligent Ethernet Adapter

Arrays:

Adapters:

• EMC DSSD D5

Page 15: Introduction to NVMe Over Fabrics-V3R

Recap

• NVMe was built from the ground up to support a consistent model for NVM interfaces, even across network fabrics

• Simplicity of protocol enables hardware automated I/O Queues – NVMe transport bridge

• No translation to or from another protocol like SCSI (in firmware/software)

• Inherent parallelism of NVMe multiple I/O Queues is exposed to the host

• NVMe commands and structures are transferred end-to-end• Maintains the NVMe architecture across a range of fabric

types

Page 16: Introduction to NVMe Over Fabrics-V3R

Backup-1

Page 17: Introduction to NVMe Over Fabrics-V3R

Seagate SSD

1200.2 Series SAS 12Gbs-Up to 210K RR IPOS and 25 DWPD

XM1400 Series M.2 22110 PCIe G3 x 4-Up to 3DWPD

XF1400 Series U.2 PCIe G3 x 4-Up to 200K RR IPOS and 3 DWPD

XP6500 Series AIC PCIe G3 x 8-Up to 300K RR IPOS

XP7200 Series AIC PCIe G3 x 16-Up to 940K RR IPOS

XP6300 Series AIC PCIe G3 x 8-Up to 296K RR IPOS

Page 18: Introduction to NVMe Over Fabrics-V3R

Traditional Scale Out Storage

• Support for high BW/IOPS NVMe support preserves software investment, because it keeps existing software price/performance competitive

• Support for high BW/IOPS NVMe support realizes most of the NVMe speedup benefits• Disaster Recovery (DR) requires MAN or WAN

Page 19: Introduction to NVMe Over Fabrics-V3R

RDMA• RDMA stands for Remote Direct Memory Access and enables one

computer to access another’s internal memory directly without involving the destination computer’s operating system. The destination computer’s network adapter moves data directly from the network into an area of application memory without involving the OS involving its own data buffers and network I/O stack. Consequently the transfer us very fast. It has the downside of not having an acknowledgement (ack) sent back to the source computer telling it that the transfer has been successful.

• There is no general RDMA standard, meaning that implementations are specific to particular servers and network adapters, operating systems and applications. There are RDMA implementations for Linux and Windows Server 2012, which may use iWARP, RoCE, and InfiniBand as the carrying layer for the transfers.

Page 20: Introduction to NVMe Over Fabrics-V3R

iWARP - Internet Wide Area RDMA Protocol

• iWARP (internet Wide Area RDMA Protocol) implements RDMA over Internet Protocol networks. It is layered on IETF-standard congestion-aware protocols such as TCP and SCTP, and uses a mix of layers, including DDP (Direct Data Placement), MPA (Marker PDU Aligned framing), and a separate RDMA protocol (RDMAP) to deliver RDMA services over TCP/IP. Because of this it's said to have lower throughput, higher latency and require higher CPU and memory utilisation than RoCE.

• For example: "Latency will be higher than RoCE (at least with both Chelsioand Intel/NetEffect implementations), but still well under 10 μs."

• Mellanox says no iWARP support is available at 25, 50, and 100Gbit/s Ethernet speeds. Chelsio saysthe IETF standard for RDMA is iWARP. It provides the same host interface as InfiniBand and is available in the same OpenFabrics Enterprise Distribution (OFED).

• Chelsio, which positions iWARP as an alternative to InfiniBand, says iWARPis the industry standard for RDMA over Ethernet is iWARP. High performance iWARP implementations are available and compete directly with InfiniBand in application benchmarks.

Page 21: Introduction to NVMe Over Fabrics-V3R

RoCE - RDMA over Converged Ethernet

• RoCE (RDMA over Converged Ethernet) allows remote direct memory access (RDMA) over an Ethernet network. It operates over layer 2 and layer 3 DCB-capable (DCB - Data Centre Bridging) switches. Such switches comply with the IEEE 802.1 Data Center Bridging standard, which is a set of extensions to traditional Ethernet, geared to providing a lossless data centre transport layer that, Cisco says, helps enable the convergence of LANs and SANs onto a single unified fabric. DCB switches support the Fibre Channel over Ethernet (FCoE) networking protocol.

There are two versions:• RoCE v1 uses the Ethernet protocol as a link layer protocol and

hence allows communication between any two hosts in the same Ethernet broadcast domain,

• RoCE v2 is a RDMA running on top of UDP/IP and can be routed.

Page 22: Introduction to NVMe Over Fabrics-V3R

Backup-2

Page 23: Introduction to NVMe Over Fabrics-V3R

facebook – Lightning NVMe JBOF Architecture

Page 24: Introduction to NVMe Over Fabrics-V3R

Facebook Lightning Target

• Hot-plug. We want the NVMe JBOF to behave like a SAS JBOD when drives are replaced. We don't want to follow the complicated procedure that traditional PCIe hot-plug requires. As a result, we need to be able to robustly support surprise hot-removal and surprise hot-add without causing operating system hangs or crashes.

• Management. PCIe does not yet have an in-band enclosure and chassis management scheme like the SAS ecosystem does. While this is coming, we chose to address this using a more traditional BMC approach, which can be modified in the future as the ecosystem evolves.

• Signal integrity. The decision to maintain the separation of a PEB from the PDPB as well as supporting multiple SSDs per “slot” results in some long PCIe connections through multiple connectors. Extensive simulations, layout optimizations, and the use of low-loss but still low-cost PCB materials should allow us to achieve the bit error rate requirements of PCIe without the use of redrivers/retimers or exotic PCB materials.

• External PCIe cables. We chose to keep the compute head node separate from the storage chassis, as this gives us the flexibility to scale the compute-to-storage ratio as needed. It also allows us to use more powerful CPUs, larger memory footprints, and faster network connections all of which will be needed to take full advantage of high-performance SSDs. As the existing PCIe cables are clunky and difficult to use, we chose to use mini-SAS HD cables (SFF-8644). This also aligns with upcoming external cabling standards. We designed the cables such that they include a full complement of PCIe side-band signals and a USB connection for an out-of-band management connection.

• Power. Current 2.5" NVMe SSDs may consume up to 25W of power! This creates an unnecessary system constraint, and we have chosen to limit the power consumption per slot to 14W. This aligns much better with the switch oversubscription and our performance targets.

Page 25: Introduction to NVMe Over Fabrics-V3R

NVMe JBOF Benefits

• Manageability• Flexibility• Modularity • Performance

Page 26: Introduction to NVMe Over Fabrics-V3R

Manageability of BMC

USB, I2C, Ethernet Out-of-band (OOB)

Page 27: Introduction to NVMe Over Fabrics-V3R

Flexibility in NVMe SSDs

Page 28: Introduction to NVMe Over Fabrics-V3R

Modularity in PCIe Switch

A common switch board for both trays•Easily design new or difference version without modifying the rest of the infrastructure

Page 29: Introduction to NVMe Over Fabrics-V3R

Low IO-Watt Performance

Page 30: Introduction to NVMe Over Fabrics-V3R

Ultra-High I/O Performance 5X Throughput + 1200X IOPS

Page 31: Introduction to NVMe Over Fabrics-V3R

OCP All-Flash NVMe Storage

• 2U 60/30 NVMe SSDs• Ultra-high IOPS and <10µS latency• PCIe 3.0 + U.2 or M.2 NVMe SSD support• High density storage system with 60 SSDs (M.2)