wael noureddine, ph.d. chelsio communications inc. · chelsio t580-cr, rhel 6.5 (kernel 3.18.14),...
TRANSCRIPT
2015 Storage Developer Conference. © Chelsio Communications Inc. All Rights Reserved.
NVMe over Fabrics
Wael Noureddine, Ph.D. Chelsio Communications Inc.
2015 Storage Developer Conference. © Chelsio Communications Inc. All Rights Reserved.
Overview
r Introduction r NVMe/Fabrics and iWARP r RDMA Block Device and results r Closing notes
2
2015 Storage Developer Conference. © Chelsio Communications Inc. All Rights Reserved.
r Accelerating pace of progress r Capacity r Reliability r Efficiency r Performance
r Latency r Bandwidth r IOPS – Random
Access
3
NVMe Devices
Samsung Keynote, Flash Memory Summit, Santa Clara, August’2015
2015 Storage Developer Conference. © Chelsio Communications Inc. All Rights Reserved.
A Decade of NIC Evolution
4
2003
7Gbps aggregate
PCI-X Physical Functions: 1 DMA Queues: 1 Interrupt: Line Logical NICs: 1 Lock-based Synchronization
2013 2007
11Gbps bidirectional PCIe Gen1 Physical Functions: 1 DMA Queues: 8 Interrupt: MSI Logical NICs: 8 Up to 8 cores
55Gbps bidirectional PCIe Gen3 Physical Functions: 8 DMA Queues: 64K Interrupt: MSI-X Logical NICs: 128 Lock-free user space I/O
2013 2013 2013
2015 Storage Developer Conference. © Chelsio Communications Inc. All Rights Reserved.
NVMe – Non-Volatile Memory Express
r Modern PCIe interface architecture r Benefit from experience
with other high performance devices
r Scale with exponential speed increases
r Standardized for enhanced plug-and-play r Register set r Feature set r Command set
r Parallels to high performance NIC interface r 64K x deep queues r MSI-X r Lock-free operation
r Security r Data integrity protection
5
2015 Storage Developer Conference. © Chelsio Communications Inc. All Rights Reserved.
iWARP RDMA over Ethernet
r Open, transparent and mature r IETF Standard (2007)
r Plug-and-play r Standard and proven TCP/IP r Built-in reliability congestion control,
flow control r Natively routable
r Cost effective r Regular switches r Same network appliances r Works with DCB but NOT required r No need for lossless configuration
r No network restrictions r Architecture, scale, distance, RTT,
link speeds r Hardware performance
r Exception processing in HW r Ultra low latency r High packet rate and bandwidth
MEMORY MEMORY
Payload No(fica(ons
CQ
Payload
Host Host
CQ
No(fica(ons
TCP/IP/Ethernet Frames
TCP/IP/Ethernet Frames
Unified Wire RNIC
RDMA TCP
IP ETH
Unified Wire RNIC
RDMA TCP
IP ETH
TCP/IP in hardware
LAN/Data Center WAN/Long Haul
Any Switches/Routers
End-to-end CRC
2015 Storage Developer Conference. © Chelsio Communications Inc. All Rights Reserved.
NVMe over Fabrics
r Address PCI reach and scalability limits
r Extend native NVMe over large fabric
r iWARP RDMA over Ethernet first proof point r Intel NVMe drives r Chelsio 40GbE iWARP r Compared to local disk
r Less than 8usec additional RTT latency
r Same IOPS performance
7
2015 Storage Developer Conference. © Chelsio Communications Inc. All Rights Reserved.
NVMe/Ethernet Fabric with iWARP
8
Appliance Frontend
NVMe op(mized client
Linux clients
Windows clients
SMB3
NVMe/Fabrics
Ethernet Front-‐end Fabric
Ethernet Back-‐end Fabric
….
NVM Storage
iSCSI NVM
Storage
NVM Storage
ANY Switch/ router
ANY Switch/ router
NVMe/Fabrics T5 Unified Adapter T5 Unified
Adapter
T5 supports iSCSI, SMBDirect and NVMe/Fabrics offload simultaneously
iSCSI SMBv3
NVMe/Fabrics
NVM Storage
Remote replicaIon
ANY Switch/ router
2015 Storage Developer Conference. © Chelsio Communications Inc. All Rights Reserved.
T5 Storage Protocol Support
9
NFS
Lower Layer Driver
iSCSI
iSER
SMB Direct
NDK
FCoE
NVMe/Fabrics and RBD RPC
Network Driver RDMA Driver iSCSI Driver
T5 Hardware
FCoE Driver
SMB
RDMA Offload
TCP Offload
iSCSI Offload FCoE Offload
NIC
T10-DIX
OS Block IO NBD
2015 Storage Developer Conference. © Chelsio Communications Inc. All Rights Reserved.
NVMe Drive
Linux File System
NVMe/Fabrics Layering
10
Kernel
T5 NIC
Ini(ator
RDMA Verbs
NVMe/Fabrics
RDMA Driver
RDMA Offload
TCP Offload
Linux Block IO
Target
RDMA Verbs
NVMe/Fabrics
RDMA Driver
RDMA Offload
TCP Offload
NVMe Device Driver
T5 NIC
Kernel
NVMe Drive
NVMe Device Driver
2015 Storage Developer Conference. © Chelsio Communications Inc. All Rights Reserved.
NVMe Drive
Linux File System
RBD Layering
11
Kernel
T5 NIC
Ini(ator
RDMA Verbs
RDMA Block Driver
RDMA Driver
RDMA Offload
TCP Offload
Linux Block IO
Target
RDMA Verbs
RDMA Block Driver
RDMA Driver
RDMA Offload
TCP Offload
Block Device Driver
SAS Drive..
T5 NIC
Kernel
Linux Block IO
2015 Storage Developer Conference. © Chelsio Communications Inc. All Rights Reserved.
Generic RDMA Block Device (RBD)
r Open Source Chelsio Linux driver r Part of Chelsio’s Unified Wire SW
r Virtualizes block devices over an RDMA connection
r Target r Claims the physical device using
blkdev_get_by_path() r Submits IOs from the Initiator via
submit_bio() r Initiator
r Registers as a BIO device driver r Registers connected Target devices
as a BIO device using add_disk() r Current RBD Protocol Limits
r Max RBD IO size is 128KB r 256 max outstanding IOs in flight r Initiator physical/logical sector size is
4KB
r Management via rbdctl command r Initiator provides character device
command interface for adding/removing devices
r Simple command, rbdctl, used to connect, login, and register the target devices
r Examples r Target
r Load rbdt module r Initiator
r Load rbdi module r Add a new target:rbdct –n –a
<ipaddr> -d /dev/nvme0n1 r Remove an RBD device: rbdctl –
r –d /dev/rbdi0 r List connected devices: rbdctl -l
12
2015 Storage Developer Conference. © Chelsio Communications Inc. All Rights Reserved.
RBD vs. NBD Throughput – 40GbE
13
0
5
10
15
20
25
30
35
40
4k 8k 16k 32k 64k 128k 256k 512k 1M
Thr
ough
put
(Gbp
s)
I/O Size (B)
RBD_read NBD_Read RBD_Write NBD_Write
Intel Xeon CPU E5-1660 v2 6-core processor clocked at 3.70GHz (HT enabled) with 64 GB of RAM, Chelsio T580-CR, RHEL 6.5 (Kernel 3.18.14), standard MTU of 1500B, using RAMdisks on target (4) fio --name=<test_type> --iodepth=32 --rw=<test_type> --size=800m --direct=1 --invalidate=1 --fsync_on_close=1 --norandommap --group_reporting --ioengine=libaio --numjobs=4 --bs=<block_size> --runtime=30 --time_based --filename=/dev/rbdi0
2015 Storage Developer Conference. © Chelsio Communications Inc. All Rights Reserved.
RBD vs. NBD Latency – 40GbE
14
Intel Xeon CPU E5-1660 v2 6-core processor clocked at 3.70GHz (HT enabled) with 64 GB of RAM. Chelsio T580-CR, RHEL 6.5 (Kernel 3.18.14), standard MTU of 1500B, using RAMdisk on target (1) fio --name=<test_type> --iodepth=1 --rw=<test_type> --size=800m --direct=1 --invalidate=1 --fsync_on_close=1 --norandommap --group_reporting --ioengine=libaio --numjobs=1 --bs=<block_size> --runtime=30 --time_based --filename=/dev/rbdi0
0
20
40
60
80
100
120
140
160
180
200
4k 8k 16k 32k 64k 128k
Late
ncy
(us)
I/O Size (B)
RBD_Read NBD_Read RBD_Write NBD_Write
2015 Storage Developer Conference. © Chelsio Communications Inc. All Rights Reserved.
SSD Latency Results
15
Supermicro MB, 16 cores - Intel(R) Xeon(R) CPU E5-2687W 0 @ 3.10GHz, 64GB memory Chelsio T580-CR, RHEL6.5, linux-3.18.14 kernel, standard MTU of 1500B, through Ethernet switch, NVMe SSD target fio-2.1.10, 1 thread, direct IO, 30 second runs, libaio IO engine, IO depth = 1, random read, write, and read-write.
0
50
100
150
200
250
300
350
400
450
4 8 16 32 64 128 256
Late
ncy
(use
c)
I/O Size (KB)
Random Read Latency
Local
RBD
2015 Storage Developer Conference. © Chelsio Communications Inc. All Rights Reserved.
SSD Bandwidth Results
16
Supermicro MB, 16 cores - Intel(R) Xeon(R) CPU E5-2687W 0 @ 3.10GHz, 64GB memory Chelsio T580-CR, RHEL6.5, linux-3.18.14 kernel, standard MTU of 1500B, through Ethernet switch, SSD target fio-2.1.10, 1 thread, direct IO, 30 second runs, libaio IO engine, IO depth = 32, random read, write, and read-write.
0
1
2
3
4
5
6
7
8
9
4 8 16 32 64 128 256
Thr
ough
put
(Gbp
s)
I/O Size (KB)
Random Read Throughput (1 Thread)
Local
RBD
2015 Storage Developer Conference. © Chelsio Communications Inc. All Rights Reserved.
Local vs. RBD Bandwidth – 2 SSD
17
0
2
4
6
8
10
12
14
16
4 8 16 32 64 128 256
Thr
ough
put
(Gbp
s)
I/O Size (KB)
Read
Local RBD
0
1
2
3
4
5
6
7
8
9
10
4 8 16 32 64 128 256
Thr
ough
put
(Gbp
s)
I/O Size (KB)
Write
Local RBD
Intel(R) Xeon(R) CPU E5-2687W 0 @ 3.10GHz 2x6 cores 64GB RAM, 2xNVMe SSD Chelsio T580-CR, CentOS Linux release 7.1.1503 Kernel 3.17.8, standard MTU of 1500B through 40Gb switch fio --direct=1 --invalidate=1 --fsync_on_close=1 --norandommap --group_reporting --name=foo --bs=$2 --size=800m --numjobs=8 --ioengine=libaio --iodepth=16 --rw=randwrite --runtime=30 --time_based --filename_format=/dev/rbdi\$jobnum --directory=/
2015 Storage Developer Conference. © Chelsio Communications Inc. All Rights Reserved.
SSD
RDMA over Fabrics Evolution
18
RNIC SSD
CPU
PCIe
SSD RNIC
SSD
CPU Memory
PCIe
SSD RNIC
SSD
CPU
Memory
PCIe
DMA to/from host memory CPU handles command transfer
YESTERDAY
Peer-to-peer DMA through PCI bus No host memory use
TODAY
Network integrated NVMe storage No CPU or memory involvement
TOMORROW
Memory
2015 Storage Developer Conference. © Chelsio Communications Inc. All Rights Reserved.
Conclusions
r NVMe over Fabrics for scalable high performance and high efficiency fabrics that leverage disruptive solid-state storage
r Standard interfaces and middleware r Ease of use and widespread support
r iWARP RDMA ideal for NVMe/Fabrics r Familiar and proven foundation r Plug-and-play
r Works with DCB but does not require it r No DCB tax and complexity
r Scalability, reliability, routability r Any switches and any routers over any distance r Scales from 40 to 100Gbps and beyond
r NVMe/Fabrics devices with built-in processing power 19