high performance file serving with smb3 and rdma via smb
TRANSCRIPT
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
High Performance File Serving with SMB3 and RDMA via SMB Direct
Tom Talpey, Microsoft Greg Kramer, Microsoft
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Protocol
SMB Direct New protocol supporting SMB 3.0 over RDMA Minimal CPU overhead High bandwidth, low latency
Fabric agnostic iWARP, InfiniBand, RoCE IP addressing
IANA port (smbdirect 5445)
File Client File Server
SMB3 Server SMB3 Client
User
Kernel
Application
Disk R-NIC
Network w/ RDMA
support
NTFS SCSI
Network w/ RDMA
support
R-NIC
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Documented
MS-SMBD http://msdn.microsoft.com/en-us/library/hh536346.aspx
MS-SMB2 http://msdn.microsoft.com/en-us/library/cc246482.aspx
Windows kRDMA API NDKPI
http://msdn.microsoft.com/en-us/library/windows/hardware/jj206456.aspx
Part of Windows Driver Kit Network Direct (and Verbs) heritage
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Implemented
Windows Server 2012 SMB 3.0 over SMB Direct Supports
Multichannel Continuous availability All other SMB 3.0 features
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Basics
SMB Direct is a transport framing Only 3 message types
2-way full duplex transport which supports: Datagram-type send/receive exchange
With fragmentation/reassembly for “large” Direct RDMA Read/Write
SMB 3.0 binding defines transport use: Client buffer advertisement for READ and WRITE Server RDMA buffer access (push/pull)
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Use
Discovery via SMB 3.0 Multichannel “RDMA” attribute of interface
Negotiated capabilities SMB Direct version Message and RDMA Region sizes Credits
Messages RDMA Read operations (via NDK provider)
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Three messages
7
Octet 0 Octet 1 Octet 2 Octet 3
MinVersion MaxVersion
Reserved CreditsRequested
PreferredSendSize
MaxReceiveSize
MaxFragmentedReceiveSize
SMB Direct Negotiate Request
Octet 0 Octet 1 Octet 2 Octet 3
MinVersion MaxVersion
NegotiatedVersion Reserved
CreditsRequested CreditsGranted
Status
MaxReadWriteSize
PreferredSendSize
MaxReceiveSize
MaxFragmentedReceiveSize
SMB Direct Negotiate Response
Octet 0 Octet 1 Octet 2 Octet 3
CreditsRequested CreditsGranted
Flags Reserved
RemainingDataLength
DataOffset
DataLength
Padding
Data (variable)
SMB Direct Data Transfer Header
Once Everything else
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Transfers
Send/Receive model Single logical message Possibly sent as fragmentation “train”
Using ordering properties of RDMA
Implements crediting All SMB 3.0 operations use this
Direct placement model Advertises RDMA regions in scatter/gather list SMB 3.0 uses for SMB2_READ and SMB2_WRITE
Only.
Piggyback on existing “Channel”
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Send transfers
9
SMB Direct HDR (24 bytes)
SMB3 message bytes 0 - 999
DataOffset = 24 DataLength = 1000
RemainingDataLength = 1048
Send 0
SMB Direct HDR (24 bytes)
SMB3 message bytes 1000 - 1999
DataOffset = 24 DataLength = 1000
RemainingDataLength = 48
Send 1
SMB Direct HDR (24 bytes)
SMB3 message bytes 2000- 2047
DataOffset = 24 DataLength = 48
RemainingDataLength = 0
Send 2
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
SMB3 Reads and Writes
10
Octet 0 Octet 1 Octet 2 Octet 3
StructureSize DataOffset
Length
Offset
…
FileId
…
…
…
Channel
RemainingBytes
WriteChannelInfoOffset WriteChannelInfoLength
Flags
Buffer (variable)
SMB3 WRITE REQUEST
Previously reserved fields
Octet 0 Octet 1 Octet 2 Octet 3
StructureSize Padding Reserved
Length
Offset
…
FileId
…
…
…
MinimumCount
Channel
RemainingBytes
ReadChannelInfoOffset ReadChannelInfoLength
Flags
Buffer (variable)
SMB3 READ REQUEST
Octet 0 Octet 1 Octet 2 Octet 3
Address
…
Token
Length
Channel array
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
RDMA transfers
11
SMB3 HDR
MEMORY DESCRIPTORS
SMB Direct HDR
SMB3 HDR
SMB3 WRITE RESP
SMB Direct HDR
RDMA Read
SMB3 WRITE REQ
Send
Send
DATA
SMB3 HDR
SMB3 READ REQ
MEMORY DESCRIPTORS
SMB Direct HDR
SMB3 HDR
SMB3 READ RESP
SMB Direct HDR
RDMA Write DATA
Send
Send
SMB Direct READ
SMB Direct WRITE
Client Server
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Credits
Bi-directional Count of ready receive buffers offered
Dynamic – can increase or decrease at any time Optional to do so
Used only to control low-level SMBD message exchanges Recycled independently of SMB operations Relatively small number required (100’s even
for deep random workloads)
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Quirks
Interesting corner cases “Last credit” Always need 1 in each endpoint to avoid deadlock
(but see details in spec!) Bi-directional – no requirement for same both ways
Async/Cancel/Errors No reply, multiple reply, unexpected large reply NOT an RPC-like interface, much as it may
resemble one
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Efficiency
True bi-directional and streaming sends Can be exposed as sockets-like interface
With register/unregister/RDMA rw extensions RDMA operations / completions Datamover offload to RNIC Server “pull” model improves performance Many options for RDMA efficiency
FRMR, silent completions, coalescing, etc Resources bounded by credits and sizes
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Performance
15
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
SDC 2011 performance results
16
InfiniBand switch
Nehalem: 1 socket x 4 cores @ 2.26 Ghz
Westmere: 2 socket x 6 cores @ 2.66 Ghz
RAID 0 – 12 SSDs RAID 0 – 12 SSDs
Single 32 Gbps InfiniBand link
160,000 IOPS (1KiB random reads) 3200 MiB/sec (512KiB sequential reads)
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Current performance results
17
File Server (SMB 3.0)
File Client (SMB 3.0)
SQLIO
RDMA NIC
RDMA NIC
RDMA NIC
RDMA NIC
SAS
SAS HBA
JBOD SSD SSD SSD SSD SSD SSD SSD SSD
SAS
SAS HBA
JBOD SSD SSD SSD SSD SSD SSD SSD SSD
Storage Spaces
http://www.microsoft.com/en-us/download/details.aspx?displaylang=en&id=20163
NTFS
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Current performance results…
18
Avg. MB/sec* Avg. IOs/sec (512 KiB)
Avg. %CPU (Client) Avg. Latency (ms)
7,340 ~14K 8.6 1
Avg. MB/sec* Avg. IOs/sec (8 KiB)
Avg. %CPU (Client) Avg. Latency (ms)
3,711 ~453K 60 < 1
sqlio2.exe -T100 -t16 –s60 -b8 -o4 –frandom -BN –LS (four files per volume)
sqlio2.exe -T100 –t2 –s60 –b512 -o4 –fsequential -BN –LS (1 file per volume)
Server fully utilized
Server fully utilized
* 1MB = 1,000,000 bytes
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Let’s take it to 11!
19
File Server (SMB 3.0)
File Client (SMB 3.0)
SQLIO
RDMA NIC
RDMA NIC
RDMA NIC
RDMA NIC
SAS
SAS HBA
JBOD SSD SSD SSD SSD SSD SSD SSD SSD
SAS
SAS HBA
JBOD SSD SSD SSD SSD SSD SSD SSD SSD
Storage Spaces
NTFS
RDMA NIC
RDMA NIC
SAS
SAS HBA
JBOD SSD SSD SSD SSD SSD SSD SSD SSD
SAS
SAS HBA
JBOD SSD SSD SSD SSD SSD SSD SSD SSD
SAS
SAS HBA
JBOD SSD SSD SSD SSD SSD SSD SSD SSD
SAS
SAS HBA
JBOD SSD SSD SSD SSD SSD SSD SSD SSD
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Let’s take it to 11 16!
20
Avg. MB/sec* Avg. IOs/sec (512 KiB)
Avg. %CPU (client) Avg. Latency (ms)
16,253 ~31K 15 1
sqlio2.exe -T100 –t2 –s60 –b512 -o4 –fsequential -BN –LS (1 file per volume)
16 GigaBYTES (not bits) of storage throughput!
* 1MB = 1,000,000 bytes
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
NUMA effects on performance
At these speeds, NUMA effects cannot be ignored
To achieve peak performance, the SMB3 / SMB Direct stack must avoid cross-NUMA node memory accesses whenever possible.
21
Test Case Avg. MB/sec* Avg. IOs/sec (8 KiB)
Avg. %CPU (client)
Avg. Latency (ms)
NUMA aware multichannel dispatcher
3,711 453K 60 < 1
NUMA unaware multichannel dispatcher
3,719 454K 76 < 1
sqlio2.exe -T100 -t16 –s60 -b8 -o4 –frandom -BN –LS (four files per volume)
* 1MB = 1,000,000 bytes
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
NUMA and SMB3 Multichannel
SMB3 Multichannel can be used to improve performance on NUMA systems SMB3 session is split across multiple channels Channels affinitized to a set of NUMA nodes Client dispatches IO requests to maximize
performance and minimize cross NUMA node memory accesses
One example of how the Windows Server 2012 SMB3 / SMB Direct stack has been optimized for high performance on NUMA systems
22
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
That’s great! Now what?
Are there simple improvements we could make to the SMB Direct protocol?
Goals: Ease of implementation Increase IOPS Decrease latency Decrease CPU utilization
23
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Where can we reduce IO costs?
24
App SMB Client Client RNIC Server RNIC
ReadFile()
Register status
Send status
Invalidate registration
Invalidate status
ReadFile() status
Register buffer
Send SMB request
Send SMB response
Consumes CPU cycles
RDMA write data
Aggressive invalidation: • Consumes CPU cycles • Consumes RNIC/bus cycles • Increases interrupts/sec • Increases IO latency
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Why aggressively invalidate?
Application will likely reuse same buffers for subsequent IO requests.
Why not cache and reuse buffer registrations? Peer can RDMA write after IO has completed
Data corruption / system crash / connection loss
Peer can RDMA read after IO has completed Data leak / connection loss
Registration caches are not robust enough for storage and enterprise server applications.
25
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Why aggressively invalidate?
Invalidation provides strict correctness guarantees with respect to data: Data is in a consistent state following DMA
Application can safely access its data Peer no longer has access to the region
No data corruption, crashes, or leaks due to peer-initiated RDMA operations
Aggressive invalidation is a necessary expense, but we might be able to reduce its cost…
26
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Use Send with Invalidate?
27
App SMB Client Client RNIC Server RNIC
ReadFile()
Register status
Send status
ReadFile() status
Register buffer
Send SMB request
Send SMB response with token
to invalidate
Consumes CPU cycles
RDMA write data
RNIC invalidates registration before indicating
received data
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Benefits of send with invalidate...
Reduces RNIC work requests by 1/3rd for small IOs (IOs that require one memory descriptor) Fewer CPU cycles Fewer RNIC/bus cycles Fewer interrupts Lower IO latency
Already supported by major RDMA standards iWARP InfiniBand RoCE
28
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Benefits of send with invalidate…
No change to SMB Direct protocol Make send with invalidate an optional feature. Client
continues to invalidate the buffer if the server does not.
Minimal change to SMB3 protocol SMB3 read/write request indicates when the server is
requested to invalidate a request’s memory descriptor via the server’s response.
Not a committed plan (investigation only) Feedback?
29
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
Summary
SMB3 and SMB Direct allow Windows Server 2012 to efficiently host enterprise application workloads.
SMB3 / SMB Direct protocols could be enhanced in simple ways to further improve performance. Increase IOPS Decrease CPU overhead Decrease latency
30
2012 Storage Developer Conference. © Microsoft Corporation. All Rights Reserved.
31
Questions?
http://smb3.info