smb direct, the secret decoder ring · 2015-06-18 · • scatter/gather entries support (reading...
TRANSCRIPT
SMB Direct, The Secret Decoder Ring
E2EVC 2015 BERLINJune 12th-14th
Didier Van HoyeTechnical Architect & Technology Strategist
Microsoft MVP in Hyper-V
MEET Member
DELL TechCenter Rockstar
http://workinghardinit.wordpress.com @workinghardinit
What?
Why?
WHERE?
How?
Who? When?
What is SMB DIRECT?
SMB Direct is SMB over RDMA
So what is RDMA?
Didier Van Hoye - WorkingHardInIT
Direct Memory Access (DMA) allows access (read/write) to the host memory directly, without the intervention of the CPU(s).
Remote Direct Memory Access (RDMA) extends this ability to remote systems.
What makes RDMA “great”?
What makes RDMA “great”?
Didier Van Hoye - WorkingHardInIT
Kernel Bypass: applications can perform data transfer directly from user space without the need to perform context switches over the kernel.
Zero-Copy: applications can transfer data from one host to another without use of the software network stack. Data is being read & written, sent/received directly to memory buffers without being copied over the different network layers.
What makes RDMA “great”?
Didier Van Hoye - WorkingHardInIT
CPU Bypass / Offload• Applications can access remote memory without consuming any host CPU cycles
in the remote machine. The remote memory machine will be read/written without involving remote processes (meaning CPU cycles).
• The caches in the remote CPU(s) won't be filled with the accessed memory content.
Transport protocol acceleration• Message based transactions (TCP is stream based).• Scatter/gather entries support (reading multiple memory buffers and sending
them as one or reading one and writing it to multiple memory buffers).
Didier Van Hoye - WorkingHardInIT
Q:Where did RDMA grow up?
A: Where the benefits were worth the
premium price
Didier Van Hoye - WorkingHardInIT
1. Low latency
2. High throughput
3. Reduced CPU footprint
4. “lossless” fabrics (you have to provide this)
HPC, Financial transactions, medical imaging, storage, backup, cloud computing, …
Why do I care?
Why is RDMA important to us?
Didier Van Hoye - WorkingHardInIT
1. The BIG surge in East-West traffic due to virtualization induces a performance & scalability hit.
2. Storage over file system shares is also highly demanding.
In other words not using RDMA: Slows us down Loses us money (can’t leverage capabilities of hard & software)
Modern commodity network
hardware …
Didier Van Hoye - WorkingHardInIT
A modern cloud OS, servers &
storage
Didier Van Hoye - WorkingHardInIT
Flavors of RDMA
Didier Van Hoye - WorkingHardInIT
Type (Cards*) Pros Cons
Non-RDMA Ethernet
(wide variety of NICs)
• TCP/IP-based protocol
• Works with any Ethernet switch
• Wide variety of vendors and models
• Support for in-box NIC teaming
• High CPU Utilization under load• High latency
iWARP
(Chelsio T4/T5)Lo
w C
PU
Uti
lizat
ion
un
der
load
Low
late
ncy
• TCP/IP-based protocol
• Works with any Ethernet switch
• RDMA traffic routable
• Offers up to 40Gbps per NIC port today
• Requires enabling firewall rules
RoCE(Mellanox ConnectX-3,
Mellanox ConnectX-3Pro,Mellanox ConnectX-4)
• Ethernet-based protocol• Works with Ethernet switches with DCB support• Routable RoCEv2 is here• Offers up to 100Gbps per NIC port today
• RoCEv1 non routable• Requires DCB switch with Priority Flow Control (PFC)
InfiniBand(Mellanox ConnectX-3,
Mellanox ConnectX-3Pro,Mellanox ConnectX-4,Mellanox Connect-IB)
• Switches typically less expensive per port• Switches offer high speed Ethernet uplinks• Commonly used in HPC environments• Offers up to 100Gbps per NIC port today
• Not an Ethernet-based protocol• RDMA traffic not routable via IP infrastructure• Requires InfiniBand switches• Requires a subnet manager (typically on the switch)
What is DCB?
Didier Van Hoye - WorkingHardInIT
• Priority-based Flow Control (PFC). Provides a link-level, flow-control mechanism that can be independently controlled for each priority to ensure zero-loss due to converged-network congestion.
• Quantified Congestion Notification (QCN). Provides end-to-end congestion management for protocols without built-in congestion-control mechanisms. It’s also expected to benefit protocols with existing congestion management by providing more timely reactions to network congestion.
• Enhanced Transmission Selection (ETS). Provides a common management framework for bandwidth assignment to traffic classes.
• Data Center Bridging Exchange Protocol (DCBx). A discovery and capability exchange protocol used to convey capabilities and configurations of the other three DCB features between neighbors to ensure consistent configuration across the network.
How does PFC work?
Didier Van Hoye - WorkingHardInIT
How does ETS work?
Didier Van Hoye - WorkingHardInIT
What about QCN?
Didier Van Hoye - WorkingHardInIT
What about QCN?
Didier Van Hoye - WorkingHardInIT
• Missing in Windows & a lot of switches for now • Complexity to achieve this across all hops & end to end• Is the lack of end to end congestion notification an issue?• End to end might not be needed (hop based in FC – ingress rate limiting -
does the job, doesn’t work in FCoE due to FCF/Layer 3 MAC rewriting ) … • Apparently not (you won’t hear anyone put that in writing) as RoCE v2 is
now being used in public clouds without problems …• Some state the industry silicon is ready but … the
standard/implementation has to come to many switches still…• Some say it’s a worry, it depends on the design & protocol• Devil in details= layer 2 so not routable ;-)
What about DCBX?
Didier Van Hoye - WorkingHardInIT
• Missing in Windows for now.• It’s a convenience issue solved by automation of your DCB
configuration (PowerShell).• But convenience is important and I expect Microsoft to look
into this and possibly provide it in future versions of Windows.• The benefit is that the DCB configuration is learned from the
switches.
How do I configure DCB?
Didier Van Hoye - WorkingHardInIT
Configure OS, Application &
Network
Didier Van Hoye - WorkingHardInIT
• Don’t mix QoS types
• Configure QoS for all payloads
• You need rNICs (RDMA capable NICs)
• Configure your switches end to end (ports, uplinks)
• Configure your host & rNICs
• You must do PFC (this requires tagged VLANson hosts & switches)
• You can do ETS (harder to test & make work in heterogeneous environments)
Didier Van Hoye - WorkingHardInIT
Disable 802.3x flow control (global
pause)
Didier Van Hoye - WorkingHardInIT
FTOS#configure
FTOS(conf)#interface range tengigabitethernet 0/0 -47
FTOS(conf-if-range-te-0/0-47)#no flowcontrol rx on tx on
FTOS(conf-if-range-te-0/0-47)#exit
FTOS(conf)#interface range fortyGigE 0/48 , fortyGigE 0/52
FTOS(conf-if-range-fo-0/48-52)#no flowcontrol rx on tx off
FTOS(conf-if-range-fo-0/48-52)#exit
Enable DCB & Configure VLANs
Didier Van Hoye - WorkingHardInIT
FTOS(conf)#service-class dynamic dot1p
FTOS(conf)#dcb enable
FTOS(conf)#exit
FTOS#copy running-config startup-config
FTOS#reload
FTOS#configure
FTOS(conf)#interface vlan 110
FTOS (conf-if-vl-vlan-id*)#tagged tengigabitethernet 0/0-47
FTOS (conf-if-vl-vlan-id*)#tagged port-channel 3
FTOS (conf-if-vl-vlan-id*)#exit
Create & configure DCB Map Policy
Didier Van Hoye - WorkingHardInIT
FTOS(conf)#dcb-map SMBDIRECT
FTOS(conf-dcbmap-profile-name*)#priority-group 0 bandwidth 90 pfc on
FTOS(conf-dcbmap-profile-name*)#priority-group 1 bandwidth 10 pfc off
FTOS(conf-dcbmap-profile-name*)#priority-pgid 1 1 1 1 0 1 1 1
FTOS(conf-dcb-profile-name*)#exit PFC => IEEE 802.1Qb
Apply DCB map to the switch ports
& uplinks
Didier Van Hoye - WorkingHardInIT
FTOS(conf)#interface range ten 0/0 – 47
FTOS(conf-if-range-te-0/0-47)#dcb-map SMBDIRECT
FTOS(conf-if-range-te-0/0-47)#exit
FTOS(conf)#interface range fortyGigE 0/48 , fortyGigE 0/52
FTOS(conf-if-range-fo-0/48,fo-0/52)#dcb-map SMBDIRECT
FTOS(conf-if-range-fo-0/48,fo-0/52)#exit
FTOS(conf)#exit
FTOS#copy running-config startup-config
Demo Time
Didier Van Hoye - WorkingHardInIT
Jumbo Frames?
Didier Van Hoye - WorkingHardInIT
• Infiniband?– Exists, MTU Size up to 4K.
– Impact on SMB Direct?
• RoCE?– Yes, familiar 1500 MTU size up to 9K (or more depending on your NIC/Swithes).
– Impact on SMB Direct?
– Also useful if you fall back to non RDMA (TCP/IP)
• iWarp? – Yes, familiar 1500 MTU size up to 9K (or more depending on your NIC/Swithes
– Impact on SMB Direct?
– Also same as above, handy during fail back to non RDMA (TCP/IP)!
https://workinghardinit.wordpress.com/2013/11/25/live-migration-can-benefit-from-jumbo-frames/
Jumbo Frames?
Didier Van Hoye - WorkingHardInIT
https://workinghardinit.wordpress.com/2013/11/25/live-migration-can-benefit-from-jumbo-frames/
Configure SMB Direct On The
Windows Host
Didier Van Hoye - WorkingHardInIT
• Infiniband?
– https://technet.microsoft.com/en-us/library/dn583823.aspx
• RoCE?
– https://technet.microsoft.com/en-us/library/dn583822.aspx
• iWarp?
– https://technet.microsoft.com/en-us/library/dn583825.aspx
Prevent live migration starving
storage traffic
Didier Van Hoye - WorkingHardInIT
Set-SmbBandwidthLimit -Category LiveMigration -BytesPerSecond 10GB
Set-SmbBandwidthLimit -Category VirtualMachine -BytesPerSecond 10GB
https://workinghardinit.wordpress.com/2013/09/03/preventing-live-migration-over-smb-starving-csv-traffic-in-windows-server-2012-r2-with-set-smbbandwidthlimit/
PerfMon is your friend
Didier Van Hoye - WorkingHardInIT
SMB Multichannel & Direct “Break”
sometimes
Didier Van Hoye - WorkingHardInIT
• Enabling/Disabling RDMA
• Enabling/Disabling Multichannel
• Enabling/Disabling the NICs
• In rare case a firmware upgrade to the switches can require
a reboot of the host(s)
Keep the above in mind when testing & experimenting or trouble shouting. Always start from a clean, known state.
SMB Direct requires SMB Multichannel
Didier Van Hoye - WorkingHardInIT
Multiple RDMA NICs
SMB Server
SMB Client
Switch10GbE/IB
NIC10GbE/IB
NIC10GbE/IB
Switch10GbE/IB
NIC10GbE/IB
NIC10GbE/IB
BIOS Power Optimizations still
apply!
Didier Van Hoye - WorkingHardInIT
• Disable C states in BIOS / UEFI
• Disable C1E states in BIOS / UEFI
• Disable PCIe Link Power Management in BIOS, basically set
all power optimizations to max performance
• Optimize memory for speed
• The faster the cards & bus speeds the better …
https://workinghardinit.wordpress.com/2013/06/10/still-need-to-optimizing-power-settings-on-dell-12th-generation-servers-for-lightning-fast-hyper-v-live-migrations/
Didier Van Hoye - WorkingHardInIT
The Agony of Choice …
Didier Van Hoye - WorkingHardInIT
• Infiniband will be around for a long time (just like FC)– What will they’ll do to offset Ethernet speed growth?
strategy seems RoCE but iWarp is giving them a fight for their money.
• Ethernet is likely to grow fast in– New deployments (no infiniband in place)
– > 10Gbps deployments only if price/Gbps drops low enough compared to Infiniband, that’s where Mellanox is outperforming Chelsio (cheap swithes & cards),
• What Flavor Should I use?– RoCE has shown great potential as the official heir to infiniband but must address
concerns around real life loss less routability
complexity of DCB (PFC/ETS) in a heterogeneous world
– iWarp has the advantage of TCP/IP routability & ease of deployment but must address concerns around need of DCB configurations in larger / high performance deployment & the overhead of relying
on the TCP/IP stack
– Remember … there are successful SMB 3 / SOFS / Storage Spaces deployments out there without SMB Direct … (it all depends) but build for the future …