controlling congestion in new storage architectures
TRANSCRIPT
PRESENTATION TITLE GOES HERE
Controlling Congestion in New Storage Architectures
September 15, 2015
Today’s Presenters
Chad Hintz Ethernet Storage Forum
Board Member Solutions Architect - Cisco
David L. Fair SNIA Ethernet Storage Forum Chair
Intel
SNIA Legal Notice
! The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted.
! Member companies and individual members may use this material in presentations and literature under the following conditions:
! Any slide or slides used must be reproduced in their entirety without modification ! The SNIA must be acknowledged as the source of any material used in the body of any
document containing material from these presentations. ! This presentation is a project of the SNIA Education Committee. ! Neither the author nor the presenter is an attorney and nothing in this
presentation is intended to be, or should be construed as legal advice or an opinion of counsel. If you need legal advice or a legal opinion please contact your attorney.
! The information presented herein represents the author's personal opinion and current understanding of the relevant issues involved. The author, the presenter, and the SNIA do not assume any responsibility or liability for damages arising out of any reliance on or use of this information. NO WARRANTIES, EXPRESS OR IMPLIED. USE AT YOUR OWN RISK.
3
Terms
! STP-Spanning Tree Protocol ! An older network protocol that ensures a loop-free topology for
any bridged Ethernet local area network.
! Spine/Leaf, CLOS, Fat-Tree, Multi Rooted Tree ! Network based on routing with all links active
! VXLAN ! Standards based layer 2 overlay scheme over a layer 3 network
4
Agenda
! Data Center “Fabrics” Current State
! Current Congestion Control Mechanisms ! CONGA’s Design ! Why is CONGA important for IP based
Storage and Congestion Control? ! Q&A
5
Storage over Ethernet Needs and Concerns
6
Minimize chance of dropped traffic
In order frame delivery
Minimal oversubscription
Lossless fabric for FCoE (no Drop)
Reliability Performance
Data Center Fabric: Current State
7
STP- Spanning Tree
Data Center “Fabric” Journey
Blocking
STP- Spanning Tree Multi-Chassis
Etherchannel
Data Center “Fabric” Journey
STP- Spanning Tree Multi-Chassis
Etherchannel
Data Center “Fabric” Journey
Spine-Leaf
LAYER 3
STP- Spanning Tree Multi-Chassis
Etherchannel
Data Center “Fabric” Journey
MAN/WAN
VXLAN /EVPN
LAYER 3 W/ LAYER 2 OVERLAY
Spine-Leaf
Multi-rooted Tree(Spine/Leaf) = Ideal DC Network
12
1000s of server ports
Ideal DC network:
Giant switch
Ø No internal bottlenecks è predictable Ø Simplifies BW management
Multi-rooted Tree(Spine/Leaf) = Ideal DC Network
13
1000s of server ports
Ideal DC network:
Giant switch
Ø No internal bottlenecks è predictable Ø Simplifies BW management
Can’t build it L
Multi-rooted Tree(Spine/Leaf) = Ideal DC Network
14
1000s of server ports
Ideal DC network:
Giant switch
Ø No internal bottlenecks è predictable Ø Simplifies BW management
Multi-rooted tree
1000s of server ports
Can’t build it L ≈
Multi-rooted Tree(Spine/Leaf) = Ideal DC Network
15
1000s of server ports
Ideal DC network:
Giant switch
Ø No internal bottlenecks è predictable Ø Simplifies BW management
Multi-rooted tree
1000s of server ports
Can’t build it L ≈Possible
bottlenecks
Multi-rooted Tree(Spine/Leaf) = Ideal DC Network
16
1000s of server ports
Ideal DC network:
Giant switch
Ø No internal bottlenecks è predictable Ø Simplifies BW management
Multi-rooted tree
1000s of server ports
Can’t build it L ≈Possible
bottlenecks Need precise load balancing
Leaf-Spine DC Fabric
17
Approximates ideal giant switch
H1 H2 H3 H4 H5 H6 H7 H8 H9 H1 H2 H3 H4 H5 H6 H7 H8 H9
Leaf switches
Spine switches
≈
Leaf-Spine DC Fabric
18
Approximates ideal giant switch
H1 H2 H3 H4 H5 H6 H7 H8 H9 H1 H2 H3 H4 H5 H6 H7 H8 H9
Leaf switches
Spine switches
≈! How close is Leaf-Spine to ideal giant switch?
! What impacts its performance? ! Link speeds, oversubscription, buffering
Today: Equal Cost Multipath (ECMP) Routing
Pick among equal-cost paths by an algorithm(hash) Ø Randomized load balancing Ø Preserves packet order (Optimal for TCP)
19
Problems: - Hash collisions - No idea of congestion - Flows mapped to one
path (Loss of link issues)
Impact of Link Speed
20
Three non-oversubscribed topologies:
20×10Gbps Downlinks
20×10Gbps Uplinks
20×10Gbps Downlinks
5×40Gbps Uplinks
20×10Gbps Downlinks
2×100Gbps Uplinks
How does Link Speed affect ECMP
Higher speed links improve ECMP efficiency
21
20×10Gbps Uplinks
2×100Gbps Uplinks
11×10Gbps flows (55% load)
1 2
1 2 20
Prob of 100% throughput = 3.27%
Prob of 100% throughput = 99.95%
http://simula.stanford.edu/~alizade/papers/conga-sigcomm14.pdf
Better
Impact of Link Speed
22
0 2 4 6 8
10 12 14 16 18 20
30 40 50 60 70 80 FCT
(no
rmal
ized
to
opti
mal
)
Load (%)
Avg FCT: Large (10MB,∞) background flows
OQ-Switch
20x10Gbps
5x40Gbps
2x100Gbps
http://simula.stanford.edu/~alizade/papers/conga-sigcomm14.pdf
Better
Impact of Link Speed
23
0 2 4 6 8
10 12 14 16 18 20
30 40 50 60 70 80 FCT
(no
rmal
ized
to
opti
mal
)
Load (%)
Avg FCT: Large (10MB,∞) background flows
OQ-Switch
20x10Gbps
5x40Gbps
2x100Gbps
! 40/100Gbps fabric: ~ same Flow Completion Time as Giant Switch (OQ)
! 10Gbps fabric: FCT up 40% worse than OQ
Storage over Spine-Leaf
! New scale out storage is looking to have initiators and targets spread over multiple leaf switches
! Concerns ! Multiple hops ! Potential for increased latency ! Oversubscription ! TCP Incast ! Potential buffering issues
24
Incast Issue with IP Based Storage
25
Initiators (Senders)
Target (receiver)
Spread over many paths
Single point of convergence
Incast events are most severe at receiver (iSCSI, other IP based storage)
Summary
! 40/100Gbps fabric + ECMP ≈ Giant switch; some performance loss with 10Gbps fabric
! Oversubscription(incast) in IP Storage networks are very common and have cascading effect on performance and throughput
26
Current Congestion Control Mechanisms Hop-By-Hop
27
IEEE 802.1Qaz
Enhanced Transmission Selection (ETS)
! Required when consolidating I/O – It’s a QoS problem
! Prevents a single traffic class of “hogging” all the bandwidth and starving other classes
! When a given load doesn’t fully utilize its allocated bandwidth, it is available to other classes
! Helps accommodate for classes of a“bursty” nature
Ethernet Wire
FCoE
50% 50%
IEEE 802.1Qbb Priority Flow Control
! PFC enables Flow Control on a Per-Priority basis
! Therefore, we have the ability to have lossless and lossy priorities at the same time on the same wire
! Allows FCoE to operate over a lossless priority independent of other priorities
! Other traffic assigned to other CoS will continue to transmit and rely on upper layer protocols for retransmission
! Not only for FCoE traffic
29
Ethernet Wire
FCoE
Adding in Spine-Leaf
! Use IEEE ETS to guarantee bandwidth for traffic types
! Use IEEE PFC to create lossless traffic for FCoE
! Use Ethernet infrastructure for all kinds of storage
! Improve scalability for all application needs and maintain high, consistent performance for all traffic types, not just storage
30
Problems ETS/PFC do not solve
! Does not take in consideration Layer 3 links and ECMP in spine-leaf topology ! Limited to hop-by-hop links ! PFC designed for lossless
traffic, not typical IP-based storage
! ETS guarantees bandwidth, does not alleviate congestion
31
The network paradigm as we know it…
Control and Data Plane
! Two Models: ! Distributed Control and Data Plane-Traditional ! Centralized (Controller Based/SDN)
33
Control and Data Plane resides within Physical Device
Traditional
Software defined networking (SDN) definition: The physical separation of the network control plane from
the forwarding plane, and where a control plane controls several devices.
What is SDN? per ONF definition
https://www.opennetworking.org/sdn-resources/sdn-definition
In other words…
In the SDN paradigm, not all processing happens inside the same device
CONGA’s Design
37
CONGA in 1 Slide
38
L0 L1 L2
1. Leaf switches (top-of-rack) track congestion to other leaves on different paths in near real-time
2. Send traffic on least congested path(s)
Fast feedback loops between leaf switches, directly in dataplane
CONGA in 1 Slide
39
L0 L1 L2
1. Leaf switches (top-of-rack) track congestion to other leaves on different paths in near real-time
2. Send traffic on least congested path(s)
Fast feedback loops between leaf switches, directly in dataplane
Could this work with centralized control plane?
! If control plane is separate then feedback can be sent in dataplane but then has to be computed in central control point (controller)
! Latency for this along with constant change in network makes this not a valid option
40
VS.
CONGA in Depth
CONGA operates over a standard DC overlay (VXLAN) Ø Already broadly supported (VXLAN) to virtualize the
physical network
41 H1 H2 H3 H4 H5 H6 H7
L0 L1 L2
H9 H8
H1èH9 L0èL2 H1èH9 L0èL2
VXLAN encap.
CONGA-In Depth (VXLAN)
42
VXLAN Frame Format
H9
CONGA In Depth: Leaf-to-Leaf Congestion
43 H1 H2 H3 H4 H5 H6 H7
L1
H8
1 2 3 0
Track path-wise congestion metrics (3 bits) between each pair of leaf switches
Conges4on-‐From-‐Leaf Table @L2
Src Leaf
Path 0 1 2
L0 L1
3
L0èL2 Path=2 CE=5
5
L0 L2
L0èL2 Path=2 CE=0
L0èL2 Path=2 CE=5
L0èL2 Path=2 CE=0
pkt.CE ç max(pkt.CE, link.util)
H9
CONGA In Depth: Leaf-to-Leaf Congestion
44 H1 H2 H3 H4 H5 H6 H7
L1
H8
1 2 3 0
Track path-wise congestion metrics (3 bits) between each pair of leaf switches
Conges4on-‐To-‐Leaf Table @L0
Dest Leaf
Path 0 1 2
L1 L2
3
5
L0 L2
L0èL2 Path=2 CE=0
L2èL0 FB-‐Path=2 FB-‐Metric=5
51 1 43 7 2
CONGA-In Depth: LB Decisions
Send each packet on least congested path
45 H1 H2 H3 H4 H5 H6 H7
L0 L1 L2
H9 H8
Conges4on-‐To-‐Leaf Table @L0
Dest Leaf
Path 0 1 2
L1 L2
3
551 1 43 7 2
1 2 3 0
L0 è L1: p* = 3 L0 è L2: p* = 0 or 1
http://groups.csail.mit.edu/netmit/wordpress/wp-content/themes/netmit/papers/texcp-
hotnets04.pdf
CONGA-In Depth: LB Decisions
Send each packet on least congested path
46 H1 H2 H3 H4 H5 H6 H7
L0 L1 L2
H9 H8
Conges4on-‐To-‐Leaf Table @L0
Dest Leaf
Path 0 1 2
L1 L2
3
551 1 43 7 2
1 2 3 0
L0 è L1: p* = 3 L0 è L2: p* = 0 or 1
flowlet [Kandula et al 2004]
http://groups.csail.mit.edu/netmit/wordpress/wp-content/themes/netmit/papers/texcp-
hotnets04.pdf
CONGA-In Depth: Flowlet Switching
H1 H2
TCP flow
• State-of-the-art ECMP hashes flows (5-tuples) to path to prevent reordering
TCP packets. • Flowlet switching* routes bursts of
packets from the same flow independently.
• No packet re-ordering
Gap ≥ |d1 – d2|
d1 d2
*Flowlet Switching (Kandula et al ’04) http://groups.csail.mit.edu/netmit/wordpress/wp-content/
themes/netmit/papers/texcp-hotnets04.pdf
Of Elephants and Mice
! Two types of Flows in the data center ! Long Live Flows-”Elephant”
! Data or (block) storage migrations, VM Migrations, MapReduce ! Flows that impact buffers ! Not many in Data Centers, but just a few of these could be impactful
! Short Lived Flows-”Mice” ! Web requests, emails, small data requests ! Can be bursty
! How they interact is key ! In tradition ECMP, multiple long live flows can be mapped to few
links ! If mice are mapped to same links it is detrimental to the
application performance
48
Of Elephants and Mice
! Two types of Flows in the data center ! Long Live Flows-”Elephant”
! Data or storage migrations, VM Migrations, MapReduce ! Flows that impact buffers ! Not many in Data Centers, but just a few of these could be impactful
! Short Lived Flows-”Mice” ! Web requests, emails, small data requests ! Can be bursty
! How they interact is key ! In tradition ECMP, multiple long live flows can be mapped to few
links ! If mice are mapped to same links it is detrimental to the
application performance
49
Of Elephants and Mice
! Two types of Flows in the data center ! Long Live Flows-”Elephant”
! Data or storage migrations, VM Migrations, MapReduce ! Flows that impact buffers ! Not many in Data Centers, but just a few of these could be impactful
! Short Lived Flows-”Mice” ! Web requests, emails, small data requests ! Can be bursty
! How they interact is key ! In tradition ECMP, multiple long live flows can be mapped to few
links ! If mice are mapped to same links it is detrimental to the
application performance
50
Need a new Metric to determine this impact
Application Flow Completion Time (FCT)
CONGA: Fabric Load Balancing Dynamic Flow Prioritization
Real traffic is a mix of large (elephant) and small (mice) flows.
Key Idea: Fabric detects initial few flowlets of each flow and
assigns them to a high priority class.
CONGA: Fabric Load Balancing Dynamic Flow Prioritization
Real traffic is a mix of large (elephant) and small (mice) flows.
F1
F2
F3
Standard (single priority): Large flows severely impact
performance (latency & loss). for small flows
Key Idea: Fabric detects initial few flowlets of each flow and
assigns them to a high priority class.
CONGA: Fabric Load Balancing Dynamic Flow Prioritization
Real traffic is a mix of large (elephant) and small (mice) flows.
F1
F2
F3
Standard (single priority): Large flows severely impact
performance (latency & loss). for small flows
High Priority
Dynamic Flow Prioritization: Fabric automatically gives a higher priority to small flows.
Standard Priority
Key Idea: Fabric detects initial few flowlets of each flow and
assigns them to a high priority class.
CONGA: Fabric Load Balancing Dynamic Flow Prioritization
Real traffic is a mix of large (elephant) and small (mice) flows.
F1
F2
F3
Standard (single priority): Large flows severely impact
performance (latency & loss). for small flows
High Priority
Dynamic Flow Prioritization: Fabric automatically gives a higher priority to small flows.
Standard Priority
Key Idea: Fabric detects initial few flowlets of each flow and
assigns them to a high priority class.
Why Is CONGA important for IP based Storage?
55
Storage Flows are Elephant Flows
! Using Flowlet switching we can break long lived storage flows(block) into flowlets(bursts) and route across multiple paths ! No Packet reorder-in order delivery
! Send traffic on least congested path using CONGA feedback loop
! Loss of link in path is a loss of flowlet –Minimal disruption ! No TCP reset (ISCSI, NFS, APPs)
! Just send small burst that was lost
! Object and File based short flows(mice) get higher priority and are able complete flows faster
56
CONGA for “Elephant” flows (block)
57
0
0.2
0.4
0.6
0.8
1
1.2
1.4
10 20 30 40 50 60 70 80 90
FCT
(N
orm
. to
EC
MP
)
Load (%)
ECMP
CONGA
Mice flows (<100KB) Elephant flows (>10MB)
0
0.2
0.4
0.6
0.8
1
1.2
10 20 30 40 50 60 70 80 90
FCT
(N
orm
. to
EC
MP
)
Load (%)
ECMP
CONGA
CONGA up-to 35% better than ECMP for
elephants
CONGA for “Mice” flows (file or object)
58
0
0.2
0.4
0.6
0.8
1
1.2
1.4
10 20 30 40 50 60 70 80 90
FCT
(N
orm
. to
EC
MP
)
Load (%)
ECMP
CONGA
Mice flows (<100KB) Elephant flows (>10MB)
0
0.2
0.4
0.6
0.8
1
1.2
10 20 30 40 50 60 70 80 90
FCT
(N
orm
. to
EC
MP
)
Load (%)
ECMP
CONGA
CONGA up-to 40% better for mice
Single Fabric for Storage(block,file or object) and Data
59
0
0.2
0.4
0.6
0.8
1
1.2
1.4
10 20 30 40 50 60 70 80 90
FCT
(N
orm
. to
EC
MP
)
Load (%)
ECMP
CONGA
Mice flows (<100KB) Elephant flows (>10MB)
0
0.2
0.4
0.6
0.8
1
1.2
10 20 30 40 50 60 70 80 90
FCT
(N
orm
. to
EC
MP
)
Load (%)
ECMP
CONGA
CONGA up-to 35% better than ECMP for
elephants
CONGA up-to 40% better for mice
Link Failures with Minimal Loss
60
0 2 4 6 8
10 12 14 16 18 20
10 20 30 40 50 60 70
FCT
(N
orm
. to
Opt
imal
)
Load (%)
ECMP
CONGA
Overall Average FCT
Summary
61
CONGA with DCB meets the needs of Storage over Ethernet
62
Minimize chance of Dropped traffic
In Order Frame Delivery
Minimal Oversubscription/
Incast Issues
Lossless Fabric for FCoE (no Drop)
CONGA with DCB meets the needs of Storage over Ethernet
63
Routing flowlets over least Congestion
across path-link loss has minimal impact
Minimize chance of Dropped traffic
In Order Frame Delivery
Flowlet Switching with CONGA guarantees in
order delivery
Minimal Oversubscription/
Incast Issues
40G fabric with Enhanced ECMP and Mice/Elephant Flow detection/ separation
Lossless Fabric for FCoE (no Drop)
CONGA and DCB (PFC,ETS) can be
implemented together
After This Webcast
! This webcast will be posted to the SNIA Ethernet Storage Forum (ESF) website and available on-demand ! http://www.snia.org/forums/esf/knowledge/webcasts
! A full Q&A from this webcast, including answers to questions we couldn't get to today, will be posted to the SNIA-ESF blog ! http://sniaesfblog.org/
! Follow us on Twitter @ SNIAESF
64
PRESENTATION TITLE GOES HERE Q&A
PRESENTATION TITLE GOES HERE Thank You