controlling congestion in new storage architectures

66
PRESENTATION TITLE GOES HERE Controlling Congestion in New Storage Architectures September 15, 2015

Upload: phamnhan

Post on 13-Feb-2017

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Controlling Congestion in New Storage Architectures

PRESENTATION TITLE GOES HERE

Controlling Congestion in New Storage Architectures

September 15, 2015

Page 2: Controlling Congestion in New Storage Architectures

Today’s Presenters

Chad Hintz Ethernet Storage Forum

Board Member Solutions Architect - Cisco

David L. Fair SNIA Ethernet Storage Forum Chair

Intel

Page 3: Controlling Congestion in New Storage Architectures

SNIA Legal Notice

!   The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted.

!   Member companies and individual members may use this material in presentations and literature under the following conditions:

!   Any slide or slides used must be reproduced in their entirety without modification !   The SNIA must be acknowledged as the source of any material used in the body of any

document containing material from these presentations. !   This presentation is a project of the SNIA Education Committee. !   Neither the author nor the presenter is an attorney and nothing in this

presentation is intended to be, or should be construed as legal advice or an opinion of counsel. If you need legal advice or a legal opinion please contact your attorney.

!   The information presented herein represents the author's personal opinion and current understanding of the relevant issues involved. The author, the presenter, and the SNIA do not assume any responsibility or liability for damages arising out of any reliance on or use of this information. NO WARRANTIES, EXPRESS OR IMPLIED. USE AT YOUR OWN RISK.

3

Page 4: Controlling Congestion in New Storage Architectures

Terms

!   STP-Spanning Tree Protocol !   An older network protocol that ensures a loop-free topology for

any bridged Ethernet local area network.

!   Spine/Leaf, CLOS, Fat-Tree, Multi Rooted Tree !   Network based on routing with all links active

!   VXLAN !   Standards based layer 2 overlay scheme over a layer 3 network

4

Page 5: Controlling Congestion in New Storage Architectures

Agenda

!  Data Center “Fabrics” Current State

!  Current Congestion Control Mechanisms !  CONGA’s Design !  Why is CONGA important for IP based

Storage and Congestion Control? !  Q&A

5

Page 6: Controlling Congestion in New Storage Architectures

Storage over Ethernet Needs and Concerns

6

Minimize chance of dropped traffic

In order frame delivery

Minimal oversubscription

Lossless fabric for FCoE (no Drop)

Reliability Performance

Page 7: Controlling Congestion in New Storage Architectures

Data Center Fabric: Current State

7

Page 8: Controlling Congestion in New Storage Architectures

STP- Spanning Tree

Data Center “Fabric” Journey

Blocking

Page 9: Controlling Congestion in New Storage Architectures

STP- Spanning Tree Multi-Chassis

Etherchannel

Data Center “Fabric” Journey

Page 10: Controlling Congestion in New Storage Architectures

STP- Spanning Tree Multi-Chassis

Etherchannel

Data Center “Fabric” Journey

Spine-Leaf

LAYER 3

Page 11: Controlling Congestion in New Storage Architectures

STP- Spanning Tree Multi-Chassis

Etherchannel

Data Center “Fabric” Journey

MAN/WAN

VXLAN /EVPN

LAYER 3 W/ LAYER 2 OVERLAY

Spine-Leaf

Page 12: Controlling Congestion in New Storage Architectures

Multi-rooted Tree(Spine/Leaf) = Ideal DC Network

12

1000s of server ports

Ideal DC network:

Giant switch

Ø  No internal bottlenecks è predictable Ø  Simplifies BW management

Page 13: Controlling Congestion in New Storage Architectures

Multi-rooted Tree(Spine/Leaf) = Ideal DC Network

13

1000s of server ports

Ideal DC network:

Giant switch

Ø  No internal bottlenecks è predictable Ø  Simplifies BW management

Can’t build it L

Page 14: Controlling Congestion in New Storage Architectures

Multi-rooted Tree(Spine/Leaf) = Ideal DC Network

14

1000s of server ports

Ideal DC network:

Giant switch

Ø  No internal bottlenecks è predictable Ø  Simplifies BW management

Multi-rooted tree

1000s of server ports

Can’t build it L ≈

Page 15: Controlling Congestion in New Storage Architectures

Multi-rooted Tree(Spine/Leaf) = Ideal DC Network

15

1000s of server ports

Ideal DC network:

Giant switch

Ø  No internal bottlenecks è predictable Ø  Simplifies BW management

Multi-rooted tree

1000s of server ports

Can’t build it L ≈Possible

bottlenecks

Page 16: Controlling Congestion in New Storage Architectures

Multi-rooted Tree(Spine/Leaf) = Ideal DC Network

16

1000s of server ports

Ideal DC network:

Giant switch

Ø  No internal bottlenecks è predictable Ø  Simplifies BW management

Multi-rooted tree

1000s of server ports

Can’t build it L ≈Possible

bottlenecks Need precise load balancing

Page 17: Controlling Congestion in New Storage Architectures

Leaf-Spine DC Fabric

17

Approximates ideal giant switch

H1   H2   H3   H4   H5   H6   H7   H8   H9  H1   H2   H3   H4   H5   H6   H7   H8   H9  

Leaf switches

Spine switches

Page 18: Controlling Congestion in New Storage Architectures

Leaf-Spine DC Fabric

18

Approximates ideal giant switch

H1   H2   H3   H4   H5   H6   H7   H8   H9  H1   H2   H3   H4   H5   H6   H7   H8   H9  

Leaf switches

Spine switches

≈!   How close is Leaf-Spine to ideal giant switch?

!   What impacts its performance? !   Link speeds, oversubscription, buffering

Page 19: Controlling Congestion in New Storage Architectures

Today: Equal Cost Multipath (ECMP) Routing

Pick among equal-cost paths by an algorithm(hash) Ø Randomized load balancing Ø Preserves packet order (Optimal for TCP)

19

Problems: -  Hash collisions -  No idea of congestion -  Flows mapped to one

path (Loss of link issues)

Page 20: Controlling Congestion in New Storage Architectures

Impact of Link Speed

20

Three non-oversubscribed topologies:

20×10Gbps Downlinks

20×10Gbps Uplinks

20×10Gbps Downlinks

5×40Gbps Uplinks

20×10Gbps Downlinks

2×100Gbps Uplinks

Page 21: Controlling Congestion in New Storage Architectures

How does Link Speed affect ECMP

Higher speed links improve ECMP efficiency

21

20×10Gbps Uplinks

2×100Gbps Uplinks

11×10Gbps flows (55% load)

1 2

1 2 20

Prob of 100% throughput = 3.27%

Prob of 100% throughput = 99.95%

http://simula.stanford.edu/~alizade/papers/conga-sigcomm14.pdf

Page 22: Controlling Congestion in New Storage Architectures

Better

Impact of Link Speed

22

0 2 4 6 8

10 12 14 16 18 20

30 40 50 60 70 80 FCT

(no

rmal

ized

to

opti

mal

)

Load (%)

Avg FCT: Large (10MB,∞) background flows

OQ-Switch

20x10Gbps

5x40Gbps

2x100Gbps

http://simula.stanford.edu/~alizade/papers/conga-sigcomm14.pdf

Page 23: Controlling Congestion in New Storage Architectures

Better

Impact of Link Speed

23

0 2 4 6 8

10 12 14 16 18 20

30 40 50 60 70 80 FCT

(no

rmal

ized

to

opti

mal

)

Load (%)

Avg FCT: Large (10MB,∞) background flows

OQ-Switch

20x10Gbps

5x40Gbps

2x100Gbps

!   40/100Gbps fabric: ~ same Flow Completion Time as Giant Switch (OQ)

!   10Gbps fabric: FCT up 40% worse than OQ

Page 24: Controlling Congestion in New Storage Architectures

Storage over Spine-Leaf

!   New scale out storage is looking to have initiators and targets spread over multiple leaf switches

!   Concerns !   Multiple hops !   Potential for increased latency !   Oversubscription !   TCP Incast !   Potential buffering issues

24

Page 25: Controlling Congestion in New Storage Architectures

Incast Issue with IP Based Storage

25

Initiators (Senders)

Target (receiver)

Spread over many paths

Single point of convergence

Incast events are most severe at receiver (iSCSI, other IP based storage)

Page 26: Controlling Congestion in New Storage Architectures

Summary

!   40/100Gbps fabric + ECMP ≈ Giant switch; some performance loss with 10Gbps fabric

!   Oversubscription(incast) in IP Storage networks are very common and have cascading effect on performance and throughput

26

Page 27: Controlling Congestion in New Storage Architectures

Current Congestion Control Mechanisms Hop-By-Hop

27

Page 28: Controlling Congestion in New Storage Architectures

IEEE 802.1Qaz

Enhanced Transmission Selection (ETS)

!   Required when consolidating I/O – It’s a QoS problem

!   Prevents a single traffic class of “hogging” all the bandwidth and starving other classes

!   When a given load doesn’t fully utilize its allocated bandwidth, it is available to other classes

!   Helps accommodate for classes of a“bursty” nature

Ethernet Wire

FCoE

50% 50%

Page 29: Controlling Congestion in New Storage Architectures

IEEE 802.1Qbb Priority Flow Control

!   PFC enables Flow Control on a Per-Priority basis

!   Therefore, we have the ability to have lossless and lossy priorities at the same time on the same wire

!   Allows FCoE to operate over a lossless priority independent of other priorities

!   Other traffic assigned to other CoS will continue to transmit and rely on upper layer protocols for retransmission

!   Not only for FCoE traffic

29

Ethernet Wire

FCoE

Page 30: Controlling Congestion in New Storage Architectures

Adding in Spine-Leaf

!   Use IEEE ETS to guarantee bandwidth for traffic types

!   Use IEEE PFC to create lossless traffic for FCoE

!   Use Ethernet infrastructure for all kinds of storage

!   Improve scalability for all application needs and maintain high, consistent performance for all traffic types, not just storage

30

Page 31: Controlling Congestion in New Storage Architectures

Problems ETS/PFC do not solve

!   Does not take in consideration Layer 3 links and ECMP in spine-leaf topology !   Limited to hop-by-hop links !   PFC designed for lossless

traffic, not typical IP-based storage

!   ETS guarantees bandwidth, does not alleviate congestion

31

Page 32: Controlling Congestion in New Storage Architectures

The network paradigm as we know it…

Page 33: Controlling Congestion in New Storage Architectures

Control and Data Plane

!   Two Models: !   Distributed Control and Data Plane-Traditional !   Centralized (Controller Based/SDN)

33

Page 34: Controlling Congestion in New Storage Architectures

Control and Data Plane resides within Physical Device

Traditional

Page 35: Controlling Congestion in New Storage Architectures

Software defined networking (SDN) definition: The physical separation of the network control plane from

the forwarding plane, and where a control plane controls several devices.

What is SDN? per ONF definition

https://www.opennetworking.org/sdn-resources/sdn-definition

Page 36: Controlling Congestion in New Storage Architectures

In other words…

In the SDN paradigm, not all processing happens inside the same device

Page 37: Controlling Congestion in New Storage Architectures

CONGA’s Design

37

Page 38: Controlling Congestion in New Storage Architectures

CONGA in 1 Slide

38

L0 L1 L2

1.  Leaf switches (top-of-rack) track congestion to other leaves on different paths in near real-time

2.  Send traffic on least congested path(s)

Fast feedback loops between leaf switches, directly in dataplane

Page 39: Controlling Congestion in New Storage Architectures

CONGA in 1 Slide

39

L0 L1 L2

1.  Leaf switches (top-of-rack) track congestion to other leaves on different paths in near real-time

2.  Send traffic on least congested path(s)

Fast feedback loops between leaf switches, directly in dataplane

Page 40: Controlling Congestion in New Storage Architectures

Could this work with centralized control plane?

!   If control plane is separate then feedback can be sent in dataplane but then has to be computed in central control point (controller)

!   Latency for this along with constant change in network makes this not a valid option

40

VS.

Page 41: Controlling Congestion in New Storage Architectures

CONGA in Depth

CONGA operates over a standard DC overlay (VXLAN) Ø Already broadly supported (VXLAN) to virtualize the

physical network

41 H1   H2   H3   H4   H5   H6   H7  

L0   L1   L2  

H9  H8  

H1èH9  L0èL2  H1èH9  L0èL2  

VXLAN encap.

Page 42: Controlling Congestion in New Storage Architectures

CONGA-In Depth (VXLAN)

42

VXLAN Frame Format

Page 43: Controlling Congestion in New Storage Architectures

H9  

CONGA In Depth: Leaf-to-Leaf Congestion

43 H1   H2   H3   H4   H5   H6   H7  

L1  

H8  

1  2  3  0  

Track path-wise congestion metrics (3 bits) between each pair of leaf switches

Conges4on-­‐From-­‐Leaf  Table  @L2  

Src  Leaf  

Path  0   1   2  

L0  L1  

3  

L0èL2  Path=2  CE=5  

5

L0   L2  

L0èL2  Path=2  CE=0  

L0èL2  Path=2  CE=5  

L0èL2  Path=2  CE=0  

pkt.CE ç max(pkt.CE, link.util)

Page 44: Controlling Congestion in New Storage Architectures

H9  

CONGA In Depth: Leaf-to-Leaf Congestion

44 H1   H2   H3   H4   H5   H6   H7  

L1  

H8  

1  2  3  0  

Track path-wise congestion metrics (3 bits) between each pair of leaf switches

Conges4on-­‐To-­‐Leaf  Table  @L0  

Dest  Leaf  

Path  0   1   2  

L1  L2  

3  

5

L0   L2  

L0èL2  Path=2  CE=0  

L2èL0  FB-­‐Path=2  FB-­‐Metric=5  

51 1 43 7 2

Page 45: Controlling Congestion in New Storage Architectures

CONGA-In Depth: LB Decisions

Send each packet on least congested path

45 H1   H2   H3   H4   H5   H6   H7  

L0   L1   L2  

H9  H8  

Conges4on-­‐To-­‐Leaf  Table  @L0  

Dest  Leaf  

Path  0   1   2  

L1  L2  

3  

551 1 43 7 2

1  2  3  0  

L0 è L1: p* = 3 L0 è L2: p* = 0 or 1

http://groups.csail.mit.edu/netmit/wordpress/wp-content/themes/netmit/papers/texcp-

hotnets04.pdf

Page 46: Controlling Congestion in New Storage Architectures

CONGA-In Depth: LB Decisions

Send each packet on least congested path

46 H1   H2   H3   H4   H5   H6   H7  

L0   L1   L2  

H9  H8  

Conges4on-­‐To-­‐Leaf  Table  @L0  

Dest  Leaf  

Path  0   1   2  

L1  L2  

3  

551 1 43 7 2

1  2  3  0  

L0 è L1: p* = 3 L0 è L2: p* = 0 or 1

flowlet [Kandula et al 2004]

http://groups.csail.mit.edu/netmit/wordpress/wp-content/themes/netmit/papers/texcp-

hotnets04.pdf

Page 47: Controlling Congestion in New Storage Architectures

CONGA-In Depth: Flowlet Switching

H1 H2

TCP flow

•  State-of-the-art ECMP hashes flows (5-tuples) to path to prevent reordering

TCP packets. •  Flowlet switching* routes bursts of

packets from the same flow independently.

•  No packet re-ordering

Gap ≥ |d1 – d2|

d1 d2

*Flowlet Switching (Kandula et al ’04) http://groups.csail.mit.edu/netmit/wordpress/wp-content/

themes/netmit/papers/texcp-hotnets04.pdf

Page 48: Controlling Congestion in New Storage Architectures

Of Elephants and Mice

!   Two types of Flows in the data center !   Long Live Flows-”Elephant”

!   Data or (block) storage migrations, VM Migrations, MapReduce !   Flows that impact buffers !   Not many in Data Centers, but just a few of these could be impactful

!   Short Lived Flows-”Mice” !   Web requests, emails, small data requests !   Can be bursty

!   How they interact is key !   In tradition ECMP, multiple long live flows can be mapped to few

links !   If mice are mapped to same links it is detrimental to the

application performance

48

Page 49: Controlling Congestion in New Storage Architectures

Of Elephants and Mice

!   Two types of Flows in the data center !   Long Live Flows-”Elephant”

!   Data or storage migrations, VM Migrations, MapReduce !   Flows that impact buffers !   Not many in Data Centers, but just a few of these could be impactful

!   Short Lived Flows-”Mice” !   Web requests, emails, small data requests !   Can be bursty

!   How they interact is key !   In tradition ECMP, multiple long live flows can be mapped to few

links !   If mice are mapped to same links it is detrimental to the

application performance

49

Page 50: Controlling Congestion in New Storage Architectures

Of Elephants and Mice

!   Two types of Flows in the data center !   Long Live Flows-”Elephant”

!   Data or storage migrations, VM Migrations, MapReduce !   Flows that impact buffers !   Not many in Data Centers, but just a few of these could be impactful

!   Short Lived Flows-”Mice” !   Web requests, emails, small data requests !   Can be bursty

!   How they interact is key !   In tradition ECMP, multiple long live flows can be mapped to few

links !   If mice are mapped to same links it is detrimental to the

application performance

50

Need a new Metric to determine this impact

Application Flow Completion Time (FCT)

Page 51: Controlling Congestion in New Storage Architectures

CONGA: Fabric Load Balancing Dynamic Flow Prioritization

Real traffic is a mix of large (elephant) and small (mice) flows.

Key Idea: Fabric detects initial few flowlets of each flow and

assigns them to a high priority class.

Page 52: Controlling Congestion in New Storage Architectures

CONGA: Fabric Load Balancing Dynamic Flow Prioritization

Real traffic is a mix of large (elephant) and small (mice) flows.

F1

F2

F3

Standard (single priority): Large flows severely impact

performance (latency & loss). for small flows

Key Idea: Fabric detects initial few flowlets of each flow and

assigns them to a high priority class.

Page 53: Controlling Congestion in New Storage Architectures

CONGA: Fabric Load Balancing Dynamic Flow Prioritization

Real traffic is a mix of large (elephant) and small (mice) flows.

F1

F2

F3

Standard (single priority): Large flows severely impact

performance (latency & loss). for small flows

High Priority

Dynamic Flow Prioritization: Fabric automatically gives a higher priority to small flows.

Standard Priority

Key Idea: Fabric detects initial few flowlets of each flow and

assigns them to a high priority class.

Page 54: Controlling Congestion in New Storage Architectures

CONGA: Fabric Load Balancing Dynamic Flow Prioritization

Real traffic is a mix of large (elephant) and small (mice) flows.

F1

F2

F3

Standard (single priority): Large flows severely impact

performance (latency & loss). for small flows

High Priority

Dynamic Flow Prioritization: Fabric automatically gives a higher priority to small flows.

Standard Priority

Key Idea: Fabric detects initial few flowlets of each flow and

assigns them to a high priority class.

Page 55: Controlling Congestion in New Storage Architectures

Why Is CONGA important for IP based Storage?

55

Page 56: Controlling Congestion in New Storage Architectures

Storage Flows are Elephant Flows

!   Using Flowlet switching we can break long lived storage flows(block) into flowlets(bursts) and route across multiple paths !   No Packet reorder-in order delivery

!   Send traffic on least congested path using CONGA feedback loop

!   Loss of link in path is a loss of flowlet –Minimal disruption !   No TCP reset (ISCSI, NFS, APPs)

!   Just send small burst that was lost

!   Object and File based short flows(mice) get higher priority and are able complete flows faster

56

Page 57: Controlling Congestion in New Storage Architectures

CONGA for “Elephant” flows (block)

57

0

0.2

0.4

0.6

0.8

1

1.2

1.4

10 20 30 40 50 60 70 80 90

FCT

(N

orm

. to

EC

MP

)

Load (%)

ECMP

CONGA

Mice flows (<100KB) Elephant flows (>10MB)

0

0.2

0.4

0.6

0.8

1

1.2

10 20 30 40 50 60 70 80 90

FCT

(N

orm

. to

EC

MP

)

Load (%)

ECMP

CONGA

CONGA up-to 35% better than ECMP for

elephants

Page 58: Controlling Congestion in New Storage Architectures

CONGA for “Mice” flows (file or object)

58

0

0.2

0.4

0.6

0.8

1

1.2

1.4

10 20 30 40 50 60 70 80 90

FCT

(N

orm

. to

EC

MP

)

Load (%)

ECMP

CONGA

Mice flows (<100KB) Elephant flows (>10MB)

0

0.2

0.4

0.6

0.8

1

1.2

10 20 30 40 50 60 70 80 90

FCT

(N

orm

. to

EC

MP

)

Load (%)

ECMP

CONGA

CONGA up-to 40% better for mice

Page 59: Controlling Congestion in New Storage Architectures

Single Fabric for Storage(block,file or object) and Data

59

0

0.2

0.4

0.6

0.8

1

1.2

1.4

10 20 30 40 50 60 70 80 90

FCT

(N

orm

. to

EC

MP

)

Load (%)

ECMP

CONGA

Mice flows (<100KB) Elephant flows (>10MB)

0

0.2

0.4

0.6

0.8

1

1.2

10 20 30 40 50 60 70 80 90

FCT

(N

orm

. to

EC

MP

)

Load (%)

ECMP

CONGA

CONGA up-to 35% better than ECMP for

elephants

CONGA up-to 40% better for mice

Page 60: Controlling Congestion in New Storage Architectures

Link Failures with Minimal Loss

60

0 2 4 6 8

10 12 14 16 18 20

10 20 30 40 50 60 70

FCT

(N

orm

. to

Opt

imal

)

Load (%)

ECMP

CONGA

Overall Average FCT

Page 61: Controlling Congestion in New Storage Architectures

Summary

61

Page 62: Controlling Congestion in New Storage Architectures

CONGA with DCB meets the needs of Storage over Ethernet

62

Minimize chance of Dropped traffic

In Order Frame Delivery

Minimal Oversubscription/

Incast Issues

Lossless Fabric for FCoE (no Drop)

Page 63: Controlling Congestion in New Storage Architectures

CONGA with DCB meets the needs of Storage over Ethernet

63

Routing flowlets over least Congestion

across path-link loss has minimal impact

Minimize chance of Dropped traffic

In Order Frame Delivery

Flowlet Switching with CONGA guarantees in

order delivery

Minimal Oversubscription/

Incast Issues

40G fabric with Enhanced ECMP and Mice/Elephant Flow detection/ separation

Lossless Fabric for FCoE (no Drop)

CONGA and DCB (PFC,ETS) can be

implemented together

Page 64: Controlling Congestion in New Storage Architectures

After This Webcast

!   This webcast will be posted to the SNIA Ethernet Storage Forum (ESF) website and available on-demand ! http://www.snia.org/forums/esf/knowledge/webcasts

!   A full Q&A from this webcast, including answers to questions we couldn't get to today, will be posted to the SNIA-ESF blog ! http://sniaesfblog.org/

!   Follow us on Twitter @ SNIAESF

64

Page 65: Controlling Congestion in New Storage Architectures

PRESENTATION TITLE GOES HERE Q&A

Page 66: Controlling Congestion in New Storage Architectures

PRESENTATION TITLE GOES HERE Thank You