ethernet data center routing challenges and 802.1aq/spb new work peter ashwood-smith...
TRANSCRIPT
Ethernet Data Center Routing Challengesand 802.1aq/SPB new work
PETER ASHWOOD-SMITH
802.1aq’s 16 ECT can give perfect spread going 2 hops 16 uplinks. However:
A) Need to tweak 2nd layer switch priorities to guarantee all 16 are used.B) Need at least 16 subnets (C/S-Vlan’s) to assign one per 802.1aq B-VID.
A) TweakBridgePrioritiesHere
S1 … S16B)
Can we eliminate ‘tweaking*’
• David Allan et al. have a presentation on this so I won’t spend much time on it.
• In general a network with N equal cost paths from ‘some source’ to ‘some destination’ requires #ECT about 25-40% greater than N (to statistically capture them all).
• Therefore when #ECT == N some ‘tweaking’ is usually required (for DC its trivial to do however).
• Dave et al. suggest non-independence between ECT algorithms as way to address this (maximize diversity) …
*Tweaking = adjustingBridge Priorities up/down fromdefaults.
A15 A16
B32B31B30B29
A1 A2
B4B3B2B1
• 48 switch non blocking 2 layer L2 fabric• 16 at “upper” layer A1..A16
• 32 at “lower” layer B1.. B32
• 16 uplinks per Bn, & 160 UNI links per Bn
• 32 downlinks per An
“Example” 802.1aq switching cluster – assume 100GE NNI links/groups
• (16 x 100GE per Bn )x32 = 512x100GE = 51.2T • 160 x 10GE server links (UNI) per Bn
• (32 x 160)/2 = 2560 servers @ 2x10GE per• uFIB = 16 x 48 B-mac = 768 entries• mFIB = 16 subnet x 48 src = 768 entries
16 x 32 x 100GE = 51.2Tusing 48 x 2T switches
S3,1 S3,160 S32,1 S32,160S1,1 S1,160
5120 x 10GE
16 x 100GE
160 x 10GE
32 x 100GE
1536 FIB/node
Goodnumbers“16”& “2”levels.
For a given ECT-ALGk, Aj is a member of every SPF-TREE(B*,ECT-ALGk)
Properly tuned no two ECT-ALGorithms will use the same Aj as a fork point.
S1 … S16
ECT-ALG#12
SourceNode (1)
A15 A16
B32B31B30B29
A1 A2
B4B3B2B1
Subnet Ni maps to I-SIDj and then to a unique A (j mod 16 )
So load spreading allows each Ai to transit a complete subnet.
Problem#1 - Unable to further spread such that Ai and Aj (i != j) each handle subset of flows in I-SID j
I-SIDj I-SIDjI-SIDj
I-SIDi I-SIDiI-SIDi
A15 A16
B32B31B30B29
A1 A2
B4B3B2B1
This is an issue under failure of Aj
Recovery will move entire subnet traffic to another Ai node.
A preferable solution is to spread affected load over remaining A*
I-SIDj I-SIDjI-SIDj
I-SIDi I-SIDiI-SIDi
A15 A16
B32B31B30B29
A1 A2
B4B3B2B1
Possible solution – head end hashing (unicast only)
Allow unicast I-SIDi and I-SIDj traffic to be hashed based on smaller flows to different B-VIDs (ECT-ALGorithms)
This breaks the symmetry and congruence rules but allows edge balancing at smaller granularity. No changes to multicast.Requires learning <C-DA, B-DA> , independent of B-VID
I-SIDj I-SIDjI-SIDj
I-SIDi I-SIDiI-SIDi
Unicast
Mcast
A15 A16
B32B31B30B29
A1 A2
B4B3B2B1
A15 A16
B32B31B30B29
A1 A2
B4B3B2B1
Interconnection of fabrics creates more than 16 paths (exponential )
C1 C2
Number of paths can grow exponentially with increasing levels.Constant number of paths always << number of paths in many networks.Growing 802.1aq ECT to say 32 or even 100 ECMP causes larger unicast FIBs.
O(16)
O(16x2)
O(16x2x16)
A15 A16
B32B31B30B29
A1 A2
B4B3B2B1
Horizontal Growth – not too bad but need more ECT-ALGORITHMS.
Horizontal growth by 1 just increases number of ECT by 1Not too big a problem but we would need to define new ECT (via Opaque).
B34B33
A17
General Issue
O(degree)
O(diameter)
#paths ~= O( diameter degree)
So head end ECT in worst case requires O(exp(# B-VIDs))
S D
Choosepath fromN x B-VID
A feasible solution …
Re-assign traffic to path at each hop
Tandem “ECMP” just like IP.
Need to keep O(degree) number of next hopsOnly need one B-VID .. removes O(diameter) from state cost
Flip side is you have no control – just hope for fine scale statistical distribution
Choosepath fromN x nxt hop
S D
Choosepath fromN x nxt hop
Single B-VID
What about loops in this mode?
802.1aq Ingress Check is very strong in the case of a single next hop and hencea single possible ingress for an SA.
802.1aq Ingress Check is weakened in the case of a multiple next hop and henceMultiple possible ingress for an SA.
However 802.1aq Agreement Protocol functions correctly in the context of multiple possible Next Hops for the same B-VID (refer to Mick’s proof).
But …
Agreement Protocol ConcernsIs it too complex? it is clearly non trivial, we need implementation/emulation experience.
Is it overly Draconian. For example the bounds on movement are what is required for a mathematical proof by induction .. However there are probably many cases where further movement would not loop. What isthe degree of ‘overkill’ ?
Is it marketable? – this is unfortunately a legitimate concern!!!
802.1aq can be deployed without AP until we introduce hash basedforwarding at which point we either require a symmetric AP and/oran on-data-path loop detection/drop mechanism.
Believe that an on-data-path loop detection mechanism is requiredfor hash based ECMP until we have more experience with AP.
Recommend we standardize a TTL TAG either stand-alone or as a new form of I-TAG.
View of New Work Requirements
R1) New ECT-ALGorithms with improved spreading properties.
R2) Allow optional head end hash assignment of 802.1aq SPBM UNI known unicasttraffic to one of multiple next hop interfaces/B-VIDs. Very similar to Link Ag.Minimally HASH (seed, C.SA, C.DA, C-VID, [ IP.SA, IP.DA, IP.PROTO] )
R3) Allow optional tandem hash assignment of 802.1aq SPBM B-VID NNI unicasttraffic to one of multiple next hop interfaces. Essentially a new SPBM ECT-ALGwith its own B-VID. (i.e. new ECT-ALGorithms, all usable at same time)Minimally HASH (seed, B-VID, C.SA, C.DA, C-VID, [ IP.SA, IP.DA, IP.PROTO ])
R4) minor OA&M changes in support of R2 and R3, because symmetry/congruence broken.
R5) More experience with AP, emulations, simulations etc. +addition of TTL to new I-TAG or a TTL-TAG.