flowlet switching srikanth kandula shan sinha & dina katabi
TRANSCRIPT
High PerformanceSwitching and RoutingTelecom Center Workshop: Sept 4, 1997.
Flowlet SwitchingSrikanth Kandula
Shan Sinha & Dina Katabi
30%
70%
• Load balancing to remove hot spots
• Rebalance traffic when unpredictable events occur (Outages, DoS, BGP reroutes, Flash Crowds, …)
ISPs Want to Split Traffic Across Multiple Paths
• Load balancing to remove hot spots
• Rebalance traffic when unpredictable events occur (Outages, DoS, BGP reroutes, Flash Crowds, …)
Unpredictable Traffic
Rebalance
Traffic
30%
70%
ISPs Want to Split Traffic Across Multiple Paths
• Load balancing to remove hot spots
• Rebalance traffic when unpredictable events occur (Outages, DoS, BGP reroutes, Flash Crowds, …)
Unpredictable Traffic
70%
30%
ISPs Want to Split Traffic Across Multiple Paths
• Much research on balancing and rebalancing load,
• But implementation is hard particularly with dynamic ratios Either sacrifice accuracy or reorder TCP
packets
• Much research on balancing and rebalancing load,
• But implementation is hard particularly with dynamic ratios Either sacrifice accuracy or reorder TCP
packets
1. Given the desired split ratios – possibly dynamic
2. Split traffic accurately, at the edge router, without reordering TCP’s packets
Problem
Existing Scheme 1: Packet-Based Splitting
• Assign packets to paths proportional to the desired ratios
Reorders TCP packets causing bad throughput
• Assign TCP flows to each path proportional to the desired ratio
1. Flows are not all equal: Elephants & Mice
2. So, estimate the rate of each TCP flow3. But rates change with time4. Too complex5. Very inaccurate if desired ratios
change
Existing Scheme 2: Flow-Based Splitting
How to Split Traffic?
Packet-Based
• Accurate
• Reorders TCP packets
• Easily tracks dynamic ratios
Flow-Based
• Inaccurate
• No packet reordering
• Hard to track if ratios change
Can we combine the best of the two approaches?
Can we combine the best of the two approaches?
This Talk
• Show how to send a single TCP flow down multiple paths without reordering
• Accurately split traffic even when desired ratios are dynamic
• Easy to implement
Flowlet Switching
• If the previous packet from the flow has left the merging point Can reassign the flow to a different path
TCP flow
2
1
Flowlet Switching
Delay = D1
Delay = D2
Flowlets are bursts from same flow separated by at least ; they can be switched independently!
Given > |D2-D1|
Idle ≥
Implementing Flowlet Switching is Simple
• Router at the split point hashes packet header
• If (Now - Last_Seen) > , flow can change path
• Reassign path proportionally to the desired split ratios
SRCip DSTip SRCPort DSTPort hash
Last_Seen (s) Path
9920.2659 3
Does it Really Work?
• Traces collected on a peering link, an edge link and two core links
• Split Vectors (3 paths) Static (.3, .3, .4) Dynamic – sinusoidal with amplitude 60%,
period 20min [Akella04,Chuah02]
Paths Desired
DesiredObtained
NError
1
0.06%2.31%
12.01%
0.07%3.96%
40.83%
0
5
10
15
20
25
30
35
40
45
Packet-based Flow-based Flowlet-switching
Static Dynamic
Is Flowlet Switching Accurate?Er
ror
0.06%2.31%
12.01%
0.07%3.96%
40.83%
0
5
10
15
20
25
30
35
40
45
Packet-based Flow-based Flowlet-switching
Static Dynamic
Is Flowlet Switching Accurate?
Flowlet switching is much more accurate than flow-based switching
Flowlet switching is much more accurate than flow-based switching
Erro
r
Can do Flowlet Switching without Per-Flow State
#Active Flows ~ 50,000; But… Router maintains a hash table < 1000 entries
(5KB).
#Active Flows ~ 50,000; But… Router maintains a hash table < 1000 entries
(5KB).
4 16 64 256 1024 2048 4096 8192
Hash Table Entries
Errors stabilize for small table
Fig. shows Avg. and Max. of many traces
But Where do Flowlets come from?
• Can’t be just timeouts or short flows; most of the bytes are in the elephants
• Why can a large flow be broken into many small flowlets?
• Well-known that TCP usually sends a window in one or a few bursts and waits for acks [Zhang91,Zhang03, Jiang04]
• Some Reasons Slow-start Ack compression Window is much smaller than delay-BW
product
Flowlets exist because TCP is bursty at RTT and sub-RTT scales
Most flowlets have inter-arrivals less than an RTT most flowlets are sub-windowsMost flowlets have inter-arrivals less than an RTT most flowlets are sub-windows
Flowlets exist because TCP is Bursty
Why Flowlet Switching is Accurate?
• 80% of bytes are in flowlets smaller than 10KB
• Assigning a flowlet to a path isn’t a long commitment
Why Flowlets can Track Dynamics?
An order of magnitude more opportunities to rebalance!An order of magnitude more opportunities to rebalance!
143.16
611.95
3784.10
111.33
1454.98
8661.43
35287.04
2848.76
Edge
Peering
Core1
Core2
Arrival Rate of both flows and flowlets (/sec)Arrival Rate of both flows and flowlets (/sec)
1454.98
8661.43
35287.04
2848.76111.33
3784.1
611.95
143.16Edge
Peering
Core1
Core2
Flowlets
Flows
Flow 1
Flow 2
Flow 3
Time# Active Flowlets
0
1
2
3
Why flowlet switching doesn’t need per-flow state?
Flow 1
Flow 2
Flow 3
Time# Active Flowlets
0
1
2
3
Why flowlet switching doesn’t need per-flow state?
Flow 1
Flow 2
Flow 3
# Active Flowlets
Time0
1
2
3
Why flowlet switching doesn’t need per-flow state?
Edge
Peering
Core1
Core2
Trace
18.41
28.08
240.12
50.66
#Active Flowlets
Why flowlet switching doesn’t need per-flow state?
#Active flowlets is 2 orders of magnitude smaller than flows Very small hash table#Active flowlets is 2 orders of magnitude
smaller than flows Very small hash table
Edge
Peering
Core1
Core2
1450.42
8477.33
47883.33
1559.33
#ActiveFlows Trace
18.41
28.08
240.12
50.66
#Active Flowlets
Why flowlet switching doesn’t need per-flow state?
Why Flowlet Switching is Possible?
• Why can a large flow be broken into many small flowlets?
• Why is flowlet switching accurate?
• Why flowlet switching does not need per-flow state?
• TCP burstiness at small time scales
• Small commitment; many more chances to rebalance
• Few simultaneously active flowlets
Configuring Flowlet Switching
For our traces which are a diverse collection of traffic within continental US ~50ms is a good and safe choice! Our procedure is a constructive way to find
Flowlet separation > delay difference
But, how to find delay difference?
Flowlet Separation of 50ms is Good
Any flowlet timeout in [50, 100] ms yields highly accurate splitsAny flowlet timeout in [50, 100] ms yields highly accurate splits
~50ms results in accurate splitting
Even if delay difference >> 50ms, prob. of reordering is negligible compared to drop. rate in the Internet (about 1%)
Even if delay difference >> 50ms, prob. of reordering is negligible compared to drop. rate in the Internet (about 1%)
Flowlet Separation of 50ms is Safe
1 %
.8 %
.6 %
.4 %
.2 %
0 %