stress resistant scheduling algorithms for cioq switches prashanth pappu applied research laboratory...
TRANSCRIPT
Stress Resistant Scheduling Algorithms for
CIOQ Switches Prashanth PappuApplied Research Laboratory
Washington University in St Louis
“Stress Resistant Scheduling Algorithms for CIOQ switches”, Prashanth Pappu, Jon Turner. To appear in ICNP 2003.
Prashanth Pappu
Anatomy of a Router
Switch Fabric
IPP
OP
P
LineCard
IPP
OP
P
LineCard
IPP
OP
P
LineCard
IPP
OP
P
LineCard
IPP
OP
P
LineCard
IPP
OP
P
LineCard
ControlProcessor
Port processor queue packets and make routing decisions
Line cards encode data for transmission on target physical layer.
Control processor – routing protocols and monitoring functions.
Prashanth Pappu
Output Queuing
Queuing is done only at output ports. Maximizes throughput. Contentions between packets - only at output ports. Speedup=N, impractical but ideal model.
SwitchingFabric
OutputPorts
InputPorts
Prashanth Pappu
Combined Input Output Queuing (CIOQ)
Use of VOQs. Crossbar configured by centralized scheduler. Bipartite graph matching problem.
SwitchingFabric
OutputPorts
InputPorts
……
CentralizedScheduler
Prashanth Pappu
Stability results Maximum size matching (MSM) – stable for i.i.d,
uniform, admissible traffic. Maximum weight matching (MWM) – stable for
independent, admissible traffic. Too complex, O(N5/2) and O(N3logN). Switch with 10 Gb/s links has < 40 ns to make
scheduling decision. Maximal size matching algorithms – Parallel
Iterative Matching (PIM) and iterative SLIP (iSLIP).
Prashanth Pappu
Parallel iterative Matching (PIM) Iterative matching algorithm
each unmatched input sends request to every output for which it has a queued cell.
unmatched outputs randomly pick a request and send grant.
if input receives multiple grants, it picks one randomly.
O(log N) convergence.
Prashanth Pappu
iterative SLIP (iSLIP) Iterative matching algorithm
unmatched inputs send requests to unmatched outputs (for which they have cells)
unmatched outputs pick a request that appears next in a fixed round-robin order from an input pointer. (input pointer is updated only in first iteration)
if input gets multiple grants, it picks one that appears next in a fixed round robin order from an output pointer. (update of output pointer)
Desynchronization effect. Simple to implement but do not perform well under
extreme traffic conditions.
Prashanth Pappu
Worst case results
Critical Cells First (CCF) can emulate output queuing with speedup =2.
Lowest occupancy output first algorithm is work conserving with speedup =2.
Can be augmented with timestamps to emulate output queuing. (speedup=3)
Prashanth Pappu
LOOFA
Iterative matching algorithm unmatched inputs send requests to outputs with
lowest occupancy (for which they have queued cells)
outputs pick a request randomly and send grant to input
O(N) iterations to perform correctly. Work conserving with speedup of 2. Significant result but not practical.
Prashanth Pappu
Traffic in IP networks Unregulated nature of IP networks can cause
sustained overloads. use of slow congestion control mechanisms limited route diversity makes congested links common use of route selection mechanisms not guided by session
bandwidth needs sudden route changes causing rapid traffic shifts malicious users
How do practical scheduling algorithms perform in these conditions?
Prashanth Pappu
Solution
We use targeted stress tests to study performance of practical scheduling
algorithms under extreme conditions study performance of work conserving scheduling
algorithms under speedups < 2 design stress resistant scheduling algorithms
which maintain throughput under uniform traffic and stress tests and can still be implemented at high speeds.
Prashanth Pappu
Miss fraction
Previous work use average queuing delay as a metric.
Not useful under inadmissible traffic conditions.
Miss fraction
miss fraction = 1 – NA/NI
Determines relative loss in throughput.
Prashanth Pappu
Stress Testphase 1 phase 2 phase 3 phase 4
Adversary approach in overloading (stressing) various outputs.
Output with empty queues have cells queued at various inputs.
Inputs with cells for an empty output also have cells queued forother outputs.
Test can be varied by changing number of participating inputs orphases.
Prashanth Pappu
Stress Test (Example)
PIM (speedup =1.5). Stress test with 3 participating inputs, 4 phases.
Prashanth Pappu
PIM (under uniform traffic)
Average Queuing delays Miss fraction
Prashanth Pappu
iSLIP (under uniform traffic)
Average Queuing delays Miss fraction
Prashanth Pappu
Stress Tests
Test A(Worst case for PIM(4), speedup=2)
Test B(Worst case for LOOFA, speedup=2)
Prashanth Pappu
Stress resistant algorithms Better performance of LOOFA suggests, ordering
outputs is the key. Complete ordering can make algorithms too
complex to implement. But traffic conditions are persistent and change
slowly, use approximate ordering schemes. Lowest Layer Selection (LLS) heuristic which achieves a
coarser ordering of outputs. Odd-even sorting which achieves approximate ordering but
converges to ideal ordering under persistent traffic conditions.
Prashanth Pappu
Lowest Layer Selection achieves coarser ordering bigger layers for larger
queue lengths beyond a queue limit all
outputs are treated equal number of layers
independent of N. algorithms give priority to
outputs in lowest layer in accept phase.
priority encoder or N-way minimum finding circuit can be used on a grant vector.
Prashanth Pappu
Lowest Layer Selection - Random (LLS-R) Iterative matching algorithm
each unmatched input sends request to every output for which it has a queued cell.
unmatched outputs randomly pick a request and send grant.
if input receives multiple grants, it picks one randomly from lowest layer.
O(log N) convergence still holds.
Prashanth Pappu
Lowest Layer Selection –SLIP (LLS-S) Iterative matching algorithm
unmatched inputs send requests to unmatched outputs (for which they have cells)
unmatched outputs pick a request that appears next in a fixed round-robin order from an input pointer. (input pointer is updated only in first iteration)
if input gets multiple grants, it picks one that appears next in the lowest layer in a fixed round robin order from an output pointer. (update of output pointer)
Both LLS-R and LLS-S have the same performance as PIM and iSLIP under uniform traffic.
Prashanth Pappu
Stress TestMiss fractions for LLS-R, LLS-S (using 16 layers) and LOOFA.
Test A Test B
Prashanth Pappu
Stress TestMiss fractions for LLS-S and LLS-R (single iteration) with varying layers.
LLS-S (Test A) LLS-R (Test A)
Prashanth Pappu
Approximate LOOFA (A-LOOFA)
LOOFA is complex but can be used as the basis for a practical algorithm (with similar performance)
Prashanth Pappu
Approximate LOOFA (A-LOOFA)
)()( ,1,1,,1,1,, jijijijijijiji cvrcvrr
Matching in A-LOOFA is accomplished using a simple combinational circuit.
O(N) but constant factor is determined by gate delays (2N times delay in each block).
.13 um ASIC process, gate delays are 25-50 ps. Match can be completed in 3.2-6.4 ns.
)()( 1,,,11,,,1, jijijijijijiji rvcrvcc
Prashanth Pappu
A-LOOFA Columns are ordered using odd-even sort.
for all even j < N, swap Bj and qj with Bj+1 and qj+1, if qj > qj+1.
Similarly, for all odd j < N-1
Rows are ordered using a permutation based on perfect shuffle (to ensure fairness). for all even i<N, generate a pseudo random bit xi.
if xi = 0, values in row i are moved to row i/2 and those in i+1 are moved to (N+i)/2.
else, values in row i are moved to row (N+i)/2 and values in row i+1 are moved to row i/2.
Prashanth Pappu
A-LOOFA performance
Test A Test B