towards simple, high-performance input-queued switch schedulers
DESCRIPTION
Towards Simple, High-performance Input-Queued Switch Schedulers. Devavrat Shah Stanford University. Joint work with Paolo Giaccone and Balaji Prabhakar. Berkeley, Dec 5. Outline. Description of input-queued switches Scheduling the problem some history - PowerPoint PPT PresentationTRANSCRIPT
Towards Simple, High-performance Input-Queued
Switch Schedulers
Devavrat ShahStanford University
Berkeley, Dec 5
Joint work withPaolo Giaccone and Balaji Prabhakar
2
Outline
• Description of input-queued switches• Scheduling
– the problem – some history
• Simple, high-performance schedulers– Laura– Serena– Apsara
• Conclusions
3
The Input-Queued (IQ) Switch Architecture
• N inputs, N outputs (in fig, N = 3)• Time is slotted
– at most one packet can arrive per time-slot at each input
• Equal sized cells/packets• Buffers only at inputs• Use a crossbar for switching packets
4
Scheduling
• Crossbar is defined by these constraints: in each time-slot– only one packet can be transferred to each output– only one packet can be transferred from each input
• The scheduling problem: Subject to the above constraint, find a matching of inputs and outputs– i.e. determine which output will receive a packet from
which input in each time slot
5
Background to switch scheduling
1. [Karol et al. 1987] Throughput is limited due to head-of-line blocking (limited to 58% for Bernoulli IID uniform traffic)
2. [Tamir 1989] Observed that with “Virtual Output Queues” (VOQs) head-of-line blocking is eliminated.
6
Basic Switch Model
S(t)
N NLNN(t)
A1N(t)
A11(t)L11(t)
1 1
ANN(t)
AN1(t)
D1(t)
DN(t)
7
Some definitions
matrix. npermutatio a is and :where
:matrix Service 2.
".admissible" is traffic the say we If
where
:matrix Traffic 1.
SssS
nAE
ijij
jij
iij
ijijij
1,0],[
1,1
)]([:,
3. Queue occupancies:
Occupancy
L11(t) LNN(t)
)]([ tAE ij
8
More background on theory
[Anderson et al. 1993] A schedule is equivalent to finding a matching in a bipartite graph induced by input and output nodes
9
Background
[McKeown et al. 1995] (a) Maximum size match does not give 100% throughput.(b) But maximum weight match can, where weight can be queue-length, age of a cell
20
32
30
25
20
30
25
MWM
10
Maximum Weight Matching
• Maximum weight matching (MWM)– 100% throughput– provable delay bounds for i.i.d. Bernoulli admissible
traffic– but, finding MWM is like solving a network-flow problem
whose complexity is -- complex for high-speed networks
• We seek to approximate maximum weight matching
• Our goal:– obtain a simply implementable approximation to MWM
that performs competitively with MWM
)( 3NO
11
Approximating MWM
• Two performance measures– throughput– delay
• We first consider simple approximations to MWM that deliver 100% throughput (i.e. stability), and then deal with delay
12
Methods of Approximation
• Randomization– well-known method for simplifying
implementation
• Using information in packet arrivals– since queue-sizes grow due to arrivals, and
arrival times are a source of randomness
• Hardware parallelism– yields an efficient search procedure
13
Randomization
• The main idea of randomized algorithms is
– to simplify the decision-making process by basing
decisions upon a small, randomly chosen sample from the state rather than upon the complete state
14
An Illustrative Example
• Find the oldest person from a population of 1 billion
• Deterministic algorithm: linear search – has a complexity of 1 billion
• A randomized version: find the oldest of 30 randomly chosen people– has a complexity of 30 (ignoring complexity of random
sampling)
• Performance– linear search will find the absolute oldest person (rank = 1)– if R is the person found by randomized algorithm, we can
make statements like P(R has rank < 100 million) > 0.95 thus, we can say that the performance of the randomized
algorithm is very good with a high probability
109
130
15
Randomizing Iterative Schemes
• Often, we want to perform some operation iteratively
• Example: find the oldest person each year
• Say in 2001 you choose 30 people at random– and store the identity of the oldest person in memory– in 2002 you choose 29 new people at random– let R be the oldest person from these 29 + 1 = 30 people
P(R has rank < 100 million)
or, P(R has rank < 50 million)
109
159
109
130
16
Back to Switch Scheduling: Randomizing MWM
• Choose d matchings at random and use the heaviest one as the schedule
• Ideally we would like to have small d. However:
• Theorem: Even with d = N this algorithm doesn’t yield 100% throughput!
17
Proof
18
• Switch Size : 32 X 32
• Input Traffic (shown for a 4 X 4 switch) – Bernoulli i.i.d. inputs– diagonal load matrix:
• normalized load=x+y<1• x=2y
Simulation Scenario
xy
yx
yx
yx
00
00
00
00
19
0.001
0.01
0.1
1
10
100
1000
10000
0.0 0.2 0.4 0.6 0.8 1.0
Mea
n IQ
Len
Normalized Load
Diagonal Traffic
MWM R32R1
20
Crucial Observation
• The state of the switch changes due to arrivals & departures
• Between consecutive time slots, a queue’s length can change at most by 1– hence a heavy matching tends to stay heavy
• Therefore– ‘’remembering’’ a heavy matching should help
in improving the performance
21
Tassiulas’ Algorithm
• [Tassiulas 1998] proposed the following algorithm based on this observation:– let S(t-1) be the matching used at time t-1– let R(t) be a matching chosen uniformly at
random– and let S(t) be the heavier of R(t) and S(t-1)
• This gives 100% throughput !note the boost in throughput is due to the use
of memory
• But, delays are very large
22
0.01
0.1
1
10
100
1000
10000
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Mea
n IQ
Len
Normalized Load
Diagonal Traffic
MWMTassiulas
23
Derandomization
• Let G be a fully-connected graph where each node is one of the N! possible schedules
• Construct a Hamiltonian walk, H(t), on G– H(t) cycles through the nodes of G
• At any time t – let R(t) = H(t mod N!) – and let S(t) be the heavier of R(t) and S(t-1) this also has 100% throughput, but delays are
large (derandomization will be useful later)
24
Stability
• Lemma: Consider IQ switch with Bernoulli i.i.d. inputs. Let B be a matching algorithm which ensures WB(t) >= W*(t) – c for every t. Then B is stable.
• Theorem: WDER(t) >= W*(t) – 2N.N! Therefore, it is stable.
25
Delay
• These simple approximations of MWM yield 100% throughput, but delays are large
• To obtain good delays we’ll present three different algorithms which use the following features:– selective remembrance -- Laura– information in the arrivals -- Serena– hardware parallelism -- Apsara
26
Laura
Tassiulas
• COMP = Maximum• R(t) – uniform sample
Next time COMP
S(t-1)
S(t)
R(t)
Laura
• COMP = Merge, picks the best edges of two matchings
• R(t) – non-uniform sample
27
10
10
10
70
60
50
40
30
10
20Merging
S(t-1) R
10 – 40+10 -30+10-50= - 90
70-10+60-20=100
W(S(t-1))=160
W(R)=150
S(t)W(S(t)) = 250
Merging Procedure
28
Throughput
• Theorem:– LAURA is stable under any admissible Bernoulli
i.i.d. input traffic.
29
Average Backlog via Simulation
• Switch size: N = 32
• Length of VOQ: QMAX = 10000
• Comparison with– iSLIP, iLQF, MUCS, RPA and MWM
30
Simulation
• Traffic Matrices– uniform diagonalsparse– logdiagonal
TU 1
N 2
1 1 1 1 1 1
1 1 1 1 1 1
1 1 1 1 1 11 1 1 1 1 1
1 1 1 1 1 1
1 1 1 1 1 1
TD 1
2N
1 1 0 0 0 0
0 1 1 0 0 0
0 0 1 1 0 00 0 0 1 1 0
0 0 0 0 1 1
1 0 0 0 0 1
TS 1
3N 2
2 1 0 0 0 0
0 1 2 0 0 0
0 1 1 1 0 00 0 0 2 1 0
0 0 0 0 1 2
1 0 0 0 1 1
31
Laura: Diagonal traffic
32
Laura: Sparse traffic
33
• Since an increase in queue sizes is due to arrivals
• And arrivals are a source of randomness
Use arrivals to generate random matching
SERENASerena
34
Serena
Next time Merge
S(t-1)
S(t)
R(t) = matching generated using arrivals
35
23 7
893
2
5
Arr-R
47
1131
97
S(t-1)
Merging Procedure
893
5
23
W(S(t-1))=209
1
W(R)=121RMerging
S(t)
W(S(t))=243
89
3
23
31
97
36
Throughput
Theorem:– SERENA achieves 100% throughput under any
admissible i.i.d. Bernoulli traffic pattern
37
Serena: Diagonal traffic
38
Apsara
• One way to obtain MWM is to search the space of all N! matchings
• A natural approximation: If S(t-1) is the current matching, then S(t) is the heaviest matching in a “neighborhood” of S(t-1)
• It turns out that there is a convenient way of defining neighbors (both for theory and for practice)
39
Neighbors
Neighbors differ from S(t) in ONLY TWO edges (for all values of N)
Neighbors
Example: 3 x 3 switchS(t)
40
Apsara
Next time MAX
S(t-1)
S(t)
Neighbors generated in parallel
N1 N2 Nk H(t)
Hamiltonian Walk
41
Apsara: Throughput
• Theorem: Apsara is stable under any admissible i.i.d. Bernoulli traffic.
(stability due to Hamiltonian matching)
• Also, note that W(S(t)) >= W(S(t-1),t)
• Theorem: If W(S(t)) = W(S(t-1),t) then W(S(t)) >= 0.5 W *(t)
(this is not enough to ensure stability)
42
Apsara: Diagonal traffic
43
Limited Parallelism
• The Apsara algorithm searches over neighbors in parallel
• If space is limited to modules, then search over randomly chosen subset of size K from all neighbors
• And there are other (good) deterministic ways of searching a smaller neighborhood of matchings
2
N
2
NK
2
N
44
Apsara: Limited parallelism
45
Diagonal traffic
46
Conclusions
• We have presented novel scheduling algorithms for input-queued switches– Laura– Serena– Apsara
• They are simple to implement and perform competitively with respect to the Maximum Weight Matching algorithm
47
References
1. L. Tassiulas, “Linear complexity algorithms for maximum throughput in radio networks and input-queued switches,” Proc. INFOCOM 1998.
2. D. Shah, P. Giaccone and B. Prabhakar, “An efficient randomized algorithm for input-queued switch scheduling,” Proc. of Hot Interconnects, 2001.
3. P. Giaccone, D. Shah and B. Prabhakar,” An Implementable Parallel Scheduler for Input-Queued Switches”, Proc. of Hot Interconnects, 2001.
4. P. Giaccone, B. Prabhakar and D. Shah, “Towards simple and efficient scheduler for high-aggregate IQ switches”, Submitted INFOCOM’02.
5. R. Motwani and P. Raghavan, Randomized Algorithms, Cambridge University Press, 1995.
48
Uniform traffic
49
LogDiagonal traffic
Maximum Throughput Algorithm Load 0.99
MWM 0.99MaxLAURA 0.99LAURA 0.99iSLIP 0.84iLQF 0.97MUCS 0.99RPA 0.98