towards simple, high-performance input-queued switch schedulers

Towards Simple, High-performance Input-Queued

Switch Schedulers

Devavrat ShahStanford University

Berkeley, Dec 5

Joint work withPaolo Giaccone and Balaji Prabhakar

2

Outline

• Description of input-queued switches• Scheduling

– the problem – some history

• Simple, high-performance schedulers– Laura– Serena– Apsara

• Conclusions

3

The Input-Queued (IQ) Switch Architecture

• N inputs, N outputs (in fig, N = 3)• Time is slotted

– at most one packet can arrive per time-slot at each input

• Equal sized cells/packets• Buffers only at inputs• Use a crossbar for switching packets

4

Scheduling

• Crossbar is defined by these constraints: in each time-slot– only one packet can be transferred to each output– only one packet can be transferred from each input

• The scheduling problem: Subject to the above constraint, find a matching of inputs and outputs– i.e. determine which output will receive a packet from

which input in each time slot

5

Background to switch scheduling

1. [Karol et al. 1987] Throughput is limited due to head-of-line blocking (limited to 58% for Bernoulli IID uniform traffic)

2. [Tamir 1989] Observed that with “Virtual Output Queues” (VOQs) head-of-line blocking is eliminated.

6

Basic Switch Model

S(t)

N NLNN(t)

A1N(t)

A11(t)L11(t)

1 1

ANN(t)

AN1(t)

D1(t)

DN(t)

7

Some definitions

matrix. npermutatio a is and :where

:matrix Service 2.

".admissible" is traffic the say we If

where

:matrix Traffic 1.

SssS

nAE

ijij

jij

iij

ijijij

1,0],[

1,1

)]([:,

3. Queue occupancies:

Occupancy

L11(t) LNN(t)

)]([ tAE ij

8

More background on theory

[Anderson et al. 1993] A schedule is equivalent to finding a matching in a bipartite graph induced by input and output nodes

9

Background

[McKeown et al. 1995] (a) Maximum size match does not give 100% throughput.(b) But maximum weight match can, where weight can be queue-length, age of a cell

20

32

30

25

20

30

25

MWM

10

Maximum Weight Matching

• Maximum weight matching (MWM)– 100% throughput– provable delay bounds for i.i.d. Bernoulli admissible

traffic– but, finding MWM is like solving a network-flow problem

whose complexity is -- complex for high-speed networks

• We seek to approximate maximum weight matching

• Our goal:– obtain a simply implementable approximation to MWM

that performs competitively with MWM

)( 3NO

11

Approximating MWM

• Two performance measures– throughput– delay

• We first consider simple approximations to MWM that deliver 100% throughput (i.e. stability), and then deal with delay

12

Methods of Approximation

• Randomization– well-known method for simplifying

implementation

• Using information in packet arrivals– since queue-sizes grow due to arrivals, and

arrival times are a source of randomness

• Hardware parallelism– yields an efficient search procedure

13

Randomization

• The main idea of randomized algorithms is

– to simplify the decision-making process by basing

decisions upon a small, randomly chosen sample from the state rather than upon the complete state

14

An Illustrative Example

• Find the oldest person from a population of 1 billion

• Deterministic algorithm: linear search – has a complexity of 1 billion

• A randomized version: find the oldest of 30 randomly chosen people– has a complexity of 30 (ignoring complexity of random

sampling)

• Performance– linear search will find the absolute oldest person (rank = 1)– if R is the person found by randomized algorithm, we can

make statements like P(R has rank < 100 million) > 0.95 thus, we can say that the performance of the randomized

algorithm is very good with a high probability

109

130

15

Randomizing Iterative Schemes

• Often, we want to perform some operation iteratively

• Example: find the oldest person each year

• Say in 2001 you choose 30 people at random– and store the identity of the oldest person in memory– in 2002 you choose 29 new people at random– let R be the oldest person from these 29 + 1 = 30 people

P(R has rank < 100 million)

or, P(R has rank < 50 million)

109

159

109

130

16

Back to Switch Scheduling: Randomizing MWM

• Choose d matchings at random and use the heaviest one as the schedule

• Ideally we would like to have small d. However:

• Theorem: Even with d = N this algorithm doesn’t yield 100% throughput!

17

Proof

18

• Switch Size : 32 X 32

• Input Traffic (shown for a 4 X 4 switch) – Bernoulli i.i.d. inputs– diagonal load matrix:

• normalized load=x+y<1• x=2y

Simulation Scenario

xy

yx

yx

yx

00

00

00

00

19

0.001

0.01

0.1

1

10

100

1000

10000

0.0 0.2 0.4 0.6 0.8 1.0

Mea

n IQ

Len

Normalized Load

Diagonal Traffic

MWM R32R1

20

Crucial Observation

• The state of the switch changes due to arrivals & departures

• Between consecutive time slots, a queue’s length can change at most by 1– hence a heavy matching tends to stay heavy

• Therefore– ‘’remembering’’ a heavy matching should help

in improving the performance

21

Tassiulas’ Algorithm

• [Tassiulas 1998] proposed the following algorithm based on this observation:– let S(t-1) be the matching used at time t-1– let R(t) be a matching chosen uniformly at

random– and let S(t) be the heavier of R(t) and S(t-1)

• This gives 100% throughput !note the boost in throughput is due to the use

of memory

• But, delays are very large

22

0.01

0.1

1

10

100

1000

10000

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Mea

n IQ

Len

Normalized Load

Diagonal Traffic

MWMTassiulas

23

Derandomization

• Let G be a fully-connected graph where each node is one of the N! possible schedules

• Construct a Hamiltonian walk, H(t), on G– H(t) cycles through the nodes of G

• At any time t – let R(t) = H(t mod N!) – and let S(t) be the heavier of R(t) and S(t-1) this also has 100% throughput, but delays are

large (derandomization will be useful later)

24

Stability

• Lemma: Consider IQ switch with Bernoulli i.i.d. inputs. Let B be a matching algorithm which ensures WB(t) >= W*(t) – c for every t. Then B is stable.

• Theorem: WDER(t) >= W*(t) – 2N.N! Therefore, it is stable.

25

Delay

• These simple approximations of MWM yield 100% throughput, but delays are large

• To obtain good delays we’ll present three different algorithms which use the following features:– selective remembrance -- Laura– information in the arrivals -- Serena– hardware parallelism -- Apsara

26

Laura

Tassiulas

• COMP = Maximum• R(t) – uniform sample

Next time COMP

S(t-1)

S(t)

R(t)

Laura

• COMP = Merge, picks the best edges of two matchings

• R(t) – non-uniform sample

27

10

10

10

70

60

50

40

30

10

20Merging

S(t-1) R

10 – 40+10 -30+10-50= - 90

70-10+60-20=100

W(S(t-1))=160

W(R)=150

S(t)W(S(t)) = 250

Merging Procedure

28

Throughput

• Theorem:– LAURA is stable under any admissible Bernoulli

i.i.d. input traffic.

29

Average Backlog via Simulation

• Switch size: N = 32

• Length of VOQ: QMAX = 10000

• Comparison with– iSLIP, iLQF, MUCS, RPA and MWM

30

Simulation

• Traffic Matrices– uniform diagonalsparse– logdiagonal

TU 1

N 2

1 1 1 1 1 1

1 1 1 1 1 1

1 1 1 1 1 11 1 1 1 1 1

1 1 1 1 1 1

1 1 1 1 1 1

TD 1

2N

1 1 0 0 0 0

0 1 1 0 0 0

0 0 1 1 0 00 0 0 1 1 0

0 0 0 0 1 1

1 0 0 0 0 1

TS 1

3N 2

2 1 0 0 0 0

0 1 2 0 0 0

0 1 1 1 0 00 0 0 2 1 0

0 0 0 0 1 2

1 0 0 0 1 1

31

Laura: Diagonal traffic

32

Laura: Sparse traffic

33

• Since an increase in queue sizes is due to arrivals

• And arrivals are a source of randomness

Use arrivals to generate random matching

SERENASerena

34

Serena

Next time Merge

S(t-1)

S(t)

R(t) = matching generated using arrivals

35

23 7

893

2

5

Arr-R

47

1131

97

S(t-1)

Merging Procedure

893

5

23

W(S(t-1))=209

1

W(R)=121RMerging

S(t)

W(S(t))=243

89

3

23

31

97

36

Throughput

Theorem:– SERENA achieves 100% throughput under any

admissible i.i.d. Bernoulli traffic pattern

37

Serena: Diagonal traffic

38

Apsara

• One way to obtain MWM is to search the space of all N! matchings

• A natural approximation: If S(t-1) is the current matching, then S(t) is the heaviest matching in a “neighborhood” of S(t-1)

• It turns out that there is a convenient way of defining neighbors (both for theory and for practice)

39

Neighbors

Neighbors differ from S(t) in ONLY TWO edges (for all values of N)

Neighbors

Example: 3 x 3 switchS(t)

40

Apsara

Next time MAX

S(t-1)

S(t)

Neighbors generated in parallel

N1 N2 Nk H(t)

Hamiltonian Walk

41

Apsara: Throughput

• Theorem: Apsara is stable under any admissible i.i.d. Bernoulli traffic.

(stability due to Hamiltonian matching)

• Also, note that W(S(t)) >= W(S(t-1),t)

• Theorem: If W(S(t)) = W(S(t-1),t) then W(S(t)) >= 0.5 W *(t)

(this is not enough to ensure stability)

42

Apsara: Diagonal traffic

43

Limited Parallelism

• The Apsara algorithm searches over neighbors in parallel

• If space is limited to modules, then search over randomly chosen subset of size K from all neighbors

• And there are other (good) deterministic ways of searching a smaller neighborhood of matchings

2

N

2

NK

2

N

44

Apsara: Limited parallelism

45

Diagonal traffic

46

Conclusions

• We have presented novel scheduling algorithms for input-queued switches– Laura– Serena– Apsara

• They are simple to implement and perform competitively with respect to the Maximum Weight Matching algorithm

47

References

1. L. Tassiulas, “Linear complexity algorithms for maximum throughput in radio networks and input-queued switches,” Proc. INFOCOM 1998.

2. D. Shah, P. Giaccone and B. Prabhakar, “An efficient randomized algorithm for input-queued switch scheduling,” Proc. of Hot Interconnects, 2001.

3. P. Giaccone, D. Shah and B. Prabhakar,” An Implementable Parallel Scheduler for Input-Queued Switches”, Proc. of Hot Interconnects, 2001.

4. P. Giaccone, B. Prabhakar and D. Shah, “Towards simple and efficient scheduler for high-aggregate IQ switches”, Submitted INFOCOM’02.

5. R. Motwani and P. Raghavan, Randomized Algorithms, Cambridge University Press, 1995.

48

Uniform traffic

49

LogDiagonal traffic

Maximum Throughput Algorithm Load 0.99

MWM 0.99MaxLAURA 0.99LAURA 0.99iSLIP 0.84iLQF 0.97MUCS 0.99RPA 0.98

towards simple, high-performance input-queued switch schedulers

Documents

matching of inputs

approximating mwm

time slotcopyright

maximum size match

n outputs

queue occupancies

output nodescopyright

line blocking