competitive evaluation of switch architectures

157
Competitive Evaluation of Switch Architectures David Hay Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Upload: others

Post on 05-Oct-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Competitive Evaluation of Switch Architectures

Competitive Evaluation of

Switch Architectures

David Hay

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 2: Competitive Evaluation of Switch Architectures

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 3: Competitive Evaluation of Switch Architectures

Competitive Evaluation of

Switch Architectures

Research Thesis

Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of

Philosophy

David Hay

Submitted to the Senate of the

Technion - Israel Institute of Technology

Iyyar 5767 Haifa April 2007

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 4: Competitive Evaluation of Switch Architectures

The research thesis was done under the supervision of Prof. Hagit Attiya in the Department of

Computer Science.

Hagit, there are no words to express how grateful I am for your help and patient guidance

all these years. I feel privileged to have worked with you and learn from your experience, as

a researcher, a teacher, but first and foremost as a human being. I felt that you were always

available for me, for any question, any thought or even just for a chat. Among the countless things

I learned from you, I especially appreciate how you perfectly balanced between guiding me, while

maintaining my independence as a researcher. No doubt you are the kind of advisor any student

dreams of (and much much more).

I would like to thank my collaborators, Dr. Isaac Kesslasy, Gabriel Scalosub and Prof. Jennifer

L. Welch, for many helpful and fruitful discussions. The periods we worked together were among

the most enjoyable during all my studies. I thank Isaac Kesslasy also for organizing my intern-

ship in Cisco Systems during summer 2006. This internship had a significant contribution to my

research.

I am thankful to committee members: Prof. Israel Cidon, Dr. Isaac Kesslasy, Prof. Yishai

Mansour, Prof. Seffi Naor, Prof. Danny Raz and Dr. Adi Rosen. I benefited tremendously from

your insights and comments.

I would also like to thank all the other people from the Computer Science department, with

whom I worked and who helped me all these years since my undergraduate studies.

Special thanks to Moshe Saikevich for his consistent moral support (and graphic advises and

help). Moshe, without you none of this would have happened.

Last but not least, I thank my parents Yael and Yigal Hay, my grandparents Zvi and Zvia Gracy

and Lia and Jacob Hay, my brothers Eyal, Roee and Assaf, and the rest of my family, for always

being there for me and supporting my decisions and choices.

The generous financial help of the Blankstein and Wolf foundations is gratefully acknowledged.

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 5: Competitive Evaluation of Switch Architectures

Contents

Abstract 1

Abbreviations and Notations 3

1 Introduction 5

1.1 Classification of Switch Architectures . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2 Evaluating the Performance of Switch Architecture . . . . . . . . . . . . . . . . . 10

1.2.1 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.3 The Packet Scheduling Process Bottleneck . . . . . . . . . . . . . . . . . . . . . . 15

1.4 Overview of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.4.1 Relative Queuing Delay in Parallel Packet Switches . . . . . . . . . . . . . 16

1.4.2 Packet-mode Scheduling in Combined Input-Output Queued Switches . . . 17

1.4.3 Jitter Regulation for Multiple Streams . . . . . . . . . . . . . . . . . . . . 18

2 Background 19

2.1 CIOQ Switches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2 Output-Queued Switch Emulation . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.3 Parallel Packet Switches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.4 Delay Jitter Regulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

i

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 6: Competitive Evaluation of Switch Architectures

3 Model Definitions 26

4 Relative Queuing Delay in PPS 29

4.1 Summary of Our Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.2 A Model for Parallel Packet Switches . . . . . . . . . . . . . . . . . . . . . . . . 33

4.3 Lower Bounds on the Relative Queuing Delay . . . . . . . . . . . . . . . . . . . . 35

4.3.1 General Techniques and Observations . . . . . . . . . . . . . . . . . . . . 36

4.3.2 Lower Bounds for Fully-Distributed Demultiplexing Algorithms . . . . . . 40

4.3.3 Lower Bounds for u-RT Demultiplexing Algorithms . . . . . . . . . . . . 48

4.4 Upper Bounds on the Relative Queuing Delay . . . . . . . . . . . . . . . . . . . . 52

4.5 Demultiplexing Algorithms with Optimal RQD . . . . . . . . . . . . . . . . . . . 58

4.5.1 Optimal Fully-Distributed Demultiplexing Algorithms . . . . . . . . . . . 59

4.5.2 Optimal 1-RT Demultiplexing Algorithm . . . . . . . . . . . . . . . . . . 61

4.5.3 Optimal u-RT Demultiplexing Algorithm . . . . . . . . . . . . . . . . . . 67

4.6 Extensions of the PPS model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.6.1 The Relative Queuing Delay of an Input-Buffered PPS . . . . . . . . . . . 69

4.6.2 Recursive Composition of PPS . . . . . . . . . . . . . . . . . . . . . . . . 72

5 Packet-Mode Scheduling in CIOQ Switches 76

5.1 Our Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.2 A model for packet-mode CIOQ switches . . . . . . . . . . . . . . . . . . . . . . 79

5.3 Simple Upper and Lower Bounds on the Relative Queuing Delay . . . . . . . . . . 80

5.4 Tradeoffs between the speedup and the relative queuing delay . . . . . . . . . . . . 83

5.4.1 Matrix Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.4.2 Mimicking an Ideal Shadow Switch with Speedup S ≈ 4 . . . . . . . . . . 86

ii

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 7: Competitive Evaluation of Switch Architectures

5.4.3 Mimicking an Ideal Shadow Switch with Speedup S ≈ 2 . . . . . . . . . . 91

5.5 Mimicking an Ideal Shadow Switch with Bounded Buffers . . . . . . . . . . . . . 94

5.6 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

6 Jitter Regulation for Multiple Streams 104

6.1 Our Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

6.2 Model Description, Notation, and Terminology . . . . . . . . . . . . . . . . . . . 106

6.3 Online Multi-Stream Max-Jitter Regulation . . . . . . . . . . . . . . . . . . . . . 108

6.4 An Efficient Offline Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

7 Conclusions 122

Bibliography 127

iii

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 8: Competitive Evaluation of Switch Architectures

List of Figures

1 High-level model of a switch and its bottlenecks. . . . . . . . . . . . . . . . . . . 6

2 Combined Input-Output Queued Switch with Virtual Output-Queuing. . . . . . . . 9

3 A 5× 5 PPS with 2 planes in its center stage, without buffers in the input-ports . . 10

4 Illustration of different delay metrics. . . . . . . . . . . . . . . . . . . . . . . . . 14

5 Time-points associated with a cell c ∈ T . . . . . . . . . . . . . . . . . . . . . . . 35

6 Illustration of traffic T in the proof of Theorems 1 and 4. . . . . . . . . . . . . . . 42

7 Illustration of traffic T in the proof of Theorem 5. . . . . . . . . . . . . . . . . . . 47

8 Illustration of traffic Te T |I in the proof of Theorems 6 and 8. . . . . . . . . . . . 50

9 The number of cells arriving until time-slot t − 1, and still queued in plane k by

time-slot τ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

10 Illustration for the different cases in the proof of Theorem 10 . . . . . . . . . . . . 57

11 A (2, 〈2, 2〉)-RPPS with 5 input-ports and 5 output-ports. . . . . . . . . . . . . . . 73

12 Summary of the results described in Chapter 5. . . . . . . . . . . . . . . . . . . . 78

13 Illustration of the proof of Theorem 21. . . . . . . . . . . . . . . . . . . . . . . . 80

14 Simulation results for a switch, operating under uniform traffic. . . . . . . . . . . . 98

15 Simulation results for a switch, operating under spotted traffic. . . . . . . . . . . . 99

16 Simulation results for a switch, operating under diagonal traffic. . . . . . . . . . . 100

iv

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 9: Competitive Evaluation of Switch Architectures

17 Trace-Driven simulation results. . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

18 Simulation results of the Store&Forward Greedy Algorithm. . . . . . . . . . . . . 102

19 The multi-stream jitter regulation model. . . . . . . . . . . . . . . . . . . . . . . . 106

20 Geometric view of delay jitter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

21 Geometric view of the right margin of the release band. . . . . . . . . . . . . . . . 113

22 Illustration of Lemma 18. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

v

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 10: Competitive Evaluation of Switch Architectures

List of Tables

4.1 The relative queuing delay (in time-slots) of a bufferless OC-192 PPS with 1024

ports and speedup 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2 Illustration of Example 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

vi

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 11: Competitive Evaluation of Switch Architectures

Abstract

To support the growing need for Internet bandwidth, contemporary backbone routers and switches

operate with an external rate of 40 GB/s and hundreds of ports. At the same time, applications

with stringent quality of service (QoS) requirements call for powerful control mechanisms, such as

packet scheduling and queue management algorithms.

The primary goal of our research is to provide analytic methodologies for designing and eval-

uating high-capacity high-speed switches. A unique feature of our approach is a worst-case com-

parison of switch performance relative to an ideal switch, with no limitations. This competitive

approach is natural due to the central role of incomplete knowledge, and it can reveal the strengths

and weaknesses of the studied mechanisms and indicate important design choices.

We first consider the parallel packet switch (PPS) architecture in which cells are switched

in parallel through intermediate slower switches. We study the effects of this parallelism on the

overall performance and present tight bounds on the average queuing delay introduced by the

switch relative to an ideal output-queued switch. Our lower bounds hold even if the algorithm in

charge of balancing the load among middle-stage switches is randomized.

We also study how variable-size packets can be scheduled contiguously without segmentation

and reassembly in a combined input-output queued (CIOQ) switch. This mode of scheduling be-

came very attractive recently, since most common network protocols (e.g., IP) work with variable

size packets. We present frame-based schedulers that allow a packet-mode CIOQ switch with small

speedup to mimic an ideal output-queued switch with bounded relative queuing delay.

A slightly different line of research involves studying how different QoS measures can be

guaranteed in a stand-alone environment, where traffic arrives to a regulator that should shape it

1

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 12: Competitive Evaluation of Switch Architectures

to meet the demand. We focus on jitter regulators, which should shape the incoming traffic to

be perfectly periodic and show upper and lower bounds for multiple stream jitter regulation: In

the offline setting, jitter regulation can be solved in polynomial time, while in the online setting

a buffer augmentation is needed in order to compete with the optimal algorithm; the amount of

buffer augmentation depends linearly on the number of streams.

2

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 13: Competitive Evaluation of Switch Architectures

Abbreviations and Notations

N The number of the switch’s ports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

R The external rate of the switch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6

S The speedup of the switch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

K The number of planes in a parallel packet switch . . . . . . . . . . . . . . . . . . . . . 10

r The internal rate of a parallel packet switch . . . . . . . . . . . . . . . . . . . . . . . . . . 10

Lmax The maximum packet size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

orig The input-port at which a cell arrives at the switch . . . . . . . . . . . . . . . . . . . 26

dest The output-port for which a cell is destined . . . . . . . . . . . . . . . . . . . . . . . . . . 26

packet The packet corresponds to cell a specific cell . . . . . . . . . . . . . . . . . . . . . . . . 26

first The first cell of a packet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

last The last cell of a packet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

T A traffic. The collection of cells arriving at the switch . . . . . . . . . . . . . . . . 26

ta The time-slot at which a cell arrives at the switch . . . . . . . . . . . . . . . . . . . . 26

shift A cell obtained by shifting another cell by predetermined time-slots . . . .26

ESW An execution of the switch SW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

σ Coin tosses sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

tlSW The time-slot at which a cell leaves the switch . . . . . . . . . . . . . . . . . . . . . . . 27

ES The execution of the shadow switch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

tlS The time-slot at which a cell leaves the shadow switch . . . . . . . . . . . . . . . . 27

delay The queuing delay of a cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

R The relative queuing delay of a cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

Rmax The maximum relative queuing delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 14: Competitive Evaluation of Switch Architectures

Ravg The average relative queuing delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

RAmax The maximum relative queuing delay against adversary A . . . . . . . . . . . . .27

RAavg The average relative queuing delay against adversary A . . . . . . . . . . . . . . . 27

JSW The delay jitter of switch SW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

JS The delay jitter of the shadow switch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

J The relative delay jitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

S The state space of a demultiplexor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

plane The plane through which a cell is sent (in a PPS) . . . . . . . . . . . . . . . . . . . . . 34

tp The time-slot at which a cell leave the plane . . . . . . . . . . . . . . . . . . . . . . . . . 34

succ The immediate successor of a cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

C The reachable configuration space of a demultiplexor . . . . . . . . . . . . . . . . . 48

A The number of cells arriving at the switch . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

∆ The imbalance of a plane with respect to an output . . . . . . . . . . . . . . . . . . . 53

Q The length of a queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

L The number of a cell leaving a plane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

B The set of reachable buffer states . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

to The time-slot at which a cell leaves the demultiplexor . . . . . . . . . . . . . . . . 69

tCCF The time-slot at which CCF forwards a cell . . . . . . . . . . . . . . . . . . . . . . . . . . 86

L The set of eligible packet sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

X The inter-release time of cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

M The total number of streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

MJ The max-jitter of a multi-stream traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

4

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 15: Competitive Evaluation of Switch Architectures

Chapter 1

Introduction

The rapid increase in the demand for Internet bandwidth and the boost in line rates of contemporary

data networks establish the basic nodes at the network core—namely, the switches and routers—as

one of the network’s primary performance bottlenecks.

Today, switches and routers are built to operate with link rate of up to 40 Gb/s and hundreds

of ports. At the same time, contemporary data networks are required to integrate different types of

services (for example, IP traffic with voice and video traffic), implying that the switch (or router)

must meet stringent quality of service (QoS) requirements and provide service differentiation be-

tween applications. In order to cope with these challenging tasks, all routers and switches are

equipped with powerful control mechanisms, such as packet scheduling and queue management

algorithms. As switches become larger and faster, robust parallel and distributed architectures are

often used; these architectures require additional mechanisms for coordination and load balancing.

Varghese [142, Page 302] identifies three bottlenecks in the design of high-speed switches: the

address lookup process, which determines which output link a packet will switch to, the switching

process, which is responsible for forwarding packets from the input-port to the output-port, and

the packet scheduling process, which is done at the output-port and decides how packets leave the

outbound links of the switch. Figure 1 depicts the general structure of a packet switch and indicates

the locations of the above-mentioned bottlenecks.

This thesis focuses primarily on problems arising from the switching process bottleneck (ex-

5

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 16: Competitive Evaluation of Switch Architectures

Switch

Fabric

Address lookup

bottleneck

Packet Scheduling

bottleneck

Output-portsInput-ports

R

R

R

RR

R

RN

2

11

2

N

Scheduler (switching bottleneck)

Figure 1: High-level model of a switch and its bottlenecks. The switch has N input-ports and Noutput-ports, operating at external rate R.

cept Chapter 6 that deals with the packet scheduling bottleneck) and aims to provide analytical

methodologies for designing and evaluating the related switch control mechanisms. In addition,

our results allow to compare between different switch architectures.

Generally, given an existing switch architecture in which the line rates, buffering locations

and control lines are specified, we offer switching algorithms and evaluate their performance.

In addition, we prove inherent limitations of the architecture and point out to important design

trade-offs. Note that the algorithms and their analysis strongly depend on the investigated switch

architecture; therefore, this research involves a large variety of algorithmic problems.

Most of our results are relativistic, in the sense that they are measured in comparison to an

optimal switch which is not limited by its architecture. Similar to online algorithms, in which

there is no information about future events, the competitive approach taken in this research is

natural due to the central role of lack of knowledge.

A primary candidate for an ideal switch is an output-queued switch (see Section 1.1), which is

considered optimal with respect to its ability to guarantee different QoS demands. For that reason,

6

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 17: Competitive Evaluation of Switch Architectures

this comparison is often referred to as the ability of a switch to mimic or emulate an output-queued

switch [120].

Because such analysis is not burdened by probabilistic assumptions on the incoming traffic that

can be misleading, it reveals the strengths and weaknesses of the studied mechanisms and architec-

tures. In addition, analytic evaluation, and especially worst-case evaluation, is important because

it allows QoS demands to be guaranteed (unlike empirical evaluation based on simulations).

In the rest of this chapter, we first discuss in Section 1.1 how to classify different switch archi-

tectures. In Section 1.2, we overview the methods and metrics for evaluating switch architectures

performance. Section 1.3 discusses in more depth the problems arising from the packet scheduling

process bottleneck. Section 1.4 overviews the main results of this thesis.

1.1 Classification of Switch Architectures

Karol et al. [78] considered switches without buffers at all. In these switches when m cells destined

for the same output-port arrive at the same time-slot, m − 1 cells are dropped and one cell is

transmitted over the switch fabric.

Even in case of uniform well-behaved traffic, such a bufferless switch suffers from large loss

ratio and low throughput. Hence, buffering within the switch is needed to handle conflicts among

different flows. The location of the buffers, their size, and their management, depend on the specific

architecture of the switch, and play a major role in its performance. Therefore, switches are often

classified according to their buffering strategy [16]. In the rest of this section we employ such a

classification and present common architectures by the location of their buffers.

Output-Queued Switches: In output-queued (OQ) switches, a cell arriving to the switch is im-

mediately transfered to the output-port it is destined for. At each time-slot, at most one cell leaves

each output-port; conflicting cells are queued in a buffer at the output-port.

Output-queued switches provide the highest throughput and lowest average cell-delay, since

cells are queued only when the output-port is transmitting a cell. Furthermore, traffic destined

for one output-port does not affect other output-ports, implying that misbehaving flows are easily

7

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 18: Competitive Evaluation of Switch Architectures

isolated.

However, since it is possible that in a specific time-slot, all input-ports send a cell for the

same destination, output-ports are required to operate in the aggregate rate of the input-ports.

This property yields that the output-queued switch architecture does not scale with the number of

external ports, and therefore it is impractical for high-speed switches with large number of ports.

Shared-Memory Switches: Shared-memory switches are a variant of output-queued switches,

where buffers are not dedicated to a specific output-port. Naturally, shared-memory switches are

more flexible than dedicated memory, and require significantly smaller buffer size than output-

queued switches, sometimes just two or three times larger than a single output-buffer of an output-

queued switch [139].

Since at each time-slot, all the input-ports can write to the shared memory and all the output-

ports can read from it, the shared memory should operate in the aggregate rate of both the input-

ports and the output-ports. Hence, like output-queued switches, these switches are not practical if

the switch is large or operates at high speed.

Input-Queued and Combined-Input-Output-Queued Switches: Input Queued (IQ) switches,

with buffering at the input-ports, were suggested to reduce the rate in which memory units are

required to operate. Cells arriving at the switch are queued in FIFO input-buffers, and then are

forwarded to the appropriate output-port, as dictated by a centralized scheduler. The switch fabric

that is used in IQ switches is a bufferless crossbar, which puts the following constraint on the

scheduler: At each time-slot at most one cell is forwarded from each input-port and to each output-

port.

The most infamous problem in input-queued switches is the head-of-line (HOL) problem,

where a cell destined for an occupied output-port, blocks other cells from being forwarded [78].

To eliminate such problems different buffering policies were suggested. The most common one

is virtual output-queuing (VOQ) in the input-ports [133], where each input-buffer is divided to

N different FIFO queues according to the cell destination. In this case, the scheduler makes its

scheduling decisions based on the cells located in the head of each queue (i.e., N2 cells). Since

8

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 19: Competitive Evaluation of Switch Architectures

Crossbar

Fabric

Scheduler

Output-portsInput-ports

R

R

R

RR

R

R

N

2

11

2

N

Figure 2: Combined Input-Output Queued Switch with Virtual Output-Queuing. The switch fabricoperates at rate S ·R, where S is the speedup of the switch.

scheduling decisions are typically made at least once in every time-slot, the scheduler may become

the bottleneck for implementing a high-speed, large switch.

In addition to the virtual output queues, some input-queued switches have speedup; namely, the

switch fabric runs S ≥ 1 times faster than the external line rates (where S is the switch speedup),

enabling the switch to make S scheduling decisions every time-slot. When S > 1, a certain amount

of buffering should be done in the output side of the buffer, and therefore such switches are usually

referred to as Combined Input-Output Queued (CIOQ) switches (Figure 2).

Buffered Crossbar Switches: Recently, CIOQ switches with additional (small) buffers in the

crosspoints were also considered. These buffered crossbar or combined input-crosspoint queued

(CICQ) switches circumvent the major constraint imposed by the bufferless crossbar fabric and

introduce orthogonality between the operations of input-ports and output-ports. This strong prop-

erty greatly simplifies the design of switching algorithms [38, 138], at the expense of having N2

additional buffers that must be allocated and managed.

9

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 20: Competitive Evaluation of Switch Architectures

Figure 3: A 5× 5 PPS with 2 planes in its center stage, without buffers in the input-ports

Parallel Packet Switches (PPS): Switching cells in parallel is a natural approach to build switches

with very high external line rate and with a large number of ports. A parallel packet switch

(PPS) [74] is a three-stage Clos network [41], with K < N switches in its center stage, also

called planes. Each plane is an N × N switch operating at rate r < R; each plane is connected

to all input-ports on one side, and to all output-ports on the other side (Figure 3). This model

is based on an architecture used in inverse multiplexing systems [53, 56], and especially on the

inverse multiplexing for ATM (IMA) technology [12, 36].

Iyer and McKeown [70] also consider a variant of the PPS architecture, called input-buffered

PPS, having finite buffers in its input-ports in addition to buffers in its output-ports.

Additional two architectures have topology similar to the PPS. The Parallel Switching Architec-

ture (PSA) [109] has several combined input-output queued (CIOQ) switches operating in parallel

with no speedup, whereas the Switch-Memory-Switch (SMS) model [121, 122, 127] has M > N

parallel memories that reside between the input and output ports.

1.2 Evaluating the Performance of Switch Architecture

Switch architectures are evaluated by their ability to provide different QoS guarantees. Some of the

important performance figures are the maximum or average delay of cells, the switch throughput,

and cell loss probability. Contemporary network applications necessitate even more sophisticated

10

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 21: Competitive Evaluation of Switch Architectures

performance metrics (e.g, delay jitter). In Section 1.2.1, we discuss these metrics in detail.

These performance figures can be evaluated under different assumptions on the incoming traf-

fic.

A traditional approach is to model the arrival of cells as a stochastic process. The performance

figures are derived from the switch behavior in response to the arrival pattern, which is calculated

either using analytical probabilistic methods (e.g., traditional queuing theory) or using simula-

tions. The most common assumption is a uniform traffic, where cells arrival is an independent and

identically-distributed (i.i.d) Bernoulli process with parameter p, 0 < p ≤ 1, and cell destination is

chosen uniformly among all the output-ports. The simplicity of uniform traffic makes it attractive

in analytical evaluation; however, it usually leads to unrealistic overly optimistic results, compared

to real-life traffic.

More sophisticated traffic patterns (e.g., on/off traffic [2, 64] or hot-spot traffic pattern [118])

were suggested to model real traffic more accurately. Unfortunately, such models generally tend

to be either unrealisticly simple or too complex for closed-form analysis.

A contemporary approach is to use restrictive models that only bound the incoming traffic

rather than exactly characterize it. Our research focuses on such models, which capture the nature

of most known traffic patterns, and yet can be handled analytically.

These models are particularly appealing because a switch can be used as part of a network

(e.g., the Internet, LAN networks or WAN networks) whose traffic characterization can be very

different, and may even change over time. Therefore, a restrictive model that captures all traffic

patterns at once may yield more meaningful results than stochastic processes, which try to exactly

characterize the arrival pattern.

A prime example for a restrictive traffic model is the (R,B)-leaky-bucket model [140]. In this

model, it is only required that the combined rate of flows sharing the same input-port or the same

output-port does not exceed the external rate R of that port by more than a fixed bound B, which is

independent of time [34]. Other examples of restrictive models are the (R, T )-smooth model [60],

which was later used by Borodin et al [28] for adversarial queuing theory and is often referred to

as the AQT (R, T ) model. Traffic patterns that obey the strong law of large numbers [47] are also

widely used recently, since they enable the usage of fluid models [46] for evaluating the switch

11

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 22: Competitive Evaluation of Switch Architectures

behavior, although the arrival processes remain discrete [47].

Our research takes a competitive approach for evaluating the behavior of switch architectures:

We compare the performance of a switch against an ideal shadow switch receiving the same in-

coming traffic [120], which may be unrestricted or obey one of the restrictive models described

earlier. As mentioned before, since output-queued switches are considered optimal with respect

to their delay and throughput performance, this comparison is referred to as output queued switch

emulation.

The measure of how closely a switch mimics the ideal switch depends on the relevant QoS de-

mand. For example, in [14, 70, 74, 86] the performance figure discussed is the queuing delay, and

therefore the competitive analysis is resulted by the relative queuing delay; namely, the difference

in queuing delay between the evaluated switch and the shadow switch. When switches are allowed

to drop cells, and the figure of interest is the number of cells successfully delivered, competitive

analysis results in a competitive ratio (or equivalently the switch miss-fraction [113]).

1.2.1 Performance Metrics

In this section, we survey the most common performance metrics used in current research to eval-

uate a single switch. Note that end-to-end evaluations (for example, over IP networks) are outside

the scope of this thesis.

Throughput and Stability: The throughput of a switch can be defined in several ways. One

common definition is the average number of cells which are successfully transmitted by the switch

per time-slot per input-port [16]. In other cases [30], throughput is defined as the maximum rate

at which none of the offered frames (in our case, cells) are dropped by the device (in our case, the

switch). In this case the throughput is usually normalized to the maximum theoretical rate (namely,

R). For example, a 100% throughput means that even in maximum load conditions, no cells are

dropped by the switch. Note that the relation between cell loss rate to this notion of throughput is

not immediate: For example, in an extreme condition, in which the switch drops a small number

of cells for any incoming traffic (even with the lowest rate), the switch throughput is 0%, while the

loss-rate is very small.

12

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 23: Competitive Evaluation of Switch Architectures

When discussing throughput, a definition of stability comes handy [104]. A switch is stable (in

the strong sense) if the expected queue length does not grow without bound: that is, if for every

input-port i, output-port j, limt→∞ E(Qi,j(t)) is finite, where Qi,j(t) is the number of cells from

flow (i, j) that are still queued in one of the switch buffers. A switch achieves 100% throughput

if and only if it is stable under all admissible traffics, and therefore with finite buffers, no cells are

dropped [105].

A stronger stability measure is the ability of the switch to be work-conserving (greedy) [37,

88, 91]. A work-conserving switch guarantees that if a cell is pending for output port j at time-

slot t, then some cell leaves the switch from output-port j at time-slot t. This property prevents

an output-port from being idle unnecessarily, and by that ensures the switch stability, maximizes

its throughput and minimizes its average cell delay. Note that work conservation is a strictly

stronger property than stability since there are switches (e.g., the parallel packet switch, described

in Section 1.1) which are stable but not work-conserving.

Queue Length and Cell Loss Ratio: A more fine-grained measure than stability is bounding the

queue lengths (also referred to as backlogs), bounding the expected queue length or approximating

the distribution of the lengths (over time). Since buffer sizes play a major role in both the design

and the pricing of the switch, queue length bounds are very important performance figures. More-

over, the figures obtained have great practical importance and usually can be easily translated to

other important bounds (e.g., on cell delays and cell loss ratio).

As a simple example, a work-conserving output-queued switches the maximum buffer size

needed for any (R,B) leaky-bucket traffic is B cells [45]. In this case, no cells are dropped and

the maximum queue length is B; if the output queues operate under first-come-first-serve (FCFS)

policy, the maximum latency is B time-slots.

If buffer sizes are bounded, cells that cannot be stored in the buffers are dropped. The number

of cells dropped, compared to the number of cells arrived at the switch, is captured by the cell loss

ratio. Clearly, characterizing the queue lengths is often a first step towards evaluating the cell loss

ratio.

13

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 24: Competitive Evaluation of Switch Architectures

Average Delay Maximum DelayPropogationDelay

Delay Jitter

numberof cells

delay time

Figure 4: Illustration of different delay metrics [79, Page 219].

Cell Delay and Queuing Delay: For a wide class of interactive or time-critical applications

(such as voice conversation and other diverse telecommunication services), cell delays are more

important than throughput [94]. In such applications excessive latency can inhibit usability of the

cells, and therefore should be avoided.

Naturally, the maximum cell delay may occur only in extreme situations, implying that this

metric can be overly pessimistic. In such cases, we are interested in the average cell delay.

The dominant causes of delay are fixed propagation delays (e.g., those arising from speed-

of-light restrictions) and queuing delays in the switch [49]. Since propagation delays are a fixed

property of the topology, cell delay is minimized when queuing delays are minimized. Further-

more, propagation delays are strongly dependent on technology, therefore a switch architecture is

best evaluated through queuing delay.

Delay Jitter: Another important QoS parameter is the delay jitter, which is sometimes referred

to as the cell delay variation. Delay jitter is the difference between the maximal and minimal cell

transfer delays.

Guaranteeing certain delay jitter is especially important in interactive communication (such as

video or audio streaming). In such applications bounding the delay jitter is translated to bounds on

14

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 25: Competitive Evaluation of Switch Architectures

the buffer size at the destination.

Mansour and Patt-Shamir [100] define the delay jitter to measure how far off is the difference

of delivery times of different packets from the ideal difference in a perfectly periodic sequence. In

the natural case, where the abstract source of the incoming traffic operates in a perfectly periodic

manner, both definitions are equivalent.

Other Measures: The ATM forum defines other quality-of-service parameters, which are not

used widely: Cell Misinsertion Rate (CMR), Cell Error Rate (CER)and Severely Errored Cell

Block Ratio (SECBR). Our research does not address these measures. We will assume that all cells

are transmitted over the switch without errors or misinsertations.

1.3 The Packet Scheduling Process Bottleneck

After the switching process bottleneck is resolved (recall Figure 1), incoming packets are stored

at their destinations (that is, the buffers of the respective output-ports) awaiting to be scheduled

outside the switch.

A packet-scheduler, which manages the buffers of a single output-port, is responsible on de-

ciding which packet will leave the output-port outgoing link in each time-slot. Depending on the

demands from the switch, the packet-schedulers are geared to ensure the relevant performance

metrics out of those described in Section 1.2.1.

It is important to notice that typically flows from different sources traverse the switch at the

same time, and therefore compete on the same switch resources (for example, the switch buffers

or the switch internal transmission lines). One of the key roles of the packet scheduling process is

to protect well-behaved flows from misbehaved ones [34]. This is often called flow isolation, and

the ability to provide such isolation is one of the most important evaluation criteria for switching

architectures. The demand for flow isolation is sometimes formalized by the slightly stronger con-

cept of fairness: When several flows are equally important (i.e., demand the same QoS guarantees)

they should be treated fairly by the switch, and obtain an equal fraction of the switch resources.

Well-known approaches to solve flow isolation and fairness are by allocating per-flow buffers [111]

15

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 26: Competitive Evaluation of Switch Architectures

or by using appropriate queuing disciplines (e.g., GPS [115] or WFQ [50, 147]).

The problem of packet scheduling becomes even more difficult when the buffers of the output-

port have only bounded size. In such cases, the packet schedulers cannot handle every traffic

patterns and some packets must be dropped. The most common and simple drop mechanism is

tail-drop, in which incoming packets are dropped if the buffer is full. However, modern switches

and routers often implement more sophisticated drop mechanisms such as Random Early Detection

(RED) [55] that aims at optimizing the performance of TCP traffic.

In this research, we take a competitive analysis point of view also when evaluating the packet

scheduling process. We compare the performance of these schedulers to an optimal scheduler

that uses the same buffer size but has a complete knowledge on future packet arrivals (that is, an

offline algorithm). In addition, in order to investigate the trade-off between the buffer size and the

scheduler performance, we also investigate resource augmentation scenarios, in which the packet-

scheduler compensates its lack of knowledge by using additional buffers.

Note that the switching problem and the packet scheduling problem are not orthogonal: One

can devise switching algorithms that aim at optimizing certain QoS guarantees; such switching

algorithms are especially important in IQ Switch (recall Section 1.1) in which there are no buffers

in the output-ports. However, the problems often become independent if the switching algorithm

provides output-queued emulation.

1.4 Overview of the Thesis

1.4.1 Relative Queuing Delay in Parallel Packet Switches

We provide analysis of the relative queuing delay of cells in a PPS compared to an ideal switch,

capturing the influence of parallelism on PPS performance [13, 14]. Our lower and upper bounds

on the relative queuing delay depend on the amount and type of information used for balancing the

load among the lower-speed switches and indicate significant differences in the PPS performance:

sharing even out-dated information among input-ports can highly improve the switch performance.

An attractive paradigm for balancing load on the average is with randomization [17, 108]; even

16

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 27: Competitive Evaluation of Switch Architectures

a very simple strategy ensures, with high probability, maximum load close to the optimal distri-

bution [61]. Given these successful applications of randomization in traditional load balancing

settings and in other high-bandwidth switches [58, 136], it is tempting to employ randomization

in parallel packet switches in order to improve their performance. Nevertheless, we show that

randomization does not help to decrease the average relative queuing delay. This surprising result

holds because the common practice is that switches should not mis-sequence cells [81]. This prop-

erty allows an adversary to exploit a transient increase of the relative queuing delay and perpetuate

it long enough to increase the average relative queuing delay.

On the positive side, we introduce a generic methodology for analyzing the maximal relative

queuing delay by measuring the imbalance between the lower-speed switches. The methodology

is used to devise new optimal algorithms that rely on slightly out-dated global information on the

switch status. It is also used to provide a complete proof of the maximum relative queuing delay

provided by the fractional traffic dispatch algorithm [72, 86].

These results are discussed in Chapter 4.

1.4.2 Packet-mode Scheduling in Combined Input-Output Queued Switches

The need for packet-mode schedulers arises from the fact that in most common network protocols,

traffic is comprised of variable size packets (e.g., IP datagrams), while real-life switches store and

transmit packets as fixed-sized cells, with fragmentation and reassembly done outside the switch.

Packet-mode schedulers consider the linkage between cells that correspond to the same packet and

are constrained so that such cells should be delivered from the switch contiguously [57]. Packet-

aware scheduling schemes avoid the overhead induced by packet segmentation and reassembly that

can become very significant at high speeds.

We devise coarse-grained schedulers that allow a packet-mode combined input output queued

(CIOQ) switch with small speedup to mimic an ideal output-queued switch with bounded relative

queuing delay [15]. The schedulers are coarse-grained, making a scheduling decision every certain

number of time-slots and work in a pipelined manner based on matrix decomposition.

Our schedulers demonstrate a trade-off between the switch speedup and the relative queuing

17

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 28: Competitive Evaluation of Switch Architectures

delay incurred while mimicking an output-queued switch. When the switch is allowed to incur

high relative queuing delay, a speedup arbitrarily close to 2 suffices to mimic an ideal output-

queued switch. This implies that packet-mode scheduling does not require higher speedup than a

cell-based scheduler. The relative queuing delay can be considerably reduced with just a doubling

of the speedup. We also show that it is impossible to achieve zero relative queuing delay (that is, a

perfect emulation), regardless of the switch speedup.

We further evaluate the performance of our scheduler through extensive simulations, both un-

der real-life traffic traces and under various stochastic traffic models. These simulations clearly

indicate that in practice this scheduler performs significantly better than its theoretical bound.

These results are presented in Chapter 5.

1.4.3 Jitter Regulation for Multiple Streams

We also investigate the packet scheduling process. Specifically, we refer to each output-port as a

stand-alone environment with a bounded-sized buffer, and study how different QoS measures can

be guaranteed for a traffic traversing such an environment. A prime example is a jitter regulator,

which should shape the incoming traffic to be perfectly periodic, using a bounded-sized buffer.

While previous work on this topic [100] handles only a single stream, we show upper and

lower bounds for multiple stream jitter regulation in offline and online settings [63]: In the offline

setting, jitter regulation can be solved in polynomial time, while in the online setting a buffer

augmentation is needed in order to compete with the optimal algorithm; the amount of buffer

augmentation depends linearly on the number of streams.

Chapter 6 presents our results on jitter regulation.

18

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 29: Competitive Evaluation of Switch Architectures

Chapter 2

Background

In this chapter we survey the most relevant background to our research. Section 2.1 deals with

related work on CIOQ switches. Section 2.2 describes the research done on OQ emulation. In

Section 2.3, we present the known results on Parallel Packet Switches and related architectures.

We conclude in Section 2.4, by describing the prior work on jitter regulation.

2.1 Prior Work on CIOQ Switches

In a combined input-output queued (CIOQ) switch, described in Section 1.1, arriving cells are

first stored in the input side of the switch and then forwarded over a crossbar switch fabric to the

output-side as dictated by a scheduling algorithm.

The switch fabric operates S time faster than the external rate, where S is the speedup of the

switch, and imposes the following major constraint on the scheduling algorithm: At each time-slot

at most S cells can be forwarded from any input-port and at most S cells can be sent to any output-

port. An alternative approach to express this constraint is by defining a scheduling opportunity (or

scheduling decision), in which at most one cell is forwarded from the each input-port and to each

output-port. The speedup S implies that the switch has S scheduling opportunities every time-slot.

A common approach to solve the scheduling problem in CIOQ switches is to refer to the switch

as a bipartite graph G(t) = 〈V1, V2, E〉, where V1 is the set of input-ports, V2 is the set of output-

19

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 30: Competitive Evaluation of Switch Architectures

ports and an edge (v1, v2) ∈ E exists if and only if there is a cell waiting for scheduling from

input-port v1 to output-port v2 at time-slot t. Note that after each scheduling action, a new graph

should be obtained.

A solution to the classic maximum size matching problem achieves 100% throughput under

uniform i.i.d traffic, but can lead to instability and unfairness with other traffic patterns [104].

These results can be improved by assigning weights on the edges and solving the maximum weight

matching problem. It is shown [104] that if the weights are assigned according to the lengths on

the queues (LQF) or the waiting time of the cell in the head of the queue (OCF), 100% throughput

can be achieved even for non-uniform traffic. However, these algorithms are unfeasible in large

and high-speed switches due to the high complexity of maximum weight matching solutions.

To overcome this problem, several algorithms that are based on a solution of maximal match-

ing were proposed. In general, these algorithms operate in iterations, such that in each iteration

an unmatched input-port picks unmatched output-port and adds the edge to the matching, until

the matching converges to a maximal matching (usually after N iterations). The difference be-

tween the maximal-matching based algorithms is in the way conflicting requests are resolved. In

Parallel Iterative Matching (PIM) [5] these requests are resolved randomly (this yields that with

high probability O(log N) iterations suffice for the algorithm to converge), while in iSLIP [103]

these requests are resolved using round-robin pointers. Other maximal matching based algorithms

include Wave Front Arbiters (WFA) [132], iLQF [105] and iOCF [105].

Nevertheless, the scheduling algorithm complexity is typically the main performance limitation

of CIOQ switches [24], since scheduling decisions are done every time-slot, meaning that the

scheduling algorithm speed should be at least the same as the external line rate.

One approach to overcome this scheduling complexity is by using randomization, which was

proven extremely successful in simplifying the implementation of algorithms. A prime example

of a linear-time randomized algorithm for CIOQ switches was presented by Tassiulas [136], that

proposed to compare, at each scheduling decision, the weight of the current matching to the weight

of a randomly chosen matching. Giaccone et al. [58] later improved this algorithm and showed

how such randomized algorithms can achieve good delay performance.

Another approach to solve the scheduling problem is by matrix decomposition [31]. Such

20

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 31: Competitive Evaluation of Switch Architectures

solutions assume that the arrival rate of each flow is known and decompose the arrival traffic

rate matrix Λ = [λi,j] (λi,j is the rate of flow (i, j)) into permutation matrices, which are used

periodically as scheduling decisions.

A recent approach is to use a coarse-grained scheduler, which makes a scheduling decision

every predetermined number of time-slots [24, 114]. In such algorithms, a frame is defined as τ

consecutive time-slots, and scheduling decisions are done in the boundaries of these frames. The

scheduling decision should encompass all necessary information for the input-port to schedule

cells for τ time-slots. Notice that matrix decomposition techniques are a promising approach in

devising such frame-based schedulers, as was previously proposed for optical switching and in

satellite-switched time-devision multiple access (SS/TDMA) schedulers [83, 95, 137, 145].

Aggarwal et al. [3] combine these three approaches and devise a randomized, coarse-grained

algorithm for matrix decomposition. Basically, they looked at the matrix decomposition problem

as coloring a bipartite multi-graph, and proposed a randomized edge coloring algorithm that color

the graph with as few colors as possible. Their algorithm achieves nearly optimal results with very

low implementation complexity.

All the above mentioned schedulers dealt only with fixed-size cells. However, some schedulers

that handle variable size packets directly were also proposed. Previous work [57, 101] considers

packet-mode scheduling in an input-queued (IQ) switch with crossbar fabric and no speedup. It

proves analytically that packet-mode IQ switches can achieve 100% throughput, provided that

the input-traffic is well-behaved; this matches the earlier results on cell-based scheduling [104].

Marsan et al. [101] also show that under low load and small packet size variance, packet-mode

schedulers may achieve better average packet delay than cell-based schedulers. A different line

of research used competitive analysis to evaluate packet-mode scheduling, when each packet has

a weight representing its importance and a deadline until which it should be delivered from the

switch [62].

21

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 32: Competitive Evaluation of Switch Architectures

2.2 Prior Work on Output-Queued Switch Emulation

The question whether a feasible switch architecture can emulate an output-queued switch was was

first raise by Prabhakar and Mckeown in the context of combined input-output queued switches [120].

They answered this question in the affirmative and presented the first output-queued emulation al-

gorithm, called most urgent cell first algorithm (MUCFA), that requires a speedup of 4. Following

this seminal paper, other works investigated the speedup required for a CIOQ switch to emulate

an output-queued switch [35, 91, 130, 131]. A prime example is the critical cells first (CCF)

algorithm [37], which allows a CIOQ switch with speedup of at least 2 to emulate (exactly) an

output-queued switch. In addition, a matching lower bound was also proven [37]: A CIOQ switch

needs speedup ≥ 2 − 1N

in order to emulate an output-queued switch. Notice that all these algo-

rithms do not make any assumptions on the incoming traffic.

The demand for exact emulation is sometimes relaxed to allow the investigated switch archi-

tecture to lag behind in the OQ switch by a fixed and predetermined relative queuing delay [72].

We refer to such relaxed emulation as OQ switch mimicking. In this context, the ability of a CIOQ

switch with small speedup (that is, S < 2) to mimic an OQ switch was investigated in [59]: when

the traffic is well-behaved (that is, obeys one of the restrictive model described in Section 1.2) the

demand of speedup S ≥ 2 − 1/N can be relaxed at the expense of a bounded relative queuing

delay. In Chapter 5, we show that packet-mode CIOQ switch can provide OQ mimicking.

The ability to emulate or mimic an OQ switch was also investigated under other switch archi-

tectures.

A recent line of research deals with buffered crossbar (described in Section 1.1). Magill et

al. [99] showed that a buffer crossbar with S = 2 can emulate a first-come-first-serve (FCFS)

output-queued switch with any arrival pattern. Furthermore, they also showed that if the buffers at

the crosspoints are of size at least k, more general queuing disciplines can be emulated, namely a

FCFS with k strict priorities. These results were further improved in [38] showing that OQ switch

with any weighted round robin scheduler can be emulated using a fully-distributed algorithm (that

is, each input-port and each output-port make independent decisions). Turner [138] investigated

packet-mode schedulers in buffered crossbars and showed that that a buffered crossbar switch with

speedup 2 and crosspoint buffers of size 5Lmax, where Lmax is the maximum packet size, can

22

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 33: Competitive Evaluation of Switch Architectures

mimic an output-queued switch with relative queuing delay of (7/2)Lmax time-slots.

2.3 Prior Work on Parallel Packet Switches

The parallel packet switch architecture was first considered by Iyer et al. [70, 72, 74], who evalu-

ated its ability to mimic output-queued switches. Iyer et al. [74] introduced the Centralized PPS

Algorithm (CPA) that allows a PPS with speedup S ≥ 2 to mimic a FCFS output-queued switch

with zero relative queuing delay; here, the speedup S of the switch is the ratio of the aggregate

capacity of the internal traffic lines, connected to an input- or output-port, to the capacity of its

external line (namely, S = KrR

).

Unfortunately, these algorithms are impractical for real switches, because they gather infor-

mation from all the input-ports in every scheduling decision. To overcome this problem, Iyer and

McKeown [72] suggest a fully-distributed algorithm that works with speedup S = 2 and mimics

a FCFS output-queued switch with relative queuing delay of⌈N R

r

⌉time-slots. Another family

of fully-distributed algorithms, called fractional traffic dispatch (FTD) [86], works with switch

speedup S ≥ KdK/2e , and their relative queuing delay is at least 2NR/r time-slots. However, these

papers did not provide complete and precise proofs for the correctness of the proposed algorithms

and their performance.1

The requirement for additional speedup is relaxed by adding buffers in the demultiplexors.

For such an input-buffered PPS, Iyer and McKeown [72] suggest a fully-distributed algorithm that

allows a PPS with speedup S = 1 to mimic a FCFS output-queued switch with relative queuing

delay of⌈2N R

r

⌉time-slots.

2.4 Prior Work on Delay Jitter Regulation

The problem of jitter control has received much attention in recent years, along with the increasing

importance of providing QoS guarantees. A prime example is the Differentiated Services (Diff-

Serv) architecture, in which there is a specific requirement to maintain low-jitter for Expedited

1See Remarks 1 and 2 in Chapter 4 for further details.

23

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 34: Competitive Evaluation of Switch Architectures

Forwarding (EF) traffic [49].

Jitter regulators that capture jitter control mechanisms, use an internal buffer to shape the

traffic [79, 100, 149]. These regulators typically use scheduling algorithms that are not work-

conserving, i.e., they might delay releasing a cell even if there are cells in the buffer and the

outgoing links are not fully utilized.

Several algorithms have been proposed with the aim of providing traffic jitter control: A jitter

control algorithm which reconstructs the entire sequence at the destination using a predetermined

maximum delay bound was proposed in [116]. The Jitter-Earliest-Due-Date algorithm proposed

in [143] uses a predetermined maximum delay bound in order to calculate a deadline for every

cell, such that it is released precisely upon its deadline. The Stop-and-Go algorithm proposed

in [60] uses time frames of predetermined lengths in order to regulate traffic, such that cells arriv-

ing in the middle of a frame, are only made available for sending in the following time frame. The

Hierarchical-Round-Robin algorithm proposed in [76] uses a framing strategy similar to the one

used in the Stop-and-Go algorithm, but releases are governed by a round robin policy that some-

times allocates non-utilized release time-slots to other streams. Other jitter control algorithms are

surveyed thoroughly in [147].

A slightly different line of research investigated jitter regulation in the Combined Input-Output

Queue switch architecture, forcing the jitter regulator to obey additional constraints posed by the

switching architecture [83].

The problem of jitter control in an adversarial setting was studied by Mansour and Patt-Shamir [100]

in a simplified single-stream model, with only a single abstract source. They present an efficient

offline algorithm, which computes an optimal release schedule in these settings. They further de-

vise an online algorithm, which uses a buffer of size 2B, and produces a release schedule with

the optimal jitter attainable with buffer of size B, and then show a matching lower bound on the

amount of resource augmentation needed, proving that their online algorithm is optimal in this

sense.

For the same model, Koga [89] presents an optimal offline algorithm and a nearly optimal

online algorithm for the case where a cell can be stored in the buffer at most a predetermined

amount of time.

24

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 35: Competitive Evaluation of Switch Architectures

The burstiness of the traffic is also captured by its rate jitter, which was first defined as the

short-term average rate of a traffic [76]. Mansour and Patt-Shamir [100] introduced another defini-

tion for the rate jitter, which bounds the difference in cell delivery rates at various times. Since the

difference in delivery time between two successive cells is the reciprocal of the instaneous delivery

rate, the rate jitter is defined as the difference between the maximal and minimal inter-departure

time.

Note that delay jitter is more restrictive than rate jitter [100]. Therefore, unlike delay jitter

regulators which completely reconstruct the incoming traffic, rate jitter regulators typically just

partially reconstruct the traffic, and therefore are easier to implement [147].

25

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 36: Competitive Evaluation of Switch Architectures

Chapter 3

Model Definitions

A N × N switch handles either fixed-size cells or variable-size packets arriving at N input-ports

at rate R and destined for N output-ports working at the same rate R. Packets (cells) arrive at the

input-ports and leave the output-ports in a time-slotted manner (that is, all the switch external lines

are synchronized). For variable-size packets, we refer to each part of the packet that is transmitted

during a single slot as a single fixed-size cell, and measure the packet size in cell-units. Unless

otherwise specified, we assume the switch does not drop cells.

For every cell c, ta(c) denotes the time-slot in which cell c arrived at the switch. In addition,

we denote by orig(c) and dest(c) the input-port at which c arrives and the output-port for which c

is destined. packet(c) denotes the packet that corresponds to cell c; first(p), last(p) are the first

and last cells of packet p.

The definitions in the rest of this chapter assume that only fixed-size cells arrive at the switch.

However, they can be easily extended to hold also for variable-size packets.

A traffic T is a finite collection of cells, such that no two cells arrive at the same input-port at

the same time-slot. A flow (i, j) is the collection of cells sent from input-port i to output-port j.1

projection of a traffic T on a set of input-ports I , denoted by T |I , is c ∈ T | orig(c) ∈ I. Since

for any input-port i and traffic T , there are no two cells c1, c2 ∈ T |i such that ta(c1) = ta(c2), the

arrival times of cells in T |i induce a total order on them.

1It is important to notice that a flow at the switch level may correspond to several flows at the network level, allsharing the same input-port and the same output-port of the switch.

26

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 37: Competitive Evaluation of Switch Architectures

For any cell c, shift(c, t) is a cell with the same origin and destination such that ta(shift(c, t))=

ta(c) + t. The shift operation is used for concatenating two finite traffics, T1 and T2, so that T2

starts after the last cell of traffic T1. Formally, T1 T2 is the traffic T1 ∪ shift(c, t) | c ∈ T2,where t = 1 + maxta(c) | c ∈ T1|I1.

ESW (ALG, T ) is the execution of the switch, using scheduling algorithm ALG in response to

incoming traffic T . If ALG is a randomized algorithm, we denote by ESW (ALG, σ, T ) the execution

ESW (ALG, T ) taking into account the coin-tosses sequence σ obtained by the algorithm. The exact

definition of the execution is determined by the switch architecture that is investigated. Yet, given

the execution ESW (ALG, σ, T ) one can determine uniquely the time-slot in which cell c leaves

the switch for every cell c ∈ T . This time-slot is denoted by tlSW (c, T ) (or tlSW (σ, c, T ) if the

algorithm is randomized).

The switch is compared to a work-conserving shadow switch that receives the same traffic T ,

and obeys the per-flow FCFS discipline; that is, cells with the same origin and the same destination

should leave the switch in their arrival order. We denote the execution of the shadow switch in

response to traffic T by ES(T ), and the time a cell c ∈ T leaves the shadow switch by tlS(c, T ).

Note that tlS(c, T ) ≥ ta(c) + 1.

The relative queuing delay of a cell c ∈ T under a scheduling algorithm ALG and a coin-tosses

sequence σ isR(ALG, σ, c, T ) = tlSW (σ, c, T )− tlS(c, T ).

Definition 1 For traffic T , scheduling algorithm ALG and coin-tosses sequence σ, the maximum

relative queuing delay Rmax(ALG, σ, T ) is maxc∈TR(ALG, σ, c, T ), and the average relative

queuing delayRavg(ALG, σ, T ) is 1|T |∑

c∈T R(ALG, σ, c, T ).

The maximum relative queuing delay of an algorithm ALG against an adversary A is denoted

RAmax(ALG). Specifically, RA

max(ALG) ≥ R with probability 1− δ if adversary A can construct a

traffic T such that Prσ [Rmax(ALG, σ, T ) ≥ R] ≥ 1− δ.

The average relative queuing delay of an algorithm ALG against an adversary A is denoted

RAavg(ALG). Specifically, RA

avg(ALG) ≥ R with probability 1 − δ if adversary A can construct a

traffic T such that Prσ [Ravg(ALG, σ, T ) ≥ R] ≥ 1− δ.

If a switch architecture has a scheduling algorithm ALG, such thatRmax(ALG) = 0 we say that

27

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 38: Competitive Evaluation of Switch Architectures

the switch architecture emulates an ideal switch. In case the switch architecture has a scheduling

algorithm ALG for which Rmax(ALG) is bounded, we say that the switch architecture mimic an

ideal switch.

The per-flow delay jitter of a traffic T under a scheduling algorithm ALG and a coin-tosses

sequence σ is the maximal difference in queuing delay of cells originated in the same input port

and destined for the same output port. Specifically,

Definition 2 For traffic T , scheduling algorithm ALG and coin-tosses sequence σ, the delay of a

cell c in T is delay(ALG, σ, c, T ) = tlSW (σ, c, T )− ta(c).

The delay jitter of a flow (i, j) in T is

JSW (ALG, σ, T, i, j) = maxc∈Ti,j

delay(ALG, σ, c, Ti,j) − minc∈Ti,j

delay(ALG, σ, c, Ti,j) ,

where Ti,j = c ∈ T | orig(c) = i and dest(c) = j.The per-flow delay jitter of the traffic T is JSW (ALG, σ, T ) = maxi,jJSW (ALG, σ, T, i, j).

Similarly, let JS(T ) be the per-flow delay jitter of traffic T under the shadow switch. The

relative delay jitter is formally defined as follows:

Definition 3 For traffic T , scheduling algorithm ALG and coin-tosses sequence σ, the relative

delay jitter, denoted by J , is the difference between the per-flow delay jitter of the switch and

the per-flow delay jitter of an optimal shadow work-conserving switch, that is J (ALG, σ, T ) =

JSW (ALG, σ, T )− JS(T ).

Note that ifRmax(ALG, σ, T ) = 0 then J (ALG, σ, T ) = 0 as well.

28

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 39: Competitive Evaluation of Switch Architectures

Chapter 4

Relative Queuing Delay in PPS

One of the key issues in the design of a PPS (recall Figure 3) is balancing the load of switching

operations among the middle-stage switches, and by that utilizing the parallel capabilities of the

switch. Load balancing is performed by a demultiplexing algorithm, whose goal is to minimize the

concentration of a disproportional number of cells in a small number of middle-stage switches.

Demultiplexing algorithms can be classified according to the amount and type of information

they use. The strongest type of demultiplexing algorithms are centralized, and make demultiplex-

ing decisions based on global information about the status of the switch. Unfortunately, these

algorithms must operate in a speed proportional to the aggregate incoming traffic rate, and there-

fore, they are impractical. At the other extreme, fully-distributed demultiplexing algorithms rely

only on the local information in the input-port.1 Due to their relative simplicity, they are common

in contemporary switches. A realistic middle ground is what we call u Real-time distributed (u-RT)

demultiplexing algorithms, in which a demultiplexing decision is based on the local information

and global information older than u time slots. Obviously, every fully distributed algorithm is also

a u real-time distributed algorithm.

The relative queuing delay of the PPS (Definition 1) captures the influence of the parallelism of

the PPS on the performance of the switch, depending on the different demultiplexing algorithms,

and ignores the specific PPS hardware implementation. As we shall prove, the relative queuing

1These are also called independent demultiplexing algorithms [68].

29

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 40: Competitive Evaluation of Switch Architectures

delay is determined solely by the balancing of cells among the planes.

Randomization is successfully applied in traditional load balancing settings [17, 61, 108] and

in other high-bandwidth switches [58, 136]: Even a very simple strategy ensures (with high prob-

ability) maximum load close to the optimal distribution. Therefore, it is tempting to employ ran-

domization to reduce the average imbalance between planes and by that reduce the average relative

queuing delay.

Our main contributions are lower and upper bounds on the relative queuing delay of the PPS.

Our lower bounds hold even when the PPS has to deal only with well-behaved traffics that obey

the leaky-bucket model [140], which makes our results stronger. In addition, we show that ran-

domization does not help to decrease the average relative queuing delay. This somewhat surprising

result holds due to the requirement that switches should not mis-sequence cells [81]. This property

allows an adversary to exploit a transient increase of the relative queuing delay and perpetuate it

sufficiently long to increase the average relative queuing delay. On the other hand, we devise a

general methodology for analyzing the maximum relative queuing delay from above; clearly, this

also bounds (from above) the average relative queuing delay.

4.1 Summary of Our Results

Deterministic Lower Bounds

A bufferless PPS (i.e., without buffers at the input-ports) with fully-distributed demultiplexing

algorithm incurs the highest relative queuing delay and relative delay jitter. If some plane is utilized

by all the demultiplexors, we prove a lower bound of(

Rr− 1)N time slots on the relative queuing

delay and relative delay jitter, where Rr

is the ratio between the PPS external and internal rates.

Even in the unrealistic and failure-prone case where the planes are statically partitioned among the

demultiplexors, the relative queuing delay and relative delay jitter are at least(

Rr− 1)

NS

time-slots.

Both lower bounds employ leaky-bucket flows with no bursts.

A bufferless PPS with u-RT demultiplexing algorithm (for any u) has relative queuing delay

and relative delay jitter of at least(1− u r

R

)uNS

time-slots, where u = minu, 12

Rr. In contrast,

30

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 41: Competitive Evaluation of Switch Architectures

Demultiplexor Type PlanesOC-3 OC-12 OC-48

Fully-distributed unpartitioned 64,512 15,360 3,072partitioned 32,256 7,680 1,536

1-RT 504 480 384Centralized 0 0 0

Table 4.1: The relative queuing delay (in time-slots) of a bufferless OC-192 PPS with 1024 portsand speedup 2.

Iyer et al. [72] present a centralized demultiplexing algorithm for a bufferless PPS with speedup

S ≥ 2, which achieves zero relative queuing delay.

Our lower bound results show that the PPS architecture does not scale with increasing number

of external ports (see Table 4.1 for specific instances). This is significant since great effort is cur-

rently invested in building switches with a large number of ports. Note that large relative queuing

delays usually imply that the buffer sizes at the middle-stage switches and at the external ports

should be large as well, so that the cells can be queued.

For bufferless PPS, it is important to notice that using u-RT demultiplexing algorithm sig-

nificantly reduces the lower bound on the relative queuing delay compared with fully-distributed

demultiplexing algorithm. u-RT demultiplexing algorithms correspond to commercially used poli-

cies like arbitrated crossbar switches [132], in which a request is made by the input-port, and

the cell is sent once a grant is received back from the arbiter. The separation between the lower

bounds implies that employing u-RT demultiplexing algorithms in PPS (even with a considerably

large value of u) may decrease the relative queuing delay dramatically, and still be feasible for

high-speed switches.

Randomized Lower Bounds

We show that an adversary can devise traffic that exhibits with high probability a large average

relative queuing delay. The exact bounds depend on the type of the adversary, the exact restriction

on the order of cells the switch should respect and, as in the deterministic case, on the locality of

information used for cell demultiplexing.

When the PPS respects the arrival order of cells with the same input-port and the same output-

31

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 42: Competitive Evaluation of Switch Architectures

port (that is, per-flow FCFS discipline) and the adversary is adaptive [110], the bounds are equal

(with high probability) to the deterministic lower bounds for maximum relative queuing delay for

all classes of demultiplexing algorithms. The randomized lower bound holds also with an oblivious

adversary, if a PPS obeys a global FCFS policy (that is, all cells to the same destination should

leave the switch according to their arrival order) and a fully-distributed demultiplexing algorithm

is used.

Matching Upper Bounds

To prove that the lower bounds are tight, we devise a methodology for evaluating the relative

queuing delay under global FCFS policies. We show a general upper bound that depends on the

difference between the number of cells with the same destination that are sent through a specific

plane, and the total number of cells with this destination.

Our methodology is employed to prove that the maximal relative queuing delay of the fractional

traffic dispatch (FTD) algorithm [70] is O(N Rr) time-slots. This matches the lower bound on the

average relative queuing delay introduced by fully-distributed demultiplexing algorithms (even

when randomization is used). This is the first formal and complete correctness proof for this

algorithm.

Remark 1 Iyer and McKeown [70, 72] outline an approach for bounding the relative queuing

delay of FTD, but leave a number of details missing [69]; a previous attempt [86] to complete the

formal proof and precisely bound the relative queuing delay of FTD, turned out to be flawed [85]

(see Remark 2 for further details on the mistake in [86].)

By precisely capturing the crucial factors affecting the relative queuing delay, our methodology

leads to new algorithms that use global information that is u time-slot old. Their maximum relative

queuing delay is O(N) time-slots, asymptotically matching the lower bound on the average relative

queuing delay for this class of demultiplexing algorithms (even when randomization can be used).

32

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 43: Competitive Evaluation of Switch Architectures

PPS Model Extensions

One extension to the PPS model is input-buffered PPS, in which there are small buffers also in the

input-ports, can support more elaborate demultiplexing algorithms, since an arriving cell can either

be transmitted to one of the middle stage switches, or be kept in the input-buffer. We show that un-

der a u-RT demultiplexing algorithm, a switch with speedup S ≥ 2 and input-buffers larger than u

can employ a centralized algorithm (e.g., [72]). In contrast, a deterministic fully-distributed demul-

tiplexing algorithm introduces relative queuing delay and relative delay jitter of at least(1− r

R

)NS

time-slots, for any buffer size under leaky-bucket flows with no bursts.

A second extension to the PPS model is by implementing recursively the planes themselves as

parallel packet switches operating at lower rate. We prove lower bounds on the relative queuing

delay for the homogeneous recursive PPS, in which all the demultiplexors in all recursion levels

are of the same type (e.g., fully distributed demultiplexors), and for the monotone recursive PPS,

in which demultiplexors are allowed to share more information as their rate decreases. The lower

bounds generalize the lower bound for the non-recursive PPS model.

4.2 A Model for Parallel Packet Switches

An N ×N PPS is a three-stage Clos network [41], with K < N planes. Each plane is an N ×N

switch operating at rate r < R, and is connected to all input-ports on one side, and to all output-

ports on the other side (recall Figure 3). The speedup S = KrR

captures the switch over-capacity.

A bufferless PPS has no buffers at its input-ports but can store pending cells in its planes and

in its output-ports. Each cell arriving at input-port i is immediately sent to one of the planes; the

plane through which the cell is sent is determined by a randomized state machine with state set Si,

following some algorithm.

Definition 4 The demultiplexing algorithm of a bufferless input-port i is a function

ALGi : 1, . . . , N × Si × COINSPACE→ 1, . . . , K × Si

33

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 44: Competitive Evaluation of Switch Architectures

which gives a plane number and the next state, according to the incoming cell destination, the

current state, and the result of a coin-toss that is taken out of a finite and uniform coin-space

COINSPACE. (For a deterministic algorithm, |COINSPACE| = 1.)

It is important to notice that demultiplexing algorithm ALGi accesses the random coin-tosses

one by one. More precisely, the demultiplexing decision of ALGi at time-slot t depends only

on random coins that were tossed up until time-slot t; the coin-tosses up until time-slot t − 1

are incorporated into the state of ALGi at time-slot t, while the coin-toss of time-slot t appears

explicitly in the definition of ALGi.

We next extend the switch model defined in Chapter 3 to capture the PPS architecture.

ESW (ALG, σ, T ) is the execution of a PPS using demultiplexing algorithm ALG in response to

incoming traffic T , and coin-tosses sequence σ; for all cells in T , the execution indicates the planes

the cells are sent through: 〈c, plane(c, σ, T )〉 | c ∈ T. For clarity, we denote this execution by

EPPS(ALG, σ, T ).

A state s ∈ Si is reachable if there is a sequence of coin tosses σ and a traffic T , such that

the state-machine reaches state s in execution EPPS(ALG, σ, T ). A switch configuration consists

of the states of all state-machines, and the contents of all the buffers in the switch. A configuration

is reachable if it is reached in an execution of the switch. Since the switch does not have a pre-

determined initial configuration, we assume that for every pair of reachable configurations C1, C2,

there is a finite incoming traffic that causes the switch a transition from C1 to C2.

The internal lines of the switch operate at rate r < R. For simplicity, we assume that r′ , Rr

=⌈Rr

⌉. This lower rate r imposes an input constraint on the demultiplexing algorithm [74]:

For any two cells c1, c2 in traffic T , if orig(c1) = orig(c2) and |ta(c1) − ta(c2)| ≤ r′ then

plane(c1, σ, T ) 6= plane(c2, σ, T ).

Since a PPS has no buffers in its input-ports, cells are immediately sent to one of the planes;

that is, a cell c traverses the internal link between orig(c) and plane(c, σ, T ) at time ta(c) (see

Figure 5).

We assume that both the planes and the output buffers are FCFS and work-conserving. Let

tp(c, σ, T ) be the time-slot in which a cell c ∈ T leaves plane(c, σ, T ), and denote tlPPS(c, σ, T )

34

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 45: Competitive Evaluation of Switch Architectures

Shadowswitch

PPSta(c)

orig(c)

orig(c)

ta(c)

ta(c) tlS(c, T )dest(c)

tp(c,σ,T)dest(c)

tlPPS(c, σ, T )plane(c, σ, T)

Figure 5: Time-points associated with a cell c ∈ T .

the time-slot it leaves the PPS (that is, tlPPS(c, σ, T ) = tlSW (c, σ, T ) as defined in Chapter 3).

The lower rate of the internal links between the planes to the output ports imposes an output-

constraint [74]:

For every two cells c1, c2 in traffic T , if dest(c1) = dest(c2) and plane(c1, σ, T ) = plane(c2, σ, T )

then |tp(c1, σ, T ) − tp(c2, σ, T )| > r′. To neglect delays caused by the additional stage of the

PPS, a cell can leave the PPS at the same time-slot it arrives at the output-port, provided that

no other cell is leaving at this time-slot, i.e., tlPPS(c, σ, T ) ≥ tp(c, σ, T ). Note however that

tp(c, σ, T ) ≥ ta(c) + 1.

When it is clear from the context, we omit the traffic T and the coin-tosses sequence σ from

the notations plane(c, σ, T ), tp(c, σ, T ), tlPPS(c, σ, T ), tlS(c, σ, T ) andR(ALG, σ, c, T ).

4.3 Lower Bounds on the Relative Queuing Delay

The relative queuing delay of a PPS heavily depends on the information available to the demulti-

plexing algorithm. Practical demultiplexing algorithms must operate with local, or out-dated, in-

formation about the status of the switch: flows waiting at other input-ports, contents of the planes’

buffers, etc. As we shall see, such algorithms incur non-negligible queuing delay.

Specifically, in this section we prove lower bounds on the maximal and average relative queuing

delay even when randomization is used. We show lower bounds for deterministic demultiplexing

algorithms. Based on these results, we present lower bounds for randomized demultiplexing algo-

35

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 46: Competitive Evaluation of Switch Architectures

rithms that use an adaptive adversary that sends cells to the switch at each time-slot based on the

algorithm actions at previous slots. We further show that under reasonable assumptions the lower

bounds can be extended to hold with an oblivious adversary, which chooses the entire traffic in

advance, knowing only the demultiplexing algorithm. Moreover, we show that these lower bounds

on the relative queuing delay yield similar lower bounds on the relative delay jitter.

We prove even stronger results, and show that the lower bounds hold even when the traffic is

restricted by the (R,B) leaky-bucket model [34, 45]. This model restricts the traffic from flooding

the switch by requiring that the combined rate of flows sharing the same input-port or the same

output-port does not exceed the external rate R of that port by more than a fixed bound B, which

is independent of time. Specifically, a traffic T is (R,B) leaky-bucket, if for any two time-slots

t1 ≤ t2 and any output-port j, |c ∈ T | t1 ≤ ta(c) ≤ t2 and dest(c) = j| ≤ (t2 − t1) + B.

4.3.1 General Techniques and Observations

High relative queuing delay is exhibited when cells that are supposed to leave the shadow switch

one after the other, are concentrated in a single plane. We first describe this scenario given a

specific coin-tosses sequence σ, implying the results holds also for a deterministic demultiplexing

algorithm with |COINSPACE| = 1.

Definition 5 An execution EPPS(ALG, σ, T ) is (f, s) weakly-concentrating for output-port j and

plane k if there is a time-slot t such that:

1. Output-port j’s buffer of the shadow switch is empty at time-slot t; and

2. At least f cells destined for output-port j arrive at the switch during time-interval [t, t+ s), and

f out of these cells are sent through the plane k.

We call an execution an (f, s) weakly-concentrating execution, when the plane k and the

output-port j are clear from the context.

The following lemma bounds the relative queuing delay exhibited in (f, s) weakly-concentrating

executions:

36

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 47: Competitive Evaluation of Switch Architectures

Lemma 1 For any (R,B) leaky-bucket traffic T , coin-tosses sequence σ, and (f, s) weakly-concen-

trating execution EPPS(ALG, σ, T ) for output-port j and plane k, the last cell c that is sent from

plane k to output-port j in EPPS(ALG, σ, T ) attainsR(ALG, σ, c, T ) ≥ f · r′ − (s + B).

Proof. We compare the queuing delay of the cells, arriving in time interval [t, t + s), in the PPS

and in the shadow switch. Since the shadow switch is work-conserving, all f cells leave the switch

exactly f time-slots after the first cell is dispatched. On the other hand, a PPS completes this

execution after at least fr′ time-slots, because f cells are sent to the same plane, and only one

cell can be sent from this plane to the output-port every r′ time-slots. Let c be the last of these

cells sent from the plane to the output-port. Hence, the relative queuing delay c attains is at least

fr′ − f time-slots. Since the incoming traffic is (R,B) leaky-bucket, f ≤ s + B, and therefore

R(ALG, σ, c, T ) ≥ fr′ − f ≥ fr′ − (s + B) time-slots.

Lemma 1 implies the following lower bounds on the maximum relative queuing delay and the

maximum relative delay jitter:

Lemma 2 For any (R,B) leaky-bucket traffic T , coin-tosses sequence σ, and (f, s) weakly-con-

centrating execution EPPS(ALG, σ, T ) for output-port j and plane k:

(1) The maximum relative queuing delayRmax(ALG, σ, T ) ≥ f · r′ − (s + B) time-slots.

(2) There is a traffic T ′ such that the relative delay jitter J (ALG, σ, T ′) ≥ f ·r′−(s+B) time-slots.

Proof. The proof of (1) immediately follows from Lemma 1. Let t be the time-slot defining

EPPS(ALG, σ, T ) as (f, s) weakly-concentrating execution for output-port j and plane k, and

let c be the last cell, arriving at interval [t, t + s), that is sent from plane k to output-port j in

EPPS(ALG, σ, T ).

By the definition of cell c, ta(c) ≤ t + s − 1, while by Lemma 1, tlPPS(c) ≥ t + fr′. This

implies that delay(c, T ) ≥ f · r′ − s + 1.

Let t′ > tlPPS(c, T ) be the first time-slot after tlPPS(c, T ) in which all the buffers of the PPS

are empty. Such time-slot exists since T is finite and both the planes and multiplexors of the PPS

are work-conserving.

37

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 48: Competitive Evaluation of Switch Architectures

Let T ′ = T ∪ c′, where cell c′ is a cell with orig(c′) = orig(c), dest(c′) = dest(c), and

ta(c′) = t′. Clearly, cell c′ leaves the PPS exactly one time-slot after its arrival. In addition, all

other cells in traffic T ′ leave the PPS exactly as in execution EPPS(ALG, σ, T ). Because cell c and

cell c′ share the same origin and destination, the maximum delay jitter introduced by the PPS is at

least JPPS(ALG, σ, T ′) ≥ (f · r′ − s + 1)− 1 = f · r′ − s time-slots.

Recall that the maximum buffer size needed for any work-conserving switch to work under

(R,B) leaky-bucket traffic is B. Therefore a work-conserving switch, which serves the incoming

cells in an FCFS manner (e.g. FCFS output-queued switch) introduces queuing delay, and therefore

also delay jitter, of at most B time-slots. Thus, the relative delay jitter between the PPS and the

shadow switch is at least (f · r′ − s)−B = f · r′ − (s + B) time-slots, proving (2).

Another key observation is that if the last cell of a traffic attains relative queuing delay R,

then this traffic can be continued so that every added cell attains at least relative queuing delayR,

regardless of the random choices made by the demultiplexing algorithm (if any).

We first define how a traffic is continued. A cell c2 ∈ T is the immediate successor of cell c1 ∈T in demultiplexing algorithm ALG, denoted c2 = succ(c1, T ), if tlS(c2, T ) = tlS(c1, T ) + 1, and

for every coin tosses sequence σ, tlPPS(c2, T ) > tlPPS(c1, T ) in the execution EPPS(ALG, σ, T ).

Namely, a PPS cannot change the order in which c1 and c2 are delivered; this happens for example

when a PPS follows a per-flow FCFS policy and c1, c2 share the same input-port and the same

output-port. Generally, the existence of an immediate successor depends on the priority scheme

supported by the PPS.

Let c be the last cell in a traffic T , i.e., tlS(c, T ) = maxc′∈TtlS(c′, T ). A traffic T ′ =

c0, . . . , cn is a proper continuation of T , if in the execution of the shadow switch in response to

traffic T T ′, all the cells of T ′ are delivered one time-slot after the other without any stalls, and

the delivery times of the cells of T remain unchanged. Formally, T ′ is a proper continuation of T

if in the execution ES(T T ′), c0 = succ(c, T T ′), ci = succ(ci−1, T T ′) for every i, and for

every c′ ∈ T , tlS(c′, T ) = tlS(c′, T T ′) and tlPPS(c′, T ) = tlPPS(c′, T T ′).

We first examine proper continuations by a single cell:

Lemma 3 For any demultiplexing algorithm ALG, coin-tosses sequence σ, and finite traffic T , if

38

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 49: Competitive Evaluation of Switch Architectures

c1 is the last cell of T , and T ′ = c2 is a proper continuation of T , thenR(ALG, σθ, c2, T T ′) ≥R(ALG, σ, c1, T ) for any coin-toss θ.

Proof. Since T ′ is a proper continuation of T , cell c2 leaves the shadow switch exactly at time-slot

tlS(c1, T T ′) + 1, and in addition tlPPS(c2, T T ′) ≥ tlPPS(c1, T T ′) + 1. Hence,

R(ALG, σθ, c2, T T ′) ≥ tlPPS(c2, T T ′)− tlS(c2, T T ′)

≥ (tlPPS(c1, T T ′) + 1)− (tlS(c1, T T ′) + 1)

= tlPPS(c1, T )− tlS(c1, T ) = R(ALG, σ, c1, T )

It is important to notice that Lemma 3 holds for any coin-toss θ, and therefore it trivially holds

if the demultiplexing algorithm ALG is deterministic.

If the adversary can construct, for every traffic, a proper continuation that is arbitrarily long,

then it can construct a traffic that exhibits an average relative queuing delay that matches the

maximum relative queuing delay. Intuitively, the adversary waits for a cell c that attainsRmax and

then sends many cells, which form a proper continuation (whose length depends on the number of

cells that arrived before c).

Lemma 4 Fix an adversary A, demultiplexing algorithm ALG, a coin-tosses sequence σ and a

finite traffic T whose last cell c hasR(ALG, σ, c, T ) = x. If the adversary A can construct a proper

continuation of traffic T , whose size is at least⌈|T |x−ε

ε

⌉(ε is an arbitrarily small constant), then

RAavg(ALG) ≥ x− ε.

Proof. Let ` be the number of cells in traffic T , and let T ′ be a proper continuation of T such

that |T ′| =⌈`x−ε

ε

⌉. Applying Lemma 3, |T ′| times implies that for every cell b in T ′ and any

coin-tosses sequence σb,R(ALG, σσb, b) ≥ R(ALG, σ, c) ≥ x. Hence,

RAavg(ALG) ≥ 1

` +⌈`x−ε

ε

⌉ ⌈`x− ε

ε

⌉x ≥ x− ε

39

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 50: Competitive Evaluation of Switch Architectures

In order to allow constructing proper continuations of a traffic T with high relative queuing

delay, we extend Definition 5, so that traffic T ends when a concentration occurs:

Definition 6 An execution EPPS(ALG, σ, T ) is (f, s) strongly-concentrating for output-port j and

plane k if it is (f, s) weakly-concentrating and in addition traffic T ends at time-slot t + s.

For brevity, we call such executions (f, s)-concentrating executions.

4.3.2 Lower Bounds for Fully-Distributed Demultiplexing Algorithms

A fully-distributed demultiplexing algorithm demultiplexes a cell, arriving at time-slot t, according

to the input-port’s local information in time interval [0, t]. Since no information is shared between

input ports, we assume that the state si ∈ Si of demultiplexor i does not change, unless a cell

arrives at input-port i. Note that demultiplexing algorithms that change their state even without

receiving a cell are not considered fully distributed, because a common clock-tick is shared among

all input ports. (Such algorithms are covered in Section 4.3.3.)

Lower bound for deterministic fully-distributed algorithm

The relative queuing delay of a PPS with fully-distributed demultiplexing algorithm strongly de-

pends on the number of input-ports that can send a cell, destined for the same output-port, through

the same plane. The following definition captures this switch characteristic under deterministic

algorithm (Definition 8 extends this definition for randomized algorithms):

Definition 7 A deterministic demultiplexing algorithm is d-partitioned if there is a plane k, an

output-port j, such that at least d input-ports send a cell destined for output-port j through plane

k in one of their reachable configurations.

We next show that a static partition of the planes among the demultiplexors helps to reduce

the relative queuing delay. However, since such partitioning is failure-prone, most existing fully-

distributed algorithms are N -partitioned, meaning that each demultiplexor may use each plane

40

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 51: Competitive Evaluation of Switch Architectures

in order to send cells to each output-port. All our results hold for this class of algorithms by

substituting d = N .

Theorem 1 Any deterministic d-partitioned fully distributed demultiplexing algorithm ALG has

Rmax(ALG) ≥ d(r′ − 1) and J (ALG) ≥ d(r′ − 1) time-slots.

Proof. By the definition of a d-partitioned demultiplexing algorithm, there is an output-port j and

a plane k, such that at least d demultiplexors send a cell destined for j through k in some reachable

configuration. Let I = i1, i2, . . . id be the set of these demultiplexors, and let si ∈ Si be the state

of demultiplexor i ∈ I in configuration Ci, just before a cell is sent to plane k.

Consider traffic T ′i from an arbitrary reachable configuration C which leads the switch to con-

figuration Ci; such traffic exists since C and Ci are reachable, and there is a traffic that causes

the switch to transit between any two reachable configurations. Let Ti = (T ′i )|i; that is, a traffic

in which cells arrive only at input-port i exactly in the same time-slots as in traffic T ′i . Since the

demultiplexing algorithm is fully-distributed, demultiplexor i transits into si. Note that in Ti at

most one cell arrives at the switch in every time-slot, therefore this traffic has no bursts.

Now consider traffic T , which starts with Ti1 . . . Tid , a sequential composition of the traffics

Ti, where i ∈ I . T begins from configuration C, and sequentially for every i ∈ I , the same cells

arrive at the switch in the same time-slots as in traffic Ti, until demultiplexor i reaches state si.

Then, no cells arrive at the switch until all the buffers in all the planes are eventually empty. Finally,

d cells destined for output-port j arrive, one after the other, at different input-ports i ∈ I (one cell in

each time-slot). Since the demultiplexing algorithm is fully-distributed, each demultiplexor i ∈ I

remains in state si, and all the cells are sent through the same plane k (see Figure 6; the last d cells

are denoted ci1 , . . . , cid).

T has no bursts, and cells ci1 , . . . , cid arrive during d consecutive time-slots. These cells arrive

at the switch after the buffer in output-port j is empty.

Thus, by Lemma 2 with f = d, s = d and B = 0, we obtain the stated lower bounds.

If the PPS obeys a per-flow FCFS policy, we get the following lower bound on the average

relative queuing delay:

41

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 52: Competitive Evaluation of Switch Architectures

i2

i1

idCells ci1, . . . , cid

Figure 6: Illustration of traffic T in the proof of Theorems 1 and 4.

Theorem 2 Any deterministic d-partitioned fully distributed demultiplexing algorithm ALG has

Ravg(ALG) ≥ d(r′ − 1)− ε time-slots, where ε > 0 can be made arbitrarily small.

Proof. Let T be the traffic that caused the maximum relative queuing delay as described in the

proof of Theorem 1.

We continue traffic T with traffic T ′, which consists of⌈|T |f ·r

′−(s+B)−εε

⌉cells from orig(c) to

dest(c), one cell at each time-slot. T ′ is a proper continuation of traffic T , because both the PPS

and the shadow switch obey a per-flow FCFS policy and all cells in T ′ share the same input-port

and the same output-port.

Hence, Lemma 4 implies thatRavg(ALG) ≥ f · r′ − (s + B)− ε.

Note that the PPS input constraint implies that each demultiplexor must send incoming cells

through at least r′ planes. This implies that even under static partitioning each plane is used by

r′NK

demultiplexors on the average. Hence, there is a plane k that is used by at least r′NK

= NS

demultiplexors in order to dispatch cells destined for a certain output-port j. By substituting d =

N/S in Theorems 1 and 2 we get:

Theorem 3 A bufferless PPS, with fully-distributed deterministic demultiplexing algorithm, has

maximum relative queuing delay and relative delay jitter of NS

(r′ − 1) time-slots under a leaky-

bucket traffic without bursts. Its average relative delay jitter is NS

(r′ − 1)− ε, for arbitrarily small

ε > 0.

42

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 53: Competitive Evaluation of Switch Architectures

Lower bound for randomized fully-distributed algorithms with an adaptive adversary

We concentrate now on adaptive adversary, denoted adp, which sends cells to the switch based on

the algorithm actions.

For every traffic T we examine the probability Prσ [EPPS(ALG, σ, T ) is (f, s)-concentrating],

taken over all coin-tosses sequences σ, that the execution of ALG given T and σ is (f, s)-concen-

trating.

Another key observation is that if there is traffic T such that its execution is (f, s)-concentrating

with small but non-negligible probability, an adaptive adversary can construct another execution

that is almost always (f, s)-concentrating:

Lemma 5 If from every configuration C there is an (R,B) leaky-bucket traffic T such that

Prσ [EPPS(ALG, σ, T ) is (f, s)-concentrating ] ≥ p > 0, then an adaptive adversary can construct

an (R,B) leaky-bucket traffic T ′ from C such that Pr σ [EPPS(ALG, σ, T ′) is (f, s)-concentrating] ≥1− δ, where δ can be made arbitrarily small.

Proof. Fix a configuration C; the adaptive adversary constructs the executions from C iteratively:

Denote C0 , C. Let Ci be the configuration just before iteration i ≥ 0, and denote by T i the

traffic such that from configuration Ci, Prσ [EPPS(ALG, σ, T i) is (f, s)-concentrating] ≥ p. The

adversary stops if the last execution is indeed (f, s)-concentrating. Otherwise, it concatenates an

empty traffic of B time-slots (denoted Te) and continues to the next iteration.

Since in each iteration the adversary stops with probability at least p independently of previous

iterations, it stops with an (f, s)-concentrating execution at iteration ` ≤⌈log1−p δ

⌉with probabil-

ity 1 − δ. Since there are B empty time-slots between the arrival of the last cell of traffic T i and

the arrival of the first cell in T i+1, T ′ = T 0 Te . . . Te T ` has burstiness factor B, and its

corresponding execution starting from C is (f, s)-concentrating with probability 1− δ.

If both the shadow switch and the PPS are per-flow FCFS, an adaptive adversary can always

construct an arbitrarily long proper continuation of some traffic T . Therefore, we have:

43

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 54: Competitive Evaluation of Switch Architectures

Lemma 6 If from every configuration C there is an (R,B) leaky-bucket traffic T such that

Prσ [EPPS(ALG, σ, T ) is (f, s)-concentrating] > 0, then with probability 1 − δ, Radpavg(ALG) ≥

f · r′ − (s + B)− ε, where ε > 0 and δ > 0 can be made arbitrarily small.

Proof. By Lemma 5, an adaptive adversary can construct a traffic T ′ from configuration C, such

that Prσ [EPPS(ALG, σ, T ′) is (f, s)-concentrating] ≥ 1− δ.

Let c be the last cell of T ′. Lemma 1 implies that with probability 1 − δ, the relative queuing

delay of c is at least f · r′ − (s + B).

When the adaptive adversary observes such a concentration event, it continues with traffic T ′′,

which consists of⌈|T ′|f ·r

′−(s+B)−εε

⌉cells from orig(c) to dest(c), one cell at each time-slot. T ′′

is a proper continuation of traffic T ′, because both the PPS and the shadow switch obey a per-flow

FCFS policy and all cells in T ′′ share the same input-port and the same output-port.

Hence, Lemma 4 implies thatRadpavg(ALG) ≥ f · r′ − (s + B)− ε, with probability 1− δ.

We now extend Definition 7 to capture randomized demultiplexing algorithms:

Definition 8 A randomized demultiplexing algorithm is d-partitioned if there is a plane k, an

output-port j, and a set of input-ports I , such that |I| ≥ d and the following property holds:

For every input-port i ∈ I and state si ∈ Si, if at least ni cells destined for output-port j arrive at

input-port i after it is in state si, then with probability pi > 0, i sends at least one cell destined for

output-port j through plane k.

We next prove a lower bound for d-partitioned fully distributed demultiplexing algorithms by

showing that it is possible to construct a traffic with no bursts that causes with, non-negligible

probability, the algorithm to concentrate d cells in a single plane during a time-interval of d time-

slots:

Theorem 4 Any randomized d-partitioned fully distributed demultiplexing algorithm ALG has,

with probability 1 − δ, Radpmax(ALG) ≥ d(r′ − 1) time-slots and Radp

avg(ALG) ≥ d(r′ − 1) − ε

time-slots, where ε > 0 and δ > 0 can be made arbitrarily small.

44

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 55: Competitive Evaluation of Switch Architectures

Proof. Given ALG, the adversary pre-computes the set I = i1, . . . , id of d input ports, the

output-port j and the plane k, and for each input port i ∈ I the values ni and pi > 0, for which the

conditions presented in Definition 8 hold.

We now construct a traffic similar to the one used in the proof of Theorem 1.

Fix a configuration C, and for every i ∈ I , let T ′i be a traffic consisting of ni cells destined for

output-port j that arrive one after the other to input-port i. By the definition of ni, with probability

at least pi, there is at least one cell in T ′i that is sent through plane k. Let ci be the first such cell;

it follows that Prσ [cell ci is sent through plane k in EPPS(ALG, σ, T ′i )] ≥ pi. Let Ti be the prefix

of T ′i that ends with cell ci; that is Ti = c ∈ T ′

i |ta(c) ≤ ta(ci). Since the probability to send ci

through plane k in execution EPPS(ALG, σ, T ′i ) depends only on cells that arrive at the switch be-

fore cell ci, it follows that for prefix Ti, Prσ [cell ci is sent through plane k in EPPS(ALG, σ, Ti)] ≥pi as well.

Traffic T is defined as follows: T = (Ti1 \ ci1) . . . (Tid \ cid) ci1 . . . cid (recall

Figure 6). We next show that with non-negligible probability, taken over all coin-tosses sequences

σ, all cells ci1 . . . cid are sent through plane k in the execution of ALG on traffic T .

In traffic T , for each input port i ∈ I , no cells arrive to input port i between Ti \ ci and ci.

Thus, for each input port i ∈ I and coin-tosses sequence σ, plane(ci, T ) = plane(ci, Ti1. . .Tid.Since the demultiplexors are independent, the probability, taken over all coin-tosses sequences σ,

that the last d cells are sent through plane k in execution EPPS(ALG, σ, T ) is at least∏d

a=1 pia > 0.

This implies that execution EPPS(ALG, σ, T ) is (d, d)-concentrating with non-negligible prob-

ability. Since T has no bursts, the claim follows immediately from Lemma 6.

Lower bound for randomized fully-distributed algorithms with an oblivious adversary

We now consider oblivious adversaries, obl, that choose the entire traffic in advance, knowing

only the demultiplexing algorithm ALG. Roblmax(ALG) and Robl

avg(ALG) denote the maximum and

average queuing delay of algorithm ALG against such an adversary. We assume that the PPS and

the shadow switch obey a global FCFS policy, i.e., cells that share the same output-port should

leave the switch in the order of their arrival (with ties broken arbitrarily). Unlike per-flow FCFS

45

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 56: Competitive Evaluation of Switch Architectures

policy, global FCFS policy requires cells to leave in order even if they do not share the same origin.

We next extend Theorem 4 to hold with an oblivious adversary, under a global FCFS discipline.

Theorem 5 Any randomized d-partitioned fully distributed demultiplexing algorithm ALG has

Roblmax(ALG) ≥ d(r′ − 1) time-slots and Robl

avg(ALG) ≥ d(r′ − 1) − ε time-slots, with probabil-

ity 1− δ, where ε > 0 and δ > 0 can be made arbitrarily small.

Proof. Given ALG, the adversary pre-computes the set I = i1, . . . , id of d input ports, the

output-port j and the plane k, and for each input port i ∈ I the values ni and pi, for which the

conditions of Definition 8 hold. Let p ,∏d

a=1pia

nia> 0.

For any input port i ∈ I , let xi be a value chosen uniformly at random from 1, . . . , ni. Let Ti

be a traffic consisting of xi cells from input port i to output port j, and let ci be the last cell of Ti.

Traffic T ′ is defined as follows: T ′ = (Ti1 \ ci1) . . . (Tid \ cid) ci1 . . . cid. Note that

traffic T ′ is similar to traffic T in the proof of Theorems 1 and 4 (illustrated in Figure 6).

Using traffic T ′, the adversary constructs a traffic T (illustrated in Figure 7) whose average

relative queuing delay is at least d(r′ − 1) − ε time-slots with probability 1 − δ (the constants

δ, ε > 0 can be made arbitrarily small). The construction has two steps:

Step 1 Concatenate⌈log1−p δ

⌉instances of traffic T ′. For each instance, choose independently

and uniformly at random the values for xim , 1 ≤ m ≤ d, from 1, . . . nim. Let ` be the total

size of these instances.

Step 2 Concatenate a traffic of size `⌈

d(r′−1)−εε

⌉cells, such that each cell is sent from an arbitrary

input port i to output port j.

We first prove that with non-negligible probability, taken over all coin-tosses sequences σ, the

execution of ALG on any instance of traffic T ′ = (Ti1 \ ci1) . . . (Tid \ cid) ci1 . . . cidis (d, d)-concentrating, regardless of the initial configuration.

Claim 1 Prσ [EPPS(ALG, σ, T ′) sends the last d cells through plane k] ≥∏d

a=1pia

nia, p > 0.

46

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 57: Competitive Evaluation of Switch Architectures

i2

i1

id

Step 1 Step 2

i

Traffic T ′ Traffic T ′ • • •Cell c′

Figure 7: Illustration of traffic T in the proof of Theorem 5.

Proof of claim. For any input port i, denote by Ti the traffic consisting of ni cells from input port

i to output port j. By the definition of ni, with probability pi at least one cell in Ti is sent through

plane k. Since xi is chosen uniformly at random from the values 1, . . . , ni, this cell is the xi-th

cell (that is, the cell ci) with probability at least 1ni

. Note that traffic Ti is a prefix of traffic Ti;

since the demultiplexor is bufferless, the decision through which plane to send the cell ci is based

only on cells arriving at the switch prior to ci, which implies that cell ci is sent through k with

probability of at least 1ni· pi.

In traffic T ′, for each input port i ∈ I , no cells arrive at input port i between Ti \ ci and ci.

Thus, for each input port i ∈ I and coin-tosses sequence σ, plane(ci, T′) = plane(ci, Ti1 . . .Tid)

Since the demultiplexors are independent, the probability, taken over all coin-tosses sequences σ,

that execution EPPS(ALG, σ, T ′) sends the last d cells through plane k is at least∏d

a=1pia

nia, p > 0.

In Step 1, the random choices of the T ′ instances are independent. Therefore, (d, d)-concentration

occurs at least once in Step 1 with probability at least 1− δ. Let c′ be last cell of the first instance

in which (d, d)-concentration occurs, and T1 be the traffic c ∈ T | ta(c) ≤ ta(c′). Since

EPPS(ALG, σ, T1) is (d, d)-concentrating, Lemma 1 implies thatR(ALG, σ, c′, T1) ≥ d(r′ − 1).

Let T2 = T \ T1. We next show that T2 is a proper continuation of T1. Intuitively, this is due to

the fact that the switches are work-conserving with FCFS policy and during each interval of size τ ,

exactly τ cells arrive at the switch and destined for the same output-port j (i.e., there are no stalls

between cells in traffic T = T1 T2).

Formally, consider two cells c1, c2 ∈ T such that ta(c2) = ta(c1)+1. The FCFS policy implies

47

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 58: Competitive Evaluation of Switch Architectures

that tlS(c2, T ) > tlS(c1, T ) and tlPPS(c2, T ) > tlPPS(c1, T ). In addition, by the construction of

traffic T , there is no cell c3 ∈ T such that ta(c1) ≤ ta(c3) ≤ ta(c2). Therefore, the FCFS policy

and the work-conservation of the shadow switch imply that tlS(c2, T ) = tlS(c1, T ) + 1. Hence,

for every two cells c1, c2 ∈ T , if ta(c2) = ta(c1) + 1 then c2 = succ(c1, T ); in particular, the first

cell of T2 is the successor of cell c′. Moreover, since the switches follow a FCFS policy, cells of

traffic T2 do not prohibit cells of traffic T1 of being delivered on time; namely, for any cell c ∈ T1,

tlS(c, T1) = tlS(c, T1 T2) and tlPPS(c, T1) = tlPPS(c, T2).

Since R(ALG, σ, c′, T1) ≥ d(r′ − 1) and |T2| ≥⌈|T1|d(r′−1)−ε

ε

⌉, Lemma 4 implies that

Roblavg(ALG) ≥ d(r′ − 1)− ε andRobl

max(ALG) ≥ d(r′ − 1), with probability 1− δ.

4.3.3 Lower Bounds for u-RT Demultiplexing Algorithms

While fully-distributed demultiplexing algorithms do not use any global information, in practice

demultiplexors may be able to gather some information about the switch status (e.g., through ded-

icated control lines). Therefore, it is important to consider a broader class of demultiplexing algo-

rithms in which out-dated global information is also used:

Definition 9 A u real-time distributed (u-RT) demultiplexing algorithm demultiplexes a cell, ar-

riving at time t, according to the input-port’s local information in time interval [0, t], and to the

switch’s global information in time interval [0, t− u].

The state transition function of the ith bufferless demultiplexor operating under u-RT demul-

tiplexing algorithm is Si(t) : Si × Ct−u+1 × 1, . . . , N × COINSPACE → Si, where t is the

time-slot in which Si is applied, C is the set of all reachable switch configurations, and Ct−u+1

is the cross-product of t − u + 1 such sets, one for each time-slot in the interval [0, t − u]. Note

that a demultiplexor state transition may depend on other demultiplexors’ state transitions, and on

incoming flows to other input-ports, as long as these events occurred u time-slots before the state

transition. The state of a demultiplexor can change even if no cell arrives at the input-port.

The additional global information allows to reduce the relative queuing delay. For example,

when a 1-RT demultiplexing algorithm receives (R, 0) leaky-bucket traffic, it has full information

48

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 59: Competitive Evaluation of Switch Architectures

about the switch status, and therefore it can emulate a centralized algorithm. Yet, lack of infor-

mation about recent events yields non-negligible relative queuing delay, caused by leaky-bucket

traffic with a non-zero burstiness factor, as we shall prove next. A prominent example of 1-RT

demultiplexing algorithms (that is, with u = 1) are demultiplexing algorithms that share only a

common clock-tick among input-ports. Therefore, a demultiplexor with a 1-RT algorithm may

change its state even if no cell arrives at its input-port.

Lower bound for deterministic u-RT algorithms

Let u = minu, r′

2, that is, the minimum between the lag in gathering global information and half

the external rate relative to the rate of the planes. We first show a lower bounds on the performance

of deterministic u-RT algorithms:

Theorem 6 Any deterministic u-RT demultiplexing algorithm ALG hasRmax(ALG) ≥ uNS

(1− ur′ ))

andJ (ALG) ≥ uNS

(1− ur′ )) time-slots. If the PPS obeys a per-flow FCFS policy thenRavg(ALG) ≥

uNS

(1− ur′ )− ε time-slots, where ε > 0 can be made arbitrarily small.

Proof. Consider an arbitrary configuration C. Denote by t0 the time-slot in which the PPS is in

configuration C, by x0 the number of cells that arrived at the PPS until time-slot t0, and by n0 the

number of cells stored in one of the PPS’ buffers at time-slot t0.

Consider now the empty traffic Te, in which no cells arrive to the switch at all. We first argue

that if Te is long enough, all the buffers of the switch become empty. Specifically, denote by C1 the

switch configuration at time-slot t1 = t0 +n0 + uNx0

S+1. If there are still cells stored in one of the

buffers at time-slot t1, then these cells have relative queuing delay of at least uNx0

S+ 1 time-slots;

therefore the average relative queuing delay is more than uNS

time-slots, and the theorem follows.

Assume now that all the buffers are empty in configuration C1. Fix an output-port j, and

consider the traffic T in which cells destined for j arrive simultaneously to all input-ports at each

time-slot in the interval [t1, t1 + u). Note that T is an (R, uN − u) leaky-bucket traffic, since for

any τ ≥ 1 and time-interval [t, t + τ), the total number of cells arriving to the switch is bounded

by τ + (uN − u).

49

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 60: Competitive Evaluation of Switch Architectures

time−slot

i|I|

t1 + ut1t0

i1

i2

Figure 8: Illustration of traffic Te T |I in the proof of Theorems 6 and 8.

Since u ≤ 12

Rr

< Rr

, the input constraint implies that two cells arriving at the same input-port

are not sent through the same plane. Hence, there is a plane k used by a set I of at least uK

N input-

ports in the execution EPPS(ALG, T ); note that since a PPS speedup is at least 1, uK

N < RrK

N ≤ N .

For every input-port i ∈ I , let ci ∈ T |i be a cell such that plane(ci, T |i) = k. Consider the

traffic T |i = c|c ∈ T |i and ta(c) ≤ ta(ci); that is, T |i consists of the cells in T |i that arrive at

the switch before cell ci.

Now consider traffic T |I =⋃

i∈I T |i (see Figure 8). Note that both T |I and Te T |I are

(R, u2 NK− u) leaky-bucket traffics.

For every input-port i ∈ I , ta(ci) < t1 + u ≤ t1 + u, which implies that input-port i

does not have global information on the switch status after time-slot t1. Hence, the executions

EPPS(ALG, T ) and EPPS(ALG, T |I) are equivalent. Therefore, all the input-ports i ∈ I send their

last cell to plane k in EPPS(ALG, Te T |I) starting at configuration C. Hence, by Lemma 2, the

maximum relative queuing delay and the relative delay jitter are at leastR = uNS

(1− ur′ ) time-slots.

Assume now that the PPS obeys a per-flow FCFS policy. Let c be the last cell of traffic Te T |Ithat attains the maximum relative queuing delay. Consider traffic T ′ that consists of

⌈|(T |I)| · R−ε

ε

⌉cells from orig(c) to dest(c), one cell at each time-slot. T ′ is a proper continuation of traffic T |I ;

thus, by Lemma 4,Ravg(ALG) ≥ uNS

(1− ur′ )− ε as required.

By substituting the minimal value u = 1, we get the following general result:

Corollary 7 Any deterministic u-RT demultiplexing algorithm, u ≥ 1, has relative queuing delay

and relative delay jitter of at least NS

(1− 1

r′

)time-slots, under leaky-bucket traffic with burstiness

50

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 61: Competitive Evaluation of Switch Architectures

factor NK− 1.

Lower bound for randomized u-RT algorithms with an adaptive adversary

We next give a lower bound on the average relative queuing delay of a randomized u-RT demulti-

plexing algorithms. The proof is based on Theorem 6 and Lemma 6:

Theorem 8 Any randomized u-RT demultiplexing algorithm ALG has, with probability 1 − δ,

Radpmax(ALG) ≥ uN

S(1 − u

r′ ) time-slots and Radpavg(ALG) ≥ uN

S(1 − u

r′ ) − ε time-slots, where ε > 0

and δ > 0 can be made arbitrarily small.

Proof. Consider an arbitrary configuration C and the traffics Te and T , whose constructions are

described in the proof of Theorem 6.

It is important to notice that the input constraint implies that for every coin-tosses sequence σ

there is a plane k used by a set I of at least uK

N input-ports in the execution EPPS(ALG, σ, T ).

For every input-port i ∈ I , let ci ∈ T |i be a cell such that plane(ci, T |i) = k. Consider the

traffics T |i = c|c ∈ T |i and ta(c) ≤ ta(ci) and T |I =⋃

i∈I T |i (recall Figure 8).

For every input-port i ∈ I , ta(ci) < t1 + u ≤ t1 + u, which implies that input-port i

does not have global information on the switch status after time-slot t1. Hence, the executions

EPPS(ALG, σ, T ) and EPPS(ALG, σ, T |I) are equivalent. Therefore, with probability of at least∏i∈I

(1

|COINSPACE||T |i|)≥(

1|COINSPACE|

u)|I|≥(

1|COINSPACE|

u)uN

K> 0, taken over the coin-tosses

sequences σ, all the input-ports i ∈ I send their last cell to plane k in EPPS(ALG, σ, Te T |I)starting at configuration C. Hence, configuration C satisfies the conditions of Lemma 6 and the

claim follows.

The question whether the lower bound for u-RT demultiplexing algorithm (described in The-

orem 8) can be extended to hold with an oblivious adversary is left open. The proof technique

described in this section will most likely fail to provide such an extension, since the worst-case

traffics that are used in order to prove lower bounds for u-RT algorithms have bursts. Unfor-

tunately, the burstiness accumulates when concatenating bursty traffics, unless there is a gap of

51

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 62: Competitive Evaluation of Switch Architectures

certain number of time-slots in which no cells arrive to the over-loaded output-port. Large bursts

may justify high queuing delay of cells, and hence result in low relative queuing delay. On the

other hand, a gap in which no cells arrive to the over-loaded output, reduces the relative queuing

delay of cells that arrive immediately after it. This implies that the adversary should identify the

concentration and then choose to continue the traffic without a gap (as in Lemma 4).

4.4 Upper Bounds on the Relative Queuing Delay

This section presents a methodology for bounding Rmax(ALG, σ, T ) for an arbitrary traffic T and

coin-tosses sequence σ. We fix some traffic T and omit the notations ALG, σ and T . For simplicity

assume T begins after time-slot 0, and that at time-slot 0 (i.e., at “the beginning of time”), no cells

arrive at the switch, and therefore all the queues are empty. Our analysis depends on the realistic

assumption that the PPS obeys the global FCFS policy.

Cells are queued in a bufferless PPS either within the planes or within the multiplexors residing

at the output-ports. A simple situation in which queuing in a multiplexor happens is when the

output-port is flooded, but in this case, cells also suffer from high queuing delay in the shadow

switch, and the relative queuing delay is small. A more complicated situation is when a cell arrives

at the multiplexor out of order, and should wait for previous cells to arrive from their planes. In

this case, the relative queuing delay is a by-product of queuing within the other planes (of some

preceding cell)—the relative queuing delay of the waiting cell is at most the relative queuing delay

of some preceding cell that was queued only in the planes. This observation is captured in the next

lemma:

Lemma 7 There is a cell c such that tlPPS(c) = tp(c) andR(c) = Rmax.

Proof. Let c be the first cell to leave the PPS such that R(c) = Rmax. Assume that tlPPS(c) >

tp(c); since the multiplexor buffer is work-conserving, in time-slot tlPPS(c) − 1 another cell c′

leaves the PPS from output-port dest(c). Hence tlPPS(c′) = tlPPS(c)− 1, and therefore R(c′) =

tlPPS(c′)−tlS(c′) = tlPPS(c)−1−tlS(c′). Since c′ leaves the PPS before c and the shadow switch

is FCFS, tlS(c′) ≤ tlS(c)−1. Hence the relative queuing delay of c′ isR(c′) ≥ tlPPS(c)−tlS(c) =

R(c) = Rmax, contradicting the minimality of c.

52

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 63: Competitive Evaluation of Switch Architectures

Consider a single cell c, and focus on the queuing within plane(c), caused by the lower rate

on the link from plane(c) to dest(c). Since both the PPS and the shadow switch are FCFS, cells

arriving at the switch after cell c cannot prohibit c from being transmitted on time. We present an

upper bound that depends only on the disproportion of the number of cells sent through plane(c)

to dest(c). Relating this quantity and the queue lengths at time-slot ta(c) is not immediate, since

it is possible that the shadow switch is busy when the plane is idle, and vice versa.

Let Aj(t1, t2) be the number of cells destined for output-port j that arrive at the switch during

time interval [t1, t2], and Akj (t1, t2) be the number of these cells that are sent through plane k. The

following definition captures the imbalance between planes:

Definition 10 For a plane k, output-port j and time-slots 0 ≤ t1 ≤ t2:

1. The imbalance of time interval [t1, t2] is ∆kj (t1, t2) = Ak

j (t1, t2)− 1r′ Aj(t1, t2).

2. The imbalance by time-slot t2 is ∆kj (t2) = maxt1≤t2∆k

j (t1, t2).

3. The maximum imbalance is ∆kj=maxt2∆k

j (t2).

Clearly, ∆kj ≥ ∆k

j (t2) ≥ ∆kj (t1, t2) for every output-port j, plane k and time-slots t1 > t2. In

addition, the imbalance is superadditive:

Property 1 For every output-port j, plane k and time-slots t1 > t2,

∆kj (t2) ≥ ∆k

j (t1 − 1) + ∆kj (t1, t2)

Proof. By Definition 10, there is a time-slot t1′ ≤ t1 such that ∆k

j (t1 − 1) = ∆kj (t1

′, t1 − 1) =

Akj (t1

′, t1 − 1)− 1r′ Aj(t1

′, t1 − 1). Since ∆kj (t1, t2) = Ak

j (t1, t2)− 1r′ Aj(t1, t2), we have:

∆kj (t1, t2)+∆k

j (t1−1)=Akj (t1

′, t1−1)+Akj (t1, t2)−

1

r′(Aj(t1

′, t1−1)+Aj(t1, t2))=∆kj (t1

′, t2)≤∆kj (t2)

Let Qj(t) be the length of the jth queue in the shadow switch after time-slot t; similarly,

Qkj (t) is the length of jth queue of plane k of the PPS after time-slot t. Let Lk

j (t1, t2) be the

53

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 64: Competitive Evaluation of Switch Architectures

number of cells destined for output-port j that leave plane k during time interval [t1, t2]. Note that

Qkj (t) = Ak

j (0, t)− Lkj (0, t).

Time-slot t1 is the beginning of a (k, j) busy period for time-slot t2 ≥ t1, if it is the last time-slot

before t2, such that Qkj (t1 − 1) = 0. Note that this expression is well defined because in time-slot

0 all the queues are empty. Since Qkj (t1) > Qk

j (t1 − 1), a cell c arrives at the switch at time-slot

t1, and therefore exactly one cell destined for j leaves plane k in time-interval (t1 + 1− r′, t1 + 1].

This is either cell c itself, or another cell that prohibits c from using the link, and therefore is sent

at most r′ time-slots before time-slot t1 + 1. Since the queue is never empty until time-slot t2, one

cell is sent to j exactly every r′ time-slots after the first cell. This implies that the number of cells

sent from k, Lkj (t1, t2) ≥

⌊(t2−t1)+1

r′

⌋.

Remark 2 Khotimsky and Krishnan [86] defined busy periods only with respect to an output-port

j. This points to a flaw in their proof, which ignores situations when the optimal shadow switch is

busy sending cells to output-port j, while a specific plane in the PPS is idle part of the time [85].

These situations are the main source of complication in our proof.

The following lemma bounds how badly a plane can perform relative to the shadow switch, by

comparing their busy periods:

Lemma 8 If Qj(t− 1) = 0 then for every plane k and for every δ ∈ 0, . . . , ∆kj (t− 1)r′,

Lkj (0, (t− 1) + δ) ≥ Ak

j (0, t− 1)−

⌈∆k

j (t− 1)r′ − δ

r′

Proof. If there is a time-slot t1 ∈ [t− 1, t− 1 + δ], such that Qkj (t1) = 0, then by time-slot t1, no

cells destined for j are waiting in plane k. That is, Lkj (0, (t − 1) + δ) ≥ Lk

j (0, t1) = Akj (0, t1) ≥

Akj (0, t− 1), and the lemma follows.

Otherwise, let t2 be the beginning of a (k, j) busy period for time-slot (t− 1) + δ. During time

interval [t2, (t− 1) + δ] plane k sends a cell to output j every r′ time-slots, therefore:

Lkj (t2, (t− 1) + δ) ≥

⌊t + δ − t2

r′

⌋(4.1)

54

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 65: Competitive Evaluation of Switch Architectures

t2 t t + δ

Akj (0, minτ, t− 1)

−Lkj (0, τ)

τ (time-slots)t + ∆kj (t− 1)r′

Figure 9: The number of cells arriving until time-slot t−1, and still queued in plane k by time-slotτ .

On the other hand, Qj(t−1) = 0 implies that for every time-slot t3 ≤ t−1, Aj(t3, t−1) ≤ t−t3

(otherwise the jth buffer of the shadow switch is not empty after time-slot t− 1). In particular:

Aj(t2, t− 1) ≤ t− t2 (4.2)

Using these inequalities we bound Lkj (0, (t− 1) + δ):

Lkj (0, (t−1)+δ) = Lk

j (0, t2 − 1) + Lkj (t2, (t− 1) + δ)

≥ Lkj (0, t2 − 1) +

⌊t+δ−t2

r′

⌋by (4.1)

≥ Akj (0, t2 − 1) +

⌊t+δ−t2

r′

⌋Since Qk

j (t2−1)=0

≥ Akj (0, t2 − 1)+

⌊δr′ +

Aj(t2,t−1)

r′

⌋by (4.2)

= Akj (0, t2 − 1) + Ak

j (t2, t− 1) +⌊

δr′ −∆k

j (t2, t− 1)⌋

by Definition 10

≥ Akj (0, t− 1)−

⌈r′∆k

j (t−1)+δ

r′

By substituting δ = 0 in Lemma 8, we get the following corollary, demonstrating the relation

between the imbalance and the queue size in the beginning of a busy period:

Corollary 9 If Qj(t− 1) = 0 then for every plane k, Qkj (t− 1) ≤ max

0,⌈∆k

j (t− 1)⌉

.

55

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 66: Competitive Evaluation of Switch Architectures

We complete the proof by bounding the lag between the time a cell leaves the plane it is sent

through and the time it should leave the shadow switch:

Theorem 10 The maximum relative queuing delay of cells destined for output-port j and sent

through plane k is bounded by max0, r′(∆kj + 1) + Bj, where Bj is the maximum number of

cells destined for output-port j that arrive at the switch in the same time-slot.

Proof. By Lemma 7, it suffices to bound tp(c) − tlS(c) for every cell c. Since tp(c) − tlS(c) =

(tp(c)− ta(c))− ((tlS(c)− ta(c)), it suffices to bound only the difference between the time a cell

spends in the plane, tp(c) − ta(c), and the time it spends in the shadow switch, tlS(c) − ta(c).

Since both switches operate under FCFS policy, these values solely depend on the corresponding

queues’ lengths when cell c arrives.

Let t1 be the earliest time-slot, such that the buffer of output-port j in the shadow switch is

never empty during time-interval [t1, ta(c)]; if no such time-slot exists let t1 = ta(c).

First, we bound tlS(c) − ta(c) from below. The buffer in the shadow switch is empty at time-

slot t1−1, and then the switch is continuously busy during time-interval [t1, ta(c)−1], transmitting

exactly one cell at each time-slot to output-port j. This implies that Qj(ta(c)−1) = Aj(t1, ta(c)−1)− (ta(c)− t1). All the cells in the queue should leave the switch after time-slot ta(c) and before

tlS(c), therefore:

tlS(c)− ta(c) > Aj(t1, ta(c)− 1)− (ta(c)− t1)

Since Aj(ta(c), ta(c)) ≤ Bj , and tlS(c)− ta(c) is an integer, it follows that:

tlS(c)− ta(c) ≥ Aj(t1, ta(c))−Bj + t1 − ta(c) + 1 (4.3)

Recall that by Corollary 9, Qkj (t1−1)≤max0,

⌈∆k

j (t1−1)⌉. There are two cases to consider,

depending on whether all the cells that were queued in plane k at time slot t1 left the plane before

the arrival of cell c (see Figure 10):

Case 1: ta(c) ≤ t1 + ∆kj (t1 − 1)r′. Since plane k is FCFS and work-conserving, it transfers

every cell in its queue in exactly r′ time-slots, except cell c which is considered as transfered in the

56

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 67: Competitive Evaluation of Switch Architectures

t (time-slots)

⌈∆k

j (t1 − 1)⌉ −Lk

j (0, t)

Akj (0, mint, t1 − 1)

t1 + ∆kj (t1 − 1)r′

after time-slotCase 2: cell c arrive

t1 + ∆kj (t1 − 1)r′t1

t1 + ∆kj (t1 − 1)r′

before time-slotCase 1: cell c arrive

Figure 10: Illustration for the different cases in the proof of Theorem 10

first time-slot of its transmission:

tp(c)− ta(c) ≤ r′Qkj (ta(c)) + 1

= r′(Akj (0, ta(c))− Lk

j (0, ta(c))) + 1 by the definition of Qkj (ta(c))

≤ r′

(Ak

j (0, ta(c))− Akj (0, t1 − 1) +

⌈r′∆k

j (t1 − 1) + t1 − ta(c)

r′

⌉)+ 1

by Lemma 8, since ta(c) ∈ [t1, t1 + ∆kj (t1 − 1)r′]

and Lkj (0, ta(c)) ≥ Lk

j (0, ta(c)− 1).

≤ r′

(Ak

j (0, ta(c))− Akj (0, t1 − 1) +

r′∆kj (t1 − 1) + t1 − ta(c)

r′+ 1

)+ 1

≤ r′Akj (t1, ta(c)) + r′∆k

j (t1 − 1) + t1 − ta(c) + r′ + 1

= Aj(t1, ta(c)) + r′∆kj (t1, ta(c)) + r′∆k

j (t1 − 1)− ta(c) + t1 + r′ + 1 by Definition 10

≤ Aj(t1, ta(c)) + r′(∆kj (ta(c)) + 1)− ta(c) + t1 + 1 by Property 1 (4.4)

By (4.4) and (4.3), tp(c)− tlS(c) ≤ r′(∆kj (ta(c)) + 1) + Bj .

Case 2: ta(c) > t1 + ∆kj (t1 − 1)r′. If Qk

j (ta(c)) = 0 then cell c is immediately delivered to the

output-port, i.e., tp(c) = ta(c) + 1 ≤ tlS(c) and the claim holds since tp(c)− tlS(c) ≤ 0.

If Qkj (ta(c)) > 0, let t2 be the beginning of a (k, j) busy period for ta(c). Note that by the

57

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 68: Competitive Evaluation of Switch Architectures

choice of t2, Lkj (t2, ta(c)) ≥

⌊ta(c)−t2+1

r′

⌋. Hence, we have:

tp(c)− ta(c) ≤ r′Qkj (ta(c)) + 1

= r′(Ak

j (t2, ta(c))− Lkj (t2, ta(c))

)+ 1 since Qk

j (t2 − 1) = 0

≤ r′(

Akj (t2, ta(c))−

⌈ta(c)−(t2−1)

r′

⌉)+1 since plane k is continuously busy

≤ Aj(t2, ta(c)) + r′∆kj (t2, ta(c))− r′

⌈ta(c)− (t2 − 1)

r′

⌉+ 1

≤ Aj(t1, ta(c)) + r′(∆k

j (t2, ta(c))+1)+t1−ta(c)+(t2 − t1)−Aj(t1, t2−1)+1 (4.5)

By the choice of t1, the output-buffer of the shadow switch is empty at time-slot t1 − 1, and not

empty during time-interval [t1, t2 − 1]. This implies that (t2 − t1) ≤ Aj(t1, t2 − 1), and therefore

(4.5) implies:

tp(c)− ta(c) ≤ Aj(t1, ta(c)) + r′(∆kj (ta(c)) + 1) + t1 − ta(c) + 1 (4.6)

By (4.6) and (4.3), tp(c)− tlS(c) ≤ r′(∆kj (ta(c)) + 1) + Bj .

4.5 Demultiplexing Algorithms with Optimal RQD

This section presents several demultiplexing algorithms and uses the methodology described in

Section 4.4 in order to bound their relative queuing delay.

First, we revisit the fractional traffic dispatch algorithm (FTD) [70] and prove that its rela-

tive queuing delay is (N + 1)r′ time-slots. Then, for a PPS with speedup S > 2, we introduce

a variant of the FTD algorithm that is 2N/S-partitioned; its relative queuing delay is at most

(2N/S + 1) r′ + N(1 − 2/S) time-slots, matching the lower bound for fully distributed demulti-

plexing algorithms (Theorem 4).

Finally, we present novel 1-RT and u-RT demultiplexing algorithms with relative queuing delay

3N + r′ time-slots (Sections 4.5.2 and 4.5.3). Both algorithms have optimal relative queuing delay

for a PPS with constant speedup.

58

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 69: Competitive Evaluation of Switch Architectures

4.5.1 Optimal Fully-Distributed Demultiplexing Algorithms

Iyer and McKeown [70] presented the best-known example of a fully-distributed demultiplexing

algorithm. In this algorithm, there is a window of size r′ time-slots that slides over the sequence

of cells in each flow (i, j). The algorithm maintains window constraint that ensures that two cells

in the same window are not sent through the same plane. An equivalent variation of the algorithm,

which is called the fractional traffic dispatch algorithm (FTD), statically divides each flow to

blocks of size r′ [70, 86].

The demultiplexing algorithm chooses the plane through which a cell is sent arbitrarily from

the set of planes that do not violate the window constraint and the input-constraint described at

Section 4.2. A speedup of S ≥ 2 suffices for the algorithm to work correctly [70].

A simple application of Theorem 10 and the fact that Bj ≤ N shows:

Theorem 11 Ravg(FTD) ≤ Rmax(FTD) ≤ (N + 1)r′.

Proof. Let Ai→j(t1, t2) be the number of cells in flow (i, j) that arrive at the switch during time-

interval [t1, t2], and Aki→j(t1, t2) be the number of these cells that are sent through plane k.

∆kj (t1, t2) = Ak

j (t1, t2)−Aj(t1, t2)

r′by Definition 10

=N∑

i=1

Aki→j(t1, t2)−

Aj(t1, t2)

r′

≤N∑

i=1

⌈Ai→j(t1, t2)

r′

⌉− Aj(t1, t2)

r′due to the window constraint

≤N∑

i=1

(Ai→j(t1, t2)

r′+

r′ − 1

r′

)− Aj(t1, t2)

r′since Ai→j, r

′ are integers

= Nr′ − 1

r′

By Theorem 10,Rmax(FTD) ≤ (N + 1)r′, since Bj ≤ N .

For PPS with speedup S > 2, a 2NS

-partitioned variant of FTD can yield better relative queuing

59

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 70: Competitive Evaluation of Switch Architectures

Algorithm 1 Partitioned Fractional Traffic Dispatch (PART-FTD) AlgorithmLocal variables at demultiplexor i:

M [N ][r′]: matrix of values in 1, . . . , 2r′,⊥, initially all ⊥R[r′]: vector of values in 1, . . . , 2r′,⊥, initially all ⊥S[N + 1]: vector of values in 0, . . . , r′ − 1, initially all 0

1: int procedure DISPATCH(cell c) at demultiplexor i2: j ← dest(c)3: D ← k ∈ 1, . . . , 2r′ | ∃a ∈ 0, . . . , r − 1,M [j − 1][a] = k

. Planes that violate the window-constraint.4: E ← k ∈ 1, . . . , 2r′ | ∃a ∈ 0, . . . , r − 1, R[a] = k

. Planes that violate the input-constraint.5: choose k ∈ (1, . . . , 2r′ \ (D ∪ E))6: M [j − 1][S[j − 1]]← k . Update for future window constraint calculations.7: R[S[N ]]← k . Update for future input constraint calculations.8: S[j − 1]← S[j − 1] + 1 mod r′ . Update pointer for cyclic use of the vector.9: S[N ]← S[N ] + 1 mod r′ . Update pointer for cyclic use of the vector.

10: return k + 2r′⌊

i2N/S

⌋− 1

11: end procedure

delay, matching the lower bounds described in Theorems 4 and 5. In this algorithm, denoted PART-

FTD (see pseudo-code in Algorithm 1), demultiplexor i uses only planes

2r′⌊

i2N/S

⌋, . . . , 2r′

(⌊i

2N/S

⌋+ 1)− 1. This implies that each demultiplexor uses exactly 2r′

planes, as required for the correctness of FTD, but each plane is used only by at most 2NS

demulti-

plexors.

Theorem 12 Ravg(PART-FTD) ≤ Rmax(PART-FTD) ≤(

2NS

+ 1)r′ + N

(1− 2

S

).

Proof. We use the same calculations as in the proof of Theorem 11. The only difference is that

∆kj (t1, t2) = Ak

j (t1, t2)−Aj(t1, t2)

r′≤

N∑i=1

⌈Ai→j(t1, t2)

r′

⌉− Aj(t1, t2)

r′≤ 2N

S

r′ − 1

r′

since at most 2NS

demultiplexors can send cells through plane k. Therefore, by Theorem 10,

Rmax(PART-FTD) ≤(

2NS

+ 1)r′ + N

(1− 2

S

).

60

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 71: Competitive Evaluation of Switch Architectures

4.5.2 Optimal 1-RT Demultiplexing Algorithm

We describe a 1-RT demultiplexing algorithm that matches the lower bound presented in Theo-

rem 8. Informally, Algorithm 2 divides the set of planes into two equal-size sets, V0 and V1, and its

operations with respect to cells destined for a specific output-port into two phases. In each phase,

the algorithm sends cells destined for a specific output-port through a different set of planes (i.e.,

V0 or V1). After every time-slot, each input-port collects global information about the switch, and

uses it to calculate the imbalance for each plane k and each output-port j. In the next time-slot

each input-port sends a cell to output-port j only through planes with low (or zero) imbalance.

Intuitively, a phase i ends when there are no balanced planes in Vi to use. Then, in the next phase,

the demultiplexors use the planes of the set V1−i.

To avoid situations in which all the input-ports send cells through the same plane, we divide

the input-ports into Nr′ sets of size r′, and assure that, under no circumstances, two input-ports in

the same set send a cell destined for the same output-port through the same plane. This is done by

calculating the actions of other input-ports in the same set as if they indeed get a cell destined for

the same output-port.

With respect to each output-port j, planes are divided into three levels according to their imbal-

ance (see Definition 10): balanced planes with imbalance ∆kj (t) ≤ 0, slightly imbalanced planes

whose imbalance satisfies 0 < ∆kj (t) < N

r′ , and extremely imbalanced planes with imbalance

∆kj (t) ≥ N

r′ , At the beginning of each time-slot, a set of eligible planes, denoted by F [j], is calcu-

lated for every destination j: A plane is eligible for output-port j if it is balanced with respect to

output-port j or if it was never extremely imbalanced with respect to output-port j since the last

phase change. Phase i is changed to phase 1 − i when all planes k ∈ V1−i become balanced (The

set Q[j] maintains the planes of V1−i that are still imbalanced. The phase changes when Q[j] = ∅).

Example 1 Suppose that at time-slot t = 0, phase[0] changed from 1 to 0, ∆90(0) = 6.5 and all

other planes in V1 have imbalance at most 6.5. In addition, we assume that planes 1 and 2 did not

receive any cells before time-slot 0.

The demultiplexors are divided into 4 sets: 0, 1, 2, 3, 4, 5, 6, 7. Upon receiving a cell,

each demultiplexor calculates the behavior of all demultiplexors in its set that have smaller index

61

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 72: Competitive Evaluation of Switch Architectures

and ensures that it will not send the cell through the same plane as them. Table 4.2 shows the

plane number through which each demultiplexor would have sent a cell destined for output-port 0,

if such a cell arrives at the switch. The actual arrivals are marked in framed boxes, and are taken

into account in the following time-slots.

At time-slot 1, demultiplexor 0 will send a cell through the first plane in V0 (that is, plane 1).

On the other hand, demultiplexor 1 must avoid sending its cell through plane 1 and therefore it will

use plane 2. Similarly, demultiplexors 2, 4 and 6 will use plane 1 and demultiplexors 3, 5 and 7

will use plane 2.

62

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 73: Competitive Evaluation of Switch Architectures

Algorithm 2 1-RT AlgorithmConstants:

V0=1, . . . , K2 ; V1=K

2 + 1, . . . ,KShared:

F [N ]: N sets of planes, initially all V0 . cells for j can be sent only through F [j]R[N ][r′]: matrix of values in 1, . . . ,K,⊥, initially all ⊥ . holding input-constraintst: value in 0, . . . , r′ − 1, initially 0 . cyclic pointer to matrix RQ[N ], L[N ]: N sets of planes, initially all ∅M [N ]: N sets of planes, initially all 1, . . . ,Kphase[N]: vector of values in 0, 1, initially all 0

1: void procedure ADVANCE-CLOCK( ) . invoked at the beginning of each time-slot2: For every j ∈ 1, . . . , N: CALCULATE(j)3: For every j ∈ 1, . . . , N: F [j]← UPDATE(j)4: Update the matrix R[N ][r′] according to global information5: t← (t + 1) mod r′

6: end procedure

1: int procedure DISPATCH(cell c) at demultiplexor i2: j ← dest(c)3: p←

⌊ir′

⌋4: set B ← ∅5: for x← r′p to i do6: E←k∈1, . . ., K | ∃a∈0, . . ., r′−1,R[x][a]=k7: k ← min (F [j] \ (B ∪ E))8: B ← B ∪ k9: end for

10: R[i][t]← k . can be read by other input-ports only in the next time-slot11: return k12: end procedure

1: set procedure UPDATE(int j)2: set S ← F [j]3: Q[j]← Q[j] \M [j]4: if Q[j] = ∅ then . change phase5: Q[j]← 1, . . . ,K \M [j]6: phase[j]← (1− phase[j])7: S ← Vphase[j]

8: else9: S ← S \ (L[j])

10: end if11: return S12: end procedure

1: void procedure CALCULATE(int j)2: set A← k|∆k

j (t) > Nr′ . using global information

3: M [j]← k|∆kj (t) ≤ 0 . using global information

4: L[j]← (L[j] ∪A) \M [j]5: end procedure

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 74: Competitive Evaluation of Switch Architectures

Time slot0 1 2 3 4 5 6 7 8

Demultiplexor 0 1 2 1 2 3 9 10 1Demultiplexor 1 2 1 2 3 2 10 9 2Demultiplexor 2 1 2 1 2 2 9 10 1Demultiplexor 3 2 1 2 3 3 10 9 2Demultiplexor 4 1 2 1 2 2 9 9 1Demultiplexor 5 2 1 2 3 3 10 10 2Demultiplexor 6 1 1 2 2 2 9 9 1Demultiplexor 7 2 2 1 3 3 10 10 2

∆10(t) 1.5 3.5 4 3 2.5 0.5 0 0

∆90(t) 6.5 5 3 1.5 0.5 0 0 0.5 0.5

Table 4.2: Illustration of Example 1. The plane number through which each demultiplexor wouldhave sent a cell destined for output-port 0, if such a cell arrives at the switch. Actual arrivals aremarked in framed boxes. No cells arrive to different output-ports in this time interval

At time-slot 2, demultiplexor 0 cannot use plane 1 due to the input-constraint. Therefore, it

will use plane 2 and demultiplexor 1 will use plane 1. Plane 1 becomes extremely imbalanced

after time-slot 2 and therefore it is not eligible to receive cells for output-port 0 in the following

time-slots. Although plane 1 becomes slightly imbalanced after time-slot 3, Algorithm 2 dictates

it is still not eligible for output-port 0, since the phase has not changed yet.

The phase changes after time-slot 5, because for every plane k ∈ V1, ∆k0(5) ≤ 0. This implies

that planes from the set V1 are used for sending cells destined for output-port 0 in the following

time-slots. At this time, Q[0] = 1, 2 since ∆10(5) = 2.5 and ∆2

0(5) = 0.5. The phase changes

again after time-slot 7, since both ∆10(7) and ∆2

0(7) are not positive.

To prove the correctness of Algorithm 2, we start with two lemmas.

The first lemma shows that the imbalance between each plane and each output-port is bounded.

Lemma 9 In Algorithm 2, for every plane k ∈ V0 ∪ V1 and output-port j, ∆kj < 2N

r′ .

Proof. Clearly, if ∆kj (t3) > N

r′ then k ∈ L[j] in the beginning of time-slot t3 + 1 (procedure

CALCULATE, Line 4). Therefore, k 6∈ F [j] in the beginning of time-slot t3+1 (procedure UPDATE,

Line 9), and cells are not sent through plane k until a time-slot t3′ > t3+1 in which ∆kj (t3

′−1) ≤ 0.

64

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 75: Competitive Evaluation of Switch Architectures

This observation holds also if the phase changes in the beginning of time-slot t3 +1 since Q[j] = ∅at Line 4 yields that Vphase ⊆M [j] at Line 7.

For every two input-ports i1 and i2, if⌊

i1r′

⌋=⌊

i2r′

⌋, then i1 and i2 do not send cells destined for

the same output-port through the same plane in the same time-slot (procedure DISPATCH). This

implies that the maximum number of cells destined for the same output-port and sent through the

same plane in a single time-slot is Nr′ .

By Definition 10, if a plane does not receive cells destined for output-port j in time-slot t1,

then ∆kj (t1) ≤ ∆k

j (t1 − 1). This implies that there is a time-slot t1, in which plane k receives

cells destined for j, and ∆kj (t1) = ∆k

j . In the worst-case, ∆kj (t1 − 1) = N

r′ and k receives Nr′ cells

destined for j.

Assume towards a contradiction, that ∆kj (t1) ≥ 2N

r′ . Then there is a time-slot t2 such that

∆kj (t2, t1) ≥ 2N

r′ . Note that ∆kj (t1, t1) < N

r′ , since Akj (t1, t1) ≤ N

r′ and Aj(t1, t1) ≥ Akj (t1, t1). This

implies that t2 < t1 and therefore by Definition 10:

∆kj (t2, t1) = Ak

j (t2, t1)−1

r′Aj(t2, t1)

= Akj (t2, t1 − 1) + Ak

j (t1, t1)−1

r′Aj(t2, t1 − 1)− 1

r′Aj(t1, t1)

= ∆kj (t2, t1 − 1) + ∆k

j (t1, t1) <2N

r′

This contradicts the choice of t2, and the claim follows.

The second property is a simple conclusion from Lemma 9:

Lemma 10 If 2N cells destined for output j arrive at a PPS is operating under Algorithm 2 during

time-interval [t1, t2] and none of them is sent through plane k, then ∆kj (t2) ≤ 0.

Proof. By Definition 10, there is a time-slot t3 such that ∆kj (t2) = ∆k

j (t3, t2).

If t3 ≥ t1, then ∆kj (t3, t2) ≤ 0, since Ak

j (t3, t2) = 0. Otherwise,

∆kj (t3, t2) = ∆k

j (t3, t1 − 1) + ∆kj (t1, t2)

= ∆kj (t3, t1 − 1) + Ak

j (t1, t2)−1

r′Aj(t1, t2)

65

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 76: Competitive Evaluation of Switch Architectures

By Lemma 9 and Definition 10, ∆kj (t3, t1 − 1) ≤ ∆k

j ≤ 2Nr′ . Since Ak

j (t1, t2) = 0 and Aj(t1, t2) ≥2N , it follows that ∆k

j (t3, t2) ≤ 0 also in this case.

The final theorem shows that a speedup of 8 suffices for this demultiplexing algorithm to

achieve optimal relative queuing delay. Note that such a high speedup is considered impracti-

cal for real switches; yet, Algorithm 2 demonstrates that the lower bound presented in Theorem 8

is tight for u = 1.

Theorem 13 Speedup S = 8 suffices for Algorithm 2 to work correctly with maximum relative

queuing delay of 3N + r′ time-slots.

Proof. It suffices to show that every time Line 7 of procedure DISPATCH is executed, F [j] \ (B ∪E) 6= ∅, and a plane can be chosen. Clearly, at each step |B| ≤ r′ and |E| < r′; therefore the claim

follows if |F [j]| > 2r′. Since F [j] is changed only by procedure UPDATE(j), it suffices to show

that |F [j]| > 2r′ after any execution of UPDATE(j).

Assume, without loss of generality, that phase = 0 after an execution of procedure UPDATE(j)

at time-slot t1. Assume, by way of contradiction, that |F [j]| ≤ 2r′ at time-slot t1. Clearly, from

Line 7 and the fact that |V0| = |V1| = K2

= Sr′

2= 4r′ > 2r′, it follows that phase = 0 after

time-slot t1 − 1. This implies that |V0 ∩ L[j]| ≥ 2r′.

Denote by t2 the last time-slot in which phase was changed from 1 to 0 (t2 = 0 if no such

time-slot exists). At time-slot t2, when executing Line 4, Q[j] is empty and therefore all planes

k ∈ V0 are at M [j] at time-slot t2. This implies that for every k ∈ V0, ∆kj (t2) ≤ 0.

Let k be a plane in V0 ∩L[j]. By the definition of L[j] there is a time-slot t3 ∈ [t2, t1] such that

∆kj (t3) > N

r′ . Let t4 be the last time-slot such that ∆kj (t3) = ∆k

j (t4, t3). If t4 < t2 then

∆kj (t4, t3) = ∆k

j (t4, t2) + ∆kj (t2 + 1, t3)

≤ ∆kj (t2) + ∆k

j (t2 + 1, t3) ≤ ∆kj (t2 + 1, t3)

and therefore t4 is not maximal. Hence t4 ≥ t2 and [t4, t3] ⊆ [t2, t1]. Since ∆kj (t4, t3) =

Akj (t4, t3)− 1

r′ Aj(t4, t3) > Nr′ , and Aj(t4, t3) ≥ Ak

j (t4, t3), it follows that Akj (t4, t3) > N

r′−1.

66

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 77: Competitive Evaluation of Switch Architectures

Because |V0 ∩ L[j]| ≥ 2r′, the number of cells arriving at the switch and destined for j during

time-interval [t2, t1] is at least (2r′) Nr′−1

> 2N . Since during time-interval [t2, t1] no cells are sent

to any plane in V1, Lemma 10 implies that any plane k ∈ V1 has ∆kj (t1) ≤ 0, and in particular all

planes in Q[j]. This yields that Q[j] becomes empty, and the phase changes at least once during

time-interval [t2, t1] which contradicts the choice of t1 and t2.

By Lemma 9 and Theorem 10, the relative queuing delay of the algorithm is at most 3N + r′.

4.5.3 Optimal u-RT Demultiplexing Algorithm

Algorithm 2 can be used as a building block for u-RT algorithms with u > 1. Algorithm 3 runs u

instances of Algorithm 2 in a round-robin manner, such that in each time-slot only one instance is

active (that is, the ith instance is active on time-slots i, i + u, i + 2u, i + 3u etc.). Since there are u

time-slots between two consecutive times in which the same instance is active, global information

on the previous time the instance was active can be shared among the demultiplexors. In addition,

each instance of Algorithm 2 has its own set of 8r′ planes, hence Algorithm 3 needs speedup

S = 8u.

We next bound the imbalance under Algorithm 3:

Lemma 11 In Algorithm 3, for every plane k and output-port j, ∆kj < 2N

r′ .

Proof. Assume towards a contradiction that there is a traffic T , a plane k and an output-port j,

such that ∆kj ≥ 2N

r′ . Let t be the first time-slot in which ∆kj (t) ≥ 2N

r′ , and let x = t mod u. The

choice of t and Algorithm 3 imply that a cell is sent through plane k at time-slot t by instance x.

Let T ′ be the traffic consisting of the cells of traffic T handled by the instance x, that is,

T ′ = c | c ∈ T, ta(c) − x mod u = 0. Let round(c) = ta(c)−xu

be the number of times instance

x was active until cell c arrived to the switch.

Consider traffic T in which each cell c of traffic T ′ arrives at the switch at time-slot round(c),

that is, T = shift(c, round(c)− ta(c)) | c ∈ T ′. Let Aj(t1, t2) be the number of cells in traffic

67

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 78: Competitive Evaluation of Switch Architectures

Algorithm 3 u-RT AlgorithmShared:

ALG[u]: u instances of Algorithm 2 . Each instance with its own planes and shared variablesx: value in 0, . . . , u− 1, initially u− 1 . cyclic pointer to array ALG

1: void procedure ADVANCE-CLOCK( ) . invoked at the beginning of each time-slot2: x← (x + 1) mod u3: ALG[x].ADVANCE-CLOCK() . invoke procedure ADVANCE-CLOCK on the xth instance4: end procedure

1: int procedure DISPATCH(cell c) at demultiplexor i2: return ALG[x].DISPATCH(c) . invoke procedure DISPATCH on the xth instance3: end procedure

T destined for output-port j that arrive at the switch during time interval [t1, t2], and Akj (t1, t2)

be the number of these cells that are sent through plane k by Algorithm 2. Similarly, following

Definition 10 ∆kj (t1, t2) = Ak

j (t1, t2)− 1r′ Aj(t1, t2), and ∆k

j (t2) = maxt1≤t2 ∆kj (t2).

Since only instance x sends cells to plane k, and the dispatching decisions of instance x as

response to traffic T are the same of the decisions of Algorithm 2 as response to traffic T , it

follows that for every time t′ < t, Akj (t

′, t) = Akj

(⌈t′−x

u

⌉, t−x

u

). On the other hand, in T there is

a subset of T ’s cells destined for output-port j, and therefore Aj(t′, t) ≥ Aj

(⌈t′−x

u

⌉, t−x

u

). This

implies that ∆kj (

t−xu

) ≥ ∆kj (t) ≥ 2N

r′ , contradicting Lemma 9.

Lemma 11, Theorem 13 and Theorem 10 immediately imply:

Theorem 14 For any u ≥ 1 and a PPS with speedup S = 8u, there is a u-RT demultiplexing

algorithm ALG such thatRmax(ALG) ≤ 3N + r′.

Note that speedup S = 8u is not feasible in real-life switches. Therefore, Algorithm 3 has

only theoretical importance. We leave for further research the question whether there is an optimal

u-RT demultiplexing algorithm which requires a speedup that does not depend on u.

68

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 79: Competitive Evaluation of Switch Architectures

4.6 Extensions of the PPS model

4.6.1 The Relative Queuing Delay of an Input-Buffered PPS

We extend our definitions for the bufferless PPS model to the case in which there are buffers in the

input-ports. In an input-buffered PPS, when a cell arrives, the demultiplexor either sends the cell

to one of the planes or keeps it in its buffer. In every time-slot, the demultiplexor sends any number

of buffered cells to the planes, provided that the rate constraints on the lines between the input-

port and any plane are preserved. In this section, we consider only deterministic demultiplexing

algorithms . (a discussion on extending the bounds to randomized algorithms appears in the end of

the section.)

We refer to the buffer residing at input-port i with finite size s as a vector bi ∈ 1, . . . , N,⊥s.

An element of this vector contains the destination of the cell stored at the corresponding place in

the buffer. Empty places in the buffer are indicated with ⊥ in the vector. The size of the buffer at

input-port i is denoted |bi|.

The demultiplexor state machine is changed to include the state of the input-port buffer. Bi

denotes the set of the reachable states of the buffer residing in input-port i. We refer to the set of

states of the ith demultiplexor as Si × Bi. A switch configuration includes also the input-buffers

content.

Definition 11 The demutliplexing algorithm of the demultiplexor residing at input-port i with

input-buffer, is a function

ALGi : 1, . . . , N,⊥ × Si × Bi → Si × 1, . . . , K,⊥|bi|+1

which gives the next state and a vector of size |bi| + 1, according to the incoming cell destination

(⊥ if no cell arrives), current state and the content of the buffer.

The vector of size |bi| + 1 that is returned by the function ALGi states through which plane to

send the cell in the corresponding place in the buffer; the last element of the vector refers to the

incoming cell; ⊥ indicates that the corresponding cell remains in the buffer. We denote by to(c, T )

69

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 80: Competitive Evaluation of Switch Architectures

the time-slot in which cell c ∈ T is sent from the input-port to one of the planes; since cells can be

queued in the input-port to(c, T ) can be larger ta(c), unlike the bufferless PPS model.

When measuring the relative queuing delay in an input-buffered PPS, the queuing of cells both

in the input-buffers and the planes’ buffers of the PPS should be compared to the queuing of cells

in the output-buffers of the shadow switch. Generally, input buffers increase the flexibility of the

demultiplexing algorithms, which leads to weaker lower bounds.

We prove these lower bounds by constructing (f, s) weakly-concentrating executions (recall

Definition 5):

Theorem 15 Any deterministic fully-distributed demultiplexing algorithm ALG with any input-

buffers’ sizes has Rmax(ALG) ≥ NS

(1− 1

r′

)and J (ALG) ≥ N

S

(1− 1

r′

)time-slots, where ε can

be made arbitrarily small.

Proof. Let C be a switch configuration in which all the buffers in the switch are empty, and denote

by t0 the time-slot in which the PPS is in configuration C. Denote the state of demultiplexor i in

configuration C by (s0i , b

0i ). Clearly, b0

i = ⊥|bi|.

For every input-port i, consider traffic T |i = ci which consists on a single cell ci with

orig(ci) = i, dest(ci) = j and ta(ci) = t0. Note that to(ci) ≤ N/S time-slots, otherwise T |iis causing a relative queuing delay greater then N/S time-slots. Let (sf

i , bfi ) be the demultiplexor

state just before this cell is sent. Clearly, T |i has no bursts, since one cell arrives at the switch.

Since K < N , there exists a plane k and a set of N/K demultiplexors I = i1, . . . , iN/K,such that plane(ci) = k for every i ∈ I .

Now consider another traffic T = T |i1 T |i2 . . . T |iN/K. That is, traffic T begins in con-

figuration C, and for every time-slot t ∈ t0, . . . , t0 + N/K − 1 one cell, which is destined for

output-port j, arrives at input-port i ∈ I . Note that in every time-slot at most one cell arrives at the

switch, therefore this traffic has no bursts.

Since ALG is fully-distributed demultiplexing algorithm, and all the buffers are empty in con-

figuration C, a demultiplexor i ∈ I does not change its states until the first cell arrives. Before

T and T |i begin, demultiplexor i is in state (s0i , b

0i ), and its individual flow under T is exactly

the same as under T |i (only one cell destined for output-port j arrives). Therefore, demultiplexor

70

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 81: Competitive Evaluation of Switch Architectures

i ∈ I changes its state to (sfi , b

fi ), and sends its cell to plane k, implying that EPPS(ALG, T ) is

(N/K, N/K) weakly-concentrating execution for output-port j and plane k.

Applying Lemma 2 with f = N/K, s = N/K and B = 0, yields lower bounds of NK

r′ − NK

=

NS

(1− 1

r′

)on the maximum relative queuing delay and relative delay jitter.

Unlike fully-distributed demultiplexing algorithms, the size of the input-buffers affects the rel-

ative queuing delay in an input-buffered PPS under u-RT demultiplexing algorithms. A PPS that

can store u cells in each input-port is able to support a u-RT demultiplexing algorithm that guar-

antees relative queuing delay of at most u time-slots, by simulating the CPA algorithm [74]. Note

that CPA assumes the PPS is a global FCFS switch, i.e., cells leave an output-port in a FCFS order,

regardless of the input-port from which they are originated.

Theorem 16 There is a u-RT demultiplexing algorithm for a global FCFS input-buffered PPS,

with buffer size at least u and speedup S ≥ 2, and a relative queuing delay of at most u time-slots.

This algorithm may be impractical; yet, it demonstrates that a lower bound of Ω(N) time-slots

does not hold when the input-buffers are sufficiently large. When buffers are smaller than u, we

show that a global FCFS deterministic input-buffered PPS has relative queuing delay of NS

(1− 1

r′

)time-slots, under leaky-bucket traffic with burstiness factor u(N

K− 1):

Theorem 17 Any deterministic u-RT demultiplexing algorithm ALGwith input-buffers’ sizes smaller

than u hasRmax(ALG) ≥ NS

(1− 1

r′

)and J (ALG) ≥ N

S

(1− 1

r′

)time-slots.

Proof. Let C be the switch configuration at time t0, and assume that at this time all the buffers

in the switch are empty. Let T |i be a traffic that is comprised of cells c with orig(c) = i and

dest(c) = j such that one cell arrives at input-port i in each time-slot, until the first cell destined

for output-port j is sent to one of the planes. This execution takes less than u time-slots, because

otherwise input-port i is queuing in its input-buffer more cells than its capacity. Note that T |i is a

leaky-bucket traffic with no bursts.

Since the PPS is FCFS, and only cells of traffic T |i arrive at the switch, the first cell to leave

the input-port i’s buffer is the first cell of traffic T |i. We denote this cell by ci.

71

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 82: Competitive Evaluation of Switch Architectures

Since K < N , there exists a plane k and a set of demultiplexors I ⊆ 1, . . . , N of size N/K,

such that plane(ci) = k for every i ∈ I . Let t′ = maxto(ci, T |i)|i ∈ I.

Now compose all traffics T |i for i ∈ I and append the time-interval (t′, t′ + u(N/K − 1)]

in which no cells arrive at the switch. T =⋃

i∈I T |i denotes the composite traffic, starting from

configuration C at time t0, which is a (R, u(NK− 1)) leaky-bucket traffic.

Every demultiplexor i ∈ I goes through the same state transitions in response to T |i and

T , since composing the traffic does not change the switch configurations in time interval [0, t0],

to(ci) − u < t0, and its local information is identical in T |i and T . Hence, demultiplexor i sends

the cell ci to plane k in time-slot to(ci, T ) = to(ci, T |i) < t0 + u.

Notice that under traffic T at time interval [t0, t0] (that is, the first time-slot), N/K cells destined

for the same output arrived at the switch and sent through the same plane. Furthermore, the burst of

traffic T during this interval is N/K−1. Thus, Lemma 2 with f = N/K, s = 1 and B = N/K−1

implies thatRmax(ALG) = J (ALG) = NK

r′ −(1 + N

k− 1)

= NS

(1− 1

r′

), as required.

We leave for future research the question whether these lower bounds apply also for the average

relative queuing delay and for randomized algorithms. The major difference between these proofs

and our other lower bounds’ proofs (described in Section 4.3) is that they employ executions in

which a concentration occurs at the beginning of the traffic rather than at its very end (that is,

a weakly-concentrating executions). Therefore, it is unclear how a proper continuation to this

traffic can be devised. Another interesting future research direction is to present methodology and

algorithms that match the lower bounds for input-buffered PPS.

4.6.2 Recursive Composition of PPS

Another extension to the PPS model is implementing the planes themselves as PPS (operating at a

lower rate). A (q, K)-recursive-PPS ((q, K)-RPPS) is defined recursively as follows:

Base case: (1, 〈k1〉)-RPPS is a PPS with k1 planes, operating at external rate R and internal rate

r1 as described in Section 4.2.

Recursion Step: (q+1, K 〈kq+1〉)-RPPS is an (q, K)-RPPS whose planes are replaced with PPS

72

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 83: Competitive Evaluation of Switch Architectures

Figure 11: A (2, 〈2, 2〉)-RPPS with 5 input-ports and 5 output-ports.

switches, each with kq+1 planes, that operate at external rate rq, and internal rate rq+1. Note

that rq > rq+1. K 〈kq+1〉 denotes the concatenation of the vector 〈kq+1〉 after the vector K.

This composition is described in Figure 11.

When a cell arrives to a K-RPPS, it is demultiplexed through a chain of q demultiplexors

(where q is the length of the vector K) until it is sent to an output-queued switch. It is important to

notice that each demultiplexor in this chain handles traffic originated only from a single input-port.

The collection of demultiplexors that handle all flows originating from input-port i, denoted Gi,

forms a tree of height q with∏q

i=1 K[i] leaves; the level of each demultiplexor is its distance from

the the root of the corresponding tree Gi.

In the homogeneous case, where all the demultiplexors in Gi are of the same type, Gi can be

considered as single (yet complex) demultiplexor of this type. Therefore, all lower-bound results,

described in Section 4.3, hold after substituting K with∏q

i=1 K[i] and r with rq.

For simplicity, we present the results only for N -partitioned fully-distributed demultiplexing

algorithms:

Corollary 18 Any homogeneous-RPPS that uses (randomized) N -partitioned fully distributed de-

multiplexing algorithms has, with probability 1 − δ, average relative queuing delay of at least

N( Rrq− 1)− ε time-slots, against adaptive and oblivious adversaries, where ε > 0 and δ > 0 can

73

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 84: Competitive Evaluation of Switch Architectures

be made arbitrarily small (δ = 0 for deterministic algorithms).

As in Theorem 5, the lower bound against oblivious adversary holds only if the RPPS obeys a

global FCFS policy.

Corollary 19 Any homogeneous-RPPS that uses (randomized) u-RT demultiplexing algorithms

has, with probability 1 − δ, an average relative queuing delay of at least uNS′ (1 − urq

R) − ε time-

slots, against an adaptive adversary, where S ′ = rq

R

∏qi=1 K[i], u = R

2rqand ε, δ > 0 can be made

arbitrarily small (δ = 0 for deterministic algorithms).

These results imply that building a PPS recursively and homogeneously does not improve its

relative queuing delay. Note that similar approach may be applied in order to analyze an input-

buffered RPPS; hence, as in Theorems 16 and 17, the lower bound on the relative queuing delay

some time depend on relations between the buffer size and the type of information used.

Since sharing information is more feasible as the external rate of the switch decreases, it is

interesting to investigate also a monotone K-RPPS in which the switches at levels 1, . . . , v op-

erate under fully-distributed algorithms, the switches at levels v + 1, . . . , w operate under u-RT

demultiplexors algorithms, and the switches at level w + 1, . . . , q are centralized. All demulti-

plexing algorithms can be either deterministic or randomized. For brevity, we assume all u-RT

demultiplexors operate with the same parameter u, and identify such recursive PPS by the tuple

〈K, v, w, u〉.

If all demultiplexors are bufferless, Corollaries 18 and 19 imply the following lower bound:

Corollary 20 A monotone (randomized) 〈K, v, w, u〉-RPPS has, with probability 1−δ, an average

relative queuing delay of at least

max

N

(R

rv

− 1

),uN

S ′

(1− urw

rv

)− ε,

against an adaptive adversary, where S ′ = rw

rv

∏wi=v+1 K[i], u = rv

2rwand ε, δ > 0 can be made

arbitrarily small (δ = 0 for deterministic algorithms).

74

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 85: Competitive Evaluation of Switch Architectures

Proof. Consider a single input-port i and the collections of demultiplexors that handles all flows

originating from input-port i.

On one hand, the demultiplexors at levels 1, . . . , v form a homogeneous fully-distributed de-

multiplexor. Therefore, Corollary 18 implies that it attains, with probability 1− δ, average relative

queuing delay of at least N(

Rrv− 1)− ε time-slots.

On the other hand, the demultiplexors at levels v +1, . . . , w form a collection of homogeneous

u-RT distributed demultiplexors. Therefore, Corollary 19 implies that each of these demultiplexors

attains, with probability 1− δ, average relative queuing delay of at least N uNS′

(1− urw

rv

)− ε time-

slots.

Therefore, the overall average relative queuing delay is as claimed.

We leave for further research the constructions of algorithms for recursive PPS and the analysis

of other combinations of demultiplexing algorithms (e.g., when some of the demultiplexors are

bufferless and some have input-buffers).

75

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 86: Competitive Evaluation of Switch Architectures

Chapter 5

Packet-Mode Scheduling in CIOQ Switches

In many network protocols, from very large Wide Area Networks (WANs) to small Networks on

Chips (NoCs), traffic is comprised of variable size packets. A prime example is provided by IP

datagrams whose sizes typically vary from 40 to 1500 bytes [112]. Real-life switches, however,

operate with fixed-size cells, which are easier to buffer and schedule synchronously in an electronic

domain.

Transmitting packets over cell-based switches requires the use of packet segmentation and re-

assembly modules, resulting in a significant computation and communication overhead [77]. Cell-

based scheduling is expected to turn into an even more crucial problem as the use of optics becomes

widespread, since future switches could deal with packets in the optical spectrum and might be

unable to afford their segmentation and reassembly. Cell-based schedulers, unaware of packet

boundaries, may cause performance degradation; indeed, packet-aware switches typically have

better drop-rate, since they may reduce the number of retransmissions by ensuring that only com-

plete packets are sent over the switch fabric (cf. [141, Page 44]).

Packet mode schedulers [57, 101] bridge this gap by delivering packets contiguously over the

switch fabric, implying that until a packet is fully transmitted, neither its originating port nor its

destination port can handle different packets.

It is imperative to explore whether packet-mode schedulers can provide similar performance

guarantees as cell-based schedulers. We address this question by focusing on CIOQ switches

76

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 87: Competitive Evaluation of Switch Architectures

and investigating whether a packet-mode CIOQ switch can mimic an ideal shadow switch with

bounded relative queuing delay.

5.1 Our Results

In this chapter, we present packet-mode schedulers for CIOQ switches that mimic a ideal switch

with bounded relative queuing delay. Since such mimicking requires CIOQ switches with a certain

speedup, we further investigate the trade-off between the speedup of the switch and its relative

queuing delay.

We devise pipelined frame-based schedulers, in which scheduling decisions are done at the

frame boundaries. Our schedulers and their analysis rely on matrix decomposition techniques. At

each frame, a demand matrix, representing the total size of packets between each input-output pair,

is decomposed into permutations that dictate the scheduling decisions in the next frame. The major

challenge in these decompositions is ensuring contiguous packet delivery while decomposing the

demand matrix to as few permutations as possible.

We show that contrarily to a cell-based CIOQ switch, a packet-mode CIOQ switch cannot

exactly emulate an ideal shadow switch (e.g., output-queued (OQ) switch), whatever the speedup.

However, once we allow for a bounded relative queuing delay, we find that a packet-mode CIOQ

switch does not require a fundamentally higher speedup than a cell-based CIOQ switch.

Specifically, we show (Theorem 26) that a speedup of 2 + O( 1Rmax

) suffices to ensure that a

packet-mode CIOQ switch mimics a ideal switch with maximum relative queuing delay Rmax =

O(N · lcm(Lmax)) time-slots, where Lmax is the maximum packet size, and lcm(Lmax) is the least

common multiple of 1, . . . , Lmax. This result also holds in the common case where only few packet

sizes are legal, and the resulting relative queuing delay is O(N · lcm(L)) time-slots, where L is

the restricted set of legal packet sizes. It is important to note that if L = 1, . . . , Lmax, lcm(L) is

exponential in Lmax, since it is bounded from below by the primorial of Lmax, Lmax#, and from

above by the factorial of Lmax, Lmax!; both Lmax# and Lmax! are exponential in Lmax.

The relative queuing delay can be significantly reduced with just a doubling of the speedup.

We show (Theorem 25) that a speedup of 4 + O( 1Rmax

) suffices to ensure that a packet-mode

77

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 88: Competitive Evaluation of Switch Architectures

4

2

1

Relative Queuing Delay(logarithmic scale)

Theorem 25

Theorem 26

Corollary 28

Corollary 27

Theorem 222Lmax

Speedup

Figure 12: Summary of our results. The solid line represents emulation of ideal switch withunbounded buffer size, while the dashed line represents emulation of ideal switch with buffer sizeB per output-port. Relative queuing delay scale is logarithmic.

CIOQ switch mimics an ideal shadow switch with a more reasonable relative queuing delay of

Rmax = O(NLmax) time-slots.

The relative queuing delay can be further reduced to be only Lmax−1 time-slots, if the speedup

is increased to 2Lmax (Theorem 22). In addition, we show (Theorem 21) that it is impossible to

achieve relative queuing delay of less than Lmax/2−3, regardless of the speedup used. In particular,

packet-mode schedulers cannot exactly emulate OQ switches (with no relative queuing delay).

Finally, we consider mimicking an ideal switch with a bounded buffer size B at each output-

port. Extending the matrix decomposition techniques, we show (Corollary 28) that with a smaller

speedup of 1+O( 1Rmax

) and relative queuing delayRmax = O(B+N ·lcm(Lmax)), a packet-mode

CIOQ mimics ideal shadow switch with buffer size B.

Figure 12 summarizes our results and demonstrates the trade-off between the speedup required

for switch mimicking and the resulting relative queuing delay.

78

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 89: Competitive Evaluation of Switch Architectures

5.2 A model for packet-mode CIOQ switches

This section extends the model defined in Chapter 3 to capture packet-mode switches in general,

and specifically packet-mode CIOQ switches.

In a packet-mode switch, packets of variable size traverse the switch contiguously. The packet

size is measured in cell-units, where the minimal packet size is one cell and the maximum packet

size is Lmax cells. All cells of the same packet arrive at the switch contiguously in the same input-

port and are destined for the same output-port. Therefore, we refer to a packet simply as a sequence

of cells and assume that its size is known upon arrival of its first cell (e.g., the total size is written

in the header).

Packet-mode switch are required to ensure that cells of the same packet leave the switch con-

tiguously; that is, cells of the same packet should leave the switch one after the other with no

interleaving of cells from other packets.

A packet-mode switch should further provide a relaxed notion of a first-come-first-serve (FCFS)

discipline. If the last cell of packet p arrives at the switch before the first cell of packet p′ and both

packets share the same output-port, then all cells of packet p should leave the switch before the

cells of packet p′. We denote this partial order of packets by p ≺ p′ (i.e., packet p should be handled

before packet p′).

An ideal packet-mode shadow switch (e.g., a packet-mode OQ switch) should also be work-

conserving: Namely, if a cell is pending for output port j at time-slot t, then some cell leaves the

switch from output-port j at time-slot t. [37, 88, 91]. We denote by tlS(c) the time-slot at which

cell c is delivered by the shadow switch. The contiguous packet delivery implies that for any packet

p = (c1, . . . c`), tlS(ci) = tlS(cj) + (i− j) for 1 ≤ j ≤ i ≤ `.

Recall that in a CIOQ switch with speedup S, packets arriving at rate R are first buffered in the

input side and then forwarded over the switch fabric to the output-side as dictated by a scheduling

algorithm (see Figure 2). Packets that arrive at input-port i and are destined for output-port j are

stored in the input side of the switch in a separate buffer, which is called virtual output-queue and

denoted by V OQij . The switch fabric operates at rate S ·R, where S is the speedup of the switch,

79

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 90: Competitive Evaluation of Switch Architectures

Input 2

Input 1

Lmax1

p3

p1

p2

R+ 2 R+ 3 Lmax +R+ 2 time-slot

Figure 13: Illustration of the proof of Theorem 21; white packets are destined for output-port 1,while the gray packet is destined for output-port 2.

implying that the switch has S scheduling opportunities (or scheduling decisions) every time-slot.1

A packet-mode CIOQ switch ensures that if a packet p from input-port i to output-port j con-

sists of the cells (c1, . . . , c`) then after cell c1 is transmitted across the switch fabric, no cells of

packets other than p are transmitted from input port i or to output port j until cell c` is transmitted.

Naturally, cells of the same packet are transmitted in order.

It is possible that some input-port i starts transmitting cells of a packet p before all the cells of

packet p arrived at the switch. Since the speedup of the switch is typically greater than 1, this may

cause the switch to under-utilize its speedup. For example, suppose that the first cell c1 of a packet

p = (c1, c2, . . . , c`) arrives at input-port i at time-slot ta(c1) and is immediately sent to output-port

j in the first scheduling opportunity of time-slot ta(c1). Since cell c2 arrives at the switch only at

time-slot ta(c2) = ta(c1) + 1, no cells can be sent from input-port i or to output-port j for the

next S − 1 scheduling opportunities (even if there are cells of other packets in one of the relevant

buffers).

5.3 Simple Upper and Lower Bounds on the Relative Queuing

Delay

We show that a packet-mode CIOQ switch cannot mimic with a small relative queuing delay an

ideal shadow switch, regardless of the CIOQ switch speedup. In particular, this result implies that

a packet-mode CIOQ cannot exactly emulate an OQ switch, whatever the speedup used. This runs

1For non-integral speedup values, the speedup S is the average number of such scheduling decisions per time-slot,where at each time slot the switch makes between bSc and dSe scheduling decisions [59].

80

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 91: Competitive Evaluation of Switch Architectures

against the conventional wisdom that “speedup N solves every problem”.

Theorem 21 A packet-mode CIOQ switch cannot mimic an ideal switch with a relative queuing

delayRmax < Lmax/2− 3 time-slots.

Proof. Assume towards a contradiction that the CIOQ switch mimics an ideal shadow switch

with relative queuing delayRmax < Lmax/2− 3, and consider the following traffic, comprising of

only three packets (see Figure 13): At time-slot 1 a packet p1 of size Lmax arrives at input-port 1,

destined for output-port 1. At time-slot Rmax + 2, another packet, denoted p2, of size 1 arrives at

input-port 2, destined for output-port 1. At time-slotRmax + 3, a packet p3 of size Lmax arrives at

input-port 2, destined for output-port 2.

At time-slot 1, packet p1 is the only packet destined for output-port 1; since the shadow switch

is work-conserving, the first cell of p1 is delivered by the shadow switch at time-slot 1, implying

it must be delivered by the CIOQ switch by time-slot Rmax + 1. Packet-scheduling restricts the

switch from delivering cells of other packets to output-port 1 until the last cell of packet p1 is

delivered. Since the last cell of packet p1 arrives at the switch at time-slot Lmax, then output-port 1

is busy handling p1 at least until time-slot Lmax.

Using the same arguments, the first cell of packet p3 must be delivered to output-port 2 at

time-slot 2Rmax +3, and input-port 2 is busy handling p3 at least until time-slot Lmax +Rmax +2.

Since Lmax > 2Rmax +3, packet p2 cannot be delivered to output-port 1 until time slot Lmax +

Rmax + 2. But, packet p2 is delivered by the shadow switch in time-slot Lmax + 1, implying that

its relative queuing delay is at leastRmax + 1, contradicting the assumption.

Note that this result holds since the CIOQ switch waits for the cells of the different packets to

arrive, and therefore under the situation described in the proof of Theorem 21, the switch in fact

degrades to work at the external line rate (i.e., with S = 1), as an IQ switch. The result is therefore

consistent with the known result that IQ switches, with speedup 1, cannot emulate output-queued

switches [37].

We now show that a CIOQ switch can mimic a shadow switch with relative queuing delay of

Lmax − 1 time-slots provided it has a sufficiently large speedup of 2Lmax. The algorithm closely

81

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 92: Competitive Evaluation of Switch Architectures

follows the CCF algorithm, which emulates (precisely) a cell-based OQ switch with speedup S =

2 [37].

Intuitively, multiplying the speedup by the maximum packet size Lmax reduces the problem

of packet-mode switching to cell-based switching: Each cell-based scheduling decision can be

mapped to Lmax contiguous packet-mode scheduling decisions, implying that a packet can be

transmitted contiguously. In addition, a relative queuing delay of Lmax − 1 time-slots allows the

scheduler to wait until a packet fully arrives at the switch before it is scheduled. The following

theorem captures this simple result:

Theorem 22 A packet-mode CIOQ switch with speedup S = 2Lmax can mimic a ideal shadow

switch with relative queuing delay of Lmax − 1 time-slots.

Proof. For each time-slot t, let traffic T (t) be the collection of cells that arrive at the switch by

time-slot t and let T ′(t) ⊆ T (t) be a traffic comprising only of cells in T (t) that are the first cells

of their corresponding packets. Denote by t′CCF (c) the time-slot in which the CCF algorithm with

speedup S = 2 schedules a cell c of traffic T ′(t) over the switch fabric, and let tlS′(c) be the

time-slot in which c leaves a cell-based OQ switch that handles traffic T ′.

The packet-mode CCF algorithm (PM-CCF) simulates the behavior of a cell-based CCF: For

each packet p of traffic T (t), PM-CCF forwards the entire packet p contiguously over the switch

fabric in time-slot tPM−CCF = t′CCF (first(p)) + Lmax − 1.

Since the cell-based CCF works with speedup S = 2, for each time-slot t there are at most two

cells which share the same input or output port and are forwarded over the switch fabric by the cell-

based CCF in time-slot t. PM-CCF works correctly since it has 2Lmax scheduling opportunities at

each time-slot and therefore can schedule the packets corresponding to these two cells entirely in

the same time-slot t. In addition, the contiguous arrival of packets at the input-ports ensures that

packet p has fully arrived to the switch by time-slot t′CCF (first(p)) + Lmax − 1.

For each cell c of traffic T =⋃

t T (t), tlS(c) denotes the time-slot in which c leaves the

packet-mode shadow switch. Note that tlS(c) ≥ tlS(first(packet(c))) ≥ tlS′(first(packet(c))),

because cells corresponding to the same packet are delivered in order and traffic T ′ =⋃

t T′(t) is

a subset of traffic T . Since the cell-based CCF emulates cell-based OQ switch, it follows that for

82

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 93: Competitive Evaluation of Switch Architectures

each cell c of traffic T :

tlS(c) ≥ tlS′(first(packet(c)))

≥ t′CCF (first(packet(c)))

= tPM−CCF (c)− (Lmax − 1)

This implies that every cell c can be delivered from a CIOQ switch with packet-mode CCF at

time-slot tlS(c) + Lmax − 1, and the claim follows.

This result only demonstrates the possibility of ideal shadow switch mimicking with bounded

delay, since a speedup S = 2Lmax is unreasonable in practical switches.

Furthermore, this result also shows that although cut-through CIOQ switches (that is, switches

that do not wait for packets to fully-arrive at the switch before starting scheduling them) may

provide smaller delay in cell-mode scheduling, in packet-mode scheduling it is more profitable

to use store-and-forward CIOQ switches, that must wait for packets to fully-arrive to the switch

before start scheduling them.

In the rest of this chapter, we show how to achieve a similar result with smaller speedup, by

presenting a tradeoff between the speedup and the relative queuing delay: As the speedup of the

switch increases, the needed relative queuing delay for mimicking a shadow switch decreases.

5.4 Tradeoffs between the speedup and the relative queuing de-

lay

Our scheduling algorithms operate in a frame-based pipelined manner, with scheduling decisions

done only at the frame boundaries. At each frame boundary, the algorithms first construct several

demand matrices, and then decompose these matrices into permutations (or sub-permutations).

The algorithms satisfy the demands by scheduling the cells in the next frame according to the

resulting permutations.

The algorithms and their analysis rely on some results of matrix theory, which are presented

83

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 94: Competitive Evaluation of Switch Architectures

next.

5.4.1 Matrix Decomposition

Definition 12 A permutation P is a 0-1 matrix such that the sum of each row and the sum of each

column is exactly 1. A sub-permutation P is a 0-1 matrix such that the sum of each row and the

sum of each column is at most 1.

In the rest of this chapter, for simplicity, we refer to sub-permutations as permutations. The

following definition captures the fact that the number of cells that should be scheduled from a

single input-port or to a single output-port is bounded:

Definition 13 A matrix A ∈ INN×N is C-bounded if the sum of each row and each column in A is

at most C.

A classical result says that any C-bounded matrix A can be decomposed into C permutations,

whose sum dominates A:

Theorem 23 (BIRKHOFF VON-NEUMANN DECOMPOSITION [25, 52, 144]) If a matrix A ∈INN×N is C-bounded by an integer C, then there are C permutations P1, . . . , PC such that A ≤∑C

i=1 Pi.

Note that since all values in the matrix A are integer the same result can be obtained using

Konig’s Theorem that bounds the chromatic index of a bipartite graph by its maximum vertex

degree [90].

The Birkhoff von-Neumann decomposition implies that every C-bounded demand matrix can

be scheduled, cell by cell, in C scheduling opportunities (or, equivalently, in dC/Se time-slots)

when permutation Pi dictates the scheduling in opportunity i. However, such a scheduling may

violate the packet-mode restrictions, since there is no relation between adjacent permutations in

the sequence. For reasons that will become clear shortly, we are interested in the following class

of permutations:

84

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 95: Competitive Evaluation of Switch Architectures

Definition 14 A maximal matching for a matrix A = [aij] is a permutation matrix P = [pij] ≤ A

such that if pij = 0 and aij > 0 then there exist i′ such that pi′j = 1 or j′ such that pij′ = 1.

Intuitively, a permutation P ≤ A is a maximal matching for a matrix A if no element can be

added to P , resulting in a matrix that is still a permutation and is dominated by A.

The next theorem shows that if a matrix is decomposed by any sequence of maximal matchings

then the number of permutations needed is at most twice the number needed in Theorem 23. The

decomposition of a C-bounded matrix A works iteratively: In each iteration m, a maximal match-

ing P (m) for the matrix A(m − 1) is found and then subtracted from A(m − 1) to form A(m)

(negative values are treated as zeros). The procedure stops when A(m) = 0.

We next show that this happens after at most 2C − 1 iterations, regardless of the choice of the

maximal matching in each iteration, implying that the matrix A is decomposed into less than 2C

permutations.

Theorem 24 ([145, THEOREM 2.2]) For every C-bounded matrix A ∈ INN×N , the decomposi-

tion procedure described above stops after at most 2C − 1 iterations.

Proof. Denote A(0) = A and let P (m) = [p(m)ij] be the maximal matching found in iteration

m. Let A(m) = [a(m)ij] be the matrix, resulting from subtracting the permutation P (m) from the

matrix A(m− 1).

If A(2C − 1) 6= 0 then there exist i,j such that a(2C − 1)ij > 0. Let a(2C − 1)ij = k and

a(0)ij = `. This implies that p(m)ij = 1 in exactly `− k permutations P (m) (1 ≤ m ≤ 2C − 1),

and therefore p(m)ij = 0 in (2C − 1)− ` + k such permutations.

Note that for every m ≤ 2C−1, a(m)ij > 0. Therefore, Definition 14 yields that if p(m)ij = 0

then there are either i′ such that p(m)i′j = 1 or j′ such that p(m)ij′ = 1.

However, the sum of either row i or column j, excluding a(0)ij , is at most C − `. This implies

that 2(C − `) ≥ (2C − 1)− ` + k, which is a contradiction since `, k ≥ 1.

85

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 96: Competitive Evaluation of Switch Architectures

5.4.2 Mimicking an Ideal Shadow Switch with Speedup S ≈ 4

Our schedulers operate by constructing a demand matrix at each frame boundary, and then using

the result of decomposing this matrix for scheduling decisions in the next frame. The relative

queuing delay of the schedulers corresponds to the size of the frame, while the speedup of the

switch is determined by the ratio between the frame size and the number of permutations obtained

in the decomposition.

A key insight is that packet-mode shadow switches can be implemented by a push-in-first-

out (PIFO) cell-based OQ switch. In such OQ switches, arriving cells are placed in an arbitrary

location in their destination’s buffer, and the switch always outputs the cells at the head of its

buffers [37]. The PIFO policy is an extension of the first-in-first-out (FIFO) policy that can also

implement QoS-aware (Quality-of-Service-aware) algorithms, such as WFQ and strict priority.

In our case, it allows us to implement packet-mode shadow switches as follows: The first cell

of a packet p arriving at the switch is placed at the end of the relevant OQ switch buffer. Each

consecutive cell ci of packet p is placed immediately after cell ci−1; in each time-slot, the cell at

the head of the buffer departs from the switch. Since cells of the same packet are placed one after

the other in the buffer, they leave the OQ switch contiguously. In addition, if p ≺ p′ then the last

cell of packet p is placed in the buffer before the first cell of packet p′, implying that packet p is

served before packet p′.

Notice that, using the CCF algorithm, a cell-based CIOQ switch with speedup S = 2 can

emulate cell-based OQ switch with any PIFO discipline [37], and in particular the above-mentioned

discipline. However, the CCF algorithm ensures only that packets departs contiguously from the

switch and does not deliver the packets contiguously over the switch fabric (that is, from the input-

ports to the output-ports). Yet, our next algorithms use this underlying CCF algorithm in order to

construct the demand matrix of each frame.

Let tCCF (c) be the time-slot in which a cell c is forwarded over the switch fabric by this CCF

algorithm. Clearly, tCCF (c) ≤ tlS(c). We have the next lemma:

Lemma 12 If a scheduling algorithm ALG schedules the cell last(p) of every packet p by time-

slot tCCF (last(p)) + δ then the maximum relative queuing delay of ALG is at most δ + Lmax − 1,

86

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 97: Competitive Evaluation of Switch Architectures

where Lmax is the maximum packet size.

Proof. Consider a cell c, let k be its place in packet(c), and let ` be the size of packet(c). The

contiguous packet delivery in the shadow switch dictates that tlS(c) = tlS(last(packet(c))− (`−k). Let tALG(c) be the time-slot in which ALG forwards cell c over the switch fabric. Since both

ALG and CCF forward the cells of packet(c) in their order within the packet,

tALG(c) ≤ tALG(last(packet(c))

≤ tCCF (last(packet(c)) + δ

≤ tlS(last(packet(c)) + δ

= tlS(c) + δ + `− k

≤ tlS(c) + δ + `− 1

≤ tlS(c) + δ + Lmax − 1

This implies that every cell c is in the output-side of the switch by time-slot tlS(c)+δ+Lmax−1,

and therefore ALG can output cell c from the CIOQ switch at time-slot tlS(c) + δ + Lmax − 1.

Notice that ALG does not transmit two cells c, c′ at the same time-slot from the same output-port,

since tlS(c) + δ + Lmax − 1 = tlS(c′) + δ + Lmax − 1 implies that tlS(c) = tlS(c′), contradicting

the definition of the shadow switch.

We now explore the trade-off between the speedup S in which the CIOQ switch operates and

its relative queuing delay. We devise a frame-based scheduler in which the demand matrix in each

frame is built according to the times in which the underlying CCF algorithm forwards cells over

the switch fabric. In addition, packets that were not fully forwarded by the CCF algorithm until

the frame boundary are queued in the input-side of the switch until the next frame. Thus, the CCF

algorithm determines which packets should be delivered by a packet-mode CIOQ in each frame,

as captured by the next definition:

Definition 15 For every input-port i, output-port j, frame size τ and frame number k > 0, the set

of eligible cells of frame k, denoted aij(τ, k), includes all cells c 6∈⋃

k′<k aij(τ, k′) such that all

cells c′ ∈ packet(c) have tCCF (c′) ≤ kτ . By convention, aij(τ, 0) = ∅.

87

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 98: Competitive Evaluation of Switch Architectures

Notice that by definition, all the cells of a packet p are in the same set of eligible cells.

The next lemma bounds the number of cells, sharing an input-port or an output-port, that should

be scheduled within the same frame:

Lemma 13 For every input-port i, output-port j, frame size τ and frame number k > 0,

∑Nj′=1 |aij′(τ, k)| ≤ 2τ + N (Lmax − 1) and

∑Ni′=1 |ai′j(τ, k)| ≤ 2τ + N (Lmax − 1)

Proof. Note that the CCF algorithm works with CIOQ switch with speedup 2. Thus, the number

of cells c that share the same input-port (output-port) and have been forwarded by the CCF within

frame k (namely, (k − 1)τ < tCCF (c) ≤ kτ ) is at most 2T .

Since in each virtual output-queue V OQi,j , all cells of the same packet p are stored one after

the other, there is no cell of a different packet that is forwarded by CCF between cells of packet

p. Therefore, only cells of one packet are in aij(τ, k) and were forwarded by CCF before time-slot

(k − 1)τ ; we next bound the number of such cells: Since the maximum packet size is Lmax and

the last cell of each packet was forwarded by the CCF after time-slot (k − 1)τ , at most Lmax − 1

such cells share the same input-port and the same output-port. Thus, the number of such cells that

share an input-port (output-port) is at most N(Lmax − 1). This implies that both∑N

j′=1 |aij′(τ, k)|and

∑Ni′=1 |ai′j(τ, k)| are bounded by 2τ + N(Lmax − 1).

Lemma 13 and Theorem 23 imply that the eligible cells of each frame can be scheduled within

2τ + N(Lmax − 1) scheduling opportunities. Unfortunately, the decomposition described in The-

orem 23 does not ensure that the packet-mode scheduling constraints are satisfied and therefore

cannot be used directly. For example, consider the matrix

A = [aij] =

3 1 2 0

0 2 2 2

1 2 2 1

2 1 0 3

,

88

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 99: Competitive Evaluation of Switch Architectures

in which, for example, element a1,1 represents a single packet of size 3 and elements a2,2,a2,3,a2,4

represent packets of size 2. The following decomposition of A into six permutations

A =

1 0 0 0

0 1 0 0

0 0 1 0

0 0 0 1

+

1 0 0 0

0 0 1 0

0 1 0 0

0 0 0 1

+

1 0 0 0

0 0 0 1

0 0 1 0

0 1 0 0

+

0 0 1 0

0 1 0 0

1 0 0 0

0 0 0 1

+

0 1 0 0

0 0 1 0

0 0 0 1

1 0 0 0

+

0 0 1 0

0 0 0 1

0 1 0 0

1 0 0 0

violates the packet-mode constraints: Contiguous transmission of packet a1,1 requires that the first

three permutations are scheduled contiguously. On the other hand, each permutation i ∈ 1, 2, 3must also be adjacent to permutation i + 3 in order to ensure contiguous transmission of packet

a2,i+1. These requirements cannot be satisfied simultaneously, since it yields that at least one

permutation must be adjacent to three permutations.

To circumvent this problem, we use Theorem 24 and introduce a different decomposition al-

gorithm, which guarantees contiguous packet delivery but requires twice as much scheduling op-

portunities: At each frame boundary, the algorithm counts the number of cells in each set aij(τ, k)

and constructs a matrix B(k) = [bij] accordingly (namely, bij = |aij(τ, k)|). Then, the algorithm

repeatedly builds maximal matchings for matrix B(k) and keeps contiguous packet delivery in the

following manner: If a cell from input-port i to output-port j is forwarded in some iteration of the

algorithm, and there are more cells from i to j that were not forwarded yet, then the algorithm

keeps the matching between i and j for the next iteration. (This procedure is sometimes called

exhaustive service matching [96].)

Since the algorithm uses only maximal matchings, Theorem 24 yields that the algorithm needs

twice as many iterations as Birkhoff von-Newmann decomposition in order to decompose matrix

B(k). In particular, for every frame size τ , the algorithm needs at most 4τ + 2N(Lmax − 1)

iterations to complete. This implies that it can mimic an ideal switch with a speedup arbitrarily

89

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 100: Competitive Evaluation of Switch Architectures

close to 4, while attaining a relative queuing delay of O(NLmax).

Theorem 25 A packet-mode CIOQ switch with speedup S = 4 + 2N(Lmax−1)−1τ

can mimic an OQ

switch with a relative queuing delay of 2τ + Lmax − 2 time-slots.

Proof. Fix a frame size τ and let B(k) = [bij] be the N × N matrix such that bij = |aij(τ, k)|.Lemma 13 implies that the sum of each row and each column of B(k) is at most 2τ +N(Lmax−1).

Algorithm 4 works by repeatedly constructing maximal matchings P for matrix B(k). If a

cell in the set aij(τ, k) is forwarded in some iteration of the algorithm, and there are more cells

in aij(τ, k) to be forwarded, the algorithm keeps the matching between input-port i and output-

port j for the next iteration. Therefore, cells of a specific set are forwarded contiguously. Hence,

Definition 15 implies that Algorithm 4 forwards all the cells corresponding to a specific packet

contiguously: this clearly satisfies the packet-mode scheduling constraints.

All matchings used by Algorithm 4 are maximal and the sum of each column and each row in

B(k) is at most 2τ + N(Lmax − 1). Theorem 24 implies that Algorithm 4 needs at most

2 · (2τ + N(Lmax − 1))− 1 = 4τ + 2N(Lmax − 1)− 1

iterations to complete. Thus, with speedup 4 + 2N(Lmax−1)−1τ

the algorithm schedules all cells

corresponding to B(k) within the next frame, that is, by time-slot (k + 1)τ .

Consider the last cell last(p) of some packet p. Definition 15 implies that if last(p) ∈ aij(τ, k)

then tCCF (last(p)) > (k − 1)τ . Since Algorithm 4 schedules last(p) by time-slot (k + 1)τ , it

follows that the relative queuing delay of last(p) is at most 2τ − 1. By Lemma 12, the relative

queuing delay is at most 2τ + Lmax − 2.

Notice that for switch speedup S > 4, the relative queuing delay induced by this algorithm is2N(Lmax−1)−1

S−4+ Lmax − 2 time-slots.

90

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 101: Competitive Evaluation of Switch Architectures

Algorithm 4 Coarse-Grained Maximal MatchingsLocal Variables:

B: matrix of values in IN, initially B = B(k)P : matrix of values in 0, 1, initially 0

1: procedure SCHEDULE(matrix B)2: while B 6= 0 do3: for all P [i][j] do4: if P [i][j] = 1 and B[i][j] = 0 then5: P [i][j] = 06: end if7: end for8: P := MAX-MATCH(B,P ) . returns a maximal match-

ing of B that dominates P9: for all P [i][j] do

10: if P [i][j] = 1 and B[i][j] > 0 then11: forward a cell from input i to output j12: end if13: end for14: B := B − P15: for all B[i][j] do . avoid negative values in B16: B[i][j] := maxB[i][j], 017: end for18: end while19: end procedure

20: matrix procedure MAX-MATCH(matrix B, matrix P )21: while there are i, j such that B[i][j] = 1 and

∑Nj′=1 P [i][j′] = 0 and

∑Ni′=1 P [i′][j] = 0 do

22: P [i][j] = 123: end while24: return P25: end procedure

5.4.3 Mimicking an Ideal Shadow Switch with Speedup S ≈ 2

Notice that for each frame, the scheduler described in Theorem 25 schedules all eligible cells with

the same origin and the same destination contiguously, implying that in fact it considers them as a

single packet. Using a more fine-grained scheduler and the Birkhoff von-Neumann decomposition,

we now show that a smaller speedup, arbitrarily close to 2, suffices albeit with larger relative

queuing delay.

This is done in the context of the common situation where packet size are restricted to be from

91

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 102: Competitive Evaluation of Switch Architectures

the set L (cf. [112, 128]). Notice that this case generalizes the unrestricted packet size case, where

L = 1, . . . , Lmax. Let lcm(L) be the least common multiple of all elements in L.

Theorem 26 A packet-mode CIOQ switch with speedup S = 2 +N(Lmaxlcm(L)−1)

τcan mimic an

ideal shadow switch with a relative queuing delay of 2τ + Lmax − 2 time-slots.

Proof. Fix a frame size τ . For every packet size ` ∈ L, let aij(τ, `, k) ⊆ aij(τ, k) be the set of

eligible cells (recall Definition 15) that correspond to packets of size `. Let B(`, k) = [b(`, k)ij]

be the matrix with values b(`, k)ij = |aij(τ,`,k)|`

, that is, the number of eligible packets of size ` in

frame k.

For every packet size `, the algorithm first tries to concatenate lcm(L)/` packets one after the

other in order to get one mega-packet of size lcm(L), each such mega-packet consists of packets

of the same size. The matrix B(lcm(L), k) = [b((lcm(L), k)ij] counts the number of such mega-

packets:

b ((lcm(L), k)ij =∑`∈L

⌊` · b(`, k)ij

lcm(L)

We first bound the sum of each row and each column of the matrix B(lcm(L), k). Consider

some row i of the matrix (the proof for a column j follows analogously):

N∑j=1

b ((lcm(L), k)ij =N∑

j=1

∑`∈L

⌊` · b(`, k)ij

lcm(L)

≤N∑

j=1

∑`∈L

` · b(`, k)ij

lcm(L)

=1

lcm(L)

N∑j=1

∑`∈L

|aij(τ, `, k)|

=1

lcm(L)

N∑j=1

|aij(τ, k)|

≤ 1

lcm(L)· (2τ + N(Lmax − 1)) by Lemma 13

92

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 103: Competitive Evaluation of Switch Architectures

By Theorem 23, the matrix B(lcm(L), k) can be decomposed into 2τ+N(Lmax−1)

lcm(L)permutations.

Denote by P(lcm(L), k) the set of these permutations.

We now turn to deal with leftover packets. Let the matrices B′(`, k) = [b′(`, k)]ij count the

number of packets of size ` that are not concatenated into mega-packets. Note that b′(`, k)ij ≤(lcm(L) − 1)/`, since it is the remainder of dividing b(`, k)ij by lcm(L)/`. This implies that

the sum of each row and each column of matrix B′(`, k) is bounded by N(lcm(L) − 1)/`. By

Theorem 23, the matrix B′ can be decomposed intoN ·(lcm(L)−1)

`permutations. Let P(`, k) be the

set of the permutations used to decompose the matrix B′(`, k).

After obtaining these sets of permutations, the algorithm forwards contiguously all the mega-

packets by holding each permutation P ∈ P(lcm(L), k) for lcm(L) consecutive iterations. Then

for every ` ∈ L, the algorithm holds each permutation P ∈ P(`, k) for ` consecutive iterations.

Clearly, all the cells of a specific packet are forwarded contiguously, and the algorithm satisfies the

packet-mode scheduling constraints.

The number of iterations needed for the algorithm to complete is bounded by:

lcm(L)2τ+N(Lmax−1)

lcm(L)+∑

`∈L `N ·(lcm(L)−1)

`

≤ 2τ + N(Lmax − 1) + N · Lmax · (lcm(L)− 1)

Thus, with a speedup of 2+N ·(Lmaxlcm(L)−1)

τthe algorithm schedules all cells correspond to frame

k within the next frame. This implies that for each packet p the maximum relative queuing delay

of cell last(p) is less than two frame sizes, namely at most 2τ − 1 time-slots. Hence, Lemma 12

implies that the maximum relative queuing delay is at most 2τ + Lmax − 2.

Note that for switch speedup S > 2, the relative queuing delay induced by this algorithm isN ·(Lmaxlcm(L)−1)

S−2+ Lmax − 2 time-slots.

Furthermore, it is important to notice that even though the algorithm described in Theorem 26

employs more sophisticated decomposition than Algorithm 4, both algorithms have the same over-

all time-complexity: Their time-complexity is solely determined by the complexity of the under-

lying CCF algorithm that is invoked twice every time-slot (while decomposition is done only once

93

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 104: Competitive Evaluation of Switch Architectures

every τ time-slots).

5.5 Mimicking an Ideal Shadow Switch with Bounded Buffers

In many practical applications, CIOQ switches are required to emulate shadow switches with

bounded buffer size. We show that smaller speedup suffices for mimicking an ideal shadow

switch with output buffer B. Intuitively, the reason for this better performance is that an ideal

shadow switch with bounded buffers cannot handle all incoming traffic types without dropping

cells. Therefore, by using the extra information about the legal incoming traffic types, the CIOQ

switch can optimize its scheduling decisions, resulting in a simpler and more efficient scheduling

algorithm.

Unlike the previous algorithms, algorithms for bounded mimicking do not rely on the CCF

algorithm, and use the following definition and lemma, which are adapted from Definition 15 and

Lemma 13:

Definition 16 For every input-port i, output-port j, frame size τ and frame number k > 0, the set

of eligible cells of frame k, denoted aij(τ, k), is the set of cells c that are delivered successfully

by the ideal switch, c 6∈⋃

k′<k aij(τ, k′), and all cells c′ ∈ packet(c) arrive at the switch before

time-slot kτ . By convention, aij(τ, 0) = ∅.

As in Definition 15, all the cells of each packet p are in the same set of eligible cells. The next

lemma bounds the size of these sets.

Lemma 14 For every input-port i, output-port j, frame size T and frame number k > 0,

∑Nj′=1 |aij′(τ, k)| ≤ τ + B + N (Lmax − 1) and

∑Ni′=1 |ai′j(τ, k)| ≤ τ + B + N (Lmax − 1)

Proof. Clearly, at most τ cells arrive at each input-port between time-slot (k − 1)τ and kτ . We

next show that at most τ + B cells arrive between time-slot (k − 1)τ and kτ , destined for a single

output-port j, and are successfully delivered by the shadow switch.

94

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 105: Competitive Evaluation of Switch Architectures

Assume, by way of contradiction, that `1 > τ + B cells destined for output-port j arrive at

the switch within frame k and are not dropped by the shadow switch. Let `2 ≥ 0 be the number

of cells stored in the buffer of output-port j in time-slot (k − 1)τ . By the definition of a switch,

at most τ cells are delivered from output-port j between time-slots (k − 1)τ and kτ , hence the

number of cells that are stored in the buffer by the end of frame k is at least `1 + `2 − τ > B cells,

contradicting the fact that the buffer size is B.

Since all cells of the same packet p arrive at the switch contiguously, only cells of one packet

are in aij(τ, k) and arrived at the switch before time-slot (k− 1)τ . Since the maximum packet size

is Lmax and the last cell of each packet arrives after time-slot (k − 1)τ , the number of such cells

that share the same input-port and the same output-port is bounded by Lmax− 1. Thus, the number

of such cells that share the same input-port (output-port) is bounded by N(Lmax− 1), and the sum

is therefore bounded by τ + B + N(Lmax − 1).

In order to mimic an ideal switch, the CIOQ switch drops all cells that are dropped by the

shadow switch. By employing Lemma 14 in the proofs of Theorems 25 and 26 respectively, we

get the following results:

Corollary 27 A packet-mode CIOQ switch with speedup S = 2 + 2B+2N(Lmax−1)−1τ

can mimic an

ideal shadow switch with buffer size B with a relative queuing delay of 2τ + Lmax − 2 time-slots.

Corollary 28 A packet-mode CIOQ switch with speedup S = 1 +B+N ·(Lmax·lcm(L)−1)

τcan mimic

an ideal shadow switch with buffer size B with a relative queuing delay of 2τ +Lmax−2 time-slots.

5.6 Simulation Results

Analytically, the algorithm described in Section 5.4.3 has prohibitive relative queuing delay of

O(N · lcm(Lmax)) time-slot and therefore it has only theoretical importance2. Conversely, Al-

gorithm 4 requires a speedup S ≈ 4 in order to mimic an ideal switch with reasonable relative

queuing delay of O(NLmax) time-slots.

2Even when lcm(L) = Lmax, the relative queuing delay is O(NLmax2).

95

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 106: Competitive Evaluation of Switch Architectures

In this section, we show that in practice, Algorithm 4 out-performs its analytical worst-case

bounds (from Theorem 25), implying that even with a modest speedup, it achieves small relative

queuing delay. The results are obtains by conducting extensive simulation experiments under

various synthetic and trace-driven traffic patterns.

Basically, in order to demonstrate the tradeoff between the speedup and the relative queuing

delay, we conduct the following simulation: Given the incoming traffic and a fixed frame size,

we measure the loss ratio of packets under various speedup values. Then, we present the speedup

required to achieve less than 0.1% packet drop. Note that, unlike our theoretical upper bounds,

we allow small amount of cell drops; this clearly represents real-life situations in which switches

are allowed to drop cells in extreme situations. Furthermore, this metric is especially important

because the delay occur in ideal switches (e.g., in OQ switches) is very well-studied, and by the

relative queuing delay one can easily derive absolute bounds on the cell or packet delay of packet-

mode CIOQ switch.

We first consider several stochastic traffic patterns, which are generally modeled as ON-OFF

processes: The ON period length is chosen according to a specific packet size distribution (that

is, each ON period models an arrival of a single packet), while the OFF period is distributed

geometrically with some probability p; the parameter p is chosen so that a certain load is achieved.

Specifically, we study the following three stochastic traffic patterns. These patterns were also

used by Marsan et al. [101] in order to investigate the performance of a packet-mode Input-Queued

switch (with no speedup). It is important to notice that our results are even stronger than real-life

performance, since some of the traffic patterns are chosen specifically to reflect starvation and

unfairness due to the contiguous forwarding of large packets [101]:

1. Uniform traffic: In this traffic pattern, packet sizes are chosen uniformly at random in the

range [1, 192]. For each packet, its destination is chosen uniformly at random among all

output-ports. This uniform traffic setting is considered due to its frequent use in simulations

and stochastic analysis of switch performance (see Chapter 1.2). Note that the maximum

packet size is the Maximum Transmission Unit (MTU) of IP over ATM, measured in ATM

cells [10, 101].

2. Spotted traffic: Packet sizes are 100 cells with probability 0.5 and 3 cells with probability

96

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 107: Competitive Evaluation of Switch Architectures

0.5; packet destination is chosen according to the following 8 × 8 matrix; each input-port i

chooses a destination uniformly at random among all destinations with entry 1 in row i.

1 1 1 0 1 0 1 0

0 1 0 1 1 1 0 1

1 0 1 0 1 1 1 0

1 1 0 1 0 1 0 1

1 0 1 0 1 0 1 1

0 1 0 1 0 1 1 1

1 0 1 1 1 0 1 0

0 1 1 1 0 1 0 1

Since the matrix is doubly-stochastic and the sum of each row (column) is five, it implies

that each input-port sends packets to 5 output ports, and each output-ports receives packets

from 5 input-ports. Notice that this specific traffic matrix aims to highlight starvation and

loss of throughput due to the contiguous forwarding of large packets [101].

3. Diagonal traffic: Packet destinations are chosen uniformly at random. For every cell c, if

orig(c) = dest(c) then the packet size is 100; otherwise, the packet size is 1. In this traffic

pattern, the flows on the diagonal of the switching matrix consist only on long packets,

while the flows that are not on the diagonal of the switching matrix consist only of short

packets. Like the spotted traffic setting, this traffic pattern stresses the effects of contiguously

delivering packets of variable sizes [101].

The length of all the simulations is 100, 000 time-slots, and they were performed on a 16× 16

switches (Except the spotted traffic simulations that were performed on an 8 × 8 switch, as in the

setting described in [101]).

For each traffic pattern (stochastic or trace-driven), we fix a certain speedup S and a frame

size τ . For each frame k, Algorithm 4 constructs a demand matrix B(k) and then decomposes the

demand matrix B(k− 1) of the previous frame into a sequence of scheduling decisions. Under the

fixed speedup S, Algorithm 4 schedules at most S · τ of these scheduling decisions, and drops all

packets with cells in the remaining scheduling decisions. We measure the loss ratio (in terms of

97

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 108: Competitive Evaluation of Switch Architectures

0

1

2

3

4

5

200 400 600 800 1000 1200 1400 1600

Spe

edup

Frame Size

Uniform Traffic

0.25 load0.5 load

0.75 load1.0 load

Figure 14: Simulation results for a 16 × 16 switch, operating under uniform traffic pattern anddifferent input loads. The results shows the required frame size needed to achieve less than 0.1%packet drop ratio under specfiic speedup.

packets) under all values of S and τ .

Figures 14, 15 and 16 present the speedup required to achieve less than 0.1% packet drop ratio

under different loads and different stochastic traffic patterns. As expected, the results demonstrate

that Algorithm 4 needs smaller speedup to achieve smaller relative queuing delay. Moreover, the

results show that as the load of the traffic increases, the speedup required by Algorithm 4 also

increases.

Interestingly, these results show that, even in extreme situations, a speedup of less than 2 suf-

fices to achieve ideal switch mimicking with frame size of only 8Lmax time-slots. This can be ex-

plained by carefully investigating the reasons behind the upper bound of Theorem 25: A speedup

S ≥ 4 is required due to frames at which the underlying CCF algorithm forwards 2τ cells from the

98

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 109: Competitive Evaluation of Switch Architectures

0

1

2

3

4

5

200 400 600 800 1000 1200 1400 1600

Spe

edup

Frame Size

Spotted Traffic

0.25 load0.5 load

0.75 load

Figure 15: Simulation results for a 8×8 switch, operating under spotted traffic pattern and differentinput loads. The results shows the required frame size needed to achieve less than 0.1% packet dropratio under specfiic speedup.

same input-port or to the same output-port; moreover, the additional factor of 2 is caused by a poor

selection of maximal matchings resulting in an inefficient contiguous decomposition as captured

by Theorem 24. Under non-adversarial traffics, these two situations rarely occur in practice, espe-

cially not simultaneously. A relative queuing delay of 2N(Lmax−1)−1S−4

+ Lmax − 2 time-slots occurs

in even more extreme situation: when there is a frame k and an input-port i (output-port j) such

that from any flow (i, j) there is a packet p whose first cell is sent by the underlying CCF algorithm

before time-slot (k − 1)τ and its last cell is sent by the CCF between time-slots (k − 1)τ and kτ .

Clearly, this situation hardly ever happens.

We also conducted trace-driven simulation using trace data of TCP traffic over OC-48 links;

this trace data was taken from CAIDA [43]. We investigate the performance of Algorithm 4 under

99

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 110: Competitive Evaluation of Switch Architectures

0

1

2

3

4

5

200 400 600 800 1000 1200 1400 1600

Spe

edup

Frame Size

Diagonal Traffic

0.25 load0.5 load

0.75 load1.0 load

Figure 16: Simulation results for a 16 × 16 switch, operating under diagonal traffic pattern anddifferent input loads. The results shows the required frame size needed to achieve less than 0.1%packet drop ratio under specfiic speedup.

this real traffic and show that, also in this non-synthetic case, it performs better than its theoretical

upper bounds. To the best of our knowledge, these are the first test-driven simulation of packet-

mode CIOQ switches.

Figure 17 presents the performance of Algorithm 4 in the trace-driven experiments. We con-

ducted these experiments in granularity of 30 bytes (that is, the cell unit size is 30 bytes) yielding

a maximum packet size, Lmax, of 50 cells (i.e., 1500 bytes). Furthermore, we compressed the traf-

fic so each input-port is fully utilized (that is, 100% load). Compressing the traces to 100% load

intuitively represents the worst-case traffic that should be handled by the switch; this intuition is

further confirmed by our previous experiments which show that as the traffic load increases the

required speedup also increases. As in the previous synthetic traffic patterns, these trace-driven

100

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 111: Competitive Evaluation of Switch Architectures

0

1

2

3

4

5

200 400 600 800 1000 1200 1400 1600

Spe

edup

Frame Size

Trace-Driven Simulations

Figure 17: Trace-driven simulation results for a 16 × 16 switch, operating under 1.0 input loads.The results shows the required frame size needed to achieve less than 0.1% packet drop ratio underspecific speedup.

simulations also show that Algorithm 4 performs better than its theoretical bounds.

Finally, we compare the performance of Algorithm 4 to two simple greedy algorithms.

The cut-through greedy algorithm gets a certain relative queuing delay Rmax as a parameter,

and ensures that each packet either attains relative queuing delay less than Rmax or is dropped.

Specifically, the algorithm chooses randomly a maximal matching over all packets arriving at the

input side of the switch (even if the packet has not fully-arrived at the switch) and, similarly to

Algorithm 4, keeps an input-output pair matched until the corresponding packet is fully transmit-

ting. Before a packet is selected for transmission its relative queuing delay is compared to Rmax,

and the packet is dropped if it is above the threshold. Our simulations shows that the cut-thorough

greedy algorithm never achieves 0.1% packet drop ratio no matter what the speedup of the switch

101

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 112: Competitive Evaluation of Switch Architectures

0

1

2

3

4

5

200 400 600 800 1000 1200 1400 1600

Spe

edup

Relative Queuing Delay Threshold

Store&Forward Greedy Algorithm

Figure 18: Simulation results of the Store&Forward Greedy Algorithm for a 16 × 16 switch,operating under 1.0 input loads and trace-driven traffic. The results shows the required speedupneeded to achieve less than 0.1% packet drop ratio under specfic relative queuing delay threshold.

is and what the relative queuing delay threshold, Rmax, chosen. This results coincide with the

lower bound described in Theorem 21, implying that a cut-through algorithm cannot mimic an

ideal switch.

The store&forward greedy algorithm operates exactly as the cut-through greedy algorithm but

schedules only fully-arrived packets. Although this algorithm potentially introduces additional

relative queuing delay of Lmax, our trace-driven simulations, described in Figure 18, show that

this algorithm converges very fast to achieve less than 0.1% packet drop ratio, and in fact it out-

performs Algorithm 4.

It is important to notice that the actions of the store&forward greedy algorithm and Algo-

rithm 4 are very similar; the main difference is that the store&forward greedy algorithm does not

102

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 113: Competitive Evaluation of Switch Architectures

operate in a frame-based manner, and therefore it introduces smaller relative queuing delay. Yet,

the store&forward greedy algorithm has no known analytical worst-case upper bounds.

103

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 114: Competitive Evaluation of Switch Architectures

Chapter 6

Jitter Regulation for Multiple Streams

The notion of delay jitter (or Cell Delay Variation [11]), defined as the difference between the

maximal and minimal end-to-end delays of different cells, captures the smoothness of a traffic. The

need for efficient mechanisms to provide such smooth and continuous traffic is mostly motivated by

the increasing popularity of interactive communication and in particular video/audio streaming.1

Controlling traffic distortions within the network, and in particular jitter control, has the effect

of moderating the traffic throughout the network [147]. This is important when a service provider

in a QoS network must meet service level agreements (SLAs) with its customers. In such cases,

moderating high congestion states in switches along the network results in the provider’s ability to

satisfy the guarantees to more customers [134].

Jitter control mechanisms have been extensively studied in recent years (see Section 2.4). These

are usually modelled as jitter regulators that use internal buffers in order to shape the traffic, so

that cells leave the regulator in the most periodic manner possible. Upon arrival, cells are stored

in the buffer until their planned release time, or until a buffer overflow occurs. This indicates a

tradeoff between the buffer size and the best attainable jitter, i.e., as buffer space increases, one can

expect to obtain a lower jitter.

In this chapter, we investigate the problem of finding an optimal jitter release schedule, given

1For example, 6.98 billion video streams were initiated by U.S. users during August 2006, while the U.S. streamingaudience increased by 4 percent from July 2006 to reach 110.3 million streamers in August 2006, representing about64 percent of the total U.S. Internet audience [42].

104

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 115: Competitive Evaluation of Switch Architectures

a predetermined buffer size. This problem was first raised by Mansour and Patt-Shamir [100],

who considered only a single-stream setting. In practice, however, jitter regulators handle multiple

streams simultaneously and must provide low jitter for each stream separately and independently.

In the multi-stream model, the traffic arriving at the regulator is an interleaving of M streams

originating from M independent abstract sources (see Figure 19). Each abstract source i sends a

stream of fixed-size cells in a fully periodic manner, with rate Ri, which arrive at a jitter regulator

after traversing the network. Variable end-to-end delays caused by transient congestion throughout

the network may result in such a stream arriving at the regulator in a non-periodic fashion. The

regulator knows the value of Ri, and strives to release consecutive cells 1/Ri time units apart, thus

re-shaping the traffic into its original form. Moreover, the order in which cells are released by

each abstract source is assumed to be respected throughout the network. This implies that the cells

from the same stream arrive at the regulator in order (but not necessarily equally spaced), and the

regulator should also maintain this order. We refer to this property as the FIFO constraint.

Note that the FIFO constraint should be respected in each stream independently, but not neces-

sarily on all incoming traffic. This implies that in the multi-stream model, the order in which cells

are released is not known a priori. This lack of knowledge is an inherent difference from the case

where there is only one abstract source, and it poses a major difficulty in devising algorithms for

multi-stream jitter regulation (as we describe in detail in Section 6.4).

6.1 Our Results

We present algorithms and tight lower bounds for jitter regulation in this multiple streams environ-

ment, both in offline and online settings. This answers a primary open question posed in [100].

We evaluate the performance of a regulator in the multi-stream model by considering the max-

imum jitter obtained on any stream. We show that, somewhat surprisingly, the offline problem can

be solved in polynomial time. This is done by characterizing a collection of optimal schedules,

and showing that their properties can be used to devise an offline algorithm that efficiently finds a

release schedule that attains the optimal jitter.

We use a competitive analysis approach in order to examine the online problem. In this setting,

105

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 116: Competitive Evaluation of Switch Architectures

Figure 19: The multi-stream jitter regulation model.

by sizing up the buffer to a size of 2MB and statically partitioning the buffer equally among the M

streams, applying the algorithm described in [100, Algorithm B] on each stream separately yields

an algorithm that obtains the optimal max-jitter possible with a buffer of size B. We show that

such a resource augmentation cannot be avoided, by proving that any online algorithm needs a

buffer of size at least MB in order to obtain a jitter within a bounded factor from the optimal jitter

possible with a buffer of size B. We further show that these results also apply when the objective

is to minimize the average jitter attained by the M streams. These results indicate that online jitter

regulation does not scale well as the number of streams increases unless the buffer is sized up

proportionally.

6.2 Model Description, Notation, and Terminology

We adapt the following definitions from [100]:

Definition 17 Given a traffic T = ci | 0 ≤ i ≤ n such that cell ci arrives at time ta(ci), we

define the following:

1. A release schedule s for traffic T defines the release time of cells in T . Specifically, for each

cell ci, tls(ci, T ) denotes the time at which cell ci ∈ T is released from the regulator under

106

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 117: Competitive Evaluation of Switch Architectures

schedule s. Note that for every ci ∈ T, ta(ci) ≤ tls(ci, T ).

2. A release schedule s for T is B-feasible if at any time t,

|ci ∈ T|ta(ci) ≤ t < tls(ci, T )| ≤ B.

That is, there are never more than B cells in the buffer simultaneously.

3. The delay jitter of T under a release schedule s is

J (s, T ) = max0≤i,k≤n

tls(ci, T )− tls(ck, T )− (i− k)X

where X = 1/R is the inter-release time of T (i.e., X is the difference between the release

times of any two consecutive cells from the abstract source).

It is important to notice that since the abstract source generates perfectly periodic traffic, this

definition of delay jitter coincides with the notion of Cell Delay Variation. This definition also

coincides with Definition 2, if JS(T ) = 0 and each flow (i, j) corresponds to a different stream

Ti,j .

Note also that unlike previous chapters, we do not assume that time is slotted, and therefore the

time in which cells arrive and leave the regulator can be any non-negative real number in R+.

We first extend Definition 17 to a traffic T that is an interleaving of M traffics T1, . . . , TM .

We call each traffic Ti ⊆ T a stream and denote by XTithe inter-release time of stream Ti. We

assume for simplicity that all streams have the same inter-release time X; all our results extend

immediately to the case where this does not hold.

Let cj denote the j’th cell (in order of arrival) of the interleaving of the streams T , and let cij

denote the j’th cell of the single stream Ti. A release schedule should obey a per-stream FIFO

discipline, in which cells of the same stream are released in the order of their arrival.

Let J (s, Ti) be the jitter of a single stream Ti obtained by a release schedule s. We use the

following metric to evaluate multi-stream release schedules:

107

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 118: Competitive Evaluation of Switch Architectures

Definition 18 The max-jitter of a multi-stream traffic T =⋃

i∈1,...,M Ti obtained by a release

schedule s is the maximal jitter obtained by any of the streams composing the traffic; that is,

MJ (s, T ) = max1≤k≤M

J (s, Tk).

In what follows, given any algorithm A, we denote by J (A, Ti) (MJ (A, T )) the jitter (max-

jitter) corresponding to the schedule produced by A given stream Ti (traffic T ).

One can take a geometric view of delay jitter by considering a two dimensional plane where

the x-axis denotes time and the y-axis denotes the cell number (see Figure 20). We first consider

the case of a single stream T . Given a release schedule s, a point at coordinates 〈t, j〉 is marked if

tls(cj, T ) = t (that is, if the j’th cell is released at time t). The release band is the band with slope

R = 1/X that encloses all the marked points and has minimal width, where the width of the band

is the maximal difference in the x-axis coordinates between its margins. The jitter obtained by s is

the width of its release band, and therefore our objective is to find a schedule with the narrowest

release band.

Under the multi-stream model, we associate every stream Ti with a different color i. A point

at coordinates 〈t, j〉 is colored with color i if tls(cij, T ) = t. Any schedule s induces a separate

release band for each stream Ti ⊆ T that encloses all points with color i. Schedule s is therefore

characterized by M release bands.

6.3 Online Multi-Stream Max-Jitter Regulation

As mentioned previously, an online algorithm with buffer size 2MB which statically partitions the

buffer equally among the M streams and applies the algorithm described in [100, Algorithm B]

on each stream separately, obtains the optimal max-jitter possible with a buffer of size B. In this

section we show that this result is tight up to a factor of 2, by showing that in order to obtain a

max-jitter within a bounded factor from the optimal max-jitter possible with a buffer of size B,

any online algorithm needs a buffer of size at least MB cells. Hence, in order to maintain any

reasonable jitter performance, it is necessary to increase the buffer size in a linear proportion to the

108

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 119: Competitive Evaluation of Switch Architectures

time

cellnumber

leftmargin

rightmargin

j

ta(cij) tls(c

ij, T )

J (s, Ti)

Figure 20: Geometric view of delay jitter. Arrivals are marked with dotted circles and releasesare marked with full circles. The jitter of stream Ti is the width of the band with slope R = 1/Xenclosing all releases.

number of streams.

Theorem 29 For every online algorithm ALG with an internal buffer of size smaller than MB,

and for any x > 0, there exists a traffic consisting of M streams, forcing ALG to have max-jitter at

least x, while the optimal jitter possible with a buffer of size B is zero.

Proof. Let ALG be an online algorithm with a buffer of size at most MB − 1. Consider the

following traffic T : For every 0 ≤ k ≤ B− 1, M cells arrive at the regulator at time k ·X , one for

every stream.

Since the buffer size is at most MB − 1 and MB cells arrived by time t1 = (B − 1)X , it

follows that ALG releases a cell by time t1, say of stream Ti. Consider the following continuation

for T : Given some x > 0, in time t′ ≥ t1 + BX + x, a single cell of stream Ti arrives at the

regulator.

Note that ALG releases the first cell of stream Ti by time t1, the last cell of stream Ti cannot

be sent prior to time t′, and Ti consists of B + 1 cells. Let s be the schedule produced by ALG. It

109

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 120: Competitive Evaluation of Switch Architectures

follows thatJ (ALG, Ti) ≥ tls(c

iB, T )− tls(c

i0, T )− (B − 0)X

≥ t′ − t1 −BX

≥ x + t1 + BX − t1 −BX = x,

which can be arbitrarily large. It follows also that MJ (ALG, T ) ≥ x. On the other hand, note

that for any choice of x, the optimal max-jitter possible with a buffer of size B is zero: Every cell

of a stream other than Ti is released immediately upon its arrival, and for every 0 ≤ j ≤ B, cell

cij ∈ Ti is released in time x− (B − j)X . Since every stream other than Ti does not consume any

buffer space, it is easy to verify that there are at most B cells in the buffer at any time. Clearly,

every stream attains zero jitter by this release schedule.

Theorem 29 implies that if the buffer size is smaller than MB then there are scenarios in which

an optimal schedule attains zero jitter for all streams, while any online algorithm can be forced

to produce a schedule with arbitrarily large max-jitter. This fact immediately implies that even if

the objective is to minimize the average jitter obtained by the different streams, the same lower

bound holds. Since the online algorithm, which statically partitions the buffer, minimizes the jitter

of each stream independently, it clearly minimizes the overall average jitter as well, thus providing

a matching upper bound (up to a factor of 2).

Moreover, we are able to prove a more general lower bound:

Theorem 30 For every online algorithm ALG with an internal buffer size smaller than

max MB, M(B − 1) + B + 1 ,

there exists a traffic consisting of M streams, such that ALG attains max-jitter strictly greater than

the optimal jitter possible with a buffer of size B.

Proof. Let ALG be an online algorithm with a buffer of size at most M(B− 1) + B. Consider the

following traffic T : For every 0 ≤ k ≤ B− 1, M cells arrive at the regulator at time k ·X , one for

every stream. The traffic stops if ALG releases a cell before time t′ = BX .

If ALG releases a cell before time t′, the claim follows from the proof of Theorem 29.

110

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 121: Competitive Evaluation of Switch Architectures

Therefore, assume now that ALG does not release any cells before time t′, implying that in

time t′ there are MB cells in the buffer. Note that this implies that ALG has buffer size at least

MB. Consider the following continuation for T : In time t′, B + 1 cells of stream T1 arrive at the

regulator.

Since ALG has a buffer of size at most M(B−1)+B = MB +B−M , and before time t′ ALG

has not released any of the MB + B + 1 cells in T , it must release at least M + 1 cells in time t′.

By the pigeonhole principle it follows that at least two of the released cells correspond to the same

stream, say Ti. Therefore, the release schedule produced by ALG, denoted by s, has max-jitter of

at least the jitter attained by stream Ti:

tls(ci0, T )− tls(c

i1, T )− (0− 1)X = t′ − t′ − (0− 1)X = X.

Hence,MJ T (ALG) is strictly greater than zero. On the other hand, the optimal max-jitter possible

with a buffer of size B is zero. To see this, consider the following release schedule: Every cell of

a stream other than T1 is released immediately upon its arrival, and for every 0 ≤ j ≤ 2B, cell

c1j ∈ T1 is released in time t′ − (B − j)X . Similarly to the previous case, every stream obtains

a zero jitter by this release schedule, and no more than B cells are stored simultaneously in the

buffer since every cell, except the last B cells of stream T1, is sent immediately upon arrival, and

no two cells of the same stream are sent simultaneously.

Note that Theorem 30 implies that a greater resource augmentation is required, compared to

Theorem 29, while Theorem 29 implies unbounded competitiveness whereas Theorem 30 only

implies no online algorithm can obtain the optimal jitter. Furthermore, the lower bound described

in Theorem 30 exactly coincides with the result of the single stream model (i.e., M = 1) [100].

6.4 An Efficient Offline Algorithm

This section presents an efficient offline algorithm that generates a release schedule with optimal

max-jitter.

Given a traffic T that is an interleaving of M streams, consider a total order π = (c′0, . . . , c′n)

111

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 122: Competitive Evaluation of Switch Architectures

on the release schedule of cells in T that respects the FIFO order in each stream separately. The

release schedule, which attains the optimal max-jitter and respects π, can be found using similar

arguments to the ones in [100, Algorithm A]: Cell c′j can be stored in the buffer only until cell

c′j+B arrives, imposing strict bounds on the release time of each cell. In particular, it follows that

for every traffic T , there exists an optimal release schedule. Unfortunately, it is computationally

intractable to enumerate over all possible total orders, hence a more sophisticated approach should

be considered.

We first discuss properties of schedules that achieve optimal max-jitter. Then, we show that

these properties allow to find an optimal schedule in polynomial time.

For every cell cj ∈ T, one can intuitively consider t = ta(cj)− jX as the time at which c0 ∈ T

should be sent, so that cj ∈ T is sent immediately upon its arrival, in a perfectly periodic release

schedule. For any stream T , denote by τ(T ) = maxj

ta(cj)− jX | cj ∈ T

. From a geometric

point of view, τ(T ) is a lower bound on the intersection between the time axis and the right margin

of any release band (see Figure 21(a)), since otherwise the cell defining τ(Ti) would have to be

released prior to its arrival.

Given a release schedule s for a traffic T , a stream Ti ⊆ T is said to be aligned in s if there is

no cell cik ∈ Ti such that tls(c

ik, T ) > τ(Ti)+kX . Clearly, if Ti is aligned in s, then the cell ci

j ∈ Ti

that defines τ(Ti) satisfies tls(cij, T ) = ta(ci

j). Geometrically, the right margin of a release band

corresponding to an aligned stream Ti intersects the time axis in point 〈τ(Ti), 0〉 (see Figure 21(b)).

A release schedule s for traffic T is said to be aligned, if every stream is aligned in s. The

following lemma shows that one can iteratively align the streams of an optimal schedule without

increasing the overall jitter:

Lemma 15 For every traffic T , there exists an optimal aligned schedule s.

Proof. Given an optimal schedule s′ for traffic T with ` < M aligned streams, we prove that s′

can be changed into an aligned schedule (i.e. with M aligned streams), maintaining its optimality.

We first show that s′ can be altered into an optimal schedule with ` + 1 aligned streams. Let Ti

112

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 123: Competitive Evaluation of Switch Architectures

left margin

time

cellnumber

right margin

τ(T )

(a) non-aligned schedule

τ(T )

cellnumber

time

(b) aligned schedule

Figure 21: Geometric view of the right margin of the release band. Arrivals are marked by dottedcircles and releases bu full circles.

be one of the non-aligned streams in s′, and consider the following schedule s:

tls(ckj , T ) =

mintls′(ck

j , T ), τ(Tk) + jX

k = i

tls′(ckj , T ) k 6= i

Clearly, for every stream other than Ti the schedule remains unchanged; therefore, it suffices to

consider only stream Ti. Since tls′(cij, T ) ≥ ta(ci

j) and τ(Ti) + jX ≥ ta(cij), s is a release

schedule and it can easily be verified that s satisfies the FIFO constraint. Schedule s is B-feasible,

since s′ is B-feasible and for any cell cij ∈ Ti, tls(c

ij, T ) ≤ tls′(ci

j, T ). Stream Ti is aligned in s,

since every cell cij ∈ Ti satisfies tls(c

ij, T ) ≤ τ(Ti) + jX . Hence, s has ` + 1 aligned stream.

In order to prove that s is optimal, it suffices to show that tls(cij, T )− tls(c

im, T )− (j−m)X ≤

J (s′, Ti) for every two cells cij, c

im ∈ Ti.

Assume without loss of generality that tls(cij, T ) − jX ≥ tls(c

im, T ) − mX . If this does

not hold then our term is negative, and we can simply replace between the roles of cij and ci

m.

We distinguish between four possible cases: In the first case where tls(cij, T ) = tls′(ci

j, T ) and

tls(cim, T ) = tls′(ci

m, T ) the result follows immediately for the definition of J (s′, Ti). In the

case where tls(cij, T ) = τ(Ti) + jX and tls(c

im, T ) = τ(Ti) + mX the term is zero, which is

less than J (s′, Ti) by definition. The third case to consider is when tls(cij, T ) = tls′(ci

j, T ) and

tls(cim, T ) = τ(Ti) + mX . This implies that tls′(ci

j, T ) < τ(Ti) + jX , thus tls′(cij, T ) − jX <

113

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 124: Competitive Evaluation of Switch Architectures

τ(Ti). Therefore tls(cij, T ) − jX = tls′(ci

j, T ) − jX < τ(Ti) = tls(cim, T ) −mX , contradicting

the assumption on cij and ci

m. The last case to consider is when tls(cij, T ) = τ(Ti) + jX and

tls(cim, T ) = tls′(ci

m, T ): Similarly to the previous case, this implies that τ(Ti) < tls′(cij, T )− jX ,

and therefore tls(cij, T ) − tls(c

im, T ) − (j −m)X = τ(Ti) − (tls′(ci

m, T ) −mX) < tls′(cij, T ) −

jX − (tls′(cim, T )−mX) ≤ J (s′, Ti), as required.

Applying the same arguments repeatedly alters schedule s′ into an aligned schedule and pre-

serves its optimality.

Next we show that the optimality of a schedule s is maintained even if cells that are stored in

the buffer are released earlier, as long as their new release time satisfies the FIFO order and remains

within a release band of widthMJ (s, T ):

Lemma 16 Let s be an optimal schedule for traffic T . Then, for every stream Ti ⊆ T and for

every J ∈ [J (s, Ti),MJ (s, T )], the new schedule s’ that is defined as:

tls′(ckj , T ) =

maxta(ck

j ), τ(Tk)− J + jX

k = i

tls(ckj , T ) k 6= i

is B-feasible andMJ (s′, T ) =MJ (s, T ). Furthermore, if s is aligned then so is s′.

Proof. Since s′ only changes the release schedule of stream Ti, it clearly preserves the FIFO order

and jitter of each stream other than Ti.

We first show that s′ respects the FIFO order of cells in Ti. Let cij be any cell in Ti. If

tls′(cij, T ) = ta(ci

j) then its release time is at most ta(cij+1) ≤ tls′(ci

j+1, T ). Otherwise, tls′(cij, T ) =

τ(Ti)− J + jX ≤ τ(Ti)− J + (j + 1)X ≤ tls′(cij+1, T ).

In order to bound the max-jitter of s′, it suffices to show that J (s′, Ti) ≤MJ (s, T ). Consider

any pair of cells cia, c

ib ∈ Ti. By the definition of s′, tls′(ci

a, T ) ≥ τ(Ti) − J + aX . On the other

hand, tls′(cib, T ) = max

ta(ck

b ), τ(Tk)− J + bX≤ τ(Ti) + bX since ta(ci

b) ≤ τ(Ti) + bX

by the definition of τ(Ti). Hence, tls′(cib, T ) − tls′(ci

a, T ) ≤ J + (b − a)X , which implies that

J (s′, Ti) = maxa,b tls′(cib, T )− tls′(ci

a, T )− (b− a)X ≤ J ≤MJ (s, T ).

114

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 125: Competitive Evaluation of Switch Architectures

Assume by way of contradiction that s′ in not B-feasible, and let t be any time in which a set

P of more than B cells are stored in the buffer. Since the release schedule of any stream Tk other

than Ti is identical under both s and s′, every cell ckj ∈ P , for k 6= i, is also stored in the buffer at

time t under schedule s. Note first that any cell in P is not released upon its arrival. Hence,

tls′(cij, T ) = τ(Ti)− J + jX by the definition of s′

≤ τ(Ti)− J (s, Ti) + jX since J ∈ [J (s, Ti),MJ (s, T )]

= ta(cik)− kX − J (s, Ti) + jX for ci

k ∈ Ti defining τ(Ti)

≤ tls(cik, T )− (k − j)X − J (s, Ti) since ta(ci

k) ≤ tls(cik, T )

≤ tls(cik, T )− (k − j)X−(

tls(cik, T )− tls(c

ij, T )− (k − j)X

)by definition of J (s, Ti)

≤ tls(cij, T )

Therefore, all cells cij ∈ P are stored in the buffer at time t under schedule s as well, contradicting

the B-feasibility of s.

We conclude the proof by showing that if s is aligned then s′ is also aligned. Assume s is

aligned. For any stream Tk 6= Ti schedules s and s′ are identical on Tk, and therefore Tk is

aligned in s′. Assume by contradiction that Ti is not aligned, therefore there is a cell cij ∈ Ti such

that tls′(cij, T ) > τ(Ti) + jX . Since tls′(ci

j, T ) = maxta(ci

j), τ(Ti)− J + jX

, it follows that

ta(cij) > τ(Ti) + jX , contradicting the maximality of τ(Ti).

By iteratively applying Lemma 16 with J =MJ (s, T ) on all streams, we get:

Corollary 31 Given an optimal aligned schedule s for traffic T , the schedule s′ defined by

tls′(ckj , T ) = max

ta(ck

j ), τ(Tk)−MJ (s, T ) + jX

is an optimal aligned schedule.

The following lemma bounds from below the release time of cells in a schedule. Intuitively,

this lemma defines the left margin of the release band.

115

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 126: Competitive Evaluation of Switch Architectures

Lemma 17 For any schedule s for traffic T , every stream Ti ⊆ T , and every cell cij ∈ Ti,

tls(cij, T ) ≥ τ(Ti)− J (s, Ti) + jX .

Proof. Assume by contradiction that there exists a stream Ti and a cell cij ∈ Ti such that tls(ci

j, T ) <

τ(Ti)− J (s, Ti) + jX . Let cik ∈ Ti be the cell defining τ(Ti). Since tls(c

ik, T ) ≥ ta(ci

k),

J (s, Ti) ≥ tls(cik, T )− tls(c

ij, T )− (k − j)X

> ta(cik)− (τ(Ti)− J (s, Ti) + jX)− (k − j)X

= ta(cik)− (ta(ci

k)− kX) + J (s, Ti)− jX − kX + jX = J (s, Ti),

which is a contradiction.

Lemma 17 indicates an important property of aligned optimal schedules. In such schedules,

the jitter of any stream can be characterized by the release time of a single cell, as depicted in the

following corollary:

Corollary 32 For any aligned schedule s for traffic T and every stream Ti ⊆ T ,

J (s, Ti) = maxj

τ(Ti)− tls(c

ij, T ) + jX

.

The following lemma shows that at least one of the widest release bands, corresponding to

some stream Ti attaining the max-jitter, has its left margin determined by the following event: An

arrival of a cell causing a buffer overflow (possibly of another stream), which necessitates some

cell of Ti to be released earlier than desired.

Lemma 18 Let s be an aligned optimal schedule for traffic T . There exists a stream Ti ⊆ T that

attains the max-jitter, and a cell cij ∈ Ti such that tls(c

ij, T ) = τ(Ti) −MJ (s, T ) + jX and

tls(cij, T ) = ta(c`) for some cell c` ∈ T.

Proof. We show by contradiction that if the claim does not hold for an optimal aligned schedule,

then such a schedule can be altered into a new schedule with max-jitter strictly less than the original

schedule. Formally, consider an aligned optimal schedule s for T . Let T ′ = Ti | J (s, Ti) =MJ (s, T ),

116

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 127: Competitive Evaluation of Switch Architectures

and for every Ti ∈ T ′, let Pi =cij ∈ Ti | tls(ci

j, T ) = τ(Ti)−MJ (s, T ) + jX

. From a geo-

metric point of view, Pi consists of all the cells in Ti, whose release time lies on the left margin

of Ti’s release band. Finally, let P =⋃

Ti∈T ′ Pi. Assume by contradiction that for every cj ∈ P ,

there is no cell c` ∈ T such that tls(cj, T ) = ta(c`).

Note first that in such a case,MJ (s, T ) > 0. Otherwise, since s is aligned, for each stream Ti

the cell cik defining τ(Ti) satisfies both tls(c

ik, T ) = ta(ci

k) and tls(cik, T ) = τ(Ti)− 0 + jX .

The altered schedule s′ is obtained by postponing the release of all the cells in P for some

positive amount of time. As we shall prove, schedule s′ is B-feasible, and has a max-jitter strictly

less thanMJ (s, T ), contradicting the optimality of s.

For each cell cik ∈ P which is the j’th cell of T (i.e, ci

k = cj), the exact amount of postponing

time is determined by the following constraints:

1. Avoiding buffer overflow: Do not postpone further than the first arrival of a cell after tls(cj, T ).

This constraint is captured by

δ(cj) =

minc`:ta(c`)>tls(cj ,T )

ta(c`)− tls(cj, T )

if cj is not the last cell in T

∞ otherwise.

2. Maintaining FIFO order: Recalling that cj = cik, do not postpone further than tls(c

ik+1, T ).

This constraint is captured by

ε(cj) =

tls(cik+1, T )− tls(c

ik, T ) if cj is not the last cell in Ti

∞ otherwise.

Let δ = mincj∈P δ(cj) and ε = mincj∈P ε(cj), capturing the amounts of time that satisfy these

constraints for all cells in P .

It is important to notice that δ and ε are strictly greater than zero: δ > 0 by its definition,

and ε > 0 since if there is a cell cik ∈ Pi such that tls(c

ik, T ) = tls(c

ik+1, T ) then tls(c

ik+1, T ) =

τ(Ti)−MJ (s, T ) + kX < τ(Ti)−MJ (s, T ) + (k + 1)X , contradicting Lemma 17.

117

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 128: Competitive Evaluation of Switch Architectures

For the purpose of analysis, define for every stream Ti ∈ T ′,

ρ(Ti) = mincik∈Ti\Pi

tls(c

ik, T )− (τ(Ti)−MJ (s, T ) + kX)

.

ρ(Ti) comes to capture how far is the rest of the stream from the left margin. Since for any Ti ∈ T ′,

J (s, Ti) > 0, then Ti \Pi is not empty and ρ(Ti) > 0 and finite. Let ρ = minTi∈T ′ ρ(Ti). It follows

that ρ > 0 and finite.

Let ∆ = min δ, ε, ρ, and consider the following schedule that, as we shall prove, attains a

jitter strictly smaller thanMJ (s, T ):

tls′(cj, T ) =

tls(cj, T ) + ∆/2 cj ∈ P

tls(cj, T ) otherwise

Note that this schedule is well-defined since ∆ > 0 and finite.

We first prove that s′ is B-feasible and maintains FIFO order. Assume by way of contradiction

that s′ is not B-feasible, and let t be the first time the number of cells in the buffer exceeds B. By

the minimality of t, there exists a cell that arrives at time t. For every cell cj ∈ P , no cells arrive

to the buffer in the interval [tls(cj, T ), tls(cj, T ) + ∆/2] because ∆ ≤ δ(cj), implying that t is not

in any such interval. But the definition of s′ yields that the content of the buffer in such a time t is

the same under schedules s and s′, thus contradicting the B-feasibility of s. The FIFO order of s′

is maintained since ∆ ≤ ε(cj) for every cj ∈ P .

We conclude the proof by showing thatMJ (s′, T ) <MJ (s, T ). Consider any Ti ∈ T ′, and

any cik ∈ Ti. If ci

k ∈ P then by the definition of s′ and Lemma 17, tls′(cik, T ) = tls(c

ik, T )+∆/2 ≥

τ(Ti) −MJ (s, T ) + kX + ∆/2. The same holds also for cik /∈ P : Since ρ(Ti) ≥ ∆ > ∆/2, it

follows that tls′(cik, T ) = tls(c

ik, T ) ≥ τ(Ti)−MJ (s, T ) + kX + ρ(Ti) > τ(Ti)−MJ (s, T ) +

kX + ∆/2. Hence, for every cik,

τ(Ti)− tls′(cik, T ) + kX ≤ τ(Ti)− (τ(Ti)−MJ (s, T ) + kX + ∆/2) + kX

= MJ (s, T )−∆/2

< J (s, Ti).

118

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 129: Competitive Evaluation of Switch Architectures

time

cellnumber

r

j

k

τ(Ti) ta(cik)tls(c

ij, T )

r′

c`

J (s, Ti)

Figure 22: Outline of arrivals (dotted circles) and releases (full circles) for cells of the streamTi that attains the max-jitter, in an aligned release schedule, as discussed in Corollary 31 and inLemma 18. The square represents an arrival of some cell in T causing buffer overflow.

By Corollary 32, J (s′, Ti) < J (s, Ti) for any stream Ti ∈ T ′. The jitter of any other stream

remains unchanged, thereforeMJ (s′, T ) <MJ (s, T ), contradicting the optimality of s.

Finally, we conclude this section by showing that there exists a polynomial-time algorithm

that finds an optimal schedule for the multi-stream max-jitter problem. Algorithm 5 depicts the

pseudo-code of this algorithm.

Theorem 33 Algorithm 5 finds an optimal schedule for the multi-stream max-jitter problem. Its

time-complexity is O(n3), where n =∑M

i=1 |Ti|.

Proof. Assume a feasible schedule exists. Lemma 18 implies that there is an optimal schedule s

and a stream Ti, such thatMJ (s, T ) = τ(Ti) − ta(cl) + kX , for some cells cik ∈ Ti and cl ∈ T .

Note that for any stream Ti, the value of τ(Ti) can be computed in linear time using only the arrival

times of the cells of stream Ti (See Algorithm 5, lines 18-19).

It follows that by enumerating over all possible choices of pairs (cik, cl), one can find the col-

lection of possible values of the optimal jitter (See Algorithm 5, function OFFLINE).

For every such value J , computing an aligned release schedule attaining jitter J and verifying

that it is B-feasible can be done in linear time by checking the feasibility of the schedule defined

119

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 130: Competitive Evaluation of Switch Architectures

in Corollary 31 assuming MJ (s) = J (See Algorithm 5, functions COMPUTESCHEDULE and

ISFEASIBLE).

120

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 131: Competitive Evaluation of Switch Architectures

Algorithm 5 Algorithm OFFLINE

1: boolean function ISFEASIBLE(traffic T , schedule s, buffer size B)2: BufferOccupancy← 03: T ′ ← MERGE(T, s) . for identical values, those of s appear before those of T4: for every e ∈ T ′ in increasing order do5: if e ∈ T then6: BufferOccupancy← BufferOccupancy + 17: else . i.e., e ∈ s8: BufferOccupancy← BufferOccupancy− 19: end if

10: if BufferOccupancy > B then11: return FALSE

12: end if13: end for14: return TRUE

15: end function

16: schedule function COMPUTESCHEDULE(traffic T , jitter J)17: s← NULL

18: for every stream Ti ∈ T do19: τ(Ti)← maxj

ta(ci

j)− jX

. defines the right margin20: end for21: for every cell ci

j ∈ T do22: tls(c

ij, T )← max

ta(ci

j), τ(Ti)− J + jX

23: end for24: return s25: end function

26: schedule function OFFLINE(traffic T , buffer size B)27: Array of jitter values MinJitter[M ]. Initially all∞28: Array of schedules MinSchedule[M ]. Initially all NULL

29: for every stream Ti ∈ T do30: τ(Ti)← maxj

ta(ci

j)− jX

31: for every cij ∈ Ti do

32: for every ck ∈ T do33: if ta(ck) ≥ ta(ci

j) and ta(ck) ≤ τ(Ti) + jX then34: λ(Ti)← ta(ck)− jX35: s← COMPUTESCHEDULE(T ,τ(Ti)− λ(Ti))36: if ISFEASIBLE(T, s, B) and MinJitter[i] > τ(Ti)− λ(Ti) then37: MinJitter[i]← τ(Ti)− λ(Ti)38: MinSchedule[i]← s39: end if40: end if41: end for42: end for43: end for44: `← arg min MinJitter . The value of (any) argument i for which MinJitter[i] is minimal45: return MinSchedule[`]46: end function

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 132: Competitive Evaluation of Switch Architectures

Chapter 7

Conclusions

At the heart of the Internet, and crucially shaping its performance, are routers and switches. The

communication needs of current applications necessitate that switches (routers) would have high-

speed and high port-count. This dissertation has taken a competitive approach and evaluated the

performance of a switch by comparison to the performance of an ideal switch.

Chapter 4 focused on the Parallel Packet Switch architecture and presented tight bounds on the

average relative queuing delay, which hold with high probability even if randomization is used.

This generally implies that unlike other load-balancing problems, randomization does not reduce

the relative queuing delay. Our lower bounds rely on the fact that switches are FCFS, but can be

generalized to rely on other priority schemes.

We believe that the same techniques presented in this dissertation can be applied to other switch

architectures and show that randomization does not decrease the relative queuing delay also in

these architectures. In contrast, it is important to notice that using randomized algorithms may

decrease the complexity of the packet scheduling process, which is one of the primary performance

bottlenecks of contemporary switches.

We also introduced a novel class of demultiplexing algorithms for PPS (called u-RT algorithms)

that uses local information and complete global information older than u time-slots. It is important

to design u-RT demultiplexing algorithms that exchange small, practical amount of information,

e.g., credits, between demultiplexors. Moreover, it is important to study the performance of syn-

122

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 133: Competitive Evaluation of Switch Architectures

chronized demultiplexing algorithms that share only a global clock (these algorithms are currently

classified as 1-RT demultiplexing algorithms).

In addition, we presented lower bounds for two extensions of the PPS model: input-buffered

PPS and recursive PPS. An interesting future research is to devise efficient demultiplexing algo-

rithms for these architectures that out-perform the lower bounds presented for the bufferless PPS

model.

Chapter 5 investigated packet-mode scheduling in CIOQ switches. We show that even though

packet-mode scheduling imposes very confining restrictions on scheduling algorithms, a speedup

arbitrarily close to 2 suffices to mimic ideal shadow switch, if relative queuing delay can be tol-

erated. This result matches the lower bound for cell-based scheduling, implying that, somewhat

surprisingly, no additional speedup is required in order to keep packets contiguous over the switch

fabric.

We studied the trade-off between the relative queuing delay and the speedup of the switch by

presented upper bounds on the speedup required to achieve a given relative queuing delay, leaving

the question of their optimality for future research. Note that by Theorem 21, Lmax/2 − 3 is a

lower bound on the relative queuing delay, regardless of the switch speedup.

It is also interesting to explore packet-mode scheduling in PPS architecture. Such packet-

mode PPS ensures that all cells of the same packet are delivered contiguously to the output-port,

eliminating the need for reassembly buffers in the output-port. However, since the middle-stage

switches operate at a lower rate, it is impossible to deliver consecutive cells through the same

plane, therefore, unlike CIOQ switches, contiguous packet delivery over the switch fabric is not

required.

Iyer et al. [74, Theorem 5] showed that a speedup S ≥ 3 suffices for a PPS to emulate a

PIFO OQ switch using a centralized algorithm. Notice that a packet-mode PPS is not required to

maintain contiguous packet delivery over its fabric. Thus, as discussed in Chapter 5 (Page 86),

this result implies that a packet-mode PPS can emulate an ideal shadow switch (with no relative

queuing delay).1 The question whether a fully-distributed or u-RT demultiplexing algorithms can

1The algorithm described in [74, Theorem 5] requires that the departure time of a cell c is known upon its arrival.This can be easily satisfied since we assumed that the size of a packet is known upon the arrival of its first cell.

123

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 134: Competitive Evaluation of Switch Architectures

provide such emulation, and what is their relative queuing delay, is left for future research.

Chapter 6 examined the problem of jitter regulation and specifically, the tradeoff between the

buffer size available at the regulator and the optimal jitter attainable using such a buffer. We

dealt with the realistic case where the regulator must handle many streams concurrently, with the

objective of minimizing the maximum jitter attained by any of these streams.

Since real-life networks clearly have finite capacity links, it is also interesting to investigate the

behavior of a jitter regulator that handles multiple streams simultaneously and its outgoing links are

of bounded bandwidth (either shared among all streams or dedicated to each stream separately). In

addition, since regulators might be allowed to drop cells, it is of interest to examine the correlations

between buffer size, optimal jitter, and drop ratio.

Furthermore, it is important to incorporate jitter regulation within existing switching architec-

tures, so that traffic leaves the switch already shaped. This is straightforward when cells are imme-

diately available in the output buffer upon their arrival (e.g., in OQ switches). However, in other

switch architectures (e.g., in CIOQ switches), jitter regulation and scheduling must be integrated.

Such mechanisms are of a special interest since they cope both with the switching bottleneck and

with the packet scheduling bottleneck simultaneously (recall Figure 1).

In a broader context, it is appealing to apply the techniques and methodologies presented in

this dissertation on other (existing and future) switch architectures. A prominent example is the

buffered crossbar architecture (see Section 1.1, Page 9) that draws a lot of attention recently. Even

though some competitive evaluations of this architecture already exist in literature (e.g., [38, 99,

138]), there are still many questions left unresolved. Among these, perhaps the most appealing one

is to fully understand the relations between the speedup of the switch, the size of the buffers in the

crosspoints and the relative queuing delay.

This dissertation considers only unicast traffic, in which each cell is destined for a single output-

port. Given the tremendous growth in video relay usage in recent years, multicast traffic becomes

crucial. In a multicast traffic, each cell has a set of f destinations (called its fanout set) to which

it should be delivered. A trivial way to support such multicast traffics is by replicating each cell

f times and treating the cells as unicast cells. However, as the size of the fan-out set increases,

this approach becomes infeasible. On the positive side, common switch architectures have built-in

124

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 135: Competitive Evaluation of Switch Architectures

multicast capabilities; a prominent example is the CIOQ switch in which a multicast cell can be

delivered to several output-ports in a single time-slot (regardless of the switch speedup). Therefore,

a promising future research direction is investigating the ability of a switch with built-in multicast

capabilities to handle multicast traffic and provide good QoS guarantees such as small delay and

small jitter performance.

There are also several ways to extend the competitive model described in this dissertation and

potentially achieve other insights on the behavior of feasible switch architectures.

In order to provide an exact emulation, our model requires that at each time-slot exactly the

same cells are transmitted from the investigated switch and the shadow switch. We further relaxed

the model and allowed a bounded relative queuing delay between the switches, yet we still required

that the order of cells (determined by a priority scheme) is preserved. It is of theoretical interest to

investigate whether further relaxation of these requirements to allow cells to be mis-sequenced can

reduce the relative queuing delay or the required speedup for such emulation. Note that under such

setting the switch is still required to achieve the same throughput and overall delay guarantees as

the shadow switch (for example, by providing work-conservation).

Another extension of our model involves incorporating cell drops. In practice, buffer sizes

are bounded and cells that cannot be stored in the buffers are dropped. When using a competi-

tive approach to evaluate switch architecture or scheduling policies with bounded buffer sizes, a

multiplicative competitive ratio is used to evaluate cell losses. Namely, the competitive ratio C is

the ratio between the number of cells transmitted by an optimal algorithm (often referred to as an

adversary) to the number of cells transmitted by the evaluated algorithm. However, these evalua-

tions usually assume that the adversary uses an offline algorithm, which has a complete knowledge

about future arrivals, while operating under the same architecture. Since offline algorithms are not

realistic, these results tend to be overly pessimistic.

It is worthy to consider also a model in which the adversary operates under an ideal switch

architecture with a given (bounded) buffer size and without any knowledge on future arrivals.

Under such a model, one can compare a given switch with its adversary by evaluating the relative

queuing delay, the relative loss ratio and tradeoffs between these two objectives. We believe that

such evaluations may indicate important and realistic design choices. As a first step towards this

125

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 136: Competitive Evaluation of Switch Architectures

goal, we showed how packet-mode CIOQ switch can mimic an ideal switch with bounded buffers

(see Section 5.5).

126

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 137: Competitive Evaluation of Switch Architectures

Bibliography

[1] Inverse Multiplexing over ATM (IMA): A Breakthrough WAN Technology for Corporate

Networks. 3Com Corporation, 1997.

[2] A. Adas. Traffic models in broadband networks. IEEE Communications Magazine, 35(7):

82–89, July 1997.

[3] G. Aggarwal, R. Motwani, D. Shah, and A. Zhu. Switch scheduling via randomized edge

coloring. In 44th Symposium on Foundations of Computer Science (FOCS), pages 502–513,

2003.

[4] E. Altman, Z. Liu, and R. Righter. Scheduling of an input-queued switch to achieve maximal

throughput. Probability in the Engineering and Informational Sciences, 14:327–334, 2000.

[5] T. E. Anderson, S. S. Owicki, J. B. Saxe, and C. P. Thacker. High-speed switch scheduling

for local-area networks. ACM Transactions on Computer Systems, 11(4):319–352, 1993.

[6] M. Andrews, B. Awerbuch, A. Fernandez, J. Kleinberg, T. Leighton, and Z. Liu. Universal

stability results for greedy Contention-Resolution protocols. Journal of the ACM, 48(1):

39–69, 2001.

[7] M. Arpaci and J. A. Copeland. Buffer Management for Shared-Memory ATM Switches.

IEEE Communications Surveys and Tutorials, 3(1), 2000.

[8] A. Aslam and K. J. Christensen. A parallel packet switch with multiplexors containing

virtual input queues. Computer Communications, 27(13):1248–1263, 2004.

[9] J. Aspnes. Randomized protocols for asynchronous consensus. Distributed Computing, 16

(2-3):165–175, 2003.

127

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 138: Competitive Evaluation of Switch Architectures

[10] R. Atkinson. RFC 1626: Default IP MTU for use over ATM AAL5, May 1994.

[11] Traffic Management Specification. The ATM Forum, March 1999. Version 4.1, AF-TM-

0121.000.

[12] Inverse Multiplexing for ATM (IMA) specification. The ATM Forum, March 1999. Version

1.1, AF-PHY-0086.001.

[13] H. Attiya and D. Hay. Randomization does not reduce the average delay in parallel packet

switches. In the 17th ACM Symposium on Parallelism in Algorithms and Architectures

(SPAA), pages 11–20, 2005.

[14] H. Attiya and D. Hay. The inherent queuing delay of parallel packet switches. IEEE Trans-

actions on Parallel and Distributed Systems, 17(9):1048–1056, 2006.

[15] H. Attiya, D. Hay, and I. Keslassy. Packet-mode emulation of output-queued switches. In

the 18th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages

138–147, 2006.

[16] R.Y. Awdeh and H.T. Mouftah. Survey of ATM switch architectures. Computer Networks

and ISDN Systems, 27(12):1567–1613, 1995.

[17] Y. Azar, A. Broder, A. Karlin, and E. Upfal. Balanced allocations. SIAM Journal on Com-

puting, 29(1):180–200, 1999.

[18] A. Bar-Noy, R. Bhatia, J. Naor, and B. Schieber. Minimizing service and operation costs of

periodic scheduling. Mathematics of Operations Research, 27(3):518–544, 2002.

[19] Y. Bartal, A. Fiat, and Y. Rabani. Competitive algorithms for distributed data management.

Journal of Computer and System Sciences, 51(3):341–358, 1995.

[20] S. Baruah, G. Buttazzo, S. Gorinsky, and G. Lipari. Scheduling periodic task systems to

minimize output jitter. In 6th International Conference on Real-Time Computing Systems

and Applications, pages 62–68, 1999.

[21] S. Ben-David, A. Borodin, R. Karp, G. Tardos, and A. Wigderson. On the power of ran-

domization in online algorithms. Algorithmica, 11(1):2–14, 1994.

128

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 139: Competitive Evaluation of Switch Architectures

[22] J. C.R. Bennett and H. Zhang. WF 2Q: Worst-case fair weighted fair queueing. In IEEE

INFOCOM, March 1996.

[23] J. C.R. Bennett and H. Zhang. Hierarchical packet fair queueing algorithms. IEEE/ACM

Transactions on Networking, 5(5):675–689, October 1997.

[24] A. Bianco, P. Giaccone, E. Leonardi, and F. Neri. A framework for differential frame-based

matching algorithms in input-queued switches. In IEEE INFOCOM, 2004.

[25] G. Birkhoff. Tres observaciones sobre el algebra lineal. Univ. Nac. Tucuman Rev. Ser. A, 5:

147–151, 1946.

[26] Blackwell, T., Chang, K., Kung, H.T. and Lin. D. Credit-based flow control for ATM

networks. In Proc. of the First Annual Conference on Telecommunications R&D in Mas-

sachusetts, 1994.

[27] A. Borodin and R. El-Yaniv. Online Computation and Competitive Analysis. Cambridge

University Press, 1998.

[28] A. Borodin, J. Kleinberg, P. Raghavan, M. Sudan, and D. P. Williamson. Adversarial Queue-

ing Theory. Journal of the ACM, 48(1):13–38, 2001.

[29] J. Y. Le Boudec. Network calculus made easy. Technical Report EPFL-DI 96/218, cole

Polytechnique Federale, Lausanne (EPFL), 1996.

[30] S. Bradner. Internet RFC 1242: Benchmarking terminology for network interconnection

devices, July 1991.

[31] C. S. Chang, W. J. Chen, and H. Y. Huang. Birkhoff-von Neumann input buffered crossbar

switches. In IEEE INFOCOM, pages 1614–1623, 2000.

[32] C.S. Chang, D.S. Lee, and Y.S. Jou. Load balanced Birkhoff-von Neumann switches, part

I: one-stage buffering. Computer Communications, 25:611–622, 2002.

[33] K. Chang and H.T. Kung. Reciever-oriented adaptive buffer allocation in credit-based flow

control for atm networks. In IEEE INFOCOM, pages 239–252, 1995.

129

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 140: Competitive Evaluation of Switch Architectures

[34] A. Charny. Providing QoS guarantees in input buffered crossbar switches with speedup.

PhD thesis, Massachusetts Institute Of Technology, September 1998.

[35] A. Charny, P. Krishna, N.S. Patel, and R.J. Simcoe. Algorithms for providing bandwidth and

delay guarantees in input-buffered crossbars with speedup. In Sixth IEEE/IFIP International

Workshop on Quality of Service, pages 235–244, 1998.

[36] F. M. Chiussi, D. A. Khotimsky, and S. Krishnan. Generalized inverse multiplexing of

switched ATM connections. In IEEE Globecom, pages 3134–3140, 1998.

[37] S. Chuang, A. Goel, N. McKeown, and B. Prabhakar. Matching output queueing with a

combined input output queued switch. In IEEE INFOCOM, pages 1169–1178, 1999.

[38] S. Chuang, S. Iyer, and N. McKeown. Practical algorithms for performance guarantees in

buffered crossbars. In IEEE INFOCOM, 2005.

[39] ATM Switch Router Software Configuration Guide. Cisco Systems, Inc, 2001. 12.1(6)EY.

[40] Cisco 12000 Series Gigabit Switch Routers. Cisco Systems, Inc.

Available online at: http://www.cisco.com/warp/public/cc/pd/rt/12000/prodlit/gsr ov.pdf .

[41] C. Clos. A study of non-blocking switching networks. Bell System Technical Journal, pages

406–424, 1953.

[42] comScore Networks. comScore releases August U.S. Video Metrix rankings, October 2006.

Available online at: http://www.comscore.com/press/release.asp?press=1035.

[43] Cooperative Association for Internet Data Analysis (CAIDA). http://www.caida.org/.

[44] R. L. Cruz. A calculus for network delay, Part II: Network analysis. IEEE Transactions on

Information Theory, pages 132– 141, 1991.

[45] R.L. Cruz. A calculus for network delay, part I: Network elements in isolation. IEEE

Transactions on Information Theory, pages 114–31, 1991.

[46] J. Dai. A fluid-limit model criterion for instability of multiclass queueing networks. Annals

of Applied Probability, 6:751–757, 1996.

130

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 141: Competitive Evaluation of Switch Architectures

[47] J. G. Dai and B. Prabhakar. The throughput of data switches with and without speedup. In

IEEE INFOCOM, pages 556–564, 2000.

[48] W.J. Dally and B. Towles. Principles and practices of interconnection networks, chapter 7,

pages 145–158. Elsevier, 2004.

[49] B. Davie, A. Charny, J.C.R. Bennett, K. Benson, J.Y. Le Boudec, W. Courtney, S. Davari,

V. Firoiu, and D. Stiliadis. Internet RFC 3246: An expedited forwarding PHB (per-hop

behavior), March 2002.

[50] A. Demers, S. Keshav, and S. Shenker. Analysis and simulation of a fair queueing algorithm.

Journal of Internetworking Research and Experience, pages 3–26, October 1990.

[51] E.A. Dinic. Algorithm for solution of a problem of maximum ow in a network with power

estimation. Soviet Mathematics Doklady, 11(5):1277–1280, 1970.

[52] L. Dulmage and I. Halperin. On a theorem of Frobenius-Konig and J. von Neumann’s game

of hide and seek. Trans. Roy. Soc. Canada III, 49:23–29, 1955.

[53] J. Duncanson. Inverse multiplexing. IEEE Communications Magazine, 32(4):34–41, April

1994.

[54] V. Firoiu, J. Le Boudec, D. Towsley, and Z. Zhang. Theories and models for internet quality

of service. Proceedings of the IEEE, 90(9):1565–1591, September 2002.

[55] S. Floyd and V. Jacobson. Random early detection gateways for congestion avoidance.

IEEE/ACM Transactions on Networking, 1(4):397–413, 1993.

[56] P. Fredette. The past, present, and future of inverse multiplexing. IEEE Communications

Magazine, 32(4):42–46, April 1994.

[57] Y. Ganjali, A. Keshavarzian, and D. Shah. Input queued switches: cell switching vs. packet

switching. In IEEE INFOCOM, volume 3, pages 1651–1658, March 2003.

[58] P. Giaccone, B. Prabhakar, and D. Shah. Randomized scheduling algorithms for high-

aggregate bandwidth switches. IEEE Journal on Selected Areas in Communications, 21

(4):546–559, May 2003.

131

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 142: Competitive Evaluation of Switch Architectures

[59] P. Giaccone, E. Leonardi, B. Prabhakar, and D. Shah. Delay bounds for combined input-

output switches with low speedup. Performance Evaluation, 55(1-2):113–128, 2004.

[60] S.J. Golestani. A stop-and-go queueing framework for congestion management. In ACM

SIGCOMM, pages 8–18, 1990.

[61] G. Gonnet. Expected length of the longest probe sequence in hash coding searching. Journal

of the ACM, 28(2):289–304, April 1981.

[62] D. Guez, A. Kesselman, and A. Rosen. Packet-mode policies for input-queued switches.

In the 16th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages

93–102, 2004.

[63] D. Hay and G. Scalosub. Jitter regulation for multiple streams. In the 13th Annual European

Symposium on Algorithms (ESA), pages 496–507, 2005.

[64] H. Heffes and D. M. Lucantoni. Markov-modulated characterization of packetized voice

and data traffic and related statistical multiplexer performance. IEEE Journal on Selected

Areas in Communications, 4(6):856–868, 1986.

[65] M. Hluchyj and M. Karol. Queueing in high-performance packet switching. IEEE Journal

on Selected Areas in Communications, 6(12):1587–1597, December 1988.

[66] J.E. Hopcroft and R.M. Karp. An n5/2 algorithm for maximum matchings in bipartite

graphs. SIAM Journal on Computing, 2(4):225–231, 1973.

[67] A. Hung, G. Kesidis, and N. McKeown. ATM input-buffered switches with guaranteed-rate

property. In IEEE ISCC, pages 331–335, 1998.

[68] S. Iyer. Analysis of a packet switch with memories running slower than the line rate. Mas-

ter’s thesis, Stanford University, May 2000.

[69] S. Iyer. Personal Communication, October 2006.

[70] S. Iyer and N. McKeown. Making parallel packet switches practical. In IEEE INFOCOM,

pages 1680–1687, 2001.

132

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 143: Competitive Evaluation of Switch Architectures

[71] S. Iyer and N. McKeown. On the speedup required for a multicast parallel packet switch.

IEEE Communications letteres, 5(6):269–271, June 2001.

[72] S. Iyer and N. McKeown. Analysis of the parallel packet switch architecture. IEEE/ACM

Transactions on Networking, pages 314–324, 2003.

[73] S. Iyer and Nick McKeown. Maximum size matchings and input queued switches. In 40th

Annual Allerton Conference on Communication, Control, and Computing, October 2002.

[74] S. Iyer, A.A. Awadallah, and N. McKeown. Analysis of a packet switch with memories

running slower than the line rate. In IEEE INFOCOM, pages 529–537, 2000.

[75] P. Jayanti, K. Tan, and S. Toueg. Time and space lower bounds for nonblocking implemen-

tations. SIAM Journal on Computing, 30(2):438–456, 2000.

[76] C.R. Kalmanek, H. Kanakia, and S. Keshav. Rate-controlled servers for very high-speed

networks. In IEEE Globecom, pages 12–20, December 1990.

[77] K. Kar, T. Lakshman, D. Stiliadis, and L. Tassiulas. Reduced complexity input buffered

switches. In HOT Interconnects, pages 13– 20, 2000.

[78] M. Karol, M. Hluchyj, and S. Morgan. Input versus output queueing on a space-division

packet switch. IEEE Transactions on Communications, 35(12):1347–1356, December 1987.

[79] S. Keshav. An Engineering Approach to Computer Networking. Addison-Wesley Publishing

Co., 1997.

[80] I. Keslassy. Load Balanced Router. PhD thesis, Stanford University, June 2004.

[81] I. Keslassy and N. McKeown. Maintaining packet order in two-stage switches. In IEEE

INFOCOM, 2002.

[82] I. Keslassy, C.S. Chang, N. McKeown, and D.S. Lee. Optimal load-balancing. In IEEE

INFOCOM, 2005.

[83] I. Keslassy, M. Kodialam, T.V. Lakshman, and D. Stiliadis. On guaranteed smooth schedul-

ing for input-queued switches. IEEE/ACM Transactions on Networking, 13(6):1364–1375,

December 2005.

133

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 144: Competitive Evaluation of Switch Architectures

[84] A. Kesselman and Y. Mansour. Harmonic buffer management policy for shared memory

switches. In IEEE INFOCOM, 2002.

[85] D. Khotimsky. Personal Communication, January 2004.

[86] D. Khotimsky and S. Krishnan. Stability analysis of a parallel packet switch with bufferless

input demultiplexors. In IEEE International Conference on Communications (ICC), pages

100–106, 2001.

[87] D. Khotimsky and S. Krishnan. Evaluation of open-loop sequence control schemes for

multi-path switches. In IEEE International Conference on Communications (ICC), vol-

ume 4, pages 2116– 2120, 2002.

[88] L. Kleinrock. Queuing Systems, Volume II. John Wiley&Sons, 1975.

[89] H. Koga. Jitter regulation in an internet router with delay constraint. Journal of Scheduling,

4(6):355–377, 2001.

[90] D. Konig. Grafok es alkalmazasuk a determinansok es a halmazok elmeletere. Matematikai

es Termeszettudomanyi Ertesıto, 34:104–119, 1916.

[91] P. Krishna, N.S. Patel, A. Charny, and R.J. Simcoe. On the speedup required for work-

conserving crossbar switches. IEEE Journal on Selected Areas in Communications, 17(6):

1057–1066, June 1999.

[92] H. T. Kung and A. Chapman. The FCVC (Flow-Controlled Virtual Channels) Proposal for

ATM Networks. In International Conference on Network Protocols, pages 116–127, 1993.

[93] H. T. Kung, T. Blackwell, and A. Chapman. Credit-based flow control for ATM networks:

credit update protocol, adaptive credit allocation and statistical multiplexing. In ACM SIG-

COMM, pages 101–114, 1994.

[94] E. Leonardi, M. Mellia, F. Neri, and M. A. Marsan. Bounds on average delays and queue size

averages and variances in input queued cell-based switches. In IEEE INFOCOM, volume 3,

pages 1095–1103, 2001.

134

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 145: Competitive Evaluation of Switch Architectures

[95] X. Li and M. Hamdi. On scheduling optical packet switches with reconfiguration delay.

IEEE Journal on Selected Areas in Communications, 21(7):1156–1164, 2003.

[96] Y. Li, S. Panwar, and H.J. Chao. Exhaustive service matching algorithms for input queued

switches. In IEEE Workshop on High Performance Switching and Routing, pages 253–258,

2004.

[97] M. Liu, S.H. Kuo, and W.C Su. Performance analysis for a packet-mode combined input-

output queued switch. In IEEE International Conference on Networking, Sensing and Con-

trol, pages 117–121, March 2004.

[98] Inverse Multiplexing for ATM, Expanding the Revenue Opportunities for Converged Ser-

vices over ATM. Lucent Technologies, 2001.

[99] R. B. Magill, C. E. Rohrs, and R. L. Stevenson. Output-queued switch emulation by fabrics

with limited memory. IEEE Journal on Selected Areas in Communications, 21(4):606–615,

2003.

[100] Y. Mansour and B. Patt-Shamir. Jitter control in QoS networks. IEEE/ACM Transactions

on Networking, 9(4):492–502, August 2001.

[101] M. A. Marsan, A. Bianco, P. Giaccone, E. Leonardi, and F. Neri. Packet-mode scheduling in

input-queued cell-based switches. IEEE/ACM Transactions on Networking, 10(5):666–678,

October 2002.

[102] M. A. Marsan, A. Bianco, P. Giaccone, E. Leonardi, and F. Neri. Multicast traffic in input-

queued switches: optimal scheduling and maximum throughput. IEEE/ACM Transactions

on Networking, 11(3):465–477, 2003.

[103] N. McKeown. The islip scheduling algorithm for input-queued switches. IEEE/ACM Trans-

actions on Networking, 7(2):188–201, 1999.

[104] N. McKeown, A. Mekkittikul, V. Anantharam, and J. Walrand. Achieving 100% through-

put in an input-queued switch. IEEE Transactions on Communications, 47(8):1260–1267,

August 1999.

135

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 146: Competitive Evaluation of Switch Architectures

[105] A. Mekkittikul. Scheduling Non-uniform Traffic in High Speed Packet Switches and Routers.

PhD thesis, Stanford University, November 1998.

[106] A. Mekkittikul and N. McKeown. A practical scheduling algorithm to achieve 100%

throughput in input-queued switches. In IEEE INFOCOM, volume 2, pages 792–799, April

1998.

[107] M.D. Mitzenmacher. The power of two choices in randomized load balancing. PhD thesis,

University of California at Berkley, Fall 1996.

[108] M.D. Mitzenmacher. On the analysis of randomized load balancing schemes. Theory of

Computer Systems, 32:361–386, 1999.

[109] S. Mneimneh, V. Sharma, and K.Y. Siu. Switching using parallel input-output queued

switches with no speedup. IEEE/ACM Transactions on Networking, 10(5):653–665, 2002.

[110] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995.

[111] J. Nagle. RFC 970: On packet switches with infinite storage, December 1985.

[112] WAN packet size distribution. The National Laboratory for Applied Network Research.

Available online at: http://www.nlanr.net/NA/Learn/packetsizes.html.

[113] P. Pappu, J. Parwatikar, J. Turner, and K. Wong. Distributed queueing in scalable high

performance routers. In IEEE INFOCOM, 2003.

[114] P. Pappu, J. Turner, and K. Wong. Work-conserving distributed schedulers for terabit routers.

In ACM SIGCOMM, pages 257–268, 2004.

[115] A. K. Parekh and R. G. Gallager. A generalized processor sharing approach to flow control in

integrated service networks: the single node case. IEEE/ACM Transactions on Networking,

1:344–357, 1993.

[116] C. Partridge. Internet RFC 1257: Isochronous applications do not require jitter-controlled

networks, September 1991.

136

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 147: Competitive Evaluation of Switch Architectures

[117] M.R. Pearlman, Z.J. Haas, P. Sholander, and S.S. Tabrizi. On the impact of alternate path

routing for load balancing in mobile ad hoc networks. In MobiHoc, pages 3–10, 2000.

[118] G. F. Pfister and V. A. Norton. Hot spot contention and combining in multistage intercon-

nection networks. IEEE Transactions on Computers, 34(10):943–948, 1985.

[119] Inverse multiplexing over ATM works today. PMC-Sierra, Inc., 2002.

http://www. electronicstalk.com/news/pmc/pmc121.html.

[120] B. Prabhakar and N. McKeown. On the speedup required for combined input and output

queued switching. Automatica, 35(12):1909–1920, December 1999.

[121] A. Prakash, A. Aziz, and V. Ramachandran. A near optimal schedule for switch-memory-

switch routers. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA),

pages 343–352, 2003.

[122] A. Prakash, A. Aziz, and V. Ramachandran. Randomized parallel schedulers for switch-

memory-switch routers: Analysis and numerical studies. In IEEE INFOCOM, 2004.

[123] F. Rendl. On the complexity of decomposing matrices arising in satellite communication.

Operations Research Letters, 4:5–8, 1985.

[124] A. Rosen. A note on models for non-probabilistic analysis of packet switching network.

Information Processing Letters, 84(5):237–240, December 2002.

[125] J. Sgall. On-line scheduling - a survey. On-line Algorithms: The State of the Art, pages

196–231, 1998.

[126] D. Shah and M. Kopikare. Delay bounds for approximate maximum weight matching algo-

rithms for input queued switches. In IEEE INFOCOM, 2002.

[127] S. Sharif, A. Aziz, and A. Prakash. An O(log2 N) parallel algorithm for output queuing. In

IEEE INFOCOM, 2002.

[128] R. Sinha, C. Papadopoulos, and J. Heidemann. Internet Packet Size Distributions: Some

Observations, 2005. Available online at:

http://netweb.usc.edu/˜rsinha/pkt-sizes/.

137

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 148: Competitive Evaluation of Switch Architectures

[129] D.D. Sleator and R.E. Tarjan. Amortized efficiency of list update and paging rules. Com-

munications of the ACM, 28(2):202–208, 1985.

[130] D.C. Stephens and H. Zhang. Implementing distributed packet fair queueing in a scalable

architecture. In IEEE INFOCOM, pages 282–290, 1998.

[131] I. Stoica and H. Zhang. Exact emulation of an output queueing switch by a combined input

output queueing switch. In Sixth IEEE/IFIP International Workshop on Quality of Service,

pages 218–224, 1998.

[132] Y. Tamir and H.C. Chi. Symmetric crossbar arbiters for VLSI communication switches.

IEEE Transactions on Parallel and Distributed Systems, 4(1):13–27, January 1993.

[133] Y. Tamir and G. L. Frazier. High-performance multi-queue buffers for VLSI communica-

tions switches. In Proceedings of the 15th Annual International Symposium on Computer

architecture, 1988.

[134] A.S. Tanenbaum. Computer Networks. Prentice Hall, fourth edition, 2003.

[135] R.E. Tarjan. Data structures and network algorithms. Society for Industrial and Applied

Mathematics, 1983.

[136] L. Tassiulas. Linear complexity algorithms for maximum throughput in radio networks and

input queued switches. In IEEE INFOCOM, pages 533–539, 1998.

[137] B. Towles and W.J. Dally. Guaranteed scheduling for switches with configuration overhead.

IEEE/ACM Transactions on Networking, 11(5):835–847, 2003.

[138] J. Turner. Strong performance guarantees for asynchronous crossbar schedulers. In IEEE

INFOCOM, 2006. To appear.

[139] J. Turner and N. Yamanaka. Architectural choices in large scale atm switches. IEICE

Transactions, E81-B(2):120–137, February 1998.

[140] J.S. Turner. New directions in communications (or which way to the information age?).

IEEE Communications Magazine, 24(10):8–15, October 1986.

138

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 149: Competitive Evaluation of Switch Architectures

[141] ERX-700/1400, Edge Routing Switch. Unisphere Solutions, Inc.

http://www.juniper.net/techpubs/software/erx/erx130/product overview.pdf .

[142] G. Varghese. Network Algorithmics: An Interdisciplinary Approach to Designing Fast Net-

worked Devices. Morgan Kaufmann Publishers, Inc., 2005.

[143] D.C. Verma, H. Zhang, and D. Ferrari. Guaranteeing delay jitter bounds in packet switching

networks. In Proceedings of TriComm, pages 35–46, 1991.

[144] J. von Neumann. A certain zero-sum two-person game equivalent to the optimal assignment

problem. Contributions to the Theory of Games, 2:5–12, 1953.

[145] T. Weller and B. Hajek. Scheduling nonuniform traffic in a packet-switching system with

small propagation delay. IEEE/ACM Transactions on Networking, 5(6):813–823, 1997.

[146] C-S. Wu, J-C. Jiau, and K-J. Chen. Characterizing traffic behavior and providing end-to-end

service guarantees within atm networks. In IEEE INFOCOM, volume 1, pages 336–344,

1997.

[147] H. Zhang. Service disciplines for guaranteed performance service in packet switching net-

works. Proceedings of the IEEE, 83(10):1374–1396, October 1995.

[148] H. Zhang. Providing end-to-end performance guarantees using non-work-conserving disci-

plines. Computer Communications: Special Issue on System Support for Multimedia Com-

puting, 18(10), October 1995.

[149] H. Zhang and D. Ferrari. Rate-controlled service disciplines. Journal of High-Speed Net-

works, 3(4):389–412, 1994.

139

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 150: Competitive Evaluation of Switch Architectures

v

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 151: Competitive Evaluation of Switch Architectures

iv

האלגוריתמים מבוססים על טכניקת ). לארכיטקטורה זו הפועלים מספר פעמים בכל יחידת זמן

.פירוק מטריצות

מהירותו הפנימית : השונותפשרות בין מאפייני המתג מצביעים עלהאלגוריתמים שהצגנו

, גדול הואיחסיההשהיה ה זמן כאשר. תג לעומת זמן ההשהיה היחסי אותו הוא משיגשל המ

מכיוון שדרישה דומה ידועה גם . מקצב הגעת החבילותכפולה הנדרשת מהמתגפנימיתהמהירות ה

מהירות פנימית גבוהה יותר כדי ל נדרששהמתג איננו משמעות התוצאה היא , מיתוג תאיםלגבי

העלאת על ידי זמן ההשהיה היחסי יכול להתקצר משמעותית.ברת החבילותשמור על רציפות העל

אפשרי להשיג הדמית מתג אופטימלי ללא השהיה בלתיי כהוכחנו, עם זאת. המהירות הפנימית

. של המתגללא תלות במהירות הפנימית, יחסית כלל

הסימולציות תוצאות. ניתחנו את ביצועי האלגוריתמים שלנו על ידי סימולציות,בנוסף

. יותר מחסמיהם התיאורטיםיםהאלגוריתמים שהוצגו טובביצועי מראות שבמציאות

התעבורה נכנסת בה מבודדתשירות בסביבה -הבטחת איכותנושא מחקר שלישי הינו

בבקר התרכזנו במחקר זה.השירות- איכותכך שתתאים לדרישות, בה מחדשוציען על לבקר האמו

הנדרש לעצב את התעבורה הנכנסת כך שתהיה מחזורית) delay jitter regulator( ההשהיה רעד

, כגון(הדרישה לתעבורה חלקה נפוצה בעיקר בתקשורת אינטראקטיבית . ככל האפשר)או חלקה(

ניתנים לתרגום ישירות חסמים על רעד ההשהייה , ומים כאלובייש). תעבורת קול או וידאו

רעד , לייה החדה בתעבורת וידאו בשנים האחרונותעקב הע. לחסמים על גודל החוצצים ביעד

.ההשהיה נהפך למדד חשוב במתגים מודרניים

ולכן , בקרי רעד ההשהיה משתמשים בחוצצים פנימיים בכדי לעצב מחדש את התעבורה

סיטואציות ב, כמו כן. הצלחת הבקר במשימתו את יחסי הגומלין בין גודל החוצצים ללחקורחשוב

ולהשיג רעד השהיה קטן על כל ,זמנית-כאלו צריכים לטפל במספר זרימות בו בקרים יותמציאות

.אחת מהן בנפרד

זרימות המשיג את רעד ההשהיה -מקוון לבקר מרובה-הצגנו אלגוריתם לא, במחקר זה

עבור , לעומת זאת. האלגוריתם פועל בזמן פולינומיאלי. Bהאופטימלי בהנתן גודל חוצץ נתון

הצגנו חסם עליון וחסם תחתון המראה כי גודל החוצץ הנדרש כדי להשיג , םאלגוריתמים מקווני

. רעד השהיה אופטימלי תלוי באופן לינארי במספר הזרימותאת

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 152: Competitive Evaluation of Switch Architectures

iii

אפילו : יגמה נפוצה לחלוקת עומס באופן ממוצע היא שימוש באקראיותפאראד

עומס מקסימלי הקרוב לחלוקה , בהסתברות גבוהה, אסטרטגיות פשוטות ביותר מבטיחות

, עומס מסורתיות ובמתגים אחרים- הצלחת יישום האקראיות בבעיות חלוקתלאור. האופטימלית

אנו מראים , ברם. כדי לשפר את ביצועיה בPPSה להשתמש באקראיות גם בארכיטקטורת מפת

תוצאה מפתיעה זו נובעת מהדרישה . שאקראיות לא יכולה להוריד את זמן ההשהיה הממוצע

ונה זו מאפשרת ליריב לנצל תכ. עוזבים את המתגבו תאים המעשית ממתגים לא לשנות את הסדר

הממוצעכך שזמן ההשהיה , חלוף בזמן ההשהיה היחסי ולהנציחה לזמן ארוך מספיק-תעליה ב

.יגדל בהתאם

אנו מציגים מתודולוגיה כללית לניתוח זמן ההשהיה המקסימלי על ידי , לעומת זאת

מתודולוגיה זו משמשת לתכנון אלגוריתמים . מדידת חוסר האיזון בין המתגים האיטיים

, יתרה מזאת. שים במידע מיושן במקצת על מצבו הכוללני של המתגאופטימלים חדשים המשתמ

. אנליזה מלאה לאלגוריתמים מבוזרים ידועים לראשונהמתודלוגיה זו מספקת

קלט-משולבי חוצצי פלטארכיטקטורת מתגים נושא מחקר נוסף הינו נתוח ביצועיה של

)CIOQ( , נהבחבילות באורך משתאלגוריתמי מיתוג המטפלים ובפרט .

חבילות באורך משתנה נובע מהעובדה שתחת רוב פרוטוקולי ישיר של הצורך במיתוג

,אולם ).IPלמשל חבילות (התעבורה מורכבת מחבילות באורך משתנה , התקשורת הידועים

ם שלתהליכיומבצעים ,אל תאים בעלי אורך אחידם מתייחסים לחבילות אלו כייומתגים עכש

מתגים המטפלים ישירות בחבילות באורך . מחוץ למתגמחדש יה שבירת חבילה לתאים ואיחו

מבלי , מהכניסות ליציאותבאופן רציף משתנה מוגבלים בכך שהם צריכים להעביר כל חבילה

העלולה , בכך מתגים כאלו חוסכים את התקורה הנובעת משבירת החבילה ואיחויה. ברהולש

).תגים אופטייםבמ, לדוגמה(ות משמעותית ביותר במתגים מהירים להי

תכננו אלגוריתמי מיתוג המאפשרים העברת חבילות רציפה במתגים מטיפוס ,במחקר זה

CIOQ .עם (מאפשרת להשיג הדמיה , במידה קטנההראינו כי האצת מהירותו הפנימית של המתג

, נולריות נמוכה האלגוריתמים שהצגנו פועלים בגר.של מתג אופטימלי) זמן השהיה יחסי חסום

בניגוד לאלגוריתמים נפוצים ( רק אחת למספר יחידות זמן תוג חבילותימחליטים על מר כלומ

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 153: Competitive Evaluation of Switch Architectures

ii

, צבי הגעת החבילות עבורה ידועים ק,בהנתן ארכיטקטורת מתגים קיימת, באופן כללי

. אנו מציעים אלגוריתמי מיתוג ומנתחים את ביצועיהם, ומיקום קווי הבקרהמיקום החוצצים

ם על בחירות חשובות אנו מוכיחים מגבלות אינהרנטיות של הארכיטקטורה ומצביעי, בנוסף

קטורה שאלגוריתמי המיתוג המוצעים תלויים באופן הדוק בארכיט, חשוב להדגיש. בתכנון מתגים

מיות יתמחקר זה מערב בתוכו מגוון רחב של בעיות אלגור, לכן.הספיציפית אליה הם מיועדים

.שונות

למתג כלומר הן נמדדות בהשוואה , יחסיות הן המוצגות במחקר זהרוב התוצאות

-on (בדומה לניתוח אלגוריתמים מקוונים. אופטימלי שאיננו מוגבל על ידי הארכיטקטורה שלו

line( ,מכיוון ,הגישה התחרותית של מחקר זה היא טבעית, בהם אין מידע על מאורעות עתידיים

. הגורמים המשמעותיים ביותר בביצועיושאינו נגיש לאלגוריתם המיתוג היא אחתשכמות המידע

בהנחות הסתברותיות על התעבורה הנכנסת תלוי שניתוח תחרותי כזה איננו לצייןשוב ח

יתרונות דבר זה מאפשר לו לחשוף). לות להיות מטעות במקרים מסוימיםהנחות אלו יכו(

ניתוח על פי המקרה הגרוע לובמיוחד , ניתוח אנאליטיל, בנוסף. וחולשות של המגונים הרלוונטים

בניגוד לניתוח (שירות מסוימת - איכותלהבטיח מכיוון שהוא מאפשר ,דת חשיבות מיוח,ביותר

).ואינו מכסה את כל הסיטואציות האפשריות, נסיוני המבוסס בדרך כלל על סימולציות

.נסקור כעת את התוצאות המרכזיות של מחקר זה

, תחת ארכיטקטורה זו. )PPS (מתגי חבילות מקבילייםעסקנו בארכיטקטורת , ראשית

. הפועלים בקצב איטי יותר, כיםחבילות המגיעות למתג נשלחות ליעדן במקביל דרך מתגים מתוו

) בגודל קבוע ואחידתא הינו חבילה(זמן ההשהיה היחסי של תאים את בצורה אנליטיתתחנונ

ת המקביליות של המתג מדד זה מבטא את השפע. בהשוואה למתג אופטימליPPSבארכיטקטורת

סוג המידע שבידי ליונים והתחתונים תלויים בכמות ובהחסמים הע. ולליםעל ביצועיו הכ

ומבטאים הבדלים משמעותיים , האחראי על חלוקת העומס בין המתגים המתווכיםהאלגוריתם

משפר באופן , אף אם מיושן, שיתוף מידע בין הכניסות השונות: בין ביצועי אלגוריתמים שונים

.משמעותי את ביצועי המתג

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 154: Competitive Evaluation of Switch Architectures

i

תקציר

והן בקצבים בהם פועלות חב פס על ידי צרכני רשת האינטרנטהן בדרישה לרו, העלייה המהירה

אחד ל–המתגים והנתבים – רכיבי הרשת הבסיסיים הופכת את, רשתות תקשורת מודרניות

.ביצועי הרשת כולה של בקבוק הצווארי מ

שניה ומאות כניסות/יגהביט ג40פועלים בקצבים של עד המתגים ונתבים קיימים , כיום

, למשל(רשתות תקשורת עכשוויות נדרשות לשלב מספר סוגי תעבורה , יתרה מזאת. ויציאות

-נדרש לעמוד בדרישות איכות) או הנתב(ולכן המתג , )וידאותעבורת עם תעבורת קול ו IPתעבורת

הנתבים כל , כדי להתמודד עם משימות אלו. המשתנות מיישום ליישום, שירות מחמירות

חבילות -כגון אלגוריתמי תזמון, והמתגים המודרניים מצויידים במנגנוני בקרה מתוחכמים

השימוש בארכיטקטורות מתרחב,מהירים יותרו גדוליםנעשיםככל שהמתגים . תורים-וניהול

עומסהחלוקת איזון ארכיטקטורות אלו דורשות מנגונים נוספים לתיאום ו. מקביליות ומבוזרות

.יהן השוניםבין רכיב

תהליך : המשפיעים על ביצועיואופיינים במתג מודרני ניתן לזהות שלושה צווארי בקבוק

, יציאה מיועדת כל חבילה המגיעה למתגוקובע לאיזה) address-lookup (חיפוש כתובת היעד

תזמון ותהליך , אחראי על העברת החבילות מהכניסות ליציאותה) switching(המיתוג תהליך

לותהחבי

)packet-scheduling (בכל זמן נתוןמיציאה מסוימת במתג חבילה תצא מחליט איזוה .

, ל ידי תהליך המיתוגחיבור זה מתרכז בעיקר בבעיות העולות מצוואר הבקבוק הנוצר ע

, בנוסף. שיטות מחקר אנליטיות לבחינת ביצועי מנגנוני הבקרה במתגים שוניםותורם לפיתוח

. בין ארכיטקטורות מתגים שונותהשוותרות לתוצאות המחקר מאפש

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 155: Competitive Evaluation of Switch Architectures

. נעשה בהנחיית פרופסור חגית עטיה בפקולטה למדעי המחשבמחקרה

אין מילים לתאר עד כמה אסיר תודה אני על עזרתך וההנחיה הסבלנית בכל השנים , חגית

כמרצה , כחוקר, אני מרגיש בר מזל שניתנה לי ההזדמנות לעבוד איתך וללמוד ממך. האחרונותמחשבה או סתם , בכל שאלה, ייפנויה בפנהרגשתי תמיד שדלתך . ובראש וראשונה כבן אדם

לאזן בצורה את הצלחתךאני מוקיר במיוחד , מבין כל אינספור הדברים שלמדתי ממך. שיחהסוג המנחים עליו מאין ספק שאת . עצמאות שנתת לי כחוקרהחופש ובין המושלמת בין הנחייתי ל

).והרבה הרבה יותר(כל סטודנט חולם

' גבריאל סקלוסוב ופרופ, יצחק קסלסי' דר, איתם שיתפתי פעולהאני רוצה להודות לאנשיםיצחק קסלסי ' אני מודה לדר. התקופות בהן עבדנו ביחד היו מהמהנות בכל לימודיי. וולש. ניפר ל'ג

.ביקור זה תרם למחקרי רבות. 2006ביקור שלי בסיסקו במהלך קיץ גם על ארגון ה

' דר, ישראל צידון' פרופ, ספי נאור' פרופ, ישי מנסור' פרופ: אני מודה לחברי ועדות הבוחנים שלי .הפקתי תועלת רבה מהערותיכם ותובנותיכם. דני רז' עדי רוזן ופרופ' דר, יצחק קסלסי

או שעזרו לי /איתם עבדתי ו, אני רוצה להודות גם לכל האנשים האחרים בפקולטה למדעי המחשב

. הראשוןהחל מלימודי התואר, במהלך שנותי בפקולטה

). ועל העצות והעזרה הגרפית(על התמיכה המוראלית העקבית ' תודות מיוחדות למשה סייקביץ .ללא תמיכתך כל זה לא היה קורה, משה

י צבי וצביה יי וסבתותילסב, יעל ויגאל חייי להורמודהאני , )אם כי לא בסדר חשיבותם(אחרונים

על כך שתמיד הייתם שם לצידי , וליתר בני משפחתי,רועי ואסף, חיי איללא, וליה ויעקב חיגרסי .כל החלטה ובחירה שעשיתיב ותמכתם

על התמיכה הכספית הנדיבה בהשתלמותילקרנות וולף ובלנקשטייןאני מודה

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 156: Competitive Evaluation of Switch Architectures

ניתוח תחרותי שלניתוח תחרותי שלניתוח תחרותי שלניתוח תחרותי של ארכיטקטורות מתגיםארכיטקטורות מתגיםארכיטקטורות מתגיםארכיטקטורות מתגים

חיבור על מחקר

לשם מילוי חלקי של הדרישות לקבלת תואר

דוקטור לפילוסופיה

דוד חי

מכון טכנולוגי לישראל― הוגש לסנט הטכניון

2007 אפריל ז"ר תשסייא חיפה

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007

Page 157: Competitive Evaluation of Switch Architectures

ניתוח תחרותי שלניתוח תחרותי שלניתוח תחרותי שלניתוח תחרותי של ארכיטקטורות מתגיםארכיטקטורות מתגיםארכיטקטורות מתגיםארכיטקטורות מתגים

דוד חי

Technion - Computer Science Department - Ph.D. Thesis PHD-2007-06 - 2007