switch architectures

Switch Architectures

Input Queued, Output Queued,

Combined Input and Output Queued

Outline

I. Introduction

II. System Model

III. The Least Cushion First/Most Urgent First Algorithm

IV. Conclusion

Ⅰ. Introduction

Exponential growth of Internet traffic demands large scale switches

Common Switch Architectures Output Queued

High performance

Easier to provide QoS guarantee

Has serious scaling problem Input Queued

More scalable

Suffers from HOL blocking

Virtual Output Queues can improve performance

Difficult to provide QoS guarantee

Output Queued-Shared Bus

Input Port Output Port

Output Queued-Shared Memory

Memory1234

Input Port Output Port

Input Queued

1 2 3 4OUTPUT PORT:

Input port:

Input Queued with VOQ

1 2 3 4OUTPUT PORT:

For output port:

Input port:1

Ⅰ. Introduction

Input queued Output queued Shared Bus

Output queued Shared Memory

Memory BW S2 SN )1( SN 2

Example S = 10Gbps, N = 16

20Gbps 170Gbps 320Gbps

S ： link speed

N ： switch size (N×N)

Memory BW requirements for three common switch architectures:

Input queueing is necessary !

Can speedup the switch to improve performance CIOQ switch

Ⅰ. Introduction

Complexity (Iteration)

Description

Maximum )( 5.2NO Achieves 100% throughput under uniform traffic

Maximum weight )log( 3 NNO Achieves 100% throughput under either uniform or non-uniform traffic

Maximal )(NO Achieves 100% throughput with a speedup of 2 times

Stable )( 2NO Exactly emulates an OQ switch with a speedup of 2 times

matching

Matching Algorithms for Performance Improvement:

Ⅰ. Introduction

Exact Emulation: under identical input traffic, the departure times of every cell from both CIOQ switch and OQ switch are identical.

CIOQ Switch

EmulatedOQ Switch

Output 1

Output N

Output 1

Output N

Input 1

Input N

Input 1

Input N

IdenticalInput Traffic

IdenticalDeparture Pattern

Ⅰ. Introduction

We propose a new scheduling algorithm called the least cushion first / most urgent first (LCF/MUF) algorithm

O(N) complexity with parallel comparators Exactly emulates an OQ switch with a speedup of 2

times No constraint on service discipline

Ⅱ. System Model

Switching

Fabrics

Speedup=2

Ⅱ. System Model

Switch fabric is speeded up by a factor of 2 There are 2 scheduling phases in slot k, referred to as

phase k.1 and phase k.2 A cell delivered to its destined output port in phase

k.1 can be transmitted out of the output port in the same slot (i.e., cut through)

A cell delivered in phase k.2 can only be transmitted in slot k+1 or after

Ⅱ. System Model

Ⅲ. The Least Cushion First / Most Urgent First Algorithm

Let denote a cell at input port i destined to output port j

Definition 1: The cushion of cell : The number of cells residing in output port j which will depart

the emulated OQ switch earlier than cell

Definition 2: The cushion between input port i and output port j: The minimum of for all cells at input port i destined

to output port j If there is no cell destined to output port j, then is set

)( ,jixC

jix ,jix ,

),( jiC

)( , jixC

Definition 3: The scheduling matrix of an NxN switch is an NxN square matrix whose (i,j)th entry equals

Definition 4: The input thread of cell at input port i: The set of cells at input port i which has a cushion

smaller than or equal to except cell itself Let denote the size of

),( jiC

)( , jixIT

),( , jixC jix ,

|)(| , jixIT )( , jixIT

LCF / MUF Algorithm Step 1:

Select the (i,j)th entry which satisfies (Least Cushion First). If the selected entry is then stop.

If there are more than one entries with the least cushion residing in different columns, then select arbitrarily a column (i.e., an output port).

For the selected column, say, column j, determine row i which has the most urgent cell among all cells at all input ports (Most Urgent First).

)},({min),( , lkCjiC lk,

jix , jkx ,

LCF / MUF Algorithm

Step 2: Eliminate the ith row and the jth column (i.e., match

output port j to input port i) of the scheduling matrix.

If the reduced matrix becomes null, then stop. Otherwise, use the reduced matrix and go to Step 1.

Consider for example the scheduling matrix given in page 13

Ⅳ. Conclusion

We propose a new scheduling algorithm - the least cushion first / most urgent first algorithm Exactly emulates an OQ switch No constraint on service discipline

Implement issues of the LCF / MUF algorithm A switch has to know the cushions of all cells and the

relative departure order of cells destined to the same output port

It could be difficult to obtain these information for a dynamic priority assignment scheme (e.g. WFQ)

Feasible for static priority assignment schemes

Outline

Systolic Array

Binary Heap

Pipelined Heap

Hardware Design

The Systolic Array Priority Queue

Block 1Block 2Block 3Block n

Highest value

New value

NON-INCREASING PRIORITY VALUES

Permanent Data Register

Temporary Register

n = 1000

Hardware required: 1000 comparators, 2000 registers.

Performance: constant time.

The Binary Heap Priority Queue

2 3 5 7

4 5 6 7

8 9 10 11 12

16 14 10 77 33 24 3 58

1 2 3 4 5 6 7 8 9 10 11 12 13 14

n =1000Hardware required: 1 comparator, 1 register, 1 SRAM.Performance: O(log n).

The Pipelined-Heap

Modified binary heap data structure

Constant-time operation. Similar to the Systolic Array.

Good hardware scalability. Similar to the Binary Heap.

P-heap Data Structure (B,T)

16 14 10 7 3 24 1 57 8

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

4 1 3 1 0 1 2 0 1 0 0 1 0 1 1

capacity

4 5 6 7

8 9 10 11 12 13 14 15

operation positionvalue

Level 1

Level 2

Level 3

Level 4

Binary Array (B)Token Array (T)

4 5 6 7

8 9 10 11 12 13 14 15

(a) local-enqueue(1)

4 5 6 7

8 9 10 11 12 13 14 15

(b) local-enqueue(2)

The Enqueue (Insert) Operation

4 5 6 7

8 9 10 11 12 13 14 15

(d) local-enqueue(4)

4 5 6 7

8 9 10 11 12 13 14 15

(c) local-enqueue(3)

4 5 6 7

8 9 10 11 12 13 14 15

1operation positionvalue

Enqueue (contd)

4 5 6 7

8 9 10 11 12 13 14 15

(b) local-dequeue(1)

4 5 6 7

8 9 10 11 12 13 14 15

The Dequeue (Delete) Operation

4 5 6 7

8 9 10 11 12 13 14 15

(d) local-dequeue(3)

4 5 6 7

8 9 10 11 12 13 14 15

(c) local-dequeue(2)

4 5 6 7

8 9 10 11 12 13 14 15

Dequeue (contd)

Pipelined Operation

level level

Hardware Requirements

log N SRAMs represent the Binary Array B, N = size of the P-heap .

log N registers represent the Token Array T.

log N comparators required, one for each level of the P-heap.

Binary Heap

8 11 9

viewed as an array

viewed as a binary tree

Left(i) = 2*iRight(i) = 2*i + 1Parent(i) = i / 2

A[i] >= A[Left(i)]A[i] >= A[Right(i)]

Binary Heap : Insert Operation

8 10 9

viewed as an array

8 10 9

viewed as an array

Binary Heap : Delete Operation

8 10 9

viewed as an array

8 10 9

viewed as an array

8 10 9

viewed as an array

Binary Heap Operations

Both insert and delete are O(log N) operations (i.e. number of levels in the tree)

2*i can be implemented as left shift

i / 2 can be implemented as right shift

Some scheduling algorithm

Outline

iSLIP (Better solution)

Scheduling Algorithms

When we use a crossbar switch, we require a scheduling algorithm that match inputs with outputs.

This is equivalent to find a bipartite matching on a graph with N vertices.

The algorithm configures the fabric during each cell time and decides which inputs will be connected to which outputs.

Scheduling packets

For Example

P( input #, output #) = order to leave

P(1,1)=1P(1,2)=3

P(3,2)=3P(3,4)=1

P(4,4)=2

Crossbar Switch

Input side Output side

Scheduling Algorithm need to decide the path and order of packetsthrough crossbar switch

High performance systems

Usually, we design algorithm with the following properties:

High Throughput

Starvation Free

Simple to Implement

Parallel Iterative Matching (PIM)

PIM has three steps to implement

Step1 : Request

Step2 : Grant

Step3 : Accept

Each decision is made randomly.

The mathematics model of algorithm

We can assume that

Every input in[i] maintains the following state information: Table Ri[0] … Ri[N-1], where Ri[k] = 1, if In[i] has a

request for Out[k] (0, otherwise) Table Gdi[0] … Gdi[N-1], where Gdi[k] = 1, if In[i]

receives a grant from Out[k] (0, otherwise) Variable Ai, where Ai = k, if In[i] accepts the grant from

Out[k] (-1, if no output is accepted).

The mathematics model (cond’t)

Every output Out[k] maintains the following state information:

Table Rdk[0] … Rdk[N-1], where Rdk[i] = 1, if Out[k] receives a request from In[i] (0, otherwise)

Variable Gk, where Gk = i, if Out[k] sends a grant to In[i] (-1, if no input is granted)

Variable Adk, where Adk = 1, if the grant from Out[k] is accepted. (0, otherwise).

The model of PIM

Therefore, we can represent PIM algorithm as

An example of PIM algorithm

P(1,1)=1P(1,2)=3

P(3,2)=3P(3,4)=1

P(4,4)=2(a) (b) (c)

Request Grant Accept

Seconditeration

Problems with PIM

Hard to implement randomness in hardware

Unfairness occurs among connections under oversubscribed situation

Throughput is limited to approximately 63% for a single iteration

%63)/1( NNN

The unfairness problem

λ1,1=1

λ1,2=1

λ2,1=1

μ1,1=1/4

μ 1,2=3/4

μ 2,1=3/4

Round-Robin Matching Algorithm (RRM)

Use rotating priority to match inputs and outputs

Need a pointer gi to identify the highest priority

element

Apply rotating priority on both inputs and outputs

The model of RRM

RRM scheduling

P(1,1)=1P(1,2)=3

P(3,2)=3P(3,4)=1

P(4,4)=2(a) (b) (c)

Synchronization Problem

When an output receives a request, the output should choose an input to grant and gi must vary

to a new value

For example

λ1,1= λ1,2 =1

λ2,1= λ 2,2=1

μ1,1= μ1,2=1/4

μ 2,1= μ 2,2=1/4

Efficiency = 50%

iSLIP algorithm

Use to fix synchronization problem of RRM

Changes its pointer gi only when the grant is

accepted by the input, or the pointer gi will keep

its value

Solves the synchronous problem and achieves 100% throughput

The model of iSLIP

Example of iSLIP

λ1,1= λ1,2 =1

λ2,1= λ 2,2=1

1st match

2nd match

3rd match

100% throughput is achieved

Comparison of three algorithms

switch architectures

input port i

cushion of cell

destined output port

output port j

input thread of cell

nxn switch

port jdefinition

port jif

Documents

048866: packet switch architectures dr. isaac keslassy...

other switch architectures parallel packet switch 3 ·...

1 ee384y: packet switch architectures part ii load-balanced...

1 ents689l: packet processing and switching buffer-less...

winter 2004ee384x handout 11 ee384x: packet switch...

c4isr architectures and software architectures

new switch architectures and the impact to 40/100g - bicsi

on controller performance in software-deﬁned...

1 ee384y: packet switch architectures part ii address lookup...

katz, stoica f04 eecs 122: introduction to computer networks...

1 ee384y: packet switch architectures part ii load-balanced...

web application architectures - sti innsbruck ·...

spring 2004 1 ee384y: packet switch architectures matchings,...

set a path to success in it...

switch selection - training.kendallelectric.com ·...

ee384y: packet switch architectures matchings,...

winter 2006ee384x handout 11 ee384x: packet switch...

ee384y: packet switch architectures part ii scaling crossbar...

lookahead packet scheduling algorithm for cioq datacenter...

optical switching: switch fabrics, techniques, and ... ·...