switch architectures
Post on 15-Jan-2016
44 Views
Preview:
DESCRIPTION
TRANSCRIPT
1
Switch Architectures
Input Queued, Output Queued,
Combined Input and Output Queued
2
Outline
I. Introduction
II. System Model
III. The Least Cushion First/Most Urgent First Algorithm
IV. Conclusion
3
Ⅰ. Introduction
Exponential growth of Internet traffic demands large scale switches
Common Switch Architectures Output Queued
High performance
Easier to provide QoS guarantee
Has serious scaling problem Input Queued
More scalable
Suffers from HOL blocking
Virtual Output Queues can improve performance
Difficult to provide QoS guarantee
4
Output Queued-Shared Bus
1234
1
2
3
4
1
Input Port Output Port
5
Output Queued-Shared Memory
Memory1234
1234
Input Port Output Port
6
Input Queued
1 2 3 4OUTPUT PORT:
Input port:
1
2
3
4
7
Input Queued with VOQ
1 2 3 4OUTPUT PORT:
For output port:
1
2
3
4
Input port:1
234
1234
1234
1234
8
Ⅰ. Introduction
Input queued Output queued Shared Bus
Output queued Shared Memory
Memory BW S2 SN )1( SN 2
Example S = 10Gbps, N = 16
20Gbps 170Gbps 320Gbps
S : link speed
N : switch size (N×N)
Memory BW requirements for three common switch architectures:
Input queueing is necessary !
Can speedup the switch to improve performance CIOQ switch
9
Ⅰ. Introduction
Complexity (Iteration)
Description
Maximum )( 5.2NO Achieves 100% throughput under uniform traffic
Maximum weight )log( 3 NNO Achieves 100% throughput under either uniform or non-uniform traffic
Maximal )(NO Achieves 100% throughput with a speedup of 2 times
Stable )( 2NO Exactly emulates an OQ switch with a speedup of 2 times
matching
Matching Algorithms for Performance Improvement:
10
Ⅰ. Introduction
Exact Emulation: under identical input traffic, the departure times of every cell from both CIOQ switch and OQ switch are identical.
CIOQ Switch
EmulatedOQ Switch
. . .
Output 1
Output N
. . .
Output 1
Output N
. . .
Input 1
Input N
Input 1
Input N
. . .
IdenticalInput Traffic
IdenticalDeparture Pattern
11
Ⅰ. Introduction
We propose a new scheduling algorithm called the least cushion first / most urgent first (LCF/MUF) algorithm
O(N) complexity with parallel comparators Exactly emulates an OQ switch with a speedup of 2
times No constraint on service discipline
12
Ⅱ. System Model
Switching
Fabrics
Speedup=2
13
Ⅱ. System Model
Switch fabric is speeded up by a factor of 2 There are 2 scheduling phases in slot k, referred to as
phase k.1 and phase k.2 A cell delivered to its destined output port in phase
k.1 can be transmitted out of the output port in the same slot (i.e., cut through)
A cell delivered in phase k.2 can only be transmitted in slot k+1 or after
14
Ⅱ. System Model
15
Ⅲ. The Least Cushion First / Most Urgent First Algorithm
Let denote a cell at input port i destined to output port j
Definition 1: The cushion of cell : The number of cells residing in output port j which will depart
the emulated OQ switch earlier than cell
Definition 2: The cushion between input port i and output port j: The minimum of for all cells at input port i destined
to output port j If there is no cell destined to output port j, then is set
to
)( ,jixC
jix ,jix ,
),( jiC
)( , jixC
)( , jixC
jix ,
16
Ⅲ. The Least Cushion First / Most Urgent First Algorithm
Definition 3: The scheduling matrix of an NxN switch is an NxN square matrix whose (i,j)th entry equals
Definition 4: The input thread of cell at input port i: The set of cells at input port i which has a cushion
smaller than or equal to except cell itself Let denote the size of
),( jiC
jix ,
)( , jixIT
),( , jixC jix ,
|)(| , jixIT )( , jixIT
17
Ⅲ. The Least Cushion First / Most Urgent First Algorithm
18
Ⅲ. The Least Cushion First / Most Urgent First Algorithm
LCF / MUF Algorithm Step 1:
Select the (i,j)th entry which satisfies (Least Cushion First). If the selected entry is then stop.
If there are more than one entries with the least cushion residing in different columns, then select arbitrarily a column (i.e., an output port).
For the selected column, say, column j, determine row i which has the most urgent cell among all cells at all input ports (Most Urgent First).
)},({min),( , lkCjiC lk,
jix , jkx ,
19
Ⅲ. The Least Cushion First / Most Urgent First Algorithm
LCF / MUF Algorithm
Step 2: Eliminate the ith row and the jth column (i.e., match
output port j to input port i) of the scheduling matrix.
If the reduced matrix becomes null, then stop. Otherwise, use the reduced matrix and go to Step 1.
Consider for example the scheduling matrix given in page 13
20
Ⅳ. Conclusion
We propose a new scheduling algorithm - the least cushion first / most urgent first algorithm Exactly emulates an OQ switch No constraint on service discipline
Implement issues of the LCF / MUF algorithm A switch has to know the cushions of all cells and the
relative departure order of cells destined to the same output port
It could be difficult to obtain these information for a dynamic priority assignment scheme (e.g. WFQ)
Feasible for static priority assignment schemes
21
Outline
Systolic Array
Binary Heap
Pipelined Heap
Hardware Design
22
The Systolic Array Priority Queue
Block 1Block 2Block 3Block n
Highest value
New value
NON-INCREASING PRIORITY VALUES
Permanent Data Register
Temporary Register
n = 1000
Hardware required: 1000 comparators, 2000 registers.
Performance: constant time.
23
The Binary Heap Priority Queue
14
2 3 5 7
7 34
10
1
2 3
4 5 6 7
8 9 10 11 12
16
8
3
16 14 10 77 33 24 3 58
1 2 3 4 5 6 7 8 9 10 11 12 13 14
VALUE
n =1000Hardware required: 1 comparator, 1 register, 1 SRAM.Performance: O(log n).
15
24
The Pipelined-Heap
Modified binary heap data structure
Constant-time operation. Similar to the Systolic Array.
Good hardware scalability. Similar to the Binary Heap.
25
P-heap Data Structure (B,T)
16 14 10 7 3 24 1 57 8
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
4 1 3 1 0 1 2 0 1 0 0 1 0 1 1
value
capacity
16
2
4
1 5
7
8
7 3
14 10
1
2 3
4 5 6 7
8 9 10 11 12 13 14 15
operation positionvalue
Level 1
Level 2
Level 3
Level 4
Binary Array (B)Token Array (T)
26
16
2 4 5
8 7 3
14 10
1
2 3
4 5 6 7
8 9 10 11 12 13 14 15
enq
operation positionvalue
9 1
(a) local-enqueue(1)
16
2 4 5
8 7 3
14 10
1
2 3
4 5 6 7
8 9 10 11 12 13 14 15
enq
operation positionvalue
9 2
(b) local-enqueue(2)
The Enqueue (Insert) Operation
27
16
2 4 5
8 9 3
14 10
1
2 3
4 5 6 7
8 9 10 11 12 13 14 15
enq
operation positionvalue
7 10
(d) local-enqueue(4)
16
2 4 5
8 7 3
14 10
1
2 3
4 5 6 7
8 9 10 11 12 13 14 15
enq
operation positionvalue
9 5
(c) local-enqueue(3)
16
2 4 5
8 9 3
14 10
2 3
4 5 6 7
8 9 10 11 12 13 14 15
1operation positionvalue
7
(e)
Enqueue (contd)
28
2 4 5
8 7 3
14 10
1
2 3
4 5 6 7
8 9 10 11 12 13 14 15
(b) local-dequeue(1)
deq
operation positionvalue
1
1
16
2 4 5
8 7 3
14 10
2 3
4 5 6 7
8 9 10 11 12 13 14 15
(a)
operation positionvalue
The Dequeue (Delete) Operation
29
2 4 5
8
7 3
14
10
2 3
4 5 6 7
8 9 10 11 12 13 14 15
(d) local-dequeue(3)
deq
operation positionvalue
4
2 4 5
8 7 3
14
10
2 3
4 5 6 7
8 9 10 11 12 13 14 15
(c) local-dequeue(2)
deq
operation positionvalue
2
2
4
5
8
7 3
14
10
2 3
4 5 6 7
8 9 10 11 12 13 14 15
(e)
operation positionvalue
Dequeue (contd)
11
1
30
Pipelined Operation
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
level level
level level
31
Hardware Requirements
log N SRAMs represent the Binary Array B, N = size of the P-heap .
log N registers represent the Token Array T.
log N comparators required, one for each level of the P-heap.
32
Binary Heap
16
11 12
8 11 9
1
2 3
4 5 6
161
112
123
84
115
96
viewed as an array
viewed as a binary tree
Left(i) = 2*iRight(i) = 2*i + 1Parent(i) = i / 2
A[i] >= A[Left(i)]A[i] >= A[Right(i)]
33
Binary Heap : Insert Operation
16
11 12
8 10 9
1
2 3
4 5 6
161
112
123
84
105
96
viewed as an array
viewed as a binary tree
147
147
16
11 14
8 10 9
1
2 3
4 5 6
161
112
143
84
105
96
viewed as an array
viewed as a binary tree
127
127
34
Binary Heap : Delete Operation
16
11 14
8 10 9
1
2 3
4 5 6
161
112
143
84
105
96
viewed as an array
viewed as a binary tree
127
127
16
11 14
8 10 9
2 3
4 5 6
121
112
143
84
105
96
viewed as an array
viewed as a binary tree
121
11 12
8 10 9
2 3
4 5 6
141
112
123
84
105
96
viewed as an array
viewed as a binary tree
141
35
Binary Heap Operations
Both insert and delete are O(log N) operations (i.e. number of levels in the tree)
2*i can be implemented as left shift
i / 2 can be implemented as right shift
36
Some scheduling algorithm
Outline
PIM
RRM
iSLIP (Better solution)
37
Scheduling Algorithms
When we use a crossbar switch, we require a scheduling algorithm that match inputs with outputs.
This is equivalent to find a bipartite matching on a graph with N vertices.
The algorithm configures the fabric during each cell time and decides which inputs will be connected to which outputs.
38
Scheduling packets
For Example
P( input #, output #) = order to leave
P(1,1)=1P(1,2)=3
P(3,2)=3P(3,4)=1
P(4,4)=2
Crossbar Switch
Input side Output side
Scheduling Algorithm need to decide the path and order of packetsthrough crossbar switch
39
High performance systems
Usually, we design algorithm with the following properties:
High Throughput
Starvation Free
Fast
Simple to Implement
40
Parallel Iterative Matching (PIM)
PIM has three steps to implement
Step1 : Request
Step2 : Grant
Step3 : Accept
Each decision is made randomly.
41
The mathematics model of algorithm
We can assume that
Every input in[i] maintains the following state information: Table Ri[0] … Ri[N-1], where Ri[k] = 1, if In[i] has a
request for Out[k] (0, otherwise) Table Gdi[0] … Gdi[N-1], where Gdi[k] = 1, if In[i]
receives a grant from Out[k] (0, otherwise) Variable Ai, where Ai = k, if In[i] accepts the grant from
Out[k] (-1, if no output is accepted).
42
The mathematics model (cond’t)
Every output Out[k] maintains the following state information:
Table Rdk[0] … Rdk[N-1], where Rdk[i] = 1, if Out[k] receives a request from In[i] (0, otherwise)
Variable Gk, where Gk = i, if Out[k] sends a grant to In[i] (-1, if no input is granted)
Variable Adk, where Adk = 1, if the grant from Out[k] is accepted. (0, otherwise).
43
The model of PIM
Therefore, we can represent PIM algorithm as
44
An example of PIM algorithm
P(1,1)=1P(1,2)=3
P(3,2)=3P(3,4)=1
P(4,4)=2(a) (b) (c)
Request Grant Accept
Seconditeration
45
Problems with PIM
Hard to implement randomness in hardware
Unfairness occurs among connections under oversubscribed situation
Throughput is limited to approximately 63% for a single iteration
%63)/1( NNN
46
The unfairness problem
λ1,1=1
λ1,2=1
λ2,1=1
μ1,1=1/4
μ 1,2=3/4
μ 2,1=3/4
47
Round-Robin Matching Algorithm (RRM)
Use rotating priority to match inputs and outputs
Need a pointer gi to identify the highest priority
element
Apply rotating priority on both inputs and outputs
48
The model of RRM
49
RRM scheduling
P(1,1)=1P(1,2)=3
P(3,2)=3P(3,4)=1
P(4,4)=2(a) (b) (c)
21
23
4g2
41
23
4
g4
11
23
4a1
50
Synchronization Problem
When an output receives a request, the output should choose an input to grant and gi must vary
to a new value
For example
λ1,1= λ1,2 =1
λ2,1= λ 2,2=1
μ1,1= μ1,2=1/4
μ 2,1= μ 2,2=1/4
Efficiency = 50%
51
iSLIP algorithm
Use to fix synchronization problem of RRM
Changes its pointer gi only when the grant is
accepted by the input, or the pointer gi will keep
its value
Solves the synchronous problem and achieves 100% throughput
52
The model of iSLIP
53
Example of iSLIP
λ1,1= λ1,2 =1
λ2,1= λ 2,2=1
1st match
2nd match
3rd match
100% throughput is achieved
54
Comparison of three algorithms
top related