1 ee384y: packet switch architectures part ii load-balanced switch (borrowed from isaac keslassys...

Post on 26-Mar-2015

216 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

High PerformanceSwitching and RoutingTelecom Center Workshop: Sept 4, 1997.

EE384Y: Packet Switch ArchitecturesPart II

Load-balanced Switch

(Borrowed from Isaac Keslassy’s Defense Talk)

Nick McKeownProfessor of Electrical Engineering and Computer Science, Stanford University

nickm@stanford.eduhttp://www.stanford.edu/~nickm

2

The Arbitration Problem

A packet switch fabric is reconfigured for every packet transfer.

For example, at 160Gb/s, a new IP packet can arrive every 2ns.

The configuration is picked to maximize throughput and not waste capacity.

Known algorithms are probably too slow.

3

Approach

We know that a crossbar with VOQs, and uniform Bernoulli i.i.d. arrivals, gives 100% throughput for the following scheduling algorithms: Pick a permutation uar from all permutations. Pick a permutation uar from the set of size N in which each

input-output pair (i,j) are connected exactly once in the set. From the same set as above, repeatedly cycle through a fixed

sequence of N different permutations.

Can we make non-uniform, bursty traffic uniform “enough” for the above to hold?

4

Design Example

GoalsScale to High Linecard Speeds (160Gb/s)

No Centralized Scheduler Optical Switch Fabric Low Packet-Processing Complexity

Scale to High Number of Linecards (640)

Provide Performance Guarantees 100% Throughput Guarantee No Packet Reordering

Stanford “Optics in Routers” projecthttp://yuba.stanford.edu/or/

Some challenging numbers: 100Tb/s 160Gb/s linecards 640 linecards

5

Outline

Basic idea of load-balancing Packet mis-sequencing An optical switch fabric Scaling number of linecards Arbitrary arrangement of linecards

6

In

In

In

Out

Out

Out

R

R

R

R

R

R

Router capacity = NRSwitch capacity = N2R

100% Throughput in a Mesh Fabric

?

?

?

?

?

?

?

?

?

R

R

R

R

R

R

R

R

R

RRRR

7

R

In

In

In

Out

Out

Out

R

R

R

R

R

R/N

R/N

R/N

R/NR/N

R/N

R/N

R/N

R/N

If Traffic Is Uniform

RNR /NR /NR /

R

NR / NR /

8

Real Traffic is Not Uniform

R

In

In

In

Out

Out

Out

R

R

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

RNR /NR /NR /

R

RNR /NR /NR /

R

RNR /NR /NR /

R

R

R

R

?

9

Out

Out

Out

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

Load-Balanced Switch

Load-balancing stage Forwarding stage

In

In

In

Out

Out

Out

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R

R

R

100% throughput for weakly mixing traffic (Valiant, C.-S. Chang)

10

Out

Out

Out

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

In

In

In

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

112233

Load-Balanced Switch

11

Out

Out

Out

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

In

In

In

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N33

22

11

Load-Balanced Switch

12

Out

Out

Out

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/NR/N

R/N

In

In

In

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

Intuition: 100% Throughput

Arrivals to second mesh:

Capacity of second mesh:

Second mesh: arrival rate < service rate

111

111

111

where,1

UaUN

b

01

-b RUaUN

C

UN

RC

Cba

[C.-S. Chang]

13

Another way of thinking about it

1

N

1

N

1

N

External Outputs

Internal Inputs

External Inputs

Load-balancing cyclic shift

Switching cyclic shift

Load Balancing

First stage load-balances incoming packets Second stage is a cyclic shift

14

Load-Balanced Switch

External Outputs

Internal Inputs

1

N

ExternalInputs

Load-balancing cyclic shift

Switching cyclic shift

1

N

1

N

11

2

2

15

ˆ( ) ,

ˆ mod

1. Consider a periodic sequence of permutation matrices:

where is a one-cycle permutation matrix

(f or example, a TDM sequence), and .

2. I f 1st stage is

tP t P P

t t N

Main Result [Chang et al.]:

1 1

1

2 2

( ) ( ),

( ) ( ),

scheduled by a sequence of permutation

matrices:

where is a random starting phase, and

3. The 2nd stage is scheduled by a sequence of permutation

matrices:

4. Then the swit

P t P t

P t P t

ch gives 100% throughput f or a very broad

range of traffi c types.

1st stage makes non-unif orm traffi c unif orm,

and breaks up burstiness.

Observation:

16

Outline of Chang’s Proof

1

( )

( )

( ) ( ) ( )

( )

( 1)

1. Let be the matrix of arrivals at time , where

indicates an arrival at f or .

2. Let be the input traffi c to the second stage.

3. Let be the queue length matrix:

ij

a t t

a t i j

b t P t a t

q t

q t

2

20

1

1 1

max ( ) ( 1) ( 1), 0 ,

( ) max .

( ) ( ).

1( ) ( ) ( ) ( ) ( ) .

1lim

expands to

I f no output is oversubscribed, converges to steady state

t

s ts

t

q t b t P t

q t b P

q t q

E b t E P t a t E P t E a t eN

bt

:Theorem

Proof :

21

1 1( ) ( ) 0.

( )Holds under some mild conditions on (weakly mixing arrival processes).

t

s

s P s e eN N

a t

17

Outline

Basic idea of load-balancing Packet mis-sequencing An optical switch fabric Scaling number of linecards Arbitrary arrangement of linecards

18

Out

Out

Out

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

In

In

In

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

Packet Reordering

12

19

Out

Out

Out

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

In

In

In

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

Bounding Delay Difference Between Middle Ports

1

2

cells

20

Out

Out

Out

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

In

In

In

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

123

0

UFS (Uniform Frame Spreading)

12

21

Out

Out

Out

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

In

In

In

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

FOFF (Full Ordered Frames First)

12

22

FOFF (Full Ordered Frames First)

Input Algorithm N FIFO queues corresponding to the N output flows Spread each flow uniformly: if last packet was sent to

middle port k, send next to k+1. Every N time-slots, pick a flow:

- If full frame exists, pick it and spread like UFS - Else if all frames are partial, pick one in round-robin order and send it

123

12

4

N

23

Out

Out

Out

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

In

In

In

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

Bounding Reordering

123

NN

24

FOFF

Output properties N FIFO queues corresponding to the N middle

ports Buffer size less than N2 packets If there are N2 packets, one of the head-of-line

packets is in order

111

22

333

Output

4

N

25

FOFF Properties

Property 1: FOFF maintains packet order.

Property 2: FOFF has O(1) complexity.

Property 3: Congestion buffers operate independently.

Property 4: FOFF maintains an average packet delay within constant from ideal output-queued router.

Corollary: FOFF has 100% throughput for any adversarial traffic.

26

In

In

In

Out

Out

Out

R

R

R

R

R

R

Output-Queued Router?

?

?

?

?

?

?

?

?

R

R

R

R

R

R

R

R

R

RRRR

27

Outline

Basic idea of load-balancing Packet mis-sequencing An optical switch fabric Scaling number of linecards Arbitrary arrangement of linecards

28

Out

Out

Out

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

In

In

In

R

R

R

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

R/N

From Two Meshes to One Mesh

One linecard

In

Out

29

From Two Meshes to One Mesh

First meshIn Out

In Out

In Out

In Out

One linecard

Second mesh

R R

R

R

R

30

From Two Meshes to One Mesh

Combined meshIn Out

In Out

In Out

In Out

2RR

2R

2R

2R

31

Many Fabric Options

Options

Space: Full uniform meshTime: Round-robin crossbarWavelength: Static WDM

Any spreadingdevice

C1, C2, …, CN

C1

C2

C3

CN

In Out

In Out

In Out

In Out

N channels each at rate 2R/NOne linecard

32

AWGR (Arrayed Waveguide Grating Router) A Passive Optical Component

Wavelength i on input port j goes to output port (i+j-1) mod N

Can shuffle information from different inputs

1, 2…N

NxN AWGR

Linecard 1

Linecard 2

Linecard N

1

2

N

Linecard 1

Linecard 2

Linecard N

33

In Out

In Out

In Out

In Out

Static WDM Switching: Packaging

AWGR

Passive andAlmost Zero

Power

A

B

C

D

A, B, C, D

A, B, C, D

A, B, C, D

A, B, C, D

A, A, A, A

B, B, B, B

C, C, C, C

D, D, D, D

N WDM channels, each at rate 2R/N

34

Outline

Basic idea of load-balancing Packet mis-sequencing An optical switch fabric Scaling number of linecards Arbitrary arrangement of linecards

35

Scaling Problem

For N < 64, an AWGR is a good solution. We want N = 640. Need to decompose.

36

A Different Representation of the Mesh

In Out

In Out

In Out

In Out

R 2R

Mesh

2R In Out

In Out

In Out

In Out

R

2RR

37

A Different Representation of the Mesh

In Out

In Out

In Out

In Out

R In Out

In Out

In Out

In Out

R2R/N

38

1

2

3

4

Example: N=8

1

2

3

4

5

6

7

8

1

2

3

4

5

6

7

8

2R/8

39

When N is Too LargeDecompose into groups (or racks)

4R/42R 2R1

2

3

4

5

6

7

8

2R2R

1

2

3

4

5

6

7

8

4R 4R

40

When N is Too LargeDecompose into groups (or racks)

1

2

L

2R2R

2R

1

2

L

2R2R

2R

Group/Rack 1

Group/Rack G

1

2

L

2R2R

2R

Group/Rack 1

1

2

L

2R2R

2R

Group/Rack G

2RL

2RL 2RL

2RL2RL/G

2RL/G

2RL/G

2RL/G

41

Outline

Basic idea of load-balancing Packet mis-sequencing An optical switch fabric Scaling number of linecards Arbitrary arrangement of linecards

42

When Linecards Fail

1

2

L

2R2R

2R

1

2

L

2R2R

2R

Group/Rack 1

Group/Rack G

1

2

L

2R2R

2R

Group/Rack 1

1

2

L

2R2R

2R

Group/Rack G

2RL

2RL 2RL

2RL2RL/G

2RL/G

2RL/G

2RL/G

2RL

Solution: replace mesh with sum of permutations

= + +

2RL/G 2RL/G 2RL/G 2RL/G

2RL 2RL/G

G *

43

Hybrid Electro-Optical ArchitectureUsing MEMS Switches

1

2

L

2R2R

2R

1

2

L

2R2R

2R

Group/Rack 1

Group/Rack G

1

2

L

2R2R

2R

Group/Rack 1

1

2

L

2R2R

2R

Group/Rack G

MEMSSwitch

MEMSSwitch

44

When Linecards Fail

1

2

L

2R2R

2R

1

2

L

2R2R

2R

Group/Rack 1

Group/Rack G

1

2

L

2R2R

2R

Group/Rack 1

1

2

L

2R2R

2R

Group/Rack G

MEMSSwitch

MEMSSwitch

45

Fiber Link Capacity

1

2

L

2R2R

2R

1

2

L

2R2R

2R

Group/Rack 1

Group/Rack G

1

2

L

2R2R

2R

Group/Rack 1

1

2

L

2R2R

2R

Group/Rack G

MEMSSwitch

MEMSSwitch

MEMSSwitch

Link Capacity ≈ 64 λ’s * 5 Gb/s/λ = 320 Gb/s = 2R

Laser/Modulator

MUX

46

Group/Rack 1

1

2

2R

2R 4R

Group/Rack 2

1

2

2R

2R 4R

Example2 Groups of 2 Linecards

1

2

2R

2R

Group/Rack 1

1

2

2R

2R

Group/Rack 2

4R

4R

2R

2R

2R

2R

2R

2R

47

Theorem: M≡L+G-1 MEMS switches are sufficient for bandwidth.

Number of MEMS Switches

Examples:

5540,16,640

2

MGLN

NMNGL

G groups, Li linecards in group i,

G

iiLN

1

,max kk

LL

48

Group A

1

2

2R

2R 4R

Group B

1

2

2R

2R 4R

Packet Schedule

1

2

2R

2R

Group A

1

2

2R

2R

Group B

4R

4R

2R

2R

2R

2R

49

At each time-slot: Each transmitting linecard sends one packet Each receiving linecard receives one packet (MEMS constraint) Each transmitting group i

sends at most one packet to each receiving group j through each MEMS connecting them

In a schedule of N time-slots: Each transmitting linecard sends exactly one

packet to each receiving linecard

Rules for Packet Schedule

50

Packet Schedule

T+1 T+2 T+3 T+4

Tx LC A1 ? ? ? ?

Tx LC A2 ? ? ? ?

Tx LC B1 ? ? ? ?

Tx LC B2 ? ? ? ?

Tx Group A

Tx Group B

51

Packet Schedule

T+1 T+2 T+3 T+4

Tx LC A1 A1 A2 B1 B2

Tx LC A2 B2 A1 A2 B1

Tx LC B1 B1 B2 A1 A2

Tx LC B2 A2 B1 B2 A1

Tx Group A

Tx Group B

52

Bad Packet Schedule

T+1 T+2 T+3 T+4

Tx LC A1 A1 A2 B1 B2

Tx LC A2 B2 A1 A2 B1

Tx LC B1 B1 B2 A1 A2

Tx LC B2 A2 B1 B2 A1

Tx Group A

Tx Group B

53

Group Schedule

T+1 T+2 T+3 T+4

Tx Group A AB AB AB AB

Tx Group B AB AB AB AB

54

Good Packet Schedule

T+1 T+2 T+3 T+4

Tx LC A1 A1 A2 B1 B2

Tx LC A2 B2 B1 A2 A1

Tx LC B1 B1 B2 A1 A2

Tx LC B2 A2 A1 B2 B1

Theorem: There exists a polynomial-time algorithm that finds the correct packet schedule.

Tx Group A

Tx Group B

55

Outline

Basic idea of load-balancing Packet mis-sequencing An optical switch fabric Scaling number of linecards Arbitrary arrangement of linecards

top related