high performance routing

43
1 High Performance Routing Nick McKeown Assistant Professor of Electrical Engineering and Computer Science, Stanford University Abrizio/PMC-Sierra Inc. [email protected] http://www.stanford.edu/~nickm

Upload: tiger-gonzales

Post on 31-Dec-2015

12 views

Category:

Documents


0 download

DESCRIPTION

High Performance Routing. Nick McKeown Assistant Professor of Electrical Engineering and Computer Science, Stanford University Abrizio/PMC-Sierra Inc. [email protected] http://www.stanford.edu/~nickm. Outline. Outline Review: What is a Router? - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: High Performance Routing

1

High PerformanceSwitching and RoutingTelecom Center Workshop: Sept 4, 1997.

High Performance Routing

Nick McKeownAssistant Professor of Electrical Engineering and Computer Science, Stanford University

Abrizio/PMC-Sierra Inc.

[email protected] http://www.stanford.edu/~nickm

Page 2: High Performance Routing

2

Outline

• Outline• Review: What is a Router?• The Evolution of Routers• Single-stage switching:

The Fork-Join Router

Page 3: High Performance Routing

3

Outline

• Switching is the bottleneck in a router.• The trend has been to overcome

limitations in memory bandwidth:– Shared memory -> Single-stage, crossbar-

based, combined input and output queued (CIOQ).

• …and reduce power per-rack & per-system:– Single box systems -> Multi-rack systems

(LCS).

Page 4: High Performance Routing

4

Outline (2)

• What comes next?• Multistage switches solve the wrong

problem:– N^2 is not the problem.– Multistage switches are more blocking,

more power-hungry and less predictable.

• Parallel single-stage switches (e.g. the Fork-Join Router) are non-blocking, use less power, can achieve as high capacity, and can be predictable.

Page 5: High Performance Routing

5

Outline

• Outline• Review: What is a Router?• The Evolution of Routers• Single-stage switching:

The Fork-Join Router

Page 6: High Performance Routing

6

Basic Architectural Components

OutputScheduling

Control Plane

Datapath”per-packet processing

SwitchingForwarding

Table

ReservationAdmission

Control Routing Table

Routing Protocols

Policing& AccessControl

PacketClassification

Ingress EgressInterconnect

1. 2. 3.

Page 7: High Performance Routing

7

Basic Architectural Components

Datapath: per-packet processing2. Interconnect 3. EgressForwardin

gTable

ClassifierTable

Policing &AccessControl

ForwardingDecision

1. Ingress

Forwarding

Table

ClassifierTable

Policing &AccessControl

ForwardingDecision

Forwarding

Table

ClassifierTable

Policing &AccessControl

ForwardingDecision

Limitation: Memory b/w Limitation: Interconnect b/w Power & Arbitration

Limitation: Memory b/w

Page 8: High Performance Routing

8

Outline

• Outline• Review: What is a Router?• The Evolution of Routers• Single-stage switching:

The Fork-Join Router

Page 9: High Performance Routing

9

First Generation Routers

Shared Backplane

Line Interface

CPU

Memory

RouteTableCPU Buffer

Memory

LineInterface

MAC

LineInterface

MAC

LineInterface

MAC

Fixed length “DMA” blocksor cells. Reassembled on egress

linecard

Fixed length cells or variable length packets

Typically <0.5Gb/s aggregate capacity

Page 10: High Performance Routing

10

Output 2

Output N

First Generation RoutersQueueing Structure: Shared Memory

Large, single dynamically allocated memory buffer:N writes per “cell” timeN reads per “cell” time.

Limited by memory bandwidth.

Input 1 Output 1

Input N

Input 2

Numerous work has proven and made possible:– Fairness– Delay Guarantees– Delay Variation Control– Loss Guarantees– Statistical Guarantees

Page 11: High Performance Routing

11

Second Generation Routers

RouteTableCPU

LineCard

BufferMemory

LineCard

MAC

BufferMemory

LineCard

MAC

BufferMemory

FwdingCache

FwdingCache

FwdingCache

MAC

Slow Path

Drop PolicyDrop Policy Or Backpressure

OutputLink

Scheduling

BufferMemory

Typically <5Gb/s aggregate capacity

Page 12: High Performance Routing

12

RouteTableCPU

Second Generation RoutersAs caching became ineffective

LineCard

BufferMemory

LineCard

MAC

BufferMemory

LineCard

MAC

BufferMemory

FwdingTable

FwdingTable

FwdingTable

MAC

ExceptionProcessor

Page 13: High Performance Routing

13

Second Generation RoutersQueueing Structure: Combined Input and

Output Queueing

Bus

1 write per “cell” time 1 read per “cell” timeRate of writes/reads determined by bus speed

Page 14: High Performance Routing

14

Third Generation Routers

LineCard

MAC

LocalBuffer

Memory

CPUCard

LineCard

MAC

LocalBuffer

Memory

Switched Backplane

Line Interface

CPUMem

ory FwdingTable

RoutingTable

FwdingTable

Typically <50Gb/s aggregate capacity

Page 15: High Performance Routing

15

Third Generation RoutersQueueing Structure

Switch

1 write per “cell” time 1 read per “cell” timeRate of writes/reads determined by switch

fabric speedup

Page 16: High Performance Routing

16

Third Generation Routers

19” or 23”

7’

• Size-constrained: 19” or 23” wide.

• Power-constrained: ~<6kW.

• QoS unfriendly: input congestion.

Supply: 100A/200A maximum at 48V

Page 17: High Performance Routing

17

Fourth Generation Routers/Switches

Switch Core Linecards

Optical links

Up to2km

The LCS Protocol

Page 18: High Performance Routing

18

Fourth Generation Routers/Switches

The LCS Protocol

What is LCS?1. Credit-based flow control: enables separation.

2. Label-based multicast: enables scaling.

Its Benefits1. Large Number of Ports.

Separation enables large number of ports in multiple racks.

2. Minimizes Switch Core Complexity and Power.Switch core can be bufferless and lossless. QoS, discard etc. performed on linecard.

Page 19: High Performance Routing

19

Fourth Generation Routers/Switches

Queueing Structure1 write per “cell” time 1 read per “cell” time

Rate of writes/reads determined by switch

fabric speedup

Lookup&

DropPolicy

OutputScheduling

Virtual Output Queues

OutputScheduling

OutputScheduling

SwitchFabric

SwitchArbitration

Linecard Linecard

Switch Core(Bufferless)

Lookup&

DropPolicy

Lookup&

DropPolicy

Typically <5Tb/s aggregate capacity

Page 20: High Performance Routing

20

Myths about CIOQ-based crossbar switches

1. “Input-queued crossbars have low throughput”– An input-queued crossbar can have as high

throughput as any switch.

2. “Crossbars don’t support multicast traffic well”– A crossbar inherently supports multicast efficiently.

3. “Crossbars don’t scale well”– Today, it is the number of chip I/Os, not the number

of crosspoints, that limits the size of a switch fabric. Expect 5Tb/s crossbar switches.

Page 21: High Performance Routing

21

Myths about CIOQ-based crossbar switches (2)

4. “Crossbar switches can’t support delay/QoS guarantees”

– With an internal speedup of 2, a CIOQ switch can precisely emulate a shared memory switch for all traffic.

Page 22: High Performance Routing

22

What makes sense today?

Shared Memory

Input Queued

CIOQ Multistage

Blocking No No No Yes

Speedup High High Small High

Emulation of SM Yes No Yes No

Multicast Good Good Good Poor

Resequencing No No No Yes

Power Low OK OK High

Packaging - OK OK Complex

Page 23: High Performance Routing

23

What makes sense tomorrow?

Single-stage (if possible):– Reduces complexity– Minimizes interconnect b/w – Minimizes power

Page 24: High Performance Routing

24

Outline

• Outline• Review: What is a Router?• The Evolution of Routers• Single-stage switching:

The Fork-Join Router

Page 25: High Performance Routing

25

Buffer MemoryHow Fast Can I Make a Packet Buffer?

BufferMemory

5ns SRAM

Rough Estimate:– 5ns per memory operation.– Two memory operations per

packet.– Therefore, maximum

51.2Gb/s.

– In practice, closer to 40Gb/s.

64-byte wide bus 64-byte wide bus

Page 26: High Performance Routing

26

Buffer MemoryIs It Going to Get Better?

time

Specmarks,Memory size,Gate density

time

MemoryBandwidth

(to core)

Page 27: High Performance Routing

27

Fork-Join RouterSponsored by NSF and ITRI

How can we:– Increase capacity. – Reduce power per subsystem.

While at the same time…– Keep the system simple. – Support line rates faster than memory

bandwidth. – Support guaranteed services.

Increase parallelism.

Multiple racks.

Single-stage buffering.

Pkt-by-pkt load balancing.

Hmmm….?

Page 28: High Performance Routing

28

The Fork-Join Router

1

2

k

1

N

rate, R

rate, R

rate, R

rate, R

1

N

Router

Bufferless

Page 29: High Performance Routing

29

The Fork-Join Router

• Advantages– Single-stage of buffering– kpower per subsystem – kmemory bandwidth – kfowarding table lookup rate

Page 30: High Performance Routing

30

The Fork-Join Router

• Questions– Switching: What is the performance?– Forwarding Lookups: How do they

work?

Page 31: High Performance Routing

31

A Parallel Packet Switch

1

N

rate, R

rate, R

rate, R

rate, R

1

N

OutputQueuedSwitch

OutputQueuedSwitch

OutputQueuedSwitch

1

2

k

Arriving packet tagged with egress port

Page 32: High Performance Routing

32

Performance Questions

1. Can it be work-conserving?2. Can it emulate a single big output

queued switch?3. Can it support delay guarantees,

strict-priorities, WFQ, …?

Page 33: High Performance Routing

33

Work Conservation

rate, R1rate, R

1

2

k

1

R/k

R/k

R/k

R/k

R/k

R/k

Input LinkConstraint

Output LinkConstraint

OutputQueuedSwitch

OutputQueuedSwitch

OutputQueuedSwitch

Page 34: High Performance Routing

34

Work Conservation

rate, R1rate, R

1

2

k

1

R/k

R/k

R/k

R/k

R/k

R/k

1

2

3 Output LinkConstraint

45

1

2

3

4

1234115

Page 35: High Performance Routing

35

Work Conservation

1

N

rate, R

rate, R

rate, R

rate, R

1

N

OutputQueuedSwitch

OutputQueuedSwitch

OutputQueuedSwitch

1

2

k

S(R/k)

S(R/k)

S(R/k)

S(R/k)

S(R/k)

S(R/k)

Page 36: High Performance Routing

36

Precise Emulation of an Output Queued Switch

N N

Output Queued Switch

1

N

Parallel Packet Switch

= ?

1

N

1

N

Page 37: High Performance Routing

37

Parallel Packet SwitchTheorems

1. If S > 2k/(k+2) 2 then a parallel packet switch can be work-conserving for all traffic.

2. If S > 2k/(k+2) 2 then a parallel packet switch can precisely emulate a FCFS output-queued switch for all traffic.

Page 38: High Performance Routing

38

Parallel Packet SwitchTheorems

3. If S > 3k/(k+3) 3 then a parallel packet switch can precisely emulate a switch with WFQ, strict priorities, and other types of QoS, for all traffic.

Page 39: High Performance Routing

39

Parallel Packet SwitchTheorems

4. If S > 2 then a parallel packet switch with a small co-ordination buffer at rate R, can precisely emulate a switch with WFQ, strict priorities, and other types of QoS, for all traffic.

Page 40: High Performance Routing

40

The Fork-Join Router

• Questions– Switching: What is the performance?– Forwarding Lookups: How do they

work?

Page 41: High Performance Routing

41

The Fork-Join RouterLookahead Forwarding Table Lookups

Packet tagged with egress port at next

router

Lookup performed in

parallel at rate R/k

Page 42: High Performance Routing

42

The Fork-Join Router

1

2

k

1

N

rate, R

rate, R

rate, R

rate, R

1

N

Router

Expect >50Tb/s aggregate capacity

Page 43: High Performance Routing

43

Conclusions

• The main problems are power (supply and dissipation) and memory bandwidth.

• Multi-stage switches solve the wrong problem.

• Single-stage switches are here to stay.

• Very high capacity single-stage electronic routers are feasible.