high performance routing
DESCRIPTION
High Performance Routing. Nick McKeown Assistant Professor of Electrical Engineering and Computer Science, Stanford University Abrizio/PMC-Sierra Inc. [email protected] http://www.stanford.edu/~nickm. Outline. Outline Review: What is a Router? - PowerPoint PPT PresentationTRANSCRIPT
1
High PerformanceSwitching and RoutingTelecom Center Workshop: Sept 4, 1997.
High Performance Routing
Nick McKeownAssistant Professor of Electrical Engineering and Computer Science, Stanford University
Abrizio/PMC-Sierra Inc.
[email protected] http://www.stanford.edu/~nickm
2
Outline
• Outline• Review: What is a Router?• The Evolution of Routers• Single-stage switching:
The Fork-Join Router
3
Outline
• Switching is the bottleneck in a router.• The trend has been to overcome
limitations in memory bandwidth:– Shared memory -> Single-stage, crossbar-
based, combined input and output queued (CIOQ).
• …and reduce power per-rack & per-system:– Single box systems -> Multi-rack systems
(LCS).
4
Outline (2)
• What comes next?• Multistage switches solve the wrong
problem:– N^2 is not the problem.– Multistage switches are more blocking,
more power-hungry and less predictable.
• Parallel single-stage switches (e.g. the Fork-Join Router) are non-blocking, use less power, can achieve as high capacity, and can be predictable.
5
Outline
• Outline• Review: What is a Router?• The Evolution of Routers• Single-stage switching:
The Fork-Join Router
6
Basic Architectural Components
OutputScheduling
Control Plane
Datapath”per-packet processing
SwitchingForwarding
Table
ReservationAdmission
Control Routing Table
Routing Protocols
Policing& AccessControl
PacketClassification
Ingress EgressInterconnect
1. 2. 3.
7
Basic Architectural Components
Datapath: per-packet processing2. Interconnect 3. EgressForwardin
gTable
ClassifierTable
Policing &AccessControl
ForwardingDecision
1. Ingress
Forwarding
Table
ClassifierTable
Policing &AccessControl
ForwardingDecision
Forwarding
Table
ClassifierTable
Policing &AccessControl
ForwardingDecision
Limitation: Memory b/w Limitation: Interconnect b/w Power & Arbitration
Limitation: Memory b/w
8
Outline
• Outline• Review: What is a Router?• The Evolution of Routers• Single-stage switching:
The Fork-Join Router
9
First Generation Routers
Shared Backplane
Line Interface
CPU
Memory
RouteTableCPU Buffer
Memory
LineInterface
MAC
LineInterface
MAC
LineInterface
MAC
Fixed length “DMA” blocksor cells. Reassembled on egress
linecard
Fixed length cells or variable length packets
Typically <0.5Gb/s aggregate capacity
10
Output 2
Output N
First Generation RoutersQueueing Structure: Shared Memory
Large, single dynamically allocated memory buffer:N writes per “cell” timeN reads per “cell” time.
Limited by memory bandwidth.
Input 1 Output 1
Input N
Input 2
Numerous work has proven and made possible:– Fairness– Delay Guarantees– Delay Variation Control– Loss Guarantees– Statistical Guarantees
11
Second Generation Routers
RouteTableCPU
LineCard
BufferMemory
LineCard
MAC
BufferMemory
LineCard
MAC
BufferMemory
FwdingCache
FwdingCache
FwdingCache
MAC
Slow Path
Drop PolicyDrop Policy Or Backpressure
OutputLink
Scheduling
BufferMemory
Typically <5Gb/s aggregate capacity
12
RouteTableCPU
Second Generation RoutersAs caching became ineffective
LineCard
BufferMemory
LineCard
MAC
BufferMemory
LineCard
MAC
BufferMemory
FwdingTable
FwdingTable
FwdingTable
MAC
ExceptionProcessor
13
Second Generation RoutersQueueing Structure: Combined Input and
Output Queueing
Bus
1 write per “cell” time 1 read per “cell” timeRate of writes/reads determined by bus speed
14
Third Generation Routers
LineCard
MAC
LocalBuffer
Memory
CPUCard
LineCard
MAC
LocalBuffer
Memory
Switched Backplane
Line Interface
CPUMem
ory FwdingTable
RoutingTable
FwdingTable
Typically <50Gb/s aggregate capacity
15
Third Generation RoutersQueueing Structure
Switch
1 write per “cell” time 1 read per “cell” timeRate of writes/reads determined by switch
fabric speedup
16
Third Generation Routers
19” or 23”
7’
• Size-constrained: 19” or 23” wide.
• Power-constrained: ~<6kW.
• QoS unfriendly: input congestion.
Supply: 100A/200A maximum at 48V
17
Fourth Generation Routers/Switches
Switch Core Linecards
Optical links
Up to2km
The LCS Protocol
18
Fourth Generation Routers/Switches
The LCS Protocol
What is LCS?1. Credit-based flow control: enables separation.
2. Label-based multicast: enables scaling.
Its Benefits1. Large Number of Ports.
Separation enables large number of ports in multiple racks.
2. Minimizes Switch Core Complexity and Power.Switch core can be bufferless and lossless. QoS, discard etc. performed on linecard.
19
Fourth Generation Routers/Switches
Queueing Structure1 write per “cell” time 1 read per “cell” time
Rate of writes/reads determined by switch
fabric speedup
Lookup&
DropPolicy
OutputScheduling
Virtual Output Queues
OutputScheduling
OutputScheduling
SwitchFabric
SwitchArbitration
Linecard Linecard
Switch Core(Bufferless)
Lookup&
DropPolicy
Lookup&
DropPolicy
Typically <5Tb/s aggregate capacity
20
Myths about CIOQ-based crossbar switches
1. “Input-queued crossbars have low throughput”– An input-queued crossbar can have as high
throughput as any switch.
2. “Crossbars don’t support multicast traffic well”– A crossbar inherently supports multicast efficiently.
3. “Crossbars don’t scale well”– Today, it is the number of chip I/Os, not the number
of crosspoints, that limits the size of a switch fabric. Expect 5Tb/s crossbar switches.
21
Myths about CIOQ-based crossbar switches (2)
4. “Crossbar switches can’t support delay/QoS guarantees”
– With an internal speedup of 2, a CIOQ switch can precisely emulate a shared memory switch for all traffic.
22
What makes sense today?
Shared Memory
Input Queued
CIOQ Multistage
Blocking No No No Yes
Speedup High High Small High
Emulation of SM Yes No Yes No
Multicast Good Good Good Poor
Resequencing No No No Yes
Power Low OK OK High
Packaging - OK OK Complex
23
What makes sense tomorrow?
Single-stage (if possible):– Reduces complexity– Minimizes interconnect b/w – Minimizes power
24
Outline
• Outline• Review: What is a Router?• The Evolution of Routers• Single-stage switching:
The Fork-Join Router
25
Buffer MemoryHow Fast Can I Make a Packet Buffer?
BufferMemory
5ns SRAM
Rough Estimate:– 5ns per memory operation.– Two memory operations per
packet.– Therefore, maximum
51.2Gb/s.
– In practice, closer to 40Gb/s.
64-byte wide bus 64-byte wide bus
26
Buffer MemoryIs It Going to Get Better?
time
Specmarks,Memory size,Gate density
time
MemoryBandwidth
(to core)
27
Fork-Join RouterSponsored by NSF and ITRI
How can we:– Increase capacity. – Reduce power per subsystem.
While at the same time…– Keep the system simple. – Support line rates faster than memory
bandwidth. – Support guaranteed services.
Increase parallelism.
Multiple racks.
Single-stage buffering.
Pkt-by-pkt load balancing.
Hmmm….?
28
The Fork-Join Router
1
2
k
1
N
rate, R
rate, R
rate, R
rate, R
1
N
Router
Bufferless
29
The Fork-Join Router
• Advantages– Single-stage of buffering– kpower per subsystem – kmemory bandwidth – kfowarding table lookup rate
30
The Fork-Join Router
• Questions– Switching: What is the performance?– Forwarding Lookups: How do they
work?
31
A Parallel Packet Switch
1
N
rate, R
rate, R
rate, R
rate, R
1
N
OutputQueuedSwitch
OutputQueuedSwitch
OutputQueuedSwitch
1
2
k
Arriving packet tagged with egress port
32
Performance Questions
1. Can it be work-conserving?2. Can it emulate a single big output
queued switch?3. Can it support delay guarantees,
strict-priorities, WFQ, …?
33
Work Conservation
rate, R1rate, R
1
2
k
1
R/k
R/k
R/k
R/k
R/k
R/k
Input LinkConstraint
Output LinkConstraint
OutputQueuedSwitch
OutputQueuedSwitch
OutputQueuedSwitch
34
Work Conservation
rate, R1rate, R
1
2
k
1
R/k
R/k
R/k
R/k
R/k
R/k
1
2
3 Output LinkConstraint
45
1
2
3
4
1234115
35
Work Conservation
1
N
rate, R
rate, R
rate, R
rate, R
1
N
OutputQueuedSwitch
OutputQueuedSwitch
OutputQueuedSwitch
1
2
k
S(R/k)
S(R/k)
S(R/k)
S(R/k)
S(R/k)
S(R/k)
36
Precise Emulation of an Output Queued Switch
N N
Output Queued Switch
1
N
Parallel Packet Switch
= ?
1
N
1
N
37
Parallel Packet SwitchTheorems
1. If S > 2k/(k+2) 2 then a parallel packet switch can be work-conserving for all traffic.
2. If S > 2k/(k+2) 2 then a parallel packet switch can precisely emulate a FCFS output-queued switch for all traffic.
38
Parallel Packet SwitchTheorems
3. If S > 3k/(k+3) 3 then a parallel packet switch can precisely emulate a switch with WFQ, strict priorities, and other types of QoS, for all traffic.
39
Parallel Packet SwitchTheorems
4. If S > 2 then a parallel packet switch with a small co-ordination buffer at rate R, can precisely emulate a switch with WFQ, strict priorities, and other types of QoS, for all traffic.
40
The Fork-Join Router
• Questions– Switching: What is the performance?– Forwarding Lookups: How do they
work?
41
The Fork-Join RouterLookahead Forwarding Table Lookups
Packet tagged with egress port at next
router
Lookup performed in
parallel at rate R/k
42
The Fork-Join Router
1
2
k
1
N
rate, R
rate, R
rate, R
rate, R
1
N
Router
Expect >50Tb/s aggregate capacity
43
Conclusions
• The main problems are power (supply and dissipation) and memory bandwidth.
• Multi-stage switches solve the wrong problem.
• Single-stage switches are here to stay.
• Very high capacity single-stage electronic routers are feasible.