can google route? building a high-speed switch from commodity hardware guido appenzeller, matthew...

21
Can Google Route? Building a High-Speed Switch from Commodity Hardware Guido Appenzeller, Matthew Holliman Q2/2002

Upload: brett-sparks

Post on 23-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Can Google Route? Building a High-Speed Switch from Commodity Hardware Guido Appenzeller, Matthew Holliman Q2/2002

Can Google Route?

Building a High-Speed Switch from Commodity Hardware

Guido Appenzeller, Matthew HollimanQ2/2002

Page 2: Can Google Route? Building a High-Speed Switch from Commodity Hardware Guido Appenzeller, Matthew Holliman Q2/2002

Outline

Motiavation: Cost of switches The Basic Unit Switching Fabrics Analysis & Reality Check

Page 3: Can Google Route? Building a High-Speed Switch from Commodity Hardware Guido Appenzeller, Matthew Holliman Q2/2002

Price Survey of Gigabit SwitchesApples and Oranges

PC Gigabit Ethernet Card $34.99 (32 Bit) $49.35 (64 Bit)

Layer 2 24 Port Switch $2,200 16 Port Switch $1,500 8 Port Switch $830 4 Port Switch $460

Layer 3 24 Port Switch $7,467

Routers (High End) CISCO 7600

Max 15-80 GbitBase price $60kLine Cards $10kTotal $100k-250k???

CISCO 12400Max 160 GbitBase Price $120kLine Cards $25k-$50kTotal $300-$1m???

Page 4: Can Google Route? Building a High-Speed Switch from Commodity Hardware Guido Appenzeller, Matthew Holliman Q2/2002

Cost/Gigabit vs. Total Switching CapacityCost/Gigabit dramatically increases with aggregate speed

0

1000

2000

3000

4000

5000

6000

7000

8000

0 50 100 150 200

Total Maximum Throughput (kBits/ s)

Co

st

US

$/

GB

it

Page 5: Can Google Route? Building a High-Speed Switch from Commodity Hardware Guido Appenzeller, Matthew Holliman Q2/2002

Let’s build a Switch the Google wayUse large numbers of cheap PC components

Use cheap PC boards ($250) 16 MBytes of memory No Disk

Use cheap copper Gigabit Ethernet Cards Use Clos Networks to build larger fabrics out of smaller

ones

PC

PC

PC

PC

PC

PC

AddressLookup

OutputQueuing

Ethernet Ethernet

Page 6: Can Google Route? Building a High-Speed Switch from Commodity Hardware Guido Appenzeller, Matthew Holliman Q2/2002

NICFIFO

Ethernet frame arrivals

NIC transferspacket(s)

PCI busarbitration/address

PCI burstwrite

Terminatetransaction

NIC raisesinterrupt

CPU processing

CPU

DRAM

Gig-e Gig-e

MCH

PCI bus

PC bus layout

PC Data FlowThe PCI bus is the bottleneck

3.2 GB/s

0.5 GB/s

Page 7: Can Google Route? Building a High-Speed Switch from Commodity Hardware Guido Appenzeller, Matthew Holliman Q2/2002

Incoming packet buffer

Interrupt

Readpacketheader

Copy packet to

output queue

DRAMaccess

LookupRead

packetheader

Copy packet to

output queueLookup …

NIC transferspacket(s)

Computational FlowInterrupts are a bottle neck for short packets

Packet processing is done from/to DRAM Packets are written from to network cards in bursts to

save IRQ overhead and PCI bandwidth

5 μs ~100 ns

Page 8: Can Google Route? Building a High-Speed Switch from Commodity Hardware Guido Appenzeller, Matthew Holliman Q2/2002

Per Port Throughput vs. Burst Size We need 66Mhz, 64-bit system

0 500 1000 15002

3

4

5

6

7

8

9

10

11x 100 Mbits

Burst Size

Per-

port

read/w

rite

bandw

idth

33 MHz, 32-bit66 MHz, 32-bit33 MHz, 64-bit66 MHz, 64-bit

1 GBitThreshold

Page 9: Can Google Route? Building a High-Speed Switch from Commodity Hardware Guido Appenzeller, Matthew Holliman Q2/2002

0 500 1000 1500100

101

102

103

104

Packet size

CPU

clo

cks

per

byte

N=1, 5 us N=10, 5 us N=30, 5 us N=1, 11 us N=10, 11 us N=30, 11 us

CPU clock cycles per byte vs. Packet SizeFor 100%throughput we need to aggregate short packets

1 GBitThreshold

bytes

AveragePacket

Size

N = # of packets per burst5us = HW latency only11us = With driver latency

Page 10: Can Google Route? Building a High-Speed Switch from Commodity Hardware Guido Appenzeller, Matthew Holliman Q2/2002

PC Performance SummaryToday’s PCs are just fast enough to operate as a 4x4 switch

To build a 4x4 half duplex (2x2 full duplex) switch we need: 66 MHz/64 Bit PCI bus 1 Gbyte/s Memory Bandwidth NIC must have sufficient buffers to aggregate short

packets to bursts (about 2kBytes)

Software has to run w/o interrupts e.g. Linux in halted mode

Page 11: Can Google Route? Building a High-Speed Switch from Commodity Hardware Guido Appenzeller, Matthew Holliman Q2/2002

Building larger SwitchesClos Network for an 16x16 Switch made of 4x4 Switches

Requires 12 PCs 48 Network Cards 8 GBit capacity Total cost: $5400

3Com is cheaper

Page 12: Can Google Route? Building a High-Speed Switch from Commodity Hardware Guido Appenzeller, Matthew Holliman Q2/2002

Building larger Switches Clos Network for an 256x256 Switch out of 16x16 switches

Requires 576 PCs 2304 network cards 128 Gbit capacity Total cost: $260k Now we are cheaper!

Well, sort of…

1

2

3

16

1

16

Page 13: Can Google Route? Building a High-Speed Switch from Commodity Hardware Guido Appenzeller, Matthew Holliman Q2/2002

Switch size vs. Basic Units NeededScaling is slightly worse than n log(n)

How many 4x4 switches do we need for an NxN switch? 4 x 4 1 switch 16 x 16 12 switches 256 x 256 576 switches 42^n x 42^n 3n42^n/4 switches

General: N x N needs (N/4) log4N (1.5)^log2log4N switches

Could you build a switch with less basic units Maybe, but not much Theoretical limit is O( (N/4) log4N ) Differing term (1.5)^log2log4N is small

Page 14: Can Google Route? Building a High-Speed Switch from Commodity Hardware Guido Appenzeller, Matthew Holliman Q2/2002

Scheduling – The Problem

How do we do scheduling? For n=k Clos Network we need dynamic

matching For 256x256 alogrithm is time-consuming

How to pass traffic information between inputs, scheduler and nodes More network connections are costly Timing critical

Page 15: Can Google Route? Building a High-Speed Switch from Commodity Hardware Guido Appenzeller, Matthew Holliman Q2/2002

Solution: Buffered Clos Networks

Two ideas:

1. Add a randomization stage (Chang et. al.) Now we can use round robin as scheduler This is deterministic and requires no synchronization

between nodes

2. Use the PC’s capability to buffer data Each node has a few Mbytes If there is a collision re-send packets

We use randomization

Page 16: Can Google Route? Building a High-Speed Switch from Commodity Hardware Guido Appenzeller, Matthew Holliman Q2/2002

Randomized, Buffering Clos Network

Stage 1: Pseudo Random

(no coordination needed) Never blocks

Stage 2: Round Robin

(no coordination needed) Never blocks.

PseudoRandom Round Robin

Page 17: Can Google Route? Building a High-Speed Switch from Commodity Hardware Guido Appenzeller, Matthew Holliman Q2/2002

Stability Analysis of the Switch First stage Analysis

Matching is random, distribution of packets on middle column of nodes is I.I.D. Uniform

No blocking can occur Queue length at the middle stage is equivalent to an IQR

with k inputs, VOQs and Round Robin scheduling• We know such a IQR has 100% throughput under I.I.D.

Uniform traffic Second stage

No blocking can occur, 100% throughput if all VOQs in middle stage are occupied

Queue length at the middle stage is equivalent to an output queued router with k inputs.

• Output queued router has 100% throughput

System has 100% throughput for any admissible traffic

Page 18: Can Google Route? Building a High-Speed Switch from Commodity Hardware Guido Appenzeller, Matthew Holliman Q2/2002

Reality Check

This might look like a good idea… Cheap Scalable – switch can grow Some Redundancy - node failure reduces throughput

by 1/16 worst case

…but is probably not exactly what Carriers want High power consumption (50 kW vs. 2.5 kW) Major space requirements (10-20 racks) Packet reordering can happen (TCP won’t like it) Maintenance – One PC will fail per day!

Page 19: Can Google Route? Building a High-Speed Switch from Commodity Hardware Guido Appenzeller, Matthew Holliman Q2/2002

Research OutlookWhy this could still be interesting

We can do this in hardware Implement in VLSI Build from chipsets that 24x24 switch manufacturers

use. We could use better base units

E.g. 1.15 TBit half duplex (fastest in the world?)• 576x576 using 24x24 Netgear switches (GS524T )• Cost: $158k• (We might get a volume discount from Netgear)

So far we don’t use intelligence of the nodes We can re-configure matchings periodically Distribute lookup

Page 20: Can Google Route? Building a High-Speed Switch from Commodity Hardware Guido Appenzeller, Matthew Holliman Q2/2002

Backup

Page 21: Can Google Route? Building a High-Speed Switch from Commodity Hardware Guido Appenzeller, Matthew Holliman Q2/2002

Randomized/HoQ Buffering Clos Network

Stage 1: Pseudo Random

(no coordination needed) Never blocks

Stage 2: Head of Queue

(no coordination needed) If it blocks, buffer packet

and resend.

Note: We can overprovision middle layer!

PseudoRandom Round Robin