theia:networkingforultra3 densedatacenters€¦ · theia:networkingforultra3 densedatacenters meg...
TRANSCRIPT
Theia: Networking for Ultra-‐Dense Data Centers
meg walraed-‐sullivan, Jitendra Padhye, David A. Maltz Microso=
HotNets 2014
Simple and Cheap
SeaMicro
Ultra-‐Dense Data Centers (UDDCs)
• Data centers are expensive to build • So we try to pack more hardware into exisIng data centers • One way: pack more CPUs into a rack
FireBox HP Moonshot Intel RSA
UDDC Challenges
• System management • Power and cooling • Failure recovery • How to tailor applicaIons • Networking
• System management • Power and cooling • Failure recovery • How to tailor applicaIons • Networking
TradiIonal ToR-‐based architectures no longer appropriate due to monetary cost
physical space requirements oversubscripIon
Rack
servers
ToR Rack
servers
ToR Rack
servers
ToR Rack
servers
ToR
………
Rack
servers
ToR Rack
servers
ToR Rack
servers
ToR Rack
servers
ToR
………
Data Center Networks Today
………
Why Rethink the Architecture?
Rack
servers
ToR
Rack
servers
servers
ToR
Server
servers servers servers
servers servers servers
Server servers servers servers
servers servers servers servers
Server servers servers servers
servers servers servers servers
Rack
Hundreds/Thousands of Servers
or SoCs
ToR
Rack
Fewer servers
ToR ToR
ToR
ToR
ToR ToR ToR
ToR
Why Rethink the Architecture?
• Problem: need to connect hundreds/thousands of servers • To each other • To rest of data center
• Naïve soluIons won’t work (cost, power, space) • Can’t build a thousand-‐port ToR • Can’t add many ToRs per rack
• Trade star topology for fixed, direct-‐connect topology • Upside: cheap, no power, small physical space • Downside: lose full bisecIon bandwidth, flexible topology
Rack
Hundreds/Thousands of Servers
or SoCs
ToR
Rack
Fewer servers
ToR ToR
ToR
ToR
ToR ToR ToR
ToR
Theia
• Preliminary design for UDDC network architecture • Building out and evaluaIng new vendor hardware • Design will undoubtedly change as we progress
• Goal is a simple, pracIcal, cheap design • Beg, borrow, and steal from exisIng technologies • Throw hardware at the problem when it is cheap, so=ware when not
• Theia is meant to start a conversaIon about UDDCs
Rack
Hundreds of servers
The Theia Architecture ToR
………
• Start with tradiIonal rack
Rack The Theia Architecture ToR
SubRack
SubRack
SubRack
SubRack
SubRack
SubRack
SubRack
………
• Start with tradiIonal rack • Divide servers into SubRacks
Rack The Theia Architecture
SubRack
SubRack
SubRack
SubRack
SubRack
SubRack
SubRack
………
• Start with tradiIonal rack • Divide servers into SubRacks • Replace ToR with fixed circuit interconnect (patch panel)
Rack The Theia Architecture
SubRack
SubRack
SubRack
SubRack
SubRack
SubRack
SubRack
………
• Start with tradiIonal rack • Divide servers into SubRacks • Replace ToR with fixed circuit interconnect (patch panel) • Connect racks to one another using spare patch panel ports
ToR
………
………
ToR
………
Theia Architecture: SubRacks Rack
………
• SubRack ≈ 1-‐2 rack units • CPUs connected via In-‐Chassis Switch (ICS)
• Like our own “mini ToR” but.. • ICS-‐to-‐CPU connecIons are copper, not cable • Very licle physical space required
SubRack
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
ICS
……… 10s of
SubRacks/rack
SubRack
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
ICS 10s of CPUs/SubRack
Theia Architecture: SubRacks Rack
………
• Can tune (at deployment Ime) number of downlinks (ICS-‐CPU) vs. uplinks (ICS-‐patch panel and rest of rack)
• Tradeoff at ICS: aggregaIon vs. oversubscripIon • OversubscripIon raIo: # uplinks : # CPUs
SubRack
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
ICS
………
SubRack
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
ICS IniIally, ≤ ten
uplinks
Theia Architecture: Patch Panel Rack
SubRack
SubRack
SubRack
SubRack
SubRack
SubRack
SubRack
………
• Patch panel connects SubRacks to one another • 10s of SubRacks with ~10 uplinks = hundreds of ports
• OpIcal patch panel implements a fixed circuit topology • No acIve components • Draws no power • Compact • Adds no queuing delay • Cabling is simple (underlying topology is hidden)
• Tradeoff: cost (power, space, $) vs. fixed, direct topology
Theia Architecture: Inter-‐Rack ConnecHvity
• Repurpose “le=over” patch panel ports to interconnect racks • Link between 2 racks may be groups of mulIple links
ToR
………
ToR
………
ToR
………
Theia Architecture: Inter-‐Rack ConnecHvity
• Repurpose “le=over” patch panel ports to interconnect racks • Link between 2 racks may be groups of mulIple links • Build larger topology w/ each rack as a “super node”
What about oversubscripHon?
• At this scale, oversubscripIon is unavoidable • More rack-‐locality can be expected ToR
………
ToR
………
Tune this oversubscripIon by allocaIng patch panel ports to in-‐rack interconnect (purple) or inter-‐rack interconnect (red)
What about through-‐traffic?
• Traffic passes through intermediate racks • Traffic traverses the patch panel (and therefore ICSs)
ToR
………
ToR
………
ToR
………
Patch Panel Topology
• Graph in which • Each node is an ICS (and its corresponding SubRack) • Links are implemented by patch panel internals
• What we care about: • Minimize through-‐traffic: latency and failure resilience • Support wide-‐range of graph sizes: UDDCs are sIll new • No dependency between number of nodes and ports per node • Reduce disrupIons caused by failures and miscablings
Patch Panel Topology OpHons
Hypercube: constraints on number of nodes, port counts, dependency between the two (similar for torus, Dcell, Bcube, etc)
Jelly fish: allows for organic growth, but this is not needed with fixed topology patch panel
Circulant Graph: Can build a performant graph w/ any number of nodes, port counts.
IniHal Topology: Circulant Graph
• Nodes N={0,…,N} • With p ports/node
• Strides S={…s…} s.t. node i connects to nodes i±s • …A ring with “short cuts”
• Key is to pick good shortcuts given N and p
S={1,6} Avg Path Len = 1.933 ½ are 2-‐hops Worst = 3 hops
S={3,8} Avg Path Len = 2.6 ~Even split btwn 1,2,3,4
Circulant Graph
• IniIal reasons for choosing: Flexibility • Wide range of graph sizes • No dependency between port count and number of nodes
• Turns out to be quite performant • Low amount of “through” traffic • Resilient to failure in connecIvity, performance, and consistency • Simple, elegant rouIng and forwarding • Miswirings likely to cause isomorphic graphs
Circulant Graph Average Path Lengths
0
2
4
6
8
10
12
14
16
18
16, 1
16, 2
16, 3
16, 4
16, 5
18, 1
18, 2
18, 3
18, 4
18, 5
20, 1
20, 2
20, 3
20, 4
20, 5
22, 1
22, 2
22, 3
22, 4
22, 5
24, 1
24, 2
24, 3
24, 4
24, 5
26, 1
26, 2
26, 3
26, 4
26, 5
28, 1
28, 2
28, 3
28, 4
28, 5
30, 1
30, 2
30, 3
30, 4
30, 5
32, 1
32, 2
32, 3
32, 4
32, 5
48, 1
48, 2
48, 3
48, 4
48, 5
64, 1
64, 2
64, 3
64, 4
64, 5
Best Avg. P
ath Length Across
Strid
e Sets
Circulant Graph Size <# Nodes, # Strides>
Latency and Through-‐Traffic
Summary
• ToR-‐based architecture won’t work for UDDCs
• Theia: Preliminary architecture to support 1000s of CPUs/rack • Flexibility of packet-‐switched network over fixed circuit topology
• Just the beginning of this conversaIon: • Other in-‐rack topologies… • Inter-‐rack connecIvity: will our proposal scale to data center size? • RouIng and addressing: different protocols for inter-‐ and intra-‐ rack? • Tailoring topology to workload and workload to (dense) topology