camcube - rethinking the data center cluster€¦ · camcube rethinking the data center cluster...

87
CamCube Rethinking the Data Center Cluster Paolo Costa [email protected] joint work with Austin Donnelly, Greg O’Shea, Antony Rowstron (MSRC) Hussam Abu-Libdeh (Intern, Cornell), Simon Schubert (Intern, EPFL)

Upload: dangduong

Post on 27-Apr-2018

227 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

CamCube Rethinking the Data Center Cluster

Paolo Costa [email protected]

joint work with Austin Donnelly, Greg O’Shea, Antony Rowstron (MSRC) Hussam Abu-Libdeh (Intern, Cornell), Simon Schubert (Intern, EPFL)

Page 2: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Paolo Costa 2 CamCube - Rethinking the Data Center Cluster

Page 3: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

A New Software Stack

Paolo Costa 3 CamCube - Rethinking the Data Center Cluster

Page 4: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

A New Software Stack

Dremel

Dryad/DryadLINQ

Paolo Costa 4 CamCube - Rethinking the Data Center Cluster

Page 5: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

A New Software Stack

Dremel

Dryad/DryadLINQ

Paolo Costa 5

Network is a critical component Focus of this talk: How to make it easy to design

and deploy efficient data center applications

CamCube - Rethinking the Data Center Cluster

Page 6: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Building Data Center Applications is Hard!

Abstraction Reality

Paolo Costa 6

• Application logical topologies

Dynamo

MapReduce

Tree

Dremel

Databus

• Data center physical topology

CamCube - Rethinking the Data Center Cluster

Page 7: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Abstraction & Reality Mismatch

Paolo Costa 7 CamCube - Rethinking the Data Center Cluster

Page 8: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Abstraction & Reality Mismatch

Switches

Router

One logical hop is mapped to multiple physical hops

Paolo Costa 8 CamCube - Rethinking the Data Center Cluster

Page 9: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Abstraction & Reality Mismatch

Switches

Router

Paolo Costa 9 CamCube - Rethinking the Data Center Cluster

Page 10: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Abstraction & Reality Mismatch

Switches

Router

Two disjoint logical paths share some physical links

Paolo Costa 10 CamCube - Rethinking the Data Center Cluster

Page 11: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Abstraction & Reality Mismatch

Switches

Router

Paolo Costa 11 CamCube - Rethinking the Data Center Cluster

Page 12: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

CamCube - Rethinking the Data Center Cluster

Issue #1: Oversubscription

Switches

Router

Paolo Costa 12

Bandwidth gets scarce as you move up the tree Locality is key to performance

Page 13: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

CamCube - Rethinking the Data Center Cluster

Issue #2: Path collision

Paolo Costa 13

The network allocates paths independently Applications cannot modify the way packets are routed

Page 14: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Addressing These Issues…

• Oversubscription: Fat-tree[SIGCOMM’08], VL2[SIGCOMM’09], …

• Path collision: Hedera[NSDI’10], MPTCP[SIGCOMM’11], SPAIN[NSDI’10], …

• TCP Incast: DCTCP [SIGCOMM’10], ICTCP[CoNEXT’10], FDS[OSDI’12], …

• Traffic prioritization: Orchestra [SIGCOMM’11], D2TCP[SIGCOMM’11], …

• Fair sharing: Seawall [NSDI’11], FairCloud [SIGCOMM’12], …

Paolo Costa 14 CamCube - Rethinking the Data Center Cluster

Page 15: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Applications & Network Gap

The network is a black box for applications (and vice versa)

Paolo Costa 15

CamCube - Rethinking the Data Center Cluster

Page 16: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Applications & Network Gap Applications perspective

10.0.1.4 10.0.2.3

• Applications only see IP addresses − Hard to infer locality & congestion

• No control on packet routing − Point-to-point only

• Need to reverse-engineer the network

?

Why slow?

Paolo Costa 16

CamCube - Rethinking the Data Center Cluster

Page 17: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Applications & Network Gap Applications perspective Network Perspective

• The network only sees packets

• No insights about application behaviour

• Has to infer application patterns

10.0.1.4 10.0.2.3

• Applications only see IP addresses − Hard to infer locality & congestion

• No control on packet routing − Point-to-point only

• Need to reverse-engineer the network

?

? Why slow?

Are these related? Long vs. short flows?

Paolo Costa 17 CamCube - Rethinking the Data Center Cluster

Page 18: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Applications & Network Gap Applications perspective Network Perspective

• The network only sees packets

• No insights about application behaviour

• Has to infer application patterns

10.0.1.4 10.0.2.3

• Applications only see IP addresses − Hard to infer locality & congestion

• No control on packet routing − Point-to-point only

• Need to reverse-engineer the network

?

? Why slow?

Are these related? Long vs. short flows?

Paolo Costa 18 CamCube - Rethinking the Data Center Cluster

Page 19: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Internet & Data Centers

Internet

• Multiple administration domains

• Heterogeneous HW and network

• Topology not known

• Malicious software

• This is due to how the Internet was designed… − …but data centers are not mini-Internets

Strict layer isolation

Paolo Costa 19

CamCube - Rethinking the Data Center Cluster

Page 20: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Internet & Data Centers

Internet Data Centers

• Multiple administration domains

• Heterogeneous HW and network

• Topology not known

• Malicious software

• Single administration domain

• Homogenous HW and network − x86 and Ethernet

• Topology known − and can be customised

• Trusted components − e.g., using virtualization

• This is due to how the Internet was designed… − …but data centers are not mini-Internets

Paolo Costa 20 CamCube - Rethinking the Data Center Cluster

Page 21: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Internet & Data Centers

Internet Data Centers

• Multiple administration domains

• Heterogeneous HW and network

• Topology not known

• Malicious software

• Single administration domain

• Homogenous HW and network − x86 and Ethernet

• Topology known − and can be customised

• Trusted components − e.g., using virtualization

• This is due to how the Internet was designed… − …but data centers are not mini-Internets

Paolo Costa 21

How can we exploit this flexibility to improve efficiency and reduce complexity?

CamCube - Rethinking the Data Center Cluster

Page 22: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

CamCube

How can we design a data center closer to what a distributed systems builder expects?

Paolo Costa 22 CamCube - Rethinking the Data Center Cluster

Page 23: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

CamCube

How can we design a data center closer to what a distributed systems builder expects?

Paolo Costa 23

• Today: The network is a given and apps adapt to it

• CamCube: Adapt the network to the apps’ needs

CamCube - Rethinking the Data Center Cluster

Page 24: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

CamCube

How can we design a data center closer to what a distributed systems builder expects?

Direct-Connect topology Servers are directly interconnected to each other

(no switches / routers)

Physical Ethernet cable

Paolo Costa 24 CamCube - Rethinking the Data Center Cluster

Page 25: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

CamCube

How can we design a data center closer to what a distributed systems builder expects?

Direct-Connect topology Servers are directly interconnected to each other

(no switches / routers)

A fully connected mesh topology would be ideal All logical topologies can be mapped perfectly

Paolo Costa 25 CamCube - Rethinking the Data Center Cluster

Page 26: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

CamCube

Direct-Connect topology Servers are directly interconnected to each other

(no switches / routers)

A fully connected mesh topology would be ideal All logical topologies can be mapped perfectly

Dynamo

Paolo Costa 26 CamCube - Rethinking the Data Center Cluster

Page 27: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

CamCube

Direct-Connect topology Servers are directly interconnected to each other

(no switches / routers)

A fully connected mesh topology would be ideal All logical topologies can be mapped perfectly

Paolo Costa 27 CamCube - Rethinking the Data Center Cluster

Page 28: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

CamCube

How can we design a data center closer to what a distributed systems builder expects?

Direct-Connect topology Servers are directly interconnected to each other

(no switches / routers)

A fully connected mesh topology would be ideal All logical topologies can be mapped perfectly

Not very scalable Node degree grows linearly with N

(high server load and cabling complexity)

Paolo Costa 28 CamCube - Rethinking the Data Center Cluster

Page 29: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Which topology?

• Various options available − Trees, rings, hypercubes, tori, …

• Scalable − Node degree is constant (=6)

• Fault-tolerant − High degree of multi-path

• Easy to wire − Only short links are needed

• Trade-off − Increased hop count

2D Torus

3D Torus Paolo Costa 29 CamCube - Rethinking the Data Center Cluster

Page 30: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Network Visibility

Paolo Costa

• Limited network visibility −Hard to infer server location

• IP addresses only

−Hard to infer congestion

• Nodes have (x,y,z) coordinates − Easy to understand locality

• Servers have full visibility on the status of network links

y

z

(1,2,2)

x

(1,2,1)

10.0.1.4 10.0.2.3

30 CamCube - Rethinking the Data Center Cluster

Page 31: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Packet Routing

• Single routing protocol − Point-to-point only

• Servers can intercept, process, and forward packets − multiple custom routing protocols − e.g., multicast, multipath

Paolo Costa 31 CamCube - Rethinking the Data Center Cluster

Page 32: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Packet Processing

• Application-agnostic packet processing − Typically header-only − e.g., OpenFlow

• Application-specific packet processing − Servers understand the

application semantics − E.g., caching, aggregation

Paolo Costa 32

CamCube - Rethinking the Data Center Cluster

Page 33: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

CamCube Services

• Several services have been implemented on top of CamCube, including:

• CamKey − Key-value store

• Camdoop − MapReduce-like system

• CamGraph − Graph processing engine

• TCP/IP service − Enables running unmodified TCP applications

Paolo Costa 33 CamCube - Rethinking the Data Center Cluster

Page 34: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

CamCube Services

• Several services have been implemented on top of CamCube, including:

• CamKey − Key-value store

• Camdoop − MapReduce-like system

• CamGraph − Graph processing engine

• TCP/IP service − Enables running unmodified TCP applications

Paolo Costa 34 CamCube - Rethinking the Data Center Cluster

Page 35: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Key-based Routing • Packets are routed based on the

key rather than server address

• Inspired by Distributed Hash Tables (DHTs) − The (x,y,z)coordinates

define a key-space

• 160-bit keys are expressed as (x,y,z,w) − If alive, (x,y,z) is the server responsible for − Otherwise, keys are re-mapped to 1-hop neighbors based on w

• Example − (2,2,0,27) -> (2,2,0), (2,1,0), (1,2,0), …

y

z

x

(2,2,0)

(2,1,0)

(1,2,0)

Paolo Costa 35 CamCube - Rethinking the Data Center Cluster

Page 36: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

CamKey

• Reliable high-performance key-value store − Combination of BigTable + memcached

Two components:

• Replicated store − Ensures fault tolerance

• Caching service − Provides high performance

Paolo Costa 36 CamCube - Rethinking the Data Center Cluster

Page 37: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Replicated Store hash(ID) = e689eb3… = (2,2,0,27)

Data objects IDs are hashed using SHA-1 and the result is interpreted as 4D coordinates

Paolo Costa 37 CamCube - Rethinking the Data Center Cluster

Page 38: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Replicated Store hash(ID) = e689eb3… = (2,2,0,27)

Primary replica hash(ID) = e689eb3… = (2,2,0,27)

(2,2,0)

The primary replica is stored at the server responsible for the key

Paolo Costa 38 CamCube - Rethinking the Data Center Cluster

Page 39: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Replicated Store hash(ID) = e689eb3… = (2,2,0,27)

Primary replica

The first secondary replica is stored at the server that will become responsible for the key

if the primary fails

(2,2,0,27) -> (2,2,0), (2,1,0), (1,2,0), …

Secondary replica

(2,1,0)

Paolo Costa 39 CamCube - Rethinking the Data Center Cluster

Page 40: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Replicated Store hash(ID) = e689eb3… = (2,2,0,27)

Primary replica

Secondary replica

The second secondary replica is stored on the next server on the list and so on

(2,2,0,27) -> (2,2,0), (2,1,0), (1,2,0), … (1,2,0)

Paolo Costa 40 CamCube - Rethinking the Data Center Cluster

Page 41: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Replicated Store hash(ID) = e689eb3… = (2,2,0,27)

Primary replica

Secondary replica

(2,2,0,27) -> (2,2,0), (2,1,0), (1,2,0), …

High-locality Secondary replicas are 1-hop neighbors

Disjoint paths can be used Paolo Costa 41 CamCube - Rethinking the Data Center Cluster

Page 42: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Replicated Store hash(ID) = e689eb3… = (2,2,0,27)

Primary replica

Secondary replica

(2,2,0,27) -> (2,2,0), (2,1,0), (1,2,0), …

Client Transparency Clients do not need to know the replica identity

Key-based routing is used to deliver packets

Route to (2,2,0,27)

Paolo Costa 42 CamCube - Rethinking the Data Center Cluster

Page 43: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Replicated Store hash(ID) = e689eb3… = (2,2,0,27)

Primary replica

Secondary replica

(2,2,0,27) -> (2,2,0), (2,1,0), (1,2,0), …

Client Transparency Clients do not need to know the replica identity

Key-based routing is used to deliver packets Paolo Costa 43 CamCube - Rethinking the Data Center Cluster

Page 44: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Replicated Store hash(ID) = e689eb3… = (2,2,0,27)

Primary replica

Secondary replica

(2,2,0,27) -> (2,2,0), (2,1,0), (1,2,0), …

Client Transparency Clients do not need to know the replica identity

Key-based routing is used to deliver packets Paolo Costa 44 CamCube - Rethinking the Data Center Cluster

Page 45: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Caching Service hash(ID) = e689eb3… = (2,2,0,27)

Primary replica

f(2,2,0,27) -> (1,1,0,27), (3,1,0,27), (1,3,0,27), (3,3,0,27),…

Caches

For each key, we generate c additional keys that represent the location of caches

Paolo Costa 45 CamCube - Rethinking the Data Center Cluster

Page 46: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Caching Service hash(ID) = e689eb3… = (2,2,0,27)

Primary replica

These cache keys are assigned to servers using the usual mapping function

Caches

f(2,2,0,27) -> (1,1,0,27), (3,1,0,27), (1,3,0,27), (3,3,0,27),…

Paolo Costa 46 CamCube - Rethinking the Data Center Cluster

Page 47: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Caching Service hash(ID) = e689eb3… = (2,2,0,27)

Primary replica

When a server lookups a key, the path is chosen so as to pass through the closest cache

Caches

f(2,2,0,27) -> (1,1,0,27), (3,1,0,27), (1,3,0,27), (3,3,0,27),…

Paolo Costa 47 CamCube - Rethinking the Data Center Cluster

Page 48: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Caching Service hash(ID) = e689eb3… = (2,2,0,27)

Primary replica

When a server lookups a key, the path is chosen so as to pass through the closest cache

Caches

f(2,2,0,27) -> (1,1,0,27), (3,1,0,27), (1,3,0,27), (3,3,0,27),…

Paolo Costa 48 CamCube - Rethinking the Data Center Cluster

Page 49: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Caching Service hash(ID) = e689eb3… = (2,2,0,27)

Primary replica

Caches

On a cache miss, the lookup request is forwarded to the primary replica

and the response is cached on the way back

f(2,2,0,27) -> (1,1,0,27), (3,1,0,27), (1,3,0,27), (3,3,0,27),…

Paolo Costa 49 CamCube - Rethinking the Data Center Cluster

Page 50: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Caching Service hash(ID) = e689eb3… = (2,2,0,27)

Primary replica

Caches

On a cache miss, the lookup request is forwarded to the primary replica

and response is cached on the way back

f(2,2,0,27) -> (1,1,0,27), (3,1,0,27), (1,3,0,27), (3,3,0,27),…

Paolo Costa 50 CamCube - Rethinking the Data Center Cluster

Page 51: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Caching Service hash(ID) = e689eb3… = (2,2,0,27)

Primary replica

Caches

On a cache miss, the lookup request is forwarded to the primary replica

and response is cached on the way back

f(2,2,0,27) -> (1,1,0,27), (3,1,0,27), (1,3,0,27), (3,3,0,27),…

Paolo Costa 51 CamCube - Rethinking the Data Center Cluster

Page 52: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Caching Service hash(ID) = e689eb3… = (2,2,0,27)

Primary replica

Caches

On a cache miss, the lookup request is forwarded to the primary replica

and response is cached on the way back

f(2,2,0,27) -> (1,1,0,27), (3,1,0,27), (1,3,0,27), (3,3,0,27),…

Paolo Costa 52 CamCube - Rethinking the Data Center Cluster

Page 53: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Caching Service hash(ID) = e689eb3… = (2,2,0,27)

Primary replica

Caches

On a cache miss, the lookup request is forwarded to the primary replica

and response is cached on the way back

f(2,2,0,27) -> (1,1,0,27), (3,1,0,27), (1,3,0,27), (3,3,0,27),…

Paolo Costa 53 CamCube - Rethinking the Data Center Cluster

Page 54: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Caching Service hash(ID) = e689eb3… = (2,2,0,27)

Primary replica

Caches

On a cache miss, the lookup request is forwarded to the primary replica

and response is cached on the way back

f(2,2,0,27) -> (1,1,0,27), (3,1,0,27), (1,3,0,27), (3,3,0,27),…

Paolo Costa 54 CamCube - Rethinking the Data Center Cluster

Page 55: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Caching Service hash(ID) = e689eb3… = (2,2,0,27)

Primary replica

Caches

On a cache miss, the lookup request is forwarded to the primary replica

and response is cached on the way back

f(2,2,0,27) -> (1,1,0,27), (3,1,0,27), (1,3,0,27), (3,3,0,27),…

Paolo Costa 55 CamCube - Rethinking the Data Center Cluster

Page 56: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Caching Service hash(ID) = e689eb3… = (2,2,0,27)

Primary replica

Caches

On a cache miss, the lookup request is forwarded to the primary replica

and response is cached on the way back

f(2,2,0,27) -> (1,1,0,27), (3,1,0,27), (1,3,0,27), (3,3,0,27),…

Paolo Costa 56 CamCube - Rethinking the Data Center Cluster

Page 57: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Caching Service hash(ID) = e689eb3… = (2,2,0,27)

Primary replica

Caches

Next requests for the same key are intercepted on-path and the associated value is returned

f(2,2,0,27) -> (1,1,0,27), (3,1,0,27), (1,3,0,27), (3,3,0,27),…

Paolo Costa 57 CamCube - Rethinking the Data Center Cluster

Page 58: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Caching Service hash(ID) = e689eb3… = (2,2,0,27)

Primary replica

Caches

Next requests for the same key are intercepted on-path and the associated value is returned

f(2,2,0,27) -> (1,1,0,27), (3,1,0,27), (1,3,0,27), (3,3,0,27),…

Paolo Costa 58 CamCube - Rethinking the Data Center Cluster

Page 59: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Caching Service hash(ID) = e689eb3… = (2,2,0,27)

Primary replica

Caches

Next requests for the same key are intercepted on-path and the associated value is returned

f(2,2,0,27) -> (1,1,0,27), (3,1,0,27), (1,3,0,27), (3,3,0,27),…

Paolo Costa 59 CamCube - Rethinking the Data Center Cluster

Page 60: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Caching Service hash(ID) = e689eb3… = (2,2,0,27)

Primary replica

Caches

Next requests for the same key are intercepted on-path and the associated value is returned

f(2,2,0,27) -> (1,1,0,27), (3,1,0,27), (1,3,0,27), (3,3,0,27),…

Paolo Costa 60 CamCube - Rethinking the Data Center Cluster

Page 61: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Caching Service hash(ID) = e689eb3… = (2,2,0,27)

Primary replica

Caches

Write operations always go to the primary replica and caches are invalidated

Paolo Costa 61 CamCube - Rethinking the Data Center Cluster

Page 62: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Caching Service hash(ID) = e689eb3… = (2,2,0,27)

Primary replica

Caches

Write operations always go to the primary replica and caches are invalidated

Paolo Costa 62 CamCube - Rethinking the Data Center Cluster

Page 63: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Caching Service hash(ID) = e689eb3… = (2,2,0,27)

Primary replica

Caches

Write operations always go to the primary replica and caches are invalidated

Paolo Costa 63 CamCube - Rethinking the Data Center Cluster

Page 64: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Caching Service hash(ID) = e689eb3… = (2,2,0,27)

Primary replica

Caches

Write operations always go to the primary replica and caches are invalidated

Paolo Costa 64 CamCube - Rethinking the Data Center Cluster

Page 65: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Evaluation

Testbed − 27-server CamCube (3 x 3 x 3) − Quad-core 2.27 Ghz, 12 GB RAM − Six 1 Gbps ports per server − Runtime & services implemented in user-space (C#)

Workload: Image store − 9 external servers (up to 150 concurrent requests) − Insert: 1.47 MB average image size − Lookup: 3.55 KB average thumbnail size

Paolo Costa 65 CamCube - Rethinking the Data Center Cluster

Page 66: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Insert Throughput

Better

Worse 0

1

2

3

4

5

6

0 25 50 75 100 125 150

Inse

rt t

hro

ug

hp

ut

(Gb

ps)

Concurrent insert requests

switch

CamKey

switch (no disk)

CamKey (no disk)

Paolo Costa 66 CamCube - Rethinking the Data Center Cluster Load increases

Page 67: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Insert Throughput

Better

Worse 0

1

2

3

4

5

6

0 25 50 75 100 125 150

Inse

rt t

hro

ug

hp

ut

(Gb

ps)

Concurrent insert requests

switch

CamKey

switch (no disk)

CamKey (no disk)

Disk I/O bounded

Server bandwidth bounded

CamKey exploits disjoint paths to create replicas

Paolo Costa 67 CamCube - Rethinking the Data Center Cluster Load increases

Page 68: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Lookup Throughput

0

20,000

40,000

60,000

80,000

100,000

120,000

140,000

160,000

0 25 50 75 100 125 150

Lo

ok

up

ra

te (

req

s/s)

Concurrent lookup requests

switch

CamKey (disabled cache)

CamKey

Better

Worse

Paolo Costa 68 CamCube - Rethinking the Data Center Cluster Load increases

Page 69: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Lookup Throughput

0

20,000

40,000

60,000

80,000

100,000

120,000

140,000

160,000

0 25 50 75 100 125 150

Lo

ok

up

ra

te (

req

s/s)

Concurrent lookup requests

switch

CamKey (disabled cache)

CamKey

Better

Worse

Latency 0.83 ms (median)

1.70 ms (95th perc)

Latency 0.97 ms (median)

2.13 ms (95th perc)

Higher hop count

Caches reduce

hop count

Paolo Costa 69 CamCube - Rethinking the Data Center Cluster Load increases

Page 70: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

0

1

2

3

4

5

6

0 20 40 60 80 100 120 140

Inse

rt t

hro

ug

hp

ut

(Gb

ps)

Time (s)

CamKey

Failures

A random server fails every 10 s

`

Paolo Costa 70 CamCube - Rethinking the Data Center Cluster

Page 71: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

0

1

2

3

4

5

6

0 20 40 60 80 100 120 140

Inse

rt t

hro

ug

hp

ut

(Gb

ps)

Time (s)

CamKey

Failures

A random server fails every 10 s

Only 18 servers left

Paolo Costa 71 CamCube - Rethinking the Data Center Cluster

Page 72: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

CamCube Services

• Several services have been implemented on top of CamCube, including:

• CamKey − Key-value store

• Camdoop − MapReduce-like system

• CamGraph − Graph processing engine

• TCP/IP service − Enables running unmodified TCP applications

Paolo Costa 72 CamCube - Rethinking the Data Center Cluster

Page 73: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

MapReduce

• Map − Processes input data and generates (key, value) pairs

• Shuffle − Distributes the intermediate pairs to the reduce tasks

• Reduce − Aggregates all values associated to each key

Chunk 0

Chunk 1

Chunk 2

Input file

Map Task

Map Task

Map Task

Reduce Task

Reduce Task

Reduce Task

Intermediate results Final results

Paolo Costa 73 CamCube - Rethinking the Data Center Cluster

Page 74: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Shuffle Phase

Split 0

Split 1

Split 2

Map Task

Map Task

Map Task

Reduce Task

Reduce Task

Reduce Task

Intermediate results

• Shuffle phase is challenging for data center networks − All-to-all traffic pattern with O(N2) flows

• Often a bottleneck for MapReduce jobs − Led to proposals for full-bisection bandwidth

Paolo Costa 74 CamCube - Rethinking the Data Center Cluster

Page 75: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Data Reduction

• The final results are typically much smaller than the intermediate results (e.g., WordCount)

• In most Facebook jobs final size is 5.4 % of the intermediate size

• In most Yahoo jobs the ratio is 8.2 %

Split 0

Split 1

Split 2

Input file

Map Task

Map Task

Map Task

Reduce Task

Reduce Task

Reduce Task

Intermediate results Final results

Paolo Costa 75 CamCube - Rethinking the Data Center Cluster

Page 76: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Data Reduction

• The final results are typically much smaller than the intermediate results (e.g., WordCount)

• In most Facebook jobs final size is 5.4 % of the intermediate size

• In most Yahoo jobs the ratio is 8.2 %

Split 0

Split 1

Split 2

Input file

Map Task

Map Task

Map Task

Reduce Task

Reduce Task

Reduce Task

Intermediate results Final results

How can we exploit this to reduce the traffic and improve the performance of the shuffle phase?

Paolo Costa 76 CamCube - Rethinking the Data Center Cluster

Page 77: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Aggregation Tree

• We could use aggregation trees to perform multiple steps of aggregation to reduce inter-rack traffic − e.g., rack-level aggregation

Paolo Costa 77 CamCube - Rethinking the Data Center Cluster

Page 78: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Aggregation Tree

• We could use aggregation trees to perform multiple steps of aggregation to reduce inter-rack traffic − e.g., rack-level aggregation

Paolo Costa 78 CamCube - Rethinking the Data Center Cluster

Page 79: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Mapping a tree…

… on a traditional topology … on CamCube

• Mismatch between logical and physical topology

• 1:1 mapping btw. logical and physical topology

• Packets are aggregated on path (=> less traffic)

Only one child per link

Rack Switch

Link shared by all children

Paolo Costa 79 CamCube - Rethinking the Data Center Cluster

Page 80: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Mapping a tree…

… on a traditional topology … on CamCube

• Mismatch between logical and physical topology

• 1:1 mapping btw. logical and physical topology

• Packets are aggregated on path (=> less traffic)

Only one child per link

Rack Switch

Link shared by all children

Paolo Costa 80 CamCube - Rethinking the Data Center Cluster

Page 81: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Mapping a tree…

… on a traditional topology … on CamCube

• Mismatch between logical and physical topology

• 1:1 mapping btw. logical and physical topology

• Packets are aggregated on path (=> less traffic)

Only one child per link

Rack Switch

Link shared by all children

Paolo Costa 81 CamCube - Rethinking the Data Center Cluster

Page 82: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Mapping a tree…

… on a traditional topology … on CamCube

• Mismatch between logical and physical topology

• 1:1 mapping btw. logical and physical topology

• Packets are aggregated on path (=> less traffic)

Only one child per link

Rack Switch

Link shared by all children

Paolo Costa 82 CamCube - Rethinking the Data Center Cluster

Page 83: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Mapping a tree…

… on a traditional topology … on CamCube

• Mismatch between logical and physical topology

• 1:1 mapping btw. logical and physical topology

• Packets are aggregated on path (=> less traffic)

Rack

Switch

Paolo Costa 83

Camdoop Improve the performance of the shuffle phase

by reducing the traffic rather than by increasing the bandwidth

CamCube - Rethinking the Data Center Cluster

Page 84: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Workload Parameter

• Output size / intermediate size (S) − S=1 (no aggregation)

o All map outputs have a disjoint set of keys − S=1/N ≈ 0 (full aggregation)

o All map outputs share the same set of keys

− We use synthetic workloads to explore different value of S o Intermediate data size is 22.2 GB (843 MB/server)

Split 0

Split 1

Split 2

Input file

Map Task

Map Task

Map Task

Reduce Task

Reduce Task

Reduce Task

Intermediate results Output results

Paolo Costa 84 CamCube - Rethinking the Data Center Cluster

Page 85: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Evaluation

1

10

100

1000

0 0.2 0.4 0.6 0.8 1

Tim

e (

s) lo

gsc

ale

Output size/ intermediate size (S)

Baseline

Camdoop (no agg.)

Camdoop

Worse

Better

Full aggregation

No aggregation

Paolo Costa 85 CamCube - Rethinking the Data Center Cluster

Page 86: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Evaluation

1

10

100

1000

0 0.2 0.4 0.6 0.8 1

Tim

e (

s) lo

gsc

ale

Output size/ intermediate size (S)

Baseline

Camdoop (no agg.)

Camdoop

Worse

Better

Full aggregation

No aggregation

Running on the switch using TCP

Impact of in-network

aggregation

Facebook reported aggregation ratio

Impact of running on CamCube

Paolo Costa 86 CamCube - Rethinking the Data Center Cluster

Page 87: CamCube - Rethinking the Data Center Cluster€¦ · CamCube Rethinking the Data Center Cluster Paolo Costa costa@imperial.ac.uk joint work with Austin Donnelly, Greg O’Shea, Antony

Summary

• Data centers present both unique challenges and opportunities to network designers

• Good time to revisit previous assumptions and rethink application and protocol design

• CamCube − Enables applications to “control” the network − Removes distinction between computation and

network devices

Paolo Costa 87 CamCube - Rethinking the Data Center Cluster