sylvia ratnasamy, paul francis, mark handley, richard karp, scott shenker a scalable, content-...

Sylvia Ratnasamy, Paul Francis, Mark Handley,

Richard Karp, Scott Shenker

A Scalable, Content-Addressable Network (CAN)

ACIRI U.C.Berkeley Tahoe Networks

1 2 3

1,2 3 1

1,2 1

Slide Credits (for UO CIS 410/510)

• Ratnasamy et.al. from SIGCOMM ‘01

• Ken Birman from CIS 514 Cornell

CAN: solution

• Virtual d-dimensional Cartesian coordinate space

• entire space is partitioned amongst all the nodes – every node “owns” a zone in the overall space– point = node that owns the enclosing zone

• State per node is O(d)• Routing between nodes is O(d * n1/d) , d dimensions, n nodes assuming evenly

distributed nodes

CAN Virtual Space

• Virtual d-dimensionalCartesian coordinatesystem on a d-torus– Example: 2-d [0,1]x[1,0]

– Note: coordinates can be

rational numbers !!

– Infinitely expandable

• Subspaces (zones) dynamically partitionedamong all nodes

CAN: simple example

1

CAN: simple example(vertical split first)

1 2

CAN: simple example(horizontal split next)

1

2

3

CAN: simple example

1

2

3

4

CAN: simple example

Storing and Retrieving (K,V) in CAN

• Pair (K,V) is stored bymapping key K to a point P in the space using a uniform hash function and storing (K,V) at the node in the zone containing P

• Retrieve entry (K,V) by applying the same hash function to map K to P and retrieve entry from node in zone containing P– If P is not contained in the zone of the requesting node or its neighboring

zones, route request to neighbor node in zone nearest P

CAN: simple example

I

CAN: simple example

node I::insert(K,V)

I

(1) a = hx(K)

CAN: simple example

x = a

node I::insert(K,V)

I

(1) a = hx(K)

b = hy(K)

CAN: simple example

x = a

y = b

node I::insert(K,V)

I

(1) a = hx(K)

b = hy(K)

CAN: simple example

(2) route(K,V) -> (a,b)

node I::insert(K,V)

I

CAN: simple example

(2) route(K,V) -> (a,b)

(3) (a,b) stores (K,V)

(K,V)

node I::insert(K,V)

I(1) a = hx(K)

b = hy(K)

CAN: simple example

(2) route “retrieve(K)” to (a,b)

(K,V)

(1) a = hx(K)

b = hy(K)

node J::retrieve(K)

J

Routing in a CAN

• Each node maintains a table of the IP address and virtual coordinate zone of each local neighbor

• Follow path through the Cartesian space from source to destination coordinates

• Use greedy routing to neighbor closest to destination

• For d-dimensional space partitioned into n equal zones, nodes maintain 2d neighbors– Average routing path length:

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎠

⎞⎜⎝

⎛ dnd 1

4

CAN: routing table

CAN: routing ???

(a,b)

(x,y)

CAN: node insertion

I

new node1) discover some node “I” already in CAN

CAN: node insertion

2) pick random point in space

I

(p,q)

new node

CAN: node insertion

(p,q)

3) I routes to (p,q), discovers node J

I

J

new node

CAN: node insertion

newJ

4) split J’s zone in half… new owns one half

CAN: node failures

• Need to repair the space

– recover database• soft-state updates• use replication, rebuild database from replicas

– repair routing • takeover algorithm

CAN: takeover algorithm• Simple failures

– know your neighbor’s neighbors– when a node fails, one of its neighbors takes over its

zone

• More complex failure modes– simultaneous failure of multiple adjacent nodes – scoped flooding to discover neighbors

• Only the failed node’s immediate neighbors are required for recovery

Evaluation

• Scalability

• Low-latency

• Load balancing

• Robustness

CAN: scalability• For a uniformly partitioned space with n nodes and

d dimensions – per node, number of neighbors is 2d– average routing path is (dn1/d)/4 hops– simulations show that the above results hold in practice

• Can scale the network without increasing per-node state. Unlimited growth possible.

• Chord/Plaxton/Tapestry/Buzz– log(n) nbrs with log(n) hops

CAN: low-latency

• Problem– latency stretch = (CAN routing delay)

(IP routing delay)– application-level routing may lead to high

stretch

• Solution– increase dimensions– heuristics

• RTT-weighted routing• multiple nodes per zone (peer nodes)• deterministically replicate entries

CAN: low-latency

#nodes

Late

ncy

str

etc

h

0

20

40

60

80

100

120

140

160

180

16K 32K 65K 131K

#dimensions = 2

w/o heuristics

w/ heuristics

0

2

4

6

8

10

CAN: low-latency

#nodes

Late

ncy

str

etc

h

16K 32K 65K 131K

#dimensions = 10

w/o heuristics

w/ heuristics

CAN: load balancing

• Two pieces

– Dealing with hot-spots• popular (key,value) pairs• nodes cache recently requested entries• overloaded node replicates popular entries at

neighbors

– Uniform coordinate space partitioning• uniformly spread (key,value) entries• uniformly spread out routing load

Uniform Partitioning

• Added check – at join time, pick a zone– check neighboring zones– pick the largest zone and split that one

0

20

40

60

80

100

Uniform Partitioning

V 2V 4V 8V

Volume

Perce

nta

ge

of n

od

es

w/o check

w/ check

V = total volumen

V16

V 8

V 4

V 2

65,000 nodes, 3 dimensions

CAN: Robustness

• Completely distributed – no single point of failure

• Not exploring database recovery

• Resilience of routing– can route around trouble

Routing resilience

destination

source

Routing resilience

Routing resilience

destination

Routing resilience

• Node X::route(D)

If (X cannot make progress to D) – check if any neighbor of X can make

progress– if yes, forward message to one such nbr

Routing resilience

Routing resilience

Routing resilience

0

20

40

60

80

100

2 4 6 8 10dimensions

Pr(

suc c

es s

ful ro

uti

ng

)

CAN size = 16K nodesPr(node failure) = 0.25

Routing resilience

0

20

40

60

80

100

0 0.25 0.5 0.75

CAN size = 16K nodes#dimensions = 10

Pr(node failure)

Pr(

suc c

es s

ful ro

uti

ng

)

Topologically Sensitive Overlay Construction for CAN:

Distributed Binning • Idea

– well known set of landmark machines – each CAN node, measures its RTT to each landmark– orders the landmarks in order of increasing RTT– Nodes are sorted into bins based on the landmark order

• CAN construction– place nodes from the same bin close together on the CAN

Distributed Binning

Basic Binning Example• 5 landmark machines: L1, L2, L3, L4, L5 • CAN node pings the 5 landmarks and orders them

from closest to farthest based on RTTbin label is 21453 which is one of 5! possible orderings

• Bin label gets translated into CAN coordinatesDivide into 5! Subspaces by cycling through the dimensionsXYZXYZXYZ … gets divided into m, m-1, m-2, … portions Thus, X axis divided into 5*2, Y divided into 4*1, Z divided into 3.

Sophisticated Distributed Binning

• Divide RTTs into ranges:– Level 0: [0,100) ms– Level 1: [100,200) ms ,etc.

• Node bin includes the order of the landmarks and level:E.g. Node A’s distance to landmarks L1, L2,L3 is 232 ms, 51ms, and 117 ms. IBin ordering is L2, L3, L1.

Bin + level level ordering is 012.

Distributed Binning

– 4 Landmarks (placed at 5 hops away from each other)– naïve partitioning

number of nodes

256 1K 4K

late

ncy

Str

etc

h

5

10

15

20

256 1K 4K

w/o binning w/ binning

w/o binning w/ binning

#dimensions=2 #dimensions=4

Fault Tolerance: Multiple Hash Functions

• Improve data availability by using k hash functions to map a single key to k points in the coordinate space

• Replicate (K,V) and storeat k distinct nodes

• (K,V) is only unavailablewhen all k replicas aresimultaneouslyunavailable

• Authors suggest queryingall k nodes in parallel toreduce average lookup latency

sylvia ratnasamy, paul francis, mark handley, richard karp, scott shenker a scalable, content-...

Documents

v node

b node

p slide

node j

j new node slide

y slide

q new node slide

node insertion bootstrap