sylvia ratnasamy (uc berkley dissertation 2002) paul francis mark handley richard karp scott shenker...

Sylvia Ratnasamy (UC Berkley Dissertation 2002)

Paul Francis

Mark Handley

Richard Karp

Scott Shenker

A Scalable, Content Addressable Network

Slides byAuthors above

with additions/modificatons byMichael Isaacs (UCSB)Tony Hannan (Georgia Tech)

Outline

• Introduction• Design• Design Improvements• Evalution• Related Work• Conclusion

• Hash tables– essential building block in software systems

• Internet-scale distributed hash tables– equally valuable to large-scale distributed systems

• peer-to-peer systems– Napster, Gnutella, Groove, FreeNet, MojoNation…

• large-scale storage management systems– Publius, OceanStore, PAST, Farsite, CFS ...

• mirroring on the Web

Internet-scale hash tables

Content-Addressable Network(CAN)

• CAN: Internet-scale hash table

• Interface– insert (key, value)– value = retrieve (key)

• Properties– scalable– operationally simple– good performance

• Related systems: Chord, Plaxton/Tapestry ...

Problem Scope

• provide the hashtable interface• scalability• robustness• performance

Not in scope

• security• anonymity• higher level primitives

– keyword searching– mutable content

Outline

CAN: basic idea

insert(K1,V1)

CAN: basic idea

insert(K1,V1)

CAN: basic idea

(K1,V1)

CAN: basic idea

retrieve (K1)

CAN: solution

• Virtual Cartesian coordinate space

• Entire space is partitioned amongst all the nodes – every node “owns” a zone in the overall space

• Abstraction– can store data at “points” in the space – can route from one “point” to another

• Point = node that owns the enclosing zone

CAN: simple example

node I::insert(K,V)

(1) a = hx(K)

CAN: simple example

node I::insert(K,V)

(1) a = hx(K)

b = hy(K)

CAN: simple example

node I::insert(K,V)

(1) a = hx(K)

b = hy(K)

CAN: simple example

(2) route(K,V) -> (a,b)

node I::insert(K,V)

CAN: simple example

(2) route(K,V) -> (a,b)

(3) (a,b) stores (K,V)

node I::insert(K,V)

I(1) a = hx(K)

b = hy(K)

CAN: simple example

(2) route “retrieve(K)” to (a,b)

(1) a = hx(K)

b = hy(K)

node J ::retrieve(K)

Data stored in the CAN is addressed by name (i.e. key), not location (i.e. IP address)

CAN: routing table

CAN: routing

A node only maintains state for its immediate neighboring nodes

CAN: routing

CAN: node insertion

new node1) Discover some node “I” already in CAN

CAN: node insertion

2) pick random point in space

new node

CAN: node insertion

3) I routes to (p,q), discovers node J

new node

CAN: node insertion

4) split J’s zone in half… new owns one half

Inserting a new node affects only a single other node and its immediate neighbors

CAN: node insertion

CAN: node failures

• Need to repair the space

– recover database• state updates• use replication, rebuild database from replicas

– repair routing • takeover algorithm

CAN: takeover algorithm

• Nodes periodically send update message to neigbors

• An absent update message indicates a failure

– Each node send a recover message to the takeover node

– The takeover node takes over the failed node's zone

CAN: takeover node

• The takeover node is the node whose VID is numerically closest to the failed nodes VID.

• The VID is a binary string indicating the path from the root to the zone in the binary partition tree.

100 1010 1

100 101

CAN: VID predecessor / successor

• To handle multiple neighbors failing simultaneously, each node maintains its predecessor and successor in the VID ordering (idea borrowed from Chord).

• This guarantees that the takeover node can always be found.

Only the failed node’s immediate neighbors are required for recovery

CAN: node failures

Basic Design recap

• Basic CAN– completely distributed– self-organizing– nodes only maintain state for their

immediate neighbors

Outline

Design Improvement Objectives

• Reduce latency of CAN routing– Reduce CAN path length– Reduce per-hop latency (IP

length/latency per CAN hop)• Improve CAN robustness

– data availability– routing

Design Improvement Tradeoffs

Routing performance and system robustnes

» vs.

Increased per-node state and system complexity

Design Improvement Techniques

1. Increase number of dimensions in coordinate space

2. Multiple coordinate spaces (realities)

3. Better CAN routing metrics (measure & consider time)

4. Overloading coordinate zones (peer nodes)

5. Multiple hash functions

6. Geographic-sensitive construction of CAN overlay network (distributed binning)

7. More uniform partitioning (when new nodes enter)

8. Caching & Replication techniques for “hot-spots”

9. Background load balancing algorithm

1. Increase dimensions

2. Multiple realities

• Multiple (r) coordinate spaces

• Each node maintains a different zone in each reality

• Contents of hash table replicated on each reality

• Routing chooses neighbor closest in any reality (can make large jumps towards target)

• Advantages:– Greater data availability– Fewer hops

Multiple realities

Multiple dimensions vs realities

3. Time-sensitive routing

• Each node measures round-trip time (RTT) to each of its neighbors

• Routing message is forwarded to the neighbor with the maximum ratio of progress to RTT

• Advantage:– reduces per-hop latency (25-40%

depending on # dimensions)

4. Zone overloading

• Node maintains list of “peers” in same zone

• Zone not split until MAXPEERS are there

• Only least RTT neighbor maintained

• Replication or splitting of data across peers• Advantages

– Reduced per hop latency

– Reduced path length (because sharing zones has same effect as reducing total number of nodes)

– Improved fault tolerance (a zone is vacant only when all nodes fail)

Zone overload

Avg of 28 to 218 nodes

5. Multiple hash

• (key, value) mapped to “k” multiple points (and nodes)

• Parallel query execution on “k” paths• Or choose closest one in space• Advantages:

– improve data availability– reduce path length

Multiple hash functions

6. Geographically distributed binning

• Distributed binning of CAN nodes based on their relative distances from m IP landmarks (such as root DNS servers)

• m! partitions of coordinate space

• New node joins partition associated with its landmark ordering

• Topological close nodes will be assigned to same partition

• Nodes in same partition divide the space up randomly

• Disadvantage: coordinate space is no longer uniformily populated

– background load balancing could alleviate this problem

Distributed binning

7. More uniform partitioning

• When adding a node to a zone, choose neighbor with largest volume to split

• Advantage– 85% of nodes have the average load as

opposed to 40% without, and largest zone volume dropped from 8x to 2x the average

8. Caching & Replication

• A node can cache values of keys it has recently requested.

• An node can push out “hot” keys to neighbors.

9. Load balancing

• Merge and split zones to balance loads

Outline

Evaluation Metrics

1.Path length2.Neighbor state3.Latency4.Volume5.Routing fault tolerance6.Data availability

CAN: scalability

• For a uniformly partitioned space with n nodes and d dimensions – per node, number of neighbors is 2d– average routing path is (dn1/d) hops

• Simulations show that the above results hold in practice (Transit-Stub topology: GT-ITM gen)

• Can scale the network without increasing per-node state

CAN: latency

• latency stretch = (CAN routing delay) (IP routing delay)

• To reduce latency, apply design improvements:– increase dimensions (d)

– multiple nodes per zone (peer nodes) (p)

– multiple “realities” ie coordinate spaces (r)

– multiple hash functions (k)

– use RTT weighted routing

– distributed Binning

– uniform Partitioning

– caching & replication (for “hot spots”)

Overall improvement

CAN: Robustness

• Completely distributed – no single point of failure

• Not exploring database recovery

• Resilience of routing– can route around trouble

Routing resilience

destination

source

Routing resilience

destination

Routing resilience

• Node X::route(D)

If (X cannot make progress to D) – check if any neighbor of X can make

progress– if yes, forward message to one such nbr

Routing resilience

Outline

Related Algorithms

• Distance Vector (DV) and Link State (LS) in IP routing– require widespread dissemination of

topology– not well suited to dynamic topology like file-

sharing

• Hierarchical routing algorithms– too centralized, single points of stress

• Other DHTs– Plaxton, Tapestry, Pastry, Chord

Related Systems

• DNS

• OceanStore (uses Tapestry)

• Publius: Web-publishing system with high anonymity

• P2P File sharing systems– Napster, Gnutella, KaZaA, Freenet

Outline

Consclusion: Purpose

• CAN– an Internet-scale hash table

– potential building block in Internet applications

Conclusion: Scalability

• O(d) per-node state

• O(d * n^(1/d)) path length

For d = (log n)/2, path lengh and per-node state is O(log n) like other DHTs

Conclusion: Latency

• With the main design improvements latency is less than twice the IP path latency

Conclusion: Robustness

• Decentralized, can route around trouble

• With certain design improvements (replication), high data availability

Weaknesses / Future Work

• Security– Denial of service attacks

• hard to combat because a malicious node can act as a malicious server as well as a malicious client

• Mutable content

• Search

sylvia ratnasamy (uc berkley dissertation 2002) paul francis mark handley richard karp scott shenker...

simple example node

v node

failureeach node

node failuresneed

node insertionp

node insertion2

node insertioncan

ya node

Documents

management of loblolly pine by don handley, handley forestry...

free riding multicast sylvia ratnasamy (intel research)...

gia: making gnutella-like p2p systems scalable yatin...

1 network management and software-defined networking (sdn)...

by sylvia ratnasamy, andrey ermolinskiy, scott shenker...

sylvia ratnasamy, paul francis, mark handley, richard karp,...

bgp ee 122, fall 2013 sylvia ratnasamy ee122/ material...

1 a scalable content- addressable network sylvia rathnasamy,...

accelerating innovation with · linux foundation open...

the impact of dht routing geometry on resilience and...

making sense of performance in data analytics frameworks kay...

the ip data plane: packets and routers ee 122, fall 2013...

a public dht service sean rhea, brighten godfrey, brad karp,...

the transport layer cs168, fall 2014 scott shenker...

black holes and random matrices - video lectures · black...

p2p group meeting (ics/forth) monday, 28 march, 2005 a...

content addressable network sylvia ratsanamy, mark handley...

handley...

netbricks: taking the v out of nfv - usenix · pdf...

tcp: congestion control (part ii) ee 122, fall 2013 sylvia...