viceroy: scalable emulation of butterfly networks for distributed hash tables

37
Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables By: Dahlia Malkhi, Moni Naor & David Ratajzcak Nov. 11, 2003 Presented by Zhenlei Jia Nov. 11, 2004

Upload: garrett-burgess

Post on 31-Dec-2015

12 views

Category:

Documents


0 download

DESCRIPTION

Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables. By: Dahlia Malkhi, Moni Naor & David Ratajzcak Nov. 11, 2003 Presented by Zhenlei Jia Nov. 11, 2004. Acknowledgments. Some of the following slides are adapted from the slides created by the authors of the paper. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables

Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables

By: Dahlia Malkhi, Moni Naor & David RatajzcakNov. 11, 2003

Presented by Zhenlei JiaNov. 11, 2004

Page 2: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables

Acknowledgments

Some of the following slides are adapted from the slides created by the authors of the paper

Page 3: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables

Outline Outline DHT Properties Viceroy

Structure Routing Algorithm Join/Leave Bounding In-degree: Bucket Solution Fault Tolerance

Summary

Page 4: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables

DHT What’s DHT

Store (key, value) pairs Lookup Join/Leave

Examples CAN, Pastry, Tapestry, Chord etc.

Page 5: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables

DHT Properties Dilation

Efficient lookup, usually O(log(n)) Maintenance cost

Support dynamic environment Control messages, affected servers

Degree Number of opened connections Servers impacted by node join/leave Heartbeat, graceful leave

Page 6: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables

DHT Properties (cont.) Congestion:

Peers should share the routing load evenly Load (of a node): the probability that it is on a ro

ute with random source and destination. If path length = O(log(n)) then on average, each

node is on n2 x O(log(n))/n = O(nlog(n)) routes. Average load = O(nlogn)/n2 = O(log(n))/n

Page 7: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables

Previous Works Node-

degree Dilation Congestion Topology

Chord (MIT)

log(n) log(n) log(n)/n Hypercube

Tapestry (Berkeley)

log(n) log(n) log(n)/n Hypercube

CAN (ICSI)

d dn(1/d) dn(1/d)/n d-dim. tourus

Small worlds

3 log2(n) log2(n)/n Cube connected cycle

Ours 3-7 log(n) log(n)/n Butterfly

Page 8: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables

Intuition

• Route is a combination of links of appropriate size

• Chord: Each node has ALL log(n) links

• Viceroy

• Each node has ONE of the long-range links

• A link of length 1/2k points to a node has link of length 1/2k+1

Chrod

Page 9: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables

000 001 010 011 100 101 110 111Level 1

Level 2

Level 3

Level 4

A Butterfly Network Each node has

ONE of the long-range links

A link of length 1/2k points to a node has link of length 1/2k+1

Nodes “share” each other’s long link

Routing

1.Route to root

2.Route to right group

3.Route to right level

Path: O(log(n))

Degree: O(1)

Page 10: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables

A Viceroy network

Level 1

Level 2

Level 3

•Ideally, there should be log(n) levels

•There is not a global counter

•Later, we will see how a node can estimate log(n) locally

0 1

00010010

0011

0100

0101

0110

10001001

1011

110111101111

Page 11: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables

Structure: Nodes Node

Id: 128 bits binary string, u Level: positive integer, u.level

Order of ids b1b2…bk ∑i=1…k bi/2i

Each node has a SUCCESSOR and a PREDECESSORSUCC(u), PRED(u)

Node u stores the keys k such that u≤k<SUCC(u)

Page 12: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables

Structure: Nodes

0 1

x SUCC(x)

PRED(x)

Keys stored on x

Lemma 2.1

Let n0 = 1/d(x, SUCC(x)), then w.h.p. (i.e. p>1-1/n1+e) that

log(n)-log(log(n))-O(1) <log(n0) ≤3log(n)

Node x selects level from 1…log(n0) uniformly randomly

Page 13: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables

Structure: Links A node u in level k has six out links

2 x Short: SUCCESSOR ,PREDECESSOR 2 x Medium: (left) closest level-(k+1) node whos

e id matches u.id[k] and is smaller than u.id. 1 x Long: the closest level-(k+1) node with prefix

u1…uk-1(1-uk)(?) u1…uk-1(1-uk)uk+1…uw* where w=log(n0)-log(log(n0))

1 x Parent: closest level-(k-1) node Also keeps track of in-bound links

Page 14: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables

Structure: Links

Level 1

Level 2

Level 3

0 1

00010010

0011

0100

0101

0110

10001001

1011

110111101111

Short link Short link

Parent link, to level k-1

Medium linkMatches x[k]0*

Long link, cross over about 1/2k

Matches u[w] except kth bit. (11*)

Matches 1*Wrong!

Page 15: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables

Routing: AlgorithmLOOKUP(x, y):Initialization: set cur to xProceed to root: while cur.level > 1:

cur = cur.parentGreedy search:

if cur.id ≤ y < SUCC(cur).id, return cur. Otherwise, choose m from links of cur that minimize d(m, y), move to m and repeat.

Demo: http://www.cs.huji.ac.il/labs/danss/anatt/viceroy.html

Page 16: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables

Routing: Example

Level 1

Level 2

Level 3

0 1

00010010

0011

0100

0101

0110

10001001

1011

110111101111y

x

Page 17: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables

One Observation

Level 1

Level 2

Level 3

0 1

00010010

0011

0100

0101

0110

10001001

1011

110111101111

Page 18: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables

Routing: Analysis (1)

Level 1

Level 2

Level 3

0 1

00010010

0011

0100

0101

0110

10001001

1011

110111101111y

x

Page 19: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables

Routing: Analysis (2) Expected path length = O(log(n))

log(n ) to `level-1’ node log(n ) for traveling among clusters log(n ) for final local search

Page 20: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables

Routing: Theorems Theorem 4.4

The path length from x to y is O(log(n)) w.h.p. Proof is based on several lemmas Lemma 4.1

For every node u with a level u.level < log(n)-log(log(n)), the number of nodes between u and u.Medium-left (Medium-right), if it exists, is at most 6log2(n) w.h.p.

Page 21: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables

Routing: Theorems (2) Lemma 4.2

In the greedy search phase of a lookup of value Y from node x, let the j’th greedy step vj, for 1 ≤ j ≤ m, be such that vj is more than O(log2(n)) nodes away from y. Then w.h.p. node vj is reached over a Medium or Long link, and hence satisfies vj.level = j and vj[j] = Y[j].

m = log(n)-2loglog(n)-log(3+e) W.h.p. within m steps, we are n/2m = 6log2(n) nodes

away from the destination

Page 22: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables

Routing: Theorems (3) Lemma 4.3

Let v be a node that is O(log2(n)) nodes away from the target y. Then w.h.p., within O(log(n)) greedy steps that target y is reached from v.

Theorem 4.4 The total length of a route from x to y is O(log(n)) w.h.p.

Theorem 4.6 Expected load on every node is O(log(n)/n). The load on every node is log2(n)/n w.h.p.

Theorem 4.7Every node u has in-degree O(log(n)) w.h.p.

Page 23: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables

Join: Algorithm1. Choose identifier: select a random 128 bits x1x2…x128

2. Setup short links: invoke LOOKUP(x), let x’ be the result node. Insert x between x’ and x’.SUUCESSOR.

3. Choose level: let k be the maximal number of matching prefix bits between x and either SUCC(x) or PRED(x), choose level from 1…k.

4. Set parent link: If SUCC(x) has level x.level-1, set x.parent to it. Otherwise, move to SUCC(x) and repeat.

5. Set long link: p = x1…xk-1(1-xk)xk+1…xw

Invoke LOOKUP(p), stop after a node at level x.level+1 and matches p

is reached.

Page 24: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables

Join: Algorithm (cont.)6. Set medium links: Denote p = x1x2…xx.level. If SUCC(x) has prefix

p and level x.level+1, set x.Medium-right link to it. Otherwise, move the SUCC(x) and repeat.

7. Set inbound links: Denote p = x1x2…xx.level.

Set inbound Medium links: Following SUCC links, so long as successor y has a prefix p and a level different from x.level, if y.level = x.level-1, set y.Medium-left to x.Set inbound long links: Following SUCC links, find y that has a prefix matches p and has level x.level. Take any inbound links that is closer to x than y.Set inbound parent links: Following PRED link, find y such that y.level = x.level+1. Repeat until meet a node in same level as x.

Page 25: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables

Join: Example

Level 1

Level 2

Level 3

0 1

00010010

0011

0100

0101

0110

10001001

1011

110111101111

0111

Lookup(x)

0111

Set Medium link: O(lg2n) w.h.pp = x1x2…xk (01)If y[k] != p: stopIf y[k]=p and y.level=k+1: set Medium linkOtherwise, move to succ(y)

STOP

Set Parent link:Following SUCC link, find a node has level k-1.

Set long linkP = x1…xk-1(1-xk)…xw stop at level k+1?In this case, find 00*

Set inbound long links:Following short links, find y such that y[k]=x[k] and y.level = x.level, check y’s inbound links.

X

Page 26: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables

Join: Analysis LOOKUP takes O(log(n)) messages w.h.p. Travels on short links during link setting phase is O(lg2n) w.h.

p. A Medium link is within 6log2(n) nodes from x w.h.p. Similar for others

Theorem 5.1: A JOIN operation by a new node x incurs expected O(log(n)) number of messages, and O(log2(n)) messages w.h.p.The expected number of nodes that change their state as a result of x’s join is constant, and w.h.p is O(log(n)).

Because node x has O(log(n)) in-degrees w.h.p.Similar results holds for LEAVE.

Page 27: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables

Bounding In-degrees Theorem 4.7

Every node has expected constant in-degree, and has O(log(n)) in-degree w.h.p.

In-degree=# of servers affected by join/leave

How to guarantee constant in-degree? Bucket solution

A background process to balance the assignment of levels

Page 28: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables

Bucket Solution: Intuition

Level k-1

Level k

• Node x has log(n) in-degree, assuming Medium Right

~log(n)

x

• Too many nodes at level k-1;• Improve the level selection procedure

Too few nodes at level k

Page 29: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables

Bucket Solution

The name space is divided into non-overlapped buckets.

A bucket contains m nodes, where log(n) ≤m ≤ clog(n), for c>2.

In a buckets, levels are NOT assigned randomly

For each 1≤j≤log(n), there are 1…c nodes at level j in each bucket

In(x) < 7c (?? 2c)

0 1

000100100011

01000101

0110

100010011011

110111101111

Page 30: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables

Maintaining Bucket Size n can be accurately estimated When bucket size exceeds clog(n), the bucket

is split into two equal size buckets. When bucket size drops below log(n), it is

merged with a neighbor bucket. Further more, if the merged bucket is greater than log(n)x(2c+2)/3, the new bucket is split into two buckets.

(c+1)/3 > 1 since c>2 Buckets are organized into a ring, which can

be merged or split with O(1) message.

Page 31: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables

Maintain Level Property Node join/leave without merging or splitting O(1)

Join: size < clog(n), choose a level that has less that c nodes

Leave: If it is the only node in its level, find another level that has two nodes, reassign level j to one of them.

Bucket merge or split may result in a reassignment of the levels to all nodes in the bucket(s) O(log(n))

Merging/splitting are expensive, but they do not happen very often

After a merging or splitting of buckets, at least log(n) (c-2)/3 JOIN/LEAVE must happen in this bucket until another merging or splitting of this bucket is performed

Amortized Overhead = c/((c-2)/3) = O(1) for c>2

Page 32: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables

Amortized analysis

Log(n)

clog(n)

d1 d2

d1, d2 > (c-2)/3

New bucket size

Max bucket size

min(c/2lgn, (c+1)/3lgn) max(c/2lgn, (2c+2)/3lgn)

Page 33: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables

Viceroy has no built in support for fault tolerance

Viceroy requires graceful leave Leaves are NOT the same as failures

Performance is sensitive to failure External techniques:

Thickening Edges State Machine Replication

Fault Tolerance

Page 34: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables

State Machine Replication

Old

NewSMR SMR

Super node

Viceroy nodes

Page 35: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables

Related Works De Bruijn Graph Based Network

Distance halving D2B Koorde

Others Symphony (Small world model) Ulysses (ButterFly, log(n), log(n)/loglogn)

Page 36: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables

Summary Constant out-degree Expected constant in-degree

O(log(n )) w.h.p. O(1) with bucket solution

O(log(n )) path length w.h.p Expected log(n )/n load:

O(log2(n)/n) w.h.p. Weakness/improvements:

Not Locality Aware No Fault Tolerance Support Due to the lack of flexibility of ButterFly network

Page 37: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables

Question

Photo by Peter J. Bryant