viceroy: scalable emulation of butterfly networks for distributed hash tables
DESCRIPTION
Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables. By: Dahlia Malkhi, Moni Naor & David Ratajzcak Nov. 11, 2003 Presented by Zhenlei Jia Nov. 11, 2004. Acknowledgments. Some of the following slides are adapted from the slides created by the authors of the paper. - PowerPoint PPT PresentationTRANSCRIPT
Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables
By: Dahlia Malkhi, Moni Naor & David RatajzcakNov. 11, 2003
Presented by Zhenlei JiaNov. 11, 2004
Acknowledgments
Some of the following slides are adapted from the slides created by the authors of the paper
Outline Outline DHT Properties Viceroy
Structure Routing Algorithm Join/Leave Bounding In-degree: Bucket Solution Fault Tolerance
Summary
DHT What’s DHT
Store (key, value) pairs Lookup Join/Leave
Examples CAN, Pastry, Tapestry, Chord etc.
DHT Properties Dilation
Efficient lookup, usually O(log(n)) Maintenance cost
Support dynamic environment Control messages, affected servers
Degree Number of opened connections Servers impacted by node join/leave Heartbeat, graceful leave
DHT Properties (cont.) Congestion:
Peers should share the routing load evenly Load (of a node): the probability that it is on a ro
ute with random source and destination. If path length = O(log(n)) then on average, each
node is on n2 x O(log(n))/n = O(nlog(n)) routes. Average load = O(nlogn)/n2 = O(log(n))/n
Previous Works Node-
degree Dilation Congestion Topology
Chord (MIT)
log(n) log(n) log(n)/n Hypercube
Tapestry (Berkeley)
log(n) log(n) log(n)/n Hypercube
CAN (ICSI)
d dn(1/d) dn(1/d)/n d-dim. tourus
Small worlds
3 log2(n) log2(n)/n Cube connected cycle
Ours 3-7 log(n) log(n)/n Butterfly
Intuition
• Route is a combination of links of appropriate size
• Chord: Each node has ALL log(n) links
• Viceroy
• Each node has ONE of the long-range links
• A link of length 1/2k points to a node has link of length 1/2k+1
Chrod
000 001 010 011 100 101 110 111Level 1
Level 2
Level 3
Level 4
A Butterfly Network Each node has
ONE of the long-range links
A link of length 1/2k points to a node has link of length 1/2k+1
Nodes “share” each other’s long link
Routing
1.Route to root
2.Route to right group
3.Route to right level
Path: O(log(n))
Degree: O(1)
A Viceroy network
Level 1
Level 2
Level 3
•Ideally, there should be log(n) levels
•There is not a global counter
•Later, we will see how a node can estimate log(n) locally
0 1
00010010
0011
0100
0101
0110
10001001
1011
110111101111
Structure: Nodes Node
Id: 128 bits binary string, u Level: positive integer, u.level
Order of ids b1b2…bk ∑i=1…k bi/2i
Each node has a SUCCESSOR and a PREDECESSORSUCC(u), PRED(u)
Node u stores the keys k such that u≤k<SUCC(u)
Structure: Nodes
0 1
x SUCC(x)
PRED(x)
Keys stored on x
Lemma 2.1
Let n0 = 1/d(x, SUCC(x)), then w.h.p. (i.e. p>1-1/n1+e) that
log(n)-log(log(n))-O(1) <log(n0) ≤3log(n)
Node x selects level from 1…log(n0) uniformly randomly
Structure: Links A node u in level k has six out links
2 x Short: SUCCESSOR ,PREDECESSOR 2 x Medium: (left) closest level-(k+1) node whos
e id matches u.id[k] and is smaller than u.id. 1 x Long: the closest level-(k+1) node with prefix
u1…uk-1(1-uk)(?) u1…uk-1(1-uk)uk+1…uw* where w=log(n0)-log(log(n0))
1 x Parent: closest level-(k-1) node Also keeps track of in-bound links
Structure: Links
Level 1
Level 2
Level 3
0 1
00010010
0011
0100
0101
0110
10001001
1011
110111101111
Short link Short link
Parent link, to level k-1
Medium linkMatches x[k]0*
Long link, cross over about 1/2k
Matches u[w] except kth bit. (11*)
Matches 1*Wrong!
Routing: AlgorithmLOOKUP(x, y):Initialization: set cur to xProceed to root: while cur.level > 1:
cur = cur.parentGreedy search:
if cur.id ≤ y < SUCC(cur).id, return cur. Otherwise, choose m from links of cur that minimize d(m, y), move to m and repeat.
Demo: http://www.cs.huji.ac.il/labs/danss/anatt/viceroy.html
Routing: Example
Level 1
Level 2
Level 3
0 1
00010010
0011
0100
0101
0110
10001001
1011
110111101111y
x
Routing: Analysis (1)
Level 1
Level 2
Level 3
0 1
00010010
0011
0100
0101
0110
10001001
1011
110111101111y
x
Routing: Analysis (2) Expected path length = O(log(n))
log(n ) to `level-1’ node log(n ) for traveling among clusters log(n ) for final local search
Routing: Theorems Theorem 4.4
The path length from x to y is O(log(n)) w.h.p. Proof is based on several lemmas Lemma 4.1
For every node u with a level u.level < log(n)-log(log(n)), the number of nodes between u and u.Medium-left (Medium-right), if it exists, is at most 6log2(n) w.h.p.
Routing: Theorems (2) Lemma 4.2
In the greedy search phase of a lookup of value Y from node x, let the j’th greedy step vj, for 1 ≤ j ≤ m, be such that vj is more than O(log2(n)) nodes away from y. Then w.h.p. node vj is reached over a Medium or Long link, and hence satisfies vj.level = j and vj[j] = Y[j].
m = log(n)-2loglog(n)-log(3+e) W.h.p. within m steps, we are n/2m = 6log2(n) nodes
away from the destination
Routing: Theorems (3) Lemma 4.3
Let v be a node that is O(log2(n)) nodes away from the target y. Then w.h.p., within O(log(n)) greedy steps that target y is reached from v.
Theorem 4.4 The total length of a route from x to y is O(log(n)) w.h.p.
Theorem 4.6 Expected load on every node is O(log(n)/n). The load on every node is log2(n)/n w.h.p.
Theorem 4.7Every node u has in-degree O(log(n)) w.h.p.
Join: Algorithm1. Choose identifier: select a random 128 bits x1x2…x128
2. Setup short links: invoke LOOKUP(x), let x’ be the result node. Insert x between x’ and x’.SUUCESSOR.
3. Choose level: let k be the maximal number of matching prefix bits between x and either SUCC(x) or PRED(x), choose level from 1…k.
4. Set parent link: If SUCC(x) has level x.level-1, set x.parent to it. Otherwise, move to SUCC(x) and repeat.
5. Set long link: p = x1…xk-1(1-xk)xk+1…xw
Invoke LOOKUP(p), stop after a node at level x.level+1 and matches p
is reached.
Join: Algorithm (cont.)6. Set medium links: Denote p = x1x2…xx.level. If SUCC(x) has prefix
p and level x.level+1, set x.Medium-right link to it. Otherwise, move the SUCC(x) and repeat.
7. Set inbound links: Denote p = x1x2…xx.level.
Set inbound Medium links: Following SUCC links, so long as successor y has a prefix p and a level different from x.level, if y.level = x.level-1, set y.Medium-left to x.Set inbound long links: Following SUCC links, find y that has a prefix matches p and has level x.level. Take any inbound links that is closer to x than y.Set inbound parent links: Following PRED link, find y such that y.level = x.level+1. Repeat until meet a node in same level as x.
Join: Example
Level 1
Level 2
Level 3
0 1
00010010
0011
0100
0101
0110
10001001
1011
110111101111
0111
Lookup(x)
0111
Set Medium link: O(lg2n) w.h.pp = x1x2…xk (01)If y[k] != p: stopIf y[k]=p and y.level=k+1: set Medium linkOtherwise, move to succ(y)
STOP
Set Parent link:Following SUCC link, find a node has level k-1.
Set long linkP = x1…xk-1(1-xk)…xw stop at level k+1?In this case, find 00*
Set inbound long links:Following short links, find y such that y[k]=x[k] and y.level = x.level, check y’s inbound links.
X
Join: Analysis LOOKUP takes O(log(n)) messages w.h.p. Travels on short links during link setting phase is O(lg2n) w.h.
p. A Medium link is within 6log2(n) nodes from x w.h.p. Similar for others
Theorem 5.1: A JOIN operation by a new node x incurs expected O(log(n)) number of messages, and O(log2(n)) messages w.h.p.The expected number of nodes that change their state as a result of x’s join is constant, and w.h.p is O(log(n)).
Because node x has O(log(n)) in-degrees w.h.p.Similar results holds for LEAVE.
Bounding In-degrees Theorem 4.7
Every node has expected constant in-degree, and has O(log(n)) in-degree w.h.p.
In-degree=# of servers affected by join/leave
How to guarantee constant in-degree? Bucket solution
A background process to balance the assignment of levels
Bucket Solution: Intuition
Level k-1
Level k
• Node x has log(n) in-degree, assuming Medium Right
~log(n)
x
• Too many nodes at level k-1;• Improve the level selection procedure
Too few nodes at level k
Bucket Solution
The name space is divided into non-overlapped buckets.
A bucket contains m nodes, where log(n) ≤m ≤ clog(n), for c>2.
In a buckets, levels are NOT assigned randomly
For each 1≤j≤log(n), there are 1…c nodes at level j in each bucket
In(x) < 7c (?? 2c)
0 1
000100100011
01000101
0110
100010011011
110111101111
Maintaining Bucket Size n can be accurately estimated When bucket size exceeds clog(n), the bucket
is split into two equal size buckets. When bucket size drops below log(n), it is
merged with a neighbor bucket. Further more, if the merged bucket is greater than log(n)x(2c+2)/3, the new bucket is split into two buckets.
(c+1)/3 > 1 since c>2 Buckets are organized into a ring, which can
be merged or split with O(1) message.
Maintain Level Property Node join/leave without merging or splitting O(1)
Join: size < clog(n), choose a level that has less that c nodes
Leave: If it is the only node in its level, find another level that has two nodes, reassign level j to one of them.
Bucket merge or split may result in a reassignment of the levels to all nodes in the bucket(s) O(log(n))
Merging/splitting are expensive, but they do not happen very often
After a merging or splitting of buckets, at least log(n) (c-2)/3 JOIN/LEAVE must happen in this bucket until another merging or splitting of this bucket is performed
Amortized Overhead = c/((c-2)/3) = O(1) for c>2
Amortized analysis
Log(n)
clog(n)
d1 d2
d1, d2 > (c-2)/3
New bucket size
Max bucket size
min(c/2lgn, (c+1)/3lgn) max(c/2lgn, (2c+2)/3lgn)
Viceroy has no built in support for fault tolerance
Viceroy requires graceful leave Leaves are NOT the same as failures
Performance is sensitive to failure External techniques:
Thickening Edges State Machine Replication
Fault Tolerance
Related Works De Bruijn Graph Based Network
Distance halving D2B Koorde
Others Symphony (Small world model) Ulysses (ButterFly, log(n), log(n)/loglogn)
Summary Constant out-degree Expected constant in-degree
O(log(n )) w.h.p. O(1) with bucket solution
O(log(n )) path length w.h.p Expected log(n )/n load:
O(log2(n)/n) w.h.p. Weakness/improvements:
Not Locality Aware No Fault Tolerance Support Due to the lack of flexibility of ButterFly network