![Page 1: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812de6550346895d9341d7/html5/thumbnails/1.jpg)
Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables
By: Dahlia Malkhi, Moni Naor & David RatajzcakNov. 11, 2003
Presented by Zhenlei JiaNov. 11, 2004
![Page 2: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812de6550346895d9341d7/html5/thumbnails/2.jpg)
Acknowledgments
Some of the following slides are adapted from the slides created by the authors of the paper
![Page 3: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812de6550346895d9341d7/html5/thumbnails/3.jpg)
Outline Outline DHT Properties Viceroy
Structure Routing Algorithm Join/Leave Bounding In-degree: Bucket Solution Fault Tolerance
Summary
![Page 4: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812de6550346895d9341d7/html5/thumbnails/4.jpg)
DHT What’s DHT
Store (key, value) pairs Lookup Join/Leave
Examples CAN, Pastry, Tapestry, Chord etc.
![Page 5: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812de6550346895d9341d7/html5/thumbnails/5.jpg)
DHT Properties Dilation
Efficient lookup, usually O(log(n)) Maintenance cost
Support dynamic environment Control messages, affected servers
Degree Number of opened connections Servers impacted by node join/leave Heartbeat, graceful leave
![Page 6: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812de6550346895d9341d7/html5/thumbnails/6.jpg)
DHT Properties (cont.) Congestion:
Peers should share the routing load evenly Load (of a node): the probability that it is on a ro
ute with random source and destination. If path length = O(log(n)) then on average, each
node is on n2 x O(log(n))/n = O(nlog(n)) routes. Average load = O(nlogn)/n2 = O(log(n))/n
![Page 7: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812de6550346895d9341d7/html5/thumbnails/7.jpg)
Previous Works Node-
degree Dilation Congestion Topology
Chord (MIT)
log(n) log(n) log(n)/n Hypercube
Tapestry (Berkeley)
log(n) log(n) log(n)/n Hypercube
CAN (ICSI)
d dn(1/d) dn(1/d)/n d-dim. tourus
Small worlds
3 log2(n) log2(n)/n Cube connected cycle
Ours 3-7 log(n) log(n)/n Butterfly
![Page 8: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812de6550346895d9341d7/html5/thumbnails/8.jpg)
Intuition
• Route is a combination of links of appropriate size
• Chord: Each node has ALL log(n) links
• Viceroy
• Each node has ONE of the long-range links
• A link of length 1/2k points to a node has link of length 1/2k+1
Chrod
![Page 9: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812de6550346895d9341d7/html5/thumbnails/9.jpg)
000 001 010 011 100 101 110 111Level 1
Level 2
Level 3
Level 4
A Butterfly Network Each node has
ONE of the long-range links
A link of length 1/2k points to a node has link of length 1/2k+1
Nodes “share” each other’s long link
Routing
1.Route to root
2.Route to right group
3.Route to right level
Path: O(log(n))
Degree: O(1)
![Page 10: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812de6550346895d9341d7/html5/thumbnails/10.jpg)
A Viceroy network
Level 1
Level 2
Level 3
•Ideally, there should be log(n) levels
•There is not a global counter
•Later, we will see how a node can estimate log(n) locally
0 1
00010010
0011
0100
0101
0110
10001001
1011
110111101111
![Page 11: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812de6550346895d9341d7/html5/thumbnails/11.jpg)
Structure: Nodes Node
Id: 128 bits binary string, u Level: positive integer, u.level
Order of ids b1b2…bk ∑i=1…k bi/2i
Each node has a SUCCESSOR and a PREDECESSORSUCC(u), PRED(u)
Node u stores the keys k such that u≤k<SUCC(u)
![Page 12: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812de6550346895d9341d7/html5/thumbnails/12.jpg)
Structure: Nodes
0 1
x SUCC(x)
PRED(x)
Keys stored on x
Lemma 2.1
Let n0 = 1/d(x, SUCC(x)), then w.h.p. (i.e. p>1-1/n1+e) that
log(n)-log(log(n))-O(1) <log(n0) ≤3log(n)
Node x selects level from 1…log(n0) uniformly randomly
![Page 13: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812de6550346895d9341d7/html5/thumbnails/13.jpg)
Structure: Links A node u in level k has six out links
2 x Short: SUCCESSOR ,PREDECESSOR 2 x Medium: (left) closest level-(k+1) node whos
e id matches u.id[k] and is smaller than u.id. 1 x Long: the closest level-(k+1) node with prefix
u1…uk-1(1-uk)(?) u1…uk-1(1-uk)uk+1…uw* where w=log(n0)-log(log(n0))
1 x Parent: closest level-(k-1) node Also keeps track of in-bound links
![Page 14: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812de6550346895d9341d7/html5/thumbnails/14.jpg)
Structure: Links
Level 1
Level 2
Level 3
0 1
00010010
0011
0100
0101
0110
10001001
1011
110111101111
Short link Short link
Parent link, to level k-1
Medium linkMatches x[k]0*
Long link, cross over about 1/2k
Matches u[w] except kth bit. (11*)
Matches 1*Wrong!
![Page 15: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812de6550346895d9341d7/html5/thumbnails/15.jpg)
Routing: AlgorithmLOOKUP(x, y):Initialization: set cur to xProceed to root: while cur.level > 1:
cur = cur.parentGreedy search:
if cur.id ≤ y < SUCC(cur).id, return cur. Otherwise, choose m from links of cur that minimize d(m, y), move to m and repeat.
Demo: http://www.cs.huji.ac.il/labs/danss/anatt/viceroy.html
![Page 16: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812de6550346895d9341d7/html5/thumbnails/16.jpg)
Routing: Example
Level 1
Level 2
Level 3
0 1
00010010
0011
0100
0101
0110
10001001
1011
110111101111y
x
![Page 17: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812de6550346895d9341d7/html5/thumbnails/17.jpg)
One Observation
Level 1
Level 2
Level 3
0 1
00010010
0011
0100
0101
0110
10001001
1011
110111101111
![Page 18: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812de6550346895d9341d7/html5/thumbnails/18.jpg)
Routing: Analysis (1)
Level 1
Level 2
Level 3
0 1
00010010
0011
0100
0101
0110
10001001
1011
110111101111y
x
![Page 19: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812de6550346895d9341d7/html5/thumbnails/19.jpg)
Routing: Analysis (2) Expected path length = O(log(n))
log(n ) to `level-1’ node log(n ) for traveling among clusters log(n ) for final local search
![Page 20: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812de6550346895d9341d7/html5/thumbnails/20.jpg)
Routing: Theorems Theorem 4.4
The path length from x to y is O(log(n)) w.h.p. Proof is based on several lemmas Lemma 4.1
For every node u with a level u.level < log(n)-log(log(n)), the number of nodes between u and u.Medium-left (Medium-right), if it exists, is at most 6log2(n) w.h.p.
![Page 21: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812de6550346895d9341d7/html5/thumbnails/21.jpg)
Routing: Theorems (2) Lemma 4.2
In the greedy search phase of a lookup of value Y from node x, let the j’th greedy step vj, for 1 ≤ j ≤ m, be such that vj is more than O(log2(n)) nodes away from y. Then w.h.p. node vj is reached over a Medium or Long link, and hence satisfies vj.level = j and vj[j] = Y[j].
m = log(n)-2loglog(n)-log(3+e) W.h.p. within m steps, we are n/2m = 6log2(n) nodes
away from the destination
![Page 22: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812de6550346895d9341d7/html5/thumbnails/22.jpg)
Routing: Theorems (3) Lemma 4.3
Let v be a node that is O(log2(n)) nodes away from the target y. Then w.h.p., within O(log(n)) greedy steps that target y is reached from v.
Theorem 4.4 The total length of a route from x to y is O(log(n)) w.h.p.
Theorem 4.6 Expected load on every node is O(log(n)/n). The load on every node is log2(n)/n w.h.p.
Theorem 4.7Every node u has in-degree O(log(n)) w.h.p.
![Page 23: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812de6550346895d9341d7/html5/thumbnails/23.jpg)
Join: Algorithm1. Choose identifier: select a random 128 bits x1x2…x128
2. Setup short links: invoke LOOKUP(x), let x’ be the result node. Insert x between x’ and x’.SUUCESSOR.
3. Choose level: let k be the maximal number of matching prefix bits between x and either SUCC(x) or PRED(x), choose level from 1…k.
4. Set parent link: If SUCC(x) has level x.level-1, set x.parent to it. Otherwise, move to SUCC(x) and repeat.
5. Set long link: p = x1…xk-1(1-xk)xk+1…xw
Invoke LOOKUP(p), stop after a node at level x.level+1 and matches p
is reached.
![Page 24: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812de6550346895d9341d7/html5/thumbnails/24.jpg)
Join: Algorithm (cont.)6. Set medium links: Denote p = x1x2…xx.level. If SUCC(x) has prefix
p and level x.level+1, set x.Medium-right link to it. Otherwise, move the SUCC(x) and repeat.
7. Set inbound links: Denote p = x1x2…xx.level.
Set inbound Medium links: Following SUCC links, so long as successor y has a prefix p and a level different from x.level, if y.level = x.level-1, set y.Medium-left to x.Set inbound long links: Following SUCC links, find y that has a prefix matches p and has level x.level. Take any inbound links that is closer to x than y.Set inbound parent links: Following PRED link, find y such that y.level = x.level+1. Repeat until meet a node in same level as x.
![Page 25: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812de6550346895d9341d7/html5/thumbnails/25.jpg)
Join: Example
Level 1
Level 2
Level 3
0 1
00010010
0011
0100
0101
0110
10001001
1011
110111101111
0111
Lookup(x)
0111
Set Medium link: O(lg2n) w.h.pp = x1x2…xk (01)If y[k] != p: stopIf y[k]=p and y.level=k+1: set Medium linkOtherwise, move to succ(y)
STOP
Set Parent link:Following SUCC link, find a node has level k-1.
Set long linkP = x1…xk-1(1-xk)…xw stop at level k+1?In this case, find 00*
Set inbound long links:Following short links, find y such that y[k]=x[k] and y.level = x.level, check y’s inbound links.
X
![Page 26: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812de6550346895d9341d7/html5/thumbnails/26.jpg)
Join: Analysis LOOKUP takes O(log(n)) messages w.h.p. Travels on short links during link setting phase is O(lg2n) w.h.
p. A Medium link is within 6log2(n) nodes from x w.h.p. Similar for others
Theorem 5.1: A JOIN operation by a new node x incurs expected O(log(n)) number of messages, and O(log2(n)) messages w.h.p.The expected number of nodes that change their state as a result of x’s join is constant, and w.h.p is O(log(n)).
Because node x has O(log(n)) in-degrees w.h.p.Similar results holds for LEAVE.
![Page 27: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812de6550346895d9341d7/html5/thumbnails/27.jpg)
Bounding In-degrees Theorem 4.7
Every node has expected constant in-degree, and has O(log(n)) in-degree w.h.p.
In-degree=# of servers affected by join/leave
How to guarantee constant in-degree? Bucket solution
A background process to balance the assignment of levels
![Page 28: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812de6550346895d9341d7/html5/thumbnails/28.jpg)
Bucket Solution: Intuition
Level k-1
Level k
• Node x has log(n) in-degree, assuming Medium Right
~log(n)
x
• Too many nodes at level k-1;• Improve the level selection procedure
Too few nodes at level k
![Page 29: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812de6550346895d9341d7/html5/thumbnails/29.jpg)
Bucket Solution
The name space is divided into non-overlapped buckets.
A bucket contains m nodes, where log(n) ≤m ≤ clog(n), for c>2.
In a buckets, levels are NOT assigned randomly
For each 1≤j≤log(n), there are 1…c nodes at level j in each bucket
In(x) < 7c (?? 2c)
0 1
000100100011
01000101
0110
100010011011
110111101111
![Page 30: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812de6550346895d9341d7/html5/thumbnails/30.jpg)
Maintaining Bucket Size n can be accurately estimated When bucket size exceeds clog(n), the bucket
is split into two equal size buckets. When bucket size drops below log(n), it is
merged with a neighbor bucket. Further more, if the merged bucket is greater than log(n)x(2c+2)/3, the new bucket is split into two buckets.
(c+1)/3 > 1 since c>2 Buckets are organized into a ring, which can
be merged or split with O(1) message.
![Page 31: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812de6550346895d9341d7/html5/thumbnails/31.jpg)
Maintain Level Property Node join/leave without merging or splitting O(1)
Join: size < clog(n), choose a level that has less that c nodes
Leave: If it is the only node in its level, find another level that has two nodes, reassign level j to one of them.
Bucket merge or split may result in a reassignment of the levels to all nodes in the bucket(s) O(log(n))
Merging/splitting are expensive, but they do not happen very often
After a merging or splitting of buckets, at least log(n) (c-2)/3 JOIN/LEAVE must happen in this bucket until another merging or splitting of this bucket is performed
Amortized Overhead = c/((c-2)/3) = O(1) for c>2
![Page 32: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812de6550346895d9341d7/html5/thumbnails/32.jpg)
Amortized analysis
Log(n)
clog(n)
d1 d2
d1, d2 > (c-2)/3
New bucket size
Max bucket size
min(c/2lgn, (c+1)/3lgn) max(c/2lgn, (2c+2)/3lgn)
![Page 33: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812de6550346895d9341d7/html5/thumbnails/33.jpg)
Viceroy has no built in support for fault tolerance
Viceroy requires graceful leave Leaves are NOT the same as failures
Performance is sensitive to failure External techniques:
Thickening Edges State Machine Replication
Fault Tolerance
![Page 34: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812de6550346895d9341d7/html5/thumbnails/34.jpg)
State Machine Replication
Old
NewSMR SMR
Super node
Viceroy nodes
![Page 35: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812de6550346895d9341d7/html5/thumbnails/35.jpg)
Related Works De Bruijn Graph Based Network
Distance halving D2B Koorde
Others Symphony (Small world model) Ulysses (ButterFly, log(n), log(n)/loglogn)
![Page 36: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812de6550346895d9341d7/html5/thumbnails/36.jpg)
Summary Constant out-degree Expected constant in-degree
O(log(n )) w.h.p. O(1) with bucket solution
O(log(n )) path length w.h.p Expected log(n )/n load:
O(log2(n)/n) w.h.p. Weakness/improvements:
Not Locality Aware No Fault Tolerance Support Due to the lack of flexibility of ButterFly network
![Page 37: Viceroy: Scalable Emulation of Butterfly Networks For Distributed Hash Tables](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812de6550346895d9341d7/html5/thumbnails/37.jpg)
Question
Photo by Peter J. Bryant