minimizing churn in distributed systems p. brighten godfrey, scott shenker, and ion stoica uc...
TRANSCRIPT
![Page 1: Minimizing Churn in Distributed Systems P. Brighten Godfrey, Scott Shenker, and Ion Stoica UC Berkeley SIGCOMM’06](https://reader036.vdocuments.us/reader036/viewer/2022062315/5697c0191a28abf838ccea49/html5/thumbnails/1.jpg)
Minimizing Churn in Distributed Systems
P. Brighten Godfrey, Scott Shenker, and Ion StoicaUC Berkeley
SIGCOMM’06
![Page 2: Minimizing Churn in Distributed Systems P. Brighten Godfrey, Scott Shenker, and Ion Stoica UC Berkeley SIGCOMM’06](https://reader036.vdocuments.us/reader036/viewer/2022062315/5697c0191a28abf838ccea49/html5/thumbnails/2.jpg)
2
Road Map
IntroductionSimulation
Basic Properties
AnalysisApplicationsDiscussionConclusion
![Page 3: Minimizing Churn in Distributed Systems P. Brighten Godfrey, Scott Shenker, and Ion Stoica UC Berkeley SIGCOMM’06](https://reader036.vdocuments.us/reader036/viewer/2022062315/5697c0191a28abf838ccea49/html5/thumbnails/3.jpg)
3
Introduction
Churn Change in the set of participating nodes due to joins,
graceful leaves, and failures
A quantitative guide to the churn form selection strategies
Analytically characterize the performance of strategies
Compare the performance of strategies with different real traces
![Page 4: Minimizing Churn in Distributed Systems P. Brighten Godfrey, Scott Shenker, and Ion Stoica UC Berkeley SIGCOMM’06](https://reader036.vdocuments.us/reader036/viewer/2022062315/5697c0191a28abf838ccea49/html5/thumbnails/4.jpg)
4
Road Map
IntroductionSimulation
Basic Properties
AnalysisApplicationsDiscussionConclusion
![Page 5: Minimizing Churn in Distributed Systems P. Brighten Godfrey, Scott Shenker, and Ion Stoica UC Berkeley SIGCOMM’06](https://reader036.vdocuments.us/reader036/viewer/2022062315/5697c0191a28abf838ccea49/html5/thumbnails/5.jpg)
5
Churn Simulations Model
System Model Node status
Up (in use, or available), down Nodes in use
Definition of churn
Example Two nodes fail and replaced by others
10 , nk
i ii
ii
UU
UU
TC
events 1
1
,max
1
)( 221kkT
![Page 6: Minimizing Churn in Distributed Systems P. Brighten Godfrey, Scott Shenker, and Ion Stoica UC Berkeley SIGCOMM’06](https://reader036.vdocuments.us/reader036/viewer/2022062315/5697c0191a28abf838ccea49/html5/thumbnails/6.jpg)
6
Selection Strategies
Predictive fixed strategies Fixed decent
Select randomly from 50% with more up time
Fixed most available The most time up
Fixed longest lived Greatest average
session time
Agnostic fixed strategies Fixed random
Predictive replacement strategies Max Expectation
Greatest expected remaining uptime
Longest uptime Longest current uptime
Optimal
Agnostic replacement strategies Random Replacement (RR) Passive Preference list
Fail and then replace Active preference list
![Page 7: Minimizing Churn in Distributed Systems P. Brighten Godfrey, Scott Shenker, and Ion Stoica UC Berkeley SIGCOMM’06](https://reader036.vdocuments.us/reader036/viewer/2022062315/5697c0191a28abf838ccea49/html5/thumbnails/7.jpg)
7
Traces
Synthetic traces PDF
a = 1.5 and b fixed so that mean is 30 minutes
1)()(
a
a
bx
abxf
![Page 8: Minimizing Churn in Distributed Systems P. Brighten Godfrey, Scott Shenker, and Ion Stoica UC Berkeley SIGCOMM’06](https://reader036.vdocuments.us/reader036/viewer/2022062315/5697c0191a28abf838ccea49/html5/thumbnails/8.jpg)
8
Simulation Setup
Event-based simulator Selection algorithm to react immediately after each
change Chord protocol simulator
No loss, except the node fail when then datagram is in flight
At least 10 trails Sample 1000 random nodes 95% confidence intervals
![Page 9: Minimizing Churn in Distributed Systems P. Brighten Godfrey, Scott Shenker, and Ion Stoica UC Berkeley SIGCOMM’06](https://reader036.vdocuments.us/reader036/viewer/2022062315/5697c0191a28abf838ccea49/html5/thumbnails/9.jpg)
9
Basic Properties
Synthetic Pareto lifetimes Fixed k = 50 Fixed strategies are the same
The same mean session time
![Page 10: Minimizing Churn in Distributed Systems P. Brighten Godfrey, Scott Shenker, and Ion Stoica UC Berkeley SIGCOMM’06](https://reader036.vdocuments.us/reader036/viewer/2022062315/5697c0191a28abf838ccea49/html5/thumbnails/10.jpg)
10
Benefit of Replacement Strategies
1.3~5 times improvement The dynamically selecting nodes for long-
running distributed application would be worthwhile
![Page 11: Minimizing Churn in Distributed Systems P. Brighten Godfrey, Scott Shenker, and Ion Stoica UC Berkeley SIGCOMM’06](https://reader036.vdocuments.us/reader036/viewer/2022062315/5697c0191a28abf838ccea49/html5/thumbnails/11.jpg)
11
Benefit of Replacement Strategies
The best fixed strategies match the performance of the best replacement one The trace are shorter
![Page 12: Minimizing Churn in Distributed Systems P. Brighten Godfrey, Scott Shenker, and Ion Stoica UC Berkeley SIGCOMM’06](https://reader036.vdocuments.us/reader036/viewer/2022062315/5697c0191a28abf838ccea49/html5/thumbnails/12.jpg)
12
Agnostic Strategies
RR is worse for small k, but is with in a factor of 2 of Max Expectation
RR is 1.2~3 times better than Passive and 2.5~10 times better than Active PL
![Page 13: Minimizing Churn in Distributed Systems P. Brighten Godfrey, Scott Shenker, and Ion Stoica UC Berkeley SIGCOMM’06](https://reader036.vdocuments.us/reader036/viewer/2022062315/5697c0191a28abf838ccea49/html5/thumbnails/13.jpg)
13
Road Map
IntroductionSimulation
Basic Properties
AnalysisApplicationsDiscussionConclusion
![Page 14: Minimizing Churn in Distributed Systems P. Brighten Godfrey, Scott Shenker, and Ion Stoica UC Berkeley SIGCOMM’06](https://reader036.vdocuments.us/reader036/viewer/2022062315/5697c0191a28abf838ccea49/html5/thumbnails/14.jpg)
14
Analysis of Fixed and PL strategies
Fixed strategies Node recover instantaneously
Each failure and recovery, normalized by time The number of a node failure Expected churn
Passive Preference List strategies If k is large, then same as Fixed strategies
Active Preference List strategies It pays more to switch back after the recovery of
the node
Tkk )( 11
T
22
Tk
kT
kT1
![Page 15: Minimizing Churn in Distributed Systems P. Brighten Godfrey, Scott Shenker, and Ion Stoica UC Berkeley SIGCOMM’06](https://reader036.vdocuments.us/reader036/viewer/2022062315/5697c0191a28abf838ccea49/html5/thumbnails/15.jpg)
15
Analysis of Random Replacement
Intuition Waiting time paradox
RR is (roughly) selecting the current session of a random node
This is biased towards longer sessions RR does very badly when stable nodes are rare
One with mean r >> 1 and others’ are 1 Churn of RR is about 2 and the best fixed strategies is
Churn rate
2
i
d
i
LCEEd
CEi
exp12
1
1
![Page 16: Minimizing Churn in Distributed Systems P. Brighten Godfrey, Scott Shenker, and Ion Stoica UC Berkeley SIGCOMM’06](https://reader036.vdocuments.us/reader036/viewer/2022062315/5697c0191a28abf838ccea49/html5/thumbnails/16.jpg)
16
Analysis of Random Replacement
Agreement of the analysis with a simulation for n = 20 and the previous Pareto-distributed session time plot
![Page 17: Minimizing Churn in Distributed Systems P. Brighten Godfrey, Scott Shenker, and Ion Stoica UC Berkeley SIGCOMM’06](https://reader036.vdocuments.us/reader036/viewer/2022062315/5697c0191a28abf838ccea49/html5/thumbnails/17.jpg)
17
Characteristics of Random Replacement
X’ is more skewed than X If E[X’] = E[X], then
x’ and x are the yth percentile values of X’ and X
The churn of RR decreases as the distributions become more “skewed”
If the session time distributions are stable and have equal mean , RR’s expected churn is at most twice the expected churn of any fixed or Preference List strategy
]|[ ]''|'[ xXXExXXE
![Page 18: Minimizing Churn in Distributed Systems P. Brighten Godfrey, Scott Shenker, and Ion Stoica UC Berkeley SIGCOMM’06](https://reader036.vdocuments.us/reader036/viewer/2022062315/5697c0191a28abf838ccea49/html5/thumbnails/18.jpg)
18
Road Map
IntroductionSimulation
Basic Properties
AnalysisApplicationsDiscussionConclusion
![Page 19: Minimizing Churn in Distributed Systems P. Brighten Godfrey, Scott Shenker, and Ion Stoica UC Berkeley SIGCOMM’06](https://reader036.vdocuments.us/reader036/viewer/2022062315/5697c0191a28abf838ccea49/html5/thumbnails/19.jpg)
19
Anycast
Whenever its current server fails, it obtains a list of the m servers to which it has lowest latency and connects to random on of these m
Switching to another server is not countedLatencies were obtained from a synthetic
edge network delay space generator It is modeled on measurements of latency
between DNS servers
![Page 20: Minimizing Churn in Distributed Systems P. Brighten Godfrey, Scott Shenker, and Ion Stoica UC Berkeley SIGCOMM’06](https://reader036.vdocuments.us/reader036/viewer/2022062315/5697c0191a28abf838ccea49/html5/thumbnails/20.jpg)
20
Anycast
Trade of between server list m and latency t t increases => Passive PL m increases => RR hybrid:
ω decrease: Passive PL to Longest Uptime
uptimelatency )1(
![Page 21: Minimizing Churn in Distributed Systems P. Brighten Godfrey, Scott Shenker, and Ion Stoica UC Berkeley SIGCOMM’06](https://reader036.vdocuments.us/reader036/viewer/2022062315/5697c0191a28abf838ccea49/html5/thumbnails/21.jpg)
21
Anycast
When session time is small, the end host experiences the mean server failure tare , as in Active PL
![Page 22: Minimizing Churn in Distributed Systems P. Brighten Godfrey, Scott Shenker, and Ion Stoica UC Berkeley SIGCOMM’06](https://reader036.vdocuments.us/reader036/viewer/2022062315/5697c0191a28abf838ccea49/html5/thumbnails/22.jpg)
22
DHT Neighbor Selection
Long-distant neighbor Deterministic topology (Active PL) Randomized topology (RR)
Simulation Sample n nodes from Gnutella Feed into Chord protocol simulator Two node send message to a node with single
key It is failed when two message are lossed
![Page 23: Minimizing Churn in Distributed Systems P. Brighten Godfrey, Scott Shenker, and Ion Stoica UC Berkeley SIGCOMM’06](https://reader036.vdocuments.us/reader036/viewer/2022062315/5697c0191a28abf838ccea49/html5/thumbnails/23.jpg)
23
DHT Neighbor Selection
Randomized topology are more stable, but have slightly longer routes
Randomized topology also can reduce maintenance bandwidth
![Page 24: Minimizing Churn in Distributed Systems P. Brighten Godfrey, Scott Shenker, and Ion Stoica UC Berkeley SIGCOMM’06](https://reader036.vdocuments.us/reader036/viewer/2022062315/5697c0191a28abf838ccea49/html5/thumbnails/24.jpg)
24
Multicast
Select one of m suitable nodes as parent Suitable: available bandwidth to serve another
child Strategies
Longest uptime, Minimum Depth, Minimum Latency Homogeneous bandwidth
![Page 25: Minimizing Churn in Distributed Systems P. Brighten Godfrey, Scott Shenker, and Ion Stoica UC Berkeley SIGCOMM’06](https://reader036.vdocuments.us/reader036/viewer/2022062315/5697c0191a28abf838ccea49/html5/thumbnails/25.jpg)
25
Multicast
![Page 26: Minimizing Churn in Distributed Systems P. Brighten Godfrey, Scott Shenker, and Ion Stoica UC Berkeley SIGCOMM’06](https://reader036.vdocuments.us/reader036/viewer/2022062315/5697c0191a28abf838ccea49/html5/thumbnails/26.jpg)
26
DHT Replica Placement
Root set (Passive PL) Nodes with ID closer to key (Object) should
keep the replica Root directory (RR)
Replica of directory is the same as root set
Replica may be on any node in the systemSimulation
Lazy replication On equal footing
![Page 27: Minimizing Churn in Distributed Systems P. Brighten Godfrey, Scott Shenker, and Ion Stoica UC Berkeley SIGCOMM’06](https://reader036.vdocuments.us/reader036/viewer/2022062315/5697c0191a28abf838ccea49/html5/thumbnails/27.jpg)
27
DHT Replica Placement
There are many permanent failures in Gnutella traces
![Page 28: Minimizing Churn in Distributed Systems P. Brighten Godfrey, Scott Shenker, and Ion Stoica UC Berkeley SIGCOMM’06](https://reader036.vdocuments.us/reader036/viewer/2022062315/5697c0191a28abf838ccea49/html5/thumbnails/28.jpg)
28
Road Map
IntroductionSimulation
Basic Properties
AnalysisApplicationsDiscussionConclusion
![Page 29: Minimizing Churn in Distributed Systems P. Brighten Godfrey, Scott Shenker, and Ion Stoica UC Berkeley SIGCOMM’06](https://reader036.vdocuments.us/reader036/viewer/2022062315/5697c0191a28abf838ccea49/html5/thumbnails/29.jpg)
29
Discussion
When would one use Random Replacement? Minimize churn
Longest Uptime RR would be easier to implement
Uptime is not easy to determine• Network problem, liar
What about load balance? The result do not address fairness between users
![Page 30: Minimizing Churn in Distributed Systems P. Brighten Godfrey, Scott Shenker, and Ion Stoica UC Berkeley SIGCOMM’06](https://reader036.vdocuments.us/reader036/viewer/2022062315/5697c0191a28abf838ccea49/html5/thumbnails/30.jpg)
30
Road Map
IntroductionSimulation
Basic Properties
AnalysisApplicationsDiscussionConclusion
![Page 31: Minimizing Churn in Distributed Systems P. Brighten Godfrey, Scott Shenker, and Ion Stoica UC Berkeley SIGCOMM’06](https://reader036.vdocuments.us/reader036/viewer/2022062315/5697c0191a28abf838ccea49/html5/thumbnails/31.jpg)
31
Conclusion
A guide to performance of a range of node selection strategies in real-world traces
Highlight and explain analytically the god performance of RR relative to smart strategies
Explain the performance implications of a variety of existing distributed systems designs