pbs @ twitter · 2012. 6. 23. · pbs peter bailis @pbailis shivaram venkataraman, mike franklin,...
Post on 04-Oct-2020
1 Views
Preview:
TRANSCRIPT
PBSPeter Bailis @pbailis Shivaram Venkataraman,Mike Franklin,Joe Hellerstein,Ion Stoica
@ Twitter6.22.12
UC Berkeley
ProbabilisticallyBoundedStaleness
PBS
1. Fast2. Scalable3. Available
solution: replicate for 1. request capacity 2. reliability
solution: replicate for 1. request capacity 2. reliability
solution: replicate for 1. request capacity 2. reliability
keep replicas in sync
keep replicas in sync
keep replicas in sync
keep replicas in sync
keep replicas in sync
keep replicas in sync
keep replicas in sync
keep replicas in sync
slow
keep replicas in sync
slowalternative: sync later
keep replicas in sync
slowalternative: sync later
keep replicas in sync
slowalternative: sync later
keep replicas in sync
slowalternative: sync later
inconsistent
keep replicas in sync
slowalternative: sync later
inconsistent
keep replicas in sync
slowalternative: sync later
inconsistent
keep replicas in sync
slowalternative: sync later
inconsistent
⇧consistency, ⇧latencycontact more replicas,
read more recent data
consistency, ⇧ ⇧
latencycontact fewer replicas,
read less recent data
⇧consistency, ⇧latencycontact more replicas,
read more recent data
consistency, ⇧ ⇧
latencycontact fewer replicas,
read less recent data
eventual consistency“if no new updates are
made to the object, eventually all accesses
will return the last updated value”
W. Vogels, CACM 2008
HowHow long do I have to wait?
eventual?
consistent?How
What happens if I don’t wait?
solution:
problem:
technique:
no guarantees with eventual consistency
consistency prediction
measure latencies use WARS model
PBS
Dynamo:Amazon’s Highly Available Key-value Store
SOSP 2007
Apache, DataStax
Project Voldemort
Adobe
Cisco
Digg
Gowalla
IBM
Morningstar
NetflixPalantir
Rackspace
Rhapsody
Shazam
Spotify
Soundcloud
Mozilla
Ask.comYammerAol
GitHubJoyentCloud
Best Buy
Boeing
Comcast
Cassandra
RiakVoldemortGilt Groupe
N = 3 replicas
Coordinator
client
readR=3
R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)
N = 3 replicas
Coordinator
client
read(“key”)readR=3
R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)
N = 3 replicas
Coordinator
read(“key”)
client
readR=3
R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)
N = 3 replicas
Coordinator
(“key”, 1)(“key”, 1)(“key”, 1)
client
readR=3
R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)
N = 3 replicas
Coordinator
(“key”, 1)(“key”, 1)(“key”, 1)
client
(“key”, 1)readR=3
R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)
N = 3 replicas
CoordinatorreadR=3
R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)
client
N = 3 replicas
Coordinator
read(“key”)readR=3
R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)
client
N = 3 replicas
Coordinator
read(“key”)
readR=3
R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)
client
N = 3 replicas
Coordinator
(“key”, 1)
readR=3
R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)
client
N = 3 replicas
Coordinator
(“key”, 1)(“key”, 1)
readR=3
R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)
client
N = 3 replicas
Coordinator
(“key”, 1)(“key”, 1)(“key”, 1)
readR=3
R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)
client
N = 3 replicas
Coordinator
(“key”, 1)(“key”, 1)(“key”, 1)
(“key”, 1)readR=3
R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)
client
N = 3 replicas
CoordinatorreadR=1
R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)
client
N = 3 replicas
Coordinator
read(“key”)readR=1
R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)
client
N = 3 replicas
Coordinator
read(“key”)
send read to all
readR=1
R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)
client
N = 3 replicas
Coordinator
(“key”, 1)
send read to all
readR=1
R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)
client
N = 3 replicas
Coordinator
(“key”, 1)
(“key”, 1)
send read to all
readR=1
R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)
client
N = 3 replicas
Coordinator
(“key”, 1)(“key”, 1)
(“key”, 1)
send read to all
readR=1
R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)
client
N = 3 replicas
Coordinator
(“key”, 1)(“key”, 1)(“key”, 1)
(“key”, 1)
send read to all
readR=1
R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)
client
N replicas/keyread: wait for R replieswrite: wait for W acks
Coordinator W=1
R1(“key”, 1) R2 (“key”, 1) R3 (“key”, 1)
Coordinator
write(“key”, 2)
W=1
R1(“key”, 1) R2 (“key”, 1) R3 (“key”, 1)
Coordinator
write(“key”, 2)
W=1
R1(“key”, 1) R2 (“key”, 1) R3 (“key”, 1)
Coordinator
ack(“key”, 2)
W=1
R1 R2 (“key”, 1) R3 (“key”, 1)(“key”, 2)
Coordinator
ack(“key”, 2)
ack(“key”, 2)
W=1
R1 R2 (“key”, 1) R3 (“key”, 1)(“key”, 2)
Coordinator Coordinator
read(“key”)ack(“key”, 2)
W=1
R1 R2 (“key”, 1) R3 (“key”, 1)(“key”, 2)
R=1
Coordinator Coordinator
ack(“key”, 2)
W=1
R1 R2 (“key”, 1) R3 (“key”, 1)(“key”, 2)
read(“key”)
R=1
Coordinator Coordinator
ack(“key”, 2)
W=1
R1 R2 (“key”, 1) R3 (“key”, 1)(“key”, 2)
(“key”, 1)
R=1
Coordinator Coordinator
ack(“key”, 2)
W=1
R1 R2 (“key”, 1) R3 (“key”, 1)(“key”, 2)
(“key”,1)
R=1
Coordinator Coordinator
ack(“key”, 2)
W=1
R1 R2 (“key”, 1) R3 (“key”, 1)(“key”, 2)
(“key”,1)
R=1
Coordinator Coordinator
ack(“key”, 2)
W=1
R1 R2 (“key”, 1) R3 (“key”, 1)(“key”, 2)
(“key”,1)
R=1
Coordinator Coordinator
ack(“key”, 2)
W=1
R1 R2 (“key”, 1) R3 (“key”, 1)(“key”, 2)
(“key”,1)
R=1
(“key”, 2)
Coordinator Coordinator
ack(“key”, 2)
W=1
R1 R2 R3 (“key”, 1)(“key”, 2)
(“key”,1)
ack(“key”, 2)
R=1
(“key”, 2)
Coordinator Coordinator
ack(“key”, 2)
W=1
R1 R2 R3(“key”, 2)
(“key”,1)
ack(“key”, 2) ack(“key”, 2)
(“key”, 2)
R=1
(“key”, 2)
Coordinator Coordinator
ack(“key”, 2)
W=1
R1 R2 R3(“key”, 2)
(“key”,1)
(“key”, 2)
R=1
(“key”, 2)
Coordinator Coordinator
ack(“key”, 2)
W=1
R1 R2 R3(“key”, 2)
(“key”,1)
(“key”, 2)
(“key”, 2)
R=1
(“key”, 2)
Coordinator Coordinator
ack(“key”, 2)
W=1
R1 R2 R3(“key”, 2)
(“key”,1)
(“key”, 2)
(“key”, 2) (“key”, 2)
R=1
(“key”, 2)
Coordinator Coordinator
ack(“key”, 2)
W=1
R1 R2 R3(“key”, 2)
(“key”,1)
(“key”, 2)
R=1
“strong”consistency
else:
R+W > Nif:
eventualconsistency
then:
R+W
R+W
strong consistency
R+W
strong consistencylower latency
R+W
strong consistencylower latency
R+W
strong consistencylower latency
Cassandra:R=W=1, N=3
by default
(1+1 ≯ 3)
http://ria101.wordpress.com/2010/02/24/hbase-vs-cassandra-why-we-moved/
"In the general case, we typically use [Cassandra’s] consistency level of [R=W=1], which provides
maximum performance. Nice!" --D. Williams, “HBase vs Cassandra: why we moved” February 2010
http://www.reddit.com/r/programming/comments/bcqhi/reddits_now_running_on_cassandra/c0m3wh6
http://www.reddit.com/r/programming/comments/bcqhi/reddits_now_running_on_cassandra/c0m3wh6
Consistency or Bust: Breaking a Riak Cluster
NoSQL Primer
Sunday, July 31, 11
23
Consistency or Bust: Breaking a Riak Cluster
Low Value Data
n = 2, r = 1, w = 1
Sunday, July 31, 11
http://www.slideshare.net/Jkirkell/breaking-a-riak-cluster
Consistency or Bust: Breaking a Riak Cluster
Low Value Data
n = 2, r = 1, w = 1
Sunday, July 31, 11
http://www.slideshare.net/Jkirkell/breaking-a-riak-cluster
Consistency or Bust: Breaking a Riak Cluster
Mission Critical Data
n = 5, r = 1, w = 5, dw = 5
Sunday, July 31, 11
http://www.slideshare.net/Jkirkell/breaking-a-riak-cluster
Consistency or Bust: Breaking a Riak Cluster
Mission Critical Data
n = 5, r = 1, w = 5, dw = 5
Sunday, July 31, 11
http://www.slideshare.net/Jkirkell/breaking-a-riak-cluster
Voldemort @ LinkedIn
N=3 not required, “some consistency”:
R=W=1, N=2Alex Feinberg, personal communication
“very low latency and high availability”:
R=W=1, N=3
Anecdotally, EC“worthwhile” for
many kinds of data
Anecdotally, EC“worthwhile” for
many kinds of data
How eventual?How consistent?
Anecdotally, EC“worthwhile” for
many kinds of data
How eventual?How consistent?
“eventual and consistent enough”
Can we do better?
Probabilistically Bounded Staleness
can’t make promisescan give expectations
Can we do better?
PBS is:a way to quantify latency-consistency trade-offs
what’s the latency cost of consistency?what’s the consistency cost of latency?
PBS is:a way to quantify latency-consistency trade-offs
what’s the latency cost of consistency?what’s the consistency cost of latency?
an “SLA” for consistency
t-visibility: probability p of consistent reads after after t seconds
(e.g., 99.9% of reads will be consistent after 10ms)
How eventual?
t-visibility depends on: 1) message delays 2) background version exchange (anti-entropy)
t-visibility depends on: 1) message delays 2) background version exchange (anti-entropy)
anti-entropy:only decreases stalenesscomes in many flavorshard to guarantee rate
Focus on message delays
focus on
with failures:
steady state
unavailableor sloppy
Coordinator Replicaonce per replica Time
Coordinator Replicawrite
once per replica Time
Coordinator Replicawrite
ack
once per replica Time
Coordinator Replicawrite
ackwait for W responses
once per replica Time
Coordinator Replicawrite
ackwait for W responses
t seconds elapse
once per replica Time
Coordinator Replicawrite
ack
read
wait for W responses
t seconds elapse
once per replica Time
Coordinator Replicawrite
ack
read
response
wait for W responses
t seconds elapse
once per replica Time
Coordinator Replicawrite
ack
read
response
wait for W responses
t seconds elapse
wait for R responses
once per replica Time
Coordinator Replicawrite
ack
read
response
wait for W responses
t seconds elapse
wait for R responses
response is stale
if read arrives before write
once per replica Time
Coordinator Replicawrite
ack
read
response
wait for W responses
t seconds elapse
wait for R responses
response is stale
if read arrives before write
once per replica Time
Coordinator Replicawrite
ack
read
response
wait for W responses
t seconds elapse
wait for R responses
response is stale
if read arrives before write
once per replica Time
Coordinator Replicawrite
ack
read
response
wait for W responses
t seconds elapse
wait for R responses
response is stale
if read arrives before write
once per replica Time
N=2
Time
write write
N=2
Time
write
ack
write
ack
N=2
Time
write
ack
write
ackW=1
N=2
Time
write
ack
write
ackW=1
N=2
Time
write
ack
read
write
ackW=1
N=2
read
Time
write
ack
read
response
write
ackW=1
N=2
read
response
Time
write
ack
read
response
write
ackW=1
R=1
N=2
read
response
Time
write
ack
read
response
write
ackW=1
R=1
N=2
read
response
good
Time
N=2
Time
writewrite
N=2
Time
write
ack
write
ackN=2
Time
write
ack
write
ack
W=1
N=2
Time
write
ack
write
ack
W=1
N=2
Time
write
ack
read
write
ack
W=1
N=2
read
Time
write
ack
read
response
write
ack
W=1
N=2
read
response
Time
write
ack
read
response
write
ack
W=1
R=1
N=2
read
response
Time
write
ack
read
response
write
ack
W=1
R=1
N=2
read
response
bad
Time
write
ack
read
response
write
ack
W=1
R=1
N=2
read
response
bad
Time
write
ack
read
response
write
ack
W=1
R=1
N=2
read
response
bad
Time
write
ack
read
response
wait for W responses
t seconds elapse
wait for R responses
response is stale
if read arrives before write
once per replicaCoordinator Replica Time
Coordinator Coordinator
ack(“key”, 2)
W=1
R1 R2(“key”, 1) R3 (“key”, 1)(“key”, 2)
R=1
Coordinator Coordinator
ack(“key”, 2)
W=1
R1 R2(“key”, 1) R3 (“key”, 1)(“key”, 2)
(“key”, 1)
(“key”,1)
R=1
Coordinator Coordinator
write(“key”, 2)
ack(“key”, 2)
W=1
R1 R2(“key”, 1) R3 (“key”, 1)(“key”, 2)
(“key”, 1)
(“key”,1)
R=1
R3 replied beforelast write arrived!
write
ack
read
response
wait for W responses
t seconds elapse
wait for R responses
response is stale
if read arrives before write
once per replicaCoordinator Replica Time
(W)write
ack
read
response
wait for W responses
t seconds elapse
wait for R responses
response is stale
if read arrives before write
once per replicaCoordinator Replica Time
(W)write
ack
read
response
wait for W responses
t seconds elapse
wait for R responses
response is stale
if read arrives before write
once per replica
(A)
Coordinator Replica Time
(R)
(W)write
ack
read
response
wait for W responses
t seconds elapse
wait for R responses
response is stale
if read arrives before write
once per replica
(A)
Coordinator Replica Time
(R)
(W)write
ack
read
response
wait for W responses
t seconds elapse
wait for R responses
response is stale
if read arrives before write
once per replica
(A)
(S)
Coordinator Replica Time
Solving WARS: hardMonte Carlo methods: easier
To use WARS:
W53.244.5101.1
...
A10.38.211.3...
R15.322.419.8...
S9.614.26.7...
run simulationMonte Carlo, sampling
gather latency data
How eventual?
key: WARS modelneed: latencies
t-visibility: consistent reads with probability p after after t seconds
consistent?What happens if I don’t wait?
How
Probability of reading later older than k versions is exponentially reduced by k
Pr(reading latest write) = 99%Pr(reading one of last two writes) = 99.9%
Pr(reading one of last three writes) = 99.99%
https://issues.apache.org/jira/browse/CASSANDRA-4261
Cassandra cluster, injected latencies:
t-staleness RMSE: 0.28%latency N-RMSE: 0.48%
WARS Simulation accuracy
Yammer100K+ companies
uses Riak
LinkedIn 150M+ users
built and uses Voldemort
production latenciesfit gaussian mixtures
N=3
10 ms
N=3
99.9% consistent reads:R=2, W=1
t = 13.6 msLatency: 12.53 ms
Latency is combined read and write latency at 99.9th percentile
100% consistent reads:R=3, W=1
Latency: 15.01 ms
LNKD-DISK
N=3
99.9% consistent reads:R=2, W=1
t = 13.6 msLatency: 12.53 ms
Latency is combined read and write latency at 99.9th percentile
100% consistent reads:R=3, W=1
Latency: 15.01 ms
LNKD-DISK
N=3
16.5% faster
99.9% consistent reads:R=2, W=1
t = 13.6 msLatency: 12.53 ms
Latency is combined read and write latency at 99.9th percentile
100% consistent reads:R=3, W=1
Latency: 15.01 ms
LNKD-DISK
N=3
16.5% fasterworthwhile?
N=3
N=3
N=3
99.9% consistent reads:R=1, W=1
t = 1.85 msLatency: 1.32 ms
Latency is combined read and write latency at 99.9th percentile
100% consistent reads:R=3, W=1
Latency: 4.20 ms
LNKD-SSD
N=3
99.9% consistent reads:R=1, W=1
t = 1.85 msLatency: 1.32 ms
Latency is combined read and write latency at 99.9th percentile
100% consistent reads:R=3, W=1
Latency: 4.20 ms
LNKD-SSD
N=3
59.5% faster
Coordinator Replica
write
ack(A)
(W)
response(S)
(R)
wait for W responses
t seconds elapse
wait for R responses
response is stale
if read arrives before write
once per replica
critical factor in staleness
read
10�2 10�1 100 101 102 103
0.20.40.60.81.0
W=3
10�2 10�1 100 101 102 103
0.20.40.60.81.0
CD
F
W=1
10�2 10�1 100 101 102 103
Write Latency (ms)
0.20.40.60.81.0
W=2
LNKD-SSD LNKD-DISK YMMR WANLNKD-SSD LNKD-DISK YMMR WAN
N=3
10�2 10�1 100 101 102 103
0.20.40.60.81.0
W=3
10�2 10�1 100 101 102 103
0.20.40.60.81.0
CD
F
W=1
10�2 10�1 100 101 102 103
Write Latency (ms)
0.20.40.60.81.0
W=2
LNKD-SSD LNKD-DISK YMMR WANLNKD-SSD LNKD-DISK YMMR WAN
N=3
10�2 10�1 100 101 102 103
0.20.40.60.81.0
W=3
10�2 10�1 100 101 102 103
0.20.40.60.81.0
CD
F
W=1
10�2 10�1 100 101 102 103
Write Latency (ms)
0.20.40.60.81.0
W=2
LNKD-SSD LNKD-DISK YMMR WANLNKD-SSD LNKD-DISK YMMR WAN
N=3
Coordinator Replica
write
ack(A)
(W)
response(S)
(R)
wait for W responses
t seconds elapse
wait for R responses
response is stale
if read arrives before write
once per replica
SSDs reducevariance
compared todisks!
read
N=3
N=3
N=3
99.9% consistent reads:R=1, W=1
t = 202.0 msLatency: 43.3 ms
Latency is combined read and write latency at 99.9th percentile
100% consistent reads:R=3, W=1
Latency: 230.06 ms
YMMR
N=3
99.9% consistent reads:R=1, W=1
t = 202.0 msLatency: 43.3 ms
Latency is combined read and write latency at 99.9th percentile
100% consistent reads:R=3, W=1
Latency: 230.06 ms
YMMR
N=3
81.1% faster
1. Tracing2. Simulation3. Tune N, R, W4. Profit
Workflow
https://issues.apache.org/jira/browse/CASSANDRA-4261
solution:
problem:
technique:
no guarantees with eventual consistency
consistency prediction
measure latencies use WARS model
PBS
consistency
to measure is a metric
to predict
R+W
strong consistencylower latency
R+W
latency vs. consistency trade-offs
simple modeling with WARS
model staleness in time, versionsPBS
latency vs. consistency trade-offs
simple modeling with WARS
model staleness in time, versionsPBSeventual consistency
often fastoften consistent
PBS helps explain when and why
latency vs. consistency trade-offs
simple modeling with WARS
model staleness in time, versions
pbs.cs.berkeley.edu/#demo
PBSeventual consistency
often fastoften consistent
PBS helps explain when and why
@pbailis
cassandra patch
VLDB 2012 early printtinyurl.com/pbsvldb
tinyurl.com/pbspatch
Extra Slides
Related Work
Quorum System Theorye.g., Probabilistic Quorums
k-quorums
Deterministic Stalenesse.g., TACT/conits
FRACS
Consistency Verificatione.g., Golab et al.
(PODC ’11),Bermbach and Tai(M4WSOC ’11)
PBS and apps
staleness requires either:
staleness-tolerant data structurestimelines, logs
cf. commutative data structures logical monotonicity
asynchronous compensation codedetect violations after data is returned; see paper
cf. “Building on Quicksand” memories, guesses, apologies
write code to fix any errors
minimize:(compensation cost)×(# of expected anomalies)
asynchronouscompensation
Read only newer data?
client’s read rateglobal write rate
(monotonic reads session guarantee)
# versions tolerablestaleness
=
(for a given key)
Failure?
latency spikes
Treat failures as
How l o n gdo partitions last?
what time interval?99.9% uptime/yr ⇒ 8.76 hours downtime/yr
8.76 consecutive hours down⇒ bad 8-hour rolling average
what time interval?99.9% uptime/yr ⇒ 8.76 hours downtime/yr
8.76 consecutive hours down⇒ bad 8-hour rolling average
hide in tail of distribution ORcontinuously evaluate SLA, adjust
10�2 10�1 100 101 102 103
0.20.40.60.81.0
W=3
10�2 10�1 100 101 102 103
0.20.40.60.81.0
CD
F
W=1
10�2 10�1 100 101 102 103
Write Latency (ms)
0.20.40.60.81.0
W=2
LNKD-SSD LNKD-DISK YMMR WAN
LNKD-SSD LNKD-DISK YMMR WANLNKD-SSD LNKD-DISK YMMR WAN
LNKD-SSD LNKD-DISK YMMR WAN
N=3
10�2 10�1 100 101 102 103
0.20.40.60.81.0
R=3
LNKD-SSD LNKD-DISK YMMR WAN
LNKD-SSD LNKD-DISK YMMR WANLNKD-SSD LNKD-DISK YMMR WAN
LNKD-SSD LNKD-DISK YMMR WAN
10�2 10�1 100 101 102 103
0.20.40.60.81.0
CD
F
W=1
10�2 10�1 100 101 102 103
Write Latency (ms)
0.20.40.60.81.0
W=2
(LNKD-SSD and LNKD-DISK identical for reads)N=3
Probabilistic quorum systems
N-WR(pinconsistent NR(
))
=
81
k-staleness: probability p of reading one of last k versions
How consistent?
82
How consistent?N-W
R(NR(
)))( K1-
82
How consistent?
closed-form solutionstatic quorum choice
N-WR(NR(
)))( K1-
<k,t>-staleness:versions and time
<k,t>-staleness:versions and time
approximation: exponentiate
t-staleness by k
reads return the last written value or newer(defined w.r.t. real time,when the read started)
consistency___“strong”
R1
N = 3 replicas
R2 R3
Write to W, read from R replicas
R1
N = 3 replicas
R2 R3
R=W=3 replicas{ }}{ R1 R2 R3
R=W=2 replicas{ }R1{ R2 } R2{ R3 } R1{ R3 }
Write to W, read from R replicas
quorum system:guaranteedintersection
R1
N = 3 replicas
R2 R3
R=W=3 replicas
R=W=1 replicas
{ }}{ R1 R2 R3
{ }R1 }{ R2 }{ R3 }{
R=W=2 replicas{ }R1{ R2 } R2{ R3 } R1{ R3 }
Write to W, read from R replicas
quorum system:guaranteedintersection
partial quorum system:
may not intersect
Synthetic,Exponential Distributions
N=3, W=1, R=1
Synthetic,Exponential Distributions
W 1/4x ARS
N=3, W=1, R=1
Synthetic,Exponential Distributions
W 1/4x ARS
W 10x ARS
N=3, W=1, R=1
concurrent writes:deterministically choose
Coordinator R=2
(“key”, 1) (“key”, 2)
top related