Peter Bailis, Shivaram Venkataraman, Mike Franklin, Joe Hellerstein, Ion Stoica
VLDB2012
UC Berkeley
Probabilistically Bounded Stalenessfor Practical Partial Quorums
PBS
watch a recording(of an earlier talk) at:http://vimeo.com/37758648
PBSNOTE TO READER
play with a live demo at:http://pbs.cs.berkeley.edu/#demo
PBSNOTE TO READER
R+W
strongconsistencyhigherlatency
eventualconsistency
lowerlatency
consistencycontinuum is a choicebinary
strong eventual
latency vs.consistency
informed by practice
our focus:
availability, partitions,failures
not in this talk:
quantify eventual consistency:wall-clock time (“how eventual?”)versions (“how consistent?”)
analyze real-world systems:EC is often strongly consistentdescribe when and why
our contributions
introsystem model
practicemetricsinsights
integration
Apache, DataStax
Project Voldemort
Dynamo:Amazon’s Highly Available Key-value Store
SOSP 2007
Adobe
Cisco
Digg
Gowalla
IBM
Morningstar
NetflixPalantir
Rackspace
Rhapsody
Shazam
Spotify
Soundcloud
Mozilla
Ask.comYammerAol
GitHubJoyentCloud
Best Buy
LinkedInBoeing
Comcast
Cassandra
RiakVoldemortGilt Groupe
N replicas/keyread: wait for R replieswrite: wait for W acks
N=3R=2
“strong”consistency
else:
R+W > Nif:
eventualconsistency
then:
reads return the last acknowledged write or anin-flight write (per-key)
consistency___“strong”
regular register
R+W > N
99th 99.9th1 1x 1x2 1.59x 2.35x3 4.8x 6.13xR
W99th 99.9th
1 1x 1x2 2.01x 1.9x3 4.96x 14.96x
LatencyLinkedIndisk-basedmodelN=3
⇧consistency, ⇧latencywait for more replicas,
read more recent data
consistency, ⇧ ⇧
latencywait for fewer replicas,
read less recent data
eventualconsistency“if no new updates are made to the object, eventually all accesses will return the last updated value”
W. Vogels, CACM 2008
R+W ≤ N
HowHow long do I have to wait?
eventual?
consistent?How
What happens if I don’t wait?
R+W
strongconsistencyhigherlatency
eventualconsistency
lowerlatency
introsystem model
practicemetricsinsights
integration
Cassandra:R=W=1, N=3
by default(1+1 ≯ 3)
eventual consistency
“maximum performance”
“very lowlatency”
okay for “most data”
“general case”
in the wild
anecdotally, EC“good enough” for many kinds of data
How eventual?How consistent?
“eventual and consistent enough”
Probabilistically Bounded Staleness
can’t make promisescan give expectations
Can we do better?
introsystem model
practicemetricsinsights
integration
HowHow long do I have to wait?
eventual?
t-visibility: probability p of consistent reads after t seconds
(e.g., 10ms after write, 99.9% of reads consistent)
How eventual?
t-visibility depends on
messaging and processing delays
Coordinator Replicawrite
ack
read
response
wait for W responses
t seconds elapse
wait for R responses
response is stale
if read arrives before write
once per replica Time
R1
R2
write
ack
read
W=1
R=1
N=2
response
Time
Alice
Bob
R2R1
R2inconsistent
(R)
(W)write
ack
read
response
wait for W responses
t seconds elapse
wait for R responses
response is stale
if read arrives before write
once per replica
(A)
(S)
Coordinator Replica Time
solving WARS: order statistics
dependent variables
instead: Monte Carlo methods
to use WARS:
W53.244.5101.1
...
A10.38.211.3...
R15.322.419.8...
S9.614.26.7...
run simulationMonte Carlo, sampling
gather latency data
44.511.3
15.314.2
real Cassandra cluster varying latencies:
t-visibility RMSE: 0.28%latency N-RMSE: 0.48%
WARS accuracy
How eventual?
key: WARS modelneed: latencies
t-visibility: consistent reads with probability p after t seconds
introsystem model
practicemetricsinsights
integration
Yammer100K+ companies
uses Riak
LinkedIn 175M+ users
built and uses Voldemort
production latenciesfit gaussian mixtures
10 ms
N=3
Latency is combined read and write latency at 99.9th percentile
R=3, W=1100% consistent:Latency: 15.01 ms
LNKD-DISKN=3
16.5% faster
R=2, W=1, t =13.6 ms99.9% consistent:Latency: 12.53 ms
worthwhile?
N=3
Latency is combined read and write latency at 99.9th percentile
R=3, W=1100% consistent:Latency: 4.20 ms
LNKD-SSDN=3
59.5% faster
R=1, W=1, t = 1.85 ms99.9% consistent:Latency: 1.32 ms
10�2 10�1 100 101 102 103
0.20.40.60.81.0
W=3
10�2 10�1 100 101 102 103
0.20.40.60.81.0
CD
F
W=1
10�2 10�1 100 101 102 103
Write Latency (ms)
0.20.40.60.81.0
W=2
LNKD-SSD LNKD-DISK YMMR WANLNKD-SSD LNKD-DISK YMMR WAN
N=3
Coordinator Replica
write
ack(A)
(W)
response(S)
(R)
wait for W responses
t seconds elapse
wait for R responses
response is stale
if read arrives before write
once per replica
SSDs reducevariance
compared todisks!
read
Yammer
latency81.1%
⇧
(187ms) 202 mst-visibility 99.9th
N=3
k-staleness (versions)How consistent?monotonic reads
quorum load
in the paper
in the paper
<k,t>-staleness:versions and time
latency distributions
WAN model
varying quorum sizes
staleness detection
in the paper
introsystem model
practicemetricsinsights
integration
1.Tracing2. Simulation3. Tune N,R,W
Integration
Project Voldemort
https://issues.apache.org/jira/browse/CASSANDRA-4261
Related WorkQuorum Systems• probabilistic quorums [PODC ’97]
• deterministic k-quorums [DISC ’05, ’06]
Consistency Verification• Golab et al. [PODC ’11]
• Bermbach and Tai [M4WSOC ’11]
• Wada et al. [CIDR ’11]
• Anderson et al. [HotDep ’10]
• Transactional consistency:Zellag and Kemme [ICDE ’11],Fekete et al. [VLDB ’09]
Latency-Consistency• Daniel Abadi [Computer ’12]
• Kraska et al. [VLDB ’09]
Bounded StalenessGuarantees
• TACT [OSDI ’00]
• FRACS [ICDCS ’03]
• AQuA [IEEE TPDS ’03]
R+W
strongconsistencyhigherlatency
eventualconsistency
lowerlatency
consistencycontinuum is a
strong eventual
quantify eventual consistency
model staleness in time, versions
latency-consistency trade-offs
analyze real systems and hardware
pbs.cs.berkeley.eduPBSquantify which choice is best and explain
why EC is often strongly consistent
Extra Slides
PBS and apps
staleness requires either:
staleness-tolerant data structurestimelines, logs
cf. commutative data structures logical monotonicity
asynchronous compensation codedetect violations after data is returned; see paper
cf. “Building on Quicksand” memories, guesses, apologies
write code to fix any errors
minimize:(compensation cost)×(# of expected anomalies)
asynchronouscompensation
Read only newer data?
client’s read rateglobal write rate
(monotonic reads session guarantee)
# versions tolerablestaleness
=
(for a given key)
Failure?
latency spikes
Treat failures as
How l o n gdo partitions last?
what time interval?99.9% uptime/yr ⇒ 8.76 hours downtime/yr
8.76 consecutive hours down⇒ bad 8-hour rolling average
hide in tail of distribution ORcontinuously evaluate SLA, adjust
10�2 10�1 100 101 102 103
0.20.40.60.81.0
W=3
10�2 10�1 100 101 102 103
0.20.40.60.81.0
CD
F
W=1
10�2 10�1 100 101 102 103
Write Latency (ms)
0.20.40.60.81.0
W=2
LNKD-SSD LNKD-DISK YMMR WAN
LNKD-SSD LNKD-DISK YMMR WANLNKD-SSD LNKD-DISK YMMR WAN
LNKD-SSD LNKD-DISK YMMR WAN
N=3
10�2 10�1 100 101 102 103
0.20.40.60.81.0
R=3
LNKD-SSD LNKD-DISK YMMR WAN
LNKD-SSD LNKD-DISK YMMR WANLNKD-SSD LNKD-DISK YMMR WAN
LNKD-SSD LNKD-DISK YMMR WAN
10�2 10�1 100 101 102 103
0.20.40.60.81.0
CD
F
W=1
10�2 10�1 100 101 102 103
Write Latency (ms)
0.20.40.60.81.0
W=2
(LNKD-SSD and LNKD-DISK identical for reads)N=3
<k,t>-staleness:versions and time
approximation: exponentiate
t-staleness by k
Synthetic,Exponential Distributions
W 1/4x ARS
W 10x ARS
N=3, W=1, R=1
concurrent writes:deterministically choose
Coordinator R=2
(“key”, 1) (“key”, 2)
N = 3 replicas
Coordinator
read(“key”)
(“key”, 1)(“key”, 1)(“key”, 1)
client
read(“key”)(“key”, 1)readR=3
R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)
N = 3 replicas
Coordinator
read(“key”)
(“key”, 1)(“key”, 1)(“key”, 1)
(“key”, 1)read(“key”)readR=3
R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)
client
N = 3 replicas
Coordinator
read(“key”)
(“key”, 1)(“key”, 1)(“key”, 1)
(“key”, 1)read(“key”)
send read to all
readR=1
R1 R2 R3(“key”, 1) (“key”, 1) (“key”, 1)
client
(“key”, 2)
Coordinator Coordinator
read(“key”)
ack(“key”, 2)
write(“key”, 2)
write(“key”, 2)ack(“key”, 2)
W=1
R1(“key”, 1) R2(“key”, 1) R3(“key”, 1)(“key”, 2)
(“key”, 1)
read(“key”)
(“key”,1)
ack(“key”, 2) ack(“key”, 2)
(“key”, 2)
(“key”, 2) (“key”, 2)
R=1
N=3