rethinkdb2.1 - jepsenrethinkdb2.1.5 2016-01-04 in this jepsen report, we’ll verify rethinkdb’s...

RethinkDB 2.1.52016-01-04

In this Jepsen report, we’ll verify RethinkDB’s support for linearizable operations using majority reads andwrites, and explore assorted read and write anomalies when consistency levels are relaxed. This work was fundedby RethinkDB, and conducted in accordance with the Jepsen ethics policy.

1 Background

RethinkDB is an open-source, horizontally scalabledocument store. Similar to MongoDB, documents arehierarchical, dynamically typed, schemaless objects.Each document is uniquely identified by an id keywithin a table, which in turn is scoped to a DB. Ontop of this key-value structure, a composable querylanguage allows users to operate on data within doc-uments, or across multiple documents—performingjoins, aggregations, etc. However, only operations ona single document are atomic—queries which accessmultiple keys may read and write inconsistent data.

RethinkDB shards data across nodes by primary key,maintaining replicas of each key across n nodes forredundancy. For each shard, a single replica is des-ignated a primary, which serializes all updates (andstrong reads) to that shard’s documents—allowing lin-earizable writes, updates, and reads against a singlekey.

If a primary dies, we’re in a bit of a pickle. Rethinkcan still offer stale reads from any remaining replica ofthat shard, but in order to make any changes, or per-form a linearizable read, we need a new primary—andone which is guaranteed to have the most recent com-mitted state from the previous primary. Prior to ver-sion 2.1, RethinkDB would not automatically promotea new primary—an operator would have to ensure theold primary was truly down, remove it from the cluster,and promote a remaining replica to a primary by hand.Since every node in a cluster is typically a primary forsome shards, the loss of any single node would lead tothe unavailability of ~1/n of the keyspace.

RethinkDB 2.1, released earlier this year, introducedautomatic promotion of primaries using the Raft con-sensus algorithm. Every node hosting a shard in a

given tablemaintains a Raft ensemble, which stores ta-ble membership, shard metadata, primary roles, andso on. This allows RethinkDB to automatically pro-mote a new primary replica for a shard so long as:

1. A majority of the table’s nodes are fully con-nected to one another, and

2. A majority of the shard’s replicas are availableto the table’s majority component

So: if we have at least three nodes, and at least threereplicas per shard, a RethinkDB table should seam-lessly tolerate the failure or isolation of a single node.Five replicas, and the cluster will tolerate two failures,and so on. This majority-quorum strategy is commonfor linearizable systems like etcd, Zookeeper, or Riak’sstrong buckets—and places Rethink in similar avail-ability territory to MongoDB, Galera Cluster, et al.

RethinkDB also supports non-voting replicas, whichasynchronously follow the normal replicas’ state, arenot eligible for automatic promotion, and don’t takepart in the Raft ensemble. These replicas do not pro-vide the usual Rethink consistency guarantees, andare best suited to geographic redundancy for disasterrecovery, or for read-heavy workloads where consis-tency isn’t important. In this analysis, every replicais a voting replica.

2 Consistency guarantees

Like MongoDB, Riak, Cassandra, and other KV stores,RethinkDB does not offer atomic multi-key operations.In this analysis, we’ll concern ourselves strictly withsingle-key consistency.

RethinkDB chooses strong defaults for update consis-tency, andweak defaults for reads. By default, updates

1

https://jepsen.io

https://jepsen.io/ethics

http://rethinkdb.com/docs/data-types/

https://www.rethinkdb.com/docs/architecture/

http://rethinkdb.com/blog/2.1-release/

https://ramcloud.stanford.edu/raft.pdf

https://ramcloud.stanford.edu/raft.pdf

http://rethinkdb.com/docs/failover/

http://rethinkdb.com/docs/consistency/

http://rethinkdb.com/docs/consistency/

(inserts, writes, modifications, deletes, etc) to a key arelinearizable, which means they appear to take placeatomically at some point in time between the client’srequest and the server’s acknowledgement. For reads,the default behavior is to allow any primary to servicea request using its in-memory state, which could allowstale or dirty reads.

Like Postgres, RethinkDB defaults will not acknowl-edge writes until they’re fsynced to disk. Users mayobtain better performance at the cost of crash safetyby relaxing the table’s or request’s durability settingfrom hard to soft. As with all databases, the filesys-tem, operating system, device drivers, and hardwaremust cooperate for fsync to provide crash safety. Inthis analysis, we use hard durability and do not ex-plore crash safety.

Some databases let you tune write safety on a per-transaction basis, which can lead to confusing se-mantics when weakly-isolated transactions interleavewith stronger ones. RethinkDB, like Riak, enforceswrite safety at the table level: all updates to a table’skeys use the same transactional isolation. A table’swrite_acks can be either:

• single: A primary can acknowledge a write toa client without the acknowledgement of otherreplicas, or

• majority: A majority of replicas must acknowl-edge a write first

The difference is in request latency; operations mustbe fully replicated at some point, so both modes havethe same throughput, and writes always go to a pri-mary, so a majority quorum must be present for write

availability. This differentiates RethinkDB from APdatabases like Cassandra, Riak, and Aerospike, whichoffer total write availability at the cost of linearizabil-ity, sequential consistency, etc.

Unlike write safety, read safety is controllable on aper-request basis. This makes sense: reads never im-pact the correctness of other reads, and relaxed con-sistency is often preferable for large read-only querieswhich can tolerate some fuzziness—e.g. analytics. Re-thinkDB offers three read_mode flavors:

• outdated: The local in-memory state of anyreplica

• single: The local in-memory state of any replicawhich thinks it’s a primary

• majority: Values safely committed to disk on amajority of replicas

Outdated and single are, I believe, equivalent interms of safety guarantees, though outdated willlikely exhibit read anomalies constantly, where singleshould only show dirty or stale reads during fail-ures. Outdated can improve availability, latency andthroughput, because all replicas can serve reads, notjust primaries. Majority is much stronger, offer-ing linearizable reads, but at the cost of a specialsync request to every replica, to which a majoritymust respond before the read can be returned to theclient. The sync requests could be omitted, piggyback-ing leader state on existing write traffic, but any lin-earizable op must still incur an additional round-trip’sworth of latency to ensure consensus.

So, how do these features interact? My interpretation,conferring with the Rethink team, is:

w=single w=majority

r=outdated Lost updates, dirty reads, stale reads Dirty reads, stale reads

r=single Lost updates, dirty reads, stale reads Dirty reads, stale reads

r=majority Lost updates, stale reads Linearizable

Rethink’s documentation claims that majority/majorityguarantees linearizability. The other cases are a littletrickier.

If writes are relaxed to single, we could see a situa-tion where an isolated primary accepts a write whichis later lost, because it has not been acknowledged bya majority of nodes. Subsequently, a disjoint majorityof replicas can service reads of an earlier value: a stale

read. A read against that isolated primary could not,however, succeed with r=majority, likely preventingdirty reads.

If writes are performed at majority, we prevent lostupdates—because any write must be acknowledged bya majority of nodes, and therefore be present on anysubsequently elected primary. Stale reads are stillpossible at r=single, because an isolated primary (or

2

https://aphyr.com/posts/313-strong-consistency-models

https://rethinkdb.com/docs/consistency/#settings

https://www.rethinkdb.com/api/javascript/run/

http://www.postgresql.org/docs/current/static/wal-reliability.html

http://docs.basho.com/riak/latest/dev/advanced/strong-consistency/#Creating-Consistent-Bucket-Types

http://research.microsoft.com/en-us/um/people/lamport/pubs/lower-bound.pdf

http://rethinkdb.com/docs/consistency/#linearizability-and-atomicity-guarantees

any node, for r=outdated) could serve read requests,while a newer primary accepts writes. We could alsoencounter dirty reads: a single or outdated readcould see a majoritywrite while it’s being propagatedto other replicas, but before those replicas have re-sponded. If the primary does not receive acknowledge-ment from a majority of replicas, and a new primaryis elected without that write, the write would havefailed—yet still have been visible to a client.

Finally, single writes and single reads should allowall anomalies we discussed above: lost updates fromthe lack of majority writes, plus dirty and stale reads.

So, the question becomes: does RethinkDB truly offerlinearizable operations at majority/majority? Arewrites still safe with single reads? Are these anoma-lies purely theoretical, or observable in practice? Let’swrite a test to find out.

2.1 Setup

Installation for Rethink is fairly straightforward.We simply add their debian repository, install therethinkdb package, and set up a log file. To preventRethinkDB from exploiting synchronized clocks, we’lluse a libfaketime shim to skew clocks and run time ata different rate for each process.

We construct a configuration file based on the stock con-fig. We want to create as many leadership transitionsas possible in a short time, so we lower the heartbeattimeout to two seconds once the cluster is up and run-ning.

We tell Jepsen how to install, configure, and start eachnode, how to clear the logs and data files between runs,and what logfiles to snarf at the end of each test. Af-ter starting the DB, we spin awaiting a connection toeach node. Jepsen ensures that every node reachesthis point before beginning the test.

2.2 Operations

With the database installed, we turn to test seman-tics. Clients in Jepsen take invocation operations, ap-ply those ops to the system under test, and return cor-responding completion operations. Our operations willconsist of writes, reads, and compare-and-sets (“cas”,for short). We define generator functions that constructthese operations over small integers.

(defn w [_ _] {:type :invoke, :f :write,:value (rand-int 5)})

(defn r [_ _] {:type :invoke, :f :read})(defn cas [_ _] {:type :invoke, :f :cas,

:value [(rand-int 5)(rand-int 5)]})

For instance, we might generate a write like {:type:invoke, :f :write, :value 2}, or a compare-and-set like {:type :invoke, :f :cas, :value [2 4]},whichmeans “set the value to 4 if, and only if, the valueis currently 2.”

In prior Jepsen analyses, we’d operate on a single keythroughout throughout the entire test, which is sim-ple, but comes at a cost. As the test proceeds andBad Things(TM) happen to the database, more andmore processes will time out or crash. We cannot tellwhether a crashed process’s operations will take placenow, or in five years, which means as the test goes on,the number of concurrent operations gradually rises.The number of orders for those operations rises expo-nentially: at every juncture, we must take all permuta-tions of every possible subset of pending ops. Anymorethan a handful of crashed processes, and linearizabil-ity verification can take years.

This places practical limits on the duration and re-quest rate of a linearizability test, which makes itharder to detect anomalies. We need a better strategy.

I redesigned the Knossos linearizability checker basedon the linear algorithm described by Gavin Lowe. Ex-tensive profiling and optimization work led to signifi-cant speedups for pathological histories that the orig-inal Knossos algorithm choked on. This implementa-tion performs the just-in-time linearization partial or-der reduction proposed by Lowe, with additional opti-mizations: we precompute the entire state space forthe model, which allows us to explore configurationswithout actually calling the model transition code, orallocating new objects. Because we re-use the samemodel objects, we can precompute their hashcodesand use reference equality for comparisons—whichremoves the need for Lowe’s union-find optimization.Most importantly, determinism allows us to skip theexploration of equivalent configurations, which dra-matically prunes the search space.

Unfortunately, these optimizations were not enough:tests of longer than ~100 seconds would bring thechecker to its knees. Peter Alvaro suggested a key in-sight: we may not need to analyze a single registerover the lifespan of the whole test. If linearizabilityviolations occur on short timescales, we can operateover several distinct keys and analyze each one inde-pendently. Each key’s history is short enough to an-alyze, while the test as a whole encompasses tens to

3

https://github.com/aphyr/jepsen/blob/6cf557af4f0acf3fce6efd961877e3d100f9a9c3/rethinkdb/src/jepsen/rethinkdb.clj#L49-L62



https://github.com/aphyr/jepsen/blob/6cf557af4f0acf3fce6efd961877e3d100f9a9c3/rethinkdb/resources/jepsen.conf

https://github.com/aphyr/jepsen/blob/6cf557af4f0acf3fce6efd961877e3d100f9a9c3/rethinkdb/resources/jepsen.conf

https://github.com/aphyr/jepsen/blob/6cf557af4f0acf3fce6efd961877e3d100f9a9c3/rethinkdb/src/jepsen/rethinkdb/document_cas.clj#L42-L48





https://github.com/aphyr/knossos/blob/master/src/knossos/linear.clj

http://www.cs.ox.ac.uk/people/gavin.lowe/LinearizabiltyTesting/paper.pdf

https://github.com/aphyr/knossos/blob/master/src/knossos/model/memo.clj

https://github.com/aphyr/knossos/blob/master/src/knossos/model/memo.clj#L135-L136

https://github.com/aphyr/knossos/blob/master/src/knossos/linear.clj#L73-L77

https://github.com/aphyr/knossos/blob/master/src/knossos/linear.clj#L73-L77

hundreds of times more operations—each one a chancefor the system to fail.

A new namespace, jepsen.independent, supports thesekinds of analyses. We can lift operations on a singlekey, like {:f :write, :value 3} to operations on[key value] tuples, e.g., to write 3 to key 1, we’d in-

voke {:f :write, :value [1 3]}. We use sequential-generator to construct operations over integer keys,and for each key, emitting a mix of reads, writes, andcompare-and-set operations, one per second, for sixtyseconds. Then sequential-generator begins anewwith the next key.

:concurrency 10:generator (std-gen (independent/sequential-generator

(range)(fn [k]

(->> (gen/reserve 5 (gen/mix [w cas])r)

(gen/delay 1)(gen/limit 60)))))

gen/reserve assigns five of our ten processes to (gen/mix [w cas]): a random mixture of write and compare-and-set operations. The remaining processes perform reads.

reserve is critical for identifying dirty and stale reads. When we partition the network, updates can stall for5-10 seconds because a majority is no longer available. If we don’t reserve specific processes for reads, everyprocess would (at some point) attempt a write, block on leader election, and for a brief window no operationswould take place—effectively blinding the test to consistency violations. By dedicating some processes to reads,we can detect transient read anomalies through these transitions.

2.3 Writing a client

With our operations constructed, we need a client which takes these operations and performs them against aRethinkDB cluster. At startup, one client creates a fresh table for the test, with five replicas—one for each node.We tell Rethink to use a particular write-acks level for the table, and establish a replica on each node. Once thetable is configured, we wait for our changes to propagate.

In response to operations, we extract the [key, value] tuple from the operation, and construct a Rethinkquery fragment for fetching that key, using the specified read-mode. Then we dispatch based on the type of theoperation. Reads simply run a get query, extracting a single field val from the document. We return the readvalue as the :value for the completion op.

:read (assoc op:type :ok:value (independent/tuple

id(r/run (term :DEFAULT

[(r/get-field row "val") nil])(:conn this))))

Writes behave similarly: we perform an upsert using conflict: update, setting the val field to the write op’svalue.

:write (do (run! (r/insert (r/table (r/db db) tbl){:id id, :val value}{"conflict" "update"})

(:conn this))(assoc op :type :ok))

4

https://github.com/aphyr/jepsen/blob/6cf557af4f0acf3fce6efd961877e3d100f9a9c3/jepsen/src/jepsen/independent.clj










Compare and set takes advantage of Rethink’s functional API: we use update against the row query fragment,providing a function of the current row which dispatches based on the current value of val. If val is equalto the compare-and-set predicate value, we update the document to value'. Otherwise, we abort the update.RethinkDB returns a map with the number of errors and replaced rows, which we use to determine whetherthe compare-and-set succeeded.

:cas (let [[value value'] valueres (r/run

(r/updaterow(r/fn [row]

(r/branch(r/eq (r/get-field row "val") value) ; predicate{:val value'} ; true branch(r/error "abort")))) ; false branch

(:conn this))](assoc op :type (if (and (= (:errors res) 0)

(= (:replaced res) 1)):ok:fail)))))))

While this is more verbose than a native compare-and-set operator, Rethink’s compositional query languageand first-class support for functions and control flowprimitives allows for fine-grained control over single-document transformations. Like SQL’s stored proce-dures, shipping complex logic to the database can im-prove locality and cut round-trips. The API, AST, andserialization format are structurally similar to one an-other, which I prefer to the query-stringmangling ubiq-uitous in SQL libraries—and it’s relatively easy towrap up queries as parameterizable functions. How-ever, the absence of multi-document concurrency con-trol prevents Rethink’s query language from reachingits full potential: we cannot (in general) safely read avalue from one document and use it to update another,or make two updates and guarantee their simultane-ous visibility.

Happily, RethinkDB explicitly distinguishes betweenfailed and indeterminate operations with a dedicatedstatus code. We use a small macro to trap Rethink’sexceptions and construct an appropriate completion op-eration based on whether the request failed (:fail), ormight have failed (:info). Since reads are pure, we can

also consider all read errors outright failures, whichlowers the number of crashed ops and reduces load onthe linearizability checker.

We must also be careful to check the return values foreach write: RethinkDB throws exceptions for catas-trophic cases, but to allow for partial failures, Re-think’s client will return maps with error counts in-stead of throwing for all errors.

2.4 Availability

We’ll run these tests for about 500 seconds, whilecutting the network in half every ten seconds.Experimentation with partially isolated topologies,SIGSTOP/SIGCONT, isolating primaries only, etc.has so far yielded equivalent results to a simple ma-jority/minority split. Shorter timescales can confuseRethinkDB for longer periods, as it struggles throughmultiple rounds of leader timeout and election, butit reliably recovers given a stable configuration for~heartbeat_timeout + 10 seconds, and often faster.

5


https://www.rethinkdb.com/docs/introduction-to-reql/




This plot shows the latency of operations over a full 500-second test: blue ops are successful, red definitelyfailed, and purple ones might have succeeded or might have failed. Grey regions show when the network waspartitioned. We can see that each partition is accompanied by a brief spike in latency—around 2.5 seconds—before operations resume as normal.

On a shorter timescale, we can see the recovery behavior in finer detail. Rethink delivers a few transientfailures or crashes for reads, writes, and compare-and-set operations just after a partition begins. After clusterreconfiguration, we see partial errors for all three classes of operations, until the partition resolves and thecluster heals. Once it heals, we return to a healthy pattern. Note that compare-and-set (cas) failures areexpected in this workload: they only succeed if they guess the current value correctly.

Why do we see partial failure? Because even though reads and writes only require a single node to acknowledge,Rethink’s design mandates that those operations take place on a primary node—and even after reconfiguration,some nodes won’t be able to talk to a primary!

6

We see a similar pattern for majority writes and single reads: because the window for concurrent primaries isshort, there’s not a significant difference in availability or latency.

However, a marked change occurs if we choose majority reads and single writes: read latencies, formerly ~2-3times lower than writes, jump to the level of writes. This is because Rethink implements majority reads byissuing an empty write to ensure that the current primary is still legal. Once the old primary steps down,operations against the minority side fail or crash quickly.

7

The scale is slightly different for this graph of majority/majority, but the numbers are effectively the same.Majority writes don’t have a significant impact on availability, because even single writes have to talk to aprimary—and in order to have a primary, you need a majority of nodes connected. There should be a differencein client latency, but this network is fast enough that other costs dominate.

3 Linearizability

So, Rethink fails over reliably within a few seconds of a partition, and offers high availability—though not totalavailability. We also suspect that during these transitions, Rethink could exhibit lost updates, dirty reads, andstale reads, depending on the write_acks and read_mode employed. Only majority/majority should ensurelinearizability.

3.1 Single writes, single reads

Consider this fragment of a history just after the network partitions, using single reads and writes. Each lineshows a process (194) invoking and completing (:ok), crashing (:info), or failing (:fail) an operation. Herethe key being acted on is 15, and we begin by reading the value 0.

9 :ok :read [15 0]...194 :invoke :write [15 3]7 :fail :read [15 nil] "Cannot perform read: lost contact with primaryreplica"292 :info :cas [15 [0 1]] "Cannot perform write: lost contact with primaryreplica"194 :ok :write [15 3]141 :info :write [15 3] "Cannot perform write: lost contact with primaryreplica"6 :fail :read [15 nil] "Cannot perform read: lost contact with primaryreplica"8 :fail :read [15 nil] "Cannot perform read: lost contact with primaryreplica"373 :info :cas [15 [1 0]] "Cannot perform write: lost contact with primaryreplica"

8

5 :invoke :read [15 nil]5 :ok :read [15 3]170 :invoke :cas [15 [1 4]]170 :info :cas [15 [1 4]] "Cannot perform write: The primary replica isn'tconnected to a quorum of replicas. The write was not performed."9 :invoke :read [15 nil]9 :fail :read [15 nil] "Cannot perform read: The primary replica isn'tconnected to a quorum of replicas. The read was not performed, you can do anoutdated read using `read_mode=\"outdated\"`."7 :invoke :read [15 nil]302 :invoke :cas [15 [0 0]]7 :ok :read [15 0]194 :invoke :cas [15 [3 0]]302 :fail :cas [15 [0 0]]194 :info :cas [15 [3 0]] "Cannot perform write: The primary replica isn'tconnected to a quorum of replicas. The write was not performed."151 :invoke :cas [15 [1 0]]6 :invoke :read [15 nil]6 :ok :read [15 0]

Did you catch that? I sure didn’t, but luckily computers are good at finding these sorts of errors. Process 194writes 3, which is read by process 5, and then, for no apparent reason, process 7 reads 0—a value from earlierin the test. Knossos spots this anomaly, and tells us that somewhere between line 73 (read 3) and line 80 (read0), it was impossible to find a legal linearization. It also knows exactly what crashed operations were pendingat the time of that illegal read:

:failures{15{:valid? false,:configs({:model {:value 3},

:pending [{:type :invoke, :f :read, :value 0, :process 7, :index 78}{:type :invoke, :f :cas, :value [1 4], :process 170, :index 74}{:type :invoke, :f :cas, :value [1 0], :process 373, :index 48}]}

... a few dozen other configurations})

:previous-ok {:type :ok, :f :read, :value 3, :process 5, :index 73},:op {:type :ok, :f :read, :value 0, :process 7, :index 80}}}},

This is still hard to reason about, so I’ve written a renderer to show the failure visually. Time flows from leftto right, and each horizontal track shows the activity of a single process. Horizontal bars show the beginningand completion of an operation, and the color of a bar shows whether that operation was successful (green) orcrashed (yellow).

If the history is linearizable, we should be able to draw some path which moves strictly to the right, touchingevery green operation—and possibly, crashed operations, since theymay have happened. Legal paths are shownin black, and the resulting states are drawn as vertical lines in that operation. When a transition would beillegal, its path and state are shown in red. Hovering over any part of a path highlights every path that touchesthat component. For clarity, we’ve collapsed most of the equivalent transitions, so you’ll see multiple pathshighlighted at once.

9

Hovering over the top line tells us that we can’t simply read 3 then read 0: there has to be a write in betweenin order for the state to change. We have a few crashed operations from earlier in the history that might takeeffect during this time—a write of 3 by process 141, and three compare-and-set operations. The write could gothrough, as shown by the black line from read 3 to write 3, but that doesn’t help us reach 0. No linearizationexists. Given subsequent reads see 0, and multiple processes attempted to write 3 just prior, I suspect the readof 3 is either a lost update or a stale read: both expected behaviors for single/single mode.

3.2 Single writes, majority reads

When we use single writes and majority reads, it’s still possible to see lost updates.

{:valid? false,:previous-ok {:type :ok, :f :write, :value 1, :process 580, :index 181},:op {:type :ok, :f :read, :value 3, :process 8, :index 200}}}},:configs

({:model {:value 1},:pending [{:type :invoke, :f :read, :value 3, :process 8, :index 199}

{:type :invoke, :f :cas, :value [3 4], :process 580, :index 196}... eighty zillion lines ...]})}

10

In this history, process 580 successfully writes 1, and two subsequent operations complete which require thevalue to be 3. None of the crashed operations allow us to reach a state of 3; it’s as if the write of 1 neverhappened. This is a lost write, which occurs when a primary acknowledges a write before having fully replicatedit, and a new primary comes to power without having seen the write. To prevent this behavior, we can writewith majority.

3.3 Majority writes, single reads

As expected, we can also find linearization anomalies with majority writes and single reads. Before a parti-tion begins, this register is 0. We have a few crashed writes of 1, which are visible to processes 5 and 8—thenprocess 6 reads 0 again. Eliding some operations:

8 :ok :read [10 0]...133 :invoke :write [10 1]131 :invoke :write [10 1]:nemesis :info :start "Cut off {:n4 #{:n3 :n2 :n5},:n1 #{:n3 :n2 :n5}, :n3 #{:n4 :n1}, :n2 #{:n4 :n1}, :n5 #{:n4 :n1}}"5 :invoke :read [10 nil]5 :ok :read [10 1]9 :invoke :read [10 nil]8 :invoke :read [10 nil]8 :ok :read [10 1]5 :invoke :read [10 nil]5 :ok :read [10 1]8 :invoke :read [10 nil]8 :ok :read [10 1]...5 :invoke :read [10 nil]5 :ok :read [10 1]...6 :invoke :read [10 nil]141 :invoke :cas [10 [4 0]]6 :ok :read [10 0]

11

It’s difficult to tell exactly what happened in thesekinds of histories, but I suspect process 131 and/or 133attempted a write of 1, which became visible on a pri-mary just before that primary was isolated by a parti-tion. Those writes crashed because their acknowledge-ments never arrived, but until the old primary stepsdown, other processes on that side of the partition cancontinue to read that uncommitted state: a dirty read.Once the new primary steps up, it lacks the write of 1and continues with the prior state of 0.

3.3.1 Majority writes, majority reads

I’ve run hundreds of test against RethinkDB atmajority/majority, at various timescales, requestrates, concurrencies, and with different types of fail-ures. Consistent with the documentation, I have neverfound a linearization failure with these settings. If youuse hard durability, majority writes, and majorityreads, single-document ops in RethinkDB appear safe.

3.4 Discussion

As far as I can ascertain, RethinkDB’s safety claimsare accurate. You can lose updates if you writewith anything less than majority, and see assortedread anomalies with single or outdated reads, butmajority/majority appears linearizable.

Rethink’s defaults prevent lost updates (offering lin-earizable writes, compare-and-set, etc), but do allowdirty and stale reads. Inmany cases this is a fine trade-off to make, and significantly improves read latency.On the other hand, dirty and stale reads create thepotential for lost updates in non-transactional read-modify-write cycles. If one, say, renders a web pagefor a user based on dirty reads, the user could take ac-tion based on that invalid view of the world, and causeinvalid data to be written back to the database. Sim-ilarly, programs which hand off state to one anotherthrough RethinkDB could lose or corrupt state by al-lowing stale reads. Beware of sidechannels.

Where these anomalies matter, RethinkDB usersshould use majority reads. There is no signifi-

12

cant availability impact to choosing majority reads,though latencies rise significantly. Conversely, if readavailability, latency, or throughput are paramount,you can use outdated reads with essentially the samesafety guarantees as single—though you’ll likely seecontinuous, rather than occasional, read anomalies.

Rethink’s safety documentation is generally of highquality—the only thing I’d add is a description of theallowable read anomalies for weaker consistency set-tings. Most of my feedback for the RethinkDB teamis minor: I’d prefer standardizing on either thrown orchecked errors for all types of operations, instead of amix, to prevent cases where users assume exceptionsand forget to check results. Error codes for indetermi-nate vs definite failures are new in the Clojure client,and requires knowing some magic constants, but I ap-preciate their presence nonetheless.

I’ve hesitated to recommend RethinkDB in the pastbecause prior to 2.1, an operator had to intervene tohandle network or node failures. However, 2.1’s au-tomatic failover converges reasonably quickly, and itsclaimed safety invariants appear to hold under par-titions. I’m comfortable recommending it for usersseeking a schema-less document store where inter-document consistency isn’t required. Users mightalso consider MongoDB, which recently introduced op-tions for stronger read consistency similar to Rethink—the two also offer similar availability properties anddata models. For inter-key consistency, a configura-

tion store like Zookeeper, or a synchronously replicatedSQL database like Postgres might make more sense.Where availability is paramount, users might consideran AP document or KV store like Couch, Riak, or Cas-sandra.

I’ve quite enjoyed working with the Rethink team onthis analysis—they’d outlined possible failure scenar-ios in advance, helped me refine tests that weren’t ag-gressive enough, and were generally eager to see theirsystem put to the test. This sort of research wandersthrough all kinds of false starts and dead ends, but Re-think’s engineers were always patient and supportiveof my experimentation.

You can reproduce these results by setting up yourown Jepsen cluster, checking out the Jepsen repo at6cf557a, and running lein test in the rethink/ di-rectory. Jepsen will run tests for all four read/writecombinations and spit out analyses in store/. Thistest relies on unreleased features from clj-rethinkdb0.12.0-SNAPSHOT, which you can build locally bycloning their repo and running lein install.

This work was funded by RethinkDB, and conducted inaccordance with the Jepsen ethics policy. I am indebtedto Caitie McCaffrey, Coda Hale, Camille Fournier, andPeter Bailis for their review comments. My thanks aswell to the RethinkDB team, especially Daniel Mewes,Tim Maxwell, Jeroen Habraken, Michael Lucy, andSlava Akhmechet.

13

https://github.com/aphyr/jepsen/blob/master/README.md

https://github.com/aphyr/jepsen/blob/master/README.md

https://github.com/apa512/clj-rethinkdb

http://jepsen.io/ethics.html

rethinkdb2.1 - jepsenrethinkdb2.1.5 2016-01-04 in this jepsen report, we’ll verify rethinkdb’s...

Documents