building consistent transactions with inconsistent replicationdistributed storage systems provide...

Building Consistent Transactions with Inconsistent ReplicationUW Technical Report UW-CSE-14-12-01

Irene Zhang Naveen Kr. Sharma Adriana SzekeresArvind Krishnamurthy Dan R. K. Ports

University of Washington{iyzhang, naveenks, aaasz, arvind, drkp}@cs.washington.edu

AbstractApplication programmers increasingly prefer distributedstorage systems with distributed transactions and strongconsistency (e.g., Google’s Spanner) for their strongguarantees and ease of use. Unfortunately, existingtransactional storage systems are expensive to use be-cause they rely on expensive replication protocols likePaxos for fault-tolerance. In this paper, we take a newapproach to make transactional storage systems more af-fordable; we eliminate consistency from the replicationprotocol, while still providing distributed transactionswith strong consistency to applications.

This paper presents TAPIR – the Transaction Ap-plication Protocol for Inconsistent Replication – thefirst transaction protocol to use a replication protocol,inconsistent replication, that provides fault-tolerancewith no consistency. By enforcing strong consistencyonly in the transaction protocol, TAPIR is able to com-mit transactions in a single round-trip and scheduledistributed transactions with no centralized coordina-tion. We demonstrate the use of TAPIR in TAPIR-KV, akey-value store that provides high-performance trans-actional storage. Compared to system using conven-tional transaction protocols that require replication withstrong consistency, TAPIR-KV has 2× better latency andthroughput.

1. IntroductionDistributed storage systems provide fault-tolerance andavailability for large-scale web applications like Googleand Amazon. Increasingly, application programmersprefer systems that support distributed transactions withstrong consistency1 to help them manage applicationcomplexity and concurrency in a distributed environ-ment. Several recent systems [3, 7, 15, 19] reflect this

1 Strong consistency for transactions is sometimes referred to aslinearizable isolation or external consistency. We use the termsinterchangeably in this paper.

trend, most notably Google’s Spanner system [8], whichsupports linearizable transactions.

For application programmers, distributed transac-tional storage with strong consistency comes at a price.These systems commonly use replication for fault-tolerance; unfortunately, replication protocols with strongconsistency, like Paxos [21], impose a high performancecost. They require cross-replica coordination on everyoperation, increasing the latency of the system, and typ-ically need a designated leader, reducing the throughputof the system.

A number of transactional storage systems have ad-dressed some of these performance limitations. Somesystems reduce throughput limitations [15, 33] or wide-area latency [19, 30], while others tackle latency andthroughput for read-only transactions [8], commutativetransactions [19] or independent transactions [9]. Evenmore have sought to provide better performance at thecost of a limited transaction model [2, 44] or weakerconsistency model [29, 40]. However, none of these sys-tems have been able to improve latency and throughputfor general-purpose, replicated read-write transactionswith strong consistency.

In this paper, we use a new approach to reduce thecost of replicated, read-write transactions and maketransactional storage more affordable for applicationprogrammers. Our key insight is that existing transac-tional storage systems waste work and performance byintegrating a distributed transaction protocol and a repli-cation protocol that both enforce strong consistency. In-stead, we show that it is possible to provide distributedtransactions with better performance and the same trans-action and consistency model using replication with noconsistency.

To demonstrate our approach, we designed TAPIR– the Transactional Application Protocol for Inconsis-tent Replication. TAPIR uses a new replication tech-nique, called inconsistent replication (IR), which pro-

1

vides fault-tolerance without consistency. IR requiresno synchronous cross-replica coordination or designatedleaders. Using IR, TAPIR is able to commit a distributedread-write transaction in a single round-trip and sched-ule transactions across partitions and replicas with nocentralized coordination.

Conventional transaction protocols cannot be easilycombined with inconsistent replication. IR only guaran-tees that successful operations execute at a majority ofreplicas; however, replicas can execute operations in dif-ferent orders and can be missing any operation at anytime. To support IR’s weak consistency model, TAPIRintegrates several novel techniques:

• Loosely synchronized clocks for optimistic transac-tion ordering at clients. TAPIR clients schedule trans-actions by proposing a transaction ordering, which isthen accepted or rejected by TAPIR’s replicated stor-age servers.

• New use of optimistic concurrency control to de-tect conflicts with only a partial transaction history.TAPIR servers use quorum intersection with opti-mistic concurrency control to ensure that any transac-tions that violate the linearizable transaction orderingare detected and aborted by at least one server.

• Multi-versioning for executing transactions out-of-order. TAPIR servers use multi-versioned storage toinstall updates out-of-order and still converge on asingle, consistent storage state.

We implement TAPIR in a new distributed transac-tional key-value store called TAPIR-KV, which supportslinearizable ACID transactions over a partitioned set ofkeys. We compare TAPIR-KV’s protocol to several con-ventional transactional storage systems, including onebased on the Spanner system, and a non-transactionalkey-value store with eventual consistency. We find thatTAPIR-KV reduces commit latency by 50% and in-creases throughput by more than 2x compared to stan-dard systems and achieves performance close to that ofthe system with eventual consistency.

In this paper, we make the following significant con-tributions to our understanding of designing distributed,replicated transaction systems:

• We define inconsistent replication, a new replicationtechnique that provides fault-tolerance without con-sistency. (Section 3)

• We design TAPIR, a new distributed transaction pro-tocol that provides transactions with strong consis-

2PC

CC

R R R

CC

R R R

CC

R R R

DistributedTransactionProtocol

ReplicationProtocol

Figure 1: Architecture of a distributed transactional storagesystem using replication. Distributed transaction protocols co-ordinate transactions across partitioned data, or shards. Theytypically consist of two components: (1) an atomic com-mitment protocol like two-phase commit to coordinate dis-tributed transactions across shards and (2) a concurrency con-trol mechanism like strict two-phase locking to schedule trans-actions within each shard. A replication protocol with strongconsistency like Paxos is responsible for keeping replicas ineach shard synchronized.

tency using replication with no consistency. (Sections4–6)

• We design and evaluate TAPIR-KV, a key-value storethat uses inconsistent replication and TAPIR to pro-vide high-performance transactional storage. (Sec-tion 7)

2. Background

Replication protocols have become an important com-ponent in distributed storage systems. While classicstorage systems use local disk for durability, modernstorage systems commonly incorporate replication forbetter fault-tolerance and availability. Some new sys-tems [35, 41] replace on-disk storage altogether within-memory replication.

Replicated, transactional storage systems must imple-ment both a distributed transaction protocol and a repli-cation protocol. Liskov [27] and Gray [17] describedone of the earliest architectures for these systems thatis still used today many systems [3, 8, 15, 19, 27, 36].This architecture places the distributed transaction pro-tocol on top of the replication protocol, as shown in Fig-ure 1. Other alternatives have been proposed as well(e.g., Replicated Commit [30]).

Conventional transaction protocols [4] assume theavailability of an ordered, fault-tolerant log. This or-dered log abstraction is easy to provide with disk; infact, sequential writes are more efficient for disk-basedstorage than random writes. Replicating this log abstrac-tion is more complicated and expensive. To enforce aserial ordering of operations to the replicated log, trans-actional storage systems must use a replication protocolwith strong consistency, like Paxos [21], ViewstampedReplication (VR) [34] or virtual synchrony [5].

2

Replication protocols with strong consistency presenta fundamental performance limitation for storage sys-tems. To enforce a strict serial ordering of operationsacross replicas, they require cross-replica coordinationon every operation. To avoid additional coordinationwhen replicas disagree on operation ordering, most im-plementations use a designated leader to decide opera-tion ordering and coordinate each operation. This leaderpresents a throughput bottleneck, while cross-replica co-ordination adds latency.

Combining these replication protocols with a dis-tributed transaction protocol results in a large amount ofcomplex distributed coordination for transactional stor-age systems. Figure 2 shows the large number of mes-sages needed to coordinate a single read-write transac-tion in a system like Spanner. Significant work [23, 25,31] has been done to reduce the cost of these replica-tion protocols; however, much of it centers on commu-tative operations [6, 22, 32] and do not apply to general-purpose transaction protocols.

Maintaining the ordered log abstraction means thatreplicated transactional storage systems use expensivedistributed coordination to enforce strict serial orderingin two places: the transaction protocol enforces a serialordering of transactions across data partitions or shards,and the replication protocol enforces a serial orderingof operations within a shard. TAPIR eliminates thisredundancy and its associated performance cost. As aresult, TAPIR is able avoid the performance limitationsof consistent replication protocols.

Conventional transaction protocols depend on opera-tions to the log to determine the ordering of transactionsin the system, so they cannot easily support a replicationprotocol without strong consistency. TAPIR carefullyavoids this constraint to support inconsistent replication.Section 4 discusses some of the challenges that we facedin designing TAPIR to enforce a strict serial ordering oftransactions without a strict serial ordering of operationsacross replicas.

3. Inconsistent Replication

The main objective of TAPIR is to build a protocol forconsistent, linearizable transactions on top of a replica-tion layer with no consistency guarantees. To achievethis, we must carefully design the weak guarantees ofthe replication protocol to support a higher-level pro-tocol with stronger guarantees. This section defines in-

ClientZone 1 Zone 2 Zone 3

ShardA

ShardB

(leader)

ShardC

ShardA

(leader)

ShardB

ShardC

ShardA

ShardB

ShardC

(leader)

exec

utio

n

BEGINBEGIN

READ(a)

WRITE(b)

READ(c)

prep

are

PREPARE(A)

PREPARE(B)

PREPARE(C)

com

mit

COMMIT(A)

COMMIT(B)

COMMIT(C)

Figure 2: Example read-write transaction using two-phasecommit, Viewstamped Replication and strict two-phase lock-ing. Each zone represents an availablity region, which couldbe a cluster, datacenter or geographic region. Each shard holdsa partition of the data stored in the system and is replicatedacross zones for fault-tolerance. For each read-write trans-action, there is a large amount of distributed coordination.The transaction protocol must coordinate reads with the des-ignated leader in each shard to acquire locks. To commit atransaciton, the transaction protocol coordinates across shardsand then the replication protocol must coordinate within eachshard.

consistent replication, a new replication protocol withfault-tolerance, but no consistency guarantees.

3.1 IR Overview

Inconsistent replication provides unordered, fault-tolerantoperations using a state machine replication model. Ap-plications invoke operations through IR for fault-tolerantexecution across replicas, and replicas execute opera-tions in the order they receive them. Operations mightbe the puts and gets of a replicated key-value storagesystem or the prepares and commits of a transactionalsystem. Unlike typical state machine replication, in-consistent replication treats application operations asunordered. Each replica executes operations in a differ-ent order; thus, unless all operations are commutative,replicas will be in an inconsistent state.

Because IR replicas are inconsistent and execute op-erations in different orders, IR does not expect replicasto always return the same result. Executing an opera-tion at enough replicas to provide fault-tolerance (e.g.,f + 1 replicas to survive f simultaneous failures) does

3

not ensure that the results from each replica will agree.Therefore, IR provides two types of guarantees: (1) fault-tolerance for an operation and (2) both consensus andfault-tolerance for an operation result. For example, IRmust ensure consensus and fault-tolerance for TAPIR op-erations whose result determines transaction ordering.

We define two types of IR operations with these differ-ent guarantees. When applications invoke inconsistentoperations, IR only ensures that the operation will notbe lost for up to f simultaneous failures. When appli-cations invoke consensus operations, IR preserves theoperation and the result given by the majority of thereplicas if enough replicas return the same result. Whileboth operation types are fault-tolerant, consensus opera-tions serve as the basic block in satisfying applications’consistency guarantees. TAPIR prepares are consensusoperations, while commits and aborts are inconsistentoperations.

IR provides these guarantees for up to f simultane-ous server failures and any number of client failures. IRsupports only fail-stop failures, not Byzantine faults oractively malicious servers. We assume an asynchronousnetwork where messages can be lost or delivered out oforder; however, we assume that messages sent repeat-edly will be eventually delivered if the replica is up.

IR does not rely on synchronous disk writes; it en-sures guarantees are maintained even if clients or repli-cas lose state on failure. This property allows IR to pro-vide better performance, especially within a datacenter,compared to Paxos and its variants, which require syn-chronous disk writes and recovery from disk on failure.IR also provide better fault-tolerance because this prop-erty allows it to tolerate disk failures at replicas.

3.2 IR Protocol

Figure 3 summarizes the IR interfaces and state atclients/replicas. Applications invoke their operationswith a Client Interface call. IR clients communicatewith IR replicas to execute operations and collect theirresponses. IR uses 2f + 1 replicas to tolerate f simul-taneous failures. Using a larger group size to toleratefewer than f failures is possible, but does not makesense because it requires a larger quorum size.

Inconsistent operations - replicated but unordered -return every result from all replicas that execute the oper-ations. Successful inconsistent operations have executedat enough replicas to ensure that they are fault-tolerant.

Client Interface:InvokeInconsistentOperation(op)→resultsInvokeConsensusOperation(op)→finalized,result

Client State:• client id - unique identifier for the client• operation counter - Number of sent operations

Replica Upcalls:Execute(op)→resultRecover(op)Recover(op,res)

Replica State:• state - denotes current replica state which is ei-

ther NORMAL (processing operations) or VIEW-CHANGING (participating in recovery)

• record - unordered set of operations and their results

Figure 3: Summary of inconsistent replication interfaces andclient/replica state.

Successful consensus operations have executed withthe same result at a majority of replicas. A successfulconsensus operation is considered finalized when theclient received enough matching results to also ensurethat the result of the operation is not lost (i.e. the oper-ation was executed with the same outcome at a super-majority of replicas). Consensus operations may fail be-cause the replicas do not return enough matching results,e.g., if there are conflicting concurrent operations.

Each client keeps an operation counter, which, com-bined with the client id, uniquely identifies operationsto replicas. Each IR replica keeps an unordered recordof executed operations.

IR uses four sub-protocols - operation processing,replica recovery, client recovery and group membershipchange. This section details the first two only; the lasttwo are identical to the VR [28] protocol.

3.2.1 Operation Processing Protocol

To begin, we describe IR’s normal-case operation pro-cessing protocol without failures. Due to IR’s weak guar-antees, replicas do not communicate on each operation.Instead, the IR client simply contacts IR replicas andcollects their responses. The protocol follows:1. The client sends 〈REQUEST, id, op〉 to all replicas,

where id is the message id (a tuple of the client idand message counter) and op is the operation.

4

2. Each replica writes op and its result to its record, thenresponds 〈REPLY, id, res〉, where res is the resultof the operation.

3. For inconsistent operations, once the client receivesf +1 responses from replicas, it returns all results tothe application.

4. For consensus operations, once the client receivesf + 1 responses from replicas with matching results,it returns the result to the application. Once the clientreceives d32fe+1 matching responses, it finalizes theresult.

IR clients retry inconsistent operations until they suc-ceed and consensus operations until they finalize or timeout. Replicas that receive operations they have alreadyexecuted can safely ignore them. A successful opera-tion requires at least a single round-trip to f + 1 repli-cas. However, IR needs d32fe + 1 matching responsesto finalize a consensus operation. Otherwise, IR cannotguarantee the majority result will persist at the majorityacross failures. This quorum requirement is the same asFast Paxos [23] and related protocols because IR mustbe able to determine both that a consensus operationsucceeded and what was the result the replicas agreedupon.

Consensus operations can time out, either becausethe group does not reach consensus or because morethan bf2 c replicas are unreachable. TAPIR is designedto cope with this using a slow path that only relies oninconsistent operations.

3.2.2 IR Recovery Protocol

IR does not rely on synchronous disk writes, so failedreplicas may lose their records of previously executedoperations. IR’s recovery protocol is carefully designedto ensure that a failed replica recovers any operation thatit may have executed previously and can still succeed.

Without this property, successful IR operations couldbe lost. For example, suppose an IR client receives aquorum of responses and reports success to the applica-tion. Then, each replica in the quorum fails in sequenceand each lose the operation, leading to the previouslysuccessful operation being lost by the group.

Ensuring this property is challenging because IR doesnot have a leader like VR. The VR leader tracks everyoperation that the group processes and every responsefrom replicas. While recovering IR replicas can poll aquorum of other replicas to find successful operations,they cannot poll all of the clients to find operations they

may have executed and their response. To deal with thissituation, the recovering replica must instead ensure anyoperation that it executed but now cannot recover doesnot succeed.

For this purpose, we introduce viewstamps into the IRprotocol. Each replica maintains a current view numberand includes this view number in responses to clients2.Replicas also divide their records into views based on theview number when the replica executed the operation.Each IR replica can be in one of two states: eitherNORMAL or VIEW-CHANGING. IR replicas only processoperations in the NORMAL state.

For an operation to be considered successful (or final-ized), the IR client must receive responses with matchingview numbers. On recovery, failed replicas force a viewchange, ensuring that any operation that has not alreadyachieved a quorum in the previous view will not succeed.The recovery protocol follows:1. The recovering replica sends 〈START-VIEW-CHANGE〉

to all replicas.2. Each replica that receives a START-VIEW-CHANGE

increments its current view number by 1 and thenresponds with 〈DO-VIEW-CHANGE, r, v〉, where ris its unordered record of executed operations, and vis the newly updated view number. The replica setsits state to VIEW-CHANGING and stops responding toclient requests.

3. Once the recovering replica receives f +1 responses,it updates its record using the received records:(a) For any consensus operation with matching results

in at least df2 e + 1 of the received records (i.e.,operations that were finalized), the replica callsRecover(op,res), where res is the operation withthe majority result. It then adds the operation toits record.

(b) For any consensus operation in a received recordwithout a matching majority result (i.e., successfulbut not finalized), the replica recovers the opera-tion using Recover(op) without a result. It thenadds the operation to its record.

(c) For any inconsistent operation in a received record,the replica calls Recover(op) and adds the opera-tion to its record.

2 If clients receive responses with different view numbers, they sendback the largest view number received. This ensures that replicasare not forever stuck in different views, leaving the group unable toprocess operations.

5

4. The replica then sends a 〈START-VIEW, vnew〉, wherevnew is the max of the view numbers from other repli-cas.

5. Any replica that receives START-VIEW checks whethervnew is higher than or equal to its current view num-ber. If so, the replica updates its current view numberand enters the NORMAL state; otherwise, it stays inits current state. The replica then always replies witha 〈START-VIEW-REPLY, vnew〉 message.

6. After the recovering replica receives f START-VIEW-REPLY responses, it enters the NORMAL state andresumes processing client requests. At this point, thereplica is considered to be recovered.

If a non-recovering replica receives a START-VIEW-CHANGE without a matching START-VIEW, after a time-out it assumes the role of the recovering replica to forcethe group into the new view and back into the NORMAL

state to continue processing operations.

3.3 Correctness

We give a brief sketch of correctness. Correctness re-quires that (1) successful operations are not lost, and (2)finalized consensus operation results are not lost. To en-sure these guarantees, we show that IR maintains thefollowing properties, to tolerate up to f simultaneousfailures out of a total of 2f + 1 replicas:P1. Successful operations are always in at least one

replica record out of any set of f + 1 replicas.P2. Finalized consensus operation results are always in

the records of a majority out of any set of f + 1replicas.In the absence of failures, the operation processing

protocol ensures that any successful operation is in therecord of at least f + 1 replicas and at least d32fe + 1replicas executed any finalized consensus operation andrecorded the same result. P1 follows from the fact thatany two sets of size f + 1 share at least one replica. P2follows from the fact that any two sets of sizes d32fe+1

and f + 1, respectively, share at least df2 e+ 1 replicas,which constitute a majority of any set of size f + 1.

However, since replicas can lose records on failures,we must prove that the recovery protocol maintains theseproperties after failures as well. The recovery protocolensures the following necessary properties will be metafter recovery:P3. A majority of the replicas, including the recovered

replica, are in a view greater than that of any previ-ously successful operation.

P4. The recovered replica gets all operations that weresuccessful in previous views and any finalized resultsfrom previous views. Any ongoing operations arenever successful.These properties are sufficient to ensure the general

properties, P1 and P2, stated above.P3 is ensured by the view change protocol. Since at

least 1 of the f+1 replicas that respond to START-VIEW-CHANGE must have the maximum view number in thegroup, and since every replica that responds to a START-VIEW-CHANGE will increment its view, the new viewmust be larger than any previous view.

P4 is ensured because any operations that were suc-cessful in views lower than the new view must appearin the joined records of any f + 1 replicas by quorumintersection. No further operations will become success-ful in any previous view because at least f + 1 replicashave incremented their view numbers prior to sendinga DO-VIEW-CHANGE message. Any in-progress opera-tions will therefore be unable to achieve a quorum. Theequivalent holds for consensus operations with a final-ized result because the quorum intersection for these isat least df2 e+ 1.

3.4 Building Atop IR

Due to its weak consistency guarantees, IR is an efficientprotocol for replication. It has no leader to order oper-ations, which increases throughput, and it has no cross-replica coordination, which reduces latency for each op-eration. In this respect, IR resembles protocols like Gen-eralized Paxos [22] and EPaxos [32], which also do notrequire ordering for commutative operations. However,the goal in these protocols is still to maintain replicasin a consistent state while executing operations in differ-ent orders. IR has a completely different goal: to ensurefault-tolerance for operations and their results instead ofconsistent replication.

With replicas in inconsistent states, applications orprotocols built on top of IR may be inconsistent, aswell. IR does not attempt to synchronize inconsistentapplication state; the application must reconcile anyinconsistencies caused by replicas executing operationsin different orders. Section 4 describes how TAPIR doesso. In addition, IR offers weak liveness guarantees forconsensus operations.

For checkpointing purposes, replicas can periodicallysynchronize or gossip to find missed operations; how-ever, this is not necessary for correctness. Without a

6

leader, no replica is guaranteed to have all of the success-ful operations. Thus, to get a complete view, applicationsmust join the records of f + 1 replicas.

By carefully selecting the quorum size for inconsis-tent and consensus operations, IR provides sufficientfault-tolerance guarantees to be able to construct ahigher-level protocol with strong consistency. Othersystems could use IR with different quorum sizes toprovide weaker consistency guarantees. For example,Dynamo could use IR with a quorum of all replicas forput operations to provide eventual consistency. We tookadvantage of this flexibility when building comparisonsystems for our evaluation.

4. TAPIR OverviewTAPIR is a distributed transaction protocol that relies oninconsistent replication. This section describes protocoland the techniques that TAPIR uses to guarantee strongconsistency for transactions using IR’s weak guarantees.

Figure 4 shows the messages sent during a sampletransaction in TAPIR. Compared to the protocol shownin Figure 2, TAPIR has three immediately apparentadvantages:1. Reads go to the closest replica. Unlike protocols

that must send reads to the leader, TAPIR sends readsto the replica closest to the client.

2. Successful transactions commit in one round-trip. Unlike protocols that use consistent replication,TAPIR commits most transactions in a single round-trip by eliminating cross-replica coordination.

3. No leader needed. Unlike protocols that order op-erations at a leader, TAPIR replicas all process thesame number of messages, eliminating a bottleneck.

4.1 System Model

TAPIR is designed to provide distributed transactionsfor a scalable storage architecture. We assume a sys-tem that partitions data into shards and replicates eachshard across a set of storage servers for availability andfault-tolerance. Clients are front-end application serverslocated in the same or another datacenter as the storageservers; they are not end-hosts or user machines. Theycan access a directory of storage servers and directlymap data to them using a technique like consistent hash-ing [18].

TAPIR provides a general storage and transaction in-terface to clients through a client-side library. Clientsbegin a transaction, then read and write during the trans-action’s execution period. During this period, the client

ClientZone 1 Zone 2 Zone 3

exec

utio

n

BEGINBEGIN

READ(a)

WRITE(b)

READ(c)

ShardA

ShardC

ShardB

ShardA

ShardC

ShardB

ShardA

ShardC

ShardB

prep

are

PREPARE(A)

PREPARE(B)

PREPARE(C)

com

mit

COMMIT(A)

COMMIT(B)

COMMIT(C)

Figure 4: TAPIR protocol.

is free to abort the transaction. Once the client finishesexecution, it commits the transaction, atomically anddurably committing the executed operations across allshards that participated in the transaction. TAPIR en-sures a linearizable ordering and fault-tolerance for thecommitted transaction for up to f simultaneous serverfailures and any client failures.

4.2 Protocol Overview

Figure 5 shows TAPIR’s interfaces and state at boththe client and the replica. Applications perform readsand writes within a transaction and then durably andatomically commit them.

Each TAPIR client supports one ongoing transactionat a time. In addition to the client’s client id, the clientstores the state for the ongoing transaction, includingthe transaction id and read and write set. TAPIR clientscommunicate with TAPIR replicas to read, commit, andabort transactions. TAPIR executes Begin operationslocally and buffers Write operations at the client untilcommit; these operations need not contact replicas.

TAPIR replicas are grouped into shards; each shardstores a subset of the key space. Read operations are sentdirectly to any TAPIR replica in the appropriate shardbecause they do not require durability. TAPIR uses IR toreplicate Prepare, Commit and Abort operations acrossreplicas in each shard participating in the transaction(i.e., holding one of the keys read or written in thetransaction).

7

Client Interface:Begin()Read(key)→objectWrite(key,object)Commit()Abort()

Client State:• client id - Unique client identifier• transaction - ongoing transaction id, read set, write

set

Replica Interface:Read(key)→object,versionRead(key,version)→objectPrepare(transaction,timestamp)→〈PREPARE-OK||ABSTAIN||ABORT||RETRY,timestamp〉Commit(transaction,timestamp)Abort(transaction,timestamp)

Replica State:• prepare transactions - list of transactions that this

replica is prepared to commit• transaction log - log of committed and abort transac-

tions in timestamp order• store - versioned data store• log table - table of transactions for which this replica

group is the backup coordinator; includes the cur-rently designated coordinator and the outcome of thetransaction

Figure 5: Summary of TAPIR interfaces and client and replicastate.

TAPIR uses timestamp ordering for transactions. Atransaction’s timestamp reflects its place in the globallinearizable order. TAPIR replicas keep transactions ina transaction log in timestamp order; they also maintaina multi-versioned data store, where each version of anobject is identified by the timestamp of the transactionthat wrote the version. TAPIR replicas serve reads fromthe versioned data store and maintain the transaction logfor synchronization and checkpointing. Like other two-phase commit protocols, TAPIR replicas maintain a listof prepared transactions.

TAPIR uses two-phase commit (2PC) with optimisticconcurrency control (OCC) to commit transactions ata single timestamp. The TAPIR client also serves asthe two-phase commit coordinator. TAPIR uses an op-timistic ordering approach for committing transactions,similar to CLOCC [1]. Clients choose a propose times-tamp for the committing transaction in Prepare. TAPIR

replicas then use OCC to validate that the transaction isconsistent with the proposed timestamp.

Unlike other two-phase commit protocols, TAPIRreplicas can respond to Prepare operations in four ways.PREPARE-OK and ABORT are the usual two-phase com-mit responses, indicating that no conflicts were foundor a conflict with a committed transaction was found,respectively. The other two responses are designed to re-duce the number of aborts in TAPIR. ABSTAIN indicatesthat the replica cannot prepare the transaction at thistime, which usually occurs when finding a conflict witha prepared transaction. RETRY indicates that the transac-tion might succeed if the client uses the retry timestampreturned with the result.

Prepare is a consensus operation since replicas canreturn different results; Commit and Abort are incon-sistent operations. The latter are carefully designed toenable TAPIR replicas to consistently commit or aborttransactions at any time.

Because Prepare is a consensus operation, it maynot finalize, meaning the result will not survive failures.However, TAPIR must still be able to make progressand eventually commit or abort the transaction. TAPIRcopes with this situation using a slow path, which writesthe transaction outcome to a backup coordinator groupbefore reporting it to the application. Once TAPIR repli-cates the outcome across the backup coordinator group,replicas can determine the result of the Prepare fromthe transaction outcome.

4.3 Building TAPIR on an IR Foundation

Numerous aspects of TAPIR are explicitly designed tocope with inconsistent replication. These design tech-niques make it possible for TAPIR to provide lineariz-able transactions from IR’s weak guarantees. TAPIRmust: (1) order transactions using unordered operations,(2) detect conflicts between transactions with an incom-plete transaction history, and (3) reconcile different oper-ation results from inconsistent replicas and applicationstate at inconsistent replicas. This section highlights howTAPIR tackles these challenges.

4.3.1 Ordering Transactions.

The first challenge is how to order transactions whenthe replication layer provides only unordered operations.Conventional transaction protocols use the order of op-erations at storage servers to order transactions, but thiscannot work with inconsistent replication.

8

Instead, TAPIR uses an optimistic transaction order-ing technique, similar to the one employed in CLOCC [1],where clients propose timestamps to order their trans-actions. To help clients pick timestamps, TAPIR usesloosely synchronized clocks. As other work has noted [38],the Precision Time Protocol (PTP) [39] makes clocksynchronization easily achievable in today’s datacenters.With hardware support, this protocol can achieve sub-microsecond accuracy. Even without hardware support,PTP lets us achieve microsecond accuracy, far less thanTAPIR requires. Across the wide-area, we found thatsub-millisecond accuracy was possible in VMs on thewidely available Google Compute Engine [16].

TAPIR’s performance depends on clients proposingclosely clustered timestamps. Increasing clock skewincreases the latency for committing transactions sinceclients must retry their transactions. Retrying does notforce the transaction to re-execute; it incurs only anadditional round-trip to the replicas. As our evaluationshows, TAPIR is robust even to high clock skews; wemeasured a retry rate that was well below 1% evenwith clock skews of several milliseconds and a high-contention Zipf distribution of accesses. As a roughmetric, TAPIR sees no negative performance impact aslong as the clock skew stays within a few percent of thelatency between clients and servers.

It is important to note that TAPIR does not depend onclock synchronization for correctness, only performance.This differs from Spanner [8], where consistency guar-antees can be violated if the clock skew exceeds the un-certainty bound given by TrueTime. Thus, TAPIR’s per-formance depends only on the actual clock skew, unlikeSpanner, which must wait for a conservatively estimateduncertainty bound on every transaction.

4.3.2 Detecting Conflicts.

Conventional distributed transaction protocols use con-currency control to prevent conflicts between concurrenttransactions. However, most concurrency control proto-cols check a transaction for conflicts against all previoustransactions. Every TAPIR replica might have a differ-ent set of transactions, so TAPIR must detect conflictsbetween transactions with an incomplete transaction his-tory.

TAPIR solves this problem using optimistic concur-rency control and quorum intersection. OCC validationchecks occur between the committing transaction andone previous transaction at a time. Thus, it not necessary

for a single server to perform all the checks. Since IRensures that every Commit executes at at least 1 replicain any set of f +1 replicas, and every Prepare executesat at least df2 e replicas in any set of f + 1, at least onereplica will detect any possible OCC conflict betweentransactions, thus ensuring correctness.

4.3.3 Reconciling Inconsistency.

TAPIR must reconcile both inconsistent replica state andinconsistent operation results.

The result of Commit and Abort is the same regardlessof their execution order. The replica state also staysconsistent. For Commit, the replica will always committhe transaction to the multi-versioned data store usingthe transaction timestamp and then log the transaction.For Abort, the replica will abort the transaction if it isprepared and then log the transaction.

Both the result and replica state after a Prepare differif they are executed in a different order at each replica.We cope with this inconsistency as follows. First, replicastate diverges only until the replicas all execute theCommit or Abort, so it is not important to reconcilereplica state. Second, TAPIR manages inconsistent re-sults by using inconsistent operations in IR. Inconsistentoperations always return the majority result and preservethat result across failures. Thus, TAPIR can rely on themajority result to decide whether to commit or abort atransaction. As a further optimization, TAPIR replicasgive an ABSTAIN result when they are unsure of the out-come of the conflicting transaction; this prevent transac-tions from aborting unnecessarily due to inconsistency.

5. TAPIR Protocol

This section details the TAPIR protocol, which provideslinearizable transactions using inconsistent replication.TAPIR provides the usual atomicity, consistency, isola-tion and durability guarantees for transactions. It alsosupports the strictest level of isolation: external con-sistency or linearizability. More specificially, TAPIR’stransaction timestamps reflect an externally consistentordering, ensuring that if transaction A commits beforetransaction B, then A will have a lower timestamp thanB.

TAPIR provides these guarantees using a transac-tion processing protocol layered on top of inconsistentreplication. It also relies on inconsistent replication forreplica recovery; however, TAPIR uses a coordinator re-covery protocol for two-phase commit coordinator fail-

9

ures. The remainder of this section describes these pro-tocols in detail.

5.1 Transaction Processing Protocol

TAPIR clients communicate with TAPIR replicas to co-ordinate linearizable transactions. TAPIR replicas pro-cess transactions unless replicas in the shard are in IRrecovery mode. This section describes the protocols forreading objects in a transaction and then committing.

5.1.1 Read Protocol

TAPIR’s read protocol is identical to other optimisticconcurrency control protocols for sharded multi-versioneddata stores. The TAPIR client sends read requests to anyreplica server in the shard for that data object. Its serverreturns the latest version of the object along with thetimestamp of the transaction that committed that ver-sion, which serves as the version ID. The client addsthe key and the version ID to the transaction’s read set.If the client has previously read the object in the trans-action, it can retrieve the same version by sending theversion ID and with the read request to the server.

Unlike other optimistic concurrency control proto-cols, TAPIR servers are inconsistent. Any replica couldmiss any transaction at any time. However, the OCC val-idation process detects any inconsistencies on commitand the transaction will abort.

5.1.2 Commit Protocol

During the execution period, the TAPIR client accumu-lates the read and write sets for the transaction. Once theapplication decides to commit, TAPIR starts the commitprotocol to find a timestamp at which it can serialize thetransaction’s reads and writes.

The TAPIR client starts the protocol by selectinga proposed timestamp. Proposed timestamps must beunique, so clients use a tuple of their local time andtheir client id. The TAPIR client serves as the two-phase commit coordinator, sendind the Prepare with theproposed timestamp to all participants.

The result of Prepare at each replica depends onthe outcome of the TAPIR validation checks. As noted,TAPIR validation checks have four possible outcomes.1. The replica checks for conflicts with other transac-

tions. For each read, it checks that the version remainsvalid (i.e. that it has not accepted a write for thatkey before the proposed timestamp. For each write,

it checks that it has not accepted a read or write forthat key after the proposed timestamp.3

2. If there are no conflicts, the participant replica addsthe transaction to the list of prepared transactions andreplies PREPARE-OK.

3. If a read conflict exists with a previously committedtransaction, then the participant replica cannot com-mit the transaction; it sends an ABORT.

4. If a write conflict exists with a previously commit-ted transaction, the participant replies RETRY withthe timestamp of the conflicting transaction. If theclient retries the transaction at a timestamp after theconflicting transaction, it may commit.

5. If a conflict exists with a prepared transaction, theparticipant replica replies ABSTAIN.If Prepare succeeds in each participant shard, the

TAPIR client proceeds as follows:1. If the result is PREPARE-OK at every shard, then the

client waits for IR to finalize the Prepare result. Asan optimization, while it waits, it can begin a slow-path commit (described below).

2. If the result at any shard is ABORT, then the clientsends Abort to all participants.

3. If the result at any shard is RETRY, then client retriesthe transaction with a new proposed timestamp: themaximum of the returned retry timestamps and thelocal time.

4. Otherwise, the client retries the transaction with anew proposed timestamp.Once IR declares the Prepare result to be finalized in

every participant shard, the TAPIR client reports the out-come to the application. If the Prepare times out with-out finalizing the result in one of the participant shards,then the TAPIR client must take a slow path to finishthe transaction. If Prepare succeeded with PREPARE-OK in every shard, the client commits the transaction,otherwise it aborts. To complete the transaction, theclient first logs the outcome of the transaction to thebackup coordinator group. It then notifies the client, andsends a Commit or Abort to all participant replicas. Theclient uses the same slow path to abort the transactionif Prepare does not succeed in any shard because therewere not enough matching responses.

3 These conflict checks ensure serializability and external consis-tency; they could be relaxed slightly (e.g., following the ThomasWrite Rule [42]) for higher performance if external consistency isnot required.

10

5.1.3 Out-of-order Execution

Since TAPIR relies on inconsistent replication, it mustbe able to execute operations out-of-order. For exam-ple, TAPIR may receive the Commit for a transactionbefore the Prepare. Thus, in addition to normal-case ex-ecution, TAPIR performs the following checks for eachoperation:

• Prepare: If the transaction has been committed oraborted (logged in the transaction log), ignore. Oth-erwise, TAPIR validation checks are run.

• Commit: Commit the transaction to the transaction logand update the data store. If prepared, remove fromprepared transaction list.

• Abort: Log abort in the transaction log. If prepared,remove from prepared list.

5.2 Recovery Protocols

TAPIR supports up to f simultaneous failures per shardas well as arbitrary client failures. Because it uses theclient as the two-phase commit coordinator, TAPIR musthave protocols for recovering from both replica andcoordinator failures.

5.2.1 Replica Recovery

On replica recovery, IR recovers, and executes opera-tions in an arbitrary order, and gives any finalized con-sensus operation the result. TAPIR already supports out-of-order execution of Commit and Abort, so it need notimplement additional support for these operations.

For finalized Prepare operations, TAPIR skips thevalidation checks and simply prepares any transactionwhere the finalized result was PREPARE-OK.

For successful, but unfinalized Prepare operations,TAPIR adds the transaction to the prepared list, but itreturns ABSTAIN if queried about the transaction. Thesetransactions exist in an in-between state until either com-mitted or aborted. TAPIR does this because the transac-tion might still commit; however, the recovering replicadoes not know if it returned PREPARE-OK originally, soit cannot participate in committing the transaction.

5.2.2 Coordinator Recovery

TAPIR uses the client as coordinator for two-phase com-mit. Because the coordinator may fail, TAPIR uses agroup of 2f + 1 backup coordinators for every transac-tion. Coordinator recovery uses a coordinator changeprotocol, conceptually similar to Viewstamped Replica-tion’s view change protocol. The currently active backupcoordinator is identified by indexing into the list with a

coordinator-view number; it is the only coordinator per-mitted to log an outcome for the transaction.

If the current coordinator is suspected to have failed,a backup coordinator executes a view change in thecoordinator group. In doing so, it receives any slow-pathoutcome that was logged by a previous coordinator. Ifsuch an outcome has been logged, the new coordinatormust follow that decision; it notifies the client and allreplicas in every shard.

If the new coordinator does not find a logged outcome,it sends a CoordinatorChange(transaction, view-num)message to all replicas in participating shards. Upon re-ceiving this message, replicas agree not to process mes-sages from the previous coordinator; they also reply tothe new coordinator with any previous Prepare resultfor the transaction. Once the CoordinatorChange is suc-cessful (at f+1 replicas in each participating shard), thenew coordinator determines the outcome of the transac-tion in the following way:

• If any replica in any shard has recorded a Commit orAbort, it must be preserved.

• If any shard has less than df2 e + 1 PREPARE-OK

responses, the transaction could not have committedon the fast path, so the new coordinator aborts it.

• If at least df2 e + 1 replicas in every shard havePREPARE-OK responses, the outcome of the trans-action is uncertain: it may or may not have com-mitted on the fast path. However, every conflictingtransaction must have used the slow path. The newcoordinator polls the coordinator (or backup coordi-nators) of each of these transactions until they havecompleted. If those transactions committed, it abortsthe transaction; otherwise, it sends Prepare opera-tions to the remaining replicas until it receives a totalof f + 1 PREPARE-OK responses and then commits.

The backup coordinator then completes the transactionin the normal slow-path way: it logs the outcome to thecoordinator group, notifies the client, and sends a Commitor Abort to all replicas of all shards.

5.3 Correctness

We give a brief sketch of how TAPIR maintains thefollowing properties given up to f failures in eachreplica group and any number of client failures:

• Isolation. There exists a global linearizable orderingof committed transactions.

• Atomicity. If a transaction commits at any participat-ing shard, it commits at them all.

11

• Durability. Committed transactions stay committed,maintaining the original linearizable order.

Isolation. Because Prepare is a consensus operation,a transaction X can be committed only if f +1 replicasreturn PREPARE-OK. We show that this means the trans-action is consistent with the serial timestamp orderingof transactions.

If a replica returns PREPARE-OK, it has not preparedor committed any conflicting transactions. If a conflict-ing transaction Y had committed, then there is one com-mon participant shard where at least f + 1 replicas re-sponded PREPARE-OK to Y . However, those replicaswould not return PREPARE-OK to X . Thus, by quorumintersection, X cannot obtain f + 1 PREPARE-OK re-sponses.

By the same reasoning, ABORT is just an optimization.If a replica detects a conflict with committed transactionY , then it may safely return ABORT and know that atleast f + 1 replicas will give matching results. At leastf+1 replicas must have voted to commit Y , and at leastf + 1 replicas processed the Commit operation. Thus, itknows that X will never get a PREPARE-OK result.

This property remains true even in the presenceof replica failures and recovery. IR ensures that anyPrepare that succeeded with a PREPARE-OK persistsfor up to f failures. It does not ensure that the result ismaintained. However, our recovery protocol for Preparewithout a result is conservative and keeps the transactionin the prepared state in case it commits. This ensuresthat other new transactions that conflict cannot acquirethe necessary quorum.

Atomicity. A transaction commits at a replica onlyif a coordinator has sent the Commit operation after amajority of replicas in each participating shard haveagreed to Prepare the transaction. In this case, thecoordinator will send Commit operations to all shards.

If the coordinator fails after deciding to commit, thecoordinator recovery protocol ensures that the backupcoordinator makes the same decision. If the previous co-ordinator committed the transaction on the slow path, thecommit outcome was logged to the backup coordinators;if it was committed on the fast path, the new coordina-tor finds the quorum of PREPARE-OK results by pollingf + 1 replicas in each shard.

Durability. If there are no coordinator failures, a trans-action is eventually finalized through an IR inconsistentoperation (Commit/Abort), which ensures that the deci-

sion is never lost. As described above, for coordinatorfailures, the coordinator recovery protocol ensures thatthe coordinator’s decision is not lost.

6. TAPIR ExtensionsWe now describe two useful extensions to TAPIR.

6.1 Read-only Transactions.

Since it uses a multi-versioned store, TAPIR easily sup-ports interactive read-only transactions at a snapshot.However, since TAPIR replicas are inconsistent, it isimportant to ensure that: (1) reads are up-to-date and(2) later write transactions do not invalidate the reads.To achieve these properties, TAPIR replicas keep a readtimestamp for each object.

TAPIR’s read-only transactions have a single round-trip fast path that sends the Read to only one replica. Ifthat replica has a validated version of the object wherethe write timestamp precedes the snapshot timestampand the read timestamp follows the snapshot timestampwe know that the returned object is valid, because it isup-to-date, and will remain valid, because it will not beoverwritten later. If the replica lacks a validated version,TAPIR uses the slow path and executes a QuorumRead

through IR as an inconsistent operation. A QuorumRead

updates the read timestamp, ensuring that at least f + 1replicas do not accept writes that would invalidate theread.

The protocol for read-only transactions follows:1. The TAPIR client chooses a snapshot timestamp for

the transaction; for example, the client’s the localtime.

2. The client sends Read(key,version), where key is whatthe application wants to read and version is the snap-shot timestamp.

3. If the replica has a validated version of the object, itreturns it. Otherwise, it returns a failure.

4. If the client was not able to get the value from thereplica, the client executes a QuorumRead(key,versionthrough IR as an inconsistent operation.

5. Any replica that receives QuorumRead returns thelatest version of the object from the data store. Italso writes the read to the transaction log and updatesthe data store to ensure that it will not prepare fortransactions that would invalidate the read.

6. The client returns the object with the highest times-tamp to the application.As a quick sketch of correctness, it is always safe to

read a version of the key that is validated at the snapshot

12

timestamp. Because the write timestamp for the versionis before the snapshot timestamp, the version was validat the snapshot timestamp, and because the read times-tamp is after the snapshot timestamp, the version willcontinue being valid at the snapshot timestamp. If thereplica does not have validated version, the replicatedQuorumRead ensure that both that the client gets the lat-est version of the object (because it must be at at least 1of any f + 1 replicas) and that a later write transactioncannot overwrite the version (because the QuorumRead

is replicated at f + 1 replicas).Since TAPIR also uses loosely synchronized clocks,

TAPIR can be combined with Spanner’s algorithm forproviding linearizable read-only transactions. Like Span-ner, this combination would require waits at the client,who acts as the coordinator in TAPIR, for the uncertaintybound and only provides linearizability guarantees aslong as the clock skew does not exceed that bound.

6.2 Synchronization.

Individual TAPIR replicas may be missing some com-mitted transactions. While these “holes” do not affectTAPIR’s correctness, storage systems using TAPIR mayprefer to have a full transaction log at certain points tocheckpoint state to durable storage. Synchronization inTAPIR requires that replicas write sync timestamps intotheir transaction log, and track their latest sync point.Replicas do not accept prepares for any transaction witha timestamp before the sync point.

TAPIR implements synchronization by sending a Toensure that Sync and FinishSync operations are fault-tolerant, we send then through IR as inconsistent opera-tions. Since IR can reorder Sync and FinishSync opera-tions, we send timestamp ranges for the synchronizationwith each operation.1. A synchronizing replica sends Sync(start, end) to the

replica group through IR, where tstart − tend is theportion of time the replica would like to have the fulllog. For example, the replica can pick start to be theits local time at the last sync and end to be its currenttime.

2. Other replicas execute Sync by returning all of theirtransactions with timestamps between start and end.If t end is bigger than the replica’s last sync point,then the replica updates its sync point to end andstops taking transaction with smaller timestamps.

3. Once the synchronizing replica has received f re-sults from IR, the replica joins the transaction logs

with its own and sends the complete log back inFinishSync(start,end,log). It updates its transactionlog and data store by committing any transaction inlog not already its log.

4. Any replica that receives FinishSync updates theirsync point if it has not already. The replica alsoupdates its own transaction log and data store bycommitting every transaction in log that is not alreadyin its log.

At this point, every replica that participated in the syn-chronization and has a new sync timestamp in its log isguaranteed to have all of the transactions that the groupcommitted between start and end.

6.3 Serializability.

TAPIR is somewhat limited in its ability to accept trans-actions out of order because it provides linearizability.Thus, TAPIR replicas cannot accept writes that are olderthan the last write that they accepted for the same keyand they cannot accept reads of older versions of thesame key.

However, if TAPIR’s guarantees are slightly weak-ened to provide serializability, TAPIR can then acceptproposed timestamps any time in the past as long as theyrespect serializable transaction ordering. This optimiza-tion requires tracking the timestamp of the transactionthat last read to each version, as well as the timestampof the transaction that wrote each version.

With this optimization, TAPIR can now accept: (1)reads of past versions, as long as the read timestampis before the write timestamp of the next version, and(2) writes in the past (Tomas Write Rule), as long asthe write timestamp is after the read timestamp of theprevious version and before the write timestamp of thenext version.

6.4 Synchronous log writes.

Given the ability to synchronously log to durable storage(e.g. hard disk, NVRAM), we can reduce the quorumrequirements for TAPIR. As long as we can recover thelog after failures, we can reduce the replica group sizeto 2f + 1 and reduce all consensus and synchronizationquorums to f + 1.

6.5 Retry Timestamp Selection.

A client can increase the likelihood that the participantreplicas will accept its proposed timestamp by proposinga very large timestamp, as this decreases the chance thatthe participant replicas have already accepted a higher

13

timestamp. Thus, to decrease the chances that clientswill have to retry forever, clients can exponentially in-crease their proposed timestamp on each retry.

6.6 Tolerating Very High Skew.

If there is significant clock skew between servers andclients, TAPIR can use waits at the participant replicasto decrease the likelihood that transactions will arriveout of timestamp order. On receiving each prepare mes-sage, the participant replica can wait (for the error boundperiod) to see if other transactions with smaller times-tamps arrive. After the wait, the replica can process thetransaction in timestamp order with respect to the othertransactions that have arrived. This wait increases thechances that the participant replica will receive trans-action in timestamp order and decreases the number oftransactions that it will have to reject for arriving out oforder.

7. EvaluationIn order to evaluate the effectiveness of TAPIR, wemeasure an implementation of TAPIR against severalconventional designs of both transactional and weaklyconsistent storage systems. Our evaluation comparesand analyzes their performance in local-area and wide-area environments.

7.1 Experimental Setup

We implemented TAPIR in a transactional key-valuestore, called TAPIR-KV. TAPIR-KV consists of 4,921lines of C++ code, not including the testing framework.We also built five comparison systems and a work-load generator based on the Retwis [24] benchmark.We ran our experiments in three testing environments:single cluster, single datacenter and wide-area, cross-datacenter.

Test Systems We compare TAPIR-KV with five sys-tems:

• Txn-Locking - standard two-phase commit with stricttwo-phase locking (S2PL) and Viewstamped Repli-cation (VR).

• Txn-OCC - standard two-phase commit with opti-mistic concurrency control (OCC) and VR.

• Span-Locking - our implementation of the Spannerprotocol.

• Span-OCC - our implementation of the Spanner pro-tocol with OCC, instead of S2PL.

• QWStore - an eventual consistency system with notransactions (e.g., Cassandra [20], Dynamo [11])

built with new IR operations that implement a readanywhere, write everywhere policy.

The transactional systems buffer writes at the client untilcommit and use the client as the coordinator.

Test Environment We ran our TAPIR experimentswith three setups:

• Single cluster - 10 quad-core servers with 24 GBRAM, running Ubuntu 12.04 connected to a fat-treenetwork with 12 switches.

• Single datacenter - 15 Google Compute Engine VMswith two cores running Debian 7 in three availabilityzones in the same geographic region (US-Central).

• Cross-datacenter - 6 Google Compute Engine VMsrunning Debian 7 in three geographic regions (US,Europe and Asia) connected over the wide-area net-work.

These three deployments offer round-trip times thatdiffer by three orders of magnitude: 100 µs, 1 ms and100-200 ms. For most experiments, we used 3 replicasper shard with replicas placed across availability zonesor geographic regions.

Within our cluster, we synchronized clocks usingthe Linux PTP [39] kernel module without hardwaresupport. For our two Google Compute Engine deploy-ments, we use Google’s existing VM clock synchroniza-tion mechanism. Lacking access to Spanner’s TrueTimeclock skew error bounds, we treated them as negligiblecompared to the network latency.

7.2 Latency and Clock Skew Measurements

We first present a profile of our different test environ-ments. Since TAPIR depends on loosely synchronizedclocks, we measured clock skews within a cluster usingPTP, as well as between Google Compute Engine VMsaround the world. We measure clock skew by sending auser-level to user-level ping message, with timestampstaken on either end. We calculate skew by by subtract-ing the timestamp taken at the destination from rtt

2 andassuming symmetric latencies.

Within our cluster, the machines had a 150 µs round-trip time between them and we were able to synchronizethe servers with an average skew of 6 µs using PTP. Wereport the average skew because TAPIR’s performancedepends on the average, not worst-case, skew.

Table 1 reports the average skew and latency betweenGoogle Compute Engine’s three geographic regions.Within each region, we average over the availability

14

Table 1: Cluster round-trip times and clock skews betweenGoogle Compute VMs.

Latency (ms) Clock Skew (ms)

US Europe Asia US Europe Asia

US 1.2 111.3 166.5 3.4 1.3 1.86

Europe 111.0 0.8 261.8 2.1 0.1 1.9

Asia 166.7 263.2 10.8 2.6 1.8 .3

zones. Overall, the clock skew is low, demonstrating thefeasibility of synchronizing clocks in the wide area.

Google gives Compute Engine VMs access to Google’sreliable wide-area network infrastructure. We imple-ment all communication using UDP and saw few packetdrops and little variation in round-trip time. However,there was a long tail to the clock skew. The worst caseclock skew that we saw was 27 ms, indicating that Span-ner may sometimes violate consistency if they find True-Time error bounds to only rise to 5 ms between synchro-nizations.

7.3 Microbenchmarks

For the microbenchmarks, we deployed our six test sys-tems with a single shard with 3 replicas. We use the samenumber of replicas for all systems for a fair through-put comparison, but TAPIR-KV needs replies from all3 replicas for its fast path commit, while the other trans-actional storage systems – TXN-LOCK, TXN-OCC,SPAN-LOCK, SPAN-OCC – only need replies from 2out of 3 replicas. QWStore must wait for replies fromall replicas for writes.

Our microbenchmark consists of single-key, read-modify-write transactions with a uniform distribution ofaccesses across 1 million keys. This workload stressesthe transactional systems because the short transactionshave high coordination overhead.

Latency. Figure 6 gives latencies for all six systemsdeployed in a single cluster, single datacenter and cross-datacenter. Within a cluster, TAPIR provides transac-tions at the same cost as the non-transactional QWStore;both systems require one round trip to read and anotherround trip to all replicas to either write or commit thetransaction. The other transaction systems have lowerperformance because they need two round-trips to com-mit, with TXN-OCC being the slowest because it alsoneeds a round-trip to a timestamp servers.

Cluster Datacenter Wide-area

Read Write Commit

Figure 6: Latency for read-modify-write transaction in ourthree test environment. For the wide-area deployment, weplace the leader replica in a different datacenter from theclient.

Moving to VMs in a datacenter, the computationcost outweighs the communication cost, so TAPIR-KV’sfewer round-trips have less impact on latency. TAPIRstill matches the performance of QWStore, but, becausethe round-trip time is only 1ms between nodes, TAPIRand QWStore only offer slightly improved performanceover the four transactional storage systems.

With VMs in different datacenters, the communica-tion cost outweighs the computation cost. For these ex-periments, we place the leader for the shard in the USwith the clients in Asia. The advantage of OCC com-pared to locking becomes visible. Being able to readat the closest replica, rather than having to go to theleader to acquire locks reduces the cost of the OCCsystems. TAPIR-KV offers better performance than theother OCC systems because it only needs a single round-trip to all of the replicas to commit the transaction, com-pared to two round-trips to the two closest replicas forthe other systems. And again, TAPIR-KV matches theperformance of the non-transactional QWStore.

Throughput. While TAPIR’s latency benefits are re-duced In our datacenter deployment, TAPIR still pro-vides throughput benefits. As shown in Figure 7, TAPIRprovides 2.6× the throughput of other transactional sys-tems because it is leader-less. Each TAPIR-KV replicaprocesses the same number of messages, so all threereplicas run at peak throughput. In comparison, the othertransactional systems bottleneck on the leader, whichmust coordinate replication. We do not show QWStore,but it provides roughly 2x the throughput of TAPIR. QW-Store has higher throughput because it does not provideatomicity.

15

Figure 7: Throughput for read-modify-write microbenchmarkdeployed in Google VMs in a single datacenter.

7.4 Retwis Experiments

For our application experiments, we deploy 5 shards of3 replicas using Google Compute VMs. We generatea synthetic workload based on the Retwis applicationand similar to YCSB-T [12]. Retwis is an open-sourceTwitter clone designed to use the Redis key-value store.Retwis has a number of Twitter functions (e.g., adduser, post tweet, get timeline, follow user) that performputs and gets on Redis. We treat each function as atransaction, and generate a synthetic workload basedon the Retwis functions. For example, the FollowUser

transaction has 2 reads and 2 puts. We use 1 millionkeys distributed across the 5 shards andvary the accessdistribution (e.g., uniform, Zipf) to vary the level ofcontention on different keys.

7.4.1 Datacenter Performance

To measure performance within a datacenter, we deployour VMs across availability zones in the US-Centralregion. All experiments use the Retwis workload with aZipf co-efficient of 0.6 to approximate the contention inreal-life workloads.

Latency. Figure 8 reports the average latency for theFollowUser transaction at peak throughput for the sixsystems. Note again that the computation costs outweighthe communication costs in the datacenter with VMs, soTAPIR-KV provides less latency benefit than in other en-vironments. TAPIR-KV still provides lower latency thanother transaction systems, except for SPAN-OCC; how-ever, SPAN-OCC achieves the same latency as TAPIR-KV, but with half the peak throughput. QWStore pro-vides lower latency at higher peak throughput than all ofthe transactional systems however because it can avoidcoordinating transactions with two-phase commit.

Throughput. As Figure 9 shows, the peak throughputfor TAPIR-KV is twice that of the other transactionalstorage systems because TAPIR-KV does not need a

TXN-OCC

TXN-LOCK

SPAN

-OCC

SPAN

-LOCK

TAPIR

QWST

ORE

Figure 8: Latency for FollowUser transaction with Zipfdistribution (coeff=0.6) at peak throughput for each system.

leader to coordinate between replicas in a shard. How-ever, QWStore achieves twice the throughput of TAPIR-KV because it does not need to coordinate replicaswithin a shard or transactions across shards.

Figure 9: Peak Throughput for Retwis Transactions with Zipfdistribution.

7.4.2 Cross-datacenter experiments

For wide-area experiments, we placed one replica fromeach shard in each geographic region. For the systemsthat use consistent replication and have a leader, wefix the location of the leader in the US and move theclient between the US, Europe and Asia. Figure 10 giveslatency results for the FollowUser transaction in theRetwis workload. As before, we use a Zipf distributionwith a 0.6 co-efficient.

Figure 10: Wide-area latency for FollowUser transaction,with leader in the US and client in US, Europe or Asia.

When the client shares a datacenter with the leader,the latency of the other transactional systems is better

16

than TAPIR-KV and QWStore. The other transactionalsystems are faster because they can commit with a round-trip to the leader and another replica in the closest geo-graphic region, while TAPIR-KV must contact all repli-cas to commit in the fast path. QWStore is the slowestof all because it does not buffer writes, so it needs around-trip to all of the replicas on each of the two writeoperations in the FollowUser function.

When the client is in a different datacenter from theleader, TAPIR-KV provides the best performance of allof the comparison systems. The locking-based transac-tional systems suffer because they must go to the leaderfor read locks. QWStore has better performance becauseit can read at any replica; however, it still takes tworound-trips to write. The OCC-based transactional stor-age systems provide the best performance in this case;they all buffer writes and read at a replica in the localdatacenter. However, TAPIR can commit the transactionin a single round-trip to all replicas, compared to tworound-trips to the two closest replica for the other OCCsystems.

8. Related WorkTAPIR is the first distributed transaction protocol to sup-port a replication protocol with no consistency guaran-tees. It uses no writes to disk, no replica leader and nocoordination between replicas on commit to provide ex-tremely high-performance read-write transactions. Thissection discusses other work with similar goals andmechanisms.

8.1 Replication

Many people [23, 25, 31] have worked on providing effi-cient replication with strong consistency guarantees. Inparticular, significant work has focused on supportingcommutative operations in consistent replication proto-cols [6, 22, 32]. Commutative operations do not requirea consistent ordering across replicas, so they can executewithout a leader or cross-replica coordination. Inconsis-tent replication supports only unordered operations, soall operations commute by definition.

Dynamo [11] and other systems [10, 13, 14, 26, 40,43] chose replication techniques with weak consistencyguarantees for performance. However, weak consistencyguarantees can be hard to understand and programmersfind them increasingly difficult to use for satisfyingmany applications’ complex requirements [8]. TAPIRoffers the best of both worlds: inconsistent replicationprovide performance on par with weak consistency sys-

tems, but TAPIR provides strong consistency guaranteesin the form of linearizable transactions to applicationprogrammers.

The inconsistent replication protocol shares many fea-tures with Viewstamped Replication [34]. Like VR, in-consistent replication is designed for in-memory repli-cation without relying on synchronous disk writes. Thepossibility of data loss on replica failure, which does nothappen in protocols like Paxos [21] that assume durablewrites, necessitates the use of viewstamps for both repli-cation protocols. Our decision to focus on in-memoryreplication is motivated by the popularity of recent in-memory systems like RAMCloud [35] and H-Base [41].

8.2 Distributed Transactions

There has been significant work in improving the perfor-mance of distributed transactions by limiting the trans-action model [2, 44] or consistency guarantees [29, 40].This work largely assumes durable writes and does notconsider replication.

Like TAPIR, Granola [9] uses loosely synchronizedclocks; however, it provide higher performance only forindependent transactions and can only be used with aconsistent replication protocol. Other transactional sys-tems have looked at commutative transactions [19] orexecuting transactions out-of-order [33]. TAPIR natu-rally supports preparing and executing commutative ornon-conflicting transactions in any order due to its useof optimistic transaction ordering and multi-versionedstorage to support inconsistent replication.

The CLOCC [1] distributed transaction protocol alsouses loosely synchronized clocks for optimistic transac-tion ordering; however, it did not consider replication.CLOCC was later combined with VR [27], but does notachieve the same performance as TAPIR because VRenforces strong consistency guarantees. Gray and Lam-port [17] use of Paxos [21] with locking and two-phasecommit has similar performance limitations.

More recently, numerous systems [3, 7, 8, 15, 19,36, 37, 44] have implemented distributed transactionswith different consistent replication protocols; however,they are all limited by their need for cross-replica coor-dination. Spanner [8] uses loosely synchronized clocksto address latency and throughput limitations in repli-cated transactional systems for read-only transactions,but uses a standard protocol for read-write transactions.Its approach for read-only transactions is complemen-tary to TAPIR’s high-performance read-write transac-

17

tions and the two could be easily combined. Other sys-tems improve latency for read-write transactions (e.g.,MDCC [36], Replicated Commit [30] for wide-area en-vironments) or throughput (e.g., Warp [15] for datacen-ter environments), none address latency and throughputfor a wide range of environments like TAPIR. TAPIR

9. ConclusionThis paper presented TAPIR, the first distributed trans-action protocol designed to use inconsistent replication.TAPIR uses three techniques that use loosely synchro-nized clocks and the properties of optimistic concur-rency control and two-phase commit to provide lineariz-able transactions with inconsistent replication. We de-signed and built TAPIR-KV, a distributed transactionalkey-value store, along with several comparison systems.Our experiments reveal that TAPIR-KV can lower thecommit latency by 50%, increase throughtput by 2× rel-ative to conventional transactional storage systems and,in many cases, can match the performance of a weaklyconsistent system without transaction support.

References[1] A. Adya, R. Gruber, B. Liskov, and U. Maheshwari.

Efficient optimistic concurrency control using looselysynchronized clocks. Proc. of SIGMOD, 1995.

[2] M. K. Aguilera, A. Merchant, M. Shah, A. Veitch, andC. Karamanolis. Sinfonia: a new paradigm for buildingscalable distributed systems. In Proc. of SOSP, 2007.

[3] J. Baker, C. Bond, J. Corbett, J. Furman, A. Khorlin,J. Larson, J.-M. Leon, Y. Li, A. Lloyd, and V. Yushprakh.Megastore: Providing scalable, highly available storagefor interactive services. In Proc. of CIDR, 2011.

[4] P. A. Bernstein, V. Hadzilacos, and N. Goodman. Con-currency Control and Recovery in Database Systems.Addison Wesley Publishing Company, 1987.

[5] K. Birman and T. A. Joseph. Exploiting virtual syn-chrony in distributed systems. In Proc. of SOSP, 1987.

[6] L. Camargos, R. Schmidt, and F. Pedone. Multicoor-dinated Paxos. Technical report, University of LuganoFaculty of Informatics, 2007/02, Jan. 2007.

[7] B. F. Cooper, R. Ramakrishnan, U. Srivastava, A. Silber-stein, P. Bohannon, H.-A. Jacobsen, N. Puz, D. Weaver,and R. Yerneni. Pnuts: Yahoo!’s hosted data servingplatform. Proc. of VLDB, 2008.

[8] J. C. Corbett et al. Spanner: Google’s globally-distributed database. In Proceedings of OSDI, Proc. ofOSDI, 2012.

[9] J. Cowling and B. Liskov. Granola: low-overhead dis-tributed transaction coordination. In Proc. of USENIX

ATC, 2012.[10] K. Daudjee and K. Salem. Lazy database replication

with snapshot isolation. In Proceedings of the 32ndInternational Conference on Very Large Data Bases,2006.

[11] G. DeCandia, D. Hastorun, M. Jampani, G. Kakula-pati, A. Lakshman, A. Pilchin, S. Sivasubramanian,P. Vosshall, and W. Vogels. Dynamo: Amazon’s highlyavailable key-value store. In Proc. of SOSP, 2007.

[12] A. Dey, A. Fekete, and U. Rohm. Scalable transactionsacross heterogeneous NoSQL key-value data stores.Proc. of VLDB, 2013.

[13] S. Elnikety, S. Dropsho, and F. Pedone. Tashkent:Uniting durability with transaction ordering for high-performance scalable database replication. In Proceed-ings of the 1st ACM SIGOPS/EuroSys European Confer-ence on Computer Systems 2006, 2006.

[14] S. Elnikety, S. Dropsho, and W. Zwaenepoel. Tashkent+:Memory-aware load balancing and update filtering inreplicated databases. In Proceedings of the 2Nd ACMSIGOPS/EuroSys European Conference on ComputerSystems 2007, 2007.

[15] R. Escriva, B. Wong, and E. G. Sirer. Warp: Multi-key transactions for key-value stores. Technical report,Cornell, Nov 2013.

[16] Google Compute Engine. https://cloud.google.com/products/compute-engine/.

[17] J. Gray and L. Lamport. Consensus on transactioncommit. ACM Transactions on Database Systems, 2006.

[18] D. Karger, E. Lehman, T. Leighton, R. Panigrahy,M. Levine, and D. Lewin. Consistent hashing and ran-dom trees: Distributed caching protocols for relievinghot spots on the world wide web. In Proc. of STOC,1997.

[19] T. Kraska, G. Pang, M. J. Franklin, S. Madden, andA. Fekete. MDCC: multi-data center consistency. InProc. of EuroSys, 2013.

[20] A. Lakshman and P. Malik. Cassandra: a decentralizedstructured storage system. ACM SIGOPS OperatingSystems Review, 2010.

[21] L. Lamport. Paxos made simple. ACM Sigact News,2001.

[22] L. Lamport. Generalized consensus and Paxos. Techni-cal Report 33, Microsoft Research, 2005.

[23] L. Lamport. Fast paxos. Distributed Computing, 19(2),2006.

[24] C. Leau. Spring Data Redis - Retwis-J, 2013.http://docs.spring.io/spring-data/data-keyvalue/examples/retwisj/current/.

[25] C. Li, D. Porto, A. Clement, J. Gehrke, N. Preguica, andR. Rodrigues. Making geo-replicated systems fast as

18

https://cloud.google.com/products/compute-engine/

https://cloud.google.com/products/compute-engine/

http://docs.spring.io/spring-data/data-keyvalue/examples/retwisj/current/

http://docs.spring.io/spring-data/data-keyvalue/examples/retwisj/current/

possible, consistent when necessary. In Proc. of OSDI,2012.

[26] Y. Lin, B. Kemme, M. Patino Martınez, and R. Jimenez-Peris. Middleware based data replication providingsnapshot isolation. In Proceedings of the 2005 ACMSIGMOD International Conference on Management ofData, 2005.

[27] B. Liskov, M. Castro, L. Shrira, and A. Adya. Providingpersistent objects in distributed systems. In Proc. ofECOOP, 1999.

[28] B. Liskov and J. Cowling. Viewstamped replicationrevisited, 2012.

[29] W. Lloyd, M. J. Freedman, M. Kaminsky, and D. G.Andersen. Don’t Settle for Eventual: Scalable CausalConsistency for Wide-area Storage with COPS. InProceedings of the Twenty-Third ACM Symposium onOperating Systems Principles, 2011.

[30] H. Mahmoud, F. Nawab, A. Pucher, D. Agrawal, andA. E. Abbadi. Low-latency multi-datacenter databasesusing replicated commit. Proc. of VLDB, 2013.

[31] Y. Mao, F. P. Junqueira, and K. Marzullo. Mencius:building efficient replicated state machines for wans. InProc. of OSDI, 2008.

[32] I. Moraru, D. G. Andersen, and M. Kaminsky. There ismore consensus in egalitarian parliaments. In Proc. ofSOSP, 2013.

[33] S. Mu, Y. Cui, Y. Zhang, W. Lloyd, and J. Li. Extractingmore concurrency from distributed transactions. In Proc.of OSDI, 2014.

[34] B. M. Oki and B. H. Liskov. Viewstamped replication:A new primary copy method to support highly-availabledistributed systems. In Proc. of PODC, 1988.

[35] D. Ongaro, S. M. Rumble, R. Stutsman, J. Ousterhout,and M. Rosenblum. Fast crash recovery in RAMCloud.In Proc. of SOSP, 2011.

[36] S. Patterson, A. J. Elmore, F. Nawab, D. Agrawal, andA. El Abbadi. Serializability, not serial: Concurrencycontrol and availability in multi-datacenter datastores.Proc. of VLDB, 2012.

[37] D. Peng and F. Dabek. Large-scale incremental process-ing using distributed transactions and notifications. InProc. of OSDI, 2010.

[38] J. Perry, A. Ousterhout, H. Balakrishnan, D. Shah, andH. Fugal. Fastpass: A centralized zero-queue datacenternetwork. In Proc. of SIGCOMM, 2014.

[39] IEEE 1588 Standard for A Precision Clock Synchroniza-tion Protocol for Networked Measurement and ControlSystems. http://www.nist.gov/el/isd/ieee/ieee1588.cfm.

[40] Y. Sovran, R. Power, M. K. Aguilera, and J. Li. Trans-actional storage for geo-replicated systems. In Proc. ofSOSP, 2011.

[41] M. Stonebraker, S. Madden, D. J. Abadi, S. Harizopou-los, N. Hachem, and P. Helland. The end of an architec-tural era: (it’s time for a complete rewrite). In Proc. ofVLDB, 2007.

[42] R. H. Thomas. A majority consensus approach toconcurrency control for multiple copy databases. ACMTransactions on Database Systems, 4(2):180–209, June1979.

[43] S. Wu and B. Kemme. Postgres-R(SI): Combiningreplica control with concurrency control based on snap-shot isolation. In Proceedings of the 21st InternationalConference on Data Engineering, 2005.

[44] Y. Zhang, R. Power, S. Zhou, Y. Sovran, M. K. Aguilera,and J. Li. Transaction chains: achieving serializabilitywith low latency in geo-distributed storage systems. InProc. of SOSP, 2013.

19

http://www.nist.gov/el/isd/ieee/ieee1588.cfm

building consistent transactions with inconsistent replicationdistributed storage systems provide...

Documents