vll: a lock manager redesign for main memory database systemsabadi/papers/vldbj-vll.pdf · each...

The VLDB Journal (2015) 24:681–705DOI 10.1007/s00778-014-0377-7

SPECIAL ISSUE PAPER

VLL: a lock manager redesign for main memory database systems

Kun Ren · Alexander Thomson · Daniel J. Abadi

Received: 14 February 2014 / Revised: 9 October 2014 / Accepted: 18 December 2014 / Published online: 4 January 2015© Springer-Verlag Berlin Heidelberg 2015

Abstract Lockmanagers are increasingly becoming a bot-tleneck in database systems that use pessimistic concurrencycontrol. In this paper, we introduce very lightweight locking(VLL), an alternative approach to pessimistic concurrencycontrol for main memory database systems, which avoidsalmost all overhead associated with traditional lock manageroperations. We also propose a protocol called selective con-tention analysis (SCA), which enables systems implement-ingVLL to achieve high transactional throughput under high-contention workloads. We implement these protocols bothin a traditional single-machine multi-core database serversetting and in a distributed database where data are parti-tioned acrossmany commoditymachines in a shared-nothingcluster. Furthermore, we show how VLL and SCA can beextended to enable range locking. Our experiments showthat VLL dramatically reduces locking overhead and therebyincreases transactional throughput in both settings.

Keywords Lightweight locking · Main memory ·Lock manager · Deterministic · Contention · Scalability

1 Introduction

As the price of main memory continues to drop, increas-ingly many transaction processing applications keep the

K. Ren (B) · D. J. AbadiYale University, New Haven, CT, USAe-mail: [email protected]

D. J. Abadie-mail: [email protected]

A. ThomsonGoogle, New York, NY, USAe-mail: [email protected]

bulk (or even all) of their active datasets in main mem-ory at all times. This has greatly improved performance ofOLTP database systems, since disk IO is eliminated as abottleneck.

As a rule, when one bottleneck is removed, others appear.In the case of main memory database systems, one commonbottleneck is the lock manager, especially under workloadswith high contention. One study reported that 16–25% oftransaction time is spent interacting with the lock managerin a main memory DBMS [12]. However, these experimentswere run on a single core machine with no physical con-tention for lock data structures. Other studies show evenlarger amounts of lock manager overhead when there aretransactions running on multiple cores competing for accessto the lock manager [14,22,29]. As the number of cores permachine continues to grow, lock managers will become evenmore of a performance bottleneck.

Although locking protocols are not implemented in a uni-formway across all database systems, themost commonwayto implement a lock manager is as a hash table that mapseach lockable record’s primary key to a linked list of lockrequests for that record [2,4,5,11,34]. This list is typicallypreceded by a lock head that tracks the current lock state forthat item. For thread safety, the lock head generally storesa mutex object, which is acquired before lock requests andreleases to ensure that adding or removing elements from thelinked list always occurs within a critical section. Every lockrelease also invokes a traversal of the linked list for the pur-pose of determining what lock request should inherit the locknext.

These hash table lookups, latch acquisitions, and linkedlist operations are main memory operations and would there-fore be a negligible component of the cost of executingany transaction that accesses data on disk. In main mem-ory database systems, however, these operations are not

123

http://crossmark.crossref.org/dialog/?doi=10.1007/s00778-014-0377-7&domain=pdf

682 K. Ren et al.

negligible. The additional memory accesses, cache misses,CPU cycles, and critical sections invoked by lock manageroperations can approach or exceed the costs of executingthe actual transaction logic. Furthermore, as the increasein cores and processors per server leads to an increase inconcurrency (and therefore lock contention), the size of thelinked list of transaction requests per lock increases—alongwith the associated cost to traverse this list upon each lockrelease.

We argue that it is therefore necessary to revisit the designof the lock manager in modern main memory database sys-tems. In this paper, we explore twomajor changes to the lockmanager. First, we move all lock information away from acentral locking data structure, instead co-locating lock infor-mation with the raw data being locked (as suggested in thepast [8]). For example, a tuple in a main memory database issupplementedwith additional (hidden) attributes that containinformation about the row-level lock information about thattuple. Therefore, a single memory access retrieves both thedata and lock information in a single cache line, potentiallyremoving additional cache misses.

Second, we remove all information about which transac-tions have outstanding requests for particular locks from thelock data structures. Therefore, instead of a linked list ofrequests per lock, we use a simple semaphore containing thenumber of outstanding requests for that lock (alternatively,two semaphores—one for read requests and one for write-requests). After removing the bulk of the lock manager’smain data structure, it is no longer trivial to determine whichtransaction should inherit a lock upon its release by a previousowner. One key contribution of our work is therefore a solu-tion to this problem. Our basic technique is to force all locksto be requested by a transaction at once, and order the trans-actions by the order in which they request their locks.We usethis global transaction order to figure out which transactionshould be unblocked and allowed to run as a consequence ofthe most recent lock release.

The combination of these two techniques—which we callvery lightweight locking (VLL)—incurs far less overheadthanmaintaining a traditional lockmanager, but it also tracksless total information about contention between transactions.Under high-contention workloads, this can result in reducedconcurrency and poor CPU utilization. To ameliorate thisproblem, we also propose an optimization called selectivecontention analysis (SCA), which—only when needed—efficiently computes the most useful subset of the contentioninformation that is tracked in full by traditional lock man-agers at all times.

Our experiments show that VLL dramatically reduceslock management overhead, both in the context of a tra-ditional database system running on a single (multi-core)server, and when used in a distributed database system thatpartitions data across machines in a shared-nothing cluster.

In such partitioned systems, the distributed commit protocol(typically two-phase commit) is often the primary bottle-neck, rather than the lock manager. However, recent workon deterministic database systems such as Calvin [36,37]has shown how two-phase commit can be eliminated fordistributed transactions, increasing throughput by up to anorder of magnitude—and consequently reintroducing thelock manager as a major bottleneck. Fortunately, deter-ministic database systems like Calvin lock all data for atransaction at the very start of executing the transaction.Since this element of Calvin’s execution protocol satisfiesVLL’s lock request ordering requirement, VLL fits natu-rally into the design of deterministic systems. When wecompare VLL (implemented within the Calvin framework)against Calvin’s native lock manager, which uses the tradi-tional design of a hash table of request queues, we find thatVLL enables an even greater throughput advantage than thatwhich Calvin has already achieved over traditional nonde-terministic execution schemes in the presence of distributedtransactions.

We also propose an extension of VLL (called VLLR) thatlocks ranges of rows rather than individual rows. Experi-ments show that VLLR outperforms two traditional rangelocking mechanisms.

2 Very lightweight locking

The category of “main memory database systems” encom-passes many different database architectures, includingsingle-server (multi-processor) architectures and a plethoraof emerging partitioned system designs. The VLL protocolis designed to be as general as possible, with specific opti-mizations for the following architectures:

– Multiple threads execute transactions on a single-server,shared memory system.

– Data are partitioned acrossprocessors (possibly spanningmultiple independent servers). At each partition, a singlethread executes transactions serially.

– Data are partitioned arbitrarily (e.g., across multiplemachines in a cluster); within each partition, multipleworker threads operate on data.

The third architecture (multiple partitions, each runningmultiple worker threads) is the most general case; the firsttwo architectures are in fact special cases of the third. In thefirst, the number of partitions is one, and in the second, eachpartition limits its pool of worker threads to just one. Forthe sake of generality, we introduce VLL in the context ofthe most general case in the upcoming sections, but we alsopoint out the advantages and trade-offs of running VLL inthe other two architectures.

123

VLL: a lock manager redesign 683

2.1 The VLL algorithm

The biggest difference between VLL and traditional lockmanager implementations is that VLL stores each record’s“lock table entry” not as a linked list in a separate lock table,but rather as a pair of integer values (CX ,CS) immediatelypreceding the record’s value in storage, which represents thenumber of transactions requesting exclusive and shared lockson the record, respectively. When no transaction is accessinga record, its CX and CS values are both 0.

In addition, a global queue of transaction requests (calledTxnQueue) is kept at eachpartition, tracking all active trans-actions in the order in which they requested their locks.

When a transaction arrives at a partition, it attempts torequest locks on all records at that partition that it will accessin its lifetime. Each lock request takes the form of increment-ing the corresponding record’s CX or CS value, dependingwhether an exclusive or shared lock is needed. Exclusivelocks are considered to be “granted” to the requesting trans-action if CX = 1 and CS = 0 after the request, since thismeans that no other shared or exclusive locks are currentlyheld on the record. Similarly, a transaction is considered tohave acquired a shared lock if CX = 0, since that means thatno exclusive locks are held on the record.

Once a transaction has requested its locks, it is added tothe TxnQueue. Both the requesting of the locks and theadding of the transaction to the queue happen inside the samecritical section (so that only one transaction at a time withina partition can go through this step). In order to reduce thesize of the critical section, the transaction attempts to figureout its entire read set and write set in advance of enteringthis critical section. This process is not always trivial andmay require some exploratory actions. Furthermore, multi-partition transaction lock requests have to be coordinated.This process is discussed further in Sect. 2.4.

Upon leaving the critical section, VLL decides how toproceed based on two factors:

– Whether or not the transaction is local or distributed. Alocal transaction is one whose read and write sets includerecords that all reside on the same partition; distributedtransactionsmay access a set of records spanningmultipledata partitions.

– Whether or not the transaction successfully acquired all ofits locks immediately upon requesting them. Transactionsthat acquire all locks immediately are termed free. Thosewhich fail to acquire at least one lock are termed blocked.

VLL handles each transactions differently based on whetherthey are free or blocked:

– Free transactions are immediately executed. Once com-pleted, the transaction releases its locks (i.e., it decre-

ments every CX or CS value that it originally incre-mented) and removes itself from the TxnQueue.1 Note,however, that if the free transaction is distributed, then itmay have to wait for remote read results, and thereforemay not complete immediately.

– Blocked transactions cannot execute fully, since not alllocks have been acquired. Instead, these are tagged inthe TxnQueue as blocked. Blocked transactions arenot allowed to begin executing until they are explicitlyunblocked by the VLL algorithm.

In short, all transactions—free and blocked, local anddistributed—are placed in the TxnQueue, but only freetransactions begin execution immediately.

Since there is no lockmanagement data structure to recordwhich transactions are waiting for data locked by other trans-actions, there is noway for a transaction to hand over its locksdirectly to another transactionwhen it finishes.An alternativemechanism is therefore needed to determine when blockedtransactions can be unblocked and executed. One possibleway to accomplish this is for a background thread to exam-ine each blocked transaction in the TxnQueue and examinethe CX and CS values of each data item for which the trans-action requested a lock. If the transaction incremented CX

for a particular item, and now CX is down to 1 and CS is0 for that item (indicating that no other active transactionshave locked that item), then the transaction clearly has anexclusive lock on it. Similarly, if the transaction incrementedCS and now CX is down to 0, the transaction has a sharedlock on the item. If all data items that it requested are nowavailable, the transaction can be unblocked and executed.

The problem with this approach is that if another trans-action entered the TxnQueue and incremented CX for thesame data item that a transaction blocked in the TxnQueuealready incremented, then both transactions will be blockedforever since CX will always be at least 2.

Fortunately, this situation can be resolved by a simpleobservation: a blocked transaction that reaches the front ofthe TxnQueue will always be able to be unblocked andexecuted—no matter how large CX and CS are for the dataitems it accesses. To see why this is the case, note that eachtransaction requests all locks and enters the queue all withinthe same critical section. Therefore, if a transaction makes itto the front of the queue, this means that all transactions thatrequested their locks before it have now completed. Further-more, all transactions that requested their locks after it willbe blocked if their read and write set conflict.

Since the front of theTxnQueue can alwaysbeunblockedand run to completion, every transaction in the TxnQueue

1 The transaction is not required to be at the front of the TxnQueuewhen it is removed. In this sense, TxnQueue is not, strictly speaking,a queue.

123

684 K. Ren et al.

will eventually be able to be unblocked. Therefore, in addi-tion to reducing lock manager overhead, this technique alsoguarantees that there will be no deadlock within a partition.(We explain how distributed deadlock is avoided in Sect.2.4). Note that a blocked transaction now has two ways tobecome unblocked: either it makes it to the front of the queue(meaning that all transactions that requested locks before ithave finished completely), or it becomes the only transactionremaining in the queue that requested locks on each of thekeys in its read set and write set. We discuss a more sophis-ticated technique for unblocking transactions in Sect. 2.6.

One problem that VLL sometimes faces is that as theTxnQueue grows in size, the probability of a new transac-tion being able to immediately acquire all its locks decreases,since the transaction can only acquire its locks if it does notconflict with any transaction in the entire TxnQueue.

We therefore artificially limit the number of transactionsthat may enter the TxnQueue—if the size exceeds a thresh-old, the system temporarily ceases to process new transac-tions, and shifts its processing resources to finding transac-tions in the TxnQueue that can be unblocked (see Sect. 2.6).In practice, we have found that this threshold should be tuneddepending on the contention ratio of the workload. High-contention workloads run best with smaller TxnQueue sizelimits since the probability of a new transaction not conflict-ing with any element in the TxnQueue is smaller. A longerTxnQueue is acceptable for lower contention workloads. Inorder to automatically account for this tuning parameter, weset the threshold not by the size of the TxnQueue, but ratherby the number of blocked transactions in the TxnQueue,since high-contention workloads will reach this thresholdsooner than low contention workloads.

Figure 1 shows the pseudocode for the basic VLL algo-rithm (for simplicity, this pseudocode only considers localtransactions). Each worker thread in the system executes theVLLMainLoop function. Figure 2 depicts an example exe-cution trace for a sequence of transactions.

2.2 Arrayed VLL

As discussed above, VLL usually prefers to colocate the CX

and CS values for each record with the record itself. Unliketraditional lock management data structures, which are tra-ditionally implemented as linked lists, CX and CS are bothsimple integers and are therefore much easier to integrateinto the data records themselves. This leads to improvedcache/memorybandwidthutilization, as a single request frommemory brings both the record and the lock informationabout the record into cache.

On the other hand, one disadvantage of this approach isthat it spreads out lock information across the entire dataset.If there is code that only accesses lock information (with-out accessing the data itself), it is better to have all the

Fig. 1 Pseudocode for the VLL algorithm

lock information stored together, rather than interspersedwith the data. One example of this is the deterministic lockmanager of Calvin. In order to implement the deterministiclocking algorithm described in previous work [35], Calvinsometimes puts its lock manager in a separate thread that isdedicated to just lock management. Since this thread nevertouches the raw data, it would be preferable to keep all lockmanagement data together in order to ensure that the cachelocal to the core that this lock manager thread is runningis filled with lock data and not polluted with record data aswell.

123


Fig. 2 Example execution of a sequence of transactions{A, B,C, D, E} using VLL. Each transaction’s read and write setis shown in the top left box. Free transactions are shown with whitebackgrounds in the TxnQueue, and blocked transactions are shown

as black. Transaction logic and record values are omitted, since VLLdepends only on the keys of the records in transactions’ read and writesets

We therefore designed another implementation of VLL,which we call “arrayed VLL”, which uses a vector or arraydata structure to store integer lock information consecutively.For example in C++, we store all CX information in onevector and all Cs information in a second vector to representread–write locks and read locks, respectively. The i th elementof each vector corresponds to the CX and Cs values for thei th record.

The overhead of arrayed VLL is a little larger than thecolocated version of VLL, since data and lock informationare accessed in two separate requests; however,when the lockmanager is running in a separate thread, this implementationis preferable, since the memory request for the lock is oftenin the cache of the core running the lock manager (especiallywhen the number of records is small, or when the recordswhich are accessed are skewed).

We will further compare the performance differencesbetween arrayed VLL and colocated VLL in Sect. 3.

2.3 Single-threaded VLL

Thus far, we have discussed themost general version ofVLL,in which multiple threads may process different transactionssimultaneously within each partition. It is also possible to

run VLL in single-threaded mode. Such an approach wouldbe useful in H-Store style settings [33], where data are parti-tioned across cores within a machine (or within a cluster ofmachines), and there is only one thread assigned to each par-tition. These partitions execute independently of one anotherunless a transaction spans multiple partitions, in which casethe partitions need to coordinate processing.

In the general version of VLL described above, once athread begins executing a transaction, it does nothing elseuntil the transaction is complete. For distributed transactionsthat perform remote reads, thismay involve sleeping for someperiod of time while waiting for another partition to sendthe read results over the network. If single-threaded VLLwas implemented simply by running only one thread (oneach partition) according to the previous specification, theresult would be a serial execution of transactions (within eachpartition). This results in wasted resources, since when thethreadneeds to sleep,waiting for amessage fromadistributednode, and no other progress can bemadewithin that partition.

In order to improve concurrency in single-threaded VLLimplementations, we allow transactions to enter a third state(in addition to “blocked” and “free”). This third state, “wait-ing”, indicates that a transaction was previously executingbut could not complete without the result of an outstanding

123

686 K. Ren et al.

remote read request. When a transaction triggers this condi-tion and enters the “waiting” state, the main execution threadputs it aside and searches for a new transaction to execute.Conversely, when the main thread is looking for a transac-tion to execute, in addition to considering the first transac-tion on the TxnQueue and any new transaction requests, itmay resume execution of any “waiting” transaction whichhas received remote read results since entering the “waiting”state.

In other words, while there is still only one thread perpartition, this thread now has the capability to work on morethan one transaction at once—switching over to a differenttransaction (instead of sleeping) when waiting for messagesfrom distributed nodes. Since there is still only one thread perpartition, this implementation shares the H-Store advantageof not requiring latches/critical sections around lock acqui-sition. Figure 3 shows the pseudocode for the distributedsingle-threaded VLL algorithm (note the lack of critical sec-tions relative to Fig. 1).

2.4 Impediments to acquiring all locks at once

As discussed in Sect. 2.1, in order to guarantee that the headof the TxnQueue is always eligible to run (which has theadded benefit of eliminating deadlocks), VLL requires thatall locks for a transaction be acquired together in a criticalsection. There are two possibilities that make this nontrivial:

– The read andwrite sets of a transactionmay not be knownbefore running the transaction. An example of this is atransaction that updates a tuple that is accessed through asecondary index lookup. Without first doing the lookup,it is hard to predict what records the transaction willaccess—and therefore what records it must lock.

– Since each partition has its own TxnQueue and the crit-ical section in which it is modified is local to a partition,different partitionsmay not begin processing transactionsin the same order. This could lead to distributed dead-lock, where one partition gets all its locks and activatesa transaction, while that transaction is “blocked” in theTxnQueue of another partition.

In order to overcome the first problem, before the trans-action enters the critical section, we allow the transactionto perform whatever reads it needs to (at no isolation) forit to figure out what data it will access (for example, itperforms the secondary index lookups). This can be donein the GetNewTxnRequest function that is called in thepseudocode shown in Figs. 1 and 3. After performing theseexploratory reads, it enters the critical section and requeststhose locks that it discovered it would likely need. Once the

Fig. 3 Pseudocode for the single-threaded VLL algorithm

123


transaction gets its locks and is handed off to an executionthread, the transaction runs as normal unless it discoversthat it does not have a lock for something it needs to access(this could happen if, for example, the secondary index wasupdated immediately after the exploratory lookup was per-formed and now returns a different value). In such a scenario,the transaction aborts, releases its locks, and submits itselfto the database system to be restarted as a completely newtransaction.

There are two possible solutions to the second problem.The first is simply to allow distributed deadlocks to occur andto run a deadlock detection protocol that aborts deadlockedtransactions. The second approach is to coordinate acrosspartitions to ensure thatmulti-partition transactions are addedto the TxnQueue in the same order on each partition.

As will be discussed further in Sect. 3, we tried bothapproaches and found that for high-contention workloads,the first solution is problematic, since the overhead of han-dling and detecting distributed deadlock completely negatestheVLL advantage of reducing the overhead of lockmanage-ment. Meanwhile, the second approach, while adding non-trivial coordination overhead, is still able yield improved per-formance. For low-contention workloads, both approachesare possibilities.

However, recent work on deterministic database systems[35–37] shows that the coordination overhead of the sec-ond approach can be reduced by performing it before begin-ning transactional execution. In short, deterministic databasesystems such as Calvin order all transactions across parti-tions, and this order can be leveraged by VLL to avoid dis-tributed deadlock. Furthermore, since deterministic systemshave been shown to be a particularly promising approachin main memory database systems [35], the integration ofVLL and deterministic database systems seems to be a par-ticularly goodmatch.We therefore used the second approachwith coordination happening before transaction execution forour implementation in most experiments of this paper.

2.5 Trade-offs of VLL

VLL’s primary strength lies in its extremely low overheadin comparison with that of traditional lock managementapproaches. VLL essentially “compresses” a standard lockmanager’s linked list of lock requests into two integers. Fur-thermore, by placing these integers inside the tuple itself,both the lock information and the data itself can be retrievedwith a single memory access, minimizing total cache misses.

The main disadvantage of VLL is a potential loss in con-currency. Traditional lockmanagers use the information con-tained in lock request queues to figure out whether a lockcan be granted to a particular transaction. Since VLL doesnot have these lock queues, it can only test more selectivepredicates on the state: (a) whether this is the only lock in the

queue, or (b) whether it is so old that it is impossible for anyother transaction to precede it in any lock queue.

As a result, it is common for scenarios to arise under VLLwhere a transaction cannot run even though it “should” beable to run (and would be able to run under a standard lockmanager design). Consider, for example, the sequence oftransactions:

txn Write set

A xB yC x, zD z

Suppose A and B are both running in executor threads(and are therefore still in the TxnQueue) when C and Dcome along. Since transaction C conflicts with A on recordx and D conflicts withC on z, both are put on theTxnQueuein blocked mode. VLL’s “lock table state” would then looklike the following (as compared to the state of a standard locktable implementation):

VLL Standard

Key Cx Cs Key Request queue

x 2 0 x A, Cy 1 0 y Bz 2 0 z C, DTxnQueueA, B, C, D

Next, suppose that A completes and releases its locks. Thelock tables would then appear as follows:

VLL Standard

Key Cx Cs Key Request queue

x 1 0 x Cy 1 0 y Bz 2 0 z C, DTxnQueueB, C, D

Since C appears at the head of all its request queues, astandard implementation would know that C could safely berun, whereas VLL is not able to determine that.

When contention is low, this inability of VLL to immedi-ately determine possible transactions that could potentiallybe unblocked is not costly. However, under higher contentionworkloads, and especially when there are distributed trans-actions in the workload, VLL’s resource utilization suffers,

123

688 K. Ren et al.

and additional optimizations are necessary. We discuss suchoptimizations in the next section.

2.6 Selective contention analysis (SCA)

For high-contention and high-percentage multi-partitionworkloads, VLL spends a growing percentage of CPU cyclesin the state described in Sect. 2.5 above, where no transactioncan be found that is known to be safe to execute—whereasa standard lock manager would have been able to find one.In order to maximize CPU resource utilization, we introducethe idea of SCA.

SCA simulates the standard lock manager’s ability todetect which transactions should inherit released locks. Itdoes this by spending work examining contention—but onlywhen CPUswould otherwise be sitting idle (i.e., TxnQueueis full and there are no obviously unblockable transactions).SCA therefore enables VLL to selectively increase its lockmanagement overhead when (and only when) it is beneficialto do so.

Any transaction in the TxnQueue that is in the ’blocked’state, conflicted with one of the transactions that precededit in the queue at the time that it was added. Since then,however, the transaction(s) that caused it to become blockedmay have completed and released their locks. As the trans-action gets closer and closer to the head of the queue,it therefore becomes much less likely to be “actually”blocked.

In general, the i th transaction in the TxnQueue canonly conflict now with up to (i − 1) prior transactions,whereas it previously had to contend with (up to) the num-ber of TxnQueueSizeLimit prior transactions. There-fore, SCA starts at the front of the queue and works itsway through the queue looking for a transaction to exe-cute. The whole while, it keeps two- bit arrays, DX andDS , each of size 100kB (so that both will easily fit insidean L2 cache of size 256kB) and initialized to all 0s. SCAthen maintains the invariant that after scanning the first itransactions:

– DX [ j] = 1 iff an element of one of the scanned transac-tions’ write sets hashes to j

– DS[k] = 1 iff an element of one of the scanned transac-tions’ read sets hashes to k

Therefore, if at any point the next transaction scanned (let’scall it Tnext ) has the properties:

– DX [hash(key)] = 0 for all keys in Tnext ’s read set– DX [hash(key)] = 0 for all keys in Tnext ’s write set– DS[hash(key)] = 0 for all keys in Tnext ’s write set

Fig. 4 SCA pseudocode

then Tnext does not conflict with any of the prior scannedtransactions and can safely be run.2

In other words, SCA traverses the TxnQueue startingwith the oldest transactions and looking for a transaction thatis ready to run and does not conflict with any older transac-tion. Pseudocode for SCA is provided in Fig. 4.

SCA is actually “selective” in two different ways. First,it only gets activated when it is really needed (in contrastto traditional lock manager overhead which always pays thecost of tracking lock contention even when this informationwill not end up being used). Second, rather than doing anexpensive all-to-all conflict analysis between active transac-tions (which is what traditional lock managers track at alltimes), SCA is able to limit its analysis to those transactions

2 Although there may be some false negatives (in which an “actually”runnable transaction is still perceived as blocked) due to the need tohash the entire key space into a 100kB bitstring, this algorithm givesno false positives.

123


that are (a) most likely to be able to run immediately and (b)least expensive to check.

In order to improve the performance of our implementa-tion of SCA, we include a minor optimization that reducesthe CPU overhead of running SCA. Each key needs to behashed into the 100kB bitstring, but hashing every key foreach transaction as we iterate through the TxnQueue can beexpensive. We therefore cache (inside the transaction state)the results of the hash function the first time SCA encountersa transaction. If that transaction is still in the TxnQueue thenext time SCA iterates through the queue, the algorithmmaythen use the saved list of offsets that corresponds to the keysread and written by that transaction to set the appropriate bitsin the SCA bitstring, rather having to re-hash each key.

2.7 Very lightweight locking of ranges (VLLR)

For workloads that frequently read, write, or delete manyconsecutive rows within the same transaction, it can be use-ful to lock ranges of rows rather thanmany individual “point”rows. This technique is particularly useful in avoiding phan-toms [26] and in the common use case of deleting an entitywhose data rows all share a common prefix of their primarykey. Somedatabase systems, such as Spanner, use range locksexclusively in place of hash table-based “point” locks [7].This section describes an extension of VLL (called VLLRfor brevity) that locks ranges of rows rather than individualrows.

VLLR works by locking bitstring prefixes. When lockinga range of primary keys, the range is represented first as arange R of (lexicographically sorted) bitstrings, and then a setP of bitstrings is generated such that for each bitstring K ofthe range, there exists an element p ∈ P that is a prefix of K .One simple way to generate P is to choose the set consistingof the longest common prefix of the minimum andmaximumbitstrings in R.3

Given a set P of prefixes to lock, VLLR proceeds in amanner similar to hierarchical locking [10]. In traditionalhierarchical locking, intention locks are acquired from courseto fine granularity before the actual lock is acquired on thetarget key—for example, intention locks may be acquired

3 This can sometimes result in locking a much larger part of the keyspace than needed. For example, in an 8-bit key space, locking therange R =[00111100,01000010] with this method results inlocking all keys with prefix 0xxxxxxx—half the key space ratherthan the necessary 7/256 of the key space. In this particular case, Pcould instead consist of the prefixes 001111xx, 0100000x, and01000010, thus locking exactly the rows in R. Or P could consist of001111xx and 010000xx, locking the rows in R, plus one additionalrow (010000011). Under some workloads, locking larger ranges thannecessary may incur expensive and unnecessary contention costs, jus-tifying the costs of finer-grained R → P transformations. Under otherworkloads, coarser-grained prefix locking may be acceptable and usersmay profit from the reduced locking overhead.

Cx Cs Ix IsCx X X X XCs X XIx X XIs X

Fig. 5 Lock mode conflicts for VLLR (Xs indicate conflict)

on database, then table, then page before an actual lock fora row is requested. VLLR’s approach is analogous in thatit requests intention locks on all nonempty prefixes of eachelement in P .

To manage lock states, VLLR uses four counters per key:Cx , Cs, Ix , and Is . When requesting a lock on a prefixp ∈ P, Cx [p] or Cs[p] is incremented (depending whetherthe request is for an exclusive or shared lock) and for eachnonempty strict prefix p j of p, Ix [p j ] or Is[p j ] is incre-mented, respectively. For each incremented counter, all con-flicting counters are checked (see Fig. 5 for a table of whichpairs of counters conflict) to see whether the lock is acquired.

Note that the lock management overhead of VLLR isbounded to one increment/decrement for each bit in the com-bined elements of the union of P sets for each transaction.Overlapping lock ranges incur no extra lock managementwork as they do in traditional range locking mechanismswhere ranges may have to be split.

The primary difference between VLLR and VLL is themaintenance of two extra counters, corresponding to Ix andIs . Therefore, it is straightforward to apply SCA to VLLRin an analogous way to how it is applied for VLL—SCAsimply has to add two extra bitmaps corresponding to Ix andIs (note, however, that SCA for VLLR will have to updatemore bits within the bitstrings per transaction to account forthe additional prefixes that must be locked). Figure 6 showsthe pseudocode for VLLR. Figure 7 depicts an example exe-cution trace for a sequence of transactions.

3 Experimental evaluation

To evaluate VLL and SCA, we ran several experimentscomparing VLL (with and without SCA) against alternativeschemes in a number of contexts. We describe our experi-mental setup in Sect. 3.1, and the remainder of this sectionpresents our experimental results.

3.1 Experimental setup

In this section, we describe our experimental setup. In orderto fully evaluate the VLL protocol, we implement three dif-ferent versions of VLL (one single-machine version and twodistributed versions) and compare each version to systems ofcorresponding designs using traditional locking methods. In

123

690 K. Ren et al.

Fig. 6 Pseudocode for the VLLR algorithm

total, we implemented and evaluated nine systems. Section3.1.1 describes each of these nine systems. After describingthese systems, we describe the benchmarks used to evaluatethese systems in Sect. 3.1.2.

3.1.1 Compared systems

We separate our experiments into two groups: single-machine experiments and experiments in which data are par-titioned acrossmultiplemachines in a shared-nothing cluster.

In our single-machine experiments, we ran VLL (exactlyas described in Sect. 2.1) in a multi-threaded environmenton a multi-processor machine. As a comparison point, weimplemented a traditional two-phase locking (2PL) protocolinside the same main memory database system prototype.This allows for an apples-to-apples comparison in which theonly difference is the locking protocol.

For our distributed, shared-nothing cluster experiments,we considered two different partitioning implementations:one where each multicore machine in the cluster contains apartition, and themulti-threaded version ofVLL runs on eachmachine (as described in Sect. 2.4); and one “H-Store-style”implementationwhere each core on eachmachine has its ownseparate partition, and each partition runs all transactions ina single worker thread (as described in Sect. 2.3).

As discussed earlier, distributed versions of VLL are sus-ceptible to entering distributed deadlock states.We presentedtwo potential solutions to this problem: (a) detecting dead-locks by analyzing transaction dependencies and abortingtransactions to break cycles and (b) avoiding deadlocks viaglobal coordination of lock acquisition order. We built andexperimented with both options.

We implement the first solution in the context of a tra-ditional System-R* design [27], using two-phase commit(2PC) for distributed transactions, but with the lock manageron each machine being replaced by our VLL implementationfrom Sect. 2.1. As a comparison point, we compare againstthe same exact traditional System-R* design, with the samemechanisms for detecting and resolving distributed deadlock(discussed later in this section), but using a traditional hashtable-based lock manager implementing the two-phase lock-ing protocol instead of VLL.

The second option can leverage the Calvin design (which,as described in Sect. 2.4, also avoids distributed deadlock bycreating a global order of all transactions). Since globallyordering transactions also has the nice property of enablingdeterministic transaction execution and 2PC is not necessaryin deterministic systems [37], we allow our implementationof the second option also to take advantage of this propertyand eschew 2PC. Therefore, we compare this implemen-

123


Fig. 7 Example execution of a sequence of transactions {A, B,C, D} using the VLLR algorithm. Each transaction requests an exclusive lock ona range. Currently executing transactions are shown with white backgrounds in the TxnQueue, and blocked transactions are shown as black

tation with the original version of Calvin. The two imple-mentations use the exact same code for everything but lockmanagement—including the code for choosing the globalorder of transactions, the code for deterministic transactionalexecution without two-phase commit, and code for schedul-ing transactions to run in worker threads once all locks havebeen successfully acquired. The only difference is that wereplaced Calvin’s hash table-based lock manager with ourVLL implementation. Since Calvin runs all lock manage-ment operations in a separate thread from the worker threadsthat execute transaction logic, we use the array-based imple-mentation of VLL described in Sect. 2.2.

In summary, we compare four different distributed sys-tems that contain multiple threads per partition: a traditionalR* design (1) with and (2) without VLL, and a Calvin design(3) with and (4) without VLL. The traditional R* designuses 2PC and implements deadlock detection and resolution(using a waits-for graph—see below), whereas the Calvindesign avoid deadlocks and processes distributed transac-tions without 2PC.

For the distributed “H-Store”-style designwhere each coreon each machine has its own separate partition and each par-tition contains just one thread, we implement and comparethree different implementations. First, we implement VLL(exactly as described in Sect. 2.3) directly into Calvin code-

base. Each partition, running on each core, runs the single-threaded version of VLL whose pseudocode was given inFig. 3.

We compare our single-threaded VLL implementationagainst two similar alternatives. First, we compare against anH-Store implementation4 [33] that also partitions data acrosscores (so that an 8-server cluster, where each server has fivecores devoted to processing transactions, is partitioned into40 partitions) and executes all transactions at each partitionserially within a single thread, removing the need for lockingor latching of shared data structures. H-Store therefore hasthe advantage of not requiring any overhead for lockmanage-ment, but the disadvantage of not allowing any concurrencywithin a partition (and for distributed transactions must sitand wait for remote messages instead of working on othertransactions in the interim).

Second, we compare against a less extreme version of H-Store that uses a traditional lock manager inside each par-tition to enable concurrent transaction execution within apartition, so that the system does not need to wait and donothing while waiting for remote messages as part of dis-

4 Although we refer to this system as “H-Store” in the discussionthat follows, we actually implemented the H-Store protocol within theCalvin framework in order to provide as fair a comparison as possible.

123

692 K. Ren et al.

tributed transactions. This implementation should thereforesee improved performance for workloads that contain manydistributed transactions, at the cost of increased overhead forlock management.

Since we built each of these three “H-Store-style” imple-mentations inside the Calvin code base, they are all determin-istic, and all can avoid 2PC for distributed transactions (justlike the latter two implementations for the multi-threadedexperiments). The only difference is lock management. Reg-ular H-Store uses no locks and serializes transactions withineach partition, lock manager-based H-Store uses a tradi-tional lock manager in each partition, and VLL-based H-Store uses the single-threaded VLL implementation fromSect. 2.3.

We spent a lot of time investigating multiple differentdeadlock detection and resolution protocols. Although weused a timeouts-based protocol in an earlier version of thispaper [30], we found that the waits-for graphs implementa-tion from Gray [9] and Stonebraker [32] performs better inpractice. We therefore carefully implemented the waits-forgraphs protocol for deadlock detection and also optimizedthe “blocked transactions threshold” in our system in orderto reduce deadlock by limiting the number of blocked trans-actions allowed in the system. We therefore managed to sub-stantially improve the performance of our prototype for thoseimplementations in which deadlock is possible relative to ourprevious work [30].

Our prototype is implemented in C++. All the experi-ments measuring throughput were conducted on AmazonEC2usingDoubleExtraLarge instances (m3.2xlarge),whichpromise 30GB of memory and 26 EC2 Computer Units—eight virtual coreswith 3.25EC2ComputeUnits each (whereeachEC2ComputeUnit provides the roughly theCPUcapac-ity of a 1.0–1.2 GHz 2007 Opteron or 2007 Xeon processor.Experiments were run on a shared-nothing cluster of eightof these Double Extra Large EC2 instances, unless statedotherwise.

In order to minimize the effects of irrelevant compo-nents of the database system on our results, we devotethree out of eight cores on every machine to those com-ponents that are completely independent of the lockingscheme (e.g., client load generation, performancemonitoringinstrumentation, intra-process communications, input log-ging, etc.) and devote the remaining five cores to workerthreads and lock management threads. For all techniquesthat we compare in the experiments, we tuned the workerthread pool size by hand by increasing the number of workerthreads until throughput decreased due to too much threadcontention.

For single-threaded systems implementations, we lever-aged these five virtual cores by creating five data partitionsper machine, so every execution engine used one core to han-dle transactions associated with its partition.

For configurations where data were not partition within amachine, we devoted all five virtual cores to worker threadswhen running the distributed 2PL protocol with deadlockdetection. In the Calvin-based deadlock-free mechanisms,dedicated one core entirely to the lock manager thread, leav-ing four cores for worker threads.

3.1.2 Benchmarks

The first set of experiments we present in this paper use thesamemicrobenchmark experimented with in several recentlypublished papers [30,37]. Each microbenchmark transactionreads 10 records and updates a value at each record. Of the10 records accessed, one is chosen from a small set of ‘hot’records, and the rest are chosen from a larger set of ‘cold’records. Contention levels between transactions can be finelytuned by varying the size of the set of hot records. In the sub-sequent discussion, we use the term contention index to referto the probability of any two transactions conflictingwith oneanother. Therefore, for this microbenchmark, if the numberof the hot records is 1,000, the contention index would be0.001. If there is only one hot record (which would then bemodified by every single transaction), the contention indexwould be 1. The set of cold records is always large enoughsuch that transactions are extremely unlikely to conflict dueto accessing the same cold record.

The transactions in this microbenchmark are short andsimple—reading and updating 10 records can be performedvery quickly. A relatively high percentage of transaction timeis spent acquiring locks, since 10 separate data items must belocked, and once the lock is acquired, very little time is spentperforming the required actions on the locked item (a sim-ple read and update). We therefore refer to this transactionalworkload as our “short” microbenchmark.

In order to experiment with workloads where a smallerpercentage of transaction time is spent acquiring locks, weintroduce an altered version of the microbenchmark where10 ms of CPU computation must be performed on each dataitem (in addition to the read and update). We call this ver-sion of the microbenchmark the “long” microbenchmark,and each transaction in the benchmark takes approximately150 ms in total (which is similar to the New Order transac-tion in the TPC-C benchmark). Transactions in the “long”microbenchmark take approximately three times longer than“short” microbenchmark transactions (150 vs. 50 ms).

To expand our experiments beyond our simple microben-chmarks, we also implement the full TPC-C benchmark. TheTPC-C benchmark models the OLTP workload of an eCom-merce order processing system. TPC-C consists of a mix offive transactions (NewOrder 45%, Payment 43%,Order Sta-tus 4%, Stock Level 4%, and Delivery 4%) with differentproperties, and it contains both read–write heavy transac-tions and read-only transactions. Since the transaction logic

123


of TPC-C is complex and the contention index is high, ourTPC-C experiments are most similar to our experiments onthe “long” microbenchmark, under high contention.

3.2 Multi-core, single-server experiments

This section compares the performance of VLL againsttwo-phase locking. For VLL, we analyze performance withand without the SCA optimization. We also implementtwo versions of 2PL: a “traditional” implementation thatdetects deadlocks and aborts deadlocked transactions, anda deadlock-free variant of 2PL in which a transaction placesall of its lock requests in a single atomic step (where the datathat must be locked are determined in an identical way asVLL, as described in Sect. 2.4). However, this modified ver-sion of 2PL still differs from VLL in that it uses a traditionallock management data structure.

As a baseline for all four systems, we include a “no lock-ing” scheme, which represents the performance of the samesystem with all locking completely removed (and any isola-tion guarantees completely forgone). This allows us to clearlysee the overhead of acquiring and releasing the locks, main-taining lock management data structures for each scheme,and waiting for blocked transactions when there is con-tention.

We benchmark our system using both the “long”microbenchmark transactions and “short” microbenchmarktransactions described in Sect. 3.1.2.

Figure 8a shows the transactional throughput the systemis able to achieve under the four alternative locking schemesfor “long” microbenchmark transactions, and Fig. 8c showsthe throughput for “short” microbenchmark transactions.

When contention is low (below 0.02), VLL (with andwithout SCA) yields near-optimal throughput. As contention

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

0.0001 0.001 0.01 0.1

thro

ughp

ut (

txns

/sec

)

contention index

Local VLL with SCA Local VLL

Local Standard 2PLLocal Deadlock-free 2PL

Local No locking

(a) “Long” transactions under a deadlock-freeworkload.

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

0.0001 0.001 0.01 0.1

thro

ughp

ut (

txns

/sec

)

contention index



Local No locking

(b)

(c) (d)

“Long” transactions under a workload in whichdeadlocks are possible.

0

20000

40000

60000

80000

100000

120000

140000

0.0001 0.001 0.01 0.1

thro

ughp

ut (

txns

/sec

)

contention index



Local No locking

“Short” transactions under a deadlock-freeworkload.

0

20000

40000

60000

80000

100000

120000

140000

0.0001 0.001 0.01 0.1

thro

ughp

ut (

txns

/sec

)

contention index



Local No locking

“Short” transactions under a workload in whichdeadlocks are possible.

Fig. 8 Transactional throughput versus contention under various microbenchmark workloads

123

694 K. Ren et al.

increases, however, the TxnQueue starts to fill up withblocked transactions, and the SCA optimization is able toimprove performance relative to the basic VLL schemeby selectively unblocking transactions and “unclogging”the execution engine. This effect is much stronger for the“long” transaction microbenchmark (where SCA boostsVLL’s performance by up to 41%) than the “short” trans-action microbenchmark, where the benefit of SCA is moremuted. This is explained by the fact that transactions inthe “short” microbenchmark are being executed so fast thatSCA’s ability to remove transactions from the TxnQueueslightly earlier than they would otherwise be removed whenreaching the head of the queue yields a very small improve-ment.

Interestingly, although SCA is able to improve throughputrelative to basic VLL at high contention levels, at extremelyhigh contention levels SCA once again approaches basicVLL in performance (this effect is seen for both “long”transactions and “short” transactions). The is because atextremely high contention, almost every transactions con-flicts with every other transaction, and SCA becomes unableto find transactions that can be removed early from theTxnQueue. Therefore, the shape of the SCA graph rela-tive to the basic VLL graph is a bubble, with close to identi-cal performance at low and extremely high contentions, andimprovement only observable at medium-to-high contentionlevels.

At the very left-hand side of the figure, where contentionis very low, transactions are always able to acquire all theirlocks and run immediately. Therefore, the difference betweenthe “no locking” throughput and the throughput of each ofthe other implementations can be completely explained bythe overhead of acquiring locks in each implementation. Forthe standard 2PL implementation, for “long” microbench-mark transactions, we see that the locking overhead of 2PLis about 22%. This number is consistent with previous mea-surements of locking overhead in main memory databases[12,22]. However, the overhead reaches 43% for the 2PLimplementation for “short” microbenchmark transactions.This is because, as described in Sect. 3.1.2, for the “short”microbenchmark, a greater percentage of transaction time isspent acquiring locks. (The “long” transaction benchmark ismore similar to real-world workloads with which the above-cited studies experimented.)

Meanwhile, the overhead of VLL is only 1.5% for the“long” microbenchmark transactions and 10.2% for the“short” microbenchmark transactions, providing evidenceof VLL’s lighter-weight lock manager implementation rel-ative to 2PL. By colocating lock information with data toincrease cache locality and by representing lock informationin only two integers per record, VLL is able to lock data withextremely low overhead. However, the multi-threaded ver-sion of VLL still must acquire and release locks in a critical

section, and this caused a bottleneck which become visiblefor the “short” microbenchmark transactions.

The deadlock-free 2PL implementation also acquiresand releases locks in a critical section. However, the crit-ical section of the deadlock-free 2PL implementation ismore expensive than the corresponding critical sectionof VLL since the process of acquiring or releasing alock in the deadlock-free 2PL implementation uses themuch heavier-weight traditional hash-based lock managerdata structure. For the “short” transaction microbench-mark, this critical section becomes such a bottleneck,that this implementation consistently yielded extremelypoor performance, and was unaffected by the contentionindex.

As contention increases, the throughput of both VLL and2PL decreases (since the system is “clogged” by blockedtransactions), and the two schemes approach one anotherin performance as the additional information that the 2PLscheme keeps in the lock manager becomes increasinglyuseful (as more transactions block). However, even at highcontention, the performance of VLL with SCA is similarto 2PL, since SCA can quickly construct the relevant partof transactional data dependencies on the fly. Therefore,VLL with the SCA optimization has the advantage that iteliminates the lock manager overhead when possible, whilestill reconstructing the information stored in standard locktables when this is needed to make progress on transactionexecution.

Since this workload involved locking only one hot itemper transaction, there is (approximately) no risk of transac-tions deadlocking.5 This hides a major disadvantage of 2PLrelative to VLL, since 2PL must detect and resolve dead-locks, while VLL does not, since VLL is always deadlock-free regardless of the transactional workload (due to the wayit orders its lock acquisitions). In order to model other typesof workloads where multiple records per transaction can becontested (which can lead to deadlock for the traditional lockschemes), we increase the number of hot records per transac-tion and present the results in Fig. 8b, d. Although the pres-ence of deadlocks reduces the performance of 2PL-basedlocking implementations at high contention indexes (whendeadlock becomes likely), our careful optimizations of thedeadlock detection and resolution protocols in our prototypekept the effect of deadlock relatively small compared to theprevious experiments. Therefore, Fig. 8b is similar to Fig.8a, and Fig. 8d is similar to Fig. 8c. However, it is clear thatthe modified workload that allows deadlocks does reduce theperformance of 2PL, but VLL is unaffected by the change inworkload.

5 The exception is that deadlock is possible in this workload if trans-actions conflict on cold items. This was rare enough, however, that weobserved no deadlocks in our experiments.

123


0

100000

200000

300000

400000

500000

600000

700000

0 5 10 15 20 30 40 50 60 70 80 90 100

thro

ughp

ut (

txns

/sec

)

% distributed transactions

Per-core VLL Per-core VLL with SCA (co-located VLL)

Per-core VLL with SCA (Arrayed VLL) Per-core lock manager

H-Store

(a) “Short” transactions under low contention.

0

100000

200000

300000

400000

500000

600000

700000

0 5 10 15 20 30 40 50 60 70 80 90 100

thro

ughp

ut (

txns

/sec

)




H-Store

(b) “Short” transactions under high contention.

0

50000

100000

150000

200000

250000

300000

0 5 10 15 20 30 40 50 60 70 80 90 100

thro

ughp

ut (

txns

/sec

)




H-Store

(c) “Long” transactions under low contention.

0

50000

100000

150000

200000

250000

300000

0 5 10 15 20 30 40 50 60 70 80 90 100

thro

ughp

ut (

txns

/sec

)




H-Store

(d) “Long” transactions under high contention.

Fig. 9 Microbenchmark throughput of partition-per-core systems, varying how many transactions span multiple partitions

3.3 Distributed database experiments

In this section, we examine the performance of VLL in a dis-tributed (shared-nothing) database architecture.We startwithsome experiments on the microbenchmark and then analyzeperformance on TPC-C.

As described in Sect. 3.1.1, we implemented VLL bothfor systems with one partition per machine and for systemswith one partition per core (in the style of H-Store). Section3.3.1 presents our experiments with the partition-per-coresystems, and Sect. 3.3.2 presents our experiments with thepartition-per-machine systems. Section 3.3.3 then presentsour experiments on TPC-C.

3.3.1 Partition-per-core experiments

In our first set of distributed experiments, we used themicrobenchmark (both “long” transactions and “short” trans-actions) and varied both lock contention and the percentageof multi-partition transactions.

These experiments ran on eight machines, each of whichhad five cores devoted to transaction processing, so thepartition-per-core systems (H-Store, per-core VLL, per-corelock manager) split data across 40 partitions, while thepartition-per-machine systems (Calvin, 2PL + 2PC, Nonde-terministic VLL + 2PC, Deterministic VLL) split data acrosseight partitions. For the low-contention test, each partitioncontained 1,000,000 records ofwhich 10,000were hot, yield-ing a contention index of 0.0001. For the high-contentiontests, we reduced the number of hot records to 100 per parti-tion, resulting in a contention index of 0.01.

Figure 9 presents the results of these experiments. Whencomparing the performance of VLL with and without theSCA optimization, it is clear that for both “short” transac-tions and “long” transactions, SCA is extremely importantunder high contention, and VLL’s throughput is poor withoutit. SCA optimization outperforms basic VLL by up to about100%with “long” transactions and about 130%with “short”transactions. The benefit of SCA is thus much larger forthese distributed systemexperiments than the single-machine

123

696 K. Ren et al.

experiments presented in Sect. 3.2. This is because for thesingle-machine experiments, the head of the TxnQueue canalways run (the only reason why a transaction will enter theTxnQueue is if it blocks waiting for locks and once a trans-action reaches the head of the queue, it is guaranteed to beable to acquire all of its locks), so progress can always bemade by running transactions from the head of the queue.For the distributed database experiments, the head of thequeue can be stalled, waiting for a remote message. With-out SCA, the entire TxnQueue must wait, waiting for thisremote message. However, SCA is able to find other trans-actions in the TxnQueue to run, while the first transactionis waiting for remote data.

Under low contention, however, the SCA optimizationactually hinders performance slightly when there are morethan 60%multi-partition transactions. We observe three rea-sons for this effect. First, under low contention, very fewtransactions are blocked waiting for locks, so SCA has asmaller number of potential transactions to unblock.

Second, since multi-partition transactions take longerthan single-partition transactions, the TxnQueue typicallycontains many multi-partition transactions waiting for readresults from other partitions. The higher the percentage ofmulti-partition transactions, the longer the TxnQueue tendsto be (recall that the queue length is limited by the number ofblocked transactions, not the total number of transactions).Since SCA iterates through the TxnQueue each time that itis called, the overhead of each SCA run therefore increaseswith multi-partition percentage.

Third, since low-contention workloads typically havemore multi-partition transactions per blocked transaction inthe TxnQueue, each blocked transaction has a higher prob-ability of being blocked behind a multi-partition transaction.This further reduces the effectiveness of SCA’s ability to findtransactions to unblock, sincemulti-partition transactions areslower to finish, and there is nothing that SCAcan do to accel-erate a multi-partition transaction.

Despite all these reasons for the reduced effectiveness ofSCA for low-contention workloads, SCA is only slower thanregular VLL by a very small amount. This is because SCAruns are only ever initiatedwhen theCPUwould otherwise beidle, so SCA only comes with a cost if the CPU would haveleft its idle state before SCA finishes its iteration throughthe TxnQueue. Overall, VLL with the SCA optimizationperforms similarly to regular VLL under low contention andup to 130% faster under high contention.

When comparing colocated (where the lock informationis stored inside the records themselves) and arrayed VLL(where the lock information is stored in separate vectors asdescribed in Sect. 2.2), we see that with fewer multi-partitiontransactions, the colocated version VLL outperforms arrayedVLL since arrayed VLL’s extra memory accesses representapproximately 10% of the cost of each “short” microbench-

mark transaction (and approximately 3–4% overhead foreach “long” microbenchmark transaction) when the transac-tion executes entirely locally. However, as distributed trans-actions are added into themix, the overhead of cross-partitioncommunication and coordination begins to dominate execu-tion costs, and the benefit of colocating lock counters withdata becomes much less visible.

Although the difference between arrayed VLL and colo-cated VLL is small, the difference between either VLLscheme and the non-VLL schemes is much larger. Whencomparing to the per-core lock manager scheme, the VLLscheme can improve performance by 10–30% when thereare few distributed transactions, but the difference becomessmaller as more distributed transactions are added to theworkload. Both the per-core lock manager implementationand VLL implementation use locking to enable a transac-tion processing thread to work on other transactions whilewaiting for remote messages during a distributed transac-tion. Therefore, the only difference between these schemesis the locking overhead. This locking overhead difference isa greater percentage of total transaction execution time forlocal transactions and “short” transactions, but becomes lessnoticeable as transactions become longer or distributed.

The H-Store (serial execution) implementation performsextremely poorly as more multi-partition transactions areadded to the workload. This is because H-Store partitionshave no mechanism for concurrent transaction executionwithin a partition, and so must sit idle any time it dependson an outstanding remote read result to complete a multi-partition transaction.6 Since H-Store is designed for parti-tionable workloads (fewer than 10%multi-partition transac-tions), it is most interesting to compare H-Store and VLL atthe left-hand side of the graph. Even at 0% multi-partitiontransactions, H-Store is unable to significantly outperformVLL, despite the fact that VLL acquires locks before execut-ing a transaction, while H-Store does not. This observationfurther highlights the extremely low overhead of lock man-agement with VLL.

Since Fig. 9 measured throughput at only two differentcontention levels, Fig. 10 presents the throughput of thesesystems when contention is varied more directly. We fix thepercentage of multi-partition transactions to be 20% for thisexperiment.

When comparing VLL with and without the SCA opti-mization, it is clear that SCA becomes increasingly impor-

6 Subsequent versions of H-Store proposed to speculatively executetransactions during this idle time [16], but this can lead to cascadingaborts and wasted work if there would have been a conflict. We donot implement speculative execution in our H-Store prototype sinceH-Store is designed for workloads with small percentages of multi-partition transactions (when speculative execution is not necessary), andour purpose for including it as a comparison point is only to analyzehow it compares to VLL on these target single-partition workloads.

123


0

50000

100000

150000

200000

250000

300000

350000

0.0001 0.001 0.01 0.1

thro

ughp

ut (

txns

/sec

)

contention index

Per-core VLL Per-core VLL with SCA Per-core lock manager

H-Store

Fig. 10 Throughput of partition-per-core systems while varying con-tention

tant as the contention increases. However, SCA always out-performs basic VLL, even at low contention. Interestingly,we do not see the decrease in effectiveness of the SCA opti-mization at extremely high contention levels that we saw forthe single-machine experiments in Fig. 8. This is because, asexplained above, in the presence of distributed transactions,the head of theTxnQueuemay stall waiting for remotemes-sages, and SCA’s ability to find other transactions to work onwhile the head of the queue is stalled is extremely important.

As contention increases, the throughput difference betwe-en VLL/SCA and the per-core lock manager becomessmaller, and under very high contention, the per-core lockmanager eventually slightly outperforms VLL/SCA. Thisinflection point exemplifies the conditions under which thebenefits of fully tracking contention data at all times out-weighs the costs of maintaining standard lock queues. To theright of this point, information contained in the lockmanagercan be used to unblock transactions much more quickly thanVLL—and this advantage outweighs the additional lockingcosts.

H-Store is unaffected by contention, since it processestransactions serially.

3.3.2 Partition-per-machine experiments

Figure 11 shows the results of our partition-per-machineexperiments. As described in Sect. 3.1.1, we experiment withfour systems: (1) a traditional System-R* design for a dis-tributed database system that uses two-phase locking (withdistributed deadlock detection and resolution) for concur-rency control, and two-phase commit for committing dis-tributed transactions—we call this system “2PL + 2PC”; (2)an identical version of the System-R* implementation withthe traditional hash-based lock manager replaced with VLL(and the SCA optimization)—we call this “Nondeterminis-tic VLL with SCA + 2PC”; (3) an implementation of Calvin

that uses determinism to avoid deadlock and 2PC for dis-tributed transactions—we call this “Deterministic Locking(Calvin)”; and (4) an identical version of Calvin with its tra-ditional hash-based lock manager replaced with VLL (withthe SCA optimization)—which we call “Deterministic VLLwith SCA”.

When comparing the nondeterministic System-R* designsystems (“2PL + 2PC” vs. “Nondeterministic VLL withSCA + 2PC”), it is clear that VLL provides a greater ben-efit for “short” transactions than “long” transactions. Thisis because, as explained above, “short” transactions spend agreater percentage of transaction time acquiring and releas-ing locks, so reducing theoverheadof the lockmanager yieldsgreater benefits for “short” transactions. This difference isonly visible at low contention and low percentage of dis-tributed transactions. At high contention, both nondetermin-istic systems have problems with distributed deadlock andperform similarly poorly. At high percentage of distributedtransactions, both systems are limited by 2PC and costs ofgenerating and processing network messages, and again thelocking overhead is less of a bottleneck.

When comparing the deterministic systems to each other(“DeterministicLocking (Calvin)” vs. “DeterministicVLL”),the benefit of VLL is much greater for “short” transactions.This is because Calvin devotes an entire core to lock man-agement and serializes lock acquisitions within one threadrunning on this core (to ensure the deterministic guarantees).For short transactions, the worker threads running on theother cores overwhelm the lock manager thread with lockrequests, and this thread is unable to keep up with demand,becoming a massive bottleneck. Therefore, Calvin is almosttotally unaffected by the percentage of distributed transac-tions in Fig. 11a—the lock manager is the bottleneck. Byeliminating this bottleneck, VLL results in up to a factor of2.3 performance improvement. However, for long transac-tions, the lockmanager thread in Calvin is never a bottleneck.Therefore, the performance of Calvin and VLL is identicalfor “long” transactions (although the core running the lockmanager thread is significantly under-utilized for the VLLimplementation).

At high-contention and high-percentage ofmulti-partitiontransactions, the performance of “Deterministic Locking(Calvin)” and “Deterministic VLL” become more similar,since throughput is limited by contention on data instead oflock manager overhead.

Although this paper does not focus on comparing deter-ministic andnondeterministic database systems, it is nonethe-less interesting to note that the deterministic systems gen-erally perform better than the nondeterministic systems—especially when there are large numbers of distributed trans-actions, since the deterministic systems avoid two-phasecommit and distributed deadlock. One exception to this ruleis for short transactions at low contention, where Calvin is

123

698 K. Ren et al.

0

100000

200000

300000

400000

500000

0 5 10 15 20 30 40 50 60 70 80 90 100

thro

ughp

ut (

txns

/sec

)


Deterministic VLL with SCA Deterministic Locking (Calvin)

2PL + 2PC Nondeterminstic VLL with SCA + 2PC

(a) “Short” transactions under low contention.

0

100000

200000

300000

400000

500000

0 5 10 15 20 30 40 50 60 70 80 90 100

thro

ughp

ut (

txns

/sec

)



2PL + 2PC

(b) “Short” transactions under high contention.

0

50000

100000

150000

200000

250000

0 5 10 15 20 30 40 50 60 70 80 90 100

thro

ughp

ut (

txns

/sec

)



2PL + 2PC Nondeterminstic VLL with SCA + 2PC

(c) “Long” transactions under low contention.

0

50000

100000

150000

200000

250000

0 5 10 15 20 30 40 50 60 70 80 90 100

thro

ughp

ut (

txns

/sec

)



2PL + 2PC

(d) “Long” transactions under high contention.

Fig. 11 Microbenchmark throughput of partition-per-machine systems, varying how many transactions span multiple partitions

significantly outperformed by traditional 2PL + 2PC becauseof its bottleneck in the lock manager. However, VLL com-pletely removes this bottleneck and enables Calvin to outper-form the nondeterministic system at almost every data point.Although VLL is a very good fit for Calvin’s deterministiclocking scheme, it clearly improves performance of nonde-terministic designs as well. Even at high contention levels,VLL is not outperformed by the traditional hash-based lockmanagement schemes due to the SCA optimization.

3.3.3 TPC-C experiments

Although our microbenchmark allowed us to carefully varythe length, complexity, contention, and amount of multi-partition transactions in a transaction processing workload,we now present results on the better-known TPC-C bench-mark. We keep the length, complexity, and contention levelsidentical to the TPC-C benchmark specifications, but varythe percentage ofmulti-partition transactions from0 to 100%(the TPC-C specification is for 10% of transactions to access

remote warehouses, and not every remote warehouse is in adifferent partition, so in practice, the actual percentage ofdistributed transactions in TPC-C is less than 10%).

Just like themicrobenchmark experiments, we experimentwith partition-per-core and partition-per-machine systems.The partition-per-core systems partition the TPC-C dataacross 40 10-warehouse partitions (each machine containsfive partitions, and each partition has 10 warehouses), whilepartition-per-machine systems partitioned the data acrosseight 20-warehouse partitions (20 warehouses per machine).The average contention index for the partition-per-core setupis approximately 0.02, and for the partition-per-machinesetup is approximately 0.01.

Figure 12a shows the throughput results for the partition-per-core systems. At first glance, the shape and relative per-formance of each implementation in this figure is very sim-ilar to the same implementations in Fig. 9d. This is becauseTPC-C transactions are similar in terms of length to the“long” transactions of our microbenchmark (TPC-C NewOrder transactions are slightly longer), and similar in terms

123


0

20000

40000

60000

80000

100000

120000

140000

160000

0 5 10 15 20 30 40 50 60 70 80 90 100

thro

ughp

ut (

txns

/sec

)


Per-core VLL Per-core VLL with SCA Per-core lock manager

H-Store

(a) Throughput of “partition-per-core” systems

0

20000

40000

60000

80000

100000

120000

140000

0 5 10 15 20 30 40 50 60 70 80 90 100

thro

ughp

ut (

txns

/sec

)


Deterministic VLL Deterministic VLL with SCA

Deterministic Locking (Calvin) 2PL + 2PC

(b) Throughput of “partition-per-machine” sys-tems

Fig. 12 TPC-C experiments

of contention index to the .01 contention index used inFig. 9d. Therefore, the results (and analysis) are similar to“long”, high-contention microbenchmark.

There are two important results to reiterate. First, theSCA optimization improves performance relative to regu-lar VLL by 40–145% when there are many multi-partitiontransactions. This is because, as described in Sect. 3.3.1,SCA finds other transactions to unblock when the head ofthe TxnQueue is stalled, waiting for a remote message, andthis function is critical when there are many multi-partitiontransactions in the workload.

Second, since TPC-C transactions are slightly longer than“long” microbenchmark transactions, the performance ofVLL (with SCA) and the per-core lock manager implemen-tation (which uses a traditional hash table lock manager) arecloser to each other in Fig. 12a than in Fig. 9d. In general,when contention is high and there are many distributed trans-actions, the lock manager is not a bottleneck (for the reasonsdiscussed in Sect. 3.3.1). Therefore, switching from a hash-based lock manager to VLL does not lead to large improve-ment in throughput (andwithout the SCAoptimization, leadsto a large decrease in throughput).

However, one should not conclude from these resultsthat VLL is not beneficial for TPC-C. As mentioned above,the actual TPC-C specification for percentage of distributedtransactions is less than 10%. At that point in Fig. 12a, VLL(with SCA) outperforms the hash-based locking scheme by8%.

Figure 12b shows the throughput results for the partition-per-machine systems. Once again, the shape and relativeperformance of the implementations in this figure is simi-lar to the “long”, high-contention microbenchmark (in thiscase, Fig. 11d for the partition-per-machine microbench-mark). One difference between TPC-C and the microbench-

mark experiment is that the throughput drops less signifi-cantly as the percentage of distributed transactions increases.This is because in our microbenchmarks, each distributedtransaction touches one “hot” key per partition, whereasin TPC-C, each distributed transaction in TPC-C touchesone “hot” key total (not per partition). Therefore, withmore distributed transactions, the contention index actuallydecreases.

When there are no distributed transactions, the 2PL +2PC implementation (the System-R*-design) outperformsthe deterministic systems. This is because, as mentionedabove, TPC-C transactions are comparable to the “long”transactions of our microbenchmark, which, as describedin Sect. 3.3.2, leads to a relatively small number of lockrequests per second, and thus reduced work for the lockacquisition thread. Therefore, the deterministic design ofdevoting an entire core to lock acquisition (see Sect. 3.3.2) iscostly—this core is significantly under-utilized, and poten-tial CPU resources are therefore wasted. In contrast, the2PL + 2PC implementation is able to fully utilize all CPUresources. However, as there are more distributed transac-tions, the disadvantages of the System-R* design around2PC and distributed deadlock present themselves, and thedeterministic systems begin to outperform the 2PL + 2PCsystem.

3.3.4 Scalability experiments

Figure 13 shows the results of an experiment in whichwe test the scalability of the different schemes when thereare 20% multi-partition transactions. We scale from twoto 48 machines in the cluster. We used the microbench-mark with “long” transactions under both low contentionand high contention. We experimented with both partition-

123

700 K. Ren et al.

0

200000

400000

600000

800000

1e+06

1.2e+06

1.4e+06

2 8 16 24 32 40 48

thro

ughp

ut (

txns

/sec

)

number of machines

Deterministic Locking (Calvin), low contention, 20% distributed Deterministic Locking (Calvin), high contention, 20% distributed

Per-core VLL with SCA, low contention, 20% distributed Per-core VLL with SCA, high contention, 20% distributed

2PL + 2PC, low contention, 20% distributed 2PL + 2PC, high contention, 20% distributed

H-Store, low/high contention, 20% distributed

(a) Scalability total throughput.

0

5000

10000

15000

20000

25000

30000

35000

2 8 16 24 32 40 48

per

node

thro

ughp

ut (

txns

/sec

)

number of machines

Deterministic Locking (Calvin), low contention, 20% distributed Deterministic Locking (Calvin), high contention, 20% distributed

Per-core VLL with SCA, low contention, 20% distributed Per-core VLL with SCA, high contention, 20% distributed

2PL + 2PC, low contention, 20% distributed 2PL + 2PC, high contention, 20% distributed

H-Store, low/high contention, 20% distributed

(b) Scalability per-node throughput.

Fig. 13 Scalability experiments (both partition-per-machine and partition-per-core systems are included)

per-core and partition-per-machine systems; in particular,we experimented with the H-Store and per-core VLL (withSCA) schemes from the partition-per-core systems, and withthe 2PL + 2PC and Calvin schemes from the partition-per-machine systems.

Figure 13a shows total throughputmeasurements, andFig.13b shows per-node throughput measurements. These fig-ures show that VLL is able to achieve similar linear scalabil-ity as Calvin and is therefore able to maintain (and extend)its performance advantage at scale. However, both the VLLand Calvin schemes did not achieve perfect linear scalabilityunder high contention. This is because both systems experi-enced increased execution progress skew with an increasednumber ofmachines. Eachmachine occasionally falls behindbriefly in execution due to random variation in workloads,randomfluctuations in RPC latency, etc., slowing down othermachines; the more machines there are, the more likely thatat least one machine is experiencing this at any time. Underhigher contention and with more distributed transactions, themore sensitivity there is to other machines falling behind andbeing slow to serve those reads, so the effects of executionprogress skewaremorepronounced in these scenarios, result-ing in performance degradation as more machines are added.

The 2PL + 2PC scheme scales as gracefully as the VLLand Calvin schemes under low contention. However, per-formance degrades more steeply under high contention. Inaddition to being affected by execution progress skew, the2PL + 2PC scheme was vulnerable to an increase in distrib-uted deadlocks with increasing contention. These deadlocksfurther increase the contention and significantly degrade thescalability of the system.

H-Store also scales poorly, since all distributed transac-tions need to be executed serially.

3.4 VLLR evaluation

In this section, we evaluate the performance VLLR (asdescribed in Sect. 2.7), with and without SCA. We first com-pare VLLR to traditional mechanisms for locking ranges.Then, we compare VLLR to the basic VLL algorithm thatrequests one lock per element in the range.

3.4.1 VLLR versus traditional mechanisms for lockingranges

This section compares VLLR against two traditional mecha-nisms for locking ranges.Our first baseline (“StandardRangeLock Manager”) explicitly maps key ranges to lock requestqueues—fragmenting ranges when needed if new ranges areinserted that partially overlap existing ranges. Since this tech-nique requires ranges to be sorted, the underlying data struc-ture uses a std::map (which internally implements a red-black tree). Our second baseline (“Hierarchical Lock Man-ager”) is a hash table-based lock manager that also does thesame bitwise-prefix hierarchical locking as VLLR, acquiringintention locks for each nonempty prefix of each element ofP .

To compare the performance and contention character-istics of these techniques in isolation of other factors,7 weimplemented a single-machine, single-threaded benchmarkin which each transaction locked a single range—and thenimposed artificial waits on a varying fraction of transactions

7 For example, distributed deadlock detection mechanisms differ forStandard Range Locking than for the other mechanisms, because par-tially overlapping ranges must be split, so the number of lock queuesthat a transaction’s requests are in may increase over time without thetransaction requesting new locks.

123


0

50000

100000

150000

200000

250000

300000

0 5 10 15 20 30 40 50 60 70 80 90 100

thro

ughp

ut (

txns

/sec

)


VLLRVLLR with SCA

Standard Range Lock Manager Hierarchical Lock Manager

(a) 100-microsecond “distributed transaction” de-lays, low contention.

0

50000

100000

150000

200000

250000

300000

0 5 10 15 20 30 40 50 60 70 80 90 100

thro

ughp

ut (

txns

/sec

)


VLLRVLLR with SCA


(b) 100-microsecond “distributed transaction” de-lays, high contention.

0

50000

100000

150000

200000

250000

300000

0 5 10 15 20 30 40 50 60 70 80 90 100

thro

ughp

ut (

txns

/sec

)


VLLRVLLR with SCA


(c) 500-microsecond “distributed transaction” de-lays, low contention.

0

50000

100000

150000

200000

250000

300000

0 5 10 15 20 30 40 50 60 70 80 90 100

thro

ughp

ut (

txns

/sec

)


VLLRVLLR with SCA


(d) 500-microsecond “distributed transaction” de-lays, high contention.

Fig. 14 Range locking throughput with different artificial delays for distributed transactions

to simulate remote reads. Tovary contention,we chose rangeswhose start and end points, when represented as bitstrings,had shared prefixes of varying length. We chose two scenar-ios: a low-contention scenario in which the start and limitof each range shared a prefix of on average 15 bits, and ahigh-contention scenario in which the start and limit of eachrange shared a prefix of on average 6 bits. For VLLR and theHierarchical Lock Manager, we translated each key range toa singleton prefix set consisting of only the longest sharedprefix between the start and limit. As a result, the StandardRange Lock Manager observed contention rates of 0.0001and 0.0125, respectively, while VLLR and the Hierarchi-cal Lock Manager observed significantly higher contentionrates: 0.0002 and 0.0178. However, under high contention,VLLR and the Hierarchical Lock Manager required signif-icantly fewer intention locks, which resulted in lower CPUoverhead. Figure 14 shows ourmeasured results with 100-ms(Fig. 14a, b) and 500-ms (Fig. 14c, d) delays, and with low(Fig. 14a, c) and high (Fig. 14b, d) contention.

Before comparing VLLR to the other schemes, we exam-ine VLLR with and without SCA. Under high contention,SCA proves similarly critical to VLLR as to VLL; it costslittle and VLLR’s throughput drops significantly without itwhen there are distributed transaction delays. As with VLL,VLLR benefits little from SCA when transaction stalls areshort or infrequent.

The Standard Range Lock Manager incurred higher CPUoverhead than VLLR due to red-black tree operations andrange splitting, but its performance declined more gracefullywhen distributed transaction delays were simulated becauseit experienced lower contention than the prefix-based VLLRimplementation (recall that it locks exact key ranges ratherthan prefixes that represent conservative estimates of keyranges).

The Hierarchical Lock Manager, on the other hand,incurred very high CPU overhead when enqueuing anddequeuing requests for many, many intention locks, signifi-cantly limiting throughput. This CPU cost is so severe that it

123

702 K. Ren et al.

0

50000

100000

150000

200000

250000

1 2 4 8 16 32 64 128 256

thro

ughp

ut (

txns

/sec

)

average number of keys per range

VLLRBasic VLL

(a) Short Transactions.

0

20000

40000

60000

80000

100000

1 2 4 8 16 32 64 128 256

thro

ughp

ut (

txns

/sec

)

average number of keys per range

VLLRBasic VLL

(b) Longer Transactions.

Fig. 15 VLLR versus basic VLL algorithm

remains the sole throughput bottleneck in Figs. 14a–c, evenwhen 100% of transactions incur delays during execution.Only under high contentionand frequent long delays (exactlywhen theStandardRangeLockManager outperformsVLLR,illustrating extreme contention costs) is throughput limitedby contention.

3.4.2 VLLR versus basic VLL algorithm

We now examine the performance of VLLR versus the per-formance of the basic VLL algorithm that implements rangelocks by simply acquiring locks on all records inside therange. In this experiment, we use the same VLLR implemen-tation and experimental transaction from Sect. 3.4.1, with100% single-partition transactions. The bitstring size we usein this experiment is 16, andwevary the longest commonpre-fix of the bitstring range from 16 to 7. For example, if theshared prefix is 11, then the five least-significant bits are notshared, which enables 25 = 32 unique values within thatrange. Since the average range is half of the total possiblerange size, the average size range for a shared prefix of 11bits is 16 records. Therefore, by varying the longest commonprefix of the bitstring range from 16 to 7, we are in effectexperimenting with ranges of average size between 1 and256 records.

Figure 15a shows the results of our experiment for shorttransactions and Fig. 15b shows the results for long transac-tions.When ranges are short, basic VLL outperforms VLLR.This is because VLL only has to lock a few records in therange, while VLLR has to acquire all the hierarchical prefixlocks. In other words, VLLRmust performmore overall lockacquisitions than VLL. However, when the range gets larger,VLL must acquire more individual record locks, and thus its

performance degrades with larger ranges. In contrast, VLLRactually has to acquire fewer locks as the range size increases,since longer ranges result in shorter shared prefixes, and thusfewer hierarchical prefix locks that VLLR needs to acquire.For shorter transactions, the cost of the additional locks thatVLLmust acquire is a larger relative percentage of the trans-action costs, and thus the difference betweenVLL andVLLRfor short transactions with large ranges is more pronounced.

In summary, when the ranges are short, especially in theextreme case of a range of size one (i.e., a “point” access),basic VLL performs well. However, VLLR results in signif-icantly larger throughput for longer range accesses.

3.5 CPU costs

The above experiments show the performance of the differ-ent locking schemes by measuring total throughput underdifferent conditions. These results reflect both (a) the CPUoverhead of each scheme, and (b) the amount of concurrencyachieved by the system.

In order to tease apart these two components of perfor-mance, wemeasured the CPU cost per transaction of request-ing and releasing locks for each locking scheme.These exper-iments were not run in the context of a fully loaded system—rather, we measured the CPU cost of each mechanism incomplete isolation.

Figure 16 shows the results of our isolated CPU cost eval-uation, which demonstrates the reduced CPU costs that VLLachieves by eliminating the lock manager. We find that theCalvin and 2PL schemes (which both use traditional lockmanagers) have an order of magnitude higher CPU cost thanthe VLL schemes. The CPU cost of multi-threaded VLL isa little larger than the cost of single-threaded VLL, since

123


per-transactionlocking mechanism CPU cost (µs)Traditional 2PL 21.07Deterministic Calvin 20.18Multi-threaded VLL 1.83Single-threaded VLL 0.69

Fig. 16 Locking overhead per transaction

multi-threaded VLL has the additional cost of acquiring andreleasing a latch around the acquisition of locks.

4 Related work

The System-R lockmanager described in [9] is the lockman-ager design that has been adopted by most database systems[13]. In order to reduce the cost of locking, there have beenseveral published methods for reducing the number of lockcalls [17,24,28].While these schemesmay partially mitigatethe cost of locking in main memory systems, but they do notaddress the root cause of high lock manager overhead—thesize and complexity of the data structure used to store lockrequests.

Kimura et al. [20] presented a Lightweight Intent Lock(LIL), which maintains a set of lightweight counters in aglobal lock table. However, this proposal doesn’t colocatethe counters with the raw data (to improve cache locality),and if the transaction doesn’t acquire all of its locks imme-diately, the thread blocks, waiting to receive a message fromanother released transaction thread. VLL differs from thisapproach in using the global transaction order to figure outwhich transaction should be unlocked and allowed to run asa consequence of the most recent lock release.

The idea of colocating a record’s lock state with the recorditself in main memory databases was proposed almost twodecades ago [8,25]. However, this proposal involved main-taining a linked list of “Lock Request Blocks” (LCBs) foreach record, significantly complicating the underlying datastructures used to store records, whereas VLL aims to sim-plify lock tracking structures by compressing all per-recordlock state into a simple pair of integers.

There has been a lot of effort in improving the implemen-tation of the lock manager in database systems. Horikawa[38], Shore-MT [15], and Jung et al. [18] all improve mul-ticore scalability by carefully optimizing the lock manager,removing latching and critical sections wherever possible.However, these systems all keep the basic two-phase lock-ing design and improve performance by removing bottle-necks in the lock manager. This differs from our design thatmoves the lock data structures outside of the lock managerand fundamentally changes what (and how) lock informationis tracked. DORA [29] partitions the lock manager acrosscores on a multicore machine, which eliminates long chains

of lock waits on a centralized lock manager and increasescache affinity. However, it doesn’t change the fundamentalhash-based lock manager data structure.

Given the high overhead of the lock manager whena database is entirely in main memory [12,14,22], someresearchers observe that executing transactions serially thatwithout concurrency control can buy significant throughputimprovement in main memory database systems [16,19,33,40]. Such an approach works well only when the workloadcan be partitioned across cores, with very fewmulti-partitiontransactions. VLL enjoys some of the advantages of reducedlocking overhead, while still performing well for a muchlarger variety of workloads.

Other attempts to avoid locks in main memory data-base systems include optimistic concurrency control schemesand multi-version concurrency control schemes [1,3,5,21,22,39], and some industry products also use the these con-currency control schemes, for example, HANA uses MVCC[23],Hekaton uses optimisticMVCC[6], andGoogle F1 usesOCC (in addition to some pessimistic locking) [31]. Whilethese schemes eliminate locking overhead, they introduceother sources of overhead. In particular, optimistic schemescan cause overhead due to aborted transactionswhen the opti-mistic assumption fails (in addition to data access trackingoverhead), andmulti-version schemes use additional (expen-sive) memory resources to store multiple copies of data.

Key range locking, as pioneered by Lomet [26], has beenshown to be useful in workloads that frequently operate onmany consecutive rows within the same transaction. Whilenot all database systems have implemented key range lock-ing, some of them, such as Spanner [7], use the range lockingtechnique exclusively.

5 Conclusion and future work

We have presented VLL, a protocol for main memory data-base systems that avoid the costs of maintaining the datastructures kept by traditional lock managers, and thereforeyields higher transactional throughput than traditional imple-mentations. VLL colocates lock information (two simplesemaphores) with the raw data, and forces all locks to beacquired by a transaction at once. Although the reduced lockinformation can complicate answering the question of whento allow blocked transactions to acquire locks and proceed,SCA allows transactional dependency information to be cre-ated as needed (and only as much information as is neededto unblock transactions). This optimization allows our VLLprotocol to achieve high concurrency in transaction execu-tion, even under high contention.

We showed how VLL can be implemented in traditionalsinger-server multi-core database systems and also in deter-ministic multi-server database systems. The experiments we

123

704 K. Ren et al.

presented demonstrate that VLL can outperform standardtwo-phase locking, deterministic locking, and H-Store styleserial execution schemes significantly—without inhibitingscalability or interfering with other components of the data-base system. Furthermore, VLL is highly compatible withboth standard (nondeterministic) approaches to transactionprocessing and deterministic database systems like Calvin.

We also showed that VLL can be easily extended to sup-port range locking by adding two extra counters and a prefixlocking algorithm, and we call this technique VLLR. Experi-ments demonstrate that VLLR outperforms other alternativerange locking approaches.

Our focus in this paper was on database systems thatupdate data in place. In future work, we intend to investigatemulti-versioned variants of the VLL protocol and integratehierarchical locking approaches into VLL.

Acknowledgments Wewould like to thank the anonymous reviewersfor their detailed and insightful comments. This work was sponsoredby the NSF under grants IIS-0845643 and IIS-1249722, and by a SloanResearch Fellowship.

References

1. Agrawal, D., Sengupta, S.:Modular synchronization in distributed,multiversion databases: version control and concurrency control.IEEE TKDE 5, 126–137 (1993)

2. Agrawal, R., Livny, M.J.C.M.: Concurrency control performancemodeling: alternatives and implications. ACM Trans. DatabaseSyst. 12, 609–654 (1987)

3. Mahmoud, H.A., Arora, V., Nawab, F., Agrawal, D., Abbadi, A.E.:Maat: effective and scalable coordination of distributed transac-tions in the cloud. In: Proceedings of PVLDB, vol. 7, no. 5, (2014)

4. Bernstein, P.A., Goodman, N.: Concurrency control in distributeddatabase systems. ACM Comput. Surv. 13(2), 185–221 (1981)

5. Bernstein, P.A., Hadzilacos, V., Goodman, N.: Concurrency Con-trol and Recovery in Database Systems. Addison-Wesley, Reading(1987)

6. Diaconu, C., Freedman, C., Ismert, E., ke Larson, P., Mittal, P.,Stonecipher, R., Verma, N., Zwilling, M.: Hekaton: Sql server’smemory-optimized oltp engine. In: SIGMOD (2013)

7. Corbett, J.C., et al.: Spanner: Googles globally-distributed data-base. In: Proceedings of OSDI 2012 (2012)

8. Gottemukkala, V., Lehman, T.: Locking and latching in a memory-resident database system. In: VLDB (1992)

9. Gray, J.: Notes onDatabase Operating Systems. Operating System,An Advanced Course. Springer, Berlin (1979)

10. Gray, J.N., Lorie, R.A., Putzolu, G.R., Traiger, I.L.: Granularityof locks and degrees of consistency in a shared database. In: Pro-ceedings of IFIP Working Conference on Modelling of DatabaseManagement Systems (1975)

11. Gray, J., Reuter.: Transaction Processing: Concepts and Tech-niques. Morgan Kaufmann, New York (1993)

12. Harizopoulos, S., Abadi, D.J., Madden, S.R., Stonebraker, M.:OLTP through the looking glass, and what we found there. In:SIGMOD (2008)

13. Hellerstein, J.M., Stonebraker, M., Hamilton, J.: Architecture of adatabase system. Found. Trends Databases 1(2), 141–259 (2007)

14. Johnson, R., Pandis, I., Ailamaki, A.: Improving oltp scalabilityusing speculative lock inheritance. PVLDB 2(1), 479–489 (2009)

15. Johnson, R., Pandis, I., Hardavellas, N., Ailamaki, A., Falsafi, B.:Shore-mt:a scalable storage manager for the multicore era. In: Pro-ceedings of EDBT (2009)

16. Jones, E.P.C., Abadi, D.J., Madden, S.R.: Concurrency control forpartitioned databases. In: SIGMOD (2010)

17. Joshi, A.: Adaptive locking strategies in a multi-node data sharingenvironment. In: VLDB (1991)

18. Jung, H., Han, H., Fekete, A.D., Heiser, G., Yeom, H.Y.: A scalablelock manager for multicores. In: Proceedings of SIGMOD (2013)

19. Kemper, A., Neumann, T., Finis, J., Funke, F., Leis, V., Muhe, H.,Muhlbauer, T., Rodiger, W.: Transaction processing in the hybridOLTP/OLAP main-memory database system hyper. IEEE DataEng. Bull. 36(2), 41–47 (2013)

20. Kimura, H., Graefe, G., Kuno, H.: Efficient locking techniquesfor databases on modern hardware. In: Workshop on AcceleratingData Management Systems Using Modern Processor and StorageArchitectures (2012)

21. Kung, H.T., Robinson, J.T.: On optimisticmethods for concurrencycontrol. TODS 6(2), 213–226 (1981)

22. Larson, P., Blanas, S., Diaconu, C., Freedman, C., Patel, J., Zwill-ing, M.: High-performance concurrency control mechanisms formain-memory database. In: PVLDB (2011)

23. Lee, J., Muehle, M., May, N., Faerber, F., Sikka1, V., Plattner, H.,Krueger, J., Grund, M.: High-performance transaction processingin sap hana. IEEE Data Eng. Bull. 36(2), 28–33 (2013)

24. Lehman, T.: Design and performance evaluation of a mainmemoryrelational database system. Ph.D. thesis, University of Wisconsin-Madison (1986)

25. Lehman, T.J., Gottemukkala, V.: The Design and PerformanceEvaluation of a Lock Manager for a Memory-Resident DatabaseSystem. Performance of ConcurrencyControlMechanisms in Cen-tralised Database System. Addison-Wesley, Reading (1996)

26. Lomet, D.: Key range locking strategies for improved concurrency.In: Proceedings of VLDB (1993)

27. Mohan, C., Lindsay, B.G., Obermarck, R.: Transaction manage-ment in the r* distributed database management system. ACMTrans. Database Syst. 11(4), 378–396 (1986)

28. Mohan, C., Narang, I.: Recovery and coherency-control protocolsfor fast inter-system page transfer and fine-granularity locking inshared disks transaction environment. In: VLDB (1991)

29. Pandis, I., Johnson, R., Hardavellas, N., Ailamaki, A.: Data-oriented transaction execution. PVLDB 3(1), 928–939 (2010)

30. Ren, K., Thomson, A., Abadi, D.J.: Lightweight locking for mainmemory database systems. In: PVLDB (2013)

31. Shute, J., Vingralek, R., Samwel, B., Handy, B., Whipkey, C.,Rollins, E., Oancea, M., Littleeld, K., Menestrina, D., Ellner, S.,Cieslewicz, J., Rae, I., Stancescu, T., Apte, H.: F1: a distributed sqldatabase that scales. In: VLDB (2013)

32. Stonebraker, M.: Concurrency control and consistency of multiplecopies of data in distributed ingres. IEEE Trans. Softw. Eng. SE-5,188–194 (1979)

33. Stonebraker, M., Madden, S.R., Abadi, D.J., Harizopoulos, S.,Hachem, N., Helland, P.: The end of an architectural era (it’s timefor a complete rewrite). In: VLDB, Vienna, Austria (2007)

34. Thomasian, A.: Two-phase locking performance and its thrashingbehavior. TODS 18(4), 579–625 (1993)

35. Thomson, A., Abadi, D.J.: The case for determinism in databasesystems. In: VLDB (2010)

36. Thomson, A., Abadi, D.J.: Modularity and scalability in calvin.IEEE Data Eng. Bull. 36(2), 48–55 (2013)

37. Thomson, A., Diamond, T., Shao, P., Ren, K., Weng, S.-C., Abadi,D.J.: Calvin: Fast distributed transactions for partitioned databasesystems. In: SIGMOD (2012)

38. Horikawa, T.: Latch-free data structures for dbms: design, imple-mentation, and evaluation. In: Proceedings of SIGMOD (2013)

123


39. Tu, S., Zheng, W., Kohler, E., Liskov, B., Madden, S.: Speedytransactions in multicore in-memory databases. In: Proceedings ofSOSP, SOSP ’13, pp. 18–32 (2013)

40. Whitney, A., Shasha, D., Apter, S.: High volume transactionprocessing without concurrency control, two phase commit, SQLor C++. In: HPTS (1997)

123

vll: a lock manager redesign for main memory database systemsabadi/papers/vldbj-vll.pdf · each...

Documents