![Page 1: The Impact of RDMA on Agreement [Extended Version]](https://reader033.vdocuments.us/reader033/viewer/2022051217/6278d8c76e1b273391789a35/html5/thumbnails/1.jpg)
The Impact of RDMA on Agreement[Extended Version]
Marcos K. AguileraVMware
Naama Ben-DavidCMU
Rachid GuerraouiEPFL
Virendra MaratheOracle
Igor ZablotchiEPFL
ABSTRACTRemote Direct Memory Access (RDMA) is becoming widely avail-able in data centers. This technology allows a process to directlyread and write the memory of a remote host, with a mechanism tocontrol access permissions. In this paper, we study the fundamentalpower of these capabilities. We consider the well-known problemof achieving consensus despite failures, and find that RDMA canimprove the inherent trade-off in distributed computing between fail-ure resilience and performance. Specifically, we show that RDMAallows algorithms that simultaneously achieve high resilience andhigh performance, while traditional algorithms had to choose oneor another. With Byzantine failures, we give an algorithm that onlyrequires π β₯ 2ππ + 1 processes (where ππ is the maximum number offaulty processes) and decides in two (network) delays in common ex-ecutions. With crash failures, we give an algorithm that only requiresπ β₯ ππ +1 processes and also decides in two delays. Both algorithmstolerate a minority of memory failures inherent to RDMA, and theyprovide safety in asynchronous systems and liveness with standardadditional assumptions.
1 INTRODUCTIONIn recent years, a technology known as Remote Direct MemoryAccess (RDMA) has made its way into data centers, earning aspotlight in distributed systems research. RDMA provides the tra-ditional send/receive communication primitives, but also allowsa process to directly read/write remote memory. Research workshows that RDMA leads to some new and exciting distributed algo-rithms [3, 9, 25, 31, 45, 49].
RDMA provides a different interface from previous communi-cation mechanisms, as it combines message-passing with shared-memory [3]. Furthermore, to safeguard the remote memory, RDMAprovides protection mechanisms to grant and revoke access for read-ing and writing data. This mechanism is fine grained: an applicationcan choose subsets of remote memory called regions to protect; itcan choose whether a region can be read, written, or both; and itcan choose individual processes to be given access, where differentprocesses can have different accesses. Furthermore, protections aredynamic: they can be changed by the application over time. In thispaper, we lay the groundwork for a theoretical understanding ofthese RDMA capabilities, and we show that they lead to distributedalgorithms that are inherently more powerful than before.
While RDMA brings additional power, it also introduces somechallenges. With RDMA, the remote memories are subject to failuresthat cause them to become unresponsive. This behavior differs from
traditional shared memory, which is often assumed to be reliable1.In this paper, we show that the additional power of RDMA morethan compensates for these challenges.
Our main contribution is to show that RDMA improves on the fun-damental trade-off in distributed systems between failure resilienceand performanceβspecifically, we show how a consensus protocolcan use RDMA to achieve both high resilience and high performance,while traditional algorithms had to choose one or another. We illus-trate this on the fundamental problem of achieving consensus andcapture the above RDMA capabilities as an M&M model [3], inwhich processes can use both message-passing and shared-memory.We consider asynchronous systems and require safety in all exe-cutions and liveness under standard additional assumptions (e.g.,partial synchrony). We measure resiliency by the number of failuresan algorithm tolerates, and performance by the number of (network)delays in common-case executions. Failure resilience and perfor-mance depend on whether processes fail by crashing or by beingByzantine, so we consider both.
With Byzantine failures, we consider the consensus problemcalled weak Byzantine agreement, defined by Lamport [37]. Wegive an algorithm that (a) requires only π β₯ 2ππ +1 processes (whereππ is the maximum number of faulty processes) and (b) decides intwo delays in the common case. With crash failures, we give the firstalgorithm for consensus that requires only π β₯ ππ + 1 processes anddecides in two delays in the common case. With both Byzantine orcrash failures, our algorithms can also tolerate crashes of memoryβonlyπ β₯ 2ππ + 1 memories are required, where ππ is the maximumnumber of faulty memories. Furthermore, with crash failures, weimprove resilience further, to tolerate crashes of a minority of thecombined set of memories and processes.
Our algorithms appear to violate known impossibility results: itis known that with message-passing, Byzantine agreement requiresπ β₯ 3ππ + 1 even if the system is synchronous [44], while consensuswith crash failures require π β₯ 2ππ + 1 if the system is partiallysynchronous [27]. There is no contradiction: our algorithms rely onthe power of RDMA, not available in other systems.
RDMAβs power comes from two features: (1) simultaneous accessto message-passing and shared-memory, and (2) dynamic permis-sions. Intuitively, shared-memory helps resilience, message-passinghelps performance, and dynamic permissions help both.
To see how shared-memory helps resilience, consider the DiskPaxos algorithm [29], which uses shared-memory (disks) but nomessages. Disk Paxos requires only π β₯ ππ + 1 processes, matching1There are a few studies of failure-prone memory, as we discuss in related work.
arX
iv:1
905.
1214
3v2
[cs
.DC
] 2
5 Fe
b 20
21
![Page 2: The Impact of RDMA on Agreement [Extended Version]](https://reader033.vdocuments.us/reader033/viewer/2022051217/6278d8c76e1b273391789a35/html5/thumbnails/2.jpg)
the resilience of our algorithm. However, Disk Paxos is not as fast:it takes at least four delays. In fact, we show that no shared-memoryconsensus algorithm can decide in two delays (Section 6).
To see how message-passing helps performance, consider theFast Paxos algorithm [39], which uses message-passing and noshared-memory. Fast Paxos decides in only two delays in commonexecutions, but it requires π β₯ 2ππ + 1 processes.
Of course, the challenge is achieving both high resilience andgood performance in a single algorithm. This is where RDMAβsdynamic permissions shine. Clearly, dynamic permissions improveresilience against Byzantine failures, by preventing a Byzantineprocess from overwriting memory and making it useless. More sur-prising, perhaps, is that dynamic permissions help performance, byproviding an uncontended instantaneous guarantee: if each processrevokes the write permission of other processes before writing to aregister, then a process that writes successfully knows that it exe-cuted uncontended, without having to take additional steps (e.g., toread the register). We use this technique in our algorithms for bothByzantine and crash failures.
In summary, our contributions are as follows:
β’ We consider distributed systems with RDMA, and we pro-pose a model that captures some of its key properties whileaccounting for failures of processes and memories, with sup-port of dynamic permissions.
β’ We show that the shared-memory part of our RDMA improvesresilience: our Byzantine agreement algorithm requires onlyπ β₯ 2ππ + 1 processes.
β’ We show that the shared-memory by itself does not permitconsensus algorithms that decide in two steps in commonexecutions.
β’ We show that with dynamic permissions, we can improvethe performance of our Byzantine Agreement algorithm, todecide in two steps in common executions.
β’ We give similar results for the case of crash failures: decisionin two steps while requiring only π β₯ ππ + 1 processes.
β’ Our algorithms can tolerate the failure of memories, up to aminority of them.
The rest of the paper is organized as follows. Section 2 givesan overview of related work. In Section 3 we formally define theRDMA-compliant M&M model that we use in the rest of the paper,and specify the agreement problems that we solve. We then proceedto present the main contributions of the paper. Section 4 presentsour fast and resilient Byzantine agreement algorithm. In Section 5we consider the special case of crash-only failures, and show animprovement of the algorithm and tolerance bounds for this setting.In Section 6 we briefly outline a lower bound that shows that thedynamic permissions of RDMA are necessary for achieving ourresults. Finally, in Section 7 we discuss the semantics of RDMA inpractice, and how our model reflects these features.
To ease readability, most proofs have been deferred to the Appen-dices.
2 RELATED WORKRDMA. Many high-performance systems were recently proposedusing RDMA, such as distributed key-value stores [25, 31], com-munication primitives [25, 33], and shared address spaces across
clusters [25]. Kalia et al. [32] provide guidelines for designing sys-tems using RDMA. RDMA has also been applied to solve consen-sus [9, 45, 49]. Our model shares similarities with DARE [45] andAPUS [49], which modify queue-pair state at run time to preventor allow access to memory regions, similar to our dynamic permis-sions. These systems perform better than TCP/IP-based solutions,by exploiting better raw performance of RDMA, without changingthe fundamental communication complexity or failure-resilience ofthe consensus protocol. Similarly, RΓΌsch et al. [46] use RDMA as areplacement for TCP/IP in existing BFT protocols.M&M. Message-and-memory (M&M) refers to a broad class ofmodels that combine message-passing with shared-memory, intro-duced by Aguilera et al. in [3]. In that work, Aguilera et al. considerM&M models without memory permissions and failures, and showthat such models lead to algorithms that are more robust to failuresand asynchrony. In particular, they give a consensus algorithm thattolerates more crash failures than message-passing systems, but ismore scalable than shared-memory systems, as well as a leader elec-tion algorithm that reduces the synchrony requirements. In this paper,our goal is to understand how memory permissions and failures inRDMA impact agreement.Byzantine Fault Tolerance. Lamport, Shostak and Pease [40, 44]show that Byzantine agreement can be solved in synchronous sys-tems iff π β₯ 3ππ + 1. With unforgeable signatures, Byzantine agree-ment can be solved iff π β₯ 2ππ + 1. In asynchronous systems subjectto failures, consensus cannot be solved [28]. However, this result iscircumvented by making additional assumptions for liveness, suchas randomization [10] or partial synchrony [18, 27]. Many Byzantineagreement algorithms focus on safety and implicitly use the addi-tional assumptions for liveness. Even with signatures, asynchronousByzantine agreement can be solved only if π β₯ 3ππ + 1 [15].
It is well known that the resilience of Byzantine agreement variesdepending on various model assumptions like synchrony, signatures,equivocation, and the exact variant of the problem to be solved. Asystem that has non-equivocation is one that can prevent a Byzantineprocess from sending different values to different processes. Table 1summarizes some known results that are relevant to this paper.
Work Synchrony Signatures Non-Equiv StrongValidity Resiliency
[40] β β β β 2π + 1[40] β β β β 3π + 1
[4, 41] β β β β 3π + 1[21] β β β β 3π + 1[21] β β β β 3π + 1[21] β β β β 2π + 1
This paper β ββ
(RDMA) β 2π + 1
Table 1: Known fault tolerance results for Byzantine agreement.
Our Byzantine agreement results share similarities with resultsfor shared memory. Malkhi et al. [41] and Alon et al. [4] showbounds on the resilience of strong and weak consensus in a modelwith reliable memory but Byzantine processes. They also provideconsensus protocols, using read-write registers enhanced with stickybits (write-once memory) and access control lists not unlike ourpermissions. Bessani et al. [11] propose an alternative to sticky bits
![Page 3: The Impact of RDMA on Agreement [Extended Version]](https://reader033.vdocuments.us/reader033/viewer/2022051217/6278d8c76e1b273391789a35/html5/thumbnails/3.jpg)
and access control lists through Policy-Enforced Augmented TupleSpaces. All these works handle Byzantine failures with powerfulobjects rather than registers. Bouzid et al. [13] show that 3ππ + 1processes are necessary for strong Byzantine agreement with read-write registers.
Some prior work solves Byzantine agreement with 2ππ+1 pro-cesses using specialized trusted components that Byzantine pro-cesses cannot control [19, 20, 22, 23, 34, 48]. Some schemes decidein two delays but require a large trusted component: a coordina-tor [19], reliable broadcast [23], or message ordering [34]. For us,permission checking in RDMA is a trusted component of sorts, butit is small and readily available.
At a high-level, our improved Byzantine fault tolerance is achievedby preventing equivocation by Byzantine processes, thereby effec-tively translating each Byzantine failure into a crash failure. Suchtranslations from one type of failure into a less serious one have ap-peared extensively in the literature [8, 15, 21, 43]. Early work [8, 43]shows how to translate a crash tolerant algorithm into a Byzantinetolerant algorithm in the synchronous setting. Bracha [14] presents asimilar translation for the asynchronous setting, in which π β₯ 3ππ +1processes are required to tolerate ππ Byzantine failures. Brachaβstranslation relies on the definition and implementation of a reliablebroadcast primitive, very similar to the one in this paper. However,we show that using the capabilities of RDMA, we can implement itwith higher fault tolerance.Faulty memory. Afek et al. [2] and Jayanti et al. [30] study theproblem of masking the benign failures of shared memory or objects.We use their ideas of replicating data across memories. Abraham etal. [1] considers honest processes but malicious memory.Common-case executions. Many systems and algorithms tolerateadversarial scheduling but optimize for common-case executionswithout failures, asynchrony, contention, etc (e.g., [12, 24, 26, 35, 36,39, 42]). None of these match both the resilience and performanceof our algorithms. Some algorithms decide in one delay but requireπ β₯ 5ππ + 1 for Byzantine failures [47] or π β₯ 3ππ + 1 for crashfailures [16, 24].
3 MODEL AND PRELIMINARIESWe consider a message-and-memory (M&M) model, which allowsprocesses to use both message-passing and shared-memory [3]. Thesystem has π processes π = {π1, . . . , ππ} and π (shared) memoriesπ = {`1, . . . , `π}. Processes communicate by accessing memoriesor sending messages. Throughout the paper, memory refers to theshared memories, not the local state of processes.
The system is asynchronous in that it can experience arbitrarydelays. We expect algorithms to satisfy the safety properties ofthe problems we consider, under this asynchronous system. Forliveness, we require additional standard assumptions, such as partialsynchrony, randomization, or failure detection.
Memory permissions. Each memory consists of a set of reg-isters. To control access, an algorithm groups those registers intoa set of (possibly overlapping) memory regions, and then definespermissions for those memory regions. Formally, a memory regionmr of a memory ` is a subset of the registers of `. We often referto mr without specifying the memory ` explicitly. Each memoryregion mr has a permission, which consists of three disjoint sets
Figure 1: Our model with processes and memories, which mayboth fail. Processes can send messages to each other or accessregisters in the memories. Registers in a memory are groupedinto memory regions that may overlap, but in our algorithmsthey do not. Each region has a permission indicating what pro-cesses can read, write, and read-write the registers in the region(shown for two regions).
of processes π mr,πmr, RWmr indicating whether each process canread, write, or read-write the registers in the region. We say that πhas read permission on mr if π β π mr or π β RWmr; we say that πhas write permission on mr if π βπmr or π β RWmr. In the specialcase when π mr = π \ {π}, πmr = β , RWmr = {π}, we say that mris a Single-Writer Multi-Reader (SWMR) regionβregisters in mrcorrespond to the traditional notion of SWMR registers. Note thata register may belong to several regions, and a process may haveaccess to the register on one region but not anotherβthis models theexisting RDMA behavior. Intuitively, when reading or writing data,a process specifies the region and the register, and the system usesthe region to determine if access is allowed (we make this precisebelow).
Permission change. An algorithm indicates an initial permissionfor each memory region mr. Subsequently, the algorithm may wishto change the permission of mr during execution. For that, processescan invoke an operation changePermission(mr, new_perm), wherenew_perm is a triple (π ,π ,RW). This operation returns no resultsand it is intended to modify π mr,πmr,RWmr to π ,π ,RW. To toler-ate Byzantine processes, an algorithm can restrict processes fromchanging permissions. For that, the algorithm specifies a functionlegalChange(π,mr, old_perm, new_perm) which returns a booleanindicating whether process π can change the permission of mr tonew_perm when the current permissions are old_perm. More pre-cisely, when changePermission is invoked, the system evaluateslegalChange to determine whether changePermission takes effector becomes a no-op. When legalChange always returns false, we
![Page 4: The Impact of RDMA on Agreement [Extended Version]](https://reader033.vdocuments.us/reader033/viewer/2022051217/6278d8c76e1b273391789a35/html5/thumbnails/4.jpg)
say that the permissions are static; otherwise, the permissions aredynamic.
Accessing memories. Processes access the memories via opera-tions write(mr, π , π£) and read(mr, π ) for memory region mr, registerπ , and value π£ . A write(mr, π , π£) by process π changes register πto π£ and returns ack if π β mr and π has write permission on mr;otherwise, the operation returns nak. A read(mr, π ) by process π
returns the last value successfully written to π if π β mr and π hasread permission on mr; otherwise, the operation returns nak. In ouralgorithms, a register belongs to exactly one region, so we omit themr parameter from write and read operations.
Sending messages. Processes can also communicate by sendingmessages over a set of directed links. We assume messages areunique. If there is a link from process π to process π, then π cansend messages to π. Links satisfy two properties: integrity and no-loss. Given two correct processes π and π, integrity requires thata message π be received by π from π at most once and only if πwas previously sent by π to π. No-loss requires that a message πsent from π to π be eventually received by π. In our algorithms, wetypically assume a fully connected network so that every pair ofcorrect processes can communicate. We also consider the specialcase when there are no links (see below).
Executions and steps. An execution is as a sequence of processsteps. In each step, a process does the following, according to itslocal state: (1) sends a message or invokes an operation on a memory(read, write, or changePermission), (2) tries to receive a messageor a response from an outstanding operation, and (3) changes localstate. We require a process to have at most one outstanding operationon each memory.
Failures. A memoryπ may fail by crashing, which causes subse-quent operations on its registers to hang without returning a response.Because the system is asynchronous, a process cannot differentiatea crashed memory from a slow one. We assume there is an upperbound ππ on the maximum number of memories that may crash.Processes may fail by crashing or becoming Byzantine. If a processcrashes, it stops taking steps forever. If a process becomes Byzantine,it can deviate arbitrarily from the algorithm. However, that processcannot operate on memories without the required permission. Weassume there is an upper bound ππ on the maximum number ofprocesses that may be faulty. Where the context is clear, we omit theπ and π subscripts from the number of failures, π .
Signatures. Our algorithms assume unforgeable signatures: thereare primitives sign(π£) and sValid(π, π£) which, respectively, signs avalue π£ and determines if π£ is signed by process π.
Messages and disks. The model defined above includes twocommon models as special cases. In the message-passing model,there are no memories (π = 0), so processes can communicate onlyby sending messages. In the disk model [29], there are no links,so processes can communicate only via memories; moreover, eachmemory has a single region which always permits all processes toread and write all registers.
ConsensusIn the consensus problem, processes propose an initial value andmust make an irrevocable decision on a value. With crash failures,we require the following properties:
β’ Uniform Agreement. If processes π and π decide π£π and π£π ,then π£π = π£π .
β’ Validity. If some process decides π£ , then π£ is the initial valueproposed by some process.
β’ Termination. Eventually all correct processes decide.We expect Agreement and Validity to hold in an asynchronous
system, while Termination requires standard additional assump-tions (partial synchrony, randomization, failure detection, etc). WithByzantine failures, we change these definitions so the problem canbe solved. We consider weak Byzantine agreement [37], with thefollowing properties:
β’ Agreement. If correct processes π and π decide π£π and π£π ,then π£π = π£π .
β’ Validity. With no faulty processes, if some process decides π£ ,then π£ is the input of some process.
β’ Termination. Eventually all correct processes decide.Complexity of algorithms. We are interested in the performance
of algorithms in common-case executions, when the system is syn-chronous and there are no failures. In those cases, we measure perfor-mance using the notion of delays, which extends message-delays toour model. Under this metric, computations are instantaneous, eachmessage takes one delay, and each memory operation takes two de-lays. Intuitively, a delay represents the time incurred by the networkto transmit a message; a memory operation takes two delays becauseits hardware implementation requires a round trip. We say that aconsensus protocol is π-deciding if, in common-case executions,some process decides in π delays.
4 BYZANTINE FAILURESWe now consider Byzantine failures and give a 2-deciding algorithmfor weak Byzantine agreement with π β₯ 2ππ + 1 processes andπ β₯ 2ππ + 1 memories. The algorithm consists of the compositionof two sub-algorithms: a slow one that always works, and a fast onethat gives up under hard conditions.
The first sub-algorithm, called Robust Backup, is developed intwo steps. We first implement a reliable broadcast primitive, whichprevents Byzantine processes from sending different values to differ-ent processes. Then, we use the framework of Clement et al. [21]combined with this primitive to convert a message-passing consen-sus algorithm that tolerates crash failures into a consensus algorithmthat tolerates Byzantine failures. This yields Robust Backup.2 Ituses only static permissions and assumes memories are split intoSWMR regions. Therefore, this sub-algorithm works in the tradi-tional shared-memory model with SWMR registers, and it may beof independent interest.
The second sub-algorithm is called Cheap Quorum. It uses dy-namic permissions to decide in two delays using one signature incommon executions. However, the sub-algorithm gives up if thesystem is not synchronous or there are Byzantine failures.
Finally, we combine both sub-algorithms using ideas from theAbstract framework of Aublin et al. [7]. More precisely, we start byrunning Cheap Quorum; if it aborts, we run Robust Backup. There
2The attentive reader may wonder why at this point we have not achieved a 2-decidingalgorithm already: if we apply Clement et al. [21] to a 2-deciding crash-tolerant algo-rithm (such as Fast Paxos [12]), will the result not be a 2-deciding Byzantine-tolerantalgorithm? The answer is no, because Clement et al. needs reliable broadcast, whichincurs at least 6 delays.
![Page 5: The Impact of RDMA on Agreement [Extended Version]](https://reader033.vdocuments.us/reader033/viewer/2022051217/6278d8c76e1b273391789a35/html5/thumbnails/5.jpg)
is a subtlety: for this idea to work, Robust Backup must decide on avalue π£ if Cheap Quorum decided π£ previously. To do that, RobustBackup decides on a preferred value if at least π + 1 processeshave this value as input. To do so, we use the classic crash-tolerantPaxos algorithm (run under the Robust Backup algorithm to ensureByzantine tolerance) but with an initial set-up phase that ensuresthis safe decision. We call the protocol Preferential Paxos.
4.1 The Robust Backup Sub-AlgorithmWe develop Robust Backup using the construction by Clement etal. [21], which we now explain. Clement et al. show how to trans-form a message-passing algorithm A that tolerates ππ crash failuresinto a message-passing algorithm that tolerates ππ Byzantine failuresin a system where π β₯ 2ππ + 1 processes, assuming unforgeablesignatures and a non-equivocation mechanism. They do so by imple-menting trusted message-passing primitives, T-send and T-receive,using non-equivocation and signature verification on every message.Processes include their full history with each message, and thenverify locally whether a received message is consistent with theprotocol. This restricts Byzantine behavior to crash failures.
To apply this construction in our model, we show that our modelcan implement non-equivocation and message passing. We first showthat shared-memory with SWMR registers (and no memory failures)can implement these primitives, and then show how our model canimplement shared-memory with SWMR registers.
4.1.1 Reliable Broadcast. Consider a shared-memory system.We present a way to prevent equivocation through a solution to thereliable broadcast problem, which we recall below. Note that ourdefinition of reliable broadcast includes the sequence number π (asopposed to being single-shot) so as to facilitate the integration withthe Clement et al. construction, as we explain in Section 4.1.2.
DEFINITION 1. Reliable broadcast is defined in terms of twoprimitives, broadcast(π,π) and deliver(π,π,π). When a process πinvokes broadcast(π,π) we say that π broadcasts (π,π). When aprocess π invokes deliver(π,π,π) we say that π delivers (π,π) fromπ. Each correct process π must invoke broadcast(π, β) with π onehigher than πβs previous invocation (and first invocation with π=1).The following holds:
(1) If a correct process π broadcasts (π,π), then all correctprocesses eventually deliver (π,π) from π.
(2) If π and π are correct processes, π delivers (π,π) from π , andπ delivers (π,πβ²) from π , thenπ=πβ².
(3) If a correct process delivers (π,π) from a correct process π,then π must have broadcast (π,π).
(4) If a correct process delivers (π,π) from π, then all correctprocesses eventually deliver (π,πβ²) from π for someπβ².
Algorithm 2 shows how to implement reliable broadcast that istolerant to a minority of Byzantine failures in shared-memory usingSWMR registers.
To broadcast its π-th messageπ, π simply signs (π,π) and writesit in slot ππππ’π [π, π, π] of its memory3.
3The indexing of the slots is as follows: the first index is the writer of the SWMRregister, the second index is the sequence number of the message, and the third index isthe sender of the message.
Algorithm 2: Reliable Broadcast1 SWMR Value[n,M,n]; initialized to β₯. Value[p] is
β©β array of SWMR(p) registers.
3 SWMR L1Proof[n,M,n]; initialized to β₯. L1Proof[p] isβ©β array of SWMR(p) registers.
5 SWMR L2Proof[n,M,n]; initialized to β₯. L2Proof[p] isβ©β array of SWMR(p) registers.
7 Code for process p8 last[n]: local array with last k delivered from each
β©β process. Initially, last[q] = 09 state[n]: local array of registers. state[q] β {
β©β WaitForSender,WaitForL1Proof,WaitForL2Proof}.β©β Initially, state[q] = WaitForSender
11 broadcast (k,m){12 Value[p,k,p].write(sign((k,m))); }
14 value checkL2proof(q,k) {15 for iβ Ξ {16 proof = L2Proof[i,k,q].read();17 if (proof != β₯ && check(proof)) {18 L2Proof[p,k,q].write(proof);19 return proof.msg; } }20 return null; }
22 for q in Ξ in parallel {23 while true {24 try_deliver(q); }} }
26 try_deliver(q) {27 k = last[q];28 val = checkL2Proof(q,k);29 if (val != null) {30 deliver(k, proof.msg, q);31 last[q] += 1;32 state = WaitForSender;33 return;34 }
36 if state == WaitForSender {37 val = Value[q,k,q].read();38 if (val==β₯ || !sValid(p, val) || key!=k)39 return;40 Value[p,k,q].write(sign(val));41 state = WaitForL1Proof; }
43 if state == WaitForL1Proof {44 checkedVals = β ;45 for i β Ξ {46 val = Value[i,k,q].read();47 if (val!=β₯ && sValid(p,val) && key==k) {48 add val to checkedVals; } }
50 if size(checkedVals) β₯ majority andβ©β checkedVals contains only one value {
51 l1prf = sign(checkedVals);52 L1Proof[p,k,q].write(l1prf);53 state = WaitForL2Proof; } }
55 if state == WaitForL2Proof{56 checkedL1Proofs = β ;57 for i in Ξ {58 proof = L1Proof[i,k,q].read();59 if ( checkL1Proof(proof) ) {60 add proof to checkedL1Proofs; } }
62 if size(checkedL1Proofs) β₯ majority {63 l2prf = sign(checkedL1Proofs);64 L2Proof[p,k,q].write(l2prf); } } }
![Page 6: The Impact of RDMA on Agreement [Extended Version]](https://reader033.vdocuments.us/reader033/viewer/2022051217/6278d8c76e1b273391789a35/html5/thumbnails/6.jpg)
Delivering a message from another process is more involved,requiring verification steps to ensure that all correct processes willeventually deliver the same message and no other. The high-levelidea is that before delivering a message (π,π) from π, each process πchecks that no other process saw a different value from π, and waitsto hear that βenoughβ other processes also saw the same value. Morespecifically, each process π has 3 slots per process per sequencenumber, that only π can write to, but all processes can read from.These slots are initialized to β₯, and π uses them to write the valuesthat it has seen. The 3 slots represent 3 levels of βproofsβ that thisvalue is correct; for each process π and sequence number π , π has aslot to write (1) the initial value π£ it read from π for π, (2) a proofthat at least π + 1 processes saw the same value π£ from π for π , and(3) a proof that at least π + 1 processes wrote a proof of seeing valueπ£ from π for π in their second slot. We call these slots the Value slot,the L1Proof slot, and the L2Proof slot, respectively.
We note that each such valid proof has signed copies of only onevalue for the message. Any proof that shows copies of two differentvalues or a value that is not signed is not considered valid. If a proofhas copies of only value π£ , we say that this proof supports π£ .
To deliver a value π£ from process π with sequence number π,process π must successfully write a valid proof-of-proofs in itsL2Proof slot supporting value π£ (we call this an L2 proof). It has twooptions of how to do this; firstly, if it sees a valid L2 proof in someother process πβs πΏ2ππππ π [π, π, π] slot, it copies this proof over toits own L2 proof slot, and can then deliver the value that this proofsupports. If π does not find a valid L2 proof in some other processβsslot, it must try to construct one itself. We now describe how this isdone.
A correct process π goes through three stages when constructinga valid L2 proof for (π,π) from π. In the pseudocode, the threestages are denoted using states that π goes through: WaitForSender,WaitForL1Proof, and WaitForL2Proof.
In the first stage, WaitForSender, π reads πβs ππππ’π [π, π, π] slot.If π finds a (π,π) pair, π signs and copies it to itsππππ’π [π, π, π] slotand enters the WaitForL1Proof state.
In the second stage, WaitForL1Proof, π reads all ππππ’π [π, π, π]slots, for π β Ξ . If all the values π reads are correctly signed andequal to (π,π), and if there are at least π + 1 such values, thenπ compiles them into an L1 proof, which it signs and writes toπΏ1ππππ π [π, π, π]; π then enters the WaitForL2Proof state.
In the third stage, WaitForL2Proof, π reads all πΏ1ππππ π [π, π, π]slots, for π β Ξ . If π finds at least π + 1 valid and signed L1proofs for (π,π), then π compiles them into an L2 proof, whichit signs and writes to πΏ2ππππ π [π, π, π]. The next time that π scansthe πΏ2ππππ π [Β·, π, π] slots, π will see its own L2 proof (or some othervalid proof for (π,π)) and deliver (π,π).
This three-stage validation process ensures the following crucialproperty: no two valid L2 proofs can support different values. Intu-itively, this property is achieved because for both L1 and L2 proofs,at least π + 1 values of the previous stage must be copied, meaningthat at least one correct process was involved in the quorum neededto construct each proof. Because correct processes read the slotsof all others at each stage before constructing the next proof, andbecause they never overwrite or delete values that they already wrote,it is guaranteed that no two correct processes will create valid L1proofs for different values, since one must see the Value slot of the
other. Thus, no two processes, Byzantine or otherwise, can constructvalid L2 proofs for different values.
Notably, a weaker version of broadcast, which does not requireProperty 4 (i.e., Byzantine consistent broadcast [17, Module 3.11]),can be solved with just the first stage of Algorithm 2, without theL1 and L2 proofs. The purpose of those proofs is to ensure the 4thproperty holds; that is, to enable all correct processes to deliver avalue once some correct process delivered.
In the appendix, we formally prove the above intuition and arriveat the following lemma.
LEMMA 4.1. Reliable broadcast can be solved in shared-memorywith SWMR regular registers with π β₯ 2π + 1 processes.
4.1.2 Applying Clement et alβs Construction. Clement et al.show that given unforgeable transferable signatures and non-equivocation,one can reduce Byzantine failures to crash failures in message pass-ing systems [21]. They define non-equivocation as a predicate π£πππππfor each process π, which takes a sequence number and a value andevaluates to true for just one value per sequence number. All pro-cesses must be able to call the same π£πππππ predicate, which alwaysterminates every time it is called.
We now show how to use reliable broadcast to implement mes-sages with transferable signatures and non-equivocation as definedby Clement et al. [21]. Note that our reliable broadcast mechanismalready involves the use of transferable signatures, so to send andreceive signed messages, one can simply use broadcast and deliverthose messages. However, simply using broadcast and deliver isnot enough to satisfy the requirements of the π£πππππ predicate ofClement et al. The problem occurs when trying to validate nestedmessages recursively.
In particular, recall that in Clement et alβs construction, whenevera message is sent, the entire history of that process, including allmessages it has sent and received, is attached. Consider two Byzan-tine processes π1 and π2, and assume that π1 attempts to equivocatein its πth message, signing both (π,π) and (π,πβ²). Assume there-fore that no correct process delivers any message from π1 in its πthround. However, since π2 is also Byzantine, it could claim to havedelivered (π,π) from π1. If π2 then sends a message that includes(π, π,π) as part of its history, a correct process π receiving π2βsmessage must recursively verify the history π2 sent. To do so, π cancall try_deliver on (π1, π). However, since no correct processdelivered any message from (π1, π), it is possible that this call neverreturns.
To solve this issue, we introduce a validate operation that canbe used along with broadcast and deliver to validate the correctnessof a given message. The validate operation is very simple: ittakes in a process id, a sequence number, and a message valueπ, andsimply runs the checkL2proof helper function. If the functionreturns a proof supporting π, validate returns true. Otherwise itreturns false. The pseudocode is shown in Algorithm 3.
In this way, Algorithms 2 and 3 together provide signed messagesand a non-equivocation primitive. Thus, combined with the con-struction of Clement et al. [21], we immediately get the followingresult.
THEOREM 4.2. There exists an algorithm for weak Byzantineagreement in a shared-memory system with SWMR regular registers,signatures, and up to ππ process crashes where π β₯ 2ππ + 1.
![Page 7: The Impact of RDMA on Agreement [Extended Version]](https://reader033.vdocuments.us/reader033/viewer/2022051217/6278d8c76e1b273391789a35/html5/thumbnails/7.jpg)
Algorithm 3: Validate Operation for Reliable Broadcast1 bool validate(q,k,m){2 val = checkL2proof(q,k);3 if (val == m) {4 return true; }5 return false; }
Non-equivocation in our model. To convert the above algorithmto our model, where memory may fail, we use the ideas in [2, 6, 30]to implement failure-free SWMR regular registers from the fail-prone memory, and then run weak Byzantine agreement using thoseregular registers. To implement an SWMR register, a process writesor reads all memories, and waits for a majority to respond. Whenreading, if π sees exactly one distinct non-β₯ value π£ across thememories, it returns π£ ; otherwise, it returns β₯.
DEFINITION 2. Let A be a message-passing algorithm. RobustBackup(A) is the algorithm A in which all π πππ and ππππππ£π opera-tions are replaced by T-send and T-receive operations (respectively)implemented with reliable broadcast.
Thus we get the following lemma, from the result of Clement etal. [21], Lemma 4.1, and the above handling of memory failures.
LEMMA 4.3. If A is a consensus algorithm that is tolerant to π
process crash failures, then Robust Backup(A) is a weak Byzantineagreement algorithm that is tolerant to up to ππ Byzantine processesand ππ memory crashes, where π β₯ 2ππ + 1 and π β₯ 2ππ + 1 in themessage-and-memory model.
The following theorem is an immediate corrolary of the lemma.
THEOREM 4.4. There exists an algorithm for Weak ByzantineAgreement in a message-and-memory model with up to ππ Byzantineprocesses and ππ memory crashes, where π β₯ 2ππ + 1 and π β₯2ππ + 1.
4.2 The Cheap Quorum Sub-AlgorithmWe now give an algorithm that decides in two delays in common
executions in which the system is synchronous and there are no fail-ures. It requires only one signature for a fast decision, whereas thebest prior algorithm requires 6ππ + 2 signatures and π β₯ 3ππ + 1 [7].Our algorithm, called Cheap Quorum, is not in itself a completeconsensus algorithm; it may abort in some executions. If Cheap Quo-rum aborts, it outputs an abort value, which is used to initialize theRobust Backup so that their composition preserves weak Byzantineagreement. This composition is inspired by the Abstract frameworkof Aublin et al. [7].
The algorithm has a special process β , say β = π1, which servesboth as a leader and a follower. Other processes act only as follow-ers. The memory is partitioned into π + 1 regions denoted Region[π]for each π β Ξ , plus an extra one for π1, Region[β] in which itproposes a value. Initially, Region[π] is a regular SWMR regionwhere π is the writer. Unlike in Algorithm 2, some of the permis-sions are dynamic; processes may remove π1βs write permission toRegion[β] (i.e., the legalChange function returns false to any per-mission change requests, except for ones revoking π1βs permissionto write on Region[β]).
Algorithm 4: Cheap Quorum normal operationβcode for pro-cess π1 Leader code2 propose(v) {3 sign(v);4 status = Value[β].write(v);5 if (status == nak) Panic_mode();6 else decide(v); }
8 Follower code9 propose(w){
10 do {v = read(Value[β]);11 for all q β Ξ do pan[q] = read(Panic[q]);12 } until (v β β₯ || pan[q] == true for some q ||
β©β timeout);13 if (v β β₯ && sValid(p1,v)) {14 sign(v);15 write(Value[p],v);16 do {for all q β Ξ do val[q] = read(Value[q]);17 if |{q : val[q] == v}| β₯ n then {18 Proof[p].write(sign(val[1..n]));19 for all q β Ξ do prf[q] = read(Proof[q]);20 if |{q : verifyProof(prf[q]) == true}| β₯
β©β n { decide(v); exit; } }21 for all q β Ξ do pan[q] = read(Panic[q]);22 } until (pan[q] == true for some q || timeout);
β©β }23 Panic_mode();}
Algorithm 5: Cheap Quorum panic modeβcode for process π1 panic_mode(){2 Panic[p] = true;3 changePermission(Region[β], R: Ξ , W: {}, RW: {});
β©β // remove write permission4 v = read(Value[p]);5 prf = read(Proof[p]);6 if (v β β₯){ Abort with β¨v, prf β©; return; }7 LVal = read(Value[β]);8 if (LVal β β₯) {Abort with β¨LVal, β₯β©; return;}9 Abort with β¨myInput, β₯β©; }
Processes initially execute under a normal mode in common-caseexecutions, but may switch to panic mode if they intend to abort,as in [7]. The pseudo-code of the normal mode is in Algorithm 4.Region[π] contains three registers Value[π], Panic[π], Proof[π] ini-tially set to β₯, false, β₯. To propose π£ , the leader π1 signs π£ andwrites it to Value[β]. If the write is successful (it may fail becauseits write permission was removed), then π1 decides π£; otherwise π1calls Panic_mode(). Note that all processes, including π1, continuetheir execution after deciding. However, π1 never decides again if itdecided as the leader. A follower π checks if π1 wrote to Value[β]and, if so, whether the value is properly signed. If so, π signs π£ ,writes it to Value[π], and waits for other processes to write the samevalue to Value[β]. If π sees 2π + 1 copies of π£ signed by differentprocesses, π assembles these copies in a unanimity proof, whichit signs and writes to Proof[π]. π then waits for 2π + 1 unanimityproofs for π£ to appear in Proof[β], and checks that they are valid,in which case π decides π£ . This waiting continues until a timeoutexpires4, at which time π calls Panic_mode(). In Panic_mode(), a
4The timeout is chosen to be an upper bound on the communication, processing andcomputation delays in the common case.
![Page 8: The Impact of RDMA on Agreement [Extended Version]](https://reader033.vdocuments.us/reader033/viewer/2022051217/6278d8c76e1b273391789a35/html5/thumbnails/8.jpg)
Fast & Robust Algorithm
Cheap Quorum
RDMA
abort value
commit value commit value
Robust BackupReliable
broadcast
RDMA
Preferential Paxos
Figure 6: Interactions of the components of the Fast & RobustAlgorithm.
process π sets Panic[π] to true to tell other processes it is panicking;other processes periodically check to see if they should panic too.π then removes write permission from Region[β], and decides on avalue to abort: either Value[π] if it is non-β₯, Value[β] if it is non-β₯,or πβs input value. If π has a unanimity proof in Proof[π], it adds itto the abort value.
In Appendix B, we prove the correctness of Cheap Quorum,and in particular we show the following two important agreementproperties:
LEMMA 4.5 (CHEAP QUORUM DECISION AGREEMENT). Letπ and π be correct processes. If π decides π£1 while π decides π£2, thenπ£1 = π£2.
LEMMA 4.6 (CHEAP QUORUM ABORT AGREEMENT). Let πand π be correct processes (possibly identical). If π decides π£ inCheap Quorum while π aborts from Cheap Quorum, then π£ will beπβs abort value. Furthermore, if π is a follower, πβs abort proof is acorrect unanimity proof.
The above construction assumes a fail-free memory with regularregisters, but we can extend it to tolerate memory failures using theapproach of Section 4.1, noting that each register has a single writerprocess.
4.3 Putting it Together: the Fast & RobustAlgorithm
The final algorithm, called Fast & Robust, combines Cheap Quorum(Β§4.2) and Robust Backup (Β§4.1), as we now explain. Recall thatRobust Backup is parameterized by a message-passing consensus al-gorithm A that tolerates crash-failures. A can be any such algorithm(e.g., Paxos).
Roughly, in Fast & Robust, we run Cheap Quorum and, if it aborts,we use a processβs abort value as its input value to Robust Backup.However, we must carefully glue the two algorithms together toensure that if some correct process decided π£ in Cheap Quorum, thenπ£ is the only value that can be decided in Robust Backup.
For this purpose, we propose a simple wrapper for Robust Backup,called Preferential Paxos. Preferential Paxos first runs a set-up phase,in which processes may adopt new values, and then runs RobustBackup with the new values. More specifically, there are some pre-ferred input values π£1 . . . π£π , ordered by priority. We guarantee that
every process adopts one of the top π +1 priority inputs. In particular,this means that if a majority of processes get the highest priorityvalue, π£1, as input, then π£1 is guaranteed to be the decision value.The set-up phase is simple; all processes send each other their inputvalues. Each process π waits to receive π β π such messages, andadopts the value with the highest priority that it sees. This is thevalue that π uses as its input to Paxos. The pseudocode for Prefer-ential Paxos is given in Algorithm 8 in Appendix C, where we alsoprove the following lemma about Preferential Paxos:
LEMMA 4.7 (PREFERENTIAL PAXOS PRIORITY DECISION).Preferential Paxos implements weak Byzantine agreement with π β₯2ππ + 1 processes. Furthermore, let π£1, . . . , π£π be the input values ofan instanceπΆ of Preferential Paxos, ordered by priority. The decisionvalue of correct processes is always one of π£1, . . . , π£ π +1.
We can now describe Fast & Robust in detail. We start executingCheap Quorum. If Cheap Quorum aborts, we execute PreferentialPaxos, with each process receiving its abort value from Cheap Quo-rum as its input value to Preferential Paxos. We define the prioritiesof inputs to Preferential Paxos as follows.
DEFINITION 3 (INPUT PRIORITIES FOR PREFERENTIAL PAXOS).The input values for Preferential Paxos as it is used in Fast & Robustare split into three sets (here, π1 is the leader of Cheap Quorum):
β’ π = {π£ | π£ contains a correct unanimity proof }β’ π = {π£ | π£ β π β§ π£ contains the signature of π1}β’ π΅ = {π£ | π£ β π β§ π£ β π}
The priority order of the input values is such that for all valuesπ£π β π , π£π β π , and π£π΅ β π΅, πππππππ‘π¦ (π£π ) > πππππππ‘π¦ (π£π ) >
πππππππ‘π¦ (π£π΅).
Figure 6 shows how the various algorithms presented in thissection come together to form the Fast & Robust algorithm. InAppendices B and C, we show that Fast & Robust is correct, withthe following key lemma:
LEMMA 4.8 (COMPOSITION LEMMA). If some correct processdecides a value π£ in Cheap Quorum before an abort, then π£ is theonly value that can be decided in Preferential Paxos with prioritiesas defined in Definition 3.
THEOREM 4.9. There exists a 2-deciding algorithm for WeakByzantine Agreement in a message-and-memory model with up toππ Byzantine processes and ππ memory crashes, where π β₯ 2ππ + 1andπ β₯ 2ππ + 1.
5 CRASH FAILURESWe now restrict ourselves to crash failures of processes and memo-ries. Clearly, we can use the algorithms of Section 4 in this setting,to obtain a 2-deciding consensus algorithm with π β₯ 2ππ + 1 andπ β₯ 2ππ + 1. However, this is overkill since those algorithms usesophisticated mechanisms (signatures, non-equivocation) to guardagainst Byzantine behavior. With only crash failures, we now showit is possible to retain the efficiency of a 2-deciding algorithm whileimproving resiliency. In Section 5.1, we first give a 2-deciding al-gorithm that allows the crash of all but one process (π β₯ ππ + 1)and a minority of memories (π β₯ 2ππ + 1). In Section 5.2, weimprove resilience further by giving a 2-deciding algorithm that
![Page 9: The Impact of RDMA on Agreement [Extended Version]](https://reader033.vdocuments.us/reader033/viewer/2022051217/6278d8c76e1b273391789a35/html5/thumbnails/9.jpg)
Algorithm 7: Protected Memory Paxosβcode for process π1 Registers: for i=1..m, p β Ξ ,2 slot[i,p]: tuple (minProp, accProp, value)// in
β©β memory i3 Ξ©: failure detector that returns current leader
5 startPhase2(i) {6 add i to ListOfReady processes;7 while (size(ListOfReady)<majority of memories) {}8 Phase2Started = true; }
10 propose(v) {11 repeat forever {12 wait until Ξ© == p; // wait to become leader13 propNr = a higher value than any proposal number
β©β seen before;14 CurrentVal = v;15 CurrentMaxProp = 0;16 Phase2Started = false;17 ListOfReady = β ;18 for every memory i in parallel {19 if (p != p1 || not first attempt) {20 getPermission(i);21 success = write(slot[i,p], (propNr,β₯,β₯));22 if (not success) { abort(); }23 vals = read all slots from i;24 if (vals contains a non-null value) {25 val = v β vals with highest propNr;26 if (val.propNr>propNr) { abort(); }27 atomic {28 if(val.propNr>CurrentMaxProp){29 if (Phase2Started) {abort();}30 CurrentVal = val.value;31 CurrentMaxProp = val.propNr;32 } } }33 startPhase2(i);}34 // done phase 1 or (p == p1 && p1's first
β©β attempt)35 success = write(slot[i,p], (propNr,propNr,
β©β CurrentVal));36 if (not success) { abort(); }37 } until this has been done at a majority of the
β©β memories, or until 'abort' has beenβ©β called
38 if (loop completed without abort) {39 decide CurrentVal; } } }
tolerates crashes of a minority of the combined set of memories andprocesses.
5.1 Protected Memory PaxosOur starting point is the Disk Paxos algorithm [29], which works ina system with processes and memories where π β₯ ππ + 1 and π β₯2ππ + 1. This is our resiliency goal, but Disk Paxos takes four delaysin common executions. Our new algorithm, called Protected MemoryPaxos, removes two delays; it retains the structure of Disk Paxosbut uses permissions to skip steps. Initially some fixed leader β = π1has exclusive write permission to all memories; if another processbecomes leader, it takes the exclusive permission. Having exclusivepermission permits a leader β to optimize execution, because β cando two things simultaneously: (1) write its consensus proposal and(2) determine whether another leader took over. Specifically, if βsucceeds in (1), it knows no leader β β² took over because β β² wouldhave taken the permission. Thus β avoids the last read in Disk Paxos,saving two delays. Of course, care must be taken to implement thiswithout violating safety.
The pseudocode of Protected Memory Paxos is in Algorithm 7.Each memory has one memory region, and at any time exactly oneprocess can write to the region. Each memory π holds a registerslot[π, π] for each process π. Intuitively, slot[π, π] is intended for πto write, but π may not have write permission to do that if it is notthe leaderβin that case, no process writes slot[π, π].
When a process π becomes the leader, it must execute a sequenceof steps on a majority of the memories to successfully commit avalue. It is important that π execute all of these steps on each ofthe memories that counts toward its majority; otherwise two leaderscould miss each otherβs values and commit conflicting values. Wetherefore present the pseudocode for this algorithm in a parallel-forloop (lines 18β37), with one thread per memory that π accesses.The algorithm has two phases similar to Paxos, where the secondphase may only begin after the first phase has been completed fora majority of the memories. We represent this in the code with abarrier that waits for a majority of the threads.
When a process π becomes leader, it executes the prepare phase(the first leader π1 can skip this phase in its first execution of theloop), where, for each memory, π attempts to (1) acquire exclusivewrite permission, (2) write a new proposal number in its slot, and(3) read all slots of that memory. π waits to succeed in executingthese steps on a majority of the memories. If any of πβs writes failor π finds a proposal with a higher proposal number, then π givesup. This is represented with an abort in the pseudocode; when anabort is executed, the for loop terminates. We assume that whenthe for loop terminatesβeither because some thread has aborted orbecause a majority of threads have reached the end of the loopβallthreads of the for loop are terminated and control returns to the mainloop (lines 11β37).
If π does not abort, it adopts the value with highest proposalnumber of all those it read in the memories. To make it clear thatraces should be avoided among parallel threads in the pseudocode,we wrap this part in an atomic environment.
In the next phase, each of πβs threads writes its value to its slot onits memory. If a write fails, π gives up. If π succeeds, this is wherewe optimize time: π can simply decide, whereas Disk Paxos mustread the memories again.
Note that it is possible that some of the memories that made upthe majority that passed the initial barrier may crash later on. Toprevent π from stalling forever in such a situation, it is importantthat straggler threads that complete phase 1 later on be allowed toparticipate in phase 2. However, if such a straggler thread observesa more up-to-date value in its memory than the one adopted by π
for phase 2, this must be taken into account. In this case, to avoidinconsistencies, π must abort its current attempt and restart the loopfrom scratch.
The code ensures that some correct process eventually decides,but it is easy to extend it so all correct processes decide [18], byhaving a decided process broadcast its decision. Also, the codeshows one instance of consensus, with π1 as initial leader. Withmany consensus instances, the leader terminates one instance andbecomes the default leader in the next.
THEOREM 5.1. Consider a message-and-memory model with upto ππ process crashes and ππ memory crashes, where π β₯ ππ + 1 andπ β₯ 2ππ + 1. There exists a 2-deciding algorithm for consensus.
![Page 10: The Impact of RDMA on Agreement [Extended Version]](https://reader033.vdocuments.us/reader033/viewer/2022051217/6278d8c76e1b273391789a35/html5/thumbnails/10.jpg)
5.2 Aligned PaxosWe now further enhance the failure resilience. We show that mem-ories and processes are equivalent agents, in that it suffices for amajority of the agents (processes and memories together) to remainalive to solve consensus. Our new algorithm, Aligned Paxos, achievesthis resiliency. To do so, the algorithm relies on the ability to useboth the messages and the memories in our model; permissions arenot needed. The key idea is to align a message-passing algorithm anda memory-based algorithm to use any majority of agents. We alignPaxos [38] and Protected Memory Paxos so that their decisions arecoordinated. More specifically, Protected Memory Paxos and Paxoshave two phases. To align these algorithms, we factor out their differ-ences and replace their steps with an abstraction that is implementeddifferently for each algorithm. The result is our Aligned Paxos algo-rithm, which has two phases, each with three steps: communicate,hear back, and analyze. Each step treats processes and memoriesseparately, and translates the results of operations on different agentsto a common language. We implement the steps using their ana-logues in Paxos and Protected Memory Paxos5. The pseudocode ofAligned Paxos is shown in Appendix E.
6 DYNAMIC PERMISSIONS ARENECESSARY FOR EFFICIENT CONSENSUS
In Β§5.1, we showed how dynamic permissions can improve theperformance of Disk Paxos. Are dynamic permissions necessary?We prove that with shared memory (or disks) alone, one cannotachieve 2-deciding consensus, even if the memory never fails, ithas static permissions, processes may only fail by crashing, and thesystem is partially synchronous in the sense that eventually there isa known upper bound on the time it takes a correct process to take astep [27]. This result applies a fortiori to the Disk Paxos model [29].
THEOREM 6.1. Consider a partially synchronous shared-memorymodel with registers, where registers can have arbitrary static per-missions, memory never fails, and at most one processes may fail bycrashing. No consensus algorithm is 2-deciding.
PROOF. Assume by contradiction that π΄ is an algorithm in thestated model that is 2-deciding. That is, there is some execution πΈ
of π΄ in which some process π decides a value π£ with 2 delays. Wedenote by π and π the set of objects which π reads and writes inπΈ respectively. Note that since π decides in 2 delays in πΈ, π andπ must be disjoint, by the definition of operation delay and thefact that a process has at most one outstanding operation per object.Furthermore, π must issue all of its read and writes without waitingfor the response of any operation.
Consider an execution πΈ β² in which π reads from the same set π of objects and writes the same values as in πΈ to the same setπ ofobjects. All of the read operations that π issues return by some timeπ‘0, but the write operations of π are delayed for a long time. Anotherprocess π β² begins its proposal of a value π£ β² β π£ after π‘0. Since noprocess other than π β² writes to any objects, πΈ β² is indistinguishableto π β² from an execution in which it runs alone. Since π΄ is a correctconsensus algorithm that terminates if there is no contention, π β² must
5We believe other implementations are possible. For example, replacing the ProtectedMemory Paxos implementation for memories with the Disk Paxos implementationyields an algorithm that does not use permissions.
eventually decide value π£ β². Let π‘ β² be the time at which π β² decides. Allof πβs write operations terminate and are linearized in πΈ β² after timeπ‘ β². Execution πΈ β² is indistinguishable to π from execution πΈ, in whichit ran alone. Therefore, π decides π£ β π£ β², violating agreement. β‘
Theorem 6.1, together with the Fast Paxos algorithm of Lam-port [39], shows that an atomic read-write shared memory modelis strictly weaker than the message passing model in its ability tosolve consensus quickly. This result may be of independent interest,since often the classic shared memory and message passing modelsare seen as equivalent, because of the seminal computational equiv-alence result of Attiya, Bar-Noy, and Dolev [6]. Interestingly, it isknown that shared memory can tolerante more failures when solvingconsensus (with randomization or partial synchrony) [5, 15], andtherefore it seems that perhaps shared memory is strictly strongerthan message passing for solving consensus. However, our resultshows that there are aspects in which message passing is strongerthan shared memory. In particular, message passing can solve con-sensus faster than shared memory in well-behaved executions.
7 RDMA IN PRACTICEOur model is meant to reflect capabilities of RDMA, while providinga clean abstraction to reason about. We now give an overview of howRDMA works, and how features of our model can be implementedusing RDMA.
RDMA enables a remote process to access local memory directlythrough the network interface card (NIC), without involving theCPU. For a piece of local memory to be accessible to a remoteprocess π, the CPU has to register that memory region and associateit with the appropriate connection (called Queue Pair) for π. Theassociation of a registered memory region and a queue pair is doneindirectly through a protection domain: both memory regions andqueue pairs are associated with a protection domain, and a queuepair π can be used to access a memory region π if π and π and in thesame protection domain. The CPU must also specify what accesslevel (read, write, read-write) is allowed to the memory region ineach protection domain. A local memory area can thus be registeredand associated with several queue pairs, with the same or differentaccess levels, by associating it with one or more protection domains.Each RDMA connection can be used by the remote server to accessregistered memory regions using a unique region-specific key createdas a part of the registration process.
As highlighted by previous work [45], failures of the CPU, NICand DRAM can be seen as independent (e.g., arbitrary delays, toomany bit errors, failed ECC checks, respectively). For instance, zom-bie servers in which the CPU is blocked but RDMA requests canstill be served account for roughly half of all failures [45]. Thismotivates our choice to treat processes and memory separately inour model. In practice, if a CPU fails permanently, the memory willalso become unreachable through RDMA eventually; however, insuch cases memory may remain available long enough for ongoingoperations to complete. Also, in practical settings it is possible forfull-system crashes to occur (e.g., machine restarts), which corre-spond to a process and a memory failing at the same timeβthis isallowed by our model.
![Page 11: The Impact of RDMA on Agreement [Extended Version]](https://reader033.vdocuments.us/reader033/viewer/2022051217/6278d8c76e1b273391789a35/html5/thumbnails/11.jpg)
Memory regions in our model correspond to RDMA memoryregions. Static permissions can be implemented by making the ap-propriate memory region registration before the execution of thealgorithm; these permissions then persist during execution withoutCPU involvement. Dynamic permissions require the host CPU tochange the access levels; this should be done in the OS kernel: thekernel creates regions and controls their permissions, and then sharesmemory with user-space processes. In this way, Byzantine processescannot change permissions illegally. The assumption is that the ker-nel is not Byzantine. Alternatively, future hardware support similarto SGX could even allow parts of the kernel to be Byzantine.
Using RDMA, a process π can grant permissions to a remoteprocess π by registering memory regions with the appropriate ac-cess permissions (read, write, or read/write) and sending the corre-sponding key to π. π can revoke permissions dynamically by simplyderegistering the memory region.
For our reliable broadcast algorithm, each process can registerthe two dimensional array of values in read-only mode with a pro-tection domain. All the queue pairs used by that process are alsocreated in the context of the same protection domain. Additionally,the process can preserve write access permission to its row via an-other registration of just that row with the protection domain, thusenabling single-writer multiple-reader access. Thereafter the reliablebroadcast algorithm can be implemented trivially by using RDMAreads and writes by all processes. Reliable broadcast with unreliablememories is similarly straightforward since failure of the memoryensures that no process will be able to access the memory.
For Cheap Quorum, the static memory region registrations arestraightforward as above. To revoke the leaderβs write permission, itsuffices for a regionβs host process to deregister the memory region.Panic messages can be relayed using RDMA message sends.
In our crash-only consensus algorithm, we leverage the capabilityof registering overlapping memory regions in a protection domain.As in above algorithms, each process uses one protection domain forRDMA accesses. Queue pairs for connections to all other processesare associated with this protection domain. The processβ entire slotarray is registered with the protection domain in read-only mode.In addition, the same slot array can be dynamically registered (andderegistered) in write mode based on incoming write permissionrequests: A proposer requests write permission using an RDMAmessage send. In response, the acceptor first deregisters write per-mission for the immediate previous proposer. The acceptor thereafterregisters the slot array in write mode and responds to the proposerwith the new key associated with the newly registered slot array.Reads of the slot array are performed by the proposer using RDMAreads. Subsequent second phase RDMA write of the value can beperformed on the slot array as long as the proposer continues tohave write permission to the slot array. The RDMA write fails ifthe acceptor granted write permission to another proposer in themeantime.
Acknowledgements. We wish to thank the anonymous reviewersfor their helpful comments on improving the paper. This work hasbeen supported in part by the European Research Council (ERC)Grant 339539 (AOC), and a Microsoft PhD Fellowship.
REFERENCES[1] Ittai Abraham, Gregory Chockler, Idit Keidar, and Dahlia Malkhi. Byzantine disk
paxos: optimal resilience with byzantine shared memory. Distributed computing(DIST), 18(5):387β408, 2006.
[2] Yehuda Afek, David S. Greenberg, Michael Merritt, and Gadi Taubenfeld. Comput-ing with faulty shared memory. In ACM Symposium on Principles of DistributedComputing (PODC), pages 47β58, August 1992.
[3] Marcos K Aguilera, Naama Ben-David, Irina Calciu, Rachid Guerraoui, ErezPetrank, and Sam Toueg. Passing messages while sharing memory. In ACMSymposium on Principles of Distributed Computing (PODC), pages 51β60. ACM,2018.
[4] Noga Alon, Michael Merritt, Omer Reingold, Gadi Taubenfeld, and Rebecca NWright. Tight bounds for shared memory systems accessed by byzantine processes.Distributed computing (DIST), 18(2):99β109, 2005.
[5] James Aspnes and Maurice Herlihy. Fast randomized consensus using sharedmemory. Journal of algorithms, 11(3):441β461, 1990.
[6] Hagit Attiya, Amotz Bar-Noy, and Danny Dolev. Sharing memory robustly inmessage-passing systems. Journal of the ACM (JACM), 42(1):124β142, 1995.
[7] Pierre-Louis Aublin, Rachid Guerraoui, Nikola KneΕΎevic, Vivien QuΓ©ma, andMarko Vukolic. The next 700 BFT protocols. ACM Transactions on ComputerSystems (TOCS), 32(4):12, 2015.
[8] Rida Bazzi and Gil Neiger. Optimally simulating crash failures in a byzantineenvironment. In International Workshop on Distributed Algorithms (WDAG),pages 108β128. Springer, 1991.
[9] Jonathan Behrens, Ken Birman, Sagar Jha, Matthew Milano, Edward Tremel,Eugene Bagdasaryan, Theo Gkountouvas, Weijia Song, and Robbert Van Renesse.Derecho: Group communication at the speed of light. Technical report, TechnicalReport. Cornell University, 2016.
[10] Michael Ben-Or. Another advantage of free choice (extended abstract): Com-pletely asynchronous agreement protocols. In ACM Symposium on Principles ofDistributed Computing (PODC), pages 27β30. ACM, 1983.
[11] Alysson Neves Bessani, Miguel Correia, Joni da Silva Fraga, and Lau Cheuk Lung.Sharing memory between byzantine processes using policy-enforced tuple spaces.IEEE Transactions on Parallel and Distributed Systems, 20(3):419β432, 2009.
[12] Romain Boichat, Partha Dutta, Svend Frolund, and Rachid Guerraoui. Recon-structing paxos. SIGACT News, 34(2):42β57, March 2003.
[13] Zohir Bouzid, Damien Imbs, and Michel Raynal. A necessary condition forbyzantine k-set agreement. Information Processing Letters, 116(12):757β759,2016.
[14] Gabriel Bracha. Asynchronous byzantine agreement protocols. Information andComputation, 75(2):130β143, 1987.
[15] Gabriel Bracha and Sam Toueg. Asynchronous consensus and broadcast protocols.Journal of the ACM (JACM), 32(4):824β840, 1985.
[16] Francisco Brasileiro, FabΓola Greve, Achour MostΓ©faoui, and Michel Raynal.Consensus in one communication step. In International Conference on ParallelComputing Technologies, pages 42β50. Springer, 2001.
[17] Christian Cachin, Rachid Guerraoui, and LuΓs E. T. Rodrigues. Introduction toReliable and Secure Distributed Programming (2. ed.). Springer, 2011.
[18] Tushar Deepak Chandra and Sam Toueg. Unreliable failure detectors for reliabledistributed systems. Journal of the ACM (JACM), 43(2):225β267, 1996.
[19] Byung-Gon Chun, Petros Maniatis, and Scott Shenker. Diverse replication forsingle-machine byzantine-fault tolerance. In USENIX Annual Technical Confer-ence (ATC), pages 287β292, 2008.
[20] Byung-Gon Chun, Petros Maniatis, Scott Shenker, and John Kubiatowicz. Attestedappend-only memory: making adversaries stick to their word. In ACM Symposiumon Operating Systems Principles (SOSP), pages 189β204, 2007.
[21] Allen Clement, Flavio Junqueira, Aniket Kate, and Rodrigo Rodrigues. Onthe (limited) power of non-equivocation. In ACM Symposium on Principles ofDistributed Computing (PODC), pages 301β308. ACM, 2012.
[22] Miguel Correia, Nuno Ferreira Neves, and Paulo VerΓssimo. How to toleratehalf less one byzantine nodes in practical distributed systems. In InternationalSymposium on Reliable Distributed Systems (SRDS), pages 174β183, 2004.
[23] Miguel Correia, Giuliana S Veronese, and Lau Cheuk Lung. Asynchronous byzan-tine consensus with 2f+ 1 processes. In ACM symposium on applied computing(SAC), pages 475β480. ACM, 2010.
[24] Dan Dobre and Neeraj Suri. One-step consensus with zero-degradation. InInternational Conference on Dependable Systems and Networks (DSN), pages137β146. IEEE Computer Society, 2006.
[25] Aleksandar Dragojevic, Dushyanth Narayanan, Miguel Castro, and Orion Hodson.FaRM: Fast remote memory. In USENIX Symposium on Networked SystemsDesign and Implementation (NSDI), pages 401β414, 2014.
[26] Partha Dutta, Rachid Guerraoui, and Leslie Lamport. How fast can eventualsynchrony lead to consensus? In International Conference on Dependable Systemsand Networks (DSN), pages 22β27. IEEE, 2005.
[27] Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer. Consensus in the presenceof partial synchrony. Journal of the ACM (JACM), 35(2):288β323, 1988.
![Page 12: The Impact of RDMA on Agreement [Extended Version]](https://reader033.vdocuments.us/reader033/viewer/2022051217/6278d8c76e1b273391789a35/html5/thumbnails/12.jpg)
[28] Michael J Fischer, Nancy A Lynch, and Michael S Paterson. Impossibility ofdistributed consensus with one faulty process. Journal of the ACM (JACM), 1985.
[29] Eli Gafni and Leslie Lamport. Disk paxos. Distributed computing (DIST), 16(1):1β20, 2003.
[30] Prasad Jayanti, Tushar Deepak Chandra, and Sam Toueg. Fault-tolerant wait-freeshared objects. Journal of the ACM (JACM), 45(3):451β500, May 1998.
[31] Anuj Kalia, Michael Kaminsky, and David G Andersen. Using RDMA efficientlyfor key-value services. ACM SIGCOMM Computer Communication Review,44(4):295β306, 2015.
[32] Anuj Kalia, Michael Kaminsky, and David G Andersen. Design guidelines forhigh performance RDMA systems. In USENIX Annual Technical Conference(ATC), page 437, 2016.
[33] Anuj Kalia, Michael Kaminsky, and David G Andersen. FaSST: Fast, scalableand simple distributed transactions with two-sided (RDMA) datagram RPCs. InUSENIX Symposium on Operating System Design and Implementation (OSDI),volume 16, pages 185β201, 2016.
[34] RΓΌdiger Kapitza, Johannes Behl, Christian Cachin, Tobias Distler, Simon Kuhnle,Seyed Vahid Mohammadi, Wolfgang SchrΓΆder-Preikschat, and Klaus Stengel.Cheapbft: resource-efficient byzantine fault tolerance. In European Conferenceon Computer Systems (EuroSys), pages 295β308, 2012.
[35] Idit Keidar and Sergio Rajsbaum. On the cost of fault-tolerant consensus whenthere are no faults: preliminary version. ACM SIGACT News, 32(2):45β63, 2001.
[36] Klaus Kursawe. Optimistic byzantine agreement. In International Symposium onReliable Distributed Systems (SRDS), pages 262β267, October 2002.
[37] Leslie Lamport. The weak byzantine generals problem. Journal of the ACM(JACM), 30(3):668β676, July 1983.
[38] Leslie Lamport. The part-time parliament. ACM Transactions on ComputerSystems (TOCS), 16(2):133β169, 1998.
[39] Leslie Lamport. Fast paxos. Distributed computing (DIST), 19(2):79β103, 2006.[40] Leslie Lamport, Robert Shostak, and Marshall Pease. The byzantine generals
problem. ACM Transactions on Programming Languages and Systems (TOPLAS),4(3):382β401, 1982.
[41] Dahlia Malkhi, Michael Merritt, Michael K Reiter, and Gadi Taubenfeld. Objectsshared by byzantine processes. Distributed computing (DIST), 16(1):37β48, 2003.
[42] Jean-Philippe Martin and Lorenzo Alvisi. Fast byzantine consensus. IEEETransactions on Dependable and Secure Computing (TDSC), 3(3):202β215, July2006.
[43] Gil Neiger and Sam Toueg. Automatically increasing the fault-tolerance ofdistributed algorithms. Journal of Algorithms, 11(3):374β419, 1990.
[44] Marshall Pease, Robert Shostak, and Leslie Lamport. Reaching agreement in thepresence of faults. Journal of the ACM (JACM), 27(2):228β234, 1980.
[45] Marius Poke and Torsten Hoefler. DARE: High-performance state machine repli-cation on RDMA networks. In Symposium on High-Performance Parallel andDistributed Computing (HPDC), pages 107β118. ACM, 2015.
[46] Signe RΓΌsch, Ines Messadi, and RΓΌdiger Kapitza. Towards low-latency byzantineagreement protocols using RDMA. In IEEE/IFIP International Conference onDependable Systems and Networks Workshops (DSN-W), pages 146β151. IEEE,2018.
[47] Yee Jiun Song and Robbert van Renesse. Bosco: One-step byzantine asynchronousconsensus. In International Symposium on Distributed Computing (DISC), pages438β450, September 2008.
[48] Giuliana Santos Veronese, Miguel Correia, Alysson Neves Bessani, Lau CheukLung, and Paulo VerΓssimo. Efficient byzantine fault-tolerance. IEEE Trans.Computers, 62(1):16β30, 2013.
[49] Cheng Wang, Jianyu Jiang, Xusheng Chen, Ning Yi, and Heming Cui. APUS:Fast and scalable paxos on RDMA. In Symposium on Cloud Computing (SoCC),pages 94β107. ACM, 2017.
A CORRECTNESS OF RELIABLEBROADCAST
OBSERVATION 1. In Algorithm 2, if π is a correct process, thenno slot that belongs to π is written to more than once.
PROOF. Since π is correct, π never writes on any slot more thanonce. Furthermore, since all slots are single-writer registers, no otherprocess can write on these slots. β‘
OBSERVATION 2. In Algorithm 2, correct processes invoke andreturn from try_deliver(q) infinitely often, for all π β Ξ .
PROOF. The try_deliver() function does not contain any blockingsteps, loops or goto statements. Thus, if a correct process invokestry_deliver(), it will eventually return. Therefore, for a fixed π the
infinite loop at line 24 will invoke and return try_deliver(π) infinitelyoften. Since the parallel for loop at line 22 performs the infinite loopin parallel for each π β Ξ , the Observation holds. β‘
PROOF OF LEMMA 4.1. We prove the lemma by showing thatAlgorithm 2 correctly implements reliable broadcast. That is, weneed to show that Algorithm 2 satisfies the four properties of reliablebroadcast.
Property 1. Let π be a correct process that broadcasts (π,π).We show that all correct processes eventually deliver (π,π) fromπ. Assume by contradiction that there exists some correct processπ which does not deliver (π,π). Furthermore, assume without lossof generality that π is the smallest key for which Property 1 isbroken. That is, all correct processes must eventually deliver allmessages (π β²,πβ²) from π, for π β² < π. Thus, all correct processesmust eventually increment πππ π‘ [π] to π .
We consider two cases, depending on whether or not some processeventually writes an L2 proof for some (π,πβ²) message from π inits L2Proof slot.
First consider the case where no process ever writes an L2 proofof any value (π,πβ²) from π. Since π is correct, upon broadcasting(π,π), π must sign and write (π,π) into ππππ’ππ [π, π, π] at line 12.By Observation 1, (π,π) will remain in that slot forever. Becauseof this, and because there is no L2 proof, all correct processes, afterreaching πππ π‘ [π] = π, will eventually read (π,π) in line 37, writeit into their own Value slot in line 40 and change their state toWaitForL1Proof.
Furthermore, since π is correct and we assume signatures areunforgeable, no process π can write any other valid value (π β²,πβ²) β (π,π) into its ππππ’ππ [π, π, π] slot. Thus, eventually each correctprocess π will add at least π + 1 copies of (π,π) to its checkedVals,write an L1proof consisting of these values into πΏ1ππππ π [π, π, π] inline 52, and change their state to WaitForL2Proof.
Therefore, all correct processes will eventually read at least π + 1valid L1 Proofs for (π,π) in line 58 and construct and write valid L2proofs for (π,π). This contradicts the assumption that no L2 proofever gets written.
In the case where there is some L2 proof, by the argument above,the only value it can prove is (π,π). Therefore, all correct processeswill see at least one valid L2 proof at deliver. This contradicts ourassumption that π is correct but does not deliver (π,π) from π.
Property 2. We now prove the second property of reliable broad-cast. Let π and π β² be any two correct processes, and π be someprocess, such that π delivers (π,π) from π and π β² delivers (π,πβ²)from π. Assume by contradiction thatπ β πβ².
Since π and π β² are correct, they must have seen valid L2 proofsat line 16 before delivering (π,π) and (π,πβ²) respectively. Let Pand P β² be those valid proofs for (π,π) and (π,πβ²) respectively. P(resp. P β²) consists of at least π + 1 valid L1 proofs; therefore, atleast one of those proofs was created by some correct process π (resp.π β²). Since π (resp. π β²) is correct, it must have written (π,π) (resp.(π,πβ²)) to its Values slot in line 40. Note that after copying a valueto their slot, in the WaitForL1Proof state, correct processes read allValue slots line 46. Thus, both π and π β² read all Value slots beforecompiling their L1 proof for (π,π) (resp. (π,πβ²)).
Assume without loss of generality that π wrote (π,π) before π β²
wrote (π,πβ²); by Observation 1, it must then be the case that π β² later
![Page 13: The Impact of RDMA on Agreement [Extended Version]](https://reader033.vdocuments.us/reader033/viewer/2022051217/6278d8c76e1b273391789a35/html5/thumbnails/13.jpg)
saw both (π,π) and (π,πβ²) when it read all Values slots (line 46).Since π β² is correct, it cannot have then compiled an L1 proof for(π,πβ²) (the check at line 50 failed). We have reached a contradiction.
Property 3. We show that if a correct process π delivers (π,π)from a correct process π β², then π β² broadcast (π,π). Correct processesonly deliver values for which a valid L2 proof exists (lines 16β30).Therefore, π must have seen a valid L2 proof P for (π,π). P consistsof at least π + 1 L1 proofs for (π,π) and each L1 proof consistsof at least π + 1 matching copies of (π,π), signed by π β². Since π β²
is correct and we assume signatures are unforgeable, π β² must havebroadcast (π,π) (otherwise π β² would not have attached its signatureto (π,π)).
Property 4. Let π be a correct process such that π delivers (π,π)from π. We show that all correct process must deliver (π,πβ²) fromπ, for someπβ².
By construction of the algorithm, if π delivers (π,π) from π, thenfor all π < π there exists ππ such that π delivered (π,ππ ) from π
before delivering (π,π) (this is because π can only deliver (π,π)if πππ π‘ [π] = π and πππ π‘ [π] is only incremented to π after π delivers(π β 1,ππβ1)).
Assume by contradiction that there exists some correct process πwhich does not deliver (π,πβ²) from π, for any πβ². Further assumewithout loss of generality that π is the smallest key for which π doesnot deliver any message from π. Thus, π must have delivered (π,πβ²
π)
from π for all π < π; thus, π must have incremented πππ π‘ [π] to π.Since π never delivers any message from π for key π , π βs πππ π‘ [π] willnever increase past π .
Since π delivers (π,π) from π, then π must have written a validL2 proof P of (π,π) in its L2Proof slot in line 18. By Observation 1,P will remain in πβs L2Proof[p,k,q] slot forever. Thus, at leastone of the slots πΏ2ππππ π [Β·, π, π] will forever contain a valid L2proof. Since π βs πππ π‘ [π] eventually reaches π and never increasespast π, π will eventually (by Observation 2) see a valid L2 proof inline 16 and deliver a message for key π from π. We have reached acontradiction. β‘
B CORRECTNESS OF CHEAP QUORUMWe prove that Cheap Quorum satisfies certain useful properties thatwill help us show that it composes with Preferential Paxos to forma correct weak Byzantine agreement protocol. For the proofs, wefirst formalize some terminology. We say that a process proposed avalue π£ by time π‘ if it successfully executes line 4; that is, π receivesthe response πππ in line 4 by π‘ . When a process aborts, note that itoutputs a tuple. We say that the first element of its tuple is its abortvalue, and the second is its abort proof. We sometimes say that aprocess π aborts with value π£ and proof ππ , meaning that π outputs(π£, ππ ) in its abort. Furthermore, the value in a process πβs Proofregion is called a correct unanimity proof if it contains π copies ofthe same value, each correctly signed by a different process.
OBSERVATION 3. In Cheap Quorum, no value written by a cor-rect process is ever overwritten.
PROOF. By inspecting the code, we can see that the correct behav-ior is for processes to never overwrite any values. Furthermore, sinceall regions are initially single-writer, and the legalChange functionnever allows another process to acquire write permission on a region
that they cannot write to initially, no other process can overwritethese values. β‘
LEMMA B.1 (CHEAP QUORUM VALIDITY). In Cheap Quorum,if there are no faulty processes and some process decides π£ , then π£ isthe input of some process.
PROOF. If π = π1, the lemma is trivially true, because π1 canonly decide on its input value. If π β π1, π can only decide on avalue π£ if it read that value from the leaderβs region. Since only theleader can write to its region, it follows that π can only decide on avalue that was proposed by the leader (π1). β‘
LEMMA B.2 (CHEAP QUORUM TERMINATION). If a correctprocess π proposes some value, every correct process π will decidea value or abort.
PROOF. Clearly, if π = π1 proposes a value, then π decides. Nowlet π β π1 be a correct follower and assume π1 is a correct leaderthat proposes π£ . Since π1 proposed π£ , π1 was able to write π£ in theleader region, where π£ remains forever by Observation 3. Clearly,if π eventually enters panic mode, then it eventually aborts; there isno waiting done in panic mode. If π never enters panic mode, thenπ eventually sees π£ on the leader region and eventually finds 2π + 1copies of π£ on the regions of other followers (otherwise π wouldenter panic mode). Thus π eventually decides π£ . β‘
LEMMA B.3 (CHEAP QUORUM PROGRESS). If the system issynchronous and all processes are correct, then no correct processaborts in Cheap Quorum.
PROOF. Assume the contrary: there exists an execution in whichthe system is synchronous and all processes are correct, yet someprocess aborts. Processes can only abort after entering panic mode,so let π‘ be the first time when a process enters panic mode and let πbe that process. Since π cannot have seen any other process declarepanic, π must have either timed out at line 12 or 22, or its checksfailed on line 13. However, since the entire system is synchronousand π is correct, π could not have panicked because of a time-out atline 12. So, π1 must have written its value π£ , correctly signed, to π1βsregion at a time π‘ β² < π‘ . Therefore, π also could not have panicked byfailing its checks on line 13. Finally, since all processes are correctand the system is synchronous, all processes must have seen π1βsvalue and copied it to their slot. Thus, π must have seen these valuesand decided on π£ at line 20, contradicting the assumption that πentered panic mode. β‘
LEMMA B.4 (LEMMA 4.5: CHEAP QUORUM DECISION AGREE-MENT). Let π and π be correct processes. If π decides π£1 while π
decides π£2, then π£1 = π£2.
PROOF. Assume the property does not hold: π decided somevalue π£1 and π decided some different value π£2. Since π decided π£1,then π must have seen a copy of π£1 at 2ππ + 1 replicas, including π.But then π cannot have decided π£2, because by Observation 3, π£1never gets overwritten from πβs region, and by the code, π only candecide a value written in its region. β‘
LEMMA B.5 (LEMMA 4.6: CHEAP QUORUM ABORT AGREE-MENT). Let π and π be correct processes (possibly identical). If πdecides π£ in Cheap Quorum while π aborts from Cheap Quorum,
![Page 14: The Impact of RDMA on Agreement [Extended Version]](https://reader033.vdocuments.us/reader033/viewer/2022051217/6278d8c76e1b273391789a35/html5/thumbnails/14.jpg)
then π£ will be πβs abort value. Furthermore, if π is a follower, πβsabort proof is a correct unanimity proof.
PROOF. If π = π, the property follows immediately, because oflines 4 through 6 of panic mode. If π β π, we consider two cases:
β’ If π is a follower, then for π to decide, all processes, and inparticular, π, must have replicated both π£ and a correct proofof unanimity before π decided. Therefore, by Observation 3,π£ and the unanimity proof are still there when π executes thepanic code in lines 4 through 6. Therefore π will abort with π£
as its value and a correct unanimity proof as its abort proof.β’ If π is the leader, then first note that since π is correct, by
Observation 3 π£ remains the value written in the leaderβs Valueregion. There are two cases. If π has replicated a value into itsValue region, then it must have read it from ππππ’π [π1], andtherefore it must be π£ . Again by Observation 3, π£ must still bethe value written inπβs Value region whenπ executes the paniccode. Therefore π aborts with value π£ . Otherwise, if π has notreplicated a value, then πβs Value region must be empty at thetime of the panic, since the legalChange function disallowsother processes from writing on that region. Therefore π readsπ£ from ππππ’π [π1] and aborts with π£ . β‘
LEMMA B.6. Cheap Quorum is 2-deciding.
PROOF. Consider an execution in which every process is correctand the system is synchronous. Then no process will enter panicmode (by Lemma B.3) and thus π1 will not have its permissionrevoked. π1 will therefore be able to write its input value to π1βsregion and decide after this single write (2 delays). β‘
C CORRECTNESS OF THE FAST & ROBUSTThe following is the pseudocode of Preferential Paxos. Recall thatT-send and T-receive are the trusted message passing primitives thatare implemented in [21] using non-equivocation and signatures.
Algorithm 8: Preferential Paxosβcode for process π1 propose((v, priorityTag)){2 T-send(v, priorityTag) to all;3 Wait to T-receive (val,priorityTag) from π β ππ
β©β processes;4 best = value with highest priority out of
β©β messages received;5 RobustBackup(Paxos).propose(best); }
LEMMA C.1 (LEMMA 4.7: PREFERENTIAL PAXOS PRIORITY
DECISION). Preferential Paxos implements weak Byzantine agree-ment with π β₯ 2ππ + 1 processes. Furthermore, let π£1, . . . , π£π bethe input values of an instance πΆ of Preferential Paxos, ordered bypriority. The decision value of correct processes is always one ofπ£1, . . . , π£ ππ+1.
PROOF. By Lemma 4.3, Robust Backup(Paxos) solves weakByzantine agreement with π β₯ 2ππ + 1 processes. Note that be-fore calling Robust Backup(Paxos), each process may change itsinput, but only to the input of another process. Thus, by the correct-ness and fault tolerance of Paxos, Preferential Paxos clearly solves
weak Byzantine agreement with π β₯ 2ππ +1 processes. Thus we onlyneed to show that Preferential Paxos satisfies the priority decisionproperty with 2ππ + 1 processes that may only fail by crashing.
Since Robust Backup(Paxos) satisfies validity, if all processescall Robust Backup(Paxos) in line 5 with a value π£ that is one ofthe ππ + 1 top priority values (that is, π£ β {π£1, . . . , π£ ππ+1}), then thedecision of correct processes will also be in {π£1, . . . , π£ ππ+1}. So wejust need to show that every process indeed adopts one of the topππ +1 values. Note that each process π waits to see πβ ππ values, andthen picks the highest priority value that it saw. No process can lie orpick a different value, since we use T-send and T-receive throughout.Thus, π can miss at most ππ values that are higher priority than theone that it adopts. β‘
We now prove the following key composition property that showsthat the composition of Cheap Quorum and Preferential Paxos issafe.
LEMMA C.2 (LEMMA 4.8: COMPOSITION LEMMA). If somecorrect process decides a value π£ in Cheap Quorum before an abort,then π£ is the only value that can be decided in Preferential Paxoswith priorities as defined in Definition 3.
PROOF. To prove this lemma, we mainly rely on two properties:the Cheap Quorum Abort Agreement (Lemma 4.6) and PreferentialPaxos Priority Decision (Lemma 4.7). We consider two cases.
Case 1. Some correct follower process π β π1 decided π£ in CheapQuorum. Then note that by Lemma 4.6, all correct processes abortedwith value π£ and a correct unanimity proof. Since π β₯ 2π + 1, thereare at least π + 1 correct processes. Note that by the way we assignpriorities to inputs of Preferential Paxos in the composition of thetwo algorithms, all correct processes have inputs with the highestpriority. Therefore, by Lemma 4.7, the only decision value possiblein Preferential Paxos is π£ . Furthermore, note that by Lemma 4.5, ifany other correct process decided in Cheap Quorum, that processβsdecision value was also π£ .
Case 2. Only the leader, π1, decides in Cheap Quorum, and π1 iscorrect. Then by Lemma 4.6, all correct processes aborted with valueπ£ . Since π1 is correct, π£ is signed by π1. It is possible that some ofthe processes also had a correct unanimity proof as their abort proof.However, note that in this scenario, all correct processes (at leastπ + 1 processes) had inputs with either the highest or second highestpriorities, all with the same abort value. Therefore, by Lemma 4.7,the decision value must have been the value of one of these inputs.Since all these inputs had the same value π£ , π£ must be the decisionvalue of Preferential Paxos. β‘
THEOREM C.3 (END-TO-END VALIDITY). In the Fast & Robustalgorithm, if there are no faulty processes and some process decidesv, then v is the input of some process.
PROOF. Note that by Lemmas B.1 and 4.7, this holds for each ofthe algorithms individually. Furthermore, recall that the abort valuesof Cheap Quorum become the input values of Preferential Paxos,and the set-up phase does not invent new values. Therefore, we justhave to show that if Cheap Quorum aborts, then all abort valuesare inputs of some process. Note that by the code in panic mode, ifCheap Quorum aborts, a process π can output an abort value fromone of three sources: its own Value region, the leaderβs Value region,
![Page 15: The Impact of RDMA on Agreement [Extended Version]](https://reader033.vdocuments.us/reader033/viewer/2022051217/6278d8c76e1b273391789a35/html5/thumbnails/15.jpg)
or its own input value. Clearly, if its abort value is its input, then weare done. Furthermore note that a correct leader only writes its inputin the Value region, and correct followers only write a copy of theleaderβs Value region in their own region. Since there are no faults,this means that only the input of the leader may be written in anyValue region, and therefore all processes always abort with someprocesses input as their abort value. β‘
THEOREM C.4 (END-TO-END AGREEMENT). In the Fast &Robust algorithm, if π and π are correct processes such that π decidesπ£1 and π decides π£2, then π£1 = π£2.
PROOF. First note that by Lemmas 4.5 and 4.7, each of the algo-rithms satisfy this individually. Thus Lemma 4.8 implies the theo-rem. β‘
THEOREM C.5 (END-TO-END TERMINATION). In Fast & Ro-bust algorithm, if some correct process is eventually the sole leaderforever, then every correct process eventually decides.
PROOF. Assume towards a contradiction that some correct pro-cess π is eventually the sole leader forever, and let π‘ be the timewhen π last becomes leader. Now consider some process π that hasnot decided before π‘ . We consider several cases:
(1) If π is executing Preferential Paxos at time π‘ , then π will even-tually decide, by termination of Preferential Paxos (Lemma 4.7).
(2) If π is executing Cheap Quorum at time π‘ , we distinguish twosub-cases:
(a) π is also executing as the leader of Cheap Quorum at timeπ‘ . Then π will eventually propose a value, so π will ei-ther decide in Cheap Quorum or abort from Cheap Quo-rum (by Lemma B.2) and decide in Preferential Paxos byLemma 4.7.
(b) π is executing in Preferential Paxos. Then π must havepanicked and aborted from Cheap Quorum. Thus, π willalso abort from Cheap Quorum and decide in PreferentialPaxos by Lemma 4.7. β‘
Note that to strengthen C.5 to general termination as stated in ourmodel, we require the additional standard assumption [38] that somecorrect process π is eventually the sole leader forever. In practice,however, π does not need to be the sole leader forever, but ratherlong enough so that all correct processes decide.
D CORRECTNESS OF PROTECTEDMEMORY PAXOS
In this section, we present the proof of Theorem 5.1. We do so byshowing the Algorithm 7 is an algorithm that satisfies all of theproperties in the theorem.
We first show that Algorithm 7 correctly implements consensus,starting with validity. Intuitively, validity is preserved because eachprocess that writes any value in a slot either writes its own value, oradopts a value that was previously written in a slot. We show thatevery value written in any slot must have been the input of someprocess.
THEOREM D.1 (VALIDITY). In Algorithm 7, if a process π de-cides a value π£ , then π£ was the input to some process.
PROOF. Assume by contradiction that some process π decidesa value π£ and π£ is not the input of any process. Since π£ is not theinput value of π, then π must have adopted π£ by reading it fromsome process π β² at line 23. Note also that a process cannot adopt theinitial value β₯, and thus, π£ must have been written in π β²βs memoryby some other process. Thus we can define a sequence of processesπ 1, π 2, . . . , π π , where π π adopts π£ from the location where it was writtenby π π+1 and π 1 = π. This sequence is necessarily finite since therehave been a finite number of steps taken up to the point when π
decided π£ . Therefore, there must be a last element of the sequence,π π who wrote π£ in line 35 without having adopted π£ . This implies π£was π π βs input value, a contradiction. β‘
We now focus on agreement.
THEOREM D.2 (AGREEMENT). In Algorithm 7, for any pro-cesses π and π, if π and π decide values π£π and π£π respectively, thenπ£π = π£π .
Before showing the proof of the theorem, we first introduce thefollowing useful lemmas.
LEMMA D.3. The values a leader accesses on remote memorycannot change between when it reads them and when it writes them.
PROOF. Recall that each memory only allows write-access tothe most recent process that acquired it. In particular, that meansthat each memory only gives access to one process at a time. Notethat the only place at which a process acquires write-permissionson a memory is at the very beginning of its run, before reading thevalues written on the memory. In particular, for each memory π aprocess does not issue a read on π before its permission requeston π successfully completes. Therefore, if a process π succeeds inwriting on memoryπ, then no other process could have acquired πβswrite-permission after π did, and therefore, no other process couldhave changed the values written onπ after πβs read ofπ. β‘
LEMMA D.4. If a leader writes values π£π and π£ π at line 35 withthe same proposal number to memories π and π , respectively, thenπ£π = π£ π .
PROOF. Assume by contradiction that a leader π writes differentvalues π£1 β π£2 with the same proposal number. Since each threadof π executes the phase 2 write (line 35) at most once per proposalnumber, it must be that different threads π1 and π2 of π wrote π£1 andπ£2, respectively. If π does not perform phase 1 (i.e., if π = π1 andthis is πβs first attempt), then it is impossible for π1 and π2 to writedifferent values, since CurrentVal was set to π£ at line 14 and was notchanged afterwards. Otherwise (if π performs phase 1), then let π‘1and π‘2 be the times when π1 and π2 executed line 8, respectively (π1and π2 must have done so, since we assume that they both reachedthe phase 2 write at line 35). Assume wlog that π‘1 β€ π‘2. Due to thecheck and abort at line 29, CurrentVal cannot change after π‘1 whilekeeping the same proposal number. Thus, when π1 and π2 performtheir phase 2 writes (after π‘1), CurrentVal has the same value as itdid at π‘1; it is therefore impossible for π1 and π2 to write differentvalues. We have reached a contradiction. β‘
LEMMA D.5. If a process π performs phase 1 and then writesto some memoryπ with proposal number π at line 35, then π musthave written π toπ at line 21 and read fromπ at line 23.
![Page 16: The Impact of RDMA on Agreement [Extended Version]](https://reader033.vdocuments.us/reader033/viewer/2022051217/6278d8c76e1b273391789a35/html5/thumbnails/16.jpg)
PROOF. Let π be the thread of π which writes to π at line 35.If phase 1 is performed (i.e., the condition at line 19 is satisfied),then by construction π cannot reach line 35 without first performinglines 21 and 23. Since π only communicates with π, it must be thatlines 21 and 23 are performed onπ. β‘
PROOF OF THEOREM D.2. Assume by contradiction that π£π β
π£π . Let ππ and ππ be the proposal numbers with which π£π and π£πare decided, respectively. Letππ (resp.ππ) be the set of memoriesto which π (resp. π) successfully wrote in phase 2 line 35 beforedeciding π£π (resp. π£π). Sinceππ andππ are both majorities, theirintersection must be non-empty. Letπ be any memory inππ β©ππ .
We first consider the case in which one of the processes did notperform phase 1 before deciding (i.e., one of the processes is π1 andit decided on its first attempt). Let that process be π wlog. Furtherassume wlog that π is the first process to enter phase 2 with a valuedifferent from π£π . πβs phase 2 write on π must have preceded π
obtaining permissions from π (otherwise, πβs write would havefailed due to lack of permissions). Thus, π must have seen πβs valueduring its read on π at line 23, and thus π cannot have adopted itsown value. Since π is the first process to enter phase 2 with a valuedifferent from π£π , π cannot have adopted any other value than π£π , soπ must have adopted π£π . Contradiction.
We now consider the remaining case: both π and π performedphase 1 before deciding. We assume wlog that ππ < ππ and that ππ isthe smallest proposal number larger than ππ for which some processenters phase 2 with CurrentVal β π£π .
Since ππ < ππ , πβs read at π must have preceded πβs phase 1write at π (otherwise π would have seen πβs larger proposal num-ber and aborted). This implies that πβs phase 2 write at π musthave preceded πβs phase 1 write at π (by Lemma D.3). Thus π
must have seen π£π during its read and cannot have adopted its owninput value. However, π cannot have adopted π£π , so π must haveadopted π£π from some other slot π π that π saw during its read. Itmust be that π π .πππππππππ ππ < ππ , otherwise π would have aborted.Since π π .πππππππππ ππ β₯ π π .πππππππππ ππ for any slot, it follows thatπ π .πππππππππ ππ < ππ . If π π .πππππππππ ππ < ππ , π cannot have adoptedπ π .π£πππ’π in line 30 (it would have adopted π£π instead). Thus it mustbe that ππ β€ π π .πππππππππ ππ < ππ ; however, this is impossible be-cause we assumed that ππ is the smallest proposal number largerthan ππ for which some process enters phase 2 with CurrentVal β π£π .We have reached a contradiction. β‘
Finally, we prove that the termination property holds.
THEOREM D.6 (TERMINATION). Eventually, all correct pro-cesses decide.
LEMMA D.7. If a correct process π is executing the for loop atlines 18β37, then π will eventually exit from the loop.
PROOF. The threads of the for loop perform the following poten-tially blocking steps: obtaining permission (line 20), writing (lines 21and 35), reading (line 23), and waiting for other threads (the barrierat line 7 and the exit condition at line 37). By our assumption thata majority of memories are correct, a majority of the threads of thefor loop must eventually obtain permission in line 20 and invoke thewrite at line 21. If one of these writes fails due to lack of permission,the loop aborts and we are done. Otherwise, a majority of threads
will perform the read at line 23. If some thread aborts at lines 22and 26, then the loop aborts and we are done. Otherwise, a majorityof threads must add themselves to ListOfReady, pass the barrier atline 7 and invoke the write at line 35. If some such write fails, theloop aborts; otherwise, a majority of threads will reach the check atline 37 and thus the loop will exit. β‘
PROOF OF THEOREM D.6. The Ξ© failure detector guaranteesthat eventually, all processes trust the same correct process π. Letπ‘ be the time after which all correct processes trust π forever. ByLemma D.7, at some time π‘ β² β₯ π‘ , all correct processes except πwill be blocked at line 12. Therefore, the minProposal values ofall memories, on all slots except those of π stop increasing. Thus,eventually, π picks a propNr that is larger than all others written onany memory, and stops restarting at line 26. Furthermore, since noprocess other than π is executing any steps of the algorithm, and inparticular, no process other than π ever acquires any memory aftertime π‘ β², π never loses its permission on any of the memories. So,all writes executed by π on any correct memory must return πππ.Therefore, π will decide and broadcast its decision to all. All correctprocesses will receive πβs decision and decide as well. β‘
To complete the proof of Theorem 5.1, we now show that Algo-rithm 7 is 2-deciding.
THEOREM D.8. Algorithm 7 is 2-deciding.
PROOF. Consider an execution in which π1 is timely, and no pro-cessβs failure detector ever suspects π1. Then, since no process thinksitself the leader, and processes do not deviate from their protocols, noprocess calls changePermission on any memory. Furthermore, π1βsdoes not perform phase 1 (lines 19β33), since it is π1βs first attempt.Thus, since π1 initially has write permission on all memories, all ofπ1βs phase 2 writes succeed. Therefore, π1 terminates, deciding itsown proposed value π£ , after one write per memory. β‘
E PSEUDOCODE FOR THE ALIGNED PAXOS
Algorithm 9: Aligned Paxos1 A=(P, M) is set of acceptors2 propose(v){3 resps = [] //prepare empty responses list4 choose propNr bigger than any seen before5 for all a in A{6 cflag = communicate1(a, propNr)7 resp = hearback1(a)8 if (cflag){resps.append((a, resp)) } }9 wait until resps has responses from a majority of A
10 next = analyze1(resps)11 if (next == RESTART) restart;12 resps = []13 for all a in A{14 cflag = communicate2(a, next)15 resp = hearback2(a)16 if (cflag){resps.append((a, resp)) } }17 wait until resps has responses from a majority of A18 next = analyze2(resps)19 if (next == RESTART) restart;20 decide next; }
![Page 17: The Impact of RDMA on Agreement [Extended Version]](https://reader033.vdocuments.us/reader033/viewer/2022051217/6278d8c76e1b273391789a35/html5/thumbnails/17.jpg)
Algorithm 15: Analyze Phase 2βcode for process π1 value analyze2(value v, responses resps){2 if there are at least A/2 + 1 resps such that resp.
β©β response==ack{3 return v }4 return RESTART }
Algorithm 10: Communicate Phase 1βcode for process π1 bool communicate1(agent a, value propNr){2 if (a is memory){3 changePermission(a, {(R:Ξ -{p}, W:β , RW: {p}); //
β©β acquire write permission4 return write(a[p], {propNr, -, -}) }5 else{6 send prepare(propNr) to a7 return true } }
Algorithm 11: Hear Back Phase 1βcode for process π1 value hearback1(agent a){2 if (a is memory){3 for all processes q{4 localInfo[q] = read(a[q]) }5 return = localInfo}6 else{7 return value received from a } }
Algorithm 12: Analyze Phase 1βcode for process π1 responses is a list of (agent, response) pairs2 value analyze1(responses resps){3 for resp in resps {4 if (resp.agent is memory){5 for all slots s in v.info {6 if (s.minProposal > propNr) return RESTART }7 (v, accProposal) = value and accProposal of slot
β©β with highest8 accProposal that had a value } }9 return v where (v, accProposal) is the highest
β©β accProposal seen in resps.response of allβ©β agents }
Algorithm 13: Communicate Phase 2βcode for process π1 bool communicate2(agent a, value msg){2 if (a is memory){3 return write(a[p], {msg.propNr, msg.propNr, msg.
β©β val})}4 else{5 send accepted(msg.propNr, msg.val) to a6 return true } }
Algorithm 14: Hear Back Phase 2βcode for process π1 value hearback2(agent a){2 if (a is memory){3 return ack }4 else{5 return value received from a } }