![Page 1: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/1.jpg)
Distributed Consensus: Why Can't We All Just Agree?
Heidi Howard
PhD Student @ University of Cambridge [email protected]
@heidiann360 hh360.user.srcf.net
![Page 2: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/2.jpg)
Sometimes inconsistency is not an option• Distributed locking
• Safety critical systems
• Distributed scheduling
• Strongly consistent databases
• Blockchain
Anything which requires guaranteed agreement
• Leader election
• Orchestration services
• Distributed file systems
• Coordination & configuration
• Strongly consistent databases
![Page 3: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/3.jpg)
What is Distributed Consensus?
“The process of reaching agreement over state between unreliable hosts connected by unreliable networks, all operating asynchronously”
![Page 4: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/4.jpg)
![Page 5: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/5.jpg)
A walk through timeWe are going to take a journey through the developments in distributed consensus, spanning over three decades. Stops include:
• FLP Result & CAP Theorem
• Viewstamped Replication, Paxos & Multi-Paxos
• State Machine Replication
• Paxos Made Live, Zookeeper & Raft
• Flexible Paxos
Bob
![Page 6: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/6.jpg)
Fischer, Lynch & Paterson ResultWe begin with a slippery start
Impossibility of distributed consensus with one faulty process Michael Fischer, Nancy Lynch
and Michael Paterson ACM SIGACT-SIGMOD Symposium on Principles of Database Systems
1983
![Page 7: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/7.jpg)
FLP ResultWe cannot guarantee agreement in an asynchronous system where even one host might fail.
Why?
We cannot reliably detect failures. We cannot know for sure the difference between a slow host/network and a failed host
Note: We can still guarantee safety, the issue limited to guaranteeing liveness.
![Page 8: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/8.jpg)
Solution to FLPIn practice:
We approximate reliable failure detectors using heartbeats and timers. We accept that sometimes the service will not be available (when it could be).
In theory:
We make weak assumptions about the synchrony of the system e.g. messages arrive within a year.
![Page 9: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/9.jpg)
Viewstamped Replicationthe forgotten algorithm
Viewstamped Replication Revisited Barbara Liskov and James Cowling
MIT Tech Report MIT-CSAIL-TR-2012-021
Not the original from 1988, but recommended
![Page 10: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/10.jpg)
Viewstamped Replication
In my view, the pioneering algorithm on the field of distributed consensus.
Approach: Select one node to be the ‘master’. The master is responsible for replicating decisions. Once a decision has been replicated onto the majority of nodes then it is commit.
We rotate the master when the old master fails with agreement from the majority of nodes.
![Page 11: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/11.jpg)
PaxosLamport’s consensus algorithm
The Part-Time Parliament Leslie Lamport
ACM Transactions on Computer Systems May 1998
![Page 12: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/12.jpg)
Paxos
The textbook algorithm for reaching consensus on a single value.
• two phase process: promise and commit
• each requiring majority agreement (aka quorums)
![Page 13: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/13.jpg)
Paxos Example - Failure Free
![Page 14: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/14.jpg)
1 2
3
P: C:
P: C:
P: C:
![Page 15: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/15.jpg)
1 2
3
P: C:
P: C:
P: C:
B
Incoming request from Bob
![Page 16: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/16.jpg)
1 2
3
P: C:
P: 13 C:
P: C:
B
Promise (13) ?
Phase 1
Promise (13) ?
![Page 17: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/17.jpg)
1 2
3 P: 13 C:
OKOK
P: 13 C:
P: 13 C:
Phase 1
![Page 18: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/18.jpg)
1 2
3 P: 13 C: 13, B
P: 13 C:
P: 13 C:
Phase 2
Commit (13, ) ?B Commit (13, ) ?B
![Page 19: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/19.jpg)
1 2
3 P: 13 C: 13, B
P: 13 C: 13,
P: 13 C: 13,
Phase 2
B B
OKOK
![Page 20: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/20.jpg)
1 2
3 P: 13 C: 13, B
P: 13 C: 13,
P: 13 C: 13, B B
OK
Bob is granted the lock
![Page 21: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/21.jpg)
Paxos Example - Node Failure
![Page 22: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/22.jpg)
1 2
3
P: C:
P: C:
P: C:
![Page 23: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/23.jpg)
1 2
3
P: C:
P: 13 C:
P: C:
Promise (13) ?
Phase 1
B
Incoming request from Bob
Promise (13) ?
![Page 24: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/24.jpg)
1 2
3
P: 13 C:
P: 13 C:
P: 13 C:
Phase 1
B
OKOK
![Page 25: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/25.jpg)
1 2
3
P: 13 C:
P: 13 C: 13,
P: 13 C:
Phase 2
Commit (13, ) ?B
B
![Page 26: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/26.jpg)
1 2
3
P: 13 C:
P: 13 C: 13,
P: 13 C: 13,
Phase 2
B
B
![Page 27: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/27.jpg)
1 2
3
P: 13 C:
P: 13 C: 13,
P: 13 C: 13,
Alice
B
B
A
![Page 28: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/28.jpg)
1 2
3
P: 22 C:
P: 13 C: 13,
P: 13 C: 13,
Phase 1
B
BA
Promise (22) ?
![Page 29: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/29.jpg)
1 2
3
P: 22 C:
P: 13 C: 13,
P: 22 C: 13,
Phase 1
B
BA
OK(13, )B
![Page 30: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/30.jpg)
1 2
3
P: 22 C: 22,
P: 13 C: 13,
P: 22 C: 13,
Phase 2
B
BA
Commit (22, ) ?B
B
![Page 31: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/31.jpg)
1 2
3
P: 22 C: 22,
P: 13 C: 13,
P: 22 C: 22,
Phase 2
B
B
OK
B
NO
![Page 32: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/32.jpg)
Paxos Example - Conflict
![Page 33: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/33.jpg)
1 2
3
P: 13 C:
P: 13 C:
P: 13 C:
B
Phase 1 - Bob
![Page 34: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/34.jpg)
1 2
3
P: 21 C:
P: 21 C:
P: 21 C:
B
Phase 1 - Alice
A
![Page 35: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/35.jpg)
1 2
3
P: 33 C:
P: 33 C:
P: 33 C:
B
Phase 1 - Bob
A
![Page 36: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/36.jpg)
1 2
3
P: 41 C:
P: 41 C:
P: 41 C:
B
Phase 1 - Alice
A
![Page 37: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/37.jpg)
What does Paxos give us?
Safety - Decisions are always final
Liveness - Decision will be reached as long as a majority of nodes are up and able to communicate*. Clients must wait two round trips to the majority of nodes, sometimes longer.
*plus our weak synchrony assumptions for the FLP result
![Page 38: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/38.jpg)
Multi-PaxosLamport’s leader-driven consensus algorithm
Paxos Made Moderately Complex Robbert van Renesse and Deniz
Altinbuken ACM Computing Surveys
April 2015 Not the original, but highly recommended
![Page 39: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/39.jpg)
Multi-Paxos
Lamport’s insight:
Phase 1 is not specific to the request so can be done before the request arrives and can be reused for multiple instances of Paxos.
Implication:
Bob now only has to wait one round trip
![Page 40: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/40.jpg)
State Machine Replicationfault-tolerant services using consensus
Implementing Fault-Tolerant Services Using the State Machine
Approach: A Tutorial Fred Schneider
ACM Computing Surveys 1990
![Page 41: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/41.jpg)
State Machine Replication (SMR)
A general technique for making a service, such as a database, fault-tolerant.
Application
Client Client
![Page 42: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/42.jpg)
Application
Application
Application
Client
Client
Network
Consensus
Consensus
Consensus
Consensus
Consensus
![Page 43: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/43.jpg)
![Page 44: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/44.jpg)
CAP TheoremYou cannot have your cake and eat it
CAP Theorem Eric Brewer
Presented at Symposium on Principles of Distributed
Computing, 2000
![Page 45: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/45.jpg)
Consistency, Availability & Partition Tolerance - Pick Two
1 2
3 4
B C
![Page 46: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/46.jpg)
Paxos Made Live & ChubbyHow google uses Paxos
Paxos Made Live - An Engineering Perspective
Tushar Chandra, Robert Griesemer and Joshua Redstone
ACM Symposium on Principles of Distributed Computing
2007
![Page 47: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/47.jpg)
Isn’t this a solved problem?
“There are significant gaps between the description of the Paxos algorithm and the needs of a real-world system.
In order to build a real-world system, an expert needs to use numerous ideas scattered in the literature and make several relatively small protocol extensions.
The cumulative effort will be substantial and the final system will be based on an unproven protocol.”
![Page 48: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/48.jpg)
Paxos Made Live
Paxos made live documents the challenges in constructing Chubby, a distributed coordination service, built using Multi-Paxos and State machine replication.
![Page 49: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/49.jpg)
Challenges• Handling disk failure and corruption
• Dealing with limited storage capacity
• Effectively handling read-only requests
• Dynamic membership & reconfiguration
• Supporting transactions
• Verifying safety of the implementation
![Page 50: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/50.jpg)
Fast PaxosLike Multi-Paxos, but faster
Fast Paxos Leslie Lamport
Microsoft Research Tech Report MSR-TR-2005-112
![Page 51: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/51.jpg)
Fast Paxos
Paxos: Any node can commit a value in 2 RTTs
Multi-Paxos: The leader node can commit a value in 1 RTT
But, what about any node committing a value in 1 RTT?
![Page 52: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/52.jpg)
Fast Paxos
We can bypass the leader node for many operations, so any node can commit a value in 1 RTT.
However, we must increase the size of the quorum.
![Page 53: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/53.jpg)
ZookeeperThe open source solution
Zookeeper: wait-free coordination for internet-scale systems
Hunt et al USENIX ATC 2010
Code: zookeeper.apache.org
![Page 54: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/54.jpg)
Zookeeper
Consensus for the masses.
It utilizes and extends Multi-Paxos for strong consistency.
Unlike “Paxos made live”, this is clearly discussed and openly available.
![Page 55: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/55.jpg)
Egalitarian PaxosDon’t restrict yourself unnecessarily
There Is More Consensus in Egalitarian Parliaments
Iulian Moraru, David G. Andersen, Michael Kaminsky
SOSP 2013
also see Generalized Consensus and Paxos
![Page 56: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/56.jpg)
Egalitarian Paxos
The basis of SMR is that every replica of an application receives the same commands in the same order.
However, sometimes the ordering can be relaxed…
![Page 57: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/57.jpg)
C=1 B? C=C+1 C? B=0 B=C
C=1 B?
C=C+1
C?
B=0
B=C
Partial Ordering
Total Ordering
![Page 58: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/58.jpg)
C=1 B? C=C+1 C? B=0 B=C
Many possible orderings
B? C=C+1 C?B=0 B=CC=1
B?C=C+1 C? B=0 B=CC=1
B? C=C+1 C? B=0 B=CC=1
![Page 59: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/59.jpg)
Egalitarian Paxos
Allow requests to be out-of-order if they are commutative.
Conflict becomes much less common.
Works well in combination with Fast Paxos.
![Page 60: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/60.jpg)
Raft ConsensusPaxos made understandable
In Search of an Understandable Consensus Algorithm
Diego Ongaro and John Ousterhout USENIX ATC
2014
![Page 61: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/61.jpg)
RaftRaft has taken the wider community by storm. Largely, due to its understandable description.
It’s another variant of SMR with Multi-Paxos.
Key features:
• Really strong leadership - all other nodes are passive
• Various optimizations - e.g. dynamic membership and log compaction
![Page 62: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/62.jpg)
Flexible PaxosPaxos made scalable
Flexible Paxos: Quorum intersection revisited
Heidi Howard, Dahlia Malkhi, Alexander Spiegelman
ArXiv:1608.06696
![Page 63: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/63.jpg)
Majorities are not needed
Usually, we use require majorities to agree so we can guarantee that all quorums (groups) intersect.
This work shows that not all quorums need to intersect. Only the ones used for phase 2 (replication) and phase 1 (leader election).
This applies to all algorithms in this class: Paxos, Viewstamped Replication, Zookeeper, Raft etc..
![Page 64: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/64.jpg)
Example: Non-strict majorities
Phase 2 Replication quorum
Phase 1 Leader election quorum
![Page 65: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/65.jpg)
Example: Counting quorums
Replication quorum Leader election quorum
![Page 66: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/66.jpg)
Example: Group quorums
Replication quorum Leader election quorum
![Page 67: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/67.jpg)
How strong is the leadership?
Strong Leadership Leaderless
PaxosEgalitarian
Paxos
Raft Viewstamped ReplicationMulti-Paxos
Fast Paxos
Leader only when neededLeader driven
ZookeeperChubby
![Page 68: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/68.jpg)
Who is the winner?Depends on the award:
• Best for minimum latency: Viewstamped Replication
• Most widely used open source project: Zookeeper
• Easiest to understand: Raft
• Best for WANs: Egalitarian Paxos
![Page 69: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/69.jpg)
Future
1. More scalable consensus algorithms utilizing Flexible Paxos.
2. A clearer understanding of consensus and better explained algorithms.
3. Consensus in challenging settings such as geo-replicated systems.
![Page 70: why cant we agree deck - QCon · “The process of reaching agreement over state between unreliable hosts connected by unreliable networks, ... Paxos algorithm and the needs of a](https://reader036.vdocuments.us/reader036/viewer/2022070107/602296621f8eb736204fae82/html5/thumbnails/70.jpg)
Summary
Do not be discouraged by impossibility results and dense abstract academic papers.
Don’t give up on consistency. Consensus is achievable, even performant and scalable.
Find the right algorithm and quorum system for your specific domain. There is no single silver bullet.
[email protected] @heidiann360