cs 3700 networks and distributed systems distributed consensus and fault tolerance (or, why cant we...

77
CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why can’t we all just get along?)

Upload: lester-price

Post on 19-Jan-2018

226 views

Category:

Documents


0 download

DESCRIPTION

Black Box Online Services Storing and retrieving data from online services is commonplace We tend to treat these services as black boxes Data goes in, we assume outputs are correct We have no idea how the service is implemented debit_transaction(-$75) get_recent_transactions() OK […, “-$75”, …]

TRANSCRIPT

Page 1: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

CS 3700Networks and Distributed

SystemsDistributed Consensus and Fault Tolerance

(or, why can’t we all just get along?)

Page 2: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Black Box Online Services

• Storing and retrieving data from online services is commonplace• We tend to treat these services as black boxes• Data goes in, we assume outputs are correct• We have no idea how the service is implemented

Black Box Service

Page 3: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Black Box Online Services

• Storing and retrieving data from online services is commonplace• We tend to treat these services as black boxes• Data goes in, we assume outputs are correct• We have no idea how the service is implemented

debit_transaction(-$75)

get_recent_transactions()

OK

[…, “-$75”, …]

Page 4: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Black Box Online Services

• Storing and retrieving data from online services is commonplace• We tend to treat these services as black boxes• Data goes in, we assume outputs are correct• We have no idea how the service is implemented

add_item_to_cart(“Cheerios”)

get_cart()

OK

[“Lucky Charms”, “Cheerios”]

Page 5: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Black Box Online Services

• Storing and retrieving data from online services is commonplace• We tend to treat these services as black boxes• Data goes in, we assume outputs are correct• We have no idea how the service is implemented

post_update(“I LOLed”)

get_newsfeed()

OK

[…, {“txt”: “I LOLed”, “likes”: 87}]

Page 6: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Peeling Back the Curtain

• How are large services implemented?• Different types of services may have different requirements• Leads to different design decisions

Black Box Service?

Page 7: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Centralization

• Advantages of centralization• Easy to setup and deploy• Consistency is guaranteed (assuming correct software implementation)

• Shortcomings• No load balancing• Single point of failure

debit_transaction(-$75)

get_account_balance()

OK

$225Bob Bob: $300Bob: $225

?

Page 8: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Sharding

• Advantages of sharding• Better load balancing• If done intelligently, may allow incremental scalability

• Shortcomings• Failures are still devastating

debit_account(-$75)

get_account_balance()

OK

$225Bob

Bob: $300Bob: $225

<A-M>

<N-Z>

Page 9: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Replication

• Advantages of replication• Better load balancing of reads (potentially)• Resilience against failure; high availability (with some caveats)

• Shortcomings• How do we maintain consistency?

debit_account(-$75)

get_account_balance()

OK

$225Bob

Bob: $300Bob: $225

<A-M>

<A-M>

<A-M>

Bob: $300Bob: $225

Bob: $300Bob: $225

100% Agreement

Page 10: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Bob: $300

Bob: $300

Consistency Failures

Bob: $300Bob: $225

Bob: $300

No Agreement

No ACK

Bob: $225No ACK

Bob: $225

Leader cannot disambiguate cases where requests and responses are lost

Bob: $300

Bob: $300

Bob: $225Timeout!

Bob: $225

Asynchronous networks are problematic

Bob: $300

Bob: $300

Bob: $225

Too few replicas?

No Agreement

Page 11: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Byzantine Failures

Bob: $300

Bob: $300

No Agreement Bob: $1000

In some cases, replicas may be

buggy or malicious

• When discussing Distributed Systems, failures due to malice are known as Byzantine Failures• Name comes from the Byzantine generals problem• More on this later…

Page 12: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Problem and Definitions

• Build a distributed system that meets the following goals:• The system should be able to reach consensus

• Consensus [n]: general agreement• The system should be consistent

• Data should be correct; no integrity violations• The system should be highly available

• Data should be accessible even in the face of arbitrary failures

• Challenges:• Many, many different failure modes• Theory tells us that these goals are impossible to achieve (more on this later)

Page 13: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Distributed Commits (2PC and 3PC)Theory (FLP and CAP)Quorums (Paxos)

Page 14: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Forcing Consistency

• One approach building distributed systems is to force them to be consistent• Guarantee that all replicas receive an update…• …Or none of them do

• If consistency is guaranteed, then reaching consensus is trivial

debit_account(-$75)

OK

Bob

Bob: $300Bob: $225

Bob: $300Bob: $225

Bob: $300Bob: $225debit_account(-$50)

Error

Bob: $175

Bob: $175

Page 15: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Distributed Commit Problem

• Application that performs operations on multiple replicas or databases• We want to guarantee that all replicas get updated, or none do

• Distributed commit problem:1. Operation is committed when all participants can perform the action2. Once a commit decision is reached, all participants must perform the action

• Two steps gives rise to the Two Phase Commit protocol

Page 16: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Motivating Transactions

• System becomes inconsistent if any individual action fails

Bob: $300Bob: $400

Alice: $600Alice: $500

transfer_money(Alice, Bob, $100)

debit_account(Alice, -$100)

OK

debit_account(Bob, $100)

OK

Error

Error

Page 17: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Simple Transactions

• Actions inside a transaction behave as a single action

Bob: $300 Bob: $400

Alice: $600 Alice: $500

transfer_money(Alice, Bob, $100)

debit_account(Alice, -$100)

debit_account(Bob, $100)

OK

begin_transaction()

end_transaction()Bob: $400

Alice: $500

At this point, if there haven’t been any errors, we say the

transaction is committed

Page 18: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Simple Transactions

• If any individual action fails, the whole transaction fails• Failed transactions have no side effects

• Incomplete results during transactions are hidden

Bob: $300

Alice: $600 Alice: $500

transfer_money(Alice, Bob, $100)

debit_account(Alice, -$100)

debit_account(Bob, $100)

Error

begin_transaction()

Page 19: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

ACID Properties

• Traditional transactional databases support the following:1. Atomicity: all or none; if transaction fails then no changes are applied to the

database2. Consistency: there are no violations of database integrity3. Isolation: partial results from incomplete transactions are hidden4. Durability: the effects of committed transactions are permanent

Page 20: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Two Phase Commits (2PC)

• Well known techniques used to implement transactions in centralized databases• E.g. journaling (append-only logs)• Out of scope for this class (take a database class, or CS 5600)

• Two Phase Commit (2PC) is a protocol for implementing transactions in a distributed setting• Protocol operates in rounds• Assume we have leader or coordinator that manages transactions• Each replica states that it is ready to commit• Leader decides the outcome and instructs replicas to commit or abort

• Assume no byzantine faults (i.e. nobody is malicious)

Page 21: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

2PC Example Leader Replica 1 Replica 2 Replica 3Tim

e

x x x

y y y

x x xy y y

txid = 678; value = y

commit txid = 678

ready txid = 678

committed txid = 678

• Begin by distributing the update

• Txid is a logical clock

• Wait to receive “ready to commit” from all replicas

• Tell replicas to commit

• At this point, all replicas are guaranteed to be up-to-date

Page 22: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Failure Modes

• Replica Failure• Before or during the initial promise phase• Before or during the commit

• Leader Failure• Before receiving all promises• Before or during sending commits• Before receiving all committed messages

Page 23: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Replica Failure (1)Leader Replica 1 Replica 2 Replica 3Tim

e

x x x

x x

x xy y

txid = 678; value = y

abort txid = 678

ready txid = 678

aborted txid = 678

• Error: not all replicas are “ready”

• The same thing happens if a write or a “ready” is dropped, a replica times out, or a replica returns an error

Page 24: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Replica Failure (2)Leader Replica 1 Replica 2 Replica 3Tim

ey

x x xy y y

commit txid = 678

ready txid = 678

committed txid = 678

• Known inconsistent state

• Leader must keep retrying until all commits succeed

commit txid = 678

y

committed txid = 678

Page 25: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Replica Failure (2)Leader Replica 1 Replica 2 Replica 3Tim

e

y

x

y y

y

commit txid = 678

committed txid = 678• Finally, the system is

consistent and may proceed

stat txid = 678

• Replicas attempt to resume unfinished transactions when they reboot

Page 26: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Leader Failure

• What happens if the leader crashes?• Leader must constantly be writing its state to permanent storage• It must pick up where it left off once it reboots

• If there are unconfirmed transactions• Send new write messages, wait for “ready to commit” replies

• If there are uncommitted transactions• Send new commit messages, wait for “committed” replies

• Replicas may see duplicate messages during this process• Thus, it’s important that every transaction have a unique txid

Page 27: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Allowing Progress

• Key problem: what if the leader crashes and never recovers?• By default, replicas block until contacted by the leader• Can the system make progress?

• Yes, under limited circumstances• After sending a “ready to commit” message, each replica starts a timer• The first replica whose timer expires elects itself as the new leader• Query the other replicas for their status• Send “commits” to all replicas if they are all “ready”

• However, this only works if all the replicas are alive and reachable• If a replica crashes or is unreachable, deadlock is unavoidable

Page 28: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

New Leader Leader Replica 1 Replica 2 Replica 3Tim

e

y y y

x x xy y y

commit txid = 678

ready txid = 678

committed txid = 678

stat txid = 678

ready txid = 678

• Replica 2’s timeout expires, begins recovery procedure

• System is consistent again

Page 29: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Deadlock Leader Replica 1 Replica 2 Replica 3Tim

e

x x xy y y

ready txid = 678

stat txid = 678

ready txid = 678

• Replica 2’s timeout expires, begins recovery procedure

• Cannot proceed, but cannot abort stat txid = 678

Page 30: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Garbage Collection

• 2PC is somewhat of a misnomer: there is actually a third phase• Garbage collection

• Replicas must retain records of past transactions, just in case the leader fails• Example, suppose the leader crashes, reboots, and attempts to commit a

transaction that has already been committed• Replicas must remember that this past transaction was already committed,

since committing a second time may lead to inconsistencies

• In practice, leader periodically tells replicas to garbage collect• Transactions <= some txid in the past may be deleted

Page 31: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

2PC Summary

• Message complexity: O(2n)• The good: guarantees consistency• The bad:• Write performance suffers if there are failures during the commit phase• Does not scale gracefully (possible, but difficult to do)• A pure 2PC system blocks all writes if the leader fails• Smarter 2PC systems still blocks all writes if the leader + 1 replica fail

• 2PC sacrifices availability in favor of consistency

Page 32: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Can 2PC be Fixed?

• They issue with 2PC is reliance on the centralized leader• Only the leader knows if a transaction is 100% ready to commit or not• Thus, if the leader + 1 replica fail, recovery is impossible

• Potential solution: Three Phase Commit• Add an additional round of communication• Tell all replicas to prepare to commit, before actually committed

• State of the system can always be deduced by a subset of alive replicas that can communicate with each other• … unless there are partitions (more on this later)

Page 33: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

3PC Example Leader Replica 1 Replica 2 Replica 3Tim

e

x x x

y y y

x x xy y y

txid = 678; value = y

commit txid = 678

ready txid = 678

committed txid = 678

• Begin by distributing the update

• Wait to receive “ready to commit” from all replicas

• Tell replicas to commit

• At this point, all replicas are guaranteed to be up-to-date

prepare txid = 678

prepared txid = 678

• Tell all replicas that everyone is “ready to commit”

Page 34: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Leader Failures Leader Replica 1 Replica 2 Replica 3Tim

e

x x x

x x xy y y

txid = 678; value = y

ready txid = 678

• Begin by distributing the update

• Wait to receive “ready to commit” from all replicas

x xabort txid = 678

aborted txid = 678

stat txid = 678

ready txid = 678

• Replica 2’s timeout expires, begins recovery procedure

• System is consistent again

• Replica 3 cannot be in the committed state, thus okay to abort

Page 35: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Leader Failures Leader Replica 1 Replica 2 Replica 3Tim

e

prepare txid = 678

prepared txid = 678

y ycommit txid = 678

committed txid = 678

stat txid = 678

prepared txid = 678

• Replica 2’s timeout expires, begins recovery procedure

• System is consistent again

• All replicas must have been ready to commit

Page 36: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Oh Great, I Fixed Everything!

• Wrong• 3PC is not robust against network partitions• What is a network partition?• A split in the network, such that full n-to-n connectivity is broken• i.e. not all servers can contact each other

• Partitions split the network into one or more disjoint subnetworks• How can a network partition occur?• A switch or a router may fail, or it may receive an incorrect routing rule• A cable connecting two racks of servers may develop a fault

• Network partitions are very real, they happen all the time

Page 37: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Partitioning Leader Replica 1 Replica 2 Replica 3Tim

e

x x x

y x

x x xy y y

txid = 678; value = y

commit txid = 678

ready txid = 678

committed txid = 678

• Leader assumes replicas 2 and 3 have failed, moves on

• System is inconsistent

prepare txid = 678

prepared txid = 678

• Network partitions into two subnets!

x

Leader recovery initiated

Abort

Page 38: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

3PC Summary• Adds an additional phase vs. 2PC• Message complexity: O(3n)• Really four phases with garbage collection

• The good: allows the system to make progress under more failure conditions• The bad:• Extra round of communication makes 3PC even slower than 2PC• Does not work if the network partitions

• 2PC will simply deadlock if there is a partition, rather than become inconsistent

• In practice, nobody used 3PC• Additional complexity and performance penalty just isn’t worth it• Loss of consistency during partitions is a deal breaker

Page 39: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Distributed Commits (2PC and 3PC)Theory (FLP and CAP)Quorums (Paxos)

Page 40: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

A Moment of Reflection

• Goals, revisited:• The system should be able to reach consensus

• Consensus [n]: general agreement• The system should be consistent

• Data should be correct; no integrity violations• The system should be highly available

• Data should be accessible even in the face of arbitrary failures

• Achieving these goals may be harder than we thought :(• Huge number of failure modes• Network partitions are difficult to cope with• We haven’t even considered byzantine failures

Page 41: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

What Can Theory Tell Us?

• Lets assume the network is synchronous and reliable• Algorithm can be divided into discreet rounds• If a message from r is not received in a round, then r must be faulty

• Since we’re assuming synchrony, packets cannot be delayed arbitrarily

• During each round, r may send m <= n messages• n is the total number of replicas• You might crash before sending all n messages

• If we are willing to tolerate f total failures (f < n), how many rounds of communication do we need to guarantee consensus?

Page 42: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Consensus in a Synchronous System

• Initialization:• All replicas choose a value 0 or 1 (can generalize to more values if you want)

• Properties:• Agreement: all non-faulty processes ultimately choose the same value

• Either 0 or 1 in this case• Validity: if a replica decides on a value, then at least one replica must have

started with that value• This prevents the trivial solution of all replicas always choosing 0, which is technically

perfect consensus but is practically useless• Termination: the system must converge in finite time

Page 43: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Algorithm Sketch

• Each replica maintains a map M of all known values• Initially, the vector only contains the replica’s own value• e.g. M = {‘replica1’: 0}

• Each round, broadcast M to all other replicas• On receipt, construct the union of received M and local M

• Algorithm terminates when all non-faulty replicas have the values from all other non-faulty replicas• Example with three non-faulty replicas (1, 3, and 5)• M = {‘replica1’: 0, ‘replica3’: 1, ‘replica5’: 0}• Final value is min(M.values())

Page 44: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Bounding Convergence Time• How many rounds will it take if we are willing to tolerate f failures?• f + 1 rounds

• Key insight: all replicas must be sure that all replicas that did not crash have the same information (so they can make the same decision)• Proof sketch, assuming f = 2• Worst case scenario is that replicas crash during rounds 1 and 2• During round 1, replica x crashes

• All other replicas don’t know if x it alive or dead• During round 2, replica y crashes

• Clear that x is not alive, but unknown if y is alive or dead• During round 3, no more replicas may crash

• All replicas are guaranteed to receive updated info from all other replicas• Final decision can be made

Page 45: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

A More Realistic Model

• The previous result is interesting, but unrealistic• We assume that the network is synchronous and reliable• Of course, neither of these things are true in reality

• What if the network is asynchronous and reliable?• Replicas may take an arbitrarily long time to respond to messages

• Let’s also assume that all faults are crash faults• i.e. if a replica has a problem it crashes and never wakes up• No byzantine faults

Page 46: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

The FLP ResultThere is no asynchronous algorithm that achieves consensus on a 1-bit value in the presence of crash faults. The result is true even if no crash

actually occurs!

• This is known as the FLP result• Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson, 1985

• Extremely powerful result because:• If you can’t agree on 1-bit, generalizing to larger values isn’t going to help you• If you can’t converge with crash faults, no way you can converge with

byzantine faults• If you can’t converge on a reliable network, no way you can on an unreliable

network

Page 47: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

FLP Proof Sketch

• In an asynchronous system, a replica x cannot tell whether a non-responsive replica y has crashed or it is just slow• What can x do?• If it waits, it will block since it might never receive the message from y• If it decides, it may find out later that y made a different decision

• Proof constructs a scenario where each attempt to decide is overruled by a delayed, asynchronous message• Thus, the system oscillates between 0 and 1 never converges

Page 48: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Impact of FLP

• FLP proves that any fault-tolerant distributed algorithm attempting to reach consensus has runs that never terminate• These runs are extremely unlikely (“probability zero”)• Yet they imply that we can’t find a totally correct solution• And so “consensus is impossible” (“not always possible”)

• So what can we do?• Use randomization, probabilistic guarantees (gossip protocols)• Avoid consensus, use quorum systems (Paxos or Raft)• In other words, trade off consistency in favor of availability

Page 49: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Consistency vs. Availability

• FLP states that perfect consistency is impossible• Practically, we can get close to perfect consistency, but at significant cost• e.g. using 3PC• Availability begins to suffer dramatically under failure conditions

• Is there a way to formalize the tradeoff between consistency and availability?

Page 50: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Eric Brewer’s CAP Theorem• CAP theorem for distributed data replication• Consistency: updates to data are applied to all or none• Availability: must be able to access all data• Network Partition Tolerance: failures can partition network into subtrees

• The Brewer Theorem• No system can simultaneously achieve C and A and P

• Typical interpretation: “C, A, and P: choose 2”• In practice, all networks may partition, thus you must choose P• So a better interpretation might be “C or A: choose 1”

• Never formally proved or published• Yet widely accepted as a rule of thumb

Page 51: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

CAP Examples

• Availability• Client can always read

• Impact of partitions• Not consistent

• Consistency• Reads must always return

accurate results

• Impact of partitions• No availability

Write(key, 1)

Replicate(key, 2)

Read(key, 1)

A+P

(key, 1)

Write(key, 1)

(key, 1)

Replicate(key, 2)

Read

Error: ServiceUnavailable

C+P

Page 52: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

“C or A: Choose 1”• Taken to the extreme, CAP suggests a binary division in distributed

systems• Your system is consistent or available

• In practice, it’s more like a spectrum of possibilities

PerfectConsistency

AlwaysAvailable

Financial information must be correct

Serve content to all visitors, regardless of correctness

Attempt to balance correctness with availability

Page 53: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Distributed Commits (2PC and 3PC)Theory (FLP and CAP)Quorums (Paxos)

Page 54: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Strong Consistency, Revisited

• 2PC and 3PC achieve strong consistency, but they have significant shortcomings• 2PC cannot make progress in the face of leader + 1 replica failures• 3PC loses consistency guarantees in the face of network partitions

• Where do we go from here?• Observation: 2PC and 3PC attempt to reach 100% agreement• What if 51% of the replicas agree?

Page 55: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Quorum Systems

• In law, a quorum is the minimum number of members of a deliberative body necessary to conduct the business of that group• When quorum is not met, a legislative body cannot hold a vote, and cannot

change the status quo• E.g. Imagine if only 1 senator showed up to vote in the Senate

• Distributed Systems can also use a quorum approach to consensus• Essentially a voting approach• If replicas agree (out of N), then an update can be committed

Page 56: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Advantages of Quorums

• Availability: quorum systems are more resilient in the face of failures• Quorum systems can be designed to tolerate both benign and byzantine

failures

• Efficiency: can significantly reduce communication complexity• Do not require all servers in order to perform an operation• Requires a subset of them for each operation

Page 57: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

High-Level Quorum Example

ts: 1Bob: $300

ts: 2Bob: $400

ts: 3Bob: $375

ts: 1Bob: $300ts: 3

Bob: $375

ts: 1Bob: $300

ts: 2Bob: $400

ts: 3Bob: $375

ts: 1Bob: $300

ts: 2Bob: $400

ts: 1Bob: $300

WriteRead

Timestamp Balance

1 $300

2 $400

3 $375

Challenges1. Ensuring that at least

replicas commit each update

2. Ensuring that updates have the correct logical ordering

Page 58: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Paxos

• Replication protocol that ensures a global ordering of updates• All writes into the system are ordered in logical time• Replicas agree on the order of committed writes

• Uses a quorum approach to consensus• One replica is always chosen as the leader• The protocol moves forward as long as replicas agree

• The “Paxos protocol” is actually a theoretical proof• We’ll be discussing a concrete implementation of the protocol• Paxos for System Builders, Jonathan Kirsch and Yair Amir.

http://www.cs.jhu.edu/~jak/docs/paxos_for_system_builders.pdf

Page 59: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

History of Paxos• Developed by Turing award winner Leslie Lamport• First published as a tech report in 1989• Journal refused to publish it, nobody understood the protocol

• Formally published in 1998• Again, nobody understands it

• Leslie Lamport publishes “Paxos Made Simple” in 2001• People start to get the protocol

• Reaches widespread fame in 2006-2007• Used by Google in their Chubby distributed mutex system• Zookeeper is the open-source version of Chubby

Page 60: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Paxos at a High-Level

1. Replicas elect a leader and agree on the view number• The view is a logical clock that divides time into epochs• During each view, there is a single leader

2. The leader collects promises from the replicas• Replicas promise to only accept proposals from the current or future views• Prevents replicas from going “back in time”• Leader learns about proposed updates from the previous view that haven’t

yet been accepted

3. The leader proposes updates and replicas accept them• Start by completing unfished updates from the previous view• Then move on to new writes from clients

Page 61: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

View Selection Replica 0 Replica 1 Replica 2 Replica 3Tim

e

• All replicas have a view number• Goal is to have all replicas agree on

the view

Replica 4

• Broadcast view to all other replicas• If a replica receives broadcasts with

the same view, assume the view is correct

• If a replica receives a broadcast with a larger view, adopt it and rebroadcast

00000

33222

33 3• Leader is replica with ID = view % N

Page 62: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Prepare/PromiseReplica 0 Replica 1 Replica 2 Replica 3Tim

e

• Leader must ensure that a quorum of replicas exist

Replica 4

5555513 13 13 13 13

prepare view=5 clock= 13

promise view=5 clock=13

• Replicas promise to not accept any messages with view < v

• Replicas won’t elect a new leader until the current one fails (measured using a timeout)

Page 63: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Commit/AcceptReplica 0 Replica 1 Replica 2 Replica 3Tim

e

• All client requests are serialized through the leader

Replica 4

5555513 13 13 13 13

write

accept clock=14

commit clock=14

14 14 14 14 14

yxyx yx yx yx

x x x x xOK

• Replicas write the new value to temporary storage

• Increment the clock and commit the new value after receiving “accept” messages

Page 64: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Paxos Review

• Three phases: elect leader, collect promises, commit/accept• Message complexity: roughly O(n2+n)• However, more like O(n2) in steady state (repeated commit/accept)

• Two logical clocks:1. The view increments each time the leader changes

• Replicas promise not to accept updates from prior views2. The clock increments after each update/write

• Maintains the global ordering of updates

• Replicas set a timeout every time they hear from the leader• Increment the view and elect a new leader if the timeout expires

• Protocol moves forward once replicas are in agreement

Page 65: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Failure Modes

1. What happens if a commit fails?2. What happens during a partition?3. What happens after the leader fails?

Page 66: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Bad Commit Replica 0 Replica 1 Replica 2 Replica 3Tim

e

• What happens if a quorum does not accept a commit?

Replica 4

13 13 13 13 13

accept clock=14

commit clock=14

14 14 14 14

yxyx x

x x x x

accept clock=14

commit clock=14

yxyx yx yx

• Leader must retry until quorum is reached, or broadcast an abort

• Replicas that fall behind can reconcile by downloading missed updates from a peer

Page 67: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Partitions (1) Replica 0 Replica 1 Replica 2 Replica 3Tim

e

• What happens during a partition?

Replica 4

• The group with a quorum (if one exists) keeps making progress• This may require a new leader election

• Group with < replicas cannot form a quorum or accept updates

00000

11

• Once partition is fixed, either:• Hold a new leader election and

move forward• Or, reconcile with up-to-date peers

yxyx yx

x x x

Page 68: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Partitions (2) Replica 0 Replica 1 Replica 2 Replica 3Tim

e

• What happens during a partition?

• The group with a quorum (if one exists) keeps making progress• This may require a new leader election

• Group with < replicas cannot form a quorum or accept updates

Replica 4

00000

11

• What happens when the view = 0 group attempts to rejoin?

• Promises for view = 1 prevent the old leader from interfering with the new quorum

1yxyx

yxyx

Page 69: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Leader Failure (1)Replica 0 Replica 1 Replica 2 Replica 3Tim

eReplica 4

13 13 13 13 13

yx

xx xxcommit clock=14

Increment view, elect new leader

prepare clock= 13

promise clock=13

commit clock=14’

zx zx zx

• What happens if there is an uncommitted update with no quorum?

• Leader is unaware of uncommitted update

• Leader announces a new update with clock=14’, which is rejected by replica 3

• Replica 3 is desynchronized, must reconcile with another replica

Page 70: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Leader Failure (2)Replica 0 Replica 1 Replica 2 Replica 3Tim

eReplica 4

13 13 13 13 13

yx

xx xxcommit clock=14

Increment view, elect new leader

prepare clock= 13

promise clock=13

commit clock=14

yx yx yx

• What happens if there is an uncommitted update with no quorum?

• Leader is aware of uncommitted update

• Leader must recommit the original clock=14 update

yx

Page 71: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Leader Failure (3)Replica 0 Replica 1 Replica 2 Replica 3Tim

eReplica 4

13 13 13 13 13

yx

xx xxcommit clock=14

Increment view, elect new leader

commit clock=14

yx yx yx

• What happens if there is an uncommitted update with a quorum?

• By definition, leader must become aware of the uncommitted update• Recall that the leader must

collect promises

• Leader must recommit the original clock=14 update

yx

yxyx

Send prepares, collect promises

Page 72: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

The Devil in the Details

• Clearly, Paxos is complicated• Things we haven’t covered:• Reconciliation – how to bring a replica up-to-date• Managing the queue of updates from clients

• Updates may be sent to any replica• Replicas are responsible for responding to clients who contact them• Replicas may need to re-forward updates if the leader changes

• Garbage collection• Replicas need to remember the exact history of updates, in case the leader changes• Periodically, the lists need to be garbage collected

Page 73: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Odds and EndsByzantine GeneralsGossip

Page 74: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Byzantine Generals Problem• Name coined by Leslie Lamport• Several Byzantine Generals are

laying siege to an enemy city• They have to agree on a common

strategy: attack or retreat• They can only communicate by

messenger• Some generals may be traitors

(their identity is unknown)

Do you see the connection with the consensus problem?

Page 75: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Byzantine Distributed Systems• Goals

1. All loyal lieutenants obey the same order2. If the commanding general is loyal, then every loyal lieutenant obeys the

order he sends

• Can the problem be solved?• Yes, iff there at least 3m+1 generals in the presence of m traitors• E.g. if there are 3 generals, even 1 traitor makes the problem unsolvable

• Bazillion variations on the basic problem• What if messages are cryptographically signed (e.g. they are unforgeable)?• What if communication is not g x g (i.e. some pairs of generals cannot

communicate)?

• Most algos have byzantine versions (e.g. Byzantine Paxos)

Page 76: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Alternatives to Quorums• Quorums favor consistency over availability• If no quorum exists, then the system stops accepting writes• Significant overhead maintaining consistent replicated state

• What if eventual consistency is okay?• Favor availability over consistency• Results may be stale or incorrect sometimes (hopefully only in rare cases)

• Gossip protocols• Replicas periodically, randomly exchange state with each other• No strong consistency guarantees but…• Surprisingly fast and reliable convergence to up-to-date state• Requires vector clocks or better in order to causally order events

• Extreme cases of divergence may be irreconcilable

Page 77: CS 3700 Networks and Distributed Systems Distributed Consensus and Fault Tolerance (or, why cant we all just get along?)

Sources

1. Some slides courtesy of Cristina Nita-Rotaru (http://cnitarot.github.io/courses/ds_Fall_2016/)

2. The Part-Time Parliament, Leslie Lamport. http://research.microsoft.com/en-us/um/people/lamport/pubs/lamport-paxos.pdf

3. Paxos Made Simple, Leslie Lamport. http://research.microsoft.com/en-us/um/people/lamport/pubs/paxos-simple.pdf

4. Paxos for System Builders, Jonathan Kirsch and Yair Amir. http://www.cs.jhu.edu/~jak/docs/paxos_for_system_builders.pdf

5. The Chubby Lock Service for Loosely-Coupled Distributed Systems, Mike Burrows. http://research.google.com/archive/chubby-osdi06.pdf

6. Paxos Made Live – An Engineering Perspective, Tushar Deepak Chandra, Robert Griesemer, Joshua Redstone. http://research.google.com/archive/paxos_made_live.pdf