computing with byzantine shared memory
DESCRIPTION
Computing with Byzantine Shared Memory. Topics in Reliable Distributed Systems Fall 2004-2005, Idit Keidar. Sources. Byzantine Disk Paxos: Optimal Resilience with Byzantine Shared Memory Abraham, Chockler, Keidar, Malkhi, PODC04 - PowerPoint PPT PresentationTRANSCRIPT
Computing with Byzantine Shared Memory
Topics in Reliable Distributed Systems
Fall 2004-2005, Idit Keidar
Sources
• Byzantine Disk Paxos: Optimal Resilience with Byzantine Shared Memory– Abraham, Chockler, Keidar, Malkhi, PODC04
• Optimal Resilience Wait-Free Storage from Byzantine Components: Inherent Costs and Solutions– Chockler, Keidar, Malkhi, FuDiCo II, June 2004.
• Some slides stolen from Chockler and Malkhi
Internet Information Services
InternetInternet
Internet Information Services
webweb
Network storage[Fleet (Hebrew U and Lucent) Agile (GTech), Coca (Cornell),
SBQL (UTexas), others]
Internet Information Services
webweb
Peer-to-peer storage[Oceanstore, etc.]
Storage Area Network (SAN)
Large-Scale Deployment
• Clients come and go– cannot be expected to be around – hence, direct communication infeasible
• Communication among storage nodes is also infeasible– In SAN impossible– Thin (scalable) servers, more logic at clients
• Some of the storage nodes can be compromised by a malicious adversary
Fault/Security Scenarios
• Client crash, unresponsiveness
• Access control for clients– access restricted to data they own– if bypassed, what can we do?
• A malicious client can mess up the data anyway
• Servers entrusted with others’ data– compromised servers a bigger problem – Byzantine faults possible
Assumptions
• Any number of client failures (crash)– no Byzantine faults (assume access control)
• Threshold (t-out-of-n) of storage components (per object) can be faulty– Byzantine, unresponsive: NR-Arbitrary Faults
• Monitoring/reconfiguration service– Each object has enough healthy replicas
• Naming service: e.g., DHT
Example: Internet-Wide StoreNS
Client:
ha=lookup(a); read(ha); write(ha); …
X
Y
Z
A
ha=
{X,Y
,Z,A
}
Data-Centric ReplicationNo server-to-server comm
NR-arbitrary failures
No client-to-client commBounded, unknown number of clients
Some System Examples
• Fleet, HUJI and Lucent • Rosebud, MIT• Agile, GTech • Coca, Cornell • SBQL, UTexas • OceanStore, Berkley• Pasis, CMU• others
Our Focus: Foundations
• Wanted: generic services useful for applications – Reliable R/W registers– Consensus
• Which register semantics can be supported?– Atomic, regular, safe– Termination conditions
• At which cost?– Failure resilience– Communication rounds– Memory consumption
• By understanding tradeoffs, system designers will be able to intelligently decide what to choose
Formal Model
• Asynchronous shared memory– servers=shared objects, clients=processes
• Shared objects may experience NR-arbitrary failures [JCT98]– faulty object can respond with arbitrary value
or fail to respond
• Processes can fail by crashing
Previous Work I: Wait-Free Constructions
• Safe register: A read that does not overlap a write returns the last register’s value – Malkhi & Reiter 98: n > 4t
• One round read/write, unbounded timestamps
– Jayanti, Chandra & Toueg 98: n > 5t• One round read/write, no timestamps
• Self-construction: implement safe register from collection of fault-prone safe registers
Byzantine Quorum Systems: Example [Malkhi and Reiter 98]
• At most one server can be penetrated
x = 7, t = 1
x = 7
x = 0t = 0
x = 2t = 5
x = 7t = 1
x = 7t = 1
Byzantine Quorum Systems: Example [Malkhi and Reiter 98]
x = 7, t = 1
x = 7
x = 0t = 0
x = 0t = 0
x = 7t = 1
x = 7t = 1
• Why timestamps?
Previous Work II: Optimal Failure Resilience (n>3t)
• Synchronous system [Bazzi, DC’00] – Safe register
• Servers communicate with each other (finish operations for faulty clients) [Martin, Alvisi & Dahlin, DISC’02] – MWMR atomic register
• Reliable clients [Attiya & Bar-Or, SRDS’03]Missing: optimal-resilience data-centric
asynchronous implementations resilient to process failures
Optimal Resilience: The Challenge
(v, 1)
The writer The reader
(v0,0) (v0,0) (v0,0)
delayed
n=4, t=1
Optimal Resilience: The Challenge
ack
The writer The reader
(v,1) (v,1) (v0,0)
Optimal Resilience: The Challenge
The writer The reader
(v,1) (v,1) (v0,0)
delayed
Optimal Resilience: The Challenge
The writer The reader
(v,1) (v,1) (v0,0)
(v0,0)(v0,0)
(v,1) ?
Cannot return v0
Optimal Resilience: The Challenge
The writer The reader
(v0,0)
(v0,0)(v0,0)
(v,1) ?
Cannot return v1
(v0,0) (v0,0)
No write happened
Reliable Writer Solution: Wait
The writer The reader
(v,1) (v,1) (v0,0)
(v0,0)(v0,0)(v,1)
(v,1)(v,1)
Faulty Writer Scenario
The writer The reader
Cannot wait!
(v, 1)
(v, 1)
(v0,0)(v0,0)(v,1)
?
What Does This Mean?
Is a solution with n=3t+1 impossible?
Lower Bound for Optimal Resilience
• The reader cannot return any value, and cannot wait for more, so what can it do?– Invoke more rounds!
• Will this help? – No! Exactly the same thing can happen in every
round.
• Conclusion?
Write Lower Bound
For 1W1R binary safe register(weakest meaningful object type)
construction from any base object typeif n ≤ 4t and processes can crashemulating Write in one round is impossible
• To emulate Write, there must be at least one base object on which two operations are invoked
Two Rounds Save the Day!
(v, 1)
The writer The reader
pw=v0,0w=v0,0
pw=v0,0w=v0,0
pw=v0,0w=v0,0
pw=v,1w=v0,0
pw=v,1w=v0,0
pw=v0,0w=v0,0
pw=v,1w=v,1
pw=v,1w=v,1
pw=v0,0w=v0,0
Writer Fails During Pre-Write
The writer The reader
pw=v,1w=v0,0
pw=v0,0w=v0,0
pw=v0,0w=v0,0
Can return v0
(v, 1)
Writer Fails During Write
The writer The reader
pw=v,1w=v,1
pw=v,1w=v0,0
pw=v0,0w=v0,0
Can wait to hear more v,1
(v, 1)
Write Never Happened
The writer The reader
pw=v0,0w=v0,0
pw=v0,0w=v0,0
pw=v0,0w=v0,0
pw=v,1w=v,1
Can wait to hear more without v,1
Can Read Always Complete in One Round?
Overlapping Write Scenario
The writer The reader
pw=v,1w=v0,0
pw=v0,0w=v0,0
pw=v0,0w=v0,0
(v, 1)
pw=v,1w=v0,0
pw=v,1w=v0,0
pw=v,1w=v,1
(v0,0)(v0,0)(v,1)
Cannot wait
Solution?
• More read rounds• By the next round, all objects responding to the
Read will have seen the first phase of the Write– By causality
• But what if there are additional overlapping writes?– Safe register implementation can deduce in the next
round (in general, in min(f+2,t+1) rounds) that there is an overlapping write; returns arbitrary value
Determining Value to Return
• Read values (from w field) are candidates to return• If there are 2t+1 responses without v (in either pw or w)
then v is no longer a candidate – Either was not written before Read, or some Write overlaps read
• If the highest timestamped candidate occurs t+1 times, it can be returned
• If the candidates set is empty, return any value– There must be an overlapping Write
• Each round, wait for at least one more response than in previous rounds
• After min(t+1,f+2) rounds, it will be possible to return
Summary: Wait-Free Safe Register Construction for n>3t
• Invokes a bounded number of Read rounds– Optimal round complexity
• Constant storage • Unbounded timestamps
• Read takes two rounds (after synchronization) in (eventually) synchronous runs
Summary: Lower Bounds for n>3t and Fault-Prone Clients
• 1W1R binary safe register construction:– WRITE: 2 communication rounds are necessary
irrespective of base object types• Termination: obstruction freedom
– READ: min(t+1,f+2) communication rounds are necessary irrespective of base object types if readers do not modify base objects
• Termination: Every write terminates, read terminates if eventually runs in isolation (Finite-Write Termination)
A Note on Safe Registers
• Wait-free safe registers are too weak to be directly useful by applications– They are good for lower bounds
• Bounded constructions of wait-free atomic registers from safe registers known, but…Complicated
Add communication rounds and memory
Too costly to deploy in a distributed setting
Registers with Stronger Semantics
• Can we come up with an efficient, direct construction of a regular/atomic register?
• Yes, if we are ready to compromise wait freedom
Termination Conditions
• Wait-Freedom:– Every operation must complete within a finite number
of steps regardless of steps of other processes
• Lock-Freedom:– If there are concurrent operations, at least one of them
must complete
• Obstruction- Freedom:– If one process runs alone (without interference) for
sufficiently long, it must complete its operation
Notes
• Wait-freedom often hard to achieve
• Lock-freedom and obstruction-freedom are achieved by simpler algorithms
What is the Right Condition?
• Ultimately, we want to solve consensus– For state machine replication– Want it to be wait-free
• Alas, impossible in asynchronous systems• We need a leader-oracle in order to solve
consensus in asynchronous shared memory– ensures that at most one process writes
(proposes values), while others only read
Reminder: Disk Paxos
forever() if leader (by leader oracle) then
b ← choose new unique rankwrite b,v,n, to xiread all xj’sif some xj.bal > b then continueif all read xj.val’s = then
v ← my initial valueelse v ← read val with highest numn ← bwrite b,v,n, to xiread all xj’sif some xi.bal >b then continuewrite b,v,n,v to xireturn v
else /* non-leader */read xld where ld is leaderif xld .dec ≠ bot then return xld .dec
Observation: Only leader writes;others just read.Leader stops writing upon deciding
What is the Right Condition? (Cont’d)
• Will Lock-Freedom help? – No, because as long as the leader does not
succeed to write, no decision is possible
• Is Wait-Freedom really needed?
Finite-Writes Termination• Finite-Writes Termination (FW-termination)
– Every write (by a correct process) eventually returns
– Every read (by a correct process) eventually returns unless infinitely many writes are invoked
• Wait-free consensus with an leader oracle is solvable with FW-terminating regular 1WMR registers
Finite-Writes Termination
obstruction - free traces
lock-free traces
FW-terminatingtraces
wait-free
traces
Consensus w/ FW-Terminating Registers
• We observe that existing wait-free shared-memory failure-detector-based () consensus algorithms* work with FW-Terminating registers– Lo and Hadzilacos’96– Gafni and Lamport’00
*with small modifications
Back to the Register Emulation
• To implement a regular FW-Termination register, Read keeps invoking rounds until the highest timestamped value appears in t+1 responses
• When overlapping Writes stop occurring, Read eventually returns– Number of rounds unbounded by construction
The Complete System
Conclusions
• Optimally resilient R/W register constructions out of Byzantine-fault-prone base objects
• Lower bounds – 2 rounds for write; min(t+1,f+2) rounds
• Matching wait-free safe construction• Simple and direct construction of FW-terminating
regular registers– Efficient in synchronous runs
• Wait-free -based Consensus is implementable with FW-terminating regular registers