computing with byzantine shared memory

Computing with Byzantine Shared Memory

Topics in Reliable Distributed Systems

Fall 2004-2005, Idit Keidar

Sources

• Byzantine Disk Paxos: Optimal Resilience with Byzantine Shared Memory– Abraham, Chockler, Keidar, Malkhi, PODC04

• Optimal Resilience Wait-Free Storage from Byzantine Components: Inherent Costs and Solutions– Chockler, Keidar, Malkhi, FuDiCo II, June 2004.

• Some slides stolen from Chockler and Malkhi

Internet Information Services

InternetInternet


webweb

Network storage[Fleet (Hebrew U and Lucent) Agile (GTech), Coca (Cornell),

SBQL (UTexas), others]


webweb

Peer-to-peer storage[Oceanstore, etc.]

Storage Area Network (SAN)

Large-Scale Deployment

• Clients come and go– cannot be expected to be around – hence, direct communication infeasible

• Communication among storage nodes is also infeasible– In SAN impossible– Thin (scalable) servers, more logic at clients

• Some of the storage nodes can be compromised by a malicious adversary

Fault/Security Scenarios

• Client crash, unresponsiveness

• Access control for clients– access restricted to data they own– if bypassed, what can we do?

• A malicious client can mess up the data anyway

• Servers entrusted with others’ data– compromised servers a bigger problem – Byzantine faults possible

Assumptions

• Any number of client failures (crash)– no Byzantine faults (assume access control)

• Threshold (t-out-of-n) of storage components (per object) can be faulty– Byzantine, unresponsive: NR-Arbitrary Faults

• Monitoring/reconfiguration service– Each object has enough healthy replicas

• Naming service: e.g., DHT

Example: Internet-Wide StoreNS

Client:

ha=lookup(a); read(ha); write(ha); …

X

Y

Z

A

ha=

{X,Y

,Z,A

}

Data-Centric ReplicationNo server-to-server comm

NR-arbitrary failures

No client-to-client commBounded, unknown number of clients

Some System Examples

• Fleet, HUJI and Lucent • Rosebud, MIT• Agile, GTech • Coca, Cornell • SBQL, UTexas • OceanStore, Berkley• Pasis, CMU• others

Our Focus: Foundations

• Wanted: generic services useful for applications – Reliable R/W registers– Consensus

• Which register semantics can be supported?– Atomic, regular, safe– Termination conditions

• At which cost?– Failure resilience– Communication rounds– Memory consumption

• By understanding tradeoffs, system designers will be able to intelligently decide what to choose

Formal Model

• Asynchronous shared memory– servers=shared objects, clients=processes

• Shared objects may experience NR-arbitrary failures [JCT98]– faulty object can respond with arbitrary value

or fail to respond

• Processes can fail by crashing

Previous Work I: Wait-Free Constructions

• Safe register: A read that does not overlap a write returns the last register’s value – Malkhi & Reiter 98: n > 4t

• One round read/write, unbounded timestamps

– Jayanti, Chandra & Toueg 98: n > 5t• One round read/write, no timestamps

• Self-construction: implement safe register from collection of fault-prone safe registers

Byzantine Quorum Systems: Example [Malkhi and Reiter 98]

• At most one server can be penetrated

x = 7, t = 1

x = 7

x = 0t = 0

x = 2t = 5

x = 7t = 1

x = 7t = 1

Byzantine Quorum Systems: Example [Malkhi and Reiter 98]

x = 7, t = 1

x = 7

x = 0t = 0

x = 0t = 0

x = 7t = 1

x = 7t = 1

• Why timestamps?

Previous Work II: Optimal Failure Resilience (n>3t)

• Synchronous system [Bazzi, DC’00] – Safe register

• Servers communicate with each other (finish operations for faulty clients) [Martin, Alvisi & Dahlin, DISC’02] – MWMR atomic register

• Reliable clients [Attiya & Bar-Or, SRDS’03]Missing: optimal-resilience data-centric

asynchronous implementations resilient to process failures

Optimal Resilience: The Challenge

(v, 1)

The writer The reader

(v0,0) (v0,0) (v0,0)

delayed

n=4, t=1


ack


(v,1) (v,1) (v0,0)



(v,1) (v,1) (v0,0)

delayed



(v,1) (v,1) (v0,0)

(v0,0)(v0,0)

(v,1) ?

Cannot return v0



(v0,0)

(v0,0)(v0,0)

(v,1) ?

Cannot return v1

(v0,0) (v0,0)

No write happened

Reliable Writer Solution: Wait


(v,1) (v,1) (v0,0)

(v0,0)(v0,0)(v,1)

(v,1)(v,1)

Faulty Writer Scenario


Cannot wait!

(v, 1)

(v, 1)

(v0,0)(v0,0)(v,1)

?

What Does This Mean?

Is a solution with n=3t+1 impossible?

Lower Bound for Optimal Resilience

• The reader cannot return any value, and cannot wait for more, so what can it do?– Invoke more rounds!

• Will this help? – No! Exactly the same thing can happen in every

round.

• Conclusion?

Write Lower Bound

For 1W1R binary safe register(weakest meaningful object type)

construction from any base object typeif n ≤ 4t and processes can crashemulating Write in one round is impossible

• To emulate Write, there must be at least one base object on which two operations are invoked

Two Rounds Save the Day!

(v, 1)


pw=v0,0w=v0,0

pw=v0,0w=v0,0

pw=v0,0w=v0,0

pw=v,1w=v0,0

pw=v,1w=v0,0

pw=v0,0w=v0,0

pw=v,1w=v,1

pw=v,1w=v,1

pw=v0,0w=v0,0

Writer Fails During Pre-Write


pw=v,1w=v0,0

pw=v0,0w=v0,0

pw=v0,0w=v0,0

Can return v0

(v, 1)

Writer Fails During Write


pw=v,1w=v,1

pw=v,1w=v0,0

pw=v0,0w=v0,0

Can wait to hear more v,1

(v, 1)

Write Never Happened


pw=v0,0w=v0,0

pw=v0,0w=v0,0

pw=v0,0w=v0,0

pw=v,1w=v,1

Can wait to hear more without v,1

Can Read Always Complete in One Round?

Overlapping Write Scenario


pw=v,1w=v0,0

pw=v0,0w=v0,0

pw=v0,0w=v0,0

(v, 1)

pw=v,1w=v0,0

pw=v,1w=v0,0

pw=v,1w=v,1

(v0,0)(v0,0)(v,1)

Cannot wait

Solution?

• More read rounds• By the next round, all objects responding to the

Read will have seen the first phase of the Write– By causality

• But what if there are additional overlapping writes?– Safe register implementation can deduce in the next

round (in general, in min(f+2,t+1) rounds) that there is an overlapping write; returns arbitrary value

Determining Value to Return

• Read values (from w field) are candidates to return• If there are 2t+1 responses without v (in either pw or w)

then v is no longer a candidate – Either was not written before Read, or some Write overlaps read

• If the highest timestamped candidate occurs t+1 times, it can be returned

• If the candidates set is empty, return any value– There must be an overlapping Write

• Each round, wait for at least one more response than in previous rounds

• After min(t+1,f+2) rounds, it will be possible to return

Summary: Wait-Free Safe Register Construction for n>3t

• Invokes a bounded number of Read rounds– Optimal round complexity

• Constant storage • Unbounded timestamps

• Read takes two rounds (after synchronization) in (eventually) synchronous runs

Summary: Lower Bounds for n>3t and Fault-Prone Clients

• 1W1R binary safe register construction:– WRITE: 2 communication rounds are necessary

irrespective of base object types• Termination: obstruction freedom

– READ: min(t+1,f+2) communication rounds are necessary irrespective of base object types if readers do not modify base objects

• Termination: Every write terminates, read terminates if eventually runs in isolation (Finite-Write Termination)

A Note on Safe Registers

• Wait-free safe registers are too weak to be directly useful by applications– They are good for lower bounds

• Bounded constructions of wait-free atomic registers from safe registers known, but…Complicated

Add communication rounds and memory

Too costly to deploy in a distributed setting

Registers with Stronger Semantics

• Can we come up with an efficient, direct construction of a regular/atomic register?

• Yes, if we are ready to compromise wait freedom

Termination Conditions

• Wait-Freedom:– Every operation must complete within a finite number

of steps regardless of steps of other processes

• Lock-Freedom:– If there are concurrent operations, at least one of them

must complete

• Obstruction- Freedom:– If one process runs alone (without interference) for

sufficiently long, it must complete its operation

Notes

• Wait-freedom often hard to achieve

• Lock-freedom and obstruction-freedom are achieved by simpler algorithms

What is the Right Condition?

• Ultimately, we want to solve consensus– For state machine replication– Want it to be wait-free

• Alas, impossible in asynchronous systems• We need a leader-oracle in order to solve

consensus in asynchronous shared memory– ensures that at most one process writes

(proposes values), while others only read

Reminder: Disk Paxos

forever() if leader (by leader oracle) then

b ← choose new unique rankwrite b,v,n, to xiread all xj’sif some xj.bal > b then continueif all read xj.val’s = then

v ← my initial valueelse v ← read val with highest numn ← bwrite b,v,n, to xiread all xj’sif some xi.bal >b then continuewrite b,v,n,v to xireturn v

else /* non-leader */read xld where ld is leaderif xld .dec ≠ bot then return xld .dec

Observation: Only leader writes;others just read.Leader stops writing upon deciding

What is the Right Condition? (Cont’d)

• Will Lock-Freedom help? – No, because as long as the leader does not

succeed to write, no decision is possible

• Is Wait-Freedom really needed?

Finite-Writes Termination• Finite-Writes Termination (FW-termination)

– Every write (by a correct process) eventually returns

– Every read (by a correct process) eventually returns unless infinitely many writes are invoked

• Wait-free consensus with an leader oracle is solvable with FW-terminating regular 1WMR registers

Finite-Writes Termination

obstruction - free traces

lock-free traces

FW-terminatingtraces

wait-free

traces

Consensus w/ FW-Terminating Registers

• We observe that existing wait-free shared-memory failure-detector-based () consensus algorithms* work with FW-Terminating registers– Lo and Hadzilacos’96– Gafni and Lamport’00

*with small modifications

Back to the Register Emulation

• To implement a regular FW-Termination register, Read keeps invoking rounds until the highest timestamped value appears in t+1 responses

• When overlapping Writes stop occurring, Read eventually returns– Number of rounds unbounded by construction

The Complete System

Conclusions

• Optimally resilient R/W register constructions out of Byzantine-fault-prone base objects

• Lower bounds – 2 rounds for write; min(t+1,f+2) rounds

• Matching wait-free safe construction• Simple and direct construction of FW-terminating

regular registers– Efficient in synchronous runs

• Wait-free -based Consensus is implementable with FW-terminating regular registers

computing with byzantine shared memory

Documents