distributed algorithms practical byzantine fault...

Distributed AlgorithmsPractical Byzantine Fault Tolerance

Alberto Montresor

University of Trento, Italy

2017/01/06

This work is licensed under a Creative CommonsAttribution-ShareAlike 4.0 International License.

references

M. Abd-El-Malek, G. Ganger, G. Goodson, M. Retier, andJ. Wylie.Fault-scalable Byzantine fault-tolerant services.In Proc. of the ACM Symposium on Operating Systems Principles,SOSP’05, Oct. 2005.

M. Castro and B. Liskov.Practical Byzantine fault tolerance.In Proc. of the 3rd Symposium on Operating systems design andimplementation, OSDI’99, pages 173–186, New Orleans, Louisiana,USA, 1999. USENIX Association.http:

//www.disi.unitn.it/~montreso/ds/papers/PbftOsdi.pdf.

M. Castro and B. Liskov.Practical Byzantine fault tolerance and proactive recovery.ACM Trans. Comput. Syst., 20:398–461, Nov. 2002.http:

//www.disi.unitn.it/~montreso/ds/papers/PbftTocs.pdf.

A. Clement, M. Kapritsos, S. Lee, Y. Wang, L. Alvisi, M. Dahlin,and T. Riche.UpRight cluster services.In Proc. of the ACM Symposium on Operating Systems Principles,SOSP’09, Oct. 2009.http:

//www.disi.unitn.it/~montreso/ds/papers/UpRight.pdf.

A. Clement, E. Wong, L. Alvisi, M. Dahlin, and M. Marchetti.Making Byzantine fault tolerant systems tolerate Byzantine faults.In Proc. of the 6th USENIX symposium on Networked systemsdesign and implementation, NSDI’09, pages 153–168. USENIXAssociation, 2009.http:

//www.disi.unitn.it/~montreso/ds/papers/Aardvark.pdf.

J. Cowling, D. Myers, B. Liskov, R. Rodrigues, and L. Shrira.HQ replication: A hybrid quorum protocol for Byzantine faulttolerance.In Proc. of the Symposium on Operating systems design andimplementation, OSDI’06, Oct. 2005.

S. Gaertner, M. Bourennane, C. Kurtsiefer, A. Cabello, andH. Weinfurter.Experimental demonstration of a quantum protocol for byzantineagreement and liar detection.Physical Review Letters, 100(7), Feb. 2008.

R. Kotla, A. Clement, E. Wong, L. Alvisi, and M. Dahlin.Zyzzyva: Speculative byzantine fault tolerance.In Proc. of the ACM Symposium on Operating Systems Principles,(SOSP’07), Stevenson, WA, Oct. 2007. ACM.http:

//www.disi.unitn.it/~montreso/ds/papers/Zyzzyva.pdf.

L. Lamport, R. Shostak, and M. Pease.The Byzantine generals problem.ACM Transactions on Programming Languages and Systems(TOPLAS), 4(3):382–401, 1982.http://www.disi.unitn.it/~montreso/ds/papers/

ByzantineGenerals.pdf.

http://creativecommons.org/licenses/by-sa/4.0/


http://www.disi.unitn.it/~montreso/ds/papers/PbftOsdi.pdf


http://www.disi.unitn.it/~montreso/ds/papers/PbftTocs.pdf


http://www.disi.unitn.it/~montreso/ds/papers/UpRight.pdf


http://www.disi.unitn.it/~montreso/ds/papers/Aardvark.pdf


http://www.disi.unitn.it/~montreso/ds/papers/Zyzzyva.pdf


http://www.disi.unitn.it/~montreso/ds/papers/ByzantineGenerals.pdf


Contents

1 Introduction2 Byzantine generals3 Practical Byzantine Fault Tolerance4 Beyond PBFT

Overview5 Zyzzyva

IntroductionThree casesThe case of the missing phaseView changes

6 Aardvark7 UpRight

Introduction

Motivation

Processes may exhibit arbitrary (Byzantine) behaviorI Malicious attacks

F They lieF They collude

I Software errorF Arbitrary states, messages

Examples

Amazon outage (2008), “Root cause was a single bit flip ininternal state messages”1

Shuttle Mission STS-124 (2008), 3-1 disagreement on sensorsduring fuel loading (on Earth!)2

2http://status.aws.amazon.com/s3-20080720.html2https://c3.nasa.gov/dashlink/resources/624/

Alberto Montresor (UniTN) DS - BFT 2017/01/06 1 / 80

http://status.aws.amazon.com/s3-20080720.html

https://c3.nasa.gov/dashlink/resources/624/

Introduction

History

State-of-the-art at the end of the 90’sI Theoretically feasible algorithms to tolerate Byzantine failures, but

inefficient in practiceI Assume synchrony – known bounds for message delays and

processing speedI Most importantly: synchrony assumption needed for correctness –

what about DoS?

Bibliography

L. Lamport, R. Shostak, and M. Pease. The Byzantine generals problem.ACM Transactions on Programming Languages and Systems (TOPLAS),4(3):382–401, 1982.http://www.disi.unitn.it/~montreso/ds/papers/ByzantineGenerals.pdf



Byzantine generals

Byzantine generals

Attack!

Wait…

Attack!

Attack! No, wait! Surrender!

Wait…

From cs4410 fall 08 lectureAlberto Montresor (UniTN) DS - BFT 2017/01/06 3 / 80

Byzantine generals

Specification

A commanding general must send an order to his n− 1 lieutenantgenerals such that:

I IC1: All loyal lieutenants obey the same orderI IC2: If the commanding general is loyal, then every loyal lieutenant

obeys the order he sends

Assumptions (“Oral” messages):I Every message that is sent is received correctlyI The receiver of a message knows who sent itI The absence of a message can be detected


Byzantine generals

Impossibility results

Under the “Oral” messages assumption, no solution with three generalscan handle even a single traitor

Comm.Gen.

Liut.1

Liut.2

Attack!Attack!

He said “Retreat”!

Comm.Gen.

Liut.1

Liut.2

Retreat!Attack!

He said “Retreat”!


Byzantine generals

“Oral Message” algorithm OM(m)

Algorithm OM(0)1 The commander sends its value to every lieutenant2 Each lieutenant uses the value he received from commander, or uses

retreat if he received no value

Algorithm OM(m)1 The commander send its value to every lieutenant2 ∀i, let vi be the value lieutenant i receives from the commander, or

retreat if it has received no value. Lieutenant i acts as thecommander of algorithm OM(m− 1) to send the value vi to each ofthe other n− 2 other lieutenants

3 ∀j 6= i, let vj be the value received by i from j in Step 2 ofalgorithm OM(m− 1) or retreat if no value. Lieutenant i uses thevalue majority(v1, ..., vn) (deterministic function)

The recursive


Byzantine generals

“Oral Message” Algorithm Example – OM(1)

A

A

A

AA A

A A

AA

A

R

C

L1

L2

L3


Byzantine generals

Oral messages

Theorem

For any m, Algorithm OM(m) satisfies conditions IC1 and IC2 ifthere are more than 3m generals and at most m traitors

Problems:I message paths of length up to m+ 1 (expensive)I absence of messages must be detected via time-out

(vulnerable to DoS)


Oral messages

Theorem

For any m, Algorithm OM(m) satisfies conditions IC1 and IC2 ifthere are more than 3m generals and at most m traitors

Problems:I message paths of length up to m+ 1 (expensive)I absence of messages must be detected via time-out

(vulnerable to DoS)

2017-0

3-0

7

DS - BFT

Byzantine generals

Oral messages

An attacker may compromise the safety of a service by delaying non-faultynodes or the communication between them until they are tagged as faulty andexcluded from the replica group. Such a denial-of-service attack is generallyeasier than gaining control over a non-faulty node.

Practical Byzantine Fault Tolerance

A Byzantine “renaissance”

Bibliography

M. Castro and B. Liskov. Practical Byzantine fault tolerance and proactive recovery.ACM Trans. Comput. Syst., 20:398–461, Nov. 2002.http://www.disi.unitn.it/~montreso/ds/papers/PbftTocs.pdf

Contributions

First state machine replication protocol that survives Byzantinefaults in asynchronous networks

Live under weak Byzantine assumptions – Byzantine paxos!

Implementation of a Byzantine, fault tolerant distributed FS

Experiments measuring cost of replication technique




Assumptions

System modelI Asynchronous distributed system with N processesI Unreliable channels

Unbreakable cryptographyI Message m is signed by its sender i, and we write 〈m〉σ(i), through:

F Public/private key pairsF Message authentication codes (MAC)

I A digest d(m) of message m is produced through collision-resistanthash functions


Assumptions

System modelI Asynchronous distributed system with N processesI Unreliable channels

Unbreakable cryptographyI Message m is signed by its sender i, and we write 〈m〉σ(i), through:

F Public/private key pairsF Message authentication codes (MAC)

I A digest d(m) of message m is produced through collision-resistanthash functions

2017-0

3-0

7

DS - BFT


Assumptions

MACs (message authentication codes) are based on secret keys (client-server,

server-server)


Assumptions

Failure modelI Up to f Byzantine serversI N > 3f total serversI (Potentially Byzantine clients)

Independent failuresI Different implementations of the serviceI Different operating systemsI Different root passwords, different administrator



Specification

State machine replicationI Replicated service with a state and deterministic operations

operating on itI Clients issue a request and block waiting for reply

SafetyI The system satisfies linearizability, provided that N > 3f + 1I Regardless of “faulty clients”...

F all operations performed by faulty clients are observed in aconsistent way by non-faulty clients

I The algorithm does not rely on synchrony to provide safety...

LivenessI It relies on synchrony to provide livenessI Assumes delay(t) does not grow faster than t indefinitelyI Weak assumption – if network faults are eventually repairedI Circumvent the impossibility results of FLP



Optimality

Theorem

To tolerate up to f malicious nodes, N must be equal to 3f + 1

Proof.

It must be possible to proceed after communicating with N − freplicas, because the faulty replicas may not respond

But the f replicas not responding may be just slow, so f of thosethat responded might be faulty

The correct replicas who responded (N − 2f) must outnumber thefaulty replicas, so

N − 2f > f ⇒ N > 3f



Optimality

So, N > 3f to ensure that at least a correct replica is present inthe reply set

N = 3f + 1; more is uselessI more and larger messagesI without improving resiliency



Processes and views

Replicas IDs: 0 . . . N − 1

Replicas move through a sequence of configurations called views

During view v:I Primary replica is i: i = v mod NI The other are backups

View changes are carried out when the primary appears to havefailed



The algorithm

To invoke an operation, the clientsends a request to the primary

The primary multicasts the request tothe backups

Quorums are employed to guaranteeordering on operations

When an order has been agreed,replicas execute the request and senda reply to the client

When the client receives at least f + 1identical replies, it is satisfied

Client

Backup 1 Backup 2 Backup 3

Primary



Problems

The primary could be faulty!I could ignore commands; assign same sequence number to different

requests; skip sequence numbers; etcI backups monitor primary’s behavior and trigger view changes to

replace faulty primary

Backups could be faulty!I could incorrectly store commands forwarded by a correct primaryI use dissemination Byzantine quorum systems

Faulty replicas could incorrectly respond to the client!I Client waits for f + 1 matching replies before accepting response



The general idea

Algorithm steps are justified by certificatesI Sets (quorums) of signed messages from distinct replicas proving

that a property of interest holds

With quorums of size at least 2f + 1I Any two quorums intersect in at least one correct replicaI There is always one quorum that contains only non-faulty replicas

1. State: …A2. State: …A

3. State: …A4. State: …

Servers

Clientswrit

e A

write AX

wri

te Aw

rite A



The general idea

Algorithm steps are justified by certificatesI Sets (quorums) of signed messages from distinct replicas proving

that a property of interest holds

With quorums of size at least 2f + 1I Any two quorums intersect in at least one correct replicaI There is always one quorum that contains only non-faulty replicas

…A …A B …B …B

write B

writ

e B

Xw

rite

Bwrite B

Servers

Clients

1. State: 2. State: 3. State: 4. State:



Protocol schema

Normal operationI How the protocol works in the absence of failuresI hopefully, the common case

View changesI How to depose a faulty primary and elect a new one

Garbage collectionI How to reclaim the storage used to keep certificates

RecoveryI How to make a faulty replica behave correctly again (not here)



State

The internal state of each of the replicas include:I the state of the actual serviceI a message log containing all the messages the replica has acceptedI an integer denoting the replica current view



Client request

Primary

Backup 1

Backup 2

Backup 3

Request

〈request, o, t, c〉σ(c)o: state machine operation

t: timestamp (used to ensure exactly-once semantics)

c: client id

σ(c): client signature



Pre-prepare phase

Primary

Backup 1

Backup 2

Backup 3

Request

Pre-prepare

〈〈pre-prepare, v, n, d(m)〉σ(p),m〉

v: current view

n: sequence number

d(m): digest of client message

σ(p): primary signature

m: client message


Pre-prepare phase

Primary

Backup 1

Backup 2

Backup 3

Request

Pre-prepare

〈〈pre-prepare, v, n, d(m)〉σ(p),m〉

v: current view

n: sequence number

d(m): digest of client message

σ(p): primary signature

m: client message2017-0

3-0

7

DS - BFT


Pre-prepare phase

Requests are not included in pre-prepare messages to keep them small. This is

important because pre-prepare messages are used as a proof that the request

was assigned sequence number in view in view changes.


Pre-prepare phase

〈〈pre-prepare, v, n, d(m)〉σ(p),m〉Correct replica i accepts pre-prepare if:

I the pre-prepare message is well-formedI the current view of i is vI i has not accepted another pre-prepare for v, n with a different

digestI n is between two water-marks L and H

(to avoid sequence number exhaustion caused by faulty primaries)

Each accepted pre-prepare message is stored in the acceptingreplica’s message log (including the primary’s)

Non-accepted pre-prepare messages are just discarded



Prepare phase

Primary

Backup 1

Backup 2

Backup 3

Request

Pre-prepare Prepare

〈prepare, v, n, d(m)〉σ(i)

Accepted by correct replica j if:I the prepare message is well-formedI current view of j is vI n is between two water-marks L and H



Prepare phase

Primary

Backup 1

Backup 2

Backup 3

Request

Pre-prepare Prepare

〈prepare, v, n, d(m)〉σ(i)

Replicas that send prepare accept the sequence number n for min view v

Each accepted prepare message is stored in the acceptingreplica’s message log



Prepare certificate (P-certificate)

Replica i produces a prepare certificate prepared(m, v, n, i) iff itslog holds:

I The request mI A pre-prepare for m in view v with sequence number nI Log contains 2f prepare messages from different backups that

match the pre-prepare

prepared(m, v, n, i) means that a quorum of (2f + 1) replicasagrees with assigning sequence number n to m in view v

Theorem

There are no two non-faulty replicas i, j such that prepared(m, v, n, i)and prepared(m′, v, n, j), with m 6= m′

Proof?



Commit phase

Primary

Backup 1

Backup 2

Backup 3

Request

Pre-prepare Prepare Commit

〈commit, v, n, d(m), i〉σ(i)After having collected a P-certificate prepared(m, v, n, i), replicai sends a commit message

Accepted if:I The commit message is well-formedI Current view of i is vI n is between two water-marks L and H



Commit certificate (C-Certificate)

Commit certificates ensure total order across viewsI we guarantee that we can’t miss prepare certificates during a view

change

A replica has a certificate committed(m, v, n, i) if:I it had a P-certificate prepared(m, v, n, i)I log contains 2f + 1 matching commit from different replicas

(possibly including its own)

Replica executes a request after it gets commit certificate for it,and has cleared all requests with smaller sequence numbers



Reply phase

Primary

Backup 1

Backup 2

Backup 3

Request

Pre-prepare Prepare Commit Reply

〈reply, v, t, c, i, r〉σ(i)r is the reply

Client waits for f + 1 replies with the same t, r

If the client does not receive replies soon enough, it broadcast therequest to all replicas



View change

A un-satisfied replica backup i mutinies:I stops accepting messages (except view-change and new-view)I multicasts 〈view-change, v + 1, P, i〉σ(i)I P contains a P-certificate Pm for each request m

(up to a given number, see garbage collection)

Mutiny succeeds if the new primary collects a new-view certificateV :

I a set containing 2f + 1 view-change messagesI indicating that 2f + 1 distinct replicas (including itself) support the

change of leadership



View change

The “primary elect” p′ (replica v + 1 mod N):

extracts from the new-view certificate V the highest sequencenumber h of any message for which V contains a P-certificate

creates a new pre-prepare message for any client message mwith sequence number n ≤ h and add it to the set O

I if there is a P-certificate for n,m in V

O ← O ∪ 〈pre-prepare, v + 1, n, dm〉σ(p′)

I Otherwise

O ← O ∪ 〈pre-prepare, v + 1, n, dnull〉σ(p′)

p′ multicasts 〈new-view, v + 1, V,O〉σ(p′)



View change

Backup accepts a 〈new-view, v + 1, V,O〉σ(p′) message for v + 1 ifI it is signed properly by p′

I V contains valid view-change messages for v + 1I the correctness of O can be locally verified

(repeating the primary’s computation)

Actions:I Adds all entries in O to its log (so did p′!)I Multicasts a prepare for each message in OI Adds all prepares to the log and enters new view



Garbage collection

A correct replica keeps in log messages about request o until:I o has been executed by a majority of correct replicas, andI this fact can proven during a view change

Truncate log with stable checkpointsI Each replica i periodically (after processing k requests) checkpoints

state and multicasts 〈checkpoint, n, d, i〉F n: last executed requestF d: state digest

A set S containing 2f + 1 equivalent checkpoint messages fromdistinct processes are a proof of the checkpoint’s correctness(stable checkpoint certificate)



View Change, revisited

Message 〈view-change, v + 1, n, S, C, P, i〉σ(i)I n: the sequence number of the last stable checkpointI S: the last stable checkpointI C: the checkpoint certificate (2f + 1 checkpoint messages)

Message 〈new-view, v + 1, n, V,O〉σ(p′)I n: the sequence number of the last stable checkpointI V,O: contains only requests with sequence number larger than n



Optimizations

Reducing repliesI One replica designated to send reply to clientI Other replicas send digest of the reply

Lower latency for writes (4 messages)I Replicas respond at Prepare phase (tentative execution)I Client waits for 2f + 1 matching responses

Fast reads (one round trip)I Client sends to all; they respond immediatelyI Client waits for 2f + 1 matching responses



Optimizations: cryptography

Reducing overheadI Public-key cryptography only for view changesI MACs (message authentication codes) for all other messages

To give an idea (Pentium 200Mhz)I Generating 1024-bit RSA signature of a MD5 digest: 43msI Generating a MAC of the same message: 10µs



Application: Byzantine NFS server



Application: Byzantine NFS server

generate them are refreshed very frequently.There are no published performance numbers for

SecureRing [16] but it would be slower than Rampartbecause its algorithm has more message delays andsignature operations in the critical path.

7.3 Andrew BenchmarkThe Andrew benchmark [15] emulates a softwaredevelopment workload. It has five phases: (1) createssubdirectories recursively; (2) copies a source tree; (3)examines the status of all the files in the tree withoutexamining their data; (4) examines every byte of data inall the files; and (5) compiles and links the files.We use the Andrew benchmark to compare BFS with

two other file system configurations: NFS-std, which isthe NFS V2 implementation in Digital Unix, and BFS-nr,which is identical to BFS but with no replication. BFS-nrran two simple UDP relays on the client, and on the serverit ran a thin veneer linked with a version of snfsd fromwhich all the checkpointmanagement codewas removed.This configuration does not write modified file systemstate to disk before replying to the client. Therefore, itdoes not implement NFSV2 protocol semantics, whereasboth BFS and NFS-std do.Out of the 18 operations in the NFS V2 protocol only

getattr is read-only because the time-last-accessedattribute of files and directories is set by operationsthat would otherwise be read-only, e.g., read andlookup. The result is that our optimization for read-only operations can rarely be used. To show the impactof this optimization, we also ran the Andrew benchmarkon a second version of BFS that modifies the lookupoperation to be read-only. This modification violatesstrict Unix file system semantics but is unlikely to haveadverse effects in practice.For all configurations, the actual benchmark code ran

at the client workstation using the standard NFS clientimplementation in the Digital Unix kernel with the samemount options. The most relevant of these options forthe benchmark are: UDP transport, 4096-byte read andwrite buffers, allowing asynchronous client writes, andallowing attribute caching.We report the mean of 10 runs of the benchmark for

each configuration. The sample standard deviation forthe total time to run the benchmark was always below2.6% of the reported value but it was as high as 14% forthe individual times of the first four phases. This highvariance was also present in the NFS-std configuration.The estimated error for the reported mean was below4.5% for the individual phases and 0.8% for the total.Table 2 shows the results for BFS and BFS-nr. The

comparison between BFS-strict and BFS-nr shows thatthe overhead of Byzantine fault tolerance for this serviceis low — BFS-strict takes only 26% more time to run

BFSphase strict r/o lookup BFS-nr1 0.55 (57%) 0.47 (34%) 0.352 9.24 (82%) 7.91 (56%) 5.083 7.24 (18%) 6.45 (6%) 6.114 8.77 (18%) 7.87 (6%) 7.415 38.68 (20%) 38.38 (19%) 32.12total 64.48 (26%) 61.07 (20%) 51.07

Table 2: Andrew benchmark: BFS vs BFS-nr. The timesare in seconds.

the complete benchmark. The overhead is lower thanwhat was observed for the micro-benchmarks becausethe client spends a significant fraction of the elapsed timecomputing between operations, i.e., between receivingthe reply to an operation and issuing the next request,and operations at the server perform some computation.But the overhead is not uniform across the benchmarkphases. The main reason for this is a variation in theamount of time the client spends computing betweenoperations; the first two phases have a higher relativeoverhead because the client spends approximately 40%of the total time computing between operations, whereasit spends approximately 70% during the last three phases.The table shows that applying the read-only optimiza-

tion to lookup improves the performance of BFS sig-nificantly and reduces the overhead relative to BFS-nrto 20%. This optimization has a significant impact inthe first four phases because the time spent waiting forlookup operations to complete in BFS-strict is at least20% of the elapsed time for these phases, whereas it isless than 5% of the elapsed time for the last phase.

BFSphase strict r/o lookup NFS-std1 0.55 (-69%) 0.47 (-73%) 1.752 9.24 (-2%) 7.91 (-16%) 9.463 7.24 (35%) 6.45 (20%) 5.364 8.77 (32%) 7.87 (19%) 6.605 38.68 (-2%) 38.38 (-2%) 39.35total 64.48 (3%) 61.07 (-2%) 62.52

Table 3: Andrew benchmark: BFS vs NFS-std. Thetimes are in seconds.

Table 3 shows the results for BFS vs NFS-std. Theseresults show that BFS can be used in practice — BFS-strict takes only 3% more time to run the completebenchmark. Thus, one could replace the NFS V2implementation in Digital Unix, which is used dailyby many users, by BFS without affecting the latencyperceived by those users. Furthermore, BFS with theread-only optimization for the lookup operation isactually 2% faster than NFS-std.The overhead of BFS relative to NFS-std is not the

12



Reality Check

Example of systems that have adopted Byzantine Fault Tolerance:

Boeing 777 Aircraft Information Management System

Boeing 777/787 flight control system

SpaceX Dragon flight control system

BitCoin


Distributed AlgorithmsPractical Byzantine Fault Tolerance

Alberto Montresor

University of Trento, Italy

2017/01/06

Acknowledgments: Lorenzo Alvisi

This work is licensed under a Creative CommonsAttribution-ShareAlike 4.0 International License.

references

M. Abd-El-Malek, G. Ganger, G. Goodson, M. Retier, andJ. Wylie.Fault-scalable Byzantine fault-tolerant services.In Proc. of the ACM Symposium on Operating Systems Principles,SOSP’05, Oct. 2005.

M. Castro and B. Liskov.Practical Byzantine fault tolerance.In Proc. of the 3rd Symposium on Operating systems design andimplementation, OSDI’99, pages 173–186, New Orleans, Louisiana,USA, 1999. USENIX Association.http:

//www.disi.unitn.it/~montreso/ds/papers/PbftOsdi.pdf.

M. Castro and B. Liskov.Practical Byzantine fault tolerance and proactive recovery.ACM Trans. Comput. Syst., 20:398–461, Nov. 2002.http:

//www.disi.unitn.it/~montreso/ds/papers/PbftTocs.pdf.

A. Clement, M. Kapritsos, S. Lee, Y. Wang, L. Alvisi, M. Dahlin,and T. Riche.UpRight cluster services.In Proc. of the ACM Symposium on Operating Systems Principles,SOSP’09, Oct. 2009.http:

//www.disi.unitn.it/~montreso/ds/papers/UpRight.pdf.

A. Clement, E. Wong, L. Alvisi, M. Dahlin, and M. Marchetti.Making Byzantine fault tolerant systems tolerate Byzantine faults.In Proc. of the 6th USENIX symposium on Networked systemsdesign and implementation, NSDI’09, pages 153–168. USENIXAssociation, 2009.http:

//www.disi.unitn.it/~montreso/ds/papers/Aardvark.pdf.

J. Cowling, D. Myers, B. Liskov, R. Rodrigues, and L. Shrira.HQ replication: A hybrid quorum protocol for Byzantine faulttolerance.In Proc. of the Symposium on Operating systems design andimplementation, OSDI’06, Oct. 2005.

S. Gaertner, M. Bourennane, C. Kurtsiefer, A. Cabello, andH. Weinfurter.Experimental demonstration of a quantum protocol for byzantineagreement and liar detection.Physical Review Letters, 100(7), Feb. 2008.

R. Kotla, A. Clement, E. Wong, L. Alvisi, and M. Dahlin.Zyzzyva: Speculative byzantine fault tolerance.In Proc. of the ACM Symposium on Operating Systems Principles,(SOSP’07), Stevenson, WA, Oct. 2007. ACM.http:

//www.disi.unitn.it/~montreso/ds/papers/Zyzzyva.pdf.

L. Lamport, R. Shostak, and M. Pease.The Byzantine generals problem.ACM Transactions on Programming Languages and Systems(TOPLAS), 4(3):382–401, 1982.http://www.disi.unitn.it/~montreso/ds/papers/

ByzantineGenerals.pdf.















Contents

1 Introduction2 Byzantine generals3 Practical Byzantine Fault Tolerance4 Beyond PBFT

Overview5 Zyzzyva

IntroductionThree casesThe case of the missing phaseView changes

6 Aardvark7 UpRight

Beyond PBFT Overview

Overview

After PBFT, several others papers started to appear:

HQ: J. Cowling, D. Myers, B. Liskov, R. Rodrigues, and

L. Shrira. HQ replication: A hybrid quorum protocol for Byzantinefault tolerance.In Proc. of the Symposium on Operating systems design andimplementation, OSDI’06, Oct. 2005

Q/U: M. Abd-El-Malek, G. Ganger, G. Goodson, M. Retier, and

J. Wylie. Fault-scalable Byzantine fault-tolerant services.In Proc. of the ACM Symposium on Operating Systems Principles,SOSP’05, Oct. 2005

The end results has been to complicate the adoption of Byzantinesolutions.


Beyond PBFT Overview

Overview

“In the regions we studied (up to f = 5), if contention is low andlow latency is the main issue, then if it is acceptable to use 5f + 1replicas, Q/U is the best choice, else HQ is the best since itoutperforms PBFT with a batch size of 1.”

“Otherwise, PBFT is the best choice in this region: It can handlehigh contention workloads, and it can beat the throughput of bothHQ and Q/U through its use of batching.”

“Outside of this region, we expect HQ will scale best: HQ’sthroughput decreases more slowly than Q/U’s (because of thelatter’s larger message and processing costs) and PBFT’s (whereeventually batching cannot compen- sate for the quadratic numberof messages).”


Zyzzyva Introduction

Zyzzyva3

OSDI’06

R. Kotla, A. Clement, E. Wong, L. Alvisi, and M. Dahlin. Zyzzyva:Speculative byzantine fault tolerance.In Proc. of the ACM Symposium on Operating Systems Principles,(SOSP’07), Stevenson, WA, Oct. 2007. ACM.


One protocol to rulethem all!

Zyzzyva is the lastword on BFT!

(Is it?)

http://www.flickr.com/photos/matthewfch/2478230533/3Zyzzyva is the last word of the English dictionary – Apart from Zyzzyzus



http://www.flickr.com/photos/matthewfch/2478230533/


Replica coordination

All correct replicas execute the same sequence of commands

For each received command c, correct replicas:I Agree on c’s position in the sequenceI Execute c in the agreed upon orderI Reply to the client



How it is done now

Primary

Backup 1

Backup 2

Backup 3

Request




The engineer’s Rule of thumb

Citation

Handle normal and worst case separately as a rule, becausethe requirements for the two are quite different: the normalcase must be fast; the worst case must make some progress

Butler Lampson, “Hints for Computer System Design”



How Zyzzyva does it

Primary

Replica 1

Replica 2

Replica 3

Request



Specification for State Machine Replication

Stability

A command is stable at a replica once its position in the sequencecannot change

Safety

Correct clients only process replies to stable commands

Liveness

All commands issued by correct clients eventually become stable andelicit a reply



Enforncing safety

Safety requires:I Correct clients only process replies to stable commands

...but RSM implementations enforce instead:I Correct replicas only execute and reply to commands that are stable

Service performs an output commit with each reply



Speculative BFT (Trust, but verify)

Replicas execute and reply to a command without knowingwhether it is stable

I trust order provided by primaryI no explicit replica agreement!

Correct client, before processing reply, verifies that it correspondsto stable command

I if not, client takes action to ensure liveness



Verifying stability

Necessary condition for stability in Zyzzyva:I A command c can become stable only if a majority of correct

replicas agree on its position in the sequence

Client can process a response for c iff:I a majority of correct replicas agrees on c’s positionI the set of replies is incompatible, for all possible future executions,

with a majority of correct replicas agreeing on a different commandholding c’s current position



History

History Hi,k is the sequence of the first k commands executed byreplica i

On receipt of a command c from the primary, replica appends c toits command history

Replica reply for c includes:I the application-level responseI the corresponding command history

Additional details:I Can be hashed through incremental hashing


Zyzzyva Three cases

Case 1: Unanimity

Primary

Replica 1

Replica 2

Replica 3

c

<c,k>

<c,k>

<c,k>

< r1,H

1,k >

< r2,H

2,k >

< r3,H

3,k >

< r4,H

4,k >

Client processes response if all replies match:

r1 = . . . = r4 ∧H1,k = . . . = H4,k


Zyzzyva Three cases

Case 1: Unanimity

Some comments:

Note that although a client has a proof that the request positionin the command history is irremediately set, no server has such aproof

Comparison of histories may be based on incremental hash

Three message hops to complete the request in the good case

Is it safe to accept the reply in this case?

All processes have agreed on ordering

Correct processes cannot change their mind later

New primary can ask n− f replicas for their histories


Zyzzyva Three cases

Case 2: A majority of correct replicas agree

Primary

Replica 1

Replica 2

Replica 3

c

<c,k>

<c,k>

<c,k>

< r1,H

1,k >

< r2,H

2,k >

< r3,H

3,k >

Is it safe to accept such a message?


Zyzzyva Three cases


Primary

Replica 1

Replica 2

Replica 3

c

<c,k>

<c,k>

< r1,H

1,k >

< r2,H

2,k >

< r3,H

3,k >

Consider this case...


Zyzzyva Three cases


Primary

Replica 1

Replica 2

Replica 3

c

<c,k>

<c,k>

<c,k>

< ri,H

i,k >

CC=<H1,k

, H2,k

, H3,k>

Client sends to all a commit certificate containing 2f + 1 matchinghistories


Zyzzyva Three cases


Primary

Replica 1

Replica 2

Replica 3

c

<c,k>

<c,k>

< ri,H

i,k >

CC=<H1,k

, H2,k

, H3,k>

ack

<c,k>

Client processes response if it receives at least 2f + 1 acks


Zyzzyva Three cases


Safe?

Certificate proves that a majority of correct processes agree on itsposition in the sequence

Incompatible with a majority backing a different command forthat position

Stability

Stability depends on matching command histories

Stability is prefix-closed:I If a command with sequence number k is stable, then so is every

command with sequence number k′ < k


Zyzzyva Three cases

Case 3: None of the above

Primary

Replica 1

Replica 2

Replica 3

c

<c,k>

< r1,H

1,k >

< r2,H

2,k >

Fewer than 2f + 1 replies match

Clients retransmits c to all replicas – hinting primary may befaulty


Zyzzyva The case of the missing phase

The case of the missing phase

Primary

Backup 1

Backup 2

Backup 3

Request


Primary

Replica 1

Replica 2

Replica 3

c

<c,k>

<c,k>

< ri,H

i,k >

CC=<H1,k

, H2,k

, H3,k>

ack

<c,k>

Where did the thirdphase go?

Why was it there tobegin with?



The missing phase – commit

Consider this scenario:

f malicious replicas, including the primary

The primary stops communicating with f correct replicas

They go on strike – they stop accepting messages in this view, aska view change

f + f replicas stops accepting messages, f + 1 replicas keepworking

The remaining f + 1 replicas are not enough to conclude thepre-prepare and prepare phases

The f correct processes that are asking a view change are notenough to conclude one, so there is no opportunity to regainliveness by electing a new primary



The missing phase – commit

The third phase of PBFT breaks this stalemate:

The remaining f + 1 replicasI either gather the evidence necessary to complete the request,I or determine that a view change is necessary

Commit phase needed for liveness


Zyzzyva View changes

Where the third phase go?

In PBFT

What compromises liveness in the previous scenario is thatthe PBFT view change protocol lets correct replicas commit toa view change and become silent in a view without anyguarantee that their action will lead to the view change

In Zyzzyva

A correct replica does not abandon view v unless it isguaranteed that every other correct replica will do the same,forcing a new view and a new primary



View change

Two phases:I Processes unsatisfied with the current primary sent a message〈i-hate-the-primary, v〉 to all

I If a process collect f + 1 i-hate-the-primary messages, sends amessage to all containing such messages and starts a new viewchange (similar to the traditional one)

Extra phase of agreement protocol is moved to the view changeprotocol



Optimizations

Checkpoint protocol to garbage collect histories

Replacing digital signatures with MAC

Replicating application state at only 2f + 1 replicas

Batching



Performance7:28 • R. Kotla et al.

0

20

40

60

80

100

120

140

0 20 40 60 80 100

Thr

ough

put (

Kop

s/se

c)

Number of clients

Unreplicated

Zyzzyva (B=10)

Zyzzyva5 (B=10)

PBFT (B=10)

Zyzzyva5

PBFT

HQ

Q/U max throughput

Zyzzyva

Fig. 4. Realized throughput for the 0/0 benchmark as the number of client varies for systemsconfigured to tolerate f = 1 faults.

0

0.5

1

1.5

2

2.5

3

3.5

4

0 20 40 60 80 100

Late

ncy

per

requ

est (

ms)

Throughput (Kops/sec)

Zyz

zyva

(B=1

)

Zyzzyva(B=10)Zyzzyva(B=20)

Zyzzyva(B=40)

PB

FT

(B=1

)

PB

FT

(B=1

0)

PB

FT(B

=20)

PB

FT(B

=40)

Fig. 5. Latency vs. throughput for systems with increasing batch sizes.

compared to Zyzzyva. However, as Figure 5 shows, further increases in batchsize do not significantly improve Zyzzyva’s performance. Conversely, PBFT’sperformance peaks with a batch size of 20, where Zyzzyva’s throughput advan-tage reduces to 23%.

ACM Transactions on Computer Systems, Vol. 27, No. 4, Article 7, Publication date: December 2009.



Discussion

What have you learned?

Do you agree on the principles?


Aardvark

Aardvark4

NSDI’09

A. Clement, E. Wong, L. Alvisi, M. Dahlin, and M. Marchetti. MakingByzantine fault tolerant systems tolerate Byzantine faults.In Proc. of the 6th USENIX symposium on Networked systems design andimplementation, NSDI’09, pages 153–168. USENIX Association, 2009.


A new beginning!

http://en.wikipedia.org/wiki/File:

Porc_formiguer.JPG

4Aardvark is the first word of the English dictionary – Oritteropo in ItalianAlberto Montresor (UniTN) DS - BFT 2017/01/06 67 / 80


http://en.wikipedia.org/wiki/File:Porc_formiguer.JPG

http://en.wikipedia.org/wiki/File:Porc_formiguer.JPG

Aardvark

From the article

Surviving vs tolerating

Although current BFT systems can survive Byzantine faultswithout compromising safety, we contend that a system thatcan be made completely unavailable by a simple Byzantinefailure can hardly be said to tolerate Byzantine faults.


Aardvark

Conventional wisdom

Handle normal and worst case separatelyI remain safe in worst caseI make progress in normal case

Maximize performance whenI the network is synchronousI all clients and servers behave correctly

FutileI it yields diminishing return on common case


Aardvark

Conventional wisdom

MisguidedI encourages systems that fail to deliver BFT

Maximize performance whenI the network is synchronousI all clients and servers behave correctly



Aardvark

Conventional wisdom

MisguidedI encourages systems that fail to deliver BFT

DangerousI it encourages fragile optimizations



Aardvark

Blueprint

Build the system around execution path that:I provides acceptable performance across the broadest set of

executionsI it is easy to implementI it is robust against Byzantine attempts to push the system away

from it


Aardvark

Revisiting conventional wisdom

Signatures are expensive – use MACsI Faulty clients can use MACs to generate ambiguityI Aardvark requires clients to sign requests

View changes are to be avoidedI Aardvark uses regular view changes to maintain high throughput

despite faulty primaries

Hardware multicast is a boonI Aardvark uses separate work queues for clients and individual

replicasI Aardvark uses fully connected topology among replicas (separate

NICs)


Aardvark

MAC Attack

Primary

Replica 1

Replica 2

Replica 3

c

<c,k>

<c,k>

<c,k>

✔

✔

✔

✔


Aardvark

MAC Attack

Primary

Replica 1

Replica 2

Replica 3

c

<c,k>

<c,k>

<c,k>

✔

✗

✗

✗


Aardvark

Throughput

Best Faulty Client Faulty Faultycase client flood primary replica

PBFT 62K 0 crash 1k 250

QU 24K 0 crash NA 19k

HQ 15K NA 4.5K NA crash

Zyzzyva 80K 0 crash crash 0

Aardvark 39K 39K 7.8K 37K 11K


UpRight

UpRight

Bibliography

A. Clement, M. Kapritsos, S. Lee, Y. Wang, L. Alvisi, M. Dahlin, and T. Riche.

UpRight cluster services.In Proc. of the ACM Symposium on Operating Systems Principles, SOSP’09, Oct.2009.http://www.disi.unitn.it/~montreso/ds/papers/UpRight.pdf

A new (B)FT replication library

Minimal intrusiveness for existing apps

Adequate performance

Goal:I ease BFT deploymentI make explicit incremental cost of BFTI switching to BFT: simple change in a config file



UpRight

UpRight

u= max number of failures to ensure liveness

r = max number of commission failures to preserve safety

Crash

Omission Commission

Byzantiner = u = f : BFT

r = 0 : CFT


UpRight

UpRight

Exposes incremental cost of BFTI Byzantine agreementI if r << u, BFT ≈ CFT in replication cost

Allows richer design optionsI Byzantine faults are rare: u > rI Safety more critical than liveness: r > u


UpRight

Reality Check

UpRight5(Java; latest update Oct. 2009)

ArchiStar-BFT6(Java; latest update May 2015)

Bft-SMaRt7(Java; latest update Apr. 2016)

7https://code.google.com/archive/p/upright/7https://github.com/archistar/archistar-bft7http://bft-smart.github.io/library/


https://code.google.com/archive/p/upright/

https://github.com/archistar/archistar-bft

http://bft-smart.github.io/library/

UpRight

For (far in the) future lectures

S. Gaertner, M. Bourennane, C. Kurtsiefer, A. Cabello, and

H. Weinfurter. Experimental demonstration of a quantum protocolfor byzantine agreement and liar detection.Physical Review Letters, 100(7), Feb. 2008


UpRight

Reading material

M. Castro and B. Liskov. Practical Byzantine fault tolerance.In Proc. of the 3rd Symposium on Operating systems design andimplementation, OSDI’99, pages 173–186, New Orleans, Louisiana, USA, 1999.USENIX Association.http://www.disi.unitn.it/~montreso/ds/papers/PbftOsdi.pdf

R. Kotla, A. Clement, E. Wong, L. Alvisi, and M. Dahlin. Zyzzyva:Speculative byzantine fault tolerance.In Proc. of the ACM Symposium on Operating Systems Principles, (SOSP’07),Stevenson, WA, Oct. 2007. ACM.http://www.disi.unitn.it/~montreso/ds/papers/Zyzzyva.pdf

A. Clement, E. Wong, L. Alvisi, M. Dahlin, and M. Marchetti. MakingByzantine fault tolerant systems tolerate Byzantine faults.In Proc. of the 6th USENIX symposium on Networked systems design andimplementation, NSDI’09, pages 153–168. USENIX Association, 2009.http://www.disi.unitn.it/~montreso/ds/papers/Aardvark.pdf





distributed algorithms practical byzantine fault...

Documents