biscotti: a ledger for private and secure peer-to-peer ... · blockchain-based ledger of model...

Biscotti: A Ledger for Private and Secure Peer-to-Peer Machine Learning

Muhammad ShayanUniversity of British Columbia

[email protected]

Clement FungUniversity of British [email protected]

Chris J.M. YoonUniversity of British Columbia

[email protected]

Ivan BeschastnikhUniversity of British Columbia

[email protected]

AbstractFederated Learning is the current state of the art in support-ing secure multi-party machine learning (ML): data is main-tained on the owner’s device and the updates to the modelare aggregated through a secure protocol [12]. However, thisprocess assumes a trusted centralized infrastructure for coor-dination, and clients must trust that the central service doesnot use the byproducts of client data. In addition to this, agroup of malicious clients could also harm the performanceof the model by carrying out a poisoning attack. [25]

As a response, we propose Biscotti: a fully decentral-ized peer to peer (P2P) approach to multi-party ML, whichuses blockchain and cryptographic primitives to coordinate aprivacy-preserving ML process between peering clients. Ourevaluation demonstrates that Biscotti is scalable, fault toler-ant, and defends against known attacks. For example, Bis-cotti is able to protect the privacy of an individual client’s up-date and the performance of the global model at scale when30% of adversaries are trying to poison the model [25].

1 Introduction

A common thread in machine learning (ML) applications isthe collection of massive amounts of training data, which isoften centralized for analysis. However, when training MLmodels in a multi-party setting, users must share their poten-tially sensitive information with a centralized service.

Federated learning [12, 32] is a prominent solution forhigh scale secure multi-party ML: clients train a sharedmodel with a secure aggregator without revealing their un-derlying data or computation. But, doing so introduces asubtle threat: clients, who previously acted as passive datacontributors, are now actively involved in the training pro-cess [25]. This presents new privacy and security challenges.

Prior work has demonstrated that adversaries can attackthe shared model through poisoning attacks [10, 40], inwhich an adversary contributes adversarial updates to influ-ence shared model parameters. Adversaries can also attack

the privacy of other clients in federated learning. In an infor-mation leakage attack, an adversary poses as an honest clientand attempts to steal or de-anonymize sensitive training datathrough careful observation and isolation of a victim’s modelupdates [24, 33].

Solutions to these two attacks are at tension and are in-herently difficult to achieve together: client contributions ordata can be made public and verifiable to prevent poisoning,but this violates the privacy guarantees of federated learning.Client contributions can be made more private, but this elimi-nates the potential for accountability from adversaries. Priorwork has attempted to solve these two attacks individuallythrough centralized anomaly detection [48], differential pri-vacy [4, 20, 22] or secure aggregation [12]. However, a pri-vate and decentralized solution that solves both attacks con-currently does not yet exist. In addition, these approaches areinapplicable in contexts where a trusted centralized authoritydoes not exist. This is the focus of our work.

Because ML does not require strong consensus or con-sistency to converge [43], traditional strong consensusprotocols such as Byzantine Fault Tolerant (BFT) proto-cols [15] are too restrictive for machine learning workloads.To facilitate private, verifiable, crowd-sourced computa-tion, distributed ledgers (blockchains) [38] have emerged.Through design elements, such as publicly verifiable proofof work, eventual consistency, and ledger-based consensus,blockchains have been used for a variety of decentralizedmulti-party tasks such as currency management [23, 38],archival data storage [5, 35] and financial auditing [39]. De-spite this wide range of applications, a fully realized, acces-sible system for large scale multi-party ML that is robust toboth attacks on the global model and attacks on other clientsdoes not exist.

We propose Biscotti, a decentralized public peer to peersystem that co-designs a privacy-preserving multi-party MLprocess with a blockchain ledger. Peering clients join Bis-cotti and contribute to a ledger to train a global model, un-der the assumption that peers are willing to collaborate onbuilding ML models, but are unwilling to share their data.

1

arX

iv:1

811.

0990

4v4

[cs

.LG

] 1

2 D

ec 2

019

Each peer has a local data set that could be used to train themodel. Biscotti is designed to support stochastic gradient de-scent (SGD) [14], an optimization algorithm that iterativelyselects a batch of training examples, computes their gradi-ents with respect to the current model parameters, and takessteps in the direction that minimizes the loss function. SGDis general purpose and can be used to train a variety of mod-els, including neural networks [17].

The Biscotti blockchain coordinates ML training betweenthe peers. Peers in the system are weighed by the value,or stake, that they have in the system. Inspired by priorwork [23], Biscotti uses consistent hashing based on proofof stake in combination with verifiable random functions(VRFs) [34] to select key roles that help to arrange the pri-vacy and security of peer SGD updates. Our use of stakeprevents groups of colluding peers from overtaking the sys-tem without a sufficient stake ownership.

With Biscotti’s design our primary contribution is to com-bine several prior techniques into one coherent system thatprovides secure and private multi-party machine learning ina highly distributed setting. In particular, Biscotti preventspeers from poisoning the model through the Multi-KRUMdefense [11] and provides privacy through differentially pri-vate noise [4] and Shamir secrets for secure aggregation [47].Most importantly, Biscotti combines these techniques with-out impacting the accuracy of the final model, achievingidentical model performance as federated learning.

We evaluated Biscotti on Azure and considered its perfor-mance, scalability, churn tolerance, and ability to withstanddifferent attacks. We found that Biscotti can train an MNISTsoftmax model with 200 peers on a 60,000 image dataset in266.7 minutes, which is 14x slower than a similar federatedlearning deployment. Biscotti is fault tolerant to node churnevery 15 seconds across 100 nodes, and converges even withsuch churn. Furthermore, we show that Biscotti is resilientto information leakage attacks [33] that require knowledgeof a client’s SGD update and a label-flipping poisoning at-tack [25] from prior work.

2 Challenges and contributions

We now describe the key challenges in designing a peer topeer (P2P) solution for multi-party ML and the key pieces inBiscotti’s design that resolve each of these challenges.Sybil attacks: Consistent hashing using VRF’s and proofof stake.

In a P2P setting adversaries can collude or generate aliasesto increase their influence in a sybil attack [19].

Biscotti uses a consistent hashing protocol based on thehash of the last block and verifiable random functions(VRF’s) (see Appendix D) to select a subset of peers that areresponsible for the different stages of the training process:adding noise to updates, validating an update, and securelyaggregating the update. To mitigate the effect of sybils, the

protocol selects peers proportionally to peer stake. A peer’sstake is reputation that the peer acquires by contributing inthe system. This ensures that an adversary cannot increasetheir influence in the system by creating multiple peers with-out increasing their total contribution/ stake.Poisoning attacks: update validation using KRUM. Inmulti-party ML, peers possess a relatively disjoint and pri-vate sub-set of the training data. As mentioned above, pri-vacy can be exploited by adversaries to provide cover forpoisoning attacks.

In multi-party settings, known baseline models are notavailable to peers, so Biscotti validates an SGD update byevaluating an update with respect to the updates submittedby other peers. Biscotti validates an SGD update using theMulti-KRUM algorithm [11], which rejects updates that pushthe model away from the direction of the majority of the up-dates. More precisely, in each round, a committee of peers isselected to filter out poisoned updates by a majority vote andeach member of the committee uses Multi-KRUM to filterout poisoned SGD updates. Multi-KRUM guarantees to fil-ter out f out of n malicious updates such that 2 f +2 < n. Wedemonstrate that by using Multi-KRUM, Biscotti can handlepoisonous updates from up to 30% malicious clients.Information leakage attacks: random verifier peers anddifferentially private updates using pre-committed noise.By observing a peer’s SGD updates from each verificationround, an adversary can perform an information leakage at-tack [33] and recover details about a victim’s training data.(For details on information leakage attacks, see Appendix E)

Biscotti prevents such attacks during update verification intwo ways. First, it uses a consistent hashing on the SHA-256hash of the last block to ensure that malicious peers cannotdeterministically select themselves to verify a victim’s gradi-ent. Second, peers send differentially-private updates [4, 20]to verifier peers: before sending a gradient to a verifier, pre-committed ε-differentially private noise is added to the up-date, masking the peer’s gradient in a way that neither thepeer nor the attacker can influence or observe(See Figure 2).By verifying noised SGD updates, peers in Biscotti can ver-ify the updates of other peers without observing their un-noised versions.Utility loss with differential privacy: secure updateaggregation and cryptographic commitments. Theblockchain-based ledger of model updates allows for audit-ing of state, but this transparency is counter to the goal ofprivacy-preserving multi-party ML. For example, the ledgertrivially leaks information if we store SGD updates directly.

Verification of differentially private updates discussedabove is one piece of the puzzle. The second piece is se-cure update aggregation: a block in Biscotti does not storeupdates from individual peers, but rather an aggregate thatobfuscates any single peer’s contribution during that roundof SGD. Biscotti uses a verifiable secret sharing scheme [27]to aggregate the updates so that any individual update re-

2

0 20 40 60 80 100Training Iterations

0.0

0.2

0.4

0.6

0.8

1.0

Test

Err

or

=0.01

=0.1

=0.5

=1

=2Biscotti

Figure 1: The utility trade-off if differentially private noiseis added directly to the updates in the ledger. Biscotti, whichaggregates non-noisy updates, has the best utility (lowest testerror).

Figure 2: Visualized privacy trade-off for curves in Figure 1.

mains hidden through cryptographic commitments (For abackground, see Appendix C).

However, secure aggregation can be either done on thedifferentially-private updates or the original updates with nonoise. This design choice represents a privacy-utility tradeoffin the trained model. By aggregating differentially-privateupdates the noise is pushed into the ledger and the finalmodel has lower utility. By aggregating original updates wecan achieve utility that is identical to federated learning train-ing. Figure 1 illustrates this trade-off for learning a softmaxmodel to recognize hand-written digits. The model is trainedusing the MNIST dataset over 100 iterations for different val-ues of ε . Smaller ε means more noise and more privacy, butlower utility, or test error. The bottom-most line in the Fig-ure is Biscotti, which aggregates original updates: a designchoice that we have made to prioritize utility. By default Bis-cotti uses a batch value of 35. Figure 2 illustrates the privacy-utility trade-off visually for batches of size 35 with noisyupdates as compared to Biscotti (right-most image), whichaggregates un-noised updates. The images are reconstructedusing an information leakage attack [33] on the aggregatedgradients of two different machine learning models. The toprow of images are constructed from aggregated gradients ofthe MNIST model with respect to the 0 digit class. The bot-tom row shows results for a softmax model trained to recog-

nize gender from faces using the Labelled Faces in the Wild(LFW) dataset. The pictures are reconstructed with respectto the female class. These results demonstrate that althoughaggregation protects individual training examples it does re-veal information about how a certain class looks like on av-erage in the aggregate. Differential private noise needs to beincluded in the aggregate to provide this additional level ofprotection. However, our design favours utility, therefore wechoose to remove the noise before aggregating the updates inthe ledger.

3 Assumptions and threat model

Like federated learning, Biscotti assumes a public machinelearning system which peers can join/leave anytime. Biscottiassumes that users are willing to collaborate on building MLmodels, but are unwilling to directly share their data whendoing so [32]. This need for collaboration arises because amodel trained only on an individual peer’s data would exhibitpoor performance, but a model trained by all participants willhave near-optimal performance.

3.1 Design assumptionsProof of stake.

Peers in Biscotti need to arrive at a consensus on the stateof the model for each round of training. Proof of Stake(POS) [23, 28] is a recently proposed consensus mechanismfor cryptocurrencies. Consensus using POS is fast becauseit delegates the consensus responsibility to a subset of peerseach round. The peers responsible are selected each roundbased on the amount of stake that they possess in the sys-tem. In cryptocurrencies, the stake of a peer refers to theamount of value (money) that a peer holds. POS relies onthe assumption that a subset of peers holding a significantfraction of the stake will not attempt to subvert the system.For a detailed background on Proof of Stake , please refer toAppendix F.

In Biscotti, we define stake as a measure of peer’s con-tribution to the system. Peers acquire stake by by provid-ing SGD updates or by facilitating the consensus process.Thus, the stake that a peer accrues during the training pro-cess is proportional to their contribution to the trained model.We assume that all nodes can obtain the fraction of stake ofany other peer using a function from the current stake of theblockchain. In addition to this, we also assume that at anypoint in time more than 70% of the stake in the system ishonest and that the stake function is properly bootstrapped.The initial stake distribution at the start may be derived froman online data sharing marketplace, a shared reputation scoreamong competing agencies or auxiliary information on a so-cial network.Blockchain topology. Each peer is connected to some sub-set of other peers in a topology that allows flooding-based

3

dissemination of blocks that eventually reach all peers. Forexample, this could be a random mesh topology with flood-ing, similar to the one used for block dissemination in Bit-coin [38]. Peers that go offline during training and join laterare bootstrapped on the current state of the blockchain byother peers in the system.Machine learning. We assume that ML training parametersare known to all peers: the model, its hyperparameters, itsoptimization algorithm and the learning goal of the system(these are distributed in the first block). In a non-adversarialsetting, peers have local datasets that they wish to keep pri-vate. When peers draw a sample from this data to computean SGD update, we assume that this is done uniformly andindependently [13]. This ensures that the Multi-KRUM gra-dient validation [11] is accurate.

3.2 Attacker assumptionsPeers may be adversarial and send malicious updates to per-form a poisoning attack on the shared model or an informa-tion leakage attack against a target peer victim. In doingso, we assume that the adversary may control multiple peersin a sybil attack [19] but does not control more than 30%of the total peers. Although peers may be able to increasethe number of peers they control in the system, we assumethat adversaries cannot artificially increase their stake in thesystem except by providing valid updates that pass Multi-KRUM [11].

When adversaries perform a poisoning attack, we assumethat their goal is to harm the performance of the final globalmodel. Our defense relies on filtering out malicious updatesthat are sufficiently different from the honest clients and arepushing the global model towards some sub-optimal objec-tive. For the purposes of this work, we limit the adversariesonly to a label flipping poisoning attack [25] in which theymislabel a certain class in their dataset causing the model tolearn to misclassify it. This does not include attacks on un-used parts of the model topology, like backdoor attacks [6],attacks based on gradient-ascent [37] techniques or adaptiveattacks based on knowledge of the poisoning defense [9].

When adversaries perform an information leakage attack,we assume that they aim to learn properties of a victim’s lo-cal dataset. Due to the vulnerabilities of secure aggregation,we do not consider information leakage attacks with side in-formation [21, 45] and class-level information leakage at-tacks on federated learning [24], which attempt to learn theproperties of an entire target class.

4 Biscotti design

Biscotti implements peer-to-peer ML trained using SGD. Forthis process, Biscotti’s design has the following goals:• Convergence to an optimal global model (the model

trained without adversaries in a federated learning setting)

• Poisoning is prevented by verifying peer contributionsto the model• Peer training data is kept private and information leak-

age attacks on training data are prevented• Colluding peers cannot gain influence without acquir-

ing sufficient stakeBiscotti meets these goals through a blockchain-based de-

sign that we describe in this section.Design overview. Peers join Biscotti and collaborativelytrain a global model. Each block in the distributed ledgerrepresents a single iteration of SGD and the ledger containsthe state of the global model at each iteration. Figure 3overviews the Biscotti design with a step-by-step of illustra-tion of what happens during a single SGD iteration in whicha single block is generated.

In each iteration, peers locally compute SGD updates (step1 in Figure 3). Since SGD updates need to be kept private,

each peer first masks their update using differentially privatenoise. This noise is obtained from a unique set of noisingpeers for each client selected by a VRF [34] (step 2 and3 ).

The masked updates are validated by a verification com-mittee to defend against poisoning. Each member in the ver-ification committee signs the commitment to the peer’s un-masked update if it passes Multi-KRUM (step 4 ). If themajority of the committee signs an update (step 5 ), thesigned update is divided into Shamir secret shares (step 6 )and given to a aggregation committee. The aggregation com-mittee uses a secure protocol to aggregate the unmasked up-dates (step 7 ). All peers who contribute a share to the finalupdate along with the peers chosen for the verification andaggregation committees receive additional stake in the sys-tem.

The aggregate of the updates is added to the global modelin a newly created block which is disseminated to all thepeers and appended to the ledger (step 8 ). Using the up-dated global model and stake, the peers repeat (step 1 ).

Next, we describe how we bootstrap the training process.

4.1 Initializing the trainingBiscotti peers initialize the training process using informa-tion in the first (genesis) block. We assume that this block isdistributed out of band by a trusted authority and the infor-mation within it is reliable. The centralized trusted author-ity only plays the role of facilitating and bootstrapping thetraining process by distributing the genesis block that makespublic information available to all peers in the system. It isnot entrusted with individual SGD updates of the peers thatcould potentially leak private information of a peer’s data.Each peer that joins Biscotti obtains the following informa-tion from the genesis block:• Initial model state w0, and expected number of itera-

tions T

4

…

…Data1

Secure Agg.VRF set t

Blockchain

… …

Get noise to mask update

NoisingVRF set t

Local SGDfor iteration t

Data2 Data3 DataN

VerifierVRF set t

Data’ Data’’

Run KRUM toverify updates

Noisy update

Signatures

1

2 3

4

5

Run secure aggregation

Create new block with aggregated gradient

Shamir secret shares6

7

8

Retrieve noise

�w1 �w2 �w3 �wN

�w2 + ⌃k"

Genesis block

Initialize ML

Block t

✏1 ✏k

"

⌃�wj

Figure 3: The ordered steps in a single iteration of the Biscotti algorithm.

COMM(�w1)sign1

Verifier commitment sigs for :

wt

COMM(�wu)sign1

COMM(�wu)signj

stake01

. . .COMM(�w1)signj

stake0u

. . .

.

.

.

Verifier commitment sigs for :�w1 �wu

COMM(�w1)Commitment to update by peer 1: Commitment to update by peer u:

COMM(�wu)

Global model:

Updated stake for peer 1: Updated stake for peer u:

Aggregate of updates: Prev. block hash: 0xab0123ef⌃u�w

Figure 4: Block contents at iteration t. Note that wt iscomputed using wt−1 +Σwu where wt−1 is the global modelstored in the block at iteration t−1 and j is the set of verifiersfor iteration t.

• The commitment public key PK for creating commit-ments to SGD updates (see Appendix C)• The public keys PKi of all other peers in the system that

are used to create and verify signatures during verification.• Pre-commitments to T iterations of differentially pri-

vate noise ζ1..T for each SGD iteration by each peer (seeFigure 6 and Appendix B)• The initial stake distribution among the peers• A stake update function for updating a peer’s stake

when a new block is appended

4.2 Blockchain design

Distributed ledgers are constructed by appending read-onlyblocks to a chain structure and disseminating blocks using a

gossip protocol. Each block maintains a pointer to its pre-vious block as a cryptographic hash of the contents in thatblock.

Each block in Biscotti (Figure 4) contains, in addition tothe previous block hash pointer, an aggregate (∆w) of SGDupdates from multiple peers and a snapshot of the globalmodel wt at iteration t. Newly appended blocks to the ledgerstore the aggregated updates ∑∆wi of multiple peers. To ver-ify that the aggregate was honestly computed, individual up-dates need to be made part of the block. However, storingthem would leak information about private training data ofthe individuals. We solve this problem by using polynomialcommitments. [27]. Polynomial commitments take an SGDupdate and map it to a point on an elliptic curve. (see Ap-pendix C for details). By including a list of commitmentsfor each peer i’s update COMM (∆wi) in the block, we canprovide both privacy and verifiability of the aggregate. Thecommitments provide privacy by hiding the individual up-dates yet can be homomorphically combined to verify thatthe update to the global model by the aggregator ∑∆wi wascomputed honestly. The following equality holds if the listof committed updates equals the aggregate sum:

COMM(∑∆wi) = ∏i

COMM(∆wi) (2)

The training process continues for a specified number ofiterations T upon which the learning process is terminatedand each peer extracts the global model from the final block.

4.3 Using stake for role selectionFor each iteration in Biscotti, a consistent hashing protocolweighted by stake designates roles (noiser, verifier, aggre-

5

peer2

peer1stake region

peer3

peer2

peerN

peer3

peer1

peer2stake region

h(h(hash ))

h(hash )hash

GeneratedBlock

VRF/SHA256

Figure 5: Biscotti uses a consistent hashing protocol basedon the current stake to determine the roles for each iteration.

gator) to some peers in the system. The protocol ensuresthat the influence of a peer is bounded by their stake (i.e.adversaries cannot trivially increase their influence throughsybils). Peers can take on multiple roles in a given iterationbut cannot be a verifier and aggregator in the same round.The verification and aggregation committees are the same forall peers but the noising committee is unique to each peer.

Biscotti uses consistent hashing to select peers for the ver-ification and aggregation committees (Figure 5). The initialSHA-256 hash of the last block is repeatedly re-hashed: eachnew hash result is mapped onto a hash ring where portions ofthe ring are proportionally assigned to peers based on theirstake. The peer in whose space the hash lies is chosen forthe verifier/aggregator role. The process is repeated untilyou get verifier/aggregator committees of the right size. Thisprovides the same stake-based property as Algorand [23]: apeer’s probability of winning this lottery is proportional totheir stake. Since an adversary cannot predict the future stateof a block until it is created, they cannot speculate on outputsof consistent hashing and strategically perform attacks.

Unlike verification and aggregation, the noising commit-tee is different for each peer for additional privacy. Eachpeer can arrive at a unique committee for itself via consistenthashing by using a different initial hash. The hash computedshould be random yet globally verifiable by other peers inthe system. In Biscotti, a peer computes this hash by passingtheir secret key SKi and the SHA-256 hash of the last block toa verifiable random function (VRF). By virtue of the peer’ssecret key, the hash computed is unique to the peer resultingin distinct committees for different peers. The VRF hash isalso accompanied by a proof that can be combined with apeers public key allowing others peer to determine that thecorrect noise providers are selected by the peer.

At each iteration, peers run the consistent hashing pro-tocol to determine whether they are an aggregator or veri-

fier. Peers that are part of the verification and aggregationcommittees do not contribute updates for that round. Eachpeer that is not an aggregator or verifier computes their set ofnoise providers. The contributing peers then obtain noise tohide their updates by following the noising protocol.

4.4 Noising protocol

Sending a peer’s SGD update directly to another peer mayleak sensitive information about their private dataset [33].To prevent this, peers use differential privacy to hide theirupdates prior to verification by adding noise sampled froma normal distribution. This ensures that each step is (ε,δ )differentially private by standard arguments in [20]. (seeAppendix B for formalisms).Using pre-committed noise to thwart poisoning. Attackersmay maliciously use the noising protocol to execute poison-ing or information leakage attacks. For example, a peer cansend a poisonous update ∆wpoison, and add noise ζp that un-poisons this update to resemble an honest update ∆w, suchthat ∆wpoison + ζp = ∆w. By doing this, a verifier observesthe honest update ∆w, but the poisonous update ∆wpoison isapplied to the model because the noise is removed in the finalaggregate.

To prevent this, Biscotti requires that every peer pre-commits the noise vector ζt for every iteration t ∈ [1..T ] inthe genesis block. Since the updates cannot be generated inadvance without the knowledge of the global model, a peercannot effectively commit noise that unpoisons an update.Furthermore, Biscotti requires that the noise that is added istaken from a different peer than the one creating the update.This peer is determined using a noising VRF and furtherrestricts the control that a malicious peer has over the noiseused to sneak poisoned updates past verification.

Using a VRF-chosen set of noisers to thwart informationleakage. A further issue may occur in which a noising peer Aand a verifier B can collude in an information leakage attackagainst a victim peer C. The noising peer A can commit a setof zero noise that does not hide the original gradient value atall. When the victim peer C sends its noised gradient to theverifier B, B performs an information leakage attack on clientC’s gradient back to its original training data. This attack isviable because the verifier B knows that A is providing therandom noise and A provides zero noise.

This attack motivates Biscotti’s use of a private VRF thatselects a group of noising peers based on the victim C’ssecret key. In doing so, an adversary cannot pre-determinewhether their noise will be used in a specific verificationround by a particular victim, and also cannot pre-determineif the other peers in the noising committee will be maliciousin a particular round. Our results in Figure 16 show that theprobability of an information leakage is negligible given asufficient number of noising peers.

6

Noise committed

Noise committedfor iteration 1

comm(noise21)

comm(noise11)

comm(noiseN1 )

comm(noise1T )

comm(noise2T )

comm(noiseNT )

……

…

……

T iterations

Npeerspeer 2

by

Figure 6: Peers commit noise to an N by T structure. Eachrow i contains all the noise committed by a single peer i,and each column t contains potential noise to be used duringiteration t. When committing noise at an iteration i, peersexecute a VRF and request ζ k

i from peer k

Protocol description. For an ML workload that may be ex-pected to run for a specific number of iterations T , each peeri generates T noise vectors ζt and commits these noise vec-tors into the ledger, storing a table of size N by T (Figure 6).When a peer is ready to contribute an update in an iteration,it runs the noising VRF and contacts each noising peer k,requesting the noise vector ζk pre-committed in the gene-sis block COMM(ζk). The peer then uses a verifier VRF todetermine the set of verifiers. The peer masks their updateusing this noise and submits to these verifiers the maskedupdate, a commitment to the noise, and a commitment to theunmasked update. It also submits the noise VRF proof thatattests to the verifier that its noise is sourced from peers thatare a part of their noise VRF set.

4.5 Verification protocol

The verifier peers are responsible for filtering out maliciousupdates in a round by running Multi-KRUM on the receivedpool of updates and accepting the top majority of the updatesreceived each round. Each verifier receives the followingfrom each peer i:• The masked SGD update: (∆wi +∑k ζk)• Commitment to the SGD update: COMM(∆wi)• The set of k noise commitments:

{COMM(ζ1),COMM(ζ2), ...,COMM(ζk))}• A VRF proof confirming the identity of the k noise

peersWhen a verifier receives a masked update from another

peer, it can confirm that the masked SGD update is consis-tent with the commitments to the unmasked update and thenoise by using the homomorphic property of the commit-ments [27]. A masked update is legitimate if the followingequality holds:

COMM(∆wi +∑k

ζk) =COMM(∆wi)∗∏k

COMM(∆ζk)

Once the verifier receives a sufficiently large number ofupdates R, it proceeds with selecting the best updates usingMulti-KRUM, as follows:

• For every peer i, the verifier calculates a score s(i)which is the sum of euclidean distances of i’s updateto the closest R− f −2 updates. It is given by:

s(i) = ∑i→ j ‖(∆wi +∑i,k ζi,k)− (∆w j +∑ j,k ζ j,k)|2

where i→ j denotes the fact that (∆w j +∑ j,k ζ j,k) be-longs to the R− f−2 closest updates to (∆wi+∑i,k ζi,k).

• The R− f peers with the lowest scores are selectedwhile the rest are rejected.

• The verifier signs the COMM(∆wi) using its public keyfor all the updates that are accepted.

To prevent a malicious sybil verifier from accepting allupdates of its colluders in this stage, we require a peer toobtain signatures from majority of the verifiers to get theirupdate accepted. Once a peer receives a majority number ofsignatures from the verifiers, the update can be disseminatedfor aggregation.

4.6 Aggregation protocolAll peers with a sufficient number of signatures in the verifi-cation stage submit their SGD updates for aggregation whichwill be eventually appended to the global model. The updateequation in SGD (see Appendix A) can be re-written as:

wt+1 = wt +

∆wveri f ied

∑i=1

∆wi

where ∆wi is the verified SGD update of peer i and wt is theglobal model at iteration t.

However, individual updates contain sensitive informationand cannot be directly shared for aggregation. This presentsa privacy dilemma: no peer should observe an update fromany other peer, but the sum of the updates must be stored ina block.

The objective of the aggregation protocol is to enable a setof m aggregators, predetermined by consistent hashing, tocompute ∑i ∆wi without observing any individual updates.Biscotti uses a technique that preserves privacy of the indi-vidual updates if at least half of the m aggregators participatehonestly in the aggregation phase. This guarantee holds ifconsistent hashing selects a majority of honest aggregators,which is likely when the majority of stake is honest.

Biscotti achieves the above guarantees using polynomialcommitments (see Appendix C) combined with verifiable se-cret sharing [47] of individual updates. The update of lengthd is encoded as a d-degree polynomial, which can be bro-ken down into n shares such that (n = 2 ∗ (d + 1)). These

7

n shares are distributed equally among m aggregators. Sincean update can be reconstructed using (d+1) shares, it wouldrequire m

2 colluding aggregators to compromise the privacyof an individual update. Therefore, given that any majorityof aggregators is honest and does not collude, the privacy ofan individual peer update is preserved.

A peer with a verified update already possesses a commit-ment C = COMM(∆wi(x)) to its SGD update signed by amajority of the verifiers from the previous step. To computeand distribute its update shares among the aggregators, peeri runs the following secret sharing procedure:

1. The peer computes the required set of secret sharessm,i = z,∆wi(z)|z ∈ Z for aggregator m. In order to en-sure that an adversary does not provide shares froma poisoned update, the peer computes a set of as-sociated witnesses witm,i = {COMM(Ψz(x))|Ψz(x) =∆w(x)−∆w(z)

x−z }. These witnesses will allow the aggrega-tor to verify that the secret share belongs to the update∆wi committed to in C. It then sends <C,sm,i,witm,i >to each aggregator along with the signatures obtained inthe verification stage.

2. After receiving the above vector from peer i, the aggre-gator m runs the following sequence of validations:

(a) m ensures that C has passed the validation phaseby verifying that it has the signature of the major-ity in the verification set.

(b) m verifies that in each share (z,∆wi(z)) ∈ sm,i∆wi(z) is the correct evaluation at z of the poly-nomial committed to in C. (For details, see Ap-pendix C)

Once every aggregator has received shares for the mini-mum number of updates u required for a block, each aggre-gator aggregates its individual shares and shares the aggre-gate with all of the other aggregators. As soon as a aggre-gator receives the aggregated d + 1 shares from at least halfof the aggregators, it can compute the aggregate sum of theupdates and create the next block. The protocol to recover∑

ui=1 ∆wi is as follows:

1. All m aggregators broadcast the sum of their acceptedshares and witnesses < ∑

ui=1 smi ,∑

uj=1 witm j >

2. Each aggregator verifies the aggregated broadcastshares made by each of the other aggregators by check-ing the consistency of the aggregated shares and wit-nesses.

3. Given that m obtains the shares from m2 aggregators in-

cluding itself, m can interpolate the aggregated sharesto determine the aggregated secret ∑

uj=1 ∆w j

Once m has figured out ∑ui=1 ∆wi, it can create a block with

the updated global model. All commitments to the updatesand the signature lists that contributed to the aggregate areadded to the block. The block is then disseminated in thenetwork. Any peer in the system can verify that all updatesare verified by looking at the signature list and homomorphi-cally combine the commitments to check that the update tothe global model was computed honestly (see 4.2). If any ofthese conditions are violated, the block is rejected.

4.7 Blockchain consensusBecause consistent hashing based subsets are globally ob-servable by each peer, and based only on the SHA-256 hashof the latest block in the chain, ledger forks should rarely oc-cur in Biscotti. For an update to be included in the ledger atany iteration, the same noising/verification/aggregation com-mittees are used. Thus, race conditions between aggregatorswill not cause forks in the ledger to occur as frequently as ine.g., BitCoin [38].

When a peer observes a more recent ledger state throughthe gossip protocol, it can catch up by verifying that the com-putation performed is correct by running the consistent hash-ing protocol for the ledger state and by verifying the signa-tures of the designated verifiers and aggregators for each newblock.

In Biscotti, each verification and aggregation step occursonly for a specified duration. Any updates that are not suc-cessfully propagated in this period of time are dropped: Bis-cotti does not append stale updates to the model once com-peting blocks have been committed to the ledger. This syn-chronous SGD model is acceptable for large scale ML work-loads which have been shown to be tolerant of bounded asyn-chrony [43]. However, these stale updates could be leveragedin future iterations if their learning rate is decayed [17]. Weleave this optimization for future work.

5 Implementation

We implemented Biscotti in 4,500 lines of Go 1.10 and 1,000lines of Python 2.7.12 and released it as an open sourceproject1. We use Go to handle all networking and distributedsystems aspects of our design. We used PyTorch 0.4.1 [41] togenerate SGD updates and noise during training. By build-ing on the general-purpose API in PyTorch, Biscotti can sup-port any model that can be optimized using SGD. We use thego-python v1.0 [3] library to interface between Python andGo. Since go-python does not support Python >= 2.7.12therefore we were limited to using Python 2.7 for our imple-mentation.

We use the kyber v.2 [2] and CONIKS 0.1.1 [1] librariesto implement the cryptographic parts of our system. We

1https://github.com/DistributedML/Biscotti

8

https://github.com/DistributedML/Biscotti

use CONIKS to implement our VRF function and kyber toimplement the commitment scheme and public/private keymechanisms. To bootstrap clients with the noise commit-ments and public keys, we use an initial genesis block. Weused the bn256 curve API in kyber for generating our com-mitments and public keys that form the basis of the aggrega-tion protocol and verifier signatures. For signing updates, weuse the Schnorr signature [46] scheme instead of ECDSA be-cause multiple verifier Schnorr signatures can be aggregatedtogether into one signature [31]. Therefore, our block sizeremains constant as the verifier set grows.

6 Evaluation

We had several goals when designing the evaluation of ourBiscotti prototype. We wanted to demonstrate that (1) Bis-cotti is robust to poisoning attacks, (2) Biscotti protects theprivacy of an individual client’s data and (3) Biscotti is scal-able, fault-tolerant and can be used to train different MLmodels.

For experiments done in a distributed setting, we deployedBiscotti to train an ML model across 20 Azure A4m v2 vir-tual machines, with 4 CPU cores and 32 GB of RAM. Wedeployed a varying number of peers in each of the VMs. TheVM’s were spread across six locations: West US, East US,Central India, Japan East, Australia East and Western Eu-rope. We measured the error of the global model againsta test set and ran each experiment for 100 iterations. Toevaluate Biscotti’s defense mechanisms, we ran known infer-ence and poisoning attacks on federated learning [33, 25] andmeasured their effectiveness under various attack scenariosand Biscotti parameters. We also evaluated the performanceimplications of our design by isolating specific componentsof our system and varying the committee sizes with differ-ent numbers of peers. For all our experiments, we deployedBiscotti with the parameter values in Table 3 unless statedotherwise.

We evaluated Biscotti with logistic regression and soft-max classifiers on the Credit Card and MNIST dataset re-spectively. The softmax classifier contains a one layer neuralnetwork with a binary cross entropy loss at the end. How-ever, due to the general-purpose PyTorch API, we claim thatBiscotti can generalize to models of arbitrary size and com-plexity, as long as they can be optimized with SGD and canbe stored in our block structure. We evaluate logistic regres-sion with a credit card fraud dataset from the UC Irvine datarepository [18], which uses an individual’s financial and per-sonal information to predict whether or not they will defaulton their next credit card payment. We evaluate softmax withthe canonical MNIST [30] dataset, a task that involves pre-dicting a digit based on its image.

The MNIST and Credit Card datasets have 60,000 and21000 training examples respectively. We used 5-fold crossvalidation on the training set to decide on hyperparameters

Dataset Model Type Train/Test Examples Params (d)Credit Card LogReg 21000/9000 25

MNIST Softmax 60000/10000 7850

Table 1: The dataset/model types used in the experiments

Model/Dataset Batch Size Learning Rate MomentumCredit Card/LogReg 10 0.01 0

MNIST/Softmax 10 0.001 0.75

Table 2: The hyperparameters used in the experiments

of batch size, momentum and learning rate indicated in Table2. The MNIST/Credit Card models were tested using a sepa-rately held-out test set of 10,000 and 9000 examples respec-tively. For all experiments, the training dataset was dividedequally among the peers unless stated otherwise. As a re-sult, each client possesses 600 training examples for MNISTand 210 examples for the credit card dataset. The size ofdataset/training examples per peer is small for our experi-mental setting and might not be representative of large-scaledatasets.

In summary, Table 1 and 2 show the datasets, types ofmodel, number of training examples, the number of parame-ters d in the models and the hyperparameters that were usedin our experiments.

In local training we were able to train a model on theCredit Card and MNIST datasets with accuracy of 98% and92%, respectively.

6.1 Minimum committee size to prevent collu-sion

In Biscotti, the verification and aggregation stages involvecommittees that use a majority voting scheme to reach con-sensus. By making these committee sizes large enough, wecan prevent an adversary controlling a certain fraction of thestake from acting maliciously. An adversary having the ma-jority vote can act maliciously by accepting poisoned up-dates or recovering an individual peer’s updates during ag-gregation. In this section, we carry out an analysis of theleast committee size needed such that the probability of anadversary having the majority is below a threshold.

Using the consistent hashing protocol, the probability ofa peer being selected is proportional to their stake. Hence,the probability p of an adversary having the majority in acommittee size k can be calculated by:

p =k

∑(i= k

2+1)

(ki

)si(1− s)k−i

where s is the fraction of stake controlled by the adversary.

9

Parameter Default ValuePrivacy budget (ε) 2Delta (δ ) 10−5

Number of peers 100Number of noisers 2Number of verifiers 3Number of aggregators 3Number of samples needed for Multi-KRUM (R)

70

Adversary upper bound ( f < R−22 ) 33

Number of updates/block (u = R2 ) 35

Initial stake Uniform, 10 eachStake update Linear, + 5

Table 3: The default parameters used for all our experiments,unless stated otherwise.

By assuming that p follows a binomial distribution, weobtain a loose upper bound for an adversary controlling themajority in Biscotti. A binomial distribution assumes sam-pling peers with replacement allowing a peer to be electedmore than once in the committee. Since Biscotti limits apeer to only one vote in the committee, the actual probabilityof the adversary controlling the majority in Biscotti is lessthan p.

Since p is an upper bound, we can safely use it to calculatethe smallest committee size that bounds p below a threshold(t). To obtain the minimum committee size for p, we usea brute force approach and try out different committee sizesand pick the least size that causes p to fall below the thresh-old.

Figure 7 plots the minimum committee size neededagainst adversarial stake for probability thresholds of 0.01,0.05 and 0.001 respectively. The minimum committee size isindependent of the number of nodes and grows exponentiallywith adversarial stake in the system. Since our experimentalevaluation is limited to training for a 100 rounds, the prob-ability threshold of an adversary controlling the majority tneeds to be less than 0.01. For this threshold, a committeesize of 26 protects against an adversary controlling 30% ofthe stake in the system.

6.2 Tolerating poisoning attacks

In this section, we evaluate Biscotti’s performance when wesubject the system to a label flipping poisoning attack as inHuang et.al [25]. We investigate the different parameter set-tings required for Biscotti to successfully defend against apoisoning attack and then evaluate how well it performs un-der attack from 30% malicious nodes compared to a Feder-ated Learning baseline.Attack Rate vs Received Updates. Biscotti requires eachpeer in the verification committee to collect a sufficient sam-ple of updates before running Multi-KRUM. We evaluated

0 5 10 15 20 25 30 35

Adversarial Stake (s)(%)05

101520253035404550556065707580859095

Com

mitt

ee S

ize (k

)

Probability of successful collusion

Prob of collusion <= 0.001Prob of collusion <= 0.01Prob of collusion <= 0.05

Figure 7: Size of committee needed such that the probabilityof an adversary successfully colluding is below a threshold.

20% 40% 70%Number (%) of Received Updates

50%

45%

40%

30%

10%

Perc

ent o

f Poi

sone

rs

1.00 1.00 1.00

1.00 1.00 1.00

1.00 1.00 1.00

1.00 1.00 0.13

0.07 0.23 0.07 0.2

0.4

0.6

0.8

1.0

1-7

Atta

ck R

ate

Figure 8: Evaluating the effect of the number of sampledupdates each round on KRUM’s performance.

the effect of collecting varying percentages of total updatesin each round on the effectiveness of Multi-KRUM with dif-ferent poisoning rates for the MNIST dataset. To ensure uni-formity and to eliminate latency effects in the collection ofupdates, in these experiments the verifiers waited for updatesfrom all peers that are not assigned to a committee and thenrandomly sampled a specified number of updates. In addi-tion, we also ensured that all verifiers deterministically sam-pled the same set of updates by using the last block hashas the random seed. This allows us to investigate how wellMulti-KRUM performs in a decentralized ML setting likeBiscotti unlike the original Multi-KRUM paper [11] in whichupdates from all clients are collected before running Multi-KRUM.

To evaluate the success of an attack, we define attack rateas the fraction of target labels that are incorrectly predicted.An attack rate of 1 specifies a successful attack while a valueclose to zero indicates an unsuccessful one.

Figure 8 show the results. As the percentage of poison-ers in the system increases, a higher fraction of updates needto be collected for Multi-KRUM to be effective (to achieve alow attack rate). A large sample ensures that the poisoners donot gain majority in the set being fed to Multi-KRUM, oth-erwise Multi-KRUM cannot prevent poisoned updates fromleaking into the model. The results show that each round up-

10

0.01 0.1 0.5 1 2 5(PrivacyLoss)

0.0

0.5

1.0

Atta

ck R

ate

Figure 9: KRUM’s performance in defending against a 30%attack on the MNIST dataset for different settings of ε noise.


0.0

0.2

0.4

0.6

0.8

1.0

Test

Err

or

Federated learning - No Poison

Federated Learning - 30% Poison

Biscotti - 30% Poison

Figure 10: Federated learning and Biscotti’s test error on theCreditCard dataset with 30% of poisoners.

dates need to be selected from 70% of the peers to preventpoisoning from 30% of the nodes.Attack Rate vs Noise. To ensure that updates are kept pri-vate in the verification stage, differentially private noise isadded to each update before it is sent to the verifier. Thisnoise is parametrized by the ε and δ parameters. Theδ parameter indicates the probability with which plain ε-differential privacy is broken and is ideally set to a valuelower than 1/|d| where d is the dimension of the dataset.Hence, we set δ to be 10−5 in all our experiments. ε repre-sents privacy-loss and a lower value of ε indicates that morenoise is added to the updates. We investigate the effect of theε value on the performance of Multi-KRUM with 30% poi-soners in a 100-node deployment with 70 received updatesin each round on the MNIST dataset. Figure 9 shows thatKRUM loses its effectiveness at values of ε ≤ 1 but performswell on values of ε ≥ 2.Biscotti vs Federated Learning Baseline. We deployedBiscotti and federated learning and subjected both systemsto a poisoning attack while training on the Credit Card andMNIST datasets. Using the parameters from the above ex-periments, Biscotti sampled 70% of updates and used a valueof 2 for the privacy budget ε .

We introduced 30% of the peers into the system with thesame malicious dataset: for credit card they labeled all de-


0.0

0.2

0.4

0.6

0.8

1.0

Test

Erro

r

Federated Learning - No PoisonFederated Learning - 30% PoisonBiscotti - 30% Poison

Figure 11: Federated learning and Biscotti’s test error on theMNIST dataset with 30% of poisoners.


0.0

0.2

0.4

0.6

0.8

1.0

1-7

Atta

ck R

ate

Federated Learning - 30% PoisonBiscotti - 30% Poison

Figure 12: Attack rate comparison of federated learning andBiscotti’s on the MNIST dataset with 30% of poisoners.

faults as non-defaults, and for MNIST these peers labeledall 1s as 7s. Figure 10 and 11 shows the test error for bothdatasets as compared to federated learning. The results showthat for both datasets the poisoned federated learning deploy-ment struggled to converge. By contrast, Biscotti performedas well as the baseline un-poisoned federated learning de-ployment on the test set.

We show the attack rate in Biscotti and Federated Learn-ing for MNIST in Figure 12. For the credit card dataset, theattack rate is the same as the test error since there is onlyone class in the dataset. The figure illustrates that in feder-ated learning, since no defense mechanisms exist, a poison-ing attack creates a tug of war between honest and maliciouspeers. On the other hand, in Biscotti the attack rate settlesafter some initial jostling.

The jostling results in Biscotti taking a larger number ofiterations to converge than un-poisoned federated learning.The convergence is slower for Biscotti because a small verifi-cation committee of 3 peers was used for this experiment. Asmaller committee size allowed the poisoning peers to con-trol a majority in the committee frequently and accept up-dates from fellow malicious peers. Biscotti finally convergedbecause the honest peers gained influence over time therebyreducing the chance of malicious peers from taking controlof the majority vote each round. In this experiment, the stake

11


0.0

0.2

0.4

0.6

0.8

1.0

Test

Erro

r

Federated Learning - No PoisonFederated Learning - 30% PoisonBiscotti - 30% Poison

Figure 13: Federated learning and Biscotti’s test error on theMNIST dataset with 30% of poisoners with larger committeesize.


0.0

0.2

0.4

0.6

0.8

1.0

1-7

Atta

ck R

ate

Federated Learning - 30% PoisonBiscotti - 30% Poison

Figure 14: Attack rate comparison of federated learning andBiscotti’s on the MNIST dataset with 30% of poisoners withlarger committee size.

of the honest clients went from 70% to 87%.To demonstrate the effect of the committee size on con-

vergence under attack, we run the same experiment for theMNIST dataset with an increased committee size of 26 peers.As shown in Section 6.1, a committee size of 26 providesprotection against an adversary controlling 30% of the stakein the system. Since peers part of committees do not con-tribute updates in that particular round, 100 nodes is notenough to collect 70% of updates needed to protect againstMulti-KRUM. Therefore, for this experiment we increasedthe number of nodes to 200. The results in Figure 14 showsthat Biscotti converges in the same number of iterations asunpoisoned Federated Learning with an increased commit-tee size. Figure 14 shows that the attack rate stays less than24.9% after the first 5 iteration because the poisoners wereunable to get their updates accepted during the training pro-cess.

6.3 Privacy evaluation

In this section, we evaluate the privacy provided by secureaggregation in Biscotti. We subject Biscotti to an informa-tion leakage attack [33] and demonstrate that the effective-

ness of this attack decreases with the number of securely ag-gregated updates in a block. We also show that the probabil-ity of a successful collusion attack to recover an individualclient’s private gradient decreases as the size of the commit-tees grows.Information leakage from aggregated gradients. We sub-ject the aggregated gradients from several different datasetsto the gradient-based information leakage attack describedin [33]. We invert the aggregated gradient knowing that thegradient for the weights associated with each class in thefully connected softmax layer is directly proportional to theinput features. To infer the original features, we take the gra-dients from a single class and invert them with respect to allthe classes in the CIFAR-10 (10 classes of objects/animals)and LFW datasets(2 classes male/female). We also invertthe gradients for three classes (0,3,5) on the MNIST dataset.We visualize the gradient in grayscale after reshaping to theoriginal feature dimensions in Figure 15. The aggregatedgradient will have data sampled from a mixture of classesincluding the target class. Our results show that having alarger number of participants in the aggregate decreases theimpact of this attack. As shown in Figure 15, as the numberof aggregates batched together increases, it becomes harderto distinguish the individual training examples.

By default Biscotti aggregates/batches 35 updates. Fig-ure 15 illustrates how the individual class examples from theinverted images are difficult to determine. It might be easy toinfer what a class looks like from the inversions. But if theclass does not represent an individual’s training set, secureaggregation provides privacy. For example, in LFW we getan image that represents what a male/female looks like butwe do not gain any information about the individual trainingexamples. Hence, the privacy gained is dependent on howclose the training examples are to the class representative.Recovering a client’s individual gradient.

We also evaluate a proposed attack on the noising proto-col, which aims to de-anonymize peer gradients. This attackis performed when a verifier colludes with several maliciouspeers. When bootstrapping the system, the malicious peerspre-commit noise that sums to 0. As a result, when a nois-ing committee selects these noise elements for verification,the masked gradient is not actually masked with any ε noise,allowing a malicious peer to perform an information leakageattack on the victim.

We evaluated the effectiveness of this attack by varyingthe proportion of malicious stake in the system and calcu-lating the probability of the adversary being able to unmaskthe updates for various sizes of the noising committee. Amalicious peer can unmask an update if it controls all thenoisers for a peer and has atleast one verifier in the verifica-tion committee. Figure 16 shows the probability of a privacyviolation as the proportion of adversarial stake increases fornoising committee sizes of 3, 5 and 10 respectively.

When the number of noisers for an iteration is 3, an ad-

12

Figure 15: The results of an information leakage attack on different number of aggregated gradients for different classes.

0.0 0.1 0.2 0.3 0.4 0.5

Fraction of Adversarial Stake0.00

0.02

0.04

0.06

0.08

0.10

Prob

abilit

y of

unm

aske

d up

date

s

Probability of successful collusion

# Noisers = 3# Noisers = 5# Noisers = 10

Figure 16: Probability of a successful collusion attack to re-cover an individual client’s gradient.

versary needs at least 15% of stake to successfully unmaskan SGD update. This trend continues when 5 noisers areused: over 30% of the stake must be malicious. When thenumber of noisers is 10 (which has minimal additional over-head according to Figure 18), privacy violations do not occureven with 50% of malicious stake. By using a stake-basedconsistent hashing to select noising clients, Biscotti preventsadversaries from performing information leakage attacks onother clients unless their proportion of stake in the system isoverwhelmingly large.

6.4 Performance, scalability and fault toler-ance

In this section, we evaluate the overhead of each stage in Bis-cotti and investigate the effect on the overhead as the numberof peers increase. In addition to this, we also measure theeffect on Biscotti’s performance as we scale the size of dif-ferent committees. We compare the performance of Biscottion federated learning. Finally, we see if churn has any effecton Biscotti’s convergence.Performance cost break-down. In breaking down the over-

40.0 60.0 80.0 100.0Number of Peers

0

20

40

60

80

100

Aver

age

Tim

e Pe

r Ite

ratio

n (s

)

0.4 0.4 0.4 0.4 3.36.8 3.4 2.9

29.8 30.6 29.0 28.7

6.9 6.8 6.2 7.6

40.544.8

39.1 39.8

NoisingVerification

Secure AggregationFlooding

Total

Figure 17: Breakdown of time spent in different mechanismsin Biscotti (Figure 3) in deployments with varying number ofpeers.

head in Biscotti, we deployed Biscotti over a varying numberpeers in training on MNIST. We captured the total amountof time spent in each of the major phases of our algorithm inFigure 3: collecting the noise from each of the noising clients(steps 2 and 3 ), executing verification via Multi-KRUMand collecting the signatures (steps 4 and 5 ) and securelyaggregating the SGD update (steps 6 and 7 ). Figure 17shows the breakdown of the average cost per iteration foreach stage under a deployment of 40, 60, 80 and 100 nodesover 3 runs. During this experiment, the committee sizeswere kept constant to the default values in Table 3.

The results show that the cost of each stage is almost con-stant as we vary the number of peers in the system. Bis-cotti spends most of its time in the aggregation phase sinceit requires the aggregators to collect secret shares of all theaccepted updates and share the aggregated shares with eachother to generate a block. The noising phase is the fastestsince it only involves making a round trip to each of thenoisers while the verification stage involves collecting a pre-defined percentage (70%) of updates to run Multi-KRUM ina asynchronous manner from all the nodes. The time per it-

13

Noisers Verifiers Aggregators

20406080

100120

Avg

Tim

e Pe

r Ite

ratio

n (s

)

36.9 36.9 36.9 37.6 44.8 40.3 39.3

58.9 49.0 50.2

59.4

96.6N = 3N = 5

N = 10 N=26

Figure 18: The average amount of time taken per iterationwith an increasing number of noisers, verifiers, or aggrega-tors.

eration also stays fairly constant as the number of nodes inthe system increase.Scaling up committee sizes in Biscotti. We evalu-ate Biscotti’s performance as we change the size of thenoiser/verifier/aggregator sets. For this we deploy Biscottion Azure with the MNIST dataset with a fixed size of 100peers, and only vary the number of noisers needed for eachSGD update, number of verifiers used for each verificationcommittee, and the number of aggregators used in secure ag-gregation. Each time one of these sizes was changed, thechange was performed in isolation; the rest of the commit-tees used a set of 3. Figure 18 plots the average time takenper iteration over 3 runs of the system.

The results show that increasing the number of noisers,verifiers and aggregator sets increases the time per iteration.The iteration time increases slightly with noisers because itrequires contacting additional peers for the noise. Increasingthe aggregators leads to a larger overhead because the secretshares are shared with more aggregators and recovering theaggregate requires coordination among more peers. Lastly, alarge number of verifiers results in frequent timeouts in themining stage. Verifiers wait for the first 70 updates and se-lect 37 updates while miners wait for shares from the first35 updates before initiating the aggregate recovery process.With an increased verifier set, the size of the intersection ofupdates accepted by majority verifiers falls frequently below35 because each of them run Multi-KRUM on a different setof updates. Hence, the aggregators wait until a timeout for 35updates when the actual number of updates accepted duringthe round is less. Since the timeout is a constant value of 90seconds, the verifier overhead does not increase significantlywhen increasing the verifier set from 10 to 26. However, thisincreased verifier overhead could be lowered by decreasingthe number of updates that aggregators wait for before start-ing the coordination.Baseline performance

We execute Biscotti in a baseline deployment and com-pare it to the original federated learning baseline [32]. We

0 2000 4000 6000 8000 10000 12000 14000 16000Time (s)

0.0

0.2

0.4

0.6

0.8

1.0

Test

Erro

r

Federated Learning 200 nodesBiscotti 200 nodes

Figure 19: Comparing the time to convergence of Biscotti toa federated learning baseline.


0.0

0.2

0.4

0.6

0.8

1.0

Test

Erro

r

Federated Learning 200 nodesBiscotti 200 nodes

Figure 20: Comparing the iterations to convergence of Bis-cotti to a federated learning baseline.

run Biscotti with 5 noisers, 26 verifiers and 26 aggrega-tors. As demonstrated earlier, these committee sizes pro-vide a guarantee of protecting against an adversary holding30% stake from unmasking updates in the noising (Section6.3) and aggregation stages (Section 6.1). They also en-sure convergence under poisoning from an attack by 30% ofthe nodes. (Section 6.2) The MNIST dataset [30] into 200equal partitions, each of which was shared with an honestpeer on an Azure cluster of 20 VMs, with each VM hosting10 peers. These Biscotti/Federated Learning peers collabo-rated on training an MNIST softmax classifier, and after 100iterations both models approached the global optimum. Toensure a fair comparison, the number of updates included inthe model each round are kept the same. Federated Learn-ing selects 35% nodes at random out of 200 for every roundand receives updates from them while Biscotti includes thefirst 35% verified updates in the block. The convergencerate over time for both systems is plotted in Figure 19 andthe same convergence over the number of iterations is shownin Figure 20. In this deployment, Biscotti takes about 13.8times longer than Federated Learning (20.8 minutes vs 266.7minutes), yet achieves similar model performance (92% ac-curacy) after 100 iterations.Training with node churn. A key feature of Biscotti’s P2Pdesign is that it is resilient to node churn (node joins and

14

0 20 40 60 80 100Iterations

0.00.20.40.60.81.0

Test

Erro

r 2 nodes per minute1 nodes per minute4 nodes per minute

Figure 21: The impact of churn on model performance.

failures). For example, the failure of any Biscotti node doesnot block or prevent the system from converging. We evalu-ate Biscotti’s resilience to peer churn by performing a CreditCard deployment with 100 peers, failing a peer at random ata specified rate. For each specified time interval, a peer ischosen randomly and is killed. In the next time interval, anew peer joins the network and is bootstrapped, maintaininga constant total number of 100 peers. Figure 21 shows therate of convergence for varying failure rates of 4, 2, and 1peer(s) failing and joining per minute. When a verifier or anaggregator fails, Biscotti defaults to the next iteration after atimeout, so this does not harm convergence.

Even with churn Biscotti is able to make progress towardsthe global objective. We found that Biscotti is resilient tochurn rates up to 4 nodes per minute (1 node joining and 1node failing every 15 seconds).

7 Limitations

KRUM limitations. For KRUM to be effective against alarge number of poisoners, it needs to observe a large num-ber samples in each round. This may not always be possiblein a decentralized system where there is node churn. In par-ticular, this may compromise the system because adversariesmay race to contribute a majority of the samples. In addi-tion to this, Multi-KRUM could also reject updates of peersthat have different data from the rest of the peers e.g onlyone peer has 1’s in the system. We did not face this issue inour experiments because we partitioned the data uniformly.However, Biscotti is compatible with other poisoning detec-tion approaches and can integrate any poisoning detectionmechanism which uses SGD updates and does not requireindividual update histories. For example, in a previous ver-sion of Biscotti we used RONI [8] to validate updates.Deep Learning. We showed that Biscotti is able to traina softmax and logistic regression model with 7,850 and 25parameters respectively. But, Biscotti might not scale to alarge deep learning model with millions of parameters due tothe communication overhead. Strategies to reduce this mightrequire learning updates in a restricted parameter space orcompressing model updates [29]. We leave the training ofdeep learning models with Biscotti to future work.Leakage from the aggregate model. Since there is no dif-ferential privacy added to the updates present in the ledger,Biscotti is vulnerable to attacks that exploit privacy leakage

from the model itself [21, 45]. Apart from differential pri-vacy, these attacks can also be mitigated by adding properregularization [45, 33] like dropout, or they may be irrelevantif a class does not represent an individual’s training data.Stake limitations. The stake that a client possesses playsa significant role in determining the chance that a node isselected as a noiser/verifier/aggregator. Our assumption isthat a large stake-holder will not subvert the system because(1) they accrue more stake by participating and (2) their stakeis tied to a monetary reward at the end of training. However,a malicious client could pose honest and accrue stake for apre-defined period of time and then turn malicious to subvertour system. We leave the design of a robust stake mechanismthat protects against such attacks to future work.

8 Related work

Biscotti’s novelty lies in its ability to simultaneously handlepoisoning attacks and preserve privacy in a P2P multi-partytraining context.Securing ML. Similar to KRUM, AUROR [48] and ANTI-DOTE [44] are alternative techniques to defend against poi-soning that rely on anomaly detection. AUROR has beenmainly proposed for the model averaging use case and usesk-means clustering on a subset of important features to detectanomalies. ANTIDOTE uses a robust PCA detector to pro-tect against attackers trying to evade anomaly-based meth-ods.

Other defenses like TRIM [26] and RONI [8] filter outpoisoned data from a centralized training set based on theirimpact on performance on the dataset. TRIM trains a modelto fit a subset of samples selected at random, and identifiesa training sample as an outlier if the error when fitting themodel to the sample is higher than a threshold. RONI trainsa model with and without a data point and rejects it if it de-grades the performance of the classifier by a certain amount.

Finally, Baracaldo et al. [7] employ data provenance as ameasure against poisoning attacks by tracking the history ofthe training data and removing information from anomaloussources.Poisoning attacks. Apart from label flipping attacks [25] weevaluated, gradient ascent techniques [26, 37] are a popularway of generating poisoned samples one sample at a time bysolving a bi-level optimization problem. Backdoor attackshave also been proposed to make the classifier misclassifyan image if it contains certain pixels or a backdoor key. De-fending against such attacks during training is a hard andopen problem in the literature.Privacy attacks. Shokri et al. [50] demonstrated a mem-bership inference attack using shadow model training thatlearns if a victim’s data was used to train the model. Fol-low up work [45] showed that it is quite easy to launch thisattack in a black-box setting. In addition, model inversionattacks [21] have been proposed to invert class images from

15

the final trained model. However, we assume that these areonly effective if a class represents a significant chunk of aperson’s data e.g., facial recognition systems.

Additional security work in federated learning has re-quired that public key infrastructure exists which validatesthe identity of users [12], but prior sybil work has shownthat relying on public key infrastructure for user validation isinsufficient [52, 53].Privacy-preserving ML. Cheng et al. recently proposed theuse of TEEs and privacy-preserving smart contracts to pro-vide security and privacy to multi-party ML tasks [16]. But,some TEEs have been found to be vulnerable [51]. Anothersolution uses two party computation [36] and encrypts boththe model parameters and the training data when performingmulti-party ML.

Differential privacy is often applied to multi-party MLsystems to provide privacy [22, 4, 49]. However, its useintroduces privacy-utility tradeoffs. Biscotti uses differen-tial privacy during update validation, but does not push noiseinto the trained model, thus favouring utility.

9 Conclusion

The emergence of large scale multi-party ML workloads anddistributed ledgers for scalable consensus have produced tworapid and powerful trends. Biscotti’s design lies at their con-fluence. To the best of our knowledge, this is the first systemto provide privacy-preserving peer-to-peer ML through a dis-tributed ledger, while simultaneously considering poisoningattacks. And, unlike prior work, Biscotti does not rely ontrusted execution environments or specialized hardware. Inour evaluation we demonstrated that Biscotti can coordinatea collaborative learning process across 100 peers and pro-duces a final model that is similar in utility to state of the artfederated learning alternatives. We also illustrated its abilityto withstand poisoning and information leakage attacks, andfrequent failures and joining of nodes (one node joining, onenode leaving every 15 seconds).

Our Biscotti prototype is open source, runs on commodityhardware, and interfaces with PyTorch, a popular frameworkfor machine learning.

AcknowledgmentsWe would like to express our thanks to Margo Seltzer for herhelpful feedback and comments. We would like to furtherthank Matheus Stolet for his help in evaluating the prototypeat scale, and Heming Zhang for his help with bootstrappingthe deployment of Biscotti on PyTorch. We would also liketo thank Gleb Naumenko, Nico Ritschel, Jodi Spacek, andMichael Hou for their work on the initial Biscotti prototype.

This research has been sponsored by the HuaweiInnovation Research Program (HIRP), Project No:HO2018085305. We also acknowledge the support of

the Natural Sciences and Engineering Research Council ofCanada (NSERC), 2014-04870.

References[1] A CONIKS Implementation in Golang. https://github.com/

coniks-sys/coniks-go, 2018.

[2] DEDIS Advanced Crypto Library for Go. https://github.com/

dedis/kyber, 2018.

[3] go-python. https://github.com/sbinet/go-python, 2018.

[4] ABADI, M., CHU, A., GOODFELLOW, I., MCMAHAN, B.,MIRONOV, I., TALWAR, K., AND ZHANG, L. Deep learning withdifferential privacy. In 23rd ACM Conference on Computer and Com-munications Security (2016), CCS.

[5] AZARIA, A., EKBLAW, A., VIEIRA, T., AND LIPPMAN, A. Medrec:Using blockchain for medical data access and permission manage-ment. OBD ’16.

[6] BAGDASARYAN, E., VEIT, A., HUA, Y., ESTRIN, D., ANDSHMATIKOV, V. How To Backdoor Federated Learning. ArXiv e-prints (2018).

[7] BARACALDO, N., CHEN, B., LUDWIG, H., AND SAFAVI, J. A. Mit-igating poisoning attacks on machine learning models: A data prove-nance based approach. In Proceedings of the 10th ACM Workshop onArtificial Intelligence and Security (2017), AISec.

[8] BARRENO, M., NELSON, B., JOSEPH, A. D., AND TYGAR, J. D.The Security of Machine Learning. Machine Learning 81, 2 (2010).

[9] BHAGOJI, A. N., CHAKRABORTY, S., MITTAL, P., AND CALO, S.Analyzing federated learning through an adversarial lens. In Pro-ceedings of the 36th International Conference on Machine Learning(2019).

[10] BIGGIO, B., NELSON, B., AND LASKOV, P. Poisoning AttacksAgainst Support Vector Machines. In Proceedings of the 29th Inter-national Coference on International Conference on Machine Learning(2012), ICML.

[11] BLANCHARD, P., EL MHAMDI, E. M., GUERRAOUI, R., ANDSTAINER, J. Machine learning with adversaries: Byzantine tolerantgradient descent. In Advances in Neural Information Processing Sys-tems 30, NIPS. 2017.

[12] BONAWITZ, K., IVANOV, V., KREUTER, B., MARCEDONE, A.,MCMAHAN, H. B., PATEL, S., RAMAGE, D., SEGAL, A., ANDSETH, K. Practical secure aggregation for privacy-preserving ma-chine learning. In Proceedings of the 2017 ACM SIGSAC Conferenceon Computer and Communications Security (2017), CCS.

[13] BOTTOU, L. On-line learning in neural networks. Cambridge Univer-sity Press, 1999.

[14] BOTTOU, L. Large-Scale Machine Learning with Stochastic GradientDescent. COMPSTAT ’10. 2010.

[15] CASTRO, M., AND LISKOV, B. Practical byzantine fault tolerance.In Proceedings of the Third Symposium on Operating Systems Designand Implementation (1999), OSDI ’99.

[16] CHENG, R., ZHANG, F., KOS, J., HE, W., HYNES, N., JOHNSON,N., JUELS, A., MILLER, A., AND SONG, D. Ekiden: A platform forconfidentiality-preserving, trustworthy, and performant smart contractexecution. ArXiv e-prints (2018).

[17] DEAN, J., CORRADO, G. S., MONGA, R., CHEN, K., DEVIN, M.,LE, Q. V., MAO, M. Z., RANZATO, M., SENIOR, A., TUCKER, P.,YANG, K., AND NG, A. Y. Large scale distributed deep networks. InProceedings of the 25th International Conference on Neural Informa-tion Processing Systems - Volume 1 (2012), NIPS.

[18] DHEERU, D., AND KARRA TANISKIDOU, E. UCI machine learningrepository, 2017.

16

https://github.com/coniks-sys/coniks-go

https://github.com/coniks-sys/coniks-go

https://github.com/dedis/kyber

https://github.com/dedis/kyber

https://github.com/sbinet/go-python

[19] DOUCEUR, J. J. The sybil attack. IPTPS ’02.

[20] DWORK, C., AND ROTH, A. The Algorithmic Foundations of Dif-ferential Privacy. Foundations and Trends in Theoretical ComputerScience 9, 3-4 (2014).

[21] FREDRIKSON, M., JHA, S., AND RISTENPART, T. Model InversionAttacks That Exploit Confidence Information and Basic Countermea-sures. In Proceedings of the 2015 ACM SIGSAC Conference on Com-puter and Communications Security (2015), CCS.

[22] GEYER, R. C., KLEIN, T., AND NABI, M. Differentially private fed-erated learning: A client level perspective. NIPS Workshop: MachineLearning on the Phone and other Consumer Devices (2017).

[23] GILAD, Y., HEMO, R., MICALI, S., VLACHOS, G., AND ZEL-DOVICH, N. Algorand: Scaling byzantine agreements for cryptocur-rencies. In Proceedings of the 26th Symposium on Operating SystemsPrinciples (2017), SOSP.

[24] HITAJ, B., ATENIESE, G., AND PEREZ-CRUZ, F. Deep Models Un-der the GAN: Information Leakage from Collaborative Deep Learn-ing. In Proceedings of the 2017 ACM SIGSAC Conference on Com-puter and Communications Security (2017), CCS.

[25] HUANG, L., JOSEPH, A. D., NELSON, B., RUBINSTEIN, B. I., ANDTYGAR, J. D. Adversarial Machine Learning. In Proceedings ofthe 4th ACM Workshop on Security and Artificial Intelligence (2011),AISec.

[26] JAGIELSKI, M., OPREA, A., BIGGIO, B., LIU, C., NITA-ROTARU,C., AND LI, B. Manipulating machine learning: Poisoning attacksand countermeasures for regression learning. In 2018 IEEE Sympo-sium on Security and Privacy, SP 2018, Proceedings, 21-23 May 2018,San Francisco, California, USA.

[27] KATE, A., AND ZAVERUCHA, GREGORY M.AND GOLDBERG, I.Constant-size commitments to polynomials and their applications.

[28] KIAYIAS, A., RUSSELL, A., DAVID, B., AND OLIYNYKOV, R.Ouroboros: A provably secure proof-of-stake blockchain protocol. InAdvances in Cryptology – CRYPTO 2017 (2017).

[29] KONEN, J., MCMAHAN, H. B., YU, F. X., RICHTARIK, P.,SURESH, A. T., AND BACON, D. Federated learning: Strategies forimproving communication efficiency. In NIPS Workshop on PrivateMulti-Party Machine Learning (2016).

[30] LECUN, Y., BOTTOU, L., BENGIO, Y., AND HAFFNER, P. Gradient-based learning applied to document recognition. Proceedings of theIEEE 86, 11 (1998).

[31] MAXWELL, G., POELSTRA, A., SEURIN, Y., AND WUILLE, P. Sim-ple schnorr multi-signatures with applications to bitcoin. CryptologyePrint Archive, Report 2018/068, 2018.

[32] MCMAHAN, H. B., MOORE, E., RAMAGE, D., HAMPSON, S., ANDY ARCAS, B. A. Communication-Efficient Learning of Deep Net-works from Decentralized Data. In Proceedings of the 20th Inter-national Conference on Artificial Intelligence and Statistics (2017),AISTATS.

[33] MELIS, L., SONG, C., DE CRISTOFARO, E., AND SHMATIKOV,V. Inference Attacks Against Collaborative Learning. ArXiv e-prints(2018).

[34] MICALI, S., VADHAN, S., AND RABIN, M. Verifiable random func-tions. In Proceedings of the 40th Annual Symposium on Foundationsof Computer Science (1999), FOCS.

[35] MILLER, A., JUELS, A., SHI, E., PARNO, B., AND KATZ, J. Perma-coin: Repurposing bitcoin work for data preservation. In Proceedingsof the 2014 IEEE Symposium on Security and Privacy (2014), S&P.

[36] MOHASSEL, P., AND ZHANG, Y. Secureml: A system for scalableprivacy-preserving machine learning, 2017.

[37] MUNOZ GONZALEZ, L., BIGGIO, B., DEMONTIS, A., PAUDICE,A., WONGRASSAMEE, V., LUPU, E. C., AND ROLI, F. Towards poi-soning of deep learning algorithms with back-gradient optimization.In Proceedings of the 10th ACM Workshop on Artificial Intelligenceand Security 2017.

[38] NAKAMOTO, S. Bitcoin: A peer-to-peer electronic cash system.

[39] NARULA, N., VASQUEZ, W., AND VIRZA, M. zkledger: Privacy-preserving auditing for distributed ledgers. In 15th USENIX Sym-posium on Networked Systems Design and Implementation (2018),NSDI.

[40] NELSON, B., BARRENO, M., CHI, F. J., JOSEPH, A. D., RUBIN-STEIN, B. I. P., SAINI, U., SUTTON, C., TYGAR, J. D., AND XIA,K. Exploiting Machine Learning to Subvert Your Spam Filter. InProceedings of the 1st Usenix Workshop on Large-Scale Exploits andEmergent Threats (2008), LEET.

[41] PASZKE, A., GROSS, S., CHINTALA, S., CHANAN, G., YANG, E.,DEVITO, Z., LIN, Z., DESMAISON, A., ANTIGA, L., AND LERER,A. Automatic differentiation in pytorch. In NIPS Autodiff Workshop(2017).

[42] PEDERSEN, T. P. Non-interactive and information-theoretic secureverifiable secret sharing. In Proceedings of the 11th Annual Interna-tional Cryptology Conference (1992), CRYPTO.

[43] RECHT, B., RE, C., WRIGHT, S., AND NIU, F. Hogwild: A lock-freeapproach to parallelizing stochastic gradient descent. In Advances inNeural Information Processing Systems 24, NIPS. 2011.

[44] RUBINSTEIN, B. I., NELSON, B., HUANG, L., JOSEPH, A. D., LAU,S.-H., RAO, S., TAFT, N., AND TYGAR, J. D. ANTIDOTE: Under-standing and Defending Against Poisoning of Anomaly Detectors. InProceedings of the 9th ACM SIGCOMM Conference on Internet Mea-surement (2009), IMC.

[45] SALEM, A., ZHANG, Y., HUMBERT, M., FRITZ, M., AND BACKES,M. Ml-leaks: Model and data independent membership inference at-tacks and defenses on machine learning models.

[46] SCHNORR, C. P. Efficient identification and signatures for smartcards. In Advances in Cryptology — CRYPTO’ 89 Proceedings (1990).

[47] SHAMIR, A. How to share a secret. Communications of the ACM(1979).

[48] SHEN, S., TOPLE, S., AND SAXENA, P. Auror: Defending againstpoisoning attacks in collaborative deep learning systems. In Proceed-ings of the 32nd Annual Conference on Computer Security Applica-tions (2016), ACSAC.

[49] SHOKRI, R., AND SHMATIKOV, V. Privacy-preserving deep learning.In Proceedings of the 2015 ACM SIGSAC Conference on Computerand Communications Security (2015), CCS.

[50] SHOKRI, R., STRONATI, M., SONG, C., AND SHMATIKOV, V. Mem-bership inference attacks against machine learning models. In IEEESymposium on Security and Privacy (2017), IEEE Computer Society,pp. 3–18.

[51] VAN BULCK, J., MINKIN, M., WEISSE, O., GENKIN, D., KASIKCI,B., PIESSENS, F., SILBERSTEIN, M., WENISCH, T. F., YAROM, Y.,AND STRACKX, R. Foreshadow: Extracting the keys to the Intel SGXkingdom with transient out-of-order execution. In Proceedings of the27th USENIX Security Symposium (2018), USENIX SEC.

[52] VISWANATH, B., BASHIR, M. A., ZAFAR, M. B., BOUGET, S.,GUHA, S., GUMMADI, K. P., KATE, A., AND MISLOVE, A.Strength in numbers: Robust tamper detection in crowd computations.In Proceedings of the 3rd ACM Conference on Online Social Networks(2015), COSN.

[53] WANG, G., WANG, B., WANG, T., NIKA, A., ZHENG, H., ANDZHAO, B. Y. Defending against sybil devices in crowdsourced map-ping services. In Proceedings of the 14th Annual International Confer-ence on Mobile Systems, Applications, and Services (2016), MobiSys.

17

Appendix A Federated learning and Dis-tributed SGD

Given a set of training data, a model structure, and a pro-posed learning task, ML algorithms train an optimal set ofparameters, resulting in a model that optimally performs thistask. In Biscotti, we assume stochastic gradient descent(SGD) [14] as the optimization algorithm.

In federated learning [32], a shared model is updatedthrough SGD. Each client uses their local training data andtheir latest snapshot of the shared model state to compute theoptimal update on the model parameters. The model is thenupdated and clients update their local snapshot of the sharedmodel before performing a local iteration again. The modelparameters w are updated at each iteration i as follows:

wt+1 = wt −ηt(λwt +1b ∑(xi,yi)∈Bt

∇l(wt ,xi,yi)) (1a)

where ηt represents a degrading learning rate, λ is a regular-ization parameter that prevents over-fitting, Bt represents agradient batch of local training data examples (xi,yi) of sizeb and ∇l represents the gradient of the loss function.

SGD is a general learning algorithm that can be used totrain a variety of models, including neural networks [14]. Atypical heuristic involves running SGD for a fixed number ofiterations or halting when the magnitude of the gradient fallsbelow a threshold. When this occurs, model training is con-sidered complete and the shared model state wt is returnedas the optimal model w∗.

In a multi-party ML setting federated learning assumesthat clients possess training data that is not identically andindependently distributed (non-IID) across clients. In otherwords, each client possesses a subset of the global datasetthat contains specific properties distinct from the global dis-tribution.

When performing SGD across clients with partitioneddata sources, we redefine the SGD update ∆i,twg at iterationt of each client i to be:

∆i,twg = ληtwg +ηt

b ∑(x,y)∈Bi,t

∇l(wg,x,y) (1b)

where the distinction is that the gradient is computed on aglobal model wg, and the gradient steps are taken using alocal batch Bi,t of data from client i. When all SGD updatesare collected, they are averaged and applied to the model,resulting in a new global model. The process then proceedsiteratively until convergence.

To increase privacy guarantees in federated learning, se-cure aggregation protocols have been added to the centralserver [12] such that no individual client’s SGD update is di-rectly observable by server or other clients. However, thisrelies on a centralized service to perform such an aggrega-tion and does not provide security against adversarial attackson ML.

Appendix B Differentially private stochasticgradient descent

Data: Batch size b, Learning rate ηt , Privacyparameters (ε,δ ), Expected update dimension d

Result: Precommited noise for an SGD update ζt ∀Tfor iteration t ∈ [1..T ] do

// Sample noise of length d for each

expected sample in batch

for Example i ∈ [1..b] doSample noise ζi = N (0,σ2I) where

σ =√

2log 1.25δ

/ε

endζt =

ηtb ∑i ζi

Commit ζt to the genesis block at column t.end

Algorithm 1: Precommiting differentially Private noise forSGD, taken from Abadi et al. [4].

We use the concept of (ε,δ ) differential privacy as ex-plained in Abadi et al. [4]. Each SGD step becomes (ε,δ )differentially private if we sample normally distributed noiseas shown in Algorithm 1. Each client commits noise to thegenesis block for all expected iterations T . We also requiresthat the norm of the gradients be clipped to have a maximumnorm of 1 so that the noise does not completely obfuscate thegradient.

This precommited noise is designed such that a neutralthird party aggregates a client update ∆i,twg from Equation(1b) and precommitted noise ζt from Algorithm 1 withoutany additional information. The noise is generated withoutany prior knowledge of the SGD update it will be appliedto while retaining the computation and guarantees providedby prior work. The noisy SGD update ∆i,twg follows fromaggregation:

∆i,twg = ∆i,twg +ζt

Appendix C Polynomial commitments andverifiable secret sharing

Polynomial Commitments [27] is a scheme that allows com-mitments to a secret polynomial for verifiable secret sharing[47]. This allows the committer to distribute secret sharesfor a secret polynomial among a set of nodes along with wit-nesses that prove in zero-knowledge that each secret sharebelongs to the committed polynomial. The polynomial com-mitment is constructed as follows:

Given two groups G1 and G2 with generators g1 and g2 ofprime order p such that there exists an asymmetric bilinearpairing e : G1×G2 → GT for which the t-SDH assumptionholds, a commitment public key (PK) is generated such that

18

PK = {g,gα ,g(α)2, ...,g(α)t} ∈ Gt+1

1 where α is the secretkey. The committer can create a commitment to a polyno-mial φ(x) = ∑

tj=0 φ jx j of degree t using the commitment PK

such that:

COMM(PK,φ(x)) =deg(φ)

∏j=0

(gα j)φ j

Given a polynomial φ(x) and a commitmentCOMM(φ(x)), it is trivial to verify whether the com-mitment was generated using the given polynomial ornot. Moreover, we can multiply two commitments toobtain a commitment to the sum of the polynomials in thecommitments by leveraging their homomorphic property:

COMM(φ1(x)+φ2(x)) =COMM(φ1(x))∗COMM(φ2(x))

Once the committer has has generated COMM(φ(x)), itcan carry out a (n, t) - secret sharing scheme to share thepolynomial among a set of n participants in such a way thatin the recovery phase a subset of at least t participants cancompute the secret polynomial. All secret shares (i,φ(i))shared with the participants are evaluations of the polyno-mial at a unique point i and are accompanied by a com-mitment to a witness polynomial COMM(ψi(x)) such thatψi(x) =

φ(x)−φ(i)x−i . By leveraging the divisibility property of

the two polynomials {φ(x),ψ(x)} and the bilinear pairingfunction e, it is trivial to verify that the secret share comesfrom the committed polynomial [27]. This is carried out byevaluating whether the following equality holds:

e(COMM(φi(x)),g2)?= e(COMM(ψi(x)),

gα2

gi2)e(g1,g2)

∆wi(i)

If the above holds, then the share is accepted. Otherwise,the share is rejected.

This commitment scheme is unconditionally binding andcomputationally hiding given the Discrete Logarithm as-sumption holds [27]. This scheme can be easily extended toprovide unconditional hiding by combining it with anotherscheme called Pedersen commitments [42] however we donot implement it.

Appendix D Verifiable random functions

A verifiable random function V RFsk(x) is a function thattakes in as input a random seed (x) and a secret key (sk).Subsequently, it outputs two values: a hash and a proof. Thehash output is a hashlen-bit-long value that is uniquely de-termined by the sk therefore is unique to a peer. The proofallows anyone with the peer’s public key (pk) to check thatthe hash has indeed been generated by a client who holds theprivate key. Therefore, it provides each client in the system

to deterministically produce a hash that cannot be faked andis unique to the client for that seed value.

In Biscotti, a VRF is used to select a unique committeesthat provides noise to a peer for protecting its update. Apeer uses as inputs to the VRF its own secret key and theSHA-256 hash of the previous block. To select a committee,the hash output of the VRF is input to a consistent hashingprocedure. The hash output from the VRF is mapped to ahash ring where each peer is assigned a space proportionalto their stake. (See Figure 5). The peer in whose portionof the ring the hash lies is selected as part of the committee.The process is repeated until the peer gets a committee of theright size.

Appendix E Information leakage attacks

When doing collaborative learning, each client computestheir gradients by back-propagating the loss through the en-tire network from the last layer to the first layer. The gradientfor a layer is computed by using the layer’s features and theerror from the preceding layer. If the layers are sequentialand fully connected, then the output for layer hl+1 is com-puted as follows:

hl+1 =Wl ∗hl

where Wl is the weight matrix.Hence, the gradient of the error with respect to Wl is com-

puted as follows:

dEdWl

=dE

dhl+1∗hl

Note that the gradient of the weights dEdWl

is computed byusing the inner products from the layer above and also thefeatures of that particular layer. Therefore, the values of thegradients of the first layer will be proportional to the inputfeatures. By exploiting this property, observation of gradientupdates can be used to infer feature values, which are in turnbased on the participants private training data.

In the information leakage attack that we launch on Bis-cotti, we select the values of the gradient in the first layerthat correspond to a certain class, rescale the values to liebetween (0,255) and visualize the resulting image.

Appendix F Proof of stake

Blockchain based systems need a mechanism to preventany arbitrary peer from proposing blocks and extending thechain at the same time. Otherwise, anyone can extend theblockchain and nothing would stop the blockchain from de-veloping forks at a rate equal to the number of users and there

19

will be no consensus eventually. A rate limiting solution isneeded to prevent the ledger from being extended infinitelyby all the peers. Proof of Work (POW) and Proof of Stake(POS) are two popular solutions to this problem.

Proof of Work uses a puzzle to solve the rate limiting prob-lem but it has its limitations. Peers who want to propose thenext block have to solve a hard cryptographic puzzle that istrivial to verify by other peers. The peer that solves the puz-zle first gets to propose the next block. The puzzle acts as arate limiter because by creating new identities (Sybils) peersdo not gain any advantage in being the next proposer. In ad-dition to preventing Sybils, it acts as a probabilistic back offmechanism that tries to limit forks in the system. It does notcompletely eliminate forks and users have to wait for a cer-tain amount of time (6 blocks in Bitcoin) before their trans-action is confirmed as accepted. This leads to long wait times(60 minutes for Bitcoin) for transactions to be accepted. Fur-thermore, it is also energy consuming since it takes a lot ofcomputational power to solve the puzzles. Despite its limita-tions of long wait times and high energy consumption, POWis a widely used solution.

To mitigate the problems of Proof of Work, Proof of Stakehas been recently proposed as an alternate. POS basedsystems assign the responsibility of proposing blocks eachround to specific peers. The probability that a peer will beselected to propose is proportional to the value (stake) thatthey have in the system. In cryptocurrencies, the stake of apeer is equal to the amount of money that they have in thesystem. Based on the distribution of stake at that particulartime, a predefined algorithm is used to choose a peer/subsetof peers that is responsible for proposing the next block. Thealgorithm ensures that that at any time a group of maliciousnodes holding a certain amount of stake in the system cannotact maliciously and take over the system. All other users ob-serve the protocol messages, which allows them to learn theagreed-upon block.

For cryptocurrencies, Algorand [23] is a popular proposalfor a proof of stake based system. In Algorand, each peeris assigned a priority for two particular roles: block pro-posal and block selection. To determine their priority fora particular role, each peer runs a verifiable random func-tion (VRF’s) (see D) using their private key, the role (selec-tion/proposal) and a random seed which is public informa-tion on the blockchain. The VRF outputs a pseudo-randomhash value that is passed by the user through cryptographicsortition to determine their influence in proposing or select-ing a block for that particular round. At a high level, thecryptographic sortition is a random algorithm that assignseach user a priority such that the priority assigned is propor-tional to the user’s account balance. If the user’s priority isabove a threshold, they are selected for that particular role.Subsequently, all users assigned a proposal role, propose ablock based on the user’s priority. To reach consensus on asingle block proposal, peers assigned the selection role agree

on one proposal for the next block using a multi-step Byzan-tine Agreement protocol. In each step, the peers responsiblefor that step vote for a proposal until in some step enoughusers have agreed on a proposal. This proposal becomes thenext block in the chain.

Similar to Algorand, Biscotti uses verifiable random func-tions to assign roles to peers in the system. However, insteadof using the cryptographic sortition algorithm, Biscotti uses aconsistent hashing protocol described in Section 4.3 to selecta peer for a role such that the probability of getting selectedfor a role is proportional to the peer’s stake. In Biscotti, wedefine stake to be the reputation that a peer acquires over thetraining process by contributing updates.

20

biscotti: a ledger for private and secure peer-to-peer ... · blockchain-based ledger of model...

Documents