analyzing federated learning through an adversarial...

Analyzing Federated Learning through an Adversarial Lens

Arjun Nitin Bhagoji * 1 Supriyo Chakraborty 2 Prateek Mittal 1 Seraphin Calo 2

AbstractFederated learning distributes model trainingamong a multitude of agents, who, guided by pri-vacy concerns, perform training using their lo-cal data but share only model parameter updates,for iterative aggregation at the server to train anoverall global model. In this work, we explorehow the federated learning setting gives rise toa new threat, namely model poisoning, differ-ent from traditional data poisoning. Model poi-soning is carried out by an adversary controllinga small number of malicious agents (usually 1)with the aim of causing the global model to mis-classify a set of chosen inputs with high confi-dence. We explore a number of attack strate-gies for deep neural networks, starting with tar-geted model poisoning using boosting of the ma-licious agent’s update to overcome the effects ofother agents. We also propose two critical no-tions of stealth to detect malicious updates. Webypass these by including them in the adversar-ial objective to carry out stealthy model poison-ing. We improve attack stealth with the use ofan alternating minimization strategy which alter-nately optimizes for stealth and the adversarialobjective. We also empirically demonstrate thatByzantine-resilient aggregation strategies are notrobust to our attacks. Our results show that ef-fective and stealthy model poisoning attacks arepossible, highlighting vulnerabilities in the fed-erated learning setting.

1. IntroductionFederated learning (McMahan et al., 2017) has recentlyemerged as a popular implementation of distributedstochastic optimization for large-scale deep neural networktraining. It is formulated as a multi-round strategy in which

*Work done at I.B.M. Research 1Princeton University 2I.B.M.T.J. Watson Research Center. Correspondence to: Arjun NitinBhagoji <[email protected]>.

Proceedings of the 36 th International Conference on MachineLearning, Long Beach, California, PMLR 97, 2019. Copyright2019 by the author(s).

the training of a neural network model is distributed be-tween multiple agents. In each round, a random subsetof agents, with local data and computational resources, isselected for training. The selected agents perform modeltraining and share only the parameter updates with a cen-tralized parameter server, that facilitates aggregation of theupdates. Motivated by privacy concerns, the server is de-signed to have no visibility into an agents’ local data andtraining process.

In this work, we exploit this lack of transparency in theagent updates, and explore the possibility of an adversarycontrolling a small number of malicious agents (usuallyjust 1) performing a model poisoning attack. The adver-sary’s objective is to cause the jointly trained global modelto misclassify a set of chosen inputs with high confidence,i.e., it seeks to poison the global model in a targeted man-ner. Since the attack is targeted, the adversary also attemptsto ensure that the global model converges to a point withgood performance on the test or validation data We notethat these inputs are not modified to induce misclassifica-tion as in the phenomenon of adversarial examples (Car-lini & Wagner, 2017; Szegedy et al., 2013). Rather, theirmisclassification is a product of the adversarial manipu-lations of the training process. We focus on an adversarywhich directly performs model poisoning instead of datapoisoning (Biggio et al., 2012; Rubinstein et al., 2009; Mei& Zhu, 2015; Xiao et al., 2015; Mei & Zhu, 2015; Koh& Liang, 2017; Chen et al., 2017a; Jagielski et al., 2018)as the agents’ data is never shared with the server. In fact,model poisoning subsumes dirty-label data poisoning in thefederated learning setting (see Section 5.1 for a detailedquantitative comparison).

Model poisoning also has a connection to a line of workon defending against Byzantine adversaries which con-sider a threat model where the malicious agents can sendarbitrary gradient updates (Blanchard et al., 2017; Chenet al., 2017b; Mhamdi et al., 2018; Chen et al., 2018; Yinet al., 2018) to the server. However, the adversarial goal inthese cases is to ensure a distributed implementation of theStochastic Gradient Descent (SGD) algorithm converges to‘sub-optimal to utterly ineffective models’(Mhamdi et al.,2018) while the aim of the defenses is to ensure conver-gence. On the other hand, we consider adversaries aim-ing to only cause targeted poisoning. In fact, we show


that targeted model poisoning is effective even with the useof Byzantine resilient aggregation mechanisms in Section4. Concurrent and independent work (Bagdasaryan et al.,2018) considers both single and multiple agents perform-ing poisoning via model replacement at convergence time.In contrast, our goal is to induce targeted misclassificationin the global model even when it is far from convergencewhile maintaining its accuracy for most tasks.

1.1. Contributions

We design attacks on federated learning that ensure tar-geted poisoning of the global model while ensuring conver-gence. Our threat model considers adversaries controllinga small number of malicious agents (usually 1) and withno visibility into the updates that will be provided by theother agents. All of our experiments are on DNNs trainedon the Fashion-MNIST (Xiao et al., 2017) and AdultCensus1 datasets. Our code (https://github.com/inspire-group/ModelPoisoning) and a technicalreport (Bhagoji et al., 2018) are available.

Targeted model poisoning: In each round, the maliciousagent generates its update by optimizing for a malicious ob-jective designed to cause targeted misclassification. How-ever, the presence of a multitude of other agents which aresimultaneously providing updates makes this challenging.We thus use explicit boosting of the malicious agent’s up-date which is designed to negate the combined effect ofthe benign agents. Our evaluation demonstrates that thisattack enables an adversary controlling a single maliciousagent to achieve targeted misclassification at the globalmodel with 100% confidence while ensuring convergenceof the global model for deep neural networks trained onboth datasets.

Stealthy model poisoning: We introduce notions of stealthfor the adversary based on accuracy checking on thetest/validation data and weight update statistics and em-pirically show that targeted model poisoning with explicitboosting can be detected in all rounds with the use ofthese stealth metrics. Accordingly, we modify the ma-licious objective to account for these stealth metrics tocarry out stealthy model poisoning which allows the mali-cious weight update to avoid detection for a majority of therounds. Finally, we propose an alternating minimizationformulation that accounts for both model poisoning andstealth, and enables the malicious weight update to avoiddetection in almost all rounds.

Attacking Byzantine-resilient aggregation: We investi-gate the possibility of model poisoning when the serveruses Byzantine-resilient aggregation mechanisms such as

1https://archive.ics.uci.edu/ml/datasets/adult

Krum (Blanchard et al., 2017) and coordinate-wise median(Yin et al., 2018) instead of weighted averaging. We showthat targeted model poisoning of deep neural networks withhigh confidence is effective even with the use of these ag-gregation mechanisms.

Connections to data poisoning and interpretability: Weshow that standard dirty-label data poisoning attacks (Chenet al., 2017a) are not effective in the federated learning set-ting, even when the number of incorrectly labeled examplesis on the order of the local training data held by each agent.Finally, we use a suite of interpretability techniques to gen-erate visual explanations of the decisions made by a globalmodel with and without a targeted backdoor. Interestingly,we observe that the explanations are nearly visually indis-tinguishable, exposing the fragility of these techniques.

2. Federated Learning and Model PoisoningIn this section, we formulate both the learning paradigmand the threat model that we consider throughout the paper.Operating in the federated learning paradigm, where modelweights are shared instead of data, gives rise to the modelpoisoning attacks that we investigate.

2.1. Federated Learning

The federated learning setup consists of K agents, eachwith access to data Di, where |Di| = li. The total numberof samples is

∑i li = l. Each agent keeps its share of the

data (referred to as a shard) private, i.e. Di = {xi1 · · ·xili}is not shared with the server S. The server is attempting totrain a classifier f with global parameter vector wG ∈ Rn,where n is the dimensionality of the parameter space. Thisparameter vector is obtained by distributed training and ag-gregation over the K agents with the aim of generalizingwell over Dtest, the test data. Federated learning can handleboth i.i.d. and non-i.i.d partitioning of training data.

At each time step t, a random subset of k agents is cho-sen for synchronous aggregation (McMahan et al., 2017).Every agent i ∈ [k], minimizes 2 the empirical loss overits own data shard Di, by starting from the global weightvector wt

G and running an algorithm such as SGD for Eepochs with a batch size of B. At the end of its run, eachagent obtains a local weight vector wt+1

i and computes itslocal update δt+1

i = wt+1i − wt

G, which is sent back tothe server. To obtain the global weight vector wt+1

G for thenext iteration, any aggregation mechanism can be used. InSection 3, we use weighted averaging based aggregationfor our experiments: wt+1

G = wtG +

∑i∈[k] αiδ

t+1i , where

lil = αi and

∑i αi = 1. In Section 4, we study the effect of

2approximately for non-convex loss functions since globalminima cannot be guaranteed

https://github.com/inspire-group/ModelPoisoning

https://github.com/inspire-group/ModelPoisoning

https://archive.ics.uci.edu/ml/datasets/adult



our attacks on the Byzantine-resilient aggregation mecha-nisms ‘Krum’ (Blanchard et al., 2017) and coordinate-wisemedian (Yin et al., 2018).

2.2. Threat Model: Model Poisoning

Traditional poisoning attacks deal with a malicious agentwho poisons some fraction of the data in order to ensurethat the learned model satisfies some adversarial goal. Weconsider instead an agent who poisons the model updates itsends back to the server.

Attack Model: We make the following assumptions re-garding the adversary: (i) they control exactly one non-colluding, malicious agent with index m (limited effect ofmalicious updates on the global model); (ii) the data is dis-tributed among the agents in an i.i.d fashion (making it eas-ier to discriminate between benign and possible maliciousupdates and harder to achieve attack stealth); (iii) the mali-cious agent has access to a subset of the training data Dmas well as to auxiliary data Daux drawn from the same dis-tribution as the training and test data that are part of itsadversarial objective. Our aim is to explore the possibilityof a successful model poisoning attack even for a highlyconstrained adversary.

Adversarial Goals: The adversary’s goal is to ensure thetargeted misclassification of the auxiliary data by the clas-sifier learned at the server. The auxiliary data consists ofsamples {xi}ri=1 with true labels {yi}ri=1 that have to beclassified as desired target classes {τi}ri=1, implying thatthe adversarial objective is

A(Dm ∪ Daux,wtG) = max

wtG

r∑i=1

1[f(xi;wtG) = τi]. (1)

We note that in contrast to previous threat models consid-ered for Byzantine-resilient learning, the adversary’s aim isnot to prevent convergence of the global model (Yin et al.,2018) or to cause it to converge to a bad minimum (Mhamdiet al., 2018). Thus, any attack strategy used by the adver-sary must ensure that the global model converges to a pointwith good performance on the test set. Going beyond thestandard federated learning setting, it is plausible that theserver may implement measures to detect aberrant models.To bypass such measures, the adversary must also conformto notions of stealth that we define and justify next.

2.3. Stealth metrics

Given an update from an agent, there are two critical prop-erties that the server can check. First, the server can verifywhether the update, in isolation, would improve or worsenthe global model’s performance on a validation set. Sec-ond, the server can check if that update is very differentstatistically from other updates. We note that neither of

these properties is checked as a part of standard federatedlearning but we use these to raise the bar for a successfulattack.

Accuracy checking: The server checks the validation ac-curacy of wt

i = wt−1G + δti , the model obtained by adding

the update from agent i to the current state of the globalmodel. If the resulting model has a validation accuracymuch lower than that of the model obtained by aggregatingall the other updates, wt

G\i = wt−1G +

∑j 6=i δ

tj , the server

can flag the update as being anomalous. For the maliciousagent, this implies that it must satisfy the following in orderto be chosen at time step t:

∑{xj ,yj}∈Dtest

1[f(xj ;wtm) =

yj ] − 1[f(xj ;wtG\m) = yj ] < γt, where γt is a threshold

the server defines to reject updates. This threshold deter-mines how much performance variation the server can tol-erate and can be varied over time. A large threshold willbe less effective at identifying anomalous updates but anoverly small one could identify benign updates as anoma-lous, due to natural variation in the data and training pro-cess.

Weight update statistics: The range of pairwise distancesbetween a particular update and the rest provides an indi-cation of how different that update is from the rest whenusing an appropriate distance metric d(·, ·). In previouswork, pairwise distances were used to define ‘Krum’ (Blan-chard et al., 2017) but as we show in Section 4, its relianceon absolute, instead of relative distance values, makesit vulnerable to our attacks. Thus, we rely on the fullrange of pairwise distances which can be computed for allagent updates and for an agent to be flagged as anoma-lous, their range of distances must differ from the othersby a server defined, time-dependent threshold κt. In par-ticular, for the malicious agent, we compute the range asRm = [mini∈[k]\m d(δtm, δ

ti),maxi∈[k]\m d(δtm, δ

ti)]. Let

Rlmin,[k]\m and Rumax,[k]\m be the minimum lower boundand maximum upper bound of the range for all otheragents. Then, for the malicious agent to not be flagged asanomalous, we need that max{|Rum − Rlmin,[k]\m|, |R

lm −

Rumax,[k]\m|} < κt. This condition ensures that the range ofdistances for the malicious agent and any other agent is nottoo different from that for any other two agents, and alsocontrols the length of Rm. We find that it is also instructiveto compare the histogram of weight updates for benign andmalicious agents, as these can be very different dependingon the attack strategy used. These provide a useful qual-itative notion of stealth, which can be used to understandattack behavior.

2.4. Experimental setup

We evaluate our attack strategies using two qualitativelydifferent datasets. The first is an image dataset, Fashion-MNIST (Xiao et al., 2017) for which we use a 3-layer


Convolutional Neural Network (CNN) with dropout as themodel architecture. With centralized training, this modelachieves 91.7% accuracy on the test set. The second datasetis the UCI Adult Census dataset3 for which we use a fullyconnected neural network achieving 84.8% accuracy on thetest set (Fernández-Delgado et al., 2014) for the model ar-chitecture. Further details about datasets and models are inSection 1 of the Supplementary.

For both datasets, we study the case with the number ofagents K set to 10 and 100. When K = 10, all the agentsare chosen at every iteration, while with K = 100, a tenthof the agents are chosen at random every iteration. We runfederated learning till a pre-specified test accuracy (91%for Fashion MNIST and 84% for the Adult Census data) isreached or the maximum number of time steps have elapsed(40 for K = 10 and 50 for K = 100). In Section 3, forillustrative purposes, we mostly consider the case wherethe malicious agent aims to mis-classify a single examplein a desired target class (r = 1). For the Fashion-MNISTdataset, the example belongs to class ‘5’ (sandal) with theaim of misclassifying it in class ‘7’ (sneaker) and for theAdult dataset it belongs to class ‘0’ with the aim of mis-classifying it in class ‘1’. We also consider the case withr = 10 but defer these results to the Supplementary mate-rial owing to space constraints.

3. Strategies for Model Poisoning AttacksIn this section, we use the adversarial goals laid out in theprevious section to formulate the adversarial optimizationproblem. We then show how explicit boosting can achievetargeted model poisoning. We further explore attack strate-gies that add stealth and improve convergence.

3.1. Adversarial optimization setup

From Eq. 1, two challenges for the adversary are immedi-ately clear. First, the objective represents a difficult combi-natorial optimization problem so we relax Eq. 1 in terms ofthe cross-entropy loss for which automatic differentiationcan be used. Second, the adversary does not have accessto the global parameter vector wt

G for the current iterationand can only influence it though the weight update δtm itprovides to the server S. So, it performs the optimizationover wt

G, which is an estimate of the value of wtG based on

all the information Itm available to the adversary. The ob-jective function for the adversary to achieve targeted modelpoisoning on the tth iteration is

argminδtm

L({xi, τi}ri=1, wtG),

s.t. wtG = g(Itm),

(2)

3https://archive.ics.uci.edu/ml/datasets/adult

0

0.2

0.4

0.6

0.8

1

2 4 6 8 10 12

0

20

40

60

80

100

Con

fiden

ce

Cla

ssifi

catio

nac

cura

cy

Time

Validation Accuracy (Global)Conf. (5→7) on Global

Figure 1. Targeted model poisoning for CNN on Fashion MNISTdata. The global model’s confidence on malicious objective (poi-soning) and the accuracy on validation data (targeted) are shown.

where g(·) is an estimator. For the rest of this section, weuse the estimate wt

G = wt−1G + αmδtm, implying that the

malicious agent ignores the updates from the other agentsbut accounts for scaling at aggregation. This assumption isenough to ensure the attack works in practice.

3.2. Targeted model poisoning for standard federatedlearning

The adversary can directly optimize the adversarial objec-tive L({xi, τi}ri=1, w

tG) with wt

G = wt−1G +αmδtm. How-

ever, this setup implies that the optimizer has to accountfor the scaling factor αm implicitly. In practice, we findthat when using a gradient-based optimizer such as SGD,explicit boosting is much more effective. The rest of thesection focuses on explicit boosting and an analysis of im-plicit boosting is deferred to Section 2 of the Supplemen-tary.

Explicit Boosting: Mimicking a benign agent, the mali-cious agent can run Em steps of a gradient-based optimizer(such as Adam (Kingma & Ba, 2015)) starting from wt−1

G

to obtain wtm which minimizes the loss over {xi, τi}ri=1.

The malicious agent then obtains an initial update δtm =wtm − wt−1

G . However, since the malicious agent’s up-date tries to ensure that the model learns labels differentfrom the true labels for the data of its choice (Daux), it hasto overcome the effect of scaling, which would otherwisemostly nullify the desired classification outcomes. Thishappens because the learning objective for all the otheragents is very different from that of the malicious agent,especially in the i.i.d. case. The final weight update sentback by the malicious agent is then δtm = λδtm, where λis the factor by which the malicious agent boosts the initialupdate. Note that with wt

G = wt−1G +αmδtm and λ = 1

αm,

then wtG = wt

m, implying that if the estimation was exact,the global weight vector should now satisfy the maliciousagent’s objective.

Results: In the attack with explicit boosting, the maliciousagent uses Em = 5 to obtain δtm, and then boosts it by




−0.20 −0.15 −0.10 −0.05 0.00 0.05 0.10 0.15 0.20Weight values

0

20000

40000

60000

80000

100000BenignMalicious

Figure 2. Comparison of weight update distributions for benignand malicious agents for targeted model poisoning attack for CNNon Fashion MNIST data.1αm

= K. The results for the case with K = 10 for theFashion MNIST data are shown in Figure 1. The attackis clearly successful at causing the global model to clas-sify the chosen example in the target class. In fact, aftert = 3, the global model is highly confident in its (incor-rect) prediction. Further, the global model converges withgood performance on the validation set in spite of the tar-geted poisoning for one example. Results for the AdultCensus dataset (Section 3 of Supplementary) demonstratetargeted model poisoning is possible across datasets andmodels. Thus, the explicit boosting attack is able to achievetargeted poisoning in the federated learning setting.

Performance on stealth metrics: While the explicit boost-ing attack does not take stealth metrics into account, it isinstructive to study properties of the model update it gener-ates. Compared to the weight update from a benign agent,the update from the malicious agent is much sparser andhas a smaller range (Figure 2). In Figure 3, the spread ofL2 distances between all benign updates and between themalicious update and the benign updates is plotted. For thebaseline attack, both the minimum and maximum distanceaway from any of the benign updates keeps decreasing overtime steps, while it remains relatively constant for the otheragents. In Figure 4a the accuracy of the malicious model onthe validation data (Acc. Mal (Targeted)) is shown, whichis much lower than the global model’s accuracy.

3.3. Stealthy model poisoning

As discussed in Section 2.3, there are two properties whichthe server can use to detect anomalous updates: accuracyon validation data and weight update statistics. In order tomaintain stealth with respect to both of these properties, theadversary can add loss terms corresponding to both of thosemetrics to the model poisoning objective function from Eq.2 and improve targeted model poisoning. First, in order toimprove the accuracy on validation data, the adversary addsthe training loss over the malicious agent’s local data shard

30

40

50

60

70

80

90

100

110

2 4 6 8 10 12 14 16

Dis

tanc

e

Time

Targeted Model Poisoning (Benign)Targeted Model Poisoning (Malicious)

Stealthy Model Poisoning (Benign)Stealthy Model Poisoning (Malicious)

Alternating Minimization (Benign)Alternating Minimization (Malicious)

Figure 3. Range of `2 distances between all benign agents and be-tween the malicious agent and the benign agents.

Dm (L(Dm,wtG)) to the objective. Since the training data

is i.i.d. with the validation data, this will ensure that themalicious agent’s update is similar to that of a benign agentin terms of validation loss and will make it challenging forthe server to flag the malicious update as anomalous.

Second, the adversary needs to ensure that its update is asclose as possible to the benign agents’ updates in the ap-propriate distance metric. For our experiments, we use the`p norm with p = 2. Since the adversary does not haveaccess to the updates for the current time step t that aregenerated by the other agents, it constrains δtm with respectto δt−1ben =

∑i∈[k]\m αiδ

t−1i , which is the average update

from all the other agents for the previous iteration, whichthe malicious agent has access to. Thus, the adversary addsρ‖δtm − δt−1ben ‖2 to its objective as well. We note that theaddition of the training loss term is not sufficient to ensurethat the malicious weight update is close to that of the be-nign agents since there could be multiple local minima withsimilar loss values. Overall, the adversarial objective thenbecomes:

argminδtm

λL({xi, τi}ri=1, wtG) + L(Dm,wt

m)

+ ρ‖δtm − δt−1ben ‖2(3)

Note that for the training loss, the optimization is just per-formed with respect to wt

m = wt−1G +δtm, as a benign agent

would do. Using explicit boosting, wtG is replaced by wt

m

as well so that only the portion of the loss corresponding tothe malicious objective gets boosted by a factor λ.

Results and effect on stealth: From Figure 4a, it is clearthat the stealthy model poisoning attack is able to cause tar-geted poisoning of the global model. We set the accuracythreshold γt to be 10% which implies that the maliciousmodel is chosen for 10 iterations out of 15. This is in con-trast to the targeted model poisoning attack which never hasvalidation accuracy within 10% of the global model. Fur-


0

0.2

0.4

0.6

0.8

1

2 4 6 8 10 12 14 16

0

20

40

60

80

100

Con

fiden

ce

Cla

ssifi

catio

nac

cura

cy

Time

Val. Acc. (Global)Conf. (5→7) Global

Acc. Mal. (stealth)Acc. Mal. (targeted)

(a) Confidence on malicious objective and accuracy on vali-dation data for wt

G. Stealth with respect to accuracy checkingis also shown for both the stealthy and targeted model poison-ing attacks. We use λ = 10 and ρ = 1e−4.

−0.20 −0.15 −0.10 −0.05 0.00 0.05 0.10 0.15 0.20Weight values

0

20000

40000

60000

80000

100000BenignMalicious

(b) Comparison of weight update distributions forbenign and malicious agents

Figure 4. Stealthy model poisoning for CNN on Fashion MNIST

ther, the weight update distribution for the stealthy poison-ing attack (Figure 4b) is similar to that of a benign agent,owing to the additional terms in the loss function. Finally,in Figure 3, we see that the range of `2 distances for themalicious agent Rm is close to that between benign agents(see Section 2.3).

Concurrent work on model poisoning boosts the entire up-date (instead of just the malicious loss component) whenthe global model is close to convergence to attempt modelreplacement (Bagdasaryan et al., 2018) but this strategy isineffective when the model has not converged.

3.4. Alternating minimization for improved modelpoisoning

While the stealthy model poisoning attack ensures targetedpoisoning of the global model while maintaining stealth ac-cording to the two conditions required, it does not ensurethat the malicious agent’s update is chosen in every itera-tion. To achieve this, we propose an alternating minimiza-tion attack strategy which decouples the targeted objectivefrom the stealth objectives, providing finer control over therelative effect of the two objectives. It works as followsfor iteration t. For each epoch i, the adversarial objectiveis first minimized starting from wi−1,t

m , giving an update

0

0.2

0.4

0.6

0.8

1

2 4 6 8 10 12

0

20

40

60

80

100

Con

fiden

ce

Cla

ssifi

catio

nac

cura

cy

Time

Val. Acc. (Global)Conf. (5→7) Global

Val. Acc. Mal.

Figure 5. Alternating minimization attack with distance con-straints for CNN on Fashion MNIST data. Stealth with respectto accuracy checking is also shown.

vector δi,tm . This is then boosted by a factor λ and addedto wi−1,t

m . Finally, the stealth objective for that epoch isminimized starting from wi,t

m = wi−1,tm + λδi,tm , providing

the malicious weight vector wi,tm for the next epoch. The

malicious agent can run this alternating minimization untilboth the adversarial and stealth objectives have sufficientlylow values. Further, the independent minimization allowsfor each objective to be optimized for a different number ofsteps, depending on which is more difficult in achieve. Inparticular, we find that optimizing the stealth objective fora larger number of steps each epoch compared to the ma-licious objective leads to better stealth performance whilemaintaining targeted poisoning.

Results and effect on stealth: The adversarial objective isachieved at the global model with high confidence startingfrom time step t = 2 and the global model converges toa point with good performance on the validation set. Thisattack can bypass the accuracy checking method as the ac-curacy on validation data of the malicious model is closeto that of the global model.In Figure 3, we can see thatthe distance spread for this attack closely follows and evenoverlaps that of benign updates throughout, thus achievingcomplete stealth with respect to both properties.

4. Attacking Byzantine-resilient aggregationThere has been considerable recent work that has proposedgradient aggregation mechanisms for distributed learningthat ensure convergence of the global model (Blanchardet al., 2017; Chen et al., 2017b; Mhamdi et al., 2018; Chenet al., 2018; Yin et al., 2018). However, the aim of theByzantine adversaries considered in this line of work is toensure convergence to ineffective models, i.e. models withpoor classification performance. The goal of the adver-sary we consider is targeted model poisoning, which im-plies convergence to an effective model on the test data.This difference in objectives leads to the lack of robust-ness of these Byzantine-resilient aggregation mechanismsto our attacks. We consider the efficient aggregation mech-anisms Krum (Blanchard et al., 2017) and coordinate-wise


median (Yin et al., 2018) for our evaluation, both of whichare provably Byzantine-resilient and converge under appro-priate conditions 4 on the loss function.

Krum: Given n agents of which f are Byzantine, Krumrequires that n ≥ 2f + 3. At any time step t, updates(δt1, . . . , δ

tn) are received at the server. For each δti , the

n − f − 2 closest (in terms of Lp norm) other updates arechosen to form a set Ci and their distances added up to givea score S(δti) =

∑δ∈Ci

‖δti − δ‖. Krum then choosesδkrum = δti with the lowest score to add to wt

i to givewt+1i = wt

i + δkrum. In Figure 6, we see the effect ofthe alternating minimization attack on Krum with a boost-ing factor of λ = 2 for a federated learning setup with 10agents. Since there is no need to overcome the constantscaling factor αm, the attack can use a much smaller boost-ing factor λ than the number of agents to ensure model poi-soning. The malicious agent’s update is chosen by Krumfor 26 of 40 time steps which leads to the malicious ob-jective being met. Further, the global model converges toa point with good performance as the malicious agent hasadded the training loss to its stealth objective. With the useof targeted model poisoning, we can cause Krum to con-verge to a model with poor performance as well.

Coordinate-wise median: Given the set of updates{δti}ki=1 at time step t, the aggregate update is δt :=coomed{{δti}ki=1}, which is a vector with its jth co-ordinate δt(j) = med{δti(j)}, where med is the 1-dimensional median. Using targeted model poisoning witha boosting factor of λ = 1, i.e. no boosting, the maliciousobjective is met with confidence close to 0.9 for 11 of 14time steps (Figure 6). We note that in this case, unlike withKrum, there is convergence to an effective global model.We believe this occurs due to the fact that coordinate-wisemedian does not simply pick one of the updates to apply tothe global model and does indeed use information from allthe agents while computing the new update. Thus, modelpoisoning attacks are effective against two completely dif-ferent Byzantine-resilient aggregation mechanisms.

5. Discussion5.1. Model poisoning vs. data poisoning

In this section, we elucidate the differences between modelpoisoning and data poisoning both qualitatively and quan-titatively. Data poisoning attacks largely fall in two cate-gories: clean-label (Muñoz-González et al., 2017; Koh &Liang, 2017) and dirty-label (Chen et al., 2017a; Gu et al.,2017; Liu et al., 2017). Clean-label attacks assume that theadversary cannot change the label of any training data asthere is a process by which data is certified as belonging

4These conditions do not hold for neural networks so the guar-antees are only empirical.

0

0.2

0.4

0.6

0.8

1

5 10 15 20 25 30 35 40

0

20

40

60

80

100

Con

fiden

ce

Cla

ssifi

catio

nac

cura

cy

Time

Val. Acc. Global (Krum)Mal. Conf. Global (Krum)

Val. Acc. Global (Coomed)Mal. Conf. Global (Coomed)

Figure 6. Model poisoning attacks with Byzantine resilient ag-gregation mechanisms. We use targeted model poisoning forcoomed and alternating minimization for Krum.

to the correct class and the poisoning of data samples hasto be imperceptible. On the other hand, to carry out dirty-label poisoning, the adversary just has to introduce a num-ber of copies of the data sample it wishes to mis-classifywith the desired target label into the training set since thereis no requirement that a data sample belong to the correctclass. Dirty-label data poisoning has been shown to achievehigh-confidence targeted misclassification for deep neuralnetworks with the addition of around 50 poisoned samplesto the training data (Chen et al., 2017a).

Dirty-label data poisoning in federated learning: In ourcomparison with data poisoning, we use the dirty-labeldata poisoning framework for two reasons. First, federatedlearning operates under the assumption that data is nevershared, only learned models. Thus, the adversary is notconcerned with notions of imperceptibility for data certifi-cation. Second, clean-label data poisoning assumes accessat train time to the global parameter vector, which is absentin the federated learning setting. Using the same experi-mental setup as before (CNN on Fashion MNIST data, 10agents chosen every time step), we add copies of the sam-ple that is to be misclassified to the training set of the ma-licious agent with the appropriate target label. We experi-ment with two settings. In the first, we add multiple copiesof the same sample to the training set. In the second, weadd a small amount of random uniform noise to each pixel(Chen et al., 2017a) when generating copies. We observethat even when we add 1000 copies of the sample to thetraining set, the data poisoning attack is completely inef-fective at causing targeted poisoning in the global model.This occurs due to the fact that malicious agent’s update isscaled, which again underlies the importance of boostingwhile performing model poisoning. We note also that if theupdate generated using data poisoning is boosted, it affectsthe performance of the global model as the entire update isboosted, not just the malicious part. Thus, model poisoningattacks are much more effective than data poisoning in thefederated learning setting.


Figure 7. Interpretation of benign (5 → 5) and malicious (5 → 7)model decisions via visualization of feature relevance and repre-sentations for a randomly chosen auxiliary data sample.

5.2. Interpreting poisoned models

Neural networks are often treated as black boxes with lit-tle transparency into their internal representation or under-standing of the underlying basis for their decisions. Inter-pretability techniques are designed to alleviate these prob-lems by analyzing various aspects of the network. Theseinclude (i) identifying the relevant features in the inputpixel space for a particular decision via Layerwise Rel-evance Propagation (LRP) techniques ((Montavon et al.,2015)); (ii) visualizing the association between neuron ac-tivations and image features (Guided Backprop ((Springen-berg et al., 2014)), DeConvNet ((Zeiler & Fergus, 2014)));(iii) using gradients for attributing prediction scores toinput features (e.g., Integrated Gradients ((Sundararajanet al., 2017)), or generating sensitivity and saliency maps(SmoothGrad ((Smilkov et al., 2017)), Gradient SaliencyMaps ((Simonyan et al., 2013))). The semantic relevanceof the generated visualization, relative to the input, is thenused to explain the model decision.

We used a suite of these techniques to try and discrimi-nate between the behavior of a benign global model andone that has been trained to satisfy the adversarial objec-tive of misclassifying a single example. Figure 7 comparesthe output of the various techniques for both the benign andmalicious models on a random auxiliary data sample. Tar-geted perturbation of the model parameters coupled withtightly bounded noise ensures that the internal represen-tations, and relevant input features used by the two mod-els, for the same input, are almost visually imperceptible.This further exposes the fragility of interpretability meth-ods (Adebayo et al., 2018).

5.3. Improving attack performance through estimation

In this section, we look at how the malicious agent canchoose a better estimate for the effect of the other agents’updates at each time step that it is chosen. The adversary’sgoal is to choose an appropriate estimate for δt[k]\m =∑i∈[k]\m αiδ

ti . The following information is available

to them from the previous time steps they were chosen:i) Global parameter vectors wt0

G . . . ,wt−1G ; ii) Malicious

weight updates δt0m . . . , δtm; and iii) Local training data

AttackTargeted

Model PoisoningAlternating

MinimizationEstimation None Previous step None Previous stept = 2 0.63 0.82 0.17 0.47t = 3 0.93 0.98 0.34 0.89t = 4 0.99 1.0 0.88 1.0

Table 1. Comparison of confidence of targeted misclassificationwith and without the use of previous step estimation for the tar-geted model poisoning and alternating minimization attacks.

shard Dm, where t0 is the first time step at which the mali-cious agent is chosen.

Previous step estimate: The malicious agent’s estimateδt[k]\m assumes that the other agents’ cumulative updateswere the same at each step since t′ (the last time step atwhich at the malicious agent was chosen), i.e. δt[k]\m =

wtG−w

t′G−δ

t′m

t−t′ . In the case when the malicious agent is cho-sen at every time step, this reduces to δt[k]\m = δt−1[k]\m.

Pre-optimization correction: Having computed an esti-mate of the cumulative updates from the other agents, themalicious agent plugs it into its estimate of the global pa-rameter vector, i.e. wt

G = wt−1G + δt[k]\m + αmδT+1

m . Inother words, the malicious agent optimizes for δtm assum-ing it has an accurate estimate of the other agents’ updates.For attacks which use explicit boosting, this involves start-ing from wt−1

G + δt[k]\m instead of just wt−1G .

Results: Attacks using previous step estimation with thepre-optimization correction are more effective at achievingthe adversarial objective for both the targeted model poi-soning and alternating minimization attacks. In Table 1,we can see that the global model misclassifies the desiredsample with a higher confidence when using previous stepestimation in the first few iterations.

6. ConclusionIn this paper, we have started an exploration of the vulnera-bility of federated learning to model poisoning adversaries,that can take advantage of the very privacy these modelsare designed to provide. In future work, we plan to ex-plore more sophisticated detection strategies at the server,which can provide guarantees against the type of attackerwe have considered here. In particular, notions of distancesbetween weight distributions may be promising defensivetools. Our attacks in this paper demonstrate that federatedlearning in its basic form is very vulnerable to model poi-soning adversaries, as are recently proposed Byzantine re-silient aggregation mechanisms. While notions of stealthcan make these attacks more challenging, they can be over-come, demonstrating that robustness to attackers of the typeconsidered here is yet to be achieved.


AcknowledgementsThis research was sponsored by the U.S. Army ResearchLaboratory and the U.K. Ministry of Defence under Agree-ment Number W911NF-16-3-0001, the National ScienceFoundation under grant CNS-1553437, by Intel throughthe Intel Faculty Research Award, by the Office of NavalResearch through the Young Investigator Program (YIP)Award and by the Army Research Office through theYoung Investigator Program (YIP) Award. Arjun NitinBhagoji would also like to thank Siemens for supportinghim through the FutureMakers Fellowship. The views andconclusions contained in this document are those of the au-thors and should not be interpreted as representing the offi-cial policies, either expressed or implied, of the U.S. ArmyResearch Laboratory, the U.S. Government, the U.K. Min-istry of Defence or the U.K. Government. The U.S. andU.K. Governments are authorized to reproduce and dis-tribute reprints for Government purposes notwithstandingany copyright notation hereon.

ReferencesAdebayo, J., Gilmer, J., Muelly, M., Goodfellow, I., Hardt,

M., and Kim, B. Sanity checks for saliency maps. InAdvances in Neural Information Processing Systems, pp.9525–9536, 2018.

Bagdasaryan, E., Veit, A., Hua, Y., Estrin, D., andShmatikov, V. How to backdoor federated learning.arXiv preprint arXiv:1807.00459, 2018.

Bhagoji, A. N., Chakraborty, S., Mittal, P., and Calo, S. An-alyzing federated learning through an adversarial lens.arXiv preprint arXiv:1811.12470, 2018.

Biggio, B., Nelson, B., and Laskov, P. Poisoning at-tacks against support vector machines. In Proceedingsof the 29th International Conference on Machine Learn-ing (ICML-12), pp. 1807–1814, 2012.

Blanchard, P., El Mhamdi, E. M., Guerraoui, R., andStainer, J. Machine learning with adversaries: Byzantinetolerant gradient descent. Advances in Neural Informa-tion Processing Systems, 2017.

Carlini, N. and Wagner, D. Towards evaluating the robust-ness of neural networks. In 2017 IEEE Symposium onSecurity and Privacy (SP), pp. 39–57. IEEE, 2017.

Chen, L., Wang, H., Charles, Z. B., and Papailiopoulos,D. S. DRACO: byzantine-resilient distributed trainingvia redundant gradients. In Proceedings of the 35thInternational Conference on Machine Learning, ICML,2018.

Chen, X., Liu, C., Li, B., Lu, K., and Song, D. Targetedbackdoor attacks on deep learning systems using datapoisoning. arXiv preprint arXiv:1712.05526, 2017a.

Chen, Y., Su, L., and Xu, J. Distributed statistical ma-chine learning in adversarial settings: Byzantine gradi-ent descent. Proc. ACM Meas. Anal. Comput. Syst., 1(2), 2017b.

Fernández-Delgado, M., Cernadas, E., Barro, S., andAmorim, D. Do we need hundreds of classifiers to solvereal world classification problems? The Journal of Ma-chine Learning Research, 15(1):3133–3181, 2014.

Gu, T., Dolan-Gavitt, B., and Garg, S. Badnets: Identify-ing vulnerabilities in the machine learning model supplychain. arXiv preprint arXiv:1708.06733, 2017.

Jagielski, M., Oprea, A., Biggio, B., Liu, C., Nita-Rotaru,C., and Li, B. Manipulating machine learning: Poisoningattacks and countermeasures for regression learning. InIEEE Security and Privacy, 2018.

Kingma, D. P. and Ba, J. Adam: A method for stochasticoptimization. In International Conference on LearningRepresentations, 2015.

Koh, P. W. and Liang, P. Understanding black-box predic-tions via influence functions. In ICML, 2017.

Liu, Y., Ma, S., Aafer, Y., Lee, W.-C., Zhai, J., Wang, W.,and Zhang, X. Trojaning attack on neural networks. InNDSS, 2017.

McMahan, B., Moore, E., Ramage, D., Hampson, S., andy Arcas, B. A. Communication-Efficient Learning ofDeep Networks from Decentralized Data. In Proceed-ings of the 20th International Conference on ArtificialIntelligence and Statistics, 2017.

Mei, S. and Zhu, X. Using machine teaching to identify op-timal training-set attacks on machine learners. In AAAI,2015.

Mhamdi, E. M. E., Guerraoui, R., and Rouault, S. The hid-den vulnerability of distributed learning in byzantium. InICML, 2018.

Montavon, G., Bach, S., Binder, A., Samek, W., andMüller, K. Explaining nonlinear classification deci-sions with deep taylor decomposition. arXiv preprintarXiv:1512.02479, 2015.

Muñoz-González, L., Biggio, B., Demontis, A., Paudice,A., Wongrassamee, V., Lupu, E. C., and Roli, F. To-wards poisoning of deep learning algorithms with back-gradient optimization. In Proceedings of the 10th ACMWorkshop on Artificial Intelligence and Security. ACM,2017.


Rubinstein, B. I., Nelson, B., Huang, L., Joseph, A. D.,Lau, S.-h., Rao, S., Taft, N., and Tygar, J. Stealthy poi-soning attacks on pca-based anomaly detectors. ACMSIGMETRICS Performance Evaluation Review, 37(2):73–74, 2009.

Simonyan, K., Vedaldi, A., and Zisserman, A. Deep in-side convolutional networks: Visualising image clas-sification models and saliency maps. arXiv preprintarXiv:1312.6034, 2013.

Smilkov, D., Thorat, N., Kim, B., Viégas, F. B., and Wat-tenberg, M. Smoothgrad: removing noise by addingnoise. arXiv preprint arXiv:1706.03825, 2017.

Springenberg, J. T., Dosovitskiy, A., Brox, T., and Ried-miller, M. A. Striving for simplicity: The all convolu-tional net. arXiv preprint arXiv:1412.6806, 2014.

Sundararajan, M., Taly, A., and Yan, Q. Ax-iomatic attribution for deep networks. arXiv preprintarXiv:1703.01365, 2017.

Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan,D., Goodfellow, I., and Fergus, R. Intriguing proper-ties of neural networks. arXiv preprint arXiv:1312.6199,2013.

Xiao, H., Biggio, B., Brown, G., Fumera, G., Eckert, C.,and Roli, F. Is feature selection secure against trainingdata poisoning? In ICML, 2015.

Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: anovel image dataset for benchmarking machine learningalgorithms. arXiv preprint arXiv:1708.07747, 2017.

Yin, D., Chen, Y., Ramchandran, K., and Bartlett, P.Byzantine-robust distributed learning: Towards optimalstatistical rates. arXiv preprint arXiv:1803.01498, 2018.

Zeiler, M. and Fergus, R. Visualizing and understand-ing convolutional networks. In Computer Vision, ECCV2014 - 13th European Conference, Proceedings, 2014.

analyzing federated learning through an adversarial...

Documents