signsgd: compressed optimisation for non-convex problems · 2018-08-09 · signsgd: compressed...

SIGNSGD: Compressed Optimisation for Non-Convex Problems

Jeremy Bernstein 1 2 Yu-Xiang Wang 2 3 Kamyar Azizzadenesheli 4 Anima Anandkumar 1 2

AbstractTraining large neural networks requires distribut-ing learning across multiple workers, where thecost of communicating gradients can be a signif-icant bottleneck. SIGNSGD alleviates this prob-lem by transmitting just the sign of each minibatchstochastic gradient. We prove that it can get thebest of both worlds: compressed gradients andSGD-level convergence rate. The relative `1/`2geometry of gradients, noise and curvature in-forms whether SIGNSGD or SGD is theoreticallybetter suited to a particular problem. On the prac-tical side we find that the momentum counterpartof SIGNSGD is able to match the accuracy andconvergence speed of ADAM on deep Imagenetmodels. We extend our theory to the distributedsetting, where the parameter server uses majorityvote to aggregate gradient signs from each workerenabling 1-bit compression of worker-server com-munication in both directions. Using a theoremby Gauss (1823) we prove that majority vote canachieve the same reduction in variance as fullprecision distributed SGD. Thus, there is greatpromise for sign-based optimisation schemes toachieve fast communication and fast convergence.Code to reproduce experiments is to be found athttps://github.com/jxbz/signSGD.

1. IntroductionDeep neural networks have learnt to solve numerous natu-ral human tasks (LeCun et al., 2015; Schmidhuber, 2015).Training these large-scale models can take days or evenweeks. The learning process can be accelerated by dis-tributing training over multiple processors—either GPUslinked within a single machine, or even multiple machineslinked together. Communication between workers is typi-cally handled using a parameter-server framework (Li et al.,

1Caltech 2Amazon AI 3UC Santa Barbara 4UC Irvine. Corre-spondence to: Jeremy Bernstein <[email protected]>, Yu-Xiang Wang <[email protected]>.

Proceedings of the 35 th International Conference on MachineLearning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018by the author(s).

Algorithm 1 SIGNSGDInput: learning rate δ, current point xkgk ← stochasticGradient(xk)xk+1 ← xk − δ sign(gk)

Algorithm 2 SIGNUM

Input: learning rate δ, momentum constant β ∈ (0, 1),current point xk, current momentum mk

gk ← stochasticGradient(xk)mk+1 ← βmk + (1− β)gkxk+1 ← xk − δ sign(mk+1)

2014), which involves repeatedly communicating the gra-dients of every parameter in the model. This can still betime-intensive for large-scale neural networks. The com-munication cost can be reduced if gradients are compressedbefore being transmitted. In this paper, we analyse the the-ory of robust schemes for gradient compression.

An elegant form of gradient compression is just to take thesign of each coordinate of the stochastic gradient vector,which we call SIGNSGD. The algorithm is as simple asthrowing away the exponent and mantissa of a 32-bit float-ing point number. Sign-based methods have been studied atleast since the days of RPROP (Riedmiller & Braun, 1993).This algorithm inspired many popular optimisers—like RM-SPROP (Tieleman & Hinton, 2012) and ADAM (Kingma &Ba, 2015). But researchers were interested in RPROP andvariants because of their robust and fast convergence, andnot their potential for gradient compression.

Until now there has been no rigorous theoretical explanationfor the empirical success of sign-based stochastic gradient

Algorithm 3 Distributed training by majority voteInput: learning rate δ, current point xk, # workers Meach with an independent gradient estimate gm(xk)on server

pull sign(gm) from each workerpush sign

[∑Mm=1 sign(gm)

]to each worker

on each workerxk+1 ← xk − δ sign

[∑Mm=1 sign(gm)

]

arX

iv:1

802.

0443

4v3

[cs

.LG

] 7

Aug

201

8

https://github.com/jxbz/signSGD


methods. The sign of the stochastic gradient is a biasedapproximation to the true gradient, making it more challeng-ing to analyse compared to standard SGD. In this paper, weprovide extensive theoretical analysis of sign-based methodsfor non-convex optimisation under transparent assumptions.We show that SIGNSGD is especially efficient in problemswith a particular `1 geometry: when gradients are as dense ordenser than stochasticity and curvature, then SIGNSGD canconverge with a theoretical rate that has similar or even bet-ter dimension dependence than SGD. We find empiricallythat both gradients and noise are dense in deep learningproblems, consistent with the observation that SIGNSGDconverges at a similar rate to SGD in practice.

We then analyse SIGNSGD in the distributed setting wherethe parameter server aggregates gradient signs of the work-ers by a majority vote. Thus we allow worker-server com-munication to be 1-bit compressed in both directions. Weprove that the theoretical speedup matches that of distributedSGD, under natural assumptions that are validated by ex-periments.

We also extend our theoretical framework to the SIGNUMoptimiser—which takes the sign of the momentum. Ourtheory suggests that momentum may be useful for control-ling a tradeoff between bias and variance in the estimateof the stochastic gradient. On the practical side, we showthat SIGNUM easily scales to large Imagenet models, andprovided the learning rate and weight decay are tuned, allother hyperparameter settings—such as momentum, weightinitisialiser, learning rate schedules and data augmentation—may be lifted from an SGD implementation.

2. Related WorkDistributed machine learning: From the information the-oretic angle, Suresh et al. (2017) study the communicationlimits of estimating the mean of a general quantity knownabout only through samples collected from M workers. Incontrast, we focus exclusively on communication of gradi-ents for optimisation, which allows us to exploit the factthat we do not care about incorrectly communicating smallgradients in our theory. Still our work has connections withinformation theory. When the parameter server aggregatesgradients by majority vote, it is effectively performing max-imum likelihood decoding of a repetition encoding of thetrue gradient sign that is supplied by the M workers.

As for existing gradient compression schemes, Seide et al.(2014) and Strom (2015) demonstrated empirically that 1-bitquantisation can still give good performance whilst dramati-cally reducing gradient communication costs in distributedsystems. Alistarh et al. (2017) and Wen et al. (2017) pro-vide schemes with theoretical guarantees by using randomnumber generators to ensure that the compressed gradient is

Table 1. The communication cost of different gradient compressionschemes, when training a d-dimensional model with M workers.

ALGORITHM # BITS PER ITERATION

SGD (Robbins & Monro, 1951) 64MdQSGD (Alistarh et al., 2017) (2 + log(2M + 1))MdTERNGRAD (Wen et al., 2017) (2 + log(2M + 1))MdSIGNSGD with majority vote 2Md

still an unbiased approximation to the true gradient. Whilstunbiasedness allows these schemes to bootstrap SGD the-ory, it unfortunately comes at the cost of hugely inflatedvariance, and this variance explosion1 basically renders theSGD-style bounds vacuous in the face of the empirical suc-cess of these algorithms. The situation only gets worsewhen the parameter server must aggregate and send backthe received gradients, since merely summing up quantisedupdates reduces the quantisation efficiency. We compare theschemes in Table 1—notice how the existing schemes pickup log factors in the transmission from parameter-serverback to workers. Our proposed approach is different, inthat we directly employ the sign gradient which is biased.This avoids the randomisation needed for constructing anunbiased quantised estimate, avoids the problem of varianceexploding in the theoretical bounds, and even enables 1-bitcompression in both directions between parameter-serverand workers, at no theoretical loss compared to distributedSGD.

Deep learning: stochastic gradient descent (Robbins &Monro, 1951) is a simple and extremely effective optimiserfor training neural networks. Still Riedmiller & Braun(1993) noted the good practical performance of sign-basedmethods like RPROP for training deep nets, and since thenvariants such as RMSPROP (Tieleman & Hinton, 2012) andADAM (Kingma & Ba, 2015) have become increasinglypopular. ADAM updates the weights according to the meandivided by the root mean square of recent gradients. Let〈.〉β denote an exponential moving average with timescaleβ, and g the stochastic gradient. Then

ADAM step ∼ 〈g〉β1√〈g2〉β2

Therefore taking the time scale of the exponential movingaverages to zero, β1, β2 → 0, yields SIGNSGD

SIGNSGD step = sign(g) =g√g2.

To date there has been no convincing theory of the {RPROP,RMSPROP, ADAM} family of algorithms, known as ‘adap-tive gradient methods’. Indeed Reddi et al. (2018) point out

1For the version of QSGD with 1-bit compression, the varianceexplosion is by a factor of

√d, where d is the number of weights.

It is common to have d > 108 in modern deep networks.


problems in the original convergence proof of ADAM, evenin the convex setting. Since SIGNSGD belongs to this samefamily of algorithms, we expect that our theoretical analysisshould be relevant for all algorithms in the family. In aparallel work, Balles & Hennig (2017) explore the connec-tion between SIGNSGD and ADAM in greater detail, thoughtheir theory is more restricted and lives in the convex world,and they do not analyse SIGNUM as we do but employ it onheuristic grounds.

Optimisation: much of classic optimisation theory focuseson convex problems, where local information in the gradi-ent tells you global information about the direction to theminimum. Whilst elegant, this theory is less relevant formodern problems in deep learning which are non-convex.In non-convex optimisation, finding the global minimum isgenerally intractable. Theorists usually settle for measuringsome restricted notion of success, such as rate of conver-gence to stationary points (Ghadimi & Lan, 2013; Allen-Zhu, 2017a) or local minima (Nesterov & Polyak, 2006).Though Dauphin et al. (2014) suggest saddle points shouldabound in neural network error landscapes, practitioners re-port not finding this a problem in practice (Goodfellow et al.,2015) and therefore a theory of convergence to stationarypoints is useful and informative.

On the algorithmic level, the non-stochastic version ofSIGNSGD can be viewed as the classical steepest descentalgorithm with `∞-norm (see, e.g., Boyd & Vandenberghe,2004, Section 9.4). The convergence of steepest descentis well-known (see Karimi et al., 2016, Appendix C, foran analysis of signed gradient updates under the Polyak-Łojasiewicz condition). Carlson et al. (2016) study astochastic version of the algorithm, but again under an `∞majorisation. To the best of our knowledge, we are the firstto study the convergence of signed gradient updates underan (often more natural) `2 majorisation (Assumption 2).

Experimental benchmarks: throughout the paper we willmake use of the CIFAR-10 (Krizhevsky, 2009) and Ima-genet (Russakovsky et al., 2015) datasets. As for neuralnetwork architectures, we train Resnet-20 (He et al., 2016a)on CIFAR-10, and Resnet-50 v2 (He et al., 2016b) on Ima-genet.

3. Convergence Analysis of SIGNSGDWe begin our analysis of sign stochastic gradient descentin the non-convex setting. The standard assumptions ofthe stochastic optimisation literature are nicely summarisedby Allen-Zhu (2017b). We will use more fine-grained as-sumptions. SIGNSGD can exploit this additional structure,much as ADAGRAD (Duchi et al., 2011) exploits sparsity.We emphasise that these fine-grained assumptions do notlose anything over typical SGD assumptions, since our as-

sumptions can be obtained from SGD assumptions and viceversa.

Assumption 1 (Lower bound). For all x and some constantf∗, we have objective value f(x) ≥ f∗.

This assumption is standard and necessary for guaranteedconvergence to a stationary point.

The next two assumptions will naturally encode notions ofheterogeneous curvature and gradient noise.

Assumption 2 (Smooth). Let g(x) denote the gradient ofthe objective f(.) evaluated at point x. Then ∀x, y we re-quire that for some non-negative constant ~L := [L1, ..., Ld]∣∣∣f(y)−

[f(x) + g(x)T (y − x)

]∣∣∣ ≤ 1

2

∑i

Li(yi − xi)2.

For twice differentiable f , this implies that −diag(~L) ≺H ≺ diag(~L). This is related to the slightly weakercoordinate-wise Lipschitz condition used in the block coor-dinate descent literature (Richtarik & Takac, 2014).

Lastly, we assume that we have access to the followingstochastic gradient oracle:

Assumption 3 (Variance bound). Upon receiving query x ∈Rd, the stochastic gradient oracle gives us an independentunbiased estimate g that has coordinate bounded variance:

E[g(x)] = g(x), E[(g(x)i − g(x)i)

2]≤ σ2

i

for a vector of non-negative constants ~σ := [σ1, .., σd].

Bounded variance may be unrealistic in practice, since asx→∞ the variance might well diverge. Still this assump-tion is useful for understanding key properties of stochasticoptimisation algorithms. In our theorem, we will be work-ing with a mini-batch of size nk in the kth iteration, andthe corresponding mini-batch stochastic gradient is modeledas the average of nk calls to the above oracle at xk. Thissquashes the variance bound on g(x)i to σ2

i /nk.

Assumptions 2 and 3 are different from the assumptionstypically used for analysing the convergence properties ofSGD (Nesterov, 2013; Ghadimi & Lan, 2013), but they arenatural to the geometry induced by algorithms with signedupdates such as SIGNSGD and SIGNUM.

Assumption 2 is more fine-grained than the standard assump-tion, which is recovered by defining `2 Lipschitz constantL := ‖~L‖∞ = maxi Li. Then Assumption 2 implies that∣∣∣f(y)−

[f(x) + g(x)T (y − x)

]∣∣∣ ≤ L

2‖yi − xi‖22.

which is the standard assumption of Lipschitz smoothness.


Assumption 3 is more fined-grained than the standardstochastic gradient oracle assumption used for SGD analy-sis. But again, the standard variance bound is recovered bydefining σ2 := ||~σ||22. Then Assumption 3 implies that

E‖g(x)− g(x)‖2 ≤ σ2

which is the standard assumption of bounded total variance.

Under these assumptions, we have the following result:

Theorem 1 (Non-convex convergence rate ofSIGNSGD). Run algorithm 1 for K iterations un-der Assumptions 1 to 3. Set the learning rate andmini-batch size (independently of step k) as

δk =1√‖~L‖1K

, nk = K

LetN be the cumulative number of stochastic gradientcalls up to step K, i.e. N = O(K2). Then we have

E

[1

K

K−1∑k=0

‖gk‖1

]2

≤ 1√N

[√‖~L‖1

(f0 − f∗ +

1

2

)+ 2‖~σ‖1

]2

The proof is given in Section B of the supplementary ma-terial. It follows the well known strategy of relating thenorm of the gradient to the expected improvement made ina single algorithmic step, and comparing this with the totalpossible improvement under Assumption 1. A key technicalchallenge we overcome is in showing how to directly dealwith a biased approximation to the true gradient. Here wewill provide some intuition about the proof.

To pass the stochasticity through the non-linear sign op-eration in a controlled fashion, we need to prove the keystatement that at the kth step for the ith gradient component

P[sign(gk,i) 6= sign(gk,i)] ≤σk,i|gk,i|

This formalises the intuition that the probability of the signof a component of the stochastic gradient being incorrectshould be controlled by the signal-to-noise ratio of thatcomponent. When a component’s gradient is large, theprobability of making a mistake is low, and one expects tomake good progress. When the gradient is small comparedto the noise, the probability of making mistakes can be high,but due to the large batch size this only happens when weare already close to a stationary point.

The large batch size in the theorem may seem unusual, butlarge batch training actually presents a systems advantage(Goyal et al., 2017) since it can be parallelised. The number

of gradient calls N is the important quantity to measureconvergence, but large batch training achieves N gradientcalls in only O(

√N) iterations whereas small batch training

needs O(N) iterations. Fewer iterations also means fewerrounds of communication in the distributed setting. Conver-gence guarantees can be extended to the small batch caseunder the additional assumption of unimodal symmetric gra-dient noise using Lemma D.1 in the supplementary, but weleave this for future work. Experiments in this paper wereindeed conducted in the small batch regime.

Another unusual feature requiring discussion is the `1 geom-etry of SIGNSGD. The convergence rate strikingly dependson the `1-norm of the gradient, the stochasticity and thecurvature. To understand this better, let’s define a notion ofdensity of a high-dimensional vector ~v ∈ Rd as follows:

φ(~v) :=‖~v‖21d‖~v‖22

(1)

To see that this is a natural definition of density, noticethat for a fully dense vector, φ(~v) = 1 and for a fullysparse vector, φ(~v) = 1/d ≈ 0. We trivially have that‖~v‖21 ≤ φ(~v)d2‖~v‖2∞ so this notion of density provides aneasy way to translate from norms in `1 to both `2 and `∞.

Remember that under our assumptions, SGD-style assump-tions hold with Lipschitz constant L := ‖~L‖∞ and totalvariance bound σ2 := ‖~σ‖22. Using our notion of densitywe can translate our constants into the language of SGD:

‖gk‖21 = φ(gk)d‖gk‖22 ≥ φ(g)d‖gk‖22‖~L‖21 ≤ φ(~L)d2‖~L‖2∞ = φ(~L)d2L2

‖~σ‖21 = φ(~σ)d‖~σ‖22 = φ(~σ)dσ2

where we have assumed φ(g) to be a lower bound on thegradient density over the entire space. Using that (x+y)2 ≤2(x2 + y2) and changing variables in the bound, we reachthe following result for SIGNSGD

E

[1

K

K−1∑k=0

‖gk‖2

]2

≤ 2√N

√φ(~L)

φ(g)L

(f0 − f∗ +

1

2

)2

+ 4φ(~σ)

φ(g)σ2

whereas, for comparison, a typical SGD bound (proved inSupplementary C) is

E

[1

K

K−1∑k=0

‖gk‖22

]≤ 1√

N

[2L(f0 − f∗) + σ2

].

The bounds are very similar, except for most notably theappearance of ratios of densities R1 and R2, defined as

R1 :=

√φ(~L)

φ(g)R2 :=

φ(~σ)

φ(g)


Naıvely comparing the bounds suggests breaking into cases:

(I) R1 � 1 and R2 � 1. This means that both the cur-vature and the stochasticity are much denser than thetypical gradient and the comparison suggests SGD isbetter suited than SIGNSGD.

(II) NOT[R1 � 1] and NOT[R2 � 1]. This means thatneither curvature nor stochasticity are much denserthan the gradient, and the comparison suggests thatSIGNSGD may converge as fast or faster than SGD,and also get the benefits of gradient compression.

(III) neither of the above holds, for example R1 � 1 andR2 � 1. Then the comparison is indeterminate aboutwhether SIGNSGD or SGD is more suitable.

Let’s briefly provide some intuition to understand how it’spossible that SIGNSGD could outperform SGD. Imagine ascenario where the gradients are dense but there is a sparseset of extremely noisy components. Then the dynamicsof SGD will be dominated by this noise, and (unless thelearning rate is reduced a lot) SGD will effectively performa random walk along these noisy components, paying lessattention to the gradient signal. SIGNSGD however willtreat all components equally, so it will scale down the sparsenoise and scale up the dense gradients comparatively, andthus make good progress. See Figure A.1 in the supplemen-tary for a simple example of this.

Still we must be careful when comparing upper bounds, andinterpreting the dependence on curvature density is moresubtle than noise density. This is because the SGD boundproved in Supplementary C is slacker under situations ofsparse curvature than dense curvature. That is to say thatSGD, like SIGNSGD, benefits under situations of sparsecurvature but this is not reflected in the SGD bound. Thepotentially slack step in SGD’s analysis is in switching fromLi to ‖~L‖∞. Because of this it is safer to interpret the cur-vature comparison as telling us a regime where SIGNSGDis expected to lose out to SGD (rather than vice versa). Thishappens when R1 � 1 and gradients are sparser than cur-vature. Intuitively, in this case SIGNSGD will push manycomponents in highly curved directions even though thesecomponents had small gradient, and this can be undesirable.

To summarise, our theory suggests that when gradients aredense, SIGNSGD should be more robust to large stochas-ticity on a sparse set of coordinates. When gradients aresparse, SGD should be more robust to dense curvature andnoise. In practice for deep networks, we find that SIGNSGDconverges about as fast as SGD. That would suggest that weare either in regime (II) or (III) above. But what is the realsituation for the error landscape of deep neural networks?

To measure gradient and noise densities in practice, weuse Welford’s algorithm (Welford, 1962; Knuth, 1997) to

0 50 100 150epoch

0.0

0.2

0.4

0.6

0.8

1.0

grad

ient

den

sity

(g)

sgdsignumadam

0 50 100 150epoch

0.0

0.2

0.4

0.6

0.8

1.0

varia

nce

dens

ity

()

sgdsignumadam

Figure 1. Gradient and noise density during an entire training runof a Resnet-20 model on the CIFAR-10 dataset. Results are aver-aged over 3 repeats for each of 3 different training algorithms, andcorresponding error bars are plotted. At the beginning of everyepoch, at that fixed point in parameter space, we do a full pass overthe data to compute the exact mean of the stochastic gradient, g,and its exact standard deviation vector ~σ (square root of diagonalof covariance matrix). The density measure φ(~v) :=

‖v‖21d‖v‖22

is 1for a fully dense vector and ≈ 0 for a fully sparse vector. Noticethat both gradient and noise are dense, and moreover the densi-ties appear to be coupled during training. Noticable jumps occurat epoch 80 and 120 when the learning rate is decimated. Ourstochastic gradient oracle (Assumption 3) is fine-grained enoughto encode such dense geometries of noise.

compute the true gradient g and its stochasticity vector ~σ atevery epoch of training for a Resnet-20 model on CIFAR-10. Welford’s algorithm is numerically stable and onlytakes a single pass through the data to compute the vectorialmean and variance. Therefore if we train a network for 160epochs, we make an additional 160 passes through the datato evaluate these gradient statistics. Results are plotted inFigure 1. Notice that the gradient density and noise densityare of the same order throughout training, and this indeedputs us in regime (II) or (III) as predicted by our theory.

In Figure A.2 of the supplementary, we present preliminaryevidence that this finding generalises, by showing that gradi-ents are dense across a range of datasets and network archi-tectures. We have not devised an efficient means to measurecurvature densities, which we leave for future work.

4. Majority Rule: the Power of Democracy inthe Multi-Worker Setting

In the most common form of distributed training, workers(such as GPUs) each evaluate gradients on their own split ofthe data, and send the results up to a parameter-server. Theparameter server aggregates the results and transmits themback to each worker (Li et al., 2014).

Up until this point in the paper, we have only analysed


0.0 2.50

20

40

60

# m

inib

atch

es

2.5 0.0 2.5CIFAR-10 gradient

2.5 0.0 2.5

5 0 50

100

200

300

400

# m

inib

atch

es

0.25 0.00 0.25Imagenet gradient

0.5 0.0 0.5

Figure 2. Histograms of the noise in the stochastic gradient, eachplot for a different randomly chosen parameter (not cherry-picked).Top row: Resnet-20 architecture trained to epoch 50 on CIFAR-10with a batch size of 128. Bottom row: Resnet-50 architecturetrained to epoch 50 on Imagenet with a batch size of 256. Fromleft to right: model trained with SGD, SIGNUM, ADAM. All noisedistributions appear to be unimodal and approximately symmetric.For a batch size of 256 Imagenet images, the central limit theoremhas visibly kicked in and the distributions look Gaussian.

SIGNSGD where the update is of the form

xk+1 = xk − δ sign(g)

To get the benefits of compression we want the mth workerto send the sign of the gradient evaluated only on its portionof the data. This suggests an update of the form

xk+1 = xk − δM∑m=1

sign(gm) (good)

This scheme is good since what gets sent to the parameterwill be 1-bit compressed. But what gets sent back almostcertainly will not. Could we hope for a scheme where allcommunication is 1-bit compressed?

What about the following scheme:

xk+1 = xk − δ sign

[M∑m=1

sign(gm)

](best)

This is called majority vote, since each worker is essentiallyvoting with its belief about the sign of the true gradient.The parameter server counts the votes, and sends its 1-bitdecision back to every worker.

The machinery of Theorem 1 is enough to establish con-vergence for the (good) scheme, but majority vote is more

elegant and more communication efficient, therefore wefocus on this scheme from here on.

In Theorem 2 we first establish the general convergence rateof majority vote, followed by a regime where majority voteenjoys a variance reduction from ‖~σ‖1 to ‖~σ‖1/

√M .

Theorem 2 (Non-convex convergence rate of dis-tributed SIGNSGD with majority vote). Run algo-rithm 3 for K iterations under Assumptions 1 to 3.Set the learning rate and mini-batch size for eachworker (independently of step k) as

δk =1√‖~L‖1K

nk = K

Then (a) majority vote with M workers converges atleast as fast as SIGNSGD in Theorem 1.

And (b) further assuming that the noise in each com-ponent of the stochastic gradient is unimodal andsymmetric about the mean (e.g. Gaussian), majorityvote converges at improved rate:

E

[1

K

K−1∑k=0

‖gk‖1

]2

≤ 1√N

[√‖~L‖1

(f0 − f∗ +

1

2

)+

2√M‖~σ‖1

]2

where N is the cumulative number of stochastic gra-dient calls per worker up to step K.

The proof is given in the supplementary material, but herewe sketch some details. Consider the signal-to-noise ratioof a single component of the stochastic gradient, defined asS := |gi|

σi. For S < 1 the gradient is small and it doesn’t

matter if we get the sign wrong. For S > 1, we can showusing a one-sided version of Chebyshev’s inequality (Can-telli, 1928) that the failure probability, q, of that sign bit onan individual worker satisfies q < 1

2 . This means that theparameter server is essentially receiving a repetition codeRM and the majority vote decoder is known to drive downthe failure probability of a repetition code exponentially inthe number of repeats (MacKay, 2002).

Remark: Part (a) of the theorem does not describe aspeedup over just using a single machine, and that mighthint that all those extra M − 1 workers are a waste in thissetting. This is not the case. From the proof sketch above,it should be clear that part (a) is an extremely conservativestatement. In particular, we expect all regions of trainingwhere the signal-to-noise ratio of the stochastic gradient sat-isfies S > 1 to enjoy a significant speedup due to variancereduction. It’s just that since we don’t get the speedup when


S < 1, it’s hard to express this in a compact bound.

To sketch a proof for part (b), note that a sign bit fromeach worker is a Bernoulli trial—call its failure probabilityq. We can get a tight control of q by a convenient tailbound owing to Gauss (1823) that holds under conditions ofunimodal symmetry. Then the sum of bits received by theparameter server is a binomial random variable, and we canuse Cantelli’s inequality to bound its tail. This turns out to beenough to get tight enough control on the error probabilityof the majority vote decoder to prove the theorem.

Remark 1: assuming that the stochastic gradient of eachworker is approximately symmetric and unimodal is veryreasonable. In particular for increasing mini-batch size itwill be an ever-better approximation by the central limit the-orem. Figure 2 plots histograms of real stochastic gradientsfor neural networks. Even at batch-size 256 the stochasticgradient for an Imagenet model already looks Gaussian.

Remark 2: if you delve into the proof of Theorem 2 andgraph all of the inequalities, you will notice that some ofthem are uniformly slack. This suggests that the assump-tions of symmetry and unimodality can actually be relaxedto only hold approximately. This raises the possibility ofproving a relaxed form of Gauss’ inequality and using athird moment bound in the Berry-Esseen theorem to derivea minimal batch size for which the majority vote scheme isguaranteed to work by the central limit theorem. We leavethis for future work.

Remark 3: why does this theorem have anything to do withunimodality or symmetry at all? It’s because there existvery skewed or bimodal random variables X with mean µsuch that P[sign(X) = sign(µ)] is arbitrarily small. Thiscan either be seen by applying Cantelli’s inequality whichis known to be tight, or by playing with distributions like

P[X = x] =

{0.1 if x = 50

0.9 if x = −1

Distributions like these are a problem because it meansthat adding more workers will actually drive up the errorprobability rather than driving it down. The beauty of thecentral limit theorem is that even for such a skewed andbimodal distribution, the mean of just a few tens of sampleswill already start to look Gaussian.

5. Extending the Theory to SIGNUM

Momentum is a popular trick used by neural network practi-tioners that can, in our experience, speed up the training ofdeep neural networks and improve the robustness of algo-rithms to other hyperparameter settings. Instead of takingsteps according to the gradient, momentum algorithms takesteps according to a running average of recent gradients.

Existing theoretical analyses of momentum often rely onthe absence of gradient stochasticity (e.g. Jin et al. (2017))or convexity (e.g. Goh (2017)) to show that momentum’sasymptotic convergence rate can beat gradient descent.

It is easy to incorporate momentum into SIGNSGD, merelyby taking the sign of the momentum We call the resulting al-gorithm SIGNUM and present the algorithmic step formallyin Algorithm 2. SIGNUM fits into our theoretical framework,and we prove its convergence rate in Theorem 3.

Theorem 3 (Convergence rate of SIGNUM). In Al-gorithm 2, set the learning rate, mini-batch size andmomentum parameter respectively as

δk =δ√k + 1

nk = k + 1 β

Our analysis also requires a warmup period to letthe bias in the momentum settle down. The warmupshould last for C(β) iterations, where C is a con-stant that depends on the momentum parameter β asfollows:

C(β) = minC∈Z+

C s.t.C

2βC ≤ 1

1− β2

1

C + 1

& βC+1 ≤ 1

2

Note that for β = 0.9, we have C = 54 which isnegligible. For the first C(β) iterations, accumulatethe momentum as normal, but use the sign of thestochastic gradient to make updates instead of thesign of the momentum.

LetN be the cumulative number of stochastic gradientcalls up to stepK, i.e.N = O(K2). Then forK � Cwe have

E

[1

K − C

K−1∑k=C

‖gk‖1

]2

= O(

1√N

[fC − f∗

δ

+(1 + logN)

(δ‖~L‖11− β

+ ‖~σ‖1√

1− β

)]2

where we have used O(.) to hide numerical constantsand the β-dependent constant C.

The proof is the greatest technical challenge of the paper,and is given in the supplementary material. We focus onpresenting the proof in a modular form, anticipating thatparts may be useful in future theoretical work. It involves avery general master lemma, Lemma E.1, which can be usedto help prove all the theorems in this paper.


0 20 40 60 80 100 120epoch

0.4

0.5

0.6

0.7

0.8

top

1 te

st a

ccur

acy

0 20 40 60 80 100 120epoch

0.4

0.5

0.6

0.7

0.8

top

1 tra

in a

ccur

acy

sgdsignumadamsgd without wd

Figure 3. Imagenet train and test accuracies using the momentumversion of SIGNSGD, called SIGNUM, to train Resnet-50 v2. Webased our implementation on an open source implementation bygithub.com/tornadomeet. Initial learning rate and weightdecay were tuned on a separate validation set split off from thetraining set and all other hyperparameters were chosen to be thosefound favourable for SGD by the community. There is a big jumpat epoch 95 when we switch off data augmentation. SIGNUM getstest set performance approximately the same as ADAM, better thanSGD with out weight decay, but about 2% worse than SGD witha well-tuned weight decay.

Remark 1: switching optimisers after a warmup period isin fact commonly done by practitioners (Akiba et al., 2017).

Remark 2: the theory suggests that momentum can be usedto control a bias-variance tradeoff in the quality of stochas-tic gradient estimates. Sending β → 1 kills the varianceterm in ‖~σ‖1 due to averaging gradients over a longer timehorizon. But averaging in stale gradients induces bias dueto curvature of f(x), and this blows up the δ‖~L‖1 term.

Remark 3: for generality, we state this theorem with a tun-able learning rate δ. For variety, we give this theorem inany-time form with a growing batch size and decaying learn-ing rate. This comes at the cost of log factors appearing.

We benchmark SIGNUM on Imagenet (Figure 3) and CIFAR-10 (Figure A.3 of supplementary). The full results of a gianthyperparameter grid search for the CIFAR-10 experimentsare also given in the supplementary. SIGNUM’s performancerivals ADAM’s in all experiments.

6. DiscussionGradient compression schemes like TERNGRAD (Wen et al.,2017) quantise gradients into three levels {0,±1}. This isdesirable when the ternary quantisation is sparse, since itcan allow further compression. Our scheme of majority voteshould easily be compatible with a ternary quantisation—in both directions of communication. This can be cast as“majority vote with abstention”. The scheme is as follows:workers send their vote to the parameter server, unless theyare very unsure about the sign of the true gradient in whichcase they send zero. The parameter-server counts the votes,

and if quorum is not reached (i.e. too many workers dis-agreed or abstained) the parameter-server sends back zero.This extended algorithm should readily fit into our theory.

In Section 2 we pointed out that SIGNSGD and SIGNUMare closely related to ADAM. In all our experiments we findthat SIGNUM and ADAM have very similar performance,although both lose out to SGD by about 2% test accuracyon Imagenet. Wilson et al. (2017) observed that ADAMtends to generalise slightly worse than SGD. Though itis still unclear why this is the case, perhaps it could bebecause we don’t know how to properly regularise suchmethods. Whilst we found that neither standard weightdecay nor the suggestion of Loshchilov & Hutter (2017)completely closed our Imagenet test set gap with SGD, it ispossible that some other regularisation scheme might. Oneidea, suggested by our theory, is that SIGNSGD could besquashing down noise levels. There is some evidence (Smith& Le, 2018) that a certain level of noise can be good forgeneralisation, biasing the optimiser towards wider valleysin the objective function. Perhaps, then, adding Gaussiannoise to the SIGNUM update might help it generalise better.This can be achieved in a communication efficient mannerin the distributed setting by sharing a random seed with eachworker, and then generating the same noise on each worker.

Finally, in Section 3 we discuss some geometric implica-tions of our theory, and provide an efficient and robust exper-imental means of measuring one aspect—the ratio betweennoise and gradient density—through the Welford algorithm.We believe that since this density ratio is easy to measure, itmay be useful to help guide those doing architecture search,to find network architectures which are amenable to fasttraining through gradient compression schemes.

7. ConclusionWe have presented a general framework for studyingsign-based methods in stochastic non-convex optimisa-tion. We present non-vacuous bounds for gradient com-pression schemes, and elucidate the special `1 geome-tries under which these schemes can be expected to suc-ceed. Our theoretical framework is broad enough to han-dle signed-momentum schemes—like SIGNUM—and alsomulti-worker distributed schemes—like majority vote.

Our work touches upon interesting aspects of the geome-try of high-dimensional error surfaces, which we wish toexplore in future work. But the next step for us will be toreach out to members of the distributed systems communityto help benchmark the majority vote algorithm which showssuch great theoretical promise for 1-bit compression in bothdirections between parameter-server and workers.

github.com/tornadomeet


AcknowledgmentsThe authors are grateful to the anonymous reviewers fortheir helpful comments, as well as Jiawei Zhao, MichaelTschannen, Julian Salazar, Tan Nguyen, Fanny Yang, MuLi, Aston Zhang and Zack Lipton for useful discussions.Thanks to Ryan Tibshirani for pointing out the connectionto steepest descent.

KA is supported in part by NSF Career Award CCF-1254106 and Air Force FA9550-15-1-0221. AA is sup-ported in part by Microsoft Faculty Fellowship, GoogleFaculty Research Award, Adobe Grant, NSF Career AwardCCF-1254106, and AFOSR YIP FA9550-15-1-0221.

ReferencesAkiba, T., Suzuki, S., and Fukuda, K. Extremely Large

Minibatch SGD: Training ResNet-50 on ImageNet in 15Minutes. arXiv:1711.04325, 2017.

Alistarh, D., Grubic, D., Li, J., Tomioka, R., and Vojnovic,M. QSGD: Communication-Efficient SGD via Gradi-ent Quantization and Encoding. In Advances in neuralinformation processing systems (NIPS-17), 2017.

Allen-Zhu, Z. Natasha: Faster Non-Convex StochasticOptimization via Strongly Non-Convex Parameter. InInternational Conference on Machine Learning (ICML-17), 2017a.

Allen-Zhu, Z. Natasha 2: Faster Non-Convex OptimizationThan SGD. arXiv:1708.08694, 2017b.

Balles, L. and Hennig, P. Dissecting Adam: TheSign, Magnitude and Variance of Stochastic Gradients.arXiv:1705.07774, 2017.

Boyd, S. and Vandenberghe, L. Convex Optimization. Cam-bridge University Press, 2004.

Cantelli, F. P. Sui confini della probabilit. Atti del CongressoInternazionale dei Matematici, 1928.

Carlson, D., Hsieh, Y., Collins, E., Carin, L., and Cevher, V.Stochastic spectral descent for discrete graphical models.IEEE Journal of Selected Topics in Signal Processing, 10(2):296–311, 2016.

Dauphin, Y. N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli,S., and Bengio, Y. Identifying and Attacking the SaddlePoint Problem in High-Dimensional Non-Convex Opti-mization. In Advances in neural information processingsystems (NIPS-14), 2014.

Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradientmethods for online learning and stochastic optimization.Journal of Machine Learning Research, 12:2121–2159,2011.

Gauss, C. F. Theoria combinationis observationum erroribusminimis obnoxiae, pars prior. Commentationes SocietatisRegiae Scientiarum Gottingensis Recentiores, 1823.

Ghadimi, S. and Lan, G. Stochastic first-and zeroth-ordermethods for nonconvex stochastic programming. SIAMJournal on Optimization, 23(4):2341–2368, 2013.

Glorot, X. and Bengio, Y. Understanding the Difficulty ofTraining Deep Feedforward Neural Networks. In Artifi-cial intelligence and statistics (AISTATS-10), 2010.

Goh, G. Why Momentum Really Works. Distill, 2017.

Goodfellow, I. J., Vinyals, O., and Saxe, A. M. QualitativelyCharacterizing Neural Network Optimization Problems.In International Conference on Learning Representations(ICLR-15), 2015.

Goyal, P., Dollar, P., Girshick, R. B., Noordhuis, P.,Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He,K. Accurate, Large Minibatch SGD: Training ImageNetin 1 Hour. arXiv:1706.02677, 2017.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residuallearning for image recognition. In IEEE Conference onComputer Vision and Pattern Recognition (CVPR-16), pp.770–778, 2016a.

He, K., Zhang, X., Ren, S., and Sun, J. Identity mappings indeep residual networks. In European Conference on Com-puter Vision (ECCV-16), pp. 630–645. Springer, 2016b.

Jin, C., Netrapalli, P., and Jordan, M. I. Accelerated Gradi-ent Descent Escapes Saddle Points Faster than GradientDescent. arXiv:1711.10456, 2017.

Karimi, H., Nutini, J., and Schmidt, M. Linear conver-gence of gradient and proximal-gradient methods underthe Polyak-łojasiewicz condition. In European Confer-ence on Machine Learning and Principles and Practice ofKnowledge Discovery in Databases (ECML-PKDD-16),pp. 795–811. Springer, 2016.

Kingma, D. P. and Ba, J. Adam: A Method for StochasticOptimization. In International Conference on LearningRepresentations (ICLR-15), 2015.

Knuth, D. E. The Art of Computer Programming, Volume 2(3rd Ed.): Seminumerical Algorithms. Addison-WesleyLongman Publishing Co., Inc., 1997.

Krizhevsky, A. Learning Multiple Layers of Features fromTiny Images. Technical report, University of Toronto,2009.

LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. Na-ture, 521(7553):436, 2015.


Li, M., Andersen, D. G., Park, J. W., Smola, A. J., Ahmed,A., Josifovski, V., Long, J., Shekita, E. J., and Su, B.-Y.Scaling Distributed Machine Learning with the ParameterServer. In Symposium on Operating Systems Design andImplementation (OSDI-14), pp. 583–598, 2014.

Loshchilov, I. and Hutter, F. Fixing Weight Decay Regular-ization in Adam. arXiv:1711.05101, 2017.

MacKay, D. J. C. Information Theory, Inference & LearningAlgorithms. Cambridge University Press, 2002.

Nesterov, Y. Introductory Lectures on Convex Optimization:A Basic Course. Springer, 2013.

Nesterov, Y. and Polyak, B. Cubic Regularization of New-ton Method and its Global Performance. MathematicalProgramming, 2006.

Pukelsheim, F. The Three Sigma Rule. The AmericanStatistician, 1994.

Reddi, S. J., Kale, S., and Kumar, S. On the Convergenceof Adam and Beyond. In International Conference onLearning Representations (ICLR-18), 2018.

Richtarik, P. and Takac, M. Iteration complexity of random-ized block-coordinate descent methods for minimizinga composite function. Mathematical Programming, 144(1-2):1–38, 2014.

Riedmiller, M. and Braun, H. A Direct Adaptive Method forFaster Backpropagation Learning: the RPROP Algorithm.In International Conference on Neural Networks (ICNN-93), pp. 586–591. IEEE, 1993.

Robbins, H. and Monro, S. A Stochastic ApproximationMethod. The Annals of Mathematical Statistics, 1951.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein,M., et al. Imagenet large scale visual recognition chal-lenge. International Journal of Computer Vision, 115(3):211–252, 2015.

Schmidhuber, J. Deep Learning in Neural Networks: anOverview. Neural Networks, 2015.

Seide, F., Fu, H., Droppo, J., Li, G., and Yu, D. 1-BitStochastic Gradient Descent and Application to Data-Parallel Distributed Training of Speech DNNs. In Con-ference of the International Speech Communication Asso-ciation (INTERSPEECH-14), 2014.

Smith, S. L. and Le, Q. V. A Bayesian Perspective onGeneralization and Stochastic Gradient Descent. In Inter-national Conference on Learning Representations (ICLR-18), 2018.

Strom, N. Scalable distributed DNN training usingcommodity GPU cloud computing. In Conference ofthe International Speech Communication Association(INTERSPEECH-15), 2015.

Suresh, A. T., Yu, F. X., Kumar, S., and McMahan, H. B.Distributed Mean Estimation with Limited Communica-tion. In International Conference on Machine Learning(ICML-17), 2017.

Tieleman, T. and Hinton, G. RMSprop. Coursera: NeuralNetworks for Machine Learning, Lecture 6.5, 2012.

Welford, B. P. Note on a Method for Calculating CorrectedSums of Squares and Products. Technometrics, 1962.

Wen, W., Xu, C., Yan, F., Wu, C., Wang, Y., Chen, Y., andLi, H. TernGrad: Ternary Gradients to Reduce Commu-nication in Distributed Deep Learning. In Advances inneural information processing systems (NIPS-17), 2017.

Wilson, A. C., Roelofs, R., Stern, M., Srebro, N., and Recht,B. The Marginal Value of Adaptive Gradient Methods inMachine Learning. In Advances in neural informationprocessing systems (NIPS-17), 2017.


A. Further experimental results

Figure A.1. A simple toy problem where SIGNSGD converges faster than SGD. The objective function is just a quadratic f(x) = 12x2

for x ∈ R100. The gradient of this function is just g(x) = x. We construct an artificial stochastic gradient by adding Gaussian noiseN (0, 1002) to only the first component of the gradient. Therefore the noise is extremely sparse. The initial point is sampled from a unitvariance spherical Gaussian. For each algorithm we tune a separate, constant learning rate finding 0.001 best for SGD and 0.01 best forSIGNSGD. SIGNSGD appears more robust to the sparse noise in this problem. Results are averaged over 50 repeats with ±1 standarddeviation shaded.

103 104 105 106 107 108

number of weights103

104

105

106

107

108

||g||21||g||22

MNIST---logistic regressionMNIST---------LeNet variantCIFAR-10-----------resnet-20Imagenet----------resnet-50

Figure A.2. Measuring gradient density via ratio of norms, over a range of datasets and architectures. For each network, we take a point inparameter space provided by the Xavier initialiser (Glorot & Bengio, 2010). We do a full pass over the data to compute the full gradient atthis point. It is remarkably dense in all cases.


1e-07 1e-05 0.001 0.1 10

SGD, momentum = 0.9

1e-07

1e-05

0.001

0.1

10

initi

al le

arni

ng ra

te

9.7 12.5 13.6 12.2 9.7 9.2 10.5 9.5 11.0 9.5

23.0 20.4 24.4 22.5 22.3 21.9 24.0 20.8 10.6 9.5

41.0 39.6 40.4 41.5 42.4 42.5 39.6 22.9 9.5 9.5

70.1 70.8 71.6 71.7 71.0 72.6 84.1 9.5 9.5 9.6

86.8 86.7 85.8 86.8 87.0 91.0 82.1 9.5 9.6 9.8

89.6 89.7 89.6 89.9 91.2 91.1 43.7 9.6 9.8 10.2

90.6 90.2 90.1 91.3 91.3 71.5 9.6 9.8 10.2 9.9

87.9 87.4 90.1 87.8 28.7 9.8 9.8 10.2 9.9 9.9

10.2 10.2 10.2 10.2 10.2 10.2 10.1 9.9 9.9 9.9

10.0 10.0 10.0 10.0 10.0 9.8 9.9 9.9 9.9 9.9

weight decay1e-07 1e-05 0.001 0.1 10

signum, = 0.9

1e-07

1e-05

0.001

0.1

10

31.2 21.9 30.9 32.3 30.7 28.8 22.9 10.4 10.5 10.2

54.4 55.5 54.3 54.8 55.4 53.2 50.8 37.7 10.6 9.9

80.2 81.2 80.2 80.9 80.6 82.8 87.4 88.0 77.5 14.4

88.9 88.8 89.0 88.3 89.6 91.0 86.9 9.6 9.8 10.0

89.5 90.6 91.0 90.6 88.1 81.0 29.4 9.8 9.8 10.2

89.2 89.7 86.0 85.2 81.8 72.5 9.8 10.0 10.0 10.1

81.0 84.0 82.8 78.4 70.6 40.3 9.9 10.0 10.0 9.8

71.5 67.5 70.9 60.6 30.7 21.3 11.0 10.1 9.7 9.8

41.5 42.5 27.2 10.2 10.4 9.9 10.5 10.0 9.9 10.1

10.0 9.6 10.4 10.1 9.4 10.1 10.0 10.3 10.1 10.0

1e-07 1e-05 0.001 0.1 10

adam, 1 = 0.9

1e-07

1e-05

0.001

0.1

10

23.0 22.5 17.3 20.6 24.0 25.9 14.3 10.7 9.6 10.6

42.5 43.6 45.8 43.4 43.9 45.2 43.9 35.2 17.6 8.7

67.6 67.7 66.0 67.3 69.1 70.7 76.2 83.5 78.3 18.4

84.8 85.9 85.4 85.8 86.5 88.5 87.9 9.5 9.5 9.7

89.7 89.6 90.2 90.7 90.8 88.7 74.1 9.5 9.7 9.8

90.2 91.1 90.9 89.2 83.4 67.6 18.5 9.8 9.8 10.0

87.1 85.6 79.7 73.7 64.1 42.0 9.8 9.8 10.0 10.1

74.9 77.3 72.1 56.6 41.8 19.7 10.1 10.0 10.1 9.9

47.5 47.4 45.5 38.4 18.5 10.9 10.1 10.1 9.9 10.0

9.8 9.8 10.3 9.9 10.2 10.1 9.9 9.9 10.1 10.2

0 20 40 60 80 100 120 140 160epoch

0.5

0.6

0.7

0.8

0.9

1.0

test

acc

urac

y

0 20 40 60 80 100 120 140 160epoch

0.5

0.6

0.7

0.8

0.9

1.0

train

acc

urac

y

0 20 40 60 80 100 120 140 160epoch

0.00

0.25

0.50

0.75

1.00

1.25

1.50

1.75

2.00

train

loss

sgdsignumadam

Figure A.3. CIFAR-10 results using SIGNUM to train a Resnet-20 model. Top: validation accuracies from a hyperparameter sweep on aseparate validation set carved out from the training set. We used this to tune initial learning rate, weight decay and momentum for allalgorithms. All other hyperparameter settings were chosen as in (He et al., 2016a) as found favourable for SGD. The hyperparametersweep for other values of momentum is plotted in Figure A.4 of the supplementary. Bottom: there is little difference between the final testset performance of the algorithms. SIGNUM closely resembles ADAM in all of these plots.


SGD︷︸︸︷1e-07 1e-05 0.001 0.1 10

momentum = 0.0

1e-07

1e-05

0.001

0.1

10

initi

al le

arni

ng ra

te

10.1 10.2 9.6 11.1 9.8 10.4 10.0 12.2 9.5 9.8

13.0 12.8 12.6 11.0 11.0 10.5 9.9 10.4 9.4 9.7

22.5 20.7 23.1 20.8 19.7 21.0 19.0 21.7 9.8 9.5

40.8 43.4 39.2 42.0 42.1 39.6 42.9 23.3 9.5 9.5

70.5 72.0 71.9 71.2 69.8 72.1 85.2 9.5 9.5 9.6

86.7 86.1 86.5 86.8 87.8 90.6 83.1 9.5 9.6 9.8

89.6 89.7 89.3 89.6 91.0 91.6 9.5 9.6 9.8 9.9

90.4 89.2 90.0 91.4 91.7 9.6 9.8 9.8 9.9 9.9

9.8 87.2 89.5 87.9 9.8 9.8 9.8 9.9 9.9 9.9

10.0 10.0 10.0 10.0 10.0 10.0 9.9 9.9 9.9 9.9

weight decay1e-07 1e-05 0.001 0.1 10

momentum = 0.5

1e-07

1e-05

0.001

0.1

10

9.8 10.7 10.7 10.0 9.7 8.9 10.5 9.5 9.7 10.0

9.8 12.2 14.6 12.8 9.8 9.7 11.1 12.7 12.5 9.5

28.0 26.7 29.1 26.6 26.8 26.4 25.1 25.1 9.6 9.5

51.2 50.4 47.5 48.4 48.7 49.9 51.2 9.5 9.5 9.5

76.7 77.1 78.3 77.0 78.0 81.8 87.6 9.5 9.5 9.7

87.8 88.0 87.6 87.8 88.6 91.3 82.6 9.5 9.7 9.8

89.3 89.5 89.5 89.9 91.4 90.8 19.2 9.7 9.8 9.9

89.7 89.2 90.6 91.5 90.8 9.7 9.8 9.8 9.9 9.9

85.1 9.8 9.8 9.8 9.8 9.8 9.8 9.9 9.9 9.9

10.0 10.0 10.0 10.0 10.0 10.0 9.9 9.9 9.9 9.9

1e-07 1e-05 0.001 0.1 10

momentum = 0.9

1e-07

1e-05

0.001

0.1

10

9.7 12.5 13.6 12.2 9.7 9.2 10.5 9.5 11.0 9.5

23.0 20.4 24.4 22.5 22.3 21.9 24.0 20.8 10.6 9.5

41.0 39.6 40.4 41.5 42.4 42.5 39.6 22.9 9.5 9.5

70.1 70.8 71.6 71.7 71.0 72.6 84.1 9.5 9.5 9.6

86.8 86.7 85.8 86.8 87.0 91.0 82.1 9.5 9.6 9.8

89.6 89.7 89.6 89.9 91.2 91.1 43.7 9.6 9.8 10.2

90.6 90.2 90.1 91.3 91.3 71.5 9.6 9.8 10.2 9.9

87.9 87.4 90.1 87.8 28.7 9.8 9.8 10.2 9.9 9.9

10.2 10.2 10.2 10.2 10.2 10.2 10.1 9.9 9.9 9.9

10.0 10.0 10.0 10.0 10.0 9.8 9.9 9.9 9.9 9.9

SIGNSGD︷︸︸︷1e-07 1e-05 0.001 0.1 10

momentum, = 0.0

1e-07

1e-05

0.001

0.1

10

initi

al le

arni

ng ra

te

20.1 18.9 18.9 16.7 16.5 18.8 18.7 19.3 24.2 23.2

42.1 41.7 42.7 40.3 42.3 41.4 42.0 41.7 44.0 30.6

64.2 64.1 64.7 65.9 65.4 64.5 63.3 66.0 69.7 19.1

84.3 85.0 84.1 85.2 85.0 84.7 85.2 89.4 68.2 19.5

89.4 89.2 89.1 89.6 89.4 89.6 91.2 89.7 39.3 9.8

89.3 89.5 88.9 88.9 89.5 90.5 88.6 9.7 9.8 9.9

54.0 74.5 67.7 47.2 78.5 80.7 9.8 9.8 9.9 9.9

9.9 9.8 9.8 9.8 9.8 9.9 9.8 9.8 9.9 9.9

9.8 10.1 10.0 10.1 9.8 9.9 10.1 9.9 9.9 9.9

9.8 9.8 9.9 9.9 9.9 9.8 9.9 9.9 9.9 9.9

weight decay1e-07 1e-05 0.001 0.1 10

momentum, = 0.5

1e-07

1e-05

0.001

0.1

10

27.8 26.1 23.7 22.3 26.6 24.0 19.7 10.6 10.6 10.3

47.3 46.4 47.8 46.1 47.1 48.6 48.3 34.5 19.7 9.5

71.9 70.5 71.2 71.6 70.1 72.8 79.4 85.8 79.2 9.7

86.9 86.9 86.3 86.6 87.1 89.6 88.5 9.5 9.7 10.1

90.1 90.0 90.5 90.5 90.6 86.8 40.0 9.7 9.9 10.2

89.6 90.8 89.7 87.1 83.6 80.0 18.2 9.9 9.9 9.9

85.6 83.6 84.2 82.7 76.8 56.4 10.1 10.2 9.8 9.9

83.8 80.3 75.8 69.5 45.1 22.8 9.9 9.8 10.0 10.0

57.5 66.3 44.6 22.0 11.4 9.9 10.0 9.6 10.1 9.8

10.1 10.3 10.3 9.1 10.8 9.7 9.7 9.9 9.8 9.9

1e-07 1e-05 0.001 0.1 10

momentum, = 0.9

1e-07

1e-05

0.001

0.1

10

31.2 21.9 30.9 32.3 30.7 28.8 22.9 10.4 10.5 10.2

54.4 55.5 54.3 54.8 55.4 53.2 50.8 37.7 10.6 9.9

80.2 81.2 80.2 80.9 80.6 82.8 87.4 88.0 77.5 14.4

88.9 88.8 89.0 88.3 89.6 91.0 86.9 9.6 9.8 10.0

89.5 90.6 91.0 90.6 88.1 81.0 29.4 9.8 9.8 10.2

89.2 89.7 86.0 85.2 81.8 72.5 9.8 10.0 10.0 10.1

81.0 84.0 82.8 78.4 70.6 40.3 9.9 10.0 10.0 9.8

71.5 67.5 70.9 60.6 30.7 21.3 11.0 10.1 9.7 9.8

41.5 42.5 27.2 10.2 10.4 9.9 10.5 10.0 9.9 10.1

10.0 9.6 10.4 10.1 9.4 10.1 10.0 10.3 10.1 10.0

ADAM︷︸︸︷1e-07 1e-05 0.001 0.1 10

momentum, 1 = 0.0

1e-07

1e-05

0.001

0.1

10

initi

al le

arni

ng ra

te

25.8 23.9 21.3 17.7 23.9 22.4 13.4 11.5 10.4 8.8

44.1 44.4 45.5 44.5 44.1 43.6 43.0 33.2 17.4 12.4

68.4 67.6 67.8 67.3 68.1 70.0 74.5 84.1 78.9 18.5

85.6 86.2 85.9 86.1 86.9 88.7 88.3 9.5 9.5 9.8

89.7 90.3 89.8 90.0 91.1 88.6 38.7 9.5 9.8 10.0

90.6 90.6 90.5 88.6 81.2 45.5 9.6 9.8 10.0 9.9

89.7 87.0 81.7 74.0 46.5 9.8 9.8 10.0 9.9 9.9

65.6 68.8 62.2 48.4 10.0 10.0 10.0 9.9 9.9 9.8

23.0 9.9 9.9 17.6 9.9 9.9 10.0 9.9 10.0 9.8

9.9 10.1 9.7 9.9 10.0 9.9 9.9 10.0 10.1 10.0

weight decay1e-07 1e-05 0.001 0.1 10

momentum, 1 = 0.5

1e-07

1e-05

0.001

0.1

10

21.4 21.9 11.2 24.3 19.5 17.1 22.3 10.9 11.1 9.6

43.5 44.3 45.5 44.2 43.3 45.1 43.9 36.5 20.7 9.9

69.2 67.6 66.3 66.6 67.2 69.0 76.1 83.3 78.9 17.4

85.9 85.7 85.6 85.9 85.6 88.2 88.4 9.5 9.5 9.8

89.6 89.5 90.4 90.6 90.9 88.9 72.2 9.5 9.8 9.8

90.6 90.9 91.6 89.2 82.9 49.4 19.5 9.8 9.8 9.9

89.0 88.4 82.7 65.6 24.6 9.8 9.8 9.8 9.9 9.9

56.9 29.5 9.8 9.8 9.8 9.8 10.0 9.9 9.9 9.8

9.9 20.2 9.9 10.1 9.9 9.9 9.9 9.9 9.8 10.2

9.9 10.4 9.9 9.9 9.9 9.9 9.9 9.8 9.9 10.2

1e-07 1e-05 0.001 0.1 10

momentum, 1 = 0.9

1e-07

1e-05

0.001

0.1

10

23.0 22.5 17.3 20.6 24.0 25.9 14.3 10.7 9.6 10.6

42.5 43.6 45.8 43.4 43.9 45.2 43.9 35.2 17.6 8.7

67.6 67.7 66.0 67.3 69.1 70.7 76.2 83.5 78.3 18.4

84.8 85.9 85.4 85.8 86.5 88.5 87.9 9.5 9.5 9.7

89.7 89.6 90.2 90.7 90.8 88.7 74.1 9.5 9.7 9.8

90.2 91.1 90.9 89.2 83.4 67.6 18.5 9.8 9.8 10.0

87.1 85.6 79.7 73.7 64.1 42.0 9.8 9.8 10.0 10.1

74.9 77.3 72.1 56.6 41.8 19.7 10.1 10.0 10.1 9.9

47.5 47.4 45.5 38.4 18.5 10.9 10.1 10.1 9.9 10.0

9.8 9.8 10.3 9.9 10.2 10.1 9.9 9.9 10.1 10.2

Figure A.4. Results of a massive grid search over hyperparameters for training Resnet-20 (He et al., 2016a) on CIFAR-10 (Krizhevsky,2009). All non-algorithm specific hyperparameters (such as learning rate schedules) were set as in (He et al., 2016a). In ADAM, β2 and εwere chosen as recommended in (Kingma & Ba, 2015). Data was divided according to a {45k/5k/10k} {train/val/test} split. Validationaccuracies are plotted above, and the best performer on the validation set was chosen for the final test run (shown in Figure A.3). Allalgorithms at the least get close to the baseline reported in (He et al., 2016a) of 91.25%. Note the broad similarity in general shape of theheatmap between ADAM and SIGNSGD, supporting a notion of algorithmic similarity. Also note that whilst SGD has a larger region ofvery high-scoring hyperparameter configurations, SIGNSGD and ADAM appear stable over a larger range of learning rates. Notice that theSGD heat map shifts up for increasing momentum, since the implementation of SGD in the mxnet deep learning framework actuallycouples the momentum and learning rate parameters.


B. Proving the convergence rate of SIGNSGD

Theorem 1 (Non-convex convergence rate of SIGNSGD). Run algorithm 1 for K iterations under Assumptions 1 to3. Set the learning rate and mini-batch size (independently of step k) as

δk =1√‖~L‖1K

, nk = K

Let N be the cumulative number of stochastic gradient calls up to step K, i.e. N = O(K2). Then we have

E

[1

K

K−1∑k=0

‖gk‖1

]2

≤ 1√N

[√‖~L‖1

(f0 − f∗ +

1

2

)+ 2‖~σ‖1

]2

Proof. First let’s bound the improvement of the objective during a single step of the algorithm for one instantiation of thenoise. I[.] is the indicator function, gk,i denotes the ith component of the true gradient g(xk) and gk is a stochastic sampleobeying Assumption 3.

First take Assumption 2, plug in the step from Algorithm 1, and decompose the improvement to expose the stochasticity-induced error:

fk+1 − fk ≤ gTk (xk+1 − xk) +

d∑i=1

Li2

(xk+1 − xk)2i

= −δkgTk sign(gk) + δ2k

d∑i=1

Li2

= −δk‖gk‖1 +δ2k

2‖~L‖1

+ 2δk

d∑i=1

|gk,i| I[sign(gk,i) 6= sign(gk,i)]

Next we find the expected improvement at time k + 1 conditioned on the previous iterate.

E[fk+1 − fk|xk] ≤ −δk‖gk‖1 +δ2k

2‖~L‖1

+ 2δk

d∑i=1

|gk,i|P[sign(gk,i) 6= sign(gk,i)]

So the expected improvement crucially depends on the probability that each component of the sign vector is correct, whichis intuitively controlled by the relative scale of the gradient to the noise. To make this rigorous, first relax the probability,then use Markov’s inequality followed by Jensen’s inequality:

P[sign(gk,i) 6= sign(gk,i)] ≤ P[|gk,i − gk,i| ≥ |gk,i|]

≤ E[|gk,i − gk,i|]|gk,i|

≤√E[(gk,i − gk,i)2]

|gk,i|

=σk,i|gk,i|


σk,i refers to the variance of the kth stochastic gradient estimate, computed over a mini-batch of size nk. Therefore, byAssumption 3, we have that σk,i ≤ σi/

√nk.

We now substitute these results and our learning rate and mini-batch settings into the expected improvement:

E[fk+1 − fk|xk] ≤ −δk‖gk‖1 + 2δk√nk‖~σ‖1 +

δ2k

2‖~L‖1

= − 1√‖~L‖1K

‖gk‖1 +2√‖~L‖1K

‖~σ‖1 +1

2K

Now extend the expectation over randomness in the trajectory, and perform a telescoping sum over the iterations:

f0 − f∗ ≥ f0 − E[fK ]

= E

[K−1∑k=0

fk − fk+1

]

≥ EK−1∑k=0

1√‖~L‖1K

‖gk‖1 −1

2

√‖~L‖1K

(4‖σ‖1 +

√‖~L‖1

)=

√K

‖~L‖1E

[1

K

K−1∑k=0

‖gk‖1

]− 1

2

√‖~L‖1

(4‖~σ‖1 +

√‖~L‖1

)

We can rearrange this inequality to yield the rate:

E

[1

K

K−1∑k=0

‖gk‖1

]≤ 1√

K

[√‖~L‖1

(f0 − f∗ +

1

2

)+ 2‖~σ‖1

]

Since we are growing our mini-batch size, it will take N = O(K2) gradient calls to reach step K. Substitute this in, squarethe result, and we are done.

C. Large and small batch SGD

Algorithm C.1 SGDInput: learning rate δ, current point xkgk ← stochasticGradient(xk)xk+1 ← xk − δ gk

For comparison with SIGNSGD theory, here we present non-convex convergence rates for SGD. These are classical resultsand we are not sure of the earliest reference.

We noticeably get exactly the same rate for large and small batch SGD when measuring convergence in terms ofnumber of stochastic gradient calls. Although the rates are the same for a given number of gradient calls N , thelarge batch setting is preferred (in theory) since it achieves these N gradient calls in only

√N iterations, whereas the

small batch setting requires N iterations. Fewer iterations in the large batch case implies a smaller wall-clock timeto reach a given accuracy (assuming the large batch can be parallelised) as well as fewer rounds of communication inthe distributed setting. These systems benefits of large batch learning have been observed by practitioners (Goyal et al., 2017).


Theorem C.1 (Non-convex convergence rate of SGD). Run algorithm C.1 for K iterations under Assumptions 1 to 3.Define L := ‖L‖∞ and σ2 := ‖~σ‖22. Set the learning rate and mini-batch size (independently of step k) as either

(large batch) δk =1

Lnk = K

(small batch) δk =1

L√K

nk = 1

Let N be the cumulative number of stochastic gradient calls up to step K, i.e. N = O(K2) for large batch andN = O(K) for small batch. Then, in either case, we have

E

[1

K

K−1∑k=0

‖gk‖22

]≤ 1√

N

[2L(f0 − f∗) + σ2

]

Proof. The proof begins the same for the large and small batch case.

First we bound the improvement of the objective during a single step of the algorithm for one instantiation of the noise. gkdenotes the true gradient at step k and gk is a stochastic sample obeying Assumption 3.

Take Assumption 2 and plug in the algorithmic step.

fk+1 − fk ≤ gTk (xk+1 − xk) +

d∑i=1

Li2

(xk+1 − xk)2i

≤ gTk (xk+1 − xk) +‖L‖∞

2‖xk+1 − xk‖22

= −δkgTk gk + δ2k

L

2‖gk‖22

Next we find the expected improvement at time k + 1 conditioned on the previous iterate.

E[fk+1 − fk|xk] ≤ −δk‖gk‖22 + δ2k

L

2

(σ2k + ‖gk‖22

).

σ2k refers to the variance of the kth stochastic gradient estimate, computed over a mini-batch of size nk. Therefore, by

Assumption 3, we have that σ2k ≤ σ2/nk.

First let’s substitute in the (large batch) hyperparameters.

E[fk+1 − fk|xk] ≤ − 1

L‖gk‖22 +

1

2L

(σ2

K+ ‖gk‖22

)= − 1

2L‖gk‖22 +

1

2L

σ2

K.


f0 − f∗ ≥ f0 − E[fK ]

= E

[K−1∑k=0

fk − fk+1

]

≥ 1

2LEK−1∑k=0

[‖gk‖22 − σ

2]



E

[1

K

K−1∑k=0

‖gk‖22

]≤ 1

K

[2L(f0 − f∗) + σ2

].

Since we are growing our mini-batch size, it will take N = O(K2) gradient calls to reach step K. Substitute this in and weare done for the (large batch) case.

Now we need to show that the same result holds for the (small batch) case. Following the initial steps of the large batchproof, we get

E[fk+1 − fk|xk] ≤ −δk‖gk‖22 + δ2k

L

2

(σ2k + ‖gk‖22

).

This time σ2k = σ2. Substituting this and our learning rate and mini-batch settings into the expected improvement:

E[fk+1 − fk|xk] ≤ − 1

L√K‖gk‖22 +

1

2LK

(σ2 + ‖gk‖22

)≤ − 1

2L√K‖gk‖22 +

1

2L

σ2

K.


f0 − f∗ ≥ f0 − E[fK ]

= E

[K−1∑k=0

fk − fk+1

]

≥ 1

2LEK−1∑k=0

[‖gk‖22√K− σ2

K

].


E

[1

K

K−1∑k=0

‖gk‖22

]≤ 1√

K

[2L(f0 − f∗) + σ2

]It will take N = O(K) gradient calls to reach step K. Substitute this in and we are done.

D. Proving the convergence rate of distributed SIGNSGD with majority vote

Theorem 2 (Non-convex convergence rate of distributed SIGNSGD with majority vote). Run algorithm 3 for Kiterations under Assumptions 1 to 3. Set the learning rate and mini-batch size for each worker (independently of stepk) as

δk =1√‖~L‖1K

nk = K

Then (a) majority vote with M workers converges at least as fast as SIGNSGD in Theorem 1.

And (b) further assuming that the noise in each component of the stochastic gradient is unimodal and symmetric aboutthe mean (e.g. Gaussian), majority vote converges at improved rate:

E

[1

K

K−1∑k=0

‖gk‖1

]2

≤ 1√N

[√‖~L‖1

(f0 − f∗ +

1

2

)+

2√M‖~σ‖1

]2

where N is the cumulative number of stochastic gradient calls per worker up to step K.


Before we introduce the unimodal symmetric assumption, let’s first address the claim that M-worker majority vote is at leastas good as single-worker SIGNSGD as in Theorem 1 only using Assumptions 1 to 3.

Proof of (a). Recall that a crucial step in Theorem 1 is showing that

|gi|P[sign(gi) 6= sign(gi)] ≤ σi

for component i of the stochastic gradient with variance bound σi.

The only difference in majority vote is that instead of using sign(gi) to approximate sign(gi), we are instead usingsign

[∑Mm=1 sign(gm,i)

]. If we can show that the same bound in terms of σi holds instead for

|gi|P

[sign

[M∑m=1

sign(gm,i)

]6= sign(gi)

](?)

then we are done, since the machinery of Theorem 1 can then be directly applied.

Define the signal-to-noise ratio of a component of the stochastic gradient as S := |gi|/σi. Note that when S ≤ 1 then (?) istrivially satisfied, so we need only consider the case that S > 1. S should really be labeled Si but we abuse notation.

Without loss of generality, assume that gi is negative, and thus using Assumption 3 and Cantelli’s inequality (Cantelli, 1928)we get that for the failure probability q of a single worker

q := P[sign(gi) 6= sign(gi)] = P[gi − gi ≥ |gi|] ≤1

1 +g2iσ2i

For S > 1 then we have failure probability q < 12 . If the failure probability of a single worker is smaller than 1

2 then theserver is essentially receiving a repetition code RM of the true gradient sign. Majority vote is the maximum likelihooddecoder of the repetition code, and of course decreases the probability of error—see e.g. (MacKay, 2002). Therefore in allregimes of S we have that

(?) ≤ |gi|P[sign(gi) 6= sign(gi)] ≤ σi

and we are done.

That’s all well and good, but what we’d really like to show is that using M workers provides a speedup by reducing thevariance. Is

(?)?≤ σi√

M(†)

too much to hope for?

Well in the regime where S � 1 such a speedup is very reasonable since q � 12 by Cantelli, and the repetition code actually

supplies exponential reduction in failure rate. But we need to exclude very skewed or bimodal distributions where q > 12

and adding more voting workers will not help. That brings us naturally to the following lemma:

Lemma D.1 (Failure probability of a sign bit under conditions of unimodal symmetric gradient noise). Let gi be an unbiasedstochastic approximation to gradient component gi, with variance bounded by σ2

i . Further assume that the noise distributionis unimodal and symmetric. Define signal-to-noise ratio S := |gi|

σi. Then we have that

P[sign(gi) 6= sign(gi)] ≤

{29

1S2 if S > 2√

3,

12 −

S2√

3otherwise

which is in all cases less than 12 .


Proof. Recall Gauss’ inequality for unimodal random variable X with mode ν and expected squared deviation from themode τ2 (Gauss, 1823; Pukelsheim, 1994):

P[|X − ν| > k] ≤

{49τ2

k2 if kτ >2√3,

1− k√3τ

otherwise

By the symmetry assumption, the mode is equal to the mean, so we replace mean µ = ν and variance σ2 = τ2.

P[|X − µ| > k] ≤

{49σ2

k2 if kσ >

2√3,

1− k√3σ

otherwise

Without loss of generality assume that gi is negative. Then applying symmetry followed by Gauss, the failure probability forthe sign bit satisfies:

P[sign(gi) 6= sign(gi)] = P[gi − gi ≥ |gi|]

=1

2P[|gi − gi| ≥ |gi|]

≤

{29σ2i

g2iif |gi|σ > 2√

3,

12 −

|gi|2√

3σiotherwise

=

{29

1S2 if S > 2√

3,

12 −

S2√

3otherwise

We now have everything we need to prove part (b) of Theorem 2.

Proof of (b). If we can show (†) we’ll be done, since the machinery of Theorem 1 follows through with σ replacedeverywhere by σ√

M. Note that the important quantity appearing in (?) is

M∑m=1

sign(gm,i).

Let Z count the number of workers with correct sign bit. To ensure that

sign

[M∑m=1

sign(gm,i)

]= sign(gi)

Z must be larger than M2 . But Z is the sum of M independent Bernoulli trials, and is therefore binomial with success

probability p and failure probability q to be determined. Therefore we have reduced proving (†) to showing that

P[Z ≤ M

2

]≤ 1√

MS(§)

where Z is the number of successes of a binomial random variable b(M,p) and S is our signal-to-noise ratio S := |gi|σi

.

Let’s start by getting a bound on the success probability p (or equivalently failure probability q) of a single Bernoulli trial.

By Lemma D.1, which critically relies on unimodal symmetric gradient noise, the failure probability for the sign bit of asingle worker satisfies:

q := P[sign(gi) 6= sign(gi)]

≤

{29

1S2 if S > 2√

3,

12 −

S2√

3otherwise

:= q(S)


Where we have defined q(S) to be our S-dependent bound on q. Since q ≤ q(S) < 12 , there is hope to show (†). Define ε to

be the defect of q from one half, and let ε(S) be its S-dependent bound.

ε :=1

2− q = p− 1

2≥ 1

2− q(S) := ε(S)

Now we have an analytical handle on random variable Z, we may proceed to show (§). There are a number of differentinequalities that we can use to bound the tail of a binomial random variable, but Cantelli’s inequality will be good enoughfor our purposes.

Let Z := M − Z denote the number of failures. Z is binomial with mean µZ = Mq and variance σ2Z

= Mpq. Then usingCantelli we get

P[Z ≤ M

2

]= P

[Z ≥ M

2

]= P

[Z − µZ ≥

M

2− µZ

]= P

[Z − µZ ≥Mε

]≤ 1

1 + M2ε2

Mpq

≤ 1

1 + M1

4ε2−1

Now using the fact that 11+x2 ≤ 1

2x we get

P[Z ≤ M

2

]≤

√1

4ε2 − 1

2√M

To finish, we need only show that√

14ε2 − 1 is smaller than 2

S , or equivalently that its square is smaller than 4S2 . Well

plugging in our bound on ε we get that

1

4ε2− 1 ≤ 1

4ε(S)2− 1

where

ε(S) =

{12 −

29

1S2 if S > 2√

3,

S2√

3otherwise

First take the case S ≤ 2√3

. Then ε2 = S2

12 and 14ε2 − 1 = 3

S2 − 1 < 4S2 . Now take the case S > 2√

3. Then ε = 1

2 −29

1S2

and we have 14ε2 − 1 = 1

S2

89−

1681

1S2

1− 89

1S2 + 16

811S4

< 1S2

89

1− 89

1S2

< 4S2 by the condition on S.

So we have shown both cases, which proves (§) from which we get (†) and we are done.

E. General recipes for the convergence of approximate sign gradient methodsNow we generalize the arguments in the proof of SIGNSGD and prove a master lemma that provides a general recipe foranalyzing the approximate sign gradient method. This allows us to handle momentum and the majority voting schemes,hence proving Theorem 3 and Theorem 2.

Lemma E.1 (Convergence rate for a class of approximate sign gradient method). Let C and K be integers satisfying0 < C � K. Consider the algorithm given by xk+1 = xk − δksign(vk), for a fixed positive sequence of δk and wherevk ∈ Rd is a measurable and square integrable function of the entire history up to time k, including x1, ..., xk, v1, ..., vk−1


and all Nk stochastic gradient oracle calls up to time k. Let gk = ∇f(xk). If Assumption 1 and Assumption 2 are true andin addition for k = C,C + 1, C + 2, ...,K

E

[d∑i=1

|gk,i|P[sign(vk,i) 6= sign(gk,i)|xk]

]≤ ξ(k) (2)

where the expectation is taken over the all random variables, and the rate ξ(k) obeys that ξ(k)→ 0 as k →∞ and then wehave

1

K − C

K−1∑k=C

E‖gk‖1 ≤fC − f∗ + 2

∑K−1k=C δkξ(k) +

∑K−1k=C

δ2k‖~L‖12

(K − C) minC≤k≤K−1 δk.

In particular, if δk = δ/√k and ξ(k) = κ/

√k, for some problem dependent constant κ, then we have

1

K − C

K−1∑k=C

E‖gk‖1 ≤fC−f∗δ + (2κ+ ‖~L‖1δ/2)(logK + 1)

√K − C√

K

.

Proof. Our general strategy will be to show that the expected objective improvement at each step will be good enough toguarantee a convergence rate in expectation. First let’s bound the improvement of the objective during a single step of thealgorithm for k ≥ C, and then take expectation. Note that I[.] is the indicator function, and wk[i] denotes the ith componentof the vector wk.

By Assumption 2

fk+1 − fk ≤ gTk (xk+1 − xk) +

d∑i=1

Li2

(xk+1,i − xk,i)2 Assumption 2

= −δkgTk sign(vk) + δ2k

‖~L‖12

by the update rule

= −δk‖gk‖1 + 2δk

d∑i=1

|gk[i]| I[sign(vk[i]) 6= sign(gk[i])] + δ2k

‖~L‖12

by identity

Now, for k ≥ C we need to find the expected improvement at time k + 1 conditioned on xk, where the expectation is overthe randomness of the stochastic gradient oracle. Note that P[E] denotes the probability of event E.

E[fk+1 − fk|xk] ≤ −δk‖gk‖1 + 2δk

d∑i=1

|gk[i]|P[sign(vk[i]) 6= sign(gk[i])

∣∣∣xk]+ δ2k

‖~L‖12

.

Note that gk becomes fixed when we condition on xk. Further take expectation over xk, and apply (2). We get:

E[fk+1 − fk] ≤ −δkE[‖gk‖1] + 2δkξ(k) +δ2k‖~L‖1

2. (3)

Rearrange the terms and sum over (3) for k = C,C + 1, ...,K − 1.

K−1∑k=C

δkE[‖gk‖1] ≤K−1∑k=C

(Efk − Efk+1) + 2

K−1∑k=C

δkξ(k) +

K−1∑k=C

δ2k‖~L‖1

2

Dividing both sides by[(K − C) minC≤k≤K−1 δk

], using a telescoping sum over Efk and using that f(x) ≥ f∗ for all x,

we get

1

K − C

K−1∑k=C

E‖gk‖1 ≤fC − f∗ + 2

∑K−1k=C δkξ(k) +

∑K−1k=C

δ2k‖~L‖12

(K − C) minC≤k≤K−1 δk

and the proof is complete by noting that the minimum is smaller than the average in the LHS.


To use the above Lemma for analyzing SIGNUM and the Majority Voting scheme, it suffices to check condition (2) for eachalgorithm.

One possible way to establish (2) is show that vk is a good approximation of the gradient gk in expected absolute value.

Lemma E.2 (Estimation to testing reduction). Equation (2) is true, if for every k

d∑i=1

E|vk[i]− gk[i]| ≤ ξ(k). (4)

Proof. First note that for any two random variables a, b ∈ R.

P[sign(a) 6= sign(b)

]≤ P

[|a− b| > |b|

].

Condition on xk and apply the above inequality to every i = 1, ..., d to what is inside the expectation of (2), we have

d∑i=1

|gk[i]|P[sign(vk[i]) 6= sign(gk[i])

∣∣∣xk] ≤ d∑i=1

|gk[i]|P[|vk[i]− gk[i]| > |gk[i]|

∣∣∣xk] ≤ d∑i=1

E[|vk[i]− gk[i]||xk].

Note that the final ≤ uses Markov’s inequality and constant |gk[i]| cancels out.

The proof is complete by taking expectation on both sides and apply (4).

Note that the proof of this lemma uses Markov’s inequality in the same way information-theoretical lower bounds are oftenproved in statistics — reducing estimation to testing.

Another handy feature of the result is that we do not require the approximation to hold for every possible xk. It is okay thatfor some xk, the approximation is much worse as long as those xk appears with small probability according to the algorithm.This feature enables us to analyze momentum and hence proving the convergence for SIGNUM.

F. Analysis for SIGNUM

Recall our definition of the key random variables used in SIGNUM.

gk := ∇f(xk)

gk :=1

nk

nk∑j=1

g(j)(xk)

mk :=1− β

1− βk+1

k∑t=0

[βtgk−t

]mk :=

1− β1− βk+1

k∑t=0

[βtgk−t

]

SIGNUM effectively uses vk = mk and also δk = O(1/√k).

Before we prove the convergence of SIGNUM, we first prove a utility lemma about the random variable Zk := gk − gk.Note that in this lemma quantities like Zk, Yk, |Zk| and Z2

k are considered vectors—so this lemma is a statement about eachcomponent of the vectors separately and all operations, such as (·)2 are pointwise operations.

Lemma F.1 (Cumulative error of stochastic gradient). For any k <∞ and fixed weight−∞ < α1, ..., αk <∞,∑kl=1 αlZl

is a Martingale. In particular,

E

( k∑l=1

αlZl

)2 ≤ k∑

l=1

α2l ~σ

2.


Proof. We simply check the definition of a Martingale. Denote Yk :=∑kl=1 αlZl. First, we have that

E[|Yk|] = E

[∣∣∣∣∣k∑l=1

αlZl

∣∣∣∣∣]

≤∑l

|αl|E[|Zl|] triangle inequality

=∑l

|αl|E[E[|Zl||xl]

]law of total probability

≤∑l

|αl|E[√

E[Z2l |xl]

]Jensen’s inequality

≤∑l

|αl|~σ <∞

Second, again using the law of total probability,

E[Yk+1|Y1, ..., Yk] = E

[k+1∑l=1

αlZl

∣∣∣∣∣α1Z1, ..., αkZk

]= Yk + αk+1E [Zk+1|α1Z1, ..., αkZk]

= Yk + αk+1E [E [Zk+1|xk+1, α1Z1, ..., αkZk] |α1Z1, ..., αkZk]

= Yk + αk+1E [E [Zk+1|xk+1] |α1Z1, ..., αkZk]

= Yk

This completes the proof that it is indeed a Martingale. We now make use of the properties of Martingale differencesequences to establish a variance bound on the Martingale.

E

( k∑l=1

αlZl

)2 =

k∑l=1

E[α2lZ

2l ] + 2

∑l<j

E[αlαjZlZj ]

=

k∑l=1

α2lE[E[Z2

l |Z1, ..., Zl−1]] + 2∑l<j

αlαjE[ZlE

[E[Zj |Z1, ..., Zj−1]

∣∣Zl]]

=

k∑l=1

α2lE[E[E[Z2

l |xl, Z1, ..., Zl−1]|Z1, ..., Zl−1]] + 0

=k∑l=1

α2l ~σ

2.

The consequence of this lemma is that we are able to treat Z1, ..., Zk as if they are independent, even though they arenot—clearly Zl is dependent on Z1, ..., Zl−1 through xl.

Lemma F.2 (Gradient approximation in SIGNUM). The version of the SIGNUM algorithm that takes vk = mk, and allparameters according to Theorem 3, obeys that for all integer C ≤ k ≤ K

‖E|mk − gk|‖1 ≤2√k + 1

(8‖~L‖1δ

β

1− β+√

3‖~σ‖1√

1− β).

Proof. For each i ∈ [d] we will use the following non-standard “bias-variance” decomposition.

E [|mk[i]− gk[i]|] ≤ E [|mk[i]− gk[i]|]︸︷︷︸(∗)

+E [|mk[i]−mk[i]|]︸︷︷︸(∗∗)

(5)


We will first bound (∗∗) and then deal with (∗).

Note that (∗∗) = 1−β1−βk+1E

[|∑kt=0 β

k−tZt|]. Using Jensen’s inequality and applying Lemma F.1 with our choice of

α1, ..., αl (including the effect of the increasing batch size) we get that for k ≥ C

(∗∗) ≤ 1− β1− βk+1

√√√√√E

∣∣∣∣∣k∑t=0

βtZk−t

∣∣∣∣∣2 ≤ 1− β

1− βk+1

√√√√ k∑t=0

[β2t

~σ2

k − t+ 1

],

where

k∑t=0

[β2t ~σ2

k − t+ 1

]=

k2∑t=0

[β2t ~σ2

k − t+ 1

]+

k∑t= k

2 +1

[β2t ~σ2

k − t+ 1

]break up sum

≤k2∑t=0

[β2t] ~σ2

k2 + 1

+

k∑t= k

2 +1

[βk~σ2

]bound summands

≤ 1

1− β2

~σ2

k2 + 1

+k

2βk~σ2 geometric series

≤ 3

1− β2

~σ2

k + 1since k ≥ C

Combining, and again using our condition that k ≥ C, we get

(∗∗) ≤ 1− β1− βk+1

√3

1− β2

~σ√k + 1

≤ 2√

3√

1− β ~σ√k + 1

(6)

We now turn to bounding (∗) — the “bias” term.

E[|mk − gk|] = E

[∣∣∣∣∣ 1− β1− βk+1

k∑t=0

[βtgk−t

]− 1− β

1− βk+1

k∑t=0

[βtgk

]∣∣∣∣∣]

since1− β

1− βk+1

k∑t=0

βt = 1

≤ 1− β1− βk+1

k∑t=0

[βtE[|gk−t − gk|]

]≤ 2(1− β)

k∑t=1

βtE[|gk−t − gk|] since k ≥ C (7)

To proceed, we need the following lemma.

Lemma F.3. Under Assumption 2, for any sign vector s ∈ {−1, 1}d, any x ∈ Rd and any ε ≤ δ

‖g(x+ εs)− g(x)‖1 ≤ 2ε‖~L‖1.

Proof. By Taylor’s theorem,

g(x+ εs)− g(x) =

[∫ 1

t=0

H(x+ tεs)dt

]εs.

Let v := sign(g(x+ εs)− g(x)), H :=[∫ 1

t=0H(x+ tεs)dt

]and moreover, use H+ to denote the psd part of H and H−

to denote the nsd part of H . Namely, H = H+ −H−.

We can write

‖g(x+ εs)− g(x)‖1 = vT (g(x+ εs)− g(x)) = vTH(εs) = εvTH+s− εvTH−s

= ε〈H1/2+ v,H

1/2+ s〉 − ε〈H1/2

− v,H1/2− s〉 ≤ ε‖H1/2

+ v‖‖H1/2+ s‖+ ε‖H1/2

− v‖‖H1/2− s‖. (8)


Note that assumption 2 implies the semidefinite ordering

H+ ≺ diag(~L) and H− ≺ diag(~L)

and thus max{sTH+s, sTH−s} ≤

∑di=1 Li = ‖~L‖1 for all s ∈ {−1, 1}d.

The proof is complete by observing that both v and s are sign vectors in (8).

Using the above lemma and the fact that our update rules are always following some sign vectors with learning rate smallerthan δ, we have

‖gk−t − gk‖1 ≤t−1∑l=0

‖gk−l − gk−l−1‖1

≤ 2‖~L‖1t−1∑l=0

δk−l−1 ≤t−1∑l=0

2‖~L‖1δ√k − l

≤ 2‖~L‖1δ∫ k

k−t

dx√x

= 4‖~L‖1δ(√

k −√k − t

)≤ 4‖~L‖1δ

√k

(1−

√1− t

k

)≤ 4‖~L‖1δ

t√k

for x ≥ 0, 1− x ≤√

1− x (9)

It follows that∑i

E[|mk[i]− gk[i]|] = E [‖mk − gk‖1]

≤ 2(1− β)

k∑t=1

βtE‖gk−t − gk‖1 Apply (7)

≤ 8(1− β)‖~L‖1δ√k

∞∑t=1

tβt Apply (9) and extend sum to∞

≤ 8(1− β)‖~L‖1δ√k

β

(1− β)2derivative of geometric progression

≤ 16‖~L‖1δ√k + 1

β

1− βfor k ≥ 1,

√k + 1

k≤ 2 (10)

Substitute (6) into (5), sum both sides over i and then further plug in (10) we get the statement in the lemma.

The proof of Theorem 3 now follows in a straightforward manner. Note that Lemma F.2 only kicks in after a warmup periodof C iterations, with C as specified in Theorem 3. In theory it does not matter what you do during this warmup period,provided you accumulate the momentum as normal and take steps according to the prescribed learning rate and mini-batchschedules. One option is to just stay put and not update the parameter vector for the first C iterations. This is wasteful sinceno progress will be made on the objective. A better option in practice is to take steps using the sign of the stochastic gradient(i.e. do SIGNSGD) instead of SIGNUM during the warmup period.

Proof of Theorem 3. Substitute Lemma F.2 as ξ(k) into Lemma E.1 and check that ξ(k) = O(1/√k), δk =

O(1/√k),min δk = δ/

√K,C � K, and in addition, we note that by the increasing minibatch size NK = O(K2).

Substitute K = O(√N) and take the square on both sides of the inequality. (We can take the mink out of the square since

all the arguments are nonnegative and (·)2 is monotonic on R+).

signsgd: compressed optimisation for non-convex problems · 2018-08-09 · signsgd: compressed...

Documents