an accelerated decentralized stochastic proximal algorithm ... · msda (scaman et al.,2017) global...

HAL Id: hal-02280763https://hal.archives-ouvertes.fr/hal-02280763

Submitted on 6 Sep 2019

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.

An Accelerated Decentralized Stochastic ProximalAlgorithm for Finite Sums

Hadrien Hendrikx, Francis Bach, Laurent Massoulié

To cite this version:Hadrien Hendrikx, Francis Bach, Laurent Massoulié. An Accelerated Decentralized Stochastic Proxi-mal Algorithm for Finite Sums. Neural Information Processing systems, Dec 2019, Vancouver, Canada.�hal-02280763�

https://hal.archives-ouvertes.fr/hal-02280763

https://hal.archives-ouvertes.fr

AN ACCELERATED DECENTRALIZED STOCHASTIC PROXIMAL

ALGORITHM FOR FINITE SUMS

HADRIEN HENDRIKX † FRANCIS BACH LAURENT MASSOULIE †

Abstract. Modern large-scale finite-sum optimization relies on two key aspects: distribution and

stochastic updates. For smooth and strongly convex problems, existing decentralized algorithms

are slower than modern accelerated variance-reduced stochastic algorithms when run on a singlemachine, and are therefore not efficient. Centralized algorithms are fast, but their scaling is

limited by global aggregation steps that result in communication bottlenecks. In this work,

we propose an efficient Accelerated Decentralized stochastic algorithm for Finite Sums namedADFS, which uses local stochastic proximal updates and randomized pairwise communications

between nodes. On n machines, ADFS learns from nm samples in the same time it takes optimalalgorithms to learn from m samples on one machine. This scaling holds until a critical networksize is reached, which depends on communication delays, on the number of samples m, and on the

network topology. We provide a theoretical analysis based on a novel augmented graph approachcombined with a precise evaluation of synchronization times and an extension of the accelerated

proximal coordinate gradient algorithm to arbitrary sampling. We illustrate the improvement ofADFS over state-of-the-art decentralized approaches with experiments.

1. Introduction

The success of machine learning models is mainly due to their capacity to train on huge amountsof data. Distributed systems can be used to process more data than one computer can store or toincrease the pace at which models are trained by splitting the work among many computing nodes.In this work, we focus on problems of the form:

minθ∈Rd

n∑i=1

fi(θ), where fi(θ) =

m∑j=1

fi,j(θ) +σi2‖θ‖2. (1)

This is the typical `2-regularized empirical risk minimization problem with n computing nodesthat have m local training examples each. The function fi,j represents the loss function for thej-th training example of node i and is assumed to be convex and Li,j-smooth (Nesterov, 2013;Bubeck, 2015). These problems are usually solved by first-order methods, and the basic distributedalgorithms compute gradients in parallel over several machines (Nedic and Ozdaglar, 2009). Anotherway to speed up training is to use stochastic algorithms (Bottou, 2010; Defazio et al., 2014; Johnsonand Zhang, 2013), that take advantage of the finite sum structure of the problem to use cheaperiterations while preserving fast convergence. This paper aims at bridging the gap between stochasticand decentralized algorithms when local functions are smooth and strongly convex. In the rest ofthis paper, following Scaman et al. (Scaman et al., 2017), we assume that nodes are linked by acommunication network and can only exchange messages with their neighbours. We further assumethat each communication takes time τ and that processing one sample, i.e., computing the proximaloperator for a single function fi,j , takes time 1. The proximal operator of a function fi,j is definedby proxηfi,j (x) = arg minv

12η‖v − x‖

2 + fi,j(v). The condition number of the Laplacian matrix of

the graph representing the communication network is denoted γ. This natural constant appearsin the running time of many decentralized algorithms and is for instance of order O(1) for thecomplete graph and O(n−1) for the 2D grid. More generally, γ−1/2 is typically of the same order asthe diameter of the graph. Following notations from Xiao et al. (Xiao et al., 2019), we define thebatch and stochastic condition numbers κb and κs (which are classical quantities in the analysis offinite sum optimization) such that for all i, κb ≥Mi/σi where Mi is the smoothness constant of

INRIA - Departement d’informatique de l’ENS, Ecole normale superieure, CNRS, INRIA, PSL ResearchUniversity, 75005 Paris, France

† Microsoft-INRIA Joint Centre

E-mail address: [email protected], [email protected], [email protected].

1

arX

iv:1

905.

1139

4v2

[m

ath.

OC

] 1

2 Ju

n 20

19

2 AN ACCELERATED DECENTRALIZED STOCHASTIC PROXIMAL ALGORITHM FOR FINITE SUMS

Algorithm Synchrony Stochastic Time

Point-SAGA (Defazio, 2016) N/A X nm+√nmκs

MSDA (Scaman et al., 2017) Global × √κb(m+ τ√

γ

)ESDACD (Hendrikx et al., 2019) Local × (m+ τ)

√κbγ

DSBA (Shen et al., 2018) Global X(m+ κs + γ−1

)(1 + τ)

ADFS (this paper) Local X m+√mκs + (1 + τ)

√κsγ

Table 1. Comparison of various state-of-the-art decentralized algorithms to reachaccuracy ε in regular graphs. Constant factors are omitted, as well as the log

(ε−1)

factorin the Time column. Reported runtime for Point-SAGA corresponds to running it on a

single machine with nm samples. To allow for direct comparison, we assume thatcomputing a dual gradient of a function fi as required by MSDA and ESDACD takestime m, although it is generally more expensive than to compute m separate proximal

operators of single fi,j functions.

the function fi and κs ≥ κi, with κi = 1 +∑mj=1 Li,j/σi the stochastic condition number of node i.

Although κs is always bigger than κb, it is generally of the same order of magnitude, leading to thepractical superiority of stochastic algorithms. The next paragraphs discuss the relevant state ofthe art for both distributed and stochastic methods, and Table 1 sums up the speeds of the maindecentralized algorithms available to solve Problem (1). Although it is not a distributed algorithm,Point-SAGA (Defazio, 2016), an optimal single-machine algorithm, is also presented for comparison.

Centralized gradient methods. A simple way to split work between nodes is to distributegradient computations and to aggregate them on a parameter server. Provided the network is fastenough, this allows the system to learn from the datasets of n workers in the same time one workerwould need to learn from its own dataset. Yet, these approaches are very sensitive to stochasticdelays, slow nodes, and communication bottlenecks. Asynchronous methods may be used (Rechtet al., 2011; Leblond et al., 2017; Xiao et al., 2019) to address the first two issues, but computinggradients on older (or even inconsistent) versions of the parameter harms convergence (Chen et al.,2016). Therefore, this paper focuses on decentralized algorithms, which are generally less sensitiveto communication bottlenecks (Lian et al., 2017).

Decentralized gradient methods. In their synchronous versions, decentralized algorithmsalternate rounds of computations (in which all nodes compute gradients with respect to their localdata) and communications, in which nodes exchange information with their direct neighbors (Duchiet al., 2012; Shi et al., 2015; Nedic et al., 2017; Tang et al., 2018; He et al., 2018). Communicationsteps often consist in averaging gradients or parameters with neighbours, and can thus be abstractedas multiplication by a so-called gossip matrix. MSDA (Scaman et al., 2017) is a batch decentralizedsynchronous algorithm, and it is optimal with respect to the constants γ and κb, among batchalgorithms that can only perform these two operations. Instead of performing global synchronousupdates, some approaches inspired from gossip algorithms (Boyd et al., 2006) use randomizedpairwise communications (Nedic and Ozdaglar, 2009; Johansson et al., 2009; Colin et al., 2016).This for example allows fast nodes to perform more updates in order to benefit from their increasedcomputing power. These randomized algorithms do not suffer from the usual worst-case analyses ofbounded-delay asynchronous algorithms, and can thus have fast rates because the step-size doesnot need to be reduced in the presence of delays. For example, ESDACD (Hendrikx et al., 2019)achieves the same optimal speed as MSDA when batch computations are faster than communications(τ > m). However, both use gradients of the Fenchel conjugates of the full local functions, whichare generally much harder to get than regular gradients.

Stochastic algorithms for finite sums. All distributed methods presented earlier are batchmethods that rely on computing full gradient steps of each function fi. Stochastic methods performupdates based on randomly chosen functions fi,j . In the smooth and strongly convex setting,they can be coupled with variance reduction (Schmidt et al., 2017; Shalev-Shwartz and Zhang,2013; Johnson and Zhang, 2013; Defazio et al., 2014) and acceleration, to achieve the m+

√mκs

AN ACCELERATED DECENTRALIZED STOCHASTIC PROXIMAL ALGORITHM FOR FINITE SUMS 3

optimal finite-sum rate, which greatly improves over the m√κb batch optimum when the dataset is

large. Examples of such methods include Accelerated-SDCA (Shalev-Shwartz and Zhang, 2014),APCG (Lin et al., 2015), Point-SAGA (Defazio, 2016) or Katyusha (Allen-Zhu, 2017).

Decentralized stochastic methods. In the smooth and strongly convex setting, DSA (Mokhtariand Ribeiro, 2016) and later DSBA (Shen et al., 2018) are two linearly converging stochasticdecentralized algorithms. DSBA uses the proximal operator of individual functions fi,j to signif-icantly improve over DSA in terms of rates. Yet, DSBA does not enjoy the

√mκs accelerated

rates and needs an excellent network with very fast communications. Indeed, nodes need tocommunicate each time they process a single sample, resulting in many communication steps.CHOCO-SGD (Koloskova et al., 2019) is a simple decentralized stochastic algorithm with supportfor compressed communications. Yet, it is not linearly convergent and it requires to communicatebetween each gradient step as well. Therefore, to the best of our knowledge, there is no decentralizedstochastic algorithm with accelerated linear convergence rate or low communication complexitywithout sparsity assumptions (i.e., sparse features in linear supervised learning).

ADFS.The main contribution of this paper is a locally synchronous Accelerated Decentralizedstochastic algorithm for Finite Sums, named ADFS. It is very similar to APCG for empirical riskminimization in the limit case n = 1 (single machine), for which it gets the same m+

√mκs rate.

Besides, this rate stays unchanged when the number of machines grows, meaning that ADFS canprocess n times more data in the same amount of time on a network of size n. This scaling lastsas long as (1 + τ)

√κsγ− 1

2 < m +√mκs. This means that ADFS is at least as fast as MSDA

unless both the network is extremely fast (communications are faster than evaluating a singleproximal operator) and the diameter of the graph is very large compared to the size of the localfinite sums. Therefore, ADFS outperforms MSDA and DSBA in most standard machine learningsettings, combining optimal network scaling with the efficient distribution of optimal sequentialfinite-sum algorithms. Note however that, similarly to DSBA and Point-SAGA, ADFS requiresevaluating proxfi,j , which requires solving a local optimization problem. Yet, in the case of linearmodels such as logistic regression, it is only a constant factor slower than computing ∇fi,j , and itis especially much faster than computing the gradient of the conjugate of the full dual functions∇f∗i required by ESDACD and MSDA.

ADFS is based on three novel technical contributions: (i) a novel augmented graph approachwhich yields the dual formulation of Section 2, (ii) an extension of the APCG algorithm to arbitrarysampling that is applied to the dual problem in order to get the generic algorithm of Section 3, and(iii) the analysis of local synchrony, which is performed in Section 4. Finally, Section 5 presents arelevant choice of parameters leading to the rates shown in Table 1, and an experimental comparisonis done in Section 6. A Python implementation of ADFS is also provided in supplementary material.

2. Model and Derivations

Figure 1. Illustration of the augmented graph for n = 3 and m = 3.


We now specify our approach to solve the problem in Equation (1). The first (classical) stepconsists in considering that all nodes have a local parameter, but that all local parameters shouldbe equal because the goal is to have the global minimizer of the sum. Therefore, the problem writes:

minθ∈Rn×d

n∑i=1

fi(θ(i)) such that θ(i) = θ(j) if j ∈ N (i), (2)

where N (i) represents the neighbors of node i in the communication graph. Then, ESDACD andMSDA are obtained by applying accelerated (coordinate) gradient descent to an appropriate dualformulation of Problem (2). In the dual formulation, constraints become variables and so updatinga dual coordinate consists in performing an update along an edge of the network. In this work,we consider a new virtual graph in order to get a stochastic algorithm for finite sums. Thetransformation is sketched in Figure 1, and consists in replacing each node of the initial network bya star network. The centers of the stars are connected by the actual communication network, andthe center of the star network replacing node i has the local function f comm

i : x 7→ σi

2 ‖x‖2. The

center of node i is then connected with m nodes whose local functions are the functions fi,j forj ∈ {1, ...,m}. If we denote E the number of edges of the initial graph, then the augmented graphhas n(1 +m) nodes and E + nm edges.

Then, we consider one parameter vector θ(i,j) for each function fi,j and one vector θ(i) for eachfunction f comm

i . Therefore, there is one parameter vector for each node in the augmented graph. Weimpose the standard constraint that the parameter of each node must be equal to the parametersof its neighbors, but neighbors are now taken in the augmented graph. This yields the followingminimization problem:

minθ∈Rn(1+m)×d

n∑i=1

[ m∑j=1

fi,j(θ(i,j)) +

σi2‖θ(i)‖2

]such that θ(i) = θ(j) if j ∈ N (i), and θ(i,j) = θ(i) ∀j ∈ {1, ..,m}.

(3)

In the rest of the paper, we use letters k, ` to refer to any nodes in the augmented graph, andletters i, j to specifically refer to a communication node and one of its virtual nodes. More precisely,we denote (k, `) the edge between the nodes k and ` in the augmented graph. Note that k and `can be virtual or communication nodes. We denote e(k) the unit vector of Rn(1+m) correspondingto node k, and ek` the unit vector of RE+nm corresponding to edge (k, `). To clearly make thedistinction between node variables and edge variables, for any matrix on the set of nodes of theaugmented graph x ∈ Rn(1+m)×d we write that x(k) = xT e(k) for k ∈ {1, ..., n(1 +m)} (superscriptnotation) and for any matrix on the set of edges of the augmented graph λ ∈ R(E+nm)×d we writethat λk` = λT ek` (subscript notation) for any edge (k, `). For node variables, we use the subscriptnotation with a t to denote time, for instance in Algorithm 1. By a slight abuse of notations, we useindices (i, j) instead of (k, `) when specifically refering to virtual edges (or virtual nodes) and denoteλij instead of λi,(i,j) the virtual edge between node i and node (i, j) in the augmented graph. The

constraints of Problem (3) can be rewritten AT θ = 0 in matrix form, where A ∈ Rn(1+m)×(nm+E)

is such that Aek` = µk`(e(k) − e(`)) for some µk` > 0. Then, the dual formulation of this problem

writes:

maxλ∈R(nm+E)×d

−n∑i=1

[ m∑j=1

f∗i,j

((Aλ)(i,j)

)+

1

2σi‖(Aλ)(i)‖2

], (4)

where the parameter λ is the Lagrange multiplier associated with the constraints of Problem (3)—more precisely, for an edge (k, `), λk` ∈ Rd is the Lagrange multiplier associated with the constraintµk`(e

(k) − e(`))T θ = 0. At this point, the functions fi,j are only assumed to be convex (and notnecessarily strongly convex) meaning that the functions f∗i,j are potentially non-smooth. Thisproblem could be bypassed by transferring some of the quadratic penalty from the communicationnodes to the virtual nodes before going to the dual formulation. Yet, this approach fails whenm is large because the smoothness parameter of f∗i,j would scale as m/σi at best, whereas asmoothness of order 1/σi is required to match optimal finite-sum methods. A better option isto consider the f∗i,j terms as non-smooth and perform proximal updates on them. The rate ofproximal gradient methods such as APCG (Lin et al., 2015) does not depend on the strong convexityparameter of the non-smooth functions f∗i,j . Each f∗i,j is (1/Li,j)-strongly convex (because fi,j was(Li,j)-smooth), so we can rewrite the previous equation in order to transfer all the strong convexity


to the communication node. Noting that (Aλ)(i,j) = −µijλij when node (i, j) is a virtual nodeassociated with node i, we rewrite the dual problem as:

minλ∈R(E+nm)×d

qA(λ) +

n∑i=1

m∑j=1

˜f∗i,j(λij), (5)

with ˜f∗i,j : x 7→ f∗i,j(−µijx) − µ2ij

2Li,j‖x‖2 and qA : x 7→ Trace

(12x

TATΣ−1Ax), where Σ is the

diagonal matrix such that e(i)T

Σe(i) = σi if i is a center node and e(i,j)T

Σe(i,j) = Li,j if it isthe virtual node (i, j). Since dual variables are associated with edges, using coordinate descentalgorithms on dual formulations from a well-chosen augmented graph of constraints allows us tohandle both computations and communications in the same framework. Indeed, choosing a variablecorresponding to an actual edge of the network results in a communication along this edge, whereaschoosing a virtual edge results in a local computation step. Then, we balance the ratio betweencommunications and computations by simply adjusting the probability of picking a given kind ofedges.

3. The Algorithm: ADFS Iterations and Expected Error

In this section, we detail our new ADFS algorithm. In order to obtain it, we introduce ageneralized version of the APCG algorithm (Lin et al., 2015) which we detail in Appendix A.Then we apply it to Problem (5) to get Algorithm 1. Due to lack of space, we only present thesmooth version of ADFS here, but a non-smooth version is presented in Appendix B, along withthe derivations required to obtain Algorithm 1 and Theorem 1. We denote A† the pseudo inverse ofA and Wk` ∈ Rn(1+m)×n(1+m) the matrix such that Wk` = (e(k) − e(`))(e(k) − e(`))T for any edge(k, `). Note that variables xt, yt and vt from Algorithm 1 are variables associated with the nodes ofthe augmented graph and are therefore matrices in Rn(1+m)×d (one row for each node). They areobtained by multiplying the dual variables of the proximal coordinate gradient algorithm appliedto the dual problem of Equation (5) by A on the left. We denote σA = λ+min(ATΣ−1A) the smallestnon-zero eigenvalue of the matrix ATΣ−1A.

Algorithm 1 ADFS(A, (σi), (Li,j), (µk`), (pk`), ρ)

1: σA = λ+min(ATΣ−1A), ηk` =ρµ2

k`

σApk`, Rk` = eTk`A

†Aek` {Initialization}2: x0 = y0 = v0 = z0 = 0(n+nm)×d

3: for t = 0 to K − 1 do {Run for K iterations}4: yt = 1

1+ρ (xt + ρvt)

5: Sample edge (k, `) with probability pk` {Edge sampled from the augmented graph}6: zt+1 = vt+1 = (1− ρ)vt + ρyt − ηk`Wk`Σ

−1yt {Nodes k and ` communicate yt}7: if (k, `) is the virtual edge between node i and virtual node (i, j) then

8: v(i,j)t+1 = proxηij f∗i,j

(z(i,j)t+1

){Virtual node update using fi,j}

9: v(i)t+1 = z

(i)t+1 + z

(i,j)t+1 − v

(i,j)t+1 {Center node update}

10: end if11: xt+1 = yt + ρRk`

pk`(vt+1 − (1− ρ)vt − ρyt)

12: end for13: return θK = Σ−1vK {Return primal parameter}

Theorem 1. We denote θ? the minimizer of the primal function F : x 7→∑ni=1 fi(x) and θ?A a

minimizer of the dual function F ∗A = qA + ψ. Then θt as output by Algorithm 1 verifies:

E[‖θt − θ?‖2

]≤ C0(1− ρ)t, if ρ2 ≤ min

k`

λ+min(ATΣ−1A)

Σ−1kk + Σ−1``

p2k`µ2k`Rk`

, (6)

with C0 = λmax(ATΣ−2A)[‖A†Aθ?A‖2 + 2σ−1A (F ∗A(0)− F ∗A(θ?A))

].

We discuss several aspects related to the implementation of Algorithm 1 below, and provide itsPython implementation in supplementary material.


Convergence rate. The parameter ρ controls the convergence rate of ADFS. It is defined by theminimum of the individual rates for each edge, which explicitly depend on parameters related tothe functions themselves (1/(Σ−1kk + Σ−1`` )), to the graph topology (Rk` = eTk`A

†Aek`), to a mix of

both (λ+min(ATΣ−1A)/µ2k`) and to the sampling probabilities of the edges (p2k`). Note that these

quantities are very different depending on whether edges are virtual or not. In Section 5, wecarefully choose the free parameters µk` and pk` to get the best convergence speed.

Sparse updates. Although the updates of Algorithm 1 involve all nodes of the network, itis actually possible to implement them efficiently so that only two nodes are actually involved

in each update, as described below. Indeed, Wk` is a very sparse matrix so(Wk`Σ

−1yt)(k)

=

µ2k`(Σ

−1k y

(k)t − Σ−1` y

(`)t ) = −

(Wk`Σ

−1yt)(`)

and(Wk`Σ

−1yt)(h)

= 0 for h 6= k, `. Therefore, onlythe following situations can happen:

(1) Communication updates: If (k, `) is a communication edge, the update only requiresnodes k and ` to exchange parameters and perform a weighted difference between them.

(2) Local updates: If (k, `) is the virtual edge between node i and its j-th virtual node,parameters exchange of line 4 is local, and the proximal term involves function fi,j only.

(3) Convex combinations: If we choose h 6= k, ` then v(h)t+1 and y

(h)t+1 are obtained by convex

combinations of y(h)t and v

(h)t so the update is cheap and local. Besides, nodes actually need

the value of their parameters only when they perform updates of type 1 or 2. Therefore,they can simply store how many updates of this type they should have done and performthem all at once before each communication or local update.

Primal proximal step. Algorithm 1 uses proximal steps performed on f∗i,j : x→ f∗i,j(−µi,jx)−µ2ij

2Li,j‖x‖2 instead of fi,j . Yet, it is possible to use Moreau identity to express proxηf∗i,j

using only

the proximal operator of fi,j , which can easily be evaluated for many objective functions. Notehowever that this may require choosing a smaller value for ρ. The exact derivations are presentedin Appendix B.3.

Linear case. For many standard machine learning problems, fi,j(θ) = `(XTi,jθ) with Xi,j ∈ Rd.

This implies that f∗i,j(θ) = +∞ whenever θ /∈ Vec (Xi,j). Therefore, the proximal steps on theFenchel conjugate only have support on Xi,j , meaning that they are one-dimensional problems thatcan be solved in constant time using for example the Newton method when no analytical solutionis available. Warm starts (initializing on the previous solution) can also be used for solving thelocal problems even faster so that in the end, a one-dimensional proximal update is only a constanttime slower than a gradient update. Note that this also allows to store parameters vt and yt asscalar coefficients for virtual nodes, thus greatly reducing the memory footprint of ADFS.

4. Distributed Execution and Synchronization Time

Theorem 1 gives bounds on the expected error after a given number of iterations. To assess theactual speed of the algorithm, it is still required to know how long executing a given number ofiterations takes. This is easy with synchronous algorithms such as MSDA or DSBA, in which allnodes iteratively perform local updates or communication rounds. In this case, executing ncomp

computing rounds and ncomm communication rounds simply takes time ncomp + τncomm. ADFSrelies on randomized pairwise communications, so it is necessary to sample a schedule, i.e., a randomsequence of edges from the augmented graph, and evaluate how fast this schedule can be executed.Note that the execution time crucially depends on how many edges can be updated in parallel,which itself depends on the graph and on the random schedule sampled.

Shared schedule. Even though they only actively take part in a small fraction of the updates, allnodes need to execute the same schedule to correctly implement Algorithm 1. To generate thisshared schedule, all nodes are given a seed and the sampling probabilities of all edges. This allowsthem to avoid deadlocks and to precisely know how many convex combinations to perform betweenvt and yt.


Figure 2. Illustration of parallel execution and local synchrony. Nodes from a toy graphexecute the schedule [(A,C), (B,D), (A,B), (D), (C,D)], where (D) means that node D

performs a local update. Each node needs to execute its updates in the partial orderdefined by the schedule. In particular, node C has to perform update (A,C) and then

update (C,D), so it is idle between times τ and τ + 1 because it needs to wait for node Dto finish its local update before the communication update (C,D) can start. We assume

τ > 1 since the local update terminates before the communication update (A,B).Contrary to synchronous algorithms, no global notion of rounds exist and some nodes

(such as node D) perform more updates than others.

Execution time. The problem of bounding the probability that a random schedule of fixed lengthexceeds a given execution time can be cast in the framework of fork-join queuing networks withblocking (Zeng et al., 2018). In particular, queuing theory (Baccelli et al., 1992) tells us that theaverage time per iteration exists for any fixed probability distribution over a given augmentedgraph. Unfortunately, existing quantitative results are not precise enough for our purpose so wegeneralize the method introduced by Hendrikx et al. (Hendrikx et al., 2019) to get a finer bound.While their result is valid when the only possible operation is communicating with a neighbor, weextend it to the case in which nodes can also perform local computations. For the rest of thispaper, we denote pcomm the probability of performing a communication update and pcomp theprobability of performing a local update. They are such that pcomp + pcomm = 1. We also definepmaxcomm = nmaxk

∑`∈N (k) pk`/2, where neighbors are in the communication network only. When all

nodes have the same probability to participate in an update, pmaxcomm = pcomm. Then, the following

theorem holds (see proof in Appendix C):

Theorem 2. Let T (t) be the time needed for the system to execute a schedule of size t, i.e., titerations of Algorithm 1. If all nodes perform local computations with probability pcomp/n withpcomp > pmax

comm or if τ > 1 then there exists C < 24 such that:

P(

1

tT (t) ≤ C

n

(pcomp + 2τpmax

comm

))→ 1 as t→∞ (7)

Note that the constant C is a worst-case estimate and that it is much smaller for homogeneouscommunication probabilities. This novel result states that the number of iterations that Algorithm 1can perform per unit of time increases linearly with the size of the network. This is possiblebecause each iteration only involves two nodes so many iterations can be done in parallel. Theassumption pcomp > pcomm is responsible for the 1 + τ factor instead of τ in Table 1, which preventsADFS from benefiting from network acceleration when communications are cheap (τ < 1). Notethat this is an actual restriction of following a schedule, as detailed in Appendix C. Yet, networkoperations generally suffer from communication protocols overhead whereas computing a singleproximal update often either has a closed-form solution or is a simple one-dimensional problem inthe linear case. Therefore, assuming τ > 1 is not very restrictive in the finite-sum setting.

5. Performances and Parameters Choice in the Homogeneous Setting

We now prove the time to convergence of ADFS presented in Table 1, and detail the conditionsunder which it holds. Indeed, Section 3 presents ADFS in full generality but the different parametershave to be chosen carefully to reach optimal speed. In particular, we have to choose the coefficientsµ to make sure that the graph augmentation trick does not cause the smallest positive eigenvalue ofATΣ−1A to shrink too much. Similarly, ρ is defined in Equation (6) by a minimum over all edges of a


(a) Higgs, n = 4 (b) Higgs, n = 100 (c) Covtype, n = 100

Figure 3. Performances of various decentralized algorithms on the logistic regressiontask with m = 104 points per node, regularization parameter σ = 1 and communication

delays τ = 5 on 2D grid networks of different sizes.

given quantity. This quantity heavily depends on whether the edge is an actual communication edgeor a virtual edge. One can trade pcomp for pcomm so that the minimum is the same for both kind ofedges, but Theorem 2 tells us that this is only possible as long as pcomp > pcomm. More specifically,we define L = AcommA

Tcomm ∈ Rn×n the Laplacian of the communication graph, with Acomm ∈ Rn×E

such that Acommek` = µk`(e(k) − e(`)) for all edge (k, `) ∈ Ecomm, the set of communication edges.

Then, we define γ = min(k,`)∈Ecomm λ+min(L)n2/(µ2k`Rk`E

2). As shown in Appendix D.2, γ ≈ γfor regular graphs such as the complete graph or the grid, justifying the use of γ instead of γ inTable 1. We assume for simplicity that σi = σ and that κi = 1 + σ−1i

∑mj=1 Li,j = κs for all nodes.

For virtual edges, we choose µ2ij = λ+min(L)Li,j/(σκi) and pij = pcomp(1 + Li,jσ

−1i )

12 /(nScomp)

with Scomp = n−1∑ni=1

∑mj=1(1 +Li,jσ

−1i )

12 . For communications edges (k, `) ∈ Ecomm, we choose

pk` = pcomm/E and µ2k` = 1/2.

Theorem 3. If we choose pcomm = min(

1/2,(

1 + Scomp

√γ/κs

)−1). Then, running Algorithm 1

for Kε = ρ−1 log(ε−1) iterations guarantees E[‖θKε

− θ?‖2]≤ C0ε, and takes time T (Kε), with:

T (Kε) ≤√

2C

(m+

√mκs +

√2

(1 + 4τ

)√κsγ

)log(1/ε)

with probability tending to 1 as ρ−1 log(ε−1)→∞, with the same C0 and C < 24 as in Theorems 1and 2.

Theorem 3 assumes that all communication probabilities and condition numbers are exactlyequal in order to ease reading. A more detailed version with rates for more heterogeneous settingscan be found in Appendix D. Note that while algorithms such as MSDA required to use polynomialsof the initial gossip matrix to model several consecutive communication steps, we can more directlytune the amount of communication and computation steps simply by adjusting pcomp and pcomm.

6. Experiments

In this section, we illustrate the theoretical results by showing how ADFS compares withMSDA (Scaman et al., 2017), ESDACD (Hendrikx et al., 2019), Point-SAGA (Defazio, 2016), andDSBA (Shen et al., 2018). All algorithms (except for DSBA, for which we fine-tuned the step-size)were run with out-of-the-box hyperparameters given by theory on data extracted from the standardHiggs and Covtype datasets from LibSVM. The underlying graph is assumed to be a 2D gridnetwork. Experiments were run in a distributed manner on an actual computing cluster. Yet, plotsare shown for idealized times in order to abstract implementation details as well as ensure thatreported timings were not impacted by the cluster status. All the details of the experimental setupas well as a comparison with centralized algorithms can be found in Appendix E. An implementationof ADFS is also available in supplementary material.


Figure 3a shows that, as predicted by theory, ADFS and Point-SAGA have similar rates on smallnetworks. In this case, ADFS uses more computing power but has a small overhead. Figures 3band 3c use a much larger grid to evaluate how these algorithms scale. In this setting, Point-SAGAis the slowest algorithm since it has 100 times less computing power available. MSDA performsquite well on the Covtype dataset thanks to its very good network scaling. Yet, the m

√κ factor in

its rate makes it scales poorly with the condition number κ, which explains why it struggles on theHiggs dataset. DSBA is slow as well despite the fine-tuning because it is the only non-acceleratedmethod, and it has to communicate after each proximal step, thus having to wait for a time τ = 5at each step. ESDACD does not perform well either because m > τ and it has to perform asmany batch computing steps as communication steps. ADFS does not suffer from any of thesedrawbacks and therefore outperforms other approaches by a large margin on these experiments.This illustrates the fact that ADFS combines the strengths of accelerated stochastic algorithms,such as Point-SAGA, and fast decentralized algorithms, such as MSDA.

7. Conclusion

In this paper, we provided a novel accelerated stochastic algorithm for decentralized optimization.To the best of our knowledge, it is the first decentralized algorithm that successfully leverages thefinite-sum structure of the objective functions to match the rates of the best known sequentialalgorithms while having the network scaling of optimal batch algorithms. The analysis in this papercould be extended to better handle heterogeneous settings, both in terms of hardware (computingtimes, delays) and local functions (different regularities). Finally, finding a locally synchronousalgorithm that can take advantage of arbitrarily low communication delays (beyond the τ > 1 limit)to scale to large graphs is still an open problem.

Acknowledgement

We acknowledge support from the European Research Council (grant SEQUOIA 724063).

References

Zeyuan Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. InProceedings of Symposium on Theory of Computing, pages 1200–1205, 2017.

Francois Baccelli, Guy Cohen, Geert Jan Olsder, and Jean-Pierre Quadrat. Synchronization andLinearity: an Algebra for Discrete Event Systems. John Wiley & Sons Ltd, 1992.

Leon Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings ofCOMPSTAT, pages 177–186. Springer, 2010.

Stephen Boyd, Arpita Ghosh, Balaji Prabhakar, and Devavrat Shah. Randomized gossip algorithms.IEEE Transactions on Information Theory, 52(6):2508–2530, 2006.

Sebastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends R© inMachine Learning, 8(3-4):231–357, 2015.

Jianmin Chen, Xinghao Pan, Rajat Monga, Samy Bengio, and Rafal Jozefowicz. Revisitingdistributed synchronous SGD. arXiv preprint arXiv:1604.00981, 2016.

Igor Colin, Aurelien Bellet, Joseph Salmon, and Stephan Clemencon. Gossip dual averaging fordecentralized optimization of pairwise functions. In Proceedings of the International Conferenceon International Conference on Machine Learning-Volume 48, pages 1388–1396, 2016.

Aaron Defazio. A simple practical accelerated method for finite sums. In Advances in NeuralInformation Processing Systems, pages 676–684, 2016.

Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient methodwith support for non-strongly convex composite objectives. In Advances in Neural InformationProcessing Systems, pages 1646–1654, 2014.

John C Duchi, Alekh Agarwal, and Martin J Wainwright. Dual averaging for distributed optimization:Convergence analysis and network scaling. IEEE Transactions on Automatic Control, 57(3):592–606, 2012.

Olivier Fercoq and Peter Richtarik. Accelerated, parallel, and proximal coordinate descent. SIAMJournal on Optimization, 25(4):1997–2023, 2015.

Lie He, An Bian, and Martin Jaggi. Cola: Decentralized linear learning. In Advances in NeuralInformation Processing Systems, pages 4536–4546, 2018.


Hadrien Hendrikx, Francis Bach, and Laurent Massoulie. Accelerated decentralized optimizationwith local updates for smooth and strongly convex objectives. In Artificial Intelligence andStatistics, 2019.

Bjorn Johansson, Maben Rabi, and Mikael Johansson. A randomized incremental subgradientmethod for distributed optimization in networked systems. SIAM Journal on Optimization, 20(3):1157–1170, 2009.

Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variancereduction. In Advances in Neural Information Processing Systems, pages 315–323, 2013.

Anastasia Koloskova, Sebastian U Stich, and Martin Jaggi. Decentralized stochastic optimizationand gossip algorithms with compressed communication. International Conference on MachineLearning, 2019.

Remi Leblond, Fabian Pedregosa, and Simon Lacoste-Julien. ASAGA: Asynchronous parallel SAGA.In Artificial Intelligence and Statistics, pages 46–54, 2017.

Yin Tat Lee and Aaron Sidford. Efficient accelerated coordinate descent methods and fasteralgorithms for solving linear systems. In Annual Symposium on Foundations of Computer Science(FOCS), pages 147–156, 2013.

Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. Can decentralizedalgorithms outperform centralized algorithms? a case study for decentralized parallel stochasticgradient descent. In Advances in Neural Information Processing Systems, pages 5330–5340, 2017.

Qihang Lin, Zhaosong Lu, and Lin Xiao. An accelerated proximal coordinate gradient method. InAdvances in Neural Information Processing Systems, pages 3059–3067, 2014.

Qihang Lin, Zhaosong Lu, and Lin Xiao. An accelerated randomized proximal coordinate gradientmethod and its application to regularized empirical risk minimization. SIAM Journal onOptimization, 25(4):2244–2273, 2015.

Tao Lin, Sebastian U. Stich, and Martin Jaggi. Don’t use large mini-batches, use local SGD. arXivpreprint arXiv:1808.07217, 2018.

Aryan Mokhtari and Alejandro Ribeiro. DSA: Decentralized double stochastic averaging gradientalgorithm. Journal of Machine Learning Research, 17(1):2165–2199, 2016.

Angelia Nedic and Asuman Ozdaglar. Distributed subgradient methods for multi-agent optimization.IEEE Transactions on Automatic Control, 54(1):48–61, 2009.

Angelia Nedic, Alex Olshevsky, and Wei Shi. Achieving geometric convergence for distributedoptimization over time-varying graphs. SIAM Journal on Optimization, 27(4):2597–2633, 2017.

Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Bourse, volume 87. SpringerScience & Business Media, 2013.

Yurii Nesterov and Sebastian U. Stich. Efficiency of the accelerated coordinate descent method onstructured optimization problems. SIAM Journal on Optimization, 27(1):110–123, 2017.

Neal Parikh and Stephen Boyd. Proximal algorithms. Foundations and Trends R© in Optimization,1(3):127–239, 2014.

Kumar Kshitij Patel and Aymeric Dieuleveut. Communication trade-offs for synchronized distributedSGD with large step size. arXiv preprint arXiv:1904.11325, 2019.

Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-free approach toparallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems,pages 693–701, 2011.

Kevin Scaman, Francis Bach, Sebastien Bubeck, Yin Tat Lee, and Laurent Massoulie. Optimalalgorithms for smooth and strongly convex distributed optimization in networks. In InternationalConference on Machine Learning, pages 3027–3036, 2017.

Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochasticaverage gradient. Mathematical Programming, 162(1-2):83–112, 2017.

Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularizedloss minimization. Journal of Machine Learning Research, 14(Feb):567–599, 2013.

Shai Shalev-Shwartz and Tong Zhang. Accelerated proximal stochastic dual coordinate ascent forregularized loss minimization. In International Conference on Machine Learning, pages 64–72,2014.

Zebang Shen, Aryan Mokhtari, Tengfei Zhou, Peilin Zhao, and Hui Qian. Towards more efficientstochastic decentralized learning: Faster convergence and sparse communication. In InternationalConference on Machine Learning, pages 4631–4640, 2018.


Wei Shi, Qing Ling, Gang Wu, and Wotao Yin. Extra: An exact first-order algorithm fordecentralized consensus optimization. SIAM Journal on Optimization, 25(2):944–966, 2015.

Hanlin Tang, Xiangru Lian, Ming Yan, Ce Zhang, and Ji Liu. D2: Decentralized training overdecentralized data. In International Conference on Machine Learning, pages 4855–4863, 2018.

Lin Xiao, Adams Wei Yu, Qihang Lin, and Weizhu Chen. DSCOVR: Randomized primal-dual blockcoordinate algorithms for asynchronous distributed optimization. Journal of Machine LearningResearch, 20(43):1–58, 2019.

Yun Zeng, Augustin Chaintreau, Don Towsley, and Cathy H Xia. Throughput scalability analysisof fork-join queueing networks. Operations Research, 66(6):1728–1743, 2018.


Section A is a self-contained section with the statement and proofs of the extended APCGalgorithm. Then, Section B presents the derivations required to obtain ADFS from the extendedAPCG algorithm. Section C is dedicated to the study of waiting time in the locally synchronousmodel, and the analysis of the speed of ADFS for a specific choice of parameters is then given inSection D. Finally, Section E details the experimental setting and gives additional experimentsinvolving centralized algorithms.

Appendix A. Generalized APCG

In this section, we study the generic problem of accelerated proximal coordinate descent. Wegive an algorithm that works with arbitrary sampling of the coordinates, thus yielding a strongerresult than state-of-the-art approaches (Lin et al., 2015; Fercoq and Richtarik, 2015). This is akey contribution that allows to obtain fast rates when sampling probabilities are heterogeneousand determined by the problem. It is especially useful in our case to pick different probabilities forcomputing and for communicating. We also extend the result to the case in which the function isstrongly convex only on a subspace. Since this section of the Appendix is intended to detail theextended APCG general algorithm for a generic problem, it is mostly self-contained, and notationsare in particular different from the rest of the paper. More specifically, we study the followinggeneric problem:

minx∈Rd

fA(x) +

d∑i=1

ψi(x(i)), (8)

where all the functions ψi are convex and fA is such that there exists a matrix A such that fA is(σA)-strongly convex on Ker(A)⊥, the orthogonal of the kernel of A. Since, A†A is the projector onKer(A)⊥, (recall that A† is the pseudo-inverse of A), the strong convexity on this subspace can bewritten as the fact that for all x, y ∈ Rd:

fA(x)− fA(y)≥∇fA(y)TA†A(x− y) + σA

2 (x− y)TA†A(x− y). (9)

Note that this implies that fA is constant on Ker(A), so in particular there exists a function f suchthat for any x ∈ Rd, fA(x) = f(Ax). In this case, σA is such that xTAT∇2f(y)Ax ≥ σA‖x‖2 forany x ∈ Ker(A)⊥ and y ∈ Rd. Besides, fA is assumed to be (Mi)-smooth in direction i meaningthat its gradient in the direction i (noted ∇ifA) is (Mi)-Lipschitz. This is the general setting of the

problem of Equation (5), that can be recovered by taking fA = qA and ψi = f∗i . Proximal coordinategradient algorithms are known to work well for these problems, which is why we would like touse APCG (Lin et al., 2014). Yet, we would like to pick different probabilities for computing andcommunication edges, whereas APCG only handles uniform coordinates sampling. Furthermore, thefirst term is strongly-convex only on the orthogonal of the kernel of the matrix A, so APCG cannotbe applied straightforwardly. Therefore, we introduce an extended version of APCG, presented inAlgorithm 2, and we explicit its rate in Theorem 4. This extended APCG can then directly beapplied to solve the problem of Equation (5).

A.1. Algorithm and results

In this appendix, since there is no need to distinguish between primal and dual variables variablesas in the main text, we denote ei ∈ Rd the unit vector corresponding to coordinate i, and x(i) = eTi xfor any x ∈ Rd. Let Ri = eTi A

†Aei and pi be the probability that coordinate i is picked to beupdated. Constant S is such that S2 ≥ MiRi

p2ifor all i. Then, following the approaches of Nesterov

and Stich (Nesterov and Stich, 2017) and Lin et al. (Lin et al., 2015), we fix A0, B0 ∈ R andrecursively define sequences αt, βt, at, At and Bt such that:

a2t+1S2 = At+1Bt+1, Bt+1 = Bt + σAat+1, At+1 = At + at+1,

αt =at+1

At+1, βt =

σAat+1

Bt+1.

Finally, we introduce the sequences (yt), (vt) and (xt), that are all initialized at 0, and (wt) suchthat for all t, wt = (1− βt)vt + βtyt. We define ηi,t = at+1

Bt+1piand the proximal operator:

proxηi,tψi: x 7→ arg min

v

1

2ηi,t‖v − x‖2 + ψi(v).

We denote ∇ifA = eieTi ∇fA the coordinate gradient of fA along direction i. For generalized APCG


Algorithm 2 Generalized APCG(A0, B0, S, σA)

y0 = 0, v0 = 0, t = 0while t < T doyt = (1−αt)xt+αt(1−βt)vt

1−αtβt

Sample i with probability pivt+1 = zt+1 = (1− βt)vt + βtyt − ηi,t∇ifA(yt)

v(i)t+1 = proxηiψi

(z(i)t+1

)xt+1 = yt + αtRi

pi(vt+1 − (1− βt)vt − βtyt)

end while

to work well, the proximal operator needs to be taken in the subspace defined by the projectorA†A, and so the non-smooth ψi terms have to be separable after composition with A†A. Since A†Ais a projector, this constraint is equivalent to stating that either Ri = 1 (projection does not affectthe coordinate i), or ψi = 0 (no proximal update to make).

Assumption 1. The functions fA and ψ are such that equation Equation (9) holds for someσA ≥ 0 and for all i ∈ Rd, fA is (Mi)-smooth in direction i and ψ and A are such that eitherRi = 1 or ψi = 0.

This natural assumption allows us to formulate the proximal update in standard squared normsince the proximal operator is only used for coordinates i for which A†Aei = ei. Then, we formulateAlgorithm 2 and analyze its rate in Theorem 4.

Theorem 4. Let F : x 7→ fA(x) +∑di=1 ψi

(x(i))

such that Assumption 1 holds. If S is such that

S2 ≥ MiRi

p2iand 1− βt − αtRi

pi≥ 0, the sequences vt and xt generated by APCG verify:

BtE[‖vt − θ?‖2A†A

]+ 2At [E [F (xt)]− F (θ?)] ≤ C0,

where C0 = B0‖v0 − θ?‖2 + 2A0 [F (x0)− F (θ?)] and θ? is a minimizer of F . The rate of APCGdepends on S through the sequences αt and βt.

Our extended APCG algorithm is also closely related with an arbitrary sampling version ofAPPROX Fercoq and Richtarik (2015). Yet, APPROX has an explicit formulation with a moreflexible block selection rule than choosing only one coordinate at a time. Similarly to Lee andSidford Lee and Sidford (2013), it also uses iterations that can be more efficient, especially in thelinear case. These extensions can also be applied to APCG under the same assumptions, but this isbeyond the scope of this paper. Theorem 4 is a general method that in particular requires to setvalues for A0, B0, α0 and β0. The two following corollaries give choices of parameters dependingon whether σA > 0 or σA = 0, along with the rate of APCG in these cases.

Corollary 1 (Strongly Convex case). Let F be such that it verifies the assumptions of Theorem 4.If σA > 0, we can choose for all t ∈ N αt = βt = ρ and At = σ−1A Bt = (1− ρ)−t with ρ =

√σAS

−1.

In this case, the condition 1−βt− αtRi

pi≥ 0 can be weakened to 1− αtRi

pi≥ 0 and it is automatically

satisfied by our choice of S, αt and βt. In this case, APCG converges linearly with rate ρ, as shownby the following result:

σAE[‖vt − θ?‖2A†A

]+ 2 [E [F (xt)]− F (θ?)] ≤ C0(1− ρ)t

Corollary 2 (Convex case). Let F be such that it verifies the assumptions of Theorem 4. If σA = 0,we can choose βt = 0, B0 = 1 and A0 = 3B0

S2p2Rwith pR = mini

piRi

. In this case, the condition

1− βt − αtRi

pi≥ 0 is always satisfied for our choice of S and the error verifies:

E [F (xt)]− F (θ?) ≤ 2

t2

[S2r2t +

6

p2R[F (x0)− F (θ?)]

],

with r2t = ‖v0 − θ?‖2A†A − E[‖vt − θ?‖2A†A].

In the convex case, we only have control over the objective function F and not over the parameters.This in particular means that it is only possible to have guarantees on the dual objective in thecase of non-smooth ADFS.


A.2. Proof of Theorem 4

Before starting the proof, we define wt = (1− βt)vt + βtyt, and:

V ti (v) =Bt+1pi2at+1

‖v − w(i)t + ηie

Ti ∇f(yt)‖2 + ψi(v).

Then, we give the following lemma, that we prove later:

Lemma 1. If either 1−βt− αt

pi≥ 0 or αt = βt and 1− αt

pi≥ 0 for any i such that ψi 6= 0, then for

any t and i such that ψi 6= 0, we can write x(i)t =

∑tl=0 δ

(i)t (l)v

(i)l such that

∑tl=0 δ

(i)t (l) = 1 and

for any l, δ(i)t (l) ≥ 0. We define ψ

(i)t =

∑tl=0 δ

(i)t (l)ψi(v

(i)l ) and ψt =

∑di=1 ψ

(i)t . Then, if Ri = 1

whenever ψi 6= 0, ψ(xt) ≤ ψt and:

Eit[ψt+1

]≤ αtψ(vt+1) + (1− αt)ψt. (10)

where v(i)t+1 = arg minv V

ti (v) for all i. In particular, v

(it)t+1 = v

(it)t+1 and v

(j)t+1 = w

(j)t for j 6= it.

Note that Lemma 1 is a generalization to arbitrary sampling probabilities of the beginning ofthe proof in (Lin et al., 2015). We can now prove the main theorem.

Proof of Theorem 4. This proof follows the same general structure as Nesterov and Stich (Nesterovand Stich, 2017). In particular, it follows from expanding the ‖vt+1 − θ?‖2 term. In the originalproof, vt+1 = wt − g where g is a gradient term so the expansion is rather straightforward. In ourcase, vt+1 is defined by a proximal mapping so a bit more work is required. Yet, similar terms willappear, plus the function values of the non-smooth term that we control with Lemma 1. We startby showing the following equality:

Bt+1pi2at+1

[‖vt+1 − θ?‖2A†A+‖vt+1 − wt‖2A†A − ‖θ? − wt‖2A†A]

≤ 〈∇ifA(yt), θ? − vt+1〉A†A + ψi

(θ?(i)

)− ψi

(v(i)t+1

).

(11)

When ψi = 0, it follows from using vt+1 = wt − at+1

Bt+1pi∇ifA(yt) and basic algebra (expanding the

squared terms).When ψi 6= 0, A†Aei = ei because eTi A

†Aei = 1 and A†A is a projector. Therefore, we obtain

‖vt+1 − θ?‖2A†A − ‖wt − θ?‖2A†A = ‖v(i)t+1 − θ?

(i)‖2 − ‖w(i)t − θ?

(i)‖2, (12)

because vt+1 is equal to wt for coordinates other than i. We now use the strong convexity of V tiat points v

(i)t+1 (its minimizer, by definition) and θ?(i) (i-th coordinate of a minimizer of F ) to

write that V ti (v(i)t+1) + Bt+1pi

2at+1‖v(i)t+1 − θ?

(i)‖2 ≤ V ti (θ?(i)). This is a key step from the proof of Lin et

al. (Lin et al., 2015). Then, expanding the V ti terms yields:

Bt+1pi2at+1

[‖v(i)t+1 − θ?

(i)‖2 + ‖v(i)t+1 − w(i)t +

at+1

Bt+1pi∇ifA(yt)‖2 − ‖θ? − wt +

at+1

Bt+1pi∇ifA(yt)‖2

]≤ ψi

(θ?(i)

)− ψi

(v(i)t+1

).

We can now retrieve Equation (11) by pulling gradient terms out of the squares and using Equa-tion (12). We now evaluate each term of Equation (11). First of all, we use the form of xt+1 andthe fact that wt − vt+1 = eTi (wt − vt+1) (only one coordinate is updated) to show:

E[at+1

pi〈∇ifA(yt), θ

? − vt+1〉A†A]

= at+1E[〈 1

pi∇ifA(yt), θ

? − wt〉A†A]

+At+1E[〈∇ifA(yt),

αtpi

(wt − vt+1)〉A†A]

= at+1〈∇fA(yt), θ? − wt〉A†A +At+1E [〈∇ifA(yt), yt − xt+1〉] ,

where we used that Ri = eTi A†Aei and yt − xt+1 = αtRi

pi(wt − vt+1).

The rest of this proof closely follows the analysis from Hendrikx et al. (Hendrikx et al., 2019),which is an adaptation of Nesterov and Stich (Nesterov and Stich, 2017) to strong convexity onlyon a subspace. The main difference is that it is also necessary to control the function values of ψ,


which is done using Lemma 1. For the first term, we use the strong convexity of f as well as thefact that wt = yt − 1−αt

αt(xt − yt) to obtain:

at+1∇fA(yt)TA†A(θ? − wt) = at+1∇fA(yt)

TA†A

(θ? − yt +

1− αtαt

(xt − yt))

≤ at+1

(fA(θ?)− fA(yt)−

1

2σA‖yt − θ?‖2A†A +

1− αtαt

(fA(xt)− fA(yt))

)≤ at+1fA(θ?)−At+1fA(yt) +AtfA(xt)−

1

2at+1σA‖yt − θ?‖2A†A.

For the second term, we use the fact that xt+1 − yt has support on ei only (just like vt+1 −wt) andthe directional smoothness of fA to obtain:

At+1〈∇ifA(yt), yt − xt+1〉 ≤ At+1

[fA(yt)− fA(xt+1) +

Mi

2‖xt+1 − yt‖2

]≤ At+1 (fA(yt)− fA(xt+1)) +

Bt+1

2

MiRip2i

a2t+1

At+1Bt+1Ri‖eTi (vt+1 − wt)‖2

≤ At+1 (fA(yt)− fA(xt+1)) +Bt+1

2‖vt+1 − wt‖2A†A.

Noting ∆fA(xt) = E [f(xt)]− fA(θ?) and remarking that at+1 = At+1 −At, we obtain, using thatαt = at+1

At+1:

E[at+1

pi〈∇ifA(yt), θ

? − vt+1〉A†A]≤ At∆fA(xt)−At+1∆fA(xt+1) +

Bt+1

2E[‖wt − vt+1‖2A†A

]− at+1σA

2‖yt − θ?‖2A†A.

Using Lemma 1, we derive in the same way:

E[at+1

pi

[ψi

(θ?(i)

)− ψi

(v(i)t+1

)]]= at+1ψ(θ?)−At+1αtψ(vt+1)

≤ At(ψt − ψ(θ?)

)−At+1

(ψt+1 − ψ(θ?)

).

Now, we can multiply Equation (11) by at+1

piand take the expectation over i. The ‖vt+1 − wt‖2A†A

terms cancel and we obtain:

Bt+1

2E[‖vt+1 − θ?‖2A†A

]+At+1∆FA(xt+1) ≤ At∆FA(xt) +

Bt+1

2‖wt − θ?‖2A†A −

at+1σA2‖yt − θ?‖2A†A,

where ∆FA(xt) = ∆fA(xt) +E[ψt

]−ψ(θ?). Convexity of the squared norm yields ‖wt− θ?‖2A†A ≤

(1−βt)‖vt−θ?‖2A†A+βt‖yt−θ?‖2A†A. Now remarking that Bt+1(1−βt) = Bt and at+1σA = Bt+1βt,and summing the inequalities until t = 0, we obtain:

Bt‖vt − θ?‖2A†A + 2At∆FA(xt) ≤ 2A0∆FA(x0) +B0‖v0 − θ?‖2A†A.

We finish the proof by using the fact that ψ(xt) ≤ ψt and ψ(x0) = ψ0 since x0 = v0. �

Now that we have proven Theorem 4, we can proceed to the proof of Lemma 1.

Proof of Lemma 1. This lemma is a generalization of the lemma from APCG with arbitraryprobabilities (instead of uniform ones). It still uses the fact that xt can be written as a convexcombination of (vl)l≤t, but it requires to use a different convex combination for each coordinateof xt, thus crucially exploiting the separability of the proximal term. If coordinate i is such that

ψi = 0, then ψ(i)t+1 ≤ αtψi(v

(i)t+1) + (1−αt)ψ(i)

t is automatically satisfied for any δ(i)t . For coordinates

i such that ψi 6= 0 (and so Ri = 1), we start by expressing xt+1 in terms of xt, vt+1 and vt . Moreprecisely, we write that for any t > 0:

x(i)t+1 = y

(i)t +

αtpi

(v(i)t+1 − w

(i)t ).


Indeed, either coordinate i is updated at time t or v(i)t+1 = w

(i)t so the previous equation always

holds. We can then develop the wt and yt terms to obtain x(i)t+1 only in function of x

(i)t , v

(i)t and

v(i)t+1:

x(i)t+1 =

αtpiv(i)t+1 +

(1− αtβt

pi

)y(i)t −

αt(1− βt)pi

v(i)t

=αtpiv(i)t+1 +

(1− αtβt

pi

)(1− αt)x(i)t + αt(1− βt)v(i)t

1− αtβt− αt(1− βt)

piv(i)t

=αtpiv(i)t+1 + αt(1− βt)

[1− αtβt

pi

1− αtβt− 1

pi

]v(i)t +

(1− αtβt

pi

)(1− αt)1− αtβt

x(i)t

=αtpiv(i)t+1 +

αt(1− βt)1− αtβt

(1− 1

pi

)v(i)t +

(1− αtβt

pi


x(i)t .

At this point, all coefficients sum to 1. Indeed, they all sum to 1 at the first line and we have

expressed w(i)t and then y

(i)t as convex combinations of other terms, thus keeping the value of the

sum unchanged. Yet, pi < 1 so the coefficient on the second term is negative. Fortunately, it is

possible to show that the v(i)t term in the decomposition of x

(i)t is large enough so that the v

(i)t term

in the decomposition of x(i)t+1 is positive. More precisely, we now show by recursion that for t ≥ 0:

x(i)t+1 =

αtpiv(i)t+1 +

t∑l=0

δ(i)t+1(l)v

(i)l , (13)

with δ(i)t+1(l) ≥ 0 for l ≤ t. For t = 0, x0 = v0 and x

(i)1 = α0

piv(i)1 +

(1− α0

pi

)v(i)0 . We now assume

that Equation (13) holds for a given t > 0, and expand δ(i)t+1(t) to show that it is positive. Using

that δ(i)t (t) = αt

pi, we write:

δ(i)t+1(t) =


(1− 1

pi

)+αtpi

(1− αtβt

pi


=αt

1− αtβt

[(1− βt)

(1− 1

pi

)+

(1− αt)pi

(1− αtβt

pi

)]=

αt1− αtβt

[1− βt −

1

pi+βtpi

+1

pi− αtpi− (1− αt)

αtβtp2i

]=

αt1− αtβt

[(1− βt −

αtpi

)+βtpi

(1− (1− αt)

αtpi

)].

We conclude that δ(i)t+1(t) ≥ 0 since 1− βt − αt

pi≥ 0. Note that this condition can be weakened to

1− α2t

p2i≥ 0 when βt = αt or when βt = 0. We also deduce from the form of x

(i)t+1 that for l < t, the

only coefficients on v(i)l in the development of x

(i)t+1 come from the x

(i)t term and so:

δ(i)t+1(l) =

(1− αtβt

pi


δ(i)t (l), (14)

so these coefficients are positive as well. Since they also sum to 1, it implies that x(i)t is a convex

combination of the v(i)l for l ≤ t, and we use the convexity of ψi to write:

ψi(x(i)t ) = ψi

(t∑l=0

δ(i)t (l)v

(i)l

)≤

t∑l=0

δ(i)t (l)ψi(v

(i)l ) = ψ

(i)t .


Now, we can properly express ψ(i)t+1 using the decomposition of x

(i)t+1 in terms of δ

(i)t+1:

E[ψ(i)t+1

]= E

[αtpiψi(v

(i)t+1)

]+αt(1− βt)1− αtβt

(1− 1

pi

)ψi(v

(i)t ) +

(1− αtβt

pi

)1− αt

1− αtβt

t∑l=0

δ(i)t (l)ψi(v

(i)l )

= αtψi(v(i)t+1) + (1− pi)

αtpiψi(w

(i)t ) +


(1− 1

pi

)ψi(v

(i)t ) +

(1− αtβt

pi

)1− αt

1− αtβtψ(i)t

At this point, we use the convexity of ψi to develop ψi(w(i)t ) and then ψi(y

(i)t ) in the following way:

ψi(w(i)t ) ≤ (1− βt)ψi(v(i)t ) + βtψi(y

(i)t )

≤ (1− βt)ψi(v(i)t ) +βt

1− αtβt

[(1− αt)ψi(x(i)t ) + αt(1− βt)ψi(v(i)t )

]=

1− βt1− αtβt

ψi(v(i)t ) +

βt(1− αt)1− αtβt

ψi(x(i)t ).

If we plug these expressions into the development of E[ψ(i)t+1

], the ψi(v

(i)t ) terms cancel and we

obtain:

E[ψ(i)t+1

]≤ αtψi(v(i)t+1) + αt

(1

pi− 1

)βt(1− αt)1− αtβt

ψi(x(i)t ) +

(1− αtβt

pi

)1− αt

1− αtβtψ(i)t

We now use the fact that ψi(x(i)t ) ≤ ψ(i)

t (by convexity of ψi) to get:

E[ψ(i)t+1

]≤ αtψi(v(i)t+1) +

1− αt1− αtβt

[αtβt

(1

pi− 1

)+

(1− αtβt

pi

)]ψ(i)t

≤ αtψi(v(i)t+1) + (1− αt)ψ(i)t

This holds for any coordinate i and so E[ψt+1

]≤ αtψ(vt+1 + (1 − αt)ψt for all t ≥ 0, which

finishes the proof of the lemma. �

A.3. Proof of the corollaries

Now that that we have proven the main result, we show how specific choices of parameters leadto fast algorithms.

Proof of Corollary 1. If σA > 0, then the parameters can be chosen as αt = βt = ρ =√σA

S , withAt = (1− ρ)−t and Bt = σAAt. These expressions can then be plugged into the recursion to verifythat they do satisfy it. This choice is classic and slightly suboptimal for small values of t comparedwith the choice made by Nesterov and Stich (Nesterov and Stich, 2017).

Yet, it remains to prove that αtRi ≤ pi to verify the assumptions of Theorem 4. This assumptionwas directly verified in the case of APCG thanks to the uniform probabilities. We show that thisalso holds in the arbitrary-sampling formulation with strong convexity on a subspace, and thisresult validates our choice of parameters. In particular, we write:

αtRi = ρRi ≤√σAS

Ri ≤√σARiMi

pi.

Then, we take x? such that ∇fA(x?) = 0 and use the smoothness and (A†A)-strong convexity offA to write that for any coordinate i and h > 0:

Mi

2‖hei‖2 ≥ f(x? + hei)− f(x?) ≥ σA

2‖hei‖A†A.

In particular, this means that Mi ≥ σARi, which means that αtRi ≤ pi for all i. �


Proof of Corollary 2. If σA = 0 then we have to choose βt = 0 for all t. Then, we can chooseBt = B0 for a any B0 > 0 and a direct recursion yields:

At+1 = At +B0

2S2

(1 +

√1 + 4S2B−10 At

).

This in particular shows that (At) is an increasing sequence. Coefficients (at) can be computedusing

at+1 = At+1 −At =B0

2S2

(1 +

√1 + 4S2B−10 At

),

and so the sequence (αt) is given by:

αt =at+1

At+1= 1− At

At+1= 1− At

At + B0

2S2

(1 +√

1 + 4S2b−1At) .

We would like to choose A0 = 0 but this would lead to α0 = 1 and would not respect αtRi ≤ pi forall i so we choose A0 > 0. Since (At) is an increasing sequence, (αt) is a decreasing sequence, andin particular it is enough to verify that α0 ≤ pR with pR = mini pi/Ri to have that αtRi ≤ pi forall i and all t. Therefore, we want A0 > 0, such that:

pR ≥ 1− 1

1 + B0

2A0S2

(1 +

√1 + 4S2B−10 A0

) ,which can be rewritten pR

1−pR ≥1X

(1 +√

1 + 2X)

with X = 2S2B−10 A0. If pR ≥ 1 then the

equation is verified for any value of A0 and we could actually even have chosen A0 = 0. Otherwise,we choose X ≥ 6/p2R ≥ 6. Then:

pR1− pR

≥ pR ≥√

6√X

=

√X(√

6−√

2)

X+

1

X

√2X ≥ 2

X+

√2X

X≥ 1

X(1 +

√1 + 2X),

where we have used that√X(√

6−√

2) ≥ 2 for X ≥ 6 and 1 +√

2X ≥√

1 + 2X. In particular,the constraint αtRi ≤ pi is satisfied with the choice:

A0 =3B0

S2p2R.

Note that the constant 3 is not tight but allows for an easy expression which always hold. Since

A0 > 0, a direct recursion yields At ≥ B0t2

4S2 . We call r2t = ‖v0 − θ?A‖2A†A − E[‖vt − θ?A‖2A†A], andFt = E[FA(xt)]− FA(θ?A), then:

Ft ≤1

2At

(B0r

2t + 2A0F0

)=

B0

2At

(r2t +

2

S2p2min

F0

)≤ 2S2

t2

(r2t +

6

S2p2RF0

),

which finishes the proof.�

Appendix B. Algorithm Derivation

B.1. Projection of virtual edges

Theorem A requires that for any coordinate i, either the proximal part ψi = 0 or the coordinateis such that eTi A

†Aei = 1, which is equivalent to having A†Aei = ei. In our case, ψk` = 0 when(k, `) is a communication edge. Lemma 2 is a small result that shows that the projection conditionis satisfied by virtual edges.

Lemma 2. If (k, `) is a virtual edge then Rk` = 1.

Proof. Let x ∈ RE+nm such that Ax = 0. From the definition of A, either x = 0 or the support ofx is a cycle of the graph. Indeed, for any edge (k, `), Aek` has non-zero weights only on nodes k and`. Virtual nodes have degree one, so virtual edges are parts of no cycles and therefore xT ek,` = 0for all virtual edges (k, `). Operator A†A is the projection operator on the orthogonal the kernel ofA, so it is the identity on virtual edges. �


B.2. From edge variables to node variables

Taking the dual formulation implies that variables are associated with edges rather than nodes.Although it could be possible to work with edge variables, it is generally inefficient. Indeed, thealgorithm needs variable Ayt instead of variable yt for the gradient computation so standardmethods work directly with Ayt (Scaman et al., 2017; Hendrikx et al., 2019).

In this section, we call vt, yt and zt the dual variable sequences in RE+nm obtained by applyingAlgorithm 2 on the dual problem of Equation 5. The new update equations can be retrieved bymultiplying each line of Algorithm 2 by A on the left, so that for example vt = Avt. Yet, there isstill a zt+1 term because of the presence of the proximal update. More specifically, we write for thevirtual edge between node i and its j-th virtual node:

vTt+1eij = proxηijψi,j

(zTt+1eij

). (15)

Fortunately, this update only modifies vt+1 when ψi,j 6= 0. This means that zt+1 is only modifiedfor local computation edges. Since local computation nodes only have one neighbour, the form ofA ensures that for any z ∈ Rn(1+m) and virtual edge (k, `) corresponding to node i and its j-thvirtual node, (Az)(i,j) = −µk`zk`. In particular, if node k is the center node i and node ` is thevirtual node (i, j), the proximal update can be rewritten:

(Avt+1)(i,j)

= −µijproxηijψi,j

(− 1

µij(Azt+1)(i,j)

)= −µij arg min

v

1

2ηij‖v −

(− 1

µij(Azt+1)(i,j)

)‖2 + ψi,j(v)

= −µij arg minv

1

2ηijµ2ij

‖ − µijv − (Azt+1)(i,j)‖2 + f∗i,j(−µijv)−µ2ij

2Li,j‖v‖2

= arg minv

1

2ηijµ2ij

‖v − (Azt+1)(i,j)‖2 + f∗i,j(v)− 1

2Li,j‖v‖2

= proxηijµ2ij f∗i,j

((Azt+1)

(i,j)),

where f∗i,j : x→ f∗i,j(x)− 12Li,j‖x‖2. For the center node, the update can be written:

(Avt+1)(i)

= (Azt+1)(i) − µijeTij zt+1 + µijproxηijψi,j

(− 1

µij(Azt+1)

(i,j)

)= (Azt+1)

(i)+ (Azt+1)

(i,j) − proxηijµ2k`f∗i,j

((Azt+1)

(i,j)).

B.3. Primal proximal updates

Moreau identity (Parikh and Boyd, 2014) provides a way to retrieve the proximal operator of

f∗ using the proximal operator of f , but this does not directly apply to f∗i,j , making its proximal

update hard to compute when no analytical formula is available to compute f∗i,j . Fortunately, the

proximal operator of f∗i,j can be retrieved from the proximal operator of f∗i,j . More specifically, if

we denote ηij = ηijµ2ij (it is clear in this section that they refer to the edge between node i and its

virtual node j), then we can also express the update only in terms of f∗i,j :

proxηij f∗i,j

((Azt+1)

(i,j))

= arg minv

1

2ηij‖v − (Azt+1)

(i,j) ‖2 + f∗i,j(v)− 1

2Li,j‖v‖2

= arg minv

1

2

(η−1ij − L

−1i,j

)‖v‖2 − η−1ij v

T (Azt+1)(i,j)

+ f∗i,j(v)

= arg minv

1

2(η−1ij − L

−1i,j

)−1 ‖v − (1− ηijL−1i,j )−1 (Azt+1)(i,j) ‖2 + f∗i,j(v)

= prox(η−1ij −L

−1i,j )−1f∗i,j

((1− ηijL−1i,j

)−1(Azt+1)

(i,j)).

Then, we use the identity:

prox(ηf)∗(x) = ηproxη−1f∗(η−1x

), (16)


and the Moreau identity to write that:

proxηf∗(x) = x− ηproxη−1f

(η−1x

). (17)

This allows us to retrieve the proximal operator on f∗i,j using only the proximal operator on fi,j :(1− ηijL−1i,j

)proxηij f∗i,j

((Azt+1)

(i,j))

= (Azt+1)(i,j) − ηijprox(η−1

ij −L−1i,j )f

(η−1ij (Azt+1)

(i,j)).

(18)Note that the previous calculations are valid as long as ηijL

−1i,j ≤ 1 for all virtual edges. A way

to bound this is to replace by the values of µ2ij and σA to get:

ρ ≤ κi2κpij .

The constraint ρ < minij pij was already enforced by APCG, so this simply gives another constraintthat is generally verified unless nodes have very different local objectives (which should not happenif m is big enough).

B.4. Smooth case

If the functions fi,j are smooth then the functions f∗i,j are strongly convex and so function qAis strongly convex. ADFS can then be obtained by applying Algorithm 2 to Problem (5). Thevalue of S is obtained by remarking that qA is µ2

ij

(Σ−1i + Σ−1j

)smooth in the direction (i, j) and

λ+min

(ATΣ−1A

)strongly convex on the orthogonal of the kernel of A. Lemma 2 guarantees that

either Ri = 1 (virtual edges) or ψi = 0 (communication edges), so we can apply Corollary 1 to get:

BtE[‖vt − θ?A‖2A†A

]+ 2At [E [F ∗A(Axt)]− F ∗A(θ?A)] ≤ C0,

where vt and xt are the dual variables and C0 is the same as in Theorem 1. ADFS works withvariables vt = Avt and xt = Axt instead. Then, we use the fact that for any x, F ∗A(x) = F ∗A(A†Ax)to write that E [F ∗A(xt)] = E

[F ∗A(A†xt)

]. Following Lin et al. (Lin et al., 2015), and noting

q : x 7→ 12x

TΣ−1x the primal optimal point θ? can be retrieved as θ? = ∇q(Aθ?A) = Σ−1Aθ?A, whereθ?A is the optimal dual parameter. Finally,

λmax(ATΣ−2A)−1‖θt − θ?‖2 ≤ λmax(ATΣ−2A)−1‖Σ−1A(vt − θ?A)‖2 ≤ ‖vt − θ?A‖2A†A,

which finishes the proof of Theorem 1. Note that APCG also gives a guarantee in terms of dualfunction values at points xt but we drop it in order to have a simpler statement.

B.5. Non-smooth setting

Extended APCG can be applied to the problem of Equation (5) even if function qA is not stronglyconvex on the orthogonal of the Kernel of A. This is for example the case when the functions fi,j arenot smooth so that Σ−1 has diagonal entries equal to 0 and therefore Ker(ATΣ−1A) 6⊂ Ker(A) soσA = 0. In this case, the choice of coefficients from Corollary 2 leads to Algorithm 3, a formulationof ADFS that provides error guarantees when primal functions fi,j are not smooth. More formally,

if we define F ∗ : x→∑ni=1

[∑mj=1 f

∗i,j

(x(i,j)

)+ 1

2σi‖x(i)‖2

], then:

Theorem 5. If the functions fi,j are non-smooth then NS-ADFS guarantees:

E [F ∗(xt)]− F ∗(θ?) ≤2

t2

[S2

λ+min(ATA)r2t +

6

p2R[F ∗(x0)− F ∗(θ?)]

],

with r2t = ‖v0 − θ?‖2 − ‖vt − θ?‖2.

The guarantees provided by Theorem 5 are weaker than in the smooth setting. In particular, welose linear convergence and get the classical accelerated sublinear O(1/t2) rate. We also lose thebound on the primal parameters— recovering primal guarantees is beyond the scope of this work.

Note that the extra λ+min(ATA) term comes from the fact that Theorem 5 is formulated withprimal parameter sequences xt = Axt. Also note that αt = O

(t−1), and at+1

Bt+1= O (t). The leading

constant governing the convergence rate isλ+min(ATA)

S2 , which is very related to the constant for the


Algorithm 3 NS-ADFS

x0 = 0, v0 = 0, t = 0, pR = mini pi/Ri, A0 = 3B0

S2p2R, ηij =

µ2ij

pij

while t < T doAt+1 = At + 1

2S2

(1 +√

1 + 4S2At)

at+1 = At+1 −At, αt = at+1

At+1

yt = (1− αt)xt + αtvtSample (i, j) with probability pijvt+1 = zt+1 = vt − at+1ηijWijΣ

−1ytif (i, j) is a computation edge then

v(i,j)t+1 = proxat+1ηijf∗i,j

(z(i,j)t+1

)v(i)t+1 = z

(i)t+1 + z

(i,j)t+1 − v

(i,j)t+1

end ifxt+1 = yt +

αtRij

pij(vt+1 − vt)

end whilereturn θt = Σ−1vt

smooth case, simply that the Σ−1 factor is removed. Therefore, we can obtain in the same way

that if we choose µ2ij =

λ+min(L)

1+m when (i, j) is a computation edge then we get:

λ+min(ATA) ≥ λ+min(L)

2(m+ 1).

Optimizing parameter ρ in order to minimize time yields ρcomp = ρcomm again, now leading inthe homogeneous case to choosing:

p∗comm =

(1 +

√γm2

2(1 +m)

)−1.

Appendix C. Average Time per Iteration

C.1. More communications implies more waiting

A fundamental assumption for Theorem 2 is to assume that pcomm < pcomp. In particular, itprevents pcomm from being too high since pcomm + pcomp = 1. Although this assumption seemsquite restrictive in the first place, it is very intuitive to want to avoid pcomm from being too high,especially in the limit of pcomm → 1 and τ arbitrarily small. Consider that one node (say node 0)starts a local update at some point. Communications are very fast compared to computations soit is very likely that the neighbors of node 0 will only perform communication updates, and theywill do so until they have to perform one with node 0. At this point, they will have to wait untilnode 0 finishes its local computation, which can take a long time. Now that the neighbors of node0 are also blocked waiting for the computation to finish, their neighbors will start establishing adependence on them rather quickly. If the probability of computing is small enough and if thecomputing time is large enough, all nodes will sooner or later need to wait for node 0 to finish itslocal update before they can continue with the execution of their part of the schedule. In the end,only node 0 will actually be performing computations while all the others will be waiting.

This phenomenon is not restricted to the limit case presented above and the synchronization costblows up as soon as pcomm > pcomp and τ < 1. In the proof below, the goal is to bound the totalexpected weight

∑ni=1E [Xt(i, w)] for w higher than a given threshold. Local computing operations

will move mass from small values of w to higher values of w. On the other hand, communicationoperations will introduce synchronization between two nodes, thus increasing the total availablemass

∑w≥0

∑ni=1E [Xt(i, w)] (and not just moving it to higher values of w) because it will duplicate

the mass for Xt(i, w) to Xt(j, w) if nodes i and j communicate. This is the technical reason whypcomm < pcomp is needed for this proof.


C.2. Detailed average time per iteration proof

The goal of this section is to prove Theorem 2. The proof is an extension of the proof ofTheorem 2 from Hendrikx et al. (Hendrikx et al., 2019). Similarly, we denote t the number ofiterations that the algorithm performs and τ ijc the random variable denoting the time taken by acommunication on edge (i, j). Similarly, τ il denotes the time taken by a local computation at nodei. Then, we introduce the random variable Xt(i, w) such that if edge (i, j) is activated at time t+ 1(with probability pij), then for all w ∈ N∗:

Xt+1(i, w) = Xt(i, w − τ ijc (t)) +Xt(j, w − τ ijc (t)),

where τ ijc (t) is the realization of τ ijc corresponding to the time taken by activating edge (i, j)at time t. If node i is chosen for a local computation, which happens with probability pcomp

i thenXt+1(i, w + τ il (t)) = Xt(i, w) for all w. Otherwise, Xt+1(j, w) = Xt(j, w) for all w. At time t = 0,X0(i, 0) = 1 and X0(i, w) = 0 for all w. Lemma 3 gives a bound on the probability that the timetaken by the algorithm to complete t iterations is greater than a given value, depending on variablesXt. Note that a Lemma similar to the one by Hendrikx et al. (Hendrikx et al., 2019) holds althoughvariable X has been modified.

Lemma 3. We denote Tmax(t) the time at which the last node of the system finishes iteration t.Then for all ν > 0:

P (Tmax(t) ≥ νt) ≤∑w≥νt

n∑i=1

E[Xt(i, w)

].

Proof. We first prove by induction on t that for any i ∈ {1, .., n}:

Ti(t) = maxw∈N,Xt(i,w)>0

w. (19)

To ease notations, we write wmax(i, t) = maxw∈N,Xt(i,w)>0 w. The property is true for t = 0because Ti(0) = 0 for all i.

We now assume that it is true for some fixed t > 0 and we assume that edge (k, l) has beenactivated at time t. For all i /∈ {k, l}, Ti(t+ 1) = Ti(t) and for all w ∈ N∗, Xt+1(i, w) = Xt(i, w) sothe property is true. Besides, if j 6= l,

wmax(k, t+ 1) = maxw∈N∗,Xt(k,w−τc(t))+Xt(l,w−τkl

c (t))>0w

= maxw∈N,Xt(i,w)+Xt(i,w)>0

w + τklc (t)

= τc(t) + max (wmax(k, t), wmax(l, t))

= τklc (t) + max (Tk(t), Tl(t)) = Tk(t+ 1).

Similarly if k = l (a local computation is performed at iteration t), then wmax(k, t + 1) =τkl (t) +wmax(k, t) = Tk(t) + τkl (t) = Tk(t+ 1). Then, we use the union bound and the the fact thathaving Xt(i, w) > 0 is equivalent to having Xt(i, w) ≥ 1 since Xt(i, w) is integer valued to showthat:

P (Tmax(t) ≥ νt) = P(

maxw,

∑ni=1X

ti (w)>0

w ≥ νt)≤ P

(∪w≥νt

n∑i=1

Xti (w) ≥ 1

)≤∑w≥νt

P

(n∑i=1

Xti (w) ≥ 1

),

so using Markov inequality yields:

P (Tmax(t) ≥ νt) ≤∑w≥νt

n∑i=1

E[Xti (w)

]. (20)

�

Variables Xti are obtained by linear recursions, so Lemma 3 allows us to bound the growth of

variables with a simple recursion formula instead of evaluating a maximum. We write pcompi and

pcommi the probability that node i performs a computation (respectively communication) update at a

given time step, and pi = pcompi +pcomm

i . We introduce pcomp = mini pcompi and pcomp = maxi p

compi

(and the same for communication probabilities).


Lemma 4. For all i, and all ν > 0, if 12 ≥ pcomp = pcomp ≥ pcomm then:

∑w≥(νc+νl)t

n∑i=1

E[Xt(i, w)

]→ 0 when t→∞, (21)

with νc = 6pcτc and νl = 9plτl where pc = 4pcomm and pl = pcomp.

Note that the constants in front of the ν parameters are very loose.

Proof. Taking the expectation over the edges that can be activated gives, with τ ijc (τ) the probabilitythat τ ijc takes value τ (and the same for τl):

E[Xt+1(i, w)

]= (1− pi)E

[Xt(i, w)

]+pcomm

n∑j=1

pij

∞∑τ=0

τ ijc (τ)(E[Xt(i, w − τ)

]+ E

[Xt(j, w − τ)

])+ pcomp

i

∞∑τ=0

τ ijl (τ)E[Xt(i, w − τ)

].

In particular, for all i, E[Xt+1(i, w)

]≤ Xt(w) where X0(w) = 1 if w = 0 and:

Xt+1(w) = (1− p) Xt(w) + 2pcomm

∞∑τ=0

τmaxc (τ)Xt(w − τ) + pcomp

∞∑τ=0

τmaxl (τ)Xt(w − τ). (22)

with τmaxc (τ) = maxij τ

ijc (τ) (and the same for τl). We now introduce φt(z) =

∑w∈N z

wXt(w).We denote φc and φl the generating functions of τmax

c (τ) and τmaxl (τ). A direct recursion leads to:

φt(z) =(1− pcomm − pcomp + pcompφl(z) + 2pcommφc(z)

)t=(φ1(z)

)t. (23)

We denote φbin(p, t) the generating function associated with the binomial law of parameters pand t. With this definition, we have:

φbin(pc, t)(φc(z))φbin(pl, t)(φl(z)) = [(1− pc)(1− pl) + (1− pc)plφl(z) + (1− pl)pcφc(z) + pcplφc(z)φl(z)]t,

(24)so we can define:

φt+(z) = (1 + δ)tφbin(pc, t)(φc(z))φbin(pl, t)(φl(z)), (25)

where pc, pl and δ are such that:

pc1− pc

≥ 2pcomm

1− p,

pl1− pl

=pcomp

1− pand δ ≥ 1− p

(1− pc)(1− pl)− 1.

Since pcomp = pcomp then p ≥ pcomp. Therefore, these conditions are satisfied for pc and pl as

given by Lemma 4 and δ = (1−pc)−1−1. Then (1+δ)(1−pc)(1−pl) ≥ 1−p, (1+δ)(1−pc)pl ≥ pcomp

and (1 + δ)(1− pl)pc ≥ 2pcomm. This means that if we write φ1(z) = a0 + acφc(z) + alφl(z) andφ1+(z) = b0 + bcφc(z) + blφl(z) then b0 ≥ a0, bc ≥ ac and bl ≥ al. In particular, all the coefficientsof φt are smaller than the coefficients of φt+ where both functions are integral series. Therefore, ifwe call Zt the random variables associated with the generating function (1 + δ)−tφt+ then for alli, t, w:

E[Xt(i, w)

]≤ (1 + δ)tP (Zt = w) , (26)

where Zt = Ztc +Ztl = Bin(pc, t)(Zc) +Bin(pl, Zl)(τl) where Zc and Zl are the random variablesmodeling the time of one communication or computation update. We can then use the boundp(Zt ≥ (νc + νl)t) ≤ p(Ztc ≥ νct) + p(Ztl ≥ νlt). This way, we can bound the communication andcomputation costs independently. Then, we write a Chernoff bound, i.e. for any λ > 0:

P(Ztc ≥ νt

)≤ e−λνtE

[eλZ

tc

]= e−λνtE

[eλZc

]t= e−λνt

[1− pc + pc

∞∑τ=0

pc(τ)eλτ

]t,


where Sc is the sum of t i.i.d. random variables drawn from τc. If Zc = τc with probability 1(deterministic delays) then this reduces to:

P(Ztc ≥ νct

)≤ e−λνct

[1− pc + pce

λτc].

Finally, we take νc = kpcτc, λ = 1τc

ln(k) and we use the basic inequality ln(1 + x) ≥ x1+x to

show that:

− ln[P(Ztc ≥ νct

)]≥ t[λνc − pc

(eλτc − 1

)]≥ t(k(ln(k)− 1)− 1)pc. (27)

Using the same log inequality and the fact that pc ≥ 12 yields:

ln (1 + δ) = − ln(1− pc) ≤pc

1− pc≤ 2pc. (28)

Therefore, choosing k = 6 ensures that k(ln(k)− 1)− 1 ≥ 3 and so:

(1 + δ)tP(Ztc ≥ νct

)≤ e−tpc . (29)

We can apply the same reasoning to Ztl , and the bound is still valid with k = 9 becausepl = pcomp ≥ pcomm = pc/4. We finish the proof by using Equation (26). �

Appendix D. Algorithm Performances

ADFS has a linear convergence rate because it results from using generalized APCG. Yet, itis not straightforward to derive hyperparameters that lead to a rate that is fast and that can beeasily interpreted. The goal of this section is to choose such parameters when the functions fi,j aresmooth.

D.1. Smallest eigenvalue of the Laplacian of the augmented graph

The strong convexity of qA on the orthogonal of the kernel of A is equal to σA = λ+min

(ATΣ−1A

),

the smallest non-zero eigenvalue of ATΣ−1A. Indeed, Σ−1 is a diagonal matrix with strictly positiveentries only so Ker(ATΣ−1A) = Ker(A). The goal of this section is to prove that for a meaningfulchoice of µ, the smallest eigenvalue of the Laplacian of the augmented graph is not too smallcompared to the Laplacian of the actual graph. More specifically, we prove the following result:

Lemma 5. If for all virtual edge between a node i and its virtual node j, µij is such that

µ2ij =

λ+min(L)

σκiLi,j and σ, κ are such that for all i, σ ≥ σi and κ ≥ κi then:

λ+min(L) ≥ λ+min(L)

2σκ. (30)

Proof. All non-zero singular values of a matrix MTM are also singular values of the matrix MMT ,and so λ+min(ATΣ−1A) = λ+min

(Σ−1/2AATΣ−1/2

). We denote L = Σ−1/2AATΣ−1/2.

Then, we denote µ2ij the weight of the virtual edge (i, j) and M the diagonal matrix of size nm

which is such that e(i,j)TMe(i,j) = µ2

ij for all virtual nodes. Mn,m is the matrix of size n × nmsuch that (Mn,meij)

(i) = µ2ij for all virtual edges (i, j) and all other entries are equal to 0. Finally,

S is the diagonal matrix of size n such that Si =∑mj=1 µ

2ij . All communication nodes are linked by

the true graph, whereas all virtual nodes are linked to their corresponding communication node.Then, if we denote L the Laplacian matrix of the original true graph, the rescaled Laplacian matrixof the augmented graph writes:

L = Σ−1/2(L+ S −Mn,m

−MTn,m M

)Σ−1/2. (31)

Therefore, if we split Σ into two diagonal blocks D1 (for the communication nodes) and D2 (forthe computation nodes) and apply the block determinant formula, we obtain:

det(D− 1

21 ATAD

− 12

1 − λId) = det(D−12 M − λId)

× det(D−1/21 LD

−1/21 +D−11 S − λId−

D− 1

21 Mn,mD

− 12

2

(D−12 M − λId

)−1D− 1

22 MT

n,mD− 1

21 ).


Then, we choose M such that D−12 M = diag(α1, ..., αn), meaning that µ2ij = αiLi,j . With

this choice, D− 1

21 Mn,mD

− 12

2

(D−12 M − λId

)−1D− 1

22 MT

n,mD− 1

21 is a diagonal matrix where the i-th

coefficient is equal to

1

σi

m∑j=1

µ4ij

1

µ2ij − Li,jλ

= κiα2i

αi − λ, (32)

where Si =∑mj=1 Li,j and κi = 1 + Si

σi. On the other hand, D−11 S is also a diagonal matrix

where the i-th entry is equal to ακi. Therefore, the solutions of det(L− λId) = 0 are λ = αi andthe solutions of:

det(D−1/21 LD

−1/21 −∆λ) = 0 (33)

with ∆λ a diagonal matrix such that (∆λ)i,i =(

1σi

∑mj=1 µ

4ij

(1

µ2ij−Li,jλ

− 1)

+ λ)

. All the

entries of ∆λ grow with λ, meaning that the smallest solution λ∗ of Equation (33) is lower boundedby the smallest solution of:

det

(λ+min(L)

σId−∆λ

). (34)

If αi > 0 and we choose λ 6= αi, then the other singular values of L are lower bounded by theminimum over all i of the solution of:

ν −

1

σi

m∑j=1

µ4ij

(1

µ2ij − Li,jλ

− 1

)+ λ

= 0, (35)

where ν =λ+min(L)

σ which, with our choice of µij gives:

ν −(αiSiσi

(αi

αi − λ− 1

)+ λ

)= 0, (36)

that can be rewritten:λ2 − λ (ν + αiκi) + αiν = 0. (37)

Therefore, noting λ∗i the smallest solution of this system for a given i, we get:

λ∗i ≥1

2

(αiκi + ν −

√(ν + αiκi)

2 − 4ναi

). (38)

In particular, we choose αi = νκi

and use that√

1− x ≤ 1− x2 to show:

λ∗i ≥ ν(

1−√

1− κ−1i)≥ ν

2κi. (39)

The other eigenvalues are given by the values that zero out diagonal terms of the lower rightcorner. These are the solutions of µ2

ij = Li,jλ, yielding λ = αi ≥ λ∗i . Therefore, λ+min(L) ≥ mini λ∗i ,

which finishes the proof.�

D.2. Eigengap of the augmented graph

This section aims as justifying the γ notation. Recall that it is defined such that γ =

min(k,`)∈Ecommλ+min(L)n

2

µ2kè

TkÀ†AekÈ2 . We show in this section that for any given family of regular graphs,

there exists a constant Cγ independent of the size of the graph such that Cγ γ ≥ γ. Matrix Adepends on µ, and we consider in this section that µ2

k` = µ0 for all communication edges (k, `).Similar results can be obtained when µ is heterogeneous.

Regular graphs. We say that a family of graph is regular if there exists Cγ > 0 such thateTkÀ

†Aek` ≤ Cγ nE for any n > 2.Recall that E is the number of edges (usually constrained by the graph family and the number

of nodes), and eTkÀ†Aek` is the effective resistance of edge (k, `). This assumption seems a bit

technical but it simply requires that all edges contribute equally to the connectivity of the graph,


and therefore is related to how symmetric the graph is. In particular, it is verified with Cγ = 1 forany completely symmetric graph, such as the complete graph or the ring. Since eTkÀ

†Aek` ≤ 1, it isalso satisfied any time the ratio n/E is bounded below, and in particular for the grid, the hypercube,or any graph with bounded degree. Under these assumptions, and for any communication edge(k, `):

λ+min(L)n2

µ2kè

TkÀ†AekÈ2

≥ γ

Cγ

λmax(L)n

µ2kÈ

≥ γ

Cγ

Trace(L)

µ20E

= 2γ

Cγ.

Here, we used the fact that Trace(L) = 2µ20E, which can be deduced directly from the form of A

(each edge has weight µ20 and contributes two times, one for each end). We conclude by using the

fact that since the previous inequalities are true for any (k, `) ∈ Ecomm, it is in particular true for γ.

D.3. Communication rate and local rate

We know that the rate of ADFS can be written as the minimum of a given quantity over alledges of the graph. This quantity will be very different whether we consider communication edgesor virtual edges. In this section, we give lower bounds for each type of edge, and show that we cantrade one for the other by adjusting the probability of communication.

Lemma 6. With the choice of parameters of Theorem 3, parameter ρ satisfies:

ρ ≥ 1√2n

min

(pcomm∆p

√γ

2κ, pcomp

√rκ

Scomp

). (40)

Proof. Recall that the rate ρ is defined as:

ρ2 = mink`

p2k`µ2kè

TkÀ†Aek`

λ+min(L)

σ−1k + σ−1`. (41)

Therefore, for communication edges the rate writes:

ρ2comm ≥ min(k,`)∈Ecomm

(1

σk+

1

σ`

)−1p2k`

µ2kè

TkÀ†Aek`

λ+min (L)

2σκ. (42)

If we take σk = σ for all k and p2k` ≥ ∆2pp

2comm/|E|2 (corresponding to a homogeneous case), then

we can make γ appear to obtain:

ρ2comm ≥ ∆2p

γ

κ

p2comm

4n2. (43)

For “computation edges”, we can write:

ρ2comp ≥ minij

p2ij

2(σ−1i + L−1i,j

) σκi

λ+min (L)Li,j

λ+min (L)

σκ, (44)

because eTijA†Aeij = 1 when (i, j) is a virtual edge (because it is part of no cycle). Since

Scomp = 1n

∑ni=1

∑mj=1

√1 + Li,jσ

−1i , this can be rewritten:

ρ2comp ≥rκ2

p2comp

n2S2comp

. (45)

�

D.4. Execution time

Now that we have specified the rate of ADFS (improvement per iteration), we can bound thetime needed to reach precision ε by plugging in the expected time to execute the schedule. Inparticular, we show in this section Theorem 6, which is a more precise version of Theorem 3.

We introduce ∆p, rκ and cτ to quantify how heterogeneous the system is. More specifically,

we can define σ = maxi σi, κi = 1 + σ−1i∑mj=1 Li,j and κs = maxi κi. Since they are not all

equal, we introduce rκ = mini κi/κs. We choose the probabilities of virtual edges, such that∑mj=1 pij is constant for all i and such that pij = pcomp(1 + Li,jσ

−1i )

12 /(nScomp) for Scomp =

n−1∑ni=1

∑mj=1(1 + Li,jσ

−1i )

12 . When (k, `) is a communication edge, we further assume that

pk` ≥ ∆ppcomm/|E| for some constant ∆p ≤ 1 and pmaxcomm ≤ cτpcomm for some cτ > 0.


Theorem 6. We choose µ2k` = 1

2 for communication edges, µ2ij =

λ+min(L)

σκiLi,j for computation edges

and pcomm = min(

12 ,(

1 +√

γκmin

Scomp

)−1 ). Then, running Algorithm 1 for K = ρ−1 log

(ε−1)

iterations guarantees E[‖θK − θ?‖2

]≤ C0ε, and takes time T (K), with T (K) bounded by:

T (K) ≤ 2C

(m+

√mκs√

2rκ+

(1 + 4cττ)

∆p

√κsγ

)log

(1

ε

)with probability tending to 1 as ρ−1 log

(ε−1)→∞, where C is the same as in Theorem 2.

In heterogeneous settings, σi and sampling probabilities may be adapted to recover goodguarantees, but this is beyond the scope of this paper. Note that taking computing probabilitiesexactly equal for all nodes is not necessary to ensure convergence, and only slightly slows downconvergence. Indeed, it is always possible to analyze a schedule for which all nodes have exactlythe same probability of local update by adding a probability of doing nothing for time 1 as a localupdate to the nodes that are chosen less frequently. If we denote pwait the probability that we needto add so that all nodes have the same probability of being selected, then pcomp + pcomm = 1− pwait

so θcomp will be slightly smaller for a given pcomm. The actual algorithm can only be faster so thisjust gives a rough upper bound on the time to convergence.

Proof. Using Theorem 2 on the average time per iteration, we know that as long as pcomp > pcomm,the execution time of the algorithm verifies the following bound for some C > 0 with high probability:

T (K) ≤ C

n(pcomp + 2τpmax

comm)K (46)

Algorithm 1 requires − log(1/ε)/ log(1− ρ) iterations to reach error ε. Using that log(1 + x) ≤ xfor any x > −1, we get that using Kε = log(1/ε)ρ−1 instead also guarantees to make error lessthan ε. We now optimize the bound in ρ:

T (Kε)

log (ε−1)≤ C

nρ(pcomp + 2τpmax

comm) (47)

If we rewrite this in terms of ρcomm and ρcomp, we obtain:

T (Kε)

log (ε−1)≤ C max (T1(pcomm), T2(pcomm)) (48)

with

T1(pcomm) =1

nρcomm(pcomp + 2cττpcomm) =

2

∆p

(2cττ − 1 +

1

pcomm

)√κ

γ(49)

and

T2(pcomm) = Scomp

√2

rκ

(1 + (2cττ − 1)pcomm

1− pcomm

)= Scomp

√2

rκ

(1 + 2τ

pcomm

1− pcomm

)(50)

T1 is a continuous decreasing function of pcomm with T1 →∞ when pcomm → 0. Similarly, T2is a continuous increasing function of pcomm such that pcomm → ∞ when pcomm → 1. Therefore,the best upper bound on the execution time is given by taking pcomm = p∗ where p∗ is such thatT1(p∗) = T2(p∗) and so ρcomm(p∗) = ρcomp(p∗).

T (Kε)

log (ε−1)≤ CT1(p∗) (51)

Then, p∗ can be found by finding the root in ]0, 1[ of a second degree polynomial. In particular,p∗ is the solution of:

p2comp = p2comm

γ∆2p

2κrκS2comp = (1− pcomm)2 (52)

which leads to p∗ =(

1 +√

γ2κmin

∆pScomp

)−1.


T (Kε)

log (ε−1)≤ 2

C

∆p

(2cττ − 1 +

1

p∗

)√κ

γ

≤ 2C

(2τ

cτ∆p

√κ

γ+

1√2rκ

Scomp

)Finally, we use the concavity of the square root to show that:

Scomp =1

n

n∑i=1

m∑j=1

√1 + Li,jσ

−1i

≤ 1

n

n∑i=1

m

√√√√ m∑j=1

1

m

(1 + Li,jσ

−1i

)≤ 1

n

n∑i=1

m

√1 +

1

m(κi − 1)

≤ m+√mκ

Yet, this analysis only works as long as p∗ ≤ 1/2. When this constraint is not respected, weknow that: γS2

comp ≤ 2κrκ. In this case, we can simply choose pcomp = pcomm = 12 and then

ρcomm ≤ ρcomp, so

T (Kε)

log (ε−1)≤ CT1

(1

2

)= 2

C

∆p(1 + 2cττ)

√κ

γ(53)

The sum of the two bounds is a valid upper bound in all situations, which finishes the proof. �

Appendix E. Experimental setting

E.1. Experimental Setting

We detail in this section the exact experimental setting in which simulations were made. Allalgorithms used out-of-the-box parameters given by theory. Batch algorithms as well as ESDACDwere given the exact κb. The datasets we used are the first million samples of the Higgs dataset (11million samples and 28 attributes) and the Covtype.binary.scale dataset (581,012 samples and 54attributes). Both datasets are available at https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html. To obtain the local dataset Xi ∈ Rm×d of each node, we drew m samplesat random from the base dataset, so that datasets of different nodes may overlap. We used thelogistic loss with quadratic regularization, meaning that the function at node i is:

fi : θi 7→m∑j=1

log(1 + exp(−li,jXT

i,jθi))

+σi2‖θi‖2,

where li,j ∈ {−1, 1} is the label associated with Xi,j , the k-th sample of node i. We chose m = 104

and σ = 1 for all simulations. Note that local functions are not normalized (not divided by m) so thisactually corresponds to a regularization value of σi = 10−4 with usual formulations. Computationdelays were chosen constant equal to 1 and communication delays constant equal to 5.

As said in the main text, plots are shown for idealized times in order to abstract implementationdetails as well as ensure that reported timings were not impacted by the cluster status (availablebandwidth for example). Note that for ADFS, nodes perform the schedule described in Section 4and are considered free to start the next iteration as soon as they send their a gradient as long asthey already received the neighbor’s gradient (non-blocking send). Note that although Algorithm 1returns vector Σ−1vt to compute the error, we used the vector Σ−1yt instead. Both have similarasymptotic convergence rates but the error was more stable using Σ−1yt. The error that we plot isthe average error over all nodes at a given time. More specifically, all nodes compute the error atspecific iteration number as F (Σ−1yt). Then, we average all these errors and the time reported isthe time at which the last node finishes this iteration.

Similarly to Table 1, we assume that computing the dual gradient of a function fi is as long ascomputing m proximal operators of fi,j functions. This greatly benefits to MSDA and ESDACD

https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html

https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html


(a) Higgs, τ = 5 (b) Covtype, τ = 5 (c) Higgs, τ = 50

Figure 4. Simulations on the logistic regression task with m = 104 points per node,regularization parameter σ = 1 on grid networks of size 100.

since in the case of logistic regression, the proximal operator for one sample has no analytic solutionbut can be efficiently computed as the result of a one-dimensional optimization problem (Shalev-Shwartz and Zhang, 2013). The inner problem corresponding to computing ∇f∗i was solved byperforming 500 steps of accelerated gradient descent. For Point-SAGA, ADFS and DSBA, 1Dprox were computed using 5 steps of Newton’s method (in one dimension). Both used warm-starts,i.e. the initial parameter for these inner problems was the solution for the last time the problemwas solved. The step-size α of DSBA was chosen as 1/(4Lmax) instead of 1/(24Lmax) whereLmax = maxi,j Li,j . DSBA was unstable for larger values of α.

E.2. Centralized Algorithms

In this section, we perform a quick comparison with the centralized algorithm Katyusha (Allen-Zhu, 2017). We assume that the allreduce communication steps take time ∆ where ∆ is the diameterof the graph. We implement the mini-batch version of this algorithm and set the mini-batch sizeso that b = 1 + ∆τ , i.e., the algorithm spends as much time computing as communicating (notcounting the full gradient steps). Counting computation time in terms of effective passes over thedataset is slightly unfair to Katyusha that has a cheaper per-example cost. Yet, this is only a(small) constant factor in the case of logistic regression.

We observe on Figures 4a and 4b that Katyusha and ADFS have comparable rates when τ = 5.This is mainly due to the fact that communications are quite fast so the effective mini-batch size is9100 in this case (diameter of the graph is 18 so 91 per node), which is quite small compared to the106 total samples. Figure 4c shows that ADFS can outperform Katyusha on the Higgs dataset (onwhich it was slower when taking τ = 5) when delays are big (τ = 50). Indeed, the effective batchsize in this case is 91000, which is about 10% of the dataset and so Katyusha does not take fulladvantage of the stochastic optimization speedup. Yet, it is still significantly faster than MSDA.Note that in the case of τ = 50, we set pcomm such that τpcomm = pcomp. This choice slightly reducedthe number communications and led to a faster algorithm by reducing communication time andsynchronization barriers. Overall, we see that, contrary to existing decentralized methods, ADFScan be competitive with a distributed implementation of Katyusha, especially when delays are high.Yet, a more detailed study reporting actual computing times with fully optimized implementationswould be needed to compare the algorithms further. Indeed, some simulation choices favoredADFS (normalized time, neglecting overhead induced by the prox), whereas other favored Katyusha(constant delays, homogeneous setting).

More fundamentally, Katyusha and ADFS are based on two distinct distribution paradigms.On the one hand, centralized algorithms use less noisy gradients because they have an effectivemini-batch size of at least n. This grants them linear speedup given that the batch size is smallenough. Yet, the batch size usually has to be quite high because it needs to grow linearly withthe communication time and the diameter of the graph in order to avoid spending more timecommunicating than computing so centralized approaches are not necessarily the best option onhigh-latency networks. On the other hand, decentralized algorithms such as ADFS can workwith very small batches but they do not get the mini-batch noise reduction from computing on


n nodes in parallel the way Katyusha does. Indeed, similarly to “Local-SGD” Lin et al. (2018);Patel and Dieuleveut (2019) approaches, each node locally runs an accelerated variance-reducedalgorithm. This confirms that decentralized algorithms, and in particular ADFS, can be well-suitedfor distributed stochastic optimization with delays.

E.3. Code

A Python implementation of ADFS is given in supplementary material. The goal of this code isto show how to implement ADFS and encourage its use as a baseline. The code implements ADFSto solve the Logistic Regression problem on a 2D grid. It generates a synthetic binary classificationdataset. Our implementation leverages the fact that Logistic Regression is a linear model to onlystore 2 scalars per virtual node instead of 2 full vectors, thus showing how to use sparse updates.The code is not optimized and not intended to be particularly fast, but rather to show how to gofrom the pseudo-code in Algorithm 1 to an actual implementation.

an accelerated decentralized stochastic proximal algorithm ... · msda (scaman et al.,2017) global...

Documents