balancing computation and communication in distributed...

Post on 06-Oct-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Balancing Computation and Communication inDistributed Optimization

Ermin Wei

Department of Electrical Engineering and Computer ScienceNorthwestern University

DIMACSRutgers Univeristy

Aug 23, 2017

Balancing Computation & Communication 1/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Collaborators

Albert S. Berahas Raghu Bollapragada Nitish Shirish Keskar

Balancing Computation & Communication 2/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Overview

1 Introduction and Motivation

2 Distributed Gradient Descent Variant

3 Communication Computation Decoupled DGD Variants

4 Conclusions & Future Work

Balancing Computation & Communication 3/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Overview

1 Introduction and Motivation

2 Distributed Gradient Descent Variant

3 Communication Computation Decoupled DGD Variants

4 Conclusions & Future Work

Balancing Computation & Communication 4/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Problem Formulation

minx∈Rp

f (x) =n∑

i=1

fi (x)

Applications: Sensor Networks, Robotic Teams, Machine Learning.

Parameter estimation insensor networks.Communication

Multi-agent cooperativecontrol and coordination.Battery

Large scale computation.Computation

Balancing Computation & Communication 5/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Algorithm Evaluation

Typical numerical results (measured in iterations or time orcommunication rounds).

0 50 100 150 2000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of iterations

Err

or=

|F-F

*|/F

*

Existing AExisting B Our method

Evaluation framework should reflect features of different applications.

Balancing Computation & Communication 6/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Algorithm Evaluation

Typical numerical results (measured in iterations or time orcommunication rounds).

0 50 100 150 2000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of iterations

Err

or=

|F-F

*|/F

*

Existing AExisting B Our method

Evaluation framework should reflect features of different applications.

Balancing Computation & Communication 6/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Problem Formulation

minx∈Rp

f (x) =n∑

i=1

fi (x)

Applications: Sensor Networks, Robotic Teams, Machine Learning.

Distributed Setting: Consensus Optimization problem

minxi∈Rp

f (x) =n∑

i=1

fi (xi )

s.t. xi = xj , ∀i , j ∈ Ni

each node i has a local copy of the parameter vector xi

@ optimality consensus is achieved among all the nodes in the network

Balancing Computation & Communication 7/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Problem Formulation

minx∈Rp

f (x) =n∑

i=1

fi (x)

Applications: Sensor Networks, Robotic Teams, Machine Learning.

Distributed Setting:

Consensus Optimization problem

minxi∈Rp

f (x) =n∑

i=1

fi (xi )

s.t. xi = xj , ∀i , j ∈ Ni

each node i has a local copy of the parameter vector xi

@ optimality consensus is achieved among all the nodes in the network

Balancing Computation & Communication 7/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Problem Formulation

minx∈Rp

f (x) =n∑

i=1

fi (x)

Applications: Sensor Networks, Robotic Teams, Machine Learning.

Distributed Setting: Consensus Optimization problem

minxi∈Rp

f (x) =n∑

i=1

fi (xi )

s.t. xi = xj , ∀i , j ∈ Ni

each node i has a local copy of the parameter vector xi

@ optimality consensus is achieved among all the nodes in the network

Balancing Computation & Communication 7/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Consensus Optimization Problem

minxi∈Rp

f (x) =n∑

i=1

fi (xi )

s.t. xi = xj , ∀i , j ∈ Ni

x is a concatenation of all local xi ’s

W is a doubly-stochastic matrix that defines the connections in thenetwork

x =

x1

x2

...xn

∈ Rnp, W =

w11 w12 · · · w1n

w21 w22 · · · w2n

.... . .

...wn1 wn2 · · · wnn

∈ Rn×n, Z = W⊗ Ip ∈ Rnp×np

Balancing Computation & Communication 8/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Consensus Optimization Problem

minxi∈Rp

f (x) =n∑

i=1

fi (xi )

s.t. xi = xj , ∀i , j ∈ Ni

1

23

4

5

6

78

9x1, f1

x2, f2

x3, f3

x4, f4

x5, f5

x6, f6

x7, f7

x8, f8

x9, f9

x is a concatenation of all local xi ’s

W is a doubly-stochastic matrix that defines the connections in thenetwork

x =

x1

x2

...xn

∈ Rnp, W =

w11 w12 · · · w1n

w21 w22 · · · w2n

.... . .

...wn1 wn2 · · · wnn

∈ Rn×n, Z = W⊗ Ip ∈ Rnp×np

Balancing Computation & Communication 8/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Consensus Optimization Problem

minxi∈Rp

f (x) =n∑

i=1

fi (xi )

s.t. Zx = x

x is a concatenation of all local xi ’s

W is a doubly-stochastic matrix that defines the connections in thenetwork

x =

x1

x2

...xn

∈ Rnp, W =

w11 w12 · · · w1n

w21 w22 · · · w2n

.... . .

...wn1 wn2 · · · wnn

∈ Rn×n, Z = W⊗ Ip ∈ Rnp×np

Balancing Computation & Communication 8/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Literature Review

1 Sublinearly Converging Methods

DGD [Tsitsiklis and Bertsekas, 1989; Nedic and Ozdaglar, 2009; Sundhar Ram et.

al., 2010; Tsianos and Rabbat, 2012; Yuan et. al., 2015; Zeng and Yin, 2016] ...

2 Linearly and Superlinearly Converging Methods

EXTRA [Shi et. al., 2015], DIGing [Nedic et. al., 2017], NEXT [Lorenzo and

Scutari, 2015], Aug-DGM [Xu et. al., 2015], NN-EXTRA [Mokhtari et. al., 2016],

[Qu and Li, 2017], DQN [Eisen et. al., 2017], NN [Mokhtari et. al., 2014, 2015]...

3 Communication Efficient Methods

[Chen and Ozdaglar, 2012], [Shamir et. al., 2014], [Chow et. al., 2016], [Lan et.

al., 2017], [Tsianos et. al., 2012], [Zhang and Lin, 2015], ...

4 Asynchronous Methods

[Tsitsiklis and Bertsekas, 1989], [Tsitsiklis et. al., 1986], [Sundhar Ram et. al.,

2009], [Wei and Ozdaglar, 2013], [Mansoori and Wei, 2017], [Zhang and Kwok,

2014], [Wu et. al., 2017], ...

Balancing Computation & Communication 9/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Literature Review

1 Sublinearly Converging Methods

DGD [Tsitsiklis and Bertsekas, 1989; Nedic and Ozdaglar, 2009; Sundhar Ram et.

al., 2010; Tsianos and Rabbat, 2012; Yuan et. al., 2015; Zeng and Yin, 2016] ...

2 Linearly and Superlinearly Converging Methods

EXTRA [Shi et. al., 2015], DIGing [Nedic et. al., 2017], NEXT [Lorenzo and

Scutari, 2015], Aug-DGM [Xu et. al., 2015], NN-EXTRA [Mokhtari et. al., 2016],

[Qu and Li, 2017], DQN [Eisen et. al., 2017], NN [Mokhtari et. al., 2014, 2015]...

3 Communication Efficient Methods

[Chen and Ozdaglar, 2012], [Shamir et. al., 2014], [Chow et. al., 2016], [Lan et.

al., 2017], [Tsianos et. al., 2012], [Zhang and Lin, 2015], ...

4 Asynchronous Methods

[Tsitsiklis and Bertsekas, 1989], [Tsitsiklis et. al., 1986], [Sundhar Ram et. al.,

2009], [Wei and Ozdaglar, 2013], [Mansoori and Wei, 2017], [Zhang and Kwok,

2014], [Wu et. al., 2017], ...

Balancing Computation & Communication 9/38

Cost =#

Communication

s×cc+ #

Computation

s×cg

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Literature Review

1 Sublinearly Converging Methods

DGD [Tsitsiklis and Bertsekas, 1989; Nedic and Ozdaglar, 2009; Sundhar Ram et.

al., 2010; Tsianos and Rabbat, 2012; Yuan et. al., 2015; Zeng and Yin, 2016] ...

2 Linearly and Superlinearly Converging Methods

EXTRA [Shi et. al., 2015], DIGing [Nedic et. al., 2017], NEXT [Lorenzo and

Scutari, 2015], Aug-DGM [Xu et. al., 2015], NN-EXTRA [Mokhtari et. al., 2016],

[Qu and Li, 2017], DQN [Eisen et. al., 2017], NN [Mokhtari et. al., 2014, 2015]...

3 Communication Efficient Methods

[Chen and Ozdaglar, 2012], [Shamir et. al., 2014], [Chow et. al., 2016], [Lan et.

al., 2017], [Tsianos et. al., 2012], [Zhang and Lin, 2015], ...

4 Asynchronous Methods

[Tsitsiklis and Bertsekas, 1989], [Tsitsiklis et. al., 1986], [Sundhar Ram et. al.,

2009], [Wei and Ozdaglar, 2013], [Mansoori and Wei, 2017], [Zhang and Kwok,

2014], [Wu et. al., 2017], ...

Balancing Computation & Communication 9/38

Cost =#Communications

×cc

+ #Computations

×cg

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Literature Review

1 Sublinearly Converging Methods

DGD [Tsitsiklis and Bertsekas, 1989; Nedic and Ozdaglar, 2009; Sundhar Ram et.

al., 2010; Tsianos and Rabbat, 2012; Yuan et. al., 2015; Zeng and Yin, 2016] ...

2 Linearly and Superlinearly Converging Methods

EXTRA [Shi et. al., 2015], DIGing [Nedic et. al., 2017], NEXT [Lorenzo and

Scutari, 2015], Aug-DGM [Xu et. al., 2015], NN-EXTRA [Mokhtari et. al., 2016],

[Qu and Li, 2017], DQN [Eisen et. al., 2017], NN [Mokhtari et. al., 2014, 2015]...

3 Communication Efficient Methods

[Chen and Ozdaglar, 2012], [Shamir et. al., 2014], [Chow et. al., 2016], [Lan et.

al., 2017], [Tsianos et. al., 2012], [Zhang and Lin, 2015], ...

4 Asynchronous Methods

[Tsitsiklis and Bertsekas, 1989], [Tsitsiklis et. al., 1986], [Sundhar Ram et. al.,

2009], [Wei and Ozdaglar, 2013], [Mansoori and Wei, 2017], [Zhang and Kwok,

2014], [Wu et. al., 2017], ...

Balancing Computation & Communication 9/38

Cost =#Communications×cc

+ #Computations×cg

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Goal of the Project

Develop an algorithmic framework that is independent of the methodused to balance computation and communication in distributedoptimization

Prove convergence for methods that use the framework

Show that the framework can be applied to a many consensusoptimization problems (with different communication andcomputation costs)

Illustrate empirically that methods that utilize the frameworkoutperform their base algorithms for specific applications

Balancing Computation & Communication 10/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Goal of the Project

Develop an algorithmic framework that is independent of the methodused to balance computation and communication in distributedoptimization

Prove convergence for methods that use the framework

Show that the framework can be applied to a many consensusoptimization problems (with different communication andcomputation costs)

Illustrate empirically that methods that utilize the frameworkoutperform their base algorithms for specific applications

Balancing Computation & Communication 10/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Goal of the Project

Develop an algorithmic framework that is independent of the methodused to balance computation and communication in distributedoptimization

Prove convergence for methods that use the framework

Show that the framework can be applied to a many consensusoptimization problems (with different communication andcomputation costs)

Illustrate empirically that methods that utilize the frameworkoutperform their base algorithms for specific applications

Balancing Computation & Communication 10/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Goal of the Project

Develop an algorithmic framework that is independent of the methodused to balance computation and communication in distributedoptimization

Prove convergence for methods that use the framework

Show that the framework can be applied to a many consensusoptimization problems (with different communication andcomputation costs)

Illustrate empirically that methods that utilize the frameworkoutperform their base algorithms for specific applications

Balancing Computation & Communication 10/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

This talk

First stage of the project

Multiple consensus in DGD (theoretically and in practice)

Design a flexible first-order algorithm that decouples the twooperations

Investigate the method theoretically and empirically

By-product: variants of DGD with exact convergence

Not in this talk (ongoing work)

Multiple gradients

Extend framework to different algorithms (e.g., EXTRA, NN) orasynchronous methods

Balancing Computation & Communication 11/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

This talk

First stage of the project

Multiple consensus in DGD (theoretically and in practice)

Design a flexible first-order algorithm that decouples the twooperations

Investigate the method theoretically and empirically

By-product: variants of DGD with exact convergence

Not in this talk (ongoing work)

Multiple gradients

Extend framework to different algorithms (e.g., EXTRA, NN) orasynchronous methods

Balancing Computation & Communication 11/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Overview

1 Introduction and Motivation

2 Distributed Gradient Descent Variant

3 Communication Computation Decoupled DGD Variants

4 Conclusions & Future Work

Balancing Computation & Communication 12/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Distributed Gradient Descent (DGD)

DGD [Tsitsiklis and Bertsekas, 1989; Nedic and Ozdaglar, 2009; Sundhar Ram

et. al., 2010; Tsianos and Rabbat, 2012; Yuan et. al., 2015; Zeng and Yin, 2016]

xi ,k+1 =∑

j∈Ni∪iwijxj ,k − α∇fi (xi ,k), ∀i = 1, ..., n

xk+1 = Zxk − α∇f(xk)

Balancing Computation & Communication 13/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Distributed Gradient Descent (DGD)

DGD [Tsitsiklis and Bertsekas, 1989; Nedic and Ozdaglar, 2009; Sundhar Ram

et. al., 2010; Tsianos and Rabbat, 2012; Yuan et. al., 2015; Zeng and Yin, 2016]

xi ,k+1 =∑

j∈Ni∪iwijxj ,k − α∇fi (xi ,k), ∀i = 1, ..., n

xk+1 = Zxk − α∇f(xk)

x =

x1

x2

...xn

∈ Rnp, ∇f(xk) =

∇f1(x1,k)∇f2(x2,k)

...∇fn(xn,k)

∈ Rnp, Z = W⊗ Ip ∈ Rnp×np

Balancing Computation & Communication 13/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Distributed Gradient Descent (DGD)

DGD [Tsitsiklis and Bertsekas, 1989; Nedic and Ozdaglar, 2009; Sundhar Ram

et. al., 2010; Tsianos and Rabbat, 2012; Yuan et. al., 2015; Zeng and Yin, 2016]

xi ,k+1 =∑

j∈Ni∪iwijxj ,k − α∇fi (xi ,k), ∀i = 1, ..., n

xk+1 = Zxk − α∇f(xk)

Diminishing α: Sub-linear convergence to the solution

Constant α: Linear convergence to a neighborhood O(α)

Balancing Computation & Communication 13/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

DGD – Questions

DGD

xk+1 = Zxk − α∇f(xk)

Is it necessary to do one consensus step and one optimization(gradient) step?

If not, what is the interpretation of methods that do moreconsensus/optimization steps?

What convergence guarantees can be proven for such methods?

How do these variants perform in practice?

DGDt

xk+1 = Ztxk − α∇f(xk), Zt = Wt ⊗ Ip

Balancing Computation & Communication 14/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

DGD – Questions

DGD

xk+1 = Zxk − α∇f(xk)

Is it necessary to do one consensus step and one optimization(gradient) step?

If not, what is the interpretation of methods that do moreconsensus/optimization steps?

What convergence guarantees can be proven for such methods?

How do these variants perform in practice?

DGDt

xk+1 = Ztxk − α∇f(xk), Zt = Wt ⊗ Ip

Balancing Computation & Communication 14/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

DGD – Questions

DGD

xk+1 = Zxk − α∇f(xk)

Is it necessary to do one consensus step and one optimization(gradient) step?

If not, what is the interpretation of methods that do moreconsensus/optimization steps?

What convergence guarantees can be proven for such methods?

How do these variants perform in practice?

DGDt

xk+1 = Ztxk − α∇f(xk), Zt = Wt ⊗ Ip

Balancing Computation & Communication 14/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

DGD – Questions

DGD

xk+1 = Zxk − α∇f(xk)

Is it necessary to do one consensus step and one optimization(gradient) step?

If not, what is the interpretation of methods that do moreconsensus/optimization steps?

What convergence guarantees can be proven for such methods?

How do these variants perform in practice?

DGDt

xk+1 = Ztxk − α∇f(xk), Zt = Wt ⊗ Ip

Balancing Computation & Communication 14/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

DGD – Assumptions & Definitions

Assumptions

1 Each component function fi is µi > 0 strongly convex and has Li > 0Lipschitz continuous gradients

2 The mixing matrix W is symmetric and doubly stochastic with β < 1(β is the second largest eigenvalue)

Definitions

xk =1

n

n∑i=1

xi,k , ∇f (xk) =n∑

i=1

∇fi (xi,k), ∇f (xk) =n∑

i=1

∇fi (xk)

Balancing Computation & Communication 15/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

DGD – Theory – Bounded distance to minimum

Theorem (Bounded distance to minimum) [Yuan et. al., 2015]

Suppose Assumptions 1 & 2 hold, and let the step length satisfy

α ≤ min

{1 + λn(W)

Lf,

1

µf + Lf

}where µf is the strong convexity parameter of f and Lf is the Lipschitz constant of the gradientof f . Then, for all k = 0, 1, ...

‖xk+1 − x?‖2 ≤ c21‖xk − x?‖2 +

c23

(1− β)2

c21 = 1− αc2 + αδ − α2δc2, c2 =

µf Lf

µf + Lf,

c23 = α3(α+ δ−1)L2D2, D =

√√√√2L(n∑

i=1

fi (0)− f ?),

where x? = arg minx f (x) and δ > 0.

Balancing Computation & Communication 16/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

DGD – Theory – Bounded distance to minimum

Theorem (Bounded distance to minimum) [Yuan et. al., 2015]

Suppose Assumptions 1 & 2 hold, and let the step length satisfy

α ≤ min

{1 + λn(W)

Lf,

1

µf + Lf

}where µf is the strong convexity parameter of f and Lf is the Lipschitz constant of the gradientof f . Then, for all k = 0, 1, ...

‖xk+1 − x?‖2 ≤ c21‖xk − x?‖2 +

c23c23c23

(1− β)2(1− β)2(1− β)2

c21 = 1− αc2 + αδ − α2δc2, c2 =

µf Lf

µf + Lf,

c23 = α3(α+ δ−1)L2D2, D =

√√√√2L(n∑

i=1

fi (0)− f ?),

where x? = arg minx f (x) and δ > 0.

Balancing Computation & Communication 16/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

DGDt – Theory – Bounded distance to minimum

Theorem (Bounded distance to minimum) [ASB, RB, NSK and EW, 2017]

Suppose Assumptions 1 & 2 hold, and let the step length satisfy

α ≤ min

{1 + λn(Wt)

Lf,

1

µf + Lf

}where µf is the strong convexity parameter of f and Lf is the Lipschitz constant of the gradientof f . Then, for all k = 0, 1, ...

‖xk+1 − x?‖2 ≤ c21‖xk − x?‖2 +

c23c23c23

(1− βt)2(1− βt)2(1− βt)2

c21 = 1− αc2 + αδ − α2δc2, c2 =

µf Lf

µf + Lf,

c23 = α3(α+ δ−1)L2D2, D =

√√√√2L(n∑

i=1

fi (0)− f ?),

where x? = arg minx f (x) and δ > 0.

Balancing Computation & Communication 16/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

DGD – Theory – Comments

Can show similar error neighborhood for ‖xi ,k − x∗‖.Theoretical results are similar to DGD in nature (constant α)

Linear convergence to a neighborhood of the solutionImproved neighborhood

O(

1

(1− β)2

)v.s. O

(1

(1− βt)2

)But, cannot kill the neighborhood with increased communication

Balancing Computation & Communication 17/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

DGD – Theory – Comments

Can show similar error neighborhood for ‖xi ,k − x∗‖.Theoretical results are similar to DGD in nature (constant α)

Linear convergence to a neighborhood of the solutionImproved neighborhood

O(

1

(1− β)2

)v.s. O

(1

(1− βt)2

)But, cannot kill the neighborhood with increased communication

Drawback: requires extra communication

Balancing Computation & Communication 17/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

DGD – Theory – Comments

Can show similar error neighborhood for ‖xi ,k − x∗‖.Theoretical results are similar to DGD in nature (constant α)

Linear convergence to a neighborhood of the solutionImproved neighborhood

O(

1

(1− β)2

)v.s. O

(1

(1− βt)2

)But, cannot kill the neighborhood with increased communication

Drawback: requires extra communication

Effectively, DGDt is DGD with a different underlying graph (differentweights in W)

Balancing Computation & Communication 17/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

DGD – Theory – Comments

Effectively, DGDt is DGD with a different underlying graph (differentweights in W)

1 2

3 4

1 2 3 4

W =

1/2 1/4 1/4 01/4 1/2 0 1/41/4 0 1/2 1/4

0 1/4 1/4 1/2

,

W =

2/3 1/3 0 01/3 1/3 1/3 00 1/3 1/3 1/3

0 0 1/3 2/3

,

Balancing Computation & Communication 17/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

DGD – Theory – Comments

Effectively, DGDt is DGD with a different underlying graph (differentweights in W)

1 2

3 4

1 2 3 4

W =

1/2 1/4 1/4 01/4 1/2 0 1/41/4 0 1/2 1/4

0 1/4 1/4 1/2

,

W =

2/3 1/3 0 01/3 1/3 1/3 00 1/3 1/3 1/3

0 0 1/3 2/3

,

Balancing Computation & Communication 17/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

DGD – Theory – Comments

Effectively, DGDt is DGD with a different underlying graph (differentweights in W)

1 2

3 4

1 2 3 4

W =

1/2 1/4 1/4 01/4 1/2 0 1/41/4 0 1/2 1/4

0 1/4 1/4 1/2

, W 2 =

3/8 1/4 1/4 1

81/4 3/8 1/8 1/41/4 1/8 3/8 1/41/8 1/4 1/4 3/8

W =

2/3 1/3 0 01/3 1/3 1/3 00 1/3 1/3 1/3

0 0 1/3 2/3

,

Balancing Computation & Communication 17/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

DGD – Theory – Comments

Effectively, DGDt is DGD with a different underlying graph (differentweights in W)

1 2

3 4

1 2 3 4

W =

1/2 1/4 1/4 01/4 1/2 0 1/41/4 0 1/2 1/4

0 1/4 1/4 1/2

, W 10 ≈

1/4 1/4 1/4 1/41/4 1/4 1/4 1/41/4 1/4 1/4 1/41/4 1/4 1/4 1/4

W =

2/3 1/3 0 01/3 1/3 1/3 00 1/3 1/3 1/3

0 0 1/3 2/3

,

Balancing Computation & Communication 17/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

DGD – Theory – Comments

Effectively, DGDt is DGD with a different underlying graph (differentweights in W)

1 2

3 4

1 2 3 4

W =

1/2 1/4 1/4 01/4 1/2 0 1/41/4 0 1/2 1/4

0 1/4 1/4 1/2

, W 10 ≈

1/4 1/4 1/4 1/41/4 1/4 1/4 1/41/4 1/4 1/4 1/41/4 1/4 1/4 1/4

W =

2/3 1/3 0 01/3 1/3 1/3 00 1/3 1/3 1/3

0 0 1/3 2/3

,

Balancing Computation & Communication 17/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

DGD – Theory – Comments

Effectively, DGDt is DGD with a different underlying graph (differentweights in W)

1 2

3 4

1 2 3 4

W =

1/2 1/4 1/4 01/4 1/2 0 1/41/4 0 1/2 1/4

0 1/4 1/4 1/2

, W 10 ≈

1/4 1/4 1/4 1/41/4 1/4 1/4 1/41/4 1/4 1/4 1/41/4 1/4 1/4 1/4

W =

2/3 1/3 0 01/3 1/3 1/3 00 1/3 1/3 1/3

0 0 1/3 2/3

, W 2 =

5/9 1/3 1/9 01/3 1/3 2/9 1/91/9 2/9 1/3 1/3

0 1/9 1/3 5/9

Balancing Computation & Communication 17/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

DGD – Theory – Comments

Effectively, DGDt is DGD with a different underlying graph (differentweights in W)

1 2

3 4

1 2 3 4

W =

1/2 1/4 1/4 01/4 1/2 0 1/41/4 0 1/2 1/4

0 1/4 1/4 1/2

, W 10 ≈

1/4 1/4 1/4 1/41/4 1/4 1/4 1/41/4 1/4 1/4 1/41/4 1/4 1/4 1/4

W =

2/3 1/3 0 01/3 1/3 1/3 00 1/3 1/3 1/3

0 0 1/3 2/3

, W 3 =

0.48 0.33 0.15 0.040.33 0.30 0.22 0.150.15 0.22 0.30 0.330.04 0.15 0.33 0.48

Balancing Computation & Communication 17/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

DGD(t) – Numerical Results

Problem: Quadratic

f (x) =1

2

n∑i=1

xTAix + bTi x

each node i = {1, ..., n} has local data Ai ∈ Rni×p and bi ∈ Rni

Parameters: n = 10, p = 10, ni = 10, κ = 102

Methods: DGD (1,1), DGD (1,2), DGD (1,5), DGD (1,10)

Graph: 4-cyclic graph, wii = 15 , wij =

{15 if j ∈ Ni

0 otherwise

Balancing Computation & Communication 18/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

DGD(t) – Numerical Results

Problem: Quadratic

f (x) =1

2

n∑i=1

xTAix + bTi x

each node i = {1, ..., n} has local data Ai ∈ Rni×p and bi ∈ Rni

Parameters: n = 10, p = 10, ni = 10, κ = 102

Methods: DGD (1,1), DGD (1,2), DGD (1,5), DGD (1,10)

Graph: 4-cyclic graph, wii = 15 , wij =

{15 if j ∈ Ni

0 otherwise

Show the effect of multiple consensus steps per gradient step

Balancing Computation & Communication 18/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

DGD(t) – Numerical Results

0 1000 2000 3000 4000 5000

Iterations

10-2

10-1

100

Relative Error

DGD (1,1)

DGD (1,2)

DGD (1,5)

DGD (1,10)

0 1 2 3 4 5 6

Cost ×104

10-2

10-1

100

Relative Error

DGD (1,1)

DGD (1,2)

DGD (1,5)

DGD (1,10)

Quadratic. n = 10, p = 10, ni = 10, κ = 102.

Balancing Computation & Communication 18/38

Cost = #Communications × 1 + #Computations × 1

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Overview

1 Introduction and Motivation

2 Distributed Gradient Descent Variant

3 Communication Computation Decoupled DGD Variants

4 Conclusions & Future Work

Balancing Computation & Communication 19/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Operators

DGD

xk+1 = Zxk − α∇f(xk)

Methods

DGD: (T − I +W)[xk ] = Zxk − α∇f(xk)

WT: W[T [xk ]] = Zxk − αZ∇f(xk)

A special case of the algorithm appeared as CTA (Combined then Adapt)and ATC in [Sayed, 13] for quadratic problems

1 What can be proven about T [W[x]] and W[T [x]] variants of DGD?2 How do these methods perform in practice?3 Advantages and limitations of the methods?

Balancing Computation & Communication 20/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Operators

DGD

xk+1 = Zxk − α∇f(xk)

Operators

W[x] = Zx

T [x] = x− α∇f(x)

Methods

DGD: (T − I +W)[xk ] = Zxk − α∇f(xk)

WT: W[T [xk ]] = Zxk − αZ∇f(xk)

A special case of the algorithm appeared as CTA (Combined then Adapt)and ATC in [Sayed, 13] for quadratic problems

1 What can be proven about T [W[x]] and W[T [x]] variants of DGD?2 How do these methods perform in practice?3 Advantages and limitations of the methods?

Balancing Computation & Communication 20/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Operators

DGD

xk+1 = Zxk − α∇f(xk)

Operators

W[x] = Zx

T [x] = x− α∇f(x)

Methods

DGD: (T − I +W)[xk ] = Zxk − α∇f(xk)

WT: W[T [xk ]] = Zxk − αZ∇f(xk)

A special case of the algorithm appeared as CTA (Combined then Adapt)and ATC in [Sayed, 13] for quadratic problems

1 What can be proven about T [W[x]] and W[T [x]] variants of DGD?2 How do these methods perform in practice?3 Advantages and limitations of the methods?

Balancing Computation & Communication 20/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Operators

DGD

xk+1 = Zxk − α∇f(xk)

Operators

W[x] = Zx

T [x] = x− α∇f(x)

Methods

DGD: (T − I +W)[xk ] = Zxk − α∇f(xk)TW: T [W[xk ]] = Zxk − α∇f(Zxk)WT: W[T [xk ]] = Zxk − αZ∇f(xk)

A special case of the algorithm appeared as CTA (Combined then Adapt)and ATC in [Sayed, 13] for quadratic problems

1 What can be proven about T [W[x]] and W[T [x]] variants of DGD?2 How do these methods perform in practice?3 Advantages and limitations of the methods?

Balancing Computation & Communication 20/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Operators

DGD

xk+1 = Zxk − α∇f(xk)

Operators

W[x] = Zx

T [x] = x− α∇f(x)

Methods

DGD: (T − I +W)[xk ] = Zxk − α∇f(xk)TW: T [W[xk ]] = Zxk − α∇f(Zxk)WT: W[T [xk ]] = Zxk − αZ∇f(xk)

A special case of the algorithm appeared as CTA (Combined then Adapt)and ATC in [Sayed, 13] for quadratic problems

1 What can be proven about T [W[x]] and W[T [x]] variants of DGD?2 How do these methods perform in practice?3 Advantages and limitations of the methods?

Balancing Computation & Communication 20/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Operators

TW

yk = Zxk

xk+1 = yk − α∇f(yk)

Methods

DGD: (T − I +W)[xk ] = Zxk − α∇f(xk)TW: T [W[xk ]] = Zxk − α∇f(Zxk)WT: W[T [xk ]] = Zxk − αZ∇f(xk)

A special case of the algorithm appeared as CTA (Combined then Adapt)and ATC in [Sayed, 13] for quadratic problems

1 What can be proven about T [W[x]] and W[T [x]] variants of DGD?2 How do these methods perform in practice?3 Advantages and limitations of the methods?

Balancing Computation & Communication 20/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

TW – Methods, Assumptions & Definitions

Methods

TWt : t (predetermined) consensus steps for every gradient step

TW+: increasing number of consensus steps

Assumptions

1 Assumptions 1 & 2, same as before

Definitions

xk =1

n

n∑i=1

xi ,k , ∇f (yk) =n∑

i=1

∇fi (yi ,k), ∇f (xk) =n∑

i=1

∇fi (xk)

Balancing Computation & Communication 21/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

TWt – Theory – Bounded distance to minimum

Theorem (Bounded distance to minimum) [ASB, RB, NSK and EW, 2017]

Suppose Assumptions 1 & 2 hold, and let the step length satisfy

α ≤ min

{1 + λn(Wt)

Lf,

1

µf + Lf

}where µf is the strong convexity parameter of f and Lf is the Lipschitz constant of the gradientof f . Then, for all k = 0, 1, ...

‖xi,k − x?‖ ≤ cki ‖x?‖+

c3√1− c2

1

βt + βtαD

where x? = arg minx f (x) and δ > 0.

Balancing Computation & Communication 22/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

TW+ Theory (increasing Consensus)

Can we increase the number of consensus steps and converge to thesolution?

Increase t(k) accordingly so that we kill the error term

O(βt(k))

Similar idea appeared in [Chen and Ozdaglar, 2012] for nonsmoothproblems

Resulting in TW t(k) algorithm with exact convergence: As long as wekeep increasing the number of consensus

Balancing Computation & Communication 23/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

TW+ Theory (increasing Consensus)

Can we increase the number of consensus steps and converge to thesolution?

Increase t(k) accordingly so that we kill the error term

O(βt(k))

Similar idea appeared in [Chen and Ozdaglar, 2012] for nonsmoothproblems

Resulting in TW t(k) algorithm with exact convergence: As long as wekeep increasing the number of consensus

Balancing Computation & Communication 23/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

TW+ Theory (increasing Consensus)

Can we increase the number of consensus steps and converge to thesolution?

Increase t(k) accordingly so that we kill the error term

O(βt(k))

Similar idea appeared in [Chen and Ozdaglar, 2012] for nonsmoothproblems

Resulting in TW t(k) algorithm with exact convergence: As long as wekeep increasing the number of consensus

Balancing Computation & Communication 23/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

TW+ Theory (increasing Consensus)

Can we increase the number of consensus steps and converge to thesolution?

Increase t(k) accordingly so that we kill the error term

O(βt(k))

Similar idea appeared in [Chen and Ozdaglar, 2012] for nonsmoothproblems

Resulting in TW t(k) algorithm with exact convergence: As long as wekeep increasing the number of consensus

Balancing Computation & Communication 23/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

TW+ – Theory – Bounded distance to minimum

Theorem (Bounded distance to minimum)[ASB, RB, NSK and EW, 2017]

Suppose Assumptions 1 & 2 hold, t(k) = k and let the step length satisfy

α ≤ min

{1

Lf,

1

µf + Lf

}where µf is the strong convexity parameter of f and Lf is the Lipschitz constant of the gradientof f . Then, for all k = 0, 1, ...

‖xi,k − x?‖ ≤ Cρk ,

where x? = arg minx f (x) and some constants C , ρ.

When t(k) = k to reach an ε−accurate solution, we need O(log( 1ε )) number of

gradient evaluation and O((log( 1ε ))2) rounds of communication.

Balancing Computation & Communication 24/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Numerical Experiments

Methods: DGD, TW (1,1,-), TW (10,1,-), TW (1,10,-), TW (1,1,k),TW (1,1,500), TW (1,1,1000)

Problem: Quadratic

f (x) =1

2

n∑i=1

xTAix + bTi x

each node i = {1, ..., n} has local data Ai ∈ Rni×p and bi ∈ Rni

Parameters: n = 10, p = 10, ni = 10, κ = Lµ = 104

Graph: 4-cyclic graph

Balancing Computation & Communication 25/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Numerical Experiments

Methods: DGD, TW (1,1,-), TW (10,1,-), TW (1,10,-), TW (1,1,k),TW (1,1,500), TW (1,1,1000)

Problem: Quadratic

f (x) =1

2

n∑i=1

xTAix + bTi x

each node i = {1, ..., n} has local data Ai ∈ Rni×p and bi ∈ Rni

Parameters: n = 10, p = 10, ni = 10, κ = Lµ = 104

Graph: 4-cyclic graph

Balancing Computation & Communication 25/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Numerical Experiments – Quadratic Problems

0 2000 4000 6000 8000 10000

Iterations

10-14

10-12

10-10

10-8

10-6

10-4

10-2

100

Relative Error

DGD

0 0.5 1 1.5 2

Cost ×104

10-4

10-3

10-2

10-1

100

Relative Error

DGD

Quadratic. n = 10, p = 10, ni = 10, κ = 104.

Balancing Computation & Communication 26/38

Cost = #Communications × 1 + #Computations × 1

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Numerical Experiments – Quadratic Problems

0 2000 4000 6000 8000 10000

Iterations

10-14

10-12

10-10

10-8

10-6

10-4

10-2

100

Relative Error

DGD

TW (1,1,-)

0 0.5 1 1.5 2

Cost ×104

10-4

10-3

10-2

10-1

100

Relative Error

DGD

TW (1,1,-)

Quadratic. n = 10, p = 10, ni = 10, κ = 104.

Balancing Computation & Communication 26/38

Cost = #Communications × 1 + #Computations × 1

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Numerical Experiments – Quadratic Problems

0 2000 4000 6000 8000 10000

Iterations

10-14

10-12

10-10

10-8

10-6

10-4

10-2

100

Relative Error

DGD

TW (1,1,-)

TW (10,1,-)

0 0.5 1 1.5 2

Cost ×104

10-4

10-3

10-2

10-1

100

Relative Error

DGD

TW (1,1,-)

TW (10,1,-)

Quadratic. n = 10, p = 10, ni = 10, κ = 104.

Balancing Computation & Communication 26/38

Cost = #Communications × 1 + #Computations × 1

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Numerical Experiments – Quadratic Problems

0 2000 4000 6000 8000 10000

Iterations

10-14

10-12

10-10

10-8

10-6

10-4

10-2

100

Relative Error

DGD

TW (1,1,-)

TW (10,1,-)

TW (1,10,-)

0 0.5 1 1.5 2

Cost ×104

10-4

10-3

10-2

10-1

100

Relative Error

DGD

TW (1,1,-)

TW (10,1,-)

TW (1,10,-)

Quadratic. n = 10, p = 10, ni = 10, κ = 104.

Balancing Computation & Communication 26/38

Cost = #Communications × 1 + #Computations × 1

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Numerical Experiments – Quadratic Problems

0 2000 4000 6000 8000 10000

Iterations

10-14

10-12

10-10

10-8

10-6

10-4

10-2

100

Relative Error

DGD

TW (1,1,-)

TW (10,1,-)

TW (1,10,-)

TW (1,1,k)

0 0.5 1 1.5 2

Cost ×104

10-4

10-3

10-2

10-1

100

Relative Error

DGD

TW (1,1,-)

TW (10,1,-)

TW (1,10,-)

TW (1,1,k)

Quadratic. n = 10, p = 10, ni = 10, κ = 104.

Balancing Computation & Communication 26/38

Cost = #Communications × 1 + #Computations × 1

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Numerical Experiments – Quadratic Problems

0 2000 4000 6000 8000 10000

Iterations

10-14

10-12

10-10

10-8

10-6

10-4

10-2

100

Relative Error

DGD

TW (1,1,-)

TW (10,1,-)

TW (1,10,-)

TW (1,1,k)

TW (1,1,500)

0 0.5 1 1.5 2

Cost ×104

10-4

10-3

10-2

10-1

100

Relative Error

DGD

TW (1,1,-)

TW (10,1,-)

TW (1,10,-)

TW (1,1,k)

TW (1,1,500)

Quadratic. n = 10, p = 10, ni = 10, κ = 104.

Balancing Computation & Communication 26/38

Cost = #Communications × 1 + #Computations × 1

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Numerical Experiments – Quadratic Problems

0 2000 4000 6000 8000 10000

Iterations

10-14

10-12

10-10

10-8

10-6

10-4

10-2

100

Relative Error

DGD

TW (1,1,-)

TW (10,1,-)

TW (1,10,-)

TW (1,1,k)

TW (1,1,500)

TW (1,1,1000)

0 0.5 1 1.5 2

Cost ×104

10-4

10-3

10-2

10-1

100

Relative Error

DGD

TW (1,1,-)

TW (10,1,-)

TW (1,10,-)

TW (1,1,k)

TW (1,1,500)

TW (1,1,1000)

Quadratic. n = 10, p = 10, ni = 10, κ = 104.

Balancing Computation & Communication 26/38

Cost = #Communications × 1 + #Computations × 1

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Experiments – Quadratic Problems – Different Costs

0 2 4 6 8 10

Cost ×104

10-5

10-4

10-3

10-2

10-1

100

Relative Error

DGD

TW (1,1,-)

TW (10,1,-)

TW (1,10,-)

TW (1,1,k)

TW (1,1,500)

TW (1,1,1000)

0 2 4 6 8 10

Cost ×104

10-6

10-5

10-4

10-3

10-2

10-1

100

Relative Error

DGD

TW (1,1,-)

TW (10,1,-)

TW (1,10,-)

TW (1,1,k)

TW (1,1,500)

TW (1,1,1000)

0 2 4 6 8 10

Cost ×104

10-3

10-2

10-1

100

Relative Error

DGD

TW (1,1,-)

TW (10,1,-)

TW (1,10,-)

TW (1,1,k)

TW (1,1,500)

TW (1,1,1000)

Quadratic. n = 10, p = 10, ni = 10, κ = 104.

Left: cg = 10, cc = 1;

Center: cg = 1, cc = 1;

Right: cg = 1, cc = 10.

Balancing Computation & Communication 27/38

Cost = #Communications × cc + #Computations × cg

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Experiments – Quadratic Problems – Different Costs

0 2 4 6 8 10

Cost ×104

10-5

10-4

10-3

10-2

10-1

100

Relative Error

DGD

TW (1,1,-)

TW (10,1,-)

TW (1,10,-)

TW (1,1,k)

TW (1,1,500)

TW (1,1,1000)

0 2 4 6 8 10

Cost ×104

10-6

10-5

10-4

10-3

10-2

10-1

100

Relative Error

DGD

TW (1,1,-)

TW (10,1,-)

TW (1,10,-)

TW (1,1,k)

TW (1,1,500)

TW (1,1,1000)

0 2 4 6 8 10

Cost ×104

10-3

10-2

10-1

100

Relative Error

DGD

TW (1,1,-)

TW (10,1,-)

TW (1,10,-)

TW (1,1,k)

TW (1,1,500)

TW (1,1,1000)

Quadratic. n = 10, p = 10, ni = 10, κ = 104.Left: cg = 10, cc = 1;

Center: cg = 1, cc = 1;

Right: cg = 1, cc = 10.

Balancing Computation & Communication 27/38

Cost = #Communications × cc + #Computations × cg

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Experiments – Quadratic Problems – Different Costs

0 2 4 6 8 10

Cost ×104

10-5

10-4

10-3

10-2

10-1

100

Relative Error

DGD

TW (1,1,-)

TW (10,1,-)

TW (1,10,-)

TW (1,1,k)

TW (1,1,500)

TW (1,1,1000)

0 2 4 6 8 10

Cost ×104

10-6

10-5

10-4

10-3

10-2

10-1

100

Relative Error

DGD

TW (1,1,-)

TW (10,1,-)

TW (1,10,-)

TW (1,1,k)

TW (1,1,500)

TW (1,1,1000)

0 2 4 6 8 10

Cost ×104

10-3

10-2

10-1

100

Relative Error

DGD

TW (1,1,-)

TW (10,1,-)

TW (1,10,-)

TW (1,1,k)

TW (1,1,500)

TW (1,1,1000)

Quadratic. n = 10, p = 10, ni = 10, κ = 104.Left: cg = 10, cc = 1;

Center: cg = 1, cc = 1;Right: cg = 1, cc = 10.

Balancing Computation & Communication 27/38

Cost = #Communications × cc + #Computations × cg

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Numerical Experiments – Logistic Regression

Problem: Logistic Regression - Binary Classification (MushroomDataset)

f (x) =1

n · ni

n∑i=1

ni∑j=1

log(1 + e−(bi )j (xT (Ai )j·))

where A ∈ Rn·ni×p and b ∈ {−1, 1}n·ni , and each node i = 1, ..., nhas a portion of A and b, Ai ∈ Rni×p and bi ∈ Rni

Parameters: n = 10, p = 114, ni = 812, κ = 104

Graph: 4-cyclic graph

Balancing Computation & Communication 28/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Numerical Experiments – Logistic Regression

0 1000 2000 3000 4000 5000

Iterations

10-5

10-4

10-3

10-2

10-1

100

Relative Error

DGD

TW (1,1,-)

TW (10,1,-)

TW (1,10,-)

TW (1,1,k)

TW (1,1,250)

TW (1,1,500)

0 2000 4000 6000 8000 10000

Cost

10-2

10-1

100

Relative Error

DGD

TW (1,1,-)

TW (10,1,-)

TW (1,10,-)

TW (1,1,k)

TW (1,1,250)

TW (1,1,500)

Logistic Regression - mushroom. n = 10, p = 114, ni = 812.

Balancing Computation & Communication 28/38

Cost = #Communications × 1 + #Computations × 1

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Overview

1 Introduction and Motivation

2 Distributed Gradient Descent Variant

3 Communication Computation Decoupled DGD Variants

4 Conclusions & Future Work

Balancing Computation & Communication 29/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Final Remarks

Most distributed optimization algorithms do one communication andone computation per iteration

Showed the effect (theoretically and empirically) of doing multipleconsensus in DGD

Proposed a variant of DGD, TW, that decouples the two operations(consensus and computation) and converges to the solution byperforming multiple consensus steps

Important to balance communication and computation in order to getbest performance in terms of cost — right balance depends on theapplication (e.g., cost of communication and cost of computation)

Balancing Computation & Communication 30/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Final Remarks

Most distributed optimization algorithms do one communication andone computation per iteration

Showed the effect (theoretically and empirically) of doing multipleconsensus in DGD

Proposed a variant of DGD, TW, that decouples the two operations(consensus and computation) and converges to the solution byperforming multiple consensus steps

Important to balance communication and computation in order to getbest performance in terms of cost — right balance depends on theapplication (e.g., cost of communication and cost of computation)

Balancing Computation & Communication 30/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Final Remarks

Most distributed optimization algorithms do one communication andone computation per iteration

Showed the effect (theoretically and empirically) of doing multipleconsensus in DGD

Proposed a variant of DGD, TW, that decouples the two operations(consensus and computation) and converges to the solution byperforming multiple consensus steps

Important to balance communication and computation in order to getbest performance in terms of cost — right balance depends on theapplication (e.g., cost of communication and cost of computation)

Balancing Computation & Communication 30/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Final Remarks

Most distributed optimization algorithms do one communication andone computation per iteration

Showed the effect (theoretically and empirically) of doing multipleconsensus in DGD

Proposed a variant of DGD, TW, that decouples the two operations(consensus and computation) and converges to the solution byperforming multiple consensus steps

DGD: xk+1 = Zxk − α∇f (xk)

TW: xk+1 = Zxk − α∇f (Zxk)

Important to balance communication and computation in order to getbest performance in terms of cost — right balance depends on theapplication (e.g., cost of communication and cost of computation)

Balancing Computation & Communication 30/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Final Remarks

Most distributed optimization algorithms do one communication andone computation per iteration

Showed the effect (theoretically and empirically) of doing multipleconsensus in DGD

Proposed a variant of DGD, TW, that decouples the two operations(consensus and computation) and converges to the solution byperforming multiple consensus steps

Important to balance communication and computation in order to getbest performance in terms of cost — right balance depends on theapplication (e.g., cost of communication and cost of computation)

Balancing Computation & Communication 30/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Future Work

Apply framework to other algorithms (exact, second-order,asynchronous, ...)

Construct framework to do multiple gradient steps

Adapt number of gradient and communication steps in algorithmicway

Other considerations: memory access, partial blocks, quantizationeffects, dynamic environment

Balancing Computation & Communication 31/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Backup Slides

Balancing Computation & Communication 32/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

DGD – Theory – Bounded gradients

Lemma (Bounded gradients) [Yuan et. al., 2015]

Suppose Assumption 1 holds, and let the step size satisfy

α ≤ 1 + λn(W)

L

where λn(W) is the smallest eigenvalue of W and L = maxi Li . Then, starting fromxi,0 = 0 (i = 1, 2, ..., n), the sequence xi,k generated by DGD converges. In addition, wealso have

‖∇f(xk)‖ ≤ D =

√√√√2L(1

n

n∑i=1

fi (0)− f ?i ) (1)

for all k = 1, 2, ..., where f ?i = fi (x?i ) and x?i = arg minx fi (x).

Balancing Computation & Communication 33/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

DGDt – Theory – Bounded gradients

Lemma (Bounded gradients) [ASB, RB, NSK and EW, 2017]

Suppose Assumption 1 holds, and let the step size satisfy

α ≤ 1 + λn(Wt)

L

where λn(Wt) is the smallest eigenvalue of Wt and L = maxi Li . Then, starting fromxi,0 = 0 (i = 1, 2, ..., n), the sequence xi,k generated by DGD converges. In addition, wealso have

‖∇f(xk)‖ ≤ D =

√√√√2L(1

n

n∑i=1

fi (0)− f ?i ) (1)

for all k = 1, 2, ..., where f ?i = fi (x?i ) and x?i = arg minx fi (x).

Balancing Computation & Communication 33/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

DGD – Theory – Bounded deviation from mean

Lemma (Bounded deviation from mean) [Yuan et. al., 2015]

If (1) and Assumption 2 hold, then the total deviation from the mean is bounded,namely,

‖xi,k − xk‖ ≤αD

1− β

for all k and i . Moreover, if in addition Assumption 1 holds, then

‖∇fi (xi,k)−∇fi (xk)‖ ≤ αDLi

1− β

‖∇f (xk)−∇f (xk)‖ ≤ αDL

1− β

for all k and i .

Balancing Computation & Communication 34/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

DGDt – Theory – Bounded deviation from mean

Lemma (Bounded deviation from mean) [ASB, RB, NSK and EW, 2017]

If (1) and Assumption 2 hold, then the total deviation from the mean is bounded,namely,

‖xi,k − xk‖ ≤αD

1− βt

for all k and i . Moreover, if in addition Assumption 1 holds, then

‖∇fi (xi,k)−∇fi (xk)‖ ≤ αDLi

1− βt

‖∇f (xk)−∇f (xk)‖ ≤ αDL

1− βt

for all k and i .

Balancing Computation & Communication 34/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

TWt – Theory – Bounded deviation from mean

Lemma (Bounded deviation from mean) [ASB, RB, NSK and EW, 2017]

If Assumption 2 holds, then the total deviation from the mean is bounded, namely,

‖yi,k − xk‖ ≤βtαDk(k + 1)

2

for all k and i . Moreover, if in addition Assumption 1 holds, then

‖∇fi (yi,k)−∇fi (xk)‖ ≤ βtαDLik(k + 1)

2

‖∇f (yk)−∇f (xk)‖ ≤ βtαDLk(k + 1)

2

for all k and i .

Balancing Computation & Communication 35/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

DGDt Numerical Results

0 1000 2000 3000 4000 5000

Iterations

10-2

10-1

100

Relative Error

DGD (1,1)

DGD (1,2)

DGD (1,5)

DGD (1,10)

0 1000 2000 3000 4000 5000

Number of Gradient Evaluations

10-2

10-1

100

Relative Error

DGD (1,1)

DGD (1,2)

DGD (1,5)

DGD (1,10)

0 0.5 1 1.5 2

Number of Communications ×106

10-2

10-1

100

Relative Error

DGD (1,1)

DGD (1,2)

DGD (1,5)

DGD (1,10)

0 1 2 3 4 5 6

Cost ×104

10-2

10-1

100

Relative Error

DGD (1,1)

DGD (1,2)

DGD (1,5)

DGD (1,10)

Quadratic. n = 10, p = 10, ni = 10, κ = 102.

Balancing Computation & Communication 36/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Experiments – Quadratic Problems

0 2000 4000 6000 8000 10000

Iterations

10-14

10-12

10-10

10-8

10-6

10-4

10-2

100

Relative Error

DGD

0 2000 4000 6000 8000 10000

Number of Gradient Evaluations

10-14

10-12

10-10

10-8

10-6

10-4

10-2

100

Relative Error

DGD

0 0.5 1 1.5 2 2.5 3 3.5 4

Number of Communications ×105

10-3

10-2

10-1

100

Relative Error

DGD

0 0.5 1 1.5 2

Cost ×104

10-4

10-3

10-2

10-1

100

Relative Error

DGD

Quadratic. n = 10, d = 10, ni = 10, κ = 104.

Balancing Computation & Communication 37/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Experiments – Quadratic Problems

0 2000 4000 6000 8000 10000

Iterations

10-14

10-12

10-10

10-8

10-6

10-4

10-2

100

Relative Error

DGD

TW (1,1,-)

0 2000 4000 6000 8000 10000

Number of Gradient Evaluations

10-14

10-12

10-10

10-8

10-6

10-4

10-2

100

Relative Error

DGD

TW (1,1,-)

0 0.5 1 1.5 2 2.5 3 3.5 4

Number of Communications ×105

10-3

10-2

10-1

100

Relative Error

DGD

TW (1,1,-)

0 0.5 1 1.5 2

Cost ×104

10-4

10-3

10-2

10-1

100

Relative Error

DGD

TW (1,1,-)

Quadratic. n = 10, d = 10, ni = 10, κ = 104.

Balancing Computation & Communication 37/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Experiments – Quadratic Problems

0 2000 4000 6000 8000 10000

Iterations

10-14

10-12

10-10

10-8

10-6

10-4

10-2

100

Relative Error

DGD

TW (1,1,-)

TW (10,1,-)

0 2000 4000 6000 8000 10000

Number of Gradient Evaluations

10-14

10-12

10-10

10-8

10-6

10-4

10-2

100

Relative Error

DGD

TW (1,1,-)

TW (10,1,-)

0 0.5 1 1.5 2 2.5 3 3.5 4

Number of Communications ×105

10-3

10-2

10-1

100

Relative Error

DGD

TW (1,1,-)

TW (10,1,-)

0 0.5 1 1.5 2

Cost ×104

10-4

10-3

10-2

10-1

100

Relative Error

DGD

TW (1,1,-)

TW (10,1,-)

Quadratic. n = 10, d = 10, ni = 10, κ = 104.

Balancing Computation & Communication 37/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Experiments – Quadratic Problems

0 2000 4000 6000 8000 10000

Iterations

10-14

10-12

10-10

10-8

10-6

10-4

10-2

100

Relative Error

DGD

TW (1,1,-)

TW (10,1,-)

TW (1,10,-)

0 2000 4000 6000 8000 10000

Number of Gradient Evaluations

10-14

10-12

10-10

10-8

10-6

10-4

10-2

100

Relative Error

DGD

TW (1,1,-)

TW (10,1,-)

TW (1,10,-)

0 0.5 1 1.5 2 2.5 3 3.5 4

Number of Communications ×105

10-3

10-2

10-1

100

Relative Error

DGD

TW (1,1,-)

TW (10,1,-)

TW (1,10,-)

0 0.5 1 1.5 2

Cost ×104

10-4

10-3

10-2

10-1

100

Relative Error

DGD

TW (1,1,-)

TW (10,1,-)

TW (1,10,-)

Quadratic. n = 10, d = 10, ni = 10, κ = 104.

Balancing Computation & Communication 37/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Experiments – Quadratic Problems

0 2000 4000 6000 8000 10000

Iterations

10-14

10-12

10-10

10-8

10-6

10-4

10-2

100

Relative Error

DGD

TW (1,1,-)

TW (10,1,-)

TW (1,10,-)

TW (1,1,k)

0 2000 4000 6000 8000 10000

Number of Gradient Evaluations

10-14

10-12

10-10

10-8

10-6

10-4

10-2

100

Relative Error

DGD

TW (1,1,-)

TW (10,1,-)

TW (1,10,-)

TW (1,1,k)

0 0.5 1 1.5 2 2.5 3 3.5 4

Number of Communications ×105

10-3

10-2

10-1

100

Relative Error

DGD

TW (1,1,-)

TW (10,1,-)

TW (1,10,-)

TW (1,1,k)

0 0.5 1 1.5 2

Cost ×104

10-4

10-3

10-2

10-1

100

Relative Error

DGD

TW (1,1,-)

TW (10,1,-)

TW (1,10,-)

TW (1,1,k)

Quadratic. n = 10, d = 10, ni = 10, κ = 104.

Balancing Computation & Communication 37/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Experiments – Quadratic Problems

0 2000 4000 6000 8000 10000

Iterations

10-14

10-12

10-10

10-8

10-6

10-4

10-2

100

Relative Error

DGD

TW (1,1,-)

TW (10,1,-)

TW (1,10,-)

TW (1,1,k)

TW (1,1,500)

0 2000 4000 6000 8000 10000

Number of Gradient Evaluations

10-14

10-12

10-10

10-8

10-6

10-4

10-2

100

Relative Error

DGD

TW (1,1,-)

TW (10,1,-)

TW (1,10,-)

TW (1,1,k)

TW (1,1,500)

0 0.5 1 1.5 2 2.5 3 3.5 4

Number of Communications ×105

10-3

10-2

10-1

100

Relative Error

DGD

TW (1,1,-)

TW (10,1,-)

TW (1,10,-)

TW (1,1,k)

TW (1,1,500)

0 0.5 1 1.5 2

Cost ×104

10-4

10-3

10-2

10-1

100

Relative Error

DGD

TW (1,1,-)

TW (10,1,-)

TW (1,10,-)

TW (1,1,k)

TW (1,1,500)

Quadratic. n = 10, d = 10, ni = 10, κ = 104.

Balancing Computation & Communication 37/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Experiments – Quadratic Problems

0 2000 4000 6000 8000 10000

Iterations

10-14

10-12

10-10

10-8

10-6

10-4

10-2

100

Relative Error

DGD

TW (1,1,-)

TW (10,1,-)

TW (1,10,-)

TW (1,1,k)

TW (1,1,500)

TW (1,1,1000)

0 2000 4000 6000 8000 10000

Number of Gradient Evaluations

10-14

10-12

10-10

10-8

10-6

10-4

10-2

100

Relative Error

DGD

TW (1,1,-)

TW (10,1,-)

TW (1,10,-)

TW (1,1,k)

TW (1,1,500)

TW (1,1,1000)

0 0.5 1 1.5 2 2.5 3 3.5 4

Number of Communications ×105

10-3

10-2

10-1

100

Relative Error

DGD

TW (1,1,-)

TW (10,1,-)

TW (1,10,-)

TW (1,1,k)

TW (1,1,500)

TW (1,1,1000)

0 0.5 1 1.5 2

Cost ×104

10-4

10-3

10-2

10-1

100

Relative Error

DGD

TW (1,1,-)

TW (10,1,-)

TW (1,10,-)

TW (1,1,k)

TW (1,1,500)

TW (1,1,1000)

Quadratic. n = 10, d = 10, ni = 10, κ = 104.

Balancing Computation & Communication 37/38

Introduction and MotivationDistributed Gradient Descent Variant

Communication Computation Decoupled DGD VariantsConclusions & Future Work

Experiments – Logistic Regression

0 1000 2000 3000 4000 5000

Iterations

10-5

10-4

10-3

10-2

10-1

100

Relative Error

DGD

TW (1,1,-)

TW (10,1,-)

TW (1,10,-)

TW (1,1,k)

TW (1,1,250)

TW (1,1,500)

0 1000 2000 3000 4000 5000

Number of Gradient Evaluations

10-5

10-4

10-3

10-2

10-1

100

Relative Error

DGD

TW (1,1,-)

TW (10,1,-)

TW (1,10,-)

TW (1,1,k)

TW (1,1,250)

TW (1,1,500)

0 0.5 1 1.5 2

Number of Communications ×105

10-2

10-1

100

Relative Error

DGD

TW (1,1,-)

TW (10,1,-)

TW (1,10,-)

TW (1,1,k)

TW (1,1,250)

TW (1,1,500)

0 2000 4000 6000 8000 10000

Cost

10-2

10-1

100

Relative Error

DGD

TW (1,1,-)

TW (10,1,-)

TW (1,10,-)

TW (1,1,k)

TW (1,1,250)

TW (1,1,500)

Logistic Regression - mushroom. n = 10, d = 114, ni = 812.

Balancing Computation & Communication 38/38

top related