large scale machine learning over networks · large scale machine learning over networks francis...
TRANSCRIPT
![Page 1: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/1.jpg)
Large Scale Machine LearningOver Networks
Francis Bach
INRIA - Ecole Normale Superieure Paris France
EacuteCOLE NORMALE
S U P Eacute R I E U R E
Joint work with Kevin Scaman Hadrien Hendrikx Laurent
Massoulie Sebastien Bubeck Yin-Tat Lee
PAISS Summer school - October 5 2019
Scientific context
bull Proliferation of digital data
ndash Personal data
ndash Industry
ndash Scientific from bioinformatics to humanities
bull Need for automated processing of massive data
Scientific context
bull Proliferation of digital data
ndash Personal data
ndash Industry
ndash Scientific from bioinformatics to humanities
bull Need for automated processing of massive data
bull Series of ldquohypesrdquo
Big data rarr Data science rarr Machine Learning
rarr Deep Learning rarr Artificial Intelligence
Recent progress in perception (vision audio text)
person ride dog
From translategooglefr From Peyre et al (2017)
Recent progress in perception (vision audio text)
person ride dog
From translategooglefr From Peyre et al (2017)
(1) Massive data
(2) Computing power
(3) Methodological and scientific progress
Recent progress in perception (vision audio text)
person ride dog
From translategooglefr From Peyre et al (2017)
(1) Massive data
(2) Computing power
(3) Methodological and scientific progress
ldquoIntelligencerdquo = models + algorithms + data
+ computing power
Recent progress in perception (vision audio text)
person ride dog
From translategooglefr From Peyre et al (2017)
(1) Massive data
(2) Computing power
(3) Methodological and scientific progress
ldquoIntelligencerdquo = models + algorithms + data
+ computing power
Outline
1 Parametric supervised learning on a single machine
minus Machine learning asymp optimization of finite sums
minus From batch to stochastic gradient methods
minus Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
ndash Centralized and decentralized methods
ndash From network averaging to optimization
ndash Distributing the fastest single machine algorithms
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull Advertising n gt 109
ndash Φ(x) isin 0 1d d gt 109
ndash Navigation history + ad
- Linear predictions
- h(x θ) = θ⊤Φ(x)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull Advertising n gt 109
ndash Φ(x) isin 0 1d d gt 109
ndash Navigation history + ad
bull Linear predictions
ndash h(x θ) = θ⊤Φ(x)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
x1 x2 x3 x4 x5 x6
y1 = 1 y2 = 1 y3 = 1 y4 = minus1 y5 = minus1 y6 = minus1
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
x1 x2 x3 x4 x5 x6
y1 = 1 y2 = 1 y3 = 1 y4 = minus1 y5 = minus1 y6 = minus1
ndash Neural networks (n d gt 106) h(x θ) = θ⊤mσ(θ⊤mminus1σ(middot middot middot θ⊤2 σ(θ⊤1 x))
x y
θ1θ3
θ2
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
2n
nsum
i=1
(
yi minus h(xi θ))2
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
(least-squares regression)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
log(
1 + exp(minusyih(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
(logistic regression)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
bull Optimization optimization of regularized risk training cost
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n iid
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
bull Optimization optimization of regularized risk training cost
bull Statistics guarantees on Ep(xy)ℓ(y h(x θ)) testing cost
Smoothness and (strong) convexity
bull A function g Rd rarr R is L-smooth if and only if it is twice
differentiable and
forallθ isin Rd
∣
∣eigenvalues[
gprimeprime(θ)]∣
∣ 6 L
smooth non-smooth
Smoothness and (strong) convexity
bull A function g Rd rarr R is L-smooth if and only if it is twice
differentiable and
forallθ isin Rd
∣
∣eigenvalues[
gprimeprime(θ)]∣
∣ 6 L
bull Machine learning
ndash with g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Smooth prediction function θ 7rarr h(xi θ) + smooth loss
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt 0
convex
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
convex
stronglyconvex
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
ndash Condition number κ = Lmicro gt 1
(small κ = Lmicro) (large κ = Lmicro)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
bull Relevance of convex optimization
ndash Easier design and analysis of algorithms
ndash Global minimum vs local minimum vs stationary points
ndash Gradient-based algorithms only need convexity for their analysis
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
bull Adding regularization by micro2θ2
ndash creates additional bias unless micro is small but reduces variance
ndash Typically Lradicn gt micro gt Ln rArr κ isin [
radicn n]
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
g(θt)minus g(θlowast) 6 O(1t2)
g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust
radicκ) if micro-strongly convex
ndash Optimal rates after t = O(d) iterations (Nesterov 2004)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr O(log log 1ε) iterations
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
bull NB single pass leads to bounds on testing error
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)
ndash Iteration complexity is independent of n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
new
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
(with yti stored value at time t of gradient of the i-th function)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 2: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/2.jpg)
Scientific context
bull Proliferation of digital data
ndash Personal data
ndash Industry
ndash Scientific from bioinformatics to humanities
bull Need for automated processing of massive data
Scientific context
bull Proliferation of digital data
ndash Personal data
ndash Industry
ndash Scientific from bioinformatics to humanities
bull Need for automated processing of massive data
bull Series of ldquohypesrdquo
Big data rarr Data science rarr Machine Learning
rarr Deep Learning rarr Artificial Intelligence
Recent progress in perception (vision audio text)
person ride dog
From translategooglefr From Peyre et al (2017)
Recent progress in perception (vision audio text)
person ride dog
From translategooglefr From Peyre et al (2017)
(1) Massive data
(2) Computing power
(3) Methodological and scientific progress
Recent progress in perception (vision audio text)
person ride dog
From translategooglefr From Peyre et al (2017)
(1) Massive data
(2) Computing power
(3) Methodological and scientific progress
ldquoIntelligencerdquo = models + algorithms + data
+ computing power
Recent progress in perception (vision audio text)
person ride dog
From translategooglefr From Peyre et al (2017)
(1) Massive data
(2) Computing power
(3) Methodological and scientific progress
ldquoIntelligencerdquo = models + algorithms + data
+ computing power
Outline
1 Parametric supervised learning on a single machine
minus Machine learning asymp optimization of finite sums
minus From batch to stochastic gradient methods
minus Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
ndash Centralized and decentralized methods
ndash From network averaging to optimization
ndash Distributing the fastest single machine algorithms
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull Advertising n gt 109
ndash Φ(x) isin 0 1d d gt 109
ndash Navigation history + ad
- Linear predictions
- h(x θ) = θ⊤Φ(x)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull Advertising n gt 109
ndash Φ(x) isin 0 1d d gt 109
ndash Navigation history + ad
bull Linear predictions
ndash h(x θ) = θ⊤Φ(x)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
x1 x2 x3 x4 x5 x6
y1 = 1 y2 = 1 y3 = 1 y4 = minus1 y5 = minus1 y6 = minus1
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
x1 x2 x3 x4 x5 x6
y1 = 1 y2 = 1 y3 = 1 y4 = minus1 y5 = minus1 y6 = minus1
ndash Neural networks (n d gt 106) h(x θ) = θ⊤mσ(θ⊤mminus1σ(middot middot middot θ⊤2 σ(θ⊤1 x))
x y
θ1θ3
θ2
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
2n
nsum
i=1
(
yi minus h(xi θ))2
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
(least-squares regression)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
log(
1 + exp(minusyih(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
(logistic regression)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
bull Optimization optimization of regularized risk training cost
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n iid
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
bull Optimization optimization of regularized risk training cost
bull Statistics guarantees on Ep(xy)ℓ(y h(x θ)) testing cost
Smoothness and (strong) convexity
bull A function g Rd rarr R is L-smooth if and only if it is twice
differentiable and
forallθ isin Rd
∣
∣eigenvalues[
gprimeprime(θ)]∣
∣ 6 L
smooth non-smooth
Smoothness and (strong) convexity
bull A function g Rd rarr R is L-smooth if and only if it is twice
differentiable and
forallθ isin Rd
∣
∣eigenvalues[
gprimeprime(θ)]∣
∣ 6 L
bull Machine learning
ndash with g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Smooth prediction function θ 7rarr h(xi θ) + smooth loss
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt 0
convex
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
convex
stronglyconvex
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
ndash Condition number κ = Lmicro gt 1
(small κ = Lmicro) (large κ = Lmicro)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
bull Relevance of convex optimization
ndash Easier design and analysis of algorithms
ndash Global minimum vs local minimum vs stationary points
ndash Gradient-based algorithms only need convexity for their analysis
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
bull Adding regularization by micro2θ2
ndash creates additional bias unless micro is small but reduces variance
ndash Typically Lradicn gt micro gt Ln rArr κ isin [
radicn n]
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
g(θt)minus g(θlowast) 6 O(1t2)
g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust
radicκ) if micro-strongly convex
ndash Optimal rates after t = O(d) iterations (Nesterov 2004)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr O(log log 1ε) iterations
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
bull NB single pass leads to bounds on testing error
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)
ndash Iteration complexity is independent of n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
new
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
(with yti stored value at time t of gradient of the i-th function)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 3: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/3.jpg)
Scientific context
bull Proliferation of digital data
ndash Personal data
ndash Industry
ndash Scientific from bioinformatics to humanities
bull Need for automated processing of massive data
bull Series of ldquohypesrdquo
Big data rarr Data science rarr Machine Learning
rarr Deep Learning rarr Artificial Intelligence
Recent progress in perception (vision audio text)
person ride dog
From translategooglefr From Peyre et al (2017)
Recent progress in perception (vision audio text)
person ride dog
From translategooglefr From Peyre et al (2017)
(1) Massive data
(2) Computing power
(3) Methodological and scientific progress
Recent progress in perception (vision audio text)
person ride dog
From translategooglefr From Peyre et al (2017)
(1) Massive data
(2) Computing power
(3) Methodological and scientific progress
ldquoIntelligencerdquo = models + algorithms + data
+ computing power
Recent progress in perception (vision audio text)
person ride dog
From translategooglefr From Peyre et al (2017)
(1) Massive data
(2) Computing power
(3) Methodological and scientific progress
ldquoIntelligencerdquo = models + algorithms + data
+ computing power
Outline
1 Parametric supervised learning on a single machine
minus Machine learning asymp optimization of finite sums
minus From batch to stochastic gradient methods
minus Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
ndash Centralized and decentralized methods
ndash From network averaging to optimization
ndash Distributing the fastest single machine algorithms
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull Advertising n gt 109
ndash Φ(x) isin 0 1d d gt 109
ndash Navigation history + ad
- Linear predictions
- h(x θ) = θ⊤Φ(x)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull Advertising n gt 109
ndash Φ(x) isin 0 1d d gt 109
ndash Navigation history + ad
bull Linear predictions
ndash h(x θ) = θ⊤Φ(x)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
x1 x2 x3 x4 x5 x6
y1 = 1 y2 = 1 y3 = 1 y4 = minus1 y5 = minus1 y6 = minus1
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
x1 x2 x3 x4 x5 x6
y1 = 1 y2 = 1 y3 = 1 y4 = minus1 y5 = minus1 y6 = minus1
ndash Neural networks (n d gt 106) h(x θ) = θ⊤mσ(θ⊤mminus1σ(middot middot middot θ⊤2 σ(θ⊤1 x))
x y
θ1θ3
θ2
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
2n
nsum
i=1
(
yi minus h(xi θ))2
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
(least-squares regression)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
log(
1 + exp(minusyih(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
(logistic regression)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
bull Optimization optimization of regularized risk training cost
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n iid
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
bull Optimization optimization of regularized risk training cost
bull Statistics guarantees on Ep(xy)ℓ(y h(x θ)) testing cost
Smoothness and (strong) convexity
bull A function g Rd rarr R is L-smooth if and only if it is twice
differentiable and
forallθ isin Rd
∣
∣eigenvalues[
gprimeprime(θ)]∣
∣ 6 L
smooth non-smooth
Smoothness and (strong) convexity
bull A function g Rd rarr R is L-smooth if and only if it is twice
differentiable and
forallθ isin Rd
∣
∣eigenvalues[
gprimeprime(θ)]∣
∣ 6 L
bull Machine learning
ndash with g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Smooth prediction function θ 7rarr h(xi θ) + smooth loss
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt 0
convex
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
convex
stronglyconvex
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
ndash Condition number κ = Lmicro gt 1
(small κ = Lmicro) (large κ = Lmicro)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
bull Relevance of convex optimization
ndash Easier design and analysis of algorithms
ndash Global minimum vs local minimum vs stationary points
ndash Gradient-based algorithms only need convexity for their analysis
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
bull Adding regularization by micro2θ2
ndash creates additional bias unless micro is small but reduces variance
ndash Typically Lradicn gt micro gt Ln rArr κ isin [
radicn n]
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
g(θt)minus g(θlowast) 6 O(1t2)
g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust
radicκ) if micro-strongly convex
ndash Optimal rates after t = O(d) iterations (Nesterov 2004)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr O(log log 1ε) iterations
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
bull NB single pass leads to bounds on testing error
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)
ndash Iteration complexity is independent of n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
new
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
(with yti stored value at time t of gradient of the i-th function)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 4: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/4.jpg)
Recent progress in perception (vision audio text)
person ride dog
From translategooglefr From Peyre et al (2017)
Recent progress in perception (vision audio text)
person ride dog
From translategooglefr From Peyre et al (2017)
(1) Massive data
(2) Computing power
(3) Methodological and scientific progress
Recent progress in perception (vision audio text)
person ride dog
From translategooglefr From Peyre et al (2017)
(1) Massive data
(2) Computing power
(3) Methodological and scientific progress
ldquoIntelligencerdquo = models + algorithms + data
+ computing power
Recent progress in perception (vision audio text)
person ride dog
From translategooglefr From Peyre et al (2017)
(1) Massive data
(2) Computing power
(3) Methodological and scientific progress
ldquoIntelligencerdquo = models + algorithms + data
+ computing power
Outline
1 Parametric supervised learning on a single machine
minus Machine learning asymp optimization of finite sums
minus From batch to stochastic gradient methods
minus Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
ndash Centralized and decentralized methods
ndash From network averaging to optimization
ndash Distributing the fastest single machine algorithms
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull Advertising n gt 109
ndash Φ(x) isin 0 1d d gt 109
ndash Navigation history + ad
- Linear predictions
- h(x θ) = θ⊤Φ(x)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull Advertising n gt 109
ndash Φ(x) isin 0 1d d gt 109
ndash Navigation history + ad
bull Linear predictions
ndash h(x θ) = θ⊤Φ(x)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
x1 x2 x3 x4 x5 x6
y1 = 1 y2 = 1 y3 = 1 y4 = minus1 y5 = minus1 y6 = minus1
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
x1 x2 x3 x4 x5 x6
y1 = 1 y2 = 1 y3 = 1 y4 = minus1 y5 = minus1 y6 = minus1
ndash Neural networks (n d gt 106) h(x θ) = θ⊤mσ(θ⊤mminus1σ(middot middot middot θ⊤2 σ(θ⊤1 x))
x y
θ1θ3
θ2
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
2n
nsum
i=1
(
yi minus h(xi θ))2
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
(least-squares regression)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
log(
1 + exp(minusyih(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
(logistic regression)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
bull Optimization optimization of regularized risk training cost
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n iid
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
bull Optimization optimization of regularized risk training cost
bull Statistics guarantees on Ep(xy)ℓ(y h(x θ)) testing cost
Smoothness and (strong) convexity
bull A function g Rd rarr R is L-smooth if and only if it is twice
differentiable and
forallθ isin Rd
∣
∣eigenvalues[
gprimeprime(θ)]∣
∣ 6 L
smooth non-smooth
Smoothness and (strong) convexity
bull A function g Rd rarr R is L-smooth if and only if it is twice
differentiable and
forallθ isin Rd
∣
∣eigenvalues[
gprimeprime(θ)]∣
∣ 6 L
bull Machine learning
ndash with g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Smooth prediction function θ 7rarr h(xi θ) + smooth loss
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt 0
convex
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
convex
stronglyconvex
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
ndash Condition number κ = Lmicro gt 1
(small κ = Lmicro) (large κ = Lmicro)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
bull Relevance of convex optimization
ndash Easier design and analysis of algorithms
ndash Global minimum vs local minimum vs stationary points
ndash Gradient-based algorithms only need convexity for their analysis
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
bull Adding regularization by micro2θ2
ndash creates additional bias unless micro is small but reduces variance
ndash Typically Lradicn gt micro gt Ln rArr κ isin [
radicn n]
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
g(θt)minus g(θlowast) 6 O(1t2)
g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust
radicκ) if micro-strongly convex
ndash Optimal rates after t = O(d) iterations (Nesterov 2004)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr O(log log 1ε) iterations
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
bull NB single pass leads to bounds on testing error
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)
ndash Iteration complexity is independent of n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
new
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
(with yti stored value at time t of gradient of the i-th function)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 5: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/5.jpg)
Recent progress in perception (vision audio text)
person ride dog
From translategooglefr From Peyre et al (2017)
(1) Massive data
(2) Computing power
(3) Methodological and scientific progress
Recent progress in perception (vision audio text)
person ride dog
From translategooglefr From Peyre et al (2017)
(1) Massive data
(2) Computing power
(3) Methodological and scientific progress
ldquoIntelligencerdquo = models + algorithms + data
+ computing power
Recent progress in perception (vision audio text)
person ride dog
From translategooglefr From Peyre et al (2017)
(1) Massive data
(2) Computing power
(3) Methodological and scientific progress
ldquoIntelligencerdquo = models + algorithms + data
+ computing power
Outline
1 Parametric supervised learning on a single machine
minus Machine learning asymp optimization of finite sums
minus From batch to stochastic gradient methods
minus Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
ndash Centralized and decentralized methods
ndash From network averaging to optimization
ndash Distributing the fastest single machine algorithms
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull Advertising n gt 109
ndash Φ(x) isin 0 1d d gt 109
ndash Navigation history + ad
- Linear predictions
- h(x θ) = θ⊤Φ(x)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull Advertising n gt 109
ndash Φ(x) isin 0 1d d gt 109
ndash Navigation history + ad
bull Linear predictions
ndash h(x θ) = θ⊤Φ(x)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
x1 x2 x3 x4 x5 x6
y1 = 1 y2 = 1 y3 = 1 y4 = minus1 y5 = minus1 y6 = minus1
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
x1 x2 x3 x4 x5 x6
y1 = 1 y2 = 1 y3 = 1 y4 = minus1 y5 = minus1 y6 = minus1
ndash Neural networks (n d gt 106) h(x θ) = θ⊤mσ(θ⊤mminus1σ(middot middot middot θ⊤2 σ(θ⊤1 x))
x y
θ1θ3
θ2
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
2n
nsum
i=1
(
yi minus h(xi θ))2
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
(least-squares regression)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
log(
1 + exp(minusyih(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
(logistic regression)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
bull Optimization optimization of regularized risk training cost
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n iid
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
bull Optimization optimization of regularized risk training cost
bull Statistics guarantees on Ep(xy)ℓ(y h(x θ)) testing cost
Smoothness and (strong) convexity
bull A function g Rd rarr R is L-smooth if and only if it is twice
differentiable and
forallθ isin Rd
∣
∣eigenvalues[
gprimeprime(θ)]∣
∣ 6 L
smooth non-smooth
Smoothness and (strong) convexity
bull A function g Rd rarr R is L-smooth if and only if it is twice
differentiable and
forallθ isin Rd
∣
∣eigenvalues[
gprimeprime(θ)]∣
∣ 6 L
bull Machine learning
ndash with g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Smooth prediction function θ 7rarr h(xi θ) + smooth loss
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt 0
convex
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
convex
stronglyconvex
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
ndash Condition number κ = Lmicro gt 1
(small κ = Lmicro) (large κ = Lmicro)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
bull Relevance of convex optimization
ndash Easier design and analysis of algorithms
ndash Global minimum vs local minimum vs stationary points
ndash Gradient-based algorithms only need convexity for their analysis
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
bull Adding regularization by micro2θ2
ndash creates additional bias unless micro is small but reduces variance
ndash Typically Lradicn gt micro gt Ln rArr κ isin [
radicn n]
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
g(θt)minus g(θlowast) 6 O(1t2)
g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust
radicκ) if micro-strongly convex
ndash Optimal rates after t = O(d) iterations (Nesterov 2004)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr O(log log 1ε) iterations
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
bull NB single pass leads to bounds on testing error
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)
ndash Iteration complexity is independent of n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
new
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
(with yti stored value at time t of gradient of the i-th function)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 6: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/6.jpg)
Recent progress in perception (vision audio text)
person ride dog
From translategooglefr From Peyre et al (2017)
(1) Massive data
(2) Computing power
(3) Methodological and scientific progress
ldquoIntelligencerdquo = models + algorithms + data
+ computing power
Recent progress in perception (vision audio text)
person ride dog
From translategooglefr From Peyre et al (2017)
(1) Massive data
(2) Computing power
(3) Methodological and scientific progress
ldquoIntelligencerdquo = models + algorithms + data
+ computing power
Outline
1 Parametric supervised learning on a single machine
minus Machine learning asymp optimization of finite sums
minus From batch to stochastic gradient methods
minus Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
ndash Centralized and decentralized methods
ndash From network averaging to optimization
ndash Distributing the fastest single machine algorithms
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull Advertising n gt 109
ndash Φ(x) isin 0 1d d gt 109
ndash Navigation history + ad
- Linear predictions
- h(x θ) = θ⊤Φ(x)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull Advertising n gt 109
ndash Φ(x) isin 0 1d d gt 109
ndash Navigation history + ad
bull Linear predictions
ndash h(x θ) = θ⊤Φ(x)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
x1 x2 x3 x4 x5 x6
y1 = 1 y2 = 1 y3 = 1 y4 = minus1 y5 = minus1 y6 = minus1
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
x1 x2 x3 x4 x5 x6
y1 = 1 y2 = 1 y3 = 1 y4 = minus1 y5 = minus1 y6 = minus1
ndash Neural networks (n d gt 106) h(x θ) = θ⊤mσ(θ⊤mminus1σ(middot middot middot θ⊤2 σ(θ⊤1 x))
x y
θ1θ3
θ2
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
2n
nsum
i=1
(
yi minus h(xi θ))2
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
(least-squares regression)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
log(
1 + exp(minusyih(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
(logistic regression)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
bull Optimization optimization of regularized risk training cost
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n iid
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
bull Optimization optimization of regularized risk training cost
bull Statistics guarantees on Ep(xy)ℓ(y h(x θ)) testing cost
Smoothness and (strong) convexity
bull A function g Rd rarr R is L-smooth if and only if it is twice
differentiable and
forallθ isin Rd
∣
∣eigenvalues[
gprimeprime(θ)]∣
∣ 6 L
smooth non-smooth
Smoothness and (strong) convexity
bull A function g Rd rarr R is L-smooth if and only if it is twice
differentiable and
forallθ isin Rd
∣
∣eigenvalues[
gprimeprime(θ)]∣
∣ 6 L
bull Machine learning
ndash with g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Smooth prediction function θ 7rarr h(xi θ) + smooth loss
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt 0
convex
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
convex
stronglyconvex
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
ndash Condition number κ = Lmicro gt 1
(small κ = Lmicro) (large κ = Lmicro)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
bull Relevance of convex optimization
ndash Easier design and analysis of algorithms
ndash Global minimum vs local minimum vs stationary points
ndash Gradient-based algorithms only need convexity for their analysis
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
bull Adding regularization by micro2θ2
ndash creates additional bias unless micro is small but reduces variance
ndash Typically Lradicn gt micro gt Ln rArr κ isin [
radicn n]
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
g(θt)minus g(θlowast) 6 O(1t2)
g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust
radicκ) if micro-strongly convex
ndash Optimal rates after t = O(d) iterations (Nesterov 2004)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr O(log log 1ε) iterations
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
bull NB single pass leads to bounds on testing error
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)
ndash Iteration complexity is independent of n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
new
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
(with yti stored value at time t of gradient of the i-th function)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 7: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/7.jpg)
Recent progress in perception (vision audio text)
person ride dog
From translategooglefr From Peyre et al (2017)
(1) Massive data
(2) Computing power
(3) Methodological and scientific progress
ldquoIntelligencerdquo = models + algorithms + data
+ computing power
Outline
1 Parametric supervised learning on a single machine
minus Machine learning asymp optimization of finite sums
minus From batch to stochastic gradient methods
minus Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
ndash Centralized and decentralized methods
ndash From network averaging to optimization
ndash Distributing the fastest single machine algorithms
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull Advertising n gt 109
ndash Φ(x) isin 0 1d d gt 109
ndash Navigation history + ad
- Linear predictions
- h(x θ) = θ⊤Φ(x)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull Advertising n gt 109
ndash Φ(x) isin 0 1d d gt 109
ndash Navigation history + ad
bull Linear predictions
ndash h(x θ) = θ⊤Φ(x)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
x1 x2 x3 x4 x5 x6
y1 = 1 y2 = 1 y3 = 1 y4 = minus1 y5 = minus1 y6 = minus1
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
x1 x2 x3 x4 x5 x6
y1 = 1 y2 = 1 y3 = 1 y4 = minus1 y5 = minus1 y6 = minus1
ndash Neural networks (n d gt 106) h(x θ) = θ⊤mσ(θ⊤mminus1σ(middot middot middot θ⊤2 σ(θ⊤1 x))
x y
θ1θ3
θ2
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
2n
nsum
i=1
(
yi minus h(xi θ))2
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
(least-squares regression)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
log(
1 + exp(minusyih(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
(logistic regression)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
bull Optimization optimization of regularized risk training cost
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n iid
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
bull Optimization optimization of regularized risk training cost
bull Statistics guarantees on Ep(xy)ℓ(y h(x θ)) testing cost
Smoothness and (strong) convexity
bull A function g Rd rarr R is L-smooth if and only if it is twice
differentiable and
forallθ isin Rd
∣
∣eigenvalues[
gprimeprime(θ)]∣
∣ 6 L
smooth non-smooth
Smoothness and (strong) convexity
bull A function g Rd rarr R is L-smooth if and only if it is twice
differentiable and
forallθ isin Rd
∣
∣eigenvalues[
gprimeprime(θ)]∣
∣ 6 L
bull Machine learning
ndash with g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Smooth prediction function θ 7rarr h(xi θ) + smooth loss
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt 0
convex
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
convex
stronglyconvex
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
ndash Condition number κ = Lmicro gt 1
(small κ = Lmicro) (large κ = Lmicro)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
bull Relevance of convex optimization
ndash Easier design and analysis of algorithms
ndash Global minimum vs local minimum vs stationary points
ndash Gradient-based algorithms only need convexity for their analysis
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
bull Adding regularization by micro2θ2
ndash creates additional bias unless micro is small but reduces variance
ndash Typically Lradicn gt micro gt Ln rArr κ isin [
radicn n]
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
g(θt)minus g(θlowast) 6 O(1t2)
g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust
radicκ) if micro-strongly convex
ndash Optimal rates after t = O(d) iterations (Nesterov 2004)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr O(log log 1ε) iterations
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
bull NB single pass leads to bounds on testing error
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)
ndash Iteration complexity is independent of n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
new
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
(with yti stored value at time t of gradient of the i-th function)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 8: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/8.jpg)
Outline
1 Parametric supervised learning on a single machine
minus Machine learning asymp optimization of finite sums
minus From batch to stochastic gradient methods
minus Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
ndash Centralized and decentralized methods
ndash From network averaging to optimization
ndash Distributing the fastest single machine algorithms
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull Advertising n gt 109
ndash Φ(x) isin 0 1d d gt 109
ndash Navigation history + ad
- Linear predictions
- h(x θ) = θ⊤Φ(x)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull Advertising n gt 109
ndash Φ(x) isin 0 1d d gt 109
ndash Navigation history + ad
bull Linear predictions
ndash h(x θ) = θ⊤Φ(x)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
x1 x2 x3 x4 x5 x6
y1 = 1 y2 = 1 y3 = 1 y4 = minus1 y5 = minus1 y6 = minus1
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
x1 x2 x3 x4 x5 x6
y1 = 1 y2 = 1 y3 = 1 y4 = minus1 y5 = minus1 y6 = minus1
ndash Neural networks (n d gt 106) h(x θ) = θ⊤mσ(θ⊤mminus1σ(middot middot middot θ⊤2 σ(θ⊤1 x))
x y
θ1θ3
θ2
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
2n
nsum
i=1
(
yi minus h(xi θ))2
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
(least-squares regression)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
log(
1 + exp(minusyih(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
(logistic regression)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
bull Optimization optimization of regularized risk training cost
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n iid
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
bull Optimization optimization of regularized risk training cost
bull Statistics guarantees on Ep(xy)ℓ(y h(x θ)) testing cost
Smoothness and (strong) convexity
bull A function g Rd rarr R is L-smooth if and only if it is twice
differentiable and
forallθ isin Rd
∣
∣eigenvalues[
gprimeprime(θ)]∣
∣ 6 L
smooth non-smooth
Smoothness and (strong) convexity
bull A function g Rd rarr R is L-smooth if and only if it is twice
differentiable and
forallθ isin Rd
∣
∣eigenvalues[
gprimeprime(θ)]∣
∣ 6 L
bull Machine learning
ndash with g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Smooth prediction function θ 7rarr h(xi θ) + smooth loss
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt 0
convex
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
convex
stronglyconvex
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
ndash Condition number κ = Lmicro gt 1
(small κ = Lmicro) (large κ = Lmicro)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
bull Relevance of convex optimization
ndash Easier design and analysis of algorithms
ndash Global minimum vs local minimum vs stationary points
ndash Gradient-based algorithms only need convexity for their analysis
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
bull Adding regularization by micro2θ2
ndash creates additional bias unless micro is small but reduces variance
ndash Typically Lradicn gt micro gt Ln rArr κ isin [
radicn n]
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
g(θt)minus g(θlowast) 6 O(1t2)
g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust
radicκ) if micro-strongly convex
ndash Optimal rates after t = O(d) iterations (Nesterov 2004)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr O(log log 1ε) iterations
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
bull NB single pass leads to bounds on testing error
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)
ndash Iteration complexity is independent of n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
new
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
(with yti stored value at time t of gradient of the i-th function)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 9: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/9.jpg)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull Advertising n gt 109
ndash Φ(x) isin 0 1d d gt 109
ndash Navigation history + ad
- Linear predictions
- h(x θ) = θ⊤Φ(x)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull Advertising n gt 109
ndash Φ(x) isin 0 1d d gt 109
ndash Navigation history + ad
bull Linear predictions
ndash h(x θ) = θ⊤Φ(x)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
x1 x2 x3 x4 x5 x6
y1 = 1 y2 = 1 y3 = 1 y4 = minus1 y5 = minus1 y6 = minus1
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
x1 x2 x3 x4 x5 x6
y1 = 1 y2 = 1 y3 = 1 y4 = minus1 y5 = minus1 y6 = minus1
ndash Neural networks (n d gt 106) h(x θ) = θ⊤mσ(θ⊤mminus1σ(middot middot middot θ⊤2 σ(θ⊤1 x))
x y
θ1θ3
θ2
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
2n
nsum
i=1
(
yi minus h(xi θ))2
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
(least-squares regression)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
log(
1 + exp(minusyih(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
(logistic regression)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
bull Optimization optimization of regularized risk training cost
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n iid
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
bull Optimization optimization of regularized risk training cost
bull Statistics guarantees on Ep(xy)ℓ(y h(x θ)) testing cost
Smoothness and (strong) convexity
bull A function g Rd rarr R is L-smooth if and only if it is twice
differentiable and
forallθ isin Rd
∣
∣eigenvalues[
gprimeprime(θ)]∣
∣ 6 L
smooth non-smooth
Smoothness and (strong) convexity
bull A function g Rd rarr R is L-smooth if and only if it is twice
differentiable and
forallθ isin Rd
∣
∣eigenvalues[
gprimeprime(θ)]∣
∣ 6 L
bull Machine learning
ndash with g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Smooth prediction function θ 7rarr h(xi θ) + smooth loss
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt 0
convex
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
convex
stronglyconvex
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
ndash Condition number κ = Lmicro gt 1
(small κ = Lmicro) (large κ = Lmicro)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
bull Relevance of convex optimization
ndash Easier design and analysis of algorithms
ndash Global minimum vs local minimum vs stationary points
ndash Gradient-based algorithms only need convexity for their analysis
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
bull Adding regularization by micro2θ2
ndash creates additional bias unless micro is small but reduces variance
ndash Typically Lradicn gt micro gt Ln rArr κ isin [
radicn n]
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
g(θt)minus g(θlowast) 6 O(1t2)
g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust
radicκ) if micro-strongly convex
ndash Optimal rates after t = O(d) iterations (Nesterov 2004)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr O(log log 1ε) iterations
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
bull NB single pass leads to bounds on testing error
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)
ndash Iteration complexity is independent of n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
new
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
(with yti stored value at time t of gradient of the i-th function)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 10: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/10.jpg)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull Advertising n gt 109
ndash Φ(x) isin 0 1d d gt 109
ndash Navigation history + ad
- Linear predictions
- h(x θ) = θ⊤Φ(x)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull Advertising n gt 109
ndash Φ(x) isin 0 1d d gt 109
ndash Navigation history + ad
bull Linear predictions
ndash h(x θ) = θ⊤Φ(x)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
x1 x2 x3 x4 x5 x6
y1 = 1 y2 = 1 y3 = 1 y4 = minus1 y5 = minus1 y6 = minus1
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
x1 x2 x3 x4 x5 x6
y1 = 1 y2 = 1 y3 = 1 y4 = minus1 y5 = minus1 y6 = minus1
ndash Neural networks (n d gt 106) h(x θ) = θ⊤mσ(θ⊤mminus1σ(middot middot middot θ⊤2 σ(θ⊤1 x))
x y
θ1θ3
θ2
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
2n
nsum
i=1
(
yi minus h(xi θ))2
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
(least-squares regression)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
log(
1 + exp(minusyih(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
(logistic regression)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
bull Optimization optimization of regularized risk training cost
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n iid
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
bull Optimization optimization of regularized risk training cost
bull Statistics guarantees on Ep(xy)ℓ(y h(x θ)) testing cost
Smoothness and (strong) convexity
bull A function g Rd rarr R is L-smooth if and only if it is twice
differentiable and
forallθ isin Rd
∣
∣eigenvalues[
gprimeprime(θ)]∣
∣ 6 L
smooth non-smooth
Smoothness and (strong) convexity
bull A function g Rd rarr R is L-smooth if and only if it is twice
differentiable and
forallθ isin Rd
∣
∣eigenvalues[
gprimeprime(θ)]∣
∣ 6 L
bull Machine learning
ndash with g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Smooth prediction function θ 7rarr h(xi θ) + smooth loss
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt 0
convex
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
convex
stronglyconvex
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
ndash Condition number κ = Lmicro gt 1
(small κ = Lmicro) (large κ = Lmicro)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
bull Relevance of convex optimization
ndash Easier design and analysis of algorithms
ndash Global minimum vs local minimum vs stationary points
ndash Gradient-based algorithms only need convexity for their analysis
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
bull Adding regularization by micro2θ2
ndash creates additional bias unless micro is small but reduces variance
ndash Typically Lradicn gt micro gt Ln rArr κ isin [
radicn n]
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
g(θt)minus g(θlowast) 6 O(1t2)
g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust
radicκ) if micro-strongly convex
ndash Optimal rates after t = O(d) iterations (Nesterov 2004)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr O(log log 1ε) iterations
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
bull NB single pass leads to bounds on testing error
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)
ndash Iteration complexity is independent of n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
new
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
(with yti stored value at time t of gradient of the i-th function)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 11: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/11.jpg)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull Advertising n gt 109
ndash Φ(x) isin 0 1d d gt 109
ndash Navigation history + ad
bull Linear predictions
ndash h(x θ) = θ⊤Φ(x)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
x1 x2 x3 x4 x5 x6
y1 = 1 y2 = 1 y3 = 1 y4 = minus1 y5 = minus1 y6 = minus1
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
x1 x2 x3 x4 x5 x6
y1 = 1 y2 = 1 y3 = 1 y4 = minus1 y5 = minus1 y6 = minus1
ndash Neural networks (n d gt 106) h(x θ) = θ⊤mσ(θ⊤mminus1σ(middot middot middot θ⊤2 σ(θ⊤1 x))
x y
θ1θ3
θ2
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
2n
nsum
i=1
(
yi minus h(xi θ))2
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
(least-squares regression)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
log(
1 + exp(minusyih(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
(logistic regression)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
bull Optimization optimization of regularized risk training cost
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n iid
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
bull Optimization optimization of regularized risk training cost
bull Statistics guarantees on Ep(xy)ℓ(y h(x θ)) testing cost
Smoothness and (strong) convexity
bull A function g Rd rarr R is L-smooth if and only if it is twice
differentiable and
forallθ isin Rd
∣
∣eigenvalues[
gprimeprime(θ)]∣
∣ 6 L
smooth non-smooth
Smoothness and (strong) convexity
bull A function g Rd rarr R is L-smooth if and only if it is twice
differentiable and
forallθ isin Rd
∣
∣eigenvalues[
gprimeprime(θ)]∣
∣ 6 L
bull Machine learning
ndash with g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Smooth prediction function θ 7rarr h(xi θ) + smooth loss
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt 0
convex
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
convex
stronglyconvex
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
ndash Condition number κ = Lmicro gt 1
(small κ = Lmicro) (large κ = Lmicro)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
bull Relevance of convex optimization
ndash Easier design and analysis of algorithms
ndash Global minimum vs local minimum vs stationary points
ndash Gradient-based algorithms only need convexity for their analysis
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
bull Adding regularization by micro2θ2
ndash creates additional bias unless micro is small but reduces variance
ndash Typically Lradicn gt micro gt Ln rArr κ isin [
radicn n]
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
g(θt)minus g(θlowast) 6 O(1t2)
g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust
radicκ) if micro-strongly convex
ndash Optimal rates after t = O(d) iterations (Nesterov 2004)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr O(log log 1ε) iterations
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
bull NB single pass leads to bounds on testing error
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)
ndash Iteration complexity is independent of n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
new
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
(with yti stored value at time t of gradient of the i-th function)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 12: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/12.jpg)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
x1 x2 x3 x4 x5 x6
y1 = 1 y2 = 1 y3 = 1 y4 = minus1 y5 = minus1 y6 = minus1
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
x1 x2 x3 x4 x5 x6
y1 = 1 y2 = 1 y3 = 1 y4 = minus1 y5 = minus1 y6 = minus1
ndash Neural networks (n d gt 106) h(x θ) = θ⊤mσ(θ⊤mminus1σ(middot middot middot θ⊤2 σ(θ⊤1 x))
x y
θ1θ3
θ2
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
2n
nsum
i=1
(
yi minus h(xi θ))2
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
(least-squares regression)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
log(
1 + exp(minusyih(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
(logistic regression)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
bull Optimization optimization of regularized risk training cost
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n iid
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
bull Optimization optimization of regularized risk training cost
bull Statistics guarantees on Ep(xy)ℓ(y h(x θ)) testing cost
Smoothness and (strong) convexity
bull A function g Rd rarr R is L-smooth if and only if it is twice
differentiable and
forallθ isin Rd
∣
∣eigenvalues[
gprimeprime(θ)]∣
∣ 6 L
smooth non-smooth
Smoothness and (strong) convexity
bull A function g Rd rarr R is L-smooth if and only if it is twice
differentiable and
forallθ isin Rd
∣
∣eigenvalues[
gprimeprime(θ)]∣
∣ 6 L
bull Machine learning
ndash with g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Smooth prediction function θ 7rarr h(xi θ) + smooth loss
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt 0
convex
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
convex
stronglyconvex
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
ndash Condition number κ = Lmicro gt 1
(small κ = Lmicro) (large κ = Lmicro)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
bull Relevance of convex optimization
ndash Easier design and analysis of algorithms
ndash Global minimum vs local minimum vs stationary points
ndash Gradient-based algorithms only need convexity for their analysis
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
bull Adding regularization by micro2θ2
ndash creates additional bias unless micro is small but reduces variance
ndash Typically Lradicn gt micro gt Ln rArr κ isin [
radicn n]
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
g(θt)minus g(θlowast) 6 O(1t2)
g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust
radicκ) if micro-strongly convex
ndash Optimal rates after t = O(d) iterations (Nesterov 2004)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr O(log log 1ε) iterations
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
bull NB single pass leads to bounds on testing error
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)
ndash Iteration complexity is independent of n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
new
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
(with yti stored value at time t of gradient of the i-th function)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 13: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/13.jpg)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
x1 x2 x3 x4 x5 x6
y1 = 1 y2 = 1 y3 = 1 y4 = minus1 y5 = minus1 y6 = minus1
ndash Neural networks (n d gt 106) h(x θ) = θ⊤mσ(θ⊤mminus1σ(middot middot middot θ⊤2 σ(θ⊤1 x))
x y
θ1θ3
θ2
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
2n
nsum
i=1
(
yi minus h(xi θ))2
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
(least-squares regression)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
log(
1 + exp(minusyih(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
(logistic regression)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
bull Optimization optimization of regularized risk training cost
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n iid
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
bull Optimization optimization of regularized risk training cost
bull Statistics guarantees on Ep(xy)ℓ(y h(x θ)) testing cost
Smoothness and (strong) convexity
bull A function g Rd rarr R is L-smooth if and only if it is twice
differentiable and
forallθ isin Rd
∣
∣eigenvalues[
gprimeprime(θ)]∣
∣ 6 L
smooth non-smooth
Smoothness and (strong) convexity
bull A function g Rd rarr R is L-smooth if and only if it is twice
differentiable and
forallθ isin Rd
∣
∣eigenvalues[
gprimeprime(θ)]∣
∣ 6 L
bull Machine learning
ndash with g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Smooth prediction function θ 7rarr h(xi θ) + smooth loss
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt 0
convex
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
convex
stronglyconvex
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
ndash Condition number κ = Lmicro gt 1
(small κ = Lmicro) (large κ = Lmicro)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
bull Relevance of convex optimization
ndash Easier design and analysis of algorithms
ndash Global minimum vs local minimum vs stationary points
ndash Gradient-based algorithms only need convexity for their analysis
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
bull Adding regularization by micro2θ2
ndash creates additional bias unless micro is small but reduces variance
ndash Typically Lradicn gt micro gt Ln rArr κ isin [
radicn n]
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
g(θt)minus g(θlowast) 6 O(1t2)
g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust
radicκ) if micro-strongly convex
ndash Optimal rates after t = O(d) iterations (Nesterov 2004)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr O(log log 1ε) iterations
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
bull NB single pass leads to bounds on testing error
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)
ndash Iteration complexity is independent of n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
new
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
(with yti stored value at time t of gradient of the i-th function)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 14: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/14.jpg)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
2n
nsum
i=1
(
yi minus h(xi θ))2
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
(least-squares regression)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
log(
1 + exp(minusyih(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
(logistic regression)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
bull Optimization optimization of regularized risk training cost
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n iid
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
bull Optimization optimization of regularized risk training cost
bull Statistics guarantees on Ep(xy)ℓ(y h(x θ)) testing cost
Smoothness and (strong) convexity
bull A function g Rd rarr R is L-smooth if and only if it is twice
differentiable and
forallθ isin Rd
∣
∣eigenvalues[
gprimeprime(θ)]∣
∣ 6 L
smooth non-smooth
Smoothness and (strong) convexity
bull A function g Rd rarr R is L-smooth if and only if it is twice
differentiable and
forallθ isin Rd
∣
∣eigenvalues[
gprimeprime(θ)]∣
∣ 6 L
bull Machine learning
ndash with g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Smooth prediction function θ 7rarr h(xi θ) + smooth loss
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt 0
convex
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
convex
stronglyconvex
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
ndash Condition number κ = Lmicro gt 1
(small κ = Lmicro) (large κ = Lmicro)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
bull Relevance of convex optimization
ndash Easier design and analysis of algorithms
ndash Global minimum vs local minimum vs stationary points
ndash Gradient-based algorithms only need convexity for their analysis
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
bull Adding regularization by micro2θ2
ndash creates additional bias unless micro is small but reduces variance
ndash Typically Lradicn gt micro gt Ln rArr κ isin [
radicn n]
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
g(θt)minus g(θlowast) 6 O(1t2)
g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust
radicκ) if micro-strongly convex
ndash Optimal rates after t = O(d) iterations (Nesterov 2004)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr O(log log 1ε) iterations
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
bull NB single pass leads to bounds on testing error
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)
ndash Iteration complexity is independent of n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
new
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
(with yti stored value at time t of gradient of the i-th function)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 15: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/15.jpg)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
2n
nsum
i=1
(
yi minus h(xi θ))2
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
(least-squares regression)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
log(
1 + exp(minusyih(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
(logistic regression)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
bull Optimization optimization of regularized risk training cost
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n iid
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
bull Optimization optimization of regularized risk training cost
bull Statistics guarantees on Ep(xy)ℓ(y h(x θ)) testing cost
Smoothness and (strong) convexity
bull A function g Rd rarr R is L-smooth if and only if it is twice
differentiable and
forallθ isin Rd
∣
∣eigenvalues[
gprimeprime(θ)]∣
∣ 6 L
smooth non-smooth
Smoothness and (strong) convexity
bull A function g Rd rarr R is L-smooth if and only if it is twice
differentiable and
forallθ isin Rd
∣
∣eigenvalues[
gprimeprime(θ)]∣
∣ 6 L
bull Machine learning
ndash with g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Smooth prediction function θ 7rarr h(xi θ) + smooth loss
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt 0
convex
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
convex
stronglyconvex
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
ndash Condition number κ = Lmicro gt 1
(small κ = Lmicro) (large κ = Lmicro)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
bull Relevance of convex optimization
ndash Easier design and analysis of algorithms
ndash Global minimum vs local minimum vs stationary points
ndash Gradient-based algorithms only need convexity for their analysis
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
bull Adding regularization by micro2θ2
ndash creates additional bias unless micro is small but reduces variance
ndash Typically Lradicn gt micro gt Ln rArr κ isin [
radicn n]
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
g(θt)minus g(θlowast) 6 O(1t2)
g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust
radicκ) if micro-strongly convex
ndash Optimal rates after t = O(d) iterations (Nesterov 2004)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr O(log log 1ε) iterations
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
bull NB single pass leads to bounds on testing error
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)
ndash Iteration complexity is independent of n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
new
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
(with yti stored value at time t of gradient of the i-th function)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 16: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/16.jpg)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
log(
1 + exp(minusyih(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
(logistic regression)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
bull Optimization optimization of regularized risk training cost
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n iid
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
bull Optimization optimization of regularized risk training cost
bull Statistics guarantees on Ep(xy)ℓ(y h(x θ)) testing cost
Smoothness and (strong) convexity
bull A function g Rd rarr R is L-smooth if and only if it is twice
differentiable and
forallθ isin Rd
∣
∣eigenvalues[
gprimeprime(θ)]∣
∣ 6 L
smooth non-smooth
Smoothness and (strong) convexity
bull A function g Rd rarr R is L-smooth if and only if it is twice
differentiable and
forallθ isin Rd
∣
∣eigenvalues[
gprimeprime(θ)]∣
∣ 6 L
bull Machine learning
ndash with g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Smooth prediction function θ 7rarr h(xi θ) + smooth loss
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt 0
convex
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
convex
stronglyconvex
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
ndash Condition number κ = Lmicro gt 1
(small κ = Lmicro) (large κ = Lmicro)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
bull Relevance of convex optimization
ndash Easier design and analysis of algorithms
ndash Global minimum vs local minimum vs stationary points
ndash Gradient-based algorithms only need convexity for their analysis
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
bull Adding regularization by micro2θ2
ndash creates additional bias unless micro is small but reduces variance
ndash Typically Lradicn gt micro gt Ln rArr κ isin [
radicn n]
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
g(θt)minus g(θlowast) 6 O(1t2)
g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust
radicκ) if micro-strongly convex
ndash Optimal rates after t = O(d) iterations (Nesterov 2004)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr O(log log 1ε) iterations
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
bull NB single pass leads to bounds on testing error
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)
ndash Iteration complexity is independent of n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
new
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
(with yti stored value at time t of gradient of the i-th function)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 17: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/17.jpg)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
bull Optimization optimization of regularized risk training cost
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n iid
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
bull Optimization optimization of regularized risk training cost
bull Statistics guarantees on Ep(xy)ℓ(y h(x θ)) testing cost
Smoothness and (strong) convexity
bull A function g Rd rarr R is L-smooth if and only if it is twice
differentiable and
forallθ isin Rd
∣
∣eigenvalues[
gprimeprime(θ)]∣
∣ 6 L
smooth non-smooth
Smoothness and (strong) convexity
bull A function g Rd rarr R is L-smooth if and only if it is twice
differentiable and
forallθ isin Rd
∣
∣eigenvalues[
gprimeprime(θ)]∣
∣ 6 L
bull Machine learning
ndash with g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Smooth prediction function θ 7rarr h(xi θ) + smooth loss
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt 0
convex
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
convex
stronglyconvex
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
ndash Condition number κ = Lmicro gt 1
(small κ = Lmicro) (large κ = Lmicro)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
bull Relevance of convex optimization
ndash Easier design and analysis of algorithms
ndash Global minimum vs local minimum vs stationary points
ndash Gradient-based algorithms only need convexity for their analysis
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
bull Adding regularization by micro2θ2
ndash creates additional bias unless micro is small but reduces variance
ndash Typically Lradicn gt micro gt Ln rArr κ isin [
radicn n]
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
g(θt)minus g(θlowast) 6 O(1t2)
g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust
radicκ) if micro-strongly convex
ndash Optimal rates after t = O(d) iterations (Nesterov 2004)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr O(log log 1ε) iterations
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
bull NB single pass leads to bounds on testing error
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)
ndash Iteration complexity is independent of n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
new
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
(with yti stored value at time t of gradient of the i-th function)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 18: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/18.jpg)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
bull Optimization optimization of regularized risk training cost
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n iid
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
bull Optimization optimization of regularized risk training cost
bull Statistics guarantees on Ep(xy)ℓ(y h(x θ)) testing cost
Smoothness and (strong) convexity
bull A function g Rd rarr R is L-smooth if and only if it is twice
differentiable and
forallθ isin Rd
∣
∣eigenvalues[
gprimeprime(θ)]∣
∣ 6 L
smooth non-smooth
Smoothness and (strong) convexity
bull A function g Rd rarr R is L-smooth if and only if it is twice
differentiable and
forallθ isin Rd
∣
∣eigenvalues[
gprimeprime(θ)]∣
∣ 6 L
bull Machine learning
ndash with g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Smooth prediction function θ 7rarr h(xi θ) + smooth loss
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt 0
convex
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
convex
stronglyconvex
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
ndash Condition number κ = Lmicro gt 1
(small κ = Lmicro) (large κ = Lmicro)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
bull Relevance of convex optimization
ndash Easier design and analysis of algorithms
ndash Global minimum vs local minimum vs stationary points
ndash Gradient-based algorithms only need convexity for their analysis
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
bull Adding regularization by micro2θ2
ndash creates additional bias unless micro is small but reduces variance
ndash Typically Lradicn gt micro gt Ln rArr κ isin [
radicn n]
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
g(θt)minus g(θlowast) 6 O(1t2)
g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust
radicκ) if micro-strongly convex
ndash Optimal rates after t = O(d) iterations (Nesterov 2004)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr O(log log 1ε) iterations
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
bull NB single pass leads to bounds on testing error
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)
ndash Iteration complexity is independent of n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
new
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
(with yti stored value at time t of gradient of the i-th function)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 19: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/19.jpg)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
bull Optimization optimization of regularized risk training cost
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n iid
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
bull Optimization optimization of regularized risk training cost
bull Statistics guarantees on Ep(xy)ℓ(y h(x θ)) testing cost
Smoothness and (strong) convexity
bull A function g Rd rarr R is L-smooth if and only if it is twice
differentiable and
forallθ isin Rd
∣
∣eigenvalues[
gprimeprime(θ)]∣
∣ 6 L
smooth non-smooth
Smoothness and (strong) convexity
bull A function g Rd rarr R is L-smooth if and only if it is twice
differentiable and
forallθ isin Rd
∣
∣eigenvalues[
gprimeprime(θ)]∣
∣ 6 L
bull Machine learning
ndash with g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Smooth prediction function θ 7rarr h(xi θ) + smooth loss
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt 0
convex
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
convex
stronglyconvex
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
ndash Condition number κ = Lmicro gt 1
(small κ = Lmicro) (large κ = Lmicro)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
bull Relevance of convex optimization
ndash Easier design and analysis of algorithms
ndash Global minimum vs local minimum vs stationary points
ndash Gradient-based algorithms only need convexity for their analysis
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
bull Adding regularization by micro2θ2
ndash creates additional bias unless micro is small but reduces variance
ndash Typically Lradicn gt micro gt Ln rArr κ isin [
radicn n]
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
g(θt)minus g(θlowast) 6 O(1t2)
g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust
radicκ) if micro-strongly convex
ndash Optimal rates after t = O(d) iterations (Nesterov 2004)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr O(log log 1ε) iterations
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
bull NB single pass leads to bounds on testing error
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)
ndash Iteration complexity is independent of n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
new
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
(with yti stored value at time t of gradient of the i-th function)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 20: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/20.jpg)
Parametric supervised machine learning
bull Data n observations (xi yi) isin Xtimes Y i = 1 n iid
bull Prediction function h(x θ) isin R parameterized by θ isin Rd
bull (regularized) empirical risk minimization
minθisinRd
1
n
nsum
i=1
ℓ(
yi h(xi θ))
+ λΩ(θ)
=1
n
nsum
i=1
fi(θ)
data fitting term + regularizer
bull Optimization optimization of regularized risk training cost
bull Statistics guarantees on Ep(xy)ℓ(y h(x θ)) testing cost
Smoothness and (strong) convexity
bull A function g Rd rarr R is L-smooth if and only if it is twice
differentiable and
forallθ isin Rd
∣
∣eigenvalues[
gprimeprime(θ)]∣
∣ 6 L
smooth non-smooth
Smoothness and (strong) convexity
bull A function g Rd rarr R is L-smooth if and only if it is twice
differentiable and
forallθ isin Rd
∣
∣eigenvalues[
gprimeprime(θ)]∣
∣ 6 L
bull Machine learning
ndash with g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Smooth prediction function θ 7rarr h(xi θ) + smooth loss
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt 0
convex
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
convex
stronglyconvex
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
ndash Condition number κ = Lmicro gt 1
(small κ = Lmicro) (large κ = Lmicro)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
bull Relevance of convex optimization
ndash Easier design and analysis of algorithms
ndash Global minimum vs local minimum vs stationary points
ndash Gradient-based algorithms only need convexity for their analysis
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
bull Adding regularization by micro2θ2
ndash creates additional bias unless micro is small but reduces variance
ndash Typically Lradicn gt micro gt Ln rArr κ isin [
radicn n]
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
g(θt)minus g(θlowast) 6 O(1t2)
g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust
radicκ) if micro-strongly convex
ndash Optimal rates after t = O(d) iterations (Nesterov 2004)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr O(log log 1ε) iterations
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
bull NB single pass leads to bounds on testing error
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)
ndash Iteration complexity is independent of n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
new
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
(with yti stored value at time t of gradient of the i-th function)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 21: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/21.jpg)
Smoothness and (strong) convexity
bull A function g Rd rarr R is L-smooth if and only if it is twice
differentiable and
forallθ isin Rd
∣
∣eigenvalues[
gprimeprime(θ)]∣
∣ 6 L
smooth non-smooth
Smoothness and (strong) convexity
bull A function g Rd rarr R is L-smooth if and only if it is twice
differentiable and
forallθ isin Rd
∣
∣eigenvalues[
gprimeprime(θ)]∣
∣ 6 L
bull Machine learning
ndash with g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Smooth prediction function θ 7rarr h(xi θ) + smooth loss
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt 0
convex
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
convex
stronglyconvex
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
ndash Condition number κ = Lmicro gt 1
(small κ = Lmicro) (large κ = Lmicro)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
bull Relevance of convex optimization
ndash Easier design and analysis of algorithms
ndash Global minimum vs local minimum vs stationary points
ndash Gradient-based algorithms only need convexity for their analysis
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
bull Adding regularization by micro2θ2
ndash creates additional bias unless micro is small but reduces variance
ndash Typically Lradicn gt micro gt Ln rArr κ isin [
radicn n]
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
g(θt)minus g(θlowast) 6 O(1t2)
g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust
radicκ) if micro-strongly convex
ndash Optimal rates after t = O(d) iterations (Nesterov 2004)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr O(log log 1ε) iterations
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
bull NB single pass leads to bounds on testing error
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)
ndash Iteration complexity is independent of n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
new
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
(with yti stored value at time t of gradient of the i-th function)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 22: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/22.jpg)
Smoothness and (strong) convexity
bull A function g Rd rarr R is L-smooth if and only if it is twice
differentiable and
forallθ isin Rd
∣
∣eigenvalues[
gprimeprime(θ)]∣
∣ 6 L
bull Machine learning
ndash with g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Smooth prediction function θ 7rarr h(xi θ) + smooth loss
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt 0
convex
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
convex
stronglyconvex
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
ndash Condition number κ = Lmicro gt 1
(small κ = Lmicro) (large κ = Lmicro)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
bull Relevance of convex optimization
ndash Easier design and analysis of algorithms
ndash Global minimum vs local minimum vs stationary points
ndash Gradient-based algorithms only need convexity for their analysis
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
bull Adding regularization by micro2θ2
ndash creates additional bias unless micro is small but reduces variance
ndash Typically Lradicn gt micro gt Ln rArr κ isin [
radicn n]
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
g(θt)minus g(θlowast) 6 O(1t2)
g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust
radicκ) if micro-strongly convex
ndash Optimal rates after t = O(d) iterations (Nesterov 2004)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr O(log log 1ε) iterations
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
bull NB single pass leads to bounds on testing error
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)
ndash Iteration complexity is independent of n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
new
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
(with yti stored value at time t of gradient of the i-th function)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 23: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/23.jpg)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt 0
convex
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
convex
stronglyconvex
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
ndash Condition number κ = Lmicro gt 1
(small κ = Lmicro) (large κ = Lmicro)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
bull Relevance of convex optimization
ndash Easier design and analysis of algorithms
ndash Global minimum vs local minimum vs stationary points
ndash Gradient-based algorithms only need convexity for their analysis
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
bull Adding regularization by micro2θ2
ndash creates additional bias unless micro is small but reduces variance
ndash Typically Lradicn gt micro gt Ln rArr κ isin [
radicn n]
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
g(θt)minus g(θlowast) 6 O(1t2)
g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust
radicκ) if micro-strongly convex
ndash Optimal rates after t = O(d) iterations (Nesterov 2004)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr O(log log 1ε) iterations
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
bull NB single pass leads to bounds on testing error
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)
ndash Iteration complexity is independent of n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
new
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
(with yti stored value at time t of gradient of the i-th function)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 24: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/24.jpg)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
convex
stronglyconvex
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
ndash Condition number κ = Lmicro gt 1
(small κ = Lmicro) (large κ = Lmicro)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
bull Relevance of convex optimization
ndash Easier design and analysis of algorithms
ndash Global minimum vs local minimum vs stationary points
ndash Gradient-based algorithms only need convexity for their analysis
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
bull Adding regularization by micro2θ2
ndash creates additional bias unless micro is small but reduces variance
ndash Typically Lradicn gt micro gt Ln rArr κ isin [
radicn n]
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
g(θt)minus g(θlowast) 6 O(1t2)
g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust
radicκ) if micro-strongly convex
ndash Optimal rates after t = O(d) iterations (Nesterov 2004)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr O(log log 1ε) iterations
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
bull NB single pass leads to bounds on testing error
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)
ndash Iteration complexity is independent of n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
new
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
(with yti stored value at time t of gradient of the i-th function)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 25: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/25.jpg)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
ndash Condition number κ = Lmicro gt 1
(small κ = Lmicro) (large κ = Lmicro)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
bull Relevance of convex optimization
ndash Easier design and analysis of algorithms
ndash Global minimum vs local minimum vs stationary points
ndash Gradient-based algorithms only need convexity for their analysis
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
bull Adding regularization by micro2θ2
ndash creates additional bias unless micro is small but reduces variance
ndash Typically Lradicn gt micro gt Ln rArr κ isin [
radicn n]
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
g(θt)minus g(θlowast) 6 O(1t2)
g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust
radicκ) if micro-strongly convex
ndash Optimal rates after t = O(d) iterations (Nesterov 2004)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr O(log log 1ε) iterations
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
bull NB single pass leads to bounds on testing error
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)
ndash Iteration complexity is independent of n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
new
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
(with yti stored value at time t of gradient of the i-th function)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 26: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/26.jpg)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
bull Relevance of convex optimization
ndash Easier design and analysis of algorithms
ndash Global minimum vs local minimum vs stationary points
ndash Gradient-based algorithms only need convexity for their analysis
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
bull Adding regularization by micro2θ2
ndash creates additional bias unless micro is small but reduces variance
ndash Typically Lradicn gt micro gt Ln rArr κ isin [
radicn n]
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
g(θt)minus g(θlowast) 6 O(1t2)
g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust
radicκ) if micro-strongly convex
ndash Optimal rates after t = O(d) iterations (Nesterov 2004)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr O(log log 1ε) iterations
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
bull NB single pass leads to bounds on testing error
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)
ndash Iteration complexity is independent of n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
new
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
(with yti stored value at time t of gradient of the i-th function)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 27: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/27.jpg)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Convex loss and linear predictions h(x θ) = θ⊤Φ(x)
bull Relevance of convex optimization
ndash Easier design and analysis of algorithms
ndash Global minimum vs local minimum vs stationary points
ndash Gradient-based algorithms only need convexity for their analysis
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
bull Adding regularization by micro2θ2
ndash creates additional bias unless micro is small but reduces variance
ndash Typically Lradicn gt micro gt Ln rArr κ isin [
radicn n]
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
g(θt)minus g(θlowast) 6 O(1t2)
g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust
radicκ) if micro-strongly convex
ndash Optimal rates after t = O(d) iterations (Nesterov 2004)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr O(log log 1ε) iterations
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
bull NB single pass leads to bounds on testing error
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)
ndash Iteration complexity is independent of n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
new
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
(with yti stored value at time t of gradient of the i-th function)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 28: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/28.jpg)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
bull Adding regularization by micro2θ2
ndash creates additional bias unless micro is small but reduces variance
ndash Typically Lradicn gt micro gt Ln rArr κ isin [
radicn n]
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
g(θt)minus g(θlowast) 6 O(1t2)
g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust
radicκ) if micro-strongly convex
ndash Optimal rates after t = O(d) iterations (Nesterov 2004)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr O(log log 1ε) iterations
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
bull NB single pass leads to bounds on testing error
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)
ndash Iteration complexity is independent of n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
new
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
(with yti stored value at time t of gradient of the i-th function)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 29: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/29.jpg)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
bull Adding regularization by micro2θ2
ndash creates additional bias unless micro is small but reduces variance
ndash Typically Lradicn gt micro gt Ln rArr κ isin [
radicn n]
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
g(θt)minus g(θlowast) 6 O(1t2)
g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust
radicκ) if micro-strongly convex
ndash Optimal rates after t = O(d) iterations (Nesterov 2004)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr O(log log 1ε) iterations
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
bull NB single pass leads to bounds on testing error
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)
ndash Iteration complexity is independent of n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
new
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
(with yti stored value at time t of gradient of the i-th function)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 30: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/30.jpg)
Smoothness and (strong) convexity
bull A twice differentiable function g Rd rarr R is micro-strongly convex
if and only if
forallθ isin Rd eigenvalues
[
gprimeprime(θ)]
gt micro
bull Strong convexity in machine learning
ndash With g(θ) = 1n
sumni=1 ℓ(yi h(xi θ))
ndash Strongly convex loss and linear predictions h(x θ) = θ⊤Φ(x)
ndash Invertible covariance matrix 1n
sumni=1Φ(xi)Φ(xi)
⊤ rArr n gt d
ndash Even when micro gt 0 micro may be arbitrarily small
bull Adding regularization by micro2θ2
ndash creates additional bias unless micro is small but reduces variance
ndash Typically Lradicn gt micro gt Ln rArr κ isin [
radicn n]
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
g(θt)minus g(θlowast) 6 O(1t2)
g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust
radicκ) if micro-strongly convex
ndash Optimal rates after t = O(d) iterations (Nesterov 2004)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr O(log log 1ε) iterations
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
bull NB single pass leads to bounds on testing error
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)
ndash Iteration complexity is independent of n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
new
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
(with yti stored value at time t of gradient of the i-th function)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 31: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/31.jpg)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O(eminust(microL)) = O(eminustκ)
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
g(θt)minus g(θlowast) 6 O(1t2)
g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust
radicκ) if micro-strongly convex
ndash Optimal rates after t = O(d) iterations (Nesterov 2004)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr O(log log 1ε) iterations
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
bull NB single pass leads to bounds on testing error
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)
ndash Iteration complexity is independent of n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
new
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
(with yti stored value at time t of gradient of the i-th function)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 32: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/32.jpg)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1) (line search)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
(small κ = Lmicro) (large κ = Lmicro)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
g(θt)minus g(θlowast) 6 O(1t2)
g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust
radicκ) if micro-strongly convex
ndash Optimal rates after t = O(d) iterations (Nesterov 2004)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr O(log log 1ε) iterations
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
bull NB single pass leads to bounds on testing error
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)
ndash Iteration complexity is independent of n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
new
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
(with yti stored value at time t of gradient of the i-th function)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 33: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/33.jpg)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
g(θt)minus g(θlowast) 6 O(1t2)
g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust
radicκ) if micro-strongly convex
ndash Optimal rates after t = O(d) iterations (Nesterov 2004)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr O(log log 1ε) iterations
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
bull NB single pass leads to bounds on testing error
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)
ndash Iteration complexity is independent of n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
new
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
(with yti stored value at time t of gradient of the i-th function)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 34: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/34.jpg)
Iterative methods for minimizing smooth functions
bull Assumption g convex and L-smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
g(θt)minus g(θlowast) 6 O(1t)
g(θt)minus g(θlowast) 6 O((1minus1κ)t) = O(eminustκ) if micro-strongly convex
bull Acceleration (Nesterov 1983) second-order recursion
θt = ηtminus1 minus γtgprime(ηtminus1) and ηt = θt + δt(θt minus θtminus1)
ndash Good choice of momentum term δt isin [0 1)
g(θt)minus g(θlowast) 6 O(1t2)
g(θt)minus g(θlowast) 6 O((1minus1radicκ)t) = O(eminust
radicκ) if micro-strongly convex
ndash Optimal rates after t = O(d) iterations (Nesterov 2004)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr O(log log 1ε) iterations
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
bull NB single pass leads to bounds on testing error
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)
ndash Iteration complexity is independent of n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
new
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
(with yti stored value at time t of gradient of the i-th function)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 35: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/35.jpg)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr O(log log 1ε) iterations
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
bull NB single pass leads to bounds on testing error
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)
ndash Iteration complexity is independent of n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
new
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
(with yti stored value at time t of gradient of the i-th function)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 36: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/36.jpg)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr O(log log 1ε) iterations
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
bull NB single pass leads to bounds on testing error
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)
ndash Iteration complexity is independent of n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
new
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
(with yti stored value at time t of gradient of the i-th function)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 37: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/37.jpg)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr O(κ log 1ε) iterations
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr O(log log 1ε) iterations
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
bull NB single pass leads to bounds on testing error
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)
ndash Iteration complexity is independent of n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
new
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
(with yti stored value at time t of gradient of the i-th function)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 38: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/38.jpg)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
bull NB single pass leads to bounds on testing error
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)
ndash Iteration complexity is independent of n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
new
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
(with yti stored value at time t of gradient of the i-th function)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 39: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/39.jpg)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
bull NB single pass leads to bounds on testing error
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)
ndash Iteration complexity is independent of n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
new
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
(with yti stored value at time t of gradient of the i-th function)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 40: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/40.jpg)
Iterative methods for minimizing smooth functions
bull Assumption g convex and smooth on Rd
bull Gradient descent θt = θtminus1 minus γt gprime(θtminus1)
ndash O(1t) convergence rate for convex functions
ndash O(eminustκ) linear if strongly-convex hArr complexity = O(nd middot κ log 1ε)
bull Newton method θt = θtminus1 minus gprimeprime(θtminus1)minus1gprime(θtminus1)
ndash O(
eminusρ2t)
quadratic rate hArr complexity = O((nd2 + d3) middot log log 1ε)
bull Key insights for machine learning (Bottou and Bousquet 2008)
1 No need to optimize below statistical error
2 Cost functions are averages
3 Testing error is more important than training error
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
bull NB single pass leads to bounds on testing error
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)
ndash Iteration complexity is independent of n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
new
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
(with yti stored value at time t of gradient of the i-th function)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 41: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/41.jpg)
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
bull NB single pass leads to bounds on testing error
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)
ndash Iteration complexity is independent of n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
new
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
(with yti stored value at time t of gradient of the i-th function)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 42: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/42.jpg)
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
bull NB single pass leads to bounds on testing error
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)
ndash Iteration complexity is independent of n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
new
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
(with yti stored value at time t of gradient of the i-th function)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 43: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/43.jpg)
Stochastic gradient descent (SGD) for finite sums
minθisinRd
g(θ) =1
n
nsum
i=1
fi(θ)
bull Iteration θt = θtminus1 minus γtfprimei(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Polyak-Ruppert averaging θt =
1t+1
sumtu=0 θu
bull Convergence rate if each fi is convex L-smooth and g micro-strongly-
convex
Eg(θt)minus g(θlowast) 6
O(1radict) if γt = 1(L
radict)
O(L(microt)) = O(κt) if γt = 1(microt)
ndash No adaptivity to strong-convexity in general
ndash Running-time complexity O(d middot κε)
bull NB single pass leads to bounds on testing error
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)
ndash Iteration complexity is independent of n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
new
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
(with yti stored value at time t of gradient of the i-th function)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 44: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/44.jpg)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)
ndash Iteration complexity is independent of n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
new
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
(with yti stored value at time t of gradient of the i-th function)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 45: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/45.jpg)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)
ndash Iteration complexity is independent of n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
new
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
(with yti stored value at time t of gradient of the i-th function)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 46: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/46.jpg)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)
ndash Iteration complexity is independent of n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
new
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
(with yti stored value at time t of gradient of the i-th function)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 47: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/47.jpg)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
ndash Exponential convergence rate in O(eminustκ) for convex problems
ndash Can be accelerated to O(eminustradicκ) (Nesterov 1983)
ndash Iteration complexity is linear in n
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
ndash Sampling with replacement i(t) random element of 1 nndash Convergence rate in O(κt)
ndash Iteration complexity is independent of n
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
new
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
(with yti stored value at time t of gradient of the i-th function)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 48: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/48.jpg)
Stochastic vs deterministic methods
bull Minimizing g(θ) =1
n
nsum
i=1
fi(θ) with fi(θ) = ℓ(
yi h(xi θ))
+ λΩ(θ)
bull Batch gradient descent θt = θtminus1minusγnablag(θtminus1) = θtminus1minusγ
n
nsum
i=1
nablafi(θtminus1)
bull Stochastic gradient descent θt = θtminus1 minus γtnablafi(t)(θtminus1)
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
new
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
(with yti stored value at time t of gradient of the i-th function)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 49: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/49.jpg)
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
new
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
(with yti stored value at time t of gradient of the i-th function)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 50: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/50.jpg)
Stochastic vs deterministic methods
bull Goal = best of both worlds Linear rate with O(d) iteration cost
Simple choice of step size
time
log(excess
cost)
deterministic
stochastic
new
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
(with yti stored value at time t of gradient of the i-th function)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 51: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/51.jpg)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
(with yti stored value at time t of gradient of the i-th function)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 52: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/52.jpg)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
θt = θtminus1 minus γ[
nablafi(t)(θtminus1)+1
n
nsum
i=1
ytminus1i minus ytminus1
i(t)
]
(with yti stored value at time t of gradient of the i-th function)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 53: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/53.jpg)
Recent progress in single machine optimization
bull Variance reduction
ndash Exponential convergence with O(d) iteration cost
ndash SAG (Le Roux Schmidt and Bach 2012)
ndash SVRG (Johnson and Zhang 2013 Zhang et al 2013)
ndash SAGA (Defazio Bach and Lacoste-Julien 2014) etc
bull Running-time to reach precision ε (with κ = condition number)
Stochastic gradient descent dtimes∣
∣
∣κ times 1
ε
Gradient descent dtimes∣
∣
∣nκ times log 1
ε
Variance reduction dtimes∣
∣
∣(n+ κ) times log 1
ε
ndash Can be accelerated (eg Lan 2015) n+ κ rArr n+radicnκ
ndash Matching upper and lower bounds of complexity
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 54: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/54.jpg)
Outline
1 Parametric supervised learning on a single machine
ndash Machine learning asymp optimization of finite sums
ndash From batch to stochastic gradient methods
ndash Linearly-convergent stochastic methods for convex problems
2 Machine learning over networks
minus Centralized and decentralized methods
minus From network averaging to optimization
minus Distributing the fastest single machine algorithms
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 55: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/55.jpg)
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 56: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/56.jpg)
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
1 3
24
57
6 9
8
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 57: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/57.jpg)
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Each dataset function fi only accessible by node i in a graph
ndash Massive datasets multiple machines cores
ndash Communication legal constraints
bull Goal Minimize communication and local computation costs
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 58: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/58.jpg)
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
bull Why not simply distributing a simple single machine algorithm
ndash (accelerated) gradient descent (see eg Nesterov 2004)
θt = θtminus1 minus γnablag(θtminus1)
ndash Requiresradicκ log 1
ε full gradient computations to reach precision ε
ndash Need to perform distributed averaging over a network
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 59: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/59.jpg)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 60: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/60.jpg)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 61: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/61.jpg)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 62: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/62.jpg)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
1 3
24
57
6 9
81 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 63: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/63.jpg)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 64: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/64.jpg)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Centralized algorithms
ndash Compute a spanning tree with diameter 6 2∆
ndash Masterslave algorithm ∆ communication steps + no error
bull Application to centralized distributed optimization
ndashradicκ log 1
ε gradient steps andradicκ∆log 1
ε communication steps
ndash ldquoOptimalrdquo (Scaman Bach Bubeck Lee and Massoulie 2017)
bull Robustness
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 65: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/65.jpg)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 66: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/66.jpg)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
1 3
24
57
6 9
8
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 67: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/67.jpg)
Classical algorithms for distributed averaging
bull Goal Given n observations ξ1 ξn isin R
ndash Compute θlowast =1
n
nsum
i=1
ξi = argminθisinR
1
n
nsum
i=1
(θ minus ξi)2
bull Decentralized algorithms - gossip (Boyd et al 2006)
ndash Replace θi by a weighted average of its neighborssumn
j=1Wijθjndash Potential asynchrony changing network
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 68: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/68.jpg)
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 69: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/69.jpg)
Convergence of synchronous gossip
bull Synchronous gossip (all nodes simultaneously)
ndash Main iteration θt = Wθtminus1 = W tθ0 = W tξ
ndash Typical assumption W symmetric doubly stochastic matrix
ndash Consequence Eigenvalues(W ) isin [minus1 1]
ndash Eigengap γ = λ1(W )minus λ2(W ) = 1minus λ2(W )
ndash γminus1 = mixing time of the associated Markov chain
1 3
24
57
6 9
8
ndash Need 1γ log
1ε iterations to reach precision ε (for classical averaging)
Illustration of synchronous gossip
![Page 70: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/70.jpg)
Illustration of synchronous gossip
![Page 71: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/71.jpg)
Decentralized optimization
bull Mixing gossip and optimization
ndash Nedic and Ozdaglar (2009) Duchi et al (2012) Wei and Ozdaglar
(2012) Iutzeler et al (2013) Shi et al (2015) Jakovetic et al
(2015) Nedich et al (2016) Mokhtari et al (2016) Colin et al
(2016) Scaman et al (2017) etc
Decentralized optimization
bull Mixing gossip and optimization
bull Lower bound on complexity (Scaman et al 2017)
ndashradicκ log 1
ε gradient steps andradic
κγ log 1ε communication steps
ndash Plain gossip not optimal
(need to gossip gradients with increasing precision)
Decentralized optimization
bull Mixing gossip and optimization
bull Lower bound on complexity (Scaman et al 2017)
ndashradicκ log 1
ε gradient steps andradic
κγ log 1ε communication steps
ndash Plain gossip not optimal
(need to gossip gradients with increasing precision)
bull Is this lower bound achievable
Dual reformulation (Jakovetic et al 2015)
minθisinRd
nsum
i=1
fi(θ) = minθ(1)θ(n)isinRd
nsum
i=1
fi(θ(i)) such that foralli sim j θ(i) = θ(j)
minθ(1)θ(n)isinRd
maxforallisimjλijisinRd
nsum
i=1
fi(θ(i)) +
sum
isimj
λ⊤ij(θ
(i)minusθ(j))
Dual reformulation (Jakovetic et al 2015)
minθisinRd
nsum
i=1
fi(θ) = minθ(1)θ(n)isinRd
nsum
i=1
fi(θ(i)) such that foralli sim j θ(i) = θ(j)
= minθ(1)θ(n)isinRd
maxforallisimjλijisinRd
nsum
i=1
fi(θ(i)) +
sum
isimj
λ⊤ij(θ
(i)minusθ(j))
Dual reformulation (Jakovetic et al 2015)
minθisinRd
nsum
i=1
fi(θ) = minθ(1)θ(n)isinRd
nsum
i=1
fi(θ(i)) such that foralli sim j θ(i) = θ(j)
= minθ(1)θ(n)isinRd
maxforallisimjλijisinRd
nsum
i=1
fi(θ(i)) +
sum
isimj
λ⊤ij(θ
(i)minusθ(j))
= maxforallisimjλijisinRd
minθ(1)θ(n)isinRd
nsum
i=1
fi(θ(i)) +
nsum
i=1
[θ(i)]⊤lineari(λ)
Dual reformulation (Jakovetic et al 2015)
minθisinRd
nsum
i=1
fi(θ) = minθ(1)θ(n)isinRd
nsum
i=1
fi(θ(i)) such that foralli sim j θ(i) = θ(j)
= minθ(1)θ(n)isinRd
maxforallisimjλijisinRd
nsum
i=1
fi(θ(i)) +
sum
isimj
λ⊤ij(θ
(i)minusθ(j))
= maxforallisimjλijisinRd
minθ(1)θ(n)isinRd
nsum
i=1
fi(θ(i)) +
nsum
i=1
[θ(i)]⊤lineari(λ)
= maxforallisimjλijisinRd
nsum
i=1
functioni(λ) = maxforallisimjλijisinRd
function(λ)
Dual reformulation (Jakovetic et al 2015)
minθisinRd
nsum
i=1
fi(θ) = minθ(1)θ(n)isinRd
nsum
i=1
fi(θ(i)) such that foralli sim j θ(i) = θ(j)
= minθ(1)θ(n)isinRd
maxforallisimjλijisinRd
nsum
i=1
fi(θ(i)) +
sum
isimj
λ⊤ij(θ
(i)minusθ(j))
= maxforallisimjλijisinRd
minθ(1)θ(n)isinRd
nsum
i=1
fi(θ(i)) +
nsum
i=1
[θ(i)]⊤lineari(λ)
= maxforallisimjλijisinRd
nsum
i=1
functioni(λ) = maxforallisimjλijisinRd
function(λ)
bull Accelerated gradient descent (Scaman et al 2017)
hArr alternating local gradient computations and a gossip step
ndashradic
κγ log 1ε gradient steps and
radic
κγ log 1ε communication steps
ndash Not optimal rArr need accelerated gossip
Accelerated gossip
bull Regular gossip
ndash Iterations θt = W tθ0
bull Accelerated gossip
ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)
ndash Shift-register gossip (Cao et al 2006)
ndash Linear combinations hArr ηt =t
sum
k=0
αkθk =t
sum
k=0
αkWkξ = Pt(W )ξ
ndash Optimal polynomial is the Chebyshev polynomial
ndash Can be computed online with same cost as regular gossip eg
θt = ωtWθtminus1 + (1minus ωt)θtminus1
ndash Replace γminus1 by γminus12 in rates
Accelerated gossip
bull Regular gossip
ndash Iterations θt = W tθ0
bull Accelerated gossip
ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)
ndash Shift-register gossip (Cao et al 2006)
ndash Linear combinations hArr ηt =t
sum
k=0
αkθk =t
sum
k=0
αkWkξ = Pt(W )ξ
ndash Optimal polynomial is the Chebyshev polynomial
ndash Can be computed online with same cost as regular gossip eg
θt = ωtWθtminus1 + (1minus ωt)θtminus1
ndash Replace γminus1 by γminus12 in rates
Illustration of accelerated gossip
![Page 72: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/72.jpg)
Decentralized optimization
bull Mixing gossip and optimization
bull Lower bound on complexity (Scaman et al 2017)
ndashradicκ log 1
ε gradient steps andradic
κγ log 1ε communication steps
ndash Plain gossip not optimal
(need to gossip gradients with increasing precision)
Decentralized optimization
bull Mixing gossip and optimization
bull Lower bound on complexity (Scaman et al 2017)
ndashradicκ log 1
ε gradient steps andradic
κγ log 1ε communication steps
ndash Plain gossip not optimal
(need to gossip gradients with increasing precision)
bull Is this lower bound achievable
Dual reformulation (Jakovetic et al 2015)
minθisinRd
nsum
i=1
fi(θ) = minθ(1)θ(n)isinRd
nsum
i=1
fi(θ(i)) such that foralli sim j θ(i) = θ(j)
minθ(1)θ(n)isinRd
maxforallisimjλijisinRd
nsum
i=1
fi(θ(i)) +
sum
isimj
λ⊤ij(θ
(i)minusθ(j))
Dual reformulation (Jakovetic et al 2015)
minθisinRd
nsum
i=1
fi(θ) = minθ(1)θ(n)isinRd
nsum
i=1
fi(θ(i)) such that foralli sim j θ(i) = θ(j)
= minθ(1)θ(n)isinRd
maxforallisimjλijisinRd
nsum
i=1
fi(θ(i)) +
sum
isimj
λ⊤ij(θ
(i)minusθ(j))
Dual reformulation (Jakovetic et al 2015)
minθisinRd
nsum
i=1
fi(θ) = minθ(1)θ(n)isinRd
nsum
i=1
fi(θ(i)) such that foralli sim j θ(i) = θ(j)
= minθ(1)θ(n)isinRd
maxforallisimjλijisinRd
nsum
i=1
fi(θ(i)) +
sum
isimj
λ⊤ij(θ
(i)minusθ(j))
= maxforallisimjλijisinRd
minθ(1)θ(n)isinRd
nsum
i=1
fi(θ(i)) +
nsum
i=1
[θ(i)]⊤lineari(λ)
Dual reformulation (Jakovetic et al 2015)
minθisinRd
nsum
i=1
fi(θ) = minθ(1)θ(n)isinRd
nsum
i=1
fi(θ(i)) such that foralli sim j θ(i) = θ(j)
= minθ(1)θ(n)isinRd
maxforallisimjλijisinRd
nsum
i=1
fi(θ(i)) +
sum
isimj
λ⊤ij(θ
(i)minusθ(j))
= maxforallisimjλijisinRd
minθ(1)θ(n)isinRd
nsum
i=1
fi(θ(i)) +
nsum
i=1
[θ(i)]⊤lineari(λ)
= maxforallisimjλijisinRd
nsum
i=1
functioni(λ) = maxforallisimjλijisinRd
function(λ)
Dual reformulation (Jakovetic et al 2015)
minθisinRd
nsum
i=1
fi(θ) = minθ(1)θ(n)isinRd
nsum
i=1
fi(θ(i)) such that foralli sim j θ(i) = θ(j)
= minθ(1)θ(n)isinRd
maxforallisimjλijisinRd
nsum
i=1
fi(θ(i)) +
sum
isimj
λ⊤ij(θ
(i)minusθ(j))
= maxforallisimjλijisinRd
minθ(1)θ(n)isinRd
nsum
i=1
fi(θ(i)) +
nsum
i=1
[θ(i)]⊤lineari(λ)
= maxforallisimjλijisinRd
nsum
i=1
functioni(λ) = maxforallisimjλijisinRd
function(λ)
bull Accelerated gradient descent (Scaman et al 2017)
hArr alternating local gradient computations and a gossip step
ndashradic
κγ log 1ε gradient steps and
radic
κγ log 1ε communication steps
ndash Not optimal rArr need accelerated gossip
Accelerated gossip
bull Regular gossip
ndash Iterations θt = W tθ0
bull Accelerated gossip
ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)
ndash Shift-register gossip (Cao et al 2006)
ndash Linear combinations hArr ηt =t
sum
k=0
αkθk =t
sum
k=0
αkWkξ = Pt(W )ξ
ndash Optimal polynomial is the Chebyshev polynomial
ndash Can be computed online with same cost as regular gossip eg
θt = ωtWθtminus1 + (1minus ωt)θtminus1
ndash Replace γminus1 by γminus12 in rates
Accelerated gossip
bull Regular gossip
ndash Iterations θt = W tθ0
bull Accelerated gossip
ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)
ndash Shift-register gossip (Cao et al 2006)
ndash Linear combinations hArr ηt =t
sum
k=0
αkθk =t
sum
k=0
αkWkξ = Pt(W )ξ
ndash Optimal polynomial is the Chebyshev polynomial
ndash Can be computed online with same cost as regular gossip eg
θt = ωtWθtminus1 + (1minus ωt)θtminus1
ndash Replace γminus1 by γminus12 in rates
Illustration of accelerated gossip
![Page 73: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/73.jpg)
Decentralized optimization
bull Mixing gossip and optimization
bull Lower bound on complexity (Scaman et al 2017)
ndashradicκ log 1
ε gradient steps andradic
κγ log 1ε communication steps
ndash Plain gossip not optimal
(need to gossip gradients with increasing precision)
bull Is this lower bound achievable
Dual reformulation (Jakovetic et al 2015)
minθisinRd
nsum
i=1
fi(θ) = minθ(1)θ(n)isinRd
nsum
i=1
fi(θ(i)) such that foralli sim j θ(i) = θ(j)
minθ(1)θ(n)isinRd
maxforallisimjλijisinRd
nsum
i=1
fi(θ(i)) +
sum
isimj
λ⊤ij(θ
(i)minusθ(j))
Dual reformulation (Jakovetic et al 2015)
minθisinRd
nsum
i=1
fi(θ) = minθ(1)θ(n)isinRd
nsum
i=1
fi(θ(i)) such that foralli sim j θ(i) = θ(j)
= minθ(1)θ(n)isinRd
maxforallisimjλijisinRd
nsum
i=1
fi(θ(i)) +
sum
isimj
λ⊤ij(θ
(i)minusθ(j))
Dual reformulation (Jakovetic et al 2015)
minθisinRd
nsum
i=1
fi(θ) = minθ(1)θ(n)isinRd
nsum
i=1
fi(θ(i)) such that foralli sim j θ(i) = θ(j)
= minθ(1)θ(n)isinRd
maxforallisimjλijisinRd
nsum
i=1
fi(θ(i)) +
sum
isimj
λ⊤ij(θ
(i)minusθ(j))
= maxforallisimjλijisinRd
minθ(1)θ(n)isinRd
nsum
i=1
fi(θ(i)) +
nsum
i=1
[θ(i)]⊤lineari(λ)
Dual reformulation (Jakovetic et al 2015)
minθisinRd
nsum
i=1
fi(θ) = minθ(1)θ(n)isinRd
nsum
i=1
fi(θ(i)) such that foralli sim j θ(i) = θ(j)
= minθ(1)θ(n)isinRd
maxforallisimjλijisinRd
nsum
i=1
fi(θ(i)) +
sum
isimj
λ⊤ij(θ
(i)minusθ(j))
= maxforallisimjλijisinRd
minθ(1)θ(n)isinRd
nsum
i=1
fi(θ(i)) +
nsum
i=1
[θ(i)]⊤lineari(λ)
= maxforallisimjλijisinRd
nsum
i=1
functioni(λ) = maxforallisimjλijisinRd
function(λ)
Dual reformulation (Jakovetic et al 2015)
minθisinRd
nsum
i=1
fi(θ) = minθ(1)θ(n)isinRd
nsum
i=1
fi(θ(i)) such that foralli sim j θ(i) = θ(j)
= minθ(1)θ(n)isinRd
maxforallisimjλijisinRd
nsum
i=1
fi(θ(i)) +
sum
isimj
λ⊤ij(θ
(i)minusθ(j))
= maxforallisimjλijisinRd
minθ(1)θ(n)isinRd
nsum
i=1
fi(θ(i)) +
nsum
i=1
[θ(i)]⊤lineari(λ)
= maxforallisimjλijisinRd
nsum
i=1
functioni(λ) = maxforallisimjλijisinRd
function(λ)
bull Accelerated gradient descent (Scaman et al 2017)
hArr alternating local gradient computations and a gossip step
ndashradic
κγ log 1ε gradient steps and
radic
κγ log 1ε communication steps
ndash Not optimal rArr need accelerated gossip
Accelerated gossip
bull Regular gossip
ndash Iterations θt = W tθ0
bull Accelerated gossip
ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)
ndash Shift-register gossip (Cao et al 2006)
ndash Linear combinations hArr ηt =t
sum
k=0
αkθk =t
sum
k=0
αkWkξ = Pt(W )ξ
ndash Optimal polynomial is the Chebyshev polynomial
ndash Can be computed online with same cost as regular gossip eg
θt = ωtWθtminus1 + (1minus ωt)θtminus1
ndash Replace γminus1 by γminus12 in rates
Accelerated gossip
bull Regular gossip
ndash Iterations θt = W tθ0
bull Accelerated gossip
ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)
ndash Shift-register gossip (Cao et al 2006)
ndash Linear combinations hArr ηt =t
sum
k=0
αkθk =t
sum
k=0
αkWkξ = Pt(W )ξ
ndash Optimal polynomial is the Chebyshev polynomial
ndash Can be computed online with same cost as regular gossip eg
θt = ωtWθtminus1 + (1minus ωt)θtminus1
ndash Replace γminus1 by γminus12 in rates
Illustration of accelerated gossip
![Page 74: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/74.jpg)
Dual reformulation (Jakovetic et al 2015)
minθisinRd
nsum
i=1
fi(θ) = minθ(1)θ(n)isinRd
nsum
i=1
fi(θ(i)) such that foralli sim j θ(i) = θ(j)
minθ(1)θ(n)isinRd
maxforallisimjλijisinRd
nsum
i=1
fi(θ(i)) +
sum
isimj
λ⊤ij(θ
(i)minusθ(j))
Dual reformulation (Jakovetic et al 2015)
minθisinRd
nsum
i=1
fi(θ) = minθ(1)θ(n)isinRd
nsum
i=1
fi(θ(i)) such that foralli sim j θ(i) = θ(j)
= minθ(1)θ(n)isinRd
maxforallisimjλijisinRd
nsum
i=1
fi(θ(i)) +
sum
isimj
λ⊤ij(θ
(i)minusθ(j))
Dual reformulation (Jakovetic et al 2015)
minθisinRd
nsum
i=1
fi(θ) = minθ(1)θ(n)isinRd
nsum
i=1
fi(θ(i)) such that foralli sim j θ(i) = θ(j)
= minθ(1)θ(n)isinRd
maxforallisimjλijisinRd
nsum
i=1
fi(θ(i)) +
sum
isimj
λ⊤ij(θ
(i)minusθ(j))
= maxforallisimjλijisinRd
minθ(1)θ(n)isinRd
nsum
i=1
fi(θ(i)) +
nsum
i=1
[θ(i)]⊤lineari(λ)
Dual reformulation (Jakovetic et al 2015)
minθisinRd
nsum
i=1
fi(θ) = minθ(1)θ(n)isinRd
nsum
i=1
fi(θ(i)) such that foralli sim j θ(i) = θ(j)
= minθ(1)θ(n)isinRd
maxforallisimjλijisinRd
nsum
i=1
fi(θ(i)) +
sum
isimj
λ⊤ij(θ
(i)minusθ(j))
= maxforallisimjλijisinRd
minθ(1)θ(n)isinRd
nsum
i=1
fi(θ(i)) +
nsum
i=1
[θ(i)]⊤lineari(λ)
= maxforallisimjλijisinRd
nsum
i=1
functioni(λ) = maxforallisimjλijisinRd
function(λ)
Dual reformulation (Jakovetic et al 2015)
minθisinRd
nsum
i=1
fi(θ) = minθ(1)θ(n)isinRd
nsum
i=1
fi(θ(i)) such that foralli sim j θ(i) = θ(j)
= minθ(1)θ(n)isinRd
maxforallisimjλijisinRd
nsum
i=1
fi(θ(i)) +
sum
isimj
λ⊤ij(θ
(i)minusθ(j))
= maxforallisimjλijisinRd
minθ(1)θ(n)isinRd
nsum
i=1
fi(θ(i)) +
nsum
i=1
[θ(i)]⊤lineari(λ)
= maxforallisimjλijisinRd
nsum
i=1
functioni(λ) = maxforallisimjλijisinRd
function(λ)
bull Accelerated gradient descent (Scaman et al 2017)
hArr alternating local gradient computations and a gossip step
ndashradic
κγ log 1ε gradient steps and
radic
κγ log 1ε communication steps
ndash Not optimal rArr need accelerated gossip
Accelerated gossip
bull Regular gossip
ndash Iterations θt = W tθ0
bull Accelerated gossip
ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)
ndash Shift-register gossip (Cao et al 2006)
ndash Linear combinations hArr ηt =t
sum
k=0
αkθk =t
sum
k=0
αkWkξ = Pt(W )ξ
ndash Optimal polynomial is the Chebyshev polynomial
ndash Can be computed online with same cost as regular gossip eg
θt = ωtWθtminus1 + (1minus ωt)θtminus1
ndash Replace γminus1 by γminus12 in rates
Accelerated gossip
bull Regular gossip
ndash Iterations θt = W tθ0
bull Accelerated gossip
ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)
ndash Shift-register gossip (Cao et al 2006)
ndash Linear combinations hArr ηt =t
sum
k=0
αkθk =t
sum
k=0
αkWkξ = Pt(W )ξ
ndash Optimal polynomial is the Chebyshev polynomial
ndash Can be computed online with same cost as regular gossip eg
θt = ωtWθtminus1 + (1minus ωt)θtminus1
ndash Replace γminus1 by γminus12 in rates
Illustration of accelerated gossip
![Page 75: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/75.jpg)
Dual reformulation (Jakovetic et al 2015)
minθisinRd
nsum
i=1
fi(θ) = minθ(1)θ(n)isinRd
nsum
i=1
fi(θ(i)) such that foralli sim j θ(i) = θ(j)
= minθ(1)θ(n)isinRd
maxforallisimjλijisinRd
nsum
i=1
fi(θ(i)) +
sum
isimj
λ⊤ij(θ
(i)minusθ(j))
Dual reformulation (Jakovetic et al 2015)
minθisinRd
nsum
i=1
fi(θ) = minθ(1)θ(n)isinRd
nsum
i=1
fi(θ(i)) such that foralli sim j θ(i) = θ(j)
= minθ(1)θ(n)isinRd
maxforallisimjλijisinRd
nsum
i=1
fi(θ(i)) +
sum
isimj
λ⊤ij(θ
(i)minusθ(j))
= maxforallisimjλijisinRd
minθ(1)θ(n)isinRd
nsum
i=1
fi(θ(i)) +
nsum
i=1
[θ(i)]⊤lineari(λ)
Dual reformulation (Jakovetic et al 2015)
minθisinRd
nsum
i=1
fi(θ) = minθ(1)θ(n)isinRd
nsum
i=1
fi(θ(i)) such that foralli sim j θ(i) = θ(j)
= minθ(1)θ(n)isinRd
maxforallisimjλijisinRd
nsum
i=1
fi(θ(i)) +
sum
isimj
λ⊤ij(θ
(i)minusθ(j))
= maxforallisimjλijisinRd
minθ(1)θ(n)isinRd
nsum
i=1
fi(θ(i)) +
nsum
i=1
[θ(i)]⊤lineari(λ)
= maxforallisimjλijisinRd
nsum
i=1
functioni(λ) = maxforallisimjλijisinRd
function(λ)
Dual reformulation (Jakovetic et al 2015)
minθisinRd
nsum
i=1
fi(θ) = minθ(1)θ(n)isinRd
nsum
i=1
fi(θ(i)) such that foralli sim j θ(i) = θ(j)
= minθ(1)θ(n)isinRd
maxforallisimjλijisinRd
nsum
i=1
fi(θ(i)) +
sum
isimj
λ⊤ij(θ
(i)minusθ(j))
= maxforallisimjλijisinRd
minθ(1)θ(n)isinRd
nsum
i=1
fi(θ(i)) +
nsum
i=1
[θ(i)]⊤lineari(λ)
= maxforallisimjλijisinRd
nsum
i=1
functioni(λ) = maxforallisimjλijisinRd
function(λ)
bull Accelerated gradient descent (Scaman et al 2017)
hArr alternating local gradient computations and a gossip step
ndashradic
κγ log 1ε gradient steps and
radic
κγ log 1ε communication steps
ndash Not optimal rArr need accelerated gossip
Accelerated gossip
bull Regular gossip
ndash Iterations θt = W tθ0
bull Accelerated gossip
ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)
ndash Shift-register gossip (Cao et al 2006)
ndash Linear combinations hArr ηt =t
sum
k=0
αkθk =t
sum
k=0
αkWkξ = Pt(W )ξ
ndash Optimal polynomial is the Chebyshev polynomial
ndash Can be computed online with same cost as regular gossip eg
θt = ωtWθtminus1 + (1minus ωt)θtminus1
ndash Replace γminus1 by γminus12 in rates
Accelerated gossip
bull Regular gossip
ndash Iterations θt = W tθ0
bull Accelerated gossip
ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)
ndash Shift-register gossip (Cao et al 2006)
ndash Linear combinations hArr ηt =t
sum
k=0
αkθk =t
sum
k=0
αkWkξ = Pt(W )ξ
ndash Optimal polynomial is the Chebyshev polynomial
ndash Can be computed online with same cost as regular gossip eg
θt = ωtWθtminus1 + (1minus ωt)θtminus1
ndash Replace γminus1 by γminus12 in rates
Illustration of accelerated gossip
![Page 76: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/76.jpg)
Dual reformulation (Jakovetic et al 2015)
minθisinRd
nsum
i=1
fi(θ) = minθ(1)θ(n)isinRd
nsum
i=1
fi(θ(i)) such that foralli sim j θ(i) = θ(j)
= minθ(1)θ(n)isinRd
maxforallisimjλijisinRd
nsum
i=1
fi(θ(i)) +
sum
isimj
λ⊤ij(θ
(i)minusθ(j))
= maxforallisimjλijisinRd
minθ(1)θ(n)isinRd
nsum
i=1
fi(θ(i)) +
nsum
i=1
[θ(i)]⊤lineari(λ)
Dual reformulation (Jakovetic et al 2015)
minθisinRd
nsum
i=1
fi(θ) = minθ(1)θ(n)isinRd
nsum
i=1
fi(θ(i)) such that foralli sim j θ(i) = θ(j)
= minθ(1)θ(n)isinRd
maxforallisimjλijisinRd
nsum
i=1
fi(θ(i)) +
sum
isimj
λ⊤ij(θ
(i)minusθ(j))
= maxforallisimjλijisinRd
minθ(1)θ(n)isinRd
nsum
i=1
fi(θ(i)) +
nsum
i=1
[θ(i)]⊤lineari(λ)
= maxforallisimjλijisinRd
nsum
i=1
functioni(λ) = maxforallisimjλijisinRd
function(λ)
Dual reformulation (Jakovetic et al 2015)
minθisinRd
nsum
i=1
fi(θ) = minθ(1)θ(n)isinRd
nsum
i=1
fi(θ(i)) such that foralli sim j θ(i) = θ(j)
= minθ(1)θ(n)isinRd
maxforallisimjλijisinRd
nsum
i=1
fi(θ(i)) +
sum
isimj
λ⊤ij(θ
(i)minusθ(j))
= maxforallisimjλijisinRd
minθ(1)θ(n)isinRd
nsum
i=1
fi(θ(i)) +
nsum
i=1
[θ(i)]⊤lineari(λ)
= maxforallisimjλijisinRd
nsum
i=1
functioni(λ) = maxforallisimjλijisinRd
function(λ)
bull Accelerated gradient descent (Scaman et al 2017)
hArr alternating local gradient computations and a gossip step
ndashradic
κγ log 1ε gradient steps and
radic
κγ log 1ε communication steps
ndash Not optimal rArr need accelerated gossip
Accelerated gossip
bull Regular gossip
ndash Iterations θt = W tθ0
bull Accelerated gossip
ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)
ndash Shift-register gossip (Cao et al 2006)
ndash Linear combinations hArr ηt =t
sum
k=0
αkθk =t
sum
k=0
αkWkξ = Pt(W )ξ
ndash Optimal polynomial is the Chebyshev polynomial
ndash Can be computed online with same cost as regular gossip eg
θt = ωtWθtminus1 + (1minus ωt)θtminus1
ndash Replace γminus1 by γminus12 in rates
Accelerated gossip
bull Regular gossip
ndash Iterations θt = W tθ0
bull Accelerated gossip
ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)
ndash Shift-register gossip (Cao et al 2006)
ndash Linear combinations hArr ηt =t
sum
k=0
αkθk =t
sum
k=0
αkWkξ = Pt(W )ξ
ndash Optimal polynomial is the Chebyshev polynomial
ndash Can be computed online with same cost as regular gossip eg
θt = ωtWθtminus1 + (1minus ωt)θtminus1
ndash Replace γminus1 by γminus12 in rates
Illustration of accelerated gossip
![Page 77: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/77.jpg)
Dual reformulation (Jakovetic et al 2015)
minθisinRd
nsum
i=1
fi(θ) = minθ(1)θ(n)isinRd
nsum
i=1
fi(θ(i)) such that foralli sim j θ(i) = θ(j)
= minθ(1)θ(n)isinRd
maxforallisimjλijisinRd
nsum
i=1
fi(θ(i)) +
sum
isimj
λ⊤ij(θ
(i)minusθ(j))
= maxforallisimjλijisinRd
minθ(1)θ(n)isinRd
nsum
i=1
fi(θ(i)) +
nsum
i=1
[θ(i)]⊤lineari(λ)
= maxforallisimjλijisinRd
nsum
i=1
functioni(λ) = maxforallisimjλijisinRd
function(λ)
Dual reformulation (Jakovetic et al 2015)
minθisinRd
nsum
i=1
fi(θ) = minθ(1)θ(n)isinRd
nsum
i=1
fi(θ(i)) such that foralli sim j θ(i) = θ(j)
= minθ(1)θ(n)isinRd
maxforallisimjλijisinRd
nsum
i=1
fi(θ(i)) +
sum
isimj
λ⊤ij(θ
(i)minusθ(j))
= maxforallisimjλijisinRd
minθ(1)θ(n)isinRd
nsum
i=1
fi(θ(i)) +
nsum
i=1
[θ(i)]⊤lineari(λ)
= maxforallisimjλijisinRd
nsum
i=1
functioni(λ) = maxforallisimjλijisinRd
function(λ)
bull Accelerated gradient descent (Scaman et al 2017)
hArr alternating local gradient computations and a gossip step
ndashradic
κγ log 1ε gradient steps and
radic
κγ log 1ε communication steps
ndash Not optimal rArr need accelerated gossip
Accelerated gossip
bull Regular gossip
ndash Iterations θt = W tθ0
bull Accelerated gossip
ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)
ndash Shift-register gossip (Cao et al 2006)
ndash Linear combinations hArr ηt =t
sum
k=0
αkθk =t
sum
k=0
αkWkξ = Pt(W )ξ
ndash Optimal polynomial is the Chebyshev polynomial
ndash Can be computed online with same cost as regular gossip eg
θt = ωtWθtminus1 + (1minus ωt)θtminus1
ndash Replace γminus1 by γminus12 in rates
Accelerated gossip
bull Regular gossip
ndash Iterations θt = W tθ0
bull Accelerated gossip
ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)
ndash Shift-register gossip (Cao et al 2006)
ndash Linear combinations hArr ηt =t
sum
k=0
αkθk =t
sum
k=0
αkWkξ = Pt(W )ξ
ndash Optimal polynomial is the Chebyshev polynomial
ndash Can be computed online with same cost as regular gossip eg
θt = ωtWθtminus1 + (1minus ωt)θtminus1
ndash Replace γminus1 by γminus12 in rates
Illustration of accelerated gossip
![Page 78: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/78.jpg)
Dual reformulation (Jakovetic et al 2015)
minθisinRd
nsum
i=1
fi(θ) = minθ(1)θ(n)isinRd
nsum
i=1
fi(θ(i)) such that foralli sim j θ(i) = θ(j)
= minθ(1)θ(n)isinRd
maxforallisimjλijisinRd
nsum
i=1
fi(θ(i)) +
sum
isimj
λ⊤ij(θ
(i)minusθ(j))
= maxforallisimjλijisinRd
minθ(1)θ(n)isinRd
nsum
i=1
fi(θ(i)) +
nsum
i=1
[θ(i)]⊤lineari(λ)
= maxforallisimjλijisinRd
nsum
i=1
functioni(λ) = maxforallisimjλijisinRd
function(λ)
bull Accelerated gradient descent (Scaman et al 2017)
hArr alternating local gradient computations and a gossip step
ndashradic
κγ log 1ε gradient steps and
radic
κγ log 1ε communication steps
ndash Not optimal rArr need accelerated gossip
Accelerated gossip
bull Regular gossip
ndash Iterations θt = W tθ0
bull Accelerated gossip
ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)
ndash Shift-register gossip (Cao et al 2006)
ndash Linear combinations hArr ηt =t
sum
k=0
αkθk =t
sum
k=0
αkWkξ = Pt(W )ξ
ndash Optimal polynomial is the Chebyshev polynomial
ndash Can be computed online with same cost as regular gossip eg
θt = ωtWθtminus1 + (1minus ωt)θtminus1
ndash Replace γminus1 by γminus12 in rates
Accelerated gossip
bull Regular gossip
ndash Iterations θt = W tθ0
bull Accelerated gossip
ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)
ndash Shift-register gossip (Cao et al 2006)
ndash Linear combinations hArr ηt =t
sum
k=0
αkθk =t
sum
k=0
αkWkξ = Pt(W )ξ
ndash Optimal polynomial is the Chebyshev polynomial
ndash Can be computed online with same cost as regular gossip eg
θt = ωtWθtminus1 + (1minus ωt)θtminus1
ndash Replace γminus1 by γminus12 in rates
Illustration of accelerated gossip
![Page 79: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/79.jpg)
Accelerated gossip
bull Regular gossip
ndash Iterations θt = W tθ0
bull Accelerated gossip
ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)
ndash Shift-register gossip (Cao et al 2006)
ndash Linear combinations hArr ηt =t
sum
k=0
αkθk =t
sum
k=0
αkWkξ = Pt(W )ξ
ndash Optimal polynomial is the Chebyshev polynomial
ndash Can be computed online with same cost as regular gossip eg
θt = ωtWθtminus1 + (1minus ωt)θtminus1
ndash Replace γminus1 by γminus12 in rates
Accelerated gossip
bull Regular gossip
ndash Iterations θt = W tθ0
bull Accelerated gossip
ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)
ndash Shift-register gossip (Cao et al 2006)
ndash Linear combinations hArr ηt =t
sum
k=0
αkθk =t
sum
k=0
αkWkξ = Pt(W )ξ
ndash Optimal polynomial is the Chebyshev polynomial
ndash Can be computed online with same cost as regular gossip eg
θt = ωtWθtminus1 + (1minus ωt)θtminus1
ndash Replace γminus1 by γminus12 in rates
Illustration of accelerated gossip
![Page 80: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/80.jpg)
Accelerated gossip
bull Regular gossip
ndash Iterations θt = W tθ0
bull Accelerated gossip
ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)
ndash Shift-register gossip (Cao et al 2006)
ndash Linear combinations hArr ηt =t
sum
k=0
αkθk =t
sum
k=0
αkWkξ = Pt(W )ξ
ndash Optimal polynomial is the Chebyshev polynomial
ndash Can be computed online with same cost as regular gossip eg
θt = ωtWθtminus1 + (1minus ωt)θtminus1
ndash Replace γminus1 by γminus12 in rates
Illustration of accelerated gossip
![Page 81: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/81.jpg)
Illustration of accelerated gossip
![Page 82: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/82.jpg)
Accelerated gossip
bull Regular gossip
ndash Iterations θt = W tθ0
bull Accelerated gossip
ndash Chebyshev acceleration (Auzinger 2011 Arioli and Scott 2014)
ndash Shift-register gossip (Cao et al 2006)
ndash Linear combinations hArr ηt =t
sum
k=0
αkθk =t
sum
k=0
αkWkξ = Pt(W )ξ
ndash Optimal polynomial is the Chebyshev polynomial
ndash Can be computed online with same cost as regular gossip eg
θt = ωtWθtminus1 + (1minus ωt)θtminus1
ndash Replace γminus1 by γminus12 in rates
bull rArr optimal complexity for optimization (Scaman et al 2017)
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Single machine vs ldquooptimalrdquo decentralized algorithm
Algorithm gradient steps communication
Single machine algorithm nm+radicnmκ 0
MSDA (Scaman et al 2017) mradicκ
radic
κγ
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull MSDA (Scaman et al 2017)
ndashradicκ log 1
ε gradient steps andradic
κγ log 1ε communication steps
ndash ldquoOptimalrdquo but still not adapted to machine learning
ndash Huge slow down when going from 1 to 2 machines
ndash Only synchronous
Decentralized algorithms for machine learning
(Hendrikx Bach and Massoulie 2019)
bull Trade-offs between gradient and communication steps
ndash Adapted to functions of the type fi(θ) =1
m
msum
j=1
ℓ(yij θ⊤Φ(xij))
ndash Allows for partial asynchrony
bull n computing nodes with m observations each
Algorithm gradient steps communication
Single machine algorithm nm+radicnmκ 0
MSDA (Scaman et al 2017) mradicκ
radic
κγ
ADFS (Hendrikx et al 2019) m+radicmκ
radic
κγ
ADFS - Algorithm principle
bull Minimizingnsum
i=1
msum
j=1
fij(θ) +σi
2θ2
ndash Create an equivalent graph
ndash Dual randomized coordinate ascent (with non uniform sampling)
ndash Decoupling of data and gossip steps
Decentralized algorithms for machine learning
(Hendrikx Bach and Massoulie 2019)
bull Running times on an actual cluster
ndash Logistic regression with m = 104 observations per node in R28
ndash Two-dimensional grid network
n = 4 n = 100
Decentralized algorithms for machine learning
(Hendrikx Bach and Massoulie 2019)
bull Running times on an actual cluster
ndash Logistic regression with mn asymp 105 observations in R47236
ndash Two-dimensional grid network with n = 100 nodes
Conclusions
bull Distributed decentralized machine learning
ndash Distributing the fastest single machine algorithms
ndash n machines and m observations per machine
ndash From nm+radicnmκ (single machine) to m+
radicmκ gradient steps
ndash Linear speed-ups for well-conditioned problems
Conclusions
bull Distributed decentralized machine learning
ndash Distributing the fastest single machine algorithms
ndash n machines and m observations per machine
ndash From nm+radicnmκ (single machine) to m+
radicmκ gradient steps
ndash Linear speed-ups for well-conditioned problems
bull Extensions
ndash Beyond convex problems
ndash Matching running time complexity lower bounds
ndash Experiments on large-scale clouds
References
M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)
591ndash608 2014
W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011
L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008
Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms
IEEE transactions on information theory 52(6)2508ndash2530 2006
Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed
computation In 44th Annual Allerton Conference on Communication Control and Computation
pages 952ndash959 2006
Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for
decentralized optimization of pairwise functions In International Conference on Machine Learning
pages 1388ndash1396 2016
Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method
with support for non-strongly convex composite objectives In Advances in Neural Information
Processing Systems 2014
John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization
Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606
2012
Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic
gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019
Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed
optimization using a randomized alternating direction method of multipliers In Decision and
Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013
Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed
augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015
Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance
reduction In Advances in Neural Information Processing Systems 2013
G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv
2015
N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence
rate for strongly-convex optimization with finite training sets In Advances in Neural Information
Processing Systems (NIPS) 2012
A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact
linear convergence rate for consensus optimization IEEE Transactions on Signal and Information
Processing over Networks 2(4)507ndash522 2016
Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization
IEEE Transactions on Automatic Control 54(1)48ndash61 2009
A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization
over time-varying graphs ArXiv e-prints 2016
Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)
Soviet Math Doklady 269(3)543ndash547 1983
Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004
Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal
algorithms for smooth and strongly convex distributed optimization in networks In International
Conference on Machine Learning pages 3027ndash3036 2017
Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized
consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015
Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st
Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012
L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of
full gradients In Advances in Neural Information Processing Systems 2013
![Page 83: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/83.jpg)
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Single machine vs ldquooptimalrdquo decentralized algorithm
Algorithm gradient steps communication
Single machine algorithm nm+radicnmκ 0
MSDA (Scaman et al 2017) mradicκ
radic
κγ
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull MSDA (Scaman et al 2017)
ndashradicκ log 1
ε gradient steps andradic
κγ log 1ε communication steps
ndash ldquoOptimalrdquo but still not adapted to machine learning
ndash Huge slow down when going from 1 to 2 machines
ndash Only synchronous
Decentralized algorithms for machine learning
(Hendrikx Bach and Massoulie 2019)
bull Trade-offs between gradient and communication steps
ndash Adapted to functions of the type fi(θ) =1
m
msum
j=1
ℓ(yij θ⊤Φ(xij))
ndash Allows for partial asynchrony
bull n computing nodes with m observations each
Algorithm gradient steps communication
Single machine algorithm nm+radicnmκ 0
MSDA (Scaman et al 2017) mradicκ
radic
κγ
ADFS (Hendrikx et al 2019) m+radicmκ
radic
κγ
ADFS - Algorithm principle
bull Minimizingnsum
i=1
msum
j=1
fij(θ) +σi
2θ2
ndash Create an equivalent graph
ndash Dual randomized coordinate ascent (with non uniform sampling)
ndash Decoupling of data and gossip steps
Decentralized algorithms for machine learning
(Hendrikx Bach and Massoulie 2019)
bull Running times on an actual cluster
ndash Logistic regression with m = 104 observations per node in R28
ndash Two-dimensional grid network
n = 4 n = 100
Decentralized algorithms for machine learning
(Hendrikx Bach and Massoulie 2019)
bull Running times on an actual cluster
ndash Logistic regression with mn asymp 105 observations in R47236
ndash Two-dimensional grid network with n = 100 nodes
Conclusions
bull Distributed decentralized machine learning
ndash Distributing the fastest single machine algorithms
ndash n machines and m observations per machine
ndash From nm+radicnmκ (single machine) to m+
radicmκ gradient steps
ndash Linear speed-ups for well-conditioned problems
Conclusions
bull Distributed decentralized machine learning
ndash Distributing the fastest single machine algorithms
ndash n machines and m observations per machine
ndash From nm+radicnmκ (single machine) to m+
radicmκ gradient steps
ndash Linear speed-ups for well-conditioned problems
bull Extensions
ndash Beyond convex problems
ndash Matching running time complexity lower bounds
ndash Experiments on large-scale clouds
References
M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)
591ndash608 2014
W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011
L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008
Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms
IEEE transactions on information theory 52(6)2508ndash2530 2006
Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed
computation In 44th Annual Allerton Conference on Communication Control and Computation
pages 952ndash959 2006
Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for
decentralized optimization of pairwise functions In International Conference on Machine Learning
pages 1388ndash1396 2016
Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method
with support for non-strongly convex composite objectives In Advances in Neural Information
Processing Systems 2014
John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization
Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606
2012
Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic
gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019
Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed
optimization using a randomized alternating direction method of multipliers In Decision and
Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013
Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed
augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015
Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance
reduction In Advances in Neural Information Processing Systems 2013
G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv
2015
N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence
rate for strongly-convex optimization with finite training sets In Advances in Neural Information
Processing Systems (NIPS) 2012
A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact
linear convergence rate for consensus optimization IEEE Transactions on Signal and Information
Processing over Networks 2(4)507ndash522 2016
Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization
IEEE Transactions on Automatic Control 54(1)48ndash61 2009
A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization
over time-varying graphs ArXiv e-prints 2016
Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)
Soviet Math Doklady 269(3)543ndash547 1983
Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004
Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal
algorithms for smooth and strongly convex distributed optimization in networks In International
Conference on Machine Learning pages 3027ndash3036 2017
Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized
consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015
Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st
Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012
L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of
full gradients In Advances in Neural Information Processing Systems 2013
![Page 84: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/84.jpg)
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull Single machine vs ldquooptimalrdquo decentralized algorithm
Algorithm gradient steps communication
Single machine algorithm nm+radicnmκ 0
MSDA (Scaman et al 2017) mradicκ
radic
κγ
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull MSDA (Scaman et al 2017)
ndashradicκ log 1
ε gradient steps andradic
κγ log 1ε communication steps
ndash ldquoOptimalrdquo but still not adapted to machine learning
ndash Huge slow down when going from 1 to 2 machines
ndash Only synchronous
Decentralized algorithms for machine learning
(Hendrikx Bach and Massoulie 2019)
bull Trade-offs between gradient and communication steps
ndash Adapted to functions of the type fi(θ) =1
m
msum
j=1
ℓ(yij θ⊤Φ(xij))
ndash Allows for partial asynchrony
bull n computing nodes with m observations each
Algorithm gradient steps communication
Single machine algorithm nm+radicnmκ 0
MSDA (Scaman et al 2017) mradicκ
radic
κγ
ADFS (Hendrikx et al 2019) m+radicmκ
radic
κγ
ADFS - Algorithm principle
bull Minimizingnsum
i=1
msum
j=1
fij(θ) +σi
2θ2
ndash Create an equivalent graph
ndash Dual randomized coordinate ascent (with non uniform sampling)
ndash Decoupling of data and gossip steps
Decentralized algorithms for machine learning
(Hendrikx Bach and Massoulie 2019)
bull Running times on an actual cluster
ndash Logistic regression with m = 104 observations per node in R28
ndash Two-dimensional grid network
n = 4 n = 100
Decentralized algorithms for machine learning
(Hendrikx Bach and Massoulie 2019)
bull Running times on an actual cluster
ndash Logistic regression with mn asymp 105 observations in R47236
ndash Two-dimensional grid network with n = 100 nodes
Conclusions
bull Distributed decentralized machine learning
ndash Distributing the fastest single machine algorithms
ndash n machines and m observations per machine
ndash From nm+radicnmκ (single machine) to m+
radicmκ gradient steps
ndash Linear speed-ups for well-conditioned problems
Conclusions
bull Distributed decentralized machine learning
ndash Distributing the fastest single machine algorithms
ndash n machines and m observations per machine
ndash From nm+radicnmκ (single machine) to m+
radicmκ gradient steps
ndash Linear speed-ups for well-conditioned problems
bull Extensions
ndash Beyond convex problems
ndash Matching running time complexity lower bounds
ndash Experiments on large-scale clouds
References
M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)
591ndash608 2014
W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011
L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008
Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms
IEEE transactions on information theory 52(6)2508ndash2530 2006
Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed
computation In 44th Annual Allerton Conference on Communication Control and Computation
pages 952ndash959 2006
Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for
decentralized optimization of pairwise functions In International Conference on Machine Learning
pages 1388ndash1396 2016
Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method
with support for non-strongly convex composite objectives In Advances in Neural Information
Processing Systems 2014
John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization
Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606
2012
Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic
gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019
Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed
optimization using a randomized alternating direction method of multipliers In Decision and
Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013
Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed
augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015
Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance
reduction In Advances in Neural Information Processing Systems 2013
G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv
2015
N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence
rate for strongly-convex optimization with finite training sets In Advances in Neural Information
Processing Systems (NIPS) 2012
A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact
linear convergence rate for consensus optimization IEEE Transactions on Signal and Information
Processing over Networks 2(4)507ndash522 2016
Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization
IEEE Transactions on Automatic Control 54(1)48ndash61 2009
A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization
over time-varying graphs ArXiv e-prints 2016
Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)
Soviet Math Doklady 269(3)543ndash547 1983
Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004
Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal
algorithms for smooth and strongly convex distributed optimization in networks In International
Conference on Machine Learning pages 3027ndash3036 2017
Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized
consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015
Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st
Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012
L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of
full gradients In Advances in Neural Information Processing Systems 2013
![Page 85: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/85.jpg)
Distribution in machine learning (and beyond)
bull Machine learning through optimization
minθisinRd
1
n
nsum
i=1
fi(θ) = g(θ)
ndash fi(θ) error of model defined by θ on dataset indexed by i
ndash Example fi(θ) =1
mi
misum
j=1
ℓ(yij θ⊤Φ(xij)) if mi observations
bull MSDA (Scaman et al 2017)
ndashradicκ log 1
ε gradient steps andradic
κγ log 1ε communication steps
ndash ldquoOptimalrdquo but still not adapted to machine learning
ndash Huge slow down when going from 1 to 2 machines
ndash Only synchronous
Decentralized algorithms for machine learning
(Hendrikx Bach and Massoulie 2019)
bull Trade-offs between gradient and communication steps
ndash Adapted to functions of the type fi(θ) =1
m
msum
j=1
ℓ(yij θ⊤Φ(xij))
ndash Allows for partial asynchrony
bull n computing nodes with m observations each
Algorithm gradient steps communication
Single machine algorithm nm+radicnmκ 0
MSDA (Scaman et al 2017) mradicκ
radic
κγ
ADFS (Hendrikx et al 2019) m+radicmκ
radic
κγ
ADFS - Algorithm principle
bull Minimizingnsum
i=1
msum
j=1
fij(θ) +σi
2θ2
ndash Create an equivalent graph
ndash Dual randomized coordinate ascent (with non uniform sampling)
ndash Decoupling of data and gossip steps
Decentralized algorithms for machine learning
(Hendrikx Bach and Massoulie 2019)
bull Running times on an actual cluster
ndash Logistic regression with m = 104 observations per node in R28
ndash Two-dimensional grid network
n = 4 n = 100
Decentralized algorithms for machine learning
(Hendrikx Bach and Massoulie 2019)
bull Running times on an actual cluster
ndash Logistic regression with mn asymp 105 observations in R47236
ndash Two-dimensional grid network with n = 100 nodes
Conclusions
bull Distributed decentralized machine learning
ndash Distributing the fastest single machine algorithms
ndash n machines and m observations per machine
ndash From nm+radicnmκ (single machine) to m+
radicmκ gradient steps
ndash Linear speed-ups for well-conditioned problems
Conclusions
bull Distributed decentralized machine learning
ndash Distributing the fastest single machine algorithms
ndash n machines and m observations per machine
ndash From nm+radicnmκ (single machine) to m+
radicmκ gradient steps
ndash Linear speed-ups for well-conditioned problems
bull Extensions
ndash Beyond convex problems
ndash Matching running time complexity lower bounds
ndash Experiments on large-scale clouds
References
M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)
591ndash608 2014
W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011
L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008
Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms
IEEE transactions on information theory 52(6)2508ndash2530 2006
Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed
computation In 44th Annual Allerton Conference on Communication Control and Computation
pages 952ndash959 2006
Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for
decentralized optimization of pairwise functions In International Conference on Machine Learning
pages 1388ndash1396 2016
Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method
with support for non-strongly convex composite objectives In Advances in Neural Information
Processing Systems 2014
John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization
Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606
2012
Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic
gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019
Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed
optimization using a randomized alternating direction method of multipliers In Decision and
Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013
Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed
augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015
Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance
reduction In Advances in Neural Information Processing Systems 2013
G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv
2015
N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence
rate for strongly-convex optimization with finite training sets In Advances in Neural Information
Processing Systems (NIPS) 2012
A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact
linear convergence rate for consensus optimization IEEE Transactions on Signal and Information
Processing over Networks 2(4)507ndash522 2016
Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization
IEEE Transactions on Automatic Control 54(1)48ndash61 2009
A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization
over time-varying graphs ArXiv e-prints 2016
Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)
Soviet Math Doklady 269(3)543ndash547 1983
Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004
Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal
algorithms for smooth and strongly convex distributed optimization in networks In International
Conference on Machine Learning pages 3027ndash3036 2017
Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized
consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015
Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st
Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012
L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of
full gradients In Advances in Neural Information Processing Systems 2013
![Page 86: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/86.jpg)
Decentralized algorithms for machine learning
(Hendrikx Bach and Massoulie 2019)
bull Trade-offs between gradient and communication steps
ndash Adapted to functions of the type fi(θ) =1
m
msum
j=1
ℓ(yij θ⊤Φ(xij))
ndash Allows for partial asynchrony
bull n computing nodes with m observations each
Algorithm gradient steps communication
Single machine algorithm nm+radicnmκ 0
MSDA (Scaman et al 2017) mradicκ
radic
κγ
ADFS (Hendrikx et al 2019) m+radicmκ
radic
κγ
ADFS - Algorithm principle
bull Minimizingnsum
i=1
msum
j=1
fij(θ) +σi
2θ2
ndash Create an equivalent graph
ndash Dual randomized coordinate ascent (with non uniform sampling)
ndash Decoupling of data and gossip steps
Decentralized algorithms for machine learning
(Hendrikx Bach and Massoulie 2019)
bull Running times on an actual cluster
ndash Logistic regression with m = 104 observations per node in R28
ndash Two-dimensional grid network
n = 4 n = 100
Decentralized algorithms for machine learning
(Hendrikx Bach and Massoulie 2019)
bull Running times on an actual cluster
ndash Logistic regression with mn asymp 105 observations in R47236
ndash Two-dimensional grid network with n = 100 nodes
Conclusions
bull Distributed decentralized machine learning
ndash Distributing the fastest single machine algorithms
ndash n machines and m observations per machine
ndash From nm+radicnmκ (single machine) to m+
radicmκ gradient steps
ndash Linear speed-ups for well-conditioned problems
Conclusions
bull Distributed decentralized machine learning
ndash Distributing the fastest single machine algorithms
ndash n machines and m observations per machine
ndash From nm+radicnmκ (single machine) to m+
radicmκ gradient steps
ndash Linear speed-ups for well-conditioned problems
bull Extensions
ndash Beyond convex problems
ndash Matching running time complexity lower bounds
ndash Experiments on large-scale clouds
References
M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)
591ndash608 2014
W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011
L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008
Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms
IEEE transactions on information theory 52(6)2508ndash2530 2006
Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed
computation In 44th Annual Allerton Conference on Communication Control and Computation
pages 952ndash959 2006
Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for
decentralized optimization of pairwise functions In International Conference on Machine Learning
pages 1388ndash1396 2016
Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method
with support for non-strongly convex composite objectives In Advances in Neural Information
Processing Systems 2014
John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization
Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606
2012
Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic
gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019
Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed
optimization using a randomized alternating direction method of multipliers In Decision and
Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013
Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed
augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015
Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance
reduction In Advances in Neural Information Processing Systems 2013
G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv
2015
N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence
rate for strongly-convex optimization with finite training sets In Advances in Neural Information
Processing Systems (NIPS) 2012
A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact
linear convergence rate for consensus optimization IEEE Transactions on Signal and Information
Processing over Networks 2(4)507ndash522 2016
Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization
IEEE Transactions on Automatic Control 54(1)48ndash61 2009
A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization
over time-varying graphs ArXiv e-prints 2016
Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)
Soviet Math Doklady 269(3)543ndash547 1983
Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004
Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal
algorithms for smooth and strongly convex distributed optimization in networks In International
Conference on Machine Learning pages 3027ndash3036 2017
Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized
consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015
Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st
Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012
L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of
full gradients In Advances in Neural Information Processing Systems 2013
![Page 87: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/87.jpg)
ADFS - Algorithm principle
bull Minimizingnsum
i=1
msum
j=1
fij(θ) +σi
2θ2
ndash Create an equivalent graph
ndash Dual randomized coordinate ascent (with non uniform sampling)
ndash Decoupling of data and gossip steps
Decentralized algorithms for machine learning
(Hendrikx Bach and Massoulie 2019)
bull Running times on an actual cluster
ndash Logistic regression with m = 104 observations per node in R28
ndash Two-dimensional grid network
n = 4 n = 100
Decentralized algorithms for machine learning
(Hendrikx Bach and Massoulie 2019)
bull Running times on an actual cluster
ndash Logistic regression with mn asymp 105 observations in R47236
ndash Two-dimensional grid network with n = 100 nodes
Conclusions
bull Distributed decentralized machine learning
ndash Distributing the fastest single machine algorithms
ndash n machines and m observations per machine
ndash From nm+radicnmκ (single machine) to m+
radicmκ gradient steps
ndash Linear speed-ups for well-conditioned problems
Conclusions
bull Distributed decentralized machine learning
ndash Distributing the fastest single machine algorithms
ndash n machines and m observations per machine
ndash From nm+radicnmκ (single machine) to m+
radicmκ gradient steps
ndash Linear speed-ups for well-conditioned problems
bull Extensions
ndash Beyond convex problems
ndash Matching running time complexity lower bounds
ndash Experiments on large-scale clouds
References
M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)
591ndash608 2014
W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011
L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008
Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms
IEEE transactions on information theory 52(6)2508ndash2530 2006
Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed
computation In 44th Annual Allerton Conference on Communication Control and Computation
pages 952ndash959 2006
Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for
decentralized optimization of pairwise functions In International Conference on Machine Learning
pages 1388ndash1396 2016
Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method
with support for non-strongly convex composite objectives In Advances in Neural Information
Processing Systems 2014
John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization
Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606
2012
Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic
gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019
Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed
optimization using a randomized alternating direction method of multipliers In Decision and
Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013
Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed
augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015
Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance
reduction In Advances in Neural Information Processing Systems 2013
G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv
2015
N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence
rate for strongly-convex optimization with finite training sets In Advances in Neural Information
Processing Systems (NIPS) 2012
A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact
linear convergence rate for consensus optimization IEEE Transactions on Signal and Information
Processing over Networks 2(4)507ndash522 2016
Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization
IEEE Transactions on Automatic Control 54(1)48ndash61 2009
A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization
over time-varying graphs ArXiv e-prints 2016
Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)
Soviet Math Doklady 269(3)543ndash547 1983
Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004
Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal
algorithms for smooth and strongly convex distributed optimization in networks In International
Conference on Machine Learning pages 3027ndash3036 2017
Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized
consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015
Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st
Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012
L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of
full gradients In Advances in Neural Information Processing Systems 2013
![Page 88: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/88.jpg)
Decentralized algorithms for machine learning
(Hendrikx Bach and Massoulie 2019)
bull Running times on an actual cluster
ndash Logistic regression with m = 104 observations per node in R28
ndash Two-dimensional grid network
n = 4 n = 100
Decentralized algorithms for machine learning
(Hendrikx Bach and Massoulie 2019)
bull Running times on an actual cluster
ndash Logistic regression with mn asymp 105 observations in R47236
ndash Two-dimensional grid network with n = 100 nodes
Conclusions
bull Distributed decentralized machine learning
ndash Distributing the fastest single machine algorithms
ndash n machines and m observations per machine
ndash From nm+radicnmκ (single machine) to m+
radicmκ gradient steps
ndash Linear speed-ups for well-conditioned problems
Conclusions
bull Distributed decentralized machine learning
ndash Distributing the fastest single machine algorithms
ndash n machines and m observations per machine
ndash From nm+radicnmκ (single machine) to m+
radicmκ gradient steps
ndash Linear speed-ups for well-conditioned problems
bull Extensions
ndash Beyond convex problems
ndash Matching running time complexity lower bounds
ndash Experiments on large-scale clouds
References
M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)
591ndash608 2014
W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011
L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008
Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms
IEEE transactions on information theory 52(6)2508ndash2530 2006
Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed
computation In 44th Annual Allerton Conference on Communication Control and Computation
pages 952ndash959 2006
Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for
decentralized optimization of pairwise functions In International Conference on Machine Learning
pages 1388ndash1396 2016
Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method
with support for non-strongly convex composite objectives In Advances in Neural Information
Processing Systems 2014
John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization
Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606
2012
Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic
gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019
Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed
optimization using a randomized alternating direction method of multipliers In Decision and
Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013
Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed
augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015
Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance
reduction In Advances in Neural Information Processing Systems 2013
G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv
2015
N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence
rate for strongly-convex optimization with finite training sets In Advances in Neural Information
Processing Systems (NIPS) 2012
A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact
linear convergence rate for consensus optimization IEEE Transactions on Signal and Information
Processing over Networks 2(4)507ndash522 2016
Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization
IEEE Transactions on Automatic Control 54(1)48ndash61 2009
A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization
over time-varying graphs ArXiv e-prints 2016
Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)
Soviet Math Doklady 269(3)543ndash547 1983
Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004
Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal
algorithms for smooth and strongly convex distributed optimization in networks In International
Conference on Machine Learning pages 3027ndash3036 2017
Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized
consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015
Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st
Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012
L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of
full gradients In Advances in Neural Information Processing Systems 2013
![Page 89: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/89.jpg)
Decentralized algorithms for machine learning
(Hendrikx Bach and Massoulie 2019)
bull Running times on an actual cluster
ndash Logistic regression with mn asymp 105 observations in R47236
ndash Two-dimensional grid network with n = 100 nodes
Conclusions
bull Distributed decentralized machine learning
ndash Distributing the fastest single machine algorithms
ndash n machines and m observations per machine
ndash From nm+radicnmκ (single machine) to m+
radicmκ gradient steps
ndash Linear speed-ups for well-conditioned problems
Conclusions
bull Distributed decentralized machine learning
ndash Distributing the fastest single machine algorithms
ndash n machines and m observations per machine
ndash From nm+radicnmκ (single machine) to m+
radicmκ gradient steps
ndash Linear speed-ups for well-conditioned problems
bull Extensions
ndash Beyond convex problems
ndash Matching running time complexity lower bounds
ndash Experiments on large-scale clouds
References
M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)
591ndash608 2014
W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011
L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008
Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms
IEEE transactions on information theory 52(6)2508ndash2530 2006
Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed
computation In 44th Annual Allerton Conference on Communication Control and Computation
pages 952ndash959 2006
Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for
decentralized optimization of pairwise functions In International Conference on Machine Learning
pages 1388ndash1396 2016
Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method
with support for non-strongly convex composite objectives In Advances in Neural Information
Processing Systems 2014
John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization
Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606
2012
Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic
gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019
Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed
optimization using a randomized alternating direction method of multipliers In Decision and
Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013
Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed
augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015
Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance
reduction In Advances in Neural Information Processing Systems 2013
G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv
2015
N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence
rate for strongly-convex optimization with finite training sets In Advances in Neural Information
Processing Systems (NIPS) 2012
A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact
linear convergence rate for consensus optimization IEEE Transactions on Signal and Information
Processing over Networks 2(4)507ndash522 2016
Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization
IEEE Transactions on Automatic Control 54(1)48ndash61 2009
A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization
over time-varying graphs ArXiv e-prints 2016
Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)
Soviet Math Doklady 269(3)543ndash547 1983
Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004
Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal
algorithms for smooth and strongly convex distributed optimization in networks In International
Conference on Machine Learning pages 3027ndash3036 2017
Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized
consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015
Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st
Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012
L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of
full gradients In Advances in Neural Information Processing Systems 2013
![Page 90: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/90.jpg)
Conclusions
bull Distributed decentralized machine learning
ndash Distributing the fastest single machine algorithms
ndash n machines and m observations per machine
ndash From nm+radicnmκ (single machine) to m+
radicmκ gradient steps
ndash Linear speed-ups for well-conditioned problems
Conclusions
bull Distributed decentralized machine learning
ndash Distributing the fastest single machine algorithms
ndash n machines and m observations per machine
ndash From nm+radicnmκ (single machine) to m+
radicmκ gradient steps
ndash Linear speed-ups for well-conditioned problems
bull Extensions
ndash Beyond convex problems
ndash Matching running time complexity lower bounds
ndash Experiments on large-scale clouds
References
M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)
591ndash608 2014
W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011
L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008
Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms
IEEE transactions on information theory 52(6)2508ndash2530 2006
Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed
computation In 44th Annual Allerton Conference on Communication Control and Computation
pages 952ndash959 2006
Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for
decentralized optimization of pairwise functions In International Conference on Machine Learning
pages 1388ndash1396 2016
Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method
with support for non-strongly convex composite objectives In Advances in Neural Information
Processing Systems 2014
John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization
Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606
2012
Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic
gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019
Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed
optimization using a randomized alternating direction method of multipliers In Decision and
Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013
Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed
augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015
Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance
reduction In Advances in Neural Information Processing Systems 2013
G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv
2015
N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence
rate for strongly-convex optimization with finite training sets In Advances in Neural Information
Processing Systems (NIPS) 2012
A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact
linear convergence rate for consensus optimization IEEE Transactions on Signal and Information
Processing over Networks 2(4)507ndash522 2016
Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization
IEEE Transactions on Automatic Control 54(1)48ndash61 2009
A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization
over time-varying graphs ArXiv e-prints 2016
Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)
Soviet Math Doklady 269(3)543ndash547 1983
Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004
Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal
algorithms for smooth and strongly convex distributed optimization in networks In International
Conference on Machine Learning pages 3027ndash3036 2017
Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized
consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015
Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st
Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012
L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of
full gradients In Advances in Neural Information Processing Systems 2013
![Page 91: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/91.jpg)
Conclusions
bull Distributed decentralized machine learning
ndash Distributing the fastest single machine algorithms
ndash n machines and m observations per machine
ndash From nm+radicnmκ (single machine) to m+
radicmκ gradient steps
ndash Linear speed-ups for well-conditioned problems
bull Extensions
ndash Beyond convex problems
ndash Matching running time complexity lower bounds
ndash Experiments on large-scale clouds
References
M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)
591ndash608 2014
W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011
L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008
Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms
IEEE transactions on information theory 52(6)2508ndash2530 2006
Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed
computation In 44th Annual Allerton Conference on Communication Control and Computation
pages 952ndash959 2006
Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for
decentralized optimization of pairwise functions In International Conference on Machine Learning
pages 1388ndash1396 2016
Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method
with support for non-strongly convex composite objectives In Advances in Neural Information
Processing Systems 2014
John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization
Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606
2012
Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic
gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019
Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed
optimization using a randomized alternating direction method of multipliers In Decision and
Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013
Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed
augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015
Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance
reduction In Advances in Neural Information Processing Systems 2013
G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv
2015
N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence
rate for strongly-convex optimization with finite training sets In Advances in Neural Information
Processing Systems (NIPS) 2012
A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact
linear convergence rate for consensus optimization IEEE Transactions on Signal and Information
Processing over Networks 2(4)507ndash522 2016
Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization
IEEE Transactions on Automatic Control 54(1)48ndash61 2009
A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization
over time-varying graphs ArXiv e-prints 2016
Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)
Soviet Math Doklady 269(3)543ndash547 1983
Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004
Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal
algorithms for smooth and strongly convex distributed optimization in networks In International
Conference on Machine Learning pages 3027ndash3036 2017
Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized
consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015
Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st
Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012
L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of
full gradients In Advances in Neural Information Processing Systems 2013
![Page 92: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/92.jpg)
References
M Arioli and J Scott Chebyshev acceleration of iterative refinement Numerical Algorithms 66(3)
591ndash608 2014
W Auzinger Iterative Solution of Large Linear Systems Lecture notes TU Wien 2011
L Bottou and O Bousquet The tradeoffs of large scale learning In Adv NIPS 2008
Stephen Boyd Arpita Ghosh Balaji Prabhakar and Devavrat Shah Randomized gossip algorithms
IEEE transactions on information theory 52(6)2508ndash2530 2006
Ming Cao Daniel A Spielman and Edmund M Yeh Accelerated gossip algorithms for distributed
computation In 44th Annual Allerton Conference on Communication Control and Computation
pages 952ndash959 2006
Igor Colin Aurelien Bellet Joseph Salmon and Stephan Clemencon Gossip dual averaging for
decentralized optimization of pairwise functions In International Conference on Machine Learning
pages 1388ndash1396 2016
Aaron Defazio Francis Bach and Simon Lacoste-Julien SAGA A fast incremental gradient method
with support for non-strongly convex composite objectives In Advances in Neural Information
Processing Systems 2014
John C Duchi Alekh Agarwal and Martin J Wainwright Dual averaging for distributed optimization
Convergence analysis and network scaling IEEE Transactions on Automatic control 57(3)592ndash606
2012
Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic
gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019
Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed
optimization using a randomized alternating direction method of multipliers In Decision and
Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013
Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed
augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015
Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance
reduction In Advances in Neural Information Processing Systems 2013
G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv
2015
N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence
rate for strongly-convex optimization with finite training sets In Advances in Neural Information
Processing Systems (NIPS) 2012
A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact
linear convergence rate for consensus optimization IEEE Transactions on Signal and Information
Processing over Networks 2(4)507ndash522 2016
Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization
IEEE Transactions on Automatic Control 54(1)48ndash61 2009
A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization
over time-varying graphs ArXiv e-prints 2016
Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)
Soviet Math Doklady 269(3)543ndash547 1983
Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004
Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal
algorithms for smooth and strongly convex distributed optimization in networks In International
Conference on Machine Learning pages 3027ndash3036 2017
Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized
consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015
Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st
Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012
L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of
full gradients In Advances in Neural Information Processing Systems 2013
![Page 93: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/93.jpg)
Hadrien Hendrikx Francis Bach and Laurent Massoulie Asynchronous accelerated proximal stochastic
gradient for strongly convex distributed finite sums Technical Report 190109865 arXiv 2019
Franck Iutzeler Pascal Bianchi Philippe Ciblat and Walid Hachem Asynchronous distributed
optimization using a randomized alternating direction method of multipliers In Decision and
Control (CDC) 2013 IEEE 52nd Annual Conference on pages 3671ndash3676 IEEE 2013
Dusan Jakovetic Jose MF Moura and Joao Xavier Linear convergence rate of a class of distributed
augmented lagrangian algorithms IEEE Transactions on Automatic Control 60(4)922ndash936 2015
Rie Johnson and Tong Zhang Accelerating stochastic gradient descent using predictive variance
reduction In Advances in Neural Information Processing Systems 2013
G Lan An optimal randomized incremental gradient method Technical Report 150702000 arXiv
2015
N Le Roux M Schmidt and F Bach A stochastic gradient method with an exponential convergence
rate for strongly-convex optimization with finite training sets In Advances in Neural Information
Processing Systems (NIPS) 2012
A Mokhtari W Shi Q Ling and A Ribeiro A decentralized second-order method with exact
linear convergence rate for consensus optimization IEEE Transactions on Signal and Information
Processing over Networks 2(4)507ndash522 2016
Angelia Nedic and Asuman Ozdaglar Distributed subgradient methods for multi-agent optimization
IEEE Transactions on Automatic Control 54(1)48ndash61 2009
A Nedich A Olshevsky and W Shi Achieving geometric convergence for distributed optimization
over time-varying graphs ArXiv e-prints 2016
Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)
Soviet Math Doklady 269(3)543ndash547 1983
Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004
Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal
algorithms for smooth and strongly convex distributed optimization in networks In International
Conference on Machine Learning pages 3027ndash3036 2017
Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized
consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015
Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st
Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012
L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of
full gradients In Advances in Neural Information Processing Systems 2013
![Page 94: Large Scale Machine Learning Over Networks · Large Scale Machine Learning Over Networks Francis Bach INRIA - Ecole Normale Sup´erieure, Paris, France ÉCOLENORMALE SUPÉRIEURE Joint](https://reader035.vdocuments.us/reader035/viewer/2022081405/5f082a3a7e708231d420a946/html5/thumbnails/94.jpg)
Y Nesterov A method for solving a convex programming problem with rate of convergence O(1k2)
Soviet Math Doklady 269(3)543ndash547 1983
Y Nesterov Introductory lectures on convex optimization a basic course Kluwer 2004
Kevin Scaman Francis Bach Sebastien Bubeck Yin Tat Lee and Laurent Massoulie Optimal
algorithms for smooth and strongly convex distributed optimization in networks In International
Conference on Machine Learning pages 3027ndash3036 2017
Wei Shi Qing Ling Gang Wu and Wotao Yin EXTRA An exact first-order algorithm for decentralized
consensus optimization SIAM Journal on Optimization 25(2)944ndash966 2015
Ermin Wei and Asuman Ozdaglar Distributed alternating direction method of multipliers In 51st
Annual Conference on Decision and Control (CDC) pages 5445ndash5450 IEEE 2012
L Zhang M Mahdavi and R Jin Linear convergence with condition number independent access of
full gradients In Advances in Neural Information Processing Systems 2013