breaking the nonsmooth barrier: a scalable parallel method for...

1
Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite Optimization Fabian Pedregosa ] , R´ emi Leblond , Simon Lacoste–Julien ? INRIA and ´ Ecole Normale Sup´ erieure, Paris, France. ] Currently at UC Berkeley ? MILA and DIRO, Universit´ e de Montr´ eal, Canada Summary Optimization methods need to be adapted to the parallel setting to leverage modern computer architectures. Highly efficient variants of stochastic gradient descent have been recently proposed, such as Hogwild [1], Kromagnon [2], ASAGA [3]. They assume that the objective function is smooth, so are inapplicable to problems such as Lasso, optimization with convex constraints, etc. Main contributions: 1. Sparse Proximal SAGA, a sparse variant of the linearly-convergent proximal SAGA algorithm. 2. ProxASAGA, the first parallel asynchronous variance-reduced method that supports nonsmooth composite objective functions. Problem Setting Objective: develop parallel asynchronous method for problems of the form minimize xR p 1 n n i=1 f i (x)+ h(x) , f i is differentiable with L-Lipschitz gradient. h is block-separable (h(x)= B h B ([x] B )) and “simple” in the sense that we have access to prox γh def = arg min x γh(x)+ 1 2 kx - z k 2 . = includes Lasso, group Lasso or box constraints. Variance-reduced stochastic gradient methods are natural can- didates due to their state of the art performance on these problems and recent asynchronous variants. The SAGA algorithm [4] maintains current iterate x R p and historical gradients α R n×p . At each iteration, sample i {1,...,n} and compute (x + , α + ) as x + = prox γh (x - γ (f i (x) - α i + α)) ; α + i = f i (x) . Difficulty of a Composite Extension Existing methods exhibit best performance when updates are sparse. Even in the presence of sparse gradients, the SAGA update is not sparse due to the presence of α and prox. Existing convergence proofs bound noise from asynchrony using the Lipschitz constant of the gradient. This property does not extend to composite case. A New Sequential Algorithm: Sparse Proximal SAGA The algorithm relies on the following quantities Extended support T i : set of blocks that intersect with f i . T i def = {B : supp(f i ) B 6= ,B ∈ B} For each block B ∈B , d B def = n/n B , where n B := i 1{B T i } is the number of T i that contain B . D i is a diagonal matrix defined block-wise as [D i ] B,B def = d B 1{B T i }I |B | . ϕ i is a block-wise reweighting of h: ϕ i def = B T i d B h B (x) Justification. The following properties are verified ϕ i (x) is zero outside T i D i x is zero outside T i (sparsity) E i ϕ i = h E i D i = I (unbiasedness) Algorithm. As SAGA, it maintains current iterate x R p and table of historical gradients α R n×p . At each iteration, it samples an index i ∈{1,...,n} and computes next iterate (x + , α + ) as v i = f i (x) - α i + D i α x + = prox γϕ i ( x - γ v i ) ; α + i = f i (x) Features Per Iteration cost in O (|T i |). Easy to implement (compared to the lagged update approach [5]). Amenable to parallelization. Convergence Analysis For step size γ = 1 5L and -strongly convex (μ> 0), Sparse Proximal SAGA converges geometrically in expectation. At it- eration t we have Ekx t - x * k 2 (1 - 1 5 min{ 1 n , 1 κ }) t C 0 , with C 0 = kx 0 - x * k 2 + 1 5L 2 n i=1 kα 0 i -∇f i (x * )k 2 and κ = L μ (condition number). Implications Same convergence rate than SAGA with cheaper updates. In the “big data regime” (n κ): rate in O (1/n). In the “ill-conditioned regime” (n κ): rate in O (1). Adaptivity to strong convexity, i.e., no need to know strong convexity parameter to obtain linear convergence. A New Parallel Algorithm: Proximal Asynchronous SAGA (ProxASAGA) Proximal Asynchronous SAGA (ProxASAGA) runs Sparse Prox- imal SAGA asynchornously and without locks and updates x, α and α in shared memory. All read/write operations to shared memory are inconsistent, i.e., no vector-level locks while reading/writing. 1: keep doing in parallel 2: Sample i uniformly in {1, ..., n} 3: x ] T i = inconsistent read of x on T i 4: ˆ α i = inconsistent read of α i 5: [ α ] T i = inconsistent read of α on T i 6: [ δ α ] S i =[f i x)] S i - α i ] S i 7: v ] T i =[δ α ] T i +[ D i α ] T i 8: [ δ x ] T i =[prox γϕ i x - γ ˆ v )] T i - x] T i 9: for B in T i do 10: for b in B do 11: [ x ] b [ x ] b +[ δ x ] b . atomic 12: if b supp(f i ) then 13: [ α ] b [ α] b + 1 / n[δ α] b . atomic 14: end if 15: end for 16: end for 17: α i ←∇f i x) (scalar update) . atomic 18: end parallel loop Perturbed Iterate Framework Problem: Analysis of asynchronous parallel algorithms is hard. Solution: Cast them as sequential algorithms working on per- turbed inputs. Distinguish: ˆ x t : inconsistent vector. Counter t is incremented when a core finishes reading the parameters (after read labeling [3]). x t : the virtual iterate defined by x t+1 def = x t - γ g t with g (x, v ,i)= 1 γ ( ˆ x t - prox γϕ i x t - γ ˆ v i t ) ) . Interpret ˆ x t as a noisy version of x t due to asynchrony. Gen- eralization of perturbed iterate framework [2, 3] to composite objectives. Analysis preliminaries Definition (measure of sparsity). Let Δ := max B ∈B |{i : T i 3 B }|/n. This is the normalized maximum number of times that a block appears in the extended support. We always have 1 / n Δ 1. Definition (delay bound). τ is a uniform bound on the maximum delay between two iterations processed concurrently. Convergence guarantee of ProxASAGA Suppose τ 1 10 Δ . Then: If κ n, then with step size γ = 1 / 36L, ProxASAGA converges geometrically with rate factor Ω( 1 κ ). If κ<n, then using the step size γ = 1 / 36, ProxASAGA converges geometrically with rate factor Ω( 1 n ). In both cases, the convergence rate is the same as Sparse Proximal SAGA = ProxASAGA is linearly faster up to constant factor. In both cases the step size does not depend on τ . If τ 6κ, a universal step size of Θ( 1 / L) achieves a similar rate than Sparse Proximal SAGA, making it adaptive to local strong convexity (knowledge of κ not required). Experimental results Comparison on 3 large-scale datasets on an elastic-net regularized logistic regression model: minimize x 1 n n i=1 log ( 1 + exp(-b i a | i x) ) + λ 1 2 kxk 2 2 + λ 2 kxk 1 , Dataset n p density L Δ KDD 2010 19,264,097 1,163,024 10 -6 28.12 0.15 KDD 2012 149,639,105 54,686,452 2 × 10 -7 1.25 0.85 Criteo 45,840,617 1,000,000 4 × 10 -5 1.25 0.89 Highlights: ProxASAGA significantly outperforms existing methods, significant speedup (6x to 12x) over the sequential version. References 1. Niu, F., Recht, B., Re, C. & Wright, S. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. in NIPS (2011). 2. Mania, H. et al. Perturbed iterate analysis for asynchronous stochastic optimization. SIAM Journal on Optimization (2017). 3. Leblond, R., Pedregosa, F. & Lacoste-Julien, S. ASAGA: asynchronous parallel SAGA. AISTATS (2017). 4. Defazio, A. et al. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. in NIPS (2014). 5. Schmidt, M., Le Roux, N. & Bach, F. Minimizing finite sums with the stochastic average gradient. Mathematical Programming (2016).

Upload: others

Post on 07-Mar-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Breaking the Nonsmooth Barrier: A Scalable Parallel Method for Composite OptimizationFabian Pedregosa† ], Remi Leblond†, Simon Lacoste–Julien?

†INRIA and Ecole Normale Superieure, Paris, France. ]Currently at UC Berkeley ?MILA and DIRO, Universite de Montreal, Canada

Summary

Optimization methods need to be adapted to the parallel settingto leverage modern computer architectures.

Highly efficient variants of stochastic gradient descent have beenrecently proposed, such as Hogwild [1], Kromagnon [2],ASAGA [3].

They assume that the objective function is smooth, so areinapplicable to problems such as Lasso, optimization with convexconstraints, etc.

Main contributions:

1. Sparse Proximal SAGA, a sparse variant of thelinearly-convergent proximal SAGA algorithm.

2. ProxASAGA, the first parallel asynchronous variance-reducedmethod that supports nonsmooth composite objective functions.

Problem Setting

Objective: develop parallel asynchronous method for problems ofthe form

minimizex∈Rp

1n

∑ni=1 fi(x) + h(x) ,

• fi is differentiable with L-Lipschitz gradient.

• h is block-separable (h(x) =∑

B hB([x]B)) and “simple” in

the sense that we have access to

proxγhdef= argminx γh(x) + 1

2‖x− z‖2 .

=⇒ includes Lasso, group Lasso or box constraints.

Variance-reduced stochastic gradient methods are natural can-didates due to their state of the art performance on these problemsand recent asynchronous variants.

The SAGA algorithm [4] maintains current iterate x ∈ Rp andhistorical gradients α ∈ Rn×p. At each iteration, sample i ∈1, . . . , n and compute (x+,α+) as

x+ = proxγh(x− γ(∇fi(x)−αi +α)) ; α+i = ∇fi(x) .

Difficulty of a Composite Extension

• Existing methods exhibit best performance when updates aresparse.

• Even in the presence of sparse gradients, the SAGA update isnot sparse due to the presence of α and prox.

• Existing convergence proofs bound noise from asynchrony usingthe Lipschitz constant of the gradient. This property does notextend to composite case.

A New Sequential Algorithm: Sparse Proximal SAGA

The algorithm relies on the following quantities• Extended support Ti: set of blocks that intersect with ∇fi.

Tidef= B : supp(∇fi) ∩B 6= ∅, B ∈ B

• For each block B ∈ B, dBdef= n/nB, where

nB :=∑

i1B ∈ Ti is the number of Ti that contain B.

• Di is a diagonal matrix defined block-wise as

[Di]B,Bdef= dB1B ∈ TiI|B|.

• ϕi is a block-wise reweighting of h: ϕidef=∑

B∈Ti dBhB(x)

Justification. The following properties are verified

ϕi(x) is zero outside Ti Dix is zero outside Ti (sparsity)

Eiϕi = h EiDi = I (unbiasedness)

Algorithm. As SAGA, it maintains current iterate x ∈ Rp

and table of historical gradients α ∈ Rn×p. At each iteration,it samples an index i ∈ 1, . . . , n and computes next iterate(x+,α+) as

vi = ∇fi(x)−αi +Diα

x+ = proxγϕi(x− γvi

); α+

i = ∇fi(x)

Features

• Per Iteration cost in O(|Ti|).

• Easy to implement (compared to the lagged updateapproach [5]).

• Amenable to parallelization.

Convergence Analysis

For step size γ = 15L and f µ-strongly convex (µ > 0), Sparse

Proximal SAGA converges geometrically in expectation. At it-eration t we have

E‖xt− x∗‖2 ≤ (1− 15 min1

n,1κ)

tC0 ,

with C0 = ‖x0−x∗‖2 + 15L2

∑ni=1 ‖α0

i −∇fi(x∗)‖2 and κ = Lµ

(condition number).

Implications

• Same convergence rate than SAGA with cheaper updates.

• In the “big data regime” (n ≥ κ): rate in O(1/n).

• In the “ill-conditioned regime” (n ≤ κ): rate in O(1/κ).

• Adaptivity to strong convexity, i.e., no need to know strongconvexity parameter to obtain linear convergence.

A New Parallel Algorithm: Proximal AsynchronousSAGA (ProxASAGA)

Proximal Asynchronous SAGA (ProxASAGA) runs Sparse Prox-imal SAGA asynchornously and without locks and updates x,α and α in shared memory.

All read/write operations to shared memory are inconsistent,i.e., no vector-level locks while reading/writing.

1: keep doing in parallel2: Sample i uniformly in 1, ..., n3: [ x ]Ti = inconsistent read of x on Ti4: αi = inconsistent read of αi

5: [α ]Ti = inconsistent read of α on Ti6: [ δα ]Si = [∇fi(x)]Si − [αi]Si7: [ v ]Ti = [δα ]Ti + [Diα ]Ti8: [ δx ]Ti = [proxγϕi(x− γv)]Ti − [x]Ti9: for B in Ti do

10: for b in B do11: [x ]b← [x ]b + [ δx ]b . atomic12: if b ∈ supp(∇fi) then13: [α ]b← [α]b + 1/n[δα]b . atomic14: end if15: end for16: end for17: αi← ∇fi(x) (scalar update) . atomic18: end parallel loop

Perturbed Iterate Framework

Problem: Analysis of asynchronous parallel algorithms is hard.

Solution: Cast them as sequential algorithms working on per-turbed inputs. Distinguish:

• xt: inconsistent vector. Counter t is incremented when acore finishes reading the parameters (after read labeling [3]).

• xt: the virtual iterate defined by xt+1def= xt− γgt with

g(x,v, i) = 1γ

(xt− proxγϕi(xt− γvit)

).

Interpret xt as a noisy version of xt due to asynchrony. Gen-eralization of perturbed iterate framework [2, 3] to compositeobjectives.

Analysis preliminaries

Definition (measure of sparsity). Let ∆ := maxB∈B |i :Ti 3 B|/n. This is the normalized maximum number of timesthat a block appears in the extended support. We always have1/n ≤ ∆ ≤ 1.

Definition (delay bound). τ is a uniform bound on themaximum delay between two iterations processed concurrently.

Convergence guarantee of ProxASAGA

Suppose τ ≤ 110√

∆. Then:

• If κ ≥ n, then with step size γ = 1/36L, ProxASAGA convergesgeometrically with rate factor Ω(1

κ).• If κ < n, then using the step size γ = 1/36nµ, ProxASAGA

converges geometrically with rate factor Ω(1n).

In both cases, the convergence rate is the same as Sparse ProximalSAGA =⇒ ProxASAGA is linearly faster up to constant factor.In both cases the step size does not depend on τ .

If τ ≤ 6κ, a universal step size of Θ(1/L) achieves a similar ratethan Sparse Proximal SAGA, making it adaptive to local strongconvexity (knowledge of κ not required).

Experimental results

Comparison on 3 large-scale datasets on an elastic-net regularizedlogistic regression model:

minimizex

1n

∑ni=1 log

(1 + exp(−biaᵀ

ix))

+ λ1

2 ‖x‖22 + λ2‖x‖1 ,

Dataset n p density L ∆KDD 2010 19,264,097 1,163,024 10−6 28.12 0.15KDD 2012 149,639,105 54,686,452 2× 10−7 1.25 0.85Criteo 45,840,617 1,000,000 4× 10−5 1.25 0.89

Highlights: ProxASAGA significantly outperforms existingmethods, significant speedup (6x to 12x) over the sequentialversion.

References

1. Niu, F., Recht, B., Re, C. & Wright, S. Hogwild: A lock-free approach toparallelizing stochastic gradient descent. in NIPS (2011).

2. Mania, H. et al. Perturbed iterate analysis for asynchronous stochasticoptimization. SIAM Journal on Optimization (2017).

3. Leblond, R., Pedregosa, F. & Lacoste-Julien, S. ASAGA: asynchronousparallel SAGA. AISTATS (2017).

4. Defazio, A. et al. SAGA: A fast incremental gradient method with support fornon-strongly convex composite objectives. in NIPS (2014).

5. Schmidt, M., Le Roux, N. & Bach, F. Minimizing finite sums with thestochastic average gradient. Mathematical Programming (2016).