stochastic optimization part iii: advanced topics of...

Stochastic OptimizationPart III: Advanced topics of stochastic optimization

†‡ Taiji Suzuki

†Tokyo Institute of TechnologyGraduate School of Information Science and EngineeringDepartment of Mathematical and Computing Sciences

‡JST, PRESTO

MLSS2015@Kyoto

1 / 62

Outline

1 Stochastic optimization for structured regularizationStructured regularizationAlternating Direction Method of Multipliers (ADMM)Stochastic ADMM for online dataStochastic ADMM for batch data

2 Parallel and distributed optimization

3 Further interesting topics

2 / 62

Outline




3 / 62

Outline




4 / 62

Regularized learning problem

Lasso:

minx∈Rp

1

n

n∑i=1

(yi − z⊤i x)2 + ∥x∥1︸︷︷︸regularization

.

General regularized learning problem:

minx∈Rp

1

n

n∑i=1

fi (z⊤i x) + ψ(x).

Difficulty: Sparsity inducing regularization is usually non-smooth.

5 / 62

Proximal mapping

Regularized learning problem:

minx∈Rp

1

n

n∑i=1

fi (z⊤i x) + ψ(x).

gt ∈ ∂x(1n

∑ni=1 fi (z

⊤i x)

)|x=x (t−1) , gt =

1t

∑tτ=1 gτ .

Proximal gradient descent:

x (t) = argminx∈Rp

{g⊤t x + ψ(x) +

1

2ηt∥x − x (t−1)∥2

}.

Regularized dual averaging:

x (t) = argminx∈Rp

{g⊤t x + ψ(x) +

1

2ηt∥x∥2

}.

A key computation is the proximal mapping:

prox(q|ψ) := argminx

{ψ(x) +

1

2∥x − q∥2

}.

6 / 62

Example of Proximal mapping: ℓ1 regularization

prox(q|C∥ · ∥1) = argminx

{C∥x∥1 +

1

2∥x − q∥2

}= (sign(qj)max(|qj | − C , 0))j .

→ Soft-thresholding function. Analytic form.

There are also many regularization functions for which computing the proxmapping is difficult.→ Structured regularization.

7 / 62

Examples of structured regularization

Overlapped group lasso

ψ(w) = C∑g∈G

∥wg∥q

(q > 1; typicallyq = 2,∞)

The groups may have overlap.It is difficult to compute the proximal mapping.

Application (1)

Genome Wide Association Study (GWAS)

(Balding ‘06, McCarthy et al. ‘08) 8 / 62

Application of group reg. (2)

Sentence regularization for text classification (Yogatama and Smith,

2014)

The words occurred in the same sentence is grouped:

ψ(w) =D∑

d=1

Sd∑s=1

λd ,s∥w(d ,s)∥2,

(d expresses a document, s expresses a sentence).

9 / 62

(Generalized) Fused Lasso and TV-denoising

Fused lasso (Tibshirani et al. (2005), Jacob et al. (2009)):

ψ(x) = C∑

(i ,j)∈E

|xi − xj |.

TV-denoising (Rudin et al., 1992, Chambolle, 2004):

ψ(X ) = C∑i ,j

√|Xi+1,j − Xi ,j |2 + |Xi ,j+1 − Xi ,j |2.

1

2

34

5

x1

x2

x3x4

x5

Genome data analysis by Fused lasso (Tibshi-

rani and Taylor, 2011)

Image restoration (Mairal et al., 2009)

10 / 62

Other examples

Robust PCA (Candes et al. 2009).

Low rank tensor estimation (Signoretto et al., 2010; Tomioka et al.,2011).

Dictionary learning (Kasiviswanathan et al., 2012; Rakotomamonjy,2013).

11 / 62

Solutions

Developing a sophisticated method for each regularization: Jacobet al. (2009), Yuan et al. (2011).

Submodular optimization: Bach (2010), Kawahara et al. (2009),Bach et al. (2012)

Decomposing the proximal mapping: Yu (2013)

Applying linear transformation to make the regularization simpler.→ Alternating Direction Method of Multipliers, ADMM.

12 / 62

Outline




13 / 62

Linear transformation, Decomposition technique

Overlapped group lasso ψ(w) = C∑

g∈G ∥wg∥※ It is difficult to compute the proximal mapping.

Solution:

Prepare ψ for which proximalmapping is easily computable.

Let ψ(B⊤w) = ψ(w), and utilizethe proximal mapping w.r.t. ψ.

BTw

Decompose into independent groups:

B⊤w =

ψ(y) = C∑g′∈G′

∥yg′∥

prox(q|ψ) =(qg′max

{1− C

∥qg′∥, 0

})g′∈G′

　14 / 62

Another example

Graph guided regularization

ψ(w) = C∑

(i ,j)∈E

|wi − wj |.1

2

34

5

x1

x2

x3x4

x5

ψ(y) = C∑e∈E|ye |, y = B⊤w = (wi − wj)(i ,j)∈E

⇒

ψ(B⊤w) = ψ(w),

prox(q|ψ) =(qemax

{1− C

|qe | , 0})

e∈E.

Soft-Thresholding function.

15 / 62

Optimizing composite objective functionwith linear constraint

minx

1

n

n∑i=1

fi (z⊤i x) + ψ(B⊤x)

⇔ minx ,y

1

n

n∑i=1

fi (z⊤i x) + ψ(y) s.t. y = B⊤x .

� �Augmented Lagrangian

L(x , y , λ) = 1

n

∑i

fi (z⊤i x) + ψ(y) + λ⊤(y − B⊤x) +

ρ

2∥y − B⊤x∥2

� �infx ,y

supλL(x , y , λ)

yields the optimization of the original problem.

The augmented Lagrangian is the basis of the method of multipliers(Hestenes, 1969, Powell, 1969, Rockafellar, 1976). 16 / 62

Method of multipliers

minx ,y{ f (x) + ψ(y) s.t. Ax + By = 0}

L(w , y , λ) = f (x) + ψ(y) + λ⊤(Ax + By) +ρ

2∥Ax + By∥2

Method of multipliers (Hestenes, 1969, Powell, 1969)

(x (t), y (t)) = argmin(x,y)

L(x , y , λ(t−1))

λ(t) = λ(t−1) + ρ(Ax (t) + By (t))

Remark:

Update of λ corresponds to gradient ascent of L.It is easy to check that

∇x(f (x) + ⟨λ(t),Ax + By (t)⟩)|x=x(t) = 0

∇y (ψ(y) + ⟨λ(t),Ax (t) + By⟩)|y=y (t) = 0.

If Ax (t) + By (t) = 0, this give the optimality condition.

17 / 62

Method of multipliers


L(w , y , λ) = f (x) + ψ(y) + λ⊤(Ax + By) +ρ

2∥Ax + By∥2

Method of multipliers (Hestenes, 1969, Powell, 1969)

(x (t), y (t)) = argmin(x,y)

L(x , y , λ(t−1))

λ(t) = λ(t−1) + ρ(Ax (t) + By (t))

Remark:

Update of λ corresponds to gradient ascent of L.It is easy to check that

∇x(f (x) + ⟨λ(t),Ax + By (t)⟩)|x=x(t) = 0

∇y (ψ(y) + ⟨λ(t),Ax (t) + By⟩)|y=y (t) = 0.

If Ax (t) + By (t) = 0, this give the optimality condition.17 / 62

Alternating Direction Method of multipliers(Douglas-Rachford splitting)


L(w , y , λ) = f (x) + ψ(y) + λ⊤(Ax + By) +ρ

2∥Ax + By∥2

Alternating Direction Method of Multipliers (Gabay and Mercier, 1976)

x (t) = argminxL(x , y (t−1), λ(t−1))

y (t) = argminyL(x (t), y , λ(t−1))

λ(t) = λ(t−1) + ρ(Ax (t) + By (t))

ADMM converges to the optimal (Mota et al., 2011)

O(1/t) convergence in general (He and Yuan, 2012)

Linear convergence for a strongly convex objective (Deng and Yin,2012, Hong and Luo, 2012)

18 / 62

ADMM for structured regularizationminx{f (x) + ψ(B⊤w)} ⇔ min

x ,y{f (x) + ψ(y) s.t. y = B⊤x}

L(x , y , λ) = f (x) + ψ(y) + λ⊤(y − B⊤x) + ρ2∥y − B⊤x∥2

where f (x) = 1n

∑fi (z

⊤i x)

ADMM for structured regularization

x (t) = argminx{f (x) + λ(t−1)⊤(−B⊤x) +

ρ

2∥y (t−1) − B⊤x∥2}

y (t) = argminy{ψ(y) + λ(t)

⊤y +

ρ

2∥y − B⊤x (t)∥2}

(= prox(B⊤x (t) − λ(t)/ρ|ψ/ρ))λ(t) = λ(t−1) − ρ(B⊤x (t) − y (t))

The update of y is given by the proximal mapping w.r.t. simple ψ.→ Usually analytic form.

However, the computation of 1n

∑ni=1 fi (z

⊤i w) is still heavy.

→ Stochastic version of ADMM has been developed (Suzuki, 2013,

Ouyang et al., 2013, Suzuki, 2014).

19 / 62

ADMM for structured regularizationminx{f (x) + ψ(B⊤w)} ⇔ min

x ,y{f (x) + ψ(y) s.t. y = B⊤x}

L(x , y , λ) = f (x) + ψ(y) + λ⊤(y − B⊤x) + ρ2∥y − B⊤x∥2

where f (x) = 1n

∑fi (z

⊤i x)

ADMM for structured regularization

x (t) = argminx{f (x) + λ(t−1)⊤(−B⊤x) +

ρ

2∥y (t−1) − B⊤x∥2}

y (t) = argminy{ψ(y) + λ(t)

⊤y +

ρ

2∥y − B⊤x (t)∥2}

(= prox(B⊤x (t) − λ(t)/ρ|ψ/ρ))λ(t) = λ(t−1) − ρ(B⊤x (t) − y (t))

The update of y is given by the proximal mapping w.r.t. simple ψ.→ Usually analytic form.However, the computation of 1

n

∑ni=1 fi (z

⊤i w) is still heavy.

→ Stochastic version of ADMM has been developed (Suzuki, 2013,

Ouyang et al., 2013, Suzuki, 2014).19 / 62

Example: ADMM for Lasso

minx

{1

2n∥Zx − Y ∥2 + C∥x∥1

}

ADMM for Lasso

x (t) =

(Z⊤Z

2n+ρI

2

)−1 (Y

n+ λ(t) − ρy (t−1)

)y (t) = ST C

ρ(x (t) − λ(t)/ρ)

λ(t) = λ(t−1) − ρ(x (t) − y (t))

STη(x) = (sign(xj)max{|xj | − η, 0})j

20 / 62

Outline




21 / 62

Table of literature

Stochastic methods for regularized learning problems.Normal ADMM

Online • Proximal gradient type (Nesterov,

2007)

Online-ADMM (Wang and Baner-

jee, 2012)

FOBOS (Duchi and Singer, 2009) SGD-ADMM (Suzuki, 2013, Ouyang

et al., 2013)

• Dual averaging type (Nesterov, 2009) RDA-ADMM (Suzuki, 2013)

RDA (Xiao, 2009)

BatchSDCA (Shalev-Shwartz and Zhang, 2013)

SDCA-ADMM (Suzuki, 2014)

(Stochastic Dual Coordinate Ascent)

SAG (Le Roux et al., 2013)

(Stochastic Averaging Gradient)

SVRG (Johnson and Zhang, 2013)

(Stochastic Variance Reduced Gradient)

Online type: O(1/√T ) in general,

O(log(T )/T ) or O(1/T ) for a strongly convex objective.Batch type: linear convergence.

22 / 62

Reminder of online stochastic optimization

gt ∈ ∂ℓt(xt), gt = 1t

∑tτ=1 gτ (ℓt(x) = ℓ(zt , x))

OPG (Online Proximal Gradient Descent)

xt+1 = argminx

{⟨gt , x⟩+ ψ(x) +

1

2ηt∥x − xt∥2

}

RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009))

xt+1 = argminx

{⟨gt , x⟩+ ψ(x) +

1

2ηt∥x∥2

}These update rule is computed by a proximal mapping associated with ψ.

Efficient for a simple regularization such as ℓ1 regularization.

How about structured regularization? → ADMM.

23 / 62

OPG-ADMM

Ordinary OPG: xt+1 = argminx

{⟨gt , x⟩+ ψ(x) + 1

2ηt∥x − xt∥2

}.

OPG-ADMM

xt+1 =argminx∈X

{g⊤t x − λt⊤(B⊤x − yt)

+ρ

2∥B⊤x − yt∥2 +

1

2ηt∥x − xt∥2Gt

},

yt+1 = argminy∈Y

{ψ(y)− λ⊤t (B⊤xt+1 − y) +

ρ

2∥B⊤xt+1 − y∥2

}

λt+1 =λt − ρ(B⊤xt+1 − yt+1).

The update rule of yt+1 and λt+1 are same as the ordinary ADMM.

prox(·|ψ) is usually analytically obtained.

Gt is any positive definite matrix.

24 / 62

OPG-ADMM

Ordinary OPG: xt+1 = argminx

{⟨gt , x⟩+ ψ(x) + 1

2ηt∥x − xt∥2

}.

OPG-ADMM

xt+1 =argminx∈X

{g⊤t x − λt⊤(B⊤x − yt)

+ρ

2∥B⊤x − yt∥2 +

1

2ηt∥x − xt∥2Gt

},

yt+1 = argminy∈Y

{ψ(y)− λ⊤t (B⊤xt+1 − y) +

ρ

2∥B⊤xt+1 − y∥2

}=prox(B⊤xt+1 − λt/ρ|ψ),

λt+1 =λt − ρ(B⊤xt+1 − yt+1).


prox(·|ψ) is usually analytically obtained.

Gt is any positive definite matrix.

24 / 62

RDA-ADMM

Ordinary RDA: wt+1 = argminw

{⟨gt ,w⟩+ ψ(w) + 1

2ηt∥w∥2

}RDA-ADMM

Let xt =1t

∑tτ=1 xτ , λt =

1t

∑tτ=1 λτ , yt =

1t

∑tτ=1 yτ , gt =

1t

∑tτ=1 gτ .

xt+1 =argminx∈X

{g⊤t x − (Bλt)

⊤x +ρ

2t∥B⊤x∥2

+ρ(B⊤xt − yt)⊤B⊤x +

1

2ηt∥x∥2Gt

},

yt+1 =prox(B⊤xt+1 − λt/ρ|ψ),λt+1 =λt − ρ(B⊤xt+1 − yt+1).


25 / 62

Simplified version

Letting Gt be a specific form, the update rule is much simplified.(Gt = γI − ρ

t ηtBB⊤ for RDA-ADMM, and Gt = γI − ρηtBB⊤ for OPG-ADMM)

Online stochastic ADMM1 Update of x

(OPG-ADMM) xt+1 = ΠX

[−ηtγ{gt − B(λt − ρB⊤xt + ρyt)}+ xt

],

(RDA-ADMM) xt+1 = ΠX

[−ηtγ{gt − B(λt − ρB⊤xt + ρyt)}

].

2 Update of yyt+1 = prox(B⊤xt+1 − λt/ρ|ψ).

3 Update of λλt+1 = λt − ρ(B⊤xt+1 − yt+1).

Fast computation.Easy implementation. 26 / 62

Convergence analysis

We bound the expected risk:

Expected riskP(x) = EZ [ℓ(Z , x)] + ψ(x).

Assumptions:

(A1) ∃G s.t. ∀g ∈ ∂xℓ(z , x) satisfies ∥g∥ ≤ G for all z , x .

(A2) ∃L s.t. ∀g ∈ ∂ψ(y) satisfies ∥g∥ ≤ L for all y .

(A3) ∃R s.t. ∀x ∈ X satisfies ∥x∥ ≤ R.

27 / 62

Convergence rate: bounded gradient

(A1) ∃G s.t. ∀g ∈ ∂xℓ(z , x) satisfies ∥g∥ ≤ G for all z , x .(A2) ∃L s.t. ∀g ∈ ∂ψ(y) satisfies ∥g∥ ≤ L for all y .(A3) ∃R s.t. ∀x ∈ X satisfies ∥x∥ ≤ R.

Theorem (Convergence rate of RDA-ADMM)

Under (A1), (A2), (A3), we have

Ez1:T−1[P(xT )− P(x∗)] ≤ 1

T

T∑t=2

ηt−1

2(t − 1)G 2 +

γ

ηT∥x∗∥2 + K

T.

Theorem (Convergence rate of OPG-ADMM)

Under (A1), (A2), (A3), we haveEz1:T−1

[P(xT )− P(x∗)] ≤ 12T

∑Tt=2max

{γηt− γ

ηt−1, 0}R2

+ 1T

∑Tt=1

ηt2 G

2 + KT .

Both methods have convergence rate O(

1√T

)by letting ηt = η0

√t for

RDA-ADMM and ηt = η0/√t for OPG-ADMM.

28 / 62

Convergence rate: strongly convex

(A4) There exist σf , σψ ≥ 0 and P,Q ⪰ O such that

EZ [ℓ(Z , x) + (x ′ − x)⊤∇xℓ(Z , x)] +σf2∥x − x ′∥2P ≤ EZ [ℓ(Z , x

′)],

ψ(y) + (y ′ − y)⊤∇ψ(y) +σψ2∥y − y ′∥2Q ,

for all x , x ′ ∈ X and y , y ′ ∈ Y, and ∃σ > 0 satisfying

σI ⪯ σf P +ρσψ

2ρ+ σψQ.

The update rule of RDA-ADMM is modified as

xt+1 = argminx∈X

{g⊤t x − λ⊤t B⊤x +

ρ

2t∥B⊤x∥2 + ρ(B⊤x t − y t)⊤B⊤x +

1

2ηt∥x∥2Gt

+σ

2∥x − xt∥2

}.

29 / 62

Convergence analysis: strongly convex

Theorem (Convergence rate of RDA-ADMM)

Under (A1), (A2), (A3), (A4), we have

Ez1:T−1[P(xT )− P(x∗)] ≤ 1

2T

T∑t=2

1t−1ηt−1

+ tσG 2 +

γ

ηT∥x∗∥2 + K

T.

Theorem (Convergence rate of OPG-ADMM)

Under (A1), (A2), (A3), (A4), we haveEz1:T−1

[P(xT )− P(x∗)] ≤ 12T

∑Tt=2max

{γηt− γ

ηt−1− σ, 0

}R2

+ 1T

∑Tt=1

ηt2 G

2 + KT .

Both methods have convergence rate O(log(T )

T

)by letting ηt = η0t for

RDA-ADMM and ηt = η0/t for OPG-ADMM.This can be improved to O(1/T ) by weighted averaging (Azadi and Sra,2014).

30 / 62

Numerical experiments

101

102

103

10-2

10-1

100

CPU time (s)

Exp

ecte

d R

isk

OPG−ADMMRDA−ADMMNL−ADMMRDAOPG

Figure: Artificial data: 1024 dim,512 sample, Overlapped group lasso.

100

101

102

103

10-0.8

10-0.7

10-0.6

CPU time (s)

Cla

ssif

icat

ion

Err

or

OPG−ADMMRDA−ADMMNL−ADMMRDA

Figure: Real data (Adult, a9a):15,252 dim, 32,561 sample,Overlapped group lasso + ℓ1 reg.

31 / 62

Related methods

O(n/T ) (improved from O(1/√T )) convergence in a batch setting:

Zhong and Kwok (2014)

Acceleration of stochastic ADMM: Azadi and Sra (2014)

Parallel computing with stochastic ADMM: Wang et al. (2014)

32 / 62

Outline




33 / 62

Batch setting

In the batch setting, the data are fixed. We just minimize the objectivefunction defined by

1

n

n∑i=1

fi (a⊤i x) + ψ(B⊤x).

� �ADMM version of SDCA

Converges linearly:

T > (n + γ/λ)log(1/ϵ)

to achieve ϵ accuracy for γ-smooth loss andλ-strongly convex regularization.� �

34 / 62

Dual problem

Dual problem

Let A = [a1, a2, . . . , an] ∈ Rp×n.

minw

{1

n

n∑i=1

fi (a⊤i w) + ψ(B⊤w)

}(P: Primal)

= − minx∈Rn,y∈Rd

{1

n

n∑i=1

f ∗i (xi ) + ψ∗(y

n

) ∣∣ Ax + By = 0}

(D: Dual)

Optimality condition:

a⊤i w∗ ∈ ∇f ∗i (x∗i ),

1

ny∗ ∈ ∇ψ(u)|u=B⊤w∗ , Ax∗ + By∗ = 0.

⋆ Each coordinate xi corresponds to each observation ai .

35 / 62

SDCA-ADMM

Let the augmented Lagrangian beL(x , y ,w) :=

∑ni=1 f

∗i (xi ) + nψ∗(y/n)− ⟨w ,Ax + By⟩+ ρ

2∥Ax + By∥2.

Basic algorithm

For each t = 1, 2, . . .Choose i ∈ {1, . . . , n} uniformly at random, and update

y (t) ← argminy

{L(x (t−1), y ,w (t−1)) +

1

2∥y − y (t−1)∥2Q

}x(t)i ← argmin

xi∈R

{L([xi ; x

(t−1)\i ], y (t),w (t−1)) +

1

2∥xi − x

(t−1)i ∥2Gi,i

}w (t) ← w (t−1) − ξρ{n(Ax (t) + By (t))−(n − 1)(Ax (t−1) + By (t−1))}.

Q, Gi ,i are positive definite matrices that satisfy some condition.

Only i-th coordinate xi is updated.

The update of the multiplier w should be modified.

36 / 62

SDCA-ADMM

Let the augmented Lagrangian beL(x , y ,w) :=

∑ni=1 f

∗i (xi ) + nψ∗(y/n)− ⟨w ,Ax + By⟩+ ρ

2∥Ax + By∥2.

Split the index set {1, . . . , n} into K groups (I1, I2, . . . , IK ).

Block coordinate SDCA-ADMM

For each t = 1, 2, . . .Choose k ∈ {1, . . . ,K} uniformly at random, and set I = Ik ,

y (t) ← argminy

{L(x (t−1), y ,w (t−1)) +

1

2∥y − y (t−1)∥2Q

}x(t)I ← argmin

xI∈R|I |

{L([xI ; x

(t−1)\I ], y (t),w (t−1)) +

1

2∥xI − x

(t−1)I ∥2GI ,I

}w (t) ← w (t−1) − ξρ{n(Ax (t) + By (t))−(n − n/K )(Ax (t−1) + By (t−1))}.

Q, GI ,I are positive definite matrices that satisfy some condition.

37 / 62

Simplified algorithm

SettingQ = ρ(ηBId − B⊤B), GI ,I = ρ(ηZ ,I I|I | − Z⊤

I ZI ),

then, by the relation prox(q|ψ) + prox(q|ψ∗) = q, we obtain the followingupdate rule:

Simplified algorithm

For q(t) = y (t−1) + B⊤

ρηB{w (t−1) − ρ(Zx (t−1) + By (t−1))}, let

y (t) ← q(t) − prox(q(t)|nψ(ρηB · )/(ρηB)),

For p(t)I = x

(t−1)I +

Z⊤I

ρηZ,I{w (t−1) − ρ(Zx (t−1) + By (t))}, let

x(t)i ← prox(p

(t)i |f

∗i /(ρηZ ,I )) (∀i ∈ I ).

⋆ The update of x can be parallelized.

38 / 62

Convergence Analysis

x∗: the optimal variable of xY∗: the set of optimal variables (not necessarily unique)w∗: the optimal variable of wAssumption:

There exits v > 0 such that, ∀xi ∈ R,

f ∗i (xi )− f ∗i (x∗i ) ≥ ⟨∇f ∗i (x∗i ), xi − x∗i ⟩+

∥xi − x∗i ∥2

2γ.

∃h, vψ > 0 such that, for all y , u, there exists y∗ ∈ Y∗ such that

ψ∗(y/n)− ψ∗(y∗/n) ≥ ⟨B⊤w∗, y/n − y∗/n⟩+ vψ2∥PKer(B)(y/n − y∗/n)∥2,

ψ(u)− ψ(B⊤w∗) ≥ ⟨y∗/n, u − B⊤w∗⟩+ λ

2∥u − B⊤w∗∥2.

B⊤ is injective.

39 / 62

FD(x , y) :=1n

∑ni=1 f

∗i (xi ) + ψ∗( yn )− ⟨w

∗,A xn − B y

n ⟩.RD(x , y ,w) := FD(x , y)− FD(x

∗, y∗) + 12n2ξρ

∥w − w∗∥2 + ρ(1−ξ)2n ∥Ax +

By∥2 + 12n∥x − x∗∥2Ip/γ+H + 1

2nK ∥y − y∗∥2Q .

Theorem (Linear convergence of SDCA-ADMM)

Let H be a matrix such that HI ,I = ρA⊤I AI + GI ,I for all I ∈ {I1, . . . , IK},

µ = min

{1

4(1+γσmax(H)) ,λρσmin(B

⊤B)2max{1/n,4λρ,4λσmax(Q)}

,Kvψ/n

4σmax(Q) ,Kσmin(BB

⊤)4σmax(Q)(ργσmax(A⊤A)+4)

},

ξ = 14n , and C1 = RD(x

(0), y (0),w (0)), then we have that

(dual residual) E[RD(x(t), y (t),w (t))] ≤

(1− µ

K

)tC1,

(primal variable) E[∥w (t) − w∗∥2] ≤ nρ

2

(1− µ

K

)tC1.

For t ≥ C ′Kµlog

(C ′′nϵ

), we have that E[∥w (t) − w ∗∥2] ≤ ϵ.

40 / 62

Convergence AnalysisAssumption:

fi is γ-smooth.

ψ is λ-strongly convex.

Other technical conditions.

With a setting ρ = min{1, 1/γ} and K = n,

t ≥ C(n +

γ

λ

)log

(C ′

ϵ

)gives E[∥w (t) − w∗∥2] ≤ ϵ (γλ is like the condition number).The rate is as good as the ordinary SDCA.

Non stochastic method (e.g. ADMM in dual):

t ≥ Cnγ

λlog

(C ′

ϵ

).

41 / 62

Numerical Experiments: Loss function

Binary classification.

Smoothed hinge loss: fi (u) =

0, (yiu ≥ 1),12 − yiu, (yiu < 0),12 (1− yiu)

2, (otherwise).

-0.5 0 0.5 1 1.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

⇒ proximal mapping is analytically obtained.

prox(u|f ∗i /C ) =

Cu−yi1+C (−1 ≤ Cuyi−1

1+C ≤ 0),

−yi (−1 > Cuyi−11+C ),

0 (otherwise).42 / 62

Numerical Experiments (Artificial data): Setting

Artificial data: Overlapped group regularization:

ψ(X ) = C (32∑i=1

∥X:,i∥+32∑j=1

∥Xj,:∥+ 0.01×∑i,j

X 2i,j/2),

X ∈ R32×32.

43 / 62

Numerical Experiments (Artificial data): Results

5 10 15 20

10-5

10-4

10-3

10-2

10-1

100

CPU time (s)

Em

piri

cal R

isk

(a) Excess objective value

10-1

100

101

0

0.5

1

1.5

2

2.5

CPU time (s)T

est L

oss

(b) Test loss

10-1

100

101

0.05

0.1

0.15

0.2

0.25

0.3

CPU time (s)

Cla

ssifi

catio

n E

rror

RDAOPG−ADMMRDA−ADMMOL−ADMMBatch−ADMMSDCA−ADMM

(c) Class. Error

Figure: Artificial data (n=5,120, d=1024). Overlapped group lasso. Mini-batchsize n/K = 50.

44 / 62

Numerical Experiments (Artificial data): Results

100 200 300 400 500

10-6

10-4

10-2

100

CPU time (s)

Em

piri

cal R

isk

(a) Excess objective value

100

101

102

0

0.5

1

1.5

2

2.5

3

3.5

CPU time (s)T

est L

oss

(b) Test loss

100

101

102

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

CPU time (s)

Cla

ssifi

catio

n E

rror

RDAOPG−ADMMRDA−ADMMOL−ADMMBatch−ADMMSDCA−ADMM

(c) Class. Error

Figure: Artificial data (n=51,200, d=1024). Overlapped group lasso. Mini-batchsize n/K = 50.

45 / 62

Numerical Experiments (Real data): Results

5 10 15 20 25 30 35 40 450.166

0.167

0.168

0.169

0.17

0.171

0.172

0.173

0.174

0.175

CPU time (s)

Em

piri

cal R

isk

(a) Objective function

100

101

0.17

0.172

0.174

0.176

0.178

0.18

0.182

0.184

0.186

0.188

0.19

CPU time (s)T

est L

oss

(b) Test loss

100

101

0.12

0.122

0.124

0.126

0.128

0.13

0.132

0.134

0.136

0.138

0.14

CPU time (s)

Cla

ssifi

catio

n E

rror

OPG−ADMMRDA−ADMMOL−ADMMSDCA−ADMM

(c) Class. Error

Figure: Real data (20 Newsgroups, n=12,995). Graph regularization. Mini-batchsize is n/K = 50.

47 / 62

Numerical Experiments (Real data): Results

0 50 100 150 200

0.2

0.25

0.3

0.35

0.4

0.45

0.5

CPU time (s)

Em

piri

cal R

isk

(a) Objective function

0 50 100 150 200

0.2

0.25

0.3

0.35

0.4

0.45

CPU time (s)T

est L

oss

(b) Test loss

0 50 100 150 200

0.15

0.155

0.16

0.165

0.17

0.175

0.18

CPU time (s)

Cla

ssifi

catio

n E

rror

OPG−ADMMRDA−ADMMOL−ADMMSDCA−ADMM

(c) Class. Error

Figure: Real data (a9a, n=32,561). Graph regularization. Mini-batch sizen/K = 50.

48 / 62

Outline




49 / 62

Distributed computing

Q: How to deal with huge data that cannot be treated in onecomputational node?A: Distributed computing.

Challenge communication cost trade-off:communication inside node v.s. between nodes via network.

We briefly introduce two approaches

Simple averaging of SGD: (Zinkevich et al., 2010, Zhang et al., 2013)

Distributed dual coordinate descent (COCOA+): (Ma et al., 2015)

50 / 62

Simple averaging

x[1]

x[2]

x[3]

x[4]

Run independent SGDs by K nodes.Take the average of K final solutions:

xK =1

K

K∑k=1

x[k].

Just one synchronization, efficient communication cost. How aboutconvergence?

51 / 62

Assumption:

The loss function is sufficiently smooth (there exists second orderderivative, and it is Lipschitz continuous and bounded).

The expected loss is λ-strongly convex.

Each node runs T iterations of SGD.

Theorem ((Zhang et al., 2013))

With an appropriate step size,

E[∥xK − x∗∥2] ≤ C

(G 2

KTλ2+

1

T 3/2

).

As K is increased, the main term is linearly improved.

However, too large K is not effective. Actually, for λ ∈ (0, 1/√T ), it

is shown that (Shamir et al., 2014)

E[∥xK − x∗∥2] ≥ C

λ2T.

52 / 62

COCOA+ (Ma et al., 2015)

We divide the sample into K groups {Gk}k :

{1, . . . , n} =K∪

k=1

Gk , Gk ∩ Gk′ = ∅.

Dual problem of RERM:

D(y) =1

n

n∑i=1

f ∗i (yi ) + ψ∗(−1

nAy

)

=1

n

K∑k=1

( ∑i∈Gk

f ∗i (yi )

)︸︷︷︸

Divided into K groups

+ψ∗(− 1

n

K∑k=1

AGkyGk

)︸︷︷︸

needs synchronization

where AGk= [ai1 , . . . , aiGk ] ∈ Rp×|Gk | where ij ∈ Gk and yGk

= (yi )i∈Gk.

53 / 62

Run small coord. desc.in parallel

}}}

Run small coord. desc.in parallelSum up the results

(synchronization)G1

G2

G3

54 / 62

1 fi is γf -smooth.2 ψ is λ-strongly convex.3 Each subproblem decreases the objective by a factor of Θ.

Let the separability of the problems as

σmin := maxy∈Rn

∥Ay∥2∑Kk=1 ∥AGk

yGk∥2, σmax := max

kmax

yGk∈R|Gk |

∥AGkyGk∥2

∥yGk∥2

.

Theorem

Under an appropriate setting of parameters, after T iterations with

T ≥ C σminσmaxγf /(λn)+1(1−Θ) log(1/ϵ),

it holds that E[D(y (T ))−D(y∗)] ≤ ϵ. Furthermore, after T iterations with

T ≥ C σminσmaxγf /λ+1(1−Θ) log(σminσmaxγf /λ+1

(1−Θ) /ϵ),

it hods that E[P(w (t))− D(y (t))] ≤ ϵ.55 / 62

It is shown that σmin ≤ K and σmax ≤ n/K . Then

T ≥ Cγf /λ+ 1

(1−Θ)log(1/ϵ)

achieves ϵ accuracy. This is equivalent to iteration number of batchgradient methods on a strongly convex function.

Typically Θ =(1− 1

n/K+γf /λ

)twhere t is the number of inner iterations.

Total computational time (worst case):

t(1 + γf /λ) log(1/ϵ) =( n

K+γfλ

)(1 +

γfλ

)log(1/ϵ).

Huge learning problems can be optimized on a distributed system withlinear convergence rate.

56 / 62

Outline




57 / 62

Stochastic primal dual coordinate method(Zhang and Lin, 2015)

minx

{1

n

n∑i=1

fi (a⊤i x)︸︷︷︸ + ψ(x)

}

= minx

maxy

{1

n

n∑i=1

(⟨x , aiyi ⟩ − f ∗i (yi )) + ψ(x)

}

x and yi (i is chosen randomly) are updated alternatively.Iteration complexity:

T ≥(n +

√γ/λ

)log(1/ϵ)

58 / 62

Stochastic primal dual coordinate method(Zhang and Lin, 2015)

minx

{1

n

n∑i=1

fi (a⊤i x)︸︷︷︸

supyi{⟨x ,aiyi ⟩−f ∗i (yi )}

+ψ(x)

}

= minx

maxy

{1

n

n∑i=1

(⟨x , aiyi ⟩ − f ∗i (yi )) + ψ(x)

}x and yi (i is chosen randomly) are updated alternatively.Iteration complexity:

T ≥(n +

√γ/λ

)log(1/ϵ)

58 / 62

Multi-armed bandit

Maximize the sum of rewards earned through a sequence of plays.

Formulated by Robbins (1952).

Optimal strategy: Lai and Robbins (1985)

UCB strategy: Auer et al. (2002), Burnetas and Katehakis (1996)

Thompson sampling (Bayesian strategy): Thompson (1933)

Continuous version of bandit: Bayesian optimization (Mockus, 1975,

Mockus and Mockus, 1991, Srinivas et al., 2012, Snoek et al., 2012)

Gaussian process regression to search the peak of a function.

Practically useful for hyper-parameter tuning of deep learning.

59 / 62

Stochastic gradient Langevin Monte Carlo(Welling and Teh, 2011)

Goal: Efficient sampling from the posterior distribution.

Randomly choose small mini-batch It ⊆ {1, . . . , n} and update theparameter θt :

θt = θt−1 + ηt

∇θ log π(θ)︸︷︷︸prior

+n

|It |∑i∈It

∇θ log(p(xi |θ))︸︷︷︸likelihood

+ ϵtξ

where ξ ∼ N(0, ηtI) and∑∞t=1 ϵt =∞,

∑∞t=1 ϵ

2t <∞.

Related topic: Stochastic variational inference (Hoffman et al., 2013).

Bayesian inference for large datasets.Practically very useful.Supported by some theoretical backgrounds.

60 / 62

Connection to learning theory

minx

{1

t

t∑τ=1

ℓ(zτ , x) + λt∥x∥1

}

Solve regularized learning problem in online setting (data zt comesone after another).

The regularization parameter λt should go to zero as t →∞.

Question: Can we achieve statistically optimal estimation?→ Yes.

RADAR (L1-regularization): Agarwal et al. (2012)

REASON (Stochastic ADMM): Sedghi et al. (2014)

∥xt − x∗∥2 ≤ s log(p)

Tfor s-sparse truth x∗.Simultaneous discussions of optimization and statistics.

61 / 62

Summary of part III

Stochastic ADMM for structured sparsity

Online proximal gradient method, regularized dual averaging methodfor ADMM (online)Stochastic dual coordinate descent for ADMM (batch)A similar convergence result to the normal ones

Distributed stochastic optimization

Simple averagingDistributed SDCA (COCOA+)

62 / 62

A. Agarwal, S. Negahban, and M. J. Wainwright. Stochastic optimizationand sparse statistical recovery: Optimal algorithms for high dimensions.In Advances in Neural Information Processing Systems, pages1538–1546, 2012.

P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of themultiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.

S. Azadi and S. Sra. Towards an optimal stochastic alternating directionmethod of multipliers. In Proceedings of the 31st InternationalConference on Machine Learning, pages 620–628, 2014.

F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization withsparsity-inducing penalties. Foundations and Trends R⃝ in MachineLearning, 4(1):1–106, 2012.

F. R. Bach. Structured sparsity-inducing norms through submodularfunctions. In Advances in Neural Information Processing Systems, pages118–126, 2010.

O. Banerjee, L. E. Ghaoui, and A. d’Aspremont. Model selection throughsparse maximum likelihood estimation for multivariate gaussian orbinary data. Journal of Machine Learning Research, 9:485–516, 2008.

62 / 62

A. N. Burnetas and M. N. Katehakis. Optimal adaptive policies forsequential allocation problems. Advances in Applied Mathematics, 17(2):122–142, 1996.

A. Chambolle. An algorithm for total variation minimization andapplications. Journal of Mathematical imaging and vision, 20(1-2):89–97, 2004.

W. Deng and W. Yin. On the global and linear convergence of thegeneralized alternating direction method of multipliers. Technical report,Rice University CAAM TR12-14, 2012.

J. Duchi and Y. Singer. Efficient online and batch learning using forwardbackward splitting. Journal of Machine Learning Research, 10:2873–2908, 2009.

D. Gabay and B. Mercier. A dual algorithm for the solution of nonlinearvariational problems via finite-element approximations. Computers &Mathematics with Applications, 2:17–40, 1976.

B. He and X. Yuan. On the O(1/n) convergence rate of theDouglas-Rachford alternating direction method. SIAM J. NumericalAnalisis, 50(2):700–709, 2012.

62 / 62

M. Hestenes. Multiplier and gradient methods. Journal of OptimizationTheory & Applications, 4:303–320, 1969.

M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stochasticvariational inference. The Journal of Machine Learning Research, 14(1):1303–1347, 2013.

M. Hong and Z.-Q. Luo. On the linear convergence of the alternatingdirection method of multipliers. arXiv preprint arXiv:1208.3922, 2012.

L. Jacob, G. Obozinski, and J.-P. Vert. Group lasso with overlap andgraph lasso. In Proceedings of the 26th International Conference onMachine Learning, 2009.

R. Johnson and T. Zhang. Accelerating stochastic gradient descent usingpredictive variance reduction. In C. Burges, L. Bottou, M. Welling,Z. Ghahramani, and K. Weinberger, editors, Advances in NeuralInformation Processing Systems 26, pages 315–323. Curran Associates,Inc., 2013. URL http://papers.nips.cc/paper/

4937-accelerating-stochastic-gradient-descent-using-predictive-variance-reduction.

pdf.

Y. Kawahara, K. Nagano, K. Tsuda, and J. A. Bilmes. Submodularity cuts62 / 62

http://papers.nips.cc/paper/4937-accelerating-stochastic-gradient-descent-using-predictive-variance-reduction.pdf



and applications. In Advances in Neural Information ProcessingSystems, pages 916–924, 2009.

T. L. Lai and H. Robbins. Asymptotically efficient adaptive allocationrules. Advances in applied mathematics, 6(1):4–22, 1985.

N. Le Roux, M. Schmidt, and F. Bach. A stochastic gradient method withan exponential convergence rate for strongly-convex optimization withfinite training sets. In Advances in Neural Information ProcessingSystems 25, 2013.

C. Ma, V. Smith, M. Jaggi, M. Jordan, P. Richtarik, and M. Takac.Adding vs. averaging in distributed primal-dual optimization. InProceedings of The 32nd International Conference on Machine Learning(ICML-15), pages 1973–1982, 2015.

J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Non-localsparse models for image restoration. In Computer Vision, 2009 IEEE12th International Conference on, pages 2272–2279. IEEE, 2009.

J. Mockus. On bayesian methods for seeking the extremum. InOptimization Techniques IFIP Technical Conference, pages 400–404.Springer, 1975.

62 / 62

J. Mockus and L. Mockus. Bayesian approach to global optimization andapplication to multiobjective and constrained problems. Journal ofOptimization Theory and Applications, 70(1):157–172, 1991.

J. F. Mota, J. M. Xavier, P. M. Aguiar, and M. Puschel. A proof ofconvergence for the alternating direction method of multipliers appliedto polyhedral-constrained functions. arXiv preprint arXiv:1112.2295,2011.

Y. Nesterov. Gradient methods for minimizing composite objectivefunction. Technical Report 76, Center for Operations Research andEconometrics (CORE), Catholic University of Louvain (UCL), 2007.

Y. Nesterov. Primal-dual subgradient methods for convex problems.Mathematical Programming, Series B, 120:221–259, 2009.

H. Ouyang, N. He, L. Q. Tran, and A. Gray. Stochastic alternatingdirection method of multipliers. In Proceedings of the 30th InternationalConference on Machine Learning, 2013.

M. Powell. A method for nonlinear constraints in minimization problems.In R. Fletcher, editor, Optimization, pages 283–298. Academic Press,London, New York, 1969.

62 / 62

H. Robbins. Some aspects of the sequential design of experiments.Bulletin of the American Mathematical Society, 58:527–535, 1952.

R. T. Rockafellar. Augmented Lagrangians and applications of theproximal point algorithm in convex programming. Mathematics ofOperations Research, 1:97–116, 1976.

L. I. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noiseremoval algorithms. Physica D: Nonlinear Phenomena, 60(1):259–268,1992.

H. Sedghi, A. Anandkumar, and E. Jonckheere. Multi-step stochasticadmm in high dimensions: Applications to sparse optimization andmatrix decomposition. In Advances in Neural Information ProcessingSystems, pages 2771–2779, 2014.

S. Shalev-Shwartz and T. Zhang. Proximal stochastic dual coordinateascent. Technical report, 2013. arXiv:1211.2717.

O. Shamir, N. Srebro, and T. Zhang. Communication-efficient distributedoptimization using an approximate newton-type method. In Proceedingsof the 32th International Conference on Machine Learning, pages1000–1008, 2014.

62 / 62

J. Snoek, H. Larochelle, and R. P. Adams. Practical bayesian optimizationof machine learning algorithms. In Advances in neural informationprocessing systems, pages 2951–2959, 2012.

N. Srinivas, A. Krause, S. M. Kakade, and M. W. Seeger.Information-theoretic regret bounds for gaussian process optimization inthe bandit setting. Information Theory, IEEE Transactions on, 58(5):3250–3265, 2012.

T. Suzuki. Dual averaging and proximal gradient descent for onlinealternating direction multiplier method. In Proceedings of the 30thInternational Conference on Machine Learning, pages 392–400, 2013.

T. Suzuki. Stochastic dual coordinate ascent with alternating directionmethod of multipliers. In Proceedings of the 31th InternationalConference on Machine Learning, pages 736–744, 2014.

W. R. Thompson. On the likelihood that one unknown probability exceedsanother in view of the evidence of two samples. Biometrika, pages285–294, 1933.

R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight. Sparsityand smoothness via the fused lasso. Journal of Royal Statistical Society:B, 67(1):91–108, 2005.

62 / 62

R. J. Tibshirani and J. Taylor. The solution path of the generalized lasso.Ann. Statist., 39(3):1335–1371, 06 2011. doi: 10.1214/11-AOS878.URL http://dx.doi.org/10.1214/11-AOS878.

H. Wang and A. Banerjee. Online alternating direction method. InProceedings of the 29th International Conference on Machine Learning,2012.

H. Wang, A. Banerjee, and Z.-Q. Luo. Parallel direction method ofmultipliers. In Advances in Neural Information Processing Systems,pages 181–189, 2014.

M. Welling and Y. W. Teh. Bayesian learning via stochastic gradientlangevin dynamics. In Proceedings of the 28th International Conferenceon Machine Learning (ICML-11), pages 681–688, 2011.

L. Xiao. Dual averaging methods for regularized stochastic learning andonline optimization. In Advances in Neural Information ProcessingSystems 23. 2009.

D. Yogatama and N. A. Smith. Making the most of bag of words:Sentence regularization with alternating direction method of multipliers.In Proceedings of the 31th International Conference on MachineLearning, pages 656–664, 2014.

62 / 62

http://dx.doi.org/10.1214/11-AOS878

Y.-L. Yu. On decomposing the proximal map. In Advances in NeuralInformation Processing Systems, pages 91–99, 2013.

L. Yuan, J. Liu, and J. Ye. Efficient methods for overlapping group lasso.In Advances in Neural Information Processing Systems 24, 2011.

M. Yuan and Y. Lin. Model selection and estimation in the gaussiangraphical model. Biometrika, 94(1):19–35, 2007.

Y. Zhang and X. Lin. Stochastic primal-dual coordinate method forregularized empirical risk minimization. In Proceedings of The 32ndInternational Conference on Machine Learning (ICML-15), pages353–361, 2015.

Y. Zhang, J. C. Duchi, and M. Wainwright. Communication-efficientalgorithms for statistical optimization. Journal of Machine LearningResearch, 14:3321–3363, 2013.

W. Zhong and J. Kwok. Fast stochastic alternating direction method ofmultipliers. In Proceedings of the 31st International Conference onMachine Learning (ICML-14), pages 46–54, 2014.

M. Zinkevich, M. Weimer, L. Li, and A. J. Smola. Parallelized stochasticgradient descent. In Advances in neural information processing systems,pages 2595–2603, 2010.

62 / 62

stochastic optimization part iii: advanced topics of...

Documents