stochastic optimization part iii: advanced topics of...

77
Stochastic Optimization Part III: Advanced topics of stochastic optimization †‡ Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical and Computing Sciences JST, PRESTO MLSS2015@Kyoto 1 / 62

Upload: others

Post on 08-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Stochastic OptimizationPart III: Advanced topics of stochastic optimization

†‡ Taiji Suzuki

†Tokyo Institute of TechnologyGraduate School of Information Science and EngineeringDepartment of Mathematical and Computing Sciences

‡JST, PRESTO

MLSS2015@Kyoto

1 / 62

Page 2: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Outline

1 Stochastic optimization for structured regularizationStructured regularizationAlternating Direction Method of Multipliers (ADMM)Stochastic ADMM for online dataStochastic ADMM for batch data

2 Parallel and distributed optimization

3 Further interesting topics

2 / 62

Page 3: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Outline

1 Stochastic optimization for structured regularizationStructured regularizationAlternating Direction Method of Multipliers (ADMM)Stochastic ADMM for online dataStochastic ADMM for batch data

2 Parallel and distributed optimization

3 Further interesting topics

3 / 62

Page 4: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Outline

1 Stochastic optimization for structured regularizationStructured regularizationAlternating Direction Method of Multipliers (ADMM)Stochastic ADMM for online dataStochastic ADMM for batch data

2 Parallel and distributed optimization

3 Further interesting topics

4 / 62

Page 5: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Regularized learning problem

Lasso:

minx∈Rp

1

n

n∑i=1

(yi − z⊤i x)2 + ∥x∥1︸︷︷︸regularization

.

General regularized learning problem:

minx∈Rp

1

n

n∑i=1

fi (z⊤i x) + ψ(x).

Difficulty: Sparsity inducing regularization is usually non-smooth.

5 / 62

Page 6: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Regularized learning problem

Lasso:

minx∈Rp

1

n

n∑i=1

(yi − z⊤i x)2 + ∥x∥1︸︷︷︸regularization

.

General regularized learning problem:

minx∈Rp

1

n

n∑i=1

fi (z⊤i x) + ψ(x).

Difficulty: Sparsity inducing regularization is usually non-smooth.

5 / 62

Page 7: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Proximal mapping

Regularized learning problem:

minx∈Rp

1

n

n∑i=1

fi (z⊤i x) + ψ(x).

gt ∈ ∂x(1n

∑ni=1 fi (z

⊤i x)

)|x=x (t−1) , gt =

1t

∑tτ=1 gτ .

Proximal gradient descent:

x (t) = argminx∈Rp

{g⊤t x + ψ(x) +

1

2ηt∥x − x (t−1)∥2

}.

Regularized dual averaging:

x (t) = argminx∈Rp

{g⊤t x + ψ(x) +

1

2ηt∥x∥2

}.

A key computation is the proximal mapping:

prox(q|ψ) := argminx

{ψ(x) +

1

2∥x − q∥2

}.

6 / 62

Page 8: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Example of Proximal mapping: ℓ1 regularization

prox(q|C∥ · ∥1) = argminx

{C∥x∥1 +

1

2∥x − q∥2

}= (sign(qj)max(|qj | − C , 0))j .

→ Soft-thresholding function. Analytic form.

There are also many regularization functions for which computing the proxmapping is difficult.→ Structured regularization.

7 / 62

Page 9: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Examples of structured regularization

Overlapped group lasso

ψ(w) = C∑g∈G

∥wg∥q

(q > 1; typicallyq = 2,∞)

The groups may have overlap.It is difficult to compute the proximal mapping.

Application (1)

Genome Wide Association Study (GWAS)

(Balding ‘06, McCarthy et al. ‘08) 8 / 62

Page 10: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Application of group reg. (2)

Sentence regularization for text classification (Yogatama and Smith,

2014)

The words occurred in the same sentence is grouped:

ψ(w) =D∑

d=1

Sd∑s=1

λd ,s∥w(d ,s)∥2,

(d expresses a document, s expresses a sentence).

9 / 62

Page 11: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

(Generalized) Fused Lasso and TV-denoising

Fused lasso (Tibshirani et al. (2005), Jacob et al. (2009)):

ψ(x) = C∑

(i ,j)∈E

|xi − xj |.

TV-denoising (Rudin et al., 1992, Chambolle, 2004):

ψ(X ) = C∑i ,j

√|Xi+1,j − Xi ,j |2 + |Xi ,j+1 − Xi ,j |2.

1

2

34

5

x1

x2

x3x4

x5

Genome data analysis by Fused lasso (Tibshi-

rani and Taylor, 2011)

Image restoration (Mairal et al., 2009)

10 / 62

Page 12: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Other examples

Robust PCA (Candes et al. 2009).

Low rank tensor estimation (Signoretto et al., 2010; Tomioka et al.,2011).

Dictionary learning (Kasiviswanathan et al., 2012; Rakotomamonjy,2013).

11 / 62

Page 13: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Solutions

Developing a sophisticated method for each regularization: Jacobet al. (2009), Yuan et al. (2011).

Submodular optimization: Bach (2010), Kawahara et al. (2009),Bach et al. (2012)

Decomposing the proximal mapping: Yu (2013)

Applying linear transformation to make the regularization simpler.→ Alternating Direction Method of Multipliers, ADMM.

12 / 62

Page 14: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Outline

1 Stochastic optimization for structured regularizationStructured regularizationAlternating Direction Method of Multipliers (ADMM)Stochastic ADMM for online dataStochastic ADMM for batch data

2 Parallel and distributed optimization

3 Further interesting topics

13 / 62

Page 15: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Linear transformation, Decomposition technique

Overlapped group lasso ψ(w) = C∑

g∈G ∥wg∥※ It is difficult to compute the proximal mapping.

Solution:

Prepare ψ for which proximalmapping is easily computable.

Let ψ(B⊤w) = ψ(w), and utilizethe proximal mapping w.r.t. ψ.

BTw

Decompose into independent groups:

B⊤w =

ψ(y) = C∑g′∈G′

∥yg′∥

prox(q|ψ) =(qg′max

{1− C

∥qg′∥, 0

})g′∈G′

 14 / 62

Page 16: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Another example

Graph guided regularization

ψ(w) = C∑

(i ,j)∈E

|wi − wj |.1

2

34

5

x1

x2

x3x4

x5

ψ(y) = C∑e∈E|ye |, y = B⊤w = (wi − wj)(i ,j)∈E

ψ(B⊤w) = ψ(w),

prox(q|ψ) =(qemax

{1− C

|qe | , 0})

e∈E.

Soft-Thresholding function.

15 / 62

Page 17: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Optimizing composite objective functionwith linear constraint

minx

1

n

n∑i=1

fi (z⊤i x) + ψ(B⊤x)

⇔ minx ,y

1

n

n∑i=1

fi (z⊤i x) + ψ(y) s.t. y = B⊤x .

� �Augmented Lagrangian

L(x , y , λ) = 1

n

∑i

fi (z⊤i x) + ψ(y) + λ⊤(y − B⊤x) +

ρ

2∥y − B⊤x∥2

� �infx ,y

supλL(x , y , λ)

yields the optimization of the original problem.

The augmented Lagrangian is the basis of the method of multipliers(Hestenes, 1969, Powell, 1969, Rockafellar, 1976). 16 / 62

Page 18: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Method of multipliers

minx ,y{ f (x) + ψ(y) s.t. Ax + By = 0}

L(w , y , λ) = f (x) + ψ(y) + λ⊤(Ax + By) +ρ

2∥Ax + By∥2

Method of multipliers (Hestenes, 1969, Powell, 1969)

(x (t), y (t)) = argmin(x,y)

L(x , y , λ(t−1))

λ(t) = λ(t−1) + ρ(Ax (t) + By (t))

Remark:

Update of λ corresponds to gradient ascent of L.It is easy to check that

∇x(f (x) + ⟨λ(t),Ax + By (t)⟩)|x=x(t) = 0

∇y (ψ(y) + ⟨λ(t),Ax (t) + By⟩)|y=y (t) = 0.

If Ax (t) + By (t) = 0, this give the optimality condition.

17 / 62

Page 19: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Method of multipliers

minx ,y{ f (x) + ψ(y) s.t. Ax + By = 0}

L(w , y , λ) = f (x) + ψ(y) + λ⊤(Ax + By) +ρ

2∥Ax + By∥2

Method of multipliers (Hestenes, 1969, Powell, 1969)

(x (t), y (t)) = argmin(x,y)

L(x , y , λ(t−1))

λ(t) = λ(t−1) + ρ(Ax (t) + By (t))

Remark:

Update of λ corresponds to gradient ascent of L.It is easy to check that

∇x(f (x) + ⟨λ(t),Ax + By (t)⟩)|x=x(t) = 0

∇y (ψ(y) + ⟨λ(t),Ax (t) + By⟩)|y=y (t) = 0.

If Ax (t) + By (t) = 0, this give the optimality condition.17 / 62

Page 20: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Alternating Direction Method of multipliers(Douglas-Rachford splitting)

minx ,y{ f (x) + ψ(y) s.t. Ax + By = 0}

L(w , y , λ) = f (x) + ψ(y) + λ⊤(Ax + By) +ρ

2∥Ax + By∥2

Alternating Direction Method of Multipliers (Gabay and Mercier, 1976)

x (t) = argminxL(x , y (t−1), λ(t−1))

y (t) = argminyL(x (t), y , λ(t−1))

λ(t) = λ(t−1) + ρ(Ax (t) + By (t))

ADMM converges to the optimal (Mota et al., 2011)

O(1/t) convergence in general (He and Yuan, 2012)

Linear convergence for a strongly convex objective (Deng and Yin,2012, Hong and Luo, 2012)

18 / 62

Page 21: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Alternating Direction Method of multipliers(Douglas-Rachford splitting)

minx ,y{ f (x) + ψ(y) s.t. Ax + By = 0}

L(w , y , λ) = f (x) + ψ(y) + λ⊤(Ax + By) +ρ

2∥Ax + By∥2

Alternating Direction Method of Multipliers (Gabay and Mercier, 1976)

x (t) = argminxL(x , y (t−1), λ(t−1))

y (t) = argminyL(x (t), y , λ(t−1))

λ(t) = λ(t−1) + ρ(Ax (t) + By (t))

ADMM converges to the optimal (Mota et al., 2011)

O(1/t) convergence in general (He and Yuan, 2012)

Linear convergence for a strongly convex objective (Deng and Yin,2012, Hong and Luo, 2012)

18 / 62

Page 22: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

ADMM for structured regularizationminx{f (x) + ψ(B⊤w)} ⇔ min

x ,y{f (x) + ψ(y) s.t. y = B⊤x}

L(x , y , λ) = f (x) + ψ(y) + λ⊤(y − B⊤x) + ρ2∥y − B⊤x∥2

where f (x) = 1n

∑fi (z

⊤i x)

ADMM for structured regularization

x (t) = argminx{f (x) + λ(t−1)⊤(−B⊤x) +

ρ

2∥y (t−1) − B⊤x∥2}

y (t) = argminy{ψ(y) + λ(t)

⊤y +

ρ

2∥y − B⊤x (t)∥2}

(= prox(B⊤x (t) − λ(t)/ρ|ψ/ρ))λ(t) = λ(t−1) − ρ(B⊤x (t) − y (t))

The update of y is given by the proximal mapping w.r.t. simple ψ.→ Usually analytic form.

However, the computation of 1n

∑ni=1 fi (z

⊤i w) is still heavy.

→ Stochastic version of ADMM has been developed (Suzuki, 2013,

Ouyang et al., 2013, Suzuki, 2014).

19 / 62

Page 23: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

ADMM for structured regularizationminx{f (x) + ψ(B⊤w)} ⇔ min

x ,y{f (x) + ψ(y) s.t. y = B⊤x}

L(x , y , λ) = f (x) + ψ(y) + λ⊤(y − B⊤x) + ρ2∥y − B⊤x∥2

where f (x) = 1n

∑fi (z

⊤i x)

ADMM for structured regularization

x (t) = argminx{f (x) + λ(t−1)⊤(−B⊤x) +

ρ

2∥y (t−1) − B⊤x∥2}

y (t) = argminy{ψ(y) + λ(t)

⊤y +

ρ

2∥y − B⊤x (t)∥2}

(= prox(B⊤x (t) − λ(t)/ρ|ψ/ρ))λ(t) = λ(t−1) − ρ(B⊤x (t) − y (t))

The update of y is given by the proximal mapping w.r.t. simple ψ.→ Usually analytic form.However, the computation of 1

n

∑ni=1 fi (z

⊤i w) is still heavy.

→ Stochastic version of ADMM has been developed (Suzuki, 2013,

Ouyang et al., 2013, Suzuki, 2014).19 / 62

Page 24: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Example: ADMM for Lasso

minx

{1

2n∥Zx − Y ∥2 + C∥x∥1

}

ADMM for Lasso

x (t) =

(Z⊤Z

2n+ρI

2

)−1 (Y

n+ λ(t) − ρy (t−1)

)y (t) = ST C

ρ(x (t) − λ(t)/ρ)

λ(t) = λ(t−1) − ρ(x (t) − y (t))

STη(x) = (sign(xj)max{|xj | − η, 0})j

20 / 62

Page 25: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Outline

1 Stochastic optimization for structured regularizationStructured regularizationAlternating Direction Method of Multipliers (ADMM)Stochastic ADMM for online dataStochastic ADMM for batch data

2 Parallel and distributed optimization

3 Further interesting topics

21 / 62

Page 26: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Table of literature

Stochastic methods for regularized learning problems.Normal ADMM

Online • Proximal gradient type (Nesterov,

2007)

Online-ADMM (Wang and Baner-

jee, 2012)

FOBOS (Duchi and Singer, 2009) SGD-ADMM (Suzuki, 2013, Ouyang

et al., 2013)

• Dual averaging type (Nesterov, 2009) RDA-ADMM (Suzuki, 2013)

RDA (Xiao, 2009)

BatchSDCA (Shalev-Shwartz and Zhang, 2013)

SDCA-ADMM (Suzuki, 2014)

(Stochastic Dual Coordinate Ascent)

SAG (Le Roux et al., 2013)

(Stochastic Averaging Gradient)

SVRG (Johnson and Zhang, 2013)

(Stochastic Variance Reduced Gradient)

Online type: O(1/√T ) in general,

O(log(T )/T ) or O(1/T ) for a strongly convex objective.Batch type: linear convergence.

22 / 62

Page 27: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Reminder of online stochastic optimization

gt ∈ ∂ℓt(xt), gt = 1t

∑tτ=1 gτ (ℓt(x) = ℓ(zt , x))

OPG (Online Proximal Gradient Descent)

xt+1 = argminx

{⟨gt , x⟩+ ψ(x) +

1

2ηt∥x − xt∥2

}

RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009))

xt+1 = argminx

{⟨gt , x⟩+ ψ(x) +

1

2ηt∥x∥2

}These update rule is computed by a proximal mapping associated with ψ.

Efficient for a simple regularization such as ℓ1 regularization.

How about structured regularization? → ADMM.

23 / 62

Page 28: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

OPG-ADMM

Ordinary OPG: xt+1 = argminx

{⟨gt , x⟩+ ψ(x) + 1

2ηt∥x − xt∥2

}.

OPG-ADMM

xt+1 =argminx∈X

{g⊤t x − λt⊤(B⊤x − yt)

2∥B⊤x − yt∥2 +

1

2ηt∥x − xt∥2Gt

},

yt+1 = argminy∈Y

{ψ(y)− λ⊤t (B⊤xt+1 − y) +

ρ

2∥B⊤xt+1 − y∥2

}

λt+1 =λt − ρ(B⊤xt+1 − yt+1).

The update rule of yt+1 and λt+1 are same as the ordinary ADMM.

prox(·|ψ) is usually analytically obtained.

Gt is any positive definite matrix.

24 / 62

Page 29: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

OPG-ADMM

Ordinary OPG: xt+1 = argminx

{⟨gt , x⟩+ ψ(x) + 1

2ηt∥x − xt∥2

}.

OPG-ADMM

xt+1 =argminx∈X

{g⊤t x − λt⊤(B⊤x − yt)

2∥B⊤x − yt∥2 +

1

2ηt∥x − xt∥2Gt

},

yt+1 = argminy∈Y

{ψ(y)− λ⊤t (B⊤xt+1 − y) +

ρ

2∥B⊤xt+1 − y∥2

}=prox(B⊤xt+1 − λt/ρ|ψ),

λt+1 =λt − ρ(B⊤xt+1 − yt+1).

The update rule of yt+1 and λt+1 are same as the ordinary ADMM.

prox(·|ψ) is usually analytically obtained.

Gt is any positive definite matrix.

24 / 62

Page 30: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

RDA-ADMM

Ordinary RDA: wt+1 = argminw

{⟨gt ,w⟩+ ψ(w) + 1

2ηt∥w∥2

}RDA-ADMM

Let xt =1t

∑tτ=1 xτ , λt =

1t

∑tτ=1 λτ , yt =

1t

∑tτ=1 yτ , gt =

1t

∑tτ=1 gτ .

xt+1 =argminx∈X

{g⊤t x − (Bλt)

⊤x +ρ

2t∥B⊤x∥2

+ρ(B⊤xt − yt)⊤B⊤x +

1

2ηt∥x∥2Gt

},

yt+1 =prox(B⊤xt+1 − λt/ρ|ψ),λt+1 =λt − ρ(B⊤xt+1 − yt+1).

The update rule of yt+1 and λt+1 are same as the ordinary ADMM.

25 / 62

Page 31: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Simplified version

Letting Gt be a specific form, the update rule is much simplified.(Gt = γI − ρ

t ηtBB⊤ for RDA-ADMM, and Gt = γI − ρηtBB⊤ for OPG-ADMM)

Online stochastic ADMM1 Update of x

(OPG-ADMM) xt+1 = ΠX

[−ηtγ{gt − B(λt − ρB⊤xt + ρyt)}+ xt

],

(RDA-ADMM) xt+1 = ΠX

[−ηtγ{gt − B(λt − ρB⊤xt + ρyt)}

].

2 Update of yyt+1 = prox(B⊤xt+1 − λt/ρ|ψ).

3 Update of λλt+1 = λt − ρ(B⊤xt+1 − yt+1).

Fast computation.Easy implementation. 26 / 62

Page 32: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Convergence analysis

We bound the expected risk:

Expected riskP(x) = EZ [ℓ(Z , x)] + ψ(x).

Assumptions:

(A1) ∃G s.t. ∀g ∈ ∂xℓ(z , x) satisfies ∥g∥ ≤ G for all z , x .

(A2) ∃L s.t. ∀g ∈ ∂ψ(y) satisfies ∥g∥ ≤ L for all y .

(A3) ∃R s.t. ∀x ∈ X satisfies ∥x∥ ≤ R.

27 / 62

Page 33: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Convergence rate: bounded gradient

(A1) ∃G s.t. ∀g ∈ ∂xℓ(z , x) satisfies ∥g∥ ≤ G for all z , x .(A2) ∃L s.t. ∀g ∈ ∂ψ(y) satisfies ∥g∥ ≤ L for all y .(A3) ∃R s.t. ∀x ∈ X satisfies ∥x∥ ≤ R.

Theorem (Convergence rate of RDA-ADMM)

Under (A1), (A2), (A3), we have

Ez1:T−1[P(xT )− P(x∗)] ≤ 1

T

T∑t=2

ηt−1

2(t − 1)G 2 +

γ

ηT∥x∗∥2 + K

T.

Theorem (Convergence rate of OPG-ADMM)

Under (A1), (A2), (A3), we haveEz1:T−1

[P(xT )− P(x∗)] ≤ 12T

∑Tt=2max

{γηt− γ

ηt−1, 0}R2

+ 1T

∑Tt=1

ηt2 G

2 + KT .

Both methods have convergence rate O(

1√T

)by letting ηt = η0

√t for

RDA-ADMM and ηt = η0/√t for OPG-ADMM.

28 / 62

Page 34: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Convergence rate: strongly convex

(A4) There exist σf , σψ ≥ 0 and P,Q ⪰ O such that

EZ [ℓ(Z , x) + (x ′ − x)⊤∇xℓ(Z , x)] +σf2∥x − x ′∥2P ≤ EZ [ℓ(Z , x

′)],

ψ(y) + (y ′ − y)⊤∇ψ(y) +σψ2∥y − y ′∥2Q ,

for all x , x ′ ∈ X and y , y ′ ∈ Y, and ∃σ > 0 satisfying

σI ⪯ σf P +ρσψ

2ρ+ σψQ.

The update rule of RDA-ADMM is modified as

xt+1 = argminx∈X

{g⊤t x − λ⊤t B⊤x +

ρ

2t∥B⊤x∥2 + ρ(B⊤x t − y t)⊤B⊤x +

1

2ηt∥x∥2Gt

2∥x − xt∥2

}.

29 / 62

Page 35: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Convergence analysis: strongly convex

Theorem (Convergence rate of RDA-ADMM)

Under (A1), (A2), (A3), (A4), we have

Ez1:T−1[P(xT )− P(x∗)] ≤ 1

2T

T∑t=2

1t−1ηt−1

+ tσG 2 +

γ

ηT∥x∗∥2 + K

T.

Theorem (Convergence rate of OPG-ADMM)

Under (A1), (A2), (A3), (A4), we haveEz1:T−1

[P(xT )− P(x∗)] ≤ 12T

∑Tt=2max

{γηt− γ

ηt−1− σ, 0

}R2

+ 1T

∑Tt=1

ηt2 G

2 + KT .

Both methods have convergence rate O(log(T )

T

)by letting ηt = η0t for

RDA-ADMM and ηt = η0/t for OPG-ADMM.This can be improved to O(1/T ) by weighted averaging (Azadi and Sra,2014).

30 / 62

Page 36: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Numerical experiments

101

102

103

10-2

10-1

100

CPU time (s)

Exp

ecte

d R

isk

OPG−ADMMRDA−ADMMNL−ADMMRDAOPG

Figure: Artificial data: 1024 dim,512 sample, Overlapped group lasso.

100

101

102

103

10-0.8

10-0.7

10-0.6

CPU time (s)

Cla

ssif

icat

ion

Err

or

OPG−ADMMRDA−ADMMNL−ADMMRDA

Figure: Real data (Adult, a9a):15,252 dim, 32,561 sample,Overlapped group lasso + ℓ1 reg.

31 / 62

Page 37: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Related methods

O(n/T ) (improved from O(1/√T )) convergence in a batch setting:

Zhong and Kwok (2014)

Acceleration of stochastic ADMM: Azadi and Sra (2014)

Parallel computing with stochastic ADMM: Wang et al. (2014)

32 / 62

Page 38: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Outline

1 Stochastic optimization for structured regularizationStructured regularizationAlternating Direction Method of Multipliers (ADMM)Stochastic ADMM for online dataStochastic ADMM for batch data

2 Parallel and distributed optimization

3 Further interesting topics

33 / 62

Page 39: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Batch setting

In the batch setting, the data are fixed. We just minimize the objectivefunction defined by

1

n

n∑i=1

fi (a⊤i x) + ψ(B⊤x).

� �ADMM version of SDCA

Converges linearly:

T > (n + γ/λ)log(1/ϵ)

to achieve ϵ accuracy for γ-smooth loss andλ-strongly convex regularization.� �

34 / 62

Page 40: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Dual problem

Dual problem

Let A = [a1, a2, . . . , an] ∈ Rp×n.

minw

{1

n

n∑i=1

fi (a⊤i w) + ψ(B⊤w)

}(P: Primal)

= − minx∈Rn,y∈Rd

{1

n

n∑i=1

f ∗i (xi ) + ψ∗(y

n

) ∣∣ Ax + By = 0}

(D: Dual)

Optimality condition:

a⊤i w∗ ∈ ∇f ∗i (x∗i ),

1

ny∗ ∈ ∇ψ(u)|u=B⊤w∗ , Ax∗ + By∗ = 0.

⋆ Each coordinate xi corresponds to each observation ai .

35 / 62

Page 41: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

SDCA-ADMM

Let the augmented Lagrangian beL(x , y ,w) :=

∑ni=1 f

∗i (xi ) + nψ∗(y/n)− ⟨w ,Ax + By⟩+ ρ

2∥Ax + By∥2.

Basic algorithm

For each t = 1, 2, . . .Choose i ∈ {1, . . . , n} uniformly at random, and update

y (t) ← argminy

{L(x (t−1), y ,w (t−1)) +

1

2∥y − y (t−1)∥2Q

}x(t)i ← argmin

xi∈R

{L([xi ; x

(t−1)\i ], y (t),w (t−1)) +

1

2∥xi − x

(t−1)i ∥2Gi,i

}w (t) ← w (t−1) − ξρ{n(Ax (t) + By (t))−(n − 1)(Ax (t−1) + By (t−1))}.

Q, Gi ,i are positive definite matrices that satisfy some condition.

Only i-th coordinate xi is updated.

The update of the multiplier w should be modified.

36 / 62

Page 42: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

SDCA-ADMM

Let the augmented Lagrangian beL(x , y ,w) :=

∑ni=1 f

∗i (xi ) + nψ∗(y/n)− ⟨w ,Ax + By⟩+ ρ

2∥Ax + By∥2.

Split the index set {1, . . . , n} into K groups (I1, I2, . . . , IK ).

Block coordinate SDCA-ADMM

For each t = 1, 2, . . .Choose k ∈ {1, . . . ,K} uniformly at random, and set I = Ik ,

y (t) ← argminy

{L(x (t−1), y ,w (t−1)) +

1

2∥y − y (t−1)∥2Q

}x(t)I ← argmin

xI∈R|I |

{L([xI ; x

(t−1)\I ], y (t),w (t−1)) +

1

2∥xI − x

(t−1)I ∥2GI ,I

}w (t) ← w (t−1) − ξρ{n(Ax (t) + By (t))−(n − n/K )(Ax (t−1) + By (t−1))}.

Q, GI ,I are positive definite matrices that satisfy some condition.

37 / 62

Page 43: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Simplified algorithm

SettingQ = ρ(ηBId − B⊤B), GI ,I = ρ(ηZ ,I I|I | − Z⊤

I ZI ),

then, by the relation prox(q|ψ) + prox(q|ψ∗) = q, we obtain the followingupdate rule:

Simplified algorithm

For q(t) = y (t−1) + B⊤

ρηB{w (t−1) − ρ(Zx (t−1) + By (t−1))}, let

y (t) ← q(t) − prox(q(t)|nψ(ρηB · )/(ρηB)),

For p(t)I = x

(t−1)I +

Z⊤I

ρηZ,I{w (t−1) − ρ(Zx (t−1) + By (t))}, let

x(t)i ← prox(p

(t)i |f

∗i /(ρηZ ,I )) (∀i ∈ I ).

⋆ The update of x can be parallelized.

38 / 62

Page 44: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Convergence Analysis

x∗: the optimal variable of xY∗: the set of optimal variables (not necessarily unique)w∗: the optimal variable of wAssumption:

There exits v > 0 such that, ∀xi ∈ R,

f ∗i (xi )− f ∗i (x∗i ) ≥ ⟨∇f ∗i (x∗i ), xi − x∗i ⟩+

∥xi − x∗i ∥2

2γ.

∃h, vψ > 0 such that, for all y , u, there exists y∗ ∈ Y∗ such that

ψ∗(y/n)− ψ∗(y∗/n) ≥ ⟨B⊤w∗, y/n − y∗/n⟩+ vψ2∥PKer(B)(y/n − y∗/n)∥2,

ψ(u)− ψ(B⊤w∗) ≥ ⟨y∗/n, u − B⊤w∗⟩+ λ

2∥u − B⊤w∗∥2.

B⊤ is injective.

39 / 62

Page 45: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

FD(x , y) :=1n

∑ni=1 f

∗i (xi ) + ψ∗( yn )− ⟨w

∗,A xn − B y

n ⟩.RD(x , y ,w) := FD(x , y)− FD(x

∗, y∗) + 12n2ξρ

∥w − w∗∥2 + ρ(1−ξ)2n ∥Ax +

By∥2 + 12n∥x − x∗∥2Ip/γ+H + 1

2nK ∥y − y∗∥2Q .

Theorem (Linear convergence of SDCA-ADMM)

Let H be a matrix such that HI ,I = ρA⊤I AI + GI ,I for all I ∈ {I1, . . . , IK},

µ = min

{1

4(1+γσmax(H)) ,λρσmin(B

⊤B)2max{1/n,4λρ,4λσmax(Q)}

,Kvψ/n

4σmax(Q) ,Kσmin(BB

⊤)4σmax(Q)(ργσmax(A⊤A)+4)

},

ξ = 14n , and C1 = RD(x

(0), y (0),w (0)), then we have that

(dual residual) E[RD(x(t), y (t),w (t))] ≤

(1− µ

K

)tC1,

(primal variable) E[∥w (t) − w∗∥2] ≤ nρ

2

(1− µ

K

)tC1.

For t ≥ C ′Kµlog

(C ′′nϵ

), we have that E[∥w (t) − w ∗∥2] ≤ ϵ.

40 / 62

Page 46: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Convergence AnalysisAssumption:

fi is γ-smooth.

ψ is λ-strongly convex.

Other technical conditions.

With a setting ρ = min{1, 1/γ} and K = n,

t ≥ C(n +

γ

λ

)log

(C ′

ϵ

)gives E[∥w (t) − w∗∥2] ≤ ϵ (γλ is like the condition number).The rate is as good as the ordinary SDCA.

Non stochastic method (e.g. ADMM in dual):

t ≥ Cnγ

λlog

(C ′

ϵ

).

41 / 62

Page 47: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Numerical Experiments: Loss function

Binary classification.

Smoothed hinge loss: fi (u) =

0, (yiu ≥ 1),12 − yiu, (yiu < 0),12 (1− yiu)

2, (otherwise).

-0.5 0 0.5 1 1.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

⇒ proximal mapping is analytically obtained.

prox(u|f ∗i /C ) =

Cu−yi1+C (−1 ≤ Cuyi−1

1+C ≤ 0),

−yi (−1 > Cuyi−11+C ),

0 (otherwise).42 / 62

Page 48: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Numerical Experiments (Artificial data): Setting

Artificial data: Overlapped group regularization:

ψ(X ) = C (32∑i=1

∥X:,i∥+32∑j=1

∥Xj,:∥+ 0.01×∑i,j

X 2i,j/2),

X ∈ R32×32.

43 / 62

Page 49: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Numerical Experiments (Artificial data): Results

5 10 15 20

10-5

10-4

10-3

10-2

10-1

100

CPU time (s)

Em

piri

cal R

isk

(a) Excess objective value

10-1

100

101

0

0.5

1

1.5

2

2.5

CPU time (s)T

est L

oss

(b) Test loss

10-1

100

101

0.05

0.1

0.15

0.2

0.25

0.3

CPU time (s)

Cla

ssifi

catio

n E

rror

RDAOPG−ADMMRDA−ADMMOL−ADMMBatch−ADMMSDCA−ADMM

(c) Class. Error

Figure: Artificial data (n=5,120, d=1024). Overlapped group lasso. Mini-batchsize n/K = 50.

44 / 62

Page 50: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Numerical Experiments (Artificial data): Results

100 200 300 400 500

10-6

10-4

10-2

100

CPU time (s)

Em

piri

cal R

isk

(a) Excess objective value

100

101

102

0

0.5

1

1.5

2

2.5

3

3.5

CPU time (s)T

est L

oss

(b) Test loss

100

101

102

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

CPU time (s)

Cla

ssifi

catio

n E

rror

RDAOPG−ADMMRDA−ADMMOL−ADMMBatch−ADMMSDCA−ADMM

(c) Class. Error

Figure: Artificial data (n=51,200, d=1024). Overlapped group lasso. Mini-batchsize n/K = 50.

45 / 62

Page 51: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Numerical Experiments (Real data): Setting

Real data: Graph regularization:

ψ(w) = C1

p∑i=1

|wi |+C2

∑(i,j)∈E

|wi−wj |+0.01×(C1

p∑i=1

|wi |2+C2

∑(i,j)∈E

|wi−wj |2)

where E is the set of edges obtained from the similarity matrix.The similarity graph is obtained by Graphical Lasso (Yuan and Lin, 2007,

Banerjee et al., 2008).

46 / 62

Page 52: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Numerical Experiments (Real data): Results

5 10 15 20 25 30 35 40 450.166

0.167

0.168

0.169

0.17

0.171

0.172

0.173

0.174

0.175

CPU time (s)

Em

piri

cal R

isk

(a) Objective function

100

101

0.17

0.172

0.174

0.176

0.178

0.18

0.182

0.184

0.186

0.188

0.19

CPU time (s)T

est L

oss

(b) Test loss

100

101

0.12

0.122

0.124

0.126

0.128

0.13

0.132

0.134

0.136

0.138

0.14

CPU time (s)

Cla

ssifi

catio

n E

rror

OPG−ADMMRDA−ADMMOL−ADMMSDCA−ADMM

(c) Class. Error

Figure: Real data (20 Newsgroups, n=12,995). Graph regularization. Mini-batchsize is n/K = 50.

47 / 62

Page 53: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Numerical Experiments (Real data): Results

0 50 100 150 200

0.2

0.25

0.3

0.35

0.4

0.45

0.5

CPU time (s)

Em

piri

cal R

isk

(a) Objective function

0 50 100 150 200

0.2

0.25

0.3

0.35

0.4

0.45

CPU time (s)T

est L

oss

(b) Test loss

0 50 100 150 200

0.15

0.155

0.16

0.165

0.17

0.175

0.18

CPU time (s)

Cla

ssifi

catio

n E

rror

OPG−ADMMRDA−ADMMOL−ADMMSDCA−ADMM

(c) Class. Error

Figure: Real data (a9a, n=32,561). Graph regularization. Mini-batch sizen/K = 50.

48 / 62

Page 54: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Outline

1 Stochastic optimization for structured regularizationStructured regularizationAlternating Direction Method of Multipliers (ADMM)Stochastic ADMM for online dataStochastic ADMM for batch data

2 Parallel and distributed optimization

3 Further interesting topics

49 / 62

Page 55: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Distributed computing

Q: How to deal with huge data that cannot be treated in onecomputational node?A: Distributed computing.

Challenge communication cost trade-off:communication inside node v.s. between nodes via network.

We briefly introduce two approaches

Simple averaging of SGD: (Zinkevich et al., 2010, Zhang et al., 2013)

Distributed dual coordinate descent (COCOA+): (Ma et al., 2015)

50 / 62

Page 56: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Simple averaging

x[1]

x[2]

x[3]

x[4]

Run independent SGDs by K nodes.Take the average of K final solutions:

xK =1

K

K∑k=1

x[k].

Just one synchronization, efficient communication cost. How aboutconvergence?

51 / 62

Page 57: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Assumption:

The loss function is sufficiently smooth (there exists second orderderivative, and it is Lipschitz continuous and bounded).

The expected loss is λ-strongly convex.

Each node runs T iterations of SGD.

Theorem ((Zhang et al., 2013))

With an appropriate step size,

E[∥xK − x∗∥2] ≤ C

(G 2

KTλ2+

1

T 3/2

).

As K is increased, the main term is linearly improved.

However, too large K is not effective. Actually, for λ ∈ (0, 1/√T ), it

is shown that (Shamir et al., 2014)

E[∥xK − x∗∥2] ≥ C

λ2T.

52 / 62

Page 58: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

COCOA+ (Ma et al., 2015)

We divide the sample into K groups {Gk}k :

{1, . . . , n} =K∪

k=1

Gk , Gk ∩ Gk′ = ∅.

Dual problem of RERM:

D(y) =1

n

n∑i=1

f ∗i (yi ) + ψ∗(−1

nAy

)

=1

n

K∑k=1

( ∑i∈Gk

f ∗i (yi )

)︸ ︷︷ ︸

Divided into K groups

+ψ∗(− 1

n

K∑k=1

AGkyGk

)︸ ︷︷ ︸

needs synchronization

where AGk= [ai1 , . . . , aiGk ] ∈ Rp×|Gk | where ij ∈ Gk and yGk

= (yi )i∈Gk.

53 / 62

Page 59: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Run small coord. desc.in parallel

}}}

Run small coord. desc.in parallelSum up the results

(synchronization)G1

G2

G3

54 / 62

Page 60: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

1 fi is γf -smooth.2 ψ is λ-strongly convex.3 Each subproblem decreases the objective by a factor of Θ.

Let the separability of the problems as

σmin := maxy∈Rn

∥Ay∥2∑Kk=1 ∥AGk

yGk∥2, σmax := max

kmax

yGk∈R|Gk |

∥AGkyGk∥2

∥yGk∥2

.

Theorem

Under an appropriate setting of parameters, after T iterations with

T ≥ C σminσmaxγf /(λn)+1(1−Θ) log(1/ϵ),

it holds that E[D(y (T ))−D(y∗)] ≤ ϵ. Furthermore, after T iterations with

T ≥ C σminσmaxγf /λ+1(1−Θ) log(σminσmaxγf /λ+1

(1−Θ) /ϵ),

it hods that E[P(w (t))− D(y (t))] ≤ ϵ.55 / 62

Page 61: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

It is shown that σmin ≤ K and σmax ≤ n/K . Then

T ≥ Cγf /λ+ 1

(1−Θ)log(1/ϵ)

achieves ϵ accuracy. This is equivalent to iteration number of batchgradient methods on a strongly convex function.

Typically Θ =(1− 1

n/K+γf /λ

)twhere t is the number of inner iterations.

Total computational time (worst case):

t(1 + γf /λ) log(1/ϵ) =( n

K+γfλ

)(1 +

γfλ

)log(1/ϵ).

Huge learning problems can be optimized on a distributed system withlinear convergence rate.

56 / 62

Page 62: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Outline

1 Stochastic optimization for structured regularizationStructured regularizationAlternating Direction Method of Multipliers (ADMM)Stochastic ADMM for online dataStochastic ADMM for batch data

2 Parallel and distributed optimization

3 Further interesting topics

57 / 62

Page 63: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Stochastic primal dual coordinate method(Zhang and Lin, 2015)

minx

{1

n

n∑i=1

fi (a⊤i x)︸ ︷︷ ︸ + ψ(x)

}

= minx

maxy

{1

n

n∑i=1

(⟨x , aiyi ⟩ − f ∗i (yi )) + ψ(x)

}

x and yi (i is chosen randomly) are updated alternatively.Iteration complexity:

T ≥(n +

√γ/λ

)log(1/ϵ)

58 / 62

Page 64: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Stochastic primal dual coordinate method(Zhang and Lin, 2015)

minx

{1

n

n∑i=1

fi (a⊤i x)︸ ︷︷ ︸

supyi{⟨x ,aiyi ⟩−f ∗i (yi )}

+ψ(x)

}

= minx

maxy

{1

n

n∑i=1

(⟨x , aiyi ⟩ − f ∗i (yi )) + ψ(x)

}x and yi (i is chosen randomly) are updated alternatively.Iteration complexity:

T ≥(n +

√γ/λ

)log(1/ϵ)

58 / 62

Page 65: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Multi-armed bandit

Maximize the sum of rewards earned through a sequence of plays.

Formulated by Robbins (1952).

Optimal strategy: Lai and Robbins (1985)

UCB strategy: Auer et al. (2002), Burnetas and Katehakis (1996)

Thompson sampling (Bayesian strategy): Thompson (1933)

Continuous version of bandit: Bayesian optimization (Mockus, 1975,

Mockus and Mockus, 1991, Srinivas et al., 2012, Snoek et al., 2012)

Gaussian process regression to search the peak of a function.

Practically useful for hyper-parameter tuning of deep learning.

59 / 62

Page 66: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Stochastic gradient Langevin Monte Carlo(Welling and Teh, 2011)

Goal: Efficient sampling from the posterior distribution.

Randomly choose small mini-batch It ⊆ {1, . . . , n} and update theparameter θt :

θt = θt−1 + ηt

∇θ log π(θ)︸ ︷︷ ︸prior

+n

|It |∑i∈It

∇θ log(p(xi |θ))︸ ︷︷ ︸likelihood

+ ϵtξ

where ξ ∼ N(0, ηtI) and∑∞t=1 ϵt =∞,

∑∞t=1 ϵ

2t <∞.

Related topic: Stochastic variational inference (Hoffman et al., 2013).

Bayesian inference for large datasets.Practically very useful.Supported by some theoretical backgrounds.

60 / 62

Page 67: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Connection to learning theory

minx

{1

t

t∑τ=1

ℓ(zτ , x) + λt∥x∥1

}

Solve regularized learning problem in online setting (data zt comesone after another).

The regularization parameter λt should go to zero as t →∞.

Question: Can we achieve statistically optimal estimation?→ Yes.

RADAR (L1-regularization): Agarwal et al. (2012)

REASON (Stochastic ADMM): Sedghi et al. (2014)

∥xt − x∗∥2 ≤ s log(p)

Tfor s-sparse truth x∗.Simultaneous discussions of optimization and statistics.

61 / 62

Page 68: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Summary of part III

Stochastic ADMM for structured sparsity

Online proximal gradient method, regularized dual averaging methodfor ADMM (online)Stochastic dual coordinate descent for ADMM (batch)A similar convergence result to the normal ones

Distributed stochastic optimization

Simple averagingDistributed SDCA (COCOA+)

62 / 62

Page 69: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

A. Agarwal, S. Negahban, and M. J. Wainwright. Stochastic optimizationand sparse statistical recovery: Optimal algorithms for high dimensions.In Advances in Neural Information Processing Systems, pages1538–1546, 2012.

P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of themultiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.

S. Azadi and S. Sra. Towards an optimal stochastic alternating directionmethod of multipliers. In Proceedings of the 31st InternationalConference on Machine Learning, pages 620–628, 2014.

F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization withsparsity-inducing penalties. Foundations and Trends R⃝ in MachineLearning, 4(1):1–106, 2012.

F. R. Bach. Structured sparsity-inducing norms through submodularfunctions. In Advances in Neural Information Processing Systems, pages118–126, 2010.

O. Banerjee, L. E. Ghaoui, and A. d’Aspremont. Model selection throughsparse maximum likelihood estimation for multivariate gaussian orbinary data. Journal of Machine Learning Research, 9:485–516, 2008.

62 / 62

Page 70: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

A. N. Burnetas and M. N. Katehakis. Optimal adaptive policies forsequential allocation problems. Advances in Applied Mathematics, 17(2):122–142, 1996.

A. Chambolle. An algorithm for total variation minimization andapplications. Journal of Mathematical imaging and vision, 20(1-2):89–97, 2004.

W. Deng and W. Yin. On the global and linear convergence of thegeneralized alternating direction method of multipliers. Technical report,Rice University CAAM TR12-14, 2012.

J. Duchi and Y. Singer. Efficient online and batch learning using forwardbackward splitting. Journal of Machine Learning Research, 10:2873–2908, 2009.

D. Gabay and B. Mercier. A dual algorithm for the solution of nonlinearvariational problems via finite-element approximations. Computers &Mathematics with Applications, 2:17–40, 1976.

B. He and X. Yuan. On the O(1/n) convergence rate of theDouglas-Rachford alternating direction method. SIAM J. NumericalAnalisis, 50(2):700–709, 2012.

62 / 62

Page 71: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

M. Hestenes. Multiplier and gradient methods. Journal of OptimizationTheory & Applications, 4:303–320, 1969.

M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stochasticvariational inference. The Journal of Machine Learning Research, 14(1):1303–1347, 2013.

M. Hong and Z.-Q. Luo. On the linear convergence of the alternatingdirection method of multipliers. arXiv preprint arXiv:1208.3922, 2012.

L. Jacob, G. Obozinski, and J.-P. Vert. Group lasso with overlap andgraph lasso. In Proceedings of the 26th International Conference onMachine Learning, 2009.

R. Johnson and T. Zhang. Accelerating stochastic gradient descent usingpredictive variance reduction. In C. Burges, L. Bottou, M. Welling,Z. Ghahramani, and K. Weinberger, editors, Advances in NeuralInformation Processing Systems 26, pages 315–323. Curran Associates,Inc., 2013. URL http://papers.nips.cc/paper/

4937-accelerating-stochastic-gradient-descent-using-predictive-variance-reduction.

pdf.

Y. Kawahara, K. Nagano, K. Tsuda, and J. A. Bilmes. Submodularity cuts62 / 62

Page 72: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

and applications. In Advances in Neural Information ProcessingSystems, pages 916–924, 2009.

T. L. Lai and H. Robbins. Asymptotically efficient adaptive allocationrules. Advances in applied mathematics, 6(1):4–22, 1985.

N. Le Roux, M. Schmidt, and F. Bach. A stochastic gradient method withan exponential convergence rate for strongly-convex optimization withfinite training sets. In Advances in Neural Information ProcessingSystems 25, 2013.

C. Ma, V. Smith, M. Jaggi, M. Jordan, P. Richtarik, and M. Takac.Adding vs. averaging in distributed primal-dual optimization. InProceedings of The 32nd International Conference on Machine Learning(ICML-15), pages 1973–1982, 2015.

J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Non-localsparse models for image restoration. In Computer Vision, 2009 IEEE12th International Conference on, pages 2272–2279. IEEE, 2009.

J. Mockus. On bayesian methods for seeking the extremum. InOptimization Techniques IFIP Technical Conference, pages 400–404.Springer, 1975.

62 / 62

Page 73: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

J. Mockus and L. Mockus. Bayesian approach to global optimization andapplication to multiobjective and constrained problems. Journal ofOptimization Theory and Applications, 70(1):157–172, 1991.

J. F. Mota, J. M. Xavier, P. M. Aguiar, and M. Puschel. A proof ofconvergence for the alternating direction method of multipliers appliedto polyhedral-constrained functions. arXiv preprint arXiv:1112.2295,2011.

Y. Nesterov. Gradient methods for minimizing composite objectivefunction. Technical Report 76, Center for Operations Research andEconometrics (CORE), Catholic University of Louvain (UCL), 2007.

Y. Nesterov. Primal-dual subgradient methods for convex problems.Mathematical Programming, Series B, 120:221–259, 2009.

H. Ouyang, N. He, L. Q. Tran, and A. Gray. Stochastic alternatingdirection method of multipliers. In Proceedings of the 30th InternationalConference on Machine Learning, 2013.

M. Powell. A method for nonlinear constraints in minimization problems.In R. Fletcher, editor, Optimization, pages 283–298. Academic Press,London, New York, 1969.

62 / 62

Page 74: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

H. Robbins. Some aspects of the sequential design of experiments.Bulletin of the American Mathematical Society, 58:527–535, 1952.

R. T. Rockafellar. Augmented Lagrangians and applications of theproximal point algorithm in convex programming. Mathematics ofOperations Research, 1:97–116, 1976.

L. I. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noiseremoval algorithms. Physica D: Nonlinear Phenomena, 60(1):259–268,1992.

H. Sedghi, A. Anandkumar, and E. Jonckheere. Multi-step stochasticadmm in high dimensions: Applications to sparse optimization andmatrix decomposition. In Advances in Neural Information ProcessingSystems, pages 2771–2779, 2014.

S. Shalev-Shwartz and T. Zhang. Proximal stochastic dual coordinateascent. Technical report, 2013. arXiv:1211.2717.

O. Shamir, N. Srebro, and T. Zhang. Communication-efficient distributedoptimization using an approximate newton-type method. In Proceedingsof the 32th International Conference on Machine Learning, pages1000–1008, 2014.

62 / 62

Page 75: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

J. Snoek, H. Larochelle, and R. P. Adams. Practical bayesian optimizationof machine learning algorithms. In Advances in neural informationprocessing systems, pages 2951–2959, 2012.

N. Srinivas, A. Krause, S. M. Kakade, and M. W. Seeger.Information-theoretic regret bounds for gaussian process optimization inthe bandit setting. Information Theory, IEEE Transactions on, 58(5):3250–3265, 2012.

T. Suzuki. Dual averaging and proximal gradient descent for onlinealternating direction multiplier method. In Proceedings of the 30thInternational Conference on Machine Learning, pages 392–400, 2013.

T. Suzuki. Stochastic dual coordinate ascent with alternating directionmethod of multipliers. In Proceedings of the 31th InternationalConference on Machine Learning, pages 736–744, 2014.

W. R. Thompson. On the likelihood that one unknown probability exceedsanother in view of the evidence of two samples. Biometrika, pages285–294, 1933.

R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight. Sparsityand smoothness via the fused lasso. Journal of Royal Statistical Society:B, 67(1):91–108, 2005.

62 / 62

Page 76: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

R. J. Tibshirani and J. Taylor. The solution path of the generalized lasso.Ann. Statist., 39(3):1335–1371, 06 2011. doi: 10.1214/11-AOS878.URL http://dx.doi.org/10.1214/11-AOS878.

H. Wang and A. Banerjee. Online alternating direction method. InProceedings of the 29th International Conference on Machine Learning,2012.

H. Wang, A. Banerjee, and Z.-Q. Luo. Parallel direction method ofmultipliers. In Advances in Neural Information Processing Systems,pages 181–189, 2014.

M. Welling and Y. W. Teh. Bayesian learning via stochastic gradientlangevin dynamics. In Proceedings of the 28th International Conferenceon Machine Learning (ICML-11), pages 681–688, 2011.

L. Xiao. Dual averaging methods for regularized stochastic learning andonline optimization. In Advances in Neural Information ProcessingSystems 23. 2009.

D. Yogatama and N. A. Smith. Making the most of bag of words:Sentence regularization with alternating direction method of multipliers.In Proceedings of the 31th International Conference on MachineLearning, pages 656–664, 2014.

62 / 62

Page 77: Stochastic Optimization Part III: Advanced topics of ...ibis.t.u-tokyo.ac.jp/suzuki/mlss2015/MLSS2015_3.pdf · RDA (Regularized Dual Averaging; Xiao (2009), Nesterov (2009)) xt+1

Y.-L. Yu. On decomposing the proximal map. In Advances in NeuralInformation Processing Systems, pages 91–99, 2013.

L. Yuan, J. Liu, and J. Ye. Efficient methods for overlapping group lasso.In Advances in Neural Information Processing Systems 24, 2011.

M. Yuan and Y. Lin. Model selection and estimation in the gaussiangraphical model. Biometrika, 94(1):19–35, 2007.

Y. Zhang and X. Lin. Stochastic primal-dual coordinate method forregularized empirical risk minimization. In Proceedings of The 32ndInternational Conference on Machine Learning (ICML-15), pages353–361, 2015.

Y. Zhang, J. C. Duchi, and M. Wainwright. Communication-efficientalgorithms for statistical optimization. Journal of Machine LearningResearch, 14:3321–3363, 2013.

W. Zhong and J. Kwok. Fast stochastic alternating direction method ofmultipliers. In Proceedings of the 31st International Conference onMachine Learning (ICML-14), pages 46–54, 2014.

M. Zinkevich, M. Weimer, L. Li, and A. J. Smola. Parallelized stochasticgradient descent. In Advances in neural information processing systems,pages 2595–2603, 2010.

62 / 62