admm: an usage studyhome.cse.ust.hk/~qyaoaa/papers/talk-admm.pdf · admm: an usage study quanming...

ADMM: An Usage Study

Quanming Yao

April 6, 2017

Quanming Yao (HKUST) April 6, 2017 1 / 24

Overview

1 A Brief Review

2 Three Application Examples

3 The ADMM Algorithm

4 ADMM: Extensions


That One Famous Paper

Three years ago, 1000 cites - lots of people are using it.

Paper website: http://stanford.edu/~boyd/admm.html


http://stanford.edu/~boyd/admm.html

Proximal Gradient Descent: Review

Optimization problem

minx

f (x) + g(x)

Two fundamental assumptions

A1. f : is Lipschitz smooth

A2. g : has cheap closed-form solution on proximal step, i.e.,

minx

1

2‖x − z‖2

2 + g(x)

First attempt: Accelerated proximal gradient (APG) descent


ADMM: An Overview

When either Assumption A1 or A2 fails,

APG can not be applied; or becomes too slow

Alternating Direction Method of Multipliers (ADMM)

convex case:most general optimization approach with convergence guarantee

nonconvex case:good empirical performance in many problems

most important: easy to use

ADMM serves as an alternative when APG fails


Example: Robust PCA

Robust PCA: data is contaminated by sparse errors

minX‖D − X‖1︸︷︷︸

f

+λ ‖X‖∗

f is not smooth


Example: Fused Lasso

Fused Lasso: the signal is smooth, a few flip-flop

(a) `1-norm. (b) `2-norm. (c) Fused lasso.

minx∈Rd

1

2‖y − Dx‖2

2 + λ ‖Gx‖1︸︷︷︸g

, G =

+1 −1 . . .. . .

. . . +1 −1

∈ Rd×d−1

g has no closed-form solution on proximal step


Example: Matrix Completion with Box Constraint

MovieLens- ratings in [1, 5]

Image- pixels in [0, 255]

minX

1

2

∑(i ,j)∈Ω

(Xij − Oij)2 + λ ‖X‖∗ s.t. 1 ≤ Xij ≤ 5︸︷︷︸

constaint

extra constraints


ADMM: An Illustration

Consider optimization problem: f and g are convex

minx

f (x) + g(y) : s.t. Ax = By

ADMM works on augmented Lagrangian

L (x , y , p) = f (x) + g(y) + p>(Ax − By)︸︷︷︸standard Lagrangian

+ρ

2‖Ax − By‖2

2

p: the dual parameterρ2 ‖Ax − By‖2

2: augmented term

ρ: penalty parameter, needs to be positive


ADMM: An Illustration

minx ,y

maxpL(x , y , p) ≡ f (x) + g(y) + p>(Ax − By) +

ρ

2‖Ax − By‖2

2

Optimization procedures

xt+1 = arg minx

f (x) + p>t (Ax − Byt) +ρ

2‖Ax − Byt‖2

2 , (1)

yt+1 = arg miny

g(y) + p>t (Axt+1 − By) +ρ

2‖Ax − Byt+1‖2

2 , (2)

pt+1 = pt + ρ (Axt+1 − Byt+1) . (3)

Alternating direction

(1) descent step to minimize L w.r.t x (similarly for (2))

(3) ascend step to maximize L w.r.t p

f and g are convex, O(1/T ) rate is guaranteed [4]


ADMM: Application to Robust PCA

Robust PCA: minX‖D − X‖1︸︷︷︸

f

+λ ‖X‖∗︸︷︷︸g

Reformulation: introduce Y to decouple f and g

minX ,Y‖Y ‖1 + λ ‖X‖∗ s.t. X + Y = D.

Augmented Lagrangian

L(X ,Y ,P) ≡ ‖Y ‖1 + λ ‖X‖∗ + 〈P,X + Y − D〉+ρ

2‖X + Y − D‖2

F



minX ,Y

maxP‖Y ‖1 + λ ‖X‖∗ + 〈P,X + Y − D〉+

ρ

2‖X + Y − D‖2

F

ADMM procedures

Xt+1 = arg minXλ ‖X‖∗ + 〈Pt ,X + Yt − D〉+

ρ

2‖X + Yt − D‖2

F

= arg minX

1

2

∥∥∥∥X − (D − Yt −1

ρPt

)∥∥∥∥2

F

+λ

ρ‖X‖∗

= proxλρ‖·‖∗

(ZXt

)proximal step with nuclear norm

where ZXt = D − Yt − 1

ρPt .

SVD on ZXt = UΣV>: Closed-form [1], Xt+1 = U max

(Σ− λ

ρ I , 0)V>



minX ,Y

maxP‖Y ‖1 + λ ‖X‖∗ + 〈P,X + Y − D〉+

ρ

2‖X + Y − D‖2

F

ADMM procedures

Yt+1 = arg minY‖Y ‖1 + 〈P,Xt+1 + Y − D〉+

ρ

2‖Xt+1 + Y − D‖2

F

= arg minY

∥∥∥∥Y − (D − Xt+1 −1

ρPt

)∥∥∥∥2

F

+1

ρ‖Y ‖1

= prox 1ρ‖·‖1

(ZYt

)proximal step with `1-norm

where ZYt = D − Xt+1 − 1

ρPt .

Closed-form: [Yt+1]ij = sign([

ZYt

]ij

)max

(∣∣∣[ZYt

]ij

∣∣∣− 1ρ , 0)



Reformulation

minX ,Y‖Y ‖1 + λ ‖X‖∗ s.t. X + Y = D.

ADMM procedures

Xt+1 = proxλρ‖·‖∗

(ZXt

)where ZX

t = D − Yt − 1ρPt

Yt+1 = prox 1ρ‖·‖1

(ZYt

)where ZY

t = D − Xt+1 − 1ρPt

Pt+1 = Pt + ρ(D − Xt+1 − Yt+1)

ADMM is the only choice

Smoothing technique and interior method can be used, but they are slow



Convergence curve (convex problem), the number shows the value of ρ

too large ρ - increase; too small ρ - decrease

once ρ > 0 convergence is guaranteed, I usually set ρ = 1


ADMM: Other Two Examples

Fused lasso: minx∈Rd12 ‖y − Dx‖2

2 + λ ‖Gx‖1

minx ,z12 ‖y − Dx‖2

2 + λ ‖z‖1 s.t. z = Gx

closed-form on x

proximal step with `1-norm on z

Matrix completion: minX12

∑(i ,j)∈Ω (Xij − Oij)

2 +λ ‖X‖∗ s.t. 1 ≤ Xij ≤ 5

minX12

∑(i ,j)∈Ω (Xij − Oij)

2 + λ ‖X‖∗ s.t. 1 ≤ Zij ≤ 5, X = Z

simple projection on Z

no closed-form on X (we will use linearization later)


ADMM: Improvements

Useful extensions

Multi-block of parameters

Increasing penalty parameter

Linearization and acceleration

Nonconvex optimization


Multi-Blocks

An example: minx

m∑i=1

fi (x) → minx ,y i

m∑i=1

fi (yi ) s.t. x = y i .

Each fi is a convex function or indicator function of a convex set

L(x , y1, . . . , ym, p1, . . . , pm) =m∑i=1

fi (yi ) + p>i (x − y i ) +

ρ

2

∥∥x − y i∥∥2

2

ADMM procedures

for each i , get y it+1 = arg miny i fi (yi ) + p>i (xt − y i ) + ρ

2

∥∥xt − y i∥∥2

2

get xt+1 = arg minx∑m

i=1 p>i (x − y i ) + ρ

2

∥∥x − y i∥∥2

2

for each i , update pt+1 = pt + ρ(xt+1 − y it+1

)Convergence is not guaranteed in general cases [2]; works well in practice.


Increasing Penalty Parameter

Robust PCA

minX ,Y‖Y ‖1 + ‖X‖∗ s.t. X + Y = D.

ADMM procedures, start with ρ0 = 0.001 (a small value)

Xt+1 = prox 1ρ‖·‖∗

(ZXt

)where ZX

t = D − Yt − 1ρPt

Yt+1 = prox 1ρ‖·‖1

(ZYt

)where ZY

t = D − Xt+1 − 1ρPt

Pt+1 = Pt + ρt(D − Xt+1 − Yt+1)

ρt+1 = cρt , c > 1 (i.e. increasing exponentially)

Obtaining a fast approximate solution

Fast convergence to a limit point (not optimal one) [5]


Linearization and Acceleration

minX

1

2

∑(i ,j)∈Ω

(Xij − Oij)2 + λ ‖X‖∗ , s.t.1 ≤ Zij ≤ 5,X = Z

On minimization over Xt+1

Xt+1 =arg minX

1

2

∑(i ,j)∈Ω

(Xij − Oij)2

︸︷︷︸f (X )

+〈Pt ,X − Zt〉+ρ

2‖X − Zt‖2

F +λ ‖X‖∗

No closed-form, need to be solved with other algorithms like APG.


Linearization and Acceleration

Second order approximation on f (X ) [8]

arg minX

f (Xt) + 〈X − Xt ,∇f (Xt)〉+L

2‖X − Xt‖2

F

+〈Pt ,X − Zt〉+ρ

2‖X − Zt‖2

F +λ ‖X‖∗

= arg minX

1

2

∥∥∥∥X − ρ

ρ+ LZt +

1

ρ+ LPt −

L

ρ+ LXt

∥∥∥∥2

F

+λ

ρ+ L‖X‖∗

Proximal step with the nuclear norm: closed-form solution

Acceleration may also be applied (similar to APG) [3, 7]

If linearization is not used, no need to do acceleration


Nonconvex Optimization

Nonconvex robust PCA [6]

minX ,Y

m∑i=1

n∑j=1

κ (Yij) + λ

m∑i=1

κ (σi (X )) , s.t. X + Y = D.

where κ(x) = log(

1 + |x |θ

)No convergence guarantee in general, work well in practice

use standard ADMM procedures (no linearization and acceleration)

make ρ increase exponentially



ADMM on nonconvex problem

no converge guarantee in general


Cai, J., Candes, E. J., and Shen, Z.

A singular value thresholding algorithm for matrix completion.SIAM Journal on Optimization 20 (2010), 1956–1982.

Chen, C., He, B., Ye, Y., and Yuan, X.

The direct extension of admm for multi-block convex minimization problems is not necessarily convergent.Mathematical Programming 155, 1-2 (2016), 57–79.

Goldstein, T., ODonoghue, B., and Setzer, S.

Fast alternating direction optimization methods.SIAM Journal on Imaging Sciences 7, 3 (2014).

He, B., and Yuan, X.

On the O(1/n) convergence rate of the Douglas-Rachford alternating direction method.SIAM Journal on Numerical Analysis 50, 2 (2012), 700–709.

Lin, Z., Chen, M., and Ma, Y.

The augmented lagrange multiplier method for exact recovery of corrupted low-rank matrices.arXiv preprint arXiv:1009.5055 (2010).

Oh, T.-H., Tai, Y.-W., Bazin, J.-C., Kim, H., and Kweon, I. S.

Partial sum minimization of singular values in robust pca: Algorithm and applications.IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 4 (2016), 744–758.

Ouyang, Y., Chen, Y., Lan, G., and Pasiliao Jr, E.

An accelerated linearized alternating direction method of multipliers.SIAM Journal on Imaging Sciences 8, 1 (2015), 644–681.

Yang, J., and Yuan, X.

Linearized augmented lagrangian and alternating direction methods for nuclear norm minimization.Mathematics of Computation 82, 281 (2013), 301–329.


admm: an usage studyhome.cse.ust.hk/~qyaoaa/papers/talk-admm.pdf · admm: an usage study quanming...

Documents